Bootstrapping SMT binary analysis tools by analyzing emulator binaries
Category: Vulnerability Development
Summary: SMT solvers are nifty, and pretty easy to apply to simple or
repetitive binary analysis problems. Implementing instruction semantics for
a specific branch structure, or the subset of arithmetic operations used by
a specific obfuscation pass has a pretty low time investment.
This changes when trying to build more generic tools that can be re-used or
run at scale. Binary analysis projects with accurate IR semantics for a
wide range of instructions are usually large efforts, and their
verification quality falls off for instructions and architectures that
didn't see extensive use.
What if we want to start from scratch and stay simple?
* Emulators and bytecode interpreters contain executable instruction
* Code for executing a single instruction tends to be short and loop-free.
* Our endgame is to use SMT tools to reason about short, loop-free code.
As such, we'll walk through the process of taking an emulator
cross-compiled for its own architecture, and using a manually-written
SMT-LIB specification of instruction semantics to extract the instruction
semantics implemented by the interpreter's decode-and-execute code. Then,
cross-compiling a bunch of other emulators and interpreters, and using the
first extracted specification to extract specifications for completely
I will also examine the use of constraint solving to explore differences
between instruction semantics specifications. This provides assurance of
their accuracy, but also discovers edge cases between targets. The overall
goal is to get good at deriving SMT formulas for code that looks like a
non-JIT bytecode interpreter's step function, without depending on any
specific existing binary analysis toolchains.