Decompilers and beyond - Hex-Ray
Document Sample


Decompilers and beyond
Hex-Rays
Ilfak Guilfanov
Presentation Outline
Why do we need decompilers?
Complexity must be justified
Typical decompiler design
There are some misconceptions
Decompiler based analysis
New analysis type and tools become possible
Future
“...is bright and sunny”
Your feedback
Online copy of this presentation is available at
http://www.hex-rays.com/idapro/ppt/decompilers_and_beyond.ppt
(c) 2008 Hex-Rays SA 2
Disassemblers
We need disassemblers to analyze binary code
Simple disassemblers produce a listing with instructions
Better disassemblers assist in analysis by annotating the
code, good navigation etc. You know the difference.
Even the ideal disassembler stays at low level: the output
is an assembler listing
The main output of a disassembler is still one-to-one
mapping of opcodes to instruction mnemonics
No leverage, no abstractions, little insight
The analyst must mentally map assembly instructions to
higher level abstractions and concepts
A boring and routine task after a while
(c) 2008 Hex-Rays SA 3
Disassembler limitations
The output is
Boring
Inhuman
Repetitive
Error prone
Requires special skills
Did I say repetitive?
Yet some geeks like it?...
(c) 2008 Hex-Rays SA 4
Decompilers
The need:
Software grows like gas
Time spent on analysis skyrockets
Malware proliferates and mutates
We need better tools to handle this
Decompilation is the next logical step, yet a tough one
(c) 2008 Hex-Rays SA 5
Building ideal decompiler
The answer is clear and easy to give: ideal decompilers
do not exist
It is customary to compare compilers and decompilers:
Preprocessing
Lexical analysis
Syntax analysis
Code generation
Optimization
This comparison is correct but superficial
(c) 2008 Hex-Rays SA 6
Compilers are privileged
Strictly defined input language
Anything nonconforming – spit out an error message
Reasonable amount of information on all functions,
variables, types, etc.
The output may be ugly
Who will ever read it but some geeks? :)
(c) 2008 Hex-Rays SA 7
Machine code decompilers are impossible
Informal and sometimes hostile input
Many problems are unsolved or proved to be unsolvable
in general
The output is examined in detail by a human being, any
suboptimality is noticed because it annoys the analyst
Conclusion: robust decompilers are impossible
What if we address the common cases? For example, if
we cover 90%, will the rest be handled manually?
(c) 2008 Hex-Rays SA 8
Easy for humans, hard for computers
In fact, many (all?) problems encountered during
decompilation are hard
For every problem, there is a naïve solution, which,
unfortunately, does not work
Just a few examples...
(c) 2008 Hex-Rays SA 9
Function calls are a problem
Function calls require answering the following questions:
Where does the function expect its input registers?
Where does it return the result?
What registers or memory cells does it spoil?
How does it change the stack pointer?
Does it return to the caller or somewhere else?
(c) 2008 Hex-Rays SA 10
Function return values are a problem
Does the function return anything?
How big is the return value?
(c) 2008 Hex-Rays SA 11
Function input arguments are a problem
When a register is accessed, it can be
To save its value
To allocate stack frame
Used as function argument
(c) 2008 Hex-Rays SA 12
Indirect accesses are a problem
Pointer aliases
No precise object boundaries
(c) 2008 Hex-Rays SA 13
Indirect jumps are a problem
Indirect jumps are used for switch idioms and tail calls
Recognizing them is necessary to build the control flow
graph
(c) 2008 Hex-Rays SA 14
Problems, problems, problems...
Save-restore (push/pop) pairs
Partial register accesses (al/ah/ax/eax)
64-bit arithmetic
Compiler idioms
Variable live ranges (for stack variables)
Lost type information
Pointers vs. numbers
Virtual functions
Recursive functions
(c) 2008 Hex-Rays SA 15
Hopeless situation?
Well, yes and no
While fully automatic decompiler capable of handling
arbitrary input is impossible, approximative solutions exist
We could start with a “simple” case:
Compiler generated output (no hostile adversary generating
increasingly complex input)
Only 32-bit code
No floating point, exception handling and other fancy stuff
(c) 2008 Hex-Rays SA 16
Basic ideas
Make some configurable assumptions about the input
(calling conventions, stack frames, memory model, etc)
Use sound theoretical approach to solvable problems
(data flow analysis on registers, peephole optimization
within basic blocks, instruction simplification, etc)
Use heuristics for unsolvable problems (indirect jumps,
function prolog/epilogs, call arguments)
Prefer to generate ugly but correct output rather than nice
but incorrect code
Let the user guide the decompilation in difficult cases
(specify indirect call targets, function prototypes, etc)
Interactivity is necessary to achieve good results
(c) 2008 Hex-Rays SA 17
Decompiler architecture
Overall, it could look like this:
Add-ons: decompiler based analysis tools,
plugins, visualizers, etc
Kernel: decompiler core engine
Microgen: translate decoded instructions to
microcode; handle all platform specific
aspects
Disassembler: read input file, decode
instructions and divide into functions
(c) 2008 Hex-Rays SA 18
Decompilation phases - 1
Microcode Analyze function prolog and epilog, switch
generation idioms, verify the function
Local Simplify instructions, propagate expressions,
optimization determine block types and control graph edges
Globally propagate expressions, delete dead
Global code, resolve memory references, analyze call
optimization instructions, determine input/output registers of
the function
Determine variable live ranges and their sizes,
Local variable get rid of all stack and register references,
allocation schedule instruction combinations,
assign simple types to all variables
continued...
(c) 2008 Hex-Rays SA 19
Decompilation phases - 2
Structural Analyze control flow graph and create
analysis while/if/switch and other constructs
Pseudocode Based on the microcode and structural analysis
generation results, generate output text
Massage the output to make it more readable,
Pseudocode
create for-loops, remove superfluous gotos,
transformation
create break/continue, add/remote casts, etc
Analyze pseudocode, build type equations and
Type analysis
solve them, modify variable types
Final touch Rename variables, create va_list, etc
(c) 2008 Hex-Rays SA 20
Microcode – just generated
It is very detailed
Redundant
One basic block at a time
(c) 2008 Hex-Rays SA 21
After preoptimization
(c) 2008 Hex-Rays SA 22
After local optimization
This is much better
Please note that the condition codes are still present
because they might be used by other blocks
Use-def lists are calculated dynamically
(c) 2008 Hex-Rays SA 23
After global optimization
Condition codes are gone
The LDX instruction got propagated to jz and all
references to eax are gone
Note that the jz target has changed (@3) since global
optimization removed some unused code and blocks
We are ready for local variable allocation
(c) 2008 Hex-Rays SA 24
After local variable allocation
All registers have been replaced by local variables (ecx0,
esi1; except ds)
Use-def lists are useless now but we do not need them
anymore
Now we will perform structural analysis and create
pseudocode
(c) 2008 Hex-Rays SA 25
Control graphs
Original graph view Control flow graph
(c) 2008 Hex-Rays SA 26
Graph structure as a tree
Structural analysis extracts the standard control flow
constructs from CFG
The result is a tree similar to the one below. It will be used
to generate pseudocode
The structural analysis algorithm is robust and can handle
any graphs, including irreducible ones
(c) 2008 Hex-Rays SA 27
Initial pseudocode is ugly
Almost unreadable...
(c) 2008 Hex-Rays SA 28
Transformations improve it
Some casts still remain
(c) 2008 Hex-Rays SA 29
Interactive operation allows us to fine tune it
Final result after some renamings and type adjustments:
The initial assembly
is too long to be
displayed on a slide
Pseudocode is much
shorter and more
readable
(c) 2008 Hex-Rays SA 30
What decompilation gives us
Obvious benefits
Saves time
Eliminates routine tasks
Makes source code recovery easier (...)
New things
Next abstraction level - closer to application domain
Data flow based tools (vulnerability scanner, anyone? :)
Binary translation
(c) 2008 Hex-Rays SA 31
Base to build on...
To be useful and make other tools possible, decompiler
must have a programmable API
It already exists but it needs some refinement
Microcode is not accessible yet
Decompiler is retargetable (x86 now, ARM will be next)
Both interactive and batch modes are possible
In addition to being a tool to examine binaries, decompiler
could be used for...
(c) 2008 Hex-Rays SA 32
...program verification
Well, “verification” won't be strict but it can help to spot
interesting locations in the code:
Missing return value validations (e.g. for NULL pointers)
Missing input value validations
Taint analysis
Insecure code patterns
Uninitialized variables
etc..
(c) 2008 Hex-Rays SA 33
...assembly listing improvement
Hardcore users who prefer to work with assembly
instructions can benefit from data flow analysis results
Hover the mouse over a register or data to get:
Its possible values or value ranges
Locations where is is defined
Locations where it is used
Highlight definitions or uses of the current register in two
different colors
Show list of indirect call targets, calling conventions, etc
Gray out dead instructions
Determine if a value comes from a system call (ReadFile)
etc...
(c) 2008 Hex-Rays SA 34
...more insight into the application domain
One could reconstruct data types used by the application
In fact, serious reverse engineering is impossible without
knowing data types (.,,)
Fortunately API already exposes all necessary information
for type handling
Plenty of work ahead
(c) 2008 Hex-Rays SA 35
...more abstract representations
Tools to build more abstract representations
Function clustering (think of modules or libraries)
Global data flow diagrams (functions exposed to tainted
data in red)
Statistical analysis of pseudocode
C++ template detection, generic code detection
(c) 2008 Hex-Rays SA 36
...binary code comparison
You know better than me the possible applications
To find code plagiarisms
To detect changes between program versions
To find library functions (high-gear FLIRT)
etc... (you know better than me :)
(c) 2008 Hex-Rays SA 37
Back to the earth
The tools and possibilities described on the previous
slides do not exist yet
Yes they become possible thanks to decompilation
We have a long way to go
More processors and platforms
Floating point calculations
Exception handling
Type recovery
Handling hostile code
In fact, too many ideas to enumerate them here
The future is bright... is it?...
(c) 2008 Hex-Rays SA 38
The “thank you” slide
Thank you for your attention!
Questions?
(c) 2008 Hex-Rays SA 39
Get documents about "