Decompilers and beyond - Hex-Ray

Document Sample
Decompilers and beyond - Hex-Ray Powered By Docstoc
					Decompilers and beyond
Ilfak Guilfanov
 Presentation Outline

        Why do we need decompilers?
             Complexity must be justified
        Typical decompiler design
             There are some misconceptions
        Decompiler based analysis
             New analysis type and tools become possible
             “ bright and sunny”
        Your feedback

        Online copy of this presentation is available at

(c) 2008 Hex-Rays SA                                                    2

        We need disassemblers to analyze binary code
        Simple disassemblers produce a listing with instructions
        Better disassemblers assist in analysis by annotating the
        code, good navigation etc. You know the difference.
        Even the ideal disassembler stays at low level: the output
        is an assembler listing
        The main output of a disassembler is still one-to-one
        mapping of opcodes to instruction mnemonics
        No leverage, no abstractions, little insight
        The analyst must mentally map assembly instructions to
        higher level abstractions and concepts
        A boring and routine task after a while

(c) 2008 Hex-Rays SA                                                 3
 Disassembler limitations

        The output is
             Error prone
             Requires special skills
             Did I say repetitive?
        Yet some geeks like it?...

(c) 2008 Hex-Rays SA                   4

        The need:
             Software grows like gas
             Time spent on analysis skyrockets
             Malware proliferates and mutates
        We need better tools to handle this
        Decompilation is the next logical step, yet a tough one

(c) 2008 Hex-Rays SA                                              5
 Building ideal decompiler

        The answer is clear and easy to give: ideal decompilers
        do not exist
        It is customary to compare compilers and decompilers:
             Lexical analysis
             Syntax analysis
             Code generation
        This comparison is correct but superficial

(c) 2008 Hex-Rays SA                                              6
 Compilers are privileged

        Strictly defined input language
             Anything nonconforming – spit out an error message
        Reasonable amount of information on all functions,
        variables, types, etc.
        The output may be ugly
             Who will ever read it but some geeks? :)

(c) 2008 Hex-Rays SA                                              7
 Machine code decompilers are impossible

        Informal and sometimes hostile input
        Many problems are unsolved or proved to be unsolvable
        in general
        The output is examined in detail by a human being, any
        suboptimality is noticed because it annoys the analyst

        Conclusion: robust decompilers are impossible

        What if we address the common cases? For example, if
        we cover 90%, will the rest be handled manually?

(c) 2008 Hex-Rays SA                                             8
 Easy for humans, hard for computers

        In fact, many (all?) problems encountered during
        decompilation are hard
        For every problem, there is a naïve solution, which,
        unfortunately, does not work
        Just a few examples...

(c) 2008 Hex-Rays SA                                           9
 Function calls are a problem

        Function calls require answering the following questions:
             Where does the function expect its input registers?
             Where does it return the result?
             What registers or memory cells does it spoil?
             How does it change the stack pointer?
             Does it return to the caller or somewhere else?

(c) 2008 Hex-Rays SA                                                10
 Function return values are a problem

        Does the function return anything?
        How big is the return value?

(c) 2008 Hex-Rays SA                         11
 Function input arguments are a problem

        When a register is accessed, it can be
             To save its value
             To allocate stack frame
             Used as function argument

(c) 2008 Hex-Rays SA                             12
 Indirect accesses are a problem

        Pointer aliases
        No precise object boundaries

(c) 2008 Hex-Rays SA                   13
 Indirect jumps are a problem

        Indirect jumps are used for switch idioms and tail calls
        Recognizing them is necessary to build the control flow

(c) 2008 Hex-Rays SA                                               14
 Problems, problems, problems...

        Save-restore (push/pop) pairs
        Partial register accesses (al/ah/ax/eax)
        64-bit arithmetic
        Compiler idioms
        Variable live ranges (for stack variables)
        Lost type information
        Pointers vs. numbers
        Virtual functions
        Recursive functions

(c) 2008 Hex-Rays SA                                 15
 Hopeless situation?

        Well, yes and no
        While fully automatic decompiler capable of handling
        arbitrary input is impossible, approximative solutions exist
        We could start with a “simple” case:
             Compiler generated output (no hostile adversary generating
             increasingly complex input)
             Only 32-bit code
             No floating point, exception handling and other fancy stuff

(c) 2008 Hex-Rays SA                                                       16
 Basic ideas

        Make some configurable assumptions about the input
        (calling conventions, stack frames, memory model, etc)
        Use sound theoretical approach to solvable problems
        (data flow analysis on registers, peephole optimization
        within basic blocks, instruction simplification, etc)
        Use heuristics for unsolvable problems (indirect jumps,
        function prolog/epilogs, call arguments)
        Prefer to generate ugly but correct output rather than nice
        but incorrect code
        Let the user guide the decompilation in difficult cases
        (specify indirect call targets, function prototypes, etc)
        Interactivity is necessary to achieve good results

(c) 2008 Hex-Rays SA                                                  17
 Decompiler architecture

        Overall, it could look like this:

                       Add-ons: decompiler based analysis tools,
                                plugins, visualizers, etc

                            Kernel: decompiler core engine

                       Microgen: translate decoded instructions to
                         microcode; handle all platform specific

                         Disassembler: read input file, decode
                         instructions and divide into functions

(c) 2008 Hex-Rays SA                                                 18
 Decompilation phases - 1

           Microcode        Analyze function prolog and epilog, switch
           generation              idioms, verify the function
             Local         Simplify instructions, propagate expressions,
          optimization    determine block types and control graph edges
                            Globally propagate expressions, delete dead
            Global        code, resolve memory references, analyze call
          optimization    instructions, determine input/output registers of
                                             the function
                          Determine variable live ranges and their sizes,
         Local variable    get rid of all stack and register references,
           allocation          schedule instruction combinations,
                               assign simple types to all variables


(c) 2008 Hex-Rays SA                                                          19
 Decompilation phases - 2

            Structural       Analyze control flow graph and create
            analysis          while/if/switch and other constructs
          Pseudocode     Based on the microcode and structural analysis
           generation            results, generate output text
                         Massage the output to make it more readable,
                          create for-loops, remove superfluous gotos,
                         create break/continue, add/remote casts, etc
                         Analyze pseudocode, build type equations and
         Type analysis
                               solve them, modify variable types

           Final touch        Rename variables, create va_list, etc

(c) 2008 Hex-Rays SA                                                      20
 Microcode – just generated

        It is very detailed
        One basic block at a time

(c) 2008 Hex-Rays SA                21
 After preoptimization

(c) 2008 Hex-Rays SA     22
 After local optimization

        This is much better
        Please note that the condition codes are still present
        because they might be used by other blocks
        Use-def lists are calculated dynamically

(c) 2008 Hex-Rays SA                                             23
 After global optimization

        Condition codes are gone
        The LDX instruction got propagated to jz and all
        references to eax are gone
        Note that the jz target has changed (@3) since global
        optimization removed some unused code and blocks
        We are ready for local variable allocation

(c) 2008 Hex-Rays SA                                            24
 After local variable allocation

        All registers have been replaced by local variables (ecx0,
        esi1; except ds)
        Use-def lists are useless now but we do not need them
        Now we will perform structural analysis and create

(c) 2008 Hex-Rays SA                                                 25
 Control graphs

        Original graph view   Control flow graph

(c) 2008 Hex-Rays SA                               26
 Graph structure as a tree

        Structural analysis extracts the standard control flow
        constructs from CFG
        The result is a tree similar to the one below. It will be used
        to generate pseudocode
        The structural analysis algorithm is robust and can handle
        any graphs, including irreducible ones

(c) 2008 Hex-Rays SA                                                     27
 Initial pseudocode is ugly

        Almost unreadable...

(c) 2008 Hex-Rays SA           28
 Transformations improve it

        Some casts still remain

(c) 2008 Hex-Rays SA              29
 Interactive operation allows us to fine tune it

        Final result after some renamings and type adjustments:
        The initial assembly
        is too long to be
        displayed on a slide
        Pseudocode is much
        shorter and more

(c) 2008 Hex-Rays SA                                              30
 What decompilation gives us

        Obvious benefits
             Saves time
             Eliminates routine tasks
             Makes source code recovery easier (...)
        New things
             Next abstraction level - closer to application domain
             Data flow based tools (vulnerability scanner, anyone? :)
             Binary translation

(c) 2008 Hex-Rays SA                                                    31
 Base to build on...

        To be useful and make other tools possible, decompiler
        must have a programmable API
        It already exists but it needs some refinement
             Microcode is not accessible yet
        Decompiler is retargetable (x86 now, ARM will be next)
        Both interactive and batch modes are possible
        In addition to being a tool to examine binaries, decompiler
        could be used for...

(c) 2008 Hex-Rays SA                                                  32
 ...program verification

        Well, “verification” won't be strict but it can help to spot
        interesting locations in the code:
             Missing return value validations (e.g. for NULL pointers)
             Missing input value validations
             Taint analysis
             Insecure code patterns
             Uninitialized variables

(c) 2008 Hex-Rays SA                                                     33
 ...assembly listing improvement

        Hardcore users who prefer to work with assembly
        instructions can benefit from data flow analysis results
        Hover the mouse over a register or data to get:
             Its possible values or value ranges
             Locations where is is defined
             Locations where it is used
        Highlight definitions or uses of the current register in two
        different colors
        Show list of indirect call targets, calling conventions, etc
        Gray out dead instructions
        Determine if a value comes from a system call (ReadFile)

(c) 2008 Hex-Rays SA                                                   34
 ...more insight into the application domain

        One could reconstruct data types used by the application
        In fact, serious reverse engineering is impossible without
        knowing data types (.,,)
        Fortunately API already exposes all necessary information
        for type handling
        Plenty of work ahead

(c) 2008 Hex-Rays SA                                                 35
 ...more abstract representations

        Tools to build more abstract representations
             Function clustering (think of modules or libraries)
             Global data flow diagrams (functions exposed to tainted
             data in red)
             Statistical analysis of pseudocode
             C++ template detection, generic code detection

(c) 2008 Hex-Rays SA                                                   36
 ...binary code comparison

        You know better than me the possible applications
             To find code plagiarisms
             To detect changes between program versions
             To find library functions (high-gear FLIRT)
             etc... (you know better than me :)

(c) 2008 Hex-Rays SA                                        37
 Back to the earth

        The tools and possibilities described on the previous
        slides do not exist yet
        Yes they become possible thanks to decompilation
        We have a long way to go
             More processors and platforms
             Floating point calculations
             Exception handling
             Type recovery
             Handling hostile code
             In fact, too many ideas to enumerate them here
        The future is bright... is it?...

(c) 2008 Hex-Rays SA                                            38
 The “thank you” slide

                       Thank you for your attention!

(c) 2008 Hex-Rays SA                                   39

Shared By: