PPT Slides - PowerPoint Presentation

Document Sample
PPT Slides - PowerPoint Presentation Powered By Docstoc
					PinOS: A Programmable Framework for
Whole-System Dynamic Instrumentation

                Prashanth P. Bungale
                               14th June 2007

                 Joint work with Chi-Keung Luk

 Pin Overview
 PinOS motivation and goals
 Architecture
 Design Issues
 Evaluation
 Future work

                               What is Pin?

 A Dynamic Binary Instrumentation System
       Inject and delete instruction stream at run-time without source code

 Programmable Instrumentation
       Provides APIs to write instrumentation tools (called PinTools) in C/C++

 Multiplatform
       Supports 32-bit and 64-bit x86, Itanium
       Supports Linux, Windows, MacOS

 Robust
       Instruments real-life and multithreaded applications
         Database, search engines, web browsers

 Increasingly Popular
       Over 10000 downloads since Pin was released in 2004 June

                   Pin Instrumentation Uses
     Computer Architecture Research
      –   Branch predictor simulation
      –   Cache simulation
      –   Trace generation
      –   Instruction Emulation
          • E.g., emulate newly proposed instructions

     Software Instrumentation
      – Profiling for optimization
        • Basic block counts, edge counts
      – Bug checking

                               PinOS Goals

 Extend Pin to instrument OS code as well
       Programmable through extended Pintool API

 Fine-grain instrumentation of both kernel- and user-level code
       No limitation on where and what kind of instrumentation can be inserted
       Not achievable by existing probe-based tools (e.g., Dtrace and Kprobe)

 Only active when needed
       Attach/detach PinOS to/from the guest as and when needed

 Generalized Infrastructure
    •   Single framework to instrument Linux, Windows, etc.

           PinTool on PinOS: Tracing Memory Writes
    FILE * trace;

    // Print a memory write record
    VOID RecordMemWrite(VOID * ip, VOID * va, VOID * pa, UINT32 size) {
             Host_fprintf(trace,"%p: W %p %p %d\n", ip, va, pa, size);

    // Called for every instruction
    VOID Instruction(INS ins, VOID *v) {
            if (INS_IsMemoryWrite(ins))
                INS_InsertCall(ins, IPOINT_BEFORE,
                     AFUNPTR(RecordMemWrite), IARG_INST_PTR,
                     IARG_MEMORYWRITE_SIZE, IARG_END);

    int main(int argc, char *argv[]) {
            PIN_Init(argc, argv);
            trace = Host_fopen("atrace.out", "w");
            INS_AddInstrumentFunction(Instruction, 0);
            PIN_StartProgram();        // Never returns
            return 0;


      Xen-Domain0                              Xen-DomainU

                                                   Guest OS

          Host OS                           PinTool CodeCache
                                            Engine      PinOS

              Xen Virtual Machine Monitor (VMM)                 1

                         H a r d w a r e

    1 To run PinOS between guest and hardware: Use Xen
    2 Virtualize and present a fake processor to the guest OS

         Xen 3.0 - A Convenient Environment

 Uses Intel VT to run unmodified operating systems
 Open-source availability
 We modify Xen 3.0 to customize for PinOS purposes:
       Steal physical and virtual memory for PinOS
       Provide I/O services to PinOS
       Hijack initial control of guest domain
       Perform PinOS attach/detach

 Provides support for debugging PinOS

                   Stealing Physical Memory

 Memory requirements
       PinOS exe, Pintool exe, Code Cache, PinOS stack, heap, I/O buffers

 Physical Memory
       Pre-allocate a separate range of machine pages for PinOS

         Physical Pages

         Machine Pages

                     Stealing Virtual Memory

 Steal some portion of guest address space
        Current strategy: steal part of guest’s kernel address space
          – Minimizes chance of VA space conflicts

 Map stolen VA space to pre-allocated pages in Xen shadow
 Propagate stealing to every shadow table
        i.e., in every address space ever encountered in the guest OS

 Detect and report any conflicts
        No guest OS mapping activity encountered so far in stolen VA space
        Should be less of an issue with 64-bit address space

           Memory Virtualization
      Guest OS                       Xen
     Page Table             Shadow Page Table
     V0       P0               V0          M0
                               V1          M1
     V1       P1
                               …            …
     …        …
                               Vi           Mi
     Vi       Pi               …            …
                               Vk          Mk

                    PinOS     Vk+1         MK+1
                   Memory      …            …
                               Vn          Mn

                       I/O Services for PinOS

 I/O Service requirements
         PinOS’s own debugging log, Pintools’ input/output

 I/O channels implemented as shared ring buffers
         PinOS writes I/O requests to buffer shared b/w guest and host domains
         Daemon process in host domain periodically polls and processes requests

 Sharing the ring buffers
         Allocated in guest domain
         “Mapped in” by host domain

         Host Domain                                            Guest Domain

     Daemon Process                                               PinOS

                        PinOS Attach/Detach

 Attach/Detach allows PinOS to be used only on subject execution
        Avoid overhead
          e.g., can avoid PinOS being active during OS boot every time
        Precision / accuracy
          PinOS on entire run may pollute instrumentation data collections

 Implementing Attach
        Read entire state of guest machine
        Start PinOS activity from that point on
        Use VT support for reading and setting hidden register state

                    Native            PinOS              Native

                          Attach            Detach

             Code-Cache Indexing and Sharing

 Pin uses VA as code cache index
 In PinOS, different processes can use same VA for different code
        Virtual address alone is not sufficient to distinguish code

 Option 1: <AddressSpaceID, VirtualAddress>
          Easy to implement (On x86, use the CR3 value)
          But, no sharing of code across address spaces

 Option 2: <PhysicalAddress, VirtualAddress>
          Can share code across address spaces
          Persistence across application runs
          But, much more challenging to implement

                                     Results on booting FC4-Linux

                                   Execution time                                                      Code cache space used
                        6000                                                                   1800

                                                                  Code cache space used (MB)
                                                                                               1600       1538
Execution Time (secs)


                        4000                                                                   1200


                        2000                                                                    600

                        1000                         840
                                                                                                200                          71

                               AddressSpaceID   PhysicalAddress                                       AddressSpaceID   PhysicalAddress

                               <PhysicalAddress, VirtualAddress> is the Clear Winner!

             Interrupt/Exception Virtualization

 PinOS virtualizes interrupts and exceptions:
        Maintaining control
          Ex: Timer interrupt triggering process preemption
        Maintaining transparency
          Ex: Guest interrupt handler attempting to identify thread ID based on ESP

 Install own interrupt handlers in Interrupt Descriptor Table (IDT)
        So all interrupts and exceptions are routed through PinOS

 Handling interrupts (asynchronous)
        When received by PinOS, put it on a queue
        Add a pending interrupts check at every trace entry
        Setup interrupted guest context with trace address and context
        Continue instrumentation at corresponding guest interrupt handler

 Handling exceptions (synchronous)
        Recover excepting guest address and context and setup context
        Continue instrumentation at corresponding guest exception handler

                         Exception Virtualization
 Precise Exception Delivery
         In the face of “pseudo” instruction boundaries
         Log and Rollback all guest-visible state changes until most recent guest
          instruction boundary
                                      Translated Code
                                                                      Guest Instruction
     Original Guest Code                                              boundary
                                         spill %eax

         movw %ds, (%edx)            movw M.%ds, %ax                 “Pseudo” Instruction
                                     movw %ax, (%edx)                boundary
             call proc
                                        restore %eax
                                     pushl <current-eip>
                                       jmp xlated-proc

 Faithful Exception Delivery
         While emulating instructions, conditions must be checked, and exceptions
          raised as guaranteed by hardware semantics

     Coherence: Handling Self-Modifying Code

 Self-modifying code problem
        Content of a code page may change after Pin has cached that page

 Write-monitoring Solution
        Standard page-table trick

 Bookkeeping
        Maintain a reverse page-mapping table
          i.e., a PA -> VA mapping table
        Upon bringing in code from given physical page:
          Write-protect all virtual pages that ever map into this physical page

                         Experiment Setup
 Environment:
      Xen 3.0.2 running on Intel VT-enabled machines
      Guest domain installed with Fedora Core 4 Linux

 Benchmarks:
      Fedora Core 4 Linux boot
      Apache-bench (web-server)
      Mysql-test (database server)

 Pintools:
      Insmix
        Code profiler that collects basic-block and instruction mix info
      CMP$im
        Cache simulator that models a multi-level cache hierarchy
        Results in paper

     Distribution of Kernel and User-level Instructions

                      Basic Block Count Results

            Top 5 hottest kernel-level basic blocks of mysql-test-alter-table

                                                                               Ins %
     Bbl Addr     Bbl Symbol Name                  Count   Num-Ins

     0xc0111a40   delay_pit + 0x1a              93531291          2             1.17%

     0xc8aac20b   ext3_do_update_inode + 0x82   10177398          6             0.38%

     0xc011d58f   __might_sleep + 0x2a           5170776          5             0.16%

     0xc011d57f   __might_sleep + 0x1a           5170776          4             0.13%

     0xc011d565   __might_sleep                  5170776         10             0.32%

                              Insmix Results

 Privileged    mysql-test-               Privileged    mysql-test-
 Instruction    alter-table   fc4-boot   Instruction    alter-table   fc4-boot
 CLI              8912950     2806991    LGDT                    0          2
 STI              2217286      845921    LLDT                    0          2
 IRETD             599646      574204    LIDT                    0          2
 OUT               551209       57181    LTR                     0          1
 OUTSW             990104        9824    INVD                    0          0
 IN                311762       31994    WBINVD                  0          0
 INSW                  240      48403    RDMSR                   0         15
 HLT                   207       4458    WRMSR                   0          0
 INVLPG                619      54923    RDPMC                   0          0
 CLTS                1350          80    LMSW                    0          0
 RDTSC               7043        1777    MOV CR                NA          NA
                                         MOV DR                NA          NA

     Performance of PinOS

                             Future Work

 Make PinOS capable of instrumenting Windows
 PinOS Infrastructure Support
      64-bit support (x86_64)
      Multi-Processor support (MP)

 Now that we have this powerful infrastructure,
  let’s write Pintools!
      Interesting Pintools include debuggers, profilers, tracing tools, etc.

 Plan to release to public
   Interesting users and uses may demand further enhancements


 Thanks to the entire Pin team
      For giving us a robust Pin to start with

 Thanks to:
      Mark Charney
         For helping us better understand Xed
         For fixing XED issues (only a few) very promptly

      Greg Lueck
        For many helpful discussions, esp. about signals
         For fixing related bugs in mainline Pin

      Prof. Jonathan Shapiro and Swaroop Sridhar
         For collaboration on initial ideas about segmentation virtualization

     Thank You!


     Backup Slides…

             Correctness Issue with Trace Linking

                Guest Code in Process A                  Guest Code in Process B

                 <V1, P1>                                <V1, P1>

                              jmp V2                                jmp V2

                 <V2, P2>                                <V2, P3>

                                           Code Cache
                                                                Step 2:
                            Step 1:
                                                                Process B is instrumented and
       Process A is instrumented               V1’:             finds that <V1,P1> is already
     and its translation is cached.     Translation             translated. So, no need to re-
                                       of <V1, P1> jmp V2’      translate.

                                                                However, the jump to V2’ is
                                                                incorrect because V2 is now
                                                                mapped to P3 instead of P2!
                                       of <V2, P2>

              Code-Cache Indexing and Sharing

 Our solution:
         Check predicted page mapping against actual one at each trace entry
         Maintain “SoftTLB” that caches current guest page mappings
         Assign once and always use same TLB entry for a given VA->PA mapping
           So that the trace entry check can involve a constant address lookup

                   A Translated Trace in Code Cache
                 V2’: if (SoftTLB[V2] != P2)
          Translation { // <V2,P2> is invalid.          VA         PA
         of <V2, P2>
                           call PinOS();
                                                        V1         P1
                           // Never return
                                                        V2         P3
                       // <V2, P2> is still valid.
                       //Execute the rest of the

Coherence: Handling Page-Mapping Changes

 Problem
      Guest’s page mappings may change after PinOS caches them in

 Solution
      Xen already marks guest page-table pages as read-only and thus
       tracks all writes to them
      Modify Xen to inform PinOS once it figures out which page-table
       entries get changed
      PinOS then invalidates these page mappings in its SoftTLB

     Virtualization of System-Level State
                      Segmentation Support
                         •   Segment Registers
                         •   GDT/LDT
                      Paging Support
                         •   CR3 (PDBR)
                         •   Page-table structures
                      Interrupt/Exception Delivery
                         •   IDT
                      Task support
                         •   TR
                         •   Including privileged bits like IF

     Review of IA-32 Memory Management

             Review of segment addressing

       Segment Registers                    LDT            GDT

     CS segment selector
                                             …             …
     DS segment selector

                             8K Entries
     SS   segment selector
                                           segment       segment
                                          descriptor    descriptor
     ES   segment selector

     FS   segment selector
                                             …             …
     GS segment selector

                                                       Courtesy: Gregory Lueck

            Review of segment addressing

               Segment Selector   Segment Descriptor

                  index              base address
                                     limit   other
        Table indicator
           0 – GDT
           1 – LDT
     Privilege info

                                               Courtesy: Gregory Lueck

              Review of segment addressing

                         mov %fs:0x10, %eax

FS    index    1

                               base address    +
                              limit    other

                                               Courtesy: Gregory Lueck

        Hidden Part of Segment Register

               visible part        hidden part

             index, GDT/LDT base, limit, acc. rights

 Hidden part “cached” from LDT / GDT
 Might be out-of-sync, software depends on this!
 Saving segment register writes only visible part to memory
 Restoring reads hidden part from GDT / LDT
 Asymmetry: save / restore may change contents!

                                                 Courtesy: Gregory Lueck

            Irreversible Segmentation Problem


                                                             Save                 Restore
                                                              DS                    DS

            GDT                                 GDT                                          GDT
                                                           Gratuitous Load performed
                                                           by Instrumentation System

     0x10   A           Guest Writes B   0x10   B                                     0x10    B
                       into GDT[0x10]

     DS:                                 DS:                                          DS:
      Selector: 0x10                      Selector: 0x10                               Selector: 0x10
      Desc. Cache: A                      Desc. Cache: A                               Desc. Cache: B

                                                             Wrong! Should still be A as the guest has
                                                             not yet explicitly performed a load into DS!

                       Segmentation Virtualization
  Key Insight: Just virtualize hardware descriptor caches
            Don’t virtualize segmentation tables GDT/LDT at all!

  As and when guest explicitly loads hardware registers:
            Copy guest segment descriptors into corresponding caches
            Issue hardware register load instructions with modified selector
            Use dynamic translation for doing this
                                                   CS Desc. Cache                      DS Register
                                                   DS Desc. Cache
                                                   ES Desc. Cache
                                                                         mov 0x2 -> ds
                                                   FS Desc. Cache     Issued on hardware
                                                   GS Desc. Cache              &
                            mov 0x10 -> ds                           Emulated DS Register
0x10:                                              SS Desc. Cache
                            Issued by guest                            updated with 0x10
                                                  LDTR Desc. Cache
                                                   TR Desc. Cache                           Emulated DS

                                                                        PinOS Stolen
            Guest GDT/LDT
                                              PinOS GDT active on H/W

     Irreversible Segmentation Problem Solved
            H/W                                  H/W                                             H/W
            GDT                                  GDT                                             GDT
     0x2      A                           0x2     A                      Engine
                                                                                        0x2        A

                                                             Save                 Restore
                                                              DS                    DS

            GDT                                  GDT                                             GDT
                                                             Gratuitous Load performed
                                                             by Instrumentation System

     0x10     A          Guest Writes B   0x10     B                                   0x10        B
                        into GDT[0x10]

     DS:                                  DS:                                          DS:
       Selector: 0x2                        Selector: 0x2                                   Selector: 0x2
      Desc. Cache: A                       Desc. Cache: A                               Desc. Cache: A

     Emulated DS:                         Emulated DS:                                 Emulated DS:
       Selector: 0x10                       Selector: 0x10                                  Selector: 0x10

         Implications of Virtualization Scheme
 Gratuitous loads now performed with cached descriptors
        Ensures preservation of guest-expected hardware semantics

 Allows PinOS to easily steal rest of table for own descriptors
 With this scheme, no need for tracking guest table writes!
 However, need to tame/emulate all segmentation instructions
        lds/es/fs/gs/ss
        mov ds/es/fs/gs/ss, […]
        mov […], ds/es/fs/gs/ss
        pop ds/es/fs/gs/ss
        push ds/es/fs/gs/ss
        lgdt, sgdt
        lldt, sldt
        lar, lsl, verr, verw
        ltr, str, task gate transfer through interrupt
        Far jumps, calls and returns, iret, sysenter and sysexit
        Software interrupt: int n, into, int 3
        Hardware interrupt / exception

                         Related Work I
 Dynamic Optimization
      Dynamo [2000], DynamoRIO [2003]
      Mojo [2000]

 Software Dynamic Translation
      Strata [2003]

 Dynamic Binary Analysis and Instrumentation
      Shade [1994] - SPARC & MIPS
      Walkabout [2002], Valgrind [2004]
      Pin [2005], HDTrans [2006]

Probe-based Dynamic Binary Instrumentation
      KernInst [1999], DynInst [2000], LTT [2000],
      DProbes [2001], KProbes [2004]
      DTrace [2004], SystemTap [2005]

                         Related Work II

 Full Machine Simulation/Emulation
      Embra (SimOS) [1996] – MIPS
      Simics [2002]
      Bochs [2002], QEmu [2005]

 Para-Virtualization
      Denali [2002], Xen [2003]

 Full Virtualization
      VMware [2002]

 Hardware-assisted Virtualization
      Intel Virtualization Technology (VT) [2006]
      AMD Pacifica Technology [2006]