Slide 1 - USFCS

Document Sample
Slide 1 - USFCS Powered By Docstoc
					Intel‟s „cmpxchg‟ instruction

    How does the Linux kernel‟s
  „cmos_lock‟ mechanism work?
    Review of the i386 registers
           EAX                                            CS

           EBX                                            DS

           ECX                                            ES

           EDX                                            FS

           ESI                                            GS

           EDI                                            SS
                                              Segment Registers (16-bits)

           ESP                                      EIP
General Registers (32-bits)

                              Program Control and Status Registers (32 bits)
       The x86 „system‟ registers
           CR0                          DR0

           CR1                          DR1

           CR2                          DR2

           CR3                          DR3

           CR4                          DR4

           CR5                          DR5
           CR6                          DR6
           CR7                          DR7
Control Registers (32-bits)   Debug Registers (32-bits)

         means „unimplemented‟                        GDTR
     How often is „cmpxchg‟ used?
$ cat vmlinux.asm | grep cmpxchg

c01046de:    f0   0f   b1   15   3c 99 30   lock   cmpxchg   %edx,0xc030993c
c0105591:    f0   0f   b1   15   3c 99 30   lock   cmpxchg   %edx,0xc030993c
c01055d9:    f0   0f   b1   15   3c 99 30   lock   cmpxchg   %edx,0xc030993c
c010b895:    f0   0f   b1   11              lock   cmpxchg   %edx,(%ecx)
c010b949:    f0   0f   b1   0b              lock   cmpxchg   %ecx,(%ebx)
c0129a9f:    f0   0f   b1   0b              lock   cmpxchg   %ecx,(%ebx)
c0129acf:    f0   0f   b1   0b              lock   cmpxchg   %ecx,(%ebx)
c012d377:    f0   0f   b1   0e              lock   cmpxchg   %ecx,(%esi)
c012d41a:    f0   0f   b1   0e              lock   cmpxchg   %ecx,(%esi)
c012d968:    f0   0f   b1   16              lock   cmpxchg   %edx,(%esi)
c012e568:    f0   0f   b1   2e              lock   cmpxchg   %ebp,(%esi)       Here’s the occurrence
c012e57a:    f0   0f   b1   2e              lock   cmpxchg   %ebp,(%esi)
c012e58a:    f0   0f   b1   2e              lock   cmpxchg   %ebp,(%esi)       that we studied in the
c012e83f:    f0   0f   b1   13              lock   cmpxchg   %edx,(%ebx)
c012e931:    f0   0f   b1   0a              lock   cmpxchg   %ecx,(%edx)         ‘rtc_cmos_read()’
c012ea94:    f0   0f   b1   11              lock   cmpxchg   %edx,(%ecx)
c012ecf4:    f0   0f   b1   13              lock   cmpxchg   %edx,(%ebx)
c012f08e:    f0   0f   b1   4b   18         lock   cmpxchg   %ecx,0x18(%ebx)
c012f163:    f0   0f   b1   11              lock   cmpxchg   %edx,(%ecx)
                                                                               …plus 28 other times!
c0150d0f:    f0   0f   b1   3b              lock   cmpxchg   %edi,(%ebx)
c0150d87:    f0   0f   b1   31              lock   cmpxchg   %esi,(%ecx)
c0199c5e:    f0   0f   b1   0b              lock   cmpxchg   %ecx,(%ebx)
c024b06f:    f0   0f   b1   0b              lock   cmpxchg   %ecx,(%ebx)
c024b2fe:    f0   0f   b1   51   18         lock   cmpxchg   %edx,0x18(%ecx)
c024b321:    f0   0f   b1   51   18         lock   cmpxchg   %edx,0x18(%ecx)
c024b34b:    f0   0f   b1   4b   18         lock   cmpxchg   %ecx,0x18(%ebx)
c024b960:    f0   0f   b1   53   18         lock   cmpxchg   %edx,0x18(%ebx)
         Intel‟s documentation
• You can find out what any of the Intel x86
  instructions does by consulting the official
  software developer‟s manual, online at:

• Our course-webpage has a link to this site
  that you can just click (under „Resources‟)
• The instruction-set reference is two parts:
  – Volume 2A: for opcodes A through M
  – Volume 2B: for opcodes N through Z
        Example: „cmpxchg‟
• Operation of the „cmpxchg‟ instruction is
  described (on 3 pages) in Volume 2A
• There‟s an English-sentence description,
  and also a description in „pseudo-code‟
• You probably do not want to print out this
  complete volume (.pdf) – over 700 pages!
• (You could order a printed copy from Intel)
                    Instruction format
• Intel‟s assembly language syntax differs
  from the GNU/Linux syntax (known as
  „AT&T syntax‟ with roots in UNIX history)
• When AT&T syntax is used, the „cmpxchg‟
  instruction has this layout:
      [lock] cmpxchg reg, reg/mem

                      mnemonic    source   destination
optional ‘prefix’
                       opcode    operand    operand
(used for SMP)
          „effects‟ and „affects‟
• According to Intel‟s manual, the „cmpxchg‟
  instruction also uses two „implicit‟ operands
  (i.e., operands not mentioned in the instruction)
   – The CPU‟s accumulator register
   – The CPU‟s EFLAGS register
• The accumulator-register (EAX) is both a
  source-operand and a destination-operand
• The six status-bits in the EFLAGS register will
  get modified, as a „side-effect‟ this instruction
        „cmpxchg‟ description
• This instruction compares the accumulator
  with the destination-operand (so the ZF-bit
  in EFLAGS gets assigned accordingly)
• Then:
  – If (accumulator == destination)
     { ZF  1; destination  source; }
  – If (accumulator != destination)
     { ZF  0; accumulator  destination; }
            An instruction-instance
• In our recent disassembly of Linux‟s kernel
  function „rtc_cmos_read()‟, this „cmpxchg‟
  instruction-instance was used:

       lock cmpxchg               %edx, cmos_lock

   prefix       opcode      source-operand    destination-operand

 Note: Keep in mind that the accumulator %eax will affect what happens!
 So we need to consider this instruction within it‟s surrounding context
          The complete function
c0105574 <rtc_cmos_read>:
c0105574:       53                       push %ebx
c0105575:       9c                       pushf
c0105576:       5b                       pop    %ebx
c0105577:       fa                       cli
c0105578:       64 8b 15 08   20 30 c0   mov    %fs:0xc0302008,%edx
c010557f:       0f b6 c8                 movzbl %al,%ecx
c0105582:       42                       inc    %edx
c0105583:       c1 e2 08                 shl    $0x8,%edx
c0105586:       09 ca                    or     %ecx,%edx
c0105588:       a1 3c 99 30   c0         mov    0xc030993c,%eax
c010558d:       85 c0                    test %eax,%eax
c010558f:       75 f7                    jne    c0105588 <rtc_cmos_read+0x14>
c0105591:       f0 0f b1 15   3c 99 30   lock cmpxchg %edx,0xc030993c
c0105598:       c0
c0105599:       85 c0                    test   %eax,%eax
c010559b:       75 eb                    jne    c0105588 <rtc_cmos_read+0x14>
c010559d:       88 c8                    mov    %cl,%al
c010559f:       e6 70                    out    %al,$0x70
c01055a1:       e6 80                    out    %al,$0x80
c01055a3:       e4 71                    in     $0x71,%al
c01055a5:       e6 80                    out    %al,$0x80
c01055a7:       c7 05 3c 99   30 c0 00   movl   $0x0,0xc030993c
c01055ae:       00 00 00
c01055b1:       53                       push %ebx
c01055b2:       9d                       popf
c01055b3:       0f b6 c0                 movzbl %al,%eax
c01055b6:       5b                       pop    %ebx
c01055b7:       c3                       ret
      The „preparation‟ steps
• The instructions that preceed „cmpxchg‟
  will setup register EDX (source operand)
  and register EAX (the x86 „accumulator‟)
• Several instructions are used to set up a
  value in EDX, and result in this layout:
         31                                             8 7                0
               The current processor‟s value for               CMOS
  EDX:             „per_cpu__cpu_number‟                      register‟s
                            plus 1                              index

                                                         this might be zero…
         …but this part is guaranteed to be non-zero!
     The „cmos_lock‟ variable
• This global variable is initialized to zero,
  meaning that access to CMOS memory
  locations is not currently „locked‟
• If some CPU stores a non-zero value in
  this variable‟s memory-location, it means
  that access to CMOS memory is „locked‟
• The kernel needs to insure that only one
  CPU at a time can set this „lock‟
     The „most likely‟ senario
• One of the CPUs wishes to access CMOS
  memory – so it needs to test „cmos_lock‟
  to be sure that access is now „unlocked‟
  (i.e., cmos_lock == 0 is true)
• The CPU copies the „cmos_lock‟ variable
  into the EAX, where it can then be tested
  using the „test %eax, %eax‟ instruction
• A conditional-jump follows the test
                The „busy-wait‟ loop

# Here is a „busy-wait‟ loop, used to wait for the CMOS access to be „unlocked‟
spin:    mov      cmos_lock, %eax               # copy lock-variable to accumulator
         test     %eax, %eax                    # was CMOS access „unlocked‟?
         jnz      spin                          # if it wasn‟t, then check it again

         # A CPU will fall through to here if „unlocked‟ access was detected,
         # and that CPU will now attempt to set the „lock‟ – in other words, it
         # will try to assign a non-zero value to the „cmos_lock‟ variable.

         # But there‟s a potential „race‟ here – the „cmos_lock‟ might have been
         # zero when it was copied, but it could have been changed by now…
         # … and that‟s why we need to execute „lock cmpxchg‟ at this point
        Busy-waiting will be brief
               spin:    # see if the lock-variable is clear
                        mov       cmos_lock, %eax
                        test      %eax, %eax
                        jnz       spin

                        # ok, now we try to grab the lock
                        lock cmpxchg %edx, cmos_lock

                        # did another CPU grab it first?
                        test     %eax, %eax
                        jnz      spin

If our CPU wins the „race‟, the (non-zero) value from source-operand EDX will
have been stored into the (previously zero) „cmos_lock‟ memory-location, but
the (previously zero) accumulator EAX will not have been modified; hence our
CPU will not jump back, but will fall through and execute the „critical section‟ of
code (just a few instructions), then will promptly clear the „cmos_lock‟ variable.
               The „less likely‟ case
                spin:    # see if the lock-variable is clear
                         mov       cmos_lock, %eax
                         test      %eax, %eax
                         jnz       spin

                         # ok, now we try to grab the lock
                         lock cmpxchg %edx, cmos_lock

                         # did another CPU grab it first?
                         test     %eax, %eax
                         jnz      spin

If our CPU loses the „race‟, because another CPU changed „cmos_lock‟ to some
non-zero value after we had fetched our copy of it, then the (now non-zero) value
from the „cmos_lock‟ destination-operand will have been copied into EAX, and so
the final conditional-jump shown above will take our CPU back into the spin-loop,
where it will resume busy-waiting until the „winner‟ of the race clears „cmos_lock‟.
start     Setup nonzero
           value in EDX                  flowchart

          EAX  cmos_lock            EAX
                                   is zero?


               yes                                              no
                            EAX equals cmos_lock ?

            ZF  1                                       ZF  0
        cmos_lock  EDX                              EAX  cmos_lock

                            no       EAX       yes   critical
                                   is zero?          section

                                                           cmos_lock  0

    „btr‟/‟bts‟ versus „cmpxchg‟
• In an earlier lesson we used the „btr‟/‟bts‟
  instructions to achieve „mutual exclusion‟,
  whereas Linux uses „cmpxchg‟ to do that
• We think „btr‟/‟bts‟ is easier to understand,
  so why do you think the Linux developers
  would prefer to use „cmpxchg‟ instead?

          <allow some class discussion here>
        In-class exercise #1
• Was it really necessary to insert a second
  „test %eax, %eax‟ following „cmpxchg‟?
• Can you design a simple LKM that would
  verify your answer to that question?
• The Intel documentation does not state
  precisely how other EFLAGS status-bits
  (besides ZF) are affected by „cmpxchg‟,
  only that they reflect the comparison of
  „accumulator‟ and „destination‟ operands
 31                               11 10   9   8   7   6   5   4   3   2   1   0

                 AVR   N          OD I T S Z A P C
                           IOPL             0 0 1
                 CMF   T          FFFFFF F F F

• Usually the CPU implements comparison-
  of-operands by performing a subtraction
        In-class exercise #2
• Can you decide what Intel means by “the
  comparison operation”, by writing suitable
  code that examines the effect on EFLAGS
  of „cmpxchg opnd1, opnd2‟ and these two
  plausable alternatives:
               cmp opnd1, opnd2
               cmp opnd2, opnd1