Solaris Best Practices Workshop - DTrace

Document Sample
Solaris Best Practices Workshop - DTrace Powered By Docstoc
					Solaris 10 DTrace
Workshop
Bob Netherton
Principal Engineer / OS Ambassador
Sun Microsystems, Inc.
Scott Dickson
Systems Engineer / OS Ambassador
Sun Microsystems, Inc.
You gain proficiency with DTrace as
     you would with a musical
            instrument.

    Practice, Practice, Practice


                                      Slide 2
Agenda
• Introduction to DTrace
    > Safe Dynamic Instrumentation
• DTrace concepts & structure
    > Probes & Providers
    > Actions & the D Language
    > DTrace scripts
•   General Approaches to Using DTrace
•   Common Scenarios
•   DTrace ToolKit
•   DTrace and Dynamic Languages
                                         Slide 3
Learning Goals
• Understand why DTrace is needed and where it can
  and should be used
• Learn the basic DTrace concepts and terminology
• Apply these concepts by writing your first DTrace
  scripts and through simple exercises
• Be able to digest the Solaris Dynamic Tracing
  Guide



                                                      Slide 4
http://sun.com/software/solaris/benchmarks.html

                                      ●
                                          Designed for CMT and
                                          highly threaded loads
                                      ●
                                          Virtual memory: Memory
                                          placement and page size
                                      ●
                                          Networking stack built for
                                          10GB and beyond
                                      ●
                                          Network virtualization
                                      ●
                                          Scale up or Scale out
                                      ●
                                          Unique observability –
                                          optimize what you can see



                 Extreme Performance
                                                                       Slide 5
http://sun.com/bigadmin/content/dtrace/

                                          ●
                                              Designed for use
                                              on production systems
                                          ●
                                              Safe. Minimally intrusive.
                                          ●
                                              Problems solved
                                              in minutes, not days
                                          ●
                                              Instrument every line
                                              in every application . . .
                                          ●
                                              . . . but maintain system
                                              perspective
                                          ●
                                              No code changes
                                              required


          Extreme Observability: DTrace
                                                                           Slide 6
  DTrace Results on                                                          +300%
  Production Systems                                            +267%      In 5 Hours
                                                               In 2 Days




                                                   +80%
         +32%                      +35%          In a Day
          In a                     In an
        Morning                  Afternoon

        Financial                Online Parcel     Futures      Message      Data
        Database                   Tracking      Forecasting    Handing     Routing
                                   System        Application   Benchmark   Application

Source: Sun Microsystems, Inc.                                                           Slide 7
Why Dynamic Tracing?
• We know how to debug fatal, non-reproducible
  failure:
  > Obtain core file or crash dump
  > Debug problem postmortem using mdb(1), dbx(1)
• Techniques for debugging transient failures are not
  so clear or well-defined
  > Typical techniques push traditional tools (e.g. truss(1),
    mdb(1)) beyond their design centers
  > Many transient problems cannot be debugged at all
    using current tools and techniques
  > Where do you go after truss?
                                                                Slide 8
Transient Failure Definition
• Any unacceptable behavior that does not result in
  fatal failure of the system
• May be a clear failure:
  > “read(2) is returning EIO on a device that isn't reporting
    any errors.”
  > “Our application occasionally doesn't receive its timer
    signal.”
  > “One of our threads is missing a condition variable
    wakeup.”
  > “It's not running as fast as I think it should be.”


                                                                 Slide 9
Transient failure
• May be a failure based on customer's definition of
  “unacceptable”:
  > “We were expecting to accommodate 100 users per CPU
    – and we're able to get no more than 60 users per CPU.”
  > “Every morning from about 9am to about 10:30am, the
    system is a dog.”
• In these situations, it is up to someone to
  understand the performance inhibitors – and either
  eliminate them or reset expectations


                                                              Slide 10
Debugging transient failure
• Historically, we have debugged transient failure
  using process-centric tools: truss(1), pstack(1),
  prstat(1), etc.
• These tools were not designed to debug systemic
  problems, but they are what we have
• The tools designed for systemic problems (i.e.,
  mdb(1)) are designed for postmortem analysis...




                                                      Slide 11
Postmortem techniques
• One technique is to use postmortem analysis to
  debug transient problems by inducing fatal failure
  during period of transient failure
• Better than nothing, but not by much:
  > Requires inducing fatal failure, which nearly always
    results in more downtime than the transient failure
  > Requires a keen intuition to be able to suss out a
    dynamic problem from a static snapshot of state




                                                           Slide 12
Invasive techniques
• If existing tools cannot root-cause transient failure,
  more invasive techniques must be used
• Typically, custom instrumentation is developed for
  the failing program and/or the kernel
• The customer reproduces the problem using the
  instrumented binaries




                                                           Slide 13
Invasive techniques
• Requires either:
  > running instrumented binaries in production
  or
  > reproducing a transient problem in a development
       environment
• Neither of these is desirable!
• Invasive techniques are slow, error prone, and often
  ineffective – we must develop a better way



                                                       Slide 14
Dynamic Instrumentation
• Static probes (such as TNF) are designed to collect
  a specific set of data
  > The questions are asked in advance of the problem
  > Extending the data collection would require a rebuild of
    the system
• Want to be able to dynamically modify the system to
  record arbitrary data
• Must be able to do this on a production system
• Must be completely safe – there should be no way
  to induce a fatal failure
                                                               Slide 15
We Need an Example
           Why is the system suddenly slow?
            No one is using CPU, memory, I/O.
                  Just lots of cross-calls.
       What process is causing the cross-calls?
                  The X Windows Server.
Which X Windows Servers are causing the cross-calls?
                         These six.
   What are the doing when the cross-calls happen?
They are rapidly mapping /dev/null to create empty pixmaps.
                 Why are they doing that?
        They are running a stock-ticker applet that
    updates across the network 100 times per second.
                                                              Slide 16
DTrace
Solaris Dynamic Tracing – An Observability Revolution
• Built into Solaris 10 from the outset
• Seamless, global view of the system from user-level
  thread to kernel
• Not reliant on pre-determined trace points, but
  dynamic instrumentation
• Data aggregation at source minimizes requirement
  for postprocessing
• Built for live use on production systems
   > Completely safe
   > No loops – so no halting problem!!!!!
                                                        Slide 17
DTrace
Solaris Dynamic Tracing – An Observability Revolution

• Ease-of-use and instant gratification engenders
  serious hypothesis testing
   > Tightens loop of hypothesis – instrument – test &
     observe – refine and repeat
• Instrumentation directed by high-level control
  language “D” (not unlike AWK or C) for easy
  scripting and command line use
• Comprehensive probe coverage and powerful data
  management allow for concise answers to arbitrary
  questions
                                                         Slide 18
This Workshop
• An introduction to the basics of DTrace
  > With exercises so you can play along at home
• Not designed to be a comprehensive tutorial or act
  as a reference guide
• After this, you should feel comfortable diving into
  the Solaris Dynamic Tracing Guide for more detail




                                                        Slide 19
DTrace and Privileges
• dtrace_proc
  > Permits use of pid and fasttrap providers
  > No visibility into kernel or other users' processes
  > proc_owner allows probes in all processes
• dtrace_user
  > Permits use of profile and syscall providers
  > No visibility into kernel or other users' processes
• dtrace_kernel – superset of dtrace_user
  > Full kernel visibility, all processes
  > Cannot issue destructive actions
     > Requires dtrace_proc,dtrace_user,dtrace_kernel & [e]uid=0
                                                                   Slide 20
DTrace and Privileges
• Assign privileges to a user or a role in /etc/user_attr
• Add privileges for DTrace to “defaultpriv”
  defaultpriv=basic,dtrace_user,dtrace_proc,\
     dtrace_kernel,proc_owner

• proc_owner allows you to send signals to other
  processes and to inspect their internal state




                                                            Slide 21
Create a DTrace Role
• Create an entry for the role in /etc/security/prof_attr
  dtrace::::DTrace Role:
• Configure dtrace command in /etc/security/exec_attr
  dtrace:suser:cmd:::/usr/sbin/dtrace:uid=0
• Add the role to /etc/passwd & /etc/shadow
  roleadd -s /bin/pfksh -P dtrace dtrace
  passwd dtrace
• Add this role to the desired users
  usermod -R dtrace dickson
• Use su to assume a role once you have logged in.

                                                            Slide 22
DTrace Overview




                  Slide 23
DTrace Components
• Probes
  > A point of instrumentation
• Providers
  > DTrace-specific facilities for managing probes, and the
    interaction of collected data with consumers
• Consumers
  > A process that interacts with dtrace – typically
    dtrace(1)
• Using DTrace
  > Command line – dtrace(1)
  > Scripts written in the 'D' language
                                                              Slide 24
DTrace Probes
• A probe is a point of instrumentation
• A probe
  >   Is made available by a provider
  >   Identifies the module and function that it instruments
  >   Has a name
  >   Is assigned an integer identifier
• A probe is uniquely identified by its fully qualified
  name: provider:module:function:name
  > Patterns can be used for each component of the name
  > Missing components match everything

                                                               Slide 25
DTrace Providers
• A provider represents a methodology for
  instrumenting the system
• Providers make probes available to the DTrace
  framework
• DTrace informs providers when a probe is to be
  enabled
• Providers transfer control to DTrace when a probe is
  hit (fired) ...
  > ... by any thread in any process on any processor


                                                        Slide 26
How it works
• dtrace(1) compiles the D language script
• Intermediate code is checked for safety (like java)
• Compiled code is executed in the kernel by DTrace
• DTrace instructs the provider to enable probes
• As soon as the D program exits, all instrumentation
  is removed
• No limit (modulo system resources) on number of D
  scripts that can run simultaneously
• Different users can use DTrace simultaneously
  without corruption or collision issues
                                                        Slide 27
  DTrace – The Big Picture
                                                                           DTrace
  script.d                                                                 consumers

                                                 lockstat(1M)
                dtrace(1M)
                                                                 plockstat(1M)


                               libdtrace(3LIB)
                                 dtrace(7D)                                 userland
                                                                             kernel
                                  DTrace
DTrace
providers      sysinfo             vminfo                       fasttrap
             proc            syscall               sdt                     fbt

                                                                                       Slide 28
DTrace Sample Script
• Stable script to track time spent on each CPU
sched:::on-cpu
{
                                             Probe
   self->ts = timestamp;
}
                                          Predicate
sched:::off-cpu
/self->ts/
{
   @[curcpu->cpu_id] =                       Action
      quantize(timestamp – self->ts);
}


                                                      Slide 29
Common Providers
• DTrace has quite a few providers, e.g.
  > function boundary tracing (fbt) – instrument every
      function entry and return in the kernel
  >   syscall – instrument the system call table
  >   lockstat – instrument the kernel synchronization primitives
  >   profile – configureable-rate profile interrupt
  >   vminfo – instrument kernel vm statistics (like vmstat)
  >   sysinfo – instrument kernel sys statistics (like mpstat)
  >   io – instrument disk I/O events
  >   pid – instrument every function entry and return in a user
      process
  >   ...                                                       Slide 30
DTrace Consumers
• A DTrace consumer is a process that interacts with
  DTrace
• No limit on concurrent consumers; DTrace handles
  the multiplexing
• Some programs are DTrace consumers only as an
  implementation detail
• dtrace(1M) is a DTrace consumer that acts as a
  generic front-end to the DTrace facility
  > Almost always this is what you will use
  > lockstat, plockstat also are DTrace consumers

                                                       Slide 31
Listing Probes
• Probes can be listed with the “-l” option to
  dtrace(1M)
• You can list probes
  >   from a specific provider: “-P <provider>”
  >   in a specific module: “-m <module>”
  >   in a specific function: “-f <function>”
  >   with a specific name: “-n <name>”
• Remember a fully qualified probe name is
  > provider:module:function:name


                                                  Slide 32
Exercise 1
• Use dtrace to list all available probes
  > How many probes are defined on your system?
  > What are the different providers?
  > Which provider provides the greatest number of probes?
• List the probes
  >   in the “read” function
  >   in the “ufs” module
  >   with the name “xcalls”
  >   from the “sysinfo” provider


                                                             Slide 33
Exercise 1
• Use dtrace to list all available probes
 # dtrace -l | more
    ID PROVIDER MODULE FUNCTION NAME
     1   dtrace                  BEGIN
     2   dtrace                  END
     3   dtrace                  ERROR
   480      fbt     bl bl_attach entry
   481      fbt     bl bl_attach return
   482      fbt     bl bl_detach entry


  > How many probes are available?
 # dtrace -l | egrep -v PROVIDER | wc -l
 94026
                                            Slide 34
Exercise 1
• Use dtrace to list all available probes
   > What are the different providers?

# echo `dtrace -l | awk '{print $2}' | \
   egrep -v PROVIDER | sort | uniq`
Xserver872 dtrace fbt fsinfo io libCrun1029 libCrun1074
libCrun1078 libCrun1084 libCrun1087 libCrun1088
libCrun1098 libCrun1100 libCrun1102 libCrun1104
libCrun1107 libCrun1109 libCrun1111 libCrun1113
libCrun1115 libCrun1424 libCrun1519 lockstat lx-syscall
mib proc profile sched sdt syscall sysevent sysinfo
vminfo




                                                      Slide 35
Exercise 1
• Use dtrace to list all available probes
   > What are the different providers?

# dtrace -l | awk '{print $2}' | egrep -v PROVIDER | \
   sort | uniq -c | sort -nr | head -1
90690 fbt

• List probes
   >   in the “read” function: dtrace -l -f read
   >   in the “ufs” module: dtrace -l -m ufs
   >   with the name “xcalls”: dtrace -l -n xcalls
   >   from the “sysinfo” provider:
       dtrace -l -P sysinfo
                                                         Slide 36
Fully Specifying Probes
• To specify multiple components of a probe,
  separate the components with a colon (:)
• Empty components match everything
• For example, “syscall::open:entry: specifies a probe
  >   from the “syscall” provider
  >   in any module
  >   in the “open” function
  >   named “entry”



                                                         Slide 37
Enabling Probes
• Enable a probe by specifying it without the “-l”
  option
• When an executing thread hits an enabled probe,
  the associated action occurs
• In this way, probes are enabled with the default
  action.
  > The default action will print the current CPU, probe ID
    number, probe name. No other action occurs
• For example, “dtrace -m nfs” enables every probe in
  the “nfs” module
                                                              Slide 38
Exercise 2
• Enable probes in the “random” module.
• Enable probes in the “syscall” provider
• Enable probes named “zfod”
• Enable probes provided by the “syscall” provider in
  the “open” function
• Enable the “entry” probe in the “clock” function




                                                        Slide 39
Exercise 2
• Enable probes in the “random” module.
  dtrace -m random
• Enable probes in the “syscall” provider
  dtrace -P syscall
• Enable probes named “zfod”
  dtrace -n zfod
• Enable probes provided by the “syscall” provider in
  the “open” function
  dtrace -n syscall::open:
• Enable the “entry” probe in the “clock” function
  dtrace -n ::clock:entry
                                                        Slide 40
Actions
• Actions are taken when a probe fires
• Actions are completely programmable
  > But, code is simple, fall-through code – no loops or
    branches
• Most actions record or report state in the system
• Some actions change the state of the system
  system in a well-defined way
  > These are called destructive actions
  > Disabled by default – must be explicitly enabled
• Many actions take as parameters expressions in the
  D language                                               Slide 41
D Language for Scripting DTrace
• C-like language; some constructs like awk(1)
• Complete access to kernel C types, complete
  support for ANSI-C operators
• Complete access to kernel statics and globals
• Rich set of built-in variables
• Global and local storage; rich storage capabilities
• Strings as first-class citizens



                                                        Slide 42
DTrace Sample Script
• Stable script to track time spent on each CPU
sched:::on-cpu
{
                                             Probe
   self->ts = timestamp;
}
                                          Predicate
sched:::off-cpu
/self->ts/
{
   @[curcpu->cpu_id] =                       Action
      quantize(timestamp – self->ts);
}


                                                      Slide 43
Built-in D Variables
• Examples of built-in variables:
  > pid and tid – process and thread ID
  > execname – name of current executable
  > timestamp – time since boot in nanoseconds
  > arg[ ], arg[1] ... arg[n] – arguments passed to
    probe
  > probeprov, probemod, probefunc,
    probename – current probe's provider, module,
    function, and name
  > zonename – name of current zone
• See Table 3-1 in the Dynamic Tracing Guide for a
  complete list                                      Slide 44
Actions: trace()
• trace() records results of the D expression to
 the trace buffer
  > Net effect is to output the value of the expression
• Examples:
  > trace(pid) traces the current process id
  > trace(execname) traces the name of the currently
    running executable
  > trace(probefunc) traces the function name
    containing this probe


                                                          Slide 45
Actions
• Actions are indicated by following a probe
  specification with “{ <action> }”
• Examples:
  > dtrace -n 'readch { trace(pid) }'
  > dtrace -n 'ufs { trace(execname) }'
  > dtrace -n \
    'syscall:::entry {trace(probefunc)}'
• Multiple actions for a single probe are separated by
  semicolons:
  > dtrace -n 'xcalls {trace(pid);\
    trace(execname)}'
                                                         Slide 46
Exercise 3
• Trace the name of every executable that calls the
  poll (pollsys in Solaris 10) system call
• Trace the PID in every entry to the pagefault
  function
• Trace the timestamp in every entry to the clock
  function




                                                      Slide 47
Exercise 3
• Trace the name of every executable that calls the
  poll (pollsys in Solaris 10) system call
  dtrace -n 'syscall::pollsys: \
    { trace(execname) }'
• Trace the PID in every entry to the pagefault
  function
  dtrace -n '::pagefault:entry \
    {trace(pid)}'
• Trace the timestamp in every entry to the clock
  function
  dtrace -n '::clock:entry \
    { trace(timestamp) }'
                                                      Slide 48
D Scripts
• DTrace quickly becomes difficult to manage from
  just the command line
• dtrace -s will read commands from a script
  rather than stdin
• Executable DTrace interpreter files may be created
• These look and feel somewhat like shell-scripts
• All DTrace interpreter files begin with
 #!/usr/sbin/dtrace -s


                                                       Slide 49
D Scripts
• Basic structure of a D script:
 provider:module:function:name
 /predicate/
 {
    action;
    [action;]
 }

• This script will trace the executable name upon
  entry into any system call.
 #!/usr/sbin/dtrace -s <other options>
 syscall:::entry
 {
    trace(execname);
 }
                                                    Slide 50
Predicates
• Predicates allow actions to only be taken when
  certain conditions are met
• A predicate is a D expression
• Actions will only be taken if the predicate
  expression evaluates to true
• A predicate takes the form “/expression/” and is
  placed between the probe description and the
  action


                                                     Slide 51
Predicates
• Example: Trace the PID of every process named
  “date” that performs an open(2)
 #!/usr/sbin/dtrace -s
 syscall::open:entry
 /execname == “date”/
 {
    trace(pid);
 }




                                                  Slide 52
Exercise 4
• Trace the timestamp in every ioctl(2) from
  processes named dtrace
• Use the arg0 variable to trace the executable name
  of every process read(2)-ing from file descriptor 0
     ssize_t read(int filedes, \
          void *buf, size_t nbytes);
• Use the arg2 variable to trace the executable name
  of every process write(2)-ing more than 100 bytes
     ssize_t write(int filedes, \
       const void *buf, size_t nbytes);


                                                        Slide 53
Exercise 4
• Trace the timestamp in every ioctl(2) from
  processes named dtrace
 dtrace -n 'syscall::ioctl:entry \
   /execname == “dtrace”/ \
   {trace(timestamp)}'




                                               Slide 54
Exercise 4
• Use the arg0 variable to trace the executable name
  of every process read(2)-ing from file descriptor 0
     ssize_t read(int filedes, \
          void *buf, size_t nbytes);
 dtrace -n 'syscall::read:entry \
   /arg0 == 0/ {trace(execname)}'




                                                        Slide 55
Exercise 4
• Use the arg2 variable to trace the executable name
  of every process write(2)-ing more than 100 bytes
     ssize_t write(int filedes, \
       const void *buf, size_t nbytes);
  dtrace -n 'syscall::write:entry \
    /arg2 > 100/ {trace(execname)}
• Note that arg[2] represents the requested write size.
  To find out how much was actually written, use the
  write:return probe and consult arg[0] and arg[1]


                                                       Slide 56
More Actions
• tracemem() records memory at a specified location
  for a specified length
• stack() records the current kernel stack trace
• ustack() records the current user stack trace
  > For Java VM 1.5 and beyond, a Java stack trace is
    returned
• exit() tells the DTrace consumer to exit with
  specified status


                                                        Slide 57
Destructive Actions
• Must specify “-w” option to DTrace and have
  [e]uid=0 to use destructive actions
• stop() stops the current process.
  > Use prun(1) to resume the stopped process
• raise() sends a specified signal to the current
  process
• breakpoint() triggers a kernel breakpoint and
  transfers control to the kernel debugger (kdb)
• panic() induces a kernel panic
• chill() spins for a specified number of nanoseconds
                                                        Slide 58
Exercise 5
• Use dtrace to record a kernel stack trace in the
  “zfod” probe
• Use dtrace to record a user stack trace in the “zfod”
  probe
• Write a dtrace script to stop any process named
  “xcalc”, belonging to your username, that performs
  an ioctl(2)
• Modify the above to record the process ID of the
  process being stopped

                                                          Slide 59
Exercise 5
• Use dtrace to record a kernel stack trace in the
  “zfod” probe

 #!/usr/sbin/dtrace -s
 :::zfod
 {
    stack();
 }




                                                     Slide 60
Exercise 5
• Use dtrace to record a user stack trace in the “zfod”
  probe

 #!/usr/sbin/dtrace -s
 :::zfod
 {
    ustack();
 }




                                                          Slide 61
Exercise 5
• Write a dtrace script to stop any process named
  “xcalc”, belonging to your username, that performs
  an ioctl(2)
 #!/usr/sbin/dtrace -ws
 syscall::ioctl:entry
 /execname == “xcalc” && uid == <yourid>/
 {
    stop();
    exit(0);
 }

• To resume xcalc,
     prun `pgrep -U <username> xcalc`

                                                       Slide 62
Exercise 5
• Modify the above to record the process ID of the
  process being stopped
 #!/usr/sbin/dtrace -ws
 syscall::ioctl:entry
 /execname == “xcalc” && uid == <yourid>/
 {
    stop();
    trace(pid);
    exit(0);
 }




                                                     Slide 63
Output Formatting
• The printf() function combines the trace action with
  the ability to control the output format and content
• printf takes a printf(3C)-like format string as an
  argument, followed by corresponding items to print
• Examples
     printf(“%d was here\n”,pid);
     printf(“I am %s\n”,execname);




                                                     Slide 64
Output Formatting
• Normally, dtrace provides details on the firing probe,
  plus any explicitly traced data.
• Use the “quiet” option, “-q”, to dtrace to suppress
  probe details
• The quiet option may also be set in a D script via a
  pragma statement. See the Dynamic Tracing Guide
  for other options and pragma statements.
     #pragma D option quiet;



                                                       Slide 65
Global D Variables
• D allows you to define your own variables that are
  global to your D program.
• Like awk(1), D tries to infer variable type upon
  instantiation, obviating explicit variable declaration.
 vminfo:::zfod
 {
    zfods++
 }

 profile:::tick-1sec
 {
    printf(“%d zfods/sec\n”,zfods);
    zfods=0;
 }
                                                            Slide 66
Thread-Local D Variables
• D allows for thread-local variables
• Simplify variable storage in a multi-threaded world
  > Share the same name regardless of what thread is firing,
      but separate storage is used for each thread
  >   Prevents race conditions associated with global variables
  >   Removes the need to keep track of which thread is
      running
  >   Denoted by self->variablename
  >   Unassigned variables have the value of 0
  >   Storage is released when variable is set to 0

                                                               Slide 67
Thread-local D Variables
 #!/usr/sbin/dtrace -s
 #pragma D option quiet

 syscall::pollsys:entry
 {
    self->ts = timestamp;
 }

 syscall::pollsys:return
 /self->ts && (timestamp - self->ts) > 1000000000/
 {
    printf(“%s polled for %d seconds\n”, execname,
       (timestamp – self->ts)/1000000000);
    self->ts = 0;
 }

                                                     Slide 68
Thread-local D Variables
 syscall::ioctl:entry
 /execname == “date”/
 {
    self->follow = 1;
 }

 fbt:::
 /self->follow/
 {}

 syscall::ioctl:return
 /self->follow/
 {
    self->follow = 0;
 }

                           Slide 69
Exercise 6
• Write a D script to trace the executable name and
  amount of time spent in every open(2)
• Write a D script to follow a brk(2) system call
  through the kernel when called from the date(1)
  command
• Add #pragma D option flowindent to the
  above and observe the change in output




                                                      Slide 70
Exercise 6
• Write a D script to trace the executable name and
  amount of time spent in every open(2)
 #!/usr/sbin/dtrace -s
 #pragma D option quiet

 syscall::open:entry
 {
    self->ts = timestamp;
 }
 syscall::open:return
 /self->ts/
 {
    printf(“In %s open took %d nanoseconds\n”,
       execname, (timestamp – self->ts));
    self->ts = 0;
 }
                                                      Slide 71
Exercise 6
• Write a D script to follow a brk(2) system call
  through the kernel when called from the date(1)
  command
  #!/usr/sbin/dtrace -s
 #pragma D option quiet

 syscall::brk:entry
 /execname == “date”/
 { self->follow = 1;}

 fbt:::
 /self->follow/
 {}

 syscall::brk:return
 /self->follow/
 { self->follow = 0;}
                                                    Slide 72
Exercise 6
• Add #pragma D option flowindent to the
  above and observe the change in output
 #!/usr/sbin/dtrace -s
 #pragma D option flowindent

 syscall::brk:entry
 /execname == “date”/
 { self->follow = 1;}

 fbt:::
 /self->follow/
 {}

 syscall::brk:return
 /self->follow/
 { self->follow = 0;}
                                           Slide 73
Aggregations
• When trying to understand suboptimal performance,
  one often looks for patterns that point to bottlenecks
• When looking for patterns, one often doesn't want to
  study each datum – one wishes to aggregate the
  data and look for larger trends
• Traditionally, one has had to use conventional tools
  (e.g. awk(1), perl(1))




                                                       Slide 74
Aggregations
• DTrace supports the aggregation of data as a first
  class operation
• An aggregating function, f(x), is a function where x
  is a set, such that:
  ●




  f(f(x0)  f(x1)  ...  f(xn)) = f(x0  x1  ...  xn)
• E.g., count, sum, maximum, and minimum are
  aggregating functions; median, and mode are not



                                                           Slide 75
Aggregations
• An aggregation is the result of an aggregating
  function keyed by an arbitrary tuple
• Aggregations are stored in aggregation arrays
  > Similar to associative arrays, denoted by @
• For example, to count all system calls on a system
  by system call name:
  dtrace -n 'syscall:::entry \
    { @syscalls[probefunc] = count(); }'

• By default, aggregation results are printed when
  dtrace(1M) exits                                     Slide 76
Aggregations
• Aggregations need not be named
• Aggregations can be keyed by more than one
  expression
• For example, to count all ioctl system calls by both
  executable name and file descriptor:
  dtrace -n 'syscall::ioctl:entry \
    { @[execname, arg0] = count(); }'




                                                         Slide 77
Aggregations
• Some other aggregating functions:
  >   avg(): the average of specified expressions
  >   min(): the minimum of specified expressions
  >   max(): the maximum of specified expressions
  >   quantize(): power-of-two distribution of specified
      expressions
• For example, distribution of write(2) sizes by
  executable name:
  dtrace -n 'syscall::write:entry \
    { @[execname] = quantize(arg2); }'

                                                           Slide 78
Aggregations
bash-3.2$ dtrace -n 'syscall::write:entry \
        { @[execname] = quantize(arg2); }'
dtrace: description 'syscall::write:entry ' matched 1 probe
^C
   gnome-terminal
            value ---------- Distribution ---------- count
                0 |                                  0
                1 |@@@                               1
                2 |                                  0
                4 |                                  0
                8 |                                  0
               16 |@@@                               1
               32 |@@@                               1
               64 |                                  0
              128 |@@@@@@@@@@@@@                     5
              256 |@@@@@@@@@@@@@                     5
              512 |                                  0
             1024 |@@@@@                             2
             2048 |                                  0
                                                          Slide 79
Aggregations
• You can control output formatting for aggregations
  using printa()
   > printa(@aggregation-name)
   > printa(format-string, @aggregation-name)
   > Format string works like printf(), with formats for each
     element of the tuple
      > Use @ in the format string for the aggregation result itself
profile:::profile-997 { @a[execname] = count() }
END { printa(“%8u\t%s\n”, @a);

      1 utmpd
      2 dtrace
      2 mysqld
     ...
   4829 sched
                                                                       Slide 80
Exercise 7 – Aggregations
• Count the number of system calls by executable
  name
• Count the number of write system calls by process id
• Get a distribution of read(2) sizes by executable name
  and file descriptor
• Get a distribution of time spent in read(2) by
  executable name and file descriptor



                                                     Slide 81
Exercise 7 – Aggregations
• Count the number of system calls by executable
  name
 dtrace -n 'syscall:::entry
        {@[execname] = count()}'
• Count the number of write system calls by process id
 dtrace -n 'syscall::write:entry
        {@[pid] = count()}'




                                                     Slide 82
Exercise 7 – Aggregations
• Get a distribution of read(2) sizes by executable name
  and file descriptor
 #!/usr/sbin/dtrace -s
 #pragma D option quiet

 syscall::read:entry
 { self->follow = 1; self->filedes = arg0;}

 syscall::read:return
 /self->follow && ((int) arg1 != -1)/
 { @[execname, self->filedes] = quantize(arg1);
    self->follow = 0; }


 Note: Cast in predicate on return probe needed because arg1 is
 treated as an unsigned int by default.                           Slide 83
Exercise 7 – Aggregations
• Get a distribution of time spent in read(2) by
  executable name and file descriptor
 #!/usr/sbin/dtrace -s
 #pragma D option quiet

 syscall::read:entry
 { self->ts = timestamp; self->filedes = arg0;}

 syscall::read:return
 /self->ts/
 { @[execname, self->filedes] =
       quantize(timestamp - self->ts);
    self->follow = 0; }


                                                   Slide 84
DTrace Providers




                   Slide 85
DTrace Provider
• Three probes to give AWK-like controls
• BEGIN
  > First probe to fire
  > Used for initialization or to print headings
• END
  > Fires as DTrace terminates
  > Used to dump aggregation arrays under user control and
    formatting
• ERROR
  > Fires when a runtime error occurs in another probe

                                                             Slide 86
Exercise 8 – Hello World
• Write the canonical Hello World program in D
  > You know you have to do it . . .




                                                 Slide 87
Exercise 8 – Hello World
• Write the canonical Hello World program in D
  > You know you have to do it . . .

 #!/usr/sbin/dtrace -s
 #pragma D option quiet

 BEGIN
 {
    printf(“Hello World\n”);
    exit(0);
 }




                                                 Slide 88
Profile Provider
• Allows for a probe to be fired a specific intervals
• Not related to any other provider
• Relatively low overhead
• Useful to infer system state by time-based
  observations
• profile-<interval> fires on all CPUs
• tick-<interval> fires on a single CPU
• Arguments
    > arg0 is the program counter in the kernel or 0 if in user
    > arg1 is the program counter in user or 0 if in the kernel
                                                                  Slide 89
Profile Provider
 • Example: Sample the system 1000 times per
   second and count the top user call frame
   #!/usr/sbin/dtrace -s
   profile-1001
   /arg1/
   {
      @[ustack(1)] = count();
   }


Note: Using a profile slightly out of sync with other system timers often
gives better results, since it avoids harmonic effects with other system events


                                                                            Slide 90
Function Boundary Trace (FBT)
• Allows instrumentation of most kernel functions
• fbt:::entry
  > Arguments to the probe are the same as the kernel
    function
  > Use the OpenSolaris Source Browser to follow the code
    and understand the arguments
• fbt:::return
  > arg0 is the offset of the actual return (many kernel
    functions have multiple return points)
  > arg1 is the return value, if any
• Tail-called optimizations may not be completely
  observable                                                Slide 91
Syscall Provider
• Instrumentation for entry and return from all system
  calls
• Useful to get an understanding of the interaction
  between an application and the kernel (system)
• syscall:::entry
  > Arguments are taken from the system call
  > See the manual section 2 for values and types
• syscall:::return
  > arg0 and arg1 are set to the return value from the call
  > Global variable errno may be used to track errors
                                                              Slide 92
PID Provider
• Allows dynamic instrumentation of applications,
  regardless of compiler or compile flags used
• Example: list all function entry probes in xclock
  # xclock -update 1 &
  [1] 3477
  # dtrace -l -n pid3477:::entry
• Example: list all functions only in the xclock
  executable, not the libraries that it includes
  # dtrace -l -n 'pid$target:xclock::entry' \
     -c “xclock”




                                                      Slide 93
copyin() / copyinstr()
• Recall that DTrace executes in the kernel's context
  and address space
  > Does not have direct access to do user address space
  > Can fetch data from process address space into kernel
  > If done a lot, can have a significant probe effect
  > copyin(addr, size), copyinstr(addr)

• DTrace already includes all of the kernel types and
  structs.
• The D language allows type & struct definitions for
  user types and data
                                                            Slide 94
PID Provider
• Observe changing bounds of the running xclock
 #!/usr/sbin/dtrace -s
 #pragma D option quiet

 typedef struct {
     short x, y;
     unsigned short width, height;
 } XRectangle;

 pid$1::RenderResetBounds:entry
 {
         b = (XRectangle *)copyin(arg0,
                sizeof(XRectangle));
         printf("x,y,w,h = %5d, %5d, %5d, %5d\n",
                 b->x, b->y, b->width, b->height);
 }
                                                     Slide 95
IO Provider
• Allows instrumentation of physical disk I/O.
  > Can show how file system buffering and read-ahead /
    write-behind is working
• Argument types are detailed in Chapter 27 of
  Dynamic Tracing Guide
  > args[0] – struct buf *
  > args[1] – devinfo_t *
  > args[2] – fileinfo_t *




                                                          Slide 96
IO Provider
• Example: show the device, program, I/O size, I/O
  type, and file name of all disk I/O, with a summary
 #!/usr/sbin/dtrace -s
 #pragma D option quiet
 BEGIN { printf(“%-10s %10s %10s %3s %s\n”,
    ”Device”,”Program”,”I/O Size”,”R/W”,”Path”); }
 io:::start {
    printf(“%-10s %10s %10d %3s %s\n”,
        args[1]->dev_statname, execname, args[0]->b_bcount,
        args[0]->b_flags & B_READ ? “R” : “W”,
        args[2]->fi_pathname);
    @[execname,pid,args[2]->fi_pathname] =
        sum(args[0]->b_bcount); }
 END { printf(“%-10s %8s %10s %s\n”, “Program”,
           “PID”, “Total”, “Path”);
    printa(“%-10s %8d %10@d %s\n”,@);
 }
                                                         Slide 97
IO Provider
bash-3.2$ dtrace -s ioprovider.d
Device        Program   I/O Size R/W Path
cmdk0           sched       8192   W <none>
cmdk0           sched       1024   W <none>
cmdk0      gnome-terminal       4096   R /usr/lib/libgtk-x11-2.0.so.0.1200
cmdk0      gnome-terminal       1024   R /usr/share/themes/nimbus/gtk-2.0/
cmdk0         fsflush       8192   R <none>
cmdk0            bash       8192   R /usr/sbin/mkfile
cmdk0            bash       5120   R /usr/sbin/mkfile
cmdk0          mkfile       8192   W /export/home/dickson/foo
cmdk0          mkfile      57344   W /export/home/dickson/foo
cmdk0          mkfile      49152   W /export/home/dickson/foo
cmdk0          mkfile       8192   W /export/home/dickson/foo
^C
Program         PID      Total Path
gnome-terminal     1222       1024 /usr/share/themes/nimbus/gtk-2.0/close.
gnome-terminal     1222       4096 /usr/lib/libgtk-x11-2.0.so.0.1200.2
sched             0      10752 <none>
bash           1303      13312 /usr/sbin/mkfile
fsflush           3      49152 <none>
mkfile         1303    4202496 /export/home/dickson/foo

                                                                   Slide 98
MIB Provider
• Allows dynamic instrumentation of all MIB counters,
  such as those reported by netstat -s and kstat
• Example: show TCP connections / second
 #!/usr/sbin/dtrace -s
 #pragma D option quiet
 mib:::tcpActiveOpens,mib:::tcpPassiveOpens {
    @[execname,probefunc,probename] = sum(arg0);
    event = 1;
 }
 tick-1sec /event/ {
    printa(“%20s %20s %20s %@d\n”,@);
    trunc(@); event = 0; }

 # dtrace -s mibprovider.d
          ssh     tcp_conn_request     tcpPassiveOpens 1
          ssh     tcp_connect_ipv4      tcpActiveOpens 1
                                                           Slide 99
Other Standard Providers
• See Dynamic Tracing Guide for more details on
  standard providers
• lockstat – kernel lock contention & synchronization
  events
• plockstat – user-level synchronization primitives
• sched – scheduling, on / off CPU, priority changes
• fpuinfo – software simulation of h/w floating point
  primitives
• proc – process, LWP creation, termination
• sysinfo / vminfo – sys / vm kernel statistics
                                                        Slide 100
Dynamic Language Providers
• Providers have been created for a number of higher
  level, interpreted, or dynamic languages
  > JavaScript, Java, Perl, PHP, Python, Ruby, sh, Tcl
• Application specific providers have also been
  created for particular applications
  > PostgresQL, MySQL 6
  > Allow for the tracing of application specific events
• Both leverage Statically Defined Tracing
  > Allows probes to be compiled into applications,
    interpreters, run-time environments, etc.

                                                           Slide 101
Exploring DTrace
• This has been just an introduction to DTrace –
  there's much, much more:
  >   BEGIN, END probes     – Aggregation formatting
  >   Normalization         – Provider specifics
  >   Associative arrays    – Clause-local variables
  >   User-level tracing    – Ring buffering
  >   Speculative tracing   – Anonymous tracing
  >   Postmortem tracing    – Privilege model
  >   Explicit versioning   – Well-defined stability

                                                       Slide 102
Exploring DTrace – Resources
• Solaris Dynamic Tracing Guide on docs.sun.com
  > All examples are in /usr/demo/dtrace
• Solaris Internals: Solaris 10 and OpenSolaris Kernel
  Architecture – McDougall & Mauro
• Solaris Performance and Tools: DTrace and MDB
  Techniques for Solaris 10 and OpenSolaris – McDougall,
  Mauro, Gregg
• OpenSolaris.org
  > http://opensolaris.org/os/community/dtrace
• http://blogs.sun.com/
  > Bryan Cantrill (bmc), Adam Leventhal (ahl)
  > http://blogs.sun.com/bmc/entry/dtrace_at_google
                                                           Slide 103
DTrace in Practice




                     Slide 104
General Approaches
• Start with existing tools
  >   truss
  >   vmstat/mpstat/iostat/lockstat
  >   pfiles/pmap/pldd/pstack/ptree
  >   prstat -mL
• Understand the new tools in Solaris 10:
  >   intrstat
  >   plockstat (DTrace consumer)
  >   pstack (for Java)
  >   pfiles (now shows file names)
                                            Slide 105
Solaris Performance & Tracing Tools
   Process stats                             Kernel Tracing / debugging
   ● cputrack - per-processor hw counters    ●   dtrace – trace and monitor kernel
   ● pargs – process arguments               ●   lockstat – monitor locking statistics
   ● pflags – process flags
                                             ●   lockstat -k – profile kernel
   ● pcred – process credentials
                                             ●   mdb – debug live and kernel cores
   ● pldd – process's library dependencies

   ● psig – process signal disposition       System Stats
   ● pstack – process stack dump             ●   acctcom – process accounting
   ● pmap – process memory map               ●   busstat – Bus hardware counters
   ● pfiles – open files and names           ●   cpustat – CPU hardware counters
   ● prstat – process statistics             ●   iostat – IO & NFS statistics
   ● ptree – process tree                    ●   kstat – display kernel statistics
   ● ptime – process microstate times        ●   mpstat – processor statistics
   ● pwdx – process working directory        ●   netstat – network statistics
                                             ●   nfsstat – nfs server stats
   Process control                           ●   sar – kitchen sink utility
   ● pgrep – grep for processes              ●   vmstat – virtual memory stats
   ● pkill – kill processes list

   ● pstop – stop processes
                                             Process Tracing / debugging
   ● prun – start processes                  ●   abitrace – trace ABI interfaces
   ● prctl – view/set process resources      ●   dtrace – trace the world
   ● pwait – wait for process                ●   mdb – debug/control processes
   ● preap – reap a zombie process           ●   truss – trace functions and system calls


                                                                                            Slide 106
General Approaches
• You will type "dtrace" thousands of times
• Often you will hit a dead-end
• Think like Edison...
• Use the manual!
  > Install it locally!
  > And the examples in /usr/demo/dtrace


                                              Slide 107
General Approaches
• Use aggregations to make sense of large amounts
  of data.
  > Consider looking into speculations as well
• Look for outliers and holes
• Use quantize() / lquantize()
  > min() / max() / avg() can hide important data
• Use these providers:
  > profile, pid, sched, proc, io – less of fbt
• Beware of probe effect on with the pid provider

                                                    Slide 108
General Approaches
• A good starting place is mpstat
  >   Which column is out of line with the others?
  >   User/system time ratios?
  >   context switching?
  >   lock contention?
  >   cross calls?
  >   faults?
  >   system call levels?
  >   Interrupts?


                                                     Slide 109
  Scenario 1
High User Time




                 Slide 110
 Scenario: High User Time Code
 multi.c – Interest Rate Model
# mpstat 1
   CPU minf mjf xcal    intr ithr    csw icsw migr smtx   srw syscl   usr sys   wt idl
     0    0   0    0     342 240     146   60    0    0     0   229   100   0    0   0
     0    0   0    0     335 233     144   60    0    0     0   181   100   0    0   0


• Where is this utilization coming from?
• Use profile provider (small probe effect)
• Examples:
   >   dtrace   -n   'profile-1001   {@[arg1] = count()}'
   >   dtrace   -n   'profile-1001   /pid==1234/ {@[arg1] = count()}'
   >   dtrace   -n   'profile-1001   /execname == “multi”/{@[arg1] = count()}'
   >   dtrace   -n   'profile-1001   {@[execname,ustack(1)] = count()}'
   >   dtrace   -n   'profile-1001   {@[execname] = count()}'



                                                                                    Slide 111
Scenario: High User Time
• Profile a user application: look at ustack()
# dtrace -n 'profile-1001/execname == “multi”/{@[ustack(1)] = count()} \
             END{trunc(@,10)}'
^C
CPU     ID                    FUNCTION:NAME
  0      2                             :END
libm.so.2`exp+0x4
 94
libm.so.2`exp+0x54
 99
libm.so.2`exp+0x67
111
                                               ustack(1) does not
libm.so.2`exp+0x98                            add much overhead
116
multi`intio_calc_n_month_rate+0x8f
117
multi`intio_calc_n_month_rate+0x186
138
libm.so.2`exp+0x10d
163
libm.so.2`exp+0x90
184
libm.so.2`exp+0x61
617
libm.so.2`log+0x1a
854                                                                        Slide 112
Scenario: High User Time
• Clearly this application is sitting in the math libraries
• Non-production probing:
  > We want a function count & time spent in each function
  > (Warning: This can have significant probe-effect.)
  # cat quantify.d
  #!/usr/sbin/dtrace -s
  pid$1:::entry
  {
          self->ts[self->stack++] = timestamp;
  }

  pid$1:::return
  / self->ts[self->stack - 1] /
  {
          this->elapsed = timestamp - self->ts[--self->stack];
          @[probefunc] = count();
          @a[probefunc] = quantize(this->elapsed);
          self->ts[self->stack] = 0;
  }



                                                                 Slide 113
Scenario: High User Time
# ./quantify.d
  intio_calc_pmms_rate                                               16
  intio_calc_adj_rate                                                81
  intio_calc_adjustment                                              82
  intio_calc_n_month_rate                                         34655
  log                                                             39380
  exp                                                             78763
  intio_calc_adjustment

          value ------------- Distribution -------------      count
           2048 |                                             0
           4096 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@     82
           8192 |                                             0

  log
          value    ------------- Distribution -------------   count
           2048   |                                           0
           4096   |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@   39337
           8192   |                                           19
          16384   |                                           12
          32768   |                                           9
          65536   |                                           0
         131072   |                                           2
         262144   |                                           0
         524288   |                                           1
        1048576   |                                           0
                                                                          Slide 114
Scenario: High User Time
exp
        value    ------------- Distribution -------------   count
         2048   |                                           0
         4096   |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@   78685
         8192   |                                           37
        16384   |                                           26
        32768   |                                           9
        65536   |                                           2
       131072   |                                           3
       262144   |                                           0
       524288   |                                           1
      1048576   |                                           0

intio_calc_pmms_rate
         value ------------- Distribution -------------     count
      16777216 |                                            0
      33554432 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@    16
      67108864 |                                            0

intio_calc_adj_rate
         value ------------- Distribution ------------- count
       2097152 |                                        0
       4194304 |@@@@@@@@@@@@@@@@                        33
       8388608 |                                        0
      16777216 |                                        0
      33554432 |@@@@@@@@@@@@@@@@@@@@@@@@                48
      67108864 |                                        0           Slide 115
Scenario: High User Time
Another View of the Application
# cat watchpid.d
#!/usr/sbin/dtrace -Fs
pid$1:::entry{}
pid$1:::return{}

# ./watchpid.d `pgrep multi`
  0    -> intio_calc_n_month_rate
  0      -> log
  0      <- log
  0      -> log
  0
  0
         <- log
         -> exp
                                    ●Look for best exp(), log()
  0      <- exp                     ●Possibly inline code
  0      -> exp
  0      <- exp
  0      -> exp
  0      <- exp
  0      -> exp
  0      <- exp
  0    <- intio_calc_n_month_rate
  0    -> intio_calc_n_month_rate
  0      -> log
  0      <- log
...
                                                                  Slide 116
Scenario: High User Time
• Other things to watch for:
  > Calls to .mul(), .div(), etc. for SPARC
     > Compiled for SPARC V7 – recompile!
  > Excessive calls to libraries such as:
     > getenv(3C), getrusage(4C), getrlimit(2)
     > time(2), gettimeofday(3C)
           –   possibly replace with getthrtime()
           –   time(2) on SPARC is very expensive! (7X)
           –   time(2) on Intel is somewhat expensive! (1.5X)
           –   gettimeofday(3C) is slightly more expensive than getthrtime(3C) (10%)
  > Watch for memmove(3C) vs. memcpy(3C)
    > Use memcpy() if regions do not overlap

                                                                                 Slide 117
   Scenario 2
High System Time




                   Slide 118
Scenario: High System Time
>10% System or Usr/Sys Approaches 2:1
# vmstat 1
 kthr      memory            page              disk             faults     cpu
 r b w   swap free re mf pi po fr de      sr cd s0 s1   --   in    sy  cs   us   sy   id
 0 0 0 1157600 751464 3188 24781 3937 0   0 0 0 433 0   0 0 1236 15801 1543 32   41   28
 0 0 0 1157292 752688 2738 17618 4578 0   0 0 0 309 0   0 0 995 11657 1289 29    29   43
 0 0 0 1157156 752476 4450 40268 2116 0   0 0 0 306 0   0 0 987 24698 1608 26    63   11


 • Profile the system. See who is asking for resources
# dtrace -n 'profile-1001 {@[execname,uid,pid,tid] = count()}'
  grep                                0    11239        1                                     2
  grep                                0    11101        1                                     2
  grep                                0    12155        1                                     2
  grep                                0    11617        1                                     2
  grep                                0    12457        1                                     3
  grep                                0    11916        1                                     3
  dtrace                              0    10453        1                                     3
  grep                                0    11506        1                                     4
  grep                                0    11488        1                                     4
  Xorg                             5667      408        1                                     5
  gnome-terminal                   5667      660        1                                     7
  sched                               0        0        1                                    19
  find                                0      804        1                                   509
  sched                               0        0        0                                  1573
                                                                                                  Slide 119
Scenario: High System Time
• Profile the kernel. See what functions are being hit
# dtrace -n 'profile-1001 /arg0/{@[arg0] = count()}END \
        {trunc(@,10);printa("%a %@u\n", @)}'
^C
CPU     ID                    FUNCTION:NAME
  0      2                             :END unix`hwblkpagecopy+0xca 27
unix`page_numtopp_nolock+0x29 29
unix`ddi_get8+0x13 36
unix`cas64+0x1a 46
unix`fakesoftint_return+0x2 59
unix`page_exists+0x3f 64
unix`page_lookup_create+0x5a 66
unix`page_lookup_nowait+0x45 81
unix`mutex_enter+0x11 241
unix`cpu_halt+0x9d 660




                                                                  Slide 120
 Scenario: High System Time
 • What are the most frequently called kernel functions?
# dtrace -n 'fbt:::entry {@[probefunc] = count()}END{trunc(@,10)}'
        (can be expensive)

CPU     ID                   FUNCTION:NAME
  0      2                            :END

  htable_e2va                                                  30397
  page_pptonum                                                 40784
  x86_hm_exit                                                  50531
  x86_hm_enter                                                 50531
  htable_va2entry                                              53327
  x86pte_access_pagetable                                      53706
  x86pte_release_pagetable                                     53706
  apic_setspl                                                  54169
  psm_get_cpu_id                                               55707
  page_next_scan_large                                         67220


 • Demand paging-related work
                                                                       Slide 121
   Scenario 3
High System Calls




                    Slide 122
 Scenario: High System Calls
 >100 per second
% vmstat 1
 kthr      memory            page            disk          faults     cpu
 r b w   swap free re mf pi po fr de sr cd s0 s1 --      in   sy  cs us sy    id
 0 0 0 1158040 759184 950 3939 1480 0 1 0 55 84 0 0 0   549 3143 577 6 9      85
 0 0 0 1102740 720324 7 42 25 0 0 0 0 5 0 0 0           403 327 244 2 0       98
 0 0 0 1102740 720324 0   0 0 0 0 0 0 0 0 0 0           370 261 202 1 1       98


 • System calls show interface between application
   and OS
# dtrace -n 'syscall:::entry/pid != $pid/{@[execname,probefunc] = count()}'
...
  gnome-terminal       pollsys                  44
  Xorg                 read                     46
  mixer_applet2        ioctl                    54
  acroread             pollsys                 423
  acroread             ioctl                   840




                                                                              Slide 123
Scenario: High System Calls
• Aggregate on syscall
   dtrace -n 'syscall:::entry \
    { @[probefunc] = count()}'
   dtrace -n 'syscall:::entry \
    { @[probefunc,execname] = count()}'
   dtrace -n 'syscall:::entry \
    { @[probefunc,pid] = count()}'
• If top syscalls are read()/write()/poll()
  > aggregate on arg0 (fd #)
  > look up file descriptor using pfiles



                                              Slide 124
Scenario: High System Calls
• Use argN to drill down into system calls
• For example, what fds are being used for ioctl()?
   dtrace -n 'syscall::ioctl:entry \
      /execname == acroread/ {@[arg0] = count()}'

• Look that up with pfiles
# pfiles `pgrep acroread`
654:    /export/home/jdf/Acrobat4/Reader/intelsolaris/bin/acroread ../../dtrac
  Current rlimit: 256 file descriptors
   0: S_IFCHR mode:0666 dev:270,0 ino:6815752 uid:0 gid:3 rdev:13,2
      O_RDONLY
      /devices/pseudo/mm@0:null
   1: S_IFCHR mode:0666 dev:270,0 ino:6815752 uid:0 gid:3 rdev:13,2
      O_WRONLY|O_LARGEFILE
      /devices/pseudo/mm@0:null
   2: S_IFCHR mode:0666 dev:270,0 ino:6815752 uid:0 gid:3 rdev:13,2
      O_WRONLY|O_LARGEFILE
      /devices/pseudo/mm@0:null
   3: S_IFSOCK mode:0666 dev:276,0 ino:33309 uid:0 gid:0 size:0
      O_RDWR|O_NONBLOCK FD_CLOEXEC
        SOCK_STREAM
        SO_SNDBUF(16384),SO_RCVBUF(5120)
        sockname: AF_UNIX                                                        Slide 125
Scenario: High System Calls
• Use quantize() on read()/write(). Look for 1-byte I/O's
# dtrace   -n 'syscall::write:entry {@ = quantize(arg2)}'
   value     ------------- Distribution ------------- count
       0   |                                          0
       1   |@                                         48
       2   |                                          1
       4   |@                                         26
       8   |@                                         43
      16   |@@@                                       115
      32   |@@                                        82
      64   |                                          13
     128   |                                          7
     256   |@@                                        73
     512   |                                          5
    1024   |                                          16
    2048   |                                          7
    4096   |                                          0
    8192   |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@            1280
   16384   |                                          0

# dtrace -n 'syscall::write:entry /arg2 == 1/ {trace(execname);ustack()}'
  0     13                      write:entry   gnome-terminal
          libc.so.1`_write+0x15
          libvte.so.4.4.0`vte_terminal_io_write+0x73
          libglib-2.0.so.0.400.1`g_source_callback_funcs
          libvte.so.4.4.0`vte_terminal_io_write

                                                                            Slide 126
Scenario: High System Calls
• Looking at write size again
# dtrace -n 'syscall::write:return {@[execname, arg0] = count()}'
xscreensaver                                            8            2
gnome-netstatus-                                      208            2
gnome-terminal                                        264            4
gnome-terminal                                        300            6
soffice.bin                                             8           14
soffice.bin                                             1           26




                                                                         Slide 127
Scenario: High System Calls
• Look for system call errors:
# dtrace -n 'syscall:::return /errno/ {trace(execname);trace(pid);trace(errno)}'
  0    318     pollsys:return    Xorg                               408        4
  0     12     read:return    gnome-terminal                        660       11
  0     12     read:return    Xorg                                  408       11
  0     12     read:return    dsdm                                  570       11
  0     12     read:return    nautilus                              650       11
  0     12     read:return    Xorg                                  408       11



• Why not truss?
 >   truss is much easier to use, provides better information
 >   But watch production use!
 >   truss only looks at one process
 >   DTrace is better at looking for system-wide events

                                                                                   Slide 128
      Scenario 4
High Context Switching




                         Slide 129
Scenario: High Context Switches
(>500 per cpu) Code: what.c
# mpstat 1
CPU minf mjf xcal   intr ithr csw icsw migr smtx srw syscl usr sys wt idl
  0    0   0   87    261 112 16908   73 939 9144    0 27571  59 40   0   1
  1    0   0   68     70    3 12237  53 659 10460    0 27079  62 38   0   0
  2    0   0   60     66    2 15987  53 622 10071    0 29405  59 41   0   0
  3    0   0   48     93    8 21788  77 932 8587    0 30584  55 45   0   0


 • Determine what processes are being switched off:
# dtrace -n 'off-cpu{@[execname] = count()}'
dtrace: description 'off-cpu' matched 2 probes
^C

 pageout                                                         1
 automountd                                                      2
 sendmail                                                        3
 fsflush                                                         4
 nscd                                                            4
 svc.startd                                                     15
 dtrace                                                         55
 sched                                                       17543
 what                                                        19442

                                                                        Slide 130
Scenario: High Context Switches
• From prstat we see 'what' is threaded.
# prstat -L
   PID USERNAME    SIZE     RSS   STATE   PRI NICE      TIME    CPU   PROCESS/LWPID
 25774 tgendron   1684K   1312K   run      10    0   0:00:32    76%   what/34
 25774 tgendron   1684K   1312K   sleep    59    0   0:00:00   0.0%   what/16
 25774 tgendron   1684K   1312K   sleep    59    0   0:00:00   0.0%   what/15
 25774 tgendron   1684K   1312K   sleep    59    0   0:00:00   0.0%   what/14
 25774 tgendron   1684K   1312K   sleep    59    0   0:00:00   0.0%   what/12
 25774 tgendron   1684K   1312K   sleep    59    0   0:00:00   0.0%   what/11
 25774 tgendron   1684K   1312K   sleep    59    0   0:00:00   0.0%   what/10
 25774 tgendron   1684K   1312K   sleep    59    0   0:00:00   0.0%   what/9
 25774 tgendron   1684K   1312K   sleep    59    0   0:00:00   0.0%   what/8
 25774 tgendron   1684K   1312K   sleep    59    0   0:00:00   0.0%   what/6
 25774 tgendron   1684K   1312K   sleep    59    0   0:00:00   0.0%   what/5
 25774 tgendron   1684K   1312K   sleep    59    0   0:00:00   0.0%   what/4
 25774 tgendron   1684K   1312K   sleep    59    0   0:00:00   0.0%   what/3
 25774 tgendron   1684K   1312K   sleep    59    0   0:00:00   0.0%   what/2




                                                                                      Slide 131
Scenario: High Context Switches
• Aggregate on tid since application is threaded
 # dtrace -n 'off-cpu /execname == "what"/{@[tid] = count()}'
 ^C
        34               34
        25              501
        23              517
        15              520
        13              522
        27              522
       ...
         2              567
        22              569
        29              571
         9              571
        14              574
        11              579
        32              629



  • Looks like all but thread 34 are switching (he's the
    producer)                                                   Slide 132
  Scenario: High Context Switches
  • Get a stack trace when the context switch occurs
# dtrace -n 'off-cpu /execname == "what"/{@[ustack()] = count()}END{trunc(@,5)}'
dtrace: description 'off-cpu ' matched 3 probes
^C
CPU     ID                    FUNCTION:NAME
  1      2                             :END
...
              libc.so.1`__lwp_park+0x10
              libc.so.1`cond_wait_queue+0x28
              libc.so.1`cond_wait+0x10
              what`consumer+0x44
              libc.so.1`_lwp_start
              885
              libc.so.1`__lwp_park+0x10
              what`consumer+0xa8
              libc.so.1`_lwp_start
             8158
              libc.so.1`__lwp_park+0x10
              libc.so.1`cond_wait_queue+0x84
              libc.so.1`cond_wait+0x10
              what`consumer+0x44
              libc.so.1`_lwp_start
            11948
                                                                             Slide 133
Scenario: High Context Switches
• The cond_wait() is where we're context
  switching mostly. Also in mutex_lock().
• Examining the code, we see the producer is calling
 cond_broadcast()
  > Generally a bad idea. Use cond_signal() instead
• Consumer is waiting on a condition variable after
  each unit of work.
  > Why not check for more work to do before waiting?
• Make minor changes and run what_new.c

                                                        Slide 134
 Scenario: High Context Switches
 • mpstat and ptime before change:
# mpstat 1
CPU minf mjf xcal    intr ithr csw icsw migr smtx     srw syscl   usr sys   wt idl
  0   90   0    0     266 166 1517   26    0    1       0 2579     30   3    0 67
  0    3   0    0     250 150 2190   69    0    2       0 3645     96   4    0   0
  0    0   0    0     244 145 1976   65    0    0       0 3317     96   4    0   0
  0    0   0    0     256 155 1972   69    0    0       0 3322     97   3    0   0

# ptime ./what 128
real        9.464
user        8.111
sys         0.208

 • mpstat and ptime after change
# mpstat 1
CPU minf mjf xcal    intr ithr   csw icsw migr smtx   srw syscl   usr sys   wt idl
  0 376    4    0     384 285    480   41    0    0     0 23595     8   5    0 87
  0    9   0    0     332 231    767   13    0    0     0 2053      5   2    0 93
  0 481    0    0     288 188    450   10    0    1     0 1393      4   3    0 93

# ptime ./what_new 128
real        1.189
user        0.161
sys         0.016                                                                    Slide 135
     Scenario 5
Threaded Application




                       Slide 136
Scenario: Threaded Application
    Use prstat to see time waiting for locks:
 • prstat -mL 1
 #
    PID   USERNAME   USR   SYS   TRP   TFL   DFL LCK SLP LAT VCX ICX SCL SIG PROCESS/LWPID
  29621   root        69    31   0.0   0.0   0.0 0.3 0.0 0.3 89 21 25K     0 what/34
  29621   root       5.6   3.7   0.0   0.0   0.0 49 0.0 42 2K      6 3K    0 what/33
  29621   root       5.7   3.4   0.0   0.0   0.0 44 0.0 46 2K      8 3K    0 what/15
  29621   root       5.6   3.5   0.0   0.0   0.0 48 0.0 43 2K      6 3K    0 what/16
  29621   root       5.5   3.5   0.0   0.0   0.0 43 0.0 48 2K      1 3K    0 what/31
  29621   root       5.4   3.5   0.0   0.0   0.0 49 0.0 42 2K      9 3K    0 what/3
  29621   root       5.5   3.4   0.0   0.0   0.0 47 0.0 45 2K      9 3K    0 what/29

 • Use Dtrace consumer plockstat:
# plockstat -p 29618
Mutex block
Count     nsec Lock                         Caller
-------------------------------------------------------------------------------
 3051   636785 what`thelock                 what`consumer+0x44
 2017   465852 what`thelock                 what`consumer+0xa8
  689   205946 what`thelock                 what`consumer+0x44
 2243   162300 what`thelock                 what`producer+0xb4
 1506   115574 what`thelock                 what`consumer+0xa8
  429    50190 what`thelock                 what`producer+0xb4
    3    49706 libc.so.1`libc_malloc_lock   what`consumer+0x6c
    2    14600 libc.so.1`libc_malloc_lock   what`consumer+0x6c
                                                                                        Slide 137
 Scenario: Threaded Application
(plockstat output cont.)
Mutex spin

Count     nsec Lock                         Caller
-------------------------------------------------------------------------------
 7080    17731 what`thelock                 what`producer+0xb4
 8722    15680 what`thelock                 what`consumer+0x44
11613    13897 what`thelock                 what`consumer+0xa8
  406    11684 libc.so.1`libc_malloc_lock   what`producer+0x64
  335    11651 libc.so.1`libc_malloc_lock   what`consumer+0x78
  672    10091 libc.so.1`libc_malloc_lock   what`producer+0x20
 4451     8016 libc.so.1`libc_malloc_lock   what`consumer+0x6c

Mutex unsuccessful spin

Count     nsec Lock                         Caller
-------------------------------------------------------------------------------
 3746    30148 what`thelock                 what`consumer+0x44
 2673    27705 what`thelock                 what`producer+0xb4
 3525    27424 what`thelock                 what`consumer+0xa8
    5    27056 libc.so.1`libc_malloc_lock   what`consumer+0x6c




                                                                             Slide 138
Scenario: Threaded Application
• We see our hot lock (thelock).
  > But also notice libc_malloc_lock!
• Replace with libmtmalloc, or libumem:




                                          Slide 139
  Scenario: Threaded Application
# export LD_PRELOAD=libumem.so.1
(Run the app, then:)
# plockstat -p `pgrep what`
^C
Mutex block

Count     nsec Lock                         Caller
-------------------------------------------------------------------------------
   45 38589443 what`thelock                 what`consumer+0x44
  442 23234077 what`thelock                 what`consumer+0xa8
    4   304380 what`thelock                 what`consumer+0x44
    9   240817 what`thelock                 what`producer+0xb4
    6    33386 what`thelock                 what`consumer+0xa8
   12    15773 what`thelock                 what`producer+0xb4

Mutex spin

Count     nsec Lock                         Caller
-------------------------------------------------------------------------------
   15    15610 what`thelock                 what`consumer+0xa8
   18    15591 what`thelock                 what`consumer+0x44
  525    12021 what`thelock                 what`producer+0xb4
   41     6353 0x2d000                      libumem.so.1`umem_cache_free+0x6c
    1     6160 0x2d380                      libumem.so.1`umem_cache_free+0x6c

                                                                             Slide 140
Scenario: Threaded Application
• Watch for mutex_lock() calls
  dtrace -n 'pid26939::*mutex_lock:entry \
          { @[ustack()] = count() }'

• Then watch for cases in the code like this:
  mutex_lock (&thelock);
  foo++;
  mutex_unlock(&thelock);

• Replace with new atomic lock API – atomic_ops(3C)
• Example: 32 threads, each incrementing an integer
  until it hits 10M
    % cc atom.c -o atom     % cc -DATOM atom.c -o atom
    % ptime ./atom          % ptime ./atom
    real        2.010       real        0.396
    user        1.887       user        0.376
    sys         0.008       sys         0.004

                                                         Slide 141
Scenario: Threaded Application
• Watch for robust locks – very inefficient
USYNC_PROCESS_ROBUST (passed in mutex_init() call)
PTHREAD_MUTEX_ROBUST_NP
           (pthread_mutex_setrobust_np())


• Ask whether truly needed




                                              Slide 142
Chime – Aggregation Visualizer
• Chime provides a Java-based visualization tool for
  DTrace aggregations
• Dynamically generates D program based on
  arguments and templates
• Simple GUI interface for creating and using tools
• Allows for strip charts, line graphs, drill-down
• http://opensolaris.org/os/project/dtrace-chime/



                                                       Slide 143
Slide 144
DTrace ToolKit
• Collection of >230 DTrace utilities
  > Authored by Brendan Gregg, available at opensolaris.org
  > http://opensolaris.org/os/community/dtrace/dtracetoolkit
  > http://bdgregg.blogspot.com/2007/10/dtracetoolkit-
    ver-099.html
• Geared toward the system administrator
  > What's going on with this system?
  > Create layered tools so DTrace expertise is not required
  > Modules Java, JavaScript, Perl, Php, Python, Ruby,
    Shell, Tcl
• Understand how DTrace works, but start here to get
  work done.                                                   Slide 145
DTrace Toolkit
• Some tools are re-works of standard tools in D,
  often with extensions
  > dvmstat – like vmstat, but watches individual processes
  > dexplorer – collects overall system view like explorer
  > dtruss – like truss
• Lots of real-time snoop-like tools for other resources
  > execsnoop – Watch all calls to exec
  > rwsnoop – Watch reads and writes for system or process
  > iosnoop – Watch I/O activity for system or process


                                                              Slide 146
DTrace Toolkit
• Other tools broken into categories
  > Apps, Cpu, Disk, FS, Kernel, Locks, Mem, Net, Proc,
      System, User, Zones
• Highlights
  >   Cpu – Lots of multi-cpu tools. Watch interrupts, loads
  >   Kernel – Context switches, DNLC, dispatch priorities
  >   Vm – Swap, virtual memory access and use
  >   Proc – mmap, lastwords (final stack trace)
• New toolkits / libraries for other languages and
  environments.
  > Java, JavaScript, Perl, Php, Python, Ruby, Shell, TCL
                                                               Slide 147
So, what's next
• DTrace ports either done or in progress
  > Mac OS X
  > Assorted *BSD systems
• New providers
  > USDT, JSDT – [User, Java] Statically-Defined Tracing
  > iSCSI, ZFS, etc.




                                                           Slide 148
The DTrace Revolution
• DTrace tightens the diagnosis loop:
  hypothesis – instrument – gather data, test &
  observe – refine and repeat
• Tightened loop effects a revolution in the way we
  diagnose transient failure.
• Focus can shift from instrumentation stage to
  hypothesis stage:
  > Much less labor intensive, less error prone
  > Much more brain intensive
  > Much more effective! (And a lot more fun)

                                                      Slide 149
Your Turn


You gain proficiency with DTrace as
     you would with a musical
            instrument.

    Practice, Practice, Practice


                                   Slide 150
Exploring DTrace – Resources
• Solaris Dynamic Tracing Guide on docs.sun.com
  > http://
  > All examples are in /usr/demo/dtrace
• Solaris Internals: Solaris 10 and OpenSolaris Kernel
  Architecture – McDougall & Mauro
• Solaris Performance and Tools: DTrace and MDB
  Techniques for Solaris 10 and OpenSolaris –
  McDougall, Mauro, Gregg
• OpenSolaris.org DTrace Community
  > http://opensolaris.org/os/community/dtrace

                                                     Slide 151
Exploring DTrace – Resources
• Brendan Gregg's DTrace Toolkit
  > http://opensolaris.org/os/community/dtrace/dtracetoolkit
• http://blogs.sun.com/
  >   Bryan Cantrill (bmc), Adam Leventhal (ahl)
  >   http://blogs.sun.com/bmc/entry/dtrace_at_google
  >   http://blogs.sun.com/main/tags/dtrace
  >   jonathan, bobn, jeffv, bmc, relling, jimmauro, ahl, barts,
      jonh, mrbenchmark, roch, realneel, hosts of others
• http://developers.sun.com
• http://sun.com/bigadmin
                                                                   Slide 152
Thank You!




             Slide 153
Solaris 10 DTrace
Workshop
Bob Netherton
bob.netherton@sun.com
Scott Dickson
scott.dickson@sun.com
Appendix: DTrace and Java




                            Slide 155
jstack() Action
•   jstack action prints mixed mode stack trace
•   Both java frames and native (C/C++) frames are shown
•   Only JVM versions 5.0_01 and above are supported
•   jstack shows hex numbers for JVM versions before 5.0_01
       #!/usr/sbin/dtrace -s
       syscall::pollsys:entry
       / pid == $1 / {
           jstack(50,8192);
       }
• first optional argument limits the number of frames shown
• second optional argument changes the string size
• jstackstrsize pragma / -x to increase buffer for all jstack()'s
                                                                Slide 156
The dvm Provider
• java.net project to add DTrace support in
    > 1.4.2 (libdvmpi.so)
    > 1.5 (libdvmti.so)
    > https://solaris10-dtrace-vm-agents.dev.java.net/
•   Download shared libs
•   Add location of libs to LD_LIBRARY_PATH variable
•   Set JAVA_TOOL_OPTIONS to -Xrundvmti:all
•   Name of provider - “dvm”

                                                         Slide 157
The dvm Provider: Probes
• dvm probes and their signatures
  vm-init(), vm-death()
  thread-start(char *thread_name), thread-end()
  class-load(char *class_name)
  class-unload(char *class_name)
  gc-start(), gc-finish()
  gc-stats(long used_objects, long used_object_space)
  object-alloc(char *class_name, long size)
  object-free(char *class_name)
  method-entry(char *class_name, char *method_name, char
     *method_signature)
  method__return(char *class_name, char *method_name, char
     *method_signature)                                  Slide 158
The dvm Provider: alloc and free
• Object allocation/deallocation

#!/usr/sbin/dtrace -qs
dvm$target:::object-alloc
{
  printf("%s allocated %d size objects\n”,
    copyinstr(arg0), arg1);
}
dvm$target:::object-free
{
  printf("%s freed %d size objects\n",
    copyinstr(arg0), arg1);
}

# ./java_alloc.d -p `pgrep -n java`

                                             Slide 159
The dvm Provider: Methods
• Count methods called

#!/usr/sbin/dtrace -s
dvm$target:::method-entry
{
  @[copyinstr(arg0),copyinstr(arg1)] = count();
}
# ./java_method_count.d -p `pgrep -n java`




                                              Slide 160
The dvm provider: Time Spent
• Time spent in methods
#!/usr/sbin/dtrace -s
dvm$target:::method-entry
{
  self->ts[copyinstr(arg0),copyinstr(arg1)] =
    vtimestamp;
}
dvm$target:::method-return
{
  @ts[copyinstr(arg0),copyinstr(arg1)] =
    sum(vtimestamp – self->ts[copyinstr(arg0),
      copyinstr(arg1)]);
}

# ./java_method.d -p `pgrep -n java`


                                                 Slide 161
DTrace in JDK 6
• JDK 6 supports DTrace “out of the box”
• hotspot provider implements all dvm probes plus
  extensions:
  > Method compilation (method-compile-begin/end)
  > Compiled method load/unload(compiled-method-
    load/unload)
  > JNI method probes.
     > DTrace probes as entry and return from each JNI method.
• Strings are now unterminated UTF-8 data. Always
  use associated length value with copyinstr().
                                                                 Slide 162
JDK 6 DTrace Usage
• Certain probes are expensive
  >   Turned off by default
  >   object-alloc
  >   method-entry, method-return
  >   monitor probes
       >monitor-wait, monitor-contended-enter, etc
• Requires you to start your application with the flag
  -XX:+ExtendedDTraceProbes
• Use -XX:+DTrace{Alloc,Method,Monitor}Probes if
  possible
                                                         Slide 163
JDK6 hotspot_jni Provider
• Probes for Java Native Interface (JNI)
• Located at entry/return points of all JNI functions
• Probe arguments are same as corresponding JNI
  function arguments (for _entry probes)
• For *_return probes, probe argument is return value
• Examples:
     hotspot_jni$1:::GetPrimitiveArrayCritical_entry
     hotspot_jni$1:::GetPrimitiveArrayCritical_return

                                                    Slide 164