Debug processor reorder buffer timeout

W
Description

Intel's QuickPath Interconnect technology, abbreviated as QPI, in fact, it's official name is CSI, Common System Interface common system interface, used to implement the direct interconnection between the chip and not connected to the Northbridge through the FSB, is directed at AMD's HT bus. Whether it is speed, bandwidth, bandwidth per pin, power and all other specifications to be beyond the HT bus.

Shared by: bestt571
-
Stats
views:
26
posted:
8/26/2011
language:
English
pages:
3
Document Sample
scope of work template
							PROCESSOR

Debug processor reorder
buffer timeout
By Ai Bee Lim
Senior Platform Application Engineer

Jack R Johnson
Senior Platform Application Engineer
Intel Corp.

Processor reorder buffer (ROB)
timeout is not new, but still de-
bug engineers often spend a lot
of time debugging system issues
that result from seeing a processor
ROB timeout. The purpose of this
paper is to give context and guid-
ance to help hardware engineers
and software engineers trouble-
shooting these issues.
    Typically processors indicate a
ROB timeout with an IERR# signal
assertion. Interestingly IERR# asser-
tion does not mean ROB timeout
condition only, this means that the
processor has experienced an in-
ternal error, and it may be a result
of issues such as an error condition
in the cache unit, error conditions
in the internal bus etc.                Figure: Intel P6 processor micro-architecture with cache transfer enhancement.
    For processors that support
the Intel Quick Path Interconnect          From figure, the processor ex-        unit, the processor instructions        related to a Machine Check event.
interface, there is no longer IERR#     ecution consists of a few blocks:        are retired in order even though            Processor ROB timeout is
or MCERR# signals from the pro-         • Bus unit which interacts with          the processor can support out-of-       reported in bits [15:0] of the
cessors. Instead they have been            the system bus, known as              order execution. It is important to     MCi_STATUS – the MCACOD field
replaced by the CATERR# signal             the Front Side Bus for ear-           note that the instructions must re-     – as an internal timer error condi-
pin to indicate that a catastrophic        lier Processors, and Intel Quick      tire in order to ensure the correct-    tion with MCACOD == 0x400. Bit
error condition has been experi-           Path Interconnect for more            ness of program execution. The          38 of MC0_STATUS will also be set
enced by the processor.                    recent processors;                    ROB timer is reset on retirement        in the processor to report a BINIT#
    If the Machine Check capabil-       • Second Level Cache unit                of each micro-instruction. During       (Bus Init) timeout condition.
ity of the processor is enabled,           which interacts with the              normal operation the processor
this event can also be recorded in         Fetch/Decode Unit and the             retires instructions before the ROB     Causes and examples
the Machine Check Status register.         First Level Cache unit;               timer times out.                        Outstanding read: When the next
The processor ROB timeout is only       • First Level Cache unit which in-           When the ROB timer expires,         instruction to be retired is a read
one of the Machine Check events            teracts with the Out-of-Order         something usually is going on           operation and that read does
that can be recorded. This paper           Execution Unit;                       within the system hardware or           not complete before the ROB
will only focus on the processor        • Execution Out-of-Order Core            software or both. This document         timer expires, a ROB timeout will
ROB timeout error condition, and           unit which is handling out of         discusses some examples of ROB          be reported. This means that
provide guidance on debugging              order execution;                      timeout events.                         a memory read, an IO read, a
this Machine Check event.               • Retirement unit which is re-                                                   Memory mapped IO read, or a
                                           sponsible for retiring processor      Machine check status                    Configuration read can cause this
Processor ROB timeout                      instructions in order;                As described earlier, the ROB time-     error.
First, let’s examine the meaning of     • Branch Prediction unit which           out is a type of Machine Check              When any single thread issues
a processor ROB timeout. Figure            offers branch predicting hints        event, thus it is recommended           a read operation, it may be able
1 is an example of a P6 Processor          for the processor.                    that Machine-Check Architecture         to execute other instructions that
Micro-architecture with Advance                                                  events are enabled in the system        do not require the completion
Transfer Cache Enhancement.                In the processor Retirement           in order to capture information         of the read. At some point the


   eetasia.com | EE Times-Asia
thread needs the read result and it         as possible by testing mul-            (MCA) is disabled. IERR# will be as-        The CATERR# signal is as-
will stall waiting for a completion.        tiple systems and collecting as        serted until the processor is reset.    serted until the processor is reset
When that completion does not               much data as possible. Work                When this signal is asserted        (PLTRST# signal asserted) when
occur, the system is headed for a           hard to understand all the vari-       it indicates that the processor is      the Machine Check Exception is
ROB timeout event.                          ables that are involved with           experiencing a serious internal         not enabled. The CATERR# signal
     There are several conditions           generating the failing case.           error condition causing the sys-        will pulse when the Machine
that could prevent the read from                                                   tem to reset. Assertion of this         Check Exception is initialized and
completing before the ROB timer             Machine         check       handler:   signal suggests that the Machine        enabled correctly. Hence, looking
expires. Some of the more in-           Ensure that the Machine-Check              Check Architecture has not been         at a waveform of this signal will
teresting ones include a device         Architecture of the processor is en-       enabled or that a serious error oc-     confirm if the system has enabled
which never responds to a device        abled in order to confirm that the         curred while attempting to handle       the Machine Check Exception
status read, a device which par-        processor is seeing a ROB timeout          the Machine-Check event                 properly.
tially responds or one that returns     Machine Check event. This is done              The Processor MCERR# signal             In target probe tool (XDP/Arium):
an error result to other parts of the   by setting the MCE bit – bit 6 - in        is asserted when the processor          By design, Machine check status
system but never actually com-          Control Register four (CR4), and           is experiencing a Machine Check         registers retain their value thru
pletes the read.                        following the initialization of the        event. If MCA is enabled and the        system reset but the initialization
     Outstanding write: Typically,      Machine Check routine.                     system is correctly configured,         of the Machine-Check procedures
writes are posted and should not,           In order to confirm that a             then it will drive this signal on the   will clear these registers. In order
by themselves, cause an ROB time-       system has enabled the Machine             system bus. During an uncorrect-        to check the status registers for
out. It is possible that some issues    Check Architecture, check to make          able Machine check condition,           the error condition that led to the
downstream have consumed all            sure that the MCE bit is set in the        the MCERR# will be asserted for         Machine-Check event, one should
the transaction resources causing       Processor Control Register num-            three clocks and the Machine            take a snapshot of these registers
the functional unit to push back        ber four (CR4.MCE==1). Check the           Check Exception Handling rout-          before they are cleared on the
on the system bus, resulting in no      MCi_CTL register to make sure              ing installed at vector 0x18 will       next boot. If the Machine Check
progress as seen at the processing      that the register does not read            be called. If the system sees this      Exception is enabled, the excep-
unit. This eventually leads to a ROB    0x0. When MCi_CTL reads 0x0,               condition, it verifies that the sys-    tion routine at a minimum will
timer expiring since the pending        the Machine Check Initialization           tem has experienced a Machine           record the MCi_STATUS registers
write instruction cannot be retired     may not have taken place during            Check Event and that the Machine        information.
within the ROB timeout interval.        BIOS initialization.                       Check Initialization has completed          For some rare occasions, the
                                            Part of the Machine Check              properly in order to have MCA           Machine Check Exception is not
Debug tips and tools                    initialization is to install a software    enabled. If the MCERR# is not as-       executed properly. It may be
This section focuses on the debug       handling routine at vector 0x18h.          serted during a Machine Check           possible to read the MCi_STATUS
tips and tools that may be used to      At a minimum, the Machine Check            event, the platform configuration       registers which are cleared at the
determine the root cause of a ROB       handler at vector 0x18 should re-          settings should be reviewed in          next boot by the Machine-Check
timeout event.                          port all the Machine Check Status          order to confirm the correct set-       Initialization procedures. This can
    The basics: It is important to      registers and the relevant Machine         tings needed to drive the MCERR#        be done with an In-circuit debug
ensure that the system has all the      Check Address or Miscellaneous             signal on the system bus.               tool that allows manual control of
basic tasks completed in order to       registers if they are valid.                   Processor Block Next Request        the processor’s execution. For ex-
avoid spending unnecessary time             The routine attached to vector         (BNR#) signal is asserted by any        ample it may be possible to stop
working issues that are already         0x18 can also report the Global            FSB agent to insert a bus stall         the processor at a certain boot lo-
well known. Some of the first           Error status register for the chipset      when the agent cannot accept            cation with a breakpoint and then
steps include:                          to have a snapshot of the system           any more new bus transactions           inspect the processor Machine
• Ensure that System BIOS has           condition leading to the Machine           due to resource availability and        Check Registers in order to debug
    the latest Processor Microcode      Check Event. For most of the               to prevent overflow. When this          the system.
    Update (MCU) and the latest         Machine Check cases, this register         signal is asserted, the bus owner           The in-circuit debug tool is very
    Chipset Configuration infor-        will provide more information              cannot issue a new transaction on       powerful and allows the operator
    mation;                             about the system condition. In             the bus. This signal indicates that     to stop the processor at various
• Review known Machine Check            rare events that will be discussed         there is backpressure on the bus.       software locations and can help
    or System Hang issues which         later in this paper, the execution             Processors that support Intel       find clues to the problem area of
    are identified in the Processor     may stop and the handling rout-            QuickPath Interconnect Architecture:    the code before the ROB timeout
    Specification Update and            ing may not execute correctly.             For the current generation of           occurs.
    Chipset Specification Update        Under this condition, the problem          processors that support the                 PCI Express completion timer
    and any other Component             is more difficult to debug.                Intel QuickPath Interconnect            timeout: If the system has ex-
    Specification Updates (this ap-                                                Architecture, a few of the proces-      perienced a ROB timeout, it is
    plies to all components used        Some signals of interest                   sor signals have been consoli-          always good practice to enable
    on the system board);               Processors that support Front Side         dated and thus there is no longer       all the PCI Express End Points
• Review the silicon steppings          Bus (FSB) Architecture: Processor          separate IERR# and MCERR# sig-          Completion Timeouts. Some root
    of all the components in the        IERR# signal is asserted when the          nals. The Processor Catastrophic        complexes allow enabling of PCI
    system to verify that they are      Processor is experiencing a seri-          Error (CATERR#) signal is asserted      Express Completion timeout on
    the latest available;               ous internal error condition when          when processor is experiencing a        Configuration accesses. If this is
• Correlate the failure as much         the Machine Check Architecture             serious internal error condition.       the case, enable this feature as


   eetasia.com | EE Times-Asia
well. This helps avoid an outstand-    unfiltered capture of all transac-     tion the CPU is waiting on before       Interconnect logic analyzer. Thus,
ing configuration request target-      tions.                                 the CPU resets. In an ideal case,       it is also recommended that all the
ing a PCI Express end point which          Review the logic analyzer          this bus trace is important to un-      other methods are used to debug
is not being completed before the      traces to understand if there are      derstand what caused the ROB            the problem before this step un-
ROB timer expires. Since the com-      any outstanding requests going         timeout problem. Unfortunately          less one of the analyzers is readily
pletion timeout is in milliseconds     downstream that has not been           getting access to this debug tool       available.
and the ROB timer is in seconds,       completed which may cause ROB          or getting access to all the FSB
the completion timeout should          timeout.                               signals may be challenging so this      Summary
alert the system to take appropri-                                            is one of the last options in most      Debugging a system problem is
ate action before a ROB timeout        Bus logic analyzer                     debug efforts.                          never easy, and the more experi-
occurs, even if the target cannot      Processors that support the Front          Processors that support the Intel   ence you have the better. The de-
provide a response. Note: For a        Side Bus (FSB) Architecture: An FSB    QuickPath Interconnect Architecture:    bug of a problem needs to be very
PCI Express transaction to time        bus trace will provide visibility of   For processors that support the         methodical. Data collection needs
out, the transaction must have         the bus transactions between           Intel QuickPath Interconnect            to be precise, and every variable
actually been initiated on the PCI     the CPU and the Chipset. The FSB       Architecture, similar logic analyz-     needs to be documented with
Express interface.                     Logic Analyzer decodes these           ing capability is available through     the relevant observations so that
    PCI express logic analyzer: If a   transactions so that the system        the Mirror Port or Mid-Bus Probe.       the debug path can be identified
certain PCI express End Point in       user will be able to make sense of     In this case, the system BIOS           easily. Processor ROB timeout is a
the system is suspected to as the      the all the CPU to Chipset activity.   needs to be configured to allow         specific problem, and this paper
cause of the problem, then one            In the case of a debugging an       this activity. Also, there are chal-    attempts to explain the causes
should collect a PCI Express logic     ROB timeout, the system user will      lenges in setting up the system         and methods to help resolve the
analyzer trace. This should be an      be able to trace which transac-        board with the Intel QuickPath          problem as quickly as possible.




   eetasia.com | EE Times-Asia

						
Related docs
Other docs by bestt571
CAMPING BOOKING FORM 2011
Views: 89  |  Downloads: 0
OXYGEN COST OF KETTLEBELL SWINGS
Views: 117  |  Downloads: 0
Ballroom Dancing 4 (PDF)
Views: 40  |  Downloads: 0
KIDS ACTIVITIES
Views: 57  |  Downloads: 0