Debug processor reorder buffer timeout
W
Description
Intel's QuickPath Interconnect technology, abbreviated as QPI, in fact, it's official name is CSI, Common System Interface common system interface, used to implement the direct interconnection between the chip and not connected to the Northbridge through the FSB, is directed at AMD's HT bus. Whether it is speed, bandwidth, bandwidth per pin, power and all other specifications to be beyond the HT bus.
Document Sample


PROCESSOR
Debug processor reorder
buffer timeout
By Ai Bee Lim
Senior Platform Application Engineer
Jack R Johnson
Senior Platform Application Engineer
Intel Corp.
Processor reorder buffer (ROB)
timeout is not new, but still de-
bug engineers often spend a lot
of time debugging system issues
that result from seeing a processor
ROB timeout. The purpose of this
paper is to give context and guid-
ance to help hardware engineers
and software engineers trouble-
shooting these issues.
Typically processors indicate a
ROB timeout with an IERR# signal
assertion. Interestingly IERR# asser-
tion does not mean ROB timeout
condition only, this means that the
processor has experienced an in-
ternal error, and it may be a result
of issues such as an error condition
in the cache unit, error conditions
in the internal bus etc. Figure: Intel P6 processor micro-architecture with cache transfer enhancement.
For processors that support
the Intel Quick Path Interconnect From figure, the processor ex- unit, the processor instructions related to a Machine Check event.
interface, there is no longer IERR# ecution consists of a few blocks: are retired in order even though Processor ROB timeout is
or MCERR# signals from the pro- • Bus unit which interacts with the processor can support out-of- reported in bits [15:0] of the
cessors. Instead they have been the system bus, known as order execution. It is important to MCi_STATUS – the MCACOD field
replaced by the CATERR# signal the Front Side Bus for ear- note that the instructions must re- – as an internal timer error condi-
pin to indicate that a catastrophic lier Processors, and Intel Quick tire in order to ensure the correct- tion with MCACOD == 0x400. Bit
error condition has been experi- Path Interconnect for more ness of program execution. The 38 of MC0_STATUS will also be set
enced by the processor. recent processors; ROB timer is reset on retirement in the processor to report a BINIT#
If the Machine Check capabil- • Second Level Cache unit of each micro-instruction. During (Bus Init) timeout condition.
ity of the processor is enabled, which interacts with the normal operation the processor
this event can also be recorded in Fetch/Decode Unit and the retires instructions before the ROB Causes and examples
the Machine Check Status register. First Level Cache unit; timer times out. Outstanding read: When the next
The processor ROB timeout is only • First Level Cache unit which in- When the ROB timer expires, instruction to be retired is a read
one of the Machine Check events teracts with the Out-of-Order something usually is going on operation and that read does
that can be recorded. This paper Execution Unit; within the system hardware or not complete before the ROB
will only focus on the processor • Execution Out-of-Order Core software or both. This document timer expires, a ROB timeout will
ROB timeout error condition, and unit which is handling out of discusses some examples of ROB be reported. This means that
provide guidance on debugging order execution; timeout events. a memory read, an IO read, a
this Machine Check event. • Retirement unit which is re- Memory mapped IO read, or a
sponsible for retiring processor Machine check status Configuration read can cause this
Processor ROB timeout instructions in order; As described earlier, the ROB time- error.
First, let’s examine the meaning of • Branch Prediction unit which out is a type of Machine Check When any single thread issues
a processor ROB timeout. Figure offers branch predicting hints event, thus it is recommended a read operation, it may be able
1 is an example of a P6 Processor for the processor. that Machine-Check Architecture to execute other instructions that
Micro-architecture with Advance events are enabled in the system do not require the completion
Transfer Cache Enhancement. In the processor Retirement in order to capture information of the read. At some point the
eetasia.com | EE Times-Asia
thread needs the read result and it as possible by testing mul- (MCA) is disabled. IERR# will be as- The CATERR# signal is as-
will stall waiting for a completion. tiple systems and collecting as serted until the processor is reset. serted until the processor is reset
When that completion does not much data as possible. Work When this signal is asserted (PLTRST# signal asserted) when
occur, the system is headed for a hard to understand all the vari- it indicates that the processor is the Machine Check Exception is
ROB timeout event. ables that are involved with experiencing a serious internal not enabled. The CATERR# signal
There are several conditions generating the failing case. error condition causing the sys- will pulse when the Machine
that could prevent the read from tem to reset. Assertion of this Check Exception is initialized and
completing before the ROB timer Machine check handler: signal suggests that the Machine enabled correctly. Hence, looking
expires. Some of the more in- Ensure that the Machine-Check Check Architecture has not been at a waveform of this signal will
teresting ones include a device Architecture of the processor is en- enabled or that a serious error oc- confirm if the system has enabled
which never responds to a device abled in order to confirm that the curred while attempting to handle the Machine Check Exception
status read, a device which par- processor is seeing a ROB timeout the Machine-Check event properly.
tially responds or one that returns Machine Check event. This is done The Processor MCERR# signal In target probe tool (XDP/Arium):
an error result to other parts of the by setting the MCE bit – bit 6 - in is asserted when the processor By design, Machine check status
system but never actually com- Control Register four (CR4), and is experiencing a Machine Check registers retain their value thru
pletes the read. following the initialization of the event. If MCA is enabled and the system reset but the initialization
Outstanding write: Typically, Machine Check routine. system is correctly configured, of the Machine-Check procedures
writes are posted and should not, In order to confirm that a then it will drive this signal on the will clear these registers. In order
by themselves, cause an ROB time- system has enabled the Machine system bus. During an uncorrect- to check the status registers for
out. It is possible that some issues Check Architecture, check to make able Machine check condition, the error condition that led to the
downstream have consumed all sure that the MCE bit is set in the the MCERR# will be asserted for Machine-Check event, one should
the transaction resources causing Processor Control Register num- three clocks and the Machine take a snapshot of these registers
the functional unit to push back ber four (CR4.MCE==1). Check the Check Exception Handling rout- before they are cleared on the
on the system bus, resulting in no MCi_CTL register to make sure ing installed at vector 0x18 will next boot. If the Machine Check
progress as seen at the processing that the register does not read be called. If the system sees this Exception is enabled, the excep-
unit. This eventually leads to a ROB 0x0. When MCi_CTL reads 0x0, condition, it verifies that the sys- tion routine at a minimum will
timer expiring since the pending the Machine Check Initialization tem has experienced a Machine record the MCi_STATUS registers
write instruction cannot be retired may not have taken place during Check Event and that the Machine information.
within the ROB timeout interval. BIOS initialization. Check Initialization has completed For some rare occasions, the
Part of the Machine Check properly in order to have MCA Machine Check Exception is not
Debug tips and tools initialization is to install a software enabled. If the MCERR# is not as- executed properly. It may be
This section focuses on the debug handling routine at vector 0x18h. serted during a Machine Check possible to read the MCi_STATUS
tips and tools that may be used to At a minimum, the Machine Check event, the platform configuration registers which are cleared at the
determine the root cause of a ROB handler at vector 0x18 should re- settings should be reviewed in next boot by the Machine-Check
timeout event. port all the Machine Check Status order to confirm the correct set- Initialization procedures. This can
The basics: It is important to registers and the relevant Machine tings needed to drive the MCERR# be done with an In-circuit debug
ensure that the system has all the Check Address or Miscellaneous signal on the system bus. tool that allows manual control of
basic tasks completed in order to registers if they are valid. Processor Block Next Request the processor’s execution. For ex-
avoid spending unnecessary time The routine attached to vector (BNR#) signal is asserted by any ample it may be possible to stop
working issues that are already 0x18 can also report the Global FSB agent to insert a bus stall the processor at a certain boot lo-
well known. Some of the first Error status register for the chipset when the agent cannot accept cation with a breakpoint and then
steps include: to have a snapshot of the system any more new bus transactions inspect the processor Machine
• Ensure that System BIOS has condition leading to the Machine due to resource availability and Check Registers in order to debug
the latest Processor Microcode Check Event. For most of the to prevent overflow. When this the system.
Update (MCU) and the latest Machine Check cases, this register signal is asserted, the bus owner The in-circuit debug tool is very
Chipset Configuration infor- will provide more information cannot issue a new transaction on powerful and allows the operator
mation; about the system condition. In the bus. This signal indicates that to stop the processor at various
• Review known Machine Check rare events that will be discussed there is backpressure on the bus. software locations and can help
or System Hang issues which later in this paper, the execution Processors that support Intel find clues to the problem area of
are identified in the Processor may stop and the handling rout- QuickPath Interconnect Architecture: the code before the ROB timeout
Specification Update and ing may not execute correctly. For the current generation of occurs.
Chipset Specification Update Under this condition, the problem processors that support the PCI Express completion timer
and any other Component is more difficult to debug. Intel QuickPath Interconnect timeout: If the system has ex-
Specification Updates (this ap- Architecture, a few of the proces- perienced a ROB timeout, it is
plies to all components used Some signals of interest sor signals have been consoli- always good practice to enable
on the system board); Processors that support Front Side dated and thus there is no longer all the PCI Express End Points
• Review the silicon steppings Bus (FSB) Architecture: Processor separate IERR# and MCERR# sig- Completion Timeouts. Some root
of all the components in the IERR# signal is asserted when the nals. The Processor Catastrophic complexes allow enabling of PCI
system to verify that they are Processor is experiencing a seri- Error (CATERR#) signal is asserted Express Completion timeout on
the latest available; ous internal error condition when when processor is experiencing a Configuration accesses. If this is
• Correlate the failure as much the Machine Check Architecture serious internal error condition. the case, enable this feature as
eetasia.com | EE Times-Asia
well. This helps avoid an outstand- unfiltered capture of all transac- tion the CPU is waiting on before Interconnect logic analyzer. Thus,
ing configuration request target- tions. the CPU resets. In an ideal case, it is also recommended that all the
ing a PCI Express end point which Review the logic analyzer this bus trace is important to un- other methods are used to debug
is not being completed before the traces to understand if there are derstand what caused the ROB the problem before this step un-
ROB timer expires. Since the com- any outstanding requests going timeout problem. Unfortunately less one of the analyzers is readily
pletion timeout is in milliseconds downstream that has not been getting access to this debug tool available.
and the ROB timer is in seconds, completed which may cause ROB or getting access to all the FSB
the completion timeout should timeout. signals may be challenging so this Summary
alert the system to take appropri- is one of the last options in most Debugging a system problem is
ate action before a ROB timeout Bus logic analyzer debug efforts. never easy, and the more experi-
occurs, even if the target cannot Processors that support the Front Processors that support the Intel ence you have the better. The de-
provide a response. Note: For a Side Bus (FSB) Architecture: An FSB QuickPath Interconnect Architecture: bug of a problem needs to be very
PCI Express transaction to time bus trace will provide visibility of For processors that support the methodical. Data collection needs
out, the transaction must have the bus transactions between Intel QuickPath Interconnect to be precise, and every variable
actually been initiated on the PCI the CPU and the Chipset. The FSB Architecture, similar logic analyz- needs to be documented with
Express interface. Logic Analyzer decodes these ing capability is available through the relevant observations so that
PCI express logic analyzer: If a transactions so that the system the Mirror Port or Mid-Bus Probe. the debug path can be identified
certain PCI express End Point in user will be able to make sense of In this case, the system BIOS easily. Processor ROB timeout is a
the system is suspected to as the the all the CPU to Chipset activity. needs to be configured to allow specific problem, and this paper
cause of the problem, then one In the case of a debugging an this activity. Also, there are chal- attempts to explain the causes
should collect a PCI Express logic ROB timeout, the system user will lenges in setting up the system and methods to help resolve the
analyzer trace. This should be an be able to trace which transac- board with the Intel QuickPath problem as quickly as possible.
eetasia.com | EE Times-Asia
Get documents about "