Arthur Glaser November 24_ 2009 Intel QuickPath Interconnect

Document Sample
Arthur Glaser November 24_ 2009 Intel QuickPath Interconnect Powered By Docstoc
					                                                                                Vig 1

Avneesh Vig

Computer System Design

Prof. Arthur Glaser

November 24, 2009

            Intel® QuickPath Interconnect: a new way to communicate

       In November 2008, a new micro-architecture was introduced by Intel.

Popularly known as the Nehalem architecture, it introduced various new

technologies and provided an efficient power management unit. The speed and

performance that it offered led it to gain market appreciation fairly quickly. The

Nehalem micro-architecture introduces various new features along with

improving earlier technologies such as Hyper-threading Technology, which now

permits execution of two threads per processor core allowing up to 16 threads in

an eight processor system. The memory hierarchy has an addition of another

member, an 8MB L3 cache, which improves performance by reducing latency to

frequently used data. The architecture also utilizes an integrated DDR3 triple-

channel memory controller, which unlike the current platforms, runs at twice the

frequency of DDR2 667MHz memory and allows for higher bandwidth. In addition

to these introductions and others such as a second translation look-aside buffer,

Intel® Turbo-mode etc., one of the main changes was made to the

interconnecting bus architecture. Replacing the previously used Front Side Bus

(FSB), Intel redesigned the system interconnect technology and launched it with

this architecture, naming it Intel QuickPath Interconnect (QPI).
                                                                              Vig 2

      Before we introduce the new interconnect technology, we will study the

evolution of microprocessor systems which better justifies the current direction in

system interconnect technologies. Invented in the early 1970s, microprocessor

systems were fairly simple: comprised of mainly a processor and memory. Intel’s

4004 microprocessor (µP) was developed in 1971, and was the first single-chip

processor. This µP utilized a system bus that directly connected the processor

to memory which satisfied the instruction and data rate requirements. But as the

micro-architecture of processors evolved, higher clock frequencies were utilized

to increase performance of the processor. As a result, greater data rates from

memory were required to keep up with the processor speed. The dynamic

random access memory (DRAM) technology improved over years to catch up

and provide faster data transfers. However things changed in 1980s when the

pipelined architecture for microprocessors was introduced. The DRAM system

memory could not keep up in speed with the processor, and this brought about

the introduction of caches in microprocessor systems.

      Caches solved the bottleneck issue that was faced by processors due to

the slow DRAM memories. The cache was built using fast static random access

memory (SRAM) technology, which would offer low-latency, fast data transfers

without stalling while waiting for data (Maddox 1). The cache is an intermediary

between the processor and system memory, storing blocks of sequential bytes

that it had fetched from main memory. To facilitate high bursts of data requested

by the cache, the system bus was also improved to run at higher data rates,

keeping up with the speed of faster processors (Maddox 2). Then in early 1990s,
                                                                                 Vig 3

with the advent of multi-processor systems, the system bus architecture had to

be modified again to permit two or more processors to be connected to a shared

memory. Hence the system interconnect had to evolve to manage requests from

multiple processors and execute them efficiently. This new system interconnect

was named the Front Side Bus (FSB).

       Intel’s Pentium Pro microprocessor was the first Intel architecture to

introduce the Front Side Bus. This new system interconnect had the capability to

support up to four processors, a memory controller, and an I/O controller

(Maddox 2). FSB pipelines the requests of processors, and then fulfills them

sequentially utilizing the memory controller. It is capable of pipelining up to eight

transactions, and hence provided high throughput. But now since the system had

become complex, with one cache equipped with each processor (up to eight

caches), the system interconnect has to provide services to maintain coherency

of the caches with each other and with main memory. FSB facilitated this by

using techniques such as snooping, which involves checking what memory

requests are being made and whether any of the caches have modified data. In

such a case, FSB would perform a set of functions and make known that a

change has been made and hence the data copies in the other caches are

invalid. FSB was able to provide good cache coherency mechanisms and

operating at 400 MHz allowed for data transfer rates up to 1.6 Gigatransfers per

second (Maddox 2). But as the speed of processors increased with

improvements, the data rate requirements heightened and led to a dual-bus

system. The number of processors connected to a single bus decreased and in
                                                                                  Vig 4

turn improved bandwidth (Maddox 8). Higher bandwidth requirements further

decreased the number of processors connected to a single bus to one. Now the

microprocessor system with four processors had four FSBs connecting it to the

shared memory controller. Even though this configuration allowed for high

bandwidth, the memory controller now had to accommodate four independent

FSBs. This meant that over 1500 pins were required by the controller, making it a

very expensive device (Maddox 8). The shared memory controller hence became

a bottleneck since it was a sole controller for multiple processors.

       The solution to this came by integrating one memory controller with each

processor on the same die (Maddox 8). The approach of Integrated Memory

Controllers was becoming a necessity due to the higher data demands of

processors, and the number of processors itself that were increasing in a

multiprocessor system. Integrating the controller provides low latency transfers

and hence improve bandwidth (Intel 4). FSB would no longer be able to fulfill the

requirements because of its limit to support only five loads efficiently, and also

because of its bandwidth cap of 12.8 GB/s (Maddox 9). Furthermore, the bus is

not able to support simultaneous data transfer in both directions, limiting the

improvement further. Hence, a new system bus architecture was needed to

interconnect the processors (with integrated caches and memory controllers) with

system memory. The new architecture should provide high speed point-to-point

links connecting the processors with each other and also to their dedicated

memories. Keeping these factors in mind, Intel QuickPath Interconnect (QPI) was
                                                                                 Vig 5

designed, developed, and then utilized by the Nehalem micro-architecture to

provide optimum performance.

             Figure 1: Four-processor system based upon point-to-point links

         Intel’s QPI is a step in the new direction. Figure 1 displays a four-

processor system interconnected using bi-directional, high speed, point-to-point

links (Safranek 1). QPI is the interconnect that allows for such a system to work

efficiently and provide maximum performance. As it can be seen from the figure,

each processor, also referred to as a socket or CPU, has its own dedicated

memory. This design concept, referred to as non-uniform memory access

(NUMA), was a result of integrating the memory controllers into the die of each

processor core. Due to this, each processor would now have a low latency, high

bandwidth connection to their dedicated system memory. In the following

paragraphs, we will learn what makes up QuickPath Interconnect and how it

                                                                                  Vig 6

       Before we learn the architecture of QPI, let us familiarize ourselves with

some of the terminology used in systems which utilize Intel’s QuickPath

Interconnect. A caching agent is the processing unit that is connected to QPI

through its high performance cache. A home agent is the interface between

caching agents and a given set of memory addresses. It services coherent

transaction requests, and is a part of the integrated memory controller (Safranek

5). There are also devices that are responsible for connecting to the input/output

subsystem; these are referred to as I/O agents (Singh 1). And lastly, the devices

that provide access to code required for booting up the system are called

firmware agents. A single device can contain several of these agents, each

representing a single node. In a system, these devices exchange data over QPI

using something termed a link. Links are comprised of lanes, which carry a set of

unidirectional signals from one device to the other. Each lane would carry one

signal. In order to create the bi-directional feature of QPI, it uses two links, one in

each direction, and forms a link pair (Singh 2). Now that the terminology of a QPI

system has become familiar, we will now discuss its architecture.

       QuickPath Interconnect is composed of four layers. This follows the same

level of abstraction found in the seven-layer Open Systems Interconnection (OSI)

model of networking (Singh 2). Figure 2 on the next page shows an illustration of

the four-level model. The figure is followed by details about each layer in the

                                                                                   Vig 7

               Figure 2: the four layers of Intel QuickPath Interconnect

The Physical Layer

       This layer deals with details of the operation of the signals on a particular

link between two agents (Singh 2). It consists of 20 differential pairs in a direction

creating one link. Each such link is accompanied with one clock lane; hence a

link in one direction is a total of 21 lanes (Singh 3). Each lane will requires two

physical wires; consequently summing the total number of signals in each

direction to 42. A link pair is formed to allow bi-directional capabilities, adding to

the total of 84 signals in a link pair that uses its full width (Safranek 2). Figure 3

shows the physical layer of the QPI architecture.

             Figure 3: Physical interface of Intel QuickPath Interconnect
                                                                                Vig 8

       A phit is the unit of information exchanged in each direction between any

two agents. The phit for QPI equals 20 bits. If the link operates at full width, 84

pins will be utilized to transfer this data. Compare this to the 160 pins that were

required by the Front Side Bus (Safranek 2). However, to even further reduce

power consumption and work around failures, the link has a capability of working

at half or quarter widths. This feature will become clear when it is discussed in

detail in the next section. With current signaling speeds, the link allows for 6.4

Gigatransfers (GT) per second for regular multiprocessor systems, and 4.8 GT/s

for longer traces found in large multiprocessor systems (Singh 4). This is a big

leap from the 1.6 GT/s that the Front Side Bus allowed.

       The physical layer can be seen to be divided into two sections: analog and

logical. The analog section manages data transfer on the traces by driving

appropriate signals on the lanes, with proper timing relative to the clock, and then

recovering the data on the receiving end and translating it back into its digital

form. The logical section of the layer however deals with interfacing the physical

layer with the link layer. It manages the flow of information between these layers

and handles the width of operation (Singh 4).

The Link Layer

       This layer controls the flow of information across the link and ensures that

it has been transferred without errors. When the link layers of two devices

communicate with each other, they exchange something called a flit (Singh 5). A

flit is always 80 bits wide that is sent and received by the link layers of
                                                                                 Vig 9

communicating devices. Each flit contains 72 bits of payload and 8 bits for Cyclic

Redundancy Check (CRC). A flit is transformed into several phits by the physical

layer, hence only 20 bits of information travels on a link. The link itself is

subdivided into four quadrants of 5 lanes each. When the link is in full width

mode, it carries 20 bits of data across all four quadrants and reliably delivers it.

But when an error or failure occurs on any one of lanes, the link turns to half or

quarter width mode accordingly and retransmits the data using only one or two

quadrants (Singh 5). Figure 4 shows an illustration of the four quadrants of a link.

                   Figure 4: Mapping 20 lanes into four Quadrants

       When the link runs on quarter or half width mode, it reduces power

consumption and allows for efficient power management. The two modes also

allow the link to transmit data reliably, working around and avoiding the lanes that

recently incurred transmission failures (Intel 4). The layer also contains a clock

fail-over mechanism that re-routes the clock signal through one of the other 20

lanes. These features allow the Link Layer to manage reliable data transfers

between two agents (Intel 4). The next paragraph discusses what types of data is

managed by this layer.
                                                                                Vig 10

       The Link layer abstracts the link into a set of message classes. To

understand this abstraction, please consider the following scenario. Relating a

physical link to a post office, they both carry various different types of entities. As

a post office might deliver letters, packages etc., and have options of how fast we

want it delivered, a link is very similar and delivers various types of data such as

snoop, data response, etc. (Singh 6). The message classes are representatives

of these different types of data. There are six message classes: Home (HOM),

Data Response (DRS), Non-Data Response (NDR), Snoop (SNP), Non-Coherent

Standard (NCS), and Non-Coherent Bypass (NCB) (Singh 6). A collection of

these six classes is called a virtual network. Intel QPI supports up to three virtual

networks: VN0, VN1, VNA (Safranek 3). VN0 and VN1 each has a channel per

message class, and is independent from one another. VNA is used added for

low-cost performance and is adaptively buffered and shared among all traffic-

generating agents. On the other hand, VN0 and VN1 are independently buffered.

Figure 5 shows how the virtual networks and the six classes fit in the QPI


                Figure 5: Virtual networks and the six message classes
                                                                              Vig 11

The Routing Layer

       This layer is responsible for directing messages to their proper destination.

It maintains a number of routing tables which contain the information of where to

direct the incoming packet. Each packet contains information about its

destination in the destination field. When the routing layer receives a packet on

the receiving device, it decodes the destination Node ID and uses it to index the

routing table. Then the routing table indentifies the next link to which the packet

should be forwarded. The transmission end of that link then determines the final

destination of the packet (Safranek 4). The routing table is setup by the firmware

agents when the system is booted up (Singh 6). The table can also be populated

by the BIOS of the system, or software familiar with system topology.

The Protocol Layer

       As seen in figure 2, the protocol layer is the top-most layer in the Intel QPI

hierarchy. It is responsible to manage the coherency of data among all caching

and home agents (Singh 7). It uses the MESIF protocol for cache coherency.

This protocol is utilized by the interconnect to ensure valid data is accessed in

the caches and no out-dated data is manipulated and treated as current. The

MESIF protocol consists of five states: (1) modified, which represents whether a

cache block is dirty (has been manipulated), (2) exclusive, identifies whether a

specific cache block is the only copy among the caches, (3) shared, whether the

cache block is present in other caches, (4) invalid, represents whether the cache

block is valid or has current information, and (5) the forward state (Maddox 7).
                                                                              Vig 12

       The forward state was a new addition to the protocol which allows for fast

transfers of shared data. When two caches contain the same cache block, it is

considered that the block is in a shared state. If the block is manipulated in any of

these caches, the main memory would no longer contain the updated

information. Hence, when a request for this cache line by another processor is

made, the updated block should be presented to it, not the invalid data in main

memory. Traditionally, the memory controller would obtain a copy of the

manipulated cache block and then provide it to the cache that is requesting it; but

now with the advent of a forward state, the cache with the updated cache line

can itself forward the block to the requesting processor (Singh 7). This reduces

the time needed to get the requested data, hence improving overall performance

(Singh 8).

       The layers discussed above make up the Intel QuickPath architecture.

Various new features are added to the interconnect that provide reliable low-

latency data transfers, cache coherency, and high bandwidth. In the next section,

we will discuss the overall performance of the Intel QPI architecture.


       Small scale systems such as desktop machines, workstations etc., as well

as very large systems such as servers benefit from the high performance Intel

QuickPath Interconnect has to offer. There are two types of cache coherency

mechanisms that the architecture utilizes, each one suited for either a small scale

or a large scale processor system (Singh 9). The first mechanism is called the

source snoop mechanism, which proves best for small scale computing, and the
                                                                                Vig 13

second is the home snoop mechanism that proves better for large scale systems.

In order to appreciate the two strategies, let us first familiarize ourselves with the

traditional way of snooping. Please refer to figure 6 for the following example.

         Figure 6: four processors interconnected in a multiprocessor system

       For instance, processor 1 needs data that is currently located in the cache

of processor 4. In a regular snoop, this function would be facilitated by four hops.

Since the home agent maintains a directory of which pages are where, processor

1 would send a request to processor 4, processor 4 would communicate with

other processors to confirm that it has the updated copy, the processors would

then reply back to processor 4, and then finally processor 4 would transfer the

requested data to processor 1 (Video 3). However, in a source snoop, the same

is accomplished with three hops. Working with the same example as before,

when processor 1 sends a request to processor 4, it simultaneously sends it to

the other processors as well. The processors then confirm that the data in

processor 4 is current, and the data is then transferred from processor 4 to

processor 1 (Video 3). With the use of cache forwarding, introduced earlier in this
                                                                                 Vig 14

paper, this is further reduced to two hops if processor 2 or 3 had contained the

same data.

       The hop count is the same for a home snoop; the only difference is the

initiator of the snoop. In a home snoop, the home agent would initiate the snoop

(Singh 7). The home agent would send a snoop to all processors to confirm

whether it has the most current copy. Then the processors would reply back

confirming the same. After the confirmation is complete, the agent would transfer

the requested copy. This process is further improved by a directory that the home

agent manages (Singh 9). This directory keeps information about who has the

most current copy, and hence the home agent only goes to that particular cache

and retrieves it for the requesting processor.

       The overall performance of a system is improved greatly by the low-

latency point-to-point links provided by QPI. They allow for speedy data transfers

between the processors themselves, a much more efficient design than going to

one slow shared space. When compared to the Front Side Bus, Intel QuickPath

Interconnect supplies 25.6 GB/s of bandwidth per link-pair, and can transfer a 64

byte cache block in only 5.6 ns (Safranek 6).


       QPI also meets the demands of server systems by providing premium

Reliability, Availability, and Serviceability (RAS) features. It offers various levels

of error detection, and provides mechanisms for their correction and avoidance.

When a clock pin failure occurs, the built-in clock fail-over mechanism re-routes

the clock from the problem lane to a data lane (Intel 4). The self-healing
                                                                                   Vig 15

capability of links allows them to re-configure themselves and utilize only the

good parts of the link (Intel 4). In addition to rerouting capabilities, QPI has

implicit Cyclic Redundancy Check with link-level retry. In parallel with the

payload, the link sends 72 CRC bits to check whether reliable data was

transferred over the link. If the data on the receiving end is incorrect, the link re-

transmits the payload until correct data has been delivered (Intel 4). This

approach provides faster and more reliable data transfer compared to previous

systems that only used 32 CRC bits. The goal of QuickPath Architecture is to

keep the system up and running, working around any link and logical errors.

       After learning about all the features QuickPath Interconnect has to offer,

the improvement in performance can be understood. QPI is a new step towards

interconnecting multiprocessor systems. The traditional system bus was no

longer suitable to meet the requirements of new and bigger multiprocessor

systems. QPI was developed by keeping all the necessary factors in mind, and

provides justifiable increase in performance. Due to these reasons, Intel’s

QuickPath Interconnect was successful in providing a new quick way of

interconnecting multiprocessor systems and improving their overall performance

by the use of high speed, point to point, low-latency links.
                                                                                 Vig 16

                                     Work Cited

Intel authors. “White Paper: Intel QuickPath Interconnect”. Intel Inside. 1-5.

       March 2008. Web. 15 Nov. 2009. <

       quickpath/ whitepaper.pdf>.

Maddox, Robert A. “An Introduction to Intel QuickPath Interconnect.” Dr. Dobb’s

       (2009): 1-13. 6 Apr. 2009. Web. 19 Nov. 2009. <


Singh, Gurbir. “The Architecture of the Intel QuickPath Interconnect.” Intel Inside

       (2009): 1-13. Web. 12 Nov. 2009. <


Safranek, Robert J. and Michelle Moravan. “QuickPath Interconnect: Rules of

       the Revolution.” Dr. Dobb’s (2009): 1-6. 9 Nov. 2009. Web. 24 Nov. 2009.


Video. “Intel QuickPath Architecture Demo.” Intel Inside. 12 Nov. 2009.


Shared By:
Description: Intel's QuickPath Interconnect technology, abbreviated as QPI, in fact, it's official name is CSI, Common System Interface common system interface, used to implement the direct interconnection between the chip and not connected to the Northbridge through the FSB, is directed at AMD's HT bus. Whether it is speed, bandwidth, bandwidth per pin, power and all other specifications to be beyond the HT bus.