Docstoc

CPUs

Document Sample
CPUs Powered By Docstoc
					The Aim Computers are not marketed these days from a purely technical point of view. All retailers or manufacturers will attempt to give their product an edge over very similar products in their class. Graphics cards and motherboards are an excellent example of this right now. Different names, same technology. Marketing even goes so far as to deviate away from the correct technical terminology of computers. Kilo, Mega, Giga are not the same when it comes to making numbers "easy" for joe public. Technically and correct: 1 bit is a single unit of information depicted in the form of a 1 or a 0. There are 8 bits in a byte There are 1024 bytes in a kilobyte There are 1024 kilobytes in a Megabyte There are 1024 Megabytes in a Gigabyte And incidentally, although not used in this article... There are 1024 Gigabytes in a Terabyte

1024*1024*1024 is awkward and provides results that are not nice for marketing. Banner example Instead they move to multiples of 1000. 1000 bytes in a kilobyte, 1000 kilobytes in a megabyte and so forth. This provides nice round numbers. Take this for example (we will cover the calculations later on): Technically : PC2100 DDR Memory / DDR266 Memory 64 (bits) * 266,000,000 (Hz) = 17024,000,000 bits/s (17024,000,000/8) / (1024*1024) = 2029.4MB/s Marketing: PC2100 DDR Memory / DDR266 Memory 64 (bits) * 266,000,000 (Hz) = 17024,000,000 bits/s (17024,000,000/8) / (1000*1000) = 2128MB/s Convenient don't you think? Not only does it provide a magical 100MB/s of bandwidth, it's also a nice number (no decimal places etc..)

Latency The problem with high multipliers in modern CPUs is the latencies involved. The processor clock speed (we will use 1.73GHz as an example) is far in advance of the relatively paltry speeds of the memory bus, AGP bus etc.. the CPU finds itself having to wait around for the rest of the system to catch up. We shall use an example to illustrate: A processor with a 133MHz bus speed running at 1.73GHz has a clock multiplier of 13 (13*133 = 1733).     The The The The CPU sends a request to the system memory for information CPU then waits one cycle (commonly known as the command rate (1T) memory undergoes what is known as a RAS/CAS latency memory has a delay in finding the data known as a CAS latency

Thus whilst the CPU has waited 1 CPU cycle and then 4 bus cycles it has had to wait for 1 + (4 * multiplier) CPU cycles to get the data it was after. For every memory bus cycle the CPU has undergone 13 cycles. Not much when you consider this 1.73GHz CPU has 1.73 billion cycles per second, but how many times does the CPU access main memory? Quite a bit and so it all adds up. CPUs At one time believe it or not there was a single systembus. Handling data for CPU memory access all the way down to retrieving data from a hard drive. This had to change. CPU technology was advancing so quickly compared to the rest of the system (smaller manufacturing processes, improvement in silicon purity etc..) that CPU clock speed soon outpaced the paltry systembus speed. At first, they did what they do now.. partly. They made the system bus a fraction of the CPU bus speed, we see this today. A CPU with a 133MHz bus, the PCI bus runs at 1/4 of that, 33MHz (give or take) and the AGP bus runs at 1/2 of the CPU bus, 66MHz. More recently we have seen motherboard manufacturers allowing 1/5 PCI dividers. All this does is help overclockers as it enables them to run the CPU at 166MHz bus speed (example), whilst still keeping the PCI bus speed at 33MHz (166/5 = 33MHz approx). But back then, as today, having a bus divider was not enough. Processor speeds were increasing too fast to keep altering the dividers. The CPUs were given a frequency multiplier. Thus a CPU could run at 1,2,3... 10x the bus speed whilst the AGP bus for example, still ran at 66MHz. To illustrate:

Even more recently, in an aid to tap the overclocking and performance markets, manufacturers have allowed the PCI/AGP buses to run independantly of the CPU bus (no more divider). This allows users to adjust the CPU bus speed without the risk of damaging PCI/AGP cards - a common cause of crashing whilst overclocking. Bandwidth So far we have established that the CPU clock frequency is for internal operations, all communication with the rest of the system is made via the CPU bus, called the Front Side Bus. Let us deal with the "internal operations" first. By this we mean raw processing power, this can be calculated in MIPS (millions of instructions per second) and is literally what it means, how many instructions can the CPU perform in a set amount of time. We can measure this via benchmarks:

Another bonus of the high clockspeed is the CPU cache. Typically processors of today have 2 caches, Level 1 cache and Level 2 cache. In older processors (2 years ago and more) Level 1 cache was always built into the CPU core and Level 2 cache was Off-die in a seperate chip acting as an intermediary buffer between the CPU and the system memory (DRAM). Modern CPUs now have the Level 2 cache integrated into the CPU. The caches operate at the same clock frequency as the CPU making it an extremely fast buffer. Algorithms specific to the CPU (and manufacturing company of course) help the cache minimise the bottleneck of the Front Side Bus, which we will discuss in a moment. These algorithms can prefetch data that "may be required" by the CPU and so reduces the latency required to retrieve the information from the system memory when the CPU needs it. The size of the cache plays a big part in the performance of the CPU.

The larger the L2 cache the greater the performance of the CPU? or is it? Consider the above table, does the AMD Athlon Classic with its 512K L2 cache outperform an Athlon Thunderbird with its 256K L2 cache, clock for clock? Nope, because the location of this cache makes all the difference. On-die is significantly faster than Off-die. Apart from the fact that it is running at CPU clock speed, it is closer. So why don't manufacturers just up the amount of cache in a CPU to increase performance? Well, two reasons and they are both related to cost. One is that cache is expensive to manufacture. The other is that increased cache means increased die size. Less processors per silicon wafer and the cost per CPU increases. BUT... All this clockspeed, all those MIPS but still it is limited by the Front Side Bus (FSB). The norm now is a 133MHz FSB with a data width of 64bits. Bandwidth Calculations To avoid confusion later on here is a reference table for bits, bytes, Mega, kilo, Giga etc... 1 bit is a single unit of information depicted in the form of a 1 or a 0. There are 8 bits in a byte There are 1024 bytes in a kilobyte There are 1024 kilobytes in a Megabyte There are 1024 Megabytes in a Gigabyte And incidentally, although not used in this article... There are 1024 Gigabytes in a Terabyte This calculation will become familiar throughout this article so don't worry if you don't understand it right away. We calculate bandwidth by multiplying the data width (in bits) with the amount of operations per second (Hz). Thus: The bandwidth of a 133MHz Front Side Bus is: 64 (bits) * 133,000,000 (Hz) = 8512,000,000 bits/s (8512,000,000/8) / (1024*1024) = 1014.71MB/s DDR & QDR Ok to coin a term, using the above calculations it is clear to see that trying to get those MIPS down a 133MHz bus is effectively like trying to shove a brick down a straw. The CPUs cripple themselves with the low bus speeds.

What do they do? They use some tricks to effectively widen that straw. The later AMD processors allow data to be transferred twice every clock cycle instead of the once that we have effectively been calculating for. This is called DDR or Double Data Rate. What does this mean? Coupled with suitable DDR memory, the effective FSB doubles from 133MHz to 266MHz. Practically all AMD processors now are tied in with DDR memory (explained on page 4) and so people dispense with calling the FSB of AMD processors 133MHz DDR; they just call it 266MHz FSB. Thus the bandwidth of AMD CPUs increases from 1014.71MB/s to 2029.42MB/s. Remember that figure, it's important. Intel employs a similar strategy. Instead of transferring data once per cycle, P4's transfer data 4 times each clockcycle. This is known as QDR or.. you guessed it Quad Data Rate. Thus the effective bandwidth of P4s increases from 1014.71MB/s to 4058.84MB/s. An effective 533MHz bus. Hopefully you will recognise some of the marketing terms in there used to sell us PCs. AMD processors with a 266MHz bus! 533MHz P4s!

Latency The problem with high multipliers in modern CPUs is the latencies involved. The processor clock speed (we will use 1.73GHz as an example) is far in advance of the relatively paltry speeds of the memory bus, AGP bus etc.. the CPU finds itself having to wait around for the rest of the system to catch up. We shall use an example to illustrate: A processor with a 133MHz bus speed running at 1.73GHz has a clock multiplier of 13 (13*133 = 1733).  The CPU sends a request to the system memory for information  The CPU then waits one cycle (commonly known as the command rate (1T))  The memory undergoes what is known as RAS/CAS latency - a delay in the switching between Rows and Columns  The memory has a delay in finding the data known as a CAS latency Thus whilst the CPU has waited 1 CPU cycle and then 4 bus cycles it has had to wait for 1 + (4 * multiplier) CPU cycles to get the data it was after. For every memory bus cycle the CPU has undergone 13 cycles. Not much when you consider this 1.73GHz CPU has 1.73 billion cycles per second, but how many times does the CPU access main memory? Quite a bit and so it all adds up.

Memory We will consider 3 different types of computer memory in this article.  SDR-SDRAM (Single Data Rate - Synchronous Dynamic Random Access Memory) - SDR-SDRAM was the dominant memory of the late 90s. Later versions were available at speeds of 66/100/133 MHz as standard. This type of memory is/was used by both Intel and AMD for their recent offerings, even used in the i845/845G chipset with the Pentium 4 processsor. Later we will show what a mistake or distinct waste of CPU that was.  DDR-SDRAM (Double Data Rate - Synchronous Dynamic Random Access Memory) - DDR-SDRAM has taken over where SDR memory left off. Particularly with AMD systems (Thunderbird / XP / Thoroughbred) DDR memory has come to the fore as the mainstream memory for the forseeable future, with DDR-II on the horizon.  RDRAM (RAMBUS Dynamic Random Access Memory) - Although only really made popular in the mainstream computer market via the Intel Pentium 4 processor, RDRAM technology dates back earlier than DDR memory. Bandwidth Calculations To avoid confusion later on, here is a reference table for bits, bytes, Mega, kilo, Giga etc... 1 bit is a single unit of information depicted in the form of a 1 or a 0. There are 8 bits in a byte There are 1024 bytes in a kilobyte There are 1024 kilobytes in a Megabyte There are 1024 Megabytes in a Gigabyte And incidentally, although not used in this article... There are 1024 Gigabytes in a Terabyte SDR-SDRAM To calculate memory bandwidth we need to know 2 things. Its data width and its operating frequency. The latter is easier to find out as it is usually part of the marketing/retail title. We usually see SDR memory at 100 or 133MHz. Taking 133MHz as the example, this means that the memory can perform an operation 133 million times every second. Finding the data width, well that's just something you have to look up. SDR memory has a data width of 64 bits or 8 bytes (8 bits in a byte). PC100 SDR Memory The calculation is as follows : data width * operating frequency = bandwidth (in bits/s)

To convert to more realistic and manageable figures, divide the result by 8 to give bytes/s and then divide again by 1024 to get kilobytes/s and then by 1024 again to get Megabytes/s. Thus : 64 (bits) * 100,000,000 (Hz) = 6400,000,000 bits/s (6400,000,000/8) / (1024*1024) = 762.9MB/s memory bandwidth. PC133 SDR Memory Using the same forumla as we did for PC100 SDR memory we can easily calculate theoretical memory bandwidth for PC133 SDR memory. 64 (bits) * 133,000,000 (Hz) = 8512,000,000 bits/s (8512,000,000/8) / (1024*1024) = 1014.7MB/s or roughly about 1GB/s memory bandwidth. DDR-SDRAM DDR memory is slightly more complicated to understand for 2 reasons. Firstly, DDR memory has the ability to transfer data on the rising and falling edge of a clock cycle, meaning theoretically DDR memory doubles the memory bandwidth of a system able to use it. Secondly, as a marketing push to compete with a rival technology at the time DDR was introduced, RAMBUS; DDR was sold as a measure of its approximate peak theoretical bandwidth. Similar to AMD and the PR rating of the XP processors we have today, People buy numbers, and DDR was seen to be faster if it was sold as PC1600 and PC2100 instead of PC200 and PC266. PC1600 DDR Memory / DDR200 Memory DDR memory has the same data width as SDR memory: 64 bits. We use the same calculation to measure bandwidth, with the additional frequency. 64 (bits) * 200,000,000 (Hz) = 12800,000,000 bits/s (12800,000,000/8) / (1024*1024) = 1525.9MB/s. Notice the bandwidth is twice that of PC100 SDR memory. PC2100 DDR Memory / DDR266 Memory 64 (bits) * 266,000,000 (Hz) = 17024,000,000 bits/s (17024,000,000/8) / (1024*1024) = 2029.4MB/s or roughly 2GB/s memory bandwidth. With the advent of improved memory yields, modules able to run at higher clock speeds are being released to the market. PC2700 has finally come into its own with the introduction of the AMDXP2700+/2800+ and the Intel i845PE chipset. Here are some bandwidths for the latest memory available:

PC2700 DDR Memory / DDR333 Memory 64 (bits) * 333,000,000 (Hz) = 21312,000,000 bits/s (21312,000,000/8) / (1024*1024) = 2540.6MB/s. PC3200 DDR Memory / DDR400 Memory 64 (bits) * 400,000,000 (Hz) = 25600,000,000 bits/s (25600,000,000/8) / (1024*1024) = 3051.8MB/s. PC3500 DDR Memory / DDR434 Memory 64 (bits) * 434,000,000 (Hz) = 27776,000,000 bits/s (27776,000,000/8) / (1024*1024) = 3311.2MB/s.

RDRAM RDRAM memory is slightly more complicated in that the bus operates at an effective 64 bit bus width ala DDR but is seperated into 2 16/32 bit channels. What does this mean? Well, currently 2 sticks of RDRAM have to be used in a system. DDR has the advantage (usually from a cost point of view) of being able to be used in single DIMMs. The calculation is basically the same however, we just need to take into account the extra channel and additional memory speed. PC800 16 (bits) * 800,000,000 (Hz) = 12800,000,000 bits/s (12800,000,000/8) / (1024*1024) = 1525.9MB/s. Multiplied by 2 because of the dual channel configuration - 3051.8MB/s PC1066 16 (bits) * 1066,000,000 (Hz) = 17056,000,000 bits/s (17056,000,000/8) / (1024*1024) = 2033.2MB/s. Multiplied by 2 because of the dual channel configuration - 4066.4MB/s PC1200 All PC1200 RAMBUS is 32bit and is usually found in chipsets capable of taking a single RDRAM module. 32 (bits) * 1200,000,000 (Hz) = 38400,000,000 bits/s (38400,000,000/8) / (1024*1024) = 4577.64MB/s

Quick Roundup thus far Ok lots of calculations boggling the mind let us step back and look at the overall picture. AMD AMD processors use a 266MHz bus of 64bit width, and has a bandwidth of 2029.4MB/s PC2100 DDR memory operates at 266MHz, the memory bus is of 64 bit width and has a bandwidth of 2029.4MB/s Pretty good eh? A perfect balance. Thus we can say that AMD processors with a 266MHz bus is perfectly matched with PC2100 DDR memory. An increase in bandwidth from either side is useless as the other will become the bottleneck. Intel P4 Northwoods use a 533MHz bus of 64bit width, and has a bandwidth of 4058.84MB/s PC1066 RDRAM memory operates at 1066MHz, the memory bus is dual 16bit and has a bandwidth of 4066.4MB/s Again, almost a perfect match. An increase in bandwidth for either side is pretty much pointless. Misconceptions UNLEASH THE POWER OF YOUR AMD PROCESSOR WITH THE KT333 CHIPSET COUPLED WITH PC2700 MEMORY! Think about it, processor bandwidth is still the same, 2029.4MB/s, but the memory bandwidth is now increased to 2540.6MB/s. Will there be an increase in subsystem performance? I think not. If you're overclocking, then it's a different story. The best performance increases from overclocking come from increasing the bus speed, as it increases the CPU bandwidth. Having memory with "room to expand into" we shall say, is a good thing. It allows you to get the most out of your overclocked system. For regular users with a 266MHz FSB CPU, KT333/400 is pointless. The new XP2700/2800+ processors from AMD have a 333MHz CPU bus speed by default. A 333MHz bus has a bandwidth of 2540.59MB/s. The KT333 chipset is a perfect match for this CPU and will become the new "default". PC2700 memory is needed for KT333; the bandwidth for PC2700 memory is 2540.6MB/s. A perfect match again, how logical. nForce nForce is special as it heralded the future of memory interfaces, for DDR at least. Dual DDR technology gives 2 64bit channels instead of 1 making an effective 128bit memory bus. This allows twice the bandwidth through the bus. Although DualDDR technology never really made a huge impact on nForce memory bandwidth (so the benchmarks tell us at least), it has great potential to a recent DDR convert.

The Intel Pentium 4 processor, a long standing advocate of RAMBUS/RDRAM has pledged to move away from the serial memory technology and embrace DDR. Unfortunately, as the memory bandwidth calculations on page 4 showed, DDR in its current form has neither the bandwidth or the potential to scale up to RDRAM bandwidths in its current iteration. Dual DDR will make a big difference to Pentium 4 chipsets. P4s with QDR architecture can achieve bandwidths of around 4GB/s, perfectly matched with PC1066 RDRAM. The fastest DDR memory currently available on the other hand, PC3500 has a bandwidth of around 3.1GB/s. The P4 is crippled with current DDR chipsets. Doubling the memory bandwidth of DDR then is something Intel is looking forward to. The PCI Bus The PCI bus is one of the older buses in a modern system. It is the bus which connects all the expansion cards in a system to the main chipset, along with IDE and USB. The PCI bus is a 32-bit wide bus running at 33MHz. Using our familiar calculation we can now easily calculate its maximum bandwidth. 32 (bits) * 33,000,000 (Hz) = 1056,000,000 bits/s (1056,000,000/8) / (1024*1024) = 125.9MB/s. In marketing (because of the differing ways they use kilo, mega etc..) this is 133MB/s. It is relatively easy to imagine, that with modern ATA133 Hard Drives, PCI network adapters, sound cards and the like, the PCI bus can easily become saturated. There are 3 ways around this solution. 2 have already been implemented.  Expand the bandwidth of the bus - Server motherboards, especially with the prevalence of SCSI hard drives requiring more bandwidth than the PCI bus can transfer, have moved to a 66MHz bus using 64bit slots. This quadruples the bandwidth afforded. 64 (bits) * 66,000,000 (Hz) = 4224,000,000 bits/s (4224,000,000/8) / (1024*1024) = 503.5MB/s. In marketing (because of the differing ways they use kilo, mega etc..) this is 533MB/s.  Move to a dedicated bus - The obvious example here is graphics cards. With ever increasing speeds of graphics cards needed to deal with ever complex games the PCI bus of old simply cannot deal with the sheer amount of information needed to get to the northbridge and vice versa. Thus the AGP bus was born. A direct link from the AGP card to the chipset running at 66MHz with a 32bit bus gives a maximum bandwidth of 32 (bits) * 66,000,000 (Hz) = 2112,000,000 bits/s (1056,000,000/8) / (1024*1024) = 251.77MB/s. In marketing (because of the differing ways they use kilo, mega etc..) this is 266MB/s. IDE IDE hard drives transmit data to the CPU and vice versa, via the PCI Bus. Of course this means that any transfers is limited by the speed of the PCI bus,

133MB/s or thereabouts meaning ATA133 is as high as IDE can get (even though in reality it never gets close anyway). Recent innovations have tried to bypass the PCI bus for IDE transfers. VIA's VLink technology is a dedicated bus running at 266MB/s between the Southbridge and Northbridge. Serial ATA The successor to IDE. Why is this in the PCI section? Well currently despite all the hype, Serial ATA connectors all use the PCI bus to transfer information. SATA150 with a theoretical maximum transfer of 150MB/s is limited to the paltry 133MB/s of the PCI bus. Future chipsets will alleviate Serial ATA of the PCI bus burden and allow direct access to the chipset probably on a dedicated bus. This is needed for the next generation of SATA devices able to run at 300/600MB/s. AGP Bus As partly explained on page 6, the AGP bus was born to accomodate the ever expanding bandwidth needs of graphics card. The 133MB/s capacity of the PCI bus simply wasn't able to handle the likes of cards faster than the Voodoo 3, one of the last PCI graphics cards. The AGP bus was a 32bit bus like the PCI bus, but it operated at 66MHz giving it a maximum bandwidth of 266MB/s. This was and is known as AGP 1x. Similar to the QDR implementation of the Intel Pentium 4 processor, the AGP bus was redesigned to allow data to be processed 2, then 4 times every clock cycle. This is known as AGP2x/4x. More recently AGP8x has been introduced. Each iteration of AGP has doubled the bandwidth of the previous standard:     AGP1x AGP2x AGP4x AGP8x = = = = 266MB/s 533MB/s 1066MB/s 2132MB/s

Hypertransport In all walks of life, things move on. Standards described 10 years ago and beyond can never hope to achieve scaleability to today's needs. As the 8bit ISA bus was superceded by the PCI bus, thus the outdated PCI needs to be phased out and a new interconnect protocol defined. The leading contender for the throne at the moment is Hypertransport. An AMD led consortium hopes to make Hypertransport the defining interconnect protocol of the forseeable future. What is Hypertransport Hypertransport is a point-to-point interconnect primarilly designed for speed, scaleability and the unification of the various system buses we have today. The same link can be used to retrieve data from a network card and a bank of DDR memory. Here is an example of the typical computer bus layout as we know today:

Hypertransport would eliminate most of the bottlenecks found in today's systems. The PCI bus as explained earlier is easily saturated with the high bandwidth peripherals in use.

In terms of speed, Hypertransport is capable (at the moment) of delivering throughputs of up to 51.2Gbps.

*image taken from hypertransport.org If you're a little bit confused about the graph and the achievable bandwidths, we can simply apply our well worn equations to them. Using 500MHz clock rate as an example 2 (bits) * 500,000,000 (Hz) = 1000,000,000 bit/s (1000,000,000/8) / (1024*1024) = 119.2MB/s - with the ability of DDR signaling this is doubled to 238.4MB/s. or to use Gbits (basically because it sounds more): 1000,000,000 / (1024*1024*1024) = 0.93Gbps (rounded up to 1Gbps). With the DDR signaling this is shunted up to 2Gbps. We see Hypertransport in today's technology through one company's innovation to break from the norm. NVIDIA's nForce (and nForce2 of course) use Hypertransport as the primary interconnect offering throughputs of 800MB/s (nForce1) and 1600MB/s (nForce2). Not top speed Hypertransport but more than enough for today's components. VIA have validated Hypertransport for use in their upcoming K8 AMD Hammer chipsets so the future is certainly picking up for the fledgling protocol. 3GIO 3GIO (3rd Generation Input/Output) is Intel's answer to Hypertransport. It is thought that PCI-Express will replace PCI 2.2 in the near future as a general high bandwidth IO bus required to cope with ever increasingly bandwidth hungry peripherals. More details can be found here. Hypertransport has the edge at the moment. It is out there in nForce and nForce 2 boards not to mention being used extensively in upcoming AMD Hammer chipsets in 2003. Roundup

Before we talk about what will come let us briefly cover what is going on at the moment. It should have hopefully become apparent that there are many pitfalls when deciding on a new computer system, for both home users and businesses alike. As always, technical details are buried under a big pile of marketing. Minor advancements in technology that in reality, do nothing are heralded as the "next big thing". A quick look under the surface however, shows this not to be the case. It pains me to see users asking whether they should upgrade their VIA KT266a based motherboard to a VIA KT333 chipset because "it must be faster", bigger numbers mean faster right?. Wrong, a balanced system means you can squeeze the most out of your setup, be it for gaming, CAD or other intensive operations. Nobody wants to spend money needlessly so read this article again, get a feel for the numbers involved and come to your own conclusions. The Future We covered briefly the aspects regarding future IO buses. Hypertransport and PCI-Express are on the horizion, or indeed are already here. We need the peripherals and components to make use of this additional bandwidth. At the moment it seems wherever you look, there is a bottleneck. Hopefully in the future manufacturers will settle on fewer buses, it's less confusing for the consumer and it also means that computers will become less complex. Take for example USB2.0 and Firewire (not covered in this article), two competing protocols that basically do the same thing. Hot-pluggable, scaleable, high-bandwidth connections. Why not settle on one and stick with it? Anyway, end of the ranting. We hope you enjoyed this article. It will be constantly updated as new technologies emerge in this ever-changing industry. At the end of the day, this is a reference for us all.


				
DOCUMENT INFO
Shared By:
Categories:
Stats:
views:15
posted:1/30/2010
language:English
pages:15