The Evolution Of Chip Multi-Processors And Its Role In High Performance And Parallel Computing

Document Sample
The Evolution Of Chip Multi-Processors And Its Role In High Performance And Parallel Computing Powered By Docstoc
					                                            (IJCSIS) International Journal of Computer Science and Information Security,
                                            Vol. 8, No. 7, October 2010

              A.Neela madheswari,                                      Dr.R.S.D.Wahida banu,
 Research Scholar, Anna University, Coimbatore,         Research Supervisor, Anna University, Coimbatore,
                     India.                                                   India.                      

Abstract - The importance given for today’s                number of cores continues to offer dramatically
computing environment is the support of a                  increased performance and power characteristics
number of threads and functional units so                  [14].
that multiple processes can be done
simultaneously. At the same time, the                      In recent years, Chip Multi-Processing (CMP)
processors must not suffer from high heat                  architectures have been developed to enhance
liberation due over increase in frequencies to             performance and power efficiency through the
attain high speed of the processors and also               exploitation of both instruction-level and thread-
they must attain high system performance.                  level parallelism. For instance, the IBMPower5
These situations led to the emergence and the              processor enables two SMT threads to execute
growth of Chip Multi-Processor (CMP)                       on each of its two cores and four chips to be
architecture, which forms the basis for this               interconnected to form an eight-core module [8].
paper. It gives the contribution towards the               Intel Montecito, Woodcrest, and AMDAMD64
role of CMPs in parallel and high                          processors all support dual-cores [9]. Sun also
performance computing environments and                     shipped eight-core 32-way Niagara processors in
the needs to move towards CMP architectures                2006 [10, 15]. Chip Multi-Processors (CMP)
in the near future.                                        have the advantages of:
                                                           1. Parallelism of computation: Multiple
Keywords-     CMPs;     High      Performance              processors on a chip can execute process threads
computing;    Grid    Computing;      Parallel             concurrently.
computing; Simultaneous multithreading.                    2. Processor core density in systems: Highly
                                                           scalable enterprise class servers systems as well
             I. INTRODUCTION                               as rack-mount servers can be built that fit in
                                                           several processor cores in a small volume.
Advances in semiconductor technology enable                3. Short design cycle and quick time-to-market:
the integration of billion transistors on a single         Since CMP chips are based on existing processor
chip. Such exponentially increasing transistor             cores the product schedules can be short [5].
counts makes reliability an important design
challenge since a processor’s soft error rate                               II. MOTIVATION
grows in direct proportion to the number of
devices being integrated [7]. The huge amount of           For the last few years, the software industry has
transistors, on the other hand, leads to the               significant advances in computing and the
popularity of multi-core processor or chip multi-          emerging grid computing, cloud computing and
processor architectures for improved system                Rich Internet Applications will be the best
throughput [13].                                           examples for distributed applications. Although
                                                           we are in machine-based computing now, a shift
Multi-core processors represents an evolutionary           towards human-based computing are also
change in conventional computing as well setting           emerging in which the voice, speech, gesture and
the new trend for high performance computing               commands of the human can be understand by
(HPC) - but parallelism is nothing new. Intel has          the computers and act according to the human
a long history with the concept of parallelism             signals. Video conferencing, natural language
and the development of hardware-enhanced                   processing and speech recognition software are
threading capabilities. Intel has been delivering          come under this human-based computing as
threading capable products for more than a                 example. For these kinds of computing, there is a
decade. The move towards chip-level                        need for huge computing power with a number
multiprocessing architectures with a large

                                                                                       ISSN 1947-5500
                                             (IJCSIS) International Journal of Computer Science and Information Security,
                                             Vol. 8, No. 7, October 2010

of processors together with the advancement in                   (1) Single processor architecture, which
multi-processor technologies.                                          does not support multiple functional
                                                                       units to run simultaneously.
In this decade, computer architecture has entered                (2)   Simultaneous multithreading (SMT)
a new ‘multi-core’ era with the advent of Chip                         architecture, which supports multiple
Multi-processors    (CMP).      Many      leading                      threads to run simultaneously but not
companies, Intel, AMD and IBM, have                                    the multiple functional units at any
successfully released their multi-core processor                       particular time.
series, such as Intel IXP network processors
[28], the Cell processor [12], the AMD
                                                                 (3)   Multi-core architecture or Chip multi-
OpteronTM etc. CMPs have evolved largely due                           processor (CMP) architecture, which
to the increased power consumption in nanoscale                        supports functional units to run
technologies which have forced the designers to                        simultaneous and may support multiple
seek alternative measures instead of device                            threads also simultaneously at any
scaling to improve performance. Increasing                             particular time.
parallelism with multiple cores is an effective
strategy [18].                                              A. Single processor architecture

     III. EVOLUTION OF PROCESSOR                            The single processor architecture is shown in
             ARCHITECTURE                                   figure 1. Here only one processing unit is present
                                                            in the chip for performing the arithmetic or
Dual and multi-core processor systems are going             logical operations. At any particular time, only
to change the dynamics of the market and enable             one operation can be performed.
new innovative designs delivering high
performance with an optimized power
characteristic. They drive multithreading and
parallelism at a higher than instruction level, and
provide it to mainstream computing on a massive
scale. From an operating system level (OS), they
look like a symmetric multi-processor system
(SMP) but they bring lot more advantage than
typical dual or multi- processor systems.

Multi-core processing is a long-term strategy for
Intel that began more than a decade ago. Intel
has more than 15 multi- core processor projects
underway and it is on the fast track to deliver
multi-core processors in high volume across off                        Figure 1: Single core CPU chip
of their platform families. Intel’s multi-core
architecture will possibly feature dozens or even           B. Simultaneous            multithreading        (SMT)
hundreds of processor cores on a single die. In             architecture
addition to general-purpose cores, Intel multi-
core processors will eventually include                     SMT permits simultaneous multiple independent
specialized cores for processing graphics, speech           threads to execute simultaneously on the same
recognition       algorithms,      communication            core. If one thread is waiting for a floating point
protocols, and more. Many new and significant               operation to complete, another thread can use
innovations designed to optimize the power,                 integer units. Without SMT, only a single thread
performance, and scalability is implemented into            can run at any given time. But in SMT, the same
the new multi-core processors [14].                         functional     unit    cannot      be     executed
                                                            simultaneously. If two threads want to execute
According to the number of functional units                 the integer unit at the same time, it is not
running     simultaneously,    the  processor               possible with SMT. Here all the caches of the
architecture is classified into 3 main types                system are shared.
                                                            C. Chip Multi-Processor architecture

                                                                                        ISSN 1947-5500
                                            (IJCSIS) International Journal of Computer Science and Information Security,
                                            Vol. 8, No. 7, October 2010

                                                               IV. EXISTING ENVIRONMENTS FOR
In    multi-core or chip         multi-processor                    CHIP MULTI- PROCESSOR
architecture, multiple processing units or chips                        ARCHITECTURE
are present on a single die. Figure 2 shows a
multi-core architecture with 3 cores in a single           The chip multi-processors are used in the range
CPU chip. Here all the cores are fit on a single           of desktop to high performance computing
processor socket called as Chip Multi Processor.           environments. The section 4.1 and section 4.2
The cores can run in parallel. Within each core,           will show the existence and the main role of
threads can be time-sliced similar to single               CMPs in various computing environments.
processor system [17].
                                                           A. High Performance Computing

                                                           High performance computing uses super
                                                           computers and computer clusters to solve
                                                           advanced computation problems. A list of the
                                                           most powerful high-performance computers can
                                                           be found on the Top500 list.

                                                           Top500 is a list of the world’s fastest computers.
                                                           The list is created twice a year and includes
                                                           some rather large systems. Not all Top500
                                                           systems are clusters, but many of them are built
                                                           from the same technology. There may be HPC
 Figure 2: Chip multi-processor architecture               systems out there that are proprietary or not
                                                           interested in the Top500 ranking. The Top500
The multi-core architecture with cache and main            list is the wealth of historical data. The list was
memory is shown in Figure 3, comprises                     started in 1993 and has data on vendors,
processor cores from 0 to N and each core has              organizations, processors, memory, and so on for
private L1 cache which consists of instruction             each entry in the list [22]. As per the information
cache (I-cache) and date cache (D-cache).                  taken at June 2010 from [23], the first 10
                                                           systems are given in the table 1.

                                                           Table 1: Top 10 Super computers list
                                                           Rank Processor details               Year
                                                           1.       Jaguar - Cray XT5-HE 2009.
                                                                    Opteron Six Core 2.6
                                                           2.       Nebulae     - Dawning 2010.
                                                                    TC3600 Blade, Intel
                                                                    X5650, NVidia Tesla
                                                                    C2050 GPU.
                                                           3.       Roadrunner             - 2009.
                                                                    BladeCenter QS22/LS21
    Figure 3: Multi-core architecture with                          Cluster, PowerXCell 8i
                  memory                                            3.2 GHz / Opteron DC
                                                                    1.8     GHz,    Voltaire
Each L1 cache is connected to the shared L2                         Infiniband.
cache. The L2 cache is unified and inclusive, i.e.         4.       Kraken XT5 - Cray XT5- 2009.
it includes all the lines contained in the L1                       HE Opteron Six Core 2.6
caches. The main memory is connected to L2                          GHz.
cache, if the data requests are missed in L2               5.       JUGENE - Blue Gene/P 2009.
cache, the data access will happened in main                        Solution.
memory [20].                                               6.       Pleiades - SGI Altix ICE 2010.
                                                                    8200EX/8400EX, Xeon
                                                                    HT      QC     3.0/Xeon

                                                                                       ISSN 1947-5500
                                            (IJCSIS) International Journal of Computer Science and Information Security,
                                            Vol. 8, No. 7, October 2010

        Westmere 2.93 GHz,                                 Here the processors involved belong to multi
        Infiniband.                                        core types under some grids. Hence under grid
7.      Tianhe-1 - NUDT TH-1        2009.                  computing environment also chip multi-
        Cluster,            Xeon                           processors are used.
        E5540/E5450,         ATI
        Radeon HD 4870 2,                                  C. Parallel computing
8.      BlueGene/L - eServer        2007.                  Parallel computing plays a major role in the
        Blue Gene Solution.                                current trends and in almost all the fields.
9.      Intrepid - Blue Gene/P      2007.                  Formerly they are useful only to solve very huge
        Solution.                                          problems such as weather forecasting, etc. But
10.     Red Sky - Sun Blade         2010.                  nowadays the concept of parallel computing are
        x6275, Xeon X55xx 2.93                             used starting from super computing environment
        GHz, Infiniband.                                   to the modern desktop environment such as
                                                           quad-core or in the GPU usage [25].
Among the top 10 super computers, Jaguar and
Kraken are having multi-core that are coming               As per the parallel workload archive [21], the
under CMP processors. Thus under high                      parallel computing systems are listed as:
performance computing environments, the chip                   1. CTC IBM SP2: It contains 512 nodes
multi processors are involved and extends their                      IBM SP2 during 1996.
capability in near future since the worldwide                  2. DAS-2 5-Cluster: It contains 72 nodes,
HPC market is growing rapidly. Successful HPC                        each of dual 1GHz Pentium-III during
applications span many industrial, government                        2003.
and academic sectors.                                          3. HPC2N: It contains 120-node, each
                                                                     node contains two 240 AMD Athlon
B. Grid computing                                                    MP2000+ processors during 2002.
Grid computing has emerged as the next-
                                                               4. KTH IBM SP2: It contains 100 nodes
                                                                     IBM SP2 during 1996.
generation parallel and distributed computing
methodology, which aggregates dispersed                        5. LANL: It contains 1024-node
heterogeneous resources for solving various                          Connection Machine CM-5, during
kinds of large-scale parallel applications in                        1994.
science, engineering and commerce [3]. As per                  6. LANL O2K: It contains a cluster of 16
[24], the list of the various grid computing                         Origin 2000 machines with 128
environments are:                                                    processors each (2048 total) during
 1. DAS-2: DAS-2 is a wide-area distributed
                                                               7. LCG: It contains LHC (Large Hadron
     computer of 200 Dual Pentium-III nodes
                                                                     Collider) Computing Grid during 2005.
                                                               8. LLNL Atlas: It contains 1152 node,
 2. Grid5000: It is distributed over 9 sites and                     each node contains 8 AMD Opteron
     contains approximately 1500 nodes and                           processors during 2006.
     approximately 5500 CPUs [29].
                                                               9. LLNL T3D: It contains 128 nodes, each
 3. NorduGrid: It is one of the largest                              node has two DEC Alpha 21064
     production grids in the world having more                       processors. Each of the 128 nodes has
     than 30 sites of heterogeneous clusters.                        two DEC Alpha 21064 processors
     Some of the cluster nodes contain dual                          during 1996.
     Pentium III processors [ng].                              10. LLNL Thunder: It contains 1024 nodes,
 4. AuverGrid: It is a heterogeneous cluster                         each with 4 Intel IA-64 Itanium
     [30].                                                           processors during 2007.
 5. Sharcnet: It is a cluster of clusters. It                  11. LLNL uBGL: It contains 2048
     consists of 10 sites and has 6828 processors                    processors during 2006.
     [24].                                                     12. LPC: It contains 70 dual 3GHz
 6. LCG: It contains 24115 processors [24].                          Pentium-IV Xeons nodes during 2004.
                                                               13. NASA: It contains 128-nodes during

                                                                                       ISSN 1947-5500
                                             (IJCSIS) International Journal of Computer Science and Information Security,
                                             Vol. 8, No. 7, October 2010

    14. OSC Cluster: It has two types of nodes:             Chip-Multiprocessor (CMP) or multi-core
        32 quad-processor nodes, and 25 dual-               technology has become the mainstream in CPU
        processor nodes, for a total of 178                 designs. It embeds multiple processor cores into
        processors during 2000.                             a single die to exploit thread-level parallelism for
    15. SDSC: It contains 416 nodes during                  achieving higher overall chip-level Instruction-
        1995.                                               Per-Cycle (IPC) [2, 4, 6, 11, 27]. Combined with
    16. SDSC DataStar: It contains 184 nodes                increased clock frequency, a multi-core,
        during 2004.                                        multithreaded processor chip demands higher
    17. SDSC Blue Horizon: It contains 144                  on- and off-chip memory bandwidth and suffers
        nodes during 2000.                                  longer average memory access delays despite an
    18. SDSC SP2: It contains 128-node IBM                  increasing on-chip cache size. Tremendous
        SP2 during 1998.                                    pressures are put on memory hierarchy systems
    19. SHARCNET: It contains 10 clusters                   to supply the needed instructions and data timely
        with quad and dual core processors                  [16].
        during 2005.
                                                            The memory and the chip memory bandwidth
Hence most of the processors involved in the                are a few of the main concern which plays an
parallel computing machines are multi-core                  important role in improving the system
processor types. This implies the involvement of            performance in CMP architecture. Similarly the
multi-core processors in parallel computing                 interconnection of the chips within the single die
environments.                                               is also an important consideration.

           V. CMP CHALLENGES                                                VI. CONCLUSION

The advent of multi-core processors and the                 In today’s scenario, it is essential to have a shift
emergence of new parallel applications that take            towards Chip multi processor architectures. It is
advantage of such processors pose difficult                 not only applicable for the high performance and
challenges to designers.                                    parallel computing but also for the desktops to
                                                            face the challenges of system performance. Day
With relatively constant die sizes, limited on              by day, the challenges faced by the CMPs
chip cache, and scarce pin bandwidth, more                  become complicated but the application and
cores on chip reduces the amount of available               needs are also increasing. Suitable steps to be
cache and bus bandwidth per core, therefore                 taken to decrease power consumption and
exacerbating the memory wall problem [1]. The               leakage current.
designer has to build a processor that provides a
core with good single-thread performance in the             References
presence of long latency cache misses, while
enabling as many of these cores to be placed on             [1] W. Wulf and S. McKee, “Hitting the
the same die for high throughput.                           Memory Wall: Implications of the Obvious”,
                                                            ACM SIGArch Computer Architecture News,
Limited on chip cache area, reduced cache                   23(1):20-24, March 1995.
capacity per core, and the increase in application          [2] L. Hammond, B. A. Nayfeh and K. Olukotun,
cache foot prints as applications scale up with             A Single-Chip Multiprocessor, IEEE Computer,
the number of cores, will make cache miss stalls            Sep. 1997.
more problematic [19].                                      [3] I. Foster, C. Kesselman (Eds.), “The Grid:
                                                            Blueprint     for     a    Future   Computing
The problem of shared L2 cache allocation is                Infrastructure”, Morgan Kaufmann Publishers,
critical to the effective utilization of multi-core         1999.
processors. Sometimes unbalanced cache                      [4] J. M. Tendler, S. Dodson, S. Fields, H. Le,
allocation will happen, and this situation can              and B. Sinharoy, “IBM eserver Power4 System
easily leads to serious problems such as thread             Microarchitecture,” IBM White Paper, Oct.
starvation and priority inversion, which threatens          2001.
to processor’s utilization ratio and system                 [5] Ishwar Parulkar, Thomas Ziaja, Rajesh
performance.                                                Pendurkar, Anand D’Souza and Amitava
                                                            Majumdar, “A Scalable, Low Cost Design-For-
                                                            Test Architecture for UltraSPARCTM Chip Multi-

                                                                                        ISSN 1947-5500
                                          (IJCSIS) International Journal of Computer Science and Information Security,
                                          Vol. 8, No. 7, October 2010

Processors”, International Test Conference,              [19] Satyanarayana Nekkalapu, Haitham Akkary,
IEEE, 2002, pp.726-735.                                  Komal Jothi, Renjith Retnamma, Xiaoyu Song,
[6] Sun Microsystems, “Sun’s 64-bit Gemini               “A Simple Latency Tolerant Processor”, IEEE,
Chip,” Sunflash, 66(4), Aug. 2003.                       2008, pp.384-389.
[ng] “NorduGrid – The Nordic Testbed for Wide            [20] Benhai Zhou , Jianzhong Qiao, Shu-kuan
area computing and data handling”, Final Report,         Lin, “Research on fine-grain cache assignment
Jan 2003.                                                scheduling algorithm for multi-core processors”,
[7] S. Mukherjee, J. Emer, and S. Reinhardt,             IEEE, 2009, pp.1-4.
“The soft error problem, an architectural                     [21]      Parallel    workloads        archive,
perspective”, HPCA-11, 2005.                             Dror.G.Feitelson,
[8] B. Sinharoy, R. Kalla, J. Tendler, R.      ,
Eickemeyer, and J. Joyner. Power5 system                 March 2009.
microarchitecture. IBM Journal of Research and           [22] Douglas Eadline, “High Performance
Development, 49(4/5):505–521, 2005.                      Computing for Dummies”, SUN and AMD
[9] C. McNairy and R. Bhatia. Montecito: A               Special edition, 2009.
dual-core, dualthread itanium processor. IEEE            [23]      Top        10     super       computers,
Micro, 25(2):10–20, 2005.                      , Sep 2010.
[10] P. Kongetira, K. Aingaran, and K.                   [24]      Grid       computing       environments,
Olukotun. Niagara: A 32-way multithreaded      , June 2010.
sparc processor. IEEE Micro, 25(2):21–29, 2005.          [25] A.Neela madheswari, R.S.D.Wahida banu,
[11] AMD, Multi-core Processors: The Next                “Important essence of co-scheduling for parallel
Evolution              in            Computing,          job scheduling”, Advances in Computational              Sciences and Technology, Vol.3, No.1, 2010,
Core_Processors_WhitePaper.pdf, 2005.                    pp.49-55.
[12] A. Eichenberger, J. O’Brien, and et al.              [26] The Distributed ASCI Supercomputer 2,
Using advanced compiler technology to exploit  , Sep 2010.
the performance of the cell broadband engineTM           [27] Intel, Inside Intel Core Microarchitecture
architecture. IBM Systems Journal, 45:59–84,             and         Smart         Memory            Access.
[13] Huiyang Zhou, “A Case for fault tolerance           e/sma.pdf.
and performance enhancement using Chip Multi-            [28] Intel. Intel ixp2855 network processor -
Processors”, IEEE Computer architecture                  product brief.
letters”, Vol.5, 2006.                                   [29] Pierre Riteau, Mauricio Tsugawa, Andrea
[14] Pawel Gepner, Michal F.Kowalik, “Multi-             Matsunaga, Jose Fortes, Tim Freeman, Kate
Core Processors: New way to achieve high                 Keahey, “Sky computing on FutureGrid and
system performance”, In the proceedings of the           Grid5000”.
International Symposium on Parallel computing            [30]            AuverGrid,              http://gstat-
in Electrical Engineering, IEEE, 2006.         , Sep 2010.
[15] Fengguang Song, Shirley Moore, Jack
Dongarra, “L2 Cache Modeling for Scientific                             AUTHOR’S PROFILE
Applications on Chip Multi-Processors”,
International Conference on Parallel Processing          A.Neela Madheswari received her Master of
(ICPP), 2007.                                            Computer Science and Engineering degree from
[16] Lu Peng, Jih-Kwon Peir, Tribuvan K.                 Vinayaka Missions University, on June 2006.
Prakash, Yen-Kuang Chen and David                        Currently, she is doing his research in the area of
Koppelman,        “Memory     performance    and         Parallel and Distributed systems under Anna
scalability of Intel’s and AMD’s Dual-Core               University, Coimbatore. Earlier she completed
Processors: A case study”, IEEE, 2007, pp.55-            her B.E, from Madras University of Computer
64.                                                      Science and Engineering, Chennai on April
[17] Jernej Barbic, “Multi-core architectures”,          2000. Later, she joined as Lecturer at Mahendra
15-213, Spring 2007, May 2007.                           Engineering College in CSE department from
[18] Sushu Zhang, Karam S.Chatha, “Automated             2002. She had completed her M.E., from
Techniques for Energy Efficient scheduling on            Vinayaka Missions University of Computer
Homogeneous and Heterogeneous Chip Multi-                Science and Engineering during 2006 and now
processor architectures”, IEEE, 2008, pp.61-66.          she serves as Assistant Professor at MET’S
                                                         School of Engineering, Thrissur. Her research

                                                                                     ISSN 1947-5500
                                           (IJCSIS) International Journal of Computer Science and Information Security,
                                           Vol. 8, No. 7, October 2010

interest includes Parallel and Distributed
Computing and Web Technologies. She is a
member of the Computer Society of India,
Salem. She had presented the papers under
national and international journals, national and
international conferences. She is the reviewer in
journals namely IJCNS and IJCSIS.

                                                                                      ISSN 1947-5500

Description: Vol. 8 No. 6 September 2010 International Journal of Computer Science and Information Security