Myrinet Technology Roadmap

Document Sample
scope of work template
							      Myrinet
Technology Roadmap
         Dr. Charles L. Seitz
            CEO & CTO
            Myricom, Inc.
          chuck@myri.com

   Myrinet Users Group Conference
           Vienna, Austria
            13 May 2002
                                Charles L. Seitz
         www.myri.com           MUG-2002, 13 May 2002
                                                        1
    Myrinet Technology – History & Roadmap
1994                                           Products & Features
              1st Generation         32-bit SBus (SPARC) interfaces, 8-port switches
1995          0.64+0.64 Gb/s links

1996                                 32-bit PCI interfaces (LANai 4), 8-port switches
                                     SAN PHY level
1997                                 Clos “network in a box” of 8-port switches
1998          2nd Generation         16-port switches, HA features
              1.28+1.28 Gb/s links   64-bit PCI interfaces (LANai 7), GM message system
1999
                                     Clos “network in a box” of 16-port switches
2000                                 64-bit PCI interfaces (LANai 9), SW16, Clos128
2001                                 Fiber becomes prevalent for Myrinet-2000 links
     Past     3rd Generation
2002 Future   “Myrinet 2000”         PCI-X interfaces, GM 2      GbE & 1x InfiniBand
              2+2 Gb/s links                                     ports on Myrinet switches
2003                                 3GIO interfaces

2004                                 4x Myrinet links

                                                        Charles L. Seitz
                                www.myri.com            MUG-2002, 13 May 2002
                                                                                        2
Current Market and Technology Forces (winds of change)
• Continued healthy growth for clusters
   – All of the major OEMs now offer clusters.
   – Excellent progress in distributed-computing applications.
   – Myricom’s competitive position -- the clear market leader
       • Myricom’s 2001 revenue growth was 102%; 5-year growth has been ~609%.
       • Myricom is already shipping >80% of ports in this market niche.
• Faster hosts, faster I/O (PCI-X and 3GIO)
   – Just what we hoped for to be able to build better clusters.
   – Moore’s Law still rules
       • Advances in microelectronics (including VCSEL fiber components) apply to
         interconnect in the same way as to processors and memory.
• InfiniBand
   – Contributing to the expectation that interconnect will become “commodity.”
   – However, IB has been a technical disappointment so far, and 1x IB is a
     ‘non-starter’ in the marketplace.


                                                     Charles L. Seitz
                              www.myri.com           MUG-2002, 13 May 2002
                                                                                    3
                  Myricom’s Strategy (Priorities)
• The “whole product” concept
   – Extraordinary efforts toward software reliability and customer support
• Extend Myrinet performance ~2x at close to present prices.
   – PCI-X interfaces with two Myrinet-2000 ports
       • Two-port NICs also have applications for high availability.
• Extend and broaden Myricom’s market
   – “Low-end” Myrinet interfaces (PCI-X)
       • Over the next ~18 months, bring the list prices of “low-end” fiber interfaces down
         to ~$700.
   – GbE ports on Myrinet switches
       • Interoperability for Myrinet, and possible market for GbE Beowulf clusters.
   – Remain positioned to ship components with InfiniBand ports
       • InfiniBand ports on Myrinet switches.
       • Myrinet-2000 PHY is exactly the 1x InfiniBand PHY.
   – Motherboard NIC modules
   – And more…

                                                         Charles L. Seitz
                                www.myri.com             MUG-2002, 13 May 2002
                                                                                       4
       Links: 2.5 GBaud, full duplex, mostly fiber




                                                                            (At the PHY level,
                                                                            these links are
                                                                            identical to 1x
                                                                            InfiniBand.)

 Advantages of fiber: small-diameter, lightweight, flexible cables; reliability;
EMC; 200m length; connector size. (See http://www.myri.com/news/01723/)
                                                    Charles L. Seitz
                            www.myri.com            MUG-2002, 13 May 2002
                                                                                    5
                       Links: Changes Planned
• (June 2002) - first chips with multi-protocol ports
   – A multi-protocol port can act as a Myrinet port, long-range-Myrinet port
     (1310nm single-mode fiber to 20km), GbE port, or InfiniBand port.
   – Interoperability between Myrinet, GbE, & InfiniBand.
• (Nov 2002) - “High-end” PCI-X interfaces with two ports
   – 2 x (250+250) MB/s = 1GB/s, a good match to 1 GB/s PCI-X.
   – GM-2 route dispersion can use both links concurrently
• (Early 2003) - SerDes function integrated into Myricom custom-
  VLSI chips
   – These serial links will displace today’s SAN-2000 PHY.
   – 2+2 Gb/s data rate, 2.5+2.5 GBaud (8b/10b encoded) links are also used as the
     base PHY by 3GIO. Myricom plans to support 3GIO -- initially 4x 3GIO -- as
     soon as 3GIO hosts become available.
• (Early 2004) - “4x” Myrinet (multi-protocol) links
   – Most product volume is expected to continue with “1x” links through 2006.


                                                     Charles L. Seitz
                              www.myri.com           MUG-2002, 13 May 2002
                                                                                6
Switches: 128-Host Clos Network (Flagship)




                             Charles L. Seitz
              www.myri.com   MUG-2002, 13 May 2002
                                                     7
                    Switches: Changes Planned
• Switches with a mix of Myrinet, long-range-Myrinet, GbE, and
  InfiniBand ports (starting this year).
   – High-degree switches with GbE ports may find a market for “Beowulf”
     clusters that use next-generation hosts with GbE on the motherboard.
• The use of dispersive routing (GM 2) allows better utilization of
  Myrinet Clos networks (also HA at a finer time scale).
• More capable monitoring line card that can run Linux.
• Very few “Myrinet” changes until the advent of “4x Myrinet” links
  (early 2004).




                                                  Charles L. Seitz
                            www.myri.com          MUG-2002, 13 May 2002
                                                                            8
Interfaces: Current production M3F-PCI64B-2




                   Myricom’s highest volume product

                                Charles L. Seitz
              www.myri.com      MUG-2002, 13 May 2002
                                                        9
        Interfaces: 64-bit, 66MHz, Myrinet/PCI interfaces
• PCI64B, 133MHz RISC and memory
     – 1067 MB/s memory bandwidth
• PCI64C, 200MHz RISC and memory
     – 1600 MB/s memory bandwidth


                                                                                 533 MB/s

500 MB/s                             1067 or 1600 MB/s

 SAN       Network      Packet                                 DMA controller
                                       Fast SRAM
 port      Interface     DMA                                    & bus bridge

                                                                 PCIDMA chip

              LANai 9 chip               RISC


                                                         Charles L. Seitz
                                 www.myri.com            MUG-2002, 13 May 2002
                                                                                     10
                     Interfaces: Changes Planned
• Faster RISCs. Higher local-memory bandwidth
   – Lower latency, to ~4µs GM latency by the end of 2002 (from 7µs currently).
   – MPI latency will decrease correspondingly.
   – Higher throughput.
• Higher levels of integration
   – LANai (10) XP - 225MHz RISC and memory, PCI-X, and one port.
       • Pricing of low-end interfaces is expected to decline over next 1-2 years to less than
         50% of current prices
   – LANai (10) 2XP - 300MHz RISC and memory, PCI-X, and two ports.
   – LANai with on-chip memory (not this year)
       • Open-ended performance growth to 600+MHz RISC and memory.




                                                          Charles L. Seitz
                                www.myri.com              MUG-2002, 13 May 2002
                                                                                       11
                     LANai 10 – New Features
• 250+250 MB/s multi-protocol ports
   – Multi-protocol ports connect directly to a 10b SerDes (8b/10b-encoded data)
     to support Myrinet-Fiber, single-mode-fiber Myrinet, GbE, or InfiniBand.
• Three pinout versions of the LANai 10 - XM, XP, 2XP
   – See the following block diagrams, and Jakov’s talk
• Self-initialization of the LANai memory from ROM
   – Necessary for stand-alone protocol converters.
   – For interfaces, allows diskless hosts to boot over the Myrinet.
• Performance boost, plus headroom for future performance gains
   – Initially: XM/XP versions 200-225MHz ZBT SRAM & RISC
   – 2XP version 300+MHz ZBT SRAM & RISC
       • Evolve to products with DDR or “Sigma” SRAM
   – Headroom: 2.4–4.8+ GB/s local-memory data rate; 300–600+MHz RISC



                                                     Charles L. Seitz
                             www.myri.com            MUG-2002, 13 May 2002
                                                                             12
                  LANai 10 as a protocol converter

                                                                                     To line-card
    To line card                                                       SerDes        front-panel
    XBar16 port                                                         SerDes
                                                                                     port

                                SAN                  X                    Modes
                              network            network                   - Myrinet
                              interface          interface                 - Program control
                                                                              1310nm Fiber
                              Send/recv         Send/recv                  - InfiniBand
                                DMA               DMA                      - GbE
To line card µC                engines           engines
        (JTAG)
                              Control &            L-bus
                                                                          x72b
                                                                           x72b
                               memory            memory
                                                                         SRAM
                                                                          SRAM
                              initialize         interface


                                                  RISC
                          LANai XM




                   This circuitry is repeated for each line-card port.

                                                             Charles L. Seitz
                               www.myri.com                  MUG-2002, 13 May 2002
                                                                                               13
    Low-Cost LANai 10 PCI-X Interface


                                                                 PCI-card
                                                   SerDes
                                                    SerDes       port


                                 X
                             network
                             interface
                             Send/recv
                               DMA
 Interface      Control &     engines
EEPROM           memory
                               L-bus
 & JTAG         initialize                           x72b
                                                      x72b
                             memory
                                                    SRAM
                                                     SRAM
                 PCI-X &     interface
PCI-X bus         DMA
                                                   (225MHz)
                  Engine
                               RISC


             LANai XP




                                         Charles L. Seitz
                  www.myri.com           MUG-2002, 13 May 2002
                                                                            14
    High-End LANai 10 PCI-X Interface
                                                                 PCI-card
                                                   SerDes
                                                    SerDes       port

                                                                 PCI-card
                                                   SerDes
                                                    SerDes       port

                     X           X
                 network     network
                 interface   interface
                Send/recv    Send/recv
                  DMA          DMA
                 engines      engines

 Interface      Control &      L-bus
                                                     x72b
                                                      x72b
EEPROM           memory      memory
                                                    SRAM
                                                     SRAM
 & JTAG         initialize   interface

                 PCI-X &                           (300MHz)
PCI-X bus         DMA          RISC
                  Engine

                              dRAM                               Optional
                              DMA                   dRAM
                                                     dRAM        Used for IO
                              Engine                             page tables
             LANai 2XP


                                         Charles L. Seitz
                  www.myri.com           MUG-2002, 13 May 2002
                                                                            15
       Myrinet Software: Basic OS-Bypass Structure

                                 Applications


                                 MPI         VIA                      Middleware


 UDP                  TCP                                                OS-bypass
                                                                       APIs (multiple
Host                                                                   host processes)
OS            IP

 Ethernet             Myrinet

                                                                       (executes in the
                                   Myrinet Control Program (MCP)       Myrinet interface)

                                                   2000 + 2000 Mb/s
   10/100/1000 Mb/s


                                                     Charles L. Seitz
                                www.myri.com         MUG-2002, 13 May 2002
                                                                                     16
                      The GM Message-Passing System
No Compromises                     GM Data-Rate Performance (Myrinet-2000 Fiber Interfaces)
•   Concurrent, protected,
    user-level access
•   Reliable, ordered message
    delivery                        UNIX user process to user process
•   Very low CPU overhead                   Fully protected
                                       End-to-end data integrity
•   Robust under network
    faults
•   Mapping
•   Segmentation and
    reassembly of long
    messages
•   High-level flow control
•   “Clean” API, with
    exception handling
•   Zero-copy layering of other
    APIs                             GM short-message latency (Myrinet-2000 interfaces)
                                            ~ 7µs (PCI64C) or ~9µs (PCI64B)
                                      GM CPU overhead = 1-2µs per message (LogP)

                                                             Charles L. Seitz
                                  www.myri.com               MUG-2002, 13 May 2002
                                                                                          17
             Current Software Distributions

OS                 Platforms
Linux              IA-32, IA-64, Alpha, PowerPC, UltraSPARC
Win2000/XP         IA-32, IA-64
Solaris            UltraSPARC
Tru64              Alpha
HP UX              PA-RISC, developed by HP, used for HyperFabric
AIX                PowerPC/IBM Power
Irix               MIPS
VxWorks            PowerPC
MacOS X            Apple Macintosh G4
FreeBSD, …         IA-32 & Alpha



                                           Charles L. Seitz
                    www.myri.com           MUG-2002, 13 May 2002
                                                                    18
Small Part of Myricom’s Software Lab




                         Charles L. Seitz
          www.myri.com   MUG-2002, 13 May 2002
                                                 19
       Yes, Myrinet runs beautifully on McKinleys

                        “gm_debug” for Myricom’s early
                        4-processor McKinley boxes using
                        M3F-PCI64C interfaces in PCI-X slots.
                        Myricom is distributing both Linux and
                        Windows software for IA64 now. Several
                        Itanium clusters in service.

 DMA rate for 16384 Byte pages (64bit / 60MHz bus)
 Timing 32 pages.
         bus_read (send) = 418 MBytes/s
         bus_write (recv) = 439 MBytes/s

Much higher throughputs are expected for the future Myrinet/PCI-X Interfaces.

                                                     Charles L. Seitz
                             www.myri.com            MUG-2002, 13 May 2002
                                                                                20
     Current Choice of Myrinet Software Interfaces
• The GM API
  – Low level, but some applications are programmed at this level
• TCP/IP
  – Actually, “ethernet emulation,” included in all GM releases
      • 1.8 Gb/s TCP/IP under GM 2 (netperf benchmarks)
• MPICH-GM
  – An implementation of the Argonne MPICH directly over GM.
• VI-GM
  – An implementation of the VI Architecture API directly over GM.
      • Possibly relevant to InfiniBand compatibility
• Sockets-GM
  – An implementation of UNIX or Windows sockets (or DCOM) over GM.
    Completely transparent to application programs. Use the same binaries!
      • Sockets-GM/GM/Myrinet is similar to the proposed SDP/InfiniBand.



                                                        Charles L. Seitz
                               www.myri.com             MUG-2002, 13 May 2002
                                                                                21
              Myrinet Software: Changes Planned
• Increasing emphasis on Myrinet interoperability with GbE and
  InfiniBand.
   – Requires improvements & simplification of the mapper.
• GM 2
   – GM-2.0-alpha0 (Linux, FreeBSD) is on the web for download now.
• Possible/probable Myricom support for HP UX.
   – The only major software platform that we don’t support today.
• Performance.
   – We are still leaving some performance ‘on the table.’
• No other “middleware” layers in sight.
   – Ideas? SDP?
• Storage over Myrinet.
• Applications support.
   – Myricom is now large enough to support application developers.

                                                    Charles L. Seitz
                             www.myri.com           MUG-2002, 13 May 2002
                                                                            22