RDS--RDS_Sonoma_Feb06 by wanghonghx

VIEWS: 11 PAGES: 17

									                Reliable Datagram Sockets
                          (RDS)

                           Ranjit Pandit
                     SilverStorm Technologies
                     rpandit@silverstorm.com




Sonoma Feb 6, 2006
              Agenda

•   Goals
•   High Level Design
•   Current status
•   Preliminary performance data
•   Future work




Sonoma Feb 6, 2006                 Page 1
              Goals

• Provide reliable datagram service
   – performance
   – scalability
   – high availability
   – simplify application code

• Maintain sockets API
  – application code portability
  – faster time-to-market

                      Keep It Simple !!!
Sonoma Feb 6, 2006                         Page 2
                      Stack Overview

         Socket          Oracle        UDP
User   Applications
                          10g       Applications

                                                   Kernel

       TCP     UDP      SDP   RDS
              IP
             IPoIB
                   Openib Access Layer

                      Host Channel Adapter


   Sonoma Feb 6, 2006                              Page 3
              High Level Design

• RDS registers with the kernel as driver for Address
  Family PF_INET_OFFLOAD and Type SOCK_DGRAM

• Application creates a RDS socket with socket(2)
   – arg1 = PF = PF_INET_OFFLOAD
   – arg 2 = Type = SOCK_DGRAM

• socket(2) API supported
   – socket, bind, ioctl, sendmsg, recvmsg,
     poll, getsockopt/setsockopt


Sonoma Feb 6, 2006                                  Page 4
              Connection model

•     Application connectionless

•    Rds maintains node-to-node connection
•    IP addressing
•    Uses CMA
•    on-demand connection setup
    – connect on first sendmsg()or data recv
    – disconnect on error or policy like inactivity

•     Connection setup/teardown transparent to applications
          Application connectionless
Sonoma Feb 6, 2006                                     Page 5
              Data and Control Channel

•   Uses RC QP for node level connections
•   Data and Control QPs per session
•   Selectable MTU
•   b-copy send/recv
•   h/w flow control




Sonoma Feb 6, 2006                          Page 6
User
                    P2      …      Pn
              P1                        P1
               s1    s2            sn    S1
                   sendmsg(node2)              recvmsg()

                          Rds                 Rds
Kernel
                           RC QP        RC QP




                          Node 1          Node 2

    Sonoma Feb 6, 2006                                     Page 7
              Send

• Connection established on first send

• sendmsg()
    – allows send pipelining

• ENOBUF returned if insufficient send buffers, application
  retries




Sonoma Feb 6, 2006                                     Page 8
              Receive

• Identical to UDP recvmsg()
    – similar blocking/non-blocking behavior


• “Slow” receiver ports are stalled at sender side
   – combination of activity (LRU) and memory utilization
     used to detect slow receivers
   – sendmsg() to stalled destination port returns
     EWOULDBLOCK, application can retry
         • Blocking socket can wait for unblock
    – recvmsg() on a stalled port un-stalls it

Sonoma Feb 6, 2006                                    Page 9
              High Availability (failover)

• Use of RC and on-demand connection setup allows HA
   – connection setup/teardown transparent to
     applications
   – every sendmsg() could “potentially” result in a
     connection setup
   – if a path fails, connection is torn down, next send can
     connect on an alternate path (different port or
     different HCA)




Sonoma Feb 6, 2006                                      Page 10
              Preliminary performance Rds on
              Openib

                                              netperf (UDP_STREAM)

                             4000

                             3500

                             3000

                             2500                                            UDP GbE
                 Mbits/sec




                                                                             UDP ipoib send
                             2000
                                                                             UDP ipoib recv
                             1500                                            Rds (send = recv)
                                                                                                 *Dual 2.4GHz Xeon
                             1000                                                                2G memory
                                                                                                 4x PCI-X HCA
                             500

                               0                                                                 **Sdp ~3700Mb/sec
                                    2k   4k     8k      16k      32K   64K                       TCP_STREAM
                                              msg size (bytes)




Sonoma Feb 6, 2006                                                                                      Page 11
              Preliminary performance Rds on
              OpenIB

                                               netperf (UDP_STREAM)

                             4000

                             3500

                             3000

                             2500
                 Mbits/sec




                                                                             UDP GbE
                             2000                                            UDP ipoib recv
                                                                             Rds (send = recv)
                             1500
                                                                                                 *Dual 2.4GHz Xeon
                             1000                                                                2G memory
                                                                                                 4x PCI-X HCA
                              500

                                0                                                                **Sdp ~3700Mb/sec
                                    2k   4k      8k     16k      32K   64K
                                                                                                 TCP_STREAM
                                              msg size (bytes)




Sonoma Feb 6, 2006                                                                                      Page 12
              Preliminary performance Rds on
              OpenIB


                                                                    Latency

                        500
                        450
                        400
                        350
                        300                                                        UDP GigE
                 usec




                        250                                                        UDP ipoib
                        200                                                        Rds

                        150
                        100
                        50
                         0
                                                        8

                                                               6

                                                                      2

                                                                              24

                                                                              48

                                                                              96

                                                                              92

                                                                               4

                                                                               8
                              4

                                  8

                                      16

                                           32

                                                64




                                                                             38

                                                                             76
                                                     12

                                                            25

                                                                   51

                                                                           10

                                                                           20

                                                                           40

                                                                           81

                                                                          16

                                                                          32
                                                            Msg size(bytes)




Sonoma Feb 6, 2006                                                                         Page 13
              Status in OpenIB

• Z-copy
• Functionally 98% complete

• Running Netperf
• Running Oracle unit test (crload) stable today
• Code checked into contrib/silverstorm/
   https://openib.org/svn/trunk/contrib/silverstorm/rds/




Sonoma Feb 6, 2006                                         Page 14
              Future

• AIO
• Z-copy
• Shared recv queue




Sonoma Feb 6, 2006     Page 15
              Q&A




Sonoma Feb 6, 2006   Page 16

								
To top