Embed
Email

intel

Document Sample

Shared by: wuzhenguang
Categories
Tags
Stats
views:
1
posted:
12/8/2011
language:
pages:
52
TCP Servers: Offloading TCP/IP

Processing in Internet Servers



Liviu Iftode

Department of Computer Science

University of Maryland

and Rutgers University

My Research: Network-Centric Systems



 TCP Servers and Split-OS [NSF CAREER]

 Migratory TCP and Service Continuations

 Federated File Systems

 Smart Messages [NSF ITR-2] and Spatial

Programming for Networks of Embedded Systems



 http://discolab.rutgers.edu





2

Networking and Performance



C C C



WAN IP Network

TCP

Internet Servers S S



Storage Networks SAN IP or not IP ?

TCP or not TCP?

D D D



 The transport-layer protocol must be efficient

3

The Scalability Problem

700



Dual Processor



600



Uniprocessor

500

Throughput (requests/s)









400







300







200







100







0

300 350 400 450 500 550 600 650 700 750

Offered load (requests/s)





Apache web server on 1 Way and 2 Way 300 MHz Intel Pentium II SMP

repeatedly accessing a static16 KB file

4

Breakdown of CPU Time for Apache

User space

20%







Other system

calls

9%



Network

Processing

71%







5

The TCP/IP Stack



APPLICATION

SYSTEM CALLS

SEND RECEIVE



copy_from_application_buffers copy_to_application_buffers



TCP_send TCP_receive



IP_send IP_receive

KERNEL

packet_scheduler software_interrupt_handler



setup_DMA hardware_interrupt_handler



packet_out packet_in

6

Breakdown of CPU Time for Apache

Hardware Interrupt

Processing

8% Software Interrupt

Processing

11%

User space

20%

IP Receive

0%



IP Send

Other system calls 0%

9%







TCP Receive

7%

TCP Send

45%

7

Serialized Networking Actions



APPLICATION

SYSTEM CALLS

SEND RECEIVE



copy_from_application_buffers copy_to_application_buffers



TCP_send TCP_receive Serialized

Operations

IP_send IP_receive



packet_scheduler software_interrupt_handler



setup_DMA hardware_interrupt_handler



packet_out packet_in

8

TCP/IP Processing is Very Expensive

 Protocol processing can take up to 70% of the

CPU cycles

 For Apache web server on uniprocessors [Hu 97]

 Can lead to Receive Livelock [Mogul 95]

 Interrupt handling consumes a significant

amount of time

 Soft Timers [Aron 99]

 Serialization affects scalability





9

Outline

 Motivation

 TCP Offloading using TCP Server

 TCP Server for SMP Servers

 TCP Server for Cluster-based Servers

 Prototype Evaluation









10

TCP Offloading Approach

 Offload network processing from application

hosts to dedicated processors/nodes/I-NICs

 Reduce OS intrusion

 network interrupt handling

 context switches

 serializations in the networking stack

 cache and TLB pollution

 Should adapt to changing load conditions

 Software or hardware solution?

11

The TCP Server Idea

Host Processor TCP Server





Application TCP/IP



OS





CLIENT

FAST COMMUNICATION







SERVER



12

TCP Server Performance Factors

 Efficiency of the TCP server implementation

 event-based server, no interrupts

 Efficiency of communication between host(s)

and TCP server

 non-intrusive, low-overhead

 API

 asynchronous, zero-copy

 Adaptiveness to load



13

TCP Servers for Multiprocessor Systems





CPU 0 CPU N

Application

TCP Server

Host OS





CLIENT

SHARED MEMORY





Multiprocessor (SMP) Server



14

TCP Servers for Clusters with

Memory-to-Memory Interconnects





Host

TCP Server

Application







MEMORY-to-MEMORY CLIENT

INTERCONNECT



Cluster-based Server



15

TCP Servers for Multiprocessor

Servers









16

SMP-based Implementation



Application

TCP Server

Host OS









IO APIC

Disk & Other Network and Clock

Interrupts Interrupts



Interrupts 17

SMP-based Implementation (cont’d)





Application

TCP Server

Host OS



ENQUEUE DEQUEUE

SEND AND

REQUEST SHARED QUEUE EXECUTE

SEND REQUEST







18

TCP Server Event-Driven Architecture



Dispatcher





Send Receive Asynchronous

Monitor

Handler Handler Event Handler









Shared Queue NIC







From Application Processors To Application Processors

19

Dispatcher

 Kernel thread executing at the highest priority

level in the kernel

 Schedules different handlers based using

input from the monitor

 Executes an infinite loop and does not yield

the processor

 No other activity can execute on the TCP Server

processor





20

Asynchronous Event Handler (AEH)

 Handles asynchronous network events

 Interacts with the NIC

 Can be an Interrupt Service Routine or a

Polling Routine

 Is a short running thread

 Has the highest priority among TCP server

modules

 The clock interrupt is used as a guaranteed

trigger for the AEH when polling

21

Send and Receive Handlers

 Scheduled in response to a request in the

Shared Memory queues

 Run at the priority of the network protocol

 Interact with the Host processors









22

Monitor

 Observes the state of the system queues and

provides hints to the Dispatcher to schedule

 Used for book-keeping and dynamic load

balancing

 Scheduled periodically or when an exception

occurs

 Queue overflow or empty

 Bad checksum for a network packet

 Retransmissions on a connection

 Can be used to reconfigure the set of TCP

servers in response to load variation

23

TCP Servers for Cluster-based

Servers









24

Cluster-based Implementation



Host

Application TCP Server

Socket Stub



TUNNEL DEQUEUE AND

SOCKET EXECUTE

REQUEST VI Channels SOCKET REQUEST









25

TCP Server Architecture

Eager

Resource Processor

Manager







VI Connection Request Socket Call TCP/IP

Handler Handler Processor Provider









SAN (To Host) NIC - WAN

26

Sockets and VI Channels

 Pool of VI’s created at initialization

 Avoid cost of creating VI’s in the critical path

 Registered memory regions associated with

each VI

 Send and receive buffers associated with socket

 Also used to exchange control data

 Socket mapped to a VI on the first socket

operation

 All subsequent operations on the socket tunneled

through the same VI to the TCP server



27

Socket Call Processing

 Host library intercepts socket call

 Socket call parameters are tunneled to the TCP

server over a VI channel

 TCP server performs socket operation and

returns results to the host

 Library returns control to the application

immediately or when the socket call completes

(asynchronous vs synchronous processing).



28

Design Issues for TCP Servers

 Splitting of the TCP/IP processing

 Where to split?

 Asynchronous event handling

 Interrupt or polling?

 Asynchronous API

 Event scheduling and resource allocation

 Adaptation to different workloads





29

Prototypes and Evaluation









30

SMP-based Prototype

 Modified Linux – 2.4.9 SMP kernel on Intel

x86 platform to implement TCP server

 Most parts of the system are kernel modules,

with small inline changes to the TCP stack,

software interrupt handlers and the task

structures

 Instrumented the kernel using on-chip

performance monitoring counters to profile

the system

31

Evaluation Testbed

 Server

 4-Way 550MHz Intel Pentium II Xeon system with 1GB

DRAM and 1MB on chip L2 cache

 Clients

 4-way SMPs

 2-Way 300 MHz Intel Pentium II system with 512 MB

RAM and 256KB on chip L2 cache

 NIC : 3-Com 996-BT Gigabit Ethernet

 Server Application: Apache 1.3.20 web server

 Client program: sclients [Banga 97]

 Trace driven execution of clients

32

Trace Characteristics



Logs Number Average Number Average

of files file size of reply size

requests

Forth 11931 19.3 KB 400335 8.8 KB



Rutgers 18370 27.3 KB 498646 19.0 KB



Synthetic 128 16.0 KB 50000 16.0 KB









33

Splitting TCP/IP Processing



APPLICATION

APPLICATION

PROCESSORS SYSTEM CALLS

SEND RECEIVE



copy_from_application_buffers copy_to_application_buffers



TCP_send TCP_receive



C3 IP_send IP_receive C2

DEDICATED packet_scheduler software_interrupt_handler

PROCESSORS

setup_DMA interrupt_handler

C1

packet_out packet_in



34

Implementations

Interrupt Receive Send Avoiding

Implementation processing Bottom Bottom Interrupts

(C1) (C2) (C3) (S1)

SMP_BASE



SMP_C1C2  



SMP_C1C2S1   



SMP_C1C2C3   



SMP_C1C2C3S1    



35

Throughput (requests/s)









1000

1500

2000

2500

3000

3500

4000

4500









0

500

Uniprocessor









SMP_base_4processors









SMP_C1C2_1server

Throughput







SMP_C1C2_2servers









SMP_C1C2C3_1server









SMP_C1C2C3_2servers









SMP_C1C2S1_1server









SMP_C1C2S1_2servers









SMP_C1C2C3S1_1server









SMP_C1C2C3S1_2servers

36

Forth

Rutgers

Synthetic

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%









0%

SMP_base_A



SMP_base_D



C1C2_1ser_A



C1C2_1ser_D



C1C2_2ser_A



C1C2_2ser_D



C1C2C3_1ser_A



C1C2C3_1ser_D



C1C2C3_2ser_A



C1C2C3_2ser_D



C1C2S1_1ser_A



C1C2S1_1ser_D



C1C2S1_2ser_A



C1C2S1_2ser_D



C1C2C3S1_1ser_A



C1C2C3S1_1ser_D



C1C2C3S1_2ser_A



C1C2C3S1_2ser_D

CPU Utilization for Synthetic Trace









Idle

User









37

System

Throughput (requests/s)









1000

1500

2000

2500

3000

3500

4000









500



0

SMP_base







SMP_C1C2_1server

0% Dynamic Content





20% Dynamic Content

10% Dynamic Content









SMP_C1C2_2servers







SMP_C1C2C3_1server







SMP_C1C2C3_2servers







SMP_C1C2S1_1server

With Dynamic Content









SMP_C1C2S1_2servers







SMP_C1C2C3S1_1server

Throughput Using Synthetic Trace









SMP_C1C2C3S1_2servers

38

Adapting TCP Servers to Changing

Workloads

 Monitor the queues

 Identify low and high water marks to change the size

of the processor set

 Execute a special handler for exceptional events

 Queue length lower than the low water mark

 Set a flag which dispatcher checks

 Dispatcher sleeps if the flag is set

 Reroute the interrupts

 Queue length higher than the high water mark

 Wake up the dispatcher on the chosen processor

 Reroute the interrupts

39

Load behaviour and dynamic

reconfiguration









40

Throughput with Dynamic

Reconfiguration

4000

SMP_base

SMP_C1C2C3_1server

3500

SMP_C1C2C3_2servers

SMP_C1C2C3_Dyn_Reconf

3000

Throughput (requests/s)









2500









2000









1500









1000









500









0

0% 10% 20% 30% 40%

% of CGI requests







41

Cluster-based Prototype

 User-space implementation (bypass host

kernel)

 Entire socket operation offloaded to TCP Server

 C1, C2 and C3 offloaded by default

 Optimizations

 Asynchronous processing: AsyncSend

 Processing ahead: Eager Receive, Eager Accept

 Avoiding data copy at host using pre-registered

buffers

 requires different API: MemNet



42

Implementations



Kernel Asynchronous Avoiding Processing

Implementation Bypassing Processing Host Copies Ahead

(H1) (H2) (H3) (S2)



Cluster_base



Cluster_C1C2C3H1 



Cluster_C1C2C3H1H3  



Cluster_C1C2C3H1H2H3   



Cluster_C1C2C3H1H2H3S2    





43

Evaluation Testbed

 Server

 Host and TCP Server: 2-Way 300 MHz Intel Pentium II

system with 512 MB RAM and 256KB on chip L2 cache

 Clients

 4-Way 550MHz Intel Pentium II Xeon system with 1GB

DRAM and 1MB on chip L2 cache

 NIC: 3-Com 996-BT Gigabit Ethernet

 Server application: Custom web server

 Flexibility in modifying application to use our API

 Client program: httperf

44

Throughput with Synthetic Trace

Using HTTP/1.0

900





800





700

Throughput (replies/sec)









600





500





400



Cluster_C1C2C3H1H2H3

300

Cluster_C1C2C3H1H2H3S2

Cluster_C1C2C3H1H3

200

Cluster_C1C2C3H1

Cluster_base

100





0

400 500 600 700 800 900 1000

Offered Load (requests/sec)

45

CPU Utilization(%)









10%

20%

30%

40%

50%

60%

70%

80%

90%

100%









0%

Cluster_base









C1C2C3H1(Host)









C1C2C3H1(TCPS)

CPU Utilization









C1C2C3H1H3(Host)









Offered Load(Reqs/sec)

C1C2C3H1H3(TCPS)









C1C2C3H1H2H3(Host)









C1C2C3H1H2H3(TCPS)

Idle Time

User Time









46

System Time

Throughput with Synthetic Trace

Using HTTP/1.1

1400







1200







1000

Throughput (replies/sec)









800







600



Cluster_C1C2C3H1H2H3

400 Cluster_C1C2C3H1H2H3S2

Cluster_C1C2C3H1H3

Cluster_base

200







0

800 900 1000 1100 1200 1300

Offered Load (requests/sec)



47

Throughput with Real Trace (Forth)

Using HTTP/1.0

1200









1000

Throughput (replies/sec)









800









600









400 Cluster_C1C2C3H1H2H3S2

Cluster_C1C2C3H1H2H3

Cluster_C1C2C3H1H3

Cluster_base

200









0

800 900 1000 1100 1200 1300



Offered Load (requests/sec)

48

Related Work

 TCP Offloading Engines

 Communication Services Platform (CSP)

 System architecture for scalable cluster-based

servers, using a VIA-based SAN to tunnel TCP/IP

packets inside the cluster

 Piglet - A vertical OS for multiprocessors

 Queue Pair IP - A new end point mechanism

for inter-network processing inspired from

memory-to-memory communication



49

Conclusions

 Offloading networking functionality to a set of

dedicated TCP servers yields up to 30%

performance improvement

 Performance Essentials:

 TCP Server architecture

 event driven



 polling instead of interrupts



 adaptive to load



 API

 asynchronous, zero-copy







50

Future Work

 TCP Server software distributions

 Compare TCP Server Architecture with hardware

based offloading schemes

 Use TCP Servers in Storage Networking









51

Acknowledgements





My graduate students:

Murali Rangarajan, Aniruddha Bohra and

Kalpana Banerjee

http://discolab.rutgers.edu









52



Related docs
Other docs by wuzhenguang
Is Air Quality a Problem in My Home
Views: 7  |  Downloads: 0
IHRM Chapter 6
Views: 8  |  Downloads: 0
37.10593
Views: 6  |  Downloads: 0
December_break
Views: 7  |  Downloads: 0
Lectures for 2nd Edition
Views: 8  |  Downloads: 0
Google Chart
Views: 29  |  Downloads: 0
By registering with docstoc.com you agree to our
privacy policy

You are almost ready to download!

You are almost ready to download!