TCP Servers: Offloading TCP/IP
Processing in Internet Servers
Liviu Iftode
Department of Computer Science
University of Maryland
and Rutgers University
My Research: Network-Centric Systems
TCP Servers and Split-OS [NSF CAREER]
Migratory TCP and Service Continuations
Federated File Systems
Smart Messages [NSF ITR-2] and Spatial
Programming for Networks of Embedded Systems
http://discolab.rutgers.edu
2
Networking and Performance
C C C
WAN IP Network
TCP
Internet Servers S S
Storage Networks SAN IP or not IP ?
TCP or not TCP?
D D D
The transport-layer protocol must be efficient
3
The Scalability Problem
700
Dual Processor
600
Uniprocessor
500
Throughput (requests/s)
400
300
200
100
0
300 350 400 450 500 550 600 650 700 750
Offered load (requests/s)
Apache web server on 1 Way and 2 Way 300 MHz Intel Pentium II SMP
repeatedly accessing a static16 KB file
4
Breakdown of CPU Time for Apache
User space
20%
Other system
calls
9%
Network
Processing
71%
5
The TCP/IP Stack
APPLICATION
SYSTEM CALLS
SEND RECEIVE
copy_from_application_buffers copy_to_application_buffers
TCP_send TCP_receive
IP_send IP_receive
KERNEL
packet_scheduler software_interrupt_handler
setup_DMA hardware_interrupt_handler
packet_out packet_in
6
Breakdown of CPU Time for Apache
Hardware Interrupt
Processing
8% Software Interrupt
Processing
11%
User space
20%
IP Receive
0%
IP Send
Other system calls 0%
9%
TCP Receive
7%
TCP Send
45%
7
Serialized Networking Actions
APPLICATION
SYSTEM CALLS
SEND RECEIVE
copy_from_application_buffers copy_to_application_buffers
TCP_send TCP_receive Serialized
Operations
IP_send IP_receive
packet_scheduler software_interrupt_handler
setup_DMA hardware_interrupt_handler
packet_out packet_in
8
TCP/IP Processing is Very Expensive
Protocol processing can take up to 70% of the
CPU cycles
For Apache web server on uniprocessors [Hu 97]
Can lead to Receive Livelock [Mogul 95]
Interrupt handling consumes a significant
amount of time
Soft Timers [Aron 99]
Serialization affects scalability
9
Outline
Motivation
TCP Offloading using TCP Server
TCP Server for SMP Servers
TCP Server for Cluster-based Servers
Prototype Evaluation
10
TCP Offloading Approach
Offload network processing from application
hosts to dedicated processors/nodes/I-NICs
Reduce OS intrusion
network interrupt handling
context switches
serializations in the networking stack
cache and TLB pollution
Should adapt to changing load conditions
Software or hardware solution?
11
The TCP Server Idea
Host Processor TCP Server
Application TCP/IP
OS
CLIENT
FAST COMMUNICATION
SERVER
12
TCP Server Performance Factors
Efficiency of the TCP server implementation
event-based server, no interrupts
Efficiency of communication between host(s)
and TCP server
non-intrusive, low-overhead
API
asynchronous, zero-copy
Adaptiveness to load
13
TCP Servers for Multiprocessor Systems
CPU 0 CPU N
Application
TCP Server
Host OS
CLIENT
SHARED MEMORY
Multiprocessor (SMP) Server
14
TCP Servers for Clusters with
Memory-to-Memory Interconnects
Host
TCP Server
Application
MEMORY-to-MEMORY CLIENT
INTERCONNECT
Cluster-based Server
15
TCP Servers for Multiprocessor
Servers
16
SMP-based Implementation
Application
TCP Server
Host OS
IO APIC
Disk & Other Network and Clock
Interrupts Interrupts
Interrupts 17
SMP-based Implementation (cont’d)
Application
TCP Server
Host OS
ENQUEUE DEQUEUE
SEND AND
REQUEST SHARED QUEUE EXECUTE
SEND REQUEST
18
TCP Server Event-Driven Architecture
Dispatcher
Send Receive Asynchronous
Monitor
Handler Handler Event Handler
Shared Queue NIC
From Application Processors To Application Processors
19
Dispatcher
Kernel thread executing at the highest priority
level in the kernel
Schedules different handlers based using
input from the monitor
Executes an infinite loop and does not yield
the processor
No other activity can execute on the TCP Server
processor
20
Asynchronous Event Handler (AEH)
Handles asynchronous network events
Interacts with the NIC
Can be an Interrupt Service Routine or a
Polling Routine
Is a short running thread
Has the highest priority among TCP server
modules
The clock interrupt is used as a guaranteed
trigger for the AEH when polling
21
Send and Receive Handlers
Scheduled in response to a request in the
Shared Memory queues
Run at the priority of the network protocol
Interact with the Host processors
22
Monitor
Observes the state of the system queues and
provides hints to the Dispatcher to schedule
Used for book-keeping and dynamic load
balancing
Scheduled periodically or when an exception
occurs
Queue overflow or empty
Bad checksum for a network packet
Retransmissions on a connection
Can be used to reconfigure the set of TCP
servers in response to load variation
23
TCP Servers for Cluster-based
Servers
24
Cluster-based Implementation
Host
Application TCP Server
Socket Stub
TUNNEL DEQUEUE AND
SOCKET EXECUTE
REQUEST VI Channels SOCKET REQUEST
25
TCP Server Architecture
Eager
Resource Processor
Manager
VI Connection Request Socket Call TCP/IP
Handler Handler Processor Provider
SAN (To Host) NIC - WAN
26
Sockets and VI Channels
Pool of VI’s created at initialization
Avoid cost of creating VI’s in the critical path
Registered memory regions associated with
each VI
Send and receive buffers associated with socket
Also used to exchange control data
Socket mapped to a VI on the first socket
operation
All subsequent operations on the socket tunneled
through the same VI to the TCP server
27
Socket Call Processing
Host library intercepts socket call
Socket call parameters are tunneled to the TCP
server over a VI channel
TCP server performs socket operation and
returns results to the host
Library returns control to the application
immediately or when the socket call completes
(asynchronous vs synchronous processing).
28
Design Issues for TCP Servers
Splitting of the TCP/IP processing
Where to split?
Asynchronous event handling
Interrupt or polling?
Asynchronous API
Event scheduling and resource allocation
Adaptation to different workloads
29
Prototypes and Evaluation
30
SMP-based Prototype
Modified Linux – 2.4.9 SMP kernel on Intel
x86 platform to implement TCP server
Most parts of the system are kernel modules,
with small inline changes to the TCP stack,
software interrupt handlers and the task
structures
Instrumented the kernel using on-chip
performance monitoring counters to profile
the system
31
Evaluation Testbed
Server
4-Way 550MHz Intel Pentium II Xeon system with 1GB
DRAM and 1MB on chip L2 cache
Clients
4-way SMPs
2-Way 300 MHz Intel Pentium II system with 512 MB
RAM and 256KB on chip L2 cache
NIC : 3-Com 996-BT Gigabit Ethernet
Server Application: Apache 1.3.20 web server
Client program: sclients [Banga 97]
Trace driven execution of clients
32
Trace Characteristics
Logs Number Average Number Average
of files file size of reply size
requests
Forth 11931 19.3 KB 400335 8.8 KB
Rutgers 18370 27.3 KB 498646 19.0 KB
Synthetic 128 16.0 KB 50000 16.0 KB
33
Splitting TCP/IP Processing
APPLICATION
APPLICATION
PROCESSORS SYSTEM CALLS
SEND RECEIVE
copy_from_application_buffers copy_to_application_buffers
TCP_send TCP_receive
C3 IP_send IP_receive C2
DEDICATED packet_scheduler software_interrupt_handler
PROCESSORS
setup_DMA interrupt_handler
C1
packet_out packet_in
34
Implementations
Interrupt Receive Send Avoiding
Implementation processing Bottom Bottom Interrupts
(C1) (C2) (C3) (S1)
SMP_BASE
SMP_C1C2
SMP_C1C2S1
SMP_C1C2C3
SMP_C1C2C3S1
35
Throughput (requests/s)
1000
1500
2000
2500
3000
3500
4000
4500
0
500
Uniprocessor
SMP_base_4processors
SMP_C1C2_1server
Throughput
SMP_C1C2_2servers
SMP_C1C2C3_1server
SMP_C1C2C3_2servers
SMP_C1C2S1_1server
SMP_C1C2S1_2servers
SMP_C1C2C3S1_1server
SMP_C1C2C3S1_2servers
36
Forth
Rutgers
Synthetic
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
0%
SMP_base_A
SMP_base_D
C1C2_1ser_A
C1C2_1ser_D
C1C2_2ser_A
C1C2_2ser_D
C1C2C3_1ser_A
C1C2C3_1ser_D
C1C2C3_2ser_A
C1C2C3_2ser_D
C1C2S1_1ser_A
C1C2S1_1ser_D
C1C2S1_2ser_A
C1C2S1_2ser_D
C1C2C3S1_1ser_A
C1C2C3S1_1ser_D
C1C2C3S1_2ser_A
C1C2C3S1_2ser_D
CPU Utilization for Synthetic Trace
Idle
User
37
System
Throughput (requests/s)
1000
1500
2000
2500
3000
3500
4000
500
0
SMP_base
SMP_C1C2_1server
0% Dynamic Content
20% Dynamic Content
10% Dynamic Content
SMP_C1C2_2servers
SMP_C1C2C3_1server
SMP_C1C2C3_2servers
SMP_C1C2S1_1server
With Dynamic Content
SMP_C1C2S1_2servers
SMP_C1C2C3S1_1server
Throughput Using Synthetic Trace
SMP_C1C2C3S1_2servers
38
Adapting TCP Servers to Changing
Workloads
Monitor the queues
Identify low and high water marks to change the size
of the processor set
Execute a special handler for exceptional events
Queue length lower than the low water mark
Set a flag which dispatcher checks
Dispatcher sleeps if the flag is set
Reroute the interrupts
Queue length higher than the high water mark
Wake up the dispatcher on the chosen processor
Reroute the interrupts
39
Load behaviour and dynamic
reconfiguration
40
Throughput with Dynamic
Reconfiguration
4000
SMP_base
SMP_C1C2C3_1server
3500
SMP_C1C2C3_2servers
SMP_C1C2C3_Dyn_Reconf
3000
Throughput (requests/s)
2500
2000
1500
1000
500
0
0% 10% 20% 30% 40%
% of CGI requests
41
Cluster-based Prototype
User-space implementation (bypass host
kernel)
Entire socket operation offloaded to TCP Server
C1, C2 and C3 offloaded by default
Optimizations
Asynchronous processing: AsyncSend
Processing ahead: Eager Receive, Eager Accept
Avoiding data copy at host using pre-registered
buffers
requires different API: MemNet
42
Implementations
Kernel Asynchronous Avoiding Processing
Implementation Bypassing Processing Host Copies Ahead
(H1) (H2) (H3) (S2)
Cluster_base
Cluster_C1C2C3H1
Cluster_C1C2C3H1H3
Cluster_C1C2C3H1H2H3
Cluster_C1C2C3H1H2H3S2
43
Evaluation Testbed
Server
Host and TCP Server: 2-Way 300 MHz Intel Pentium II
system with 512 MB RAM and 256KB on chip L2 cache
Clients
4-Way 550MHz Intel Pentium II Xeon system with 1GB
DRAM and 1MB on chip L2 cache
NIC: 3-Com 996-BT Gigabit Ethernet
Server application: Custom web server
Flexibility in modifying application to use our API
Client program: httperf
44
Throughput with Synthetic Trace
Using HTTP/1.0
900
800
700
Throughput (replies/sec)
600
500
400
Cluster_C1C2C3H1H2H3
300
Cluster_C1C2C3H1H2H3S2
Cluster_C1C2C3H1H3
200
Cluster_C1C2C3H1
Cluster_base
100
0
400 500 600 700 800 900 1000
Offered Load (requests/sec)
45
CPU Utilization(%)
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
0%
Cluster_base
C1C2C3H1(Host)
C1C2C3H1(TCPS)
CPU Utilization
C1C2C3H1H3(Host)
Offered Load(Reqs/sec)
C1C2C3H1H3(TCPS)
C1C2C3H1H2H3(Host)
C1C2C3H1H2H3(TCPS)
Idle Time
User Time
46
System Time
Throughput with Synthetic Trace
Using HTTP/1.1
1400
1200
1000
Throughput (replies/sec)
800
600
Cluster_C1C2C3H1H2H3
400 Cluster_C1C2C3H1H2H3S2
Cluster_C1C2C3H1H3
Cluster_base
200
0
800 900 1000 1100 1200 1300
Offered Load (requests/sec)
47
Throughput with Real Trace (Forth)
Using HTTP/1.0
1200
1000
Throughput (replies/sec)
800
600
400 Cluster_C1C2C3H1H2H3S2
Cluster_C1C2C3H1H2H3
Cluster_C1C2C3H1H3
Cluster_base
200
0
800 900 1000 1100 1200 1300
Offered Load (requests/sec)
48
Related Work
TCP Offloading Engines
Communication Services Platform (CSP)
System architecture for scalable cluster-based
servers, using a VIA-based SAN to tunnel TCP/IP
packets inside the cluster
Piglet - A vertical OS for multiprocessors
Queue Pair IP - A new end point mechanism
for inter-network processing inspired from
memory-to-memory communication
49
Conclusions
Offloading networking functionality to a set of
dedicated TCP servers yields up to 30%
performance improvement
Performance Essentials:
TCP Server architecture
event driven
polling instead of interrupts
adaptive to load
API
asynchronous, zero-copy
50
Future Work
TCP Server software distributions
Compare TCP Server Architecture with hardware
based offloading schemes
Use TCP Servers in Storage Networking
51
Acknowledgements
My graduate students:
Murali Rangarajan, Aniruddha Bohra and
Kalpana Banerjee
http://discolab.rutgers.edu
52