TCP Servers:Offloading TCP Processing in Internet Servers.
Design, Implementation, and PerformanceM. Rangarajan, A. Bohra, K. Banerjee, E.V. Carrera, R. Bianchini, L. Iftode, W. Zwaenepoel.
Presented by:
Thomas Repantis
CS260-Seminar in Computer Science, Fall 2004 – p.1/35
Overview
To execute the TCP/IP processing on a dedicatedprocessor, node, or device (the TCP server) usinglow-overhead, non-intrusive communication between itand the host(s) running the server application.Three TCP Server architectures:
1. A dedicated network processor on a symmetricmultiprocessor (SMP) server.
2. A dedicated node on a cluster-based server builtaround a memory-mapped communicationinterconnect such as VIA.
3. An intelligent network interface in a cluster ofintelligent devices with a switch-based I/Ointerconnect such as Infiniband. CS260-Seminar in Computer Science, Fall 2004 – p.2/35
Introduction
• The network subsystem is nowadays one of themajor performance bottlenecks in web servers:Every outgoing data byte has to go through thesame processing path in the protocol stack downto the network device.
• Proposed solution a TCP Server architecture:Decoupling the TCP/IP protocol stack processingfrom the server host, and executing it on adedicated processor/node.
CS260-Seminar in Computer Science, Fall 2004 – p.3/35
Introductory Details
• The communication between the server host andthe TCP server can dramatically benefit from usinglow-overhead, non-intrusive, memory-mappedcommunication.
• The network programming interface provided tothe server application must use and tolerateasynchronous socket communication to avoid datacopying.
CS260-Seminar in Computer Science, Fall 2004 – p.4/35
Motivation
• The web server spends in user space only 20% ofits execution time.
• Network processing, which includes TCPsend/receive, interrupt processing, bottom halfprocessing, and IP send/receive take about 71%of the total execution time.
• Processor cycles devoted to TCP processing,cache and TLB pollution (OS intrusion on theapplication execution).
CS260-Seminar in Computer Science, Fall 2004 – p.6/35
TCP Server Architecture
• The application host avoids TCP processing bytunneling the socket I/O calls to the TCP serverusing fast communication channels.
• Shared memory and memory-mappedcommunication for tunneling.
CS260-Seminar in Computer Science, Fall 2004 – p.7/35
Advantages
• Kernel Bypassing.• Asynchronous Socket Calls.• No Interrupts.• No Data Copying.• Process Ahead.• Direct Communication with File Server.
CS260-Seminar in Computer Science, Fall 2004 – p.8/35
Kernel Bypassing
• Bypassing the host OS kernel.• Establishing a socket channel between the
application and the TCP server for each opensocket.
• The socket channel is created by the host OSkernel during the socket call.
CS260-Seminar in Computer Science, Fall 2004 – p.9/35
Asynchronous Socket Calls
• Maximum overlapping between the TCPprocessing of the socket call and the applicationexecution.
• Avoid context switches whenever this is possible.
CS260-Seminar in Computer Science, Fall 2004 – p.10/35
No Interrupts
• Since the TCP server exclusively executes TCPprocessing, interrupts can be apparently easilyand beneficially replaced with polling.
• Too high polling frequency rate would lead to buscongestion while too low would result in inability tohandle all events.
CS260-Seminar in Computer Science, Fall 2004 – p.11/35
No Data Copying
• With asynchronous system calls, the TCP servercan avoid the double copying performed in thetraditional TCP kernel implementation of the sendoperation.
• The application must tolerate the wait forcompletion of the send.
• For retransmission, the TCP server can read thedata again from the application send buffer.
CS260-Seminar in Computer Science, Fall 2004 – p.12/35
Process Ahead
• The TCP server can execute certain operationsahead of time, before they are actually requestedby the host.
• Specifically, the accept and receive system calls.
CS260-Seminar in Computer Science, Fall 2004 – p.13/35
Direct Communication with FileServer
• In a multi-tier architecture a TCP server can beinstructed to perform direct communication withthe file server.
CS260-Seminar in Computer Science, Fall 2004 – p.14/35
TCP Server in an SMP-basedArchitecture
• Dedicating a subset of the processors for in-kernelTCP processing.
• Network generated interrupts are routed to thededicated processors.
• The communication between the application andthe TCP server is through queues in sharedmemory. CS260-Seminar in Computer Science, Fall 2004 – p.15/35
SMP-based Architecture Details
• Offloading interrupts and receive processing.• Offloading TCP send processing.
CS260-Seminar in Computer Science, Fall 2004 – p.16/35
TCP Server in a Cluster-basedArchitecture
• Dedicating a subset of nodes to TCP processing.• VIA-based SAN interconnect.
CS260-Seminar in Computer Science, Fall 2004 – p.17/35
Cluster-based Architecture Operation
• The TCP server node acts as the networkendpoint for the outside world.
• The network data is transferred between the hostnode and the TCP server node across SAN usinglow latency memorymapped communication.
CS260-Seminar in Computer Science, Fall 2004 – p.18/35
Cluster-based Architecture Details
• The socket call interface is implemented as a userlevel communication library.
• With this library a socket call is tunneled acrossSAN to the TCP server.
• Several implementations:1. Split-TCP (synchronous)2. AsyncSend3. Eager Receive4. Eager Accept5. Setup With Accept
CS260-Seminar in Computer Science, Fall 2004 – p.19/35
TCP Server in anIntelligent-NIC-based Architecture
• Cluster of intelligent devices over aswitched-based I/O (Infiniband).
• The devices are considered to be "intelligent", i.e.,each device has a programmable processor andlocal memory.
CS260-Seminar in Computer Science, Fall 2004 – p.20/35
Intelligent-NIC-based ArchitectureDetails
• Each open connection is associated with amemory-mapped channel between the host andthe I-NIC.
• During a message send, the message istransferred directly from user-space to a sendbuffer at the interface.
• A message receive is first buffered at the networkinterface and then copied directly to user-space atthe host.
CS260-Seminar in Computer Science, Fall 2004 – p.21/35
4-way SMP-based Evaluation
• Dedicating two processors to network processingis always better than dedicating only one.
• Throughput benefits of up to 25-30%.CS260-Seminar in Computer Science, Fall 2004 – p.22/35
4-way SMP-based Evaluation
• When only one processor is dedicated to thenetwork processing, the network processorbecomes a bottleneck and, consequently, theapplication processor suffers from idle time.
• When we apply two processors to the handling ofthe network overhead, there is enough networkprocessing capacity and the application processorbecomes the bottleneck.
• The best system would be one in which thedivision of labor between the network andapplication processors is more flexible, allowing forsome measure of load balancing.
CS260-Seminar in Computer Science, Fall 2004 – p.24/35
2-node Cluster-based Evaluation forStatic Load
• Asynchronous send operations outperform theircounterparts
CS260-Seminar in Computer Science, Fall 2004 – p.25/35
2-node Cluster-based Evaluation forStatic Load
• Smaller gain than that achievable with SMP-basedarchitecture.
• 17% is the greatest throughput improvement wecan achieve with this architecture/workloadcombination.
CS260-Seminar in Computer Science, Fall 2004 – p.26/35
2-node Cluster-based Evaluation forStatic Load
• In the case of Split-TCP and AsyncSend the hosthas idle time available since it is the networkprocessing at the TCP server that proves to be thebottleneck.
CS260-Seminar in Computer Science, Fall 2004 – p.27/35
2-node Cluster-based Evaluation forStatic and Dynamic Load
• Split TCP and Async Send systems saturate laterthan Regular TCP.
CS260-Seminar in Computer Science, Fall 2004 – p.28/35
2-node Cluster-based Evaluation forStatic and Dynamic Load
• At an offered load of about 500 reqs/sec, the hostCPU is effectively saturated.
• 18% is the greatest throughput improvement wecan achieve with this architecture.
CS260-Seminar in Computer Science, Fall 2004 – p.29/35
2-node Cluster-based Evaluation forStatic and Dynamic Load
• Balanced confgurations depend heavily on theparticular characteristics of the workload.
• A dynamic load balancing scheme between hostand TCP server nodes is required for idealperformance in dynamic workloads
CS260-Seminar in Computer Science, Fall 2004 – p.30/35
Intelligent-NIC-based SimulationEvaluation
• For all the simulated processor speeds, theSplit-TCP system outperforms all the otherimplementations.
• The improvements over a conventional systemrange from 20% to 45%.
CS260-Seminar in Computer Science, Fall 2004 – p.31/35
Intelligent-NIC-based SimulationEvaluation
• The ratio of processing power at the host to thatavailable at the NIC plays an important role indetermining the server performance.
• In Split-TCP the processor on the NIC saturatesmuch earlier than the host processor or thenetwork.
• We can achieve better performance with aSplit-TCP implementation only with a fastprocessor on the NIC.
CS260-Seminar in Computer Science, Fall 2004 – p.32/35
Conclusions about TCP Servers 1/2
• Offloading TCP/IP processing is beneficial tooverall system performance when the server isoverloaded.
• An SMP-based approach to TCP servers is moreefficient than a cluster-based one.
• The benefits of SMP and cluster-based TCPservers reach 30% in the scenarios we studied.
• The simulated results show greater gains of up to45% for a cluster of devices.
CS260-Seminar in Computer Science, Fall 2004 – p.33/35
Conclusions about TCP Servers 2/2
• TCP servers require substantial computingresources for complete offloading.
• The type of workload plays a significant role in theefficiency of TCP servers.
• Depending on the application workload, either thehost processor or the TCP Server can be- comethe bottleneck.
• Hence, a scheme to balance the load between thehost and the TCP Server would be beneficial forserver performance.
CS260-Seminar in Computer Science, Fall 2004 – p.34/35