Date post: | 31-Jan-2018 |
Category: |
Documents |
Upload: | phungkhuong |
View: | 220 times |
Download: | 0 times |
EidgenössischeTechnische Hochschule
Zürich
Ecole polytechnique fédérale de ZurichPolitecnico federale di Zurigo
Swiss Federal Institute of Technology Zurich
Ninth IEEE International Symposium on High PerformanceDistributed Computing, Pittsburgh, Pennsylvania, August 1-4, 2000
Speculative DefragmentationSpeculative Defragmentation––A Technique to Improve the CommunicationA Technique to Improve the Communication
Software Efficiency for Gigabit EthernetSoftware Efficiency for Gigabit Ethernet
Ch. Kurmann, F. Rauch, M. Müller, T. StrickerLaboratory for Computer Systems
ETHZ - Swiss Institute of TechnologyCH-8092 Zurich
2
Comm. Speeds of Commodity PCsComm. Speeds of Commodity PCs
ÿ For Gigabit Ethernet and TCP/IP the OS-softwarecannot keep up with the hardware speed
MPI-Linux 2.0-BIP
MPI-Linux 2.2
TCP-Linux 2.2
TCP-Windows NT
0 20 40 60 80 100 120 140Transfer-rate [MByte/s]
Gigabit Ethernet 32bit-PCI
2020
3535
Myrinet 32bit-PCI
4242
126126
3
OverviewOverview
• Why Gigabit Ethernet
• Packet Defragmentation
• TCP/IP Overheads
• Speculative Packet Defragmentation
• Performance Analysis
• Conclusion
4
Problem StatementProblem Statement
How can we sustain network bandwidths of75-100 MByte/s with a commodity PC cluster node:
• memory copy 90 MByte/s• 32-bit PCI I/O-bus 132 MByte/s• commodity Gigabit Ethernet adapter 100 MByte/s
• standard TCP/IP protocol• fully transparent standard socket-API
5
Papers 10 Years AgoPapers 10 Years AgoThe same problem — 30-100 times slower
• memory copy < 3 MByte/s• VME I/O-bus < 3 MByte/s• commodity 10BaseT Ethernet adapter 1 MByte/s
• special purpose blast transfer protocol [Zwaenepoel85]
• optimistic bulk transfers [Carter89]
• transparent blasts by header padding [Peterson90]
Not standard protocol & not fully transparent
ÿ Solutions did not find their way into current systems!
6
Why Gigabit EthernetWhy Gigabit Ethernet
• Compatible to Ethernet and Fast Ethernet (UTP Cat5)• Uncomplicated technology which results in high reliability
and low cost• Switched Ethernet provides link level flow control on full
duplex channels
• In larger networks only unacknowledged, connectionlessdatagram delivery service ÿ TCP needed
• Standard frame size is still limited to 46-1500 Byte ofdata
7
Alternatives / ExtensionsAlternatives / Extensions
• Dedicated network hardware with customized lightweightprotocols: Myrinet, SCI, Giganet, ServerNet
ÿ primarily designed for internal communicationin server farms
• Jumbo Frames (9 KByte) for Gigabit Ethernet to reach aMaximal Transfer Unit (MTU) of a memory page:
ÿ • change of standard• higher latencies in store and forward switches• do not solve the header/payload separation
8
Packet [De]FragmentationPacket [De]Fragmentation
• IP standard technique• Data to be sent is fragmented into small
chunks < network MTU (Maximal Transfer Unit)• Network protocols enclose the frames with header/trailer• Receiver separates the headers from the payload and
defragments the data again
• Implications for Ethernet:• MTU < Memory Page• DMA-logic not optimal
ÿ Therefore memory copy for packet [de]fragmentation
9
TCP/IP Host OverheadsTCP/IP Host Overheads
• Single largest overhead:copying and checksums
ÿ Zero-copytechniques
• Per-packet processingand interrupt overheadalso high
ÿ Interrupt coalescing 0
20
40
60
80
100
Per
cent
CP
U
Copy &ChecksumInterrupt
TCP/IP
Driver/ DMAInit
PII 400MHz, Linux 2.2
Host Overhead for TCP/IPover Gigabit Ethernet
10
OS EnvironmentOS Environment
TCP/IP Stack
NIC Driver
Socket Layer
User Application
NIC Firmware
Userspace
Kernelspace
NIC
PCI Bus
Control Path
Middleware (CORBA, MPI)
Previouswork
SpeculativeDefragmentation
Protectionboundary
copies
Drivercopies
Data Path
Send and Receive Buffers
System Page PoolProtocol handling, Packet
Generation
User Mapped Data Pages
.
.
.
.
.
.
.
.
.
DMA
ORB Marshalling, Buffering
11
Required TechnologiesRequired Technologies
• Well known solutions to eliminate the User/Kernel copy:• User-Level Network Interface (U-Net) or Virtual
Interface Architecture (VIA)• User/Kernel Shared Memory (FBufs, IOLite), Copy
Emulation or Page Remapping with Copy on Write
• The Driver copy remains for Gigabit Ethernet
ÿ Goal: Elimination of driver copy for the packetdefragmentation and header separation
ÿ True zero-copy
12
Commodity GE-AdaptersCommodity GE-Adapters
• Until now, zero-copy support is only available for“intelligent” network adapters (ATM, SiliconTCP)
• Today’s Gigabit Ethernet adapters are too simple• no processor, TLBs on board• limited DMA capabilities• no protocol filtering implemented
ÿ Deterministic zero-copy implementation withcommodity GE adapters is not possible!
• Approach: Making just the common case fast
ÿ Speculation Techniques for Defragmentation
13
4096
Speculative Defragmentation ISpeculative Defragmentation I
• Our driver manages to send/receive entire 4 KByte pages• Decomposition of 4 KByte IP-packets into 3 IP-fragments
on driver level (standard IP fragmentation)• Attachment of
headers to thepayload data witha separateDMA-descriptor
data
zcdata
data
zcdata
data
zcdata
status length
status length
status length
status length
status length
status length
ET
HE
TH
IPIP TC
PT
CP
ET
HE
TH
14,20
14,20
IPIPE
TH
ET
HIPIP
14,20,20 1460 1480 1156
1st Frag. 2nd Frag. 3rd Frag.
14
Speculative Defragmentation IISpeculative Defragmentation II
What are we speculating about?• Speculation that all fragments of a whole page will be
received in order• Speculation about the precise packet format (header-
lengths, data-fields)
• The receiver has to fix the DMA descriptors withoutknowledge about the next packets to arrive
• In clusters with one or two switches, the probability ishigh, that the three fragments arrive in order
• Software cleanup when mis-speculation
15
Speculative Defragmentation IIISpeculative Defragmentation IIIFragmentation/Defragmentation of a 4 KByte
memory page by the DMA of the network interface
Ethernet Network
zcdata
header
... ...zcdata
header
sk_buffsk_buff
Protocol Headers
4 KByte Page
1460 2nd
14801156
3rd
Fragmentation
1st
1460
2nd
148011561st
Defragmentation
3rd
16
Performance EvaluationPerformance Evaluation
• Gains by Successful Speculation• Penalty for Speculation Misses
• Speculation Success Rates in Applications
• Consequences:- Network Control Architecture
- Suggested Hardware Improvements
17
Gains with SpeculationGains with Speculation
ÿ 80 % increase in performance (bandwidth)
Spec. Defragmentationwith ZeroCopy FBufs
Spec. Defragmentationwith ZeroCopy Remapping
Speculative Defrag.with Copying Socket API
Linux 2.2 Standard
0 10 20 30 40 50 60 70 80Transfer-rate [MByte/s]
TCP/IP Performance of Gigabit Ethernet
464646
454545
ZeroCopy Remappingwith Copying Driver
1 copy
0 copy
757575
424242
656565
18
Penalty with Speculation MissesPenalty with Speculation Misses
ÿ The common case is fast, the fallback not much slower
0 10 20 30 40 50 60 70 80Transfer-rate [MByte/s]
CompatibilityZero-Copy SenderStandard Receiver
Linux 2.2 OperationStandard SenderStandard Receiver
TCP/IP Performance of Gigabit Ethernet
454545
424242
353535FallbackStandard SenderZero-Copy Receiver
19
• Application traces show success of speculative transfers
• TreadMarks has an inherent scheduling that preventsinterference
• TPC-D needs a control architecture or hardware changes
Evaluation of Success RatesEvaluation of Success Rates
totallargezcopyok
100 %
68182440104400444004
Master
TreadMarks SOR
> 99 %> 99 %100 %100 %48 %Success Rate
50731304193040530399
51095307073069330675
62311448484168241682
67524458773783337833
129835907257951538235
Ethernetframes
Host2Host1Host2Host1Master
Oracle TPC-DTraces
20
Network Control ArchitectureNetwork Control Architecture
• Problem: Multiple synchronous, fast receives maygarble the zero-copy frames
�• Solution: Admission Control on Ethernet driver level
with negotiation for one single sender to blast
• Implicit channel allocation by OS works• Fully transparent• No explicit scheduling of transfers through a special
interface ÿ the API remains the same
21
Suggested Hardware ImprovementsSuggested Hardware Improvements
• Additional control-path between the checksumming- andthe DMA-logic for detection of protocol & header fields
ÿ Reliable header/payload separation
• Stream detection with a simple matching register and aseparate DMA descriptor chain for fast transfers:
ÿ Detection of at least one high performance streamÿ Separation of this stream with its DMA descriptors
ÿ Improvement of the speculation rateLower driver complexity
22
ConclusionConclusion
• Speculation techniques open a new horizon foroptimized network drivers and permit an “almost”-zero-copy implementation for TCP/IP over Gigabit Ethernet.
• The performance in our implementation was raisedfrom 42 to 75 MByte/s (80%) using the standard LinuxTCP-stack and commodity network interface hardware.
• Speculation works in network interfaces as well as in“Instruction Level Parallelism” and should beconsidered to find simple and effective hardwareimprovements for network interfaces.
• Existing Ethernet protocols and standard networkinterface chipsets prevent an accurate, fullydeterministic defragmentation in hardware.