Date post: | 06-Jul-2015 |
Category: |
Technology |
Upload: | balmanme |
View: | 212 times |
Download: | 2 times |
Streaming Exa-‐scale Data over 100Gbps Networks
Mehmet Balman
Scien.fic Data Management Group Computa.onal Research Division
Lawrence Berkeley Na.onal Laboratory
CRD All Hands Mee.ng July 15, 2012
Outline • A recent 100Gbps demo by ESnet and Internet2 at SC11
• One of the applica=ons:
• Data movement of large datasets with many files (Scaling the Earth System Grid to 100Gbps Networks)
ESG (Earth Systems Grid)
• Over 2,700 sites • 25,000 users
• IPCC FiNh Assessment Report (AR5) 2PB
• IPCC Forth Assessment Report (AR4) 35TB
Applications’ Perspective
• Increasing the bandwidth is not sufficient by itself; we need careful evalua=on of high-‐bandwidth networks from the applica=ons’ perspec=ve.
• Data distribu.on for climate science
• How scien*fic data movement and analysis between geographically disparate supercompu*ng facili*es can benefit from high-‐bandwidth networks?
Climate Data Distribution • ESG data nodes
• Data replica=on in the ESG Federa=on
• Local copies • data files are copied into temporary storage in HPC centers for post-‐processing and further climate analysis.
Climate Data over 100Gbps
• Data volume in climate applica=ons is increasing exponen=ally.
• An important challenge in managing ever increasing data sizes in climate science is the large variance in file sizes. • Climate simula=on data consists of a mix of rela=vely small and large files with irregular file size distribu=on in each dataset.
• Many small files
Keep the data channel full
FTP RPC
request a file
request a file
send file
send file
request data
send data
• Concurrent transfers • Parallel streams
lots-‐of-‐small-‐Diles problem! Dile-‐centric tools?
l Not necessarily high-‐speed (same distance) - Latency is s=ll a problem
100Gbps pipe 10Gbps pipe
request a dataset
send data
Framework for the Memory-‐mapped Network Channel
memory caches are logically mapped between client and server
Moving climate Diles efDiciently
The SC11 100Gbps demo environment
Advantages • Decoupling I/O and network opera=ons
• front-‐end (I/O processing) • back-‐end (networking layer)
• Not limited by the characteris=cs of the file sizes On the fly tar approach, bundling and sending many files together
• Dynamic data channel management Can increase/decrease the parallelism level both in the network communica=on and I/O read/write opera=ons, without closing and reopening the data channel connec=on (as is done in regular FTP variants).
The SC11 100Gbps Demo
• CMIP3 data (35TB) from the GPFS filesystem at NERSC • Block size 4MB • Each block’s data sec=on was aligned according to the system pagesize.
• 1GB cache both at the client and the server • At NERSC, 8 front-‐end threads on each host for reading data files in parallel.
• At ANL/ORNL, 4 front-‐end threads for processing received data blocks.
• 4 parallel TCP streams (four back-‐end threads) were used for each host-‐to-‐host connec=on.
83Gbps throughput
MemzNet: memory-‐mapped zero-‐copy network channel
network
Front-‐end threads
(access to memory blocks)
Front-‐end threads (access to memory blocks)
Memory blocks Memory
blocks
memory caches are logically mapped between client and server
ANI 100Gbps testbed
ANI 100G Router
nersc-diskpt-2
nersc-diskpt-3
nersc-diskpt-1
nersc-C2940 switch
4x10GE (MM)
4x 10GE (MM)
Site Router(nersc-mr2)
anl-mempt-2
anl-mempt-1
anl-app
nersc-app
NERSC ANL
Updated December 11, 2011
ANI Middleware Testbed
ANL Site Router
4x10GE (MM)
4x10GE (MM)
100G100G
1GE
1 GE
1 GE
1 GE
1GE
1 GE
1 GE1 GE
10G
10G
To ESnet
ANI 100G Router
4x10GE (MM)
100G 100G
ANI 100G Network
anl-mempt-1 NICs:2: 2x10G Myricom
anl-mempt-2 NICs:2: 2x10G Myricom
nersc-diskpt-1 NICs:2: 2x10G Myricom1: 4x10G HotLava
nersc-diskpt-2 NICs:1: 2x10G Myricom1: 2x10G Chelsio1: 6x10G HotLava
nersc-diskpt-3 NICs:1: 2x10G Myricom1: 2x10G Mellanox1: 6x10G HotLava
Note: ANI 100G routers and 100G wave available till summer 2012; Testbed resources after that subject funding availability.
nersc-asw1
anl-C2940 switch
1 GE
anl-asw1
1 GE
To ESnet
eth0
eth0
eth0
eth0
eth0
eth0
eth2-5
eth2-5
eth2-5
eth2-5
eth2-5
eth0
anl-mempt-3
4x10GE (MM)
eth2-5 eth0
1 GE
anl-mempt-3 NICs:1: 2x10G Myricom1: 2x10G Mellanox
4x10GE (MM)
10GE (MM)10GE (MM)
SC11 100Gbps demo
Many TCP Streams
(a) total throughput vs. the number of concurrent memory-to-memory transfers, (b) interface traffic, packages per second (blue) and bytes per second, over a single NIC with different number of concurrent transfers. Three hosts, each with 4 available NICs, and a total of 10 10Gbps NIC pairs were used to saturate the 100Gbps pipe in the ANI Testbed. 10 data movement jobs, each corresponding to a NIC pair, at source and destination started simultaneously. Each peak represents a different test; 1, 2, 4, 8, 16, 32, 64 concurrent streams per job were initiated for 5min intervals (e.g. when concurrency level is 4, there are 40 streams in total).
ANI testbed 100Gbps (10x10NICs, three hosts): Interrupts/CPU vs the number of concurrent transfers [1, 2, 4, 8, 16, 32 64 concurrent jobs - 5min intervals], TCP buffer size is 50M
Effects of many streams
MemzNet’s Performance
TCP buffer size is set to 50MB
MemzNet GridFTP
SC11 demo
ANI Testbed
MemzNet’s Architecture for data streaming
Experience with 100Gbps Network
Applications Mehmet Balman, Eric Pouyoul, Yushu Yao, E. Wes Bethel, Burlen Loring, Prabhat, John Shalf, Alex Sim, and Brian L. Tierney
DIDC – DelO, the Netherlands June 19, 2012
Acknowledgements Peter Nugent, Zarija Lukic , Patrick Dorn, Evangelos Chaniotakis, John Christman, Chin Guok, Chris Tracy, Lauren Rotman, Jason Lee, Shane Canon, Tina Declerck, Cary Whitney, Ed Holohan, Adam Scovel, Linda Winkler, Jason Hill, Doug Fuller, Susan Hicks, Hank Childs, Mark Howison, Aaron Thomas, John Dugan, Gopal Vaswani
The 2nd Interna=onal Workshop on Network-‐aware Data Management
to be held in conjunc=on with the EEE/ACM Interna=onal Conference for High Performance
Compu=ng, Networking, Storage and Analysis (SC'12)
hip://sdm.lbl.gov/ndm/2012
Nov 11th, 2012
Papers due by the end of August
Last year's program hip://sdm.lbl.gov/ndm/2011