+ All Categories
Home > Documents > The$Mul(core,aware$Data$Transfer$ …2015/10/05  · NUMA NODE 1 NUMA NODE 2 DataTransfer Thread...

The$Mul(core,aware$Data$Transfer$ …2015/10/05  · NUMA NODE 1 NUMA NODE 2 DataTransfer Thread...

Date post: 28-Sep-2020
Category:
Upload: others
View: 0 times
Download: 0 times
Share this document with a friend
30
The Mul(coreaware Data Transfer Middleware (MDTM) Project L. Zhang, T. Li, S. Jin, D. Katramatos, L. Carpenter, P. DeMar, D. Yu (CoPI), W.Wu (PI) 2015 Technology Exchange Cleveland OH, Oct 47, 2015 Funded by: ASCR/DOE Network Research Program
Transcript
Page 1: The$Mul(core,aware$Data$Transfer$ …2015/10/05  · NUMA NODE 1 NUMA NODE 2 DataTransfer Thread Data Transfer Node ( DTN) cores Remote I/O Access Interconnect IOH1 NIC Storage IOH2

The  Mul(core-­‐aware  Data  Transfer  Middleware  (MDTM)  Project  

 L.  Zhang,  T.  Li,  S.  Jin,  D.  Katramatos,  L.  Carpenter,  P.  

DeMar,  D.  Yu  (Co-­‐PI),  W.Wu  (PI)    

2015  Technology  Exchange  Cleveland  OH,  Oct  4-­‐7,  2015  

     

Funded  by:      ASCR/DOE  Network  Research  Program  

   

Page 2: The$Mul(core,aware$Data$Transfer$ …2015/10/05  · NUMA NODE 1 NUMA NODE 2 DataTransfer Thread Data Transfer Node ( DTN) cores Remote I/O Access Interconnect IOH1 NIC Storage IOH2

Agenda  

•  The  MDTM  Project  Backgrounds  

•  Part  1:  Mul(core-­‐Aware  Data  Transfer  Middleware  (MDTM)  –  Liang  Zhang,  FNAL  

•  Part  2:  MDTM  Data  Transfer  Applica(ons  (mdtmBBCP)  –  Dantong  Yu,  BNL    

•  Integra(on:  Part  1  +  Part  2  

•  Future  work  

Page 3: The$Mul(core,aware$Data$Transfer$ …2015/10/05  · NUMA NODE 1 NUMA NODE 2 DataTransfer Thread Data Transfer Node ( DTN) cores Remote I/O Access Interconnect IOH1 NIC Storage IOH2

Problem  Space  MulYcore/manycore  has  become  the  norm  for  high-­‐performance  compuYng.      

ExisYng  data  movement  tools  (e.g.,  BBCP,  GridFTP)  are  limited  by  major  inefficiencies  when  run  on  mulYcore  systems    

These  inefficiencies  will  ul(mately  result  in  performance  boPlenecks  on  end  systems.  Such  boPlenecks  also  impede  the  effec(ve  use  of  advanced  high-­‐bandwidth  networks.  

Page 4: The$Mul(core,aware$Data$Transfer$ …2015/10/05  · NUMA NODE 1 NUMA NODE 2 DataTransfer Thread Data Transfer Node ( DTN) cores Remote I/O Access Interconnect IOH1 NIC Storage IOH2

A  simple  inefficiency  case  …  

IOH1

NICStorage

IOH2

GPU

NUMA NODE 1 NUMA NODE 2

DataTransferThread

Data Transfer Node ( DTN)

cores

Remote I/O Access

InterconnectIOH1

NIC

Storage

IOH2

GPU

NUMA NODE 1 NUMA NODE 2

Interconnect

Local I/O Access

Data Transfer Node ( DTN)

DataTransferThreadcores

Scheduling  without  I/O  locality  

How  can  we  improve?  

Scheduling  with  I/O  locality    

General-­‐purpose  OSes  have  only  limited  support  for  I/O  locality!  

Page 5: The$Mul(core,aware$Data$Transfer$ …2015/10/05  · NUMA NODE 1 NUMA NODE 2 DataTransfer Thread Data Transfer Node ( DTN) cores Remote I/O Access Interconnect IOH1 NIC Storage IOH2

Our  solu(on  

•  The  Mul(core-­‐aware  Data  Transfer  Middleware  (MDTM)  Project  –  CollaboraYve  effort  by  Fermilab  and  Brookhaven  NaYonal  Laboratory  

–  Funded  by  DOE’s  Office  of  Advanced  ScienYfic  CompuYng  Research  (ASCR)  

– A  three-­‐year  research  project  

MDTM  aims  to  accelerate  data  movement  toolkits  on  mul(core  systems  

Page 6: The$Mul(core,aware$Data$Transfer$ …2015/10/05  · NUMA NODE 1 NUMA NODE 2 DataTransfer Thread Data Transfer Node ( DTN) cores Remote I/O Access Interconnect IOH1 NIC Storage IOH2

MDTM  Architecture  

MDTM Middleware Services

OS Services

MDTM Data Transfer Applications/Tools

Hardware

Access services

Access services

Access services

MDTM  data  transfer  applica(on  •  Data  transfer  applicaYons  that  use  MDTM  middleware  service    MDTM  middleware  services  •  A  user  space  scheduler  that  schedule  and  assigns  system  resources  

based  on  the  needs  of  data  transfer  applicaYons.  It  also  takes  into  account  other  factors,  including  NUMA  topology,  I/O  locality,  and  Qos  

Page 7: The$Mul(core,aware$Data$Transfer$ …2015/10/05  · NUMA NODE 1 NUMA NODE 2 DataTransfer Thread Data Transfer Node ( DTN) cores Remote I/O Access Interconnect IOH1 NIC Storage IOH2

MDTM  is  targeted  for  Data  Transfer  Node  (DTN)  

IOH

IOH

MemQPI QPI

PCIE

Node

1

Node

2

PCIE

PCIE

To WAN Networks(Front end)

To Local Storage(Back end)

To WAN Networks(Front end)

To Local Storage(Back end)

PCI-EController

NIC

NIC

PCIE

Local Disk

Processor

Processor... ... ... ...

...

NIC

NIC

...

System Bus/Switching Fabric

...

IOH

QPI

Node

n

PCIE

To WAN Networks(Front end)

To Local Storage(Back end)

PCIE

Processor

... ...

NIC

NIC

...

Each  DTN  features  one  or  mulYple  NUMA  nodes.    Each  NUMA  node  features  one  or  mulYple  processors  that  consists  of  mulYple  cores.  

Page 8: The$Mul(core,aware$Data$Transfer$ …2015/10/05  · NUMA NODE 1 NUMA NODE 2 DataTransfer Thread Data Transfer Node ( DTN) cores Remote I/O Access Interconnect IOH1 NIC Storage IOH2

A  MDTM-­‐based  DTN  Storage  and  Networking  Architecture  

DTN

Local storageRaid, SSD

A MDTM-based DTN

Dire

cted

Con

nect

ed S

tora

geFi

ber,

Infin

iban

d

Fiber ChannelInfiniband

Switch/Router

Distributed file systemInfiniband, Ethernet

Infiniband or 10/40 GE links

One/multiple 10/40 GE links to WAN

MDTM MiddlewareMDTM

Middleware

MDTMAPP

MDTMAPP

Switch/Router

MDTM  Storage  Architecture  •  Local  storage  (Raid,  SSD)  •  Directed  connected  storage  (FC,  IB)  •  Distributed  file  system  (IB,  10/40  GE)  

MDTM  Networking  Architecture  •  One  or  mul(ple  WAN  links  for  data  transfer  

•  Via  10/40  GE  NICs  •  One  or  mul(ple  LAN  links  for  storage  access  

•  Via  10/40  GE  NICs,  IB  adaptors,  FC  adaptors  

Page 9: The$Mul(core,aware$Data$Transfer$ …2015/10/05  · NUMA NODE 1 NUMA NODE 2 DataTransfer Thread Data Transfer Node ( DTN) cores Remote I/O Access Interconnect IOH1 NIC Storage IOH2

Part  I  Mul(core-­‐Aware  Data  Transfer  

Middleware  (MDTM)    

L.  Zhang,  FNAL  

Page 10: The$Mul(core,aware$Data$Transfer$ …2015/10/05  · NUMA NODE 1 NUMA NODE 2 DataTransfer Thread Data Transfer Node ( DTN) cores Remote I/O Access Interconnect IOH1 NIC Storage IOH2

MDTM  Middleware  

•  A  user-­‐space  resource  scheduler  that  harness  mulYcore  parallelism  to  scale  data  movement  toolkits  at  mulYcore  systems  – Data  Transfer-­‐Centric  Scheduling  and  Resource  Management  Capability  based  on  the  needs  and  requirement  of  data  transfer  applicaYons  

– NUMA  Topology-­‐aware  Scheduler    – Enabling  efficient  network  I/O  on  mulYcore  systems  – SupporYng  QoS  mechanism  to  allow  differenYated  data  transfer  

Page 11: The$Mul(core,aware$Data$Transfer$ …2015/10/05  · NUMA NODE 1 NUMA NODE 2 DataTransfer Thread Data Transfer Node ( DTN) cores Remote I/O Access Interconnect IOH1 NIC Storage IOH2

MDTM  Middleware:  System  Profiling  

•  Hardware  Topology  and  System  ConfiguraYon  –   System  calls  – 3rd  party  libraries  like  libpci.  

•  System  Status  – Core  Workload  DetecYon  –  Intensive  Threads    DetecYon  

Page 12: The$Mul(core,aware$Data$Transfer$ …2015/10/05  · NUMA NODE 1 NUMA NODE 2 DataTransfer Thread Data Transfer Node ( DTN) cores Remote I/O Access Interconnect IOH1 NIC Storage IOH2

MDTM  Middleware:  Scheduling  

CPUs/Cores)

PCI)Hubs/Bridges…)

NICs,)Disks)

Connec9on)between)devices)Devices)

•  Each  connecYon  associated  with  a  cost  value  which  reflects  scheduling  factors  like  distance,  traffic  throughput  and  etc.  

•  Applying  Dijkstra’s  Algorithm  to  find  the  lowest  cost  path  from  CPU  cores  to  NICs/Disks  

•  Pick  up  the  core  associated  to  the  lowest  cost  path    

MDTM  Middleware  Scheduling  

Page 13: The$Mul(core,aware$Data$Transfer$ …2015/10/05  · NUMA NODE 1 NUMA NODE 2 DataTransfer Thread Data Transfer Node ( DTN) cores Remote I/O Access Interconnect IOH1 NIC Storage IOH2

Shared  Memory  

MDTM  Middleware  Modules  

•  MDTM  Daemon  –  Acquiring  and  publishing  system  

informa(on  –  Scheduling  and  binding  applica(on  threads  –  Communica(ng  with  MDTM  consoles  and  

App.  

•  MDTM  API  –  Interfacing  the  MDTM  consoles  and  Apps.  –  Communica(ng  with  MDTM  Daemon  –  Reques(ng  and  reading  system  informa(on  

•  MDTM  Console  –  Facilita(ng  customers  to  access  system  

informa(on  and  status  –  Monitoring  and  development  u(lity  

OS  (Linux)  

App.    App.    App.    

MDTM  Daemon  

MDTM  Console  

MDTM  API  

Middlew

are  

Page 14: The$Mul(core,aware$Data$Transfer$ …2015/10/05  · NUMA NODE 1 NUMA NODE 2 DataTransfer Thread Data Transfer Node ( DTN) cores Remote I/O Access Interconnect IOH1 NIC Storage IOH2

MDTM  Middleware  R&D  

– MulYcore  system  profiling  – Data  transfer-­‐centric  scheduling  and  resource  management  

– NUMA  topology-­‐aware  scheduler  – SupporYng  core  affinity  on  network  and  disk  I/O  capability  

– Support  NUMA-­‐aware  buffer  pools  – Core  parYYoning  on  NUMA  system  –  Intelligent  memory  management  on  NUMA  system  

Page 15: The$Mul(core,aware$Data$Transfer$ …2015/10/05  · NUMA NODE 1 NUMA NODE 2 DataTransfer Thread Data Transfer Node ( DTN) cores Remote I/O Access Interconnect IOH1 NIC Storage IOH2

Part  II  MDTM  Data  Transfer  Applica(ons  

(mdtmBBCP)    

Dantong  Yu,  BNL  

Page 16: The$Mul(core,aware$Data$Transfer$ …2015/10/05  · NUMA NODE 1 NUMA NODE 2 DataTransfer Thread Data Transfer Node ( DTN) cores Remote I/O Access Interconnect IOH1 NIC Storage IOH2

mdtmBBCP  Design  Requirements  

•  MulYple  core  awareè  Fine-­‐granularity  design,  i.e.,  end-­‐to-­‐end  data  transfer  must  be  split  into  a  sequence  of  tasks,  each  of  which  is  handled  by  dedicated  threads.  

•  I/O  devices  reside  on  different  NUMA  nodesè  Must  minimize  data  migraYon  overheads  from  storage  to  networks    

•  Users,  Transfer  Requests,  Files  Transfers  must  be  opYmized  globally  and  parallelized!  

•  Resource-­‐aware  scheduling  and  pre-­‐allocaYon  

Page 17: The$Mul(core,aware$Data$Transfer$ …2015/10/05  · NUMA NODE 1 NUMA NODE 2 DataTransfer Thread Data Transfer Node ( DTN) cores Remote I/O Access Interconnect IOH1 NIC Storage IOH2

Data  Transfer  ApplicaYons/Servers                    

mdtmBBCP  Design  

Request/data  preprocessing  

Thread/flow    management  

                                             Data  access  and  transmission  

Data  Transfer  Service  interface  

Storage  I/O  interface:  a)  Local  disks,    b)  SAN,  c)  memdisk/flash  disks        

MDTM  interface  

SAN Topology

Block Devices

Fibre Channel SAN

DesYnaYon  Host  

Control  channel  

Control  Agent  

                                           Network  Stack  

Data  Channel  

SSD/memdisk

Key  techniques:  *Metadata  access  ü  AutomaYc  Preprocessing  for  

various  types  of  storages  ü  Knowledge  on  storage  

system  performance  via  test  *Obtain  knoweledge  on  system      layout  (cores,  disks,  NICs,  etc)  *File  grouping,  sorYng,  load          balancing  *Interface:  file  systems,  storage,      MDTM  for  layout  *Data  structures:  lists,  sets,        layout  table,  various  staYsYcs  

Page 18: The$Mul(core,aware$Data$Transfer$ …2015/10/05  · NUMA NODE 1 NUMA NODE 2 DataTransfer Thread Data Transfer Node ( DTN) cores Remote I/O Access Interconnect IOH1 NIC Storage IOH2

Major  Features  in  the  mdtmBBCP  •  Resource  pre-­‐allocaYon  

–  I/O  centric  thread  allocaYon,  for  storage  and  network  –  Shared  buffer  space  –  NUMA  awareness:  cores,  disks,  NICs  

•  Request  preprocessing  (more  details  in  extra  slides)  –  File  grouping  by  I/O  device  type  and  locaYon    –  File  sorYng  by  disk  offset  –  Post  transfer  data  write  reordering  opYmizaYon  

•  Different  methods  for  handling  large  and  small  files  –  Large  file  striping:  parallel  processing  of  the  data  of  a  single  large  file  

–  Small  file  pipelining:  one-­‐by-­‐one  processing  of  small  files.  Note:  mul3ple  groups  of  files  are  processed  using  mul3ple  pipelines  

Page 19: The$Mul(core,aware$Data$Transfer$ …2015/10/05  · NUMA NODE 1 NUMA NODE 2 DataTransfer Thread Data Transfer Node ( DTN) cores Remote I/O Access Interconnect IOH1 NIC Storage IOH2

Other  Features  in  mdtmBBCP  

•  Third  party  support  and  client/server  mode  •  Security  with  SSH  support  •  AutomaYc  host  system  configuraYon  setup  •  Data  transfer  progress  report  •  Support  for  different  I/O  mode:  direct  I/O,  asynchronous  I/O  

•  Event  driven  data  transfer  task  processing  

Page 20: The$Mul(core,aware$Data$Transfer$ …2015/10/05  · NUMA NODE 1 NUMA NODE 2 DataTransfer Thread Data Transfer Node ( DTN) cores Remote I/O Access Interconnect IOH1 NIC Storage IOH2

mdtmBBCP  R&D  •  Meta  Data  Access  

–  AutomaYc  Preprocessing  for  various  types  of  storages  –  Knowledge  on  storage  system  performance  test  

•  Retrieve  System  Layout  (cores,  disks,  NIC,etc)  for  scheduling  •  Implemented  Request  preprocessing  

–  Request  decomposiYon  and  regrouping  into  smaller  tasks  enhance    –  Task  grouping  for  affinity  binding  and  concurrency.  –  Task  sorYng  for  I/O  locality  and  OpYmizaYon  –  Load  Balance  –  Improve  performance  on  different  storage  media  

•  Implemented  Interfaces:  file  systems,  storage,  MDTM  for  layout  •  Sokware  Design  and  Data  Structures:  Object-­‐oriented,  lists  and  

sets,  layout  table,  various  staYsYcs  

Page 21: The$Mul(core,aware$Data$Transfer$ …2015/10/05  · NUMA NODE 1 NUMA NODE 2 DataTransfer Thread Data Transfer Node ( DTN) cores Remote I/O Access Interconnect IOH1 NIC Storage IOH2

mdtmBBCP  R&D  

•  Asynchronous  request  processing  – Serve  all  requests  with  the  pre-­‐allocated  and  reusable  thread  pools.  

– Maximize  the  file  transfer  concurrency  

•  Support  for  both  large  file  pipelining  and  small  file  striping    

•  Progress  report  for  data  transfer  jobs  

Page 22: The$Mul(core,aware$Data$Transfer$ …2015/10/05  · NUMA NODE 1 NUMA NODE 2 DataTransfer Thread Data Transfer Node ( DTN) cores Remote I/O Access Interconnect IOH1 NIC Storage IOH2

Part  III  Integra(on  

Page 23: The$Mul(core,aware$Data$Transfer$ …2015/10/05  · NUMA NODE 1 NUMA NODE 2 DataTransfer Thread Data Transfer Node ( DTN) cores Remote I/O Access Interconnect IOH1 NIC Storage IOH2

How  does  MDTM  works?  A  MDTM  applicaYon  spawns  three  types  of  threads  – Management  threads  to  handle  user  requests  and  management-­‐related  funcYons  

–  Dedicated  disk/storage  I/O  threads  to  read/write  from/to  disks/storages  

–  Dedicated  network  I/O  threads  to  send/receive  data  A  MDTM  data  transfer  applicaYon  accesses  MDTM  middleware  services  explicitly  via  APIs  In  operaYon,  an  MDTM  middleware  daemon  will  be  launched.  It  will  support  two  types  of  services  –  Query  service  allow  MDTM  APP  to  access  system  configuraYon  and  status  

–  Scheduling  service  assigns  system  resources  based  on  requirements  of  data  transfer  applicaYons  

Page 24: The$Mul(core,aware$Data$Transfer$ …2015/10/05  · NUMA NODE 1 NUMA NODE 2 DataTransfer Thread Data Transfer Node ( DTN) cores Remote I/O Access Interconnect IOH1 NIC Storage IOH2

MDTM  Logical  Func(ons  and  Modules  

OS Kernel (and hardware below)

Resource Scheduler

Thread Load Estimation

MDTM App Interface

System Monitor

Thread/flow Management

Statistics Store

Qos/Policy Manager

Request/data Preprocessing

Data Transfer Service Interface

Data Access and Transmission

Admin user input

NUMA access cost modelling

User Interface

Authentication & Access Control

...Data Transfer Application's Native Functions & Modules

MDTM-based Data Transfer Functions & Modules

MDTM Middleware Functions & Modules

Data Transfer Application

Data

tran

sfer

pr

ofile

Res

sche

dulin

g Re

q/Re

s

Stat

us Q

uery

/Res

I/O-­‐Centric  architecture  Parallel  data  transfer  

Data  layout  preprocessing  Disk/network  I/O  op(miza(on  

       

Data  flow-­‐centric  scheduling  NUMA-­‐awareness  scheduling  

I/O  locality  op(miza(on  Maximizing  parallelism  

Page 25: The$Mul(core,aware$Data$Transfer$ …2015/10/05  · NUMA NODE 1 NUMA NODE 2 DataTransfer Thread Data Transfer Node ( DTN) cores Remote I/O Access Interconnect IOH1 NIC Storage IOH2

Major  Ac(vi(es  •  The  MDTM  Project  DEMO  at  SC14,  New  Orleans,  November  2014.  – hlp://scdoe.info/demo-­‐staYon-­‐descripYons/  

•  L.  Zhang,  T.  Li,  Y.  Ren,  P.  DeMar,  S.  Jin,  D.  Yu,  W.  Wu,  “The  MDTM  Project”,  SC’14  Poster  session,  New  Orleans,  LA,  2014.  

•  ESCC  Winter  2015  Talk  – hlps://escc.es.net/?q=node/7/107  

•  MDTM  Deployment  on  ESNET  100G  Testbed,  July  2015  

Page 26: The$Mul(core,aware$Data$Transfer$ …2015/10/05  · NUMA NODE 1 NUMA NODE 2 DataTransfer Thread Data Transfer Node ( DTN) cores Remote I/O Access Interconnect IOH1 NIC Storage IOH2

IniYal  Results  •  We   evaluate   mdtmBBCP   in   ESNET   100G   test   bed.   mdtmBBCP   is  

compared  with  GridFTP  and  BBCP.  For  fair  comparisons,  all  the  tools  are  configured  with  the  same  parameters—I/O  block  size  and  the  number  of  parallel   streams.  We  use  Time-­‐to-­‐Comple(on  (TTC)  as   the  performance  metric.   The   comparison   is   to   transfer   a   100GB  file   from  nersc-­‐tbn-­‐2   to  nersc-­‐tbn-­‐1.  

mdtmBBCP   GridFTP   BBCP  

TTC   55s   101s   95s  

Page 27: The$Mul(core,aware$Data$Transfer$ …2015/10/05  · NUMA NODE 1 NUMA NODE 2 DataTransfer Thread Data Transfer Node ( DTN) cores Remote I/O Access Interconnect IOH1 NIC Storage IOH2

Future  work  

•  MDTM  R&D  – ProducYon  quality  distribuYon  kit  – QoS  

•  MDTM  field  test  and  deployment  –  Reaching  out  to  potenYal  MDTM  users  –  Alpha-­‐release  users  •  E.g.,  ESnet  network  engineers    

– Beta-­‐release  users  •  CMS,  ATLAS  

Page 28: The$Mul(core,aware$Data$Transfer$ …2015/10/05  · NUMA NODE 1 NUMA NODE 2 DataTransfer Thread Data Transfer Node ( DTN) cores Remote I/O Access Interconnect IOH1 NIC Storage IOH2

MDTM  Source  Code  The  latest  MDTM  Source  Code  is  available  at    hlps://cdcvs.fnal.gov/redmine/projects/mdtm    

Features  supported  – MulYcore  system  profiling  – Thread/process  scheduling  – Thread  binding  with  I/O  locality  and  load  balancing  – NUMA-­‐aware  memory  pre-­‐allocaYon  and  binding  – Network  I/O  affinity  

Page 29: The$Mul(core,aware$Data$Transfer$ …2015/10/05  · NUMA NODE 1 NUMA NODE 2 DataTransfer Thread Data Transfer Node ( DTN) cores Remote I/O Access Interconnect IOH1 NIC Storage IOH2

MDTM  Project  Website    

hlp://mdtm.fnal.gov    

Page 30: The$Mul(core,aware$Data$Transfer$ …2015/10/05  · NUMA NODE 1 NUMA NODE 2 DataTransfer Thread Data Transfer Node ( DTN) cores Remote I/O Access Interconnect IOH1 NIC Storage IOH2

QuesYons?    

Demo      hlp://mdtm-­‐server.fnal.gov:1337  


Recommended