Date post: | 09-Feb-2017 |
Category: |
Documents |
Upload: | frederick-lefebvre |
View: | 93 times |
Download: | 0 times |
A platform for data management and analytics in campuses and research labs
Frédérick [email protected]
● Compute Canada and its regional partners have put a lot of work into using Canarie’s and the Nren’s network to interconnect their infrastructure through high speed networks
● 10 GbE right now / 100 GbE for all new systems● 25 Globus/GridFTP data transfer nodes have been
deployed to facilitate data movement across the Compute Canada infrastructure
● Data doesn't just magically appear on on Compute Canada’s systems.
● It gets created “somewhere”, has a life of its own, comes to our systems for a brief time and goes back home...
● As we centralize resources, we are moving storage and computing further away from researchers
● Visualization, real-time computations as well as application development and prototyping can be impaired by the increase latency with the systems and their teams
● There is a need to improve tools available to researchers to facilitate their use of Advanced Research Computing resources.○ Improved end-to-end networking
○ Wider deployment of data movement and pre-
processing infrastructure
● Deploy Data Transfer Nodes (DTN) close to where data is generated and extend the science-dmz all the way to the labs○ DTNs administered by the local ARC team
○ Local ingestion points can be dedicated to a research
lab or the whole campus
Based on the Fiona DTN developed by SDSC for the Pacific Research Platformhttps://fasterdata.es.net/science-dmz/DTN/fiona-flash-i-o-network-appliance/
● Science-DMZ○ Dedicated
research network
○ Away from
firewalls
○ All the way to the
researchersRef: Science-dmz - es.nethttp://fasterdata.es.net/science-dmz/science-dmz-architecture/
● High speed data transfers need purpose built Data Transfer Node
● Above all, they require fast drives to prevent disk IOs from becoming the bottleneck
● Spinning disks are seldom usable unless you are going to have lots of them ○ Think 10s of them to achieve 40 Gbps!!!
● Modern processors have much more power that what is required to move data from drives to networks
● The fast IOs of a DTN and its large memory make it ideal to run streaming workload, data analytics and general data transformation
● Why let it sit idle ?
● Enhance the DTNs with the ability to run code on local data through a web interface ○ Focus for now on scripting languages and big data
analytics with Apache Spark
○ Creates an environment where data can be ingested,
explored, modified and then moved elsewhere
grifFTP server inside container, bound to specific cores
All other cores shared by the OS and user code
● JupyterLab to manage and launch user’s Notebooks
● Authentication against the CC ldap directory
● Perfsonar in containers (in progress)● Scale out whole Notebooks or Apache Spark
workloads to a parallel cluster (in progress)● Network export of local storage● Automated data transformation pipelines● Software building blocks & code snippets in
the Notebooks
S3
Sensors upload data to local storage through an S3 API
Researcher explores its data with R and Apache Spark in a Notebook
1.
2.
Data is anonymized3.
Anonymized data is transferred to a CC system using Globus
4.
Sequencers output data on local storage through CIFS share
Fastq files are preprocessed locally
1.
2.
Files are characterized and indexed
3.
Data is transferred to parallel system for further processing
4.
● A gateway to get researcher’s data onto Compute Canada’s infrastructure
● A local platform for data exploration & visualization, pre-processing and prototyping
● A generic web portal to submit workloads on ARC systems○ We have automated node reservation to scale out
Notebooks on Colosse.
○ The way we do it on Colosse requires the portal to be
a submit host○ There has to be a better way. Web API ?
Processors 2x Xeon E5-2640v4 = 40 logical cores
Memory 128 GB DDR4
Network interfaces Mellanox ConnectX3-pro dual port 40GbE
Drives for OS 2x 128 GB SATA SSD
Local storage (Perf. option) 8x 400GB nvme drives
Local storage (Capacity option) 24x 8TB NL SAS drives
● Cost is from ~12K to 25K and up ○ storage is the differentiator
● There is a need for high speed data transport services in campuses and larger labs
● Local computing capabilities create new opportunities for quick innovation
● We envision a model where researchers finance their local portal to size it up to their needs
● We have selected 2 pilot sites that will be deployed this summer
● You can participate by:○ Becoming a pilot site○ Contribute to the platform design and development○ Letting us know how we can improve the model○ Help us find a better name…
● Contact us: [email protected]