+ All Categories
Home > Documents > Introduction to Scientific Data Management · 2016-11-17 · Introduction to Scientific Data...

Introduction to Scientific Data Management · 2016-11-17 · Introduction to Scientific Data...

Date post: 22-May-2020
Category:
Upload: others
View: 5 times
Download: 0 times
Share this document with a friend
57
1 Introduction to Scientific Data Management [email protected] October 2015 http://www.cism.ucl.ac.be/training
Transcript
Page 1: Introduction to Scientific Data Management · 2016-11-17 · Introduction to Scientific Data Management damien.francois@uclouvain.be October 2015 ... JSON, YML, XML. 19 Text File

1

Introduction to Scientific Data Management

[email protected] 2015

http://www.cism.ucl.ac.be/training

Page 2: Introduction to Scientific Data Management · 2016-11-17 · Introduction to Scientific Data Management damien.francois@uclouvain.be October 2015 ... JSON, YML, XML. 19 Text File

2

Goal of this session:

“Share tools, tips and tricks related to the storage, transfer, and sharing

of scientific data”

http://www.cism.ucl.ac.be/training

Page 3: Introduction to Scientific Data Management · 2016-11-17 · Introduction to Scientific Data Management damien.francois@uclouvain.be October 2015 ... JSON, YML, XML. 19 Text File

3

1.

Data storageBlock – File – Object

Databases

Page 4: Introduction to Scientific Data Management · 2016-11-17 · Introduction to Scientific Data Management damien.francois@uclouvain.be October 2015 ... JSON, YML, XML. 19 Text File

4

Data storage

Medium (Flash, Hard Drives, Tapes, DRAM)

LVM

RAID JBODErasure coding

software RAID

Local filesystem Block storage

Attachment (IDE, SAS, SATA, iSCSI, ATAoE, FC)

RDBMSObj store

Global filesystemNoSQL

Schema Serialization (file formats, etc)

Page 5: Introduction to Scientific Data Management · 2016-11-17 · Introduction to Scientific Data Management damien.francois@uclouvain.be October 2015 ... JSON, YML, XML. 19 Text File

5

Storage abstraction levels

Page 6: Introduction to Scientific Data Management · 2016-11-17 · Introduction to Scientific Data Management damien.francois@uclouvain.be October 2015 ... JSON, YML, XML. 19 Text File

6

Storage Medium Technologies

http://www.slideshare.net/IMEXresearch/ss-ds-ready-for-enterprise-cloud

Page 7: Introduction to Scientific Data Management · 2016-11-17 · Introduction to Scientific Data Management damien.francois@uclouvain.be October 2015 ... JSON, YML, XML. 19 Text File

7

Storage performances

http://www.slideshare.net/IMEXresearch/ss-ds-ready-for-enterprise-cloud

Page 8: Introduction to Scientific Data Management · 2016-11-17 · Introduction to Scientific Data Management damien.francois@uclouvain.be October 2015 ... JSON, YML, XML. 19 Text File

8

Storage safety (RAID)

https://www.extremetech.com/computing/170748-how-long-do-hard-drives-actually-live-for

Page 9: Introduction to Scientific Data Management · 2016-11-17 · Introduction to Scientific Data Management damien.francois@uclouvain.be October 2015 ... JSON, YML, XML. 19 Text File

9

Storage safety (RAID)

https://en.wikipedia.org/wiki/Standard_RAID_levels

Page 10: Introduction to Scientific Data Management · 2016-11-17 · Introduction to Scientific Data Management damien.francois@uclouvain.be October 2015 ... JSON, YML, XML. 19 Text File

10

Storage abstraction levels

Page 11: Introduction to Scientific Data Management · 2016-11-17 · Introduction to Scientific Data Management damien.francois@uclouvain.be October 2015 ... JSON, YML, XML. 19 Text File

11

(local) Filesystems

http://arstechnica.com/information-technology/2014/01/bitrot-and-atomic-cows-inside-next-gen-filesystems/

Generation 0: No system at all. There was just an arbitrary stream of data. Think punchcards, data on audiocassette, Atari 2600 ROM carts.

Generation 1: Early random access. Here, there are multiple named files on one device with no folders or other metadata. Think Apple ][ DOS (but not ProDOS!) as one example.

Generation 2: Early organization (aka folders). When devices became capable of holding hundreds of files, better organization became necessary. We're referring to TRS-DOS, Apple //c ProDOS, MS-DOS FAT/FAT32, etc.

Generation 3: Metadata—ownership, permissions, etc. As the user count on machines grew higher, the ability torestrict and control access became necessary. This includes AT&T UNIX, Netware, early NTFS, etc.

Generation 4: Journaling! This is the killer feature defining all current, modern filesystems—ext4, modern NTFS,UFS2, XFS, you name it. Journaling keeps the filesystem from becoming inconsistent in the event of a crash,making it much less likely that you'll lose data, or even an entire disk, when the power goes off or the kernelcrashes.

Generation 5: Copy on Write snapshots, Per-block checksumming, Volume management, Far-future scalability,Asynchronous incremental replication, Online compression. Generation 5 filesystems are Btrfs and ZFS.

Page 12: Introduction to Scientific Data Management · 2016-11-17 · Introduction to Scientific Data Management damien.francois@uclouvain.be October 2015 ... JSON, YML, XML. 19 Text File

12

Network filesystem

NAS: ex. NFS SAN: ex. GFS2

One source many consumers

Pictures from https://www.redhat.com/magazine/008jun05/features/gfs_nfs/

Page 13: Introduction to Scientific Data Management · 2016-11-17 · Introduction to Scientific Data Management damien.francois@uclouvain.be October 2015 ... JSON, YML, XML. 19 Text File

13

Parallel / distributed filesystem

ex: Lustre, GPFS, BeeGeeFS GlusterFSMany sources many consumers

Pictures from https://www.redhat.com/magazine/008jun05/features/gfs_nfs/

Page 14: Introduction to Scientific Data Management · 2016-11-17 · Introduction to Scientific Data Management damien.francois@uclouvain.be October 2015 ... JSON, YML, XML. 19 Text File

14

Special filesystems – in memory

Page 15: Introduction to Scientific Data Management · 2016-11-17 · Introduction to Scientific Data Management damien.francois@uclouvain.be October 2015 ... JSON, YML, XML. 19 Text File

15

Filesystems

Page 16: Introduction to Scientific Data Management · 2016-11-17 · Introduction to Scientific Data Management damien.francois@uclouvain.be October 2015 ... JSON, YML, XML. 19 Text File

16

What filesystem for what usage

● Home (NFS) : Small size, Small I/Os

● Global scratch (parallel FS) : Large size, Large I/Os

● Local scratch (local FS): Medium size, Large I/Os

● In-memory (tmpfs): Small Size, Very Large I/Os

● Mass storage (NFS); Large size, Small I/Os

Page 17: Introduction to Scientific Data Management · 2016-11-17 · Introduction to Scientific Data Management damien.francois@uclouvain.be October 2015 ... JSON, YML, XML. 19 Text File

17

Storage abstraction levels

Page 18: Introduction to Scientific Data Management · 2016-11-17 · Introduction to Scientific Data Management damien.francois@uclouvain.be October 2015 ... JSON, YML, XML. 19 Text File

18

Text File Formats – JSON, YML, XML

Page 19: Introduction to Scientific Data Management · 2016-11-17 · Introduction to Scientific Data Management damien.francois@uclouvain.be October 2015 ... JSON, YML, XML. 19 Text File

19

Text File Formats – CSV,TSV

Page 20: Introduction to Scientific Data Management · 2016-11-17 · Introduction to Scientific Data Management damien.francois@uclouvain.be October 2015 ... JSON, YML, XML. 19 Text File

20

Binary File Formats – CDF, HDF

http://pro.arcgis.com/en/pro-app/help/data/multidimensional/fundamentals-of-netcdf-data-storage.htm

Page 21: Introduction to Scientific Data Management · 2016-11-17 · Introduction to Scientific Data Management damien.francois@uclouvain.be October 2015 ... JSON, YML, XML. 19 Text File

21

Binary File Formats – CDF, HDF

https://www.nersc.gov/users/training/online-tutorials/introduction-to-scientific-i-o/

Page 22: Introduction to Scientific Data Management · 2016-11-17 · Introduction to Scientific Data Management damien.francois@uclouvain.be October 2015 ... JSON, YML, XML. 19 Text File

22

Binary File Formats – CDF, HDF

Page 23: Introduction to Scientific Data Management · 2016-11-17 · Introduction to Scientific Data Management damien.francois@uclouvain.be October 2015 ... JSON, YML, XML. 19 Text File

23

What file format for what usage

● Meta data

– Configuration file: INI, YAML

– Result with context information: JSON● Data

– Small data (kBs): CSV, TSV

– Medium data (MBs): compressed CSV

– Large data (GBs): netCDF, HDF5, DXMF

– Huge data (TBs): Database, Object store (“loss of innocence”)

Use dedicated libraries to write and read them

Page 24: Introduction to Scientific Data Management · 2016-11-17 · Introduction to Scientific Data Management damien.francois@uclouvain.be October 2015 ... JSON, YML, XML. 19 Text File

24

Storage abstraction levels

Page 25: Introduction to Scientific Data Management · 2016-11-17 · Introduction to Scientific Data Management damien.francois@uclouvain.be October 2015 ... JSON, YML, XML. 19 Text File

25

Object storage

● Object: data (e.g. file) + meta data

● Often built on erasure coding

● Scale out easily

● Useful for web applications

● Access with REST API

Page 26: Introduction to Scientific Data Management · 2016-11-17 · Introduction to Scientific Data Management damien.francois@uclouvain.be October 2015 ... JSON, YML, XML. 19 Text File

26

RDBMS

Pictures from http://www.ibm.com/developerworks/library/x-matters8/

● Mostly needed for categorical data and alphanumericaldata (not suited for matrices, but good for end-results)

● Indexes make finding a data element is very fast(and computing sums, maxima, etc.)

● Encodes relations between data (constraints, etc)

● Atomicity, Consistency, Isolation, and Durability

Page 27: Introduction to Scientific Data Management · 2016-11-17 · Introduction to Scientific Data Management damien.francois@uclouvain.be October 2015 ... JSON, YML, XML. 19 Text File

27

NoSQL

Pictures from http://www.tomsitpro.com/articles/rdbms-sql-cassandra-dba-developer,2-547-2.html

● Mostly needed forunstructured, semi-structured, andpolymorphic data

● Scaling out very easy

● Basic Availability,Soft-state, Eventualconsistency

Page 28: Introduction to Scientific Data Management · 2016-11-17 · Introduction to Scientific Data Management damien.francois@uclouvain.be October 2015 ... JSON, YML, XML. 19 Text File

28

When to use?

– when you have a large number of small files

– when you perform a lot of direct writes in a large file

– when you want to keep structure/relations between data

– when software crashes have a non-negligible probability

– when files are update by several processes● When not to use:

– only sequential access

– simple matrices/vectors, etc.

– direct access on fixed-size records and no structure

Page 29: Introduction to Scientific Data Management · 2016-11-17 · Introduction to Scientific Data Management damien.francois@uclouvain.be October 2015 ... JSON, YML, XML. 19 Text File

29

Example: run a redis server

● Create a redis directory

● Copy /etc/redis.conf and modify the following lines:

Choose aport atrandom

Page 30: Introduction to Scientific Data Management · 2016-11-17 · Introduction to Scientific Data Management damien.francois@uclouvain.be October 2015 ... JSON, YML, XML. 19 Text File

30

Example: run a redis server

● Start the redis server

● Store values (normally you would do this in a Slurm job)

Page 31: Introduction to Scientific Data Management · 2016-11-17 · Introduction to Scientific Data Management damien.francois@uclouvain.be October 2015 ... JSON, YML, XML. 19 Text File

31

Example: run a redis server

● Check the values

● Retrieve the values

Page 32: Introduction to Scientific Data Management · 2016-11-17 · Introduction to Scientific Data Management damien.francois@uclouvain.be October 2015 ... JSON, YML, XML. 19 Text File

32

2.

Data transferfaster and less secure

parallel transfers

Page 33: Introduction to Scientific Data Management · 2016-11-17 · Introduction to Scientific Data Management damien.francois@uclouvain.be October 2015 ... JSON, YML, XML. 19 Text File

33

scp -c cipher ...

http://blog.famzah.net/2010/06/11/openssh-ciphers-performance-benchmark/

Page 34: Introduction to Scientific Data Management · 2016-11-17 · Introduction to Scientific Data Management damien.francois@uclouvain.be October 2015 ... JSON, YML, XML. 19 Text File

34

Fastest: No SSH at all

● Need friendly firewall (choose direction accordingly)

● Only over trusted networks

● If rsh is installed: rcp instead of scp

Page 35: Introduction to Scientific Data Management · 2016-11-17 · Introduction to Scientific Data Management damien.francois@uclouvain.be October 2015 ... JSON, YML, XML. 19 Text File

35

Fastest: No SSH at all

● Need friendly firewall (choose direction accordingly)

● Only over trusted networks

● If rsh is installed: rcp instead of scp

● If rsh is not installed: nc on both ends

Page 36: Introduction to Scientific Data Management · 2016-11-17 · Introduction to Scientific Data Management damien.francois@uclouvain.be October 2015 ... JSON, YML, XML. 19 Text File

36

Resuming transfers

● When nothing changed but the transfer was interrupted

– size-only: do not perform byte-level file comparison

Page 37: Introduction to Scientific Data Management · 2016-11-17 · Introduction to Scientific Data Management damien.francois@uclouvain.be October 2015 ... JSON, YML, XML. 19 Text File

37

Resuming transfers

● When nothing changed but the transfer was interrupted

– append: do not re-check partially transmitted files andresume the transfer where it was abandoned assumingfirst transfer attempt was with scp or with rsync --inplace

Page 38: Introduction to Scientific Data Management · 2016-11-17 · Introduction to Scientific Data Management damien.francois@uclouvain.be October 2015 ... JSON, YML, XML. 19 Text File

38

Parallel data transfer: bbcp

● Better use of the bandwidth than SCP

● Needs to be installed on both sides (easy to install)

● Needs friendly firewalls

Page 39: Introduction to Scientific Data Management · 2016-11-17 · Introduction to Scientific Data Management damien.francois@uclouvain.be October 2015 ... JSON, YML, XML. 19 Text File

39

Parallel data transfers: parsync

http://moo.nac.uci.edu/~hjm/parsync/

Page 40: Introduction to Scientific Data Management · 2016-11-17 · Introduction to Scientific Data Management damien.francois@uclouvain.be October 2015 ... JSON, YML, XML. 19 Text File

40

Parallel data transfers: sbcast

Page 41: Introduction to Scientific Data Management · 2016-11-17 · Introduction to Scientific Data Management damien.francois@uclouvain.be October 2015 ... JSON, YML, XML. 19 Text File

41

Transferring ZOT files

● Zillions Of Tiny files

● More meta-data than data → large overhead for rsync

● Solution: Pre-tar or tar on the fly

● Needs friendly firewall

● Also avoid 'ls' and '*' as they sort the output. Favor 'find'

Page 42: Introduction to Scientific Data Management · 2016-11-17 · Introduction to Scientific Data Management damien.francois@uclouvain.be October 2015 ... JSON, YML, XML. 19 Text File

42

3.

Data sharingwith other users (Unix permissions, Encryption)

with external users (Owncloud)

Page 43: Introduction to Scientific Data Management · 2016-11-17 · Introduction to Scientific Data Management damien.francois@uclouvain.be October 2015 ... JSON, YML, XML. 19 Text File

43

Data sharing

Data sharing with other users

Page 44: Introduction to Scientific Data Management · 2016-11-17 · Introduction to Scientific Data Management damien.francois@uclouvain.be October 2015 ... JSON, YML, XML. 19 Text File

44

Sharing with all other users

Page 45: Introduction to Scientific Data Management · 2016-11-17 · Introduction to Scientific Data Management damien.francois@uclouvain.be October 2015 ... JSON, YML, XML. 19 Text File

45

Sharing with the group

Page 46: Introduction to Scientific Data Management · 2016-11-17 · Introduction to Scientific Data Management damien.francois@uclouvain.be October 2015 ... JSON, YML, XML. 19 Text File

46

Sharing and hiding

Page 47: Introduction to Scientific Data Management · 2016-11-17 · Introduction to Scientific Data Management damien.francois@uclouvain.be October 2015 ... JSON, YML, XML. 19 Text File

47

Sharing and encrypting

Page 48: Introduction to Scientific Data Management · 2016-11-17 · Introduction to Scientific Data Management damien.francois@uclouvain.be October 2015 ... JSON, YML, XML. 19 Text File

48

Data sharing

Data sharing with external users

Page 49: Introduction to Scientific Data Management · 2016-11-17 · Introduction to Scientific Data Management damien.francois@uclouvain.be October 2015 ... JSON, YML, XML. 19 Text File

49

Data sharing with external users

● owncloud

CISMlogin

Page 50: Introduction to Scientific Data Management · 2016-11-17 · Introduction to Scientific Data Management damien.francois@uclouvain.be October 2015 ... JSON, YML, XML. 19 Text File

50

Dropbox-like

Page 51: Introduction to Scientific Data Management · 2016-11-17 · Introduction to Scientific Data Management damien.francois@uclouvain.be October 2015 ... JSON, YML, XML. 19 Text File

51

External SFTP connectors

Page 52: Introduction to Scientific Data Management · 2016-11-17 · Introduction to Scientific Data Management damien.francois@uclouvain.be October 2015 ... JSON, YML, XML. 19 Text File

52

Dropbox-like

Page 53: Introduction to Scientific Data Management · 2016-11-17 · Introduction to Scientific Data Management damien.francois@uclouvain.be October 2015 ... JSON, YML, XML. 19 Text File

53

My home on Manneback

Page 54: Introduction to Scientific Data Management · 2016-11-17 · Introduction to Scientific Data Management damien.francois@uclouvain.be October 2015 ... JSON, YML, XML. 19 Text File

54

Can create a share URL

Page 55: Introduction to Scientific Data Management · 2016-11-17 · Introduction to Scientific Data Management damien.francois@uclouvain.be October 2015 ... JSON, YML, XML. 19 Text File

55

And distribute it

Page 56: Introduction to Scientific Data Management · 2016-11-17 · Introduction to Scientific Data Management damien.francois@uclouvain.be October 2015 ... JSON, YML, XML. 19 Text File

56

Exercise:

1. Run a redis server on Hmem2. Populate it from compute nodes with random data3. Extract the data from it and create an HDF5 file4. Encrypt the file5. Copy it to lemaitre2 using nc6. Make it available to others who know of its name

http://www.cism.ucl.ac.be/training

Page 57: Introduction to Scientific Data Management · 2016-11-17 · Introduction to Scientific Data Management damien.francois@uclouvain.be October 2015 ... JSON, YML, XML. 19 Text File

57

Summary:

Storage: choose the right filesystem and the right file format

Transfer: use the parallel tools when possible andlimit encryption in favor of throughput

Sharing: use all the potential of the UNIXpermissions and try Owncloud

http://www.cism.ucl.ac.be/training


Recommended