+ All Categories
Home > Technology > D Robinson - Using HDF5 to work with large quantities of rich biological data

D Robinson - Using HDF5 to work with large quantities of rich biological data

Date post: 18-Nov-2014
Category:
Upload: jan-aerts
View: 1,442 times
Download: 0 times
Share this document with a friend
Description:
Presentation at BOSC2012 by D Robinson - Using HDF5 to work with large quantities of rich biological data
15
July 13, 2012 BOSC 2012 1 July 13, 2012 Using HDF5 To Work With Large Quantities of Rich Biological Data Dana Robinson (derobins @hdfgroup.org) The HDF Group
Transcript
Page 1: D Robinson - Using HDF5 to work with large quantities of rich biological data

July 13, 2012 BOSC 2012 1July 13, 2012

Using HDF5 To Work With Large Quantities of Rich Biological Data Dana Robinson (derobins @hdfgroup.org) The HDF Group

Page 2: D Robinson - Using HDF5 to work with large quantities of rich biological data

July 13, 2012 BOSC 2012 2

Today's Goal

Is that you walk away from this talk with a basic understanding of the HDF5 technology stack.

Page 3: D Robinson - Using HDF5 to work with large quantities of rich biological data

July 13, 2012 BOSC 2012 3

Where is HDF5 used?

Page 4: D Robinson - Using HDF5 to work with large quantities of rich biological data

July 13, 2012 BOSC 2012 4

What is HDF5?

HDF5 is a highly scalable way to organize and store heterogeneous, multidimensional data of user-defined types.

HDF5 also allows data relationships and context to be stored using annotation and linking.

Page 5: D Robinson - Using HDF5 to work with large quantities of rich biological data

July 13, 2012 BOSC 2012 5

HDF5The HDF5 technology suite includes:

• A structured binary file format

• An abstract data model for describing your data

• A data access library, written in C(w/ bindings for C++, Fortran 95/2003, and Java)

Page 6: D Robinson - Using HDF5 to work with large quantities of rich biological data

July 13, 2012 BOSC 2012 6

HDF5 has characteristics of …

April 17-19, 2012

PDF• standard

exchange format• heterogeneous

information

XML• self-describing• extensible

types• rich metadata

Binary Flat File• high-

performance

Databases• subsetting• random access

Directories and Files

• hierarchical• collections of

related information HDF5

Page 7: D Robinson - Using HDF5 to work with large quantities of rich biological data

July 13, 2012 BOSC 2012 7

Advantages of HDF5

• Platform and architecture-independent

• Scalable in space and time• File size only limited by OS and filesystem• Data access time (esp. parallel) scales well

• Flexible (user-defined types and organization)

• Files are self-describing

Page 8: D Robinson - Using HDF5 to work with large quantities of rich biological data

July 13, 2012 BOSC 2012 8

Advantages of HDF5 (2)

• High-performance

• Parallel I/O via MPI-IO

• Supports compression and other filters

• Open source (BSD license)

• THG committed to provide long-term support

Page 9: D Robinson - Using HDF5 to work with large quantities of rich biological data

July 13, 2012 BOSC 2012 9

HDF5 Data Objects

• Groups• Datasets

• Datatypes• Metadata (Attributes)

Page 10: D Robinson - Using HDF5 to work with large quantities of rich biological data

July 13, 2012 BOSC 2012 10

Example: LCMS Data

MS spectr

a

MS/

MS

spec

tra

protein IDs

chromatography parameters

protein ID type

ms parameters ms/ms parameters

sample name

Page 11: D Robinson - Using HDF5 to work with large quantities of rich biological data

July 13, 2012 BOSC 2012 11

HDF5 Data Access

Unlike many data storage systems, HDF5 has no built-in query engine or indexes.

You will have to write your own data access code, usually using the HDF5 API.

Page 12: D Robinson - Using HDF5 to work with large quantities of rich biological data

July 13, 2012 BOSC 2012 12

Dataspaces

HDF5 has a rich set of data subsetting functionality.Example: displaying a thumbnail of a high-resolution image.

Page 13: D Robinson - Using HDF5 to work with large quantities of rich biological data

July 13, 2012 BOSC 2012 13

Filters and Compression

Note that HDF5 data objects are filtered individually, not the entire file!

HDF5 supports data filters, including compression, which transform data as it enters or leaves the file.

compression filter

compressed data in the file

uncompressed data in user's buffer

Page 14: D Robinson - Using HDF5 to work with large quantities of rich biological data

July 13, 2012 BOSC 2012 14

Higher Language Bindings

C++ Fortran (95 & 2003) Java .NET Python

• C++ & Fortran distributed with library• Java distributed separately• .NET distributed separately, not supported by THG (as-is)• Python (PyTables, h5py) not distributed by THG

NOTE:HDF5 bindings are thin wrappers over the C API.

• There is no object-oriented interface to HDF5• Not pure Java, .NET, etc.


Recommended