Tue 20 Mar 2007 P Coe : MonAliSA project Oxford Physics Flexible data storage for minimal effort A...

Tue 20 Mar 2007 P Coe : MonAliSA project Oxford Physics

Flexible data storagefor minimal effort

A tale of two formats

user


Persuasion• General problem to be addressed

• First use case - ATLAS FSI c. 1998

• ATLAS FSI data format

Widening the scope :

• Recent work – MONALISA format 2006


Great Expectations

Want to use disk files to record...• Data from experiments/simulations• Prepared data ready for analysis• Analysis results

Want to see the data (not unrelated)• Plotting graphs is essential for

analysis, publication etc


Laboratory data sources...

DataAcquisition

...measuring ambient conditions...

...and deliberately induced signals

Take instrument readings...


Laboratory data is usually ...

• A collection of one dimensional arrays– Counts from an ADC - 16 bit integer– Photon counts 16 or 32 bit integer, etc

• Some arrays only have a single element – Ambient relative humidity – 1 float

• Treating all data is if it were in arrays may be a presumptuous idea... but fruitful


Data increases with compound interest...

• Preparation of acquired data– filtering, averaging, smoothing, cutting, etc

• Analysis of prepared data – fitting, calculating FFT, time derivatives etc

• Data collation– measurement to measurement trends, etc

Want to emphasise that it is all treated as one dimensional arrays of data :

e.g. FFT spectrum in two 1-d arrays of doubles


Let us not forget "annotation" Data

• Experimental set-up (which instruments, how and where connected etc)

• Other "one-off" parameters i.e. timestamp• Version info for DAQ/analysis algorithms• Seed parameters• Other fit/preparation control parameters• ...

(META?)

Software can deliver much more if annotations are included


An ideal data file format • Holds data and meta data / related information• Should be simple to write code to :

– find/read stored data of interest from the file– write any stored data to the file so it can be identified– append new data to the file, without disruption

• Handle (store/retrieve) data :– flexibly (of any format, in any order)– reliably (data should come back intact)– robustly (absent data should not break the format)– with language / platform independence


A database as a solution?

• A database in place of a file meets most requirements of a file format

• I have no database experience and did not want to be coupled to (tied down by) database related issues

• For example...is it easy to access the same data using different languages?

A pseudo-random quote from the web...“17.1. Do You Really Need a Relational Database?

It is common for web developers to jump to the conclusion that they need an SQL-compliant RDBMS like Oracle, when in fact they have a rather small data set that could be organized as one table. Commercial RDBMSs are expensive as well as nontrivial to install and administer.”


ASCII text file : Less than ideal (1)• Meta data

– some may be in column labels– remaining meta data often poured out into the filename!

• How easy it is to ...– ...find/read stored data of interest from the file?

Code for reading single lines/columns is reasonable The price is a rigid file structure

– ...write any stored data to the file so it can be identified? Code for writing ASCII columns is very simple Data Identity is based on assumptions about column order

– ...append new data to the file, without disruption? Maybe possible in Perl but... ... in most languages it is easier to create a new file


• What about handling data? :– flexibly (of any format, in any order)

Data Format is ASCII columns (or similar enough format) only!Ordering flexibility is lost (unless only humans read)

– reliably (data should come back intact)

Rounding / formatting issues – places effort on user– robustly (absent data should not break the format)

Empty columns do break the format

– with language / platform independence Most languages allow ASCII I/OMost applications (Excel, Origin, etc) read ASCIIMinor cross platform issues – very rarely fatal

ASCII text file : Less than ideal (2)


Can a binary file format help?The answer depends greatly on the :

– format design

• flexibility is easily lost without careful forethought / revision• simplicity :

The format should be “as simple as it needs to be,

but no simpler” – paraphrasing Einstein (1933)

– format implementation• design advantages are easily lost by the implementation



Two binary formats

• Old ATLAS format – relies on ID codes to identify data– has some naive file structures - too rigid– implemented in LabVIEW (and ROOT!)

• New “MonAliSA” format– relies on ID codes to identify data– simpler and more flexible– implemented in C, Java and LabVIEW


ATLAS demo FSI 1998-

• Experiments based on tuning a laser

• Timing of experiment has two "modes" based on rapid or slow laser tuning

• DAQ & analysis data file structures reflect this– alternate fast/slow periods in "blocks"

• All I/O software in National Instruments language LabVIEW (v4.x later v6.1 for analysis)


Binary file format (ATLAS FSI 1998-)

File Header

Version Number + Number of data blocks

Data block 1 2 3 4 5

Simple structure at highest level • A single, minimal sized file header

• Followed by N data blocks-Rarely of equal size

2 bytes 1 byte

locationinside file


Inside a data block (roughly)Block prefix : points to

a) start of first data labelb) start of next data block

also counts how many arrays stored

Header : most of the meta data

Array label : Identifies array(details on next slide)

1-d array BIG ENDIAN or IEEE


ATLAS FSI format - data array label

ID codes

Location at end of the data array (pointer to next label)

Previousarray up tostart of thislabel

Byte locationinside file

Array attachedto the end of the label

InstrumentChannel ID

Fixed lengthString labelling thearray contents

Element type

Number of array elements

3.1“Long Ref raw data”+white space padding tofixed length

unsigned16 bit integer


ID code examples from ATLAS

• 2 numbers in code (category, subcategory)• Categories :

1. DAQ timing parameters2. Thermometer / humidity : "environmental"3. Reference Interferometer System...6. Grid line interferometers7. (Reference) Phase 8. Sine fitting prepared data...


ID code examples from ATLAS

• Subcategories arbitrarily assigned by hand

• In 3 (Reference Interferometer System)– 3.1 Long Reference Interferometer raw data– 3.3 Etalon raw data– 3.129 Long reference data for laser 1– 3.257 Long reference data for laser 2


How does ATLAS format work? 1) Writing data to the file

• Meta structures in place first (tedious)• Each array with label : placed at end of file• Any order permitted by unique array ID codes

(inside a given block at least)• Writing each data array involves :

– Preparing meta data for label– Writing label, including pointer to end of the array– Writing the array after the label– Updating meta structures : end of block & no. of arrays


How does ATLAS format work? 2) Reading data from the file

• Finding the correct data block is similar to finding array (below)– block label and pointer to next block are in block prefix

• Find array : Seek array at (block, ID) – (In the correct block) 1st array label easily found from prefix– then iterate within the block...

• Read array label : Do ID codes match two required?If YES read label & array at end of the label If NO find next label using pointer in this label – continue iteration

• N.B. reading finds 1st matching instance (in block) only– should have only been one matching instance written


4

Search pattern schematic

Data block 1 2

4

e.g. Looking for 7.1028 data in block 3

There are5 blocks

I am block 1

Block 3 starts here

I am block 2

Block 2 starts here

I am block 3There are 26 arrays in this block

My first data label starts here

I label 7.1028 data

Array of interest

5

Labels comparedwith required7.1028 ID code


ATLAS format - review (1)Ideal : “Holds data and meta data / related information”

Meta data storage was useful but on the down side was... Scattered

some in the label some in header

Inflexible No easy way to augment meta data All block header sections had to be complete or left out

Also Block headers added large effort overhead to writing

new software made innovation in other areas painful / tedious not all block header meta data used, some never


ATLAS format - review (2)Ideal : “Should be simple to write code to :”

– “find/read stored data of interest from the file” Very simple small stable I/O routines library ID codes stored in one place

easy to maintain easy to use

– “write any stored data to the file so it can be identified” Same success with same I/O routines and ID codes

– “append new data to the file, without disruption” Only possible to append to last block of the file


ATLAS format - review (3)Ideal : "Handle (store/retrieve) data"

– flexibly (of any format, in any order) Storage order flexible (within a block) All required numerical formats supported (string as bytes)

– reliably (data should come back intact) Never reported any data errors in 9 years

– robustly (absent data should not break the format) Absent data does not break the format Earlier caveat about Meta Data applies

– with language / platform independenceNever fully tested this point in the ATLAS format Was beyond the scope of the implementationMostly used LabVIEW – format worked across LabVIEW versionsT. Kohno wrote a file reader for ROOT, no problems reported


MonAliSA 2006 : A new format

• Want to read/write files from different operating systems – C and Java for DAQ,

analysis, simulation– Run C inside LabVIEW on

windows XP (DAQ)– Most analysis / simulation

on Linux– Java work with LiCAS


Broadening the scopeOnce you have a cross platform file format :

• Want to offer to ATLAS, LiCAS, etc...– Saves duplicating "reinvented wheels"– Same I/O software for each group

• Hence same basic format / file structure

– Using ID codes for data finding• ID code range expanded from 2 numbers to 5• Each group will want control over their own ID

codes / software versions


Feeding lessons / requirements into the new format (1)...

• Kept the Labelled Data arrays– Data labels retain ID

codes, strings, pointers– Data labels drop meta data

(instrument ID)

• Removed data blocks – same DoF recorded with

"instance" label element– now possible to append

any data array to the end of the file

block 1OLD:Arraysstoredinblocks

NEW:StandaloneArrays

block 2

block 5

1st 2nd 1st

1st2nd 3rd

3rd

5th


Feeding lessons / requirements into the new format (2)...

• Removed almost all header structures– Meta data stored in arrays like other data

• Very small remaining file header holds – file compatibility information

• Group ID (e.g. 2 = MonAliSA) • Format version ID (file / header structures change)• ID codes look up table version

This needs further explanation


New format : Simpler file structure

File Header 13 bytes

• Group ID• Software ID• ID codes version• File format version• File lock• Number of arrays in file

RED : CentralisedSet by protocol definitions

BLUE : "Group" specificManaged by a "Group"

File / array specific

Immutable

Mutable

• byte (can be general purpose)• 16,32,64 bit integers (big endian)• IEEE 32 bit float, 64 bit double

• ID codes• Instance count• Pointer to next label• Data type• Number of array elements• Error detection checksum• Variable length, label string

Array Label

One dimensional array

KEY TO TEXT COLOURS


Obvious questions

• Is reading / writing data with the new format similar to the old? – ANSWER YES!

• Why does the new, simpler format appear to be so complicated?

• What is all this about "central" and "group" management?

• How much longer will this talk go on for?


All because of ID codes ...• For the 1998 - ATLAS format :

– ID codes were created and used • by 1 or 2 programmers • in the one set of software • written on one machine• in one language• Keeping ID codes unique and distinct was simple enough

• For the 2007 – MonAliSA format :– ID codes could be created and used by

• any users• for any software they require• written on any number of machines• in any language (although so far only C, LabVIEW, Java are

possible)

– ID code clashes need to be prevented


ID code management : CentralAny project / person producing software

• Issued with a copy of GIACoNDE – a "Group" level ID code management tool – written in Java (for platform independence)– has group ID hardwired as a constant

Any Binary File Reading software

• Can check for matching group ID in file header


ID code management : GroupUses GIACoNDE tool• Present state - Beta version• Creates ID codes with 5 parts• Publishes ID code template

– ID codes represented by named constants – in C header files– in Java interface files– together with writing

• group ID• software ID• ID code template version


ID code management : GIACoNDE


Future outlook

• GIACoNDE is close to completion– already produces useable output– some polishing still to be done

• File I/O libraries already written and tested– written in C, LabVIEW, Java– files written by one language can be read by

another– other groups encouraged to use the software– Java I/O will be added to LiCAS framework


Not just suitable for lab data...

For example :

Plan using binary I/O in next version of 3 player game "Austerlitz" for saving state of play


THE END

Date post:	03-Jan-2016
Category:	Documents
Upload:	tobias-rose
View:	213 times
Download:	0 times

Tue 20 Mar 2007 P Coe : MonAliSA project Oxford Physics Flexible data storage for minimal effort A...

Documents