Date post: | 12-Jan-2016 |
Category: |
Documents |
Upload: | jewel-stevenson |
View: | 221 times |
Download: | 0 times |
The GriPhyNVirtual Data System
Ian Foster for the VDS team
Jim Annis, Steve Kent, Vijay Sehkri, Fermilab, Michael Milligan, Yong Zhao,
University of Chicago
1
10
100
1000
10000
100000
1 10 100
Num
ber
of C
lust
ers
Number of Galaxies
Galaxy clustersize distribution
DAG
Science as “Workflow”:E.g., Galaxy Cluster Search
Sloan Data
Requirements Express complex multi-step “workflows”
Perhaps 100,000s of individual tasks Operate on heterogeneous distributed data
Different formats & access protocols Harness many computing resources
Parallel computers &/or distributed Grids Execute workflows reliably
Despite diverse failure conditions Enable reuse of data & workflows
Discovery & composition Support many users, workflows, resources
Policy specification & enforcement
Virtual Data System Express complex multi-step “workflows”
Perhaps 100,000s of individual tasks Operate on heterogeneous distributed data
Different formats & access protocols Harness many computing resources
Parallel computers &/or distributed Grids Execute workflows reliably & efficiently
Despite diverse failure conditions Enable reuse of data & workflows
Discovery & composition Support many users, workflows, resources
Policy specification & enforcement
VDL,XDTM
Pegasus,DAGman,
Globus
VDC
TBD
Virtual Data System
Local planner
DAGmanDAG
StaticallyPartitioned
DAG
DAGman &Condor-GDynamically
PlannedDAG
JobPlanner
JobCleanup
Abstractworkflow
VDLProgram
Virtual Datacatalog
Virtual DataWorkflowGenerator
Workflow spec Create Execution Plan Grid Workflow Execution
Genome Analysis &DB Update (GADU)
600-1000+ CPUs
The Rest of the Talk Express complex multi-step “workflows”
Perhaps 100,000s of individual tasks Operate on heterogeneous distributed data
Different formats & access protocols Harness many computing resources
Parallel computers &/or distributed Grids Execute workflows reliably & efficiently
Despite diverse failure conditions Enable reuse of data & workflows
Discovery & composition Support many users, workflows, resources
Policy specification & enforcement
VDL,XDTM
Pegasus,DAGman,
Globus
VDC
TBD
Ewa
“Messy” Scientific Data
Diverse storage formats & access protocols Logically identical dataset can be stored in text
file (e.g. CSV), binary file, spreadsheet Data available from filesystem, database, HTTP,
WebDAV, etc... Metadata encoded in directory & file names
E.g.: “fMRI volume is composed of an image file & header file with same prefix”
Format dependency hinders program and workflow reuse
But... Data is Often Logically Structured
Scientific data often maintain hierarchical structure
A common practice is to select a set of data items and apply a transformation to each individual item
A nested approach of such iterations could scale up to millions of objects
Introducing a Typing System Describe logical data structures as types …
… & physical representations as mappings Define procedures in terms of typed datasets
… & apply procedures to different physical data Compose workflows from typed procedures Benefits
Type checking Dataset selection and iteration Discovery by types Dynamic binding Type conversion
XDTM(Moreau, Zhao, Wilde, Foster)
XML Dataset Typing and Mapping Separates logical structure from physical
representations Logical structure described by XML Schema
Primitive scalar types: int, float, string, date … Complex types (structs and arrays)
Mapping descriptor How logical elements map to physical External parameters (e. g. location)
XPath for dataset selection
Mapping Define a common mapping interface
Initialize, read, create, write, close Data providers implement the interface
Responsible for data access details XView maintains cached logical datasets
VDS Mapper Data Source
VDS XViewMgr
Data SourceMapper
XView
Use Case: Functional MRI
DBIC Archive Study #1 Group #1 Subject #1
Anatomy high-res volumeFunctional Runs run #1 volume #001 ... volume #275 ... run #5 volume #001 ... snrun #... …
Group #5 ... Study #...
DBIC Archive Study_2004.0521.hgd Group_1 Subject_2004.e024
volume_anat.imgvolume_anat.hdrbold1_001.imgbold1_001.hdr...bold1_275.imgbold1_275.hdr...bold5_001.img...snrbold*_*air*...
Group_5...
Study ...
Logical Structure Physical Representation
Type Definitions in VDL
type Image {};
type Header {};
type Volume { Image img; Header hdr;
}
type Anat Volume;
type Warp {};
type NormAnat {Anat aVol; Warp aWarp; Volume nHires;
}
Part of fMRI AIRSN (Spatial Normalization) Workflow
type Run { Volume v [ ];
}
type Subject { Anat anat; Run run [ ]; Run snrun [ ];
}
type Group { Subject s[ ]; }
type Study { Group g[ ]; }
Type Definitions in XML Schema <xs:schema
targetNamespace="http://www.fmri.org/schema/airsn.xsd"xmlns="http://www.fmri.org/schema/airsn.xsd"xmlns:xs="http://www.w3.org/2001/XMLSchema">
<xs:simpleType name="Image“/><xs:simpleType name="Header“/>
<xs:complexType name="Volume"> <xs:sequence>
<xs:element name="img" type="Image"/><xs:element name="hdr" type="Header"/>
</xs:sequence></xs:complexType>
<xs:complexType name="Run"> <xs:sequence minOccurs="0 maxOccurs="unbounded">
<xs:element name="v" type="Volume"/> </xs:sequence></xs:complexType>
</xs:schema>
Procedure Definition in VDL
(Run snr) functional( Run r, NormAnat a, Air shrink ) {Run yroRun = reorientRun( r , "y" );Run roRun = reorientRun( yroRun , "x" );Volume std = roRun[0];Run rndr = random_select( roRun, .1 ); //10% sampleAirVector rndAirVec = align_linearRun( rndr, std, 12, 1000, 1000, [81,3,3] );Run reslicedRndr = resliceRun( rndr, rndAirVec, "o", "k");Volume meanRand = softmean(reslicedRndr, "y", null );Air mnQAAir = alignlinear( a.nHires, meanRand, 6, 1000, 4, [81,3,3] );Volume mnQA = reslice( meanRand, mnQAAir, "o", "k“ );Warp boldNormWarp = combinewarp( shrink, a.aWarp, mnQAAir );Run nr = reslice_warp_run( boldNormWarp, roRun );Volume meanAll = strictmean ( nr, "y", null )Volume boldMask = binarize( meanAll, "y" );snr = gsmoothRun( nr, boldMask, 6, 6, 6 );
}
Dataset Iteration
Functional analysis expressed in typed datasets
Iterate over each volume in a run
reorientRun
reorientRun
reslice_warpRun
random_select
alignlinearRun
resliceRun
softmean
alignlinear
combinewarp
strictmean
gsmoothRun
binarize
Expanded Execution Planreorient/01
reorient/02
reslice_warp/22
alignlinear/03 alignlinear/07alignlinear/11
reorient/05
reorient/06
reslice_warp/23
reorient/09
reorient/10
reslice_warp/24
reorient/25
reorient/51
reslice_warp/26
reorient/27
reorient/52
reslice_warp/28
reorient/29
reorient/53
reslice_warp/30
reorient/31
reorient/54
reslice_warp/32
reorient/33
reorient/55
reslice_warp/34
reorient/35
reorient/56
reslice_warp/36
reorient/37
reorient/57
reslice_warp/38
reslice/04 reslice/08reslice/12
gsmooth/41
strictmean/39
gsmooth/42gsmooth/43gsmooth/44 gsmooth/45 gsmooth/46 gsmooth/47 gsmooth/48 gsmooth/49 gsmooth/50
softmean/13
alignlinear/17
combinewarp/21
binarize/40
reorient
reorient
alignlinear
reslice
softmean
alignlinear
combine_warp
reslice_warp
strictmean
binarize
gsmooth
Datasets dynamically instantiated from data sources by mappers
Functional MRI Execution
Code Size Comparison
Workflow Script
Generator
VDL
GENATLAS1 49 72 6
GENATLAS2 97 135 10
FILM1 63 134 17
FEAT 84 191 13
AIRSN 215 ~400 37
Lines of code with different workflow encodings
The Rest of the Talk Express complex multi-step “workflows”
Perhaps 100,000s of individual tasks Operate on heterogeneous distributed data
Different formats & access protocols Harness many computing resources
Parallel computers &/or distributed Grids Execute workflows reliably & efficiently
Despite diverse failure conditions Enable reuse of data & workflows
Discovery & composition Support many users, workflows, resources
Policy specification & enforcement
VDL,XDTM
Pegasus,DAGman,
Globus
VDC
TBD
Virtual Data Schema
dvIDhoststart
durationexitcode
stats
Invocation
nmspacename
version
Call
passes passes
executescalls
binds references
describesuses
includes
nmspacename
version
Procedure
argnametype
direction
FormalArg
argnamevalue
ActualArg
wfidfromDV
toDV
Workflow
nmspacename
Dataset
objectpred
type/valuserdate
Annotation
1
1
1
1
1
1
*
*
*
*
*
1
11
1
1
1
1 describes
fMRI Virtual Data QueriesWhich transformations can process a “subject image”? Q: xsearchvdc -q tr_meta dataType
subject_image input A: fMRIDC.AIR::align_warp
List anonymized subject-images for young subjects: Q: xsearchvdc -q lfn_meta dataType subject_image privacy anonymized subjectType young A: 3472-4_anonymized.img
Show files that were derived from patient image 3472-3: Q: xsearchvdc -q lfn_tree 3472-3_anonymized.img A: 3472-3_anonymized.img
3472-3_anonymized.sliced.hdr atlas.hdr atlas.img … atlas_z.jpg 3472-3_anonymized.sliced.img
Provenance for ATLAS DC2(High Energy Physics)
How much compute time was delivered?| years| mon | year |+------+------+------+| .45 | 6 | 2004 || 20 | 7 | 2004 || 34 | 8 | 2004 || 40 | 9 | 2004 || 15 | 10 | 2004 || 15 | 11 | 2004 || 8.9 | 12 | 2004 |+------+------+------+
Selected statistics for one of these jobs:start: 2004-09-30 18:33:56duration: 76103.33 pid: 6123exitcode: 0 args: 8.0.5 JobTransforms-08-00-05-09/share/dc2.g4sim.filter.trf CPE_6785_556
... -6 6 2000 4000 8923 dc2_B4_filter_frag.txt utime: 75335.86 stime: 28.88 minflt: 862341 majflt: 96386
Which Linux kernel releases were used ?
How many jobs were run on a Linux 2.4.28 Kernel?
LIGO Inspiral Search Application
Describe…
Inspiral workflow application is the work of Duncan Brown, Caltech,
Scott Koranda, UW Milwaukee, and the LSC Inspiral group
FOAM:Fast Ocean/Atmosphere Model
250-Member EnsembleRun on TeraGrid under VDS
FOAM run for Ensemble Member 1
FOAM run for Ensemble Member 2
FOAM run for Ensemble Member N
Atmos Postprocessing Ocean
Postprocessing for Ensemble Member 2
Coupl Postprocessing for Ensemble Member 2
Atmos Postprocessing for Ensemble Member 2
Coupl Postprocessing for Ensemble Member 2
Results transferred to archival storage
Work of: Rob Jacob (FOAM), Veronica Nefedova (workflow design and execution)
Remote Directory Creation for Ensemble Member 1
Remote Directory Creation for Ensemble Member 2
Remote Directory Creation for Ensemble Member N
FOAM and VDS
Climate Supercomputer
andGrad student
TeraGrid and VDS
Visualization courtesy Pat
Behling and Yun Liu, UW Madison
160 ensemble members in 75 days
250 ensemble members in 4 days
Summary:Science as Workflow
Executed
Executing
Executable
Not yet executable
Query
Edit
ScheduleExecution environment
What I Did
What I Want to Do
What I Am Doing
…
Acknowledgements
The Virtual Data System group is: ISI/USC: Ewa Deelman, Carl Kesselman,
Gaurang Mehta, Gurmeet Singh, Mei-Hui Su, Karan Vahi
U of Chicago: Ben Clifford, Ian Foster, Mike Wilde, Yong Zhao
GriPhyN is supported by the NSF Many research efforts involved in this work
are supported by the US Department of Energy, Office of Science