Date post: | 10-May-2015 |
Category: |
Technology |
Upload: | ian-foster |
View: | 572 times |
Download: | 5 times |
Opportunities for X-ray science in
future computing architecture
Ian FosterComputation Institute
University of Chicago & Argonne National Laboratory
1940 1950 1960 1970 1980 1990 2000 2010
Year Introduced
1E+2
1E+5
1E+8
1E+11
1E+14
1E+17
Doubling time = 1.5 yr.
ENIAC (vacuum tubes)UNIVAC
IBM 701IBM 704
IBM 7090 (transistors)
IBM Stretch
CDC 6600 (ICs)
CDC 7600
CDC STAR-100 (vectors) CRAY-1
Cyber 205 X-MP2 (parallel vectors)
CRAY-2X-MP4
Y-MP8
i860 (MPPs)
ASCI White, ASCI Q
Petaflop
Blue Gene/L
Blue Pacific
DeltaCM-5 Paragon
NWT
ASCI Red OptionASCI Red
CP-PACS
Earth
VP2600/10SX-3/44
Red Storm
ILLIAC IV
SX-2
SX-4
SX-5
S-810/20
T3D
T3E
multi-Petaflop
Thunder
Fastest supercomputer(floating point ops/sec) Argonne
My laptop
Acquire funding
Build apparatus
Collect data
30 years? yearsBrahe
Acquire funding
Build apparatus
Collect data
30 years? years
Publish Analyzedata
Acquire data
10 years6 years2 years
Brahe
Kepler
Acquire funding
Build apparatus
Collect data
30 years? years
Publish Analyzedata
Acquire data
10 years6 years2 years
Brahe
Kepler
Steal data
Poisonadvisor
7
Computers at Harvard, 1890
8
Sloan Digital Sky Survey
Aggregate SkyServer monthly traffic from 2001 to 2006. (Singh et al., 2006)
Sloan Digital Sky Survey publication statistics, Chen et al., 2009.
Three discontinuities:
1) Massive parallelism
2) Large data
3) Economics of aggregation
Intel x86 processor trends
Gordon Bell prize winners
Year Flop/s Processors Application1988 109 101 Static finite element analysis1998 1012 103 Metal magnetic atoms2008 1015 105 Superconductive materials2018 1018? 107? ??
Time
Com
plex
ity
Dim
ensio
ns
1
2
3
Times
cale
Shor
t
Long
Mul
tisca
le
Reso
lutio
n
Coar
se
Fine
A
dapti
ve
Coup
led
(& n
on-li
near
) equ
ation
s
Few
M
any
1
M
any
Para
met
ers o
r ens
embl
e m
embe
rs
No
Yes
Erro
r ana
lysis
No
Y
es
Optim
izatio
n
Algo
rithm
s
Simpl
e
C
ompl
ex
Dan Katz
1515
DesignMaterials with desired
properties based on computation and data
CreateSynthesis and processing
methods informed by computation; generate data
UnderstandRelationship between
materials properties and structure
Rational design of catalytic materials(Curtis, Greely, Zapol, Kumaran)
Identifying optimal candidates
17
High-throughput screening on BG/P
[SC08] “Towards Loosely-Coupled Programming on Petascale Systems”
Three discontinuities:
1) Massive parallelism
2) Large data
3) Economics of aggregation
PC disk drive capacity
Data generation and analysis costs outpace Moore’s Law
$900,000
Wilkening et al, IEEE Cluster09
Data complexity also increasing
ID MURA_BACSU STANDARD; PRT; 429 AA.DE PROBABLE UDP-N-ACETYLGLUCOSAMINE 1-CARBOXYVINYLTRANSFERASEDE (EC 2.5.1.7) (ENOYLPYRUVATE TRANSFERASE) (UDP-N-ACETYLGLUCOSAMINEDE ENOLPYRUVYL TRANSFERASE) (EPT).GN MURA OR MURZ.OS BACILLUS SUBTILIS.OC BACTERIA; FIRMICUTES; BACILLUS/CLOSTRIDIUM GROUP; BACILLACEAE;OC BACILLUS.KW PEPTIDOGLYCAN SYNTHESIS; CELL WALL; TRANSFERASE.FT ACT_SITE 116 116 BINDS PEP (BY SIMILARITY).FT CONFLICT 374 374 S -> A (IN REF. 3).SQ SEQUENCE 429 AA; 46016 MW; 02018C5C CRC32; MEKLNIAGGD SLNGTVHISG AKNSAVALIP ATILANSEVT IEGLPEISDI ETLRDLLKEI GGNVHFENGE MVVDPTSMIS MPLPNGKVKK LRASYYLMGA MLGRFKQAVI GLPGGCHLGP RPIDQHIKGF EALGAEVTNE QGAIYLRAER LRGARIYLDV VSVGATINIM LAAVLAEGKT IIENAAKEPE IIDVATLLTS MGAKIKGAGT NVIRIDGVKE LHGCKHTIIP DRIEAGTFMI
[source: GlaxoSmithKline]
Volume
Complexity
Analysisdemands
23
Bob Grossman
“light sources alone are not enough … Enormous data sets of diffracted signals in reciprocal space and across wide energy ranges must
be collected and analyzed in real time so that they can guide the ongoing
experiments.”
27
Diamond Light Source
National Crystallography Service (NCS)
Local Earth Sciences Lab University of Cambridge
Function International service -multiple communities
UK service - multiple institutions. Also uses Diamond
Lone researcher at institution - uses NCS and ISIS large-scale facility
Administration Peer-reviewed proposal required
Paper-based records –experiments, safety ERA, instrument time
Multiple proposals, multiple forms
Metadata Core Scientific MetaData Model
eBank/eCrystals schema
?
Identifiers Beam-line number DOI InChI ?
Workflow Formulaic and bespoke
Formulaic, unrecorded Complex, unrecorded
Software In-house scripts In-house scripts + open-source suite
In-house scripts + open-source suite
Raw data In-house GDA store ATLAS data-store Laptop / local server
Derived data Taken offsite on laptop / USB stick
eCrystals repository Laptop / local server / USB stick
Source: Liz Lyon
Pattern recognition in x-ray spectromicroscopy• Kevin Boyce, U. Chicago: study of the evolution of tree types,
including now-extinct species that dominated in the “coal age” (carboniferous). Acetate peel of fossilized wood.
• Shows how well we can separately map cellulose-derived material from lignin-derived material in plant cell walls, with implications for cellulosic ethanol production from biomass.
Lignin-derived and cellulose-derived regions in 400 million year old chert: Boyce et al., Proc. Nat. Acad. Sci. 101, 17555 (2004), with subsequent pattern recognition analysis by Lerotic, Jacobsen, Schäfer, and Vogt, Ultramicroscopy 100, 35 (2004).
LDRD: “Next Generation Data Exploration - Intelligence in Data Analysis, Visualization, & Mining”
• “Here’s a cell in this tissue. How much zinc does it have? In the rest of the tissue, how many cells are there like this, and what is their distribution of zinc content?”– Fluorescence and absorption spectral imaging– Databases to combine results of multiple experiments and
instruments– Multivariate statistical analysis and pattern recognition
• People:– APS: Stefan Vogt (PI), Lydia Finney, Chris Jacobsen, Chris Roerhig,
Claude Saunders, Jesse Ward; Mathematics and Computer Science, ANL: Sven Leyffer, Stefan Wild, Mark Hereld; Northwestern: Rachel Mak
fc *
Wavelength Division Multiplexing
“Lambdas”
Rapid evolution of 10GbE port pricesmakes campus-Scale 10 Gbps affordable
2005 2007 2009 2010
$80K/port Chiaro(60 Max)
$ 5KForce 10(40 max)
$ 500Arista48 ports
~$1000(300+ Max)
$ 400Arista48 ports
Source: Philip Papadopoulos, SDSC, UCSD
32
33
Three discontinuities:
1) Massive parallelism
2) Large data
3) Economics of aggregation
Software-as-a-Service (SaaS)
Platform-as-a-Service (PaaS)
Infrastructure-as-a-Service (IaaS)
Economies of scale in operations
ResourceCost for
medium scaleCost for
large scale Ratio
Network $95 / Mbps / month $13 / Mbps / month ~7x
Storage $2.20 / GB / month $0.40 / GB / month ~6x
Administration
≈140 servers/admin >1000 servers/admin ~7x
Time-consuming tasks in business
Web presence Email (hosted Exchange) Calendar Telephony (hosted VOIP) Human resources and payroll Accounting Customer relationship mgmt Data analytics Content distribution …
SaaS
Time-consuming tasks in business
Web presence Email (hosted Exchange) Calendar Telephony (hosted VOIP) Human resources and payroll Accounting Customer relationship mgmt Data analytics Content distribution …
SaaS
IaaS
Time-consuming tasks in science
Run experimentsCollect dataManage dataMove dataAcquire computersAnalyze dataRun simulationsCompare experiment with simulationSearch the literature
• Communicate with colleagues
• Publish papers• Find, configure, install
relevant software• Find, access, analyze
relevant data• Order supplies• Write proposals• Write reports• …
40From http://geekandpoke.typepad.com
Globus ToolkitBuild the Grid
Components for building custom grid solutions
globustoolkit.org
Globus OnlineUse the Grid
Cloud-hostedfile transfer service
globusonline.org
Time-consuming tasks in science
Run experimentsCollect dataManage dataMove dataAcquire computersAnalyze dataRun simulationsCompare experiment with simulationSearch the literature
• Communicate with colleagues
• Publish papers• Find, configure, install
relevant software• Find, access, analyze
relevant data• Order supplies• Write proposals• Write reports• …
Time-consuming tasks in science
Run experimentsCollect dataManage dataMove dataAcquire computersAnalyze dataRun simulationsCompare experiment with simulationSearch the literature
• Communicate with colleagues
• Publish papers• Find, configure, install
relevant software• Find, access, analyze
relevant data• Order supplies• Write proposals• Write reports• …
Datastore
A peek inside Globus Online
GridFTP
GridFTP
Profiles+ state
ConsumerConsumer
ConsumerConsumerRequest
collector
Notificationtarget
WorkerWorker
WorkerWorker
Worker
Task ID : bc6d776c-2af4-11e0-9a1d-12313916526cTask Type : TRANSFERParent Task ID : n/aStatus : SUCCEEDEDRequest Time : 2011-01-28 15:39:04ZDeadline : 2011-01-29 15:39:04ZCompletion Time : 2011-01-28 16:17:12ZTotal Tasks : 500Tasks Successful : 500Tasks Expired : 0Tasks Canceled : 0Tasks Failed : 0Tasks Pending : 0Tasks Retrying : 0Command : transfer (+500 input lines)Files : 500Directories : 0Bytes Transferred: 1073741824000MBits/sec : 3754.342
ALCF-NERSCtask
summary
48
11 x 125 files200 MB each
11 users12 sites
Keith Cheng’s phenome project
GordonKindlmann
3000 zebra fish mutants
Argonne / U Chicago Grid Supercomputing Facility
APS Beamline
Data Acquisition
Argonne National LabAdvanced Photon Source
GridFTP Server
HPC Cluster
GridFTP Server
GridFTP Server
Globus Online - hosted service for high-speed, reliable, secure data movement
Penn State UniversityPhenome Project Coordination
1 Gbps Network link
10 Gbps Network link
Regular Internet link
Beamline data flow
SAN
Graphics Workstations
Users
NASPattern Recognition
Segmentation & Visualization
Software Develop.
DAS
TOMOGRAPHIC RECONSTRUCTION, DERINGING,
SEGMENTATION, MORPHOMETRIC
S & VISUALIZATION
Argonne / U Chicago Grid Supercomputing Facility
APS Beamline
Data Acquisition
Argonne National LabAdvanced Photon Source
GridFTP Server
HPC Cluster
GridFTP Server
GridFTP Server
Globus Online - hosted service for high-speed, reliable, secure data movement
Penn State UniversityPhenome Project Coordination
1 Gbps Network link
10 Gbps Network link
Regular Internet link
Beamline data flow
SAN
Graphics Workstations
Users
NASPattern Recognition
Segmentation & Visualization
Software Develop.
DAS
TOMOGRAPHIC RECONSTRUCTION, DERINGING,
SEGMENTATION, MORPHOMETRIC
S & VISUALIZATION
Four theses
• Ultrascale computing enables new problem-solving methods
• Research data management is an essential service like electricity and networking
• Economies of scale motivate highly aggregated computing and storage
• Automation of science processes accelerates discovery and yields competitive advantage