1 Managing Data for the World Wide Telescope aka: The Virtual Observatory Jim Gray Alex Szalay SLAC...

Managing Data for the World Wide Telescope

aka: The Virtual ObservatoryJim Gray

Alex Szalay

SLAC Data Management Workshop


The Evolution of Science• Observational Science

– Scientist gathers data by direct observation– Scientist analyzes data

• Analytical Science – Scientist builds analytical model– Makes predictions.

• Computational Science – Simulate analytical model– Validate model and makes predictions

• Data Exploration Science Data captured by instrumentsOr data generated by simulator– Processed by software– Placed in a database / files– Scientist analyzes database / files


Information Avalanche• In science, industry, government,….

– better observational instruments and – and, better simulations producing a data avalanche

• Examples– BaBar: Grows 1TB/day

2/3 simulation Information 1/3 observational Information

– CERN: LHC will generate 1GB/s .~10 PB/y– VLBA (NRAO) generates 1GB/s today– Pixar: 100 TB/Movie

• New emphasis on informatics:– Capturing, Organizing,

Summarizing, Analyzing, Visualizing

Image courtesy C. Meneveau & A. Szalay @ JHU

BaBar, Stanford

Space Telescope

P&E Gene Sequencer Fromhttp://www.genome.uci.edu/


The Big PictureExperiments &






• Data ingest • Managing a petabyte• Common schema• How to organize it?• How to reorganize it• How to coexist with others

• Query and Vis tools • Support/training• Performance

– Execute queries in a minute – Batch query scheduling

?The Big Problems


Other Archives facts



FTP - GREP • Download (FTP and GREP) are not adequate

– You can GREP 1 MB in a second– You can GREP 1 GB in a minute – You can GREP 1 TB in 2 days– You can GREP 1 PB in 3 years.

• Oh!, and 1PB ~3,000 disks

• At some point we need indices to limit searchparallel data search and analysis

• This is where databases can help

• Next generation technique: Data Exploration– Bring the analysis to the data!


The Speed Problem• Many users want to search the whole DB

ad hoc queries, often combinatorial• Want ~ 1 minute response• Brute force (parallel search):

– 1 disk = 50MBps => ~1M disks/PB ~ 300M$/PB

• Indices (limit search, do column store)– 1,000x less equipment: 1M$/PB

• Pre-compute answer– No one knows how do it for all questions.


Next-Generation Data Analysis• Looking for

– Needles in haystacks – the Higgs particle– Haystacks: Dark matter, Dark energy

• Needles are easier than haystacks• Global statistics have poor scaling

– Correlation functions are N2, likelihood techniques N3

• As data and computers grow at same rate, we can only keep up with N logN

• A way out? – Relax notion of optimal

(data is fuzzy, answers are approximate)– Don’t assume infinite computational resources or memory

• Combination of statistics & computer science


Analysis and Databases• Much statistical analysis deals with

– Creating uniform samples – – data filtering– Assembling relevant subsets– Estimating completeness – censoring bad data– Counting and building histograms– Generating Monte-Carlo subsets– Likelihood calculations– Hypothesis testing

• Traditionally these are performed on files• Most of these tasks are much better done inside a database• Move Mohamed to the mountain, not the mountain to Mohamed.


Organization & Algorithms• Use of clever data structures (trees, cubes):

– Up-front creation cost, but only N logN access cost– Large speedup during the analysis– Tree-codes for correlations (A. Moore et al 2001)– Data Cubes for OLAP (all vendors)

• Fast, approximate heuristic algorithms– No need to be more accurate than cosmic variance– Fast CMB analysis by Szapudi et al (2001)

• N logN instead of N3 => 1 day instead of 10 million years

• Take cost of computation into account– Controlled level of accuracy– Best result in a given time, given our computing resources


World Wide TelescopeVirtual Observatory


• Premise: Most data is (or could be online)

• The Internet is the world’s best telescope:– It has data on every part of the sky– In every measured spectral band: optical, x-ray, radio..

– As deep as the best instruments (2 years ago).

– It is up when you are up.The “seeing” is always great (no working at night, no clouds no moons no..).

– It’s a smart telescope: links objects and data to literature on them.


Why Astronomy?• Community has lots of data• Data is real and well documented

– High-dimensional (with confidence intervals)– Spatial, temporal

• Diverse and distributed– Many different instruments from

many different places and many different times

• Community wants to share/cross compare– Can freely share data and algorithms.– “DataMining, Not Data MINE!!” Mark Ellisman, UCSD

• They are well organized• Community is small and homogeneous• No commercial or privacy concerns

– All the problems are technical or social.


The WWT Components• Data Sources

– Literature– Archives

• Unified Definitions– Units, – Semantics/Concepts/Metrics,

Representations, – Provenance

• Object model• Classes and methods• Portals


Data Sources• Literature online and cross indexed

– Simbad, ADS, NED,http://simbad.u-strasbg.fr/Simbad, http://adswww.harvard.edu/, http://nedwww.ipac.caltech.edu/

• Many curated archives online– FIRST, DPOSS, 2MASS, USNO, IRAS, SDSS, VizeR,…– Typically files with English meta-data and some programs

• Groups, Researchers, Amateurs Publish– Datasets online in various formats– Data publications are ephemeral (may disappear) – Many have unknown provenance

• Documentation varies; some good and some none.


Unified Definitions• Universal Content Definitions


– Collated all table heads from all the literature– 100,000 terms reduced to ~1,500– Rough consensus that this is the right thing.– Refinement in progress as people use UCDs

• Defines – Units:

• gram, radian, second, janski...

– Semantic Concepts / Metrics • Std error, Chi2 fit, magnitude, flux @ passband, velocity,


Provenance• Most data will be derived.• To do science,

need to trace derived data back to source.• So programs and inputs must be registered.• Must be able to re-run them.• Example: Space Telescope Calibrated Data

– Run on demand– Can specify software version (to get old answers)

• Scientific Data Provenance and Curation are largely unsolved problems (some ideas but no science).


Object Model• General acceptance of XML • Recent acceptance of XML Schema

(XSD over DTD)

• Wait-and-See about SOAP/WSDL/…– “ Web Services are just Corba with angle


– FTP is good enough for me.

• Personal opinion:– Web Services are much more than

“Corba + <>”– Huge focus on interop– Huge focus on integrated tools

• But the community says “Show me!”– Many technologists convinced,

but not yet the astronomers


DataIn your address


Web Service





Yourprogram Web






Classes and Methods

• First Class: VO tablehttp://www.us-vo.org/VOTable/

– Represents an answer set in XML• Defined by an XML Schema (XSD) • Metadata (in terms of UCDs)• Data representation (numbers and text)

– First method• Cone Search: Get objects in this cone



DataIn your address


Web Service



in xml


Other Classes• Space-Time class

– http://hea-www.harvard.edu/~arots/nvometa/STCdoc.pdf

• Image Class (returns pixels)– SdssCutout– Simple Image Access Protocol


– HyperAtlashttp://bill.cacr.caltech.edu/usvo-pubs/files/hyperatlas.pdf

• Spectral – Simple Spectral Access Protocol – 500K spectra available at http://voservices.net/wave

• Query Services– ADQL and SkyNode http://skyservice.pha.jhu.edu/develop/vo/adql/– And http://SkyQuery.Net

• Registry: – see below


DataIn your address


Web Service



in xml


The Registry• UDDI seemed inappropriate

– Complex – Irrelevant questions– Relevant questions missing

• Evolved Dublin Core– Represent Datasets, Services, Portals– Needs to be machine readable– Federation (DNS model)– Push & Pull: register then harvest

• http://www.ivoa.net/twiki/bin/view/IVOA/IvoaResReg



• SkyServer: – navigator showing cutout web service– List: showing many calls and variant use.

• SkyQuery:– Show integration of various archives.– Explain spatial join xMatch operator.


SkyServer.SDSS.org• A modern Astronomy archive

– Raw Pixel data lives in file servers– Catalog data (derived objects) lives in Database– Online query to any and all

• Also used for education– 150 hours of online Astronomy– Implicitly teaches data analysis

• Interesting things– Spatial data search– Client query interface via Java Applet– Query interface via Emacs– Popular – Cloned by other surveys (a template design) – Web services are core of it.


SkyQueryA Prototype WWT

• Started with SDSS data and schema• Imported12 other datasets

into that spine schema.(a day per dataset plus load time)

• Unified them with a portal • Implicit spatial join among the datasets.• All built on Web Services

– Pure XML– Pure SOAP– Used .NET toolkit


Federation: SkyQuery.Net• Combine 4 archives initially

• Added 9 more

• Send query to portal, portal joins data from archives.

• Problem: want to do multi-step data analysis (not just single query).

• Solution: Allow personal databases on portal

• Problem: some queries are monsters

• Solution: “batch schedule” on portal server, Deposits answer in personal database.







SkyQuery Structure• Each SkyNode publishes

– Schema Web Service– Database Web Service

• Portal is – Plans Query (2 phase) – Integrates answers– Is a web service



• Portal allows federation of data but…

• Intermediate results may be large.

• Intermediate results feed into next analysis step.

• Sending them back-and-forth to client is costly and sometimes infeasible.

• Solution: create a working DB for client at Portal: MyDB



• Anyone can create a personal DB at SkyServer portal. – It is about 100 MB– It is private

• Simple queries done immediately

• Complex queries done by batch scheduler

• All queries can create/read/write MyDB tables

• Very popular with “serious” users.

• MyDB will be sharable with by a group.


Open SkyQuery

• SkyQuery being adopted by AstroGrid as reference implementation for OGSA-DAI(Open Grid Services Architecture, Data Access and Integration).

• SkyNode basic archive objecthttp://www.ivoa.net/twiki/bin/view/IVOA/SkyNode

• SkyQuery Language (VoQL) is evolving.http://www.ivoa.net/twiki/bin/view/IVOA/IvoaVOQL


The WWT ComponentsOutline• Data Sources

– Literature– Archives

• Unified Definitions– Units, – Semantics/Concepts/Metrics,

Representations, – Provenance

• Object model• Classes and methods• Portals• WWT is a poster child for

the Data Grid.

What we learned• Astro is a community of 10,000 • Homogenous & Cooperative• If you can’t do it for Astro,

do not bother with 3M bio-info.• Agreement

– Takes time – Takes endless meetings

• Big problems are non-technical– Legacy is a big problem.

• Plumbing and tools are thereBut…– What is the object model?– What do you want to save?– How document provenance?