Data Integration ViaUniversal Keys
Robert GrossmanUniversity of Illinois at Chicago
Part 1
Background & Requirements
The New Challenge
DataSet 1
DataSet 2
vs.
Build a statistically valid model. Publish it.
Data - Find it, get it, explore it, enrich it. Make decisions.
What are Some Examples?
1. Astronomy - National Virtual Observatory– Integrate Sloan Digital Sky Survey (SDSS),
2Mass, DPOSS, etc.2. Bioinformatics
– Integrate distributed information about chemical compounds, pathways, networks
3. Examples from oceangraphy, homeland defense, …
How Do We Browse DistributedData about Biological Networks?
How do we integrate all this?– For humans with search– For machines with Service Oriented Architectures
Pathway database
Pathway database
Protein sequence database
Publications
CBC Proteomics Repository
Light-Weight Data Integration
Databases Data Webs
Group
Community
KEGG, MetaCyc
“Google” for pathways & networks?
Data Grids
Biogridse.g. caBIG
Full
PartialCollaboration
RequirementsScale to large data setsPersistent services that provide high performance access to large distributed data sets for e-scienceSupport– Browsing and exploration– Queries for keys and metadata– Range queries– Joins– Continuous queries
Two Projects
DataSpace (2001-2005)- web service based infrastructure supporting – Universal keys– Metadata– Range queries
Sector (2005-present) - peer-to-peer system for large data and distributed queriessupporting– scalable transport and web/grid services– range queries– universal keys for joining distributed columns
Related Work
DODS - Distributed Oceangraphic Data System & OPeNDAP, Data Access Protocol (www.opendap.org/)Franklin, Halevy & Maier, From Databases to Dataspaces, SIGMOD 2005.…
Part 2
Technical Challenge 1 - Keys
Technical Challenge 1 - Keys
What is the minimal infrastructure that allows you to join two distributed tables for the purpose of exploration and browsingFull semantic integration is not required, indeed not desiredWant “Google” for distributed data
What is the Minimal Infrastructure for Integrating Distributed Data?
Service oriented archiecture for univeral keys UK k[i] that associate data (k[i], x[i]) on one server with data (k[j], y[j]) on another server
Examples of Universal Keys (UKs)
Astronomy - right ascension and declination used by Sloan Digital Sky Survey CommunityEarth science - 1/2 degree x 1/2 degree latitude x longitude grid used by Community Climate Model 3 (CCM3)Bioinformatics - enzyme EC numbersIn all cases, a globally unique ID or GUID is used by the UK service so that universal keys can be checked for equality unambiguously.
Integrating SDSS & 2MASS Data to Finding Candidate Brown Dwarfs
Sloan Digital Sky Survey (SDSS)– 82 million stars– Visible spectrum
Two Micro All Sky Survey (2MASS)– 208 million stars– Infrared spectrumTwo separate locations - Query at SC 05
in Seattle– SDSS in Tokyo & 2MASS in ChicagoFound 289,283 Candidate Brown dwarfs– Computation - object in both locations;
infrared value is 2 degree brighter
DataSpace & Data Webs
Everything elseEverything elseWhat don’t we do
Retrieve metadata, perform range, queries, integrate columns by UK
Retrieve document by URL
Basic operation
Data tableDocumentBasic object
Data WebDocument Web
Part 3
Technical Challenge 2 -Infrastructure / Middleware
Technical Challenge 2 -Moving the Data
Usable bandwidth is with single TCP flow as usually deployed Today’s TCP-based network protocols, web, grid & data services do not scale effectively to high bandwidth wide area networks.
5 Mbps
2488 Mbps
OC-48
5 Mbps
622 Mbps
OC-12
5 Mbps5 MbpsUsable bandwidth
9953 Mbps
155 Mbps
Available bandwidth
OC-192
OC-3
1. Goal: Exploit available bandwidth of wide area 10 Gbps network.
2. Developed new application level network protocol -UDT
3. UDT is fair to other high volume data flows
4. UDT is friendly tocommodity TCP flows.
5. UDT is easy to deploy since application level.
New Transport Protocols & Services (eg. UDT)
UDT has been downloadover 5000 times fromSource Forge
Part 4
Example - Integrating Distributed Databases of
Chemical Compounds
Smiles Strings are Not Unique
Path (a) yields CC1=CC(Br)CCC1
Path (b) yields CC1=CC(CCC1)Br
CH3
C
H2C
H2C
CH
CH
CH2Br
C
C1
C
C
C
C Br
C1
(a)
(b)
Unique Smiles
There are many variants of Smiles strings.One of the most common are unique smiles strings (Weininger et. al. 1989)The claim is that a unique starting atom and well defined branching decisions can be made using an algorithm generating “canonical labels”.
Counter Example 1
Counter Example 2
Counter Example 3
What are Natural Operations?
The set of paths is naturally defined.Paths can be lex ordered.
O
C
C 0
H
1. Set of paths of length less or equal to 2 originating from C: {CO, CC, COH}.
2. Lexigraphically order: [CC, CO, COH].
3. Concatenate: CCCOCOH
Universal Chemical Keys (UCKs)
O
OO
CH3
CH3
O
CH3CH3
CH3O
OO
CH3
CH3
OCH3
CH3
CH3
682322 682323
1. Fix depth d. Compute path labels λ(u), for nodes u.
2. Loop over all pairs of nodes u and v, compute length of shortest path n and form λ(u) n λ(v).
3. Lex order.4. Concatenate.5. Hash.
Loop over all pairs of nodes u and v and form “natural labels”
Universal Chemical Key (1 of 2)Fix d = 2 or 3.
Label each node by recursively computing strings based upon paths (of length d) originating from the node.Create labels from allpairs of nodes and lexigraphically order & concatentate (more or less - see paper)Hash to produce short string.
O
C
C 0
H
Onode 3
Cnode 2
Cnode 40node 1
Hnode 10
UCK Algorithm (2 of 2)Assign a sequence of labels λ(d) to each vertex reflecting the structure of larger and larger local neighborhoods
O3
C2
C4 01
OCCOHOCHO1
OCCOOCO3
CCCCOOHCCOO
C2
λ(2)λ(1)λ(0)V
C | CO | C0 | CCC | O | C | OC | C | O | O
lexigraphicallyorder
Example
098900…
C17H18O4682323
132020…
C17H18O4682322
UCKFormulaNSC
O
OO
CH3
CH3
O
CH3CH3
CH3O
OO
CH3
CH3
OCH3
CH3
CH3
682322 682323
Example 2 – Benzoic acid (NSC149)
C4
C5
C9
C6
C8
C7
C2O1
O3
H10
H15
H14
H13
H12
H11
C9C8
H15
H14C4
C5
C2O1
O3
H10
H11C6
C7H13
H12
Initialization ……
λ(2)Vertex
OCCO3
CCCCOOH2OCCOH1
……
OCCOH1CCCCOOH
(1,2)
μ(2)Pair
Example 2 – Benzoic acid (NSC149)
The UCK String:C7H6O2-
CCCCCCHH0CCCCCCHHCCCCCCHH0CCCCCCHHCCCCCCHH1CCCHCCHCOOCCCCCCHH1CCCHCCHCOOCCCCCCHH1CCCHCCHHCCCCCCHH1CCCHCCHHCCCCCCHH1HCCCCCCCCCHH1HCCCCCCCCCHH2CCCCCCHHCCCCCCHH2CCCCCCHHCCCCCCHH2CCCCOOHCCCCCCHH2CCCCOOHCCCCCCHH2CCCHCCHHCCCCCCHH2CCCHCCHHCCCCCCHH2HCCCCCCCCCHH2HCCCCCCCCCHH3CCCHCCHHCCCCCCHH3CCCHCCHHCCCCCCHH3HCCCCCCCCCHH3HCCCCCCCCCHH3HCCCCCCCCCHH3HCCCCCCCCCHH3OCCOCCCCCCHH3OCCOCCCCCCHH3OCCOHCCCCCCHH3OCCOHCCCCCCHH4HCCCCCCCCCHH4HCCCCCCCCCHH4HOCCCCCCCHH4HOCCCCCOOH0CCCCOOHCCCCOOH1CCCHCCHCOOCCCCOOH1OCCOCCCCOOH1OCCOHCCCCOOH2CCCCCCHHCCCCOOH2CCCCCCHHCCCCOOH2HOCCCCCOOH3CCCHCCHHCCCCOOH3CCCHCCHHCCCCOOH3HCCCCCCCOOH3HCCCCCCCOOH4CCCHCCHHCCCCOOH4HCCCCCCCOOH4HCCCCCCCOOH5HCCCCCCHCCHCOO0CCCHCCHCOOCCCHCCHCOO1CCCCCCHHCCCHCCHCOO1CCCCCCHHCCCHCCHCOO1CCCCOOHCCCHCCHCOO2CCCHCCHHCCCHCCHCOO2CCCHCCHHCCCHCCHCOO2HCCCCCCHCCHCOO2HCCCCCCHCCHCOO2OCCOCCCHCCHCOO2OCCOHCCCHCCHCOO3CCCHCCHHCCCHCCHCOO3HCCCCCCHCCHCOO3HCCCCCCHCCHCOO3HOCCCCHCCHCOO4HCCCCCCHCCHH0CCCHCCHHCCCHCCHH0CCCHCCHHCCCHCCHH0CCCHCCHHCCCHCCHH1CCCCCCHHCCCHCCHH1CCCCCCHHCCCHCCHH1CCCHCCHHCCCHCCHH1CCCHCCHHCCCHCCHH1CCCHCCHHCCCHCCHH1CCCHCCHHCCCHCCHH1HCCCCCCHCCHH1HCCCCCCHCCHH1HCCCCCCHCCHH2CCCCCCHHCCCHCCHH2CCCCCCHHCCCHCCHH2CCCHCCHCOOCCCHCCHH2CCCHCCHCOOCCCHCCHH2CCCHCCHHCCCHCCHH2CCCHCCHHCCCHCCHH2HCCCCCCHCCHH2HCCCCCCHCCHH2HCCCCCCHCCHH2HCCCCCCHCCHH2HCCCCCCHCCHH2HCCCCCCHCCHH3CCCCCCHHCCCHCCHH3CCCCCCHHCCCHCCHH3CCCCOOHCCCHCCHH3CCCCOOHCCCHCCHH3CCCHCCHCOOCCCHCCHH3HCCCCCCHCCHH3HCCCCCCHCCHH3HCCCCCCHCCHH3HCCCCCCHCCHH4CCCCOOHCCCHCCHH4HCCCCCCHCCHH4HCCCCCCHCCHH4OCCOCCCHCCHH4OCCOCCCHCCHH4OCCOHCCCHCCHH4OCCOHCCCHCCHH5HOCCCCHCCHH5HOCCCCHCCHH5OCCOCCCHCCHH5OCCOHCCCHCCHH6HOCHCCC0HCCCHCCC0HCCCHCCC0HCCCHCCC0HCCCHCCC0HCCCHCCC1CCCCCCHHHCCC1CCCCCCHHHCCC1CCCHCCHHHCCC1CCCHCCHHHCCC1CCCHCCHHHCCC2CCCCCCHHHCCC2CCCCCCHHHCCC2CCCHCCHCOOHCCC2CCCHCCHCOOHCCC2CCCHCCHHHCCC2CCCHCCHHHCCC2CCCHCCHHHCCC2CCCHCCHHHCCC2CCCHCCHHHCCC2CCCHCCHHHCCC3CCCCCCHHHCCC3CCCCCCHHHCCC3CCCCCCHHHCCC3CCCCCCHHHCCC3CCCCOOHHCCC3CCCCOOHHCCC3CCCHCCHCOOHCCC3CCCHCCHCOOHCCC3CCCHCCHHHCCC3CCCHCCHHHCCC3CCCHCCHHHCCC3CCCHCCHHHCCC3HCCCHCCC3HCCCHCCC3HCCCHCCC3HCCCHCCC3HCCCHCCC3HCCCHCCC3HCCCHCCC3HCCCHCCC4CCCCCCHHHCCC4CCCCCCHHHCCC4CCCCOOHHCCC4CCCCOOHHCCC4CCCHCCHCOOHCCC4CCCHCCHHHCCC4CCCHCCHHHCCC4HCCCHCCC4HCCCHCCC4HCCCHCCC4HCCCHCCC4HCCCHCCC4HCCCHCCC4HCCCHCCC4HCCCHCCC4OCCOHCCC4OCCOHCCC4OCCOHHCCC4OCCOHHCCC5CCCCOOHHCCC5HCCCHCCC5HCCCHCCC5HCCCHCCC5HCCCHCCC5HOCHCCC5HOCHCCC5OCCOHCCC5OCCOHCCC5OCCOHHCCC5OCCOHHCCC6HOCHCCC6HOCHCCC6OCCOHCCC6OCCOHHCCC7HOCHOC0HOCHOC1OCCOHHOC2CCCCOOHHOC3CCCHCCHCOOHOC3OCCOHOC4CCCCCCHHHOC4CCCCCCHHHOC5CCCHCCHHHOC5CCCHCCHHHOC5HCCCHOC5HCCCHOC6CCCHCCHHHOC6HCCCHOC6HCCCHOC7HCCCOCCO0OCCOOCCO1CCCCOOHOCCO2CCCHCCHCOOOCCO2OCCOHOCCO3CCCCCCHHOCCO3CCCCCCHHOCCO3HOCOCCO4CCCHCCHHOCCO4CCCHCCHHOCCO4HCCCOCCO4HCCCOCCO5CCCHCCHHOCCO5HCCCOCCO5HCCCOCCO6HCCCOCCOH0OCCOHOCCOH1CCCCOOHOCCOH1HOCOCCOH2CCCHCCHCOOOCCOH2OCCOOCCOH3CCCCCCHHOCCOH3CCCCCCHHOCCOH4CCCHCCHHOCCOH4CCCHCCHHOCCOH4HCCCOCCOH4HCCCOCCOH5CCCHCCHHOCCOH5HCCCOCCOH5HCCCOCCOH6HCCC
Analysis of NCI DatabaseRemarkNumberDescription
UCK gave same key to same compounds
33,533Number chem. comp. 2 or more entries
All gave unique UCK
202,384Number of chem. comp. with single entry
Some compounds have duplicate entries
236,917Total number of chemical compounds
Techniques Extend to Pathways
KEGG database : Lysine biosynthesis
Part 5 -
Summary & Conclusion
There is a Sweet Spot
There is a sweet spot technically for integrating distributed tabular data using universal keysThere are many, many technical reasons why this is stupid– By and large, the same technical reasons apply to
using URLs to access remote dataThere are some very practical reasons to use this approach.
DataSpace ReferencesRobert Grossman, and Marco Mazzucco, DataSpace - A Web Infrastructure for the Exploratory Analysis and Mining of Data, IEEE Computing in Science and Engineering, July/August, 2002, pages 44-51.
Asvin Ananthanarayan, Rajiv Balachandran, Yunhong Gu, Robert Grossman, Xinwei Hong, Jorge Levera, Marco Mazzucco, Data Webs for Earth Science Data, Parallel Computing, Volume 29, 2003, pages 1363-1379.
Alon Halevy, Michael Franklin and David Maier, Principles of Dataspace Systems, Proceedings of the twenty-fifth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems, pages 1 - 9, 2006.
Ian Foster and Robert L. Grossman, Data Integration in a Bandwidth Rich World, Communications ACM, Volume 46, Issue 11, November, 2003, pages 50-57.
Sector ReferencesRobert L. Grossman, Yunhong Gu, David Handley, and Michal Sabala Joe Mambretti, Alex Szalay and Ani Thakar, Kazumi Kumazoe and Oie Yuji, Minsun Lee, Yoonjoo Kwon, and Woojin Seok, Data Mining Middleware for Wide Area High Performance Networks, Journal of Future Generation Computer Systems (FGCS), 2006.
Yunhong Gu, Robert L. Grossman, Alex Szalay and Ani Thakar, Distributing the Sloan Digital Sky Survey Using UDT and Sector, Second IEEE International Conference on E-Science and Grid Computing, 2006.
Unique Chemical Keys References
Robert L. Grossman, Pavan Kasturi, Donald Hamelberg, Bing Liu, An Empirical Study of the Universal Chemical Key Algorithm for Assigning Unique Keys to Chemical Compounds, Journal of Bioinformatics and Computational Biology, 2004, Volume 2, Number 1, 2004, pages 155-171.
Greeshma Neglur and Robert L. Grossman, Assigning Unique Keys to Chemical Compounds for Data Integration: Some Interesting Counter Examples, 2nd International Workshop on Data Integration in the Life Sciences (DILS 2005), La Jolla, July 20-22, 2005.
NCI Open Database Compounds, retrieved from http://cactus.nci.nih.gov/ on August 10, 2006.