+ All Categories
Home > Documents > Data Integration Via Universal Keysdb.cis.upenn.edu/.../grossman-info-integration-06-v3.pdf ·...

Data Integration Via Universal Keysdb.cis.upenn.edu/.../grossman-info-integration-06-v3.pdf ·...

Date post: 31-Jan-2020
Category:
Upload: others
View: 0 times
Download: 0 times
Share this document with a friend
38
Data Integration Via Universal Keys Robert Grossman University of Illinois at Chicago
Transcript
Page 1: Data Integration Via Universal Keysdb.cis.upenn.edu/.../grossman-info-integration-06-v3.pdf · OC-3. 1. Goal: Exploit available bandwidth of wide area 10 Gbps network. 2. Developed

Data Integration ViaUniversal Keys

Robert GrossmanUniversity of Illinois at Chicago

Page 2: Data Integration Via Universal Keysdb.cis.upenn.edu/.../grossman-info-integration-06-v3.pdf · OC-3. 1. Goal: Exploit available bandwidth of wide area 10 Gbps network. 2. Developed

Part 1

Background & Requirements

Page 3: Data Integration Via Universal Keysdb.cis.upenn.edu/.../grossman-info-integration-06-v3.pdf · OC-3. 1. Goal: Exploit available bandwidth of wide area 10 Gbps network. 2. Developed

The New Challenge

DataSet 1

DataSet 2

vs.

Build a statistically valid model. Publish it.

Data - Find it, get it, explore it, enrich it. Make decisions.

Page 4: Data Integration Via Universal Keysdb.cis.upenn.edu/.../grossman-info-integration-06-v3.pdf · OC-3. 1. Goal: Exploit available bandwidth of wide area 10 Gbps network. 2. Developed

What are Some Examples?

1. Astronomy - National Virtual Observatory– Integrate Sloan Digital Sky Survey (SDSS),

2Mass, DPOSS, etc.2. Bioinformatics

– Integrate distributed information about chemical compounds, pathways, networks

3. Examples from oceangraphy, homeland defense, …

Page 5: Data Integration Via Universal Keysdb.cis.upenn.edu/.../grossman-info-integration-06-v3.pdf · OC-3. 1. Goal: Exploit available bandwidth of wide area 10 Gbps network. 2. Developed

How Do We Browse DistributedData about Biological Networks?

How do we integrate all this?– For humans with search– For machines with Service Oriented Architectures

Pathway database

Pathway database

Protein sequence database

Publications

CBC Proteomics Repository

Page 6: Data Integration Via Universal Keysdb.cis.upenn.edu/.../grossman-info-integration-06-v3.pdf · OC-3. 1. Goal: Exploit available bandwidth of wide area 10 Gbps network. 2. Developed

Light-Weight Data Integration

Databases Data Webs

Group

Community

KEGG, MetaCyc

“Google” for pathways & networks?

Data Grids

Biogridse.g. caBIG

Full

PartialCollaboration

Page 7: Data Integration Via Universal Keysdb.cis.upenn.edu/.../grossman-info-integration-06-v3.pdf · OC-3. 1. Goal: Exploit available bandwidth of wide area 10 Gbps network. 2. Developed

RequirementsScale to large data setsPersistent services that provide high performance access to large distributed data sets for e-scienceSupport– Browsing and exploration– Queries for keys and metadata– Range queries– Joins– Continuous queries

Page 8: Data Integration Via Universal Keysdb.cis.upenn.edu/.../grossman-info-integration-06-v3.pdf · OC-3. 1. Goal: Exploit available bandwidth of wide area 10 Gbps network. 2. Developed

Two Projects

DataSpace (2001-2005)- web service based infrastructure supporting – Universal keys– Metadata– Range queries

Sector (2005-present) - peer-to-peer system for large data and distributed queriessupporting– scalable transport and web/grid services– range queries– universal keys for joining distributed columns

Page 9: Data Integration Via Universal Keysdb.cis.upenn.edu/.../grossman-info-integration-06-v3.pdf · OC-3. 1. Goal: Exploit available bandwidth of wide area 10 Gbps network. 2. Developed

Related Work

DODS - Distributed Oceangraphic Data System & OPeNDAP, Data Access Protocol (www.opendap.org/)Franklin, Halevy & Maier, From Databases to Dataspaces, SIGMOD 2005.…

Page 10: Data Integration Via Universal Keysdb.cis.upenn.edu/.../grossman-info-integration-06-v3.pdf · OC-3. 1. Goal: Exploit available bandwidth of wide area 10 Gbps network. 2. Developed

Part 2

Technical Challenge 1 - Keys

Page 11: Data Integration Via Universal Keysdb.cis.upenn.edu/.../grossman-info-integration-06-v3.pdf · OC-3. 1. Goal: Exploit available bandwidth of wide area 10 Gbps network. 2. Developed

Technical Challenge 1 - Keys

What is the minimal infrastructure that allows you to join two distributed tables for the purpose of exploration and browsingFull semantic integration is not required, indeed not desiredWant “Google” for distributed data

Page 12: Data Integration Via Universal Keysdb.cis.upenn.edu/.../grossman-info-integration-06-v3.pdf · OC-3. 1. Goal: Exploit available bandwidth of wide area 10 Gbps network. 2. Developed

What is the Minimal Infrastructure for Integrating Distributed Data?

Service oriented archiecture for univeral keys UK k[i] that associate data (k[i], x[i]) on one server with data (k[j], y[j]) on another server

Page 13: Data Integration Via Universal Keysdb.cis.upenn.edu/.../grossman-info-integration-06-v3.pdf · OC-3. 1. Goal: Exploit available bandwidth of wide area 10 Gbps network. 2. Developed

Examples of Universal Keys (UKs)

Astronomy - right ascension and declination used by Sloan Digital Sky Survey CommunityEarth science - 1/2 degree x 1/2 degree latitude x longitude grid used by Community Climate Model 3 (CCM3)Bioinformatics - enzyme EC numbersIn all cases, a globally unique ID or GUID is used by the UK service so that universal keys can be checked for equality unambiguously.

Page 14: Data Integration Via Universal Keysdb.cis.upenn.edu/.../grossman-info-integration-06-v3.pdf · OC-3. 1. Goal: Exploit available bandwidth of wide area 10 Gbps network. 2. Developed

Integrating SDSS & 2MASS Data to Finding Candidate Brown Dwarfs

Sloan Digital Sky Survey (SDSS)– 82 million stars– Visible spectrum

Two Micro All Sky Survey (2MASS)– 208 million stars– Infrared spectrumTwo separate locations - Query at SC 05

in Seattle– SDSS in Tokyo & 2MASS in ChicagoFound 289,283 Candidate Brown dwarfs– Computation - object in both locations;

infrared value is 2 degree brighter

Page 15: Data Integration Via Universal Keysdb.cis.upenn.edu/.../grossman-info-integration-06-v3.pdf · OC-3. 1. Goal: Exploit available bandwidth of wide area 10 Gbps network. 2. Developed

DataSpace & Data Webs

Everything elseEverything elseWhat don’t we do

Retrieve metadata, perform range, queries, integrate columns by UK

Retrieve document by URL

Basic operation

Data tableDocumentBasic object

Data WebDocument Web

Page 16: Data Integration Via Universal Keysdb.cis.upenn.edu/.../grossman-info-integration-06-v3.pdf · OC-3. 1. Goal: Exploit available bandwidth of wide area 10 Gbps network. 2. Developed

Part 3

Technical Challenge 2 -Infrastructure / Middleware

Page 17: Data Integration Via Universal Keysdb.cis.upenn.edu/.../grossman-info-integration-06-v3.pdf · OC-3. 1. Goal: Exploit available bandwidth of wide area 10 Gbps network. 2. Developed

Technical Challenge 2 -Moving the Data

Usable bandwidth is with single TCP flow as usually deployed Today’s TCP-based network protocols, web, grid & data services do not scale effectively to high bandwidth wide area networks.

5 Mbps

2488 Mbps

OC-48

5 Mbps

622 Mbps

OC-12

5 Mbps5 MbpsUsable bandwidth

9953 Mbps

155 Mbps

Available bandwidth

OC-192

OC-3

Page 18: Data Integration Via Universal Keysdb.cis.upenn.edu/.../grossman-info-integration-06-v3.pdf · OC-3. 1. Goal: Exploit available bandwidth of wide area 10 Gbps network. 2. Developed

1. Goal: Exploit available bandwidth of wide area 10 Gbps network.

2. Developed new application level network protocol -UDT

3. UDT is fair to other high volume data flows

4. UDT is friendly tocommodity TCP flows.

5. UDT is easy to deploy since application level.

New Transport Protocols & Services (eg. UDT)

UDT has been downloadover 5000 times fromSource Forge

Page 19: Data Integration Via Universal Keysdb.cis.upenn.edu/.../grossman-info-integration-06-v3.pdf · OC-3. 1. Goal: Exploit available bandwidth of wide area 10 Gbps network. 2. Developed

Part 4

Example - Integrating Distributed Databases of

Chemical Compounds

Page 20: Data Integration Via Universal Keysdb.cis.upenn.edu/.../grossman-info-integration-06-v3.pdf · OC-3. 1. Goal: Exploit available bandwidth of wide area 10 Gbps network. 2. Developed

Smiles Strings are Not Unique

Path (a) yields CC1=CC(Br)CCC1

Path (b) yields CC1=CC(CCC1)Br

CH3

C

H2C

H2C

CH

CH

CH2Br

C

C1

C

C

C

C Br

C1

(a)

(b)

Page 21: Data Integration Via Universal Keysdb.cis.upenn.edu/.../grossman-info-integration-06-v3.pdf · OC-3. 1. Goal: Exploit available bandwidth of wide area 10 Gbps network. 2. Developed

Unique Smiles

There are many variants of Smiles strings.One of the most common are unique smiles strings (Weininger et. al. 1989)The claim is that a unique starting atom and well defined branching decisions can be made using an algorithm generating “canonical labels”.

Page 22: Data Integration Via Universal Keysdb.cis.upenn.edu/.../grossman-info-integration-06-v3.pdf · OC-3. 1. Goal: Exploit available bandwidth of wide area 10 Gbps network. 2. Developed

Counter Example 1

Page 23: Data Integration Via Universal Keysdb.cis.upenn.edu/.../grossman-info-integration-06-v3.pdf · OC-3. 1. Goal: Exploit available bandwidth of wide area 10 Gbps network. 2. Developed

Counter Example 2

Page 24: Data Integration Via Universal Keysdb.cis.upenn.edu/.../grossman-info-integration-06-v3.pdf · OC-3. 1. Goal: Exploit available bandwidth of wide area 10 Gbps network. 2. Developed

Counter Example 3

Page 25: Data Integration Via Universal Keysdb.cis.upenn.edu/.../grossman-info-integration-06-v3.pdf · OC-3. 1. Goal: Exploit available bandwidth of wide area 10 Gbps network. 2. Developed

What are Natural Operations?

The set of paths is naturally defined.Paths can be lex ordered.

O

C

C 0

H

1. Set of paths of length less or equal to 2 originating from C: {CO, CC, COH}.

2. Lexigraphically order: [CC, CO, COH].

3. Concatenate: CCCOCOH

Page 26: Data Integration Via Universal Keysdb.cis.upenn.edu/.../grossman-info-integration-06-v3.pdf · OC-3. 1. Goal: Exploit available bandwidth of wide area 10 Gbps network. 2. Developed

Universal Chemical Keys (UCKs)

O

OO

CH3

CH3

O

CH3CH3

CH3O

OO

CH3

CH3

OCH3

CH3

CH3

682322 682323

1. Fix depth d. Compute path labels λ(u), for nodes u.

2. Loop over all pairs of nodes u and v, compute length of shortest path n and form λ(u) n λ(v).

3. Lex order.4. Concatenate.5. Hash.

Loop over all pairs of nodes u and v and form “natural labels”

Page 27: Data Integration Via Universal Keysdb.cis.upenn.edu/.../grossman-info-integration-06-v3.pdf · OC-3. 1. Goal: Exploit available bandwidth of wide area 10 Gbps network. 2. Developed

Universal Chemical Key (1 of 2)Fix d = 2 or 3.

Label each node by recursively computing strings based upon paths (of length d) originating from the node.Create labels from allpairs of nodes and lexigraphically order & concatentate (more or less - see paper)Hash to produce short string.

O

C

C 0

H

Onode 3

Cnode 2

Cnode 40node 1

Hnode 10

Page 28: Data Integration Via Universal Keysdb.cis.upenn.edu/.../grossman-info-integration-06-v3.pdf · OC-3. 1. Goal: Exploit available bandwidth of wide area 10 Gbps network. 2. Developed

UCK Algorithm (2 of 2)Assign a sequence of labels λ(d) to each vertex reflecting the structure of larger and larger local neighborhoods

O3

C2

C4 01

OCCOHOCHO1

OCCOOCO3

CCCCOOHCCOO

C2

λ(2)λ(1)λ(0)V

C | CO | C0 | CCC | O | C | OC | C | O | O

lexigraphicallyorder

Page 29: Data Integration Via Universal Keysdb.cis.upenn.edu/.../grossman-info-integration-06-v3.pdf · OC-3. 1. Goal: Exploit available bandwidth of wide area 10 Gbps network. 2. Developed

Example

098900…

C17H18O4682323

132020…

C17H18O4682322

UCKFormulaNSC

O

OO

CH3

CH3

O

CH3CH3

CH3O

OO

CH3

CH3

OCH3

CH3

CH3

682322 682323

Page 30: Data Integration Via Universal Keysdb.cis.upenn.edu/.../grossman-info-integration-06-v3.pdf · OC-3. 1. Goal: Exploit available bandwidth of wide area 10 Gbps network. 2. Developed

Example 2 – Benzoic acid (NSC149)

C4

C5

C9

C6

C8

C7

C2O1

O3

H10

H15

H14

H13

H12

H11

C9C8

H15

H14C4

C5

C2O1

O3

H10

H11C6

C7H13

H12

Initialization ……

λ(2)Vertex

OCCO3

CCCCOOH2OCCOH1

……

OCCOH1CCCCOOH

(1,2)

μ(2)Pair

Page 31: Data Integration Via Universal Keysdb.cis.upenn.edu/.../grossman-info-integration-06-v3.pdf · OC-3. 1. Goal: Exploit available bandwidth of wide area 10 Gbps network. 2. Developed

Example 2 – Benzoic acid (NSC149)

The UCK String:C7H6O2-

CCCCCCHH0CCCCCCHHCCCCCCHH0CCCCCCHHCCCCCCHH1CCCHCCHCOOCCCCCCHH1CCCHCCHCOOCCCCCCHH1CCCHCCHHCCCCCCHH1CCCHCCHHCCCCCCHH1HCCCCCCCCCHH1HCCCCCCCCCHH2CCCCCCHHCCCCCCHH2CCCCCCHHCCCCCCHH2CCCCOOHCCCCCCHH2CCCCOOHCCCCCCHH2CCCHCCHHCCCCCCHH2CCCHCCHHCCCCCCHH2HCCCCCCCCCHH2HCCCCCCCCCHH3CCCHCCHHCCCCCCHH3CCCHCCHHCCCCCCHH3HCCCCCCCCCHH3HCCCCCCCCCHH3HCCCCCCCCCHH3HCCCCCCCCCHH3OCCOCCCCCCHH3OCCOCCCCCCHH3OCCOHCCCCCCHH3OCCOHCCCCCCHH4HCCCCCCCCCHH4HCCCCCCCCCHH4HOCCCCCCCHH4HOCCCCCOOH0CCCCOOHCCCCOOH1CCCHCCHCOOCCCCOOH1OCCOCCCCOOH1OCCOHCCCCOOH2CCCCCCHHCCCCOOH2CCCCCCHHCCCCOOH2HOCCCCCOOH3CCCHCCHHCCCCOOH3CCCHCCHHCCCCOOH3HCCCCCCCOOH3HCCCCCCCOOH4CCCHCCHHCCCCOOH4HCCCCCCCOOH4HCCCCCCCOOH5HCCCCCCHCCHCOO0CCCHCCHCOOCCCHCCHCOO1CCCCCCHHCCCHCCHCOO1CCCCCCHHCCCHCCHCOO1CCCCOOHCCCHCCHCOO2CCCHCCHHCCCHCCHCOO2CCCHCCHHCCCHCCHCOO2HCCCCCCHCCHCOO2HCCCCCCHCCHCOO2OCCOCCCHCCHCOO2OCCOHCCCHCCHCOO3CCCHCCHHCCCHCCHCOO3HCCCCCCHCCHCOO3HCCCCCCHCCHCOO3HOCCCCHCCHCOO4HCCCCCCHCCHH0CCCHCCHHCCCHCCHH0CCCHCCHHCCCHCCHH0CCCHCCHHCCCHCCHH1CCCCCCHHCCCHCCHH1CCCCCCHHCCCHCCHH1CCCHCCHHCCCHCCHH1CCCHCCHHCCCHCCHH1CCCHCCHHCCCHCCHH1CCCHCCHHCCCHCCHH1HCCCCCCHCCHH1HCCCCCCHCCHH1HCCCCCCHCCHH2CCCCCCHHCCCHCCHH2CCCCCCHHCCCHCCHH2CCCHCCHCOOCCCHCCHH2CCCHCCHCOOCCCHCCHH2CCCHCCHHCCCHCCHH2CCCHCCHHCCCHCCHH2HCCCCCCHCCHH2HCCCCCCHCCHH2HCCCCCCHCCHH2HCCCCCCHCCHH2HCCCCCCHCCHH2HCCCCCCHCCHH3CCCCCCHHCCCHCCHH3CCCCCCHHCCCHCCHH3CCCCOOHCCCHCCHH3CCCCOOHCCCHCCHH3CCCHCCHCOOCCCHCCHH3HCCCCCCHCCHH3HCCCCCCHCCHH3HCCCCCCHCCHH3HCCCCCCHCCHH4CCCCOOHCCCHCCHH4HCCCCCCHCCHH4HCCCCCCHCCHH4OCCOCCCHCCHH4OCCOCCCHCCHH4OCCOHCCCHCCHH4OCCOHCCCHCCHH5HOCCCCHCCHH5HOCCCCHCCHH5OCCOCCCHCCHH5OCCOHCCCHCCHH6HOCHCCC0HCCCHCCC0HCCCHCCC0HCCCHCCC0HCCCHCCC0HCCCHCCC1CCCCCCHHHCCC1CCCCCCHHHCCC1CCCHCCHHHCCC1CCCHCCHHHCCC1CCCHCCHHHCCC2CCCCCCHHHCCC2CCCCCCHHHCCC2CCCHCCHCOOHCCC2CCCHCCHCOOHCCC2CCCHCCHHHCCC2CCCHCCHHHCCC2CCCHCCHHHCCC2CCCHCCHHHCCC2CCCHCCHHHCCC2CCCHCCHHHCCC3CCCCCCHHHCCC3CCCCCCHHHCCC3CCCCCCHHHCCC3CCCCCCHHHCCC3CCCCOOHHCCC3CCCCOOHHCCC3CCCHCCHCOOHCCC3CCCHCCHCOOHCCC3CCCHCCHHHCCC3CCCHCCHHHCCC3CCCHCCHHHCCC3CCCHCCHHHCCC3HCCCHCCC3HCCCHCCC3HCCCHCCC3HCCCHCCC3HCCCHCCC3HCCCHCCC3HCCCHCCC3HCCCHCCC4CCCCCCHHHCCC4CCCCCCHHHCCC4CCCCOOHHCCC4CCCCOOHHCCC4CCCHCCHCOOHCCC4CCCHCCHHHCCC4CCCHCCHHHCCC4HCCCHCCC4HCCCHCCC4HCCCHCCC4HCCCHCCC4HCCCHCCC4HCCCHCCC4HCCCHCCC4HCCCHCCC4OCCOHCCC4OCCOHCCC4OCCOHHCCC4OCCOHHCCC5CCCCOOHHCCC5HCCCHCCC5HCCCHCCC5HCCCHCCC5HCCCHCCC5HOCHCCC5HOCHCCC5OCCOHCCC5OCCOHCCC5OCCOHHCCC5OCCOHHCCC6HOCHCCC6HOCHCCC6OCCOHCCC6OCCOHHCCC7HOCHOC0HOCHOC1OCCOHHOC2CCCCOOHHOC3CCCHCCHCOOHOC3OCCOHOC4CCCCCCHHHOC4CCCCCCHHHOC5CCCHCCHHHOC5CCCHCCHHHOC5HCCCHOC5HCCCHOC6CCCHCCHHHOC6HCCCHOC6HCCCHOC7HCCCOCCO0OCCOOCCO1CCCCOOHOCCO2CCCHCCHCOOOCCO2OCCOHOCCO3CCCCCCHHOCCO3CCCCCCHHOCCO3HOCOCCO4CCCHCCHHOCCO4CCCHCCHHOCCO4HCCCOCCO4HCCCOCCO5CCCHCCHHOCCO5HCCCOCCO5HCCCOCCO6HCCCOCCOH0OCCOHOCCOH1CCCCOOHOCCOH1HOCOCCOH2CCCHCCHCOOOCCOH2OCCOOCCOH3CCCCCCHHOCCOH3CCCCCCHHOCCOH4CCCHCCHHOCCOH4CCCHCCHHOCCOH4HCCCOCCOH4HCCCOCCOH5CCCHCCHHOCCOH5HCCCOCCOH5HCCCOCCOH6HCCC

Page 32: Data Integration Via Universal Keysdb.cis.upenn.edu/.../grossman-info-integration-06-v3.pdf · OC-3. 1. Goal: Exploit available bandwidth of wide area 10 Gbps network. 2. Developed

Analysis of NCI DatabaseRemarkNumberDescription

UCK gave same key to same compounds

33,533Number chem. comp. 2 or more entries

All gave unique UCK

202,384Number of chem. comp. with single entry

Some compounds have duplicate entries

236,917Total number of chemical compounds

Page 33: Data Integration Via Universal Keysdb.cis.upenn.edu/.../grossman-info-integration-06-v3.pdf · OC-3. 1. Goal: Exploit available bandwidth of wide area 10 Gbps network. 2. Developed

Techniques Extend to Pathways

KEGG database : Lysine biosynthesis

Page 34: Data Integration Via Universal Keysdb.cis.upenn.edu/.../grossman-info-integration-06-v3.pdf · OC-3. 1. Goal: Exploit available bandwidth of wide area 10 Gbps network. 2. Developed

Part 5 -

Summary & Conclusion

Page 35: Data Integration Via Universal Keysdb.cis.upenn.edu/.../grossman-info-integration-06-v3.pdf · OC-3. 1. Goal: Exploit available bandwidth of wide area 10 Gbps network. 2. Developed

There is a Sweet Spot

There is a sweet spot technically for integrating distributed tabular data using universal keysThere are many, many technical reasons why this is stupid– By and large, the same technical reasons apply to

using URLs to access remote dataThere are some very practical reasons to use this approach.

Page 36: Data Integration Via Universal Keysdb.cis.upenn.edu/.../grossman-info-integration-06-v3.pdf · OC-3. 1. Goal: Exploit available bandwidth of wide area 10 Gbps network. 2. Developed

DataSpace ReferencesRobert Grossman, and Marco Mazzucco, DataSpace - A Web Infrastructure for the Exploratory Analysis and Mining of Data, IEEE Computing in Science and Engineering, July/August, 2002, pages 44-51.

Asvin Ananthanarayan, Rajiv Balachandran, Yunhong Gu, Robert Grossman, Xinwei Hong, Jorge Levera, Marco Mazzucco, Data Webs for Earth Science Data, Parallel Computing, Volume 29, 2003, pages 1363-1379.

Alon Halevy, Michael Franklin and David Maier, Principles of Dataspace Systems, Proceedings of the twenty-fifth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems, pages 1 - 9, 2006.

Ian Foster and Robert L. Grossman, Data Integration in a Bandwidth Rich World, Communications ACM, Volume 46, Issue 11, November, 2003, pages 50-57.

Page 37: Data Integration Via Universal Keysdb.cis.upenn.edu/.../grossman-info-integration-06-v3.pdf · OC-3. 1. Goal: Exploit available bandwidth of wide area 10 Gbps network. 2. Developed

Sector ReferencesRobert L. Grossman, Yunhong Gu, David Handley, and Michal Sabala Joe Mambretti, Alex Szalay and Ani Thakar, Kazumi Kumazoe and Oie Yuji, Minsun Lee, Yoonjoo Kwon, and Woojin Seok, Data Mining Middleware for Wide Area High Performance Networks, Journal of Future Generation Computer Systems (FGCS), 2006.

Yunhong Gu, Robert L. Grossman, Alex Szalay and Ani Thakar, Distributing the Sloan Digital Sky Survey Using UDT and Sector, Second IEEE International Conference on E-Science and Grid Computing, 2006.

Page 38: Data Integration Via Universal Keysdb.cis.upenn.edu/.../grossman-info-integration-06-v3.pdf · OC-3. 1. Goal: Exploit available bandwidth of wide area 10 Gbps network. 2. Developed

Unique Chemical Keys References

Robert L. Grossman, Pavan Kasturi, Donald Hamelberg, Bing Liu, An Empirical Study of the Universal Chemical Key Algorithm for Assigning Unique Keys to Chemical Compounds, Journal of Bioinformatics and Computational Biology, 2004, Volume 2, Number 1, 2004, pages 155-171.

Greeshma Neglur and Robert L. Grossman, Assigning Unique Keys to Chemical Compounds for Data Integration: Some Interesting Counter Examples, 2nd International Workshop on Data Integration in the Life Sciences (DILS 2005), La Jolla, July 20-22, 2005.

NCI Open Database Compounds, retrieved from http://cactus.nci.nih.gov/ on August 10, 2006.


Recommended