A Tool for Supporting Integration Across Multiple Flat-File Datasets
Xuan Zhang, Gagan Agrawal
Ohio State University
Outline
• Motivation
• System Overview
• System Implementation– Languages– Query Execution
• Experiments
Motivation
• Biological researches ask for– Accessing multiple heterogeneous data
sources• Lack of common data model, data format
– Tracking multiple objects
• A motivating example: protein sequence analysis
An Example
>unknown sequence. ……MCMFGSSVIECPNPRIWFVWPYEFPLFLLPGGDRMEI……
NCBI protein-protein BLAST service
List of similar sequences
Clustering Analysis (ClustalW, RiPE, etc)
Format Information
Out: BLAST Fixed column width
BLAST Partial name
In: FASTA In: Full name
To predict protein function
Current Solution
• Manual– Copy-and-paste keyword search– Format conversion programs– NCBI link-out
• Database– Load data (BLAST output, sequence
database)– Parse input; Re-format output
Our Approach
• Join request between BLAST output and SWISSPROT (sequence database)
• Data maintained in flat files
• Query specification and data description are high-level, declarative
• Data parsing and query processing are behind the scene
Advantage
• Retrieve multiple pieces of information all at once
• Data easily available
• Declarative languages only
• High flexibility
• Low over-head
System Overviewquery
Query parser
Metadatacollection
Datasetdescriptors
Descriptorparser
Application analyzer
QUERYINFOR
DataReader DataWriter
synchronizer
Source data files
TargetData file
Source/target names
Schema & Layout information
mappings
Outline
• Motivation• System Overview• System Implementation
– Languages• Query Language• Metadata Description Language
– System• Query Analysis• Query Execution
• Experiments
Query Language
• Declarative, SQL-like• Projection, selection, cross product, join queries• Example AUTOWRAP POSTBLAST
FROM BLASTP, SWISSPROT
BY BLASTP.SP_ID = SWISSPROT.ID
WHERE
POSTBLAST.QUERY = BLASTP.QUERY
POSTBLAST.SP_AC = BLASTP.SP_AC
POSTBLAST.SP_ID = BLASTP.SP_ID
POSTBLAST.FULL_DESCR = SWISSPROT.DEPOSTBLAST.FULL_DESCR = SWISSPROT.DE
POSTBLAST.SEQUENCE = SWISSPORT.SQPOSTBLAST.SEQUENCE = SWISSPORT.SQ
POSTBLAST.SCORE = BLASTP.SCORE
POSTBLAST.E_VALUE = BLASTP.E_VALUE
Target dataset
Source datasets
Join criteria
Attribute pairs
Metadata Description Language
• One descriptor for each flat file dataset. – Schema– Layout
• Re-usable by different queries
• Can be learned semi-automatically using data mining techniques
• Example: BLAST output
Schema Descriptors
• Written in XML DTD format
• Example<?xml version='1.0' encoding='UTF-8'?>
<!ELEMENT BLASTP (QUERY, SEQUENCE*)>
<!ELEMENT QUERY (#PCDATA)>
<!ELEMENT SEQUENCE (SP_AC, SP_ID, DESCR, SCORE, E_VALUE)>
<!ELEMENT SP_AC (#PCDATA)>
<!ELEMENT SP_ID (#PCDATA)>
<!ELEMENT DESCR (#PCDATA)>
<!ELEMENT SCORE (#PCDATA)>
<!ELEMENT E_VALUE (#PCDATA)>
Layout Descriptors
• Example
DATASET "BLASTP" {
DATATYPE {BLASTP}
DATASPACE LINESIZE = 90 {
… …
}
DATA {data/Blast_htm.txt}
}
Dataset name
Schema name
File layout
File location
Description of File Layout
• Layout descriptor • Actual data file"BLASTP" VERSION
… …
"Query=" QUERY
"\nDatabase:" DB_NAME
<
"\nsp|" SP_AC
"|" SP_ID
" " DESCR
" " SCORE
" " E_VALUE
>
"\n\nALIGNMENT" DUMMY
BLASTP 2.2.11 [Jun-05-2005]
Reference: … …
RID: … …
Query= Random 50 residue protein sequence.
Database: Non-redundant SwissProt sequences 175,661 sequences; 64,716,374 total letters
Score ESequences producing significant alignments: (Bits) Value
sp|P11884|AL1A1_SHEEP Modification methylase MwoI (N-4 cytosin... 30.0 1.5 sp|P00352|AL1A1_HUMAN Oxygen-independent coproporphyrinogen II... 28.1 5.7 sp|P40530|YIE2_YEAST Oxygen-independent coproporphyrinogen II... 28.1 5.7
ALIGNMENTS>sp|P11884|AL1A1_SHEEP Modification methylase MwoI (N-4 cytosine-specific … …
Query Analysisquery
Query parser
Metadatacollection
Datasetdescriptors
Descriptorparser
QUERYINFOR
DataReader DataWriter
synchronizer
Source data files
TargetData file
Source/target names
Schema & Layout information
mappings
Application analyzerApplication analyzer
Terminology
• DLM-VAR node/pair– a pairing of a delimiter and an attribute value– E.g "Query=" QUERY
• Reach-ability– DLM-VAR node r is reachable from node a iff
configuration “ar” is allowed by the layout description
• Regular v.s Semi-structured Attribute– Regular: fixed number of values per entry– Semi-structured: various number of values per entry
• Number v.s Index– Label for layout node v.s schema node– 1 index/number, 1+ number/index
Application Analyzer
1. Label Schema and Layout Tree2. Query analysis
– Record layout information• Delimiter look-up table
– Draw correspondence between schema and layout• Label look-up table
– Collect constant values in query• Pseudo-label look-up table
– Calculate reachable nodes• Reachable look-up table
– Other information Parameters
QUERYINFORQUERYINFOR
QUERY-PROC Structure
• Three general action modules– DataReader– DataWriter– Synchronizer
• One query-specific data module– QUERYINFOR
QUERYINFOR
DataWriterDataReader
Synchronizer
Source 1
Source 2Target
QUERY-PROC Structure (cont.)
• One value buffer– Configuration vary from query to query– Accessible to three general modules
QUERY
SP_ID
SCORE
E_VALUE
SP_AC
BLASTP
Source 1
Source 2
Regular Semi-structured
QUERY-PROC Action
• DataReader– Extract attribute value
• Start: Delimiter look-up table• End: Reachable look-up table
– Fill value buffer: Label look-up table
• DataWriter– Retrieve from value buffer: Label look-up
table– Write target file: Delimiter look-up table
• Truncate or wrap: Reachable look-up table + label look-up table
QUERY-PROC Action (cont.)
• Synchronizer– Set up pseudo-attributes: Pseudo label look-
up table– Call DataReader on source 1 and 2, Call
DataWriter on target: Parameters– Test join conditions: Parameters– Clean value buffer: Parameters
Outline
• Motivation• System Overview• System Implementation
– Languages– System
• Experiments
Outline
Post-BLAST Query
• Enhance BLAST output
• Join query between BLAST output and SWISSPROT
• Results in FASTA format
• 2 modes– UNIQUE: halt once a
match is found in source 2
– ALL: search all source 2 entries
0
200
400
600
800
1000
1200
1400
1600
1800
2000
Tim
e (
se
c)
3 5 12
Query Size (Sequence Number)
UNIQUE
ALL
Chip-supplement Query
• Look up microarray genes information
• Join query between protein array and yeast genome database
• Results in tabular form• 2 queries
– Chip-Supplement: array join genome
– Chip-Supplement-Sorted: genome join array
0
10
20
30
40
50
60
70
80
90
Tim
e (
se
c)
Chip-Supplement Chip-Supplement-
Sorted
Query Type
UNIQUE
ALL
OMIM-plus Query
• Add reverse links of proteins to disease database
• Join query between OMIM database and SWISSPROT database
• Results in OMIM form
• 86.38 seconds/entry * 12,158 OMIM entry = 291.7 hours
Summary
• A data integration tool
• Answers query on flat-file datasets
• Light-weighted– Modest programming efforts– No DBMS– Various flat file formats supported