Date post: | 21-Dec-2015 |
Category: |
Documents |
View: | 216 times |
Download: | 0 times |
The Integrated Molecular Analysis of Genomes and their Expression
Consortium’s Data Mining Tools: Introducing the IQ
Peg Folta
Lawrence Livermore National Laboratory3/12/02
TRANSCRIPTOME 2002 Seattle, WA
I.M.A.G.E. maintains world’s largest publicly available cDNA collection
5,819,514 clones arrayed
I.M.A.G.E. clones account for 64% of human ESTs in GenBank
cumulative
arrayed
*
The I.M.A.G.E. collection has been shaped by projects (C-GAP, MGC…)
Xenopus
Human
Other
Zebrafish
Mouse
Species
Standard
Full-length
Norm/Sub
Normalized
Subtracted Norm/FL
Library Method
adult
embryonic
juvenile
Developmental state
abnormal
normal
treated
Tissue
3' EST5' EST
Full length
Clone sequence
Informatics focus this year was on tools to characterize and query the collection.
• IMAGEne – mature clustering tool
• IMAGEne Tissue – allows searching of tissue type dominance in clusters
• IQ – Intelligent Query tool allows mining of I.M.A.G.E. data
• Library/plate query – allows selective searching of libraries and plates
• Problem report and query – allows users to report or query problems related to I.M.A.G.E. clones
Redesign of data management system
IMAGEne-Human Process
2,289,020Quality
I.M.A.G.Esequences
14,566NCBI
Ref Seq
IMAGEne1,676,516Sequences
623,294Sequences
RemainingSequences
>50 basepairs of contiguous, non-repeat sequence
Known Clusters
14,566CandidateClusters
w/consensus
67,521
I.M.A.G.E.Singletons
268,472
279,262Lower quality
I.M.A.G.ESequences
Known gene clusters with full length I.M.A.G.E. clones have doubled in number.
0
2000
4000
6000
8000
10000
12000
14000
16000
V3.0 V3.1 V3.2.1 V3.3
IMAGEne Versions
# of clusters
EmptyUnknownPartialPredicted FullFull
Clustercoverage
Avg. genelength
3392276333801896
1578
Known Gene Cluster distribution of full length clones
0
100
200
300
400
500
200 1200 2200 3200 4200 5200 6200 7200
Length of Clone
Number of Clones
avg. length = 948
0100
200300400
500600
700800900
1000
1 2 3 4 5 6 7 8 9 10 11 12 14 15 21
Number of Contigs in a cluster
Number of Clusters
Candidate gene clusters consensus sequence and contigs are generated by CAP4
61,3144,971
824
95
227
40
Singleton: Wheat within the chaff
0
200
400
600
800
1000
1200
0 1000 2000 3000 4000High Quality Sequence Length
# of sequences
305 full insert sequences are singletons.
62,143 singletons have a 3’ PolyA site.
Avg. length is 547
Introducing the Intelligent Query - IQ
• For a given category (currently clone and library) a user can specify a query based on key database attributes.
• The user can specify the fields returned.
• Various result format options (HTML, text)
• Initial version was rolled out last summer
• New functionality to be added this year (additional categories, etc.)
Other tools to mine I.M.A.G.E. information
Query plates from libraries. Query for reported problems.
Plates Source Well Error Rate
1-3705 Incyte 13
LLNL Master 10
Research Genetics 12
Resource Center of HumanGenome Project
10
ATTC 11
3,796-6000 Incyte 7
LLNL Master 7
Research Genetics 10
Resource Center of Human Genome Project
12
Quality control for historical collection
QC on-goingMonths Well
error ratePlate Error Rate
Well error rate
Plate Error Rate
6/2000 1 (1,3) 0 7 (4,11) 2
10/2000 1 (0,3) 0 1 (0,3) 2
12/00 0 (0,2) 2 1 (0,3) 2
1/01 2 (1,4) 0 6 (4,11) 3
2/01 1 (0,3) 0 2 (1,5) 2
3/01 2 (1,5) 2 2 (1,5) 0
4/01 1 (0,3) 2 2 (1,4) 0
5/01 0 (0,1) 0 2 (1,5) 0
6/01 1 (0,3) 0 1 (0,4) 0
7/01 1 (0,4) 0 2 (1,6) 0
8/01 2 (1,3) 0 3 (2,6) 0
LLNL Replication Master vs. GenBank
Next for I.M.A.G.E. Informatics
• Extensive expansion of query tools and data access
• IMAGEne non-species specific
• Analysis of human cluster candidate genes and singletons
• Redo of web site, easier to navigate
MUCH influenced by public needs…..you have a say!
Acknowledgements
• LLNL– Christa Prange, I.M.A.G.E. PI – Tim Harsch, Amber Johnston, Julie Amundson
• Sponsors– DOE, Marv Stodolsky– NIH, Bob Strausberg
This work was partially funded by the NIH and was performed under the auspices of the U.S. Department of Energy by the University of California, Lawrence Livermore National Laboratory under contract no. W-7405-Eng-48.
image.llnl.gov