+ All Categories
Home > Documents > The Integrated Molecular Analysis of Genomes and their Expression Consortium’s Data Mining Tools:...

The Integrated Molecular Analysis of Genomes and their Expression Consortium’s Data Mining Tools:...

Date post: 21-Dec-2015
Category:
View: 216 times
Download: 0 times
Share this document with a friend
Popular Tags:
27
The Integrated Molecular Analysis of Genomes and their Expression Consortium’s Data Mining Tools: Introducing the IQ Peg Folta Lawrence Livermore National Laboratory 3/12/02 TRANSCRIPTOME 2002 Seattle, WA
Transcript

The Integrated Molecular Analysis of Genomes and their Expression

Consortium’s Data Mining Tools: Introducing the IQ

Peg Folta

Lawrence Livermore National Laboratory3/12/02

TRANSCRIPTOME 2002 Seattle, WA

I.M.A.G.E. maintains world’s largest publicly available cDNA collection

5,819,514 clones arrayed

I.M.A.G.E. clones account for 64% of human ESTs in GenBank

cumulative

arrayed

*

The I.M.A.G.E. collection has been shaped by projects (C-GAP, MGC…)

Xenopus

Human

Other

Zebrafish

Mouse

Species

Standard

Full-length

Norm/Sub

Normalized

Subtracted Norm/FL

Library Method

adult

embryonic

juvenile

Developmental state

abnormal

normal

treated

Tissue

3' EST5' EST

Full length

Clone sequence

Informatics focus this year was on tools to characterize and query the collection.

• IMAGEne – mature clustering tool

• IMAGEne Tissue – allows searching of tissue type dominance in clusters

• IQ – Intelligent Query tool allows mining of I.M.A.G.E. data

• Library/plate query – allows selective searching of libraries and plates

• Problem report and query – allows users to report or query problems related to I.M.A.G.E. clones

Redesign of data management system

IMAGEne-Human Process

2,289,020Quality

I.M.A.G.Esequences

14,566NCBI

Ref Seq

IMAGEne1,676,516Sequences

623,294Sequences

RemainingSequences

>50 basepairs of contiguous, non-repeat sequence

Known Clusters

14,566CandidateClusters

w/consensus

67,521

I.M.A.G.E.Singletons

268,472

279,262Lower quality

I.M.A.G.ESequences

Initial query page, construct the query.

Clusters matching query results, chose your cluster.

Display of cluster

Known gene clusters with full length I.M.A.G.E. clones have doubled in number.

0

2000

4000

6000

8000

10000

12000

14000

16000

V3.0 V3.1 V3.2.1 V3.3

IMAGEne Versions

# of clusters

EmptyUnknownPartialPredicted FullFull

Clustercoverage

Avg. genelength

3392276333801896

1578

Known Gene Cluster distribution of full length clones

0

100

200

300

400

500

200 1200 2200 3200 4200 5200 6200 7200

Length of Clone

Number of Clones

avg. length = 948

0100

200300400

500600

700800900

1000

1 2 3 4 5 6 7 8 9 10 11 12 14 15 21

Number of Contigs in a cluster

Number of Clusters

Candidate gene clusters consensus sequence and contigs are generated by CAP4

61,3144,971

824

95

227

40

Candidate Gene cluster characteristics.

1938

26236

28317 11030

full insert 3'&5' 3' only 5' only

Singleton: Wheat within the chaff

0

200

400

600

800

1000

1200

0 1000 2000 3000 4000High Quality Sequence Length

# of sequences

305 full insert sequences are singletons.

62,143 singletons have a 3’ PolyA site.

Avg. length is 547

IMAGEne Tissue query allows searching for tissue proportions within clusters.

Introducing the Intelligent Query - IQ

• For a given category (currently clone and library) a user can specify a query based on key database attributes.

• The user can specify the fields returned.

• Various result format options (HTML, text)

• Initial version was rolled out last summer

• New functionality to be added this year (additional categories, etc.)

Specify a clone-based query.

Next specify what clone centric results will be provided and in what format.

HTML version of clone-based query results.

Specify a library-based

query.

Similarly specify what library centric results will be provided.

HTML version of library-based query results.

Other tools to mine I.M.A.G.E. information

Query plates from libraries. Query for reported problems.

Plates Source Well Error Rate

1-3705 Incyte 13

LLNL Master 10

Research Genetics 12

Resource Center of HumanGenome Project

10

ATTC 11

3,796-6000 Incyte 7

LLNL Master 7

Research Genetics 10

Resource Center of Human Genome Project

12

Quality control for historical collection

QC on-goingMonths Well

error ratePlate Error Rate

Well error rate

Plate Error Rate

6/2000 1 (1,3) 0 7 (4,11) 2

10/2000 1 (0,3) 0 1 (0,3) 2

12/00 0 (0,2) 2 1 (0,3) 2

1/01 2 (1,4) 0 6 (4,11) 3

2/01 1 (0,3) 0 2 (1,5) 2

3/01 2 (1,5) 2 2 (1,5) 0

4/01 1 (0,3) 2 2 (1,4) 0

5/01 0 (0,1) 0 2 (1,5) 0

6/01 1 (0,3) 0 1 (0,4) 0

7/01 1 (0,4) 0 2 (1,6) 0

8/01 2 (1,3) 0 3 (2,6) 0

LLNL Replication Master vs. GenBank

Ongoing QC results

On-goingComparing master to GenBank

Error in replication @ LLNL

Next for I.M.A.G.E. Informatics

• Extensive expansion of query tools and data access

• IMAGEne non-species specific

• Analysis of human cluster candidate genes and singletons

• Redo of web site, easier to navigate

MUCH influenced by public needs…..you have a say!

Acknowledgements

• LLNL– Christa Prange, I.M.A.G.E. PI – Tim Harsch, Amber Johnston, Julie Amundson

• Sponsors– DOE, Marv Stodolsky– NIH, Bob Strausberg

This work was partially funded by the NIH and was performed under the auspices of the U.S. Department of Energy by the University of California, Lawrence Livermore National Laboratory under contract no. W-7405-Eng-48.

image.llnl.gov


Recommended