+ All Categories
Home > Documents > APAN e-Science Workshop e-Bio System for Bio-Knowledge Discovery

APAN e-Science Workshop e-Bio System for Bio-Knowledge Discovery

Date post: 18-Mar-2016
Category:
Upload: moana
View: 32 times
Download: 0 times
Share this document with a friend
Description:
APAN e-Science Workshop e-Bio System for Bio-Knowledge Discovery. 2003.8.27 Sangsoo Kim Nat’l Genome Informat’n Ct. Korea Res. Inst. of Biosci. & Biotech. Bio-Databases & Servers. Contents Bibliographic (Journal abstracts such as Medline) Experimental data (Sequences or structures) - PowerPoint PPT Presentation
Popular Tags:
43
APAN e-Science Workshop e-Bio System for Bio- Knowledge Discovery 2003.8.27 Sangsoo Kim Nat’l Genome Informat’n Ct. Korea Res. Inst. of Biosci. & B iotech.
Transcript
Page 1: APAN e-Science Workshop e-Bio System for Bio-Knowledge Discovery

APAN e-Science Workshop

e-Bio System for Bio-Knowledge Discovery

2003.8.27Sangsoo Kim

Nat’l Genome Informat’n Ct.Korea Res. Inst. of Biosci. & Biotech.

Page 2: APAN e-Science Workshop e-Bio System for Bio-Knowledge Discovery

Bio-Databases & Servers• Contents

– Bibliographic (Journal abstracts such as Medline)– Experimental data (Sequences or structures)– Results from annotation and analyses– Bioinformatic analysis tools

• Purpose– Storing & managing raw data– Querying for knowledge discovery– Sharing information with others– Serving others with online analysis

Page 3: APAN e-Science Workshop e-Bio System for Bio-Knowledge Discovery

New Role of Databases• New discoveries of biological knowledge are

published in scientific journals• But journal space is limited and not suitable to

publish large amount of high throughput data• The supplementary information is provided in an

accompanying website• Readers can download the supplementary

information and analyze from different aspect• Combination with other information may surprise

with unexpected results• Journal publishers require supplementary

information deposited in public archives

Page 4: APAN e-Science Workshop e-Bio System for Bio-Knowledge Discovery

Example - Nucleotide Sequence Repositories

• Nucleotide sequences discovered by sequencing experiments are deposited in any one of the public archives and the journal paper list the accession numbers only (without deposition, you cannot publish sequence discovery in journals)

• Public archives are– DDBJ operated by CIB, NIG in Japan– EMBL operated by EMBL-EBI in UK– GenBank operated by NCBI, NIH in USA

• The contents of these archives are exchanged daily and freely accessible to everybody

• Now extended to archive DNA chip data as well

Page 5: APAN e-Science Workshop e-Bio System for Bio-Knowledge Discovery

Growth of GenBankA Nucleotide Sequence Repository

Human Genome Project

Page 6: APAN e-Science Workshop e-Bio System for Bio-Knowledge Discovery

RTFM

Entrez: Home Page

Page 7: APAN e-Science Workshop e-Bio System for Bio-Knowledge Discovery

GenBank as HTML

Entrez: Display

FASTA as HTML

Page 8: APAN e-Science Workshop e-Bio System for Bio-Knowledge Discovery

Example – BLAST Servers• Originally developed to compare my sequence to those in the rep

ository in order to check whether mine is novel or not• Extended to detect distantly related sequences, serving as the m

ajor sequence annotation tool• Servers accept various kinds of queries and return alignment res

ults over WWW• The most widely used bioinformatic tool• For the analysis of many sequences, better to use local installati

on

Page 9: APAN e-Science Workshop e-Bio System for Bio-Knowledge Discovery

http://www.ncbi.nlm.nih.gov/BLAST

program query database

blastn dna dna

blastp protein protein

blastx dna (6x) protein

tblastn protein dna (6x)

tblastx dna (6x) dna (6x)RTFM

BLAST (Basic Local Alignment Sequence Tool)

Page 10: APAN e-Science Workshop e-Bio System for Bio-Knowledge Discovery

Descriptions Alignments

BLASTN (Cont'd)

Page 11: APAN e-Science Workshop e-Bio System for Bio-Knowledge Discovery

Example – Derived Databases• Swiss-Prot & PIR

– Proteins are predicted from deposited nucleotide sequences, either being mRNA or genomic DNA

– Functions and features of the protein is annotated manually by experts

• Protein motifs– Prosite, pfam, BLOCKS, InterPro– Keyword querying and motif detection of user’s sequence

• Gene Ontology– Hierarchical organization of biological terms– Cataloging associated gene products

Page 12: APAN e-Science Workshop e-Bio System for Bio-Knowledge Discovery

Expert Protein Analysis System

ExPASy (http://www.expasy.ch)

Page 13: APAN e-Science Workshop e-Bio System for Bio-Knowledge Discovery

NiceProt View

Page 14: APAN e-Science Workshop e-Bio System for Bio-Knowledge Discovery

Gene Ontology

• Systematic classification of biological terminology– Molecular function– Biological process– Cellular component

• Controlled vocabulary• Associated GENE list

Page 15: APAN e-Science Workshop e-Bio System for Bio-Knowledge Discovery
Page 16: APAN e-Science Workshop e-Bio System for Bio-Knowledge Discovery

Data Mining• Objective:

– Discovery of (biological) knowledge by querying information in the databases and comprehending it

• Problems:– Too many databases– Different protocols for access– Lack of standards– Poor quality or propagation of errors

• Solutions:– Data warehousing or federated databases

Page 17: APAN e-Science Workshop e-Bio System for Bio-Knowledge Discovery

Catalog of Bio-DBs arranged by Data Domain

Page 18: APAN e-Science Workshop e-Bio System for Bio-Knowledge Discovery

Database of Databases• Data warehousing

– Collect all databases by mirroring– Store in a unified format– Entrez (NCBI) or SRS (EBI)– Powerful but heavy maintenance load

• Federated databases– Maintained by participating members– Accessed by common protocols– Bio-DAS or Web Services via SOAP/XML– Next generation technology, but dependent on both the coop

eration by members and Internet bandwidth

Page 19: APAN e-Science Workshop e-Bio System for Bio-Knowledge Discovery

www.ngic.re.kr

Page 20: APAN e-Science Workshop e-Bio System for Bio-Knowledge Discovery

www.ncbi.nih.gov/LocusLink

Page 21: APAN e-Science Workshop e-Bio System for Bio-Knowledge Discovery

New Data Types• Textual

– Nucleotide or amino acid sequences– Associated feature annotation– Bibliographical texts

• Numeric– Gene expression profiles– Results from statistical analysis

• Graphical– Protein-protein interaction network– Genetic network– Biochemical reaction pathways

Page 22: APAN e-Science Workshop e-Bio System for Bio-Knowledge Discovery
Page 23: APAN e-Science Workshop e-Bio System for Bio-Knowledge Discovery
Page 24: APAN e-Science Workshop e-Bio System for Bio-Knowledge Discovery
Page 25: APAN e-Science Workshop e-Bio System for Bio-Knowledge Discovery

Building a Nation from a Land of City States

Lincoln D. SteinCold Spring Harbor

Laboratory

Page 26: APAN e-Science Workshop e-Bio System for Bio-Knowledge Discovery

Italy in the Middle Ages

Page 27: APAN e-Science Workshop e-Bio System for Bio-Knowledge Discovery

Bioinformatics, ca. 2002Bioinformatics

In the XXI Century

Page 28: APAN e-Science Workshop e-Bio System for Bio-Knowledge Discovery

Making Easy Things Hard

Give me all human sequences submitted to GenBank/EMBL l

ast week.

Page 29: APAN e-Science Workshop e-Bio System for Bio-Knowledge Discovery

Lots of ways to do it• Download weekly update of GenBank/E

MBL from FTP site• Use official network-based interfaces to

data:– NCBI toolkit– EBI CORBA & XEMBL servers

• Use friendly web interfaces at NCBI, EBI

Page 30: APAN e-Science Workshop e-Bio System for Bio-Knowledge Discovery

Perl/Java/Python to the Rescue

• One script to do the web fetch• Another to parse the file format• A third to move into private database• A fourth to repeat this weekly• Result:

– 6,719 scripts that do the same thing– None of them work together

Page 31: APAN e-Science Workshop e-Bio System for Bio-Knowledge Discovery

What’s Wrong with This?• My EMBL fetcher is poorly documented

so you write your own• Your fetcher won’t work with my parser• My parser won’t work with your fetcher• We’ve now wasted 20 hours rather than

10• Multiply this by 6,719

Page 32: APAN e-Science Workshop e-Bio System for Bio-Knowledge Discovery

What’s else is Wrong?• NCBI/EBI tweaks something• 6,719 scripts fail at once• 6,719 bioinformaticists tear their hair• 21,261 biologists curse the bioinformati

cists• 6,719 bioinformaticists curse their own

existence

Page 33: APAN e-Science Workshop e-Bio System for Bio-Knowledge Discovery

Unifying Bioinformatics Services

MIMBD: Meetings on the Interconnection of Molecular Biology Databases

Federated models: Gaea, KleisliData warehouses: GUS, MODs, Ense

mbl, UCSCAd hoc web servicesFormal web services

Page 34: APAN e-Science Workshop e-Bio System for Bio-Knowledge Discovery

Ad hoc services

BioXXX

Your Script

Conf file

Page 35: APAN e-Science Workshop e-Bio System for Bio-Knowledge Discovery

Formal Web Services

SeqFetchService

BLATService

MicroarrayService

BLASTService

SeqFetchService

GOService

Page 36: APAN e-Science Workshop e-Bio System for Bio-Knowledge Discovery

Formal Web Services

ServiceRegistry

SeqFetchService

BLATService

MicroarrayService

BLASTService

SeqFetchService

GOService

Page 37: APAN e-Science Workshop e-Bio System for Bio-Knowledge Discovery

Formal Web Services

Your Script

ServiceRegistry

BioXXX MicroarrayService

SeqFetchService

BLATService

MicroarrayService

BLASTServiceSeqFetch

Service

GOService

Page 38: APAN e-Science Workshop e-Bio System for Bio-Knowledge Discovery

Technical Infrastructure is Here*

• Common vocabulary: GO• Transport format: XML• Data definition language: XSD• Wire protocol: SOAP• Service definition language: WSDL• Service registry: UDDI

*(almost)

Page 39: APAN e-Science Workshop e-Bio System for Bio-Knowledge Discovery

Distributed Annotation Systemhttp://www.biodas.org

Reference Server

AC003027AC005122M10154

Annotation Server Annotation Server

AC003027 M10154WI1029 AFM820 AFM1126 WI443

AC005122

Annotation Server

Thursday 10:30 AMCanyon IV

Page 40: APAN e-Science Workshop e-Bio System for Bio-Knowledge Discovery

Europe, ca 2000

Page 41: APAN e-Science Workshop e-Bio System for Bio-Knowledge Discovery

Bioinformatics, ca 2010?

Page 42: APAN e-Science Workshop e-Bio System for Bio-Knowledge Discovery

NGIC

KNIH

Human

Proteome

AnimalAg-Bio

Crop

Plant

Microbial

Universities

ResearchInstitutes

Industry

Collection and Sharing of Collection and Sharing of National Genome InformationNational Genome Information

Page 43: APAN e-Science Workshop e-Bio System for Bio-Knowledge Discovery

NGIC

KNIH

Human

Proteome

Animal

Ag-Bio

Crop

Plant

Microbial

Data Grid

KISTI ETRI

Application Grid

National Genome National Genome Information NetworkInformation Network


Recommended