+ All Categories
Home > Documents > Machine-learning in building bioinformatics databases for infectious diseases Victor Tong Institute...

Machine-learning in building bioinformatics databases for infectious diseases Victor Tong Institute...

Date post: 28-Dec-2015
Category:
Upload: arleen-goodman
View: 213 times
Download: 0 times
Share this document with a friend
Popular Tags:
29
Machine-learning in building bioinformatics databases for infectious diseases Victor Tong Institute for Infocomm Research A*STAR, Singapore ASEAN-China International Bioinformatics Workshop 2008 17 Apr 2008
Transcript
Page 1: Machine-learning in building bioinformatics databases for infectious diseases Victor Tong Institute for Infocomm Research A*STAR, Singapore ASEAN-China.

Machine-learning in building bioinformatics databases for infectious diseases

Victor TongInstitute for Infocomm ResearchA*STAR, Singapore

ASEAN-China International Bioinformatics Workshop 200817 Apr 2008

Page 2: Machine-learning in building bioinformatics databases for infectious diseases Victor Tong Institute for Infocomm Research A*STAR, Singapore ASEAN-China.

Overview

Definitions and background

Architectures of existing immunological databases

Machine-learning for biological databases

Conclusion

Page 3: Machine-learning in building bioinformatics databases for infectious diseases Victor Tong Institute for Infocomm Research A*STAR, Singapore ASEAN-China.

Biology produces more data than we can process >3000 HLA alleles 107-1015 different T-cell receptors 1011 linear 9mer epitopes Post-translational spliced epitopes

Data are stored in databases, literature, laboratory records, clinical records, …

A major issue: turning data into knowledge

The information centric world

Page 4: Machine-learning in building bioinformatics databases for infectious diseases Victor Tong Institute for Infocomm Research A*STAR, Singapore ASEAN-China.

Impractical to do manual curation ≥ 16 million PubMed abstracts ~80K immunology related references

Large amounts of data that are difficult to interpret Protein-protein interaction extraction from text

Bioinformatics: systematic construction and updating of databases

Use of bioinformatics

Page 5: Machine-learning in building bioinformatics databases for infectious diseases Victor Tong Institute for Infocomm Research A*STAR, Singapore ASEAN-China.

Ad hoc bioinformatics

Biological system

Computational analysis

Biological interpretation

Page 6: Machine-learning in building bioinformatics databases for infectious diseases Victor Tong Institute for Infocomm Research A*STAR, Singapore ASEAN-China.

More systematic use of bioinformatics

Biological system

Computational analysis

Biological interpretation

Formal description

Mathematical problem

Conversion of results

Page 7: Machine-learning in building bioinformatics databases for infectious diseases Victor Tong Institute for Infocomm Research A*STAR, Singapore ASEAN-China.

Knowledge discovery from databases is the process of automated extraction of useful information or knowledge from individual or multiple databases

Page 8: Machine-learning in building bioinformatics databases for infectious diseases Victor Tong Institute for Infocomm Research A*STAR, Singapore ASEAN-China.

1) Data explosion

Current databases: Volume of data increasing exponentially GenBank, SWISS-PROT, IMGT, PubMed, etc

New databases:

Growth in numbers Increase in size More complex

Biologists: Maintain personal data bank Information relevant to their

research Define objectives for data

mining and analysis

Page 9: Machine-learning in building bioinformatics databases for infectious diseases Victor Tong Institute for Infocomm Research A*STAR, Singapore ASEAN-China.

2) Data quality

Nature of biological data: Fuzzy and complex Varying interpretations

Problems with raw data:

Inconsistent Inaccurate Redundant Irrelevant Incomplete Incorrect

Data cleaning: Limit on the percentage

error that can be tolerated in the data

Prevent propagation of errors to our databases

Prevent depreciation of data quality

Page 10: Machine-learning in building bioinformatics databases for infectious diseases Victor Tong Institute for Infocomm Research A*STAR, Singapore ASEAN-China.

3) Database creation and maintenance

Software tools and programming efforts: Data collection Constructing databases Integrating data mining tools Updating the databases

Nature of the databases:

Short lifespan Hard to maintain

Page 11: Machine-learning in building bioinformatics databases for infectious diseases Victor Tong Institute for Infocomm Research A*STAR, Singapore ASEAN-China.

4) Data integration

Disparities in data sources: Data structures Data formats Views Search mechanisms Location

Page 12: Machine-learning in building bioinformatics databases for infectious diseases Victor Tong Institute for Infocomm Research A*STAR, Singapore ASEAN-China.

Overview

Definitions and background

Architectures of existing immunological databases

Machine-learning for biological databases

Conclusion

Page 13: Machine-learning in building bioinformatics databases for infectious diseases Victor Tong Institute for Infocomm Research A*STAR, Singapore ASEAN-China.

Web-resources for immune epitope information

Immune Epitope Database and Analysis Resource (IEDB)Contains B-cell epitopes, T-cell epitopes, MHC ligands for humans, non-human primates, rodents, and other animal species.URL: http://www.immuneepitope.org

The international ImMunoGeneTics information system (IMGT)Specializes in Ig, T-cell receptors, MHC, Ig superfamily, MHC superfamily, and related proteins of the immune system of human and other vertebrate species URL: http://imgt.cines.fr/

SYFPEITHIContains ~3,500 T-cell epitopes, MHC ligands and peptide motifs for humans and rodentsURL: http://www.syfpeithi.de/

Page 14: Machine-learning in building bioinformatics databases for infectious diseases Victor Tong Institute for Infocomm Research A*STAR, Singapore ASEAN-China.

Web-resources for immuneepitope information

MHCBNContains T-cell epitopes, TAP ligands, MHC binding peptides and MHC non-binding peptides for humans and rodentsURL: http://www.imtech.res.in/raghava/mhcbn/

MPID-TContains 3D structural information of 187 T-cell receptors, MHCs and interacting epitopes for humans and rodents, spanning 40 allelesURL: http://surya.bic.nus.edu.sg/mpidt/

AntiJen/JenPepContains T-cell epitopes, MHC ligands, TAP ligands and B-cell epitopes.URL: http://www.jenner.ac.uk/antijen/

Page 15: Machine-learning in building bioinformatics databases for infectious diseases Victor Tong Institute for Infocomm Research A*STAR, Singapore ASEAN-China.

The IEDB class diagram

Page 16: Machine-learning in building bioinformatics databases for infectious diseases Victor Tong Institute for Infocomm Research A*STAR, Singapore ASEAN-China.

Relationships between an epitope & contexts

Page 17: Machine-learning in building bioinformatics databases for infectious diseases Victor Tong Institute for Infocomm Research A*STAR, Singapore ASEAN-China.
Page 18: Machine-learning in building bioinformatics databases for infectious diseases Victor Tong Institute for Infocomm Research A*STAR, Singapore ASEAN-China.
Page 19: Machine-learning in building bioinformatics databases for infectious diseases Victor Tong Institute for Infocomm Research A*STAR, Singapore ASEAN-China.

Overview

Definitions and background

Architectures of existing immunological databases

Machine-learning for biological databases

Conclusion

Page 20: Machine-learning in building bioinformatics databases for infectious diseases Victor Tong Institute for Infocomm Research A*STAR, Singapore ASEAN-China.

Naϊve Bayes classifiers

Attribute values are conditionally independent given the target value

Goal: to assign a new instance vj the most probable target value Vtarget given a set of attribute values <a1, a2, … an>

The target class may be defined as:

Vtarget = argmax P(vj)ΠP(ai|vj)

Page 21: Machine-learning in building bioinformatics databases for infectious diseases Victor Tong Institute for Infocomm Research A*STAR, Singapore ASEAN-China.

Comparison of popular text classification algorithms

Dataset 20,910 PubMed abstracts 181,299 unique words

AROC NBC: 0.838 ANN: 0.831 SVM: 0.825 DT: 0.809

Wang et al., BMC Bioinformatics 2007, 8:269

Page 22: Machine-learning in building bioinformatics databases for infectious diseases Victor Tong Institute for Infocomm Research A*STAR, Singapore ASEAN-China.

Feature selection (FS)

Data source PubMed abstracts Medical Subject Headings (MeSH) - National Library of

Medicine's controlled vocabulary used for indexing articles, for cataloging books and other holdings

Publication title Author(s) etc

Page 23: Machine-learning in building bioinformatics databases for infectious diseases Victor Tong Institute for Infocomm Research A*STAR, Singapore ASEAN-China.

Feature selection (FS)

Algorithms Document frequency (DF) – ranks features based on the

number of abstracts they appear in Information gain (IG) – measures the number of bits of

information obtained for category prediction based on their occurrence in a document

IG(u) = -∑ P(ci) log P(ci) + P(u) ∑ P(ci|u) log P(ci|u) + P(t) ∑ P(ci|ū) log P(ci|ū)

where u is the feature of interest, ci (i = 1, …, m) denotes the set of categories the documents belong to

Page 24: Machine-learning in building bioinformatics databases for infectious diseases Victor Tong Institute for Infocomm Research A*STAR, Singapore ASEAN-China.

Feature condensation (FC)

Stemming To reduce words to their common root

e.g. “binding, binds, bind” to bind Porter stemmer – AROC = 0.846 to AROC = 0.842 Domain specific vocabulary may be reduced to

unsuitable terms

Page 25: Machine-learning in building bioinformatics databases for infectious diseases Victor Tong Institute for Infocomm Research A*STAR, Singapore ASEAN-China.

Feature extraction (FE)

Rules to capture immune related expressions and group them together Reduction of feature space (i.e. no. of unique words) Enrichment of information content Better performance?

Page 26: Machine-learning in building bioinformatics databases for infectious diseases Victor Tong Institute for Infocomm Research A*STAR, Singapore ASEAN-China.

Feature extraction (FE)

Examples: Sequence length

– identify sequence length and replace with “~range<50~” or “~range>50~” if sequences to be mapped stretches 50 amino acids

MHC alleles– identify MHC alleles and replace with “~mhc_allele~”

Protein sequences– identify sequences as a) exclusively containing characters representing the 20 aa, b) in upper case, length > threshold,and replace with “~sequence~”

Page 27: Machine-learning in building bioinformatics databases for infectious diseases Victor Tong Institute for Infocomm Research A*STAR, Singapore ASEAN-China.

Performance comparison

Wang et al., BMC Bioinformatics 2007, 8:269

Page 28: Machine-learning in building bioinformatics databases for infectious diseases Victor Tong Institute for Infocomm Research A*STAR, Singapore ASEAN-China.

Overview

Definitions and background

Architectures of existing immunological databases

Machine-learning for biological databases

Conclusion

Page 29: Machine-learning in building bioinformatics databases for infectious diseases Victor Tong Institute for Infocomm Research A*STAR, Singapore ASEAN-China.

Conclusion

Machine-learning algorithms enable systematic approach to database construction and facilitates scientific discovery

It must be performed with due care and must

be scientifically and technically sound


Recommended