Department of Clinical and Biological Sciences, Turin University.Department of Genetics, General and Molecular Biology, Naples.Department of Mathematics and Information Science, Italy.CINECA, Italy.
R.A. Calogero, G. Iazzetti, S. Motta, G. Pedrazzi, S. Rago, E. Rossi, R.Turra
Data Mining applications in biological fields
•On Sequence database / Molecular structure
Protein structure predictions, homology search, genomic sequence analysis, identification and gene mapping , gene expression microarrays, …
• On Biomedical literature databases
Identification and classification of biological terms, identification of keywords and concepts, clustering , supervised classification, …
Biomedical literature analysis State of art
Two different approach:
• Information Extraction
Application of Natural Language Processing techniques that produces structured representations (templates). Entities and relations must be defined before extraction from texts.Syntactic and semantic analysis lead the extraction.
• Text Mining
Identification of word patterns inside the document corpus. No prior entities, allow to identify new concepts and new relations.No semantics.
Text Mining - The KDD process
Databases, Web sites, …Databases, Web sites, …
Target Target documentsdocuments
Documentsselection
Transformed Transformed documentsdocuments
Grammatical analysisand lemmatization
KeywordsKeywords
meta-informationextraction PatternsPatterns
Text Mining
KnowledgeKnowledge
Interpretation and validationof results
>>> 35:TOYOTA: Avalon Receives Top Score in Frontal Offset Crash Tests
Toyota Motor Corp.'s Avalon received the top score -- a "good" rating earning a "best pick" -- in the 40 mile per hour frontal offset crash tests on new or updated vehicles. The tests were conducted by the Insurance Institute for Highway Safety, a nonprofit group funded by automobile insurers. Nissan Motor Co. Ltd.'s Maxima midsize sedan and Infiniti I30 luxury sedan, the Nissan Sentra small car and Mazda Motor Corp.'s Mazda MPV minivan all scored "average" marks. Isuzu Motors Ltd..'s Rodeo sport utility, also sold by Honda Motor Co. Ltd. as the Honda Passport, earned a "poor" rating due to high crash forces recorded on the crash dummy's head, indicating an increased likelihood of injury. In the crash tests, the vehicles were driven into a deformable barrier at 40 mph, with the driver's side of the vehicle taking the impact. The tests measured the potential for injury to the head, neck, chest and foot areas, and the risk of intrusion into the passenger compartment.
SUBJECTS: Japan; Safety; Passenger Vehicles;SOURCE: Reuters, June 21, 2000;Japan;English
tn.5.26.35 SOURCE Reuterstn.5.26.35 DATE 6/21/2000tn.5.26.35 MONTHYEAR 2000_06tn.5.26.35 SUBJECTS Japantn.5.26.35 SUBJECTS Passenger_Vehiclestn.5.26.35 SUBJECTS Safetytn.5.26.35 STATE Japantn.5.26.35 LANGUAGE English tn.5.26.35 ORG2 TOYOTAtn.5.26.35 NN areatn.5.26.35 NN automobiletn.5.26.35 NN averagetn.5.26.35 NN barriertn.5.26.35 NN cartn.5.26.35 NN chesttn.5.26.35 NN compartmenttn.5.26.35 NN crashtn.5.26.35 NN drivertn.5.26.35 NN dummytn.5.26.35 NN foottn.5.26.35 NN forcetn.5.26.35 NN grouptn.5.26.35 NN headtn.5.26.35 NN hourtn.5.26.35 NN impacttn.5.26.35 NN injurytn.5.26.35 NN insurertn.5.26.35 NN intrusiontn.5.26.35 NN likelihoodtn.5.26.35 NN luxurytn.5.26.35 NN marktn.5.26.35 NN miletn.5.26.35 NN necktn.5.26.35 NN offsettn.5.26.35 NN passengertn.5.26.35 NN potentialtn.5.26.35 NN ratingtn.5.26.35 NN risktn.5.26.35 NN safetytn.5.26.35 NN scoretn.5.26.35 NN sedantn.5.26.35 NN sidetn.5.26.35 NN sporttn.5.26.35 NN testtn.5.26.35 NN utilitytn.5.26.35 NN vehicle
Tagging
Documents selection
Gene or protein Gene or protein
1400000 abstract
The process1) Identification of different parts of a documents
(marking)
TITLE (TI)
TEXT (AB)
Affiliation (AD)Date (EDAT)Journal (TA)Publ.Type (PT)Country (CY)
Textual part Meta-information
2) Grammatical analysis (and lemmatisation)
3) Information Extraction
Phase 1: marking the document
Marking up the different part of documents:
Title, Abstract, …..
TI
AB
ORG
AU
LA
TY
CO
JN
Phase 2: grammatical analysis
Automatic identification of:
NOUNS
Phase 2: grammatical analysis
Automatic identification of:
NOUNS
ADJECTIVES
Phase 2: grammatical analysis
Automatic identification of:
NOUNS
ADJECTIVES
VERBS
Phase 2: grammatical
analysis
Automatic identification of:
NOUNS
ADJECTIVES
VERBS
PROPER NOUNS
Phase 2: grammatical analysis
Selection
NOUNS
Lemmatisation
KEYWORDS
New document format
20000219 NN astrocyte20000219 NN brain20000219 NN case20000219 NN cell20000219 NN control20000219 NN disease20000219 NN distribution20000219 NN expression20000219 NN frequency20000219 NN glioma20000219 NN grade20000219 NN index20000219 NN lesion20000219 NN pattern20000219 NN process20000219 NN proliferation20000219 NN rat20000219 NN specimen20000219 NN staining20000219 NN subset20000219 NN tumor
20000219 AD Department of Neurosurgery, Shiga University of Medical Science, Ohtsu,Japan
20000219 PD 199920000219 EDAT 199920000219 TA Brain Tumor Pathol20000219 PT Journal Article20000219 NPRO astrocytic20000219 NPRO astrocytomas20000219 NPRO immunohistochemically20000219 NPRO MIB-120000219 NPRO nonneoplastic20000219 NPRO p2720000219 NPRO p27Kip20000219 NPRO p27kip120000219 NPRO p27-positive20000219 NPRO p27-positive.20000219 JJ anomalous20000219 JJ heterogeneous20000219 JJ high-grade20000219 JJ human20000219 JJ low20000219 JJ malignant20000219 JJ normal20000219 JJ reactive20000219 JJ reciprocal20000219 JJ surgical20000219 JJ uniform
Phase 3: informationextraction
20000219 gene CDKN1B20000219 gene IFI2720000219 gene P27
Gene “Dictionary”: gene name alias
CDKN1B P27KIP1IFI27 P27P27 P27
Final document
20000219 NN astrocyte20000219 NN brain20000219 NN case20000219 NN cell20000219 NN control20000219 NN disease20000219 NN distribution20000219 NN expression20000219 NN frequency20000219 NN glioma20000219 NN grade20000219 NN index20000219 NN lesion20000219 NN pattern20000219 NN process20000219 NN proliferation20000219 NN rat20000219 NN specimen20000219 NN staining20000219 NN subset20000219 NN tumor
20000219 AD Department of Neurosurgery, Shiga University of Medical Science, Ohtsu,Japan
20000219 PD 199920000219 EDAT 199920000219 TA Brain Tumor Pathol20000219 PT Journal Article20000219 NPRO astrocytic20000219 NPRO astrocytomas20000219 NPRO immunohistochemically20000219 NPRO MIB-120000219 NPRO nonneoplastic20000219 NPRO p2720000219 NPRO p27Kip20000219 NPRO p27kip120000219 NPRO p27-positive20000219 JJ anomalous20000219 JJ heterogeneous20000219 JJ high-grade20000219 JJ human20000219 JJ low20000219 JJ malignant20000219 JJ normal20000219 JJ reactive20000219 JJ reciprocal20000219 JJ surgical20000219 JJ uniform
20000219 gene CDKN1B20000219 gene IFI2720000219 gene P27
Similarity index
Similarity threshold
Weighting system
W1 W2 ... Wm
Doc i 1 1 1 1 0 1 1 0 1 0 1 0
Doc j 1 0 0 1 1 1 0 1 0 0 0 1
N11=xik xjkk=1
m
N10=xik (1-xjk)k=1
m
N01=(1-xik) xjkk=1
m
N00=(1-xik) (1-xik)k=1
m
s(i,j) = a N11
b N11 + c (N10 +N01) Condorcet a=b=1 c=1/2 Dice a=b=1 c=1/4
if s(i,j) > Doci e Docj are similar in [0,1]
default: 0.5
N11=xik xjk wkk=1
m wk = 1 / x.k
wk = log( N / x.k)(N10=.. N01=...)
clustering
Resultsexample: RET <OR> BRCA1
ClusterResultsexample: RET <OR> BRCA1
Cluster Keywords
Resultsexample:
RET <OR> BRCA1
Results example: RET <OR> BRCA1
Resultsexample:
RET <OR> BRCA1
Locus Link
(http://www.ncbi.nlm.nih.gov/LocusLink/)
gene alias
SERPINA3 SERPINA3SERPINA3 ACTSERPINA3 AACT
Locus Link Filter • extract OFFICIAL_SYMBOL and ALIAS_SYMBOL
• put them on a row
• select the terms with almost 3 character
• make pairs GENE/ALIAS to search into Medline documents
Alternate symbol: ACT,AACT
SERPINA3
Filter
Index
Meta-information
gene alias
A1BG A1BGA2M A2MA2MP A2MPNAT1 NAT1NAT1 AAC1NAT2 NAT2NAT2 AAC2AACP AACPAACP NATPSERPINA3 SERPINA3SERPINA3 ACTSERPINA3 AACTAADAC AADACAADAC DACAAMP AAMPAANAT AANATAANAT SNATAANAT AA-NATAARS AARSAAVS1 AAVS1AAVS1 AAVABAT ABATABAT GABAT
20000219 gene CDKN1B20000219 gene IFI2720000219 gene P27
A1BG 18650110 45822308 69800214A2M 78121104 74300722 51024679A2MP …
“stop words” list
ABNORMAL GAS MIN SEXACT GEL MINOR SIDEACTS GREAT NET SKINAIR HAND NON SMOOTHALPHA HER NONE SPINARM HIS OLD STEPBEST HOMOLOG OUT SUBBETA HOW PAST TERMBIS III POINT TRANSCRIPTIONALCONTACT KEY POLE TYPE1DELTA KILLER POLY TYPE-IDYE KIT PRE UPSTREAMEARLY LACK PRO WHITEEND LARGE PROTEINKINASE WHOFACT LIGHT RAYFAR MAP REDFAT MEN RINGFISH MET SALTGAMMA MICE SEAGAP MID SET
ABNORMAL GAS MIN SEXACT GEL MINOR SIDEACTS GREAT NET SKINAIR HAND NON SMOOTHALPHA HER NONE SPINARM HIS OLD STEPBEST HOMOLOG OUT SUBBETA HOW PAST TERMBIS III POINT TRANSCRIPTIONALCONTACT KEY POLE TYPE1DELTA KILLER POLY TYPE-IDYE KIT PRE UPSTREAMEARLY LACK PRO WHITEEND LARGE PROTEINKINASE WHOFACT LIGHT RAYFAR MAP REDFAT MEN RINGFISH MET SALTGAMMA MICE SEAGAP MID SET
For these aliases we made a constrained research or no research at all
KIT <near/6> (protein <or> gene <or> product)
Terms recognition - open problems
• Non standardised terminology (different conventions)
• Open vocabulary (added new terms)
• Abbreviations usage, upper case/lower case,names that describe the function, …
• Synonyms
• Term Class cross-over (es: proteins called on the basis of related DNA)
• Prepositions e conjunctions (ambiguity in the interpretation of dependence)
• Co-reference
Terms recognition - approaches
Manual coding of Knowledge *
Learning methods
Maximum entropy * *
Hidden Markov models *
Decision trees * *
Naive Bayes *
Statistical methods
naive Bayes + “word lists”*
Hybrids methods * * * LTG (Language Technology Group)
Terms recognition - approaches
Manual coding of knowledge*
Learning methods
Maximum entropy * *
Hidden Markov models *
Decision trees * *
Naive Bayes *
Statistical methods
naive Bayes + “word lists”*
Hybrids methods * * * LTG (Language Technology Group)
* training * dictionary using * hand coded rules
Test in biological filed
genes(DNA) proteins
96,7
----- -----
47,2 75,9
17,8 - 44,6 83,4- 87,5
84,4 84,5
83,8 70,3
----- -----
F-score = 2*P*R/(P+R)
Test in biological filed
genes(DNA) proteins
96,7
----- -----
47,2 75,9
17,8 - 44,6 83,4- 87,5
84,4 84,5
83,8 70,3
----- -----
F-score = 2*P*R/(P+R)
Clustering approaches Vectorial representation
Metric
Algorithms
descriptive terms (nouns, verbs, adjectives, …)
representation (binary, quantitative)
similarity index
Euclidean metric
cosine angle
hierarchical
partitive (K-means, Self Organizing Maps, Autoclass, …)
References1. Hamphrays, K., et al. (2000): Two Applications of Information Extraction to Biological Science Journal Articles: Enzyme Interactions and
Protein Structures, in Proceedings of Pacific Symposium on Biocomputing, pp 72-80, World Scientific Press 2. Milward, T., et al. (2000): Automatic Extraction of Protein Interactions from Scientific Abstracts, in Proceedings of Pacific Symposium on
Biocomputing, pp538-549, World Scientific Press.3. Rindflesch, T. C. et al. (2000), “EDGAR: Extraction of Drugs, Genes and Relations from the Biomedical Literature”, PSB'20004. Iliopoulos, et al., « TEXTQUEST : Document Clustering of Medline Abstracts for Concept Discovery in Molecular Biology»5. Stapley, B.J. et al., « Biobibliometrics : Information Retrieval and Visualization form Co-occurrences of Gene Names in Medline
Abstracts»6. Jeffrey T. Chang et al., « Including Biological Literature Improves Homology Search »7. Leung, S. et al., « Basic Gene Grammars and DNA-ChartParser for language processing of Escherichia Coli promoter DNA sequences » 8. Andrade, M. A. Et at., « Automatic extraction of keywords from scientific text: application to the knowledge domain of protein families »9. Marcotte, E. M. et al., « Mining literature for protein-protein interactions »10. Masys, D. R. et al., « Use of keyword hierarchies to interpret gene expression patterns »11. Eckman, B. A. et al., « The Merck Gene Index browser: an extensible data integration system for gene finding, gene characterization and
EST data mining »12. Fukuda, et al., (1999): “Toward Information extraction: Identifying protein names from biological papers”, PSB 9813. Collier, N., Nobata, C., and Tsujii, J. (2000), “Extracting the Names of Genes and Gene Products with a Hidden Markov Model”,
COLING-200014. Nobata, C., et al.(1999): “Automatic Term Identification and Classification in Biology Texts”, in Proceeding. of 5th Natural Language
Processing Pacific Rim Symposium15. Borthwick, A. et al. (1998), “Exploiting Diverse Knowledge Sources via Maximum Entropy in Named Entity Recognition”, Proceedings of
the Sixth Workshop on Very Large Corpora, pp 152-160.16. Hatzivassiloglou, V. et al., « Disambiguating Proteins, Genes, and RNA in Text : A Machine Learning Approach»17. Mikheev, A. Et al., « Description of the LTG System used for MUC-7 »18. Andrade, M. A. Et at., « Automatic Annotation for Biological Sequences by Extraction of Keywords from Medline Abstracts. Development
of a prototype system. »