Department of Clinical and Biological Sciences, Turin University.

Department of Clinical and Biological Sciences, Turin University.Department of Genetics, General and Molecular Biology, Naples.Department of Mathematics and Information Science, Italy.CINECA, Italy.

R.A. Calogero, G. Iazzetti, S. Motta, G. Pedrazzi, S. Rago, E. Rossi, R.Turra

Data Mining applications in biological fields

•On Sequence database / Molecular structure

Protein structure predictions, homology search, genomic sequence analysis, identification and gene mapping , gene expression microarrays, …

• On Biomedical literature databases

Identification and classification of biological terms, identification of keywords and concepts, clustering , supervised classification, …

Biomedical literature analysis State of art

Two different approach:

• Information Extraction

Application of Natural Language Processing techniques that produces structured representations (templates). Entities and relations must be defined before extraction from texts.Syntactic and semantic analysis lead the extraction.

• Text Mining

Identification of word patterns inside the document corpus. No prior entities, allow to identify new concepts and new relations.No semantics.

Text Mining - The KDD process

Databases, Web sites, …Databases, Web sites, …

Target Target documentsdocuments

Documentsselection

Transformed Transformed documentsdocuments

Grammatical analysisand lemmatization

KeywordsKeywords

meta-informationextraction PatternsPatterns

Text Mining

KnowledgeKnowledge

Interpretation and validationof results

>>> 35:TOYOTA: Avalon Receives Top Score in Frontal Offset Crash Tests

Toyota Motor Corp.'s Avalon received the top score -- a "good" rating earning a "best pick" -- in the 40 mile per hour frontal offset crash tests on new or updated vehicles. The tests were conducted by the Insurance Institute for Highway Safety, a nonprofit group funded by automobile insurers. Nissan Motor Co. Ltd.'s Maxima midsize sedan and Infiniti I30 luxury sedan, the Nissan Sentra small car and Mazda Motor Corp.'s Mazda MPV minivan all scored "average" marks. Isuzu Motors Ltd..'s Rodeo sport utility, also sold by Honda Motor Co. Ltd. as the Honda Passport, earned a "poor" rating due to high crash forces recorded on the crash dummy's head, indicating an increased likelihood of injury. In the crash tests, the vehicles were driven into a deformable barrier at 40 mph, with the driver's side of the vehicle taking the impact. The tests measured the potential for injury to the head, neck, chest and foot areas, and the risk of intrusion into the passenger compartment.

SUBJECTS: Japan; Safety; Passenger Vehicles;SOURCE: Reuters, June 21, 2000;Japan;English

tn.5.26.35 SOURCE Reuterstn.5.26.35 DATE 6/21/2000tn.5.26.35 MONTHYEAR 2000_06tn.5.26.35 SUBJECTS Japantn.5.26.35 SUBJECTS Passenger_Vehiclestn.5.26.35 SUBJECTS Safetytn.5.26.35 STATE Japantn.5.26.35 LANGUAGE English tn.5.26.35 ORG2 TOYOTAtn.5.26.35 NN areatn.5.26.35 NN automobiletn.5.26.35 NN averagetn.5.26.35 NN barriertn.5.26.35 NN cartn.5.26.35 NN chesttn.5.26.35 NN compartmenttn.5.26.35 NN crashtn.5.26.35 NN drivertn.5.26.35 NN dummytn.5.26.35 NN foottn.5.26.35 NN forcetn.5.26.35 NN grouptn.5.26.35 NN headtn.5.26.35 NN hourtn.5.26.35 NN impacttn.5.26.35 NN injurytn.5.26.35 NN insurertn.5.26.35 NN intrusiontn.5.26.35 NN likelihoodtn.5.26.35 NN luxurytn.5.26.35 NN marktn.5.26.35 NN miletn.5.26.35 NN necktn.5.26.35 NN offsettn.5.26.35 NN passengertn.5.26.35 NN potentialtn.5.26.35 NN ratingtn.5.26.35 NN risktn.5.26.35 NN safetytn.5.26.35 NN scoretn.5.26.35 NN sedantn.5.26.35 NN sidetn.5.26.35 NN sporttn.5.26.35 NN testtn.5.26.35 NN utilitytn.5.26.35 NN vehicle

Tagging

Documents selection

Gene or protein Gene or protein

1400000 abstract

The process1) Identification of different parts of a documents

(marking)

TITLE (TI)

TEXT (AB)

Affiliation (AD)Date (EDAT)Journal (TA)Publ.Type (PT)Country (CY)

Textual part Meta-information

2) Grammatical analysis (and lemmatisation)

3) Information Extraction

Phase 1: marking the document

Marking up the different part of documents:

Title, Abstract, …..

TI

AB

ORG

AU

LA

TY

CO

JN

Phase 2: grammatical analysis

Automatic identification of:

NOUNS



NOUNS

ADJECTIVES



NOUNS

ADJECTIVES

VERBS

Phase 2: grammatical

analysis


NOUNS

ADJECTIVES

VERBS

PROPER NOUNS


Selection

NOUNS

Lemmatisation

KEYWORDS

New document format

20000219 NN astrocyte20000219 NN brain20000219 NN case20000219 NN cell20000219 NN control20000219 NN disease20000219 NN distribution20000219 NN expression20000219 NN frequency20000219 NN glioma20000219 NN grade20000219 NN index20000219 NN lesion20000219 NN pattern20000219 NN process20000219 NN proliferation20000219 NN rat20000219 NN specimen20000219 NN staining20000219 NN subset20000219 NN tumor

20000219 AD Department of Neurosurgery, Shiga University of Medical Science, Ohtsu,Japan

20000219 PD 199920000219 EDAT 199920000219 TA Brain Tumor Pathol20000219 PT Journal Article20000219 NPRO astrocytic20000219 NPRO astrocytomas20000219 NPRO immunohistochemically20000219 NPRO MIB-120000219 NPRO nonneoplastic20000219 NPRO p2720000219 NPRO p27Kip20000219 NPRO p27kip120000219 NPRO p27-positive20000219 NPRO p27-positive.20000219 JJ anomalous20000219 JJ heterogeneous20000219 JJ high-grade20000219 JJ human20000219 JJ low20000219 JJ malignant20000219 JJ normal20000219 JJ reactive20000219 JJ reciprocal20000219 JJ surgical20000219 JJ uniform

Phase 3: informationextraction

20000219 gene CDKN1B20000219 gene IFI2720000219 gene P27

Gene “Dictionary”: gene name alias

CDKN1B P27KIP1IFI27 P27P27 P27

Final document

20000219 NN astrocyte20000219 NN brain20000219 NN case20000219 NN cell20000219 NN control20000219 NN disease20000219 NN distribution20000219 NN expression20000219 NN frequency20000219 NN glioma20000219 NN grade20000219 NN index20000219 NN lesion20000219 NN pattern20000219 NN process20000219 NN proliferation20000219 NN rat20000219 NN specimen20000219 NN staining20000219 NN subset20000219 NN tumor

20000219 AD Department of Neurosurgery, Shiga University of Medical Science, Ohtsu,Japan

20000219 PD 199920000219 EDAT 199920000219 TA Brain Tumor Pathol20000219 PT Journal Article20000219 NPRO astrocytic20000219 NPRO astrocytomas20000219 NPRO immunohistochemically20000219 NPRO MIB-120000219 NPRO nonneoplastic20000219 NPRO p2720000219 NPRO p27Kip20000219 NPRO p27kip120000219 NPRO p27-positive20000219 JJ anomalous20000219 JJ heterogeneous20000219 JJ high-grade20000219 JJ human20000219 JJ low20000219 JJ malignant20000219 JJ normal20000219 JJ reactive20000219 JJ reciprocal20000219 JJ surgical20000219 JJ uniform


Similarity index

Similarity threshold

Weighting system

W1 W2 ... Wm

Doc i 1 1 1 1 0 1 1 0 1 0 1 0

Doc j 1 0 0 1 1 1 0 1 0 0 0 1

N11=xik xjkk=1

m

N10=xik (1-xjk)k=1

m

N01=(1-xik) xjkk=1

m

N00=(1-xik) (1-xik)k=1

m

s(i,j) = a N11

b N11 + c (N10 +N01) Condorcet a=b=1 c=1/2 Dice a=b=1 c=1/4

if s(i,j) > Doci e Docj are similar in [0,1]

default: 0.5

N11=xik xjk wkk=1

m wk = 1 / x.k

wk = log( N / x.k)(N10=.. N01=...)

clustering

Resultsexample: RET <OR> BRCA1

ClusterResultsexample: RET <OR> BRCA1

Cluster Keywords

Resultsexample:

RET <OR> BRCA1

Results example: RET <OR> BRCA1

Resultsexample:

RET <OR> BRCA1

Locus Link

(http://www.ncbi.nlm.nih.gov/LocusLink/)

gene alias

SERPINA3 SERPINA3SERPINA3 ACTSERPINA3 AACT

Locus Link Filter • extract OFFICIAL_SYMBOL and ALIAS_SYMBOL

• put them on a row

• select the terms with almost 3 character

• make pairs GENE/ALIAS to search into Medline documents

Alternate symbol: ACT,AACT

SERPINA3

Filter

Index

Meta-information

gene alias

A1BG A1BGA2M A2MA2MP A2MPNAT1 NAT1NAT1 AAC1NAT2 NAT2NAT2 AAC2AACP AACPAACP NATPSERPINA3 SERPINA3SERPINA3 ACTSERPINA3 AACTAADAC AADACAADAC DACAAMP AAMPAANAT AANATAANAT SNATAANAT AA-NATAARS AARSAAVS1 AAVS1AAVS1 AAVABAT ABATABAT GABAT


A1BG 18650110 45822308 69800214A2M 78121104 74300722 51024679A2MP …

“stop words” list

ABNORMAL GAS MIN SEXACT GEL MINOR SIDEACTS GREAT NET SKINAIR HAND NON SMOOTHALPHA HER NONE SPINARM HIS OLD STEPBEST HOMOLOG OUT SUBBETA HOW PAST TERMBIS III POINT TRANSCRIPTIONALCONTACT KEY POLE TYPE1DELTA KILLER POLY TYPE-IDYE KIT PRE UPSTREAMEARLY LACK PRO WHITEEND LARGE PROTEINKINASE WHOFACT LIGHT RAYFAR MAP REDFAT MEN RINGFISH MET SALTGAMMA MICE SEAGAP MID SET

ABNORMAL GAS MIN SEXACT GEL MINOR SIDEACTS GREAT NET SKINAIR HAND NON SMOOTHALPHA HER NONE SPINARM HIS OLD STEPBEST HOMOLOG OUT SUBBETA HOW PAST TERMBIS III POINT TRANSCRIPTIONALCONTACT KEY POLE TYPE1DELTA KILLER POLY TYPE-IDYE KIT PRE UPSTREAMEARLY LACK PRO WHITEEND LARGE PROTEINKINASE WHOFACT LIGHT RAYFAR MAP REDFAT MEN RINGFISH MET SALTGAMMA MICE SEAGAP MID SET

For these aliases we made a constrained research or no research at all

KIT <near/6> (protein <or> gene <or> product)

Terms recognition - open problems

• Non standardised terminology (different conventions)

• Open vocabulary (added new terms)

• Abbreviations usage, upper case/lower case,names that describe the function, …

• Synonyms

• Term Class cross-over (es: proteins called on the basis of related DNA)

• Prepositions e conjunctions (ambiguity in the interpretation of dependence)

• Co-reference

Terms recognition - approaches

Manual coding of Knowledge *

Learning methods

Maximum entropy * *

Hidden Markov models *

Decision trees * *

Naive Bayes *

Statistical methods

naive Bayes + “word lists”*

Hybrids methods * * * LTG (Language Technology Group)

Terms recognition - approaches

Manual coding of knowledge*

Learning methods

Maximum entropy * *

Hidden Markov models *

Decision trees * *

Naive Bayes *

Statistical methods

naive Bayes + “word lists”*

Hybrids methods * * * LTG (Language Technology Group)

* training * dictionary using * hand coded rules

Test in biological filed

genes(DNA) proteins

96,7

----- -----

47,2 75,9

17,8 - 44,6 83,4- 87,5

84,4 84,5

83,8 70,3

----- -----

F-score = 2*P*R/(P+R)

Test in biological filed

genes(DNA) proteins

96,7

----- -----

47,2 75,9

17,8 - 44,6 83,4- 87,5

84,4 84,5

83,8 70,3

----- -----

F-score = 2*P*R/(P+R)

Clustering approaches Vectorial representation

Metric

Algorithms

descriptive terms (nouns, verbs, adjectives, …)

representation (binary, quantitative)

similarity index

Euclidean metric

cosine angle

hierarchical

partitive (K-means, Self Organizing Maps, Autoclass, …)

References1. Hamphrays, K., et al. (2000): Two Applications of Information Extraction to Biological Science Journal Articles: Enzyme Interactions and

Protein Structures, in Proceedings of Pacific Symposium on Biocomputing, pp 72-80, World Scientific Press 2. Milward, T., et al. (2000): Automatic Extraction of Protein Interactions from Scientific Abstracts, in Proceedings of Pacific Symposium on

Biocomputing, pp538-549, World Scientific Press.3. Rindflesch, T. C. et al. (2000), “EDGAR: Extraction of Drugs, Genes and Relations from the Biomedical Literature”, PSB'20004. Iliopoulos, et al., « TEXTQUEST : Document Clustering of Medline Abstracts for Concept Discovery in Molecular Biology»5. Stapley, B.J. et al., « Biobibliometrics : Information Retrieval and Visualization form Co-occurrences of Gene Names in Medline

Abstracts»6. Jeffrey T. Chang et al., « Including Biological Literature Improves Homology Search »7. Leung, S. et al., « Basic Gene Grammars and DNA-ChartParser for language processing of Escherichia Coli promoter DNA sequences » 8. Andrade, M. A. Et at., « Automatic extraction of keywords from scientific text: application to the knowledge domain of protein families »9. Marcotte, E. M. et al., « Mining literature for protein-protein interactions »10. Masys, D. R. et al., « Use of keyword hierarchies to interpret gene expression patterns »11. Eckman, B. A. et al., « The Merck Gene Index browser: an extensible data integration system for gene finding, gene characterization and

EST data mining »12. Fukuda, et al., (1999): “Toward Information extraction: Identifying protein names from biological papers”, PSB 9813. Collier, N., Nobata, C., and Tsujii, J. (2000), “Extracting the Names of Genes and Gene Products with a Hidden Markov Model”,

COLING-200014. Nobata, C., et al.(1999): “Automatic Term Identification and Classification in Biology Texts”, in Proceeding. of 5th Natural Language

Processing Pacific Rim Symposium15. Borthwick, A. et al. (1998), “Exploiting Diverse Knowledge Sources via Maximum Entropy in Named Entity Recognition”, Proceedings of

the Sixth Workshop on Very Large Corpora, pp 152-160.16. Hatzivassiloglou, V. et al., « Disambiguating Proteins, Genes, and RNA in Text : A Machine Learning Approach»17. Mikheev, A. Et al., « Description of the LTG System used for MUC-7 »18. Andrade, M. A. Et at., « Automatic Annotation for Biological Sequences by Extraction of Keywords from Medline Abstracts. Development

of a prototype system. »

Date post:	05-Jan-2016
Category:	Documents
Upload:	booker
View:	55 times
Download:	4 times