Post on 12-Jan-2016
transcript
Information Extraction from the Cancer Literature
The Pediatric Hematology/Oncology Seminar SeriesChildren’s Hospital of Philadelphia
March 8, 2005Philadelphia, PA
A Global Challenge
Cell Clinic
DNA sequenceGenomic variation
MicroarraysRNAi
Protein interactions
Patient recordsTest results
Clinical reportsProceduresPhone calls
MDS1 Leukemia
DNA sequenceGenomic variation
MicroarraysRNAi
Protein interactions
Patient recordsTest results
Clinical reportsProceduresPhone calls
MDS1 Leukemia
Text Text Text Text Text
Phenotype
Natural language understanding
Solution 2: Read everything• Leukemia: 181,394 articles• 20/day=25 years• 385,034 new articles by then
Biomedical text:• 15 million articles• 1.5 billion words
Too Much Text
Solution 3: Impose structure on the descriptions
Solution 1: Approximate• What you can find• What finds you
?
• Phase 1: Domain selection and definition
• Phase 2: Manual annotation
• Phase 3: Create and train machine-learning algorithms
• Phase 4: “Active Annotation”
• Phase 5: Utilization of annotations
IE Process
Biological Domains• Genomic variations in malignancy• Neuroblastoma
Entity Classes• Genes (genes, transcripts, proteins)• Genomic variations (type, location, state)• Malignant type• Malignancy attributes
– Developmental state– Clinical stage– Histology– Malignancy site– Differentiation status– Heredity status
Domain
Document Sets
MEDLINE: Abstracts --> Full Text
• Annotation training set: 4,000 MEDLINE abstracts– Genes commonly mutated in various malignancies– Genes implicated in neuroblastoma
• Abstracts are manually annotated (dual pass)
• Results are used to train automated taggers
Workflow Management
leukemiacauseoftenMDS1 genealterations
Extraction Process
MDS1 genealterations leukemiacauseoften leukemiacauseoftenMDS1 genealterations
Parsing
Separate
MDS1 genealterations leukemiacauseoftenSeparate
Part-of-speech Tagging
MDS1Noun
geneNoun
leukemiaNoun
causeVerb
oftenAdverbPlural noun
alterationsGrammar
Part-of-speech Tagging
Part-of-speech Tagging
MDS1 genealterations leukemiacauseoftenSeparate MDS1Noun
geneNoun
leukemiaNoun
causeVerb
oftenAdverbPlural noun
alterationsGrammar
leukemiaNounPlural noun
alterationsMDS1Noun
geneNoun
causeVerb
oftenAdverb
GrammarLabel
Named Entity Recognition
MDS1Gene
geneProcess
alterations leukemiaDisease
Definitions: Process• Initial Definitions: Domain Experts
– Analyze representative subset of text mentions– Input of specific knowledge
• Manual Annotation– Tag text with initial definitions– Iterative re-definition process– More text: Tighter and more robust definitions
• Widen Domain Expertise
• Publication and Utilization
Definitions
Gene Entities
Genes
Other
Transcripts
ProteinsGenes
Individual Gene
Gene Superfamily
Gene Family
DefinitionsGene The Gene-Entity category includes genes as well as their downstream products such as transcripts and proteins, in addition to the more general groups of gene and protein families, super-families, and so forth. Note that the category name 'Gene-Entity’ is not a completely accurate description of the members of this class since the category includes things other than genes. However, most things in this class are genes, and everything is either a gene or gene derived (transcripts and proteins). The diagram that follows attempts to illustrate this point and provides some examples.
What is and What is Not Included? There are two ways to think about genes.
1. Genes as conceptual entities. (This is what we want to capture.) Genes refer to segments of the genome which have been identified with a specific function or product (for example, the gene for eye color in a fly or a membrane receptor in humans). Although they are "things", they really represent abstract concepts. We can talk about the gene "K-Ras", but we are really referring to an abstract concept – an "ideal form" of the K-Ras gene, which has known attributes. We can’t point to K-Ras; we can only point to instances of K-Ras. Each of these instances (a specific manifestation of the gene as described in #2 below) has the attributes and characteristics of the abstract concept of K-Ras but the different instances of K-Ras may vary slightly between them. (This parallels the concept of "species". We all have an intuitive grasp of the species concept, and can differentiate most species apart: a grizzly bear from a polar bear. However, when we visit the zoo we encounter instances of a species -- individual bears -- and not the concept itself.) Although this may seem pedantic, there is an important reason for making this distinction which we’ll describe below.
Let’s consider some examples based upon this logic: a. For genes: c-kit, CD117, and alpha-smooth muscle actin b. A non-biology example: a 2003 Ferrari Modena. This is an abstract concept for a specific type of car. However, you can’t
point to an abstract 2003 Ferrari Modena, you can only point to specific instances which may vary, even if slightly, between one another.
c. K-Ras as investigated in Bob. This can be a tricky example since it would appear as though we are talking about a specific instance of K-Ras. But remember, in nearly all cases, genes are paired in humans (sometimes there are even more
Definitions
Confounding Issues:
• Levels of specificity– Protein/enzyme/kinase/tyrosine kinase/NTRK1– TRK antibody– Colon cancer vs. cancer of the colon
• Boundary issues– Retinoblastoma– Head and neck cancer– MEN type 2B syndrome
Entity Annotation
MDS1Noun
Label leukemiaNounPlural noun
alterationsgeneNoun
causeVerb
oftenAdverb
Named Entity Recognition
MDS1Gene
geneProcess
alterations leukemiaDisease
gene leukemiacauseoftenalterationsMDS1MDS1LabelDiseaseGene Process
Syntactic Analysis
SyntaxNoun phrase
Adverb phrase
Verb phrase
Noun phrase
Noun phrase
leukemiacauseoftenalterations
leukemiacauseoften
leukemiacause
leukemia
Treebanking
Syntactic Analysis
Syntax gene leukemiacauseoftenalterationsMDS1MDS1DiseaseGene ProcessNoun phrase
Adverb phrase
Verb phrase
Noun phrase
Noun phrase
leukemiacauseoftenalterations
leukemiacauseoften
leukemiacause
leukemia
gene leukemiacauseoftenalterationsMDS1MDS1LabelDiseaseGene Process
SyntaxNoun phrase
Adverb phrase
Verb phrase
Noun phrase
Noun phrase
leukemiacauseoftenalterations
leukemiacauseoften
leukemiacause
leukemiaResult: leukemia
Relation Tagging
Event: alterations
Action: cause
Frequency: often
Relationships
Object: MDS1 gene
Relation Tagging
Annotation Viewer
Annotation Viewer
Annotations
Annotation Start Annotate
dAnnotate
d
Task DateDocumen
ts Words
Pre-tagging 11/3/03 3834 1,456,000
Entity tagging 9/24/03 3829 1,455,000
POS tagging 8/27/03 2332 886,160
Treebanking 2/26/04 2300 874,000
Relation tagging
10/31/04
618 234,000
Automated Algorithms
• Pretagger– Assigns token, sentence, paragraph, section boundaries– Nearly 100% accuracy– Pipeline implementation: Finished
• Bio Part-of-speech tagger– Assigns part-of-speech tags to tokens– Uses pretagging annotations– Accuracy of 97.3%– Pipeline implementation: Finished
Entity TaggersEntity Taggers: Automated, machine-learning
algorithms for named entity recognition in text
Goals – Highly accurate, precision > recall– Rapid deployment– Flexible design
Technique– Conditional random fields– Text feature-based– Uses pretagging, POS annotations– Probabilistic maximization of feature weights– Corrects for overfitting
Entity Taggers• GeneTaggerCRF
– Tags gene symbols, names, and descriptions• KDR, VEGFR-2, VEGF receptor-2• vascular endothelial growth factor receptor type 2
– 86% precision/79% recall– Pipeline implementation: Imminent
• VTag– Simulataneously tags variation types, locations, states
• point mutation, loss of heterozygosity• codon 12, 11q23, base pair 17, Ki-ras• GGT, glycine, Asp
– 85% precision/79% recall– Pipeline implementation: Imminent
Entity Taggers• Mtag
– Tags malignant type labels• acute myeloid leukemias (AMLs)• translocation t( 9;11) - positive leukemia• NB• transitional cell carcinoma of the bladder• Hypoplastic myelodysplastic syndrome• predominantly cystic bilateral neuroblastomas
– 85% precision/82% recall– Pipeline implementation: Imminent
Entity Taggers
Relation Taggers: Identifying relationships between entities
Given this text:
Missense mutation at codon 45 (TCT to TTT)Can we automatically identify:
1. Pairwise associations [(codon 45 and TCT); (TCT and TTT); etc.]
2. The entire mutation event:
VARIATION EVENT #60609Variation type: missense mutationVariation location: codon 45Variation state 1: TCTVariation state 2: TTT
Relation Tagger
Goals: Accurate, rapid, flexible
Technique– Maximum entropy– Feature-based probabilistic model– Events built upon binary associations– Uses pretagging, POS, and entity annotations
Domain– Genomic variation events– Tested on 447 abstracts: 1218 relations, 4773 entities– 38% of relations were non-binary– Baseline: Two entities within 5 words = related
Relation Tagger
ResultsBinary
• Tagger: 77% precision/82% recall• Baseline: 66% precision/77% recall
Event-wide• Tagger: 63% precision/77% recall• Baseline: 43% precision/66% recall
Example”most common base change was a A ->G transition at codon 12 or 13”
Manual annotation:• (transition, codon 12, A, G)• (transition, codon 13, A, G)
Automated annotation:• (transition, codon 12, A, G)• (transition, codon 13, A, G)• (base change, codon 12, A, G)• (base change, codon 13, A, G)
Relation Tagger
Data
Man
ag
em
en
t
POS tagging
Document
Annotation Pipeline
QuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this picture.
Pretagging
Entity tagging
Relation tagging
Treebanking
Database Normalization Integration Interface
Propbanking
Annotation Pipeline
Annotation Pipeline
Carolyn Felix
Biomedical Annotation Database
Annotation Retrieval
What is this all good for, anyway?
Objective: To align the literature with genomic objects
Goal: Can we replicate a manually curated list of genes implicated in a biological process?
Domain: Angiogenesis
Rationale: To focus on the subset of genes implicated in the process of angiogenesis from whole-
genome expression profiling
Applications: Entity Lists
The manual list
• Genes represented on the Affy U133 chips• 340 genes, identified through:
– Prior knowledge– Literature reviews– PubMed searches– Gene Ontology codes– Gene family-based inference
Applications: Entity Lists
Applications: Entity Lists
The automated list
• Twelve partially specific angiogenic terms• Concordancy searching of MEDLINE: 41,276
abstracts• Trained GeneTaggerCRF with ~100 hand-annotated
angiogenesis abstracts• Tagged the document set
– 104,118 mentions– 22,662 non-redundant mentions
Applications: Entity Lists
Normalization
• Human gene/alias/identifier list– Compiled identifiers from 19 public databases– 302,976 entries– 156,860 non-redundant entries– All entries mapped to 25,096 “official” gene symbols
• Aligned normalized gene and tagged gene lists– 50.01% of entries matched a known gene term– 2,389 identified genes
Applications: Entity ListsGene Description FrequencyVEGF Vascular endothelial growth factor 9688NUDT6 Antisense basic fibroblast growth factor 1887FGF2 Fibroblast growth factor 2 (basic) 1463KDR Kinase insert domain receptor 1287TGFB1 Transforming growth factor, beta 1 909TNF Tumor necrosis factor 908FLT1 Fms-related tyrosine kinase 1 (VEGF/VPF receptor) 880MMP2 Matrix metalloproteinase 2 598IL8 Interleukin 8 571IL28B Interleukin 28B 559PECAM1 Platelet/endothelial cell adhesion molecule 558ECGF1 Endothelial cell growth factor 1 545EGF Epidermal growth factor 524TP53 Tumor protein p53 524THBS1 Thrombospondin 1 501PTGS2 Prostaglandin-endoperoxide synthase 2 427FN1 Fibronectin 1 407IL6 Interleukin 6 407
• Accuracy:– 247 (72.6%) of manual genes on the automated list– 91 (26.8%) of manual genes had no literature support – 2 (0.6%) of manual genes were missed for technical
reasons– Overall, 99.2% recall
• Prediction:– Relevance ranked auto-tagged genes by number of
mentions– Evaluated the top 40 NOT on the manual list– All 40 appear to be legitimate angiogenesis-related genes
• Gene Ontology (GO): 42 human genes associated with “angiogenesis” or related terms
Applications: Entity Lists
Applications: Entity ListsGene Description FrequencyNUDT6 Antisense basic fibroblast growth factor 1887TNF Tumor necrosis factor 908IL28B Interleukin 28B 559EGF Epidermal growth factor 524TP53 Tumor protein p53 524FN1 Fibronectin 1 407IL6 Interleukin 6 407CD34 CD34 antigen 384EGFR Epidermal growth factor receptor 373IL1B Interleukin 1, beta 323PCNA Proliferating cell nuclear antigen 277SOS1 Son of sevenless homolog 1 243FGF1 Fibroblast growth factor 1 (acidic) 239TM7SF2 Transmembrane 7 superfamily member 2 230GALGT2 4-GalNAc transferase 229PRAP1 Proline-rich acidic protein 1 219BMP6 Bone morphogenetic protein 6 202BCL2 B-cell CLL/lymphoma 2 201
Applications: Directed Retrieval
Locus-specific Databases: Repositories of recorded mutation information
– > 300 human genes– > 100 databases– Highly curated– Limited resources
CDKN2A database: Somatic and germline p16 mutations
– Over 1400 mutation instances– Primarily identified through manual literature perusal– Large and inefficient effort– < 20% of identified articles contain mutation instances
Applications: Directed Retrieval
Experiment: Identify mutation instance-containing articles from “relevant” articles
• Literature search of PubMed using p16 key words:– 418 articles (1/2000 to 6/2002)– 78 articles contained mutation data (experts)
• Training– 218 articles– Logistic regression classifier– Features: words and word pairs
Applications: Directed Retrieval
Evaluation• Experts
– Identified 200 candidate articles– 32 articles contained mutation information– 16% precision; ~100%(?) recall; F-measure 0.28
• Algorithm– Predicted that 88 of the 200 articles contained relevant info– 29 of 32 with relevant info identified– 44% precision; 91% recall; F-measure 0.59– Second random trial: comparable results
• Relevance ranking: Associated with value– In progress: refinement of relevance with text annotations
Conclusion: automation significantly reduces workload
The Global ChallengeWhat is MYCN?
What is MYCN related to?How?
GenesProteins
PathwaysCells
TissuesPhenotypes
TraitsDiseasesBehaviors
Environment
Genome
Literature
Integration
Cell
Disease
MYCN
Genomic position
Genomic context
Known alteration
Cellular location
Protein function
Cell type
Disease association
Clinical observation
Symptom
Environmental factor
Resources
BioIE group: http://bioie.ldc.upenn.edu/
Resources:http://bioie.ldc.upenn.edu/index.jsp?page=doc_resources.html
Documentation:http://bioie.ldc.upenn.edu/index.jsp?page=doc_users.html
Software/Tools: http://bioie.ldc.upenn.edu/index.jsp?page=doc_soft_tools.htm
Contributors University of PennsylvaniaAvik BasuAnn BiesChristine Brisson Dan CaroffHareesh ChandrupatlaMelissa DemianJacqueline EwingNadeene FrancescoHubert JinAravind JoshiSanipa KoetswawasdiSeth KulickJeremy LaCivitaJustin LacasseMatt LegerAlexis LerroMark LibermanMark MandelMark ManocchioMitch MarcusRyan McDonaldTom MortonGrace Mrowicki
Sina NeshatianBen NewmanMichael NodaMartha PalmerEric PancoastAnita PatelFernando PereiraAriel Richmond Karen RudoAndrew Schein Mike SchultzJonathan SchwartzAmanda van ScoyocNilay ShahSarah StippichSabrina SumnerRachel SwetzPartha TalukdarJulie WangColin Warner Christopher WrightJohanna Wright Dalal ZakharyRamez Zakhary
University of VermontClaire AndukaMark Greenblatt Joan MurphyAmy Rodgers
Sanger InstituteSally BamfordElisabeth DawsonJon TeagueRichard Wooster
CHOPShannon DavisJayanti JagannathanYang JinJessica KimJeremy LautmanPete WhiteScott Winters
Garrett BrodeurMike HogartyJohn Maris