Post on 11-Feb-2016
description
transcript
The Use of Semantic Graphs for Modeling Biomedical Text
Laura PlazaNIL- Natural Interaction based on LanguageUniversidad Complutense de Madrid
Semantic graph based
representation
Information Retrieval
Automatic Indexing
Semantic graph based
representation
Text summariza
tion
Why semantic?
Cerebrovascular diseases during pregnancy may
result from hemorrhage
The common cold is more common in cold weather
than in summer
Brain vascular disorders during
gestation may result from hemorrhage
=Polysemy
Synonymy
Why graphs?Pneumococcal infection is a
lung infection caused by streptococcus pneumonia.
Mycoplasma pneumonia is another type of atypical
phneumonia.
PneumonIa
Pneumococcal
pneumonia
influenza
Co-occurs with
The patient referred feeling short of breath
and was diagnosed with pneumonia
Symptom
Our ProposalUsing concepts and relations
from external knowlegde sources for representing the text as a graph
Exploiting the topology of the network to identify groups of concepts semantically related that represent different topics
Representation ProcessDocument pre-processing
Concept identification
Document representation
Concept clustering and topic recognition
Document preprocessing
Concept IdentificationThe goal of the trial was to assess
cardiovascular mortality for stroke
ConceptsGoals (Intellectual Product)Clinical Trials (Research Activity)Cardiovascular system (Body System)Mortality vital statistics (Quantitative Concept)Cerebrovascular accident (Disease or Syndrome)
Concept Identification - Ambiguity
Tissues are often coldPhrase: “Tissues”
Meta Mapping (1000) 1000 C0040300:Tissues (Body tissue)Phrase: “are”Phrase: “often cold”MetaMapping (888) 694 C0332183:Often (Frequent) 861 C0234192:Cold (Cold Sensation)MetaMapping (888) 694 C0332183:Often (Frequent) 861 C0009443:Cold (Common Cold)MetaMapping (888) 694 C0332183:Often (Frequent) 861 C0009264:Cold (Cold temperature)
WSD
• Personalized PageRank (PPR)• Journal Descriptor Indexing (JDI)• Machine Readable Dictionary (MRD) • Automatic Extracted Corpus (AEC)
Document Representation
Activity
Clinical or Research Activity
Research Activity
Study
Clinical Study
Clinical Trials
Anatomic Structure
System or Substance
Organ System
Cardiovascular System
Disease
Disorder Or Finding
Disease or Disorder
Non-Neoplastic Disorder
Non-Neoplastic Disorder by Site
Non-Neoplastic Cardiovascular Disorder
Non-Neoplastic Vascular Disorder
Cerebrovascular Disorder
Cerebrovascular Accident
Disorder by Site
Respiratory and Thoracic Disorder
Thoracic Disorder
Heart Disorder
Coronary Heart Disease
Non-Neoplastic Heart Disorder
Congestive Heart Failure
Finding by Site or System
Cardiovascular System Finding
Blood Pressure Finding
Hypertensive Disease
Personnel
Professional Personnel
Clinicians
The goal of the trial was to assess cardiovascular mortality and morbidity for stroke, coronary heart disease and
congestive heart failure, as an evidence-based guide for clinicians who treat hypertension.
Document RepresentationAll the sentence graphs are merged
into a single Document GraphThe graph is extended with more
semantic relationsEach edge is assigned a weight in [0,
1]Different relations may be assigned
different weightsThe more specific are the concepts, the
more weight is assigned to the edge
The goal of the trial was to assess cardiovascular mortality and morbidity for stroke, coronary heart disease and congestive heart failure, as an evidence-based guide
for clinicians who treat hypertension.While event rates for fatal cardiovascular disease were similar, there was a
disturbing tendency for stroke to occur more often in the doxazosin group, than in the group taking chlorthalidone
Other related relationsAssociated with relations
Is a relationsClinicians
Research Activity
Study
Clinical Study
Clinical Trials
Organ System
Cardiovascular System
Disease or Disorder
Non-Neoplastic Disorder
Non-Neoplastic Disorder by Site
Non-Neoplastic Cardiovascular Disorder
Non-Neoplastic Vascular Disorder
Cerebrovascular Disorder
Cerebrovascular Accident
Disorder by Site
Respiratory and Thoracic Disorder
Thoracic Disorder
Heart Disorder
Coronary Heart Disease
Non-Neoplastic Heart Disorder
Congestive Heart Failure
Finding by Site or System
Cardiovascular System Finding
Blood Pressure Finding
Hypertensive Disease
Disorder of Cardiovascular System
Cardiovascular Diseases
Cardiovascular Drug
Alpha-Adrenergic Blocking Agent
Doxazosin
Pharmaceutical Adjuvant
Diuretic
Thiazide Diuretics
Chlorthalidone
1/21/2
2/32/3
3/41
Concept Clustering & Topic Recognition
.
.
.
hubs
Concept Clustering & Topic Recognition
Concepts are ranked by salienceThe n vertices with a highest
salience are called hub vertices
),(
)()(kjjkj vvconnecteve
ji eweightvSalience
Concept Clustering & Topic Recognition
The hub vertices are grouped into Hub Vertex Sets (HVSs)
The remaining vertices are assigned to the cluster to which they are more connected
The number and properties of the clustering strongly depends on the parameters’ values
Concept Clustering & Topic Recognition
Chlorthalidone
Congestive heart failure
Drug pseudoallergen by function
Adverse reactions
Amlodipine
Blood pressure finding
Hepatic
Cerebrovascularaccident
Health personnel
Clinicians
Persons
Elderly
Patients
Organism
Populationgroup
.
.
.
Semantic graph based
representation
Information Retrieval
Automatic Indexing
Text summariza
tion
Text SummarizationCreating a compacted version of one
or various documents
Motivation
Extracts vs. abstracts Single vs. multi-document Generic vs. Application-
oriented
Types
Summaries as an indication of what a document is about
Improving indexing, categorization, and IR
Text Summarization
Cluster m
Cluster 1...
Sentence n
Sentence1...
Similarity = 35.0
Similarity = 12.0
Similarity = 4.0
Similarity = 86.0
5.0)(
0.1)(
0
),(
,,
,,
,,
,
jikik
jikik
jikik
Svvjkji
wCHVSv
wCHVSv
wCv
wSCsimilarityjkk
Text SummarizationCluster 1 … Cluster n
Sentence 1 (98,.0)
… Sentence 6 (18.0)
Sentence n (28.0)
… Sentence 3 (1.0)
…. … …
Sentence selection
H.1: Selecting the top n ranked sentences from the biggest cluster
H.2: Selecting ni sentences from each cluster
H.3: Weighting the sentence-to-cluster similarity to the clusters’ sizes +
other traditional criteria: frequency, position,
similarity with the title, etc
Text SummarizationEvaluation: How is the important
content preserved in the summary? ROUGE automatic evaluation metrics Comparison with the abstracts of the articles
ROUGE-2
ROUGE-SU4
H. 3* 0.3538 0.3267H.2* 0.3421 0.3205H.1* 0.3453 0.3189LexRank
0.3248 0.3097
SUMMA 0.3187 0.2989AutoSummarize
0.2446 0.2318
Text SummarizationEvaluation: How does ambiguity
affect summarization?
ROUGE-2
ROUGE-SU4
AEC 0.3670 0.3379MRD 0.3611 0.3341JDI 0.3538 0.3267First mapping
0.3283 0.3117
Given a list of genes (or proteins):1. Retrieving documents related to the genes2. Building a sematic graph-based
representation of the corpus3. Identifying groups of genes/proteins4. Generating a summary for each group that
describes the functionality of the entities
Summarization of Biological Entity-related Information
Multi-document, application-oriented
summarization
Automatic Indexing of Biomedical Literature using Summaries
Title + Abstract
MTI
Ordered list of MeSH main headings
Full text
Refined list of MeSH Headings
Automatic Indexing of Biomedical Literature using SummariesWhat about using the full texts?
◦Recall increases by precision decreases
What about using automatic summaries of different lenghts?◦As the lenght increases, recall
improves but precision worsens◦There is a summary lenght which
maximizes F-measure
Semantic graph based
representation
Information Retrieval
Automatic Indexing
Text summariza
tion
Retrieval of Similar Patient Cases
Motivation: Facilitating the access to previous cases
Problem: Given a reference patient record, to
retrieve others from the clinical database that are similar to the reference
one
Retrieval of Similar Patient Cases
Same symptom or sign (e.g. , fever)
Same diagnosis (e.g. bacterial pneumonia)
Same test or procedure (e.g., endoscopy biopsy)
Same medication (e.g. clopidogrel)
But … absent criteria are not relevant!!!
When can we consider that two patient
records are similar?
Retrieval of Similar Patient CasesThe records are represented using UMLS graphsConcepts are filtered by semantic typesNegated concept are ignored
Category UMLS Semantic TypesSymptoms and
SignsSign or SymptomFinding
Diseases Disease or SyndromePathologic Function
ProceduresTherapeutic or Preventive
ProcedureDiagnosis Procedure
Body PartsBody Location or RegionBody Part, Organ, or Organ
ComponentMedicaments Pharmacologic substance
Retrieval of Similar Patient Cases
We compute the similarity among the reference record and all records in the database
4869,0ityMaxSimilar
VotesSimilarity
55
54
53
1111...
112
111
119...
112
111
Similarity
Finding by site
Clinical finding
Disease
Bacterialpneumonia
Infectious disease
Disorder by body site
Pneumonia due to Streptococcus
Mycoplasma pneumonia
Respiratoryfinding
Functional findingof respiratory tract
Coughing
Clinical finding
Disorder by body site
Finding by site1/11
2/11
3/11
8/11
9/11
10/11
3/5
4/5
5/5
Bacterialpneumonia
Pneumococcal pneumonia
11/11
Pneumonia due to anaerobic bacteria
Pneumonia due to pleuropneumonia
Graph A Graph B
... ...Virus Diseases
Semantic graph based
representation
Information Retrieval
Automatic Indexing
Text summariza
tion
Automatic Indexing of EHRDiscovering relevant SNOMED-CT
concepts in health records
1. Spell checking2. Acronym expansion and WSD3. Negation detection4. Concept identification
4 steps
Automatic Indexing of EHR1. Spell Checking
◦ Hunspell + Levenshtein + keyboard + phonetic distance
Automatic Indexing of EHR2. Acronym expansion and WSD
◦ A list of abbreviation + Machine Learning + expert rules
Automatic Indexing of EHR1. Negation detection
◦ NegEx algorithm Spanish adaptation◦ Negation cue + Negation scope
Automatic Indexing of EHR4. Concept identification
QueryEl recién nacido fue
ingresado
SNOMED-CT concept
descriptions
Candidate mappings
- Recién nacido.- Recién nacido
prematuro.- Ingreso del
paciente.Scoring function
Final mappings- Recién nacido.- Ingreso del
paciente.
Automatic Indexing of EHR
Automatic Indexing of EHR Future work
◦ Representing the EHR as a graph using different relations from SNOMED-CT
◦ Computing the salience of the concepts to obtain the most representative ones
◦ Using such representation in different NLP tasks (e.g., categorization, IR, etc.)
Further ReadingsSummarization
Plaza, L., Díaz, A., Gervás, P. (2011). A semantic graph-based approach to biomedical summarization. Artificial Intelligence in Medicine,53.
Plaza, L. (2012). Evaluating the importance of sentence position for automatic summarization of biomedical literature. Submitted to Bioinformatics
Word Sense DisambiguationPlaza, L., Stevenson, M., Díaz, A. (2012). Resolving Ambiguity in Biomedical Text
to Improve Summarization. Information Processing & Management, 48(4). Plaza, L., Jimeno-Yepes, A., Díaz, A., Aronson, A.(2011).Studying correlation
between different word sense disambiguation methods and summarization effectiveness in biomedical texts. BMC Bioinformatics, 12.
Automatic IndexingJimeno-Yepes, A., Plaza, L., Mork, J., Díaz, A., Aronson, A.(2012).Using automatic
summaries to improve automatic indexing. To appear in BMC Bioinformatics.
Retrieval of Similar CasesPlaza, L., Díaz, A.(2010).Retrieval of Similar Electronic Health Records using
UMLS Concept Graphs. 15th International Conf. on Applications of Natural Language to Information Systems.