Date post: | 19-Dec-2015 |
Category: |
Documents |
View: | 214 times |
Download: | 0 times |
Automating Discovery from Biomedical Texts
Marti Hearst & Barbara RosarioUC Berkeley
Agyinc VisitAugust 16, 2000
The LINDI ProjectLinking Information for New
Discoveries
UIs for building and reusing hypothesis seeking strategies.
Statistical language analysis techniques for extracting propositions
Two Main Thrusts:
Scenario: Explore Functions of a Gene
Objective– Determine the functions of a newly
sequenced Gene X. Known facts
– Gene X co-expresses (activated in the same cell) with Gene A, B, C
– The relationship of Gene A, B, C with certain types of diseases (from medical literature)
Question– What types of diseases are Gene X related
to?
Gene Co-expression:Role in the genetic pathway
g?
PSA
Kall.
PAP
h?
PSA
Kall.
PAP
g?
Other possibilities as well
Make use of the literature
Look up what is known about the other genes.
Different articles in different collections Look for commonalities
– Similar topics indicated by Subject Descriptors
– Similar words in titles and abstractsadenocarcinoma, neoplasm, prostate, prostatic
neoplasms, tumor markers, antibodies ...
Developing Strategies
Different strategies seem needed for different situations– First: see what is known about
Kallikrein.– 7341 documents. Too many– AND the result with “disease” category
» If result is non-empty, this might be an interesting gene
– Now get 803 documents
Medical Literature
Explore Functions of New Gene X
Gene-A
Key
wo
rds
Slide adapted from K. Patel
Projection
Mapping
Query
Developing Strategies
Different strategies seem needed for different situations– First: see what is known about Kallikrein.– 7341 documents. Too many– AND the result with “disease” category
» If result is non-empty, this might be an interesting gene
– Now get 803 documents– AND the result with PSA
» Get 11 documents. Better!
Medical Literature
Explore Functions of New Gene X
Gene-A
Key
wo
rds
Key
wo
rds
Gene-B Gene-C
Key
wo
rds
Projection
Keywords
Intersection
Query
Developing Strategies
Look for commalities among these documents– Manual scan through ~100 category
labels– Would have been better if
»Automatically organized» Intersections of “important” categories
scanned for first
Medical Literature
Explore Functions of New Gene X
Gene-A
Key
wo
rds
Key
wo
rds
Gene-B
Keywords
Keywords
Slide adapted from K. Patel
Slicing
Gene-C
Key
wo
rds
Projection
Keywords
Intersection
Mapping
Query
Try a new tack
Researcher uses knowledge of field to realize these are related to prostate cancer and diagnostic tests
New tack: intersect search on all three known genes– Hope they all talk about diagnostics
and prostate cancer– Fortunately, 7 documents returned– Bingo! A relation to regulation of this
cancer
Medical Literature
Explore Functions of New Gene X
Possible FunctionFor Gene-X
Gene-A
Key
wo
rds
Key
wo
rds
Gene-B
Keywords
Keywords
Slide adapted from K. Patel
Slicing
Gene-C
Key
wo
rds
Projection
Keywords
Intersection
Mapping
Query
Query
Formulate a Hypothesis
Hypothesis: mystery gene has to do with regulation of expression of genes leading to prostate cancer
New tack: do some lab tests– See if mystery gene is similar in
molecular structure to the others– If so, it might do some of the same
things they do
Strategies again
In hindsight, combining all three genes was a good strategy.– Store this for later
Might not have worked– Need a suite of strategies– Build them up via experience and a
good UI
The System Doing the same query with slightly different
values each time is time-consuming and tedious
Same goes for cutting and pasting results– IR systems don’t support varying queries
like this very well.– Each situation is a bit different
Some automatic processing is needed in the background to eliminate/suggest hypotheses
The User Interface A general search interface should
support– History– Context– Comparison– Operators: Intersection, Union, Slicing– Operator Reuse– Visualization (where appropriate)
We have an initial implementation It needs lots of work
Architecture of LINDI UI
Data Layer Annotation Layer User Interface Layer
Data Layer Purpose
– Hide different formats of text collections Components
– Data: Abstractions representing records of a text collection
– Operations: performed on the data Data
– A set of records– Each record is a set of tuples with types
Operations– union, intersection, projection, mapping
Annotation Layer
Purpose– Associate data set with operations
that produced them (history)– History is a first class object
Advantage– Streamline a sequence of operations– Reuse operations– Parameterize operations
User Interface
Direct manipulation of information objects and access operations– Query– Intersection– Union– Mapping– Slicing
Record and reuse of past operations Parameterization of operations Streamlining of operations
Initial Palette
Query Structure Determined by Collection Type
Query Operation Results
Projection Operation and Subsequent Results
Parameterized Query: Repeat operations with different values
GC
GB
GA
Intersection over Projected Attribute
Intersection over Projected Attribute
Example Interaction with UI Prototype
1 Query on Gene names2 Project out only mesh headings3 Intersect the results4 Map to create a ranking5 Slice out the top-ranked.
Future Work on UI As currently designed
– Better labeling– Better layout
» Intuitive» Scalable
– Connection to real backend– User Testing
» Does direct manipulation work?» What operator sequences help?» How to improve parameterization?
More advanced– Support for strategies– Incorporation of NLP
Language Analysis Component
Goals:– Extract Propositions from Text– Make Inferences
Language Analysis Component
Why Extract Propositions from Text?– Text is how knowledge at the
propositional level is communicated– Text is continually being created and
updated by the outside world
Example:Statistical Semantic
GrammarTo detect causal relationships between medical concepts– Title:
Magnesium deficiency implicated in increased stress levels.
– Interpretation: <nutrient><reduction> related-to
<increase><symptom>
– Inference:» Increase(stress, decrease(mg))
Statistical Semantic Grammars
Empirical NLP has made great strides– But mainly applied to syntactic structure
Semantic grammars are powerful, but– Brittle – Time-consuming to construct
Idea:– Use what we now know about statistical NLP
to build up a probabilistic grammar
LINDI: Target Components
1. Special UI for retrieving appropriate docs
2. Language analysis on docs to detect causal relationships between concepts
3. Probabilistic representation of concepts and relationships
4. UI + User: Hypothesis creation