Date post: | 18-Jul-2015 |
Category: |
Science |
Upload: | pistoia-alliance |
View: | 266 times |
Download: | 0 times |
Text-mining for pharma R&D in a
social world
a Pistoia Alliance Debates webinar
Tuesday March 17th, 2015 @ 3-4pm UK
chaired by Veit Ulshoefer
© P
isto
ia A
lliance
Chair and PanelistsDavid Milward
Chief technology officer (CTO) at Linguamatics. He is a pioneer of interactive text mining, and a founder of Linguamatics. He has
over 20 years experience of product development, consultancy and research in natural language processing (NLP). After receiving
a PhD from the University of Cambridge, he was a researcher and lecturer at the University of Edinburgh. He has published in the
areas of information extraction, spoken dialogue, parsing, syntax and semantics.
Jane Reed
Head of life science strategy at Linguamatics. She is responsible for developing the strategic vision for Linguamatics’ growing
product portfolio and business development in the life science domain. Jane has extensive experience in life sciences informatics.
She worked for more than 15 years in vendor companies supplying data products, data integration and analysis and consultancy
to pharma and biotech - with roles at Instem, BioWisdom, Incyte, and Hexagen. Before moving into life science industry, Jane
worked in academia with post-docs in genetics and genomics.
Luca Toldo
Associate Director Information Services at Merck KGaA.
Gordon Baxter
Chief Scientific Officer at Instem plc. Has been both a customer (in senior R&D roles in Pharma) and a vendor (in senior roles at
Pharmagene, Biowisdom and now Instem) of IT solutions targeting numerous points in the R&D continuum. Board member of
Pistoia Alliance. Keen interest in Translational Informatics; finding value in bring data together from research, development and
medical practice over 20 years. PhD from University of Bradford, UK.
20th January 2015 Ontologies as the glue for knowledge management 3
Text-mining for pharma R&D
in a social world
17th March 2015
Dr. Jane Reed, Head of Life Science Strategy, Linguamatics
© P
isto
ia A
lliance
What information do we need?
• What targets are involved in bone cancer?
• Which companies are patenting a particular technology?
• How are people comparing my product with others?
• What are the safety risks of my product compared to others in the same class?
• What are common factors shared by patients requiring rehospitalisation?
• What other diseases could my drug treat?
© P
isto
ia A
lliance
Challenges
• Most of the answers to these questions are in
free text documents
• Ever-increasing amounts of text data to examine– Different kinds of documents
• External literature, patents, news, internal reports, blogs, presentations
– Different formats
• HTML, PDF, XML, Word, PPT, Wiki
0
5,000,000
10,000,000
15,000,000
20,000,000
25,000,000
PubMed Records
© P
isto
ia A
lliance
Search Engines – keywords
All these documents contain the
keywords ‘breast cancer’.
Read ALL the document to find the relevant bit to
you
Breast Cancer
© P
isto
ia A
lliance
Issues with Keyword Search
• Can pull back hundreds or thousands of hits
• Can retrieve noisy or irrelevant hits
• May not retrieve all the relevant hits depending on key words
used
• Difficult to ask “open” questions or pull out connections11
© P
isto
ia A
lliance
What is Text Mining?
15
Interpret Meaning, Identify
& Extract
• Facts
• Relationships
• Assertions
© P
isto
ia A
lliance
Linguistic Processing Using NLP
• Interprets meaning of the text
• Groups words into meaningful units
• Search for different forms of words
19
We find that germline BRCA1 mutations are seen in early-onset breast cancer patients.
BRCA1 gene mutations have been found in ca. 50% of hereditary breast cancers.
© P
isto
ia A
lliance
Linguistic Processing Using NLP
• Interprets meaning of the text
• Groups words into meaningful units
• Search for different forms of words
20
sentences
We find that germline BRCA1 mutations are seen in early-onset breast cancer patients.
BRCA1 gene mutations have been found in ca. 50% of hereditary breast cancers.
© P
isto
ia A
lliance
Linguistic Processing Using NLP
• Interprets meaning of the text
• Groups words into meaningful units
• Search for different forms of words
21
sentences
We find that germline BRCA1 mutations are seen in early-onset breast cancer patients.
BRCA1 gene mutations have been found in ca. 50% of hereditary breast cancers.
noun groups
match entities
© P
isto
ia A
lliance
Linguistic Processing Using NLP
• Interprets meaning of the text
• Groups words into meaningful units
• Search for different forms of words
22
sentences
We find that germline BRCA1 mutations are seen in early-onset breast cancer patients.
BRCA1 gene mutations have been found in ca. 50% of hereditary breast cancers.
verb groups
match actions
noun groups
match entities
© P
isto
ia A
lliance
Linguistic Processing Using NLP
• Interprets meaning of the text
• Groups words into meaningful units
• Search for different forms of words
23
sentences
We find that germline BRCA1 mutations are seen in early-onset breast cancer patients.
BRCA1 gene mutations have been found in ca. 50% of hereditary breast cancers.
verb groups
match actions
morphology -
different forms
noun groups
match entities
© P
isto
ia A
lliance
Semantics
• Finding meaning rather than “surface” word
• Use concepts e.g. “breast cancer” to pick up different
ways the concept might be expressed (synonyms)
– e.g. “breast neoplasm”, “breast tumour”
• Disambiguate cases where one term could mean several
concepts
– e.g. NLP: Natural Language Processing, Neuro-Linguistic
Programming
24
© P
isto
ia A
lliance
Semantics
• Find the same relationship however expressed e.g.
– “Statins treat high cholesterol”
– “High cholesterol is treated by statins”
– “Treatment of high cholesterol by statins”
• Provide results in a more standardized, semantic,
representation
– Better clustering of results
– Better statistics
– Connect results from text mining with other databases
25
© P
isto
ia A
lliance
From Words to Meaning
26
“Among them, nimesulide, a selective COX2 inhibitor, …”
Entrez Gene ID:
5743
inhibits
Entrez Gene ID: 5743inhibits
Identifying
entities and
relations
Linguistics to establish relationships
© P
isto
ia A
lliance
27
• Precise linguistic relationships, sentence co-occurrence
• Precise negation e.g. “pressure” but not “blood pressure” NLP
• Search for concepts and classes, not just keywords
• e.g. cancer and get synonyms and children:
• Malignant neoplasms, Malignant tumor …Terminologies
• Rule based pattern matching for e.g. measurements, lab codes, mutations
• e.g. microRNA: let-?\d+.* mirn?a?-?\d+.*Regular Expressions
Chemistry
• Restrict within particular regions of a document, including nested e.g. table cell in table in DescriptionFielded Search
• Simultaneous processing of large numbers of items e.g. 500 compounds, 500 genes from microarray experiment, etc.High Throughput
Toolbox of Methods
© P
isto
ia A
lliance
Whatever the Content...
29
Scientific literature
Social media
Patents News feeds EHRs Internal reports Drug labels Clinical trials ...
© P
isto
ia A
lliance
30
Identify Extract Synthesize Analyze
Pie Charts for drill down
Dashboards with up-to-
date information
Trending over time
Interaction networks Mind maps with clustering via factsClustered results table
Visualisations from Unstructured Text
© P
isto
ia A
lliance
Gene-disease mapping
Target ID/selection
Mutation/expression analysis
Toxicity analysis and prediction
Biomarker discovery
Drug repurposing
Patent analysis
KOL identification
Opportunity scouting
Trial site selection and study design
Safety
Competitive intelligence
Pharmacovigilance
Social media analysis
Comparative Effectiveness
Regulatory Submission QC
HEOR
SAR
Solutions & Applications in Life Sciences
31
Text-mining in Life Sciences
Advanced text analytics delivers value
along the pipeline
© P
isto
ia A
lliance
Text-mining in
HealthcareReusable queries deliver value in
multiple healthcare workflows
32
Care
gap
models
Pathology, radiology,
initial
assessment, discharge,
check up
Structured
dataPatient
characteristics
Potential adverse
drug reactions
Clinical
trials
gov
Patient
characteristics
Matching
Clinical
trials
Clinical case
histories and/or
genomic
interpretation
Patient
characteristics
Electronic
Health
Record
Enterprise
Data
Warehouse
Patient
characteristics
Patient
lists
FDA
drug
labels
Scientific
literature
© P
isto
ia A
lliance
Social Media is different!
• Use of Natural Language Processing (NLP) provides precise
analysis of otherwise noisy data
• Tapping a growing source of information to allow:
– early warning
– non-intrusive gathering of information without need for surveys etc.
– minimal cost of data collection
– discovery of key opinion leaders / sites, distinct populations
– tracking of communication flow
34
© P
isto
ia A
lliance
Issues with Mining Twitter
• Noise
– Nature of Twitter
• Similar information
– Saying the same thing with different words
– Retweets
• Spam
– Deliberate subversion/distraction
• Search
– Keyword search brings back a lot of irrelevant information
– #hashtags become overloaded
35
© P
isto
ia A
lliance
Analysis of Language & Constructions
in Twitter• Vocabulary
– Informal and shortened forms of words
• “u”, “ur”, “gonna”, “gotta”, “wanna”, “yall”, “ain't”
– Differs from scientific or news text, but predictable
– Can use I2E for a data-driven approach to generate the vocabulary
• Grammar
– Informal, but surprisingly grammatical
• Twitterisms
– Abbreviated URLs e.g. bit.ly
– Conventions to mark topics (#tags) , whether the Tweet is a retweet (RT), or usernames
(@tags)
– Need to include looking for # and @ tags as well as conventional organisation names e.g.
• @oxfamnz
• @oxfamireland
• #Oxfam
• @oxfam_de36
© P
isto
ia A
lliance
Terminologies and Ontologies #1
• Different ways of saying the same thing
– I have the flu
– I have H1N1
– Getting swine flu
– Got a dose of the swine flu
– Got the dreaded flu
– I feel the swineflu comin
– I HAVE SWINE FLUUUUU
– i have the pig flu
– I'm in bed with swine flu
37
© P
isto
ia A
lliance
Terminologies and Ontologies #2
• Can still leverage same tools:
– Domain knowledge to search for concepts and classes, not just keywords
• E.g. organisations, places, numerical data
– Terminology discovery - data driven approach
• Use NLP to see what words are actually used
• Bootstrap from any existing vocabulary
• Use precise linguistic patterns and wildcards to find new vocabulary
• Use substrings/regular expressions to pick up variation in ways to refer to the same
organization
38
© P
isto
ia A
lliance
NLP for Tweets
• Find and extract patterns, not just keywords
• Capturing the 1000s of ways people say the
same thing
Pick up the subtleties e.g. “don’t like” or “looks like” vs. “do like”.
Exclude confounding sentences as positive statements:
39
Text-mining for pharma R&D
in a social world
Dr. Jane Reed, Head of Life Science Strategy, Linguamatics
17th March 2015
Text Mining for Pharma R&Dscientific achievements and legal conundrum
Luca Toldo, Associate Director, Information Services, Merck KGaA, Darmstadt
/in/toldo
© P
isto
ia A
lliance
Multiple Sclerosis - bridge clinical observations and
published scientific knowledge using ontologies
17th March 2015 42http://dx.doi.org/10.1371/journal.pone.0116718
© P
isto
ia A
lliance
Alzheimer - answer questions automatically
17th March 2015 43http://www.clef-initiative.eu/documents/71612/c1c82df0-f1cd-453e-9a08-8740becd04a3
Which medical disorder first described in 1866
can increase the risk of developing Alzheimer's
disease?
APOE-e2
APOE-e3
APOE-e4
Down's syndrome
Parkinson's disease
Which medical disorder first described in 1866
can increase the risk of developing Alzheimer's
disease?
APOE-e2
APOE-e3
APOE-e4
Down's syndrome
Parkinson's disease
... using sentence splitting, stemming, and Information retrieval techniques:
• GENIA sentence splitter
• Krovetz stemming
• Indri (lemurproject.org)
© P
isto
ia A
lliance
Biomarker discovery
17th March 2015 44http://dx.doi.org/10.1186/1472-6947-12-148
© P
isto
ia A
lliance
Increase efficiency in pharmacovigilance through automatic
sentence identification.
Result: POS -- 82% Precision; 70% Recall
NEG -- 93% Precision; 96% Recall
http://www.cs.gmu.edu/~hrangwal/kd-hcm/proc/papers/2-Gurulingappa_et_al.pdf
© P
isto
ia A
lliance
Pharmacovigilance - predict drug label changes
17th March 2015 46http://dx.doi.org/10.1002/pds.3493
Up to 76% of drug label changes could be predicted through data mining methods using publicly available structured data.
The Peregrine-JSRE hybrid system was able to detect uniquely fouradverse drug events that were otherwise not found in the other databases.
© P
isto
ia A
lliance
(some of) the conundrums ... when
dealing with social text mining
• Copyright
• Data privacy
• Regulations
• Ethics
• Civil Laws
• Penal laws
17th March 2015 Text-mining for pharma R&D in a social world 47
© P
isto
ia A
lliance
Knowlede for Life: a practical view on medical text mining.
http://www.sciencedaily.com/releases/2012/09/120921111034.htm
© P
isto
ia A
lliance
WP2B - Analytics
17th March 2015 Text-mining for pharma R&D in a social world 50
© P
isto
ia A
lliance
CR from Social Media: EudraVigilance feeds MAH !
17th March 2015 Text-mining for pharma R&D in a social world 51
https://youtu.be/1own4pxICIk
© P
isto
ia A
lliance
Text mining for Pharma R&D
• is mature methodology, with scalable technologies
• delivers added value across whole value chain
• is easily adaptable to any kind of textual data
• increases the efficiency of knowledge workers
• enables data-driven decision making from unstructured
data
• using ontologies and linguistics bridges layman and
science
• Web-RADR deal with pharmacovigilance on social media
17th March 2015 Text-mining for pharma R&D in a social world 52
Pistoia Alliance Spring Conferenceat HP’s Zurich campus, Switzerland, 14th April 2015
http://pistoia-spring-2015.eventbrite.com/
@pistoiaalliance #pistoia2015
http://my.yapp.us/PISTOIAEUR15
Is consumerisation changing IT?Join us for the next Pistoia Alliance Debates webinar,
Wednesday 29th April @ 3-4pm UK
https://attendee.gotowebinar.com/register/4629369829010843393