Pistoia Alliance Debates: Text Mining for Pharma R&D in a Social World (17th March 2015)

Text-mining for pharma R&D in a

social world

a Pistoia Alliance Debates webinar

Tuesday March 17th, 2015 @ 3-4pm UK

chaired by Veit Ulshoefer

This webinar is being recorded

© P

isto

ia A

lliance

Chair and PanelistsDavid Milward

Chief technology officer (CTO) at Linguamatics. He is a pioneer of interactive text mining, and a founder of Linguamatics. He has

over 20 years experience of product development, consultancy and research in natural language processing (NLP). After receiving

a PhD from the University of Cambridge, he was a researcher and lecturer at the University of Edinburgh. He has published in the

areas of information extraction, spoken dialogue, parsing, syntax and semantics.

Jane Reed

Head of life science strategy at Linguamatics. She is responsible for developing the strategic vision for Linguamatics’ growing

product portfolio and business development in the life science domain. Jane has extensive experience in life sciences informatics.

She worked for more than 15 years in vendor companies supplying data products, data integration and analysis and consultancy

to pharma and biotech - with roles at Instem, BioWisdom, Incyte, and Hexagen. Before moving into life science industry, Jane

worked in academia with post-docs in genetics and genomics.

Luca Toldo

Associate Director Information Services at Merck KGaA.

Gordon Baxter

Chief Scientific Officer at Instem plc. Has been both a customer (in senior R&D roles in Pharma) and a vendor (in senior roles at

Pharmagene, Biowisdom and now Instem) of IT solutions targeting numerous points in the R&D continuum. Board member of

Pistoia Alliance. Keen interest in Translational Informatics; finding value in bring data together from research, development and

medical practice over 20 years. PhD from University of Bradford, UK.

20th January 2015 Ontologies as the glue for knowledge management 3

Text-mining for pharma R&D

in a social world

17th March 2015

Dr. Jane Reed, Head of Life Science Strategy, Linguamatics

© P

isto

ia A

lliance

What information do we need?

• What targets are involved in bone cancer?

• Which companies are patenting a particular technology?

• How are people comparing my product with others?

• What are the safety risks of my product compared to others in the same class?

• What are common factors shared by patients requiring rehospitalisation?

• What other diseases could my drug treat?

© P

isto

ia A

lliance

Challenges

• Most of the answers to these questions are in

free text documents

• Ever-increasing amounts of text data to examine– Different kinds of documents

• External literature, patents, news, internal reports, blogs, presentations

– Different formats

• HTML, PDF, XML, Word, PPT, Wiki

0

5,000,000

10,000,000

15,000,000

20,000,000

25,000,000

PubMed Records

© P

isto

ia A

lliance

Search Engines – keywords

© P

isto

ia A

lliance


Breast Cancer

© P

isto

ia A

lliance


Breast Cancer

© P

isto

ia A

lliance


All these documents contain the

keywords ‘breast cancer’.

Read ALL the document to find the relevant bit to

you

Breast Cancer

© P

isto

ia A

lliance

Issues with Keyword Search

• Can pull back hundreds or thousands of hits

• Can retrieve noisy or irrelevant hits

• May not retrieve all the relevant hits depending on key words

used

• Difficult to ask “open” questions or pull out connections11

© P

isto

ia A

lliance

What is Text Mining?

12

© P

isto

ia A

lliance


13

© P

isto

ia A

lliance


14

Interpret Meaning, Identify

& Extract

© P

isto

ia A

lliance


15

Interpret Meaning, Identify

& Extract

• Facts

• Relationships

• Assertions

© P

isto

ia A

lliance

Text mining vs. keyword search?

Example: What

genes affect

breast cancer?

© P

isto

ia A

lliance


Example: What

genes affect

breast cancer?

© P

isto

ia A

lliance


Example: What

genes affect

breast cancer?

© P

isto

ia A

lliance

Linguistic Processing Using NLP

• Interprets meaning of the text

• Groups words into meaningful units

• Search for different forms of words

19

We find that germline BRCA1 mutations are seen in early-onset breast cancer patients.

BRCA1 gene mutations have been found in ca. 50% of hereditary breast cancers.

© P

isto

ia A

lliance





20

sentences



© P

isto

ia A

lliance





21

sentences



noun groups

match entities

© P

isto

ia A

lliance





22

sentences



verb groups

match actions

noun groups

match entities

© P

isto

ia A

lliance





23

sentences



verb groups

match actions

morphology -

different forms

noun groups

match entities

© P

isto

ia A

lliance

Semantics

• Finding meaning rather than “surface” word

• Use concepts e.g. “breast cancer” to pick up different

ways the concept might be expressed (synonyms)

– e.g. “breast neoplasm”, “breast tumour”

• Disambiguate cases where one term could mean several

concepts

– e.g. NLP: Natural Language Processing, Neuro-Linguistic

Programming

24

© P

isto

ia A

lliance

Semantics

• Find the same relationship however expressed e.g.

– “Statins treat high cholesterol”

– “High cholesterol is treated by statins”

– “Treatment of high cholesterol by statins”

• Provide results in a more standardized, semantic,

representation

– Better clustering of results

– Better statistics

– Connect results from text mining with other databases

25

© P

isto

ia A

lliance

From Words to Meaning

26

“Among them, nimesulide, a selective COX2 inhibitor, …”

Entrez Gene ID:

5743

inhibits

Entrez Gene ID: 5743inhibits

Identifying

entities and

relations

Linguistics to establish relationships

© P

isto

ia A

lliance

27

• Precise linguistic relationships, sentence co-occurrence

• Precise negation e.g. “pressure” but not “blood pressure” NLP

• Search for concepts and classes, not just keywords

• e.g. cancer and get synonyms and children:

• Malignant neoplasms, Malignant tumor …Terminologies

• Rule based pattern matching for e.g. measurements, lab codes, mutations

• e.g. microRNA: let-?\d+.* mirn?a?-?\d+.*Regular Expressions

Chemistry

• Restrict within particular regions of a document, including nested e.g. table cell in table in DescriptionFielded Search

• Simultaneous processing of large numbers of items e.g. 500 compounds, 500 genes from microarray experiment, etc.High Throughput

Toolbox of Methods

© P

isto

ia A

lliance

Whatever the Content...

28

Scientific literature

© P

isto

ia A

lliance

Whatever the Content...

29

Scientific literature

Social media

Patents News feeds EHRs Internal reports Drug labels Clinical trials ...

© P

isto

ia A

lliance

30

Identify Extract Synthesize Analyze

Pie Charts for drill down

Dashboards with up-to-

date information

Trending over time

Interaction networks Mind maps with clustering via factsClustered results table

Visualisations from Unstructured Text

© P

isto

ia A

lliance

Gene-disease mapping

Target ID/selection

Mutation/expression analysis

Toxicity analysis and prediction

Biomarker discovery

Drug repurposing

Patent analysis

KOL identification

Opportunity scouting

Trial site selection and study design

Safety

Competitive intelligence

Pharmacovigilance

Social media analysis

Comparative Effectiveness

Regulatory Submission QC

HEOR

SAR

Solutions & Applications in Life Sciences

31

Text-mining in Life Sciences

Advanced text analytics delivers value

along the pipeline

© P

isto

ia A

lliance

Text-mining in

HealthcareReusable queries deliver value in

multiple healthcare workflows

32

Care

gap

models

Pathology, radiology,

initial

assessment, discharge,

check up

Structured

dataPatient

characteristics

Potential adverse

drug reactions

Clinical

trials

gov

Patient

characteristics

Matching

Clinical

trials

Clinical case

histories and/or

genomic

interpretation

Patient

characteristics

Electronic

Health

Record

Enterprise

Data

Warehouse

Patient

characteristics

Patient

lists

FDA

drug

labels

Scientific

literature

© P

isto

ia A

lliance

Text mining for Social MediaSpecific technical issues

Jane Z Reed

© P

isto

ia A

lliance

Social Media is different!

• Use of Natural Language Processing (NLP) provides precise

analysis of otherwise noisy data

• Tapping a growing source of information to allow:

– early warning

– non-intrusive gathering of information without need for surveys etc.

– minimal cost of data collection

– discovery of key opinion leaders / sites, distinct populations

– tracking of communication flow

34

© P

isto

ia A

lliance

Issues with Mining Twitter

• Noise

– Nature of Twitter

• Similar information

– Saying the same thing with different words

– Retweets

• Spam

– Deliberate subversion/distraction

• Search

– Keyword search brings back a lot of irrelevant information

– #hashtags become overloaded

35

© P

isto

ia A

lliance

Analysis of Language & Constructions

in Twitter• Vocabulary

– Informal and shortened forms of words

• “u”, “ur”, “gonna”, “gotta”, “wanna”, “yall”, “ain't”

– Differs from scientific or news text, but predictable

– Can use I2E for a data-driven approach to generate the vocabulary

• Grammar

– Informal, but surprisingly grammatical

• Twitterisms

– Abbreviated URLs e.g. bit.ly

– Conventions to mark topics (#tags) , whether the Tweet is a retweet (RT), or usernames

(@tags)

– Need to include looking for # and @ tags as well as conventional organisation names e.g.

• @oxfamnz

• @oxfamireland

• #Oxfam

• @oxfam_de36

© P

isto

ia A

lliance

Terminologies and Ontologies #1

• Different ways of saying the same thing

– I have the flu

– I have H1N1

– Getting swine flu

– Got a dose of the swine flu

– Got the dreaded flu

– I feel the swineflu comin

– I HAVE SWINE FLUUUUU

– i have the pig flu

– I'm in bed with swine flu

37

© P

isto

ia A

lliance

Terminologies and Ontologies #2

• Can still leverage same tools:

– Domain knowledge to search for concepts and classes, not just keywords

• E.g. organisations, places, numerical data

– Terminology discovery - data driven approach

• Use NLP to see what words are actually used

• Bootstrap from any existing vocabulary

• Use precise linguistic patterns and wildcards to find new vocabulary

• Use substrings/regular expressions to pick up variation in ways to refer to the same

organization

38

© P

isto

ia A

lliance

NLP for Tweets

• Find and extract patterns, not just keywords

• Capturing the 1000s of ways people say the

same thing

Pick up the subtleties e.g. “don’t like” or “looks like” vs. “do like”.

Exclude confounding sentences as positive statements:

39

Text-mining for pharma R&D

in a social world

Dr. Jane Reed, Head of Life Science Strategy, Linguamatics

17th March 2015

Text Mining for Pharma R&Dscientific achievements and legal conundrum

Luca Toldo, Associate Director, Information Services, Merck KGaA, Darmstadt

/in/toldo

© P

isto

ia A

lliance

Multiple Sclerosis - bridge clinical observations and

published scientific knowledge using ontologies

17th March 2015 42http://dx.doi.org/10.1371/journal.pone.0116718

© P

isto

ia A

lliance

Alzheimer - answer questions automatically

17th March 2015 43http://www.clef-initiative.eu/documents/71612/c1c82df0-f1cd-453e-9a08-8740becd04a3

Which medical disorder first described in 1866

can increase the risk of developing Alzheimer's

disease?

APOE-e2

APOE-e3

APOE-e4

Down's syndrome

Parkinson's disease

Which medical disorder first described in 1866

can increase the risk of developing Alzheimer's

disease?

APOE-e2

APOE-e3

APOE-e4

Down's syndrome

Parkinson's disease

... using sentence splitting, stemming, and Information retrieval techniques:

• GENIA sentence splitter

• Krovetz stemming

• Indri (lemurproject.org)

© P

isto

ia A

lliance

Biomarker discovery

17th March 2015 44http://dx.doi.org/10.1186/1472-6947-12-148

© P

isto

ia A

lliance

Increase efficiency in pharmacovigilance through automatic

sentence identification.

Result: POS -- 82% Precision; 70% Recall

NEG -- 93% Precision; 96% Recall

http://www.cs.gmu.edu/~hrangwal/kd-hcm/proc/papers/2-Gurulingappa_et_al.pdf

© P

isto

ia A

lliance

Pharmacovigilance - predict drug label changes

17th March 2015 46http://dx.doi.org/10.1002/pds.3493

Up to 76% of drug label changes could be predicted through data mining methods using publicly available structured data.

The Peregrine-JSRE hybrid system was able to detect uniquely fouradverse drug events that were otherwise not found in the other databases.

© P

isto

ia A

lliance

(some of) the conundrums ... when

dealing with social text mining

• Copyright

• Data privacy

• Regulations

• Ethics

• Civil Laws

• Penal laws

17th March 2015 Text-mining for pharma R&D in a social world 47

© P

isto

ia A

lliance

Social Media and

pharmacovigilance

© P

isto

ia A

lliance

Knowlede for Life: a practical view on medical text mining.

http://www.sciencedaily.com/releases/2012/09/120921111034.htm

© P

isto

ia A

lliance

WP2B - Analytics


© P

isto

ia A

lliance

CR from Social Media: EudraVigilance feeds MAH !


https://youtu.be/1own4pxICIk

© P

isto

ia A

lliance

Text mining for Pharma R&D

• is mature methodology, with scalable technologies

• delivers added value across whole value chain

• is easily adaptable to any kind of textual data

• increases the efficiency of knowledge workers

• enables data-driven decision making from unstructured

data

• using ontologies and linguistics bridges layman and

science

• Web-RADR deal with pharmacovigilance on social media


Panel discussion

Audience can ask questions in the following Q&A session

Audience Q&A

Please use the chat / question / hand-raise functions in GoToWebinar

Pistoia Alliance Spring Conferenceat HP’s Zurich campus, Switzerland, 14th April 2015

http://pistoia-spring-2015.eventbrite.com/

@pistoiaalliance #pistoia2015

http://my.yapp.us/PISTOIAEUR15

http://pistoia-spring-2015.eventbrite.com/

http://my.yapp.us/PISTOIAEUR15

Is consumerisation changing IT?Join us for the next Pistoia Alliance Debates webinar,

Wednesday 29th April @ 3-4pm UK

https://attendee.gotowebinar.com/register/4629369829010843393

https://attendee.gotowebinar.com/register/4629369829010843393

[email protected] @pistoiaalliance www.pistoiaalliance.org

Thank you for attending

Date post:	18-Jul-2015
Category:	Science
Upload:	pistoia-alliance
View:	266 times
Download:	0 times