+ All Categories
Home > Technology > How Scientists Read, How Computers Read, and What We Should Do

How Scientists Read, How Computers Read, and What We Should Do

Date post: 10-May-2015
Category:
Upload: anita-de-waard
View: 2,324 times
Download: 0 times
Share this document with a friend
Description:
Talk at ALPSP 2012 in the text mining session
Popular Tags:
37
How Scientists Read, How Computers Read, and What We Should Do Anita de Waard Disruptive Technologies Director Elsevier Labs (= not what it says in the abstract!)
Transcript
Page 1: How Scientists Read, How Computers Read, and What We Should Do

How Scientists Read, How Computers Read,

and What We Should Do

Anita de WaardDisruptive Technologies Director

Elsevier Labs

(= not what it says in the abstract!)

Page 2: How Scientists Read, How Computers Read, and What We Should Do

Outline

1. How do scientists read?2. How do computers read?3. What should we do?

Page 3: How Scientists Read, How Computers Read, and What We Should Do

Outline

1. How do scientists read?2. How do computers read?3. What should we do?

Page 4: How Scientists Read, How Computers Read, and What We Should Do

How we read• Letter < syllable < word < clause < sentence < discourse:

This is how linguistics is structured. But it is not how we understand text!

Page 5: How Scientists Read, How Computers Read, and What We Should Do

• Letter < syllable < word < clause < sentence < discourse:

This is how linguistics is structured. But it is not how we understand text!

How we read

Page 6: How Scientists Read, How Computers Read, and What We Should Do

• Letter < syllable < word < clause < sentence < discourse:

This is how linguistics is structured. But it is not how we understand text!

How we read

Page 7: How Scientists Read, How Computers Read, and What We Should Do

• Letter < syllable < word < clause < sentence < discourse:

This is how linguistics is structured. But it is not how we understand text!

How we read

Page 8: How Scientists Read, How Computers Read, and What We Should Do

• Letter < syllable < word < clause < sentence < discourse:

This is how linguistics is structured. But it is not how we understand text!

How we read

Page 9: How Scientists Read, How Computers Read, and What We Should Do

• Letter < syllable < word < clause < sentence < discourse:

This is how linguistics is structured. But it is not how we understand text!

How we read

Page 10: How Scientists Read, How Computers Read, and What We Should Do

Scientists read:

• Why do scientists read? – They want to ingest knowledge: – read, integrate with their current knowledge

• What do scientists read?– Things that are ‘interesting’ :– Pertinent (within their ‘shell of interest’)– Possibly or probably true– Novel, but in agreement with what we know

Page 11: How Scientists Read, How Computers Read, and What We Should Do

human breast cancer

noninvasive MCF7-Ras

antisense oligonucleotides

high-grade malignancy

cell viability retroviral vector

miR-31

cloned

transiently expressed miRNA sponges

Is it pertinent? -> Possibly…Is it true? -> ?Is it new, but in agreement with what I know? -> -?

What is this paper about? NOUN PHRASES

Page 12: How Scientists Read, How Computers Read, and What We Should Do

miR-31 PREVENT acquisition of aggressive traits

miR-31 INHIBIT noninvasive MCF7-Ras cells

miR-31 ENHANCE invasion

cell viability AFFECT inhibitor

miR-31 expression DEPRIVE metastatic cells

Is it pertinent? -> Possibly…Is it true? -> ?Is it new, but in agreement with what I know? ->?

What is this paper about? TRIPLES

Page 13: How Scientists Read, How Computers Read, and What We Should Do

The preceding observations demonstrated that X expression deprives Y cells of attributes associated with Z. We next asked whether X also prevents the acquisition of A traits by B cells.To do so, we transiently inhibited X in C cells with either D or E. Both approaches inhibited X function by > 4.5-fold (Figure S7A).Suppression of X enhanced invasion by 20-fold and motility by 5-fold, but F was unaffected by either inhibitor (Figure 3A; Figure S7B). The E sponge reduced X function by 2.5-fold, but did not affect the activity of other known Js (Figures S8A and S8B). Collectively, these data indicated that sustained X activity is necessary to prevent the acquisition of Z traits by both K and untransformed B cells.

Is it pertinent? -> Need contentIs it true? -> Sounds likely! I know this stuff!Is it new, but in agreement with what I know? -> Need content

What is this paper about? METADISCOURSE

Page 14: How Scientists Read, How Computers Read, and What We Should Do

Claim: • sustained miR-31 activity is necessary to prevent the acquisition of aggressive

traits by both tumor cells and untransformed breast epithelialEvidence: Method: • We transiently inhibited miR-31 in noninvasive MCF7-Ras cells with either

antisense oligonucleotides or miRNA sponges.Evidence: Result: • Both approaches inhibited miR-31 function by >4.5-fold (Figure S7A). • Suppression of miR-31 enhanced invasion by 20-fold and motility by 5-fold,

but cell viability was unaffected by either inhibitor (Figure 3A; Figure S7B). • The miR-31 sponge reduced miR-31 function by 2.5-fold, but did not affect

the activity of other known antimetastatic miRNAs (Figures S8A and S8B).

What is this paper about? CLAIMS AND EVIDENCE

Is it pertinent? -> ProbablyIs it true? -> Sounds likely! Is it new, but in agreement with what I know? -> Check/know

Page 15: How Scientists Read, How Computers Read, and What We Should Do

What is this paper about? DATA

Is it pertinent? -> Need contentIs it true? -> Need methodsIs it new, but in agreement with what I know? -> Check/know

Page 16: How Scientists Read, How Computers Read, and What We Should Do

Is it pertinent? -> Possibly Is it true? Is it new, but in agreement with what I know? -> Need background

-> Probably!

What is this paper about? METADATA

Page 17: How Scientists Read, How Computers Read, and What We Should Do

How scientists read:

Representation Pertinence Truth Fit with knowledge

Noun phrases xTriples xMetadiscourse xClaims and evidence x x xData x x xMetadata x

Text mining

Data-centric science

Publishing

Page 18: How Scientists Read, How Computers Read, and What We Should Do

Outline

1. How do scientists read?2. How do computers read?3. What should we do?

Page 19: How Scientists Read, How Computers Read, and What We Should Do

Noun Phrases: some issues• Problem 1: disambiguating terms (© GoPubMed):

– Hnrpa1 = Tis = Fli-2 = nuclear ribonucleoprotein A1 = helix destabilizing protein = single-strand binding protein = hnRNP core protein A1 = HDP-1 = topoisomerase-inhibitor suppressed.

– Cellulose 1,4-beta-cellobiosidase = exoglucanase– COLD =/ C.O.L.D. =/ cold (runny nose) =/ cold (low T)

• Problem 2: disambiguating entities (© M. Martone):– 95 antibodies were (manually!) identified in 8 articles– 52 did not contain enough information to determine the antibody

used– Some provided details in other papers– Failed to give species, clonality, vendor, or catalog number

Page 20: How Scientists Read, How Computers Read, and What We Should Do

Noun Phrases: some progress• Despite these difficulties, noun phrase recall/precision is

quite high, e.g. I2B22011 [1], [2], others: 90%-98%• Many tools, see [3] for a list; e.g. GoPubMed:

Page 21: How Scientists Read, How Computers Read, and What We Should Do

Triples: some issues:• Contingent on good NP & VP detection• Hard to parse text! E.g. a commercial tool gave:insulin maintaining glucose homeostasis When insulin secretion cannot be increased adequately (type I diabetes defect) to overcome insulin resistance in maintaining glucose homeostasis, hyperglycemia and glucose intolerance ensues. insulin may be involved glucose homeostasis Because PANDER is expressed by pancreatic beta-cells and in response to glucose in a similar way to those of insulin, PANDER may be involved in glucose homeostasis.

Page 22: How Scientists Read, How Computers Read, and What We Should Do

Triples: some progress:Biological Expression Language [4]: We provide evidence that these miRNAs are potential novel oncogenes participating in the development of human testicular germ cell tumors by numbing the p53 pathway, thus allowing tumorigenic growth in the presence of wild-type p53. Increased abundance of miR-372 decreases activity of TP53r(MIR:miR-372) -| tscript(p(HUGO:Trp53))Context: cancerSET Disease = “Cancer”

Activity of TP53 decreases cell growthtscript(p(HUGO:Trp53)) -| bp(GO:”Cell Growth”

Page 23: How Scientists Read, How Computers Read, and What We Should Do

Use biological pathway visualizations as a user interface for knowledge discovery.

23

Page 24: How Scientists Read, How Computers Read, and What We Should Do

Author-created triples: MSR ActiveText

Page 25: How Scientists Read, How Computers Read, and What We Should Do

Metadiscourse: why it matters:

• Voorhoeve et al., 2006: “These miRNAs neutralize p53- mediated CDK inhibition, possibly through direct inhibition of the expression of the tumor suppressor LATS2.”

• Kloosterman and Plasterk, 2006: “In a genetic screen, miR-372 and miR-373 were found to allow proliferation of primary human cells that express oncogenic RAS and active p53, possibly by inhibiting the tumor suppressor LATS2 (Voorhoeve et al., 2006).”

• Okada et al., 2011: “Two oncogenic miRNAs, miR-372 and miR-373, directly inhibit the expression of Lats2, thereby allowing tumorigenic growth in the presence of p53 (Voorhoeve et al., 2006).”

“[Y]ou can transform .. fiction into fact just by adding or subtracting references”, Bruno Latour [5]

Page 26: How Scientists Read, How Computers Read, and What We Should Do

Adding metadiscourse to triples:Biological statement with BEL/ epistemic markup

BEL representation: Epistemic evaluation

These miRNAs neutralize p53-mediated CDK inhibition, possibly through direct inhibition of the expression of the tumor-suppressor LATS2.

r(MIR:miR-372) -|(tscript(p(HUGO:Trp53)) -| kin(p(PFH:”CDK Family”)))Increased abundance of miR-372 decreases abundance of LATS2r(MIR:miR-372) -| r(HUGO:LATS2)

Value = PossibleSource = UnknownBasis = Unknown

Biological statement with Medscan/epistemic markup

MedScan Analysis: Epistemic evaluation

Furthermore, we present evidence that the secretion of nesfatin-1 into the culture media was dramatically increased during the differentiation of 3T3-L1 preadipocytes into adipocytes (P < 0.001) and after treatments with TNF-alpha, IL-6, insulin, and dexamethasone (P < 0.01).

IL-6 NUCB2 (nesfatin-1)Relation: MolTransportEffect: PositiveCellType: AdipocytesCell Line: 3T3-L1

Value = ProbableSource = AuthorBasis = Data

Page 27: How Scientists Read, How Computers Read, and What We Should Do

Claims and Evidence, some examples: Data2Semantics [11]

• Linking clinical guidelines to evidence in a linked data form• Goal: improve speed of integration of research > practice • Issue: evidence is not even correct within guideline?

• Studies have demonstrated inconsistent results regarding the use of such markers of inflammation as C-reactive protein (CRP), interleukins- 6 (IL-6) and -8, and procalcitonin (PCT) in neutropenic patients with cancer [55–57]. • [55]: PCT and IL-6 are more reliable markers than CRP for

predicting bacteremia in patients with febrile neutropenia• [56] In conclusion, daily measurement of PCT or IL-6

could help identify neutropenic patients with a stable course when the fever lasts >3 d. …, it would reduce adverse events and treatment costs.

• [57] Our study supports the value of PCT as a reliable tool to predict clinical outcome in febrile neutropenia.

Page 28: How Scientists Read, How Computers Read, and What We Should Do

Claims and Evidence, example: Drug Interaction Knowledgebase [12]

• Extracting adverse drug interactions (ADIs) from literature and creating linked data node of this

• Goal: improve speed and coverage of ADIs and allowing improved access to patients and doctors

• Issue: how to identify evidence? – Claim:

R-citalopram_is_not_substrate_of_cyp2c19: – Evidence:

At 10uM R- or S-CT, ketoconazole reduced reaction velocity to 55 -60% of control, quinidine to 80%, and omeprazole to 80-85% of control (Fig. 6)

Page 29: How Scientists Read, How Computers Read, and What We Should Do

Using what is known about interactions in fly & yeast: predict new interactions with a human protein

Data, e.g. Web Science 2.0: Mark Wilkinson (SADI, Madrid)

Page 30: How Scientists Read, How Computers Read, and What We Should Do

Wilkinson: doing science ON the web:

These are differentWeb services!

...selected at run-time based on the same model

Page 31: How Scientists Read, How Computers Read, and What We Should Do

Data

• All this evidence is based on data• Increasingly: science is distributed between

– Groups creating data– Groups using data – creating tools– Groups using tools on data – ideas

• All of these groups need to communicate!

Page 32: How Scientists Read, How Computers Read, and What We Should Do

In summary:

1. How do scientists read?2. How do computers read?3. What should we do?

Page 33: How Scientists Read, How Computers Read, and What We Should Do

How we read vs. computers:Level: People read: Computers read:Noun phrases Know topic Pretty wellTriples Know topic Pretty wellMetadiscourse Trust method Not very wellClaims and evidence Understand and trust Not very wellData Trust - and new science! Can enable!

Page 34: How Scientists Read, How Computers Read, and What We Should Do

Publisher runs service (‘app’)

Publisher runs service (‘app’)

6. User applications: distributed applications run on this ‘exposed data’ universe.

Is this the future of publishing? [17]

1. Research: Each item in the system has metadata (including provenance) and relations to other data items added to it.

metadata

metadata

metadata

metadata

metadata

5. Publishing and distribution: When a paper is published, a collection of validated information is exposed to the world. It remains connected to its related data item, and its heritage can be traced.

2. Workflow: All data items created in the lab are added to a (lab-owned) workflow system.

4. Editing and review: Once the co-authors agree, the paper is ‘exposed’ to the editors, who in turn expose it to reviewers. Reports are stored in the authoring/editing system, the paper gets updated, until it is validated.

Review

Edit

Revise

Rats were subjected to two grueling tests(click on fig 2 to see underlying data). These results suggest that the neurological pain pro-

3. Authoring: A paper is written in an authoring tool which can pull data with provenance from the workflow tool in the appropriate representation into the document.

Page 35: How Scientists Read, How Computers Read, and What We Should Do

What should we do?• Experiment! All over the place. Scientists get it ! • Support scientists working on these (e.g. text miners,

web science evangelists, data repositories, etc etc) – great return for your investment!

• Join forums where interactions happen between scientists, publishers, libraries, etc. e.g. Force11.org: – Collective, sponsored by Sloane, aimed at enabling/supporting

this discussion– Planning workshop,

innovative projects for 2013– Please join us at

http://force11.org!

Page 37: How Scientists Read, How Computers Read, and What We Should Do

References[1] J Am Med Inform Assoc. 2010 September; 17(5): 514–518 http://dx.doi.org/10.1136/jamia.2010.003947 [2] Quanzhi Li, Yi-Fang Brook Wu (2006): Identifying important concepts from medical documents, Journal of Biomedical Informatics 39 (2006) 668–679[3] Useful list of resources in bioinformatics http://www.bioinformatics.ca/[4] Biological Expression Language – http://www.openbel.org [5] Latour, B. and Woolgar, S., Laboratory Life: the Social Construction of Scientific Facts, 1979, Sage Publications[6] Light M, Qiu XY, Srinivasan P. (2004). The language of bioscience: facts, speculations, and statements in between. BioLINK 2004: Linking Biological Literature, Ontologies and Databases 2004:17-24.[7] Wilbur WJ, Rzhetsky A, Shatkay H (2006). New directions in biomedical text annotations: definitions, guidelines and corpus construction. BMC Bioinformatics 2006, 7:356.[8] Thompson P., Venturi G., McNaught J, Montemagni S, Ananiadou S. (2008). Categorising modality in biomedical texts. Proc. LREC 2008 Wkshp Building and Evaluating Resources for Biomedical Text Mining 2008.[9] Kim, S-M. Hovy, E.H. (2004). Determining the Sentiment of Opinions. Proceedings of the COLING conference, Geneva, 2004. [10] de Waard, A. and Pander Maat, H. (2012). Epistemic Modality and Knowledge Attribution in Scientific Discourse: A Taxonomy of Types and Overview of Features. Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics, pages 47–55, Jeju, Republic of Korea, 12 July 2012. [11] Data2Semantics project: http://www.data2semantics.org/ [12] Boyce R, Collins C, Horn J, Kalet I. (2009) Computing with evidence Part I: A drug-mechanism evidence taxonomy oriented toward confidence assignment. J Biomed Inform. 2009 Dec;42(6):979-89. Epub 2009 May 10, see also http://dbmi-icode-01.dbmi.pitt.edu/dikb-evidence/front-page.html [13] Sándor, Àgnes and de Waard, Anita, (2012). Identifying Claimed Knowledge Updates in Biomedical Research Articles, Workshop on Detecting Structure in Scholarly Discourse, ACL 2012. [14] Blake, C. (2010) Beyond genes, proteins, and abstracts: Identifying scientific claims from full-text biomedical articles, Journal of Biomedical Informatics, 43(2):173-189[15] See e.g. http://ucsdbiolit.codeplex.com/ and http://research.microsoft.com/en-us/projects/ontology/ for MS Word ontology add-ins[16] de Waard, A. and Schneider, J. (2012) Formalising Uncertainty: An Ontology of Reasoning, Certainty and Attribution (ORCA), Semantic Technologies Applied to Biomedical Informatics and Individualized Medicine workshop, ISWC 2012 [17] de Waard, A. (2010). The Future of the Journal? Integrating research data with scientific discourse, LOGOS: The Journal of the World Book Community, Volume 21, Numbers 1-2, 2010 , pp. 7-11(5) also published in Nature Precedings,http://precedings.nature.com/documents/4742/version/1


Recommended