Cochrane workshop2016

Post on 14-Feb-2017

784 views 0 download

transcript

Workshop overview• Y/our backgrounds and interests and what we want• How does mining work and what can it do for YOU/Cochrane?• Demonstration with emphasis on dictionaries.• What would YOU like a system to do?• Your dictionary/ies in action• Advanced (chemistry, diagram mining)

• ANY early adopter can obtain our (Open) software and run it at home for any resource (medical, agricultural, government, climate, etc.). We will help you during next 24 hours.

• All material CC BY.

Cochrane UK & Ireland Symposium 2016,

Birmingham, UK, 2016-03-15

Let the Machine Help with your

Systematic Reviews

Peter Murray-Rust1,2

Christopher Kittel2

[1]University of Cambridge[2]TheContentMine

Simple, Universal, Knowledge creation and re-use

The Right to Read is the Right to Mine* *PeterMurray-Rust, 2011

http://contentmine.org

Resources• Europe PubMedCentral http://europepmc.org/ • ContentMine toolkit https://github.com/ContentMine/ • Wikidata: https://www.wikidata.org/wiki/Wikidata:Main_Page • Hypothes.is https://hypothes.is/ [1]

• Etherpad: http://pads.cottagelabs.com/p/cochrane2016

• Note: early adopters can obtain our (Open) software and run it at home…

• [1] Not used in CochraneBham workshop

Europe PubMedCentral

catalogue

getpapers

query

DailyCrawl

EPMC, arXivCORE , HAL,(UNIV repos)

ToCservices

PDF HTMLDOC ePUB TeX XML

PNGEPS CSV

XLSURLsDOIs

crawl

quickscrape

normaNormalizerStructurerSemanticTagger

Text

DataFigures

ami

UNIVRepos

search

LookupCONTENTMINING

Chem

Phylo

Trials

CrystalPlants

COMMUNITY

plugins

Visualizationand Analysis

PloSONE, BMC, peerJ… Nature, IEEE, Elsevier…

Publisher Sites

scrapersqueries

taggers

abstract

methods

references

CaptionedFigures

Fig. 1

HTML tables

30, 000 pages/day Semantic ScholarlyHTML

Facts

CONTENTMINE Complete OPEN Platform for Mining Scientific Literature

dictionaries

Dictionaries!

abstract

methods

references

CaptionedFigures

Fig. 1

HTML tables

abstract

methods

references

CaptionedFigures

Fig. 1

HTML tables

Dict A

Dict B

ImageCaption

TableCaption

MININGwith sectionsand dictionaries

[W3C Annotation / https://hypothes.is/ ]

Disease Dictionary (ICD-10)

<dictionary title="disease"> <entry term="1p36 deletion syndrome"/> <entry term="1q21.1 deletion syndrome"/> <entry term="1q21.1 duplication syndrome"/> <entry term="3-methylglutaconic aciduria"/> <entry term="3mc syndrome” <entry term="corpus luteum cyst”/> <entry term="cortical blindness" />

SELECT DISTINCT ?thingLabel WHERE { ?thing wdt:P494 ?wd . ?thing wdt:P279 wd:Q12136 . SERVICE wikibase:label { bd:serviceParam wikibase:language "en" }}

wdt:P494 = ICD-10 (P494) identifierwd:Q12136 = disease (Q12136) abnormal condition that affects the body of an organism

Wikidata ontology for disease

• ChEBI (chemicals at EBI) ftp://ftp.ebi.ac.uk/pub/databases/chebi/Flat_file_tab_delimited/names_3star.tsv.gz)

• combined with WIKIDATA: World Health Organisation International Nonproprietary Name (P2275)

* => 4947 items in the dictionary (inn.xml)

DRUGS<dictionary title="inn"><entry term="(r)-fenfluramine"/><entry term="abacavir"/><entry term="abafungin"/><entry term="abafungina"/><entry term="abafungine"/><entry term="abafunginum"/><entry term="abamectin"/><entry term="abarelix"/><entry term="abatacept"/>

<dictionary title="funders"><!— from http://help.crossref.org/funder-registry with thanks --><entry id="http://dx.doi.org/10.13039/100001436" term="1675 Foundation"/><entry id="http://dx.doi.org/10.13039/100004343" term="3M"/><entry id=“http://dx.doi.org/10.13039/501100005957” term="8020 Promotion Foundation"/><entry id="http://dx.doi.org/10.13039/501100007139" term="A Richer Life Foundation"/><entry id="http://dx.doi.org/10.13039/100006543" term="A World Celiac Community Foundation"/><entry id="http://dx.doi.org/10.13039/100001962" term="A-T Children's Project"/><entry id="http://dx.doi.org/10.13039/100008456" term="A. Alfred Taubman Medical Research Institute"/>

11566 entries

Funders Dictionary

Dengue Mosquito

<dictionary name="genus"> <entry term="Aa"/> <entry term="Aaaba"/> <entry term="Aacanthocnema"/> <entry term="Aaosphaeria"/> <entry term="Aaptos"/> <entry term="Aaptosyax"/> <entry term="Aaroniella"/> <entry term="Aaronsohnia"/> <entry term="Abablemma"/>

Genera from NCBI TaxDump

<dictionary title="hgnc"> <entry term="A1BG" name="alpha-1-B glycoprotein"/> <entry term="A1BG-AS1" name="A1BG antisense RNA 1"/> <entry term="A1CF" name="APOBEC1 complementation factor"/> <entry term="A2M" name="alpha-2-macroglobulin"/> <entry term="A2M-AS1" name="A2M antisense RNA 1 (head to head)"/> <entry term="A2ML1" name="alpha-2-macroglobulin-like 1"/> <entry term="A2ML1-AS1" name="A2ML1 antisense RNA 1"/>

Human Genes (HGNC)

<entry term="Aaas" name="achalasia, adrenocortical insufficiency, alacrimia"/><entry term="Aacs" name="acetoacetyl-CoA synthetase"/><entry term="Aadac" name="arylacetamide deacetylase (esterase)"/><entry term="Aadacl2" name="arylacetamide deacetylase-like 2"/><entry term="Aadacl3" name="arylacetamide deacetylase-like 3"/><entry term="Aadat" name="aminoadipate aminotransferase"/><entry term="Aaed1" name="AhpC/TSA antioxidant enzyme domain containing 1"/><entry term="Aagab" name="alpha- and gamma-adaptin binding protein"/><entry term="Aak1" name="AP2 associated kinase 1"/><entry term="Aamdc" name="adipogenesis associated Mth938 domain containing"/><entry term="Aamp" name="angio-associated migratory protein"/>

Mouse genes (JAXson)

Ebola!

<dictionary title="tropicalVirus"> <entry term="ZIKV" name="Zika virus"/> <entry term="Zika" name="Zika virus"/> <entry term="DENV" name="Dengue virus"/> <entry term="Dengue" name="Dengue virus"/> <entry term="CHIKV" name="Chikungunya virus"/> <entry term="Chikungunya" name="Chikungunya virus"/> <entry term="WNV" name="West Nile virus"/> <entry term="West Nile" name="West Nile virus"/> <entry term="YFV" name="Yellow fever virus"/> <entry term="Yellow fever" name="Yellow fever virus"/> <entry term="HPV" name="Human papilloma virus"/> <entry term="Human papilloma virus" name="Human papilloma virus"/></dictionary>

Terms co-ocurring with “Zika”

<dictionary title="cochrane"> <entry term="Cochrane Library"/> <entry term="Cochrane Reviews"/> <entry term="Cochrane Central Register of Controlled Trials"/> <entry term="Cochrane"/> <entry term="randomize"/> <entry term="meta-analysis"/> <entry term="Embase"/> <entry term="MEDLINE"/> <entry term="eligibility"/> <entry term="exclusion"/> <entry term="outcome"/> <entry term="Review Manager"/> <entry term="STATA"/> <entry term="RCT"/></dictionary>

Terms lexically related to “meta-analysis”

Mining strategy• Discover. negotiate permissions . => bibliography• Crawl / Scrape (download), documents AND

supplemental • Normalize. PDF => XML• Index: facets => Facts and snippets (“entities”)• Interpret/analyze entities => relationships,

aggregations (“Transformative”) • Publish

catalogue

getpapers

query

DailyCrawl

EuPMC, arXivCORE , HAL,(UNIV repos)

ToCservices

PDF HTMLDOC ePUB TeX XML

PNGEPS CSV

XLSURLsDOIs

crawl

quickscrape

normaNormalizerStructurerSemanticTagger

Text

DataFigures

ami

UNIVRepos

search

LookupCONTENTMINING

Chem

Phylo

Trials

CrystalPlants

COMMUNITY

plugins

Visualizationand Analysis

PloSONE, BMC, peerJ… Nature, IEEE, Elsevier…

Publisher Sites

scrapersqueries

taggers

abstract

methods

references

CaptionedFigures

Fig. 1

HTML tables

30, 000 pages/day Semantic ScholarlyHTML

Facts

CONTENTMINE Complete OPEN Platform for Mining Scientific Literature

Demo

PMR runs getpapers and ami

Chris runs Python visualization of drug co-occurrence

Systematic Reviews

Can we:• eliminate true negatives automatically?• extract data from formulaic language?• mine diagrams?• Annotate existing sources?• forward-reference clinical trials?

Polly has 20 seconds to read this paper…

…and 10,000 more

ContentMine software can do this in a few minutes

Polly: “there were 10,000 abstracts and due to time pressures, we split this between 6 researchers. It took about 2-3 days of work (working only on this) to get through ~1,600 papers each. So, at a minimum this equates to 12 days of full-time work (and would normally be done over several weeks under normal time pressures).”

400,000 Clinical TrialsIn 10 government registries

Mapping trials => papers

http://www.trialsjournal.com/content/16/1/80

2009 => 2015. What’s happened in last 6 years??

Search the whole scientific literatureFor “2009-0100068-41”

Diagram Mining

Ln Bacterial load per fly

11.5

11.0

10.5

10.0

9.5

9.0

6.5

6.0

Days post—infection

0 1 2 3 4 5

Bitmap Image and Tesseract OCR