Literature-data integration in the life sciences – Jo McEntyre, EMBL-EBI

transcript

Linking literature to data in the life sciences

OpenAIREplus workshop, Copenhagen, 11 June 2012

Overview

• What literature? What data?

• How we make literature-data connections

• Case study

• Challenges and future directions

What literature? What data?

Big Data:

Deposition

Primary

Research

articles

Big Data:

Curated

Annotation

Unstructured Data

Funder mandatesJournal requirementsMetadata

Standards

Data Landscape and Definitions

*reuse

PMC336623 Extended to several other biological data types

2001 2002 2003 2004 2005 2006 2007 2008 2009 2010

ns) European Nucleotide Archive

2001 2002 2003 2004 2005 2006 2007 2008 2009 2010

Ensembl and Ensembl Genomes

2000000

4000000

6000000

8000000

10000000

12000000

14000000

2001 2002 2003 2004 2005 2006 2007 2008 2009 2010

UniProt

InterPro

2001 2002 2003 2004 2005 2006 2007 2008 2009 2010

100000

150000

200000

250000

300000

350000

400000

450000

500000

ArrayExpress

2001 2002 2003 2004 2005 2006 2007 2008 2009 2010

isatio

2001 2002 2003 2004 2005 2006 2007 2008 2009 2010

• Big data• Thematic data

• Public data• Archived data

• Two petabytes of data

• Scales to 7 pbs raw disk

• Majority is DNA

Two core literature databases

• 26 million abstracts

PubMed, Patents, Agricola• Website and web services

• 2.2 million full text articles(217K articles with suppl data)

• Website

• Citation networks

• Database links

• Whatizit textmining

• Supplemented by CiteXplore

• Additional text mining

• over 1.1 million new records per year • over 150K new articles per year

UK PubMed Central Overview

• Built in collaboration with PubMed Central USA (+ PMC Canada) since 2006

• Led by the European Bioinformatics Institute since 2011, with the

British Library, and the University of Manchester

• Supported by 16 UK and 2 European Funders, led by the WellcomeTrust. Research spend: ~ 2 billion GBP

• A life-science web-based repository

• Manuscript submission service (self archiving by grant holders)

• Database of grant information – with details of about 18000 PIs

• Grant reporting and funder analysis tool

• 250K requests, 40K IPs, 7K direct interactive searches per day

How many articles?

Overall: 20% OA (~ 450K OA articles out of 2.2 million total)

How we make literature-data connections

• by the author - on submission, as metadata (primary databases)

• by database curators - information and links from the

literature

• expensive, slow, but high quality

Text mining

• by algorithms that use terminologies (can be subject to lag)

• post publication – can find new associations

• variable quality, but high throughput

Links from Literature to Databases

• Proteins

• Nucleotides

• OMIM

• Chemicals

• Structure

• Clinical reviews

• Protein families

• Protein-protein interactions

• Gene expression experiments …

Semantic Type Unique Terms Articles Annotations

Gene/Protein 225,905 1,288,809 15,021,502

GO Terms 32,486 1,806,539 15,016,957

Organism 178,847 1,689,251 12,322,782

Disease 170,592 1,743,212 16,201,198

Accession No. 232,950 65,640 331,329

Chemical 76,350 1,669,500 22,438,980

Text Mining in UKPMC (2.2 million articles)

Case study

3.9 billion years ago

E. Coli meets humans

Human colon cancer DNA repair

07/21/10 17

Protein structure in PDBe

Link to the literature from the PDBe record

Algorithms that find similar structures

Text mine full text for 1ewq

Towards understanding DNA repair mechanisms

Challenges and future directions

Data-driven science

Data re-use: biology is

post publication

Linking: citing papers

and data (provenance

and integration)

Metrics and attribution

Hard decisions about

value of keeping

complete data sets

Big Data:

Deposition

Primary

Research

articles

Big Data:

Curated

Annotation

Unstructured Data

Data landscape - possibilities

reuse?

Structured links

analysis

Analysis supplied by Mimas, University of Manchester

XSLDOC

Solutions that make sense to scientists

http://ukpmc.ac.uk

Literature-data integration in the life sciences – Jo McEntyre, EMBL-EBI

Technology