+ All Categories
Home > Documents > Facilitating the development of controlled vocabularies for metabolomics with text mining

Facilitating the development of controlled vocabularies for metabolomics with text mining

Date post: 23-Jan-2016
Category:
Upload: eamon
View: 29 times
Download: 0 times
Share this document with a friend
Description:
Facilitating the development of controlled vocabularies for metabolomics with text mining. I. Spasić, 1 D. Schober, 2 S. Sansone, 2 D. Rebholz-Schuhmann, 2 D. Kell, 1 N. Paton 1 and the MSI Ontology Working Group Members 3 - PowerPoint PPT Presentation
Popular Tags:
22
Facilitating the development of controlled vocabularies for metabolomics with text mining I. Spasić, 1 D. Schober, 2 S. Sansone, 2 D. Rebholz-Schuhmann, 2 D. Kell, 1 N. Paton 1 and the MSI Ontology Working Group Members 3 1 MCISB http://www.mcisb.org 2 EBI http://www.ebi.ac.uk 3 MSI http://msi-workgroups.sf.net
Transcript
Page 1: Facilitating the development of  controlled vocabularies  for metabolomics with  text mining

Facilitating the development of

controlled vocabularies for metabolomics with

text miningI. Spasić,1 D. Schober,2 S. Sansone,2

D. Rebholz-Schuhmann,2 D. Kell,1 N. Paton1 and the MSI Ontology Working Group

Members3

1 MCISB http://www.mcisb.org2 EBI http://www.ebi.ac.uk3 MSI http://msi-workgroups.sf.net

Page 2: Facilitating the development of  controlled vocabularies  for metabolomics with  text mining

Motivation

• experimental data sets from metabolomics studies need to be integrated with one another, but also with data produced by other types of omics studies in the spirit of systems biology & bioinformatics

• controlled vocabularies and ontologies play a crucial role in consistent interpretation and seamless integration of information scattered across public resources

• the pressing need for vocabularies and ontologies for metabolomics

Page 3: Facilitating the development of  controlled vocabularies  for metabolomics with  text mining

Metabolomics Society

• http://www.metabolomicssociety.org

• the most recent community-wide initiative to coordinate the efforts in standardising reporting structures of metabolomics experiments

• five working groups:

– biological sample context

– chemical analysis

– data analysis

– ontology

– data exchange

Page 4: Facilitating the development of  controlled vocabularies  for metabolomics with  text mining

MSI OWG

• Metabolomics Standardisation Initiative Ontology WG

• http://msi-ontology.sourceforge.net

[email protected]

• coordinated by Dr Susanna-Assunta Sansone

• develop a common semantic framework for metabolomics studies by means of

– controlled vocabularies

– ontologies

so to be able to:

– describe the experimental process consistently

– ensure meaningful and unambiguous data exchange

Page 5: Facilitating the development of  controlled vocabularies  for metabolomics with  text mining

Scope

• the coverage of the domain reflects the typical structure of metabolomics investigations:

– general components (investigation design; sample source, characteristics, treatments and collection; computational analysis)

– technology-specific components (sample preparation; instrumental analysis; data pre-processing)

• analytical technologies: mass spectrometry (MS), gas chromatography-mass spectrometry (GC-MS), liquid chromatography-mass spectrometry (LC-MS), nuclear magnetic resonance (NMR) spectroscopy…

Page 6: Facilitating the development of  controlled vocabularies  for metabolomics with  text mining

Terms

• terms:

– linguistic representations of domain-specific concepts

– means of conveying scientific and technical information

• CV terms:

– used to tag units of information so that they can be more easily retrieved by a search

– improve technical communication by ensuring that everyone is using the same term to mean the same thing

Page 7: Facilitating the development of  controlled vocabularies  for metabolomics with  text mining

Term acquisition

• CV terms are chosen and organised by trained professionals who possess expertise in the subject area

• in a rapidly developing domain of metabolomics, new analytical techniques emerge regularly, thus often compelling domain experts to use non-standardised terms

• problem: manual term acquisition approaches are time-consuming, labour-intensive and error-prone

• solution: a text mining method for efficient corpus-based term acquisition as a way of rapidly expanding a CV with terms already in use in the scientific literature

Page 8: Facilitating the development of  controlled vocabularies  for metabolomics with  text mining

Strategy

• each CV is compiled in an iterative process consisting of the following steps:

1. create an initial CV by re-using the existing terminologies from database models, glossaries, etc. and normalise the terms according to the common naming conventions

2. expand the CV with other frequently co-occurring terms identified automatically using text mining over a relevant corpus of scientific publications

3. circulate the proposed CV to the practitioners in the relevant area of metabolomics for validation in order to ensure its quality and completeness

Page 9: Facilitating the development of  controlled vocabularies  for metabolomics with  text mining

A text mining workflow

1. information retrieval: gather a technology-specific corpus of documents

search terms: MeSH terms & CV termsdocuments: abstracts & full papersresources: Entrez — MEDLINE & PubMed Central (PMC)

2. term recognition: extract terms as lexical units frequently occurring in a domain-specific corpus

method: C-value provided by NaCTeM

3. term filtering: filter out terms not directly related to a given technology, such as those denoting substances, organisms, organs, diseases, etc.

resources: UMLS — MetaThesaurus & Semantic Network

Page 10: Facilitating the development of  controlled vocabularies  for metabolomics with  text mining

Information retrieval using MeSH terms

• MeSH = Medical Subject Headings

• http://www.nlm.nih.gov/mesh/

• MeSH is the NLM's CV used for indexing articles for MEDLINE/PubMed

• MeSH terminology provides a consistent way to retrieve information that may use different terminology for the same concepts

Page 11: Facilitating the development of  controlled vocabularies  for metabolomics with  text mining

IR using MeSH terms

• finding the relevant MeSH terms using the MeSH browser

• http://www.nlm.nih.gov/mesh/MBrowser.html

• look up: NMR

• resulting MeSH term(s): Magnetic Resonance Spectroscopy

• PubMed query: Magnetic Resonance Spectroscopy [MeSH Terms]

Page 12: Facilitating the development of  controlled vocabularies  for metabolomics with  text mining

Beyond MeSH terms

• NMR (or any other analytical technique used in metabolomics) is rarely itself the focus of a metabolomics study it is expected only for the results discovered to be reported in an abstract and not for the experimental conditions leading to these results

• the experimental conditions are typically reported within “Materials & Methods” sections or as part of the supplementary material it is important to process the full text articles as opposed to abstracts only

• as a consequence, an IR approach based on MeSH terms or search terms limited to abstracts will result in a low recall (i.e. many of the relevant articles will be overlooked)

NMR

NMR

NMR

NMR

MEDLINE(abstracts)

PubMed Central

(full papers)

biomedical literature

Page 13: Facilitating the development of  controlled vocabularies  for metabolomics with  text mining

Selecting search terms

2400

Page 14: Facilitating the development of  controlled vocabularies  for metabolomics with  text mining

Selecting documents

doc ID

number of matching

terms

> threshold

local corpus0

5000

10000

15000

20000

25000

30000

do

cum

ents

1 4 7 10 13 16 19 22 25 28 31

search terms

= 3

Page 15: Facilitating the development of  controlled vocabularies  for metabolomics with  text mining

Term recognition: C-value

• http://www.nactem.ac.uk/batch.php

Page 16: Facilitating the development of  controlled vocabularies  for metabolomics with  text mining

C-value

• syntactic pattern matching used to select term candidates:

(ADJ | N)+ | ((ADJ | N)* [N PREP] (ADJ | N)*) N

• termhood of each candidate term t is calculated using:

– |t| its length as the number of words

– f(t) its frequency of occurrence

– S(t) the set of other candidate terms containing

it as a subphrase

)( if ,))(|)(|

1)((||ln

)( if ,)(||ln)(

)(

tSsftS

tft

tStfttC

tSs

Page 17: Facilitating the development of  controlled vocabularies  for metabolomics with  text mining

C-value results

Page 18: Facilitating the development of  controlled vocabularies  for metabolomics with  text mining

Unified Medical Language System (UMLS)

• UMLS = an “ontology” which merges information from over 100 biomedical source vocabularies

• http://umlsks.nlm.nih.gov

• UMLS contains the following semantic classes relevant to our problem:

Organism A.1.1Anatomical Structure A.1.2Substance A.1.4Biological Function B.2.2.1Injury or Poisoning B.2.3

• we used these classes to automatically extract the corresponding terms from the UMLS thesaurus

Page 19: Facilitating the development of  controlled vocabularies  for metabolomics with  text mining

Summary

UMLS

Page 20: Facilitating the development of  controlled vocabularies  for metabolomics with  text mining
Page 21: Facilitating the development of  controlled vocabularies  for metabolomics with  text mining

Results

• input: 243 NMR terms & 152 GC terms

• output: 5,699 NMR terms & 2,612 GC terms

2%

16.25

0.13

Page 22: Facilitating the development of  controlled vocabularies  for metabolomics with  text mining

The End


Recommended