Facilitating the development of controlled vocabularies for metabolomics with text mining

transcript

Facilitating the development of

controlled vocabularies for metabolomics with

text miningI. Spasić,1 D. Schober,2 S. Sansone,2

D. Rebholz-Schuhmann,2 D. Kell,1 N. Paton1 and the MSI Ontology Working Group

Members3

1 MCISB http://www.mcisb.org2 EBI http://www.ebi.ac.uk3 MSI http://msi-workgroups.sf.net

Motivation

• experimental data sets from metabolomics studies need to be integrated with one another, but also with data produced by other types of omics studies in the spirit of systems biology & bioinformatics

• controlled vocabularies and ontologies play a crucial role in consistent interpretation and seamless integration of information scattered across public resources

• the pressing need for vocabularies and ontologies for metabolomics

Metabolomics Society

• http://www.metabolomicssociety.org

• the most recent community-wide initiative to coordinate the efforts in standardising reporting structures of metabolomics experiments

• five working groups:

– biological sample context

– chemical analysis

– data analysis

– ontology

– data exchange

MSI OWG

• Metabolomics Standardisation Initiative Ontology WG

• http://msi-ontology.sourceforge.net

• msi-workgroups-ontology@lists.sourceforge.net

• coordinated by Dr Susanna-Assunta Sansone

• develop a common semantic framework for metabolomics studies by means of

– controlled vocabularies

– ontologies

so to be able to:

– describe the experimental process consistently

– ensure meaningful and unambiguous data exchange

• the coverage of the domain reflects the typical structure of metabolomics investigations:

– general components (investigation design; sample source, characteristics, treatments and collection; computational analysis)

– technology-specific components (sample preparation; instrumental analysis; data pre-processing)

• analytical technologies: mass spectrometry (MS), gas chromatography-mass spectrometry (GC-MS), liquid chromatography-mass spectrometry (LC-MS), nuclear magnetic resonance (NMR) spectroscopy…

• terms:

– linguistic representations of domain-specific concepts

– means of conveying scientific and technical information

• CV terms:

– used to tag units of information so that they can be more easily retrieved by a search

– improve technical communication by ensuring that everyone is using the same term to mean the same thing

Term acquisition

• CV terms are chosen and organised by trained professionals who possess expertise in the subject area

• in a rapidly developing domain of metabolomics, new analytical techniques emerge regularly, thus often compelling domain experts to use non-standardised terms

• problem: manual term acquisition approaches are time-consuming, labour-intensive and error-prone

• solution: a text mining method for efficient corpus-based term acquisition as a way of rapidly expanding a CV with terms already in use in the scientific literature

Strategy

• each CV is compiled in an iterative process consisting of the following steps:

1. create an initial CV by re-using the existing terminologies from database models, glossaries, etc. and normalise the terms according to the common naming conventions

2. expand the CV with other frequently co-occurring terms identified automatically using text mining over a relevant corpus of scientific publications

3. circulate the proposed CV to the practitioners in the relevant area of metabolomics for validation in order to ensure its quality and completeness

A text mining workflow

1. information retrieval: gather a technology-specific corpus of documents

search terms: MeSH terms & CV termsdocuments: abstracts & full papersresources: Entrez — MEDLINE & PubMed Central (PMC)

2. term recognition: extract terms as lexical units frequently occurring in a domain-specific corpus

method: C-value provided by NaCTeM

3. term filtering: filter out terms not directly related to a given technology, such as those denoting substances, organisms, organs, diseases, etc.

resources: UMLS — MetaThesaurus & Semantic Network

Information retrieval using MeSH terms

• MeSH = Medical Subject Headings

• http://www.nlm.nih.gov/mesh/

• MeSH is the NLM's CV used for indexing articles for MEDLINE/PubMed

• MeSH terminology provides a consistent way to retrieve information that may use different terminology for the same concepts

IR using MeSH terms

• finding the relevant MeSH terms using the MeSH browser

• http://www.nlm.nih.gov/mesh/MBrowser.html

• look up: NMR

• resulting MeSH term(s): Magnetic Resonance Spectroscopy

• PubMed query: Magnetic Resonance Spectroscopy [MeSH Terms]

Beyond MeSH terms

• NMR (or any other analytical technique used in metabolomics) is rarely itself the focus of a metabolomics study it is expected only for the results discovered to be reported in an abstract and not for the experimental conditions leading to these results

• the experimental conditions are typically reported within “Materials & Methods” sections or as part of the supplementary material it is important to process the full text articles as opposed to abstracts only

• as a consequence, an IR approach based on MeSH terms or search terms limited to abstracts will result in a low recall (i.e. many of the relevant articles will be overlooked)

MEDLINE(abstracts)

PubMed Central

(full papers)

biomedical literature

Selecting search terms

Selecting documents

doc ID

number of matching

> threshold

local corpus0

1 4 7 10 13 16 19 22 25 28 31

search terms

Term recognition: C-value

• http://www.nactem.ac.uk/batch.php

C-value

• syntactic pattern matching used to select term candidates:

(ADJ | N)+ | ((ADJ | N)* [N PREP] (ADJ | N)*) N

• termhood of each candidate term t is calculated using:

– |t| its length as the number of words

– f(t) its frequency of occurrence

– S(t) the set of other candidate terms containing

it as a subphrase

)( if ,))(|)(|

1)((||ln

)( if ,)(||ln)(

tSsftS

tStfttC

C-value results

Unified Medical Language System (UMLS)

• UMLS = an “ontology” which merges information from over 100 biomedical source vocabularies

• http://umlsks.nlm.nih.gov

• UMLS contains the following semantic classes relevant to our problem:

Organism A.1.1Anatomical Structure A.1.2Substance A.1.4Biological Function B.2.2.1Injury or Poisoning B.2.3

• we used these classes to automatically extract the corresponding terms from the UMLS thesaurus

Summary

Results

• input: 243 NMR terms & 152 GC terms

• output: 5,699 NMR terms & 2,612 GC terms

The End

Facilitating the development of controlled vocabularies for metabolomics with text mining

Documents