The Semantic Web: New-style data-integration (and how it works for life-scientists too!)

Post on 02-Jan-2016

26 views 0 download

description

The Semantic Web: New-style data-integration (and how it works for life-scientists too!). Frank van Harmelen AI Department Vrije Universiteit Amsterdam. What’s the problem? (data-mess in bio-inf). Kenneth Griffiths and Richard Resnick Tut. At Intell. Systems for Molec. Biol., 2003. - PowerPoint PPT Presentation

transcript

The Semantic Web:New-style data-integration

(and how it works for life-scientists too!)

Frank van HarmelenAI Department

Vrije Universiteit Amsterdam

What’s the problem?

(data-mess in bio-inf)

Life Science Data

Recent focus on genetic data“genomics: the study of genes and their function. Recent advances in genomics are bringing about a revolution in our understanding of the molecular mechanisms of disease, including the complex interplay of genetic and environmental factors. Genomics is also stimulating the discovery of breakthrough healthcare products by revealing thousands of new biological targets for the development of drugs, and by giving scientists innovative ways to design new drugs, vaccines and DNA diagnostics. Genomics-based therapeutics include "traditional" small chemical drugs, protein drugs, and potentially gene therapy.”

The Pharmaceutical Research and Manufacturers of America - http://www.phrma.org/genomics/lexicon/g.html

Study of genes and their function

Understanding molecular mechanisms of disease

Development of drugs, vaccines, and diagnostics

Kenneth Griffiths and Richard ResnickTut. At Intell. Systems for Molec. Biol., 2003

The Study of Genes...

• Chromosomal location

• Sequence

• Sequence Variation

• Splicing

• Protein Sequence• Protein Structure

… and Their Function

• Homology

• Motifs

• Publications

• Expression

• HTS

• In Vivo/Vitro Functional Characterization

Understanding Mechanisms of Disease

Metabolic and

regulatory pathway induction

Development of Drugs, Vaccines, Diagnostics

Differing types of Drugs, Vaccines, and Diagnostics• Small molecules• Protein therapeutics• Gene therapy• In vitro, In vivo diagnostics

Development requires• Preclinical research• Clinical trials• Long-term clinical research

All of which often feeds back into ongoing Genomics research and discovery.

The Industry’s Problem

Too much unintegrated data:– from a variety of incompatible sources

– no standard naming convention

– each with a custom browsing and querying mechanism (no common interface)

– and poor interaction with other data sources

What are the Data Sources?

• Flat Files• URLs• Proprietary Databases• Public Databases• Data Marts• Spreadsheets• Emails• …

Sample Problem: Hyperprolactinemia

Over production of prolactin– prolactin stimulates mammary gland

development and milk production

Hyperprolactinemia is characterized by:– inappropriate milk production– disruption of menstrual cycle– can lead to conception difficulty

Understanding transcription factors for prolactin production

“Show me all genes in the public literature that are putatively related to hyperprolactinemia, have more than 3-fold expression differential between hyperprolactinemic and normal pituitary cells, and are homologous to known transcription factors.”

“Show me all genes that are homologous to known transcription factors”

SEQUENCE

1Q“Show me all genes that have more than 3-fold expression differential between hyperprolactinemic and normal pituitary cells”EXPRESSION

2Q

“Show me all genes in the public literature that are putatively related to hyperprolactinemia”

LITERATURE

3Q

(Q1Q2Q3)

The Complexity of Biological Data

Source: PhRMA & FDA 2003

Pharmaceutical Productivity

Stitching this all together by hand?

Source: Stephens et al. J Web Semantics 2006

The Medical tower of Babel Mesh

Medical Subject Headings, National Library of Medicine 22.000 descriptions

EMTREE Commercial Elsevier, Drugs and diseases 45.000 terms, 190.000 synonyms

UMLS Integrates 100 different vocabularies

SNOMED 200.000 concepts, College of American Pathologists

Gene Ontology 15.000 terms in molecular biology

NCI Cancer Ontology: 17,000 classes (about 1M definitions),

Problem with the Current WWW

Why would Semantic Web

technology help?

machine accessible meaning (What it’s like to be a machine)

<name>

<symptoms>

<drug>

<drugadministration>

<disease>

<treatment>

IS-A

alleviatesMETA-DATA

What is meta-data?

it's just datait's data describing other dataits' meant for machine consumption

disease

name

symptoms

drug

administration

Required are:1. one or more standard vocabularies

so search engines, producers and consumersall speak the same language

2. a standard syntax, so meta-data can be recognised as such

3. lots of resources with meta-data attached mechanisms for attribution and trust

is this page really about Pamela Anderson?

no shared understanding

Conceptual and terminological confusion

Actors: both humans and machines

Agree on a conceptualization

Make it explicit in some language.

world

concept

language

What are ontologies &what are they used for

standard vocabularies (“Ontologies”)Identify the key concepts in a domainIdentify a vocabulary for these

conceptsIdentify relations between these

conceptsMake these precise enough

so that they can be shared between humans and humans humans and machines machines and machines

Shared content-vocabularies:Ontologies

Formal,

explicit specification

of a shared

conceptualisation Abstract model ofsome domain

Consensualknowledge

concepts, properties,relations, functions

machineprocessable

Biomedical ontologies (a few..) Mesh

Medical Subject Headings, National Library of Medicine 22.000 descriptions

EMTREE Commercial Elsevier, Drugs and diseases 45.000 terms, 190.000 synonyms

UMLS Integrates 100 different vocabularies

SNOMED 200.000 concepts, College of American Pathologists

Gene Ontology 15.000 terms in molecular biology

NCBI Cancer Ontology: 17,000 classes (about 1M definitions),

What’s inside an ontology?

terms + specialisation hierarchy classes + class-hierarchy instances slots/values inheritance (multiple? defaults?) restrictions on slots (type, cardinality) properties of slots (symm., trans., …) relations between classes (disjoint, covers) reasoning tasks: classification,

subsumption

Increasing semantic “weight”

NB: we’re not doing philosophy

Ontologies are not

definitive descriptions of what exists in the world (= philosphy)

Ontologies are

models of the worldconstructed

to facilitate communication

Yes, ontologies exist(because we build them)

Remember “required are”: one or more standard vocabularies

so search engines, producers and consumersall speak the same language

2. a standard syntax, so meta-data can be recognised as such

3. lots of resources with meta-data attached

Stack of languages

Stack of languagesXML:

Surface syntax, no semanticsXML Schema:

Describes structure of XML documentsRDF:

Datamodel for “relations” between “things”RDF Schema:

RDF Vocabular Definition LanguageOWL:

A more expressive Vocabular Definition Language

RDF Triples in Life Sciences

Bluffer’s guide to RDF (1)Object --Attribute-> Value triples

objects are web-resourcesValue is again an Object:

triples can be linked data-model = graph

pers05 ISBN...Author-of

pers05 ISBN...Author-of

MIT

ISBN...

Publ-by

Author-of Publ-

by

Bluffer’s guide to RDF (2) Every identifier is a URL

= world-wide unique naming!

Has XML syntax

Any statement can be an object• graphs can be nested

pers05 ISBN...Author-of

NYT claims

<rdf:Description rdf:about=“#pers05”> <authorOf>ISBN...</authorOf></rdf:Description>

What does RDF Schema add?

• Defines vocabulary for RDF• Organizes this vocabulary in a

typed hierarchy• Class, subClassOf, type• Property, subPropertyOf• domain, range

Person

Teacher Student

subClassOfsubClassOf

Marta

type

supervisesdomain range

Frank

type

supervises

Stack of languagesXML:

Surface syntax, no semanticsXML Schema:

Describes structure of XML documentsRDF:

Datamodel for “relations” between “things”RDF Schema:

RDF Vocabular Definition LanguageOWL:

A more expressive Vocabular Definition Language

OWL: things RDF Schema can’t doequalityenumerationnumber restrictions

Single-valued/multi-valued Optional/required values

inverse, symmetric, transitiveboolean algebra

Union, complement…

OWL: more expressivity

Full

DL

Lite

OWL Full Allow meta-classes etc

OWL DLNegationDisjunctionFull CardinalityEnumerated types

OWL Light(sub)classes, individuals(sub)properties, domain, rangeconjunction(in)equalitycardinality 0/1datatypesinverse, transitive, symmetrichasValuesomeValuesFromallValuesFrom

RDF Schema

Remember “required are”: one or more standard vocabularies

so search engines, producers and consumersall speak the same language

a standard syntax, so meta-data can be recognised as such

3. lots of resources with meta-data attached

Question: who writes the ontologies?Professional bodies, scientific

communities, companies, publishers, ….

See previous slide on Biomedical ontologies Same developments in many other fields

Good old fashioned Knowledge Engineering

Convert from DB-schema, UML, etc.

Question:Who writes the meta-data ?

- Automated learning- shallow natural language analysis- Concept extraction

amsterdam

trade

antwerp europe

amsterdam

merchant

city town

center

netherlandsmerchant

city town

Example: Encyclopedia Britannica on “Amsterdam”

exploit existing legacy-data Amazon Lab equipment?

side-effect from user interaction MIT Lab photo-annotator

NOT from manual effortWeb 2.0 community/social interaction

Question:Who writes the meta-data ?

Remember “required are” one or more standard vocabularies

so search engines, producers and consumersall speak the same language

a standard syntax, so meta-data can be recognised as such

lots of resources with meta-data attached

Some working examples?

• DOPE• HCLS (http://www.w3.org/2001/sw/hcls/)

DOPE: BackgroundVertical Information Provision

Buy a topic instead of a Journal ! Web provides new opportunities

Business driver: drug development Rich, information-hungry market Good thesaurus (EMTREE)

The Data Document repositories:

ScienceDirect: approx. 500.000 fulltext articles

MEDLINE: approx. 10.000.000 abstracts

Extracted Metadata The Collexis Metadata Server: concept-

extraction ("semantic fingerprinting")

Thesauri and Ontologies EMTREE:

60.000 preferred terms 200.000 synonyms

RDF Schema

EMTREE

Queryinterface

RDF

Datasource 1

RDF

Datasource n….

Architecture:

Architecture:

GUI: Spectacle (Aduna)

Metadata Server(Collexis)

EMTREEThesaurus

(RDFS)

Mediator: Sesame (Aduna)

http requests

Java Client

SOAP

DocumentModel(RDFS) Source

Model(RDF)

SeRQL

Additional Source of Data

SourceModel(RDF)SeRQL

GeneThesaurus

(RDFS)

Summarising… Data integration on the Web:

machine processable data besides human processable data

Syntax for meta-data XML (not much meaning) RDF (some meaning) RDF Schema (some meaning) OWL (more meaning

Vocabularies for meta-data Lot’s of them in bio-inf.

Actual meta-data: Lot’s in bio-inf.

Will enable: Better search engines (recall, precision, concepts) Combining information across pages (inference) …

Things to do for you Practical:

Use existing software to construct new use-scenario’s

Conceptual:Create on ontology for some area of bio-medical expertise

from scratch as a refinement of an existing ontology

Technical:Transform an existing data-set in meta-data format, and provide a query interface (for humans and machines)