Large-scale integration of data and text
Lars Juhl Jensen
association networks
guilt by association
Korbel et al., Nature Biotechnology, 2004
gene neighborhood
Korbel et al., Nature Biotechnology, 2004
phylogenetic profiles
Korbel et al., Nature Biotechnology, 2004
Cell
Cellulosomes
Cellulose
experimental data
gene coexpression
physical interactions
Jensen & Bork, Science, 2008
genetic interactions
Beyer et al., Nature Reviews Genetics, 2007
curated knowledge
Letunic & Bork, Trends in Biochemical Sciences, 2008
different formats
different identifiers
phylogenetic profiles
affinity purification
von Mering et al., Nucleic Acids Research, 2005
score calibration
von Mering et al., Nucleic Acids Research, 2005
implicit weighting by quality
homology-based transfer
orthologous groups
Franceschini et al., Nucleic Acids Research, 2013
Exercise 1Query STRING for human TYMS
Show network in confidence mode
Show up to 20 interaction partners
Show only experimental evidence
Show also low-confidence links
exponential growth
~40 seconds per paper
as smart as a dog
teach it specific tricks
named entity recognition
comprehensive lexicon
cyclin dependent kinase 1
orthographic variation
prefixes and suffixes
flexible matching
spaces and hyphens
cyclin dependent kinase 1
cyclin-dependent kinase 1
~22 million abstracts
~2 million full-text articles
restricted access
information extraction
within paragraphs
score calibration
NLPNatural Language Processing
grammatical analysis
part-of-speech tagging
what you learned in schoolpronoun pronoun verb preposition noun
semantic tagging
words of special interest
sentence parsing
Gene and protein namesCue words for entity
recognitionVerbs for relation extraction
[nxexpr The expression of [nxgene the cytochrome genes [nxpg CYC1 and CYC7]]]is controlled by[nxpg HAP1]
Saric et al., Proceedings of ACL, 2004
general approach
curated knowledge
experimental data
computational predictions
common identifiers
score calibration
protein networks
Szklarczyk et al., Nucleic Acids Research, 2015string-db.org
chemical networks
Kuhn et al., Nucleic Acids Research, 2014stitch-db.org
high-throughput screens
subcellular localization
Binder et al., Database, 2014compartments.jensenlab.org
model organism databases
sequence-based predictions
tissue expression
tissues.jensenlab.org Santos et al., submitted, 2015
Brenda Tissue Ontology
high-throughput studies
mass spectrometry
immunohistochemistry
disease associations
diseases.jensenlab.org Frankild et al., Methods, 2015
Disease Ontology
genetics studies
Genetics Home Reference
NHGRI GWAS Catalog
cancer mutation data
Exercise 2Find TYMS-related diseaseshttp://diseases.jensenlab.org
Find some inhibitors of TYMShttp://stitch-db.org
Assess their tissue specificityhttp://tissues.jensenlab.org