Date post: | 15-Jul-2015 |
Category: |
Education |
Upload: | andrzej-stefan-czech |
View: | 178 times |
Download: | 5 times |
What is bioinformatics?
Bioinformatics is an interdisciplinary field that develops methods and software tools for understanding biological data. As an interdisciplinary field of science, bioinformatics combines computer science, statistics, mathematics, and engineering to study and process biological data.
http://en.wikipedia.org/wiki/Bioinformatics
2015-03-23 3
A little bit of history
• 1951 – Sequencing peptide (Frederick Sanger)
• 1965 – Sequencing RNA (Robert Holley)
• 1970 – Term BIOINFORMATICS coined by Paulien Hogeweg & Ben Hesper
• 1977 – Sequencing DNA (Frederick Sanger)
• 1990 – Human Genome Project started (expected duration 15 years)
• 2003 – Human Genome Project completed
2015-03-23 4
Cost of sequencing
Sboner et al. Genome Biology 2011 12:125 doi:10.1186/gb-2011-12-8-1252015-03-23 6
Cost of sequencing & data analysis
Sboner et al. Genome Biology 2011 12:125 doi:10.1186/gb-2011-12-8-1252015-03-23 7
Future of biological research
• With rapidly advancing automation there will be less human efforts needed for sample preparation
• With increasing amount of information data analysis will be more important
• The information output of experiments is growing beyond human capability: need of high level summaries and statistics
2015-03-23 9
Different flavors of bioinformatics
• Sequence analysis
• Annotation
• Data analysis
• Computational biology
• Structural bioinformatics
• Systems biology
• Scientific programming
• Expression analysis
• Network biology
• Biostatistics
• Computational genomics
• Databases
2015-03-23 11
High throughput sequencing
• 454 Roche ~10 hours | 400bp | 500 Mbp
• Illumina ~250 hours | 2x100bp | >150 Gbp
• PacBio 14 hours | 1..10kbp | ~50 Mbp
• Ion Torrent 4 hours | 400bp | <10 Gbp
• Nanopore 1..? hours | 1..?bp | 1..?bp
2015-03-23 13
High throughput sequencing
Billions of short sequence fragments (= ‘reads’)
2015-03-23 14
~170 DVDs
800 Gb
600.000.000.000 nucleotides
=
~800.000x
Quality filtering and trimming
TAGCGCAATACTTTCTGTTAGCGCAAATCCTAGTAGTGCAT
CCATGTGTGGGTTGTGTTNNNNNNNNNNNNNNNNNNNNNNN
AGTGGTATCAACGCAGAGTACGGGGGACCTTNNNNNNNNNN
TAGCGCAATACTTTCTGTTAGCGCAAATCCTAGTAGTGCAT
CCATGTGTGGGTTGTGTTNNNNNNNNNNNNNNNNNNNNNNN
AGTGGTATCAACGCAGAGTACGGGGGACCTTNNNNNNNNNN
2015-03-23 15
Quality filtering and trimming
TAGCGCAATACTTTCTGTTAGCGCAAATCCTAGTAGTGCAT
AGTGGTATCAACGCAGAGTACGGG
2015-03-23 16
Sequence search (BLAST)
• BLAST is one of the most commonly used bioinformatics software
• It finds small sub-sequences of your query in the subject sequence
• Uses word to match with the database of subject and then uses heuristics to verify and extend match
2015-03-23 17
GAGATGGTATCAACGCAGAGATCTGGTGTT
Sequence search
2015-03-23 18
AGAG
GAGA
AGAT
GATA
ATAT
TATG
ATGG
TGGT
AGAGATATGGT
GAGA--TGGTATCAACGCAGAGATCTGGTGTT
Sequence search
2015-03-23 19
AGAGATATGGT AGAGATATGGT0+1+1+1+1-3-2+1+1+1+1 =3 1 1 1 1 1 1-1 1 1 1 1 =9
|||| |||| |||||| ||||
Better HSP
Still possible HSP
HSP=high-scoring segment pairs
Sequence/genome alignment
• Global alignment
– global optimization that "forces" the alignment to span the entire sequences(Needleman–Wunsch algorithm or Clustal style)
• Local alignment
– identify short regions of similarity within long divergent sequences (Smith–Waterman algorithm or BLAST style)
2015-03-23 20
Sequence alignment
• Global alignment
FTFTALILLAVAV
| ||| ||| ||
F--TAL-LLA-AV
• Local alignment
FTFTALILL-AVAV
|||| || ||
--FTAL-LLAAV--
2015-03-23 21
Genome alignment
• Glocal alignment
• Uses a word matching method
• Creates suffix tree for faster search
• Searches suffix tree for exact matches of words clusters them and then uses local alignment methods to extend match
2015-03-23 22
Assembly
• Short read assembly is extremely difficult and computationally intensive!
• For longer reads an Overlap Consensus (OLC) assemblers are used
• For shorter reads (and in high numbers) De Bruijn Graph assemblers are better
2015-03-23 25Source: Commins, Toft & Fares (CC BY-SA 2.5)
2015-03-23 26Source: Schatz, Delcher , Salzberg,http://www.genome.org/cgi/doi/10.1101/gr.101360.109.
Genome annotation
• Prediction of:
– Genes
– Repeats
– Non coding RNAs (rRNAs, tRNAs, miRNAs, snRNAs, siRNAs, ta-siRNAs)
– Promoters
– Enhancers
– Protein binding sites
– …
2015-03-23 27
Genome annotation
2015-03-23 28
5’ UTR /Promoter/
RBS
Start codon
Exon 1 Exon 3Exon 2Intron IntronStop
codon3’ UTR
Genome annotation
2015-03-23 29
ACCTCTCACTCTTTCTTCTCATCTTCTTCAATTATAACAACCTAACCATGTCTTTAGAACAAGAAT TTCTCTTCCTAATTCCTTCAACAATGGCCAATAATCTCAATCTCTTCCTTTGTTTCCTTCTCTTTATTTGTTTCTTCACTCTCTGCCTTAGCCCTGGTGGTCTAGCTTGGGCTCTAATTTCAAAGCCAAAAAACCAATCCATCATTCCTAGAATCAGCTTATGAGCTTTTGTTTCATAGAGCAATGGGGTTTGCTCCATTGGAGAGTGTGTTTGTGTTGGTGATGAAGAAAATAAGAACAGCCCTAATGATAATGAAGATGAAGATTTTGTTGATGTGTTGCTTGATTTGGAAAAGGAAAACAAACTCACCGACTCTGATATGATTGCTGTGTTAGGTACGTATGTGTATTATAATTTCTTGTTTCATTACTATTTTGATATTTTTCTACTGCACT TCAATTTTAATCGGTTTGAAATGATTTTTTAATATGCTCTTACAAGATTATGACTTGGGAAAGATTCTTACATCTTTAAATATTTCAATTTTTTGTGTGATACATGAAATGCATGACTGTTTTTTACTTGCGATTTACATGTTGAAATTTTCTTTACTTTGATATTCTATGTTTTTTTAACAAATTTTCTCTTAAATAAATGACATGTAGGAAATGACCTCCCAAATCTTCCTTATCTCCATGCAGTCGTCAAAGAGACTCTTAGAAT GCACCCTCCCGGCCCACTTCTCTCTTGGGCACGCCTTGCCATCCATGACACCCATGTCGCAGGCCACTACATTCCTGCTGGCACCACTGCGATGGTCAACATGTGGGCCATAACCCACGACGACCAAACTGTGGCTCGCTCAGTTAGTTCATAAGTTCGAATGGGTTCAAGCTGATGAATCGAAAATCAAAGTGGATTTGTCTGAGTGTTTGAAGCTATCTCTGGAAATGAAACACCCTTTGATTTGTAGGGCTATCCCGAGGAATGTAGGGTTCGAGTCTCACCCTGATCATGCATGACAGATTAAAAAAAAAAAGAA AAGGCACATCTAGGGGAGCTTATTATGATATTATCATATGTTGAAAATTAAATGTGTTTGTTGCTTTCTTTTCTTTTTTTCTTTTTCCTTTCTTCTTTCTCTTAATCAATTGATATTATATCTTGTGTGGAACAAATAGTATCGGATTCGAGATTTAATGTTGGGATAATCCTTAAATGTAATTCCGTTATTAAGTGTGAA
Genome annotation
2015-03-23 30
ACCTCTCACTCTTTCTTCTCATCTTCTTCAATTATAACAACCTAACC[ATG]TCTTTAGAACAAGAATTTCTCTTCCTAATTCCTTCAACAATGGCCAATAATCTCAATCTCTTCCTTTGTTTCCTTCTCTTTATTTGTTTCTTCACTCTCTGCCTTAGCCCTGGTGGTCTAGCTTGGGCTCTAATTTCAAAGCCAAAAAACCAATCCATCATTCCTAGAATCAGCTTATGAGCTTTTGTTTCATAGAGCAATGGGGTTTGCTCCATTGGAGAGTGTGTTTGTGTTGGTGATGAAGAAAATAAGAACAGCCCTAATGATAATGAAGATGAAGATTTTGTTGATGTGTTGCTTGATTTGGAAAAGGAAAACAAACTCACCGACTCTGATATGATTGCTGTGTTAG|GTACGTATGTGTATTATAATTTCTTGTTTCATTACTATTTTGATATTTTTCTACTGCACTTCAATTTTAATCGGTTTGAAATGATTTTTTAATATGCTCTTACAAGATTATGACTTGGGAAAGATTCTTACATCTTTAAATATTTCAATTTTTTGTGTGATACATGAAATGCATGACTGTTTTTTACTTGCGATTTACATGTTGAAATTTTCTTTACTTTGATATTCTATGTTTTTTTAACAAATTTTCTCTTAAATAAATGACATGTAG|GAAATGACCTCCCAAATCTTCCTTATCTCCATGCAGTCGTCAAAGAGACTCTTAGAATGCACCCTCCCGGCCCACTTCTCTCTTGGGCACGCCTTGCCATCCATGACACCCATGTCGCAGGCCACTACATTCCTGCTGGCACCACTGCGATGGTCAACATGTGGGCCATAACCCACGACGACCAAACTGTGGCTCGCTCAGTTAGTTCATAAGTTCGAATGGGTTCAAGCTGATGAATCGAAAATCAAAGTGGATTTGTCTGAGTGTTTGAAGCTATCTCTGGAAATGAAACACCCTTTGATTTGTAGGGCTATCCCGAGGAATGTAGGGTTCGAGTCTCACCC[TGA]TCATGCATGACAGATTAAAAAAAAAAAGAAAAGGCACATCTAGGGGAGCTTATTATGATATTATCATATGTTGAAAATTAAATGTGTTTGTTGCTTTCTTTTCTTTTTTTCTTTTTCCTTTCTTCTTTCTCTTAATCAATTGATATTATATCTTGTGTGGAACAAATAGTATCGGATTCGAGATTTAATGTTGGGATAATCCTTAAATGTAATTCCGTTATTAAGTGTGAA
PDB and structural information
• Protein Data Bank holds information about structure of proteins, nucleic acids and complexes – over 100 000 entries!
• The 3D structure can be resolved by:
– X-ray diffraction
– NMR
– Electron microscopy
– Simulations
2015-03-23 34
PDB and structural informationHEADER TRANSCRIPTION 18-MAR-04 1VD4
TITLE SOLUTION STRUCTURE OF THE ZINC FINGER DOMAIN OF TFIIE ALPHA
COMPND 2 MOLECULE: TRANSCRIPTION INITIATION FACTOR IIE, ALPHA
COMPND 8 ENGINEERED: YES
SOURCE MOL_ID: 1;
SOURCE 2 ORGANISM_SCIENTIFIC: HOMO SAPIENS;
SOURCE 10 EXPRESSION_SYSTEM_PLASMID: PET11D
KEYWDS ZINC FINGER, TRANSCRIPTION
EXPDTA SOLUTION NMR
NUMMDL 20
AUTHOR M.OKUDA,A.TANAKA,Y.ARAI,M.SATOH,H.OKAMURA,A.NAGADOI,
REMARK 500 CHOLOGY: RAMACHANDRAN REVISITED. STRUCTURE 4, 1395 - 1400
REMARK 500
REMARK 500 M RES CSSEQI PSI PHI
REMARK 500 1 GLU A 118 -36.12 -163.20
REMARK 500 1 ARG A 119 -92.03 -138.92
REMARK 500 1 THR A 122 -70.74 -110.33
SITE 1 AC1 5 CYS A 129 CYS A 132 CYS A 154 CYS A 157
SITE 2 AC1 5 THR A 159
CRYST1 1.000 1.000 1.000 90.00 90.00 90.00 P 1 1
ORIGX1 1.000000 0.000000 0.000000 0.00000
ORIGX3 0.000000 0.000000 1.000000 0.00000
SCALE1 1.000000 0.000000 0.000000 0.00000
SCALE3 0.000000 0.000000 1.000000 0.00000
MODEL 1
ATOM 1 N ARG A 113 1.980 -19.277 -19.127 1.00 0.00 N
ATOM 2 CA ARG A 113 1.202 -19.280 -17.853 1.00 0.00 C
ATOM 3 C ARG A 113 0.666 -17.875 -17.557 1.00 0.00 C
ATOM 4 O ARG A 113 0.625 -17.023 -18.421 1.00 0.00 O
ATOM 5 CB ARG A 113 2.199 -19.713 -16.778 1.00 0.00 C
ATOM 6 CG ARG A 113 2.435 -21.222 -16.875 1.00 0.00 C
ATOM 7 CD ARG A 113 3.604 -21.619 -15.971 1.00 0.00 C
ATOM 8 NE ARG A 113 2.986 -21.899 -14.645 1.00 0.00 N
ATOM 9 CZ ARG A 113 3.125 -23.073 -14.094 1.00 0.00 C
2015-03-23 35
Molecular networks
• Bioinformatics is needed to describe interactions between proteins, DNA, drugs…
• When thousands of interactions are analyzed network science come to use
• The set of all protein-protein interactions in single cell is called interactome
• A single interaction can be researched in vivo/in vitro but more complex network can be only investigated in silico
2015-03-23 38
Molecular networks
2015-03-23 39Source: Hennah & Porteous. PLoS ONE 4 (3): e4906. doi:10.1371/journal.pone.0004906
Metabolic pathways
• To describe a series of biochemical reactions that often happen in different cellular compartments, bioinformatics is also useful
• For description of pathways special databases (graph) had to be designed
• Modeling of metabolites flow in pathway is virtually impossible without the use of computers
2015-03-23 40
Simulation of biological systems
• Simulation of cell-cell interactions
• Description of interactions inside population
• Between species interactions
• Food chains => food web
• Social relations
• Evolution of populations
• Modeling in pharmacology
2015-03-23 43
Databases
• Relational databases(mySQL)
• Non-relational databases (noSQL)
• Graph databases
• RDF
• MySQL, Microsoft SQL Server, SAP
• Cassandra, MongoDB, CouchDB
• Neo4j, Bio4j
• N-Triples, RDF/XML, Bio2RDF
2015-03-23 45
Databases
• Different types public resources available:
2015-03-23 46
Nucleic sequence
Protein sequence
ESTGenome
Sequencedata
Metadata/Ontologies
Functional annotation
Gene models
Gene ontologies
Protein structure
Structural dataComplexesstructure
RNA structure
Variation dataSNP
SSRindels
InteractionsMetabolic data
Pathways
Text/data mining
• Obtaining information from several scientific resources becoming is more difficult as the volume of information grows
• Number of different resources/databases is growing and simple search has to be repeated for each of them
• Filtering relevant information is a big intellectual/computational burden
2015-03-23 49
Text mining
• Retrieval, analysis and formatting (parsing) of information into searchable databases
• Recognition of patterns
• Recognition of natural language
• Extraction of semantic or grammatical relationships
• Coreference: terms that refer to the same object
2015-03-23 50
Text mining example
• Query: Find promoters known to work in E.coli with s70 holenzyme (Es70) aka sD
• PREFIX sbol:http://sbols.org/sbol.owl#
PREFIX pr:http://partsregistry.org/#
SELECT DISTINCT ?name
WHERE {
?part a sbol:Part;
sbol:status ?st;
sbol:name ?name;
sbol:dnaSequence ?seq;
a pr:promoter;
a ?cl.
FILTER (?cl =pr:sigma70_ecoli_prokaryote_rnap
&& ?st !='Deleted')}
2015-03-23 51
Open source software
• Software that anyone can use, modify, share and distribute.
• Source code is known and can (should!) be modified to fit the user requirements
• Society driven development
• Dynamic development and early releases
• Security and transparency
2015-03-23 52
How to become a bioinformatician?
• Get a computer with Linux
• Learn how to use bash shelland how to run programs command line
• Learn to code in python or Perl
• Try solving basic problems on 2015-03-23 56
How to become a bioinformatician?
• Read blogs:
• Read fora for geeks:
• Get an account on:2015-03-23 57
Want to know more?
• Join my network on http://nl.linkedin.com/in/andrzejstefanczech
• Come to Wageningen for an internship at Genetwister Technologies B.V. http://www.genetwister.nl/
• Slides from this lecture are also available on SlideShare
2015-03-23 58