Introduction to bioinformatics

Introduction to Bioinformatics

Andrzej Stefan Czech

Leeuwarden

2015-03-232015-03-23

2015-03-23 2

What is bioinformatics?

Bioinformatics is an interdisciplinary field that develops methods and software tools for understanding biological data. As an interdisciplinary field of science, bioinformatics combines computer science, statistics, mathematics, and engineering to study and process biological data.

http://en.wikipedia.org/wiki/Bioinformatics

2015-03-23 3

A little bit of history

• 1951 – Sequencing peptide (Frederick Sanger)

• 1965 – Sequencing RNA (Robert Holley)

• 1970 – Term BIOINFORMATICS coined by Paulien Hogeweg & Ben Hesper

• 1977 – Sequencing DNA (Frederick Sanger)

• 1990 – Human Genome Project started (expected duration 15 years)

• 2003 – Human Genome Project completed

2015-03-23 4

• It’s all about money!!!!2015-03-23

Why is bioinformatics so important?

5

Cost of sequencing

Sboner et al. Genome Biology 2011 12:125 doi:10.1186/gb-2011-12-8-1252015-03-23 6

Cost of sequencing & data analysis

Sboner et al. Genome Biology 2011 12:125 doi:10.1186/gb-2011-12-8-1252015-03-23 7

2015-03-23 8

Future of biological research

• With rapidly advancing automation there will be less human efforts needed for sample preparation

• With increasing amount of information data analysis will be more important

• The information output of experiments is growing beyond human capability: need of high level summaries and statistics

2015-03-23 9

Different flavors of bioinformatics

2015-03-23 10

Different flavors of bioinformatics

• Sequence analysis

• Annotation

• Data analysis

• Computational biology

• Structural bioinformatics

• Systems biology

• Scientific programming

• Expression analysis

• Network biology

• Biostatistics

• Computational genomics

• Databases

2015-03-23 11

BIOINFORMATICS IN SEQUENCE ANALYSIS

2015-03-23 12

High throughput sequencing

• 454 Roche ~10 hours | 400bp | 500 Mbp

• Illumina ~250 hours | 2x100bp | >150 Gbp

• PacBio 14 hours | 1..10kbp | ~50 Mbp

• Ion Torrent 4 hours | 400bp | <10 Gbp

• Nanopore 1..? hours | 1..?bp | 1..?bp

2015-03-23 13

High throughput sequencing

Billions of short sequence fragments (= ‘reads’)

2015-03-23 14

~170 DVDs

800 Gb

600.000.000.000 nucleotides

=

~800.000x

Quality filtering and trimming

TAGCGCAATACTTTCTGTTAGCGCAAATCCTAGTAGTGCAT

CCATGTGTGGGTTGTGTTNNNNNNNNNNNNNNNNNNNNNNN

AGTGGTATCAACGCAGAGTACGGGGGACCTTNNNNNNNNNN


CCATGTGTGGGTTGTGTTNNNNNNNNNNNNNNNNNNNNNNN

AGTGGTATCAACGCAGAGTACGGGGGACCTTNNNNNNNNNN

2015-03-23 15

Quality filtering and trimming


AGTGGTATCAACGCAGAGTACGGG

2015-03-23 16

Sequence search (BLAST)

• BLAST is one of the most commonly used bioinformatics software

• It finds small sub-sequences of your query in the subject sequence

• Uses word to match with the database of subject and then uses heuristics to verify and extend match

2015-03-23 17

GAGATGGTATCAACGCAGAGATCTGGTGTT

Sequence search

2015-03-23 18

AGAG

GAGA

AGAT

GATA

ATAT

TATG

ATGG

TGGT

AGAGATATGGT

GAGA--TGGTATCAACGCAGAGATCTGGTGTT

Sequence search

2015-03-23 19

AGAGATATGGT AGAGATATGGT0+1+1+1+1-3-2+1+1+1+1 =3 1 1 1 1 1 1-1 1 1 1 1 =9

|||| |||| |||||| ||||

Better HSP

Still possible HSP

HSP=high-scoring segment pairs

Sequence/genome alignment

• Global alignment

– global optimization that "forces" the alignment to span the entire sequences(Needleman–Wunsch algorithm or Clustal style)

• Local alignment

– identify short regions of similarity within long divergent sequences (Smith–Waterman algorithm or BLAST style)

2015-03-23 20

Sequence alignment

• Global alignment

FTFTALILLAVAV

| ||| ||| ||

F--TAL-LLA-AV

• Local alignment

FTFTALILL-AVAV

|||| || ||

--FTAL-LLAAV--

2015-03-23 21

Genome alignment

• Glocal alignment

• Uses a word matching method

• Creates suffix tree for faster search

• Searches suffix tree for exact matches of words clusters them and then uses local alignment methods to extend match

2015-03-23 22

Suffix tree

2015-03-23 23

accg

Source: Delcher et al. Nucleic Acids Research, 1999, Vol. 27, No. 11

Genome alignment

2015-03-23 24

Assembly

• Short read assembly is extremely difficult and computationally intensive!

• For longer reads an Overlap Consensus (OLC) assemblers are used

• For shorter reads (and in high numbers) De Bruijn Graph assemblers are better

2015-03-23 25Source: Commins, Toft & Fares (CC BY-SA 2.5)

2015-03-23 26Source: Schatz, Delcher , Salzberg,http://www.genome.org/cgi/doi/10.1101/gr.101360.109.

Genome annotation

• Prediction of:

– Genes

– Repeats

– Non coding RNAs (rRNAs, tRNAs, miRNAs, snRNAs, siRNAs, ta-siRNAs)

– Promoters

– Enhancers

– Protein binding sites

– …

2015-03-23 27

Genome annotation

2015-03-23 28

5’ UTR /Promoter/

RBS

Start codon

Exon 1 Exon 3Exon 2Intron IntronStop

codon3’ UTR

Genome annotation

2015-03-23 29

ACCTCTCACTCTTTCTTCTCATCTTCTTCAATTATAACAACCTAACCATGTCTTTAGAACAAGAAT TTCTCTTCCTAATTCCTTCAACAATGGCCAATAATCTCAATCTCTTCCTTTGTTTCCTTCTCTTTATTTGTTTCTTCACTCTCTGCCTTAGCCCTGGTGGTCTAGCTTGGGCTCTAATTTCAAAGCCAAAAAACCAATCCATCATTCCTAGAATCAGCTTATGAGCTTTTGTTTCATAGAGCAATGGGGTTTGCTCCATTGGAGAGTGTGTTTGTGTTGGTGATGAAGAAAATAAGAACAGCCCTAATGATAATGAAGATGAAGATTTTGTTGATGTGTTGCTTGATTTGGAAAAGGAAAACAAACTCACCGACTCTGATATGATTGCTGTGTTAGGTACGTATGTGTATTATAATTTCTTGTTTCATTACTATTTTGATATTTTTCTACTGCACT TCAATTTTAATCGGTTTGAAATGATTTTTTAATATGCTCTTACAAGATTATGACTTGGGAAAGATTCTTACATCTTTAAATATTTCAATTTTTTGTGTGATACATGAAATGCATGACTGTTTTTTACTTGCGATTTACATGTTGAAATTTTCTTTACTTTGATATTCTATGTTTTTTTAACAAATTTTCTCTTAAATAAATGACATGTAGGAAATGACCTCCCAAATCTTCCTTATCTCCATGCAGTCGTCAAAGAGACTCTTAGAAT GCACCCTCCCGGCCCACTTCTCTCTTGGGCACGCCTTGCCATCCATGACACCCATGTCGCAGGCCACTACATTCCTGCTGGCACCACTGCGATGGTCAACATGTGGGCCATAACCCACGACGACCAAACTGTGGCTCGCTCAGTTAGTTCATAAGTTCGAATGGGTTCAAGCTGATGAATCGAAAATCAAAGTGGATTTGTCTGAGTGTTTGAAGCTATCTCTGGAAATGAAACACCCTTTGATTTGTAGGGCTATCCCGAGGAATGTAGGGTTCGAGTCTCACCCTGATCATGCATGACAGATTAAAAAAAAAAAGAA AAGGCACATCTAGGGGAGCTTATTATGATATTATCATATGTTGAAAATTAAATGTGTTTGTTGCTTTCTTTTCTTTTTTTCTTTTTCCTTTCTTCTTTCTCTTAATCAATTGATATTATATCTTGTGTGGAACAAATAGTATCGGATTCGAGATTTAATGTTGGGATAATCCTTAAATGTAATTCCGTTATTAAGTGTGAA

Genome annotation

2015-03-23 30

ACCTCTCACTCTTTCTTCTCATCTTCTTCAATTATAACAACCTAACC[ATG]TCTTTAGAACAAGAATTTCTCTTCCTAATTCCTTCAACAATGGCCAATAATCTCAATCTCTTCCTTTGTTTCCTTCTCTTTATTTGTTTCTTCACTCTCTGCCTTAGCCCTGGTGGTCTAGCTTGGGCTCTAATTTCAAAGCCAAAAAACCAATCCATCATTCCTAGAATCAGCTTATGAGCTTTTGTTTCATAGAGCAATGGGGTTTGCTCCATTGGAGAGTGTGTTTGTGTTGGTGATGAAGAAAATAAGAACAGCCCTAATGATAATGAAGATGAAGATTTTGTTGATGTGTTGCTTGATTTGGAAAAGGAAAACAAACTCACCGACTCTGATATGATTGCTGTGTTAG|GTACGTATGTGTATTATAATTTCTTGTTTCATTACTATTTTGATATTTTTCTACTGCACTTCAATTTTAATCGGTTTGAAATGATTTTTTAATATGCTCTTACAAGATTATGACTTGGGAAAGATTCTTACATCTTTAAATATTTCAATTTTTTGTGTGATACATGAAATGCATGACTGTTTTTTACTTGCGATTTACATGTTGAAATTTTCTTTACTTTGATATTCTATGTTTTTTTAACAAATTTTCTCTTAAATAAATGACATGTAG|GAAATGACCTCCCAAATCTTCCTTATCTCCATGCAGTCGTCAAAGAGACTCTTAGAATGCACCCTCCCGGCCCACTTCTCTCTTGGGCACGCCTTGCCATCCATGACACCCATGTCGCAGGCCACTACATTCCTGCTGGCACCACTGCGATGGTCAACATGTGGGCCATAACCCACGACGACCAAACTGTGGCTCGCTCAGTTAGTTCATAAGTTCGAATGGGTTCAAGCTGATGAATCGAAAATCAAAGTGGATTTGTCTGAGTGTTTGAAGCTATCTCTGGAAATGAAACACCCTTTGATTTGTAGGGCTATCCCGAGGAATGTAGGGTTCGAGTCTCACCC[TGA]TCATGCATGACAGATTAAAAAAAAAAAGAAAAGGCACATCTAGGGGAGCTTATTATGATATTATCATATGTTGAAAATTAAATGTGTTTGTTGCTTTCTTTTCTTTTTTTCTTTTTCCTTTCTTCTTTCTCTTAATCAATTGATATTATATCTTGTGTGGAACAAATAGTATCGGATTCGAGATTTAATGTTGGGATAATCCTTAAATGTAATTCCGTTATTAAGTGTGAA

Genome annotation

2015-03-23 31

Source: Joint Genome Institute

Expression analysis

2015-03-23 32

STRUCTURAL BIOINFORMATICS

2015-03-23 33

PDB and structural information

• Protein Data Bank holds information about structure of proteins, nucleic acids and complexes – over 100 000 entries!

• The 3D structure can be resolved by:

– X-ray diffraction

– NMR

– Electron microscopy

– Simulations

2015-03-23 34

PDB and structural informationHEADER TRANSCRIPTION 18-MAR-04 1VD4

TITLE SOLUTION STRUCTURE OF THE ZINC FINGER DOMAIN OF TFIIE ALPHA

COMPND 2 MOLECULE: TRANSCRIPTION INITIATION FACTOR IIE, ALPHA

COMPND 8 ENGINEERED: YES

SOURCE MOL_ID: 1;

SOURCE 2 ORGANISM_SCIENTIFIC: HOMO SAPIENS;

SOURCE 10 EXPRESSION_SYSTEM_PLASMID: PET11D

KEYWDS ZINC FINGER, TRANSCRIPTION

EXPDTA SOLUTION NMR

NUMMDL 20

AUTHOR M.OKUDA,A.TANAKA,Y.ARAI,M.SATOH,H.OKAMURA,A.NAGADOI,

REMARK 500 CHOLOGY: RAMACHANDRAN REVISITED. STRUCTURE 4, 1395 - 1400

REMARK 500

REMARK 500 M RES CSSEQI PSI PHI

REMARK 500 1 GLU A 118 -36.12 -163.20

REMARK 500 1 ARG A 119 -92.03 -138.92

REMARK 500 1 THR A 122 -70.74 -110.33

SITE 1 AC1 5 CYS A 129 CYS A 132 CYS A 154 CYS A 157

SITE 2 AC1 5 THR A 159

CRYST1 1.000 1.000 1.000 90.00 90.00 90.00 P 1 1

ORIGX1 1.000000 0.000000 0.000000 0.00000

ORIGX3 0.000000 0.000000 1.000000 0.00000

SCALE1 1.000000 0.000000 0.000000 0.00000

SCALE3 0.000000 0.000000 1.000000 0.00000

MODEL 1

ATOM 1 N ARG A 113 1.980 -19.277 -19.127 1.00 0.00 N

ATOM 2 CA ARG A 113 1.202 -19.280 -17.853 1.00 0.00 C

ATOM 3 C ARG A 113 0.666 -17.875 -17.557 1.00 0.00 C

ATOM 4 O ARG A 113 0.625 -17.023 -18.421 1.00 0.00 O

ATOM 5 CB ARG A 113 2.199 -19.713 -16.778 1.00 0.00 C

ATOM 6 CG ARG A 113 2.435 -21.222 -16.875 1.00 0.00 C

ATOM 7 CD ARG A 113 3.604 -21.619 -15.971 1.00 0.00 C

ATOM 8 NE ARG A 113 2.986 -21.899 -14.645 1.00 0.00 N

ATOM 9 CZ ARG A 113 3.125 -23.073 -14.094 1.00 0.00 C

2015-03-23 35

PDB and structural information

2015-03-23 36

Source: PDB

COMPUTATIONAL BIOLOGY/SYSTEMS BIOLOGY

2015-03-23 37

Molecular networks

• Bioinformatics is needed to describe interactions between proteins, DNA, drugs…

• When thousands of interactions are analyzed network science come to use

• The set of all protein-protein interactions in single cell is called interactome

• A single interaction can be researched in vivo/in vitro but more complex network can be only investigated in silico

2015-03-23 38

Molecular networks

2015-03-23 39Source: Hennah & Porteous. PLoS ONE 4 (3): e4906. doi:10.1371/journal.pone.0004906

Metabolic pathways

• To describe a series of biochemical reactions that often happen in different cellular compartments, bioinformatics is also useful

• For description of pathways special databases (graph) had to be designed

• Modeling of metabolites flow in pathway is virtually impossible without the use of computers

2015-03-23 40

Metabolic pathways

2015-03-23 41Source: KEGG, www.genome.jp/kegg/pathway/map/map00010.html

Metabolic pathways

2015-03-23 42Source: KEGG, www.genome.jp/kegg/pathway/map/map01100.html

Simulation of biological systems

• Simulation of cell-cell interactions

• Description of interactions inside population

• Between species interactions

• Food chains => food web

• Social relations

• Evolution of populations

• Modeling in pharmacology

2015-03-23 43

BIOINFORMATICS TOOLS

2015-03-23 44

Databases

• Relational databases(mySQL)

• Non-relational databases (noSQL)

• Graph databases

• RDF

• MySQL, Microsoft SQL Server, SAP

• Cassandra, MongoDB, CouchDB

• Neo4j, Bio4j

• N-Triples, RDF/XML, Bio2RDF

2015-03-23 45

Databases

• Different types public resources available:

2015-03-23 46

Nucleic sequence

Protein sequence

ESTGenome

Sequencedata

Metadata/Ontologies

Functional annotation

Gene models

Gene ontologies

Protein structure

Structural dataComplexesstructure

RNA structure

Variation dataSNP

SSRindels

InteractionsMetabolic data

Pathways

Databases

• Where to look for the data?

2015-03-23 47

Databases

• How to use them?

– Browsing websites directly

– Downloading

– Using API

2015-03-23 48

Text/data mining

• Obtaining information from several scientific resources becoming is more difficult as the volume of information grows

• Number of different resources/databases is growing and simple search has to be repeated for each of them

• Filtering relevant information is a big intellectual/computational burden

2015-03-23 49

Text mining

• Retrieval, analysis and formatting (parsing) of information into searchable databases

• Recognition of patterns

• Recognition of natural language

• Extraction of semantic or grammatical relationships

• Coreference: terms that refer to the same object

2015-03-23 50

Text mining example

• Query: Find promoters known to work in E.coli with s70 holenzyme (Es70) aka sD

• PREFIX sbol:http://sbols.org/sbol.owl#

PREFIX pr:http://partsregistry.org/#

SELECT DISTINCT ?name

WHERE {

?part a sbol:Part;

sbol:status ?st;

sbol:name ?name;

sbol:dnaSequence ?seq;

a pr:promoter;

a ?cl.

FILTER (?cl =pr:sigma70_ecoli_prokaryote_rnap

&& ?st !='Deleted')}

2015-03-23 51

http://sbols.org/sbol.owl

http://partsregistry.org/

Open source software

• Software that anyone can use, modify, share and distribute.

• Source code is known and can (should!) be modified to fit the user requirements

• Society driven development

• Dynamic development and early releases

• Security and transparency

2015-03-23 52

Open source software repositories

2015-03-23 53

CRANThe Comprehensive R Archive Network

CodePlex

Specificity of working with Big Data

2015-03-23 54

CAN I BE A BIOINFORMATICIAN, TOO?

2015-03-23 55

How to become a bioinformatician?

• Get a computer with Linux

• Learn how to use bash shelland how to run programs command line

• Learn to code in python or Perl

• Try solving basic problems on 2015-03-23 56

http://rosalind.info/problems/locations/

http://rosalind.info/problems/locations/

How to become a bioinformatician?

• Read blogs:

• Read fora for geeks:

• Get an account on:2015-03-23 57

http://omicsomics.blogspot.fr/

http://omicsomics.blogspot.fr/

http://blog.openhelix.eu/

http://blog.openhelix.eu/

http://stackoverflow.com/

http://stackoverflow.com/

http://seqanswers.com/

http://seqanswers.com/

http://www.homolog.us/blogs/

http://www.homolog.us/blogs/

https://www.biostars.org/

https://www.biostars.org/

Want to know more?

• Join my network on http://nl.linkedin.com/in/andrzejstefanczech

• Come to Wageningen for an internship at Genetwister Technologies B.V. http://www.genetwister.nl/

• Slides from this lecture are also available on SlideShare

2015-03-23 58

http://nl.linkedin.com/in/andrzejstefanczech

http://www.genetwister.nl/

http://www.slideshare.net/andrzejsczech

2015-03-23 Image credits: Biocomicals 59

Date post:	15-Jul-2015
Category:	Education
Upload:	andrzej-stefan-czech
View:	178 times
Download:	5 times

Introduction to bioinformatics

Education