Post on 03-Jan-2016
transcript
Filling in:
Ioannis Pandis, PhD
i.pandis@ic.ac.uk
CO341: Introduction to Bioinformatics
Prof. Yi-Ke Guo (yg@ic.ac.uk)
Sequencing and Genomics
DNA Sequencing
Sequencing Analysis
Gene Expression
Gene Expression Analysis
Functional Genomics
DNA Structure
Double Helix (Crick & Watson)– 2 coiled matching strands– Backbone of sugar phosphate pairs
Nitrogenous Base Pairs – Roughly 20 atoms in a base– Adenine Thymine [A,T]– Cytosine Guanine [C,G]– Weak bonds (can be broken)– Form long chains called polymers
Read the sequence on 1 strand– GATTCATCATGGATCATACTAAC
Differences in DNA
2% tiny
Roughly 4%
Share
Materia
l
DNA differentiates:– Species/race/gender– Individuals
We share DNA with– Primates,mammals– Fish, plants, bacteria
Genotype– DNA of an individual
Genetic constitution
Phenotype– Characteristics of the
resulting organism Nature and nurture
Genes Chunks of DNA sequence
– Between 600 and 1200 bases long– 22,000 human genes, 100,000 genes in tulips
Large percentage of human genome – termed “junk”: does not code for proteins
“Simpler” organisms such as bacteria– Are much more “evolved” (have hardly any junk)– Viruses have overlapping genes (zipped/compressed)
Often the active part of a gene is split into exons– Separated by introns
Transcription Take one strand of DNA Write out the counterparts to each base
– G becomes C (and vice versa)– A becomes T (and vice versa)
Change Thymine [T] to Uracil [U] You have transcribed DNA into messenger RNA Example:
Start: GGATGCCAATGIntermediate: CCTACGGTTACTranscribed: CCUACGGUUAC
The Synthesis of Proteins
Instructions for generating Amino Acid sequences– (i) DNA double helix is unzipped– (ii) One strand is transcribed to messenger RNA – (iii) RNA acts as a template
ribosomes translate the RNA into the sequence of amino acids
Amino acid sequences fold into a 3d molecule Gene expression
– Every cell has every gene in it (has all chromosomes)– Which ones produce proteins (are expressed) & when?
Genetic Code
How the translation occurs
Think of this as a function:– Input: triples of three base letters (Codons)– Output: amino acid– Example: ACC becomes threonine (T)
Gene sequences end with: – TAA, TAG or TGA
Example Synthesis
TCGGTGAATCTGTTTGAT Transcribed to:
AGCCACUUAGACAAACUATranslated to:
SHLDKL
Evolution of Genes: Inheritance
Evolution of species– Caused by reproduction and survival of the fittest
But actually, it is the genotype which evolves– Organism has to live with it (or die before reproduction)– Three mechanisms: inheritance, mutation and crossover
Inheritance: properties from parents– Embryo has cells with 23 pairs of chromosomes– Each pair: 1 chromosome from father, 1 from mother– Most important factor in offspring’s genetic makeup
Evolution of Genes: Mutation Genes alter (slightly) during reproduction
– Caused by errors, from radiation, from toxicity– 3 possibilities: deletion, insertion, substitution
Substitution: ACGTTGACTC ACGATGACTT Deletion: ACGTTGACTC ACGTGACTC Insertion: ACGTTGACTC AGCGTTGACTC
– Frameshift: ACGTTGACTC AGCGTTGACTC
Mutations are categorised into:– Neutral or– Deleterious
A single change has a massive effect on translation Causes a different protein conformation
Evolution of Genes: Crossover (Recombination)
DNA sections are swapped – From male and female genetic input to offspring DNA
Sequencing for Medical Study
Phenotype
Genotype
Hypothesis
Test HypothesisBy Genetic Manipulation
Typical Cycle of the Study
Phenotype
Genotype
Hypothesis:
Test HypothesisBy Genetic Manipulation
Two groups:1.Develop
Colorectal cancerAt Young Age
2. Do not
Mutation in APCGene
APC is a Tumor Supressor Gene
Delete APC in MouseControl: Isogenic APC+
Technologies Required
Phenotype
Genotype
Hypothesis
Test HypothesisBy Genetic Manipulation
Observation
?Sequencing?
Reading/Thinking
Gene Deletion/Replacement
In 2005$9 million/genome
Not feasible
The thing is changing rapidly: Bp/$$ increases exponentially with time
Adapted from Shendure et al 2004
In 1980, the sequencing cost per finished bp ≈ $1.00In 2003, the sequencing cost per finished bp ≈ $0.01
>>> a 100-fold reduction in 20-25 years
History of DNA Sequencing History of DNA Sequencing
Avery: Proposes DNA as ‘Genetic Material’
Watson & Crick: Double Helix Structure of DNA
Holley: Sequences Yeast tRNAAla
1870
1953
1940
1965
1970
1977
1980
1990
2002
Miescher: Discovers DNA
Wu: Sequences Cohesive End DNA
Sanger: Dideoxy Chain TerminationGilbert: Chemical Degradation
Messing: M13 Cloning
Hood et al.: Partial Automation
• Cycle Sequencing • Improved Sequencing Enzymes
• Improved Fluorescent Detection Schemes
1986
• Next Generation Sequencing• Improved enzymes and chemistry
• Improved image processing
Adapted from Eric Green, NIH; Adapted from Messing & Llaca, PNAS (1998)Adapted from Eric Green, NIH; Adapted from Messing & Llaca, PNAS (1998)
1
15
150
50,000
25,000
1,500
200,000
50,000,000
Efficiency(bp/person/year)
15,000
100,000,000,000 2008
1928???
History of DNA Sequencing History of DNA SequencingAdapted from Eric Green, NIH; Adapted from Messing & Llaca, PNAS (1998)Adapted from Eric Green, NIH; Adapted from Messing & Llaca, PNAS (1998)
Griffith's experiment, reported in 1928 by Frederick Griffith
History of DNA Sequencing History of DNA Sequencing
Avery: Proposes DNA as ‘Genetic Material’
Watson & Crick: Double Helix Structure of DNA
Holley: Sequences Yeast tRNAAla
1870
1953
1940
1965
1970
1977
1980
1990
2002
Miescher: Discovers DNA
Wu: Sequences Cohesive End DNA
Sanger: Dideoxy Chain TerminationGilbert: Chemical Degradation
Messing: M13 Cloning
Hood et al.: Partial Automation
• Cycle Sequencing • Improved Sequencing Enzymes
• Improved Fluorescent Detection Schemes
1986
• Next Generation Sequencing• Improved enzymes and chemistry
• Improved image processing
Adapted from Eric Green, NIH; Adapted from Messing & Llaca, PNAS (1998)Adapted from Eric Green, NIH; Adapted from Messing & Llaca, PNAS (1998)
1
15
150
50,000
25,000
1,500
200,000
50,000,000
Efficiency(bp/person/year)
15,000
100,000,000,000 2008
Sanger Sequencing(Chain-termination Methods)
DNA is fragmented Cloned to a plasmid
vector Cyclic sequencing
reaction Separation by
electrophoresis Readout with
fluorescent tags
Basics of the “old” technology Clone the fragmented DNA. Generate a ladder of labeled (colored) molecules that
are different by 1 nucleotide. Separate mixture on some matrix. Detect fluorochrome by laser. Interpret peaks as string of DNA. Strings are 500 to 1,000 letters long Assemble all strings into a genome
The Process Is Sequential
3 ∙ 109 bp
1x coverage
10x coverage
2 ∙ 106 bp/day= 40 years
× 3 ∙ 109 bp
10x coverage × 3 ∙ 109 bp × $0.001/bp = $30 million
That is what old technology take
New Generation Sequencing
Basics of the “new” technology Get DNA and fragment it Attach all fragments to glass slides. Perform amplification by some form of PCR Sequencing ALL these fragments in PARALLLE using chain
termination or other methods such as pyro-sequencing Extend and amplify signal with some color scheme. Detect fluorochrome by microscopy. Interpret series of spots as short strings of DNA. Strings are 30-300 letters long Multiple images are interpreted as 0.4 to 1.2 GB/run
(1,200,000,000 letters/day). Map or align strings to one or many genome.
Making Millions Short Sequence Reads in Parallel
Technology Overview: Solexa/Illumina Sequencing
http://www.illumina.com/
Immobilize DNA to Surface
Source: www.illumina.com
Sequence Colonies
Sequence Colonies
Call Sequence
From Debbie Nickerson, Department of Genome Sciences, University of Washington, http://tinyurl.com/6zbzh4
Sequence Alighment
Meyerson et al, 2011
2006: $10 million 2008: $100,000 2009: $10,000 2010: $5,000 2012: $1,000 ??? $100
So, how fast is cost going down?
Informatics Informatics challenge : ample applications
– All the genomics research can be uniformly done through sequencing (with the help of proper assay design)
– Bioinformatics turns the sequencer into universal genomics interpreter
– Not a challenge, rather a big opportunity!!!
For Edison, phonograph was not primarily designed for playing music but …….
One Stone, Many Birds:NGS May Enable a Uniform Bioinformatics
Mapped Position : Structure/functionality
(Mapping)
BP Variant: SNP & Mutation Pattern
(Detecting)
Read Numbers:Quantified Abundance
(Counting)
Match These Sequences
How do we match this sequence:
gattcagacctagct
With this sequence:
gtcagatcct
Possible Answers
1. gattcagacctagct (no indels) gtcagatcct
2. gattcaga-cctagct (with indels) g-t-cagatcct
3. gattcagacctagc-t (no overhang) gtcagatcct
4. gattcagacctagct (with overhang) gtcagatcct
Sequence Matching Algorithms #1
Without indels Hamming distance Scoring schemes
– Certain changes in sequence more likely Due to chemical properties of the residues
BLAST algorithm– Idea: match local regions and expand– Seven part process
Sequence Matching Algorithms #2
With indels Drawing of Dotplots Dynamic Programming
(getting from A to B)Quickest route to Z + Quickest route from Z
VPFLLMMVLGVPFMMLG
A
B
ZGD
C
E
F
Searching Databases
We have ways to score how well 2 seqs match Now want to use this in databases
– Given a known gene sequence– Which genes in the database are closely related
Have to worry about:– Repeated subsequences biasing matches– Accuracy and significance of matches– Sensitivity and specificity (false + and false -)
Functional Genomics—Transcriptomics
Transcriptome – the complete set of coding and non-coding RNA molecules in a cell at a particular time: Varies between cell types
Transcriptomics – the study of the transcripts in a cell, cell type, organism, etc.
Methods for Transcriptomics Microarray-based:
– High-throughput gene expression profiling– Hybridization of labeled cDNAs to an array of complementary DNA
probes– Measurement of expression levels based on hybridization intensity
Sequence-based:– Full-length cDNA (FLcDNA) sequencing: complete sequencing of
cDNA clone– Expressed sequence tag (EST) sequencing: Single-pass
sequencing of cDNA clone– Serial Analysis of Gene Expression (SAGE):
Short sequence tags at 3’ end of transcript Tags concatenated and sequenced
NGS enables whole transcriptome sequencing : Sequence Census Method
Machine Learning
Machine learning (inductive reasoning)– Automatic proposing of hypotheses based on data– Has many applications in bioinformatics, such as
microarray analysis Example: predictive toxicology
– Given: set of toxic drugs and a set of non-toxic drugs– Given: background information (chemistry, etc.)– Produces: hypothesis why drugs are toxic/toxis
mechanism Overview of machine learning
– Aims, techniques, methodologies, representations Artificial neural networks Support vector machine et.al
Machine Learning
Larrañaga et al. 2005
QUESTIONS?The End!