Filling in: Ioannis Pandis, PhD i.pandis@ic.ac.uk CO341: Introduction to Bioinformatics Prof. Yi-Ke...

transcript

Filling in:

Ioannis Pandis, PhD

i.pandis@ic.ac.uk

CO341: Introduction to Bioinformatics

Prof. Yi-Ke Guo (yg@ic.ac.uk)

Sequencing and Genomics

DNA Sequencing

Sequencing Analysis

Gene Expression

Gene Expression Analysis

Functional Genomics

DNA Structure

Double Helix (Crick & Watson)– 2 coiled matching strands– Backbone of sugar phosphate pairs

Nitrogenous Base Pairs – Roughly 20 atoms in a base– Adenine Thymine [A,T]– Cytosine Guanine [C,G]– Weak bonds (can be broken)– Form long chains called polymers

Read the sequence on 1 strand– GATTCATCATGGATCATACTAAC

Differences in DNA

2% tiny

Roughly 4%

Materia

DNA differentiates:– Species/race/gender– Individuals

We share DNA with– Primates,mammals– Fish, plants, bacteria

Genotype– DNA of an individual

Genetic constitution

Phenotype– Characteristics of the

resulting organism Nature and nurture

Genes Chunks of DNA sequence

– Between 600 and 1200 bases long– 22,000 human genes, 100,000 genes in tulips

Large percentage of human genome – termed “junk”: does not code for proteins

“Simpler” organisms such as bacteria– Are much more “evolved” (have hardly any junk)– Viruses have overlapping genes (zipped/compressed)

Often the active part of a gene is split into exons– Separated by introns

Transcription Take one strand of DNA Write out the counterparts to each base

– G becomes C (and vice versa)– A becomes T (and vice versa)

Change Thymine [T] to Uracil [U] You have transcribed DNA into messenger RNA Example:

Start: GGATGCCAATGIntermediate: CCTACGGTTACTranscribed: CCUACGGUUAC

The Synthesis of Proteins

Instructions for generating Amino Acid sequences– (i) DNA double helix is unzipped– (ii) One strand is transcribed to messenger RNA – (iii) RNA acts as a template

ribosomes translate the RNA into the sequence of amino acids

Amino acid sequences fold into a 3d molecule Gene expression

– Every cell has every gene in it (has all chromosomes)– Which ones produce proteins (are expressed) & when?

Genetic Code

How the translation occurs

Think of this as a function:– Input: triples of three base letters (Codons)– Output: amino acid– Example: ACC becomes threonine (T)

Gene sequences end with: – TAA, TAG or TGA

Example Synthesis

TCGGTGAATCTGTTTGAT Transcribed to:

AGCCACUUAGACAAACUATranslated to:

SHLDKL

Evolution of Genes: Inheritance

Evolution of species– Caused by reproduction and survival of the fittest

But actually, it is the genotype which evolves– Organism has to live with it (or die before reproduction)– Three mechanisms: inheritance, mutation and crossover

Inheritance: properties from parents– Embryo has cells with 23 pairs of chromosomes– Each pair: 1 chromosome from father, 1 from mother– Most important factor in offspring’s genetic makeup

Evolution of Genes: Mutation Genes alter (slightly) during reproduction

– Caused by errors, from radiation, from toxicity– 3 possibilities: deletion, insertion, substitution

Substitution: ACGTTGACTC ACGATGACTT Deletion: ACGTTGACTC ACGTGACTC Insertion: ACGTTGACTC AGCGTTGACTC

– Frameshift: ACGTTGACTC AGCGTTGACTC

Mutations are categorised into:– Neutral or– Deleterious

A single change has a massive effect on translation Causes a different protein conformation

Evolution of Genes: Crossover (Recombination)

DNA sections are swapped – From male and female genetic input to offspring DNA

Sequencing for Medical Study

Phenotype

Genotype

Hypothesis

Test HypothesisBy Genetic Manipulation

Typical Cycle of the Study

Phenotype

Genotype

Hypothesis:

Two groups:1.Develop

Colorectal cancerAt Young Age

2. Do not

Mutation in APCGene

APC is a Tumor Supressor Gene

Delete APC in MouseControl: Isogenic APC+

Technologies Required

Phenotype

Genotype

Hypothesis

Observation

?Sequencing?

Reading/Thinking

Gene Deletion/Replacement

In 2005$9 million/genome

Not feasible

The thing is changing rapidly: Bp/$$ increases exponentially with time

Adapted from Shendure et al 2004

In 1980, the sequencing cost per finished bp ≈ $1.00In 2003, the sequencing cost per finished bp ≈ $0.01

>>> a 100-fold reduction in 20-25 years

History of DNA Sequencing History of DNA Sequencing

Avery: Proposes DNA as ‘Genetic Material’

Watson & Crick: Double Helix Structure of DNA

Holley: Sequences Yeast tRNAAla

Miescher: Discovers DNA

Wu: Sequences Cohesive End DNA

Sanger: Dideoxy Chain TerminationGilbert: Chemical Degradation

Messing: M13 Cloning

Hood et al.: Partial Automation

• Cycle Sequencing • Improved Sequencing Enzymes

• Improved Fluorescent Detection Schemes

• Next Generation Sequencing• Improved enzymes and chemistry

• Improved image processing

Adapted from Eric Green, NIH; Adapted from Messing & Llaca, PNAS (1998)Adapted from Eric Green, NIH; Adapted from Messing & Llaca, PNAS (1998)

50,000

25,000

200,000

50,000,000

Efficiency(bp/person/year)

15,000

100,000,000,000 2008

1928???

History of DNA Sequencing History of DNA SequencingAdapted from Eric Green, NIH; Adapted from Messing & Llaca, PNAS (1998)Adapted from Eric Green, NIH; Adapted from Messing & Llaca, PNAS (1998)

Griffith's experiment, reported in 1928 by Frederick Griffith

History of DNA Sequencing History of DNA Sequencing

Avery: Proposes DNA as ‘Genetic Material’

Watson & Crick: Double Helix Structure of DNA

Holley: Sequences Yeast tRNAAla

Miescher: Discovers DNA

Wu: Sequences Cohesive End DNA

Sanger: Dideoxy Chain TerminationGilbert: Chemical Degradation

Messing: M13 Cloning

Hood et al.: Partial Automation

• Cycle Sequencing • Improved Sequencing Enzymes

• Improved Fluorescent Detection Schemes

• Next Generation Sequencing• Improved enzymes and chemistry

• Improved image processing

Adapted from Eric Green, NIH; Adapted from Messing & Llaca, PNAS (1998)Adapted from Eric Green, NIH; Adapted from Messing & Llaca, PNAS (1998)

50,000

25,000

200,000

50,000,000

Efficiency(bp/person/year)

15,000

100,000,000,000 2008

Sanger Sequencing(Chain-termination Methods)

DNA is fragmented Cloned to a plasmid

vector Cyclic sequencing

reaction Separation by

electrophoresis Readout with

fluorescent tags

Basics of the “old” technology Clone the fragmented DNA. Generate a ladder of labeled (colored) molecules that

are different by 1 nucleotide. Separate mixture on some matrix. Detect fluorochrome by laser. Interpret peaks as string of DNA. Strings are 500 to 1,000 letters long Assemble all strings into a genome

The Process Is Sequential

3 ∙ 109 bp

1x coverage

10x coverage

2 ∙ 106 bp/day= 40 years

× 3 ∙ 109 bp

10x coverage × 3 ∙ 109 bp × $0.001/bp = $30 million

That is what old technology take

New Generation Sequencing

Basics of the “new” technology Get DNA and fragment it Attach all fragments to glass slides. Perform amplification by some form of PCR Sequencing ALL these fragments in PARALLLE using chain

termination or other methods such as pyro-sequencing Extend and amplify signal with some color scheme. Detect fluorochrome by microscopy. Interpret series of spots as short strings of DNA. Strings are 30-300 letters long Multiple images are interpreted as 0.4 to 1.2 GB/run

(1,200,000,000 letters/day). Map or align strings to one or many genome.

Making Millions Short Sequence Reads in Parallel

Technology Overview: Solexa/Illumina Sequencing

http://www.illumina.com/

Immobilize DNA to Surface

Source: www.illumina.com

Sequence Colonies

Call Sequence

From Debbie Nickerson, Department of Genome Sciences, University of Washington, http://tinyurl.com/6zbzh4

Sequence Alighment

Meyerson et al, 2011

2006: $10 million 2008: $100,000 2009: $10,000 2010: $5,000 2012: $1,000 ??? $100

So, how fast is cost going down?

Informatics Informatics challenge : ample applications

– All the genomics research can be uniformly done through sequencing (with the help of proper assay design)

– Bioinformatics turns the sequencer into universal genomics interpreter

– Not a challenge, rather a big opportunity!!!

For Edison, phonograph was not primarily designed for playing music but …….

One Stone, Many Birds:NGS May Enable a Uniform Bioinformatics

Mapped Position : Structure/functionality

(Mapping)

BP Variant: SNP & Mutation Pattern

(Detecting)

Read Numbers:Quantified Abundance

(Counting)

Match These Sequences

How do we match this sequence:

gattcagacctagct

With this sequence:

gtcagatcct

Possible Answers

1. gattcagacctagct (no indels) gtcagatcct

2. gattcaga-cctagct (with indels) g-t-cagatcct

3. gattcagacctagc-t (no overhang) gtcagatcct

4. gattcagacctagct (with overhang) gtcagatcct

Sequence Matching Algorithms #1

Without indels Hamming distance Scoring schemes

– Certain changes in sequence more likely Due to chemical properties of the residues

BLAST algorithm– Idea: match local regions and expand– Seven part process

Sequence Matching Algorithms #2

With indels Drawing of Dotplots Dynamic Programming

(getting from A to B)Quickest route to Z + Quickest route from Z

VPFLLMMVLGVPFMMLG

Searching Databases

We have ways to score how well 2 seqs match Now want to use this in databases

– Given a known gene sequence– Which genes in the database are closely related

Have to worry about:– Repeated subsequences biasing matches– Accuracy and significance of matches– Sensitivity and specificity (false + and false -)

Functional Genomics—Transcriptomics

Transcriptome – the complete set of coding and non-coding RNA molecules in a cell at a particular time: Varies between cell types

Transcriptomics – the study of the transcripts in a cell, cell type, organism, etc.

Methods for Transcriptomics Microarray-based:

– High-throughput gene expression profiling– Hybridization of labeled cDNAs to an array of complementary DNA

probes– Measurement of expression levels based on hybridization intensity

Sequence-based:– Full-length cDNA (FLcDNA) sequencing: complete sequencing of

cDNA clone– Expressed sequence tag (EST) sequencing: Single-pass

sequencing of cDNA clone– Serial Analysis of Gene Expression (SAGE):

Short sequence tags at 3’ end of transcript Tags concatenated and sequenced

NGS enables whole transcriptome sequencing : Sequence Census Method

Machine Learning

Machine learning (inductive reasoning)– Automatic proposing of hypotheses based on data– Has many applications in bioinformatics, such as

microarray analysis Example: predictive toxicology

– Given: set of toxic drugs and a set of non-toxic drugs– Given: background information (chemistry, etc.)– Produces: hypothesis why drugs are toxic/toxis

mechanism Overview of machine learning

– Aims, techniques, methodologies, representations Artificial neural networks Support vector machine et.al

Machine Learning

Larrañaga et al. 2005

QUESTIONS?The End!

Filling in: Ioannis Pandis, PhD i.pandis@ic.ac.uk CO341: Introduction to Bioinformatics Prof. Yi-Ke...

Documents