+ All Categories
Home > Documents > Andrew Meade ([email protected])[email protected] School of Biological Sciences.

Andrew Meade ([email protected])[email protected] School of Biological Sciences.

Date post: 28-Mar-2015
Category:
Upload: katelyn-pearson
View: 220 times
Download: 0 times
Share this document with a friend
Popular Tags:
12
Andrew Meade ([email protected] ) School of Biological Sciences
Transcript
Page 1: Andrew Meade (A.Meade@Reading.ac.uk)A.Meade@Reading.ac.uk School of Biological Sciences.

Andrew Meade ([email protected])School of Biological Sciences

Page 2: Andrew Meade (A.Meade@Reading.ac.uk)A.Meade@Reading.ac.uk School of Biological Sciences.

Molecular sequence growth ratesfrom 600 to 100 million sequences in 25 years

Human Genome project

Page 3: Andrew Meade (A.Meade@Reading.ac.uk)A.Meade@Reading.ac.uk School of Biological Sciences.

Molecular sequence growth rates 18 million new sequences a

year (2007 – 2008) Rate of growth is accelerating Doubling every 2 years Likely to continue with new

sequencing technology Cost, time and technical ability

required has reduced

Page 4: Andrew Meade (A.Meade@Reading.ac.uk)A.Meade@Reading.ac.uk School of Biological Sciences.

Its worse than it looks

Lack of suitably tools for sequence analysis Analysis methods don’t always scale

linearly Methods have changed

Simple heuristics Statistical methods Simple rules More realistic models Descriptive results Biological process Sub system analysis Systems biology

Computing power a major rate limiting steep

The widening gap between data and analytical methods is increasing

Page 5: Andrew Meade (A.Meade@Reading.ac.uk)A.Meade@Reading.ac.uk School of Biological Sciences.

Tools for genomic analysis

Current Tools Required Tools

Co-opted for purpose

Designed for smaller data sets

Limited to a single computer

External data required

Hard to generalise

Custom build

Limited by available hardware

Use available computers

Models derived from data

Identify informative information in the data

Page 6: Andrew Meade (A.Meade@Reading.ac.uk)A.Meade@Reading.ac.uk School of Biological Sciences.

454 parallel sequencing

Fast, 400-600 million bases per 10 hours Human genome in 100 hours, HGP 13 years

Cheap, 20¢ per kb, currently $12 Human genome for $100,000, HGP $10 billion

Accurate, 99% accurate on 400th base Small chunks 400 – 800 bases per sequence Similar to parallel computing, hard to

convert raw power to usefully results The catch - analysis

Page 7: Andrew Meade (A.Meade@Reading.ac.uk)A.Meade@Reading.ac.uk School of Biological Sciences.

454 sequencing

Sequence populations of bacteria (16s) taken from cow guts under different experiential conditions

Identify how changes in feed affects bacteria populations.

332,000 sequence in total £8,000 using 454, previously over £2

million

Page 8: Andrew Meade (A.Meade@Reading.ac.uk)A.Meade@Reading.ac.uk School of Biological Sciences.

454 sequencing analysis

Find how closely related sequence are to each other.

Perform an approximate match between all pairs of sequences. Allowing for insertions, deletions and mutations.

332,000^2 * 0.5 = 5.5 * 1010 comparisons

874 years on a single computer Trivially parallel task, easy to distribute

over nodes, different clusters, different OS / hardware.

Page 9: Andrew Meade (A.Meade@Reading.ac.uk)A.Meade@Reading.ac.uk School of Biological Sciences.

454 sequencing analysis 2 Cluster sequences from previous

steep to find what species are present and in what quantities

102 GB of data. Distributed code to reduce memory and processing requirements. Liner scaling (memory, CPU) up to

200 nodes Problems with disk access.

Page 10: Andrew Meade (A.Meade@Reading.ac.uk)A.Meade@Reading.ac.uk School of Biological Sciences.

Bayesian Phylogenetic inference Infer evolutionally histories

(phylogenies) from molecular data. Widely uses in all arias for biology.

Used to investigate how genes and proteins change and adapt to their environment

How viruses spread and mutate Reconstruct ancestral genes and proteins Used in conservation studies to identify

species that are most at risk of extinction and most valuable to conserve

Page 11: Andrew Meade (A.Meade@Reading.ac.uk)A.Meade@Reading.ac.uk School of Biological Sciences.

Mammal Mitochondrial

44 Taxa13 Protein coding regions

16400 Nucleotides

Page 12: Andrew Meade (A.Meade@Reading.ac.uk)A.Meade@Reading.ac.uk School of Biological Sciences.

Number of computers

1 ~ 70 days60 ~ 2 days

Mammal Mitochondrial scaling

x

x

x

x


Recommended