Caporaso sloan qiime_workshop_slides_18_oct2012

transcript

QIIME Workshop

Get started by opening:http://bit.ly/mbe-qiime2012

and read up at: www.qiime.org

Greg Caporasogregcaporaso@gmail.com

Extract DNA and amplify marker gene with barcoded primers Pool amplicons and

sequence

Assign millions of sequences from thousands

of samples to OTUs

Compute UniFrac distances and compare samples

www.qiime.org

Assign reads to samples

>GCACCTGAGGACAGGCATGAGGAA…>GCACCTGAGGACAGGGGAGGAGGA…>TCACATGAACCTAGGCAGGACGAA…>CTACCGGAGGACAGGCATGAGGAT…>TCACATGAACCTAGGCAGGAGGAA…>GCACCTGAGGACACGCAGGACGAC…>CTACCGGAGGACAGGCAGGAGGAA…>CTACCGGAGGACACACAGGAGGAA…>GAACCTTCACATAGGCAGGAGGAT…>TCACATGAACCTAGGGGCAAGGAA…>GCACCTGAGGACAGGCAGGAGGAA…

>5000 samples in analysis pipeline

• Stream and lake water• Marine water, sediment and reef• Soil (forest, farm, peatland, tundra, …)• Air• Coalbed• Arctic ice core• Insect-associated• Human-associated (gut, mouth, skin)

http://www.earthmicrobiome.org/

>5000 samples analyzed to date

Alpha diversity by environment type

Where do we look for new diversity?

* As determined by no hit to Greengenes database.

http://analytics.google.com

Running QIIME

Native installation on OS X or Linux (laptops through 16,416-core compute cluster*)

Ubuntu Linux Virtual Box

Amazon Web Services (EC2)

* http://ncar.janus.rc.colorado.edu/

IPython notebook

Moving Pictures of the Human Microbiome

• Two subjects sampled daily, one for six months, one for 18 months

• Four body sites: tongue, palm of left hand, palm of right hand, and gut (via fecal swabs).

Moving Pictures of the Human Microbiome

• Investigate the relative temporal variability of body sites.

• Is there a temporal core microbiome?• Technical points: do we observe the same

conclusions on 454 and Illumina data?

Moving Pictures of the Human Microbiome: QIIME tutorial

• A small subset of the full data set to facilitate short run time: ~0.1% of the full sequence collection.

• Sequenced across six Illumina GAIIx lanes, with a subset of the samples also sequenced on 454.

• The online tutorial contains details on all of the steps: go back and read that text.

Key QIIME files

• Mapping file: per sample meta-data, user-defined

• Input sequence file• OTU table: sample x OTU matrix, central to

downstream analyses [now in biom format]• Parameters file: defines analyses, for use

with the ‘workflow’ scripts (optional)

Mapping file

Mapping file: always run check_id_map.py

= required field

Sequences file

>[sampleID_seqID] description

Barcodes have been removed!!

>[sampleID_seqID] description

Barcodes have been removed!!

Sequences file: can be user-provided, or generated by split_libraries.py

OTU table (classic format)

sample x OTU matrix

OTU identifiers

sample x OTU matrix

Sample identifiers

sample x OTU matrix

Optional per OTU taxonomic information

sample x OTU matrix

http://biom-format.org

OTU tables are now in biological observation matrix (.biom) format

(QIIME 1.4.0-dev and later)Google: “biom format”

See convert_biom.pyfor translating between classic and biom otu tables

sample x observation contingency matrix

Observationcounts

Markergene (e.g., 16S)surveys

Comparativegenomics

Markergene (e.g., 16S)surveys

Metagenomics

MetatranscriptomicsMetabolomics . . .

http://www.biom-format.org

The Biological Observation Matrix (BIOM) Format or: How I Learned To Stop Worrying and Love the Ome-ome

JSON-based format for representing arbitrary sample x observation contingency tables with optional metadata

McDonald et al., GigaScience (2012).

Comparative genomic (B) and metagenome analysis (C) with QIIME

Working with OTU tables

• single_rarefaction.py: even sampling (very important if you have different numbers of seqs/sample!)

• filter_otus_from_otu_table.py• filter_samples_from_otu_table.py• per_library_stats.py

OTU picking: terminology

OTU picking

• De Novo – Reads are clustered based on similarity to one

another.• Reference-based– Closed reference: any reads which don’t hit a

reference sequence are discarded– Open reference: any reads which don’t hit a

reference sequence are clustered de novo

De novo OTU picking

• Pros– All reads are clustered

• Cons– Not parallelizable– OTUs may be defined by erroneous reads

Closed-reference OTU picking

• Pros– Built-in quality filter– Easily parallelizable– OTUs are defined by high-quality, trusted

sequences• Cons– Reads that don’t hit reference dataset are

excluded, so you can never observe new OTUs

Percentage of reads that do not hit the reference collection, by environment type.

Open-reference OTU picking

• Pros– All reads are clustered– Partially parallelizable

• Cons– Only partially parallelizable– Mix of high quality sequences defining OTUs (i.e.,

the database sequences) and possible low quality sequences defining OTUs (i.e., the sequencing reads)

Considerations in analysis

Variation in sampling depth is an important consideration

Human skin, colored by individual, at 500 sequence/sample

Image/analysis credit: Justin Kuczynski

Data reference:Forensic identification using skin bacterial communities. Fierer N, Lauber CL, Zhou N, McDonald D, Costello EK, Knight R. Proc Natl Acad Sci U S A. 2010 Apr 6;107(14):6477-81.

Human skin, colored by sampling depth, at either 50 or 500 sequences/sample

Human skin, colored by sampling depth, at either 50 (blue) or 500 (red) sequences/sample

How deep is deep enough?

It depends on the question…– Differences between community types: not many

sequences.– Rare biosphere: more (but be careful about

sequencing noise!)

100 sequences/sample 10 sequences/sample 1 sequence/sample

Direct sequencing of the human microbiome readily reveals community differences.J Kuczynski et al. Genome Biology (2011).

How deep is deep enough?

Figure 1

Can we get accurate taxonomic assignment from short reads?

Extra slides

This work is licensed under the Creative Commons Attribution 3.0 United States License. To view a copy of this license, visit http://creativecommons.org/licenses/by/3.0/us/ or send a letter to Creative Commons, 171 Second Street, Suite 300, San Francisco, California, 94105, USA.

Feel free to use or modify these slides, but please credit me by placing the following attribution information where you feel that it makes sense: Greg Caporaso, www.caporaso.us.

Caporaso sloan qiime_workshop_slides_18_oct2012

Education