Post on 05-Dec-2014
description
transcript
QIIME Workshop
Get started by opening:http://bit.ly/mbe-qiime2012
and read up at: www.qiime.org
Greg Caporasogregcaporaso@gmail.com
Extract DNA and amplify marker gene with barcoded primers Pool amplicons and
sequence
Assign millions of sequences from thousands
of samples to OTUs
Compute UniFrac distances and compare samples
www.qiime.org
Assign reads to samples
>GCACCTGAGGACAGGCATGAGGAA…>GCACCTGAGGACAGGGGAGGAGGA…>TCACATGAACCTAGGCAGGACGAA…>CTACCGGAGGACAGGCATGAGGAT…>TCACATGAACCTAGGCAGGAGGAA…>GCACCTGAGGACACGCAGGACGAC…>CTACCGGAGGACAGGCAGGAGGAA…>CTACCGGAGGACACACAGGAGGAA…>GAACCTTCACATAGGCAGGAGGAT…>TCACATGAACCTAGGGGCAAGGAA…>GCACCTGAGGACAGGCAGGAGGAA…
>5000 samples in analysis pipeline
• Stream and lake water• Marine water, sediment and reef• Soil (forest, farm, peatland, tundra, …)• Air• Coalbed• Arctic ice core• Insect-associated• Human-associated (gut, mouth, skin)
http://www.earthmicrobiome.org/
>5000 samples analyzed to date
Alpha diversity by environment type
Where do we look for new diversity?
* As determined by no hit to Greengenes database.
http://analytics.google.com
Running QIIME
Native installation on OS X or Linux (laptops through 16,416-core compute cluster*)
Ubuntu Linux Virtual Box
Amazon Web Services (EC2)
* http://ncar.janus.rc.colorado.edu/
IPython notebook
Moving Pictures of the Human Microbiome
• Two subjects sampled daily, one for six months, one for 18 months
• Four body sites: tongue, palm of left hand, palm of right hand, and gut (via fecal swabs).
Moving Pictures of the Human Microbiome
• Investigate the relative temporal variability of body sites.
• Is there a temporal core microbiome?• Technical points: do we observe the same
conclusions on 454 and Illumina data?
Moving Pictures of the Human Microbiome: QIIME tutorial
• A small subset of the full data set to facilitate short run time: ~0.1% of the full sequence collection.
• Sequenced across six Illumina GAIIx lanes, with a subset of the samples also sequenced on 454.
• The online tutorial contains details on all of the steps: go back and read that text.
Key QIIME files
• Mapping file: per sample meta-data, user-defined
• Input sequence file• OTU table: sample x OTU matrix, central to
downstream analyses [now in biom format]• Parameters file: defines analyses, for use
with the ‘workflow’ scripts (optional)
Mapping file
Mapping file: always run check_id_map.py
= required field
Sequences file
>[sampleID_seqID] description
Barcodes have been removed!!
>[sampleID_seqID] description
Barcodes have been removed!!
Sequences file: can be user-provided, or generated by split_libraries.py
OTU table (classic format)
sample x OTU matrix
OTU identifiers
OTU table (classic format)
sample x OTU matrix
Sample identifiers
OTU table (classic format)
sample x OTU matrix
Optional per OTU taxonomic information
OTU table (classic format)
sample x OTU matrix
http://biom-format.org
OTU tables are now in biological observation matrix (.biom) format
(QIIME 1.4.0-dev and later)Google: “biom format”
See convert_biom.pyfor translating between classic and biom otu tables
sample x observation contingency matrix
Observationcounts
sample x observation contingency matrix
Observationcounts
sample x observation contingency matrix
Observationcounts
sample x observation contingency matrix
Markergene (e.g., 16S)surveys
Comparativegenomics
Markergene (e.g., 16S)surveys
Metagenomics
MetatranscriptomicsMetabolomics . . .
http://www.biom-format.org
The Biological Observation Matrix (BIOM) Format or: How I Learned To Stop Worrying and Love the Ome-ome
JSON-based format for representing arbitrary sample x observation contingency tables with optional metadata
McDonald et al., GigaScience (2012).
Comparative genomic (B) and metagenome analysis (C) with QIIME
Working with OTU tables
• single_rarefaction.py: even sampling (very important if you have different numbers of seqs/sample!)
• filter_otus_from_otu_table.py• filter_samples_from_otu_table.py• per_library_stats.py
OTU picking: terminology
OTU picking
• De Novo – Reads are clustered based on similarity to one
another.• Reference-based– Closed reference: any reads which don’t hit a
reference sequence are discarded– Open reference: any reads which don’t hit a
reference sequence are clustered de novo
De novo OTU picking
• Pros– All reads are clustered
• Cons– Not parallelizable– OTUs may be defined by erroneous reads
Closed-reference OTU picking
• Pros– Built-in quality filter– Easily parallelizable– OTUs are defined by high-quality, trusted
sequences• Cons– Reads that don’t hit reference dataset are
excluded, so you can never observe new OTUs
Percentage of reads that do not hit the reference collection, by environment type.
Open-reference OTU picking
• Pros– All reads are clustered– Partially parallelizable
• Cons– Only partially parallelizable– Mix of high quality sequences defining OTUs (i.e.,
the database sequences) and possible low quality sequences defining OTUs (i.e., the sequencing reads)
Considerations in analysis
Variation in sampling depth is an important consideration
Human skin, colored by individual, at 500 sequence/sample
Image/analysis credit: Justin Kuczynski
Data reference:Forensic identification using skin bacterial communities. Fierer N, Lauber CL, Zhou N, McDonald D, Costello EK, Knight R. Proc Natl Acad Sci U S A. 2010 Apr 6;107(14):6477-81.
Image/analysis credit: Justin Kuczynski
Data reference:Forensic identification using skin bacterial communities. Fierer N, Lauber CL, Zhou N, McDonald D, Costello EK, Knight R. Proc Natl Acad Sci U S A. 2010 Apr 6;107(14):6477-81.
Variation in sampling depth is an important consideration
Human skin, colored by sampling depth, at either 50 or 500 sequences/sample
Human skin, colored by sampling depth, at either 50 (blue) or 500 (red) sequences/sample
Image/analysis credit: Justin Kuczynski
Data reference:Forensic identification using skin bacterial communities. Fierer N, Lauber CL, Zhou N, McDonald D, Costello EK, Knight R. Proc Natl Acad Sci U S A. 2010 Apr 6;107(14):6477-81.
Variation in sampling depth is an important consideration
How deep is deep enough?
It depends on the question…– Differences between community types: not many
sequences.– Rare biosphere: more (but be careful about
sequencing noise!)
100 sequences/sample 10 sequences/sample 1 sequence/sample
Direct sequencing of the human microbiome readily reveals community differences.J Kuczynski et al. Genome Biology (2011).
How deep is deep enough?
Figure 1
Can we get accurate taxonomic assignment from short reads?
Extra slides
This work is licensed under the Creative Commons Attribution 3.0 United States License. To view a copy of this license, visit http://creativecommons.org/licenses/by/3.0/us/ or send a letter to Creative Commons, 171 Second Street, Suite 300, San Francisco, California, 94105, USA.
Feel free to use or modify these slides, but please credit me by placing the following attribution information where you feel that it makes sense: Greg Caporaso, www.caporaso.us.