Post on 17-May-2018
transcript
Hospital Microbiome Project
QIIME Analysis 1
Contents 16S rRNA SEQUENCING DATA ANALYSIS TUTORIAL WITH QIIME .................... 5
Report Overview ................................................................................................................. 5
How to Obtain Microbiome Data ................................................................................... 6
How to Setup QIIME ...................................................................................................... 7
Essential files for QIIME ................................................................................................ 7
Sequence File (.fna) .................................................................................................... 8
Quality File (.qual) ...................................................................................................... 8
Mapping File ............................................................................................................... 9
Basic Statistics on Sequence Data ................................................................................ 10
Otu Picking ................................................................................................................... 10
Basic Statistics on OTU Table ...................................................................................... 13
OTU Heatmap ............................................................................................................... 14
Data Analysis ................................................................................................................ 15
Summarize Communities by Taxonomic Composition ............................................ 15
Investigating Alpha Diversity ................................................................................... 18
Identifying Differentially Abundant OTUs ............................................................... 20
Normalizing OTU Table ........................................................................................... 23
Beta-diversity and PCoA .......................................................................................... 24
Jackknifed Beta Diversity Analysis .......................................................................... 26
Hospital Microbiome Project
QIIME Analysis 2
Asli Yazağan ayazagan.com
Make Bootstrapped Tree ........................................................................................... 29
Comparing Categories .............................................................................................. 30
Conclusion .................................................................................................................... 31
REFERENCES ................................................................................................................. 32
Hospital Microbiome Project
QIIME Analysis 3
Asli Yazağan ayazagan.com
Tables and Figures
Figure 1. FastaQ File Format .......................................................................................................... 8
Figure 2. Mothur output for sequence summary ........................................................................... 10
Figure 3. Summary for biom file .................................................................................................. 14
Figure 4. rep_set_tax_assignments.txt .......................................................................................... 14
Figure 5. Heatmap for HMP data .................................................................................................. 15
Figure 6. Pie plot of the degree of sharing of microbial taxa in 14 collected samples from 7
different point with four months interval in a hospital room. ....................................................... 16
Figure 7. Area plot of the degree of sharing of microbial taxa in 14 collected samples from 7
different point with four months interval in a hospital room. ....................................................... 17
Figure 8. Bar plot of the degree of sharing of microbial taxa in 14 collected samples from 7
different point with four months interval in a hospital room. ....................................................... 17
Figure 9. Microbial composition of the microbial taxa in 14 collected samples .......................... 18
Figure 10. Rarefraction Plot for date_s ......................................................................................... 19
Figure 11. Rarefraction plot for sample_type_s ............................................................................ 20
Figure 12. Diff_otus.txt for Computer Mouse and Countertop .................................................... 21
Figure 13. MA plot for differential abundance of Computer Mouse and Countertop .................. 22
Hospital Microbiome Project
QIIME Analysis 4
Asli Yazağan ayazagan.com
Figure 14. Dispersion Estimate Plot for Differential Abundance of Computer Mouse and
Countertop..................................................................................................................................... 22
Figure 15. MA plot for Computer Mouse Samples....................................................................... 23
Figure 16. Dispersion Estimate Plot for Computer Mouse Samples ............................................ 23
Figure 17. PCoA plot for the bacterial community collected in the hospital room. Community
were characterized by samples collected in February and April. Bray-Curtis is used as distance
metric. ........................................................................................................................................... 25
Figure 18.PCoA plot for the bacterial community collected in the hospital room. ...................... 26
Figure 19. 3D PCoA Plots for HMP samples................................................................................ 27
Figure 20. Distance Boxplot for Surface type .............................................................................. 28
Figure 21. Distance Comparison among surface types ................................................................. 29
Figure 22. Jackknifed UPGMA clustering (using the weighted UniFrac metric) showing the
similarity of bacterial communities based on 16S rRNA genes. .................................................. 30
Hospital Microbiome Project
QIIME Analysis 5
Asli Yazağan ayazagan.com
16S rRNA SEQUENCING DATA ANALYSIS TUTORIAL WITH QIIME
Report Overview
The rapid progress of that DNA sequencing techniques has changed the way of
metagenomics research and data analysis techniques over the past few years. Sequencing of 16S
rRNA gene has become a relatively easy way to study microbial composition and diversity
(Fierer et al., 2007). High-throughput bioinformatics analyses increasingly rely on pipeline
frameworks to process sequence and metadata. Popular bioinformatics pipelines in the literature
are QIIME, Mother and Uparse. In this study, QIIME (Quantitative Insights Into Microbial
Ecology) (Caporaso et al., 2010), which is an open-source bioinformatics pipeline, is planned to
use for performing microbiome analysis from raw DNA sequencing data. QIIME is designed to
create quality graphics and statistics from raw sequencing data generated on the Illumina or other
platforms. Typical QIIME analysis workflow is consisted of demultiplexing, quality filtering,
clustering (OTU detection), chimera removal, taxonomic assignment, and phylogenetic
reconstruction, and diversity analyses and visualizations.
This document is organized as an introduction tutorial on how to analyze 16S sequencing
data using current methods. During microbiome analysis, there are basic questions about
microbiome data. The following questions were covered in this tutorial document:
1. Proportionally, what microbes are found in each sample community?
2. How many species are in each sample?
3. Are there species significantly more abundant in one set of samples than in another?
4. How much does diversity change between samples?
5. Do different sample groupings significantly differ in their microbial composition?
Hospital Microbiome Project
QIIME Analysis 6
Asli Yazağan ayazagan.com
This documents is structured as answer for these questions concerned so that each section
is primarily concerned with how to find the answer to a particular question about the microbiome
data.
How to Obtain Microbiome Data
The Sequence Read Archive (SRA) is a bioinformatics database that provides a public
repository for DNA sequencing data obtained from next generation sequence (NGS) technology.
Raw sequence data and metadata could be searched as well as downloaded for further
downstream analysis.
Biotechnology companies such as 454, IonTorrent, Illumina, SOLiD, Helicos and
Complete Genomics, provide a line of products and services on sequencing, genotyping and gene
expression. Illumina is one of the successful company that their technology reduced the cost
of sequencing a human genome reasonable prices. Since Illumina will be used for our data
sequencing purposes eventually in the project, 16s rRNA data obtained Illumina system was
searched from SRA database and Hospital Microbiome Project data obtained from the database.
Every experiment in SRA database has an accession codes and metadata such as study abstract,
experiment attributes and owner of the data. Raw sequence data related that experiment can be
downloaded in fasta and fastaq format using accession codes.
Hospital Microbiome Project (HMP) (Shogan et al., 2013) aims to collect microbial
samples from surfaces, air, staff, and patients from the University of Chicago's new hospital
pavilion, involving 10 patient rooms, 2 nursing stations, staff, water and air sampling, both daily
and weekly during a year in order to better understand the factors that influence bacterial
population development in health care environments.
Hospital Microbiome Project
QIIME Analysis 7
Asli Yazağan ayazagan.com
As a preliminary exploration, a small data set from HMP was analyzed. Data collected
from seven different point (countertop, computer mouse, station phone, chair armrest, corridor
floor, hot tap water faucet and cold tap water faucet) in the same room (S10) at two different
time point (27/02/2013 and 17/04/2013) was used.
How to Setup QIIME
QIIME is a software package of python wrapper scripts and it can be downloaded and
used on Linux system. It can also be used on Virtual Box with Windows operation system. I used
QIIME 1.9.0 version on Virtual box in Windows OS.
Essential files for QIIME
QIIME works with FASTAQ file format. A FASTQ file uses four lines per sequence. A
typical sequence file in FASTAQ format as described below:
Line 1 begins with a '@' character and is followed by a sequence identifier and
an optional description.
Line 2 is the raw sequence letters.
Line 3 begins with a '+' character and is optionally followed by the same sequence
identifier again as line 1.
Line 4 encodes the quality values for the sequence in Line 2, and must contain the same
number of symbols as letters in the sequence.
FASTAQ format has sequence data as well as its quality data. QIIME has
convert_fastaqual_fastq.py script in order to convert FASTQ data file as a qual file with for
quality scores and fna file for sequence data.
Hospital Microbiome Project
QIIME Analysis 8
Asli Yazağan ayazagan.com
( convert_fastaqual_fastq.py -f seqs.fastq -c fastq_to_fastaqual )
Figure 1. FastaQ File Format
Sequence File (.fna)
Sequence file shows the raw sequence data for each sequence. A typical sequence file in
fna format as described below:
Line 1 begins with a '>' character and is followed by an Accession Run Code.
Line 2 is the raw sequence letters.
Quality File (.qual)
Quality file shows the quality scores for each sequence. A typical sequence file in qual
format as described below:
Line 1 begins with a '>' character and is followed by a Accession Run Code.
Line 2 is the quality scores.
Hospital Microbiome Project
QIIME Analysis 9
Asli Yazağan ayazagan.com
Mapping File
QIIME requires a metadata mapping file for most analysis. Mapping file is generated by
user and contains all of the information, categorical or numeric, about the samples necessary to
perform the data analysis. Excel or text file can be used to create mapping file. It should be tab-
delimited. Mapping file is important because it links sample identifier with its metadata. In a
typical mapping file, each line refers to a specific sample data. Line starts with a “SampleID”,
the “BarcodeSequence” used for each sample, the “LinkerPrimerSequence” used to amplify the
sample, and ends with a description column. First column should be “SampleID” and sampleID
could have any alphanumeric characters and periods, cannot have underscores. SampleID should
refer to the sequence headers used in FASTA files. Moreover, any metadata that relates to the
samples and any additional information relating to specific samples that may be useful to have at
hand when considering outliers. The last column must be “Description”. In some circumstances,
users may need to generate a mapping file that does not contain barcodes and/or primers. To
generate such a mapping file, fields for “Barcode Sequence” and “LinkerPrimerSequence” can
be left empty.
In order to check whether created mapping file is in the right format
validate_mapping_file.py is implemented in QIIME. This script tests many problems in the
mapping file and a “_corrected.txt” form of the mapping file is generated in output folder. If
BarcodeSequence and LinkerPrimerSequence fields are empty, then barcode and primer testing
need to be disabled with the -p and -b parameters.
validate_mapping_file.py -m <mapping_filepath> -o <outputpath> -p –b
Hospital Microbiome Project
QIIME Analysis 10
Asli Yazağan ayazagan.com
Basic Statistics on Sequence Data
count_seqs.py -i <sequence_file.fna> script is implemented in QIIME to count sequences
and calculate sequence length mean and standard deviation. Our file had total 220028 sequence,
151 sequence length mean and 0 standard deviation.
Mothur gives more detailed statistics such as min, max, median and quartiles. Running
summary.seqs(fasta=<sequence_file.fna>) command, the following screen is displayed and
summary output file created.
Otu Picking
Picking OTUs is called "clustering" as sequences with some threshold of identity are
clustered together to into an OTU. There are three different methods for OTU picking:
De novo Clustering
Closed-reference
Open-reference
Figure 2. Mothur output for sequence summary
Hospital Microbiome Project
QIIME Analysis 11
Asli Yazağan ayazagan.com
The answer to which method to choose is depend on what is known about the
microbiome community priori. If the studied microbial community is well studied, then 16S
databases has many representatives and closed reference otu picking strategy is suitable. De novo
method is suitable to discover new species. Open reference method is combined of two methods,
closed and de novo method, and is highly suggested method by QIIME developers. First it
clusters sequences against a database of 16S references sequences called “greengenes”, then
uses de novo clustering on those sequences which are not similar to the reference sequences.
Table 1. Which OTU picking strategies in which study?
OTU Picking Strategies In Which Study?
Closed reference
pick_closed_reference_otus.py Human,mouse, gut, skin, oral microbiome
De novo
pick_de_novo_otus.py Environmental, soil, water etc. hazy microbiome
Open reference
pick_open_reference_otus.py Any microbiome studies. QIIME developers
suggests this method.
In the following table, advantages and disadvantages of OTU picking strategies are
compared.
Table 2. Advantages and Disadvantages of OTU Picking Strategies
OTU Picking Strgs. Advantages Disadvantages
Closed reference
Fast and parallelizable. Suitable for
big datasets. Since it uses reference
databases, creates qualified
taxonomies and trees.
Not possible to find new species.
De novo Clusters all sequences. Parallelizable is not enabled so
slow for big datasets.
Open reference Clusters all sequences. Some part of
the work is being parallelized. Faster
Not parallelizable part of the work
is slow. It might take very long
Hospital Microbiome Project
QIIME Analysis 12
Asli Yazağan ayazagan.com
OTU Picking Strgs. Advantages Disadvantages
than De novo. time in the case of finding new
species except in the reference
databases.
Open reference Otu picking strategy was used for our HMP data analysis and QIIME has
pick_open_reference_otus.py script. This script walks through many substeps in a single step: it
has (1) picked OTUs, (2) generated a representative sequence for each OTU, (3) assigned known
taxonomy to those OTUs, (4) created a phylogenetic tree, and (5) created an OTU table.
>pick_open_reference_otus.py -i <sequence_file.fna> –r <97_otus.fasta > -o <outputpath > -s
0.1 -m <clustering algorithm> -p <parameter_file>
97_otus.fasta is the reference OTU file from Greengenes. Greengenes is the database of
reference 16S sequences that is used to assign taxonomy. 97_otus.fasta file is created by
clustering all the sequences in the Greengenes database into 97% identity clusters. A
representative sequence is chosen from each of those clusters to be used to create the 97_tree and
97_taxonomy. Sequences in our data are compared by representative sequences in 97_otus.fasta
and the most similar sequence’s taxonomy is assigned to our sequence.
Default clustering algorithm is UCLUST for pick_open_reference_otus.py script. But
usearch is widely used for OTU picking, Usearch was used as clustering algorithm for our data.
Parameter file was created by user with “pick_otus:enable_rev_strand_match True” line.
This line is needed if most or all of the sequences are failing to hit the reference during the
prefiltering or closed-reference OTU picking steps, sequences may be in the reverse orientation
Hospital Microbiome Project
QIIME Analysis 13
Asli Yazağan ayazagan.com
with respect to the reference database. This line addresses this problem, however it doubles the
amount of memory used in the workflow.
An index.html file was created and it is a navigation page and has an informative table
about output files. The important outputs of the script are the following four files:
rep_set.tre: The phylogenetic tree describing the relationship of all of our sequences
rep_set.fna: The list of representative sequences for each Otu.
otu_table_mc2_w_tax.biom: The final OTU results, including taxonomic
assignments and per-sample abundances, stored in a biom file. Mc2 refers to
“minimum size 2” that means each OTU requires at least 2 sequences. This is the file
mostly used for deeper analysis.
final_otu_map_mc2.txt: the listing of which reads were clustered into which OTU.
Basic Statistics on OTU Table
biom summarize-table -i <biom_file> -o <outputpath> script is implemented in QIIME to
create a summarization for otu table.
Figure shows the summary file for biom file. 9605 OUT was picked. If the representative
sequence file rep_set.fna is counted, the same number of sequences should be displayed.
assign_taxonomy.py -i <rep_set.fna> -o <taxonomyResults_outputpath> script is used
to assign taxonomy for each OTU representative sequence. It creates
rep_set_tax_assignments.txt file that contains an entry for each representative sequence, listing
taxonomy to the greatest depth allowed by the confidence threshold (80% by default, can be
Hospital Microbiome Project
QIIME Analysis 14
Asli Yazağan ayazagan.com
changed with the -c option), and a column of confidence values for the deepest level of
taxonomy shown.
Figure 3. Summary for biom file
Figure 4. rep_set_tax_assignments.txt
OTU Heatmap
make_otu_heatmap.py -i <biom file > -o <heatmap.pdf> script creates a pdf file with a
visualization of OTU table. Each row corresponds to an OTU and each column corresponds to a
sample. The higher the relative abundance of an OTU in a sample, the more intense the color at
the corresponding position in the heatmap.
Hospital Microbiome Project
QIIME Analysis 15
Asli Yazağan ayazagan.com
Figure 5. Heatmap for HMP data
Data Analysis
Summarize Communities by Taxonomic Composition
Looking at the relative abundances of taxa per sample in the OTU table, we could
understand what microbes are found in each sample community.
Question: Proportionally, what microbes are found in each sample community?
Scripts: summarize_taxa.py and plot_taxa_summary.py
Output: Visualized plots showing relative abundance data per samples
summarize_taxa.py -i <biom file> -o <taxaSummary_outputpath> script is used to generate
text files with relative abundance data per samples to obtain a basic overview of the members of
the community for all taxonomic ranks. The level specified at specific taxonomic ranks can be
Hospital Microbiome Project
QIIME Analysis 16
Asli Yazağan ayazagan.com
specified by -L parameters for the script (1 for kingdom, 2 for phylum, 3 for class, 4 for order, 5
for family, 6 for genus, 7 for species). Output text files can be passed to plot_taxa_summary.py
script to create visualized plots a following command:
plot_taxa_summary.py -i <taxaSummary_outputpath/otu_table_w_tax.txt> -l <taxonomic rank>
-c pie,bar,area -o < taxsCharts_outputpath>
The following pie plot show the total relative abundance for all data.
Figure 6. Pie plot of the degree of sharing of microbial taxa in 14 collected samples from 7 different point
with four months interval in a hospital room.
Following area and bar plot shows the relative abundance of taxa for each sample.
Hospital Microbiome Project
QIIME Analysis 17
Asli Yazağan ayazagan.com
Figure 7. Area plot of the degree of sharing of microbial taxa in 14 collected samples from 7 different
point with four months interval in a hospital room.
Figure 8. Bar plot of the degree of sharing of microbial taxa in 14 collected samples from 7 different
point with four months interval in a hospital room.
The following table shows the microbial composition for each sample at two different time point
at phylum level. From the plots, it looks like there is higher taxa change on computer mouse,
counter top and tab faucet handles between two time points. On the other hand, those samples
show similar taxa proportion in the same time point. This might be because the person who used
Hospital Microbiome Project
QIIME Analysis 18
Asli Yazağan ayazagan.com
those locations is the same person and in second time points, the person using those locations
had been changed and it had modified the microbial abundance of taxa of samples in second time
point.
Corr.Floor
February
Comp. Mouse
February
Countertop
February
Station Phone
February
Chair Armr.
February
Cold Tap W.F.H.
February
Hot Tap W.F.H.
February
Corr. Floor
April
Comp. Mouse
April
Countertop
April
Station Phone
April
Chair Armr.
April
Cold Tap W.F.H.
April
Hot Tap W.F.H.
April
Figure 9. Microbial composition of the microbial taxa in 14 collected samples
Investigating Alpha Diversity
Diversity of species in a single sample or environment is described by alpha diversity.
Question: How many species are in each sample?
Script: alpha_rarefaction.py -i <biom file > -o < alphaDiversity_outputpath>
-p < parameters.txt > -m < mapping file >
Output: Rarefaction plots.
This script is performed several steps: (1) generate rarefied OTU tables; (2) compute
alpha diversity metrics for each rarefied OTU table; (3) collate alpha diversity results; and (4)
generate alpha rarefaction plots. Alpha diversity increases with sequencing depth and rarefaction
plots are useful to compare alpha diversity between two or more samples which may have
unequal sequence depth. This plot uses alpha diversity value versus number of included
Hospital Microbiome Project
QIIME Analysis 19
Asli Yazağan ayazagan.com
sequences. To build rarefaction curves, each community is randomly subsampled without
replacement at different intervals, and the average number of OTUs at each interval is plotted
against the size of the subsample.
As parameter file, alpha diversity metric is listed in a text file. Observed_species,
shannon, chao1 metrics are commonly used alpha diversity metrics. Observed_species is the
number of OTUs identifier per sample. Shannon diversity is a measure of entropy and chao1 is a
measure which predicts OUT richness at high depth of sequencing. echo 'alpha_diversity:metrics
observed_species,shannon,chao1' > parameters.txt command creates a parameter.txt file.
After running the script on our data, a html page with rarefraction plots were created.
Figure 10. Rarefraction Plot for date_s
Hospital Microbiome Project
QIIME Analysis 20
Asli Yazağan ayazagan.com
Figure 11. Rarefraction plot for sample_type_s
Identifying Differentially Abundant OTUs
Question: Are there species significantly more abundant in one set of samples than in
another? Which microbes are significantly different between two sample groupings ? Do specific
groups of samples differ in their microbial composition?
Script: differential_abundance.py -i < biom file > -o <output.txt> -m <mapping file> -a
DESeq2_nbinom –c <mapping category> -x < subcategory 1> -y <subcategory 2> -d
Output: text file with a list of differentially observed OTUs and their statistics and a MA
plot.
OTU differential abundance testing is used to identify OTUs that differ between two
mapping file sample categories denoted by –x and –y in the script. Differentially abundant OTUs
identification method is denoted by –a. DESeq2_nbinom and metagenomeSeq_fitZIG are
differential abundance algorithm can be used in QIIME (Paulson, Stine, Bravo, & Pop, 2013).
-d option creates a MA plot. The MA plot allows to look at the relationship between
intensity and difference between two data stores. The x-axis represents the average quantitated
Hospital Microbiome Project
QIIME Analysis 21
Asli Yazağan ayazagan.com
value across the data stores, and the y axis shows the difference between them. It also creates a
Dispersion Estimate plot that visualize the fitted dispersion vs. mean relationship.
In order to see if there are any OTUs which are significantly more abundant in the
countertop environment samples than in the computer mouse environment samples, “countertop”
was passed as –y option and “computer mouse” was passed as –x option. Checking the output
text file, the members of Actinobacteria are significantly more abundant in the countertop
samples.
Figure 12. Diff_otus.txt for Computer Mouse and Countertop
Hospital Microbiome Project
QIIME Analysis 22
Asli Yazağan ayazagan.com
Figure 13. MA plot for differential abundance of
Computer Mouse and Countertop
Figure 14. Dispersion Estimate Plot for
differential abundance of Computer Mouse and
Countertop
Checking the microbial abundance of taxa of computer mouse samples taken in february and
april, it was seen visually different taxonomy fromthe pie charts. To do an experiment,
differential abundance script was run on those samples and Figure 15 and 16 shows the MA plot
and dispersion estimate plots.
Hospital Microbiome Project
QIIME Analysis 23
Asli Yazağan ayazagan.com
Figure 15. MA plot for Computer Mouse Samples.
Figure 16. Dispersion Estimate Plot for Computer
Mouse Samples
Normalizing OTU Table
When analyzing microbial data, uneven sequencing depth could lead biased results.
Having different number of sequences for each sample will cause inaccurate results in beta
diversity analyses.
Question: How to prevent bias as result of uneven sequencing depth?
Script: normalize_table.py -i <biom file> -a CSS -o <normalized biom file>
Output: Biom table with normalized counts. This table is used as input biom file for beta
diversity script.
-a option determines the normalization algorithm to apply to input bio table. Default algorithm is
CSS. CSS is stand for “cumulative sum scaling” normalization which is an adaptive extension of
the quantile normalization approach that is better suited for marker gene survey data whereby
Hospital Microbiome Project
QIIME Analysis 24
Asli Yazağan ayazagan.com
raw counts are divided by the cumulative sum of counts up to a percentile determined using a
data-driven approach (Paulson, J.N., Stine, O.C., Corrada Bravo, H., Pop, 2013). DESeq2 is
another normalization algorithm option. DESeq2 outputs negative values for lower abundant
OTUs as a result of its log transformation and throws away low depth samples (e.g. less that
1000 sequences/sample). This presents a problem when using Bray Curtis and Unifrac metrics
which are common metrics to calculate ecological distance. There is not a good solution yet, but
CSS is currently recommanded normalization algorithm.
Beta-diversity and PCoA
It is important to analyze how different every sample is from all of the rest in microbiome
research. On the other hand, another important information is whether any grouping of samples
are more similar in composition than the average. Beta diversity is a metric of diversity that
describes how different the species composition of different sample is.
Question: How much does diversity change between samples?
Script: beta_diversity.py, principal_coordinates.py, make_2d_plots.py
Output: Distance matrix and visualized Principle Coordinate plots
In order to measure the difference between two samples mathematical and phylogenetic metrics
can be used. Two commonly used metrics in microbiome studies are Bray_Curtis and
unweighted_unifrac.
>beta_diversity.py -i <normalized biom file> -m <distance metric> -o <beta_div_output_path>
-t <rep_set.tre>
Hospital Microbiome Project
QIIME Analysis 25
Asli Yazağan ayazagan.com
The output of the command is a distance matrix defines distance between every pair of samples.
I used Bray-Curtis metric to calculate distance. This matrix can be visualized in a Principle
Coordinate plot (PCoA).
principal_coordinates.py -i <beta_div_output_path>/<metric_normalized_otu_table.txt > -o
<beta_div_coords.txt>
make_2d_plots.py -i <beta_div_coords.txt> -m <mapping file>
The resulting PCoA plot is shown in the following charts. Figure 15 shows microbial community
similarity change between two sample collection dates and it looks like overall community
mostly changed in two timepoint. Figure 16 shows the microbial community similarity among
sample types. It looks like computer mouse, countertop, stationary phone, armchair rest
visualized together meaning that they have similar microbial community. Computer mouse -
countertop samples collected in february but stationary phone - armchair rest samples collected
in april. It can also be visually displayed in the pie charts that these samples have very similar
charts. Pie charts shows very different composition for computer mouse and countertop samples
in two different time point. It can also be viewed from the PcoA plots. For example, two purple
circle stay far away between each other on the PC1-PC2 and PC1-PC3 plots in Figure 18.
Figure 17. PCoA plot for the bacterial community collected in the Hospital Room. Community were
characterized by samples collected in February and April. Bray-Curtis is used as distance metric.
April
February
Hospital Microbiome Project
QIIME Analysis 26
Asli Yazağan ayazagan.com
Figure 18.PCoA plot for the bacterial community collected in the Hospital Room. Community were
characterized by type of samples collected. Bray-Curtis is used as distance metric.
Jackknifed Beta Diversity Analysis
Question: How to compare samples to each other ?
Script: jackknifed_beta_diversity.py -i < biom file > -t <rep_set.tre> -m <mapping file >
-o <Jackknife_Output folder> -e <rarefaction_depth>;
Output: 3D PcoA plots with Emperor
This script does the following steps:
i. Compute a beta diversity distance matrix from the full data set
ii. Perform multiple rarefactions at a single depth (-e option is to change the
rarefaction depth)
iii. Compute distance matrices for all the rarefied OTU tables
iv. Build UPGMA trees for all the rarefactions
v. Compare all the trees to get consensus and support values for branching
vi. Perform principal coordinates analysis on all the rarefied distance matrices
vii. Generate plots of the principal coordinates
Cold T.W.F.H
Hot T.W.F.H
Comp. Mouse
Countertop
Station Phone
Armchair Rest
Corridor Floor
Hospital Microbiome Project
QIIME Analysis 27
Asli Yazağan ayazagan.com
Emperor is an interactive next generation tool for analysis, visualization and
interpretation of high throughput microbial ecology datasets (Vázquez-Baeza, Pirrung, Gonzalez,
& Knight, 2013). After running script, three sub-folder for each distance metric and 3D PCoA
plots are created. Unweighted_uniFrac /emperor_pcoa_plot folder has a html file has visualized
3D PCoA Plots as in Figure 12. Each point represents one of the samples and distances between
samples were calculated using unweighted UniFrac. Samples stay close to each other means that
those samples have communities with very similar overall phylogenetic trees.
Figure 19. 3D PCoA Plots for HMP samples
Jackknife analysis created a large collection of distance matrices to do statistics on.
Question: How to analyze distance matrices?
Script: dissimilarity_mtx_stats.py –i < Jackknife_Output folder/unweighted_unifrac/rare_dm> -
o <stat_output_folder>
Output: Three files; means.txt, medians.txt, and stdevs.txt files for the mean, standard deviation
and means of the distance between two samples are created.
Hospital Microbiome Project
QIIME Analysis 28
Asli Yazağan ayazagan.com
Question: Are the samples in an individual category closer to each other than they are to samples
outside the category?
Script: make_distance_boxplots.py –m <mapping file> -o <BoxPlot_Outout_Folder> -d
stat_output_folder/means.txt –f <category> --save_raw_data
Output: Boxplot Plot as a pdf file
The first and second boxplots represent all within distances and all between distances,
respectively in Figure 14.
Figure 20. Distance Boxplot for Surface type
Question: How to compare between samples grouped at different field states of a
mapping file field?
Script: make_distance_comparison_plots.py -m <mapping file> -d
<unweighted_unifrac_otu_table.txt> -f <category from mapping file> -c <comparison_groups>
-o <output_folder> -a <label_type> -t <plot_type>
Hospital Microbiome Project
QIIME Analysis 29
Asli Yazağan ayazagan.com
Output: Distance Comparison Plot
Figure 14 shows the boxplots that allow for the comparison among surface types. Countertop,
Corridor Floor and Station Phone were taken as comparison groups and those were compared
with other surface types.
Figure 21. Distance Comparison among surface types
Make Bootstrapped Tree
Question: How to make a bootstrapped tree?
Script: make_bootstrapped_tree.py
-m <Jackknife_Output folder/unweighted_unifrac/upgma_cmp/master_tree.tre>
-s <Jackknife_Output folder /unweighted_unifrac/upgma_cmp/jackknife_support.txt>
-o <Jackknife_Output folder /unweighted_unifrac/upgma_cmp/Tree.pdf>
Hospital Microbiome Project
QIIME Analysis 30
Asli Yazağan ayazagan.com
Figure 22. Jackknifed UPGMA clustering (using the weighted UniFrac metric) showing the similarity of
bacterial communities based on 16S rRNA genes.
Comparing Categories
In HMP data, seven different points in a room were sampled: countertop, computer
mouse, station phone, chair armrest, corridor floor, hot tap water faucet and cold tap water
faucet. Visual graphs reveal how different a microbial composition of sample from other
samples, but a statistical support is needed.
To generate statistical support for hypotheses, adonis and anosim (analysis of similarity)
statistical tests can be used. Adonis is a nonparametric statistical method that takes beta diversity
distance matrices, a mapping file and a category in the mapping file to determine sample
grouping from. It computes an R2 value (effect size) which shows the percentage of variation
explained by the supplied mapping file category, as well as a p-value to determine the statistical
significance. Anosim (Permanova) is a method that tests whether two or more groups of samples
Cold. T. W. F. H. April
St. Phone February
Cold T. W. F. H. February
Corr. Floor February
Countertop February
Ch. Armrest February
Hot T.W. H. February
Comp. Mouse February
Countertop April
Comp. Mouse April
St. Phone April
Ch. Armrest April
Corr. Floor April
Hot. T. W. F. H. April
Hospital Microbiome Project
QIIME Analysis 31
Asli Yazağan ayazagan.com
are significantly different. Anosim only work with categorical variable that is used to do the
grouping.
Question: Do the samples grouped by a parameter in the mapping file (i.e. sample type)
are statistically significant?
Script 1: compare_categories.py --method adonis -i <metric_normalized_otu_table.txt >
-m <mapping file> -c <comparingCategory> <adonis_out_folder>
Script 2: compare_categories.py --method anosim -i <metric_normalized_otu_table.txt >
-m <mapping file> -c <comparingCategory> -o <anosim_out_folder>
Output: p-value and R2 value. p-value indicates the statistically significance of grouping
of samples by the parameter. R2 value indicates the percentage of variation in distances is
explained by the grouping.
Adonis and anosim statistical tests were applied for “sample_type_s” and “date_s”
categories in HMP data. date_s and sample_type_s do not differ significantly from each other in
terms of microbial composition (p = 0.2, p = 0.58).
Conclusion
As a preliminary exploration, a small data set from HMP was analyzed. Data collected
from seven different point (countertop, computer mouse, station phone, chair armrest, corridor
floor, hot tap water faucet and cold tap water faucet) in the same room (S10) at two different
time point (27/02/2013 and 17/04/2014) was used. For each sample, how many and what kind of
microbes are found, diversity change between samples and microbial composition comparison
among sample groupings were investigated using QIIME pipeline. Moreover, significant
Hospital Microbiome Project
QIIME Analysis 32
Asli Yazağan ayazagan.com
abundance change among samples was investigated. Visualization and statistical tools were used
to draw conclusions.
REFERENCES
Caporaso, J. G., Kuczynski, J., Stombaugh, J., Bittinger, K., Bushman, F. D., Costello, E. K., …
Knight, R. (2010). QIIME allows analysis of high-throughput community sequencing data.
Nature Methods, 7(5), 335–6. https://doi.org/10.1038/nmeth.f.303
Fierer, N., Breitbart, M., Nulton, J., Salamon, P., Lozupone, C., Jones, R., … Jackson, R. B.
(2007). Metagenomic and small-subunit rRNA analyses reveal the genetic diversity of
bacteria, archaea, fungi, and viruses in soil. Applied and Environmental Microbiology,
73(21), 7059–7066. https://doi.org/10.1128/AEM.00358-07
Paulson, J.N., Stine, O.C., Corrada Bravo, H., Pop, M. (2013). Robust methods for differential
abundance analysis in marker gene surveys. Nature Methods, 10(12), 1200–1202.
https://doi.org/10.1016/j.biotechadv.2011.08.021.Secreted
Paulson, J. N., Stine, O. C., Bravo, H. C., & Pop, M. (2013). Differential abundance analysis for
microbial marker-gene surveys. Nature Methods, 10(12), 1200–2.
https://doi.org/10.1038/nmeth.2658
Shogan, B. D., Smith, D. P., Packman, A. I., Kelley, S. T., Landon, E. M., Bhangar, S., …
Gilbert, J. (2013). The Hospital Microbiome Project: Meeting report for the 2nd Hospital
Microbiome Project, Chicago, USA, January 15(th), 2013. Standards in Genomic Sciences,
8(3), 571–9. https://doi.org/10.4056/sigs.4187859
Hospital Microbiome Project
QIIME Analysis 33
Asli Yazağan ayazagan.com
Vázquez-Baeza, Y., Pirrung, M., Gonzalez, A., & Knight, R. (2013). EMPeror: a tool for
visualizing high-throughput microbial community data. GigaScience, 2(1), 16.
https://doi.org/10.1186/2047-217X-2-16