+ All Categories
Home > Documents > Public data resources for metagenomics · (2) Upload sequence data and metadata (3) Sequence data...

Public data resources for metagenomics · (2) Upload sequence data and metadata (3) Sequence data...

Date post: 17-May-2020
Category:
Upload: others
View: 7 times
Download: 0 times
Share this document with a friend
79
Public data resources for metagenomics Alex Mitchell [email protected]
Transcript
Page 1: Public data resources for metagenomics · (2) Upload sequence data and metadata (3) Sequence data is archived in ENA and accessioned (4) Sequence data is analysed by the pipeline

Public data resources for metagenomics

Alex [email protected]

Page 2: Public data resources for metagenomics · (2) Upload sequence data and metadata (3) Sequence data is archived in ENA and accessioned (4) Sequence data is analysed by the pipeline

My background

Doctorate in pharmacology (1995-1998)

Post-doc in molecular biology (1998-2001)

Bioinformatics research (2001-2011)

Co-ordinator for InterPro and EBI metagenomics databases (2011-)

Page 3: Public data resources for metagenomics · (2) Upload sequence data and metadata (3) Sequence data is archived in ENA and accessioned (4) Sequence data is analysed by the pipeline

My background

Page 4: Public data resources for metagenomics · (2) Upload sequence data and metadata (3) Sequence data is archived in ENA and accessioned (4) Sequence data is analysed by the pipeline

Overview

• Considerations for the analysis of metagenomic sequence data

• What public metagenomic analysis resources offer

• The EBI metagenomics resource

Page 5: Public data resources for metagenomics · (2) Upload sequence data and metadata (3) Sequence data is archived in ENA and accessioned (4) Sequence data is analysed by the pipeline

What is metagenomics?

“Metagenomics is the study of metagenomes, genetic material recovered directly from environmental samples.”

“Metagenomics is the study of all genomes present in any given environment without the need for prior individual identification or amplification”

“Metagenome” used by Handelsman et al., in 1998 to describe “collective genomes of soil microflora”

“Metagenomics” means literally ‘beyond genomics’

Page 6: Public data resources for metagenomics · (2) Upload sequence data and metadata (3) Sequence data is archived in ENA and accessioned (4) Sequence data is analysed by the pipeline

Sequencing

Filtering step

Extraction of DNA

Sampling from environment

Page 7: Public data resources for metagenomics · (2) Upload sequence data and metadata (3) Sequence data is archived in ENA and accessioned (4) Sequence data is analysed by the pipeline

Quality control

Taxonomic analysis

Functional analysis

16S rRNA18S rRNA

ITSetc

Identification and characterisation of

protein coding sequences

Page 8: Public data resources for metagenomics · (2) Upload sequence data and metadata (3) Sequence data is archived in ENA and accessioned (4) Sequence data is analysed by the pipeline

Applications of taxonomic analyses

Diversity analysisIdentification of new species

Comparing populations from different sites or

states

Page 9: Public data resources for metagenomics · (2) Upload sequence data and metadata (3) Sequence data is archived in ENA and accessioned (4) Sequence data is analysed by the pipeline

Applications of functional analyses

Bioprospecting for novel sequences with

functional applications

Reconstruction of pathways present in the

community

Comparing functional activities from different

sites or states

Page 10: Public data resources for metagenomics · (2) Upload sequence data and metadata (3) Sequence data is archived in ENA and accessioned (4) Sequence data is analysed by the pipeline

• Short sequence fragments are hard to characterise

• Assembly can lead to chimeras

• Iddo Friedberg: ‘Metagenomics is like a disaster in a jigsaw shop

• Millions of different pieces• Thousands of different puzzles• All mixed together• Most of the pieces are missing• No boxes to refer to

Why is metagenomics challenging?

Page 11: Public data resources for metagenomics · (2) Upload sequence data and metadata (3) Sequence data is archived in ENA and accessioned (4) Sequence data is analysed by the pipeline

Limitations and pitfalls

Data used for analysis can have limitations:

• 16S rRNA genes - limited resolving power and subject to copy number variation

• Viral sequences – currently no gold-standard reference database

• Protist sequences – little experimentally-derived annotation of protein function in public databases

Page 12: Public data resources for metagenomics · (2) Upload sequence data and metadata (3) Sequence data is archived in ENA and accessioned (4) Sequence data is analysed by the pipeline

Additional pitfalls

• Different functional and taxonomic analysis tools can give different results

• The same tools can give different results depending on the version and underlying algorithm (e.g., HMMER2 vs HMMER3)

• The same version of the same tools can give different results depending on the reference database used

Page 13: Public data resources for metagenomics · (2) Upload sequence data and metadata (3) Sequence data is archived in ENA and accessioned (4) Sequence data is analysed by the pipeline

Reference databases

Page 14: Public data resources for metagenomics · (2) Upload sequence data and metadata (3) Sequence data is archived in ENA and accessioned (4) Sequence data is analysed by the pipeline

Reference databases

Page 15: Public data resources for metagenomics · (2) Upload sequence data and metadata (3) Sequence data is archived in ENA and accessioned (4) Sequence data is analysed by the pipeline

Reference databases

Page 16: Public data resources for metagenomics · (2) Upload sequence data and metadata (3) Sequence data is archived in ENA and accessioned (4) Sequence data is analysed by the pipeline

Wetterstrand KA. DNA Sequencing Costs: Data from the NHGRI Large-Scale Genome Sequencing Available at: www.genome.gov/sequencingcosts.

Other considerations: data analysis speed

Page 17: Public data resources for metagenomics · (2) Upload sequence data and metadata (3) Sequence data is archived in ENA and accessioned (4) Sequence data is analysed by the pipeline

• The cost of sequencing has really gone down

• Now I can do metagenomics!

• Awesome!

• Amount of sequence generated has increased 5,000-fold

• Computational speed has increased only 10-fold

• Time taken to analyse has increased 500-fold

• $@%*!!!

Data analysis speed

Page 18: Public data resources for metagenomics · (2) Upload sequence data and metadata (3) Sequence data is archived in ENA and accessioned (4) Sequence data is analysed by the pipeline

70 %(~80 bp/$)

14.5 %

28 %

(~2m bp/$)

36.5 %

14.5 %

14.5 %

55 %

30 %

4.5 %

Sboner et al. Genome Biology (2011) 12:125

Data analysis cost

Page 19: Public data resources for metagenomics · (2) Upload sequence data and metadata (3) Sequence data is archived in ENA and accessioned (4) Sequence data is analysed by the pipeline

Raw sequence data:

• Important for metagenomics as some samples are hard to replicate

• Large file sizes

Analysis results ?

• Easiest to repeat, although it takes time & requires keeping track of analysis steps and versions

Data description including metadata

• Essential: what, where, who, how and when

• If absent, raw data have very limited usefulness

What data to store?

Page 20: Public data resources for metagenomics · (2) Upload sequence data and metadata (3) Sequence data is archived in ENA and accessioned (4) Sequence data is analysed by the pipeline

Metadata includes the in-depth, controlled description of the sample that your sequence was taken from

The importance of metadata

Where did it come from? What were the environmental conditions (lat/long, depth, pH, salinity, temperature…) or clinical observations?

How was it sampled? How was it extracted? How was it stored? What sequencing platform was used?

Page 21: Public data resources for metagenomics · (2) Upload sequence data and metadata (3) Sequence data is archived in ENA and accessioned (4) Sequence data is analysed by the pipeline

• If metadata is adequately described, using a standardised vocabulary, querying and interpretation across projects becomes possible

The importance of metadata

Show the microbial species found in the North Pacific

… at depths of 50 – 100 m

… in samples taken May-June

… compared to the Indian Ocean, under the same conditions

Page 22: Public data resources for metagenomics · (2) Upload sequence data and metadata (3) Sequence data is archived in ENA and accessioned (4) Sequence data is analysed by the pipeline

Where are you going to store this?

• Locally : back-up ?

long term ?

sharing ?

access ?

• Amazon, Google or specialist research clouds

• Public repositories, such as ENA, NCBI or DDBJ

Considerations: storing data

Page 23: Public data resources for metagenomics · (2) Upload sequence data and metadata (3) Sequence data is archived in ENA and accessioned (4) Sequence data is analysed by the pipeline

• Free!• Secure long term storage

• No need for local infrastructure

• Enforced compliance:• Publisher requirements (accession numbers)• Institutional requirements• Funder requirements

• Data are more useful: • Data are reusable and can be discovered by others• Available for re- and meta-analyses

Public repositories

Page 24: Public data resources for metagenomics · (2) Upload sequence data and metadata (3) Sequence data is archived in ENA and accessioned (4) Sequence data is analysed by the pipeline

• Transferring a 100 Gb NGS data file across the internet• 'Normal' network bandwidth (1 Gigabit/s) ~ 1 week*• High-speed bandwidth (10 Gigabit/s) < 1 day*

Considerations: moving data

* Stein, Genome Biol. (2010) 11:207

Traditional methods may be the most effective!

Page 25: Public data resources for metagenomics · (2) Upload sequence data and metadata (3) Sequence data is archived in ENA and accessioned (4) Sequence data is analysed by the pipeline

Metagenomics portals

http://www.ebi.ac.uk/metagenomics

http://metagenomics.anl.gov/

http://camera.calit2.net/

http://img.jgi.doe.gov/

Page 26: Public data resources for metagenomics · (2) Upload sequence data and metadata (3) Sequence data is archived in ENA and accessioned (4) Sequence data is analysed by the pipeline

Submit data

Sequence analysis(prebuilt workflows)

Quality filtering of sequences

Visualisation/Interpretation

What do metagenomics portals offer?

Sequence archiving

Tools to help capture & store

metadata

Tools to help transfer data

Page 27: Public data resources for metagenomics · (2) Upload sequence data and metadata (3) Sequence data is archived in ENA and accessioned (4) Sequence data is analysed by the pipeline

Data archivingPowerful analysisEasy submission

A free resource for the analysis, archiving & browsing of metagenomic study data

http://www.ebi.ac.uk/metagenomics

Page 28: Public data resources for metagenomics · (2) Upload sequence data and metadata (3) Sequence data is archived in ENA and accessioned (4) Sequence data is analysed by the pipeline

(1) Register for an account

(2) Upload sequence data and metadata

(3) Sequence data is archived in ENA and accessioned

(4) Sequence data is analysed by the pipeline

(5) Projects, metadata and results are made available on the website for private or public browsing / download

The submission & analysis process

~ 1-2 weeks, depending on study size, compute farm usage, etc

Page 29: Public data resources for metagenomics · (2) Upload sequence data and metadata (3) Sequence data is archived in ENA and accessioned (4) Sequence data is analysed by the pipeline

The submission process can be run interactively

3

Page 30: Public data resources for metagenomics · (2) Upload sequence data and metadata (3) Sequence data is archived in ENA and accessioned (4) Sequence data is analysed by the pipeline

The GSC (Genomics Standards Consortium) have created minimum standards for metagenomics metadata

Metagenomics standards

Page 31: Public data resources for metagenomics · (2) Upload sequence data and metadata (3) Sequence data is archived in ENA and accessioned (4) Sequence data is analysed by the pipeline

Metadata is captured via GSC-compliant checklist

GSC MIxS

Page 32: Public data resources for metagenomics · (2) Upload sequence data and metadata (3) Sequence data is archived in ENA and accessioned (4) Sequence data is analysed by the pipeline

rRNAselector

reads with rRNA

reads without

rRNAFragGeneScan

predicted CDS

Amplicon-based data

processed reads

discarded reads

QC

raw reads

Qiime

Taxonomic analysis

InterProScan

Function assignment

Unknown function

pCDS

The sequence analysis pipeline

Page 33: Public data resources for metagenomics · (2) Upload sequence data and metadata (3) Sequence data is archived in ENA and accessioned (4) Sequence data is analysed by the pipeline

EBI Metagenomics: QC step by step

• Clipping - low quality ends trimmed and adapter sequences removed

• Quality filtering - sequences with > 10% undetermined nucleotides removed

• Read length filtering - short sequences (< 100 nt) are removed

• Duplicate sequences removal - clustered on 99% identity (UCLUST v 1.1.579) and representative sequence chosen

• Repeat masking - RepeatMasker (open-3.2.2), removes reads with 50% or more nucleotides masked (low complexity regions)

Page 34: Public data resources for metagenomics · (2) Upload sequence data and metadata (3) Sequence data is archived in ENA and accessioned (4) Sequence data is analysed by the pipeline

EBI Metagenomics: QC consequences

Roche 454

Illumina

Ion Torrent

Page 35: Public data resources for metagenomics · (2) Upload sequence data and metadata (3) Sequence data is archived in ENA and accessioned (4) Sequence data is analysed by the pipeline

EBI Metagenomics: taxonomic analysis

rRNAselector

reads with rRNA

Amplicon-based data

processed reads

Qiime

Taxonomic analysis

Page 36: Public data resources for metagenomics · (2) Upload sequence data and metadata (3) Sequence data is archived in ENA and accessioned (4) Sequence data is analysed by the pipeline

Taxonomic analysis with EBI Metagenomics

EBI Metagenomics currently only provides taxonomy analysis for Prokaryotes.

rRNA sequences are identified using rRNASelector:

hidden Markov models to identified rRNA sequences

60 bp minimum overlap with well-curated HMM model

E-value < 10-5

Annotations are associated using Qiime:

rRNA are annotated using the Greengenes reference database

Page 37: Public data resources for metagenomics · (2) Upload sequence data and metadata (3) Sequence data is archived in ENA and accessioned (4) Sequence data is analysed by the pipeline

EBI Metagenomics taxonomy visualizations

Page 38: Public data resources for metagenomics · (2) Upload sequence data and metadata (3) Sequence data is archived in ENA and accessioned (4) Sequence data is analysed by the pipeline

Re-analysis of: Sutton et al, (2013), Impact of Long-Term Diesel

Contamination on Soil Microbial Community Structure.

Validation of taxonomic analysis

Alpha diversity analysis

polluted

clean

clean (outlier)

Page 39: Public data resources for metagenomics · (2) Upload sequence data and metadata (3) Sequence data is archived in ENA and accessioned (4) Sequence data is analysed by the pipeline

EBI Metagenomics: overview of functional analysis

reads without rRNA

FragGeneScan

predicted CDS

InterProScan

Function assignment

Unknown function

pCDS

Page 40: Public data resources for metagenomics · (2) Upload sequence data and metadata (3) Sequence data is archived in ENA and accessioned (4) Sequence data is analysed by the pipeline

EBI Metagenomics: functional annotation

EBI Metagenomics uses FragGeneScan to predict CDSs directly from the reads:

hidden Markov models to correct frame-shift using codon usage

probabilistic identification of start and stop codons

60 bp minimum ORF

Annotation is carried out using InterProScan to mine a subset of the InterPro database

Page 41: Public data resources for metagenomics · (2) Upload sequence data and metadata (3) Sequence data is archived in ENA and accessioned (4) Sequence data is analysed by the pipeline

Why not BLAST?

• BLAST: Basic Local Alignment and Search Tool

• Relatively fast

• User friendly

• Very good at recognising similarity between closely related sequences

Page 42: Public data resources for metagenomics · (2) Upload sequence data and metadata (3) Sequence data is archived in ENA and accessioned (4) Sequence data is analysed by the pipeline

Using BLAST for annotation

Page 43: Public data resources for metagenomics · (2) Upload sequence data and metadata (3) Sequence data is archived in ENA and accessioned (4) Sequence data is analysed by the pipeline

Using BLAST for annotation

Page 44: Public data resources for metagenomics · (2) Upload sequence data and metadata (3) Sequence data is archived in ENA and accessioned (4) Sequence data is analysed by the pipeline

Using BLAST for annotation

Page 45: Public data resources for metagenomics · (2) Upload sequence data and metadata (3) Sequence data is archived in ENA and accessioned (4) Sequence data is analysed by the pipeline

Because BLAST performs local pairwise alignment, it:

• can sometimes struggle with multi-domain proteins

• is less useful for weakly-similar sequences (e.g., divergent homologues)

Using BLAST for annotation

Page 46: Public data resources for metagenomics · (2) Upload sequence data and metadata (3) Sequence data is archived in ENA and accessioned (4) Sequence data is analysed by the pipeline

BLAST alignment of 2 proteins: 60S acidic ribosomal protein P0 from 2 closely-related species

Using BLAST for annotation

Page 47: Public data resources for metagenomics · (2) Upload sequence data and metadata (3) Sequence data is archived in ENA and accessioned (4) Sequence data is analysed by the pipeline

60S acidic ribosomal protein P0: multiple sequence alignment

Page 48: Public data resources for metagenomics · (2) Upload sequence data and metadata (3) Sequence data is archived in ENA and accessioned (4) Sequence data is analysed by the pipeline

An alternative approach

• This is the approach taken by protein signature databases

• Alternatively, we can model the pattern of conserved amino acids at specific positions within a multiple sequence alignment

• We can use these models to infer relationships with the characterised sequences from which the alignment was constructed

Page 49: Public data resources for metagenomics · (2) Upload sequence data and metadata (3) Sequence data is archived in ENA and accessioned (4) Sequence data is analysed by the pipeline

Full alignment methods

Single motif methods

Patterns

Multiple motif methods

Fingerprints

Three different protein signature approaches

Profiles & Hidden Markov models (HMMs)

* For a detailed description, see: https://www.ebi.ac.uk/training/online/course/introduction-protein-classification-ebi

Page 50: Public data resources for metagenomics · (2) Upload sequence data and metadata (3) Sequence data is archived in ENA and accessioned (4) Sequence data is analysed by the pipeline

Structuraldomains

Functional annotation of families/domains

Protein features

(sites)

Hidden Markov Models Finger prints

Profiles Patterns

HAMAP

Page 51: Public data resources for metagenomics · (2) Upload sequence data and metadata (3) Sequence data is archived in ENA and accessioned (4) Sequence data is analysed by the pipeline

The aim of InterPro

InterPro

Page 52: Public data resources for metagenomics · (2) Upload sequence data and metadata (3) Sequence data is archived in ENA and accessioned (4) Sequence data is analysed by the pipeline

Features of InterPro

• Manually checked and updated against a manually annotated database

• Errors are identified and fixed• Annotated with full text abstracts and Gene Ontology terms

Page 53: Public data resources for metagenomics · (2) Upload sequence data and metadata (3) Sequence data is archived in ENA and accessioned (4) Sequence data is analysed by the pipeline

… with a brief diversion into the Gene Ontology…

http://geneontology.org/

Page 54: Public data resources for metagenomics · (2) Upload sequence data and metadata (3) Sequence data is archived in ENA and accessioned (4) Sequence data is analysed by the pipeline

Aims of the Gene Ontology

• Allow cross-species and/or cross-database comparisons

• Unify the representation of gene and gene product attributes across species

http://geneontology.org/

Page 55: Public data resources for metagenomics · (2) Upload sequence data and metadata (3) Sequence data is archived in ENA and accessioned (4) Sequence data is analysed by the pipeline

English is not a very precise language

• Same name for different concepts• Different names for the same concept

Inconsistency in naming of biological concepts

?

An example …

Tactition Tactile sense

Taction

Sensory perception of touch ; GO:0050975

http://geneontology.org/

Page 56: Public data resources for metagenomics · (2) Upload sequence data and metadata (3) Sequence data is archived in ENA and accessioned (4) Sequence data is analysed by the pipeline

• A way to capture biological knowledge in a written and computable form

The Gene Ontology

• A set of concepts and their relationships to each other arrangedas a hierarchy

www.ebi.ac.uk/QuickGO

Less specific concepts

More specific concepts

http://geneontology.org/

Page 57: Public data resources for metagenomics · (2) Upload sequence data and metadata (3) Sequence data is archived in ENA and accessioned (4) Sequence data is analysed by the pipeline

The Concepts in GO

1. Molecular Function

2. Biological Process

3. Cellular Component

An elemental activity or task or job

• protein kinase activity• insulin receptor activity

A commonly recognised series of events

• cell division

Where a gene product is located

• mitochondrion

• mitochondrial matrix

• mitochondrial inner membrane

http://geneontology.org/

Page 58: Public data resources for metagenomics · (2) Upload sequence data and metadata (3) Sequence data is archived in ENA and accessioned (4) Sequence data is analysed by the pipeline

Anatomy of a GO term

Unique identifier

Term name

Definition

Synonyms

http://geneontology.org/

Page 59: Public data resources for metagenomics · (2) Upload sequence data and metadata (3) Sequence data is archived in ENA and accessioned (4) Sequence data is analysed by the pipeline

InterPro2GO

InterPro

Page 60: Public data resources for metagenomics · (2) Upload sequence data and metadata (3) Sequence data is archived in ENA and accessioned (4) Sequence data is analysed by the pipeline

We now return to your scheduled programming...

Page 61: Public data resources for metagenomics · (2) Upload sequence data and metadata (3) Sequence data is archived in ENA and accessioned (4) Sequence data is analysed by the pipeline

Using InterPro for annotation

• Underlies the automated system that adds annotation to

UniProtKB/TrEMBL

• Provides matches to 67 million proteins - over 80% of UniProtKB

• Source of ~170 million GO mappings for ~50 million distinct

UniProtKB sequences

Annotation consistency:• Using InterPro and GO for annotation allows direct comparison

with all of the proteins in UniProtKB

Page 62: Public data resources for metagenomics · (2) Upload sequence data and metadata (3) Sequence data is archived in ENA and accessioned (4) Sequence data is analysed by the pipeline

Analysing metagenomic sequences with InterPro

Considerations for metagenome analysis:

• Vast numbers of short reads

• analysis speed

• ability to cope with sequence fragments

• Making sense of output• visualisation on web site• downstream analysis and sample comparison

Page 63: Public data resources for metagenomics · (2) Upload sequence data and metadata (3) Sequence data is archived in ENA and accessioned (4) Sequence data is analysed by the pipeline

Structuraldomains

Functional annotation of families/domains

Protein features

(sites)

Hidden Markov Models Finger prints

Patterns

Databases

4

Page 64: Public data resources for metagenomics · (2) Upload sequence data and metadata (3) Sequence data is archived in ENA and accessioned (4) Sequence data is analysed by the pipeline

Assembly of metagenomics data

• Metagenomics: Not clear how you avoid assembling sequences from different species together : chimaera

Page 65: Public data resources for metagenomics · (2) Upload sequence data and metadata (3) Sequence data is archived in ENA and accessioned (4) Sequence data is analysed by the pipeline

EBI Metagenomics does not perform assembly

We are still able to annotate metagenome data as shown by this re-analysis of rumen metagenomics by Hess et al, (2011)

Page 66: Public data resources for metagenomics · (2) Upload sequence data and metadata (3) Sequence data is archived in ENA and accessioned (4) Sequence data is analysed by the pipeline

Visualising data: InterProScan results

Page 67: Public data resources for metagenomics · (2) Upload sequence data and metadata (3) Sequence data is archived in ENA and accessioned (4) Sequence data is analysed by the pipeline

Visualising data: GO Slims

• GO slims are cut-down versions of the GO ontologies

containing a subset of the terms in the whole GO

• Give a broad overview of the ontology content without the

detail of the specific fine-grained terms

Page 68: Public data resources for metagenomics · (2) Upload sequence data and metadata (3) Sequence data is archived in ENA and accessioned (4) Sequence data is analysed by the pipeline

GO Slims

Page 69: Public data resources for metagenomics · (2) Upload sequence data and metadata (3) Sequence data is archived in ENA and accessioned (4) Sequence data is analysed by the pipeline

GO Slims

Slimmed term:

Page 70: Public data resources for metagenomics · (2) Upload sequence data and metadata (3) Sequence data is archived in ENA and accessioned (4) Sequence data is analysed by the pipeline

Visualising data: GO slims

• For visualisation, EMG uses a GO slim specially developed for metagenomic data sets

Page 71: Public data resources for metagenomics · (2) Upload sequence data and metadata (3) Sequence data is archived in ENA and accessioned (4) Sequence data is analysed by the pipeline

EBI Metagenomics output files

sequence files

tab or comma separated files

TreeView, TOL,

Newick Viewer …

Megan …

sequence files

Page 72: Public data resources for metagenomics · (2) Upload sequence data and metadata (3) Sequence data is archived in ENA and accessioned (4) Sequence data is analysed by the pipeline

Simplified overview of MG-RAST pipeline

Reads Quality control

Feature prediction(FragGeneScan)

Clustering (Uclust)Protein databases

http://metagenomics.anl.gov/

Abundance profilesMetabolic

reconstructionMetabolic model

RNA database

BlatrRNAs

SILVA CommunityprofilesBlat

Blat

Page 73: Public data resources for metagenomics · (2) Upload sequence data and metadata (3) Sequence data is archived in ENA and accessioned (4) Sequence data is analysed by the pipeline

NH3 + A-H2 + O2 NH2OH + A + H2O ammonia monooxygenase:

12 Ammonia monooxygenase 2 ammonia monooxygenase family protein 4 Ammonia monooxygenase subunit A 5 Ammonia monooxygenase, putative62 Putative ammonia monooxygenase 3 putative ammonia monooxygenase protein 4 putative ammonia monooxygenase subunit A

EBI Metagenomics: 3 IPR003393 Ammonia monooxygenase/particulate methane monooxygenase, subunit A

25 IPR007820 Putative ammonia monooxygenase/protein AbrB

8 KEGG18 eggNOG13 GenBank11 IMG 8 PATRIC10 RefSeq12 TrEMBL 9 SEED

MG-RAST & EBI Metagenomics Functional analysis

MG-RAST: 92 hits to 8 different databases

Example: Analysis of Prairie Soil Sample

1 ammonia monooxygenase family protein2 ammonia monooxygenase subunit A1 ammonia monooxygenase, putative6 putative ammonia monooxygenase2 Putative ammonia monooxygenase1 putative ammonia monooxygenase subunit A

13 GenBank

Page 74: Public data resources for metagenomics · (2) Upload sequence data and metadata (3) Sequence data is archived in ENA and accessioned (4) Sequence data is analysed by the pipeline

MG-RAST & EBI Metagenomics Taxonomy analysis

MG-RAST

EBI Metagenomics: only Prokaryotic taxonomy (333 OTU)

Bacteria

Archaebacteria

Eukaryotes

Others (including virus)

(55 categories)

(15 categories)

(98 categories)

(3 types)

Example: Analysis of Prairie Soil Sample

domain level of taxonomy

Page 75: Public data resources for metagenomics · (2) Upload sequence data and metadata (3) Sequence data is archived in ENA and accessioned (4) Sequence data is analysed by the pipeline

Example: Analysis of Prairie Soil Sample

Phylum level of bacteria domain taxonomy

28 categories

MG-RAST

13 OTU

EBI Metagenomics

MG-RAST & EBI Metagenomics Taxonomy analysis

Page 76: Public data resources for metagenomics · (2) Upload sequence data and metadata (3) Sequence data is archived in ENA and accessioned (4) Sequence data is analysed by the pipeline

IMG/M

http://img.jgi.doe.gov/m

Page 77: Public data resources for metagenomics · (2) Upload sequence data and metadata (3) Sequence data is archived in ENA and accessioned (4) Sequence data is analysed by the pipeline

Some other metagenomics packages and tools

http://www.computationalbioenergy.org/software.html

http://ab.inf.uni-tuebingen.de/software/megan/ http://cbcb.umd.edu/software/metAMOS

CloVR metagenomics

http://clovr.org/methods/clovr-metagenomics/

Page 78: Public data resources for metagenomics · (2) Upload sequence data and metadata (3) Sequence data is archived in ENA and accessioned (4) Sequence data is analysed by the pipeline

Hands-on session

• Using InterProScan to analyse a single metagenomic sequence

• Exploring EMG Portal’s analysis of a metagenomic data set

• Comparing analysis results for samples within a project using STAMP

Page 79: Public data resources for metagenomics · (2) Upload sequence data and metadata (3) Sequence data is archived in ENA and accessioned (4) Sequence data is analysed by the pipeline

Questions?


Recommended