Additional File 1 MetaMeta: Integrating metagenome ...10.1186/s40168-017-0318... · MetaMeta:...

Post on 13-Apr-2018

224 views 3 download

transcript

Additional File 1

MetaMeta: Integrating metagenome analysis

tools to improve taxonomic profiling

Vitor C. Piro, Marcel Matschkowski, Bernhard Y. Renard

1 Implementation

Figure 1: Score and bin matrices: Left: Matrix with an example of calculatedscores for 6 tools. Right: matrix showing the division of the scores into 4 bins

1.1 File formats

MetaMeta accepts BioBoxes format directly (https://github.com/bioboxes/rfc/tree/master/data-format) or a .tsv file in the following format:Profiling: rank, taxon name or taxid, abundanceExample:

1

genus Methanospirillum 0.0029genus Thermus 0.0029genus 568394 0.0029species Arthrobacter sp. FB24 0.0835species 195 0.0582species Mycoplasma gallisepticum 0.0536

Binning: readid, taxon name or taxid, lenght of sequence assignedExample:

M2—S1—R140 354 201M2—S1—R142 195 201M2—S1—R145 457425 201M2—S1—R146 562 201M2—S1—R147 1245471 201M2—S1—R150 354 201

1.2 Mode functions

The mode parameter can be selected among 5 different functions, that wouldgenerate more precise or sensitive results (Figure 2). Each bin will have a cut-offvalue C defined as:

Very-sensitive: Cbin = log(bin + 3)/log(maxbins + 3)Sensitive: Cbin = log(bin + 1)/log(maxbins + 1)Linear: Cbin = bin/maxbinsPrecise: Cbin = 2bin/2maxbins

Very-precise: Cbin = 4bin/4maxbins

where maxbins is the total number of bins.

2 Results

2.1 Databases

Table 1: MetaMeta pre-configured databasesTool Archaea + Bacteria (v1) Custom

CLARK Yes (https://doi.org/10.5281/zenodo.819305) YesDUDes Yes (https://doi.org/10.5281/zenodo.819343) Yes

GOTTCHA Yes (https://doi.org/10.5281/zenodo.819341) Nokaiju Yes (https://doi.org/10.5281/zenodo.819425) Yes

kraken Yes (https://doi.org/10.5281/zenodo.819363) YesmOTUs Yes (https://doi.org/10.5281/zenodo.819365) No

2

1 2 3 4Bins

0.0

0.2

0.4

0.6

0.8

1.0Cu

t-off

(% o

f tax

ons

kept

)

very-sensitivesensitivelinearprecisevery-precise

Figure 2: Example of cut-off values for 4 bins in each mode

2.2 Computer specifications

The main evaluations were performed with MetaMeta v1.1 on a x86 clusterconsisting of of a total of 1000 cores and roughly 3.5 TB RAM. The sub-sampling evaluations on CAMI data were performed with MetaMeta v1.0 on:60 CPUs x Intel(R) Xeon(R) CPU E7-4890 v2 @ 2.80GHz, 1056 GB RAM,Debian GNU/Linux 8.4, 2.8 TB SSD.

2.3 Datasets and Parameters

MetaMeta pipeline was executed with all 6 pre-configured tools using the ar-chaea and bacteria database (Table 1).All CAMI toy sets (low, medium and high complexity) were obtained fromhttps://data.cami-challenge.org/148 stool samples from HMP were obtained at: http://hmpdacc.org/

3

List of analyzed samples: SRS011061, SRS011134, SRS011239, SRS011271,SRS011302, SRS011405, SRS011452, SRS011529, SRS011586, SRS012273, SRS012902,SRS013158, SRS013215, SRS013476, SRS013521, SRS013687, SRS013800, SRS013951,SRS014235, SRS014287, SRS014313, SRS014459, SRS014613, SRS014683, SRS014923,SRS014979, SRS015065, SRS015133, SRS015190, SRS015217, SRS015264, SRS015369,SRS015578, SRS015663, SRS015782, SRS015794, SRS015854, SRS015960, SRS016018,SRS016056, SRS016095, SRS016203, SRS016267, SRS016335, SRS016495, SRS016517,SRS016585, SRS016753, SRS016954, SRS016989, SRS017103, SRS017191, SRS017247,SRS017307, SRS017433, SRS017521, SRS017701, SRS017821, SRS018133, SRS018313,SRS018351, SRS018427, SRS018575, SRS018656, SRS018817, SRS019030, SRS019161,SRS019267, SRS019397, SRS019582, SRS019601, SRS019685, SRS019787, SRS019910,SRS019968, SRS020233, SRS020328, SRS020869, SRS021484, SRS021948, SRS022071,SRS022137, SRS022524, SRS022609, SRS022713, SRS023346, SRS023526, SRS023583,SRS023829, SRS023914, SRS023971, SRS024009, SRS024075, SRS024132, SRS024265,SRS024331, SRS024388, SRS024435, SRS024549, SRS024625, SRS042284, SRS042628,SRS043001, SRS043411, SRS043701, SRS045004, SRS045645, SRS045713, SRS047014,SRS047044, SRS048164, SRS048870, SRS049164, SRS049712, SRS049900, SRS049959,SRS049995, SRS050299, SRS050422, SRS050752, SRS050925, SRS051031, SRS051882,SRS052027, SRS052697, SRS053214, SRS053335, SRS053398, SRS054590, SRS054956,SRS055982, SRS056259, SRS056519, SRS057478, SRS057717, SRS058723, SRS058770,SRS062427, SRS063040, SRS063985, SRS064276, SRS064557, SRS064645, SRS065504,SRS075398, SRS077730, SRS078176The sample SRS023176 couldn’t be properly analyzed due to inconsistent readpairs.

Table 2: MetaMeta (v1.1) parameters used for the CAMI and HMP data. De-fault parameters were used when not stated below.

CAMIDefault low/med./high HMP

trimming 0 - -desiredminlen 70 - -

subsample 0 - -mode linear - sensitivecutoff 0.0001 - 0.00001bins 4 - -

ranks species - -

2.4 Results

4

Table 3: MetaMeta (v1.0) parameters used for the sub-sampled CAMI data.Default parameters were used when not stated below. N/A: not applicable

CAMI CAMI CAMI CAMI CAMI CAMI CAMIDefault 1% 5% 10% 16.6% 25% 50% 100%

trimming 0 1 1 1 1 1 1 1desiredminlen 70 - - - - - - -

strictness 0.8 - - - - - - -errorcorr 0 - - - - - - -

subsample 0 1 1 1 1 1 1 -samplesize 1 0.01 0.05 0.1 - 0.25 0.5 N/A

replacement 0 - - - - 1 1 N/Amode linear precise precise precise precise precise precise precisecutoff 0.0001 0.00001 0.00001 0.00001 0.00001 0.00001 0.00001 0.00001bins 4 3 3 3 3 3 3 3

ranks species - - - - - - -

5

clark

dude

s

gottc

haka

ijukra

ken

motus

metameta

merge

55

60

65

70

75

True

Pos

itive

s

0

500

1000

1500

2000

2500

Fals

e Po

sitiv

es

Figure 3: True and False Positives - CAMI medium complexity set Inblue (left y axis): True Positives. In red (right y axis): False Positives. Resultsat species level. Each marker represents one out of four samples from the CAMImedium complexity set.

6

0.0 0.2 0.4 0.6 0.8 1.0Sensitivity

0.0

0.2

0.4

0.6

0.8

1.0

Prec

isio

n

clarkdudesgottchakaijukrakenmotusmetametamerge

Figure 4: Precision and Sensitivity - CAMI medium complexity setResults at species level. Each marker represents one out of four samples fromthe CAMI medium complexity set.

7

supe

rking

dom

phylu

mcla

ssord

erfam

ilyge

nus

spec

ies0.0

0.2

0.4

0.6

0.8

1.0

1.2

1.4

L1 n

orm

clarkdudesgottchakaijukrakenmotusmetametamerge

Figure 5: L1 norm error Mean of the L1 norm measure at each taxonomiclevel for four samples from the medium complexity CAMI set.

8

clark

dude

s

gottc

haka

ijukra

ken

motus

metameta

merge

5

6

7

8

9

10

True

Pos

itive

s

0

500

1000

1500

2000

2500

Fals

e Po

sitiv

es

Figure 6: True and False Positives - CAMI low complexity set In blue(left y axis): True Positives. In red (right y axis): False Positives. Results atspecies level.

9

0.0 0.2 0.4 0.6 0.8 1.0Sensitivity

0.0

0.2

0.4

0.6

0.8

1.0

Prec

isio

n

clarkdudesgottchakaijukrakenmotusmetametamerge

Figure 7: Precision and Sensitivity - CAMI low complexity set Resultsat species level.

10

supe

rking

dom

phylu

mcla

ssord

erfam

ilyge

nus

spec

ies0.0

0.5

1.0

1.5

L1 n

orm

clarkdudesgottchakaijukrakenmotusmetametamerge

Figure 8: L1 norm error L1 norm measure at each taxonomic level for onesample from the low complexity CAMI set.

11

Figure 9: Sub-sampling Precision at species level for one randomly selectedCAMI high complexity sample. Each sub-sample was executed five times. Linesrepresent the mean and the area around it the maximum and minimum achievedvalues. The evaluated sample sizes are: 100%, 50%, 25%, 16.6%, 10%, 5%,1%. 16.6% is the exact division among 6 tools, using the the whole sample.Sub-samples above that value were taken with replacement and below withoutreplacement.

12

motus_rpt

clean_files

metametamerge

kaiju_rpt

clark_db_custom_1

clark_db_custom_3

clark_db_custom_2 kaiju_db_custom_profile

database_profile errorcorr_reads

subsample_reads

kaiju_db_custom_1

kaiju_db_custom_2

dudes_rpt

db_archaea_bacteria_check

kraken_run_1 motus_run_1 gottcha_run_1clark_run_1 dudes_run_1 kaiju_run_1

kraken_rpt

dudes_db_custom_checkclark_db_custom_check

kaiju_db_custom_4

kaiju_db_custom_check

db_archaea_bacteria_download

db_archaea_bacteria_unpack kaiju_db_custom_3

gottcha_rpt

kraken_db_custom_2

kraken_db_custom_check

kraken_db_custom_3

clark_rpt

dudes_db_custom_3

dudes_db_custom_4dudes_db_custom_profile

dudes_db_custom_2

dudes_run_2

dudes_db_custom_1

metametamerge_get_taxdump

trim_reads

all

clark_db_custom_profile

krona clean_reads

kraken_db_custom_1

kraken_db_custom_profile

Figure 10: Rulegraph Overview of the rules and their dependencies onMetaMeta pipeline.

errorcorr_reads

subsample_reads

clean_filestool: dudes

metametamerge

clean_reads

all

db_archaea_bacteria_downloadtool: kaiju

db_archaea_bacteria_unpackdb_archaea_bacteria_check

dudes_run_1database: archaea_bacteria

clean_filestool: gottcha

kraken_rpt

clean_filestool: kraken

clean_filestool: motus

db_archaea_bacteria_downloadtool: motus

db_archaea_bacteria_unpack

db_archaea_bacteria_downloadtool: gottcha

db_archaea_bacteria_unpack

metametamerge_get_taxdump

db_archaea_bacteria_unpack

db_archaea_bacteria_check db_archaea_bacteria_check

motus_run_1database: archaea_bacteria

kraken_run_1database: archaea_bacteria

kaiju_run_1database: archaea_bacteria

kaiju_rpt

db_archaea_bacteria_downloadtool: dudes

db_archaea_bacteria_unpack

clean_filestool: kaiju

motus_rpt

db_archaea_bacteria_downloadtool: kraken

krona

db_archaea_bacteria_check

gottcha_run_1database: archaea_bacteria

clark_run_1database: archaea_bacteria

db_archaea_bacteria_check

gottcha_rpt

db_archaea_bacteria_unpack

trim_readssample: sample_data_custom_viral

dudes_rpt

dudes_run_2

clark_rpt

clean_filestool: clark

db_archaea_bacteria_downloadtool: clark

db_archaea_bacteria_check

Figure 11: DAG - pre-configured database Directed acyclic graph of theMetaMeta pipeline for one sample, one database (pre-configured) and six tools.

13

dudes_db_custom_1database: custom_viral_db

dudes_db_custom_check

metametamerge

clean_reads krona

clean_filestool: dudes

dudes_db_custom_4

all

kaiju_db_custom_1database: custom_viral_db

kaiju_db_custom_2 database_profiletool: dudes

trim_readssample: sample_data_custom_viral

errorcorr_reads

clean_filestool: kraken

clark_db_custom_check

clark_run_1

database_profiletool: clarkkaiju_db_custom_3

metametamerge_get_taxdumpclean_filestool: clark

kraken_db_custom_check

kraken_run_1

dudes_db_custom_profile

clark_db_custom_profiledatabase: custom_viral_dbkraken_db_custom_profile

database_profiletool: kraken

kaiju_run_1

kaiju_rpt kraken_rpt

dudes_run_1

dudes_run_2

clark_rpt

clean_filestool: kaiju

kaiju_db_custom_check

subsample_reads

kraken_db_custom_2database: custom_viral_db

kraken_db_custom_3

clark_db_custom_1database: custom_viral_db

clark_db_custom_2

clark_db_custom_3

dudes_db_custom_2database: custom_viral_db

database_profiletool: kaiju

dudes_rpt

kraken_db_custom_1database: custom_viral_db

kaiju_db_custom_profiledatabase: custom_viral_db

kaiju_db_custom_4database: custom_viral_db

dudes_db_custom_3database: custom_viral_db

Figure 12: DAG - custom database Directed acyclic graph of the MetaMetapipeline for one sample, one database (custom) and 4 tools.

kraken_run_1database: archaea_bacteria

kraken_rpt

kraken_run_1

kraken_rpt

kraken_db_custom_check

metametamerge

kraken_run_1

metametamerge

dudes_db_custom_3database: custom_viral_db

dudes_db_custom_profiledudes_db_custom_4

krona

all

dudes_run_2

dudes_rpt kaiju_rpt

clean_filestool: kaiju

motus_rpt

metametamerge

clean_filestool: motus

db_archaea_bacteria_check

metametamerge

dudes_run_1database: archaea_bacteria

dudes_run_1database: archaea_bacteria

kraken_db_custom_2database: custom_viral_db

kraken_db_custom_3

clean_filestool: dudes

clean_filestool: dudes

clark_rpt

clean_filestool: clark

dudes_run_1

dudes_run_2

errorcorr_reads

subsample_reads

db_archaea_bacteria_downloadtool: kaiju

db_archaea_bacteria_unpack

clark_db_custom_1database: custom_viral_db

clark_db_custom_3

clark_db_custom_2

db_archaea_bacteria_unpack

db_archaea_bacteria_check

clean_filestool: kaiju

gottcha_rpt

clean_filestool: gottcha

trim_readssample: sample_data_custom_viral

errorcorr_reads

kraken_run_1database: archaea_bacteria

kraken_rpt

kaiju_db_custom_check

kaiju_run_1 kaiju_run_1

database_profiletool: clark

clark_db_custom_check

db_archaea_bacteria_downloadtool: dudes

db_archaea_bacteria_unpack

db_archaea_bacteria_check

clark_run_1database: archaea_bacteria

clark_run_1database: archaea_bacteria

dudes_db_custom_check

dudes_run_1

motus_run_1database: archaea_bacteria

motus_run_1database: archaea_bacteria

database_profiletool: dudes

clark_rpt

clean_filestool: clark

kaiju_db_custom_2

kaiju_db_custom_3

clark_run_1

clark_rpt

dudes_run_2

subsample_reads

kaiju_run_1database: archaea_bacteria

gottcha_run_1database: archaea_bacteria clark_run_1

krona

database_profiletool: kraken

db_archaea_bacteria_downloadtool: motus

dudes_run_2

dudes_rpt

kaiju_db_custom_1database: custom_viral_db

krona clean_reads

db_archaea_bacteria_unpack

db_archaea_bacteria_check

db_archaea_bacteria_unpack

clean_reads

dudes_rpt

db_archaea_bacteria_downloadtool: clark

db_archaea_bacteria_downloadtool: kraken

db_archaea_bacteria_unpack

clean_filestool: gottcha

clean_filestool: clark

kraken_db_custom_1database: custom_viral_db

kraken_db_custom_profile

kaiju_rpt dudes_rpt

clean_filestool: dudes

clean_filestool: kraken

db_archaea_bacteria_check

clean_filestool: kraken

kaiju_rpt

clean_filestool: kaiju

gottcha_rpt

clean_filestool: kaiju

dudes_db_custom_1database: custom_viral_db

kaiju_run_1database: archaea_bacteria

db_archaea_bacteria_check

kaiju_rpt

clean_filestool: dudes

clean_filestool: kraken

gottcha_run_1database: archaea_bacteria

database_profiletool: kaiju

clark_rpt

krona

kraken_rptmotus_rpt

clean_filestool: motus

clean_filestool: krakenmetametamerge_get_taxdump

dudes_db_custom_2database: custom_viral_db

trim_readssample: sample_data_custom_viral_2

db_archaea_bacteria_downloadtool: gottcha

kaiju_db_custom_4database: custom_viral_db

kaiju_db_custom_profiledatabase: custom_viral_db

clark_db_custom_profiledatabase: custom_viral_db

clean_filestool: clark

Figure 13: DAG - multiple samples Directed acyclic graph of the MetaMetapipeline for two samples, two databases (pre-configured and custom) and sixtools.

14