Additional File 1
MetaMeta: Integrating metagenome analysis
tools to improve taxonomic profiling
Vitor C. Piro, Marcel Matschkowski, Bernhard Y. Renard
1 Implementation
Figure 1: Score and bin matrices: Left: Matrix with an example of calculatedscores for 6 tools. Right: matrix showing the division of the scores into 4 bins
1.1 File formats
MetaMeta accepts BioBoxes format directly (https://github.com/bioboxes/rfc/tree/master/data-format) or a .tsv file in the following format:Profiling: rank, taxon name or taxid, abundanceExample:
1
genus Methanospirillum 0.0029genus Thermus 0.0029genus 568394 0.0029species Arthrobacter sp. FB24 0.0835species 195 0.0582species Mycoplasma gallisepticum 0.0536
Binning: readid, taxon name or taxid, lenght of sequence assignedExample:
M2—S1—R140 354 201M2—S1—R142 195 201M2—S1—R145 457425 201M2—S1—R146 562 201M2—S1—R147 1245471 201M2—S1—R150 354 201
1.2 Mode functions
The mode parameter can be selected among 5 different functions, that wouldgenerate more precise or sensitive results (Figure 2). Each bin will have a cut-offvalue C defined as:
Very-sensitive: Cbin = log(bin + 3)/log(maxbins + 3)Sensitive: Cbin = log(bin + 1)/log(maxbins + 1)Linear: Cbin = bin/maxbinsPrecise: Cbin = 2bin/2maxbins
Very-precise: Cbin = 4bin/4maxbins
where maxbins is the total number of bins.
2 Results
2.1 Databases
Table 1: MetaMeta pre-configured databasesTool Archaea + Bacteria (v1) Custom
CLARK Yes (https://doi.org/10.5281/zenodo.819305) YesDUDes Yes (https://doi.org/10.5281/zenodo.819343) Yes
GOTTCHA Yes (https://doi.org/10.5281/zenodo.819341) Nokaiju Yes (https://doi.org/10.5281/zenodo.819425) Yes
kraken Yes (https://doi.org/10.5281/zenodo.819363) YesmOTUs Yes (https://doi.org/10.5281/zenodo.819365) No
2
1 2 3 4Bins
0.0
0.2
0.4
0.6
0.8
1.0Cu
t-off
(% o
f tax
ons
kept
)
very-sensitivesensitivelinearprecisevery-precise
Figure 2: Example of cut-off values for 4 bins in each mode
2.2 Computer specifications
The main evaluations were performed with MetaMeta v1.1 on a x86 clusterconsisting of of a total of 1000 cores and roughly 3.5 TB RAM. The sub-sampling evaluations on CAMI data were performed with MetaMeta v1.0 on:60 CPUs x Intel(R) Xeon(R) CPU E7-4890 v2 @ 2.80GHz, 1056 GB RAM,Debian GNU/Linux 8.4, 2.8 TB SSD.
2.3 Datasets and Parameters
MetaMeta pipeline was executed with all 6 pre-configured tools using the ar-chaea and bacteria database (Table 1).All CAMI toy sets (low, medium and high complexity) were obtained fromhttps://data.cami-challenge.org/148 stool samples from HMP were obtained at: http://hmpdacc.org/
3
List of analyzed samples: SRS011061, SRS011134, SRS011239, SRS011271,SRS011302, SRS011405, SRS011452, SRS011529, SRS011586, SRS012273, SRS012902,SRS013158, SRS013215, SRS013476, SRS013521, SRS013687, SRS013800, SRS013951,SRS014235, SRS014287, SRS014313, SRS014459, SRS014613, SRS014683, SRS014923,SRS014979, SRS015065, SRS015133, SRS015190, SRS015217, SRS015264, SRS015369,SRS015578, SRS015663, SRS015782, SRS015794, SRS015854, SRS015960, SRS016018,SRS016056, SRS016095, SRS016203, SRS016267, SRS016335, SRS016495, SRS016517,SRS016585, SRS016753, SRS016954, SRS016989, SRS017103, SRS017191, SRS017247,SRS017307, SRS017433, SRS017521, SRS017701, SRS017821, SRS018133, SRS018313,SRS018351, SRS018427, SRS018575, SRS018656, SRS018817, SRS019030, SRS019161,SRS019267, SRS019397, SRS019582, SRS019601, SRS019685, SRS019787, SRS019910,SRS019968, SRS020233, SRS020328, SRS020869, SRS021484, SRS021948, SRS022071,SRS022137, SRS022524, SRS022609, SRS022713, SRS023346, SRS023526, SRS023583,SRS023829, SRS023914, SRS023971, SRS024009, SRS024075, SRS024132, SRS024265,SRS024331, SRS024388, SRS024435, SRS024549, SRS024625, SRS042284, SRS042628,SRS043001, SRS043411, SRS043701, SRS045004, SRS045645, SRS045713, SRS047014,SRS047044, SRS048164, SRS048870, SRS049164, SRS049712, SRS049900, SRS049959,SRS049995, SRS050299, SRS050422, SRS050752, SRS050925, SRS051031, SRS051882,SRS052027, SRS052697, SRS053214, SRS053335, SRS053398, SRS054590, SRS054956,SRS055982, SRS056259, SRS056519, SRS057478, SRS057717, SRS058723, SRS058770,SRS062427, SRS063040, SRS063985, SRS064276, SRS064557, SRS064645, SRS065504,SRS075398, SRS077730, SRS078176The sample SRS023176 couldn’t be properly analyzed due to inconsistent readpairs.
Table 2: MetaMeta (v1.1) parameters used for the CAMI and HMP data. De-fault parameters were used when not stated below.
CAMIDefault low/med./high HMP
trimming 0 - -desiredminlen 70 - -
subsample 0 - -mode linear - sensitivecutoff 0.0001 - 0.00001bins 4 - -
ranks species - -
2.4 Results
4
Table 3: MetaMeta (v1.0) parameters used for the sub-sampled CAMI data.Default parameters were used when not stated below. N/A: not applicable
CAMI CAMI CAMI CAMI CAMI CAMI CAMIDefault 1% 5% 10% 16.6% 25% 50% 100%
trimming 0 1 1 1 1 1 1 1desiredminlen 70 - - - - - - -
strictness 0.8 - - - - - - -errorcorr 0 - - - - - - -
subsample 0 1 1 1 1 1 1 -samplesize 1 0.01 0.05 0.1 - 0.25 0.5 N/A
replacement 0 - - - - 1 1 N/Amode linear precise precise precise precise precise precise precisecutoff 0.0001 0.00001 0.00001 0.00001 0.00001 0.00001 0.00001 0.00001bins 4 3 3 3 3 3 3 3
ranks species - - - - - - -
5
clark
dude
s
gottc
haka
ijukra
ken
motus
metameta
merge
55
60
65
70
75
True
Pos
itive
s
0
500
1000
1500
2000
2500
Fals
e Po
sitiv
es
Figure 3: True and False Positives - CAMI medium complexity set Inblue (left y axis): True Positives. In red (right y axis): False Positives. Resultsat species level. Each marker represents one out of four samples from the CAMImedium complexity set.
6
0.0 0.2 0.4 0.6 0.8 1.0Sensitivity
0.0
0.2
0.4
0.6
0.8
1.0
Prec
isio
n
clarkdudesgottchakaijukrakenmotusmetametamerge
Figure 4: Precision and Sensitivity - CAMI medium complexity setResults at species level. Each marker represents one out of four samples fromthe CAMI medium complexity set.
7
supe
rking
dom
phylu
mcla
ssord
erfam
ilyge
nus
spec
ies0.0
0.2
0.4
0.6
0.8
1.0
1.2
1.4
L1 n
orm
clarkdudesgottchakaijukrakenmotusmetametamerge
Figure 5: L1 norm error Mean of the L1 norm measure at each taxonomiclevel for four samples from the medium complexity CAMI set.
8
clark
dude
s
gottc
haka
ijukra
ken
motus
metameta
merge
5
6
7
8
9
10
True
Pos
itive
s
0
500
1000
1500
2000
2500
Fals
e Po
sitiv
es
Figure 6: True and False Positives - CAMI low complexity set In blue(left y axis): True Positives. In red (right y axis): False Positives. Results atspecies level.
9
0.0 0.2 0.4 0.6 0.8 1.0Sensitivity
0.0
0.2
0.4
0.6
0.8
1.0
Prec
isio
n
clarkdudesgottchakaijukrakenmotusmetametamerge
Figure 7: Precision and Sensitivity - CAMI low complexity set Resultsat species level.
10
supe
rking
dom
phylu
mcla
ssord
erfam
ilyge
nus
spec
ies0.0
0.5
1.0
1.5
L1 n
orm
clarkdudesgottchakaijukrakenmotusmetametamerge
Figure 8: L1 norm error L1 norm measure at each taxonomic level for onesample from the low complexity CAMI set.
11
Figure 9: Sub-sampling Precision at species level for one randomly selectedCAMI high complexity sample. Each sub-sample was executed five times. Linesrepresent the mean and the area around it the maximum and minimum achievedvalues. The evaluated sample sizes are: 100%, 50%, 25%, 16.6%, 10%, 5%,1%. 16.6% is the exact division among 6 tools, using the the whole sample.Sub-samples above that value were taken with replacement and below withoutreplacement.
12
motus_rpt
clean_files
metametamerge
kaiju_rpt
clark_db_custom_1
clark_db_custom_3
clark_db_custom_2 kaiju_db_custom_profile
database_profile errorcorr_reads
subsample_reads
kaiju_db_custom_1
kaiju_db_custom_2
dudes_rpt
db_archaea_bacteria_check
kraken_run_1 motus_run_1 gottcha_run_1clark_run_1 dudes_run_1 kaiju_run_1
kraken_rpt
dudes_db_custom_checkclark_db_custom_check
kaiju_db_custom_4
kaiju_db_custom_check
db_archaea_bacteria_download
db_archaea_bacteria_unpack kaiju_db_custom_3
gottcha_rpt
kraken_db_custom_2
kraken_db_custom_check
kraken_db_custom_3
clark_rpt
dudes_db_custom_3
dudes_db_custom_4dudes_db_custom_profile
dudes_db_custom_2
dudes_run_2
dudes_db_custom_1
metametamerge_get_taxdump
trim_reads
all
clark_db_custom_profile
krona clean_reads
kraken_db_custom_1
kraken_db_custom_profile
Figure 10: Rulegraph Overview of the rules and their dependencies onMetaMeta pipeline.
errorcorr_reads
subsample_reads
clean_filestool: dudes
metametamerge
clean_reads
all
db_archaea_bacteria_downloadtool: kaiju
db_archaea_bacteria_unpackdb_archaea_bacteria_check
dudes_run_1database: archaea_bacteria
clean_filestool: gottcha
kraken_rpt
clean_filestool: kraken
clean_filestool: motus
db_archaea_bacteria_downloadtool: motus
db_archaea_bacteria_unpack
db_archaea_bacteria_downloadtool: gottcha
db_archaea_bacteria_unpack
metametamerge_get_taxdump
db_archaea_bacteria_unpack
db_archaea_bacteria_check db_archaea_bacteria_check
motus_run_1database: archaea_bacteria
kraken_run_1database: archaea_bacteria
kaiju_run_1database: archaea_bacteria
kaiju_rpt
db_archaea_bacteria_downloadtool: dudes
db_archaea_bacteria_unpack
clean_filestool: kaiju
motus_rpt
db_archaea_bacteria_downloadtool: kraken
krona
db_archaea_bacteria_check
gottcha_run_1database: archaea_bacteria
clark_run_1database: archaea_bacteria
db_archaea_bacteria_check
gottcha_rpt
db_archaea_bacteria_unpack
trim_readssample: sample_data_custom_viral
dudes_rpt
dudes_run_2
clark_rpt
clean_filestool: clark
db_archaea_bacteria_downloadtool: clark
db_archaea_bacteria_check
Figure 11: DAG - pre-configured database Directed acyclic graph of theMetaMeta pipeline for one sample, one database (pre-configured) and six tools.
13
dudes_db_custom_1database: custom_viral_db
dudes_db_custom_check
metametamerge
clean_reads krona
clean_filestool: dudes
dudes_db_custom_4
all
kaiju_db_custom_1database: custom_viral_db
kaiju_db_custom_2 database_profiletool: dudes
trim_readssample: sample_data_custom_viral
errorcorr_reads
clean_filestool: kraken
clark_db_custom_check
clark_run_1
database_profiletool: clarkkaiju_db_custom_3
metametamerge_get_taxdumpclean_filestool: clark
kraken_db_custom_check
kraken_run_1
dudes_db_custom_profile
clark_db_custom_profiledatabase: custom_viral_dbkraken_db_custom_profile
database_profiletool: kraken
kaiju_run_1
kaiju_rpt kraken_rpt
dudes_run_1
dudes_run_2
clark_rpt
clean_filestool: kaiju
kaiju_db_custom_check
subsample_reads
kraken_db_custom_2database: custom_viral_db
kraken_db_custom_3
clark_db_custom_1database: custom_viral_db
clark_db_custom_2
clark_db_custom_3
dudes_db_custom_2database: custom_viral_db
database_profiletool: kaiju
dudes_rpt
kraken_db_custom_1database: custom_viral_db
kaiju_db_custom_profiledatabase: custom_viral_db
kaiju_db_custom_4database: custom_viral_db
dudes_db_custom_3database: custom_viral_db
Figure 12: DAG - custom database Directed acyclic graph of the MetaMetapipeline for one sample, one database (custom) and 4 tools.
kraken_run_1database: archaea_bacteria
kraken_rpt
kraken_run_1
kraken_rpt
kraken_db_custom_check
metametamerge
kraken_run_1
metametamerge
dudes_db_custom_3database: custom_viral_db
dudes_db_custom_profiledudes_db_custom_4
krona
all
dudes_run_2
dudes_rpt kaiju_rpt
clean_filestool: kaiju
motus_rpt
metametamerge
clean_filestool: motus
db_archaea_bacteria_check
metametamerge
dudes_run_1database: archaea_bacteria
dudes_run_1database: archaea_bacteria
kraken_db_custom_2database: custom_viral_db
kraken_db_custom_3
clean_filestool: dudes
clean_filestool: dudes
clark_rpt
clean_filestool: clark
dudes_run_1
dudes_run_2
errorcorr_reads
subsample_reads
db_archaea_bacteria_downloadtool: kaiju
db_archaea_bacteria_unpack
clark_db_custom_1database: custom_viral_db
clark_db_custom_3
clark_db_custom_2
db_archaea_bacteria_unpack
db_archaea_bacteria_check
clean_filestool: kaiju
gottcha_rpt
clean_filestool: gottcha
trim_readssample: sample_data_custom_viral
errorcorr_reads
kraken_run_1database: archaea_bacteria
kraken_rpt
kaiju_db_custom_check
kaiju_run_1 kaiju_run_1
database_profiletool: clark
clark_db_custom_check
db_archaea_bacteria_downloadtool: dudes
db_archaea_bacteria_unpack
db_archaea_bacteria_check
clark_run_1database: archaea_bacteria
clark_run_1database: archaea_bacteria
dudes_db_custom_check
dudes_run_1
motus_run_1database: archaea_bacteria
motus_run_1database: archaea_bacteria
database_profiletool: dudes
clark_rpt
clean_filestool: clark
kaiju_db_custom_2
kaiju_db_custom_3
clark_run_1
clark_rpt
dudes_run_2
subsample_reads
kaiju_run_1database: archaea_bacteria
gottcha_run_1database: archaea_bacteria clark_run_1
krona
database_profiletool: kraken
db_archaea_bacteria_downloadtool: motus
dudes_run_2
dudes_rpt
kaiju_db_custom_1database: custom_viral_db
krona clean_reads
db_archaea_bacteria_unpack
db_archaea_bacteria_check
db_archaea_bacteria_unpack
clean_reads
dudes_rpt
db_archaea_bacteria_downloadtool: clark
db_archaea_bacteria_downloadtool: kraken
db_archaea_bacteria_unpack
clean_filestool: gottcha
clean_filestool: clark
kraken_db_custom_1database: custom_viral_db
kraken_db_custom_profile
kaiju_rpt dudes_rpt
clean_filestool: dudes
clean_filestool: kraken
db_archaea_bacteria_check
clean_filestool: kraken
kaiju_rpt
clean_filestool: kaiju
gottcha_rpt
clean_filestool: kaiju
dudes_db_custom_1database: custom_viral_db
kaiju_run_1database: archaea_bacteria
db_archaea_bacteria_check
kaiju_rpt
clean_filestool: dudes
clean_filestool: kraken
gottcha_run_1database: archaea_bacteria
database_profiletool: kaiju
clark_rpt
krona
kraken_rptmotus_rpt
clean_filestool: motus
clean_filestool: krakenmetametamerge_get_taxdump
dudes_db_custom_2database: custom_viral_db
trim_readssample: sample_data_custom_viral_2
db_archaea_bacteria_downloadtool: gottcha
kaiju_db_custom_4database: custom_viral_db
kaiju_db_custom_profiledatabase: custom_viral_db
clark_db_custom_profiledatabase: custom_viral_db
clean_filestool: clark
Figure 13: DAG - multiple samples Directed acyclic graph of the MetaMetapipeline for two samples, two databases (pre-configured and custom) and sixtools.
14