+ All Categories
Home > Documents > Insights & Perspectives

Insights & Perspectives

Date post: 03-Dec-2021
Category:
Upload: others
View: 2 times
Download: 0 times
Share this document with a friend
12
Insights & Perspectives Inferring human microbial dynamics from temporal metagenomics data: Pitfalls and lessons Hong-Tai Cao 1)2)3) , Travis E. Gibson 1) , Amir Bashan 1)4) and Yang-Yu Liu 1)5) * The human gut microbiota is a very complex and dynamic ecosystem that plays a crucial role in health and well-being. Inferring microbial community structure and dynamics directly from time-resolved metagenomics data is key to understanding the community ecology and predicting its temporal behavior. Many methods have been proposed to perform the inference. Yet, as we point out in this review, there are several pitfalls along the way. Indeed, the uninformative temporal measurements and the compositional nature of the relative abundance data raise serious challenges in inference. Moreover, the inference results can be largely distorted when only focusing on highly abundant species by ignoring or grouping low-abundance species. Finally, the implicit assumptions in various regularization methods may not reflect reality. Those issues have to be seriously considered in ecological modeling of human gut microbiota. Keywords: .dynamics inference; ecological modeling; human microbiome; temporal metagenomics : Additional supporting information may be found in the online version of this article at the publisher’s web-site. Introduction We coexist with trillions of microbes that live in and on our bodies [1]. Those microorganisms play key roles in human physiology and diseases [2]. Propelled by metagenomics and next- generation DNA sequencing technolo- gies, many scientific advances have been made through the work of large-scale, consortium-driven meta- genomic projects [3, 4]. Despite these technical advances that help us acquire more accurate organismal compositions and metabolic func- tions, little is known about the under- lying ecological dynamics of our microbiota. Indeed, the microbes in our guts form very complex and dynamic ecosystems, which can be altered by diet change, medical inter- ventions, and other factors [5–7]. The alterability of our microbiota not only offers a promising future for practical microbiome-based therapies [7, 8], such as fecal microbiota transplanta- tion (FMT) [9, 10], but also raises long- term safety concerns. Careless inter- ventions could shift our microbiota to an undesired state with unintended health consequences due to its high complexity. Consequently, there is an urgent need to understand the under- lying ecological dynamics of our microbiota; in the absence of this knowledge we lack a theoretical framework for microbiome-based ther- apies in general. Measured temporal data, reason- able dynamical models, and objective criterion for model selection are the key elements in successfully inferring the system dynamics [11]. In the context of human gut microbiota, the measured temporal data are the time-series of microbe abundances, typically measured from the stool samples of a few individuals. Different dynamical models have been used to describe the dynamics of microbial ecosystems, for example linear mod- els [12]; nonlinear models such as different variations of the Generalized Lotka-Volterra (GLV) model [13–18]; and other models [19]. Among these models, GLV is a very popular one due to its simplicity. Given the measured temporal data and a dynamical model DOI 10.1002/bies.201600188 1) Channing Division of Network Medicine, Brigham and Women’s Hospital, Harvard Medical School, Boston, MA, USA 2) Department of Electrical Engineering, University of Southern California, Los Angeles, CA, USA 3) Chu Kochen Honors College, College of Electrical Engineering, Zhejiang University, Hangzhou, Zhejiang, China 4) Department of Physics, Bar-Ilan University, Ramat-Gan, Israel 5) Center for Cancer Systems Biology, Dana- Farber Cancer Institute, Boston, MA, USA *Corresponding author: Yang-Yu Liu E-mail: [email protected] Abbreviations: FMT, fecal microbiota transplantation; OTU, operational taxonomic unit. www.bioessays-journal.com 1600188 (1 of 12) Bioessays 39, 2, 1600188, ß 2016 WILEY Periodicals, Inc. Think again
Transcript
Page 1: Insights & Perspectives

.

DOI 10.1002/bies

1) Channing Division oBrigham and WomeMedical School, Bos

2) Department of Electrof Southern Californ

3) Chu Kochen HonorsElectrical EngineerinHangzhou, Zhejiang,

4) Department of PhysRamat-Gan, Israel

5) Center for Cancer SFarber Cancer Institu

*Corresponding authYang-Yu LiuE-mail: [email protected]

Abbreviations:FMT, fecal microbiooperational taxonomic

Bioessays 39, 2, 1

Insights & Perspectives

Thinkagain

Inferring human microbial dynamicsfrom temporal metagenomics data:Pitfalls and lessons

Hong-Tai Cao1)2)3), Travis E. Gibson1), Amir Bashan1)4) and Yang-Yu Liu1)5)*

The human gut microbiota is a very complex and dynamic ecosystem that plays

a crucial role in health and well-being. Inferring microbial community structure

and dynamics directly from time-resolved metagenomics data is key to

understanding the community ecology and predicting its temporal behavior.

Many methods have been proposed to perform the inference. Yet, as we point

out in this review, there are several pitfalls along the way. Indeed, the

uninformative temporal measurements and the compositional nature of the

relative abundance data raise serious challenges in inference. Moreover, the

inference results can be largely distorted when only focusing on highly

abundant species by ignoring or grouping low-abundance species. Finally, the

implicit assumptions in various regularization methods may not reflect reality.

Those issues have to be seriously considered in ecological modeling of human

gut microbiota.

dynamics inference; ecological mode

metagenomics

Keywords:

ling; human microbiome; temporal

: Additional supporting information may be found in the online version of this

article at the publisher’s web-site.

.201600188

f Network Medicine,n’s Hospital, Harvardton, MA, USAical Engineering, Universityia, Los Angeles, CA, USACollege, College ofg, Zhejiang University,China

ics, Bar-Ilan University,

ystems Biology, Dana-te, Boston, MA, USA

or:

arvard.edu

ta transplantation; OTU,unit.

600188,� 2016 WILEY

Introduction

We coexist with trillions of microbesthat live in and on our bodies [1].Those microorganisms play key rolesin human physiology and diseases [2].Propelled by metagenomics and next-generation DNA sequencing technolo-gies, many scientific advances havebeen made through the work oflarge-scale, consortium-driven meta-genomic projects [3, 4]. Despite thesetechnical advances that help usacquire more accurate organismalcompositions and metabolic func-tions, little is known about the under-lying ecological dynamics of our

www.bioPeriodicals, Inc.

microbiota. Indeed, the microbes inour guts form very complex anddynamic ecosystems, which can bealtered by diet change, medical inter-ventions, and other factors [5–7]. Thealterability of our microbiota not onlyoffers a promising future for practicalmicrobiome-based therapies [7, 8],such as fecal microbiota transplanta-tion (FMT) [9, 10], but also raises long-term safety concerns. Careless inter-ventions could shift our microbiota toan undesired state with unintendedhealth consequences due to its highcomplexity. Consequently, there is anurgent need to understand the under-lying ecological dynamics of ourmicrobiota; in the absence of thisknowledge we lack a theoreticalframework for microbiome-based ther-apies in general.

Measured temporal data, reason-able dynamical models, and objectivecriterion for model selection are thekey elements in successfully inferringthe system dynamics [11]. In thecontext of human gut microbiota,the measured temporal data are thetime-series of microbe abundances,typically measured from the stoolsamples of a few individuals. Differentdynamical models have been used todescribe the dynamics of microbialecosystems, for example linear mod-els [12]; nonlinear models such asdifferent variations of the GeneralizedLotka-Volterra (GLV) model [13–18];and other models [19]. Among thesemodels, GLV is a very popular one dueto its simplicity. Given the measuredtemporal data and a dynamical model

essays-journal.com 1600188 (1 of 12)

Page 2: Insights & Perspectives

H.-T. Cao et al. Insights & Perspectives.....Thinkagain

with many unknown parameters, weneed to identify those parameters thatyield the best model estimationaccording to certain criteria (e.g.minimum estimation error).

There are many methods that inferthe microbial dynamics and reconstructthe ecological network from temporalmetagenomics data based on the GLVmodel [20–23]. An overview of theworkflow is depicted in Fig. 1. We applycertain perturbations to the systems (forexample the administration of antibi-otics or prebiotics) and measure thespecies abundances as a function oftime using DNA sequencing technolo-gies. The unknown underlying micro-bial dynamics can be parameterized in apopulation dynamics model with vari-ous model parameters such as intrinsicgrowth rates, inter- and intra-speciesinteractions in the GLV model. Inparticular, the inter-species interactionscan be captured by an ecologicalnetwork and visualized as a directedgraph shown in Fig. 1C. If the data are“rich” or informative enough, then wecan reconstruct the ecological dynamicsby identifying all the model parameters.The model parameters can then be usedin turn to predict the temporal behaviorof the microbial ecosystem, an ultimategoal of ecological modeling of humangut microbiota.

Yet, this is just an ideal case. Inreality, there are many pitfalls along theway. For example, the temporal datacould be uninformative due to eitherlow sampling rate or “unexcited” sys-tem dynamics. The compositionalitynature of the relative abundance datawill cause fundamental limitations ininference. And overlooking low-abun-dance but strongly interacting speciesmight lead to erroneous model param-eters. They can seriously affect theinference results if they are not dealtthoughtfully. In this work, we system-atically study those pitfalls and pointout possible solutions. Note that, here,we aim to reconstruct the ecologicaldynamics and the corresponding di-rected inter-species interaction net-work, rather than constructing anyundirected microbial association net-work using similarity-based techniques,for example Pearson or Spearmancorrelations for abundance data or thehypergeometric distribution for pres-ence absence data. The construction of

1600188 (2 of 12)

microbial association networks has itsown pitfalls, as discussed with detailin [24].

Dynamics inferencerequires model, data, andmethods

Choose a proper dynamicsmodel for the microbialecosystem

One of the key elements in systemidentification is choosing a reasonabledynamics model. Recently, populationdynamics models, especially the classi-cal GLV model, have been used forpredictive modeling of the intestinalmicrobiota [16, 20–23]. Consider acollection of n microbes in a habitatwith the population of microbe i at timet denoted as xi(t), the GLV modelassumes that the microbe populationsfollow a set of ordinary differentialequations (ODEs):

_xi tð Þ ¼ xi tð Þ ri þXn

j¼1aijxj tð Þ

� �; ð1Þ

i¼ 1, . . ., n, here ri is the intrinsic growthrate of microbe i, aij (when i 6¼ j)accounts for the impact that microbe jhas on the population change ofmicrobe i, and the terms aiix

2i are

adopted according to Verhulst’s logisticgrowth model [25]. Both ri and aij areassumed to be time-invariant, that is,they are constant regardless of how thesystem evolves over time. By collectingthe individual populations xi(t) into astate vector x tð Þ ¼ x1 tð Þ; � � � ; xn tð Þð ÞT2 ℝn

�0, Equation (1) can be representedin a compact form

_x tð Þ ¼ diag x tð Þð Þ r þ Ax tð Þð Þ; ð2Þwhere r ¼ r1; � � � ; rnð ÞT 2 ℝn is a columnvector of the intrinsic growth rates, A ¼aij� � 2 ℝn�n is the inter-species interac-tion matrix, and diag generates adiagonal matrix from a vector.

The original GLV model, equa-tion (2), excludes all the externalperturbations applied to the system.For a class of asymptotically stablemicrobial ecosystems that follow thisdeterministic model and without anyexternal perturbations, the microbeabundance profile will asymptoticallyapproach a unique steady state [15].

Bioessays 39, 2, 16

However, time-series data of the steadystate display little about its underlyingdynamics, which is a bad scenario forsystem identification.

To excite the system and get“richer” or more informative time-series data, we apply external pertur-bations to drive the system and mea-sure its response. In fact, we have towisely design drive-response experi-ments to infer the underlying dynam-ics [19, 26]. Recently, an extended GLVmodel has been proposed to explicitlyconsider the impact of various externalstimuli or perturbations ui tð Þ

0s on the

system dynamics [21, 23]:

_x tð Þ ¼ diag x tð Þð Þ r þ Ax tð Þ þ Cu tð Þð Þ;ð3Þ

where u tð Þ ¼ u1 tð Þ; � � � ; ut tð Þð ÞT 2 ℝl isthe perturbation vector at time t, C ¼ciq� � 2 ℝn�l is the susceptibility matrixwith ciq representing the stimulusstrength of perturbation uq(t) onspecies i. This mimics realistic pertur-bations from antibiotics or prebiotics,which can inhibit or benefit the growthof certain microbes. The presence orabsence of the antibiotics or prebioticsis evaluated as a binary perturbation u(t) (Fig.1A) and the overall influenceson the microbial species can be repre-sented by the sum of products ofsusceptibility C and species abun-dance. We can then infer the microbialsystem under this particular drive-response scheme.

Besides the binary perturbationscheme, there is another type ofdrive-response experiment, whichdoes not require us to introduce thesusceptibility matrix C into the GLVmodel at all. This driving perturbationis implemented by setting up differentinitial conditions for the microbialecosystem. For each initial conditionchange (which mimics the immediateresult of an FMT), the system willrespond by displaying certain tran-sient behavior before it reaches theequilibrium (steady) state. We cantreat the initial conditions as jumpsor finite pulses and then concatenateseveral perturbed time-seriescorresponding to different initialconditions. By construction, theconcatenated time-series data containvarious transient behavior of thesystem corresponding to different

00188,� 2016 WILEY Periodicals, Inc.

Page 3: Insights & Perspectives

Figure 1. Overview of the workflow inferring microbial dynamics from time-series data. Givensuitable perturbations (A) on a microbial ecosystem, and the corresponding time-series ofmicrobe abundances (B), we aim to infer the microbial dynamics and reconstruct the underlyingmicrobe-microbe interaction network (C) by using classical population dynamics models, e.g.the Generalized Lotka-Volterra (GLV) model, and various standard system identificationtechniques (D). In the ideal case, the reconstructed microbe-microbe interaction network(E) captures all the key features of the original network (C), and the predicted time-series(F) agrees well with the original measurement (B). Yet, as pointed in this paper, there are manypitfalls in inferring the microbial dynamics from time-series data. In both (C) and (E), positive (ornegative) interactions are shown in blue (or red) arrows, respectively. The absolute interactionstrengths are proportional to the arrow widths and the microbiota growth rates are representedby circle colors. NRMSE represents the normalized root mean square error.

..... Insights & Perspectives H.-T. Cao et al.Thinkagain

finite pulses, which could be veryinformative and help us infer theunderlying system dynamics. Furthercomparisons between the above twodrive-response experiments arediscussed later (see SupplementaryFig. S1).

Bioessays 39, 2, 1600188,� 2016 WILEY

Collect informative data toidentify model parameters

Prior to the era of high-throughputDNA sequencing, microbiology studiesheavily relied on cultivating microbesfrom collected samples. Yet, this

Periodicals, Inc.

process is rather tedious and time-consuming. Thanks to the develop-ment of next generation sequencing,we can now study microbiomes bydirect DNA sequencing. In particular,the 16S ribosomal RNA (rRNA) genetargeted amplicon sequencing is apopular approach. In this approach,part of the 16S rRNA gene, which is themost ubiquitous and conserved markergene of the bacterial genome, issequenced [27]. Due to its simplicity,relatively low cost and availability ofvarious developed analysis pipelines,this approach has become routine fordetermining the taxonomic composi-tion and species diversity of microbialcommunities [28]. By filtering spuriousreads and carefully clustering/group-ing the remaining reads into the so-called Operational Taxonomic Units(OTUs) based on sequence similarity,one can obtain reliable and informa-tive counts from 16S rRNA genesequences. Indeed, as working namesof groups of related bacteria, OTUs areintended to represent some degree oftaxonomic relatedness. One can thenassign a frequency to each distinctOTU within the microbial communitydescribing their relative abundanceswithin the population.

Note that comparing microbialcomposition between two or morepopulations on the basis of OTUs intheir corresponding samples is totallydifferent from comparing the absoluteabundance of the taxa in the microbialecosystems from which the samplesare collected. As the total taxa abun-dance of the entire microbial ecosys-tem is unknown, it is only reasonableto draw inferences regarding the rela-tive abundance of a taxon in theecosystem using its relative abundancein the collected sample. In short, themicrobial community can be describedin terms of which OTUs are present andtheir relative abundances. The intrin-sic compositionality of the relativeabundance data will cause trouble ininference.

To reveal the pitfalls in inference,we generate synthetic time-series dataof microbe abundances using theclassical GLV model in this work.Although there are already humanmicrobial time-series data available[29, 30], we find in our previous work[15] that the time series data are not

1600188 (3 of 12)

Page 4: Insights & Perspectives

H.-T. Cao et al. Insights & Perspectives.....Thinkagain

“rich” enough to infer the humanmicrobial dynamics. Indeed, the in-ter-species interaction matrix A recon-structed from the real time-series datais almost the same as that recon-structed from the randomly shuffledtime-series data, where temporality iscompletely removed (see Figs. S13–S15in [15]). Another reason for usingsynthetic data are that we can controlthe “richness” of data and quantify theerror between the inferred results andthe ground truth.

As there is no closed-form solutionto the ODEs of the GLV model inequation (3), we solve them at prede-termined time points. Many numericalintegration methods such as explicitRunge-Kutta formula [31, 32], Adams-Bashforth-Moulton method [33] andGear’s method [34, 35] can be used toapproximate the solutions of equation(3). In this work, we choose thefrequently used Runge-Kutta method.The total number of the synthetic datapoints are obtained by dividing theintegral interval by the step-size. Notethat the integral interval [0,t] in numer-ical integration can be mapped to anylength of time in reality, such as severalweeks, days, or hours. To assign arealistic time unit to the synthetic data,we leverage two observations: (i) in oursimulations (with the model parametersand initial conditions chosen as de-scribed in Supporting Information), theGLV systems typically reach equilibriumstate at around t¼ 1; (ii) human micro-bial ecosystems relax to the equilibriumstate in about 10 days after smallperturbations [16, 20, 23, 36]. Hence,we map the integral interval [0,t] in thesimulation to [0,10t] days in real time.For example, if we run the numericalintegration from t¼0 to 10, this isequivalent to collecting the time-seriesdata from day 0 to day 100. Weemphasize that all the results presentedin this work do not depend on thedetails of the time unit chosen in oursimulations.

Inference methods are appliedunder various assumptions

Let xi(tk) be the population of the i-thmicrobial species or OTU and uq(tk) bethe q-th external perturbation at timepoint tk. Here k ¼ 0; 1; � � � ;T . The

1600188 (4 of 12)

synthetic temporal data are generatedbased on the intrinsic growth ratevector r, the inter-species interactionmatrix A, and the susceptibility matrixC. We need an inference method toidentify all the model parameters in r,A, and C, based on the time-series dataxi tkð Þ;uq tkð Þ� �

:Move xi tð Þ of equation (3) to the left

hand side and then integrate both sidesover the time interval tk; tkþ1½ �, yieldinglnxi tkþ1ð Þ � lnxi tkð Þð Þ

¼ ri þXn

j¼1aijxj tkð Þ þ

Xl

q¼1ciquq tkð Þ

� �tkþ1 � tkð Þ þ ei tkð Þ; ð4Þwhere we have assumed that xi tð Þ anduq tð Þ are roughly constant overt 2 tk; tkþ1½ �, tk � 0. Here ei tkð Þ repre-sents the corresponding error arisingfrom the approximation of the integralby holding the integrand constant overthe time interval.

Define the scaled log-difference

matrix Y¼{yik}¼{yi(tk)}2 ℝn�T whereyi(tk)¼ (lnxi(tk+1)� lnxi(tk))/(tkþ1�tk),

the parameter vector uTi ¼[ri,ai1,. . .,ain,

ci1,. . .,cil]T2 ℝ1þnþl, and the vector

fk¼(1,x1(tk),. . .,xn(tk),u1(tk),. . .,

ul(tk))T2 ℝ1þnþl, then the discretized

GLV model in equation (4) can berepresented by a system of linearalgebraic equations:

Y ¼ QFþ E

tkþ1 � tk: ð5Þ

Here Q ¼ col uif g ¼ uT1 ; uT2 ; � � � ; uTn

� �T ¼r;A;Cð Þ 2 ℝn� 1þnþlð Þ is the parameter

matrix that needs to be identified. E 2ℝn�T represents the corresponding ap-proximation error matrix. F ¼ row fkf g¼ f0;f1; � � � ;fT�1ð Þ 2 ℝ 1þnþlð Þ�T . Equa-tion (5) is often called the identificationfunction that can be used to solve for theunknown parameter matrix Q.

Given any time-series data x tkð Þ andu tkð Þ of the GLV model, Q should be asolution of the identification function(5). Yet, Q usually cannot be exactlysolved, as equation (5) is usually under-determined because of the limitedavailable data. Indeed, the number ofequations n� T is typically lessthan the number of unknownsn� 1þ nþ lð Þ. Q can be approximatelysolved by optimization methods. Thereare many algorithms to obtain an

Bioessays 39, 2, 16

approximate solution, though. We dis-cuss those methods as follows.

Least square

Mathematically, Q can be estimated asQ by solving the following optimizationproblem:

minQ

jjY�QFjj2F; ð6Þ

where jjZjjF ¼ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiXm

i¼1

Xn

j¼1z2ij

qis the

Frobenius norm of matrix

Z ¼ zij� � 2 ℝm�n. The solution Q can

be obtained by the classical least-squareregression method:

Q ¼ YFT FFT� �†

; ð7Þ

where FFT� �†

represents the pseudo-

inverse matrix of FFT. Note that

FFT� �† ¼ FFT

� ��1when FFT is non-

singular.

Regularizations

In statistic regressions, the least-squaresolution (without any penalty) in equa-tion (7) can be biased and cause over-fitting. Regularization methods canreduce the over-fitting issue by addingdifferent penalty terms (e.g. based onℓ1- or ℓ2-norm) to the regression. Inparticular, lasso regularization [37–39],which uses ℓ1-norm penalties, solvesthe regression problem in the form of

minbi ;u i

1

2T

XT

k¼1yik � fk u

T

i

� �2 þ biX1þnþl

j¼1u ij !

;

ð8Þwhere u ij is the j-th element in u i andi ¼ 1; 2; � � � ;n. Lasso regression esti-mates the unknown parameters in thei-th row of Q. There are severalalgorithms solving this optimizationproblem, such as truncated singularvalue decomposition, l-curve, crossvalidation and so on. Detailed algo-rithms and discussions can be found in[40]. In this work, we use the k-foldcross validation method and let k ¼ 5 inlasso regularization.

Different from the lasso regulariza-tion that uses ℓ1-norm penalties,Tikhonov regularization, as known asridge regression in statistics, usesℓ2-norm penalties:

00188,� 2016 WILEY Periodicals, Inc.

Page 5: Insights & Perspectives

..... Insights & Perspectives H.-T. Cao et al.Thinkagain

minbi ;u i

1

2T

XT

k¼1yik � fk u

T

i

� �2þ bi

2jju ijj2

�;

ð9Þwhere jj � jj represents the ℓ2-norm andi ¼ 1; 2; � � � ;n. Similar to lasso regres-sion, the above penalty terms bi can alsobe determined by cross validation. Thereare n different bi

0s penalizing all themodel parameters.

Linear combinations of ℓ1- andℓ2-norm penalties in equations (8) and(9) result in the so-called elastic netregularization method [41]:

minbi ;u i

1

2T

XT

k¼1yik � fk u

T

i

� �2þ biPm u

T

i

� � �;

ð10Þwhere

Pm uT

i

� �¼ 1�m

2 jju ijj2 þ mX1þnþl

j¼1u ij ,

and m 2 0; 1½ � is a predetermined pa-rameter for the optimization. Theelastic net regularization becomes theTikhonov (or lasso) regularization whenm ¼ 0 (or 1), respectively.

All the regularization methods(lasso, Tikhonov and elastic net) usepenalty terms to regularize the least-square regression. The penalty termsmake the absolute values of estimationsmaller and suppress the unimportantparameters to 0. Unimportant parame-ters in ui will be forced to be 0 in lassoregularization in equation (8) due to the

presence of penalty terms bi

X1þnþl

j¼1u ij .

Therefore lasso is a kind of sparseregression that implicitly assumes theinteraction matrix A in the GLVmodel issparse (which is of course not necessar-ily true). Although these regularizationmethods reduce the norm of estimationand aim to make the results morerealistic, it does not mean the resultsare getting close to the ground truth.

Pitfalls in current dynamicinference

Accurate time-series predictiondoes not imply accurateinference

As the ground truth is typically un-known in real world system identifica-tion problems, the identified systemparameters are usually verified by

Bioessays 39, 2, 1600188,� 2016 WILEY

simulating the model dynamics andcomparing the predicted time-serieswith the measured one. This is suitablefor simple systems but not for complexmicrobial systems. Indeed, accuratetemporal predictions are possible evenif the identified interactions look totallydifferent from the actual ones [42].

To demonstrate the above point, weset up a synthetic microbial system witheight species, following the GLVdynamics with three binary perturba-tions. It is a microbial system withhomogeneous interaction strengthsamong all species with mean degree6.4 in the underlying ecological net-work. The abundance of a certainspecies is increased when its suscepti-bility is positive and the binary pertur-bation is turned on. The population ofall the species in the microbial systemsare simulated from t ¼ 0 to 10, which ismapped to 100 days. The sampling rateis set to be once per day, which meansthere are total 100 data points for thisdata set, where the time intervalbetween two adjacent data points isone day.

Comparing A2 and A3 of Fig. 2, wefind that we can accurately predict thetemporal behavior of microbial popu-lation, given the same initial conditionsand the time-series perturbation data(Fig. 2A1). Yet, the identified inter-species interaction network (Fig. 2B2)looks drastically different from theground truth (Fig. 2B1). For example,some strong interactions (e.g. 2 ! 1)are lost, and some unessential inter-actions are inferred as dominant inter-actions (e.g. 6 ! 5). In fact, all theidentified model parameters are quitedifferent from the ground truth (seeFig. 2C1–C3). Their differences aremeasured in terms of normalized rootmean square error (NRMSE) and detailsare provided in Supplementary. Theabove result clearly demonstrates thataccurate temporal prediction could bejust due to over-fitting, and the identi-fied model parameters could be farfrom the ground truth.

Sampling rate really matters

Different sampling rates capture differ-ent resolutions of the dynamics of themicrobial system [43]. The inferredmicrobial networks from time-series

Periodicals, Inc.

data can be misleading if the microbialsystem is sampled at an improperfrequency. Unfortunately, there are nosimple rules like Nyquist frequency forthe GLV model, and the ideal samplingrate depends on the particular microbialsystem of interest [6, 43]. Resultspresented in Fig. 1 (sample 100 timesper day) and Fig. 2 (sample once perday) clearly suggest that sampling rateis really an important factor determin-ing the performance of inference, asdiscussed below in details.

The sampling rate is crucial as itbridges the measured discrete time-series data and the original continu-ous-time microbial system. Obviously,higher sampling rate makes the inter-polated discrete time-series data betterapproximate the continuous-time dy-namics of the original system. Itshould be pointed out the scaled logdifference yik in equation (4) repre-sents the linearized approximation ofthe GLV. As tkþ1 � tk increases linearly,yik changes nonlinearly, which resultsin a nonlinear ei tkð Þ. Sampling ratebecomes substantial because of thisnonlinear behavior of the approxima-tion error. Though we can arbitrarilyincrease the sampling rate for syn-thetic data, it is rather costly in realdata collection and even not feasiblefor human gut microbial systems.Hence, it would be more desirable ifthe time-series data can approximatethe original microbial dynamics withhigher accuracy at a low samplingrate.

The binary perturbation schemehelps us excite the system to get moreinformative time-series data, but theextended GLV model in equation (3)introduces more model parameters(which consist of the whole suscepti-bility matrix C) that bring new approx-imation error into ei tkð Þ and requiremore available data points. In reality,the finest longitudinal data of humangut microbiota are actually sampledjust on a daily basis for hundreds ofdays due to many limitations and thedata set is still limited. Hence, weprefer the perturbation scheme thatusing concatenated time-series withdifferent initial conditions. Indeed,we find that this initial-condition-perturbation scheme is much betterthan the binary perturbation scheme interms of smaller number of unknowns.

1600188 (5 of 12)

Page 6: Insights & Perspectives

Figure 2. Perfect time-series prediction does not imply accurate network reconstruction.A1: Time-series of binary perturbations. A2: Synthetic time-series of species abundancesgenerated from a GLV model. Both perturbation and abundance data are sampled once perday. A3: Predicted time-series of species abundances calculated from the inferred GLVmodel. B1: Original inter-species interaction network. B2: Reconstructed inter-speciesinteraction network. Here in both B1 and B2 only the top-10 strongest interactions areshown. Circle colors represent growth rates. C1: Inferred interaction strengths versus trueinteraction strengths. C2: Inferred growth rates versus true growth rates. C3: Inferredsusceptibilities versus true susceptibilities.

H.-T. Cao et al. Insights & Perspectives.....Thinkagain

It also provides more accurate inferringresults comparing to the binary exter-nal perturbations (see SupplementaryFig. S1).

1600188 (6 of 12)

We choose four sampling rates: weekly,every two days, daily, and twice a day, asshown in Fig. 3, to evaluate the impacts ofsampling rate on the performance of

Bioessays 39, 2, 16

inference with the initial-condition-per-turbation scheme. All results are obtainedby the same regression method underdifferent time steps, that is tkþ1 � tkð Þ are7, 2, 1, and 0.5days respectively. (In thenumerical integration, the time steps are0.7, 0.2, 0.1, and 0.05 respectively.) Theylead to different approximation errors.Results show that higher sampling ratewith smaller approximation error leads tobetter inference results. Even whenthe data are sampled every 2 days, theinferred interactions are much more

00188,� 2016 WILEY Periodicals, Inc.

Page 7: Insights & Perspectives

Figure 3. Impact of sampling rates on inferring microbial dynamics. Row-1: Time-series ofspecies abundances generated from a GLV model with different sampling rates: (A1): once aweek; (B1): every two days; (C1): daily; and (D1): twice a day. Row-2: Predicted time-seriesof species abundances calculated from the corresponding inferred GLV model. Row-3: Trueinteraction strengths versus inferred interaction strengths from time-series data of differentsampling rates. Row-4: True growth rates versus inferred growth rates from time-series dataof different sampling rates.

..... Insights & Perspectives H.-T. Cao et al.Thinkagain

reliable than the results with weeklysampling rate. In reality, this schemecan be implemented by fecal microbiotatransplantation, which immediatelychanges the abundances of multiplespecies (or even introduces some newspecies). In the rest of this paper, we willfocus on this type of perturbation.

Bioessays 39, 2, 1600188,� 2016 WILEY

Compositionality raises seriouschallenges

Microbial communities can be typi-cally described in terms of member-ships and relative abundances ofOTUs. Using relative abundance datainstead of the original time-series data

Periodicals, Inc.

is actually the limitation of availabledata as the total population isunknown. The compositionality ofrelative abundance data will notsignificantly alter the original dataonly when the total population isroughly time-invariant, which is notnecessarily true. Even the relativeabundance data can approximatethe original data, a time-invarianttotal population will be linearly cor-related with the constant row in F,which will introduce linear correla-tions of rows of F and lead to the rankdeficiency of FFT. Hence, a roughlytime-invariant total population will

1600188 (7 of 12)

Page 8: Insights & Perspectives

Figure 4. Compositionality of relative abundance data impedes the inference of microbialdynamics. Column-1: using absolute abundance data. A1: Time-series of absoluteabundances; A2: Predicted time-series of absolute abundances; A3: True interactionstrengths versus inferred interaction strengths; A4: True growth rates versus inferred growthrates. Column-2: using relative abundance data. B1 Time-series of relative abundances;B2: Predicted time-series of relative abundances; B3: True interaction strengths versusinferred interaction strengths; B4: True growth rates versus inferred growth rates. Inferenceresults from relative abundances are far from the ground truth. The time-series prediction ofrelative abundances also differs significantly from that of the original relative abundances.

H.-T. Cao et al. Insights & Perspectives.....

1600188 (8 of 12) Bioessays 39, 2, 16

Thinkagain

cause FFT to be almost singular,drastically reducing the numericalstability of the inverse and worseningthe inference results.

In addition to rank deficiency,compositionality will cause a moreserious issue: distorting the originaldynamics when the total populationis time variant. We normalize theoriginal synthetic data to mimic thelimitation of real metagenomic data.Results are shown by the top (blue)curves in A1 and B1 of Fig. 4. The firstjump is a positive jump in the originaldata (A1), representing an increase inabsolute abundance of this species.Yet, it becomes negative after nor-malization (B1), indicating a decreasein the relative abundance of thisspecies. Hence, using relative abun-dance data is not reliable as it can’trepresent the original data in thiscase. One promising solution toresolve this issue is to measureoverall microbial biomass over timein the ecosystem via the quantitativePCR technique [20, 21, 23].

Grouping or ignoring low-abundance species lacksjustification

Sincethenumberofequations is typicallymuch smaller than the number ofunknowns, many previous works groupthose low-abundance species togetherand treat them as a pseudo-species[16, 22, 23]. This approach sounds ratio-nal in reducing the number of unknowns(i.e. model parameters). Yet, we do notknowif it indeedworksasweexpected. Incase the low-abundance species are alsostrongly interacting species (i.e. theyinteract strongly with their interactingpartners), they can easily drive themicrobial ecosystem to different steadystates [15]. Simply grouping all the low-abundance species together might gen-erate distorted interaction networks. Totest this approach, we systematicallystudy the impact of grouping low-abundance species in inferences.

We define high-abundance speciesto be those species that account up to90% of the total abundance or more inthe sampled time-series data. Wecompare three different scenarios:(i) we infer the interactions using theentire time-series data without

00188,� 2016 WILEY Periodicals, Inc.

Page 9: Insights & Perspectives

Figure 5. Ignoring or grouping low-abundance species impedes the inference of microbial dynamics. Column A: Without ignoring orgrouping of low-abundance species, the inference results are acceptable, and the predicted time-series agrees well with the original time-series data, provided the sampling rate is high enough. Column B: After ignoring the low-abundance species, the inference results are muchworse, despite the predicted time-series still agrees well with the original time-series data. Column C: If we group the low-abundance speciestogether and regard them as a new species, the inference results are still not comparable to the results of using original data. In generatingthese figures, we consider a system of n¼15 species with a heterogeneous inter-species interaction network with mean degree< k>¼11.2.

..... Insights & Perspectives H.-T. Cao et al.

1600188 (9 of 12)Bioessays 39, 2, 1600188,� 2016 WILEY Periodicals, Inc.

Thinkagain

Page 10: Insights & Perspectives

Figure 6. Inappropriate regularization impedes the inference of microbial dynamics.Column A: Without any regularization, we can perform the inference using the least-square method (i.e. no penalty terms). The inference results are not acceptable.Column B: With Tikhonov regularization (also known as ℓ2-regularization or ridgeregression), the inference results are still bad. Column C: With lasso regularization(also known as ℓ1-regularization), the inference results are slightly better. Column D:With elastic net regularization, which uses a linear combination of ℓ1- and ℓ2-normpenalty terms (with m ¼ 0:5 in equation (10)), the inference results are as good as thaof using lasso only. Note that in all the four cases, the predicted time-series agreeswell with the original time-series data. In generating these figures, we consider amicrobial ecosystem of n ¼ 30 species with a homogeneous inter-species interactionnetwork and mean degree hki ¼ 23:2.

H.-T. Cao et al. Insights & Perspectives.....

1600188 (10 of 12) Bioessays 39,

Thinkagain

t

2, 16

grouping low-abundance species. (ii)We simply remove the low-abundancespecies in the temporal data, and focusonly on the remaining species. (iii) Wegroup all the low-abundance species asa new species, and then perform theinference. Inspired by [15], we deliber-ately generate a microbial system withinteraction strength heterogeneity. Theinferred results for the above threescenarios are shown in Fig. 5. Note thatwhen all the species are considered, the

00188,� 2016 WILEY Periodicals, Inc.

Page 11: Insights & Perspectives

..... Insights & Perspectives H.-T. Cao et al.Thinkagain

identified interactions are accurate.Yet, ignoring or grouping low-abun-dance species leads to poor inferenceresults.

We emphasize that grouping low-abundance species is not a solution tothe underdetermined problem. Even themicrobial interaction network is as-sumed to be homogeneous, recon-structed network obtained bygrouping some low-abundance speciescan be misleading, because groupingcan create false interactions betweenthe grouped species and highly abun-dant species.

Regularizations need to bedone with care

As the identification function of equa-tion (5) is typically under-determined,regularization methods such as inequations (8)–(10) are preferred to theleast-square regression method (noregularization) in equation (7). Todetermine which of the methods:least-square regression (no regulariza-tion), Tikhonov (with ℓ2-norm penalty),lasso (with ℓ1-norm penalty) and elasticnet (with a linear combination of ℓ1- andℓ2-norm penalties), works the best, weapply them to the same time-series data(Fig. 6).

We find that least-square regressiondoes not identify the model parameters.To our surprise, Tikhonov regulariza-tion does not work well either. This ispartially due to the fact that it penalizesthe norm of unknowns, rather than theabsolute values of the unknowns aslasso regularization does. If theunknowns have orders of magnitudedifferences, then Tikhonov regulariza-tion is doomed to failure. By contrast,lasso regularization shrinks the abso-lute values of the unknowns to avoid theover fitting problem. Hence it worksvery well even if the unknowns couldhave orders of magnitude differences.Although lasso implicitly assumes theinteraction matrix A is sparse, itsperformance does not change signifi-cantly when the mean degree of theinteraction network changes (see Sup-plementary Fig. S2). Although elasticnet regularization combines both ℓ1-and ℓ2-norm penalties and benefitsadvantages of both lasso and Tikhonovregularizations [41], there is no

Bioessays 39, 2, 1600188,� 2016 WILEY

significant improvement in the infer-ence results, as shown in Supplemen-tary Fig. S3.

Conclusions andprospects

Inferring microbial dynamics from tem-poral metagenomics data is a verychallenging task. Existing methodswork well in predicting the populationevolution of microbial systems. Yet, theidentified model parameters might betotally different from their ground-truthvalues. Without direct experimentalvalidation, it is hard to conclude thatthe inferred dynamics represents thetrue underlying microbial dynamics.New inference methods that can lever-age some prior knowledge of the growthrates or/and inter-species interactionsneed to be developed.

Note that in this work, we do notfocus on some other issues in dealingwith real microbiome data, for examplemeasurement noise, which of coursewill also affect the inference. Instead,we focus on synthetic data generatedfrom GLV model. We point out thateven with “clean” time-series data,current technological limitations andcommon practices can lead to poorsystem identification. Some of thesepitfalls can be overcome with moreinformation, that is the measurementof total bacterial biomass present in thesamples using qPCR techniques. Otherpitfalls are more difficult to deal with.New inference methods that can takefull advantage of existing microbiomedata sets still need to be developed.In particular, Bayesian inference algo-rithms could be very useful in practice,because they not only estimate error ininferences of dynamical systems pa-rameters but also perform statisticalmodeling of temporal metagenomicsdata [20].

Authors’ contribution

Y.-Y.L. conceived and designed theproject. H.-T.C. performed all the nu-merical simulations and data analysis.All authors analyzed the results. Y.-Y.L.and H.-T.C. wrote the manuscript. Allauthors edited the manuscript.

Periodicals, Inc.

AcknowledgmentThis work was partially supported bythe John Templeton Foundation (awardnumber 51977).

The authors has declared no conflict ofinterest.

References

1. Sender R, Fuchs S, Milo R. 2016. Revisedestimates for the number of human andbacteria cells in the body. PLoS Biol 14:1–14.

2. Clemente JC,Ursell LK,Parfrey LW,KnightR. 2012. The impact of the gut microbiota onhuman health: an integrative view. Cell 148:1258–70.

3. Consortium H. 2012. Structure, function anddiversity of the healthy human microbiome.Nature 486: 207–14.

4. Consortium H. 2012. A framework for humanmicrobiome research. Nature 486: 215–21.

5. Costello EK, Stagaman K, Dethlefsen L,Bohannan BJM, et al. 2012. The applicationof ecological theory toward an understandingof the human microbiome. Science 336:1255–62.

6. Gerber GK. 2014. The dynamic microbiome.FEBS Lett 588: 4131–9.

7. Lozupone CA, Stombaugh JI, Gordon JI,Jansson JK, et al. 2012. Diversity, stabilityand resilience of the human gut microbiota.Nature 489: 220–30.

8. Lemon KP, Armitage GC, Relman DA,Fischbach MA. 2012. Microbiota-targetedtherapies: an ecological perspective. SciTransl Med 4: 137rv5.

9. Aroniadis OC, Brandt LJ. 2013. Fecalmicrobiota transplantation: past, presentand future. Curr Opin Gastroenterol 29:79–84.

10. Borody TJ, Paramsothy S, Agrawal G.2013. Fecal microbiota transplantation:indications, methods, evidence, and fu-ture directions. Curr Gastroenterol Rep 15:1–7.

11. Ljung L. 1978. Convergence analysis ofparametric identificationmethods. IEEE TransAutomatic Control 23: 770–83.

12. Faith JJ, McNulty NP, Rey FE, Gordon JI.2011. Predicting a human gut microbiota’sresponse to diet in gnotobiotic mice. Science333: 101–4.

13. Bomze IM. 1983. Lotka-Volterra equationand replicator dynamics: a two-dimensionalclassification. Biol Cybernetics 48: 201–11.

14. Bomze IM. 1995. Lotka-Volterra equationand replicator dynamics: new issues inclassification. Biol Cybernetics 72: 447–53.

15. Gibson TE, Bashan A, Cao H-T., Weiss ST,et al. 2016. On the origins and control ofcommunity types in the human microbiome.PLoS Comput Biol 12: e1004688.

16. Marino S, Baxter NT, Huffnagle GB, Pet-rosino JF, et al. 2014. Mathematical model-ing of primary succession of murine intestinalmicrobiota. Proc Natl Acad Sci USA 111:439–44.

17. Metz JA,Geritz SA,Mesz�ena G, Jacobs FJ,et al. 1996. Adaptive dynamics, a geometricalstudy of the consequences of nearly faithful

1600188 (11 of 12)

Page 12: Insights & Perspectives

H.-T. Cao et al. Insights & Perspectives.....Thinkagain

reproduction. Stochastic Spatial Struct DynSyst 45: 183–231.

18. Steinway SN, Biggs MB, Loughran Jr TP,Papin JA, et al. 2015. Inference of networkdynamics and metabolic interactions in thegut microbiome. PLoS Comput Biol 11:e1004338.

19. Timme M, Casadiego J. 2014. Revealingnetworks from dynamics: an introduction.J Physics A Math Theor 47: 343001.

20. Bucci V, Tzen B, Li N, Simmons M, et al.2016. MDSINE: microbial dynamical systemsINference engine for microbiome time-seriesanalyses. Genome Biol 17: 1.

21. BuffieCG,Bucci V,SteinRR,McKenney PT,et al. 2015. Precision microbiome reconstitu-tion restores bile acid mediated resistance toClostridium difficile. Nature 517: 205–8.

22. Fisher CK, Mehta P. 2014. Identifyingkeystone species in the human gut micro-biome from metagenomic timeseries usingsparse linear regression. PLoS ONE 9: 2451.

23. Stein RR, Bucci V, Toussaint NC, BuffieCG, et al. 2013. Ecological modeling fromtime-series inference: insight into dynamicsand stability of intestinal microbiota. PLoSComput Biol 9: e1003388.

24. Faust K, Raes J. 2012. Microbial interac-tions: from networks to models. Nat RevMicrobiol 10: 538–50.

25. Goel N, Maitra S, Montroll E. 1971. On thevolterra and other nonlinear models of

1600188 (12 of 12)

interacting populations. Rev Mod Phys 43:231–76.

26. Balsa-Canto E,Alonso AA,Banga JR. 2008.Computational procedures for optimal exper-imental design in biological systems. IET SystBiol 2: 163–72.

27. Morgan XC, Huttenhower C. 2012. Chapter12: human microbiome analysis. PLoS Com-put Biol 8: e1002808.

28. Goodrich JK, Di Rienzi SC, Poole AC,Koren O, et al. 2014. Conducting a micro-biome study. Cell 158: 250–62.

29. CaporasoJG,LauberCL,CostelloEK,Berg-Lyons D, et al. 2011. Moving pictures of thehuman microbiome. Genome Biol 12: R50.

30. David LA, Materna AC, Friedman J, Cam-pos-Baptista MI, et al. 2014. Host lifestyleaffects humanmicrobiota on daily timescales.Genome Biol 15: 1.

31. Bogacki P,Shampine LF. 1989.A 3 (2) pair ofrunge-Kutta formulas. Appl Math Lett 2:321–5.

32. Dormand JR, Prince PJ. 1980. A family ofembedded Runge-Kutta formulae. J ComputAppl Math 6: 19–26.

33. Shampine LF, Gordon MK. 1975. Computersolution of ordinary differential equations: theinitial value problem. San Francisco: WHFreeman.

34. Shampine LF, Reichelt MW. 1997. Thematlab ode suite. SIAM J Sci Comput 18:1–22.

Bioessays 39, 2, 16

35. Shampine LF, Reichelt MW, Kierzenka JA.1999. Solving index-1 DAEs in MATLAB andsimulink. SIAM Rev 41: 538–52.

36. White JR. 2010. Novel methods for metage-nomic analysis. College Park: University OfMaryland.

37. Tibshirani R. 1996. Regression shrinkageand selection via the lasso. J Roy Statist SocB (Methodological) 58: 267–88.

38. Tibshirani R. 1997. The lasso method forvariable selection in the Cox model. Stat Med16: 385–95.

39. Yuan M, Lin Y. 2006. Model selection andestimation in regression with grouped varia-bles. J Roy Stat Soc B (Statistical Methodol-ogy) 68: 49–67.

40. Hansen PC. 2007. Regularization tools ver-sion 4.0 for matlab 7.3. Numer Algor 46:189–94.

41. Zou H, Hastie T. 2005. Regularization andvariable selection via the elastic net. J RoyStat Soc B (Statistical Methodology) 67:301–20.

42. Angulo MT, Moreno JA, Barab�asi A-L, LiuY-Y. 2015. Fundamental limitations of net-work reconstruction. arXiv preprint arXiv:150803559.

43. Faust K, Lahti L, Gonze D, de Vos WM,et al. 2015. Metagenomics meets timeseries analysis: unraveling microbial com-munity dynamics. Curr Opin Microbiol 25:56–66.

00188,� 2016 WILEY Periodicals, Inc.


Recommended