Integrating Tara Oceans datasets using unsupervised multiple kernel learning

transcript

Integrating Tara Oceans datasets using unsupervisedmultiple kernel learning

Nathalie Villa-VialaneixJoint work with Jérôme Mariettehttp://www.nathalievilla.org

Séminaire de Probabilité et StatistiqueLaboratoire J.A. Dieudonné, Université de Nice

Nathalie Villa-Vialaneix | Unsupervised multiple kernel learning 1/41

Sommaire

1 Metagenomic datasets and associated questions

2 A typical (and rich) case study: TARA Oceans datasets

3 A UMKL framework for integrating multiple metagenomic data

4 Application to TARA Oceans datasets

Sommaire

What are metagenomic data?

Source: [Sommer et al., 2010]

abundance data sparsen × p-matrices with count dataof samples in rows anddescriptors (species, OTUs,KEGG groups, k-mer, ...) incolumns. Generally p � n.

philogenetic tree (evolutionhistory between species,OTUs...). One tree with p leavesbuilt from the sequencescollected in the n samples.

Source: Wikimedia Commons, Donovan.parks

What are metagenomic data used for?

produce a profile of the diversity of a given sample⇒ allows tocompare diversity between various conditions

used in various fields: environmental science, microbiote, ...

Processed by computing a relevant dissimilarity between samples(standard Euclidean distance is not relevant) and by using this dissimilarityin subsequent analyses.

What are metagenomic data used for?

produce a profile of the diversity of a given sample⇒ allows tocompare diversity between various conditions

used in various fields: environmental science, microbiote, ...

Processed by computing a relevant dissimilarity between samples(standard Euclidean distance is not relevant) and by using this dissimilarityin subsequent analyses.

β-diversity data: dissimilarities between count data

Compositional dissimilarities: (nig) count of species g for sample iJaccard: the fraction of species specific of either sample i or j:

djac =

∑g I{nig>0,njg=0} + I{njg>0,nig=0}∑

j I{nig+njg>0}

Bray-Curtis: the fraction of the sample which is specific of eithersample i or j

∑g |nig − njg |∑g(nig + njg)

Other dissimilarities available in the R package philoseq, most of themnot Euclidean.

β-diversity data: phylogenetic dissimilarities

Phylogenetic dissimilarities

For each branch e, note le its length and pei

the fraction of counts in sample icorresponding to species below branch e.

Unifrac: the fraction of the tree specific toeither sample i or sample j.

∑e le(I{pei>0,pej=0} + I{pej>0,pei=0})∑

e leI{pei+pej>0}

Weighted Unifrac: the fraction of thediversity specific to sample i or to sample j.

dwUF =

∑e le |pei − pej |∑e(pei + pej)

Phylogenetic dissimilaritiesFor each branch e, note le its length and pei

e leI{pei+pej>0}

dwUF =

e leI{pei+pej>0}

dwUF =

e leI{pei+pej>0}

dwUF =

Sommaire

TARA Oceans datasets

The 2009-2013 expedition

Co-directed by Étienne Bourgoisand Éric Karsenti.

7,012 datasets collected from35,000 samples of plankton andwater (11,535 Gb of data).

Study the plankton: bacteria,protists, metazoans and virusesrepresenting more than 90% of thebiomass in the ocean.

TARA Oceans datasets

Science (May 2015) - Studies on:eukaryotic plankton diversity[de Vargas et al., 2015],

ocean viral communities[Brum et al., 2015],

global plankton interactome[Lima-Mendez et al., 2015],

global ocean microbiome[Sunagawa et al., 2015],

. . . .

→ datasets from different types anddifferent sources analyzed separately.

Background of this talk

ObjectivesUntil now: many papers using many methods. No integrated analysisperformed.

What do the datasets reveal if integrated in a single analysis?

Our purpose: develop a generic method to integrate phylogenetic,taxonomic and functional community composition together withenvironmental factors.

TARA Oceans datasets that we used

[Sunagawa et al., 2015]

Datasets usedenvironmental dataset: 22 numeric features (temperature, salinity, . . . ).

bacteria phylogenomic tree: computed from ∼ 35,000 OTUs.

bacteria functional composition: ∼ 63,000 KEGG orthologous groups.

eukaryotic plankton composition splited into 4 groups pico (0.8 − 5µm),

nano (5 − 20µm), micro (20 − 180µm) and meso (180 − 2000µm).

virus composition: ∼ 867 virus clusters based on shared gene content.

[de Vargas et al., 2015]

[Brum et al., 2015]

Common samples48 samples,

2 depth layers: surface(SRF) and deep chlorophyllmaximum (DCM),

31 different samplingstations.

Sommaire

Kernel methods

Kernel viewed as the dot product in an implicit Hilbert spaceK : X × X → R st: K(xi , xj) = K(xj , xi) and ∀m ∈ N, ∀x1, ..., xm ∈ X,∀α1, ..., αm ∈ R,

∑mi,j=1 αiαjK(xi , xj) ≥ 0.

⇒ [Aronszajn, 1950]

∃!(H , 〈., .〉), φ : X → H st: K(xi , xj) = 〈φ(xi), φ(xj)〉

Kernel methods

Kernel viewed as the dot product in an implicit Hilbert spaceK : X × X → R st: K(xi , xj) = K(xj , xi) and ∀m ∈ N, ∀x1, ..., xm ∈ X,∀α1, ..., αm ∈ R,

∑mi,j=1 αiαjK(xi , xj) ≥ 0.

⇒ [Aronszajn, 1950]

∃!(H , 〈., .〉), φ : X → H st: K(xi , xj) = 〈φ(xi), φ(xj)〉

Exploratory analysis with kernelsA well know example: kernel PCA [Schölkopf et al., 1998]PCA analysis performed in the feature space induced by the kernel K .

In practice:

K is centered: K ← K − 1N K IN + 1

N2 I>NK IN;

K-PCA is performed by the eigen-decomposition of (centered) K

Other unsupervised kernel methods: kernel SOM[Olteanu and Villa-Vialaneix, 2015, Mariette et al., 2017]

In practice:

N2 I>NK IN;

In practice:

N2 I>NK IN;

If (αk )k=1,...,N ∈ RN and (λk )k=1,...,N are the eigenvectors and eigenvalues,

PC axes are:

ak =N∑

αkiφ(xi)

and ak = (aki)i=1,...,n are orthonormal in the feature space induced by thekernel:

∀ k , k ′, 〈ak , ak ′〉 = α>k Kαk ′ = δkk ′ with δkk ′ =

{0 if k , k ′

1 otherwise.

In practice:

N2 I>NK IN;

Coordinate of the projection of the observations (φ(xi))i :

〈ak , φ(xi)〉 =n∑

αkjKji = Ki.αk = λkαki ,

where Ki. is the i-th row of K .

No representation for the variables (no real variables...).

In practice:

N2 I>NK IN;

Usefulness of K-PCANon linear PCA

Source: By Petter Strandmark - Own work, CC BY 3.0, https://commons.wikimedia.org/w/index.php?curid=3936753

[Mariette et al., 2017] K-PCA for non numeric datasets - here aquantitative time series: job trajectories after graduation from the Frenchsurvey “Generation 98” [Cottrell and Letrémy, 2005]

color is the mode of the trajectories

Usefulness of K-PCA

[Mariette et al., 2017] K-PCA for non numeric datasets - here aquantitative time series: job trajectories after graduation from the Frenchsurvey “Generation 98” [Cottrell and Letrémy, 2005]

color is the mode of the trajectories

From multiple dissimilarities to multiple kernels

1 several (non Euclidean) dissimilarities D1, . . . , DM , transformed intosimilarities with [Lee and Verleysen, 2007]:

Km(xi , xj) = −12

Dm(xi , xj) −2N

N∑k=1

Dm(xi , xk ) +1

N∑k , k ′=1

Dm(xk , xk ′)

2 if non positive, clipping or flipping (removing the negative part of theeigenvalues decomposition or taking its opposite) produce kernels[Chen et al., 2009].

From multiple kernels to an integrated kernel

How to combine multiple kernels?

naive approach: K ∗ = 1M

∑m Km

supervised framework: K ∗ =∑

m βmKm with βm ≥ 0 and∑

m βm = 1with βm chosen so as to minimize the prediction error[Gönen and Alpaydin, 2011]

unsupervised framework but input space is Rd [Zhuang et al., 2011]K ∗ =

∑m βmKm with βm ≥ 0 and

∑m βm = 1 with βm chosen so as to

I minimize the distortion between all training data∑

ij K ∗(xi , xj)‖xi − xj‖2;

I AND minimize the approximation of the original data by the kernelembedding

∥∥∥xi −∑

j K ∗(xi , xj)xj

∥∥∥2.

Our proposal: 2 UMKL frameworks which do not require data to havevalues in Rd .

∑m Km

∥∥∥xi −∑

j K ∗(xi , xj)xj

∥∥∥2.

∑m Km

∥∥∥xi −∑

j K ∗(xi , xj)xj

∥∥∥2.

∑m Km

∥∥∥xi −∑

j K ∗(xi , xj)xj

∥∥∥2.

STATIS like framework[L’Hermier des Plantes, 1976, Lavit et al., 1994]Similarities between kernels:

Cmm′ =〈Km,Km′〉F

‖Km‖F‖Km′‖F=

Trace(KmKm′)√Trace((Km)2)Trace((Km′)2)

(Cmm′ is an extension of the RV-coefficient [Robert and Escoufier, 1976] tothe kernel framework)

maximizeM∑

⟨K ∗(v),

‖Km‖F

= v>Cv

for K ∗(v) =M∑

vmKm and v ∈ RM such that ‖v‖2 = 1.

Solution: first eigenvector of C⇒ Set β = v∑Mm=1 vm

(consensual kernel).

maximizeM∑

⟨K ∗(v),

‖Km‖F

= v>Cv

for K ∗(v) =M∑

maximizeM∑

⟨K ∗(v),

‖Km‖F

= v>Cv

for K ∗(v) =M∑

A kernel preserving the original topology of the data IFrom an idea similar to that of [Lin et al., 2010], find a kernel such that thelocal geometry of the data in the feature space is similar to that of theoriginal data.

Proxy of the local geometry

Km −→ Gmk︸︷︷︸

k−nearest neighbors graph

−→ Amk︸︷︷︸

adjacency matrix

⇒W =∑

m I{Amk >0} or W =

∑m Am

Feature space geometry measured by

∆i(β) =

⟨φ∗β(xi),

φ∗β(x1)...

φ∗β(xN)

K ∗β (xi , x1)

K ∗β (xi , xN)

A kernel preserving the original topology of the data I

From an idea similar to that of [Lin et al., 2010], find a kernel such that thelocal geometry of the data in the feature space is similar to that of theoriginal data.

Km −→ Gmk︸︷︷︸

−→ Amk︸︷︷︸

adjacency matrix

⇒W =∑

m I{Amk >0} or W =

∑m Am

kAdjacency matrix image from: By S. Mohammad H. Oloomi, CC BY-SA 3.0,https://commons.wikimedia.org/w/index.php?curid=35313532

Feature space geometry measured by ∆i (β) =

⟨φ∗β(xi ),

φ∗β(x1)

.φ∗β(xN )

K∗β (xi , x1)

.K∗β (xi , xN )

A kernel preserving the original topology of the data IFrom an idea similar to that of [Lin et al., 2010], find a kernel such that thelocal geometry of the data in the feature space is similar to that of theoriginal data.

Km −→ Gmk︸︷︷︸

−→ Amk︸︷︷︸

adjacency matrix

⇒W =∑

m I{Amk >0} or W =

∑m Am

Feature space geometry measured by

∆i(β) =

⟨φ∗β(xi),

φ∗β(x1)...

φ∗β(xN)

K ∗β (xi , x1)

K ∗β (xi , xN)

A kernel preserving the original topology of the data IISparse version

minimizeN∑

Wij∥∥∥∆i(β) −∆j(β)

∥∥∥2

for K ∗β =M∑

βmKm and β ∈ RM st βm ≥ 0 andM∑

βm = 1.

⇔ minimizeM∑

m,m′=1

βmβm′Smm′

β ∈ RM such that βm ≥ 0 andM∑

βm = 1,

for Smm′ =∑N

i,j=1 Wij‖∆mi −∆m

j ‖2 and ∆m

Km(xi , x1)

Km(xi , xN)

Non sparse

version

minimizeN∑

Wij∥∥∥∆i(β) −∆j(β)

∥∥∥2

for K ∗v =M∑

vmKm and v ∈ RM st vm ≥ 0 and ‖v‖2 = 1.

⇔ minimizeM∑

m,m′=1

vmvm′Smm′

v ∈ RM such that vm ≥ 0 and ‖v‖2 = 1,

for Smm′ =∑N

j ‖2 and ∆m

Km(xi , x1)

Km(xi , xN)

A kernel preserving the original topology of the data IISparse version

minimizeN∑

Wij∥∥∥∆i(β) −∆j(β)

∥∥∥2

for K ∗β =M∑

βmKm and β ∈ RM st βm ≥ 0 andM∑

βm = 1.

⇔ minimizeM∑

m,m′=1

βmβm′Smm′

β ∈ RM such that βm ≥ 0 andM∑

βm = 1,

for Smm′ =∑N

j ‖2 and ∆m

Km(xi , x1)

Km(xi , xN)

Non sparse

version

minimizeN∑

Wij∥∥∥∆i(β) −∆j(β)

∥∥∥2

for K ∗v =M∑

⇔ minimizeM∑

m,m′=1

vmvm′Smm′

for Smm′ =∑N

j ‖2 and ∆m

Km(xi , x1)

Km(xi , xN)

A kernel preserving the original topology of the data IINon sparse version

minimizeN∑

Wij∥∥∥∆i(β) −∆j(β)

∥∥∥2

for K ∗v =M∑

⇔ minimizeM∑

m,m′=1

vmvm′Smm′

for Smm′ =∑N

j ‖2 and ∆m

Km(xi , x1)

Km(xi , xN)

.Nathalie Villa-Vialaneix | Unsupervised multiple kernel learning 22/41

Optimization issuesSparse version writes minβ βT Sβ st β ≥ 0 and ‖β‖1 =

∑m βm = 1⇒

standard QP problem with linear constrains (ex: package quadprogin R).

Non sparse version writes minβ βT Sβ st β ≥ 0 and ‖β‖2 = 1⇒ QPQCproblem (hard to solve).

Relaxed into to the following problem: minβ,B Trace(S2X) stTrace(AX) = 1, Trace(AjX) ≥ 0 with:

(1 β>

)is positive semi-definite

(0 0>M

)I Aj =

(0 1>j1j 0MM

)Semi-definite programming⇒ efficient solvers exist.

∑m βm = 1⇒

(1 β>

(0 0>M

)I Aj =

(0 1>j1j 0MM

∑m βm = 1⇒

Equivalent to the following problem: minβ,B Trace(S2X) stTrace(AX) = 1, Trace(AjX) ≥ 0 and B = β>β with:

(1 β>

)I A =

(0 0>M

)I Aj =

(0 1>j1j 0MM

(1 β>

(0 0>M

)I Aj =

(0 1>j1j 0MM

∑m βm = 1⇒

(1 β>

(0 0>M

)I Aj =

(0 1>j1j 0MM

A proposal to improve interpretability of K-PCA in ourframework

Issue: How to assess the importance of a given species in the K-PCA?

our datasets are either numeric (environmental) or are built from an × p count matrix

⇒ for a given species, randomly permute counts and re-do theanalysis (kernel computation - with the same optimized weights - andK-PCA)

the influence of a given species in a given dataset on a given PCsubspace is accessed by computing the Crone-Crosby distancebetween these two PCA subspaces [Crone and Crosby, 1995] (∼Frobenius norm between the projectors)

Sommaire

Integrating ’omics data using kernels

M TARA Oceans datasets(xm

i )i=1,...,n,m=1,...,M measured on the sameocean samples (1, . . . ,N) which takevalues in an arbitrary space (Xm)m:

environmental dataset,

bacteria phylogenomic tree,

bacteria functional composition,

eukaryote pico-plankton composition,

virus composition.

Environmental dataset: standard euclideandistance, given by K(xi , xj) = xT

i xj .

Bacteria phylogenomic tree: the weightedUnifrac distance, given by

dwUF (xi , xj) =

∑e le |pei − pej |∑

e pei + pej.

All composition based datasets: bacteriafunctional composition, eukaryote (pico,nano, micro, meso)-plankton compositionand virus composition calculated using theBray-Curtis dissimilarity,

dBC(xi , xj) =

∑g |nig − njg |∑g nig + njg

nig: gene g abundances summarized at theKEGG orthologous groups level in samplei.

Combinaison of M kernels by a weightedsum

K ∗ =M∑

βmKm,

where βm ≥ 0 and∑M

m=1 βm = 1.

Apply standard data mining methods(clustering, linear model, PCA, . . . ) in thefeature space.

Correlation between kernels (STATIS)

Low correlations between the bacteria functional composition andother datasets.

Strong correlation between environmental variables and smallorganisms (bacteria, eukarote pico-plankton and virus).

Influence of k (nb of neighbors) on (βm)m

k ≥ 5 provides stable results

(βm)m values returned by graph-MKL

The dataset the less correlated to the others: the bacteria functionalcomposition has the highest coefficient.

Three kernels have a weight equal to 0 (sparse version).

Proof of concept: using [Sunagawa et al., 2015]

Datasets139 samples, 3 layers (SRF, DCM and MES)

kernels: phychem, pro-OTUs and pro-OGs

Proteobacteria (clade SAR11 (Alphaproteobacteria) and SAR86)dominate the sampled areas of the ocean in term of relativeabundance and taxonomic richness.

K-PCA on K ∗

K-PCA on K ∗ - environmental dataset

Conclusion et perspectives

Summary

an integrative exploratory method

... particularly well suited for multi metagenomic datasets

with enhanced interpretability

Perspectives

implement SDP solution and test it

improve biological interpretation

soon-to-be-released R package

Questions?

ReferencesAronszajn, N. (1950).Theory of reproducing kernels.Transactions of the American Mathematical Society, 68(3):337–404.

Brum, J., Ignacio-Espinoza, J., Roux, S., Doulcier, G., Acinas, S., Alberti, A., Chaffron, S., Cruaud, C., de Vargas, C., Gasol, J.,Gorsky, G., Gregory, A., Guidi, L., Hingamp, P., Iudicone, D., Not, F., Ogata, H., Pesant, S., Poulos, B., Schwenck, S., Speich, S.,Dimier, C., Kandels-Lewis, S., Picheral, M., Searson, S., Tara Oceans coordinators, Bork, P., Bowler, C., Sunagawa, S., Wincker,P., Karsenti, E., and Sullivan, M. (2015).Patterns and ecological drivers of ocean viral communities.Science, 348(6237).

Chen, Y., Garcia, E., Gupta, M., Rahimi, A., and Cazzanti, L. (2009).Similarity-based classification: concepts and algorithm.Journal of Machine Learning Research, 10:747–776.

Cottrell, M. and Letrémy, P. (2005).How to use the Kohonen algorithm to simultaneously analyse individuals in a survey.Neurocomputing, 63:193–207.

Crone, L. and Crosby, D. (1995).Statistical applications of a metric on subspaces to satellite meteorology.Technometrics, 37(3):324–328.

de Vargas, C., Audic, S., Henry, N., Decelle, J., Mahé, P., Logares, R., Lara, E., Berney, C., Le Bescot, N., Probert, I.,Carmichael, M., Poulain, J., Romac, S., Colin, S., Aury, J., Bittner, L., Chaffron, S., Dunthorn, M., Engelen, S., Flegontova, O.,Guidi, L., Horák, A., Jaillon, O., Lima-Mendez, G., Lukeš, J., Malviya, S., Morard, R., Mulot, M., Scalco, E., Siano, R., Vincent, F.,Zingone, A., Dimier, C., Picheral, M., Searson, S., Kandels-Lewis, S., Tara Oceans coordinators, Acinas, S., Bork, P., Bowler, C.,Gorsky, G., Grimsley, N., Hingamp, P., Iudicone, D., Not, F., Ogata, H., Pesant, S., Raes, J., Sieracki, M. E., Speich, S.,Stemmann, L., Sunagawa, S., Weissenbach, J., Wincker, P., and Karsenti, E. (2015).Eukaryotic plankton diversity in the sunlit ocean.Science, 348(6237).

Gönen, M. and Alpaydin, E. (2011).

Multiple kernel learning algorithms.Journal of Machine Learning Research, 12:2211–2268.

Lavit, C., Escoufier, Y., Sabatier, R., and Traissac, P. (1994).The ACT (STATIS method).Computational Statistics and Data Analysis, 18(1):97–119.

Lee, J. and Verleysen, M. (2007).Nonlinear Dimensionality Reduction.Information Science and Statistics. Springer, New York; London.

L’Hermier des Plantes, H. (1976).Structuration des tableaux à trois indices de la statistique.PhD thesis, Université de Montpellier.Thèse de troisième cycle.

Lima-Mendez, G., Faust, K., Henry, N., Decelle, J., Colin, S., Carcillo, F., Chaffron, S., Ignacio-Espinosa, J., Roux, S., Vincent, F.,Bittner, L., Darzi, Y., Wang, B., Audic, S., Berline, L., Bontempi, G., Cabello, A., Coppola, L., Cornejo-Castillo, F., d’Oviedo, F.,de Meester, L., Ferrera, I., Garet-Delmas, M., Guidi, L., Lara, E., Pesant, S., Royo-Llonch, M., Salazar, F., Sánchez, P.,Sebastian, M., Souffreau, C., Dimier, C., Picheral, M., Searson, S., Kandels-Lewis, S., Tara Oceans coordinators, Gorsky, G.,Not, F., Ogata, H., Speich, S., Stemmann, L., Weissenbach, J., Wincker, P., Acinas, S., Sunagawa, S., Bork, P., Sullivan, M.,Karsenti, E., Bowler, C., de Vargas, C., and Raes, J. (2015).Determinants of community structure in the global plankton interactome.Science, 348(6237).

Lin, Y., Liu, T., and CS., F. (2010).Multiple kernel learning for dimensionality reduction.IEEE Transactions on Pattern Analysis and Machine Intelligence, 33:1147–1160.

Mariette, J., Olteanu, M., and Villa-Vialaneix, N. (2017).Efficient interpretable variants of online SOM for large dissimilarity data.Neurocomputing, 225:31–48.

Olteanu, M. and Villa-Vialaneix, N. (2015).

On-line relational and multiple relational SOM.Neurocomputing, 147:15–30.

Robert, P. and Escoufier, Y. (1976).A unifying tool for linear multivariate statistical methods: the rv-coefficient.Applied Statistics, 25(3):257–265.

Schölkopf, B., Smola, A., and Müller, K. (1998).Nonlinear component analysis as a kernel eigenvalue problem.Neural Computation, 10(5):1299–1319.

Sommer, M., Church, G., and Dantas, G. (2010).A functional metagenomic approach for expanding the synthetic biology toolbox for biomass conversion.Molecular Systems Biology, 6(360).

Sunagawa, S., Coelho, L., Chaffron, S., Kultima, J., Labadie, K., Salazar, F., Djahanschiri, B., Zeller, G., Mende, D., Alberti, A.,Cornejo-Castillo, F., Costea, P., Cruaud, C., d’Oviedo, F., Engelen, S., Ferrera, I., Gasol, J., Guidi, L., Hildebrand, F., Kokoszka,F., Lepoivre, C., Lima-Mendez, G., Poulain, J., Poulos, B., Royo-Llonch, M., Sarmento, H., Vieira-Silva, S., Dimier, C., Picheral,M., Searson, S., Kandels-Lewis, S., Tara Oceans coordinators, Bowler, C., de Vargas, C., Gorsky, G., Grimsley, N., Hingamp, P.,Iudicone, D., Jaillon, O., Not, F., Ogata, H., Pesant, S., Speich, S., Stemmann, L., Sullivan, M., Weissenbach, J., Wincker, P.,Karsenti, E., Raes, J., Acinas, S., and Bork, P. (2015).Structure and function of the global ocean microbiome.Science, 348(6237).

Zhuang, J., Wang, J., Hoi, S., and Lan, X. (2011).Unsupervised multiple kernel clustering.Journal of Machine Learning Research: Workshop and Conference Proceedings, 20:129–144.