Integrating Tara Oceans datasets using unsupervised multiple kernel learning

Post on 22-Jan-2018

413 views 0 download

transcript

Integrating Tara Oceans datasets using unsupervisedmultiple kernel learning

Nathalie Villa-VialaneixJoint work with Jérôme Mariettehttp://www.nathalievilla.org

Séminaire de Probabilité et StatistiqueLaboratoire J.A. Dieudonné, Université de Nice

Nathalie Villa-Vialaneix | Unsupervised multiple kernel learning 1/41

Sommaire

1 Metagenomic datasets and associated questions

2 A typical (and rich) case study: TARA Oceans datasets

3 A UMKL framework for integrating multiple metagenomic data

4 Application to TARA Oceans datasets

Nathalie Villa-Vialaneix | Unsupervised multiple kernel learning 2/41

Sommaire

1 Metagenomic datasets and associated questions

2 A typical (and rich) case study: TARA Oceans datasets

3 A UMKL framework for integrating multiple metagenomic data

4 Application to TARA Oceans datasets

Nathalie Villa-Vialaneix | Unsupervised multiple kernel learning 3/41

What are metagenomic data?

Source: [Sommer et al., 2010]

abundance data sparsen × p-matrices with count dataof samples in rows anddescriptors (species, OTUs,KEGG groups, k-mer, ...) incolumns. Generally p � n.

philogenetic tree (evolutionhistory between species,OTUs...). One tree with p leavesbuilt from the sequencescollected in the n samples.

Source: Wikimedia Commons, Donovan.parks

Nathalie Villa-Vialaneix | Unsupervised multiple kernel learning 4/41

What are metagenomic data?

Source: [Sommer et al., 2010]

abundance data sparsen × p-matrices with count dataof samples in rows anddescriptors (species, OTUs,KEGG groups, k-mer, ...) incolumns. Generally p � n.

philogenetic tree (evolutionhistory between species,OTUs...). One tree with p leavesbuilt from the sequencescollected in the n samples.

Source: Wikimedia Commons, Donovan.parks

Nathalie Villa-Vialaneix | Unsupervised multiple kernel learning 4/41

What are metagenomic data?

Source: [Sommer et al., 2010]

abundance data sparsen × p-matrices with count dataof samples in rows anddescriptors (species, OTUs,KEGG groups, k-mer, ...) incolumns. Generally p � n.

philogenetic tree (evolutionhistory between species,OTUs...). One tree with p leavesbuilt from the sequencescollected in the n samples.

Source: Wikimedia Commons, Donovan.parks

Nathalie Villa-Vialaneix | Unsupervised multiple kernel learning 4/41

What are metagenomic data used for?

produce a profile of the diversity of a given sample⇒ allows tocompare diversity between various conditions

used in various fields: environmental science, microbiote, ...

Processed by computing a relevant dissimilarity between samples(standard Euclidean distance is not relevant) and by using this dissimilarityin subsequent analyses.

Nathalie Villa-Vialaneix | Unsupervised multiple kernel learning 5/41

What are metagenomic data used for?

produce a profile of the diversity of a given sample⇒ allows tocompare diversity between various conditions

used in various fields: environmental science, microbiote, ...

Processed by computing a relevant dissimilarity between samples(standard Euclidean distance is not relevant) and by using this dissimilarityin subsequent analyses.

Nathalie Villa-Vialaneix | Unsupervised multiple kernel learning 5/41

β-diversity data: dissimilarities between count data

Compositional dissimilarities: (nig) count of species g for sample iJaccard: the fraction of species specific of either sample i or j:

djac =

∑g I{nig>0,njg=0} + I{njg>0,nig=0}∑

j I{nig+njg>0}

Bray-Curtis: the fraction of the sample which is specific of eithersample i or j

dBC =

∑g |nig − njg |∑g(nig + njg)

Other dissimilarities available in the R package philoseq, most of themnot Euclidean.

Nathalie Villa-Vialaneix | Unsupervised multiple kernel learning 6/41

β-diversity data: phylogenetic dissimilarities

Phylogenetic dissimilarities

For each branch e, note le its length and pei

the fraction of counts in sample icorresponding to species below branch e.

Unifrac: the fraction of the tree specific toeither sample i or sample j.

dUF =

∑e le(I{pei>0,pej=0} + I{pej>0,pei=0})∑

e leI{pei+pej>0}

Weighted Unifrac: the fraction of thediversity specific to sample i or to sample j.

dwUF =

∑e le |pei − pej |∑e(pei + pej)

Nathalie Villa-Vialaneix | Unsupervised multiple kernel learning 7/41

β-diversity data: phylogenetic dissimilarities

Phylogenetic dissimilaritiesFor each branch e, note le its length and pei

the fraction of counts in sample icorresponding to species below branch e.

Unifrac: the fraction of the tree specific toeither sample i or sample j.

dUF =

∑e le(I{pei>0,pej=0} + I{pej>0,pei=0})∑

e leI{pei+pej>0}

Weighted Unifrac: the fraction of thediversity specific to sample i or to sample j.

dwUF =

∑e le |pei − pej |∑e(pei + pej)

Nathalie Villa-Vialaneix | Unsupervised multiple kernel learning 7/41

β-diversity data: phylogenetic dissimilarities

Phylogenetic dissimilaritiesFor each branch e, note le its length and pei

the fraction of counts in sample icorresponding to species below branch e.

Unifrac: the fraction of the tree specific toeither sample i or sample j.

dUF =

∑e le(I{pei>0,pej=0} + I{pej>0,pei=0})∑

e leI{pei+pej>0}

Weighted Unifrac: the fraction of thediversity specific to sample i or to sample j.

dwUF =

∑e le |pei − pej |∑e(pei + pej)

Nathalie Villa-Vialaneix | Unsupervised multiple kernel learning 7/41

β-diversity data: phylogenetic dissimilarities

Phylogenetic dissimilaritiesFor each branch e, note le its length and pei

the fraction of counts in sample icorresponding to species below branch e.

Unifrac: the fraction of the tree specific toeither sample i or sample j.

dUF =

∑e le(I{pei>0,pej=0} + I{pej>0,pei=0})∑

e leI{pei+pej>0}

Weighted Unifrac: the fraction of thediversity specific to sample i or to sample j.

dwUF =

∑e le |pei − pej |∑e(pei + pej)

Nathalie Villa-Vialaneix | Unsupervised multiple kernel learning 7/41

Sommaire

1 Metagenomic datasets and associated questions

2 A typical (and rich) case study: TARA Oceans datasets

3 A UMKL framework for integrating multiple metagenomic data

4 Application to TARA Oceans datasets

Nathalie Villa-Vialaneix | Unsupervised multiple kernel learning 8/41

TARA Oceans datasets

The 2009-2013 expedition

Co-directed by Étienne Bourgoisand Éric Karsenti.

7,012 datasets collected from35,000 samples of plankton andwater (11,535 Gb of data).

Study the plankton: bacteria,protists, metazoans and virusesrepresenting more than 90% of thebiomass in the ocean.

Nathalie Villa-Vialaneix | Unsupervised multiple kernel learning 9/41

TARA Oceans datasets

Science (May 2015) - Studies on:eukaryotic plankton diversity[de Vargas et al., 2015],

ocean viral communities[Brum et al., 2015],

global plankton interactome[Lima-Mendez et al., 2015],

global ocean microbiome[Sunagawa et al., 2015],

. . . .

→ datasets from different types anddifferent sources analyzed separately.

Nathalie Villa-Vialaneix | Unsupervised multiple kernel learning 10/41

Background of this talk

ObjectivesUntil now: many papers using many methods. No integrated analysisperformed.

What do the datasets reveal if integrated in a single analysis?

Our purpose: develop a generic method to integrate phylogenetic,taxonomic and functional community composition together withenvironmental factors.

Nathalie Villa-Vialaneix | Unsupervised multiple kernel learning 11/41

TARA Oceans datasets that we used

[Sunagawa et al., 2015]

Datasets usedenvironmental dataset: 22 numeric features (temperature, salinity, . . . ).

bacteria phylogenomic tree: computed from ∼ 35,000 OTUs.

bacteria functional composition: ∼ 63,000 KEGG orthologous groups.

eukaryotic plankton composition splited into 4 groups pico (0.8 − 5µm),

nano (5 − 20µm), micro (20 − 180µm) and meso (180 − 2000µm).

virus composition: ∼ 867 virus clusters based on shared gene content.

Nathalie Villa-Vialaneix | Unsupervised multiple kernel learning 12/41

TARA Oceans datasets that we used

[Sunagawa et al., 2015]

Datasets usedenvironmental dataset: 22 numeric features (temperature, salinity, . . . ).

bacteria phylogenomic tree: computed from ∼ 35,000 OTUs.

bacteria functional composition: ∼ 63,000 KEGG orthologous groups.

eukaryotic plankton composition splited into 4 groups pico (0.8 − 5µm),

nano (5 − 20µm), micro (20 − 180µm) and meso (180 − 2000µm).

virus composition: ∼ 867 virus clusters based on shared gene content.

Nathalie Villa-Vialaneix | Unsupervised multiple kernel learning 12/41

TARA Oceans datasets that we used

[Sunagawa et al., 2015]

Datasets usedenvironmental dataset: 22 numeric features (temperature, salinity, . . . ).

bacteria phylogenomic tree: computed from ∼ 35,000 OTUs.

bacteria functional composition: ∼ 63,000 KEGG orthologous groups.

eukaryotic plankton composition splited into 4 groups pico (0.8 − 5µm),

nano (5 − 20µm), micro (20 − 180µm) and meso (180 − 2000µm).

virus composition: ∼ 867 virus clusters based on shared gene content.

Nathalie Villa-Vialaneix | Unsupervised multiple kernel learning 12/41

TARA Oceans datasets that we used

[de Vargas et al., 2015]

Datasets usedenvironmental dataset: 22 numeric features (temperature, salinity, . . . ).

bacteria phylogenomic tree: computed from ∼ 35,000 OTUs.

bacteria functional composition: ∼ 63,000 KEGG orthologous groups.

eukaryotic plankton composition splited into 4 groups pico (0.8 − 5µm),

nano (5 − 20µm), micro (20 − 180µm) and meso (180 − 2000µm).

virus composition: ∼ 867 virus clusters based on shared gene content.

Nathalie Villa-Vialaneix | Unsupervised multiple kernel learning 12/41

TARA Oceans datasets that we used

[Brum et al., 2015]

Datasets usedenvironmental dataset: 22 numeric features (temperature, salinity, . . . ).

bacteria phylogenomic tree: computed from ∼ 35,000 OTUs.

bacteria functional composition: ∼ 63,000 KEGG orthologous groups.

eukaryotic plankton composition splited into 4 groups pico (0.8 − 5µm),

nano (5 − 20µm), micro (20 − 180µm) and meso (180 − 2000µm).

virus composition: ∼ 867 virus clusters based on shared gene content.

Nathalie Villa-Vialaneix | Unsupervised multiple kernel learning 12/41

TARA Oceans datasets that we used

Common samples48 samples,

2 depth layers: surface(SRF) and deep chlorophyllmaximum (DCM),

31 different samplingstations.

Nathalie Villa-Vialaneix | Unsupervised multiple kernel learning 13/41

Sommaire

1 Metagenomic datasets and associated questions

2 A typical (and rich) case study: TARA Oceans datasets

3 A UMKL framework for integrating multiple metagenomic data

4 Application to TARA Oceans datasets

Nathalie Villa-Vialaneix | Unsupervised multiple kernel learning 14/41

Kernel methods

Kernel viewed as the dot product in an implicit Hilbert spaceK : X × X → R st: K(xi , xj) = K(xj , xi) and ∀m ∈ N, ∀x1, ..., xm ∈ X,∀α1, ..., αm ∈ R,

∑mi,j=1 αiαjK(xi , xj) ≥ 0.

⇒ [Aronszajn, 1950]

∃!(H , 〈., .〉), φ : X → H st: K(xi , xj) = 〈φ(xi), φ(xj)〉

Nathalie Villa-Vialaneix | Unsupervised multiple kernel learning 15/41

Kernel methods

Kernel viewed as the dot product in an implicit Hilbert spaceK : X × X → R st: K(xi , xj) = K(xj , xi) and ∀m ∈ N, ∀x1, ..., xm ∈ X,∀α1, ..., αm ∈ R,

∑mi,j=1 αiαjK(xi , xj) ≥ 0.

⇒ [Aronszajn, 1950]

∃!(H , 〈., .〉), φ : X → H st: K(xi , xj) = 〈φ(xi), φ(xj)〉

Nathalie Villa-Vialaneix | Unsupervised multiple kernel learning 15/41

Exploratory analysis with kernelsA well know example: kernel PCA [Schölkopf et al., 1998]PCA analysis performed in the feature space induced by the kernel K .

In practice:

K is centered: K ← K − 1N K IN + 1

N2 I>NK IN;

K-PCA is performed by the eigen-decomposition of (centered) K

Other unsupervised kernel methods: kernel SOM[Olteanu and Villa-Vialaneix, 2015, Mariette et al., 2017]

Nathalie Villa-Vialaneix | Unsupervised multiple kernel learning 16/41

Exploratory analysis with kernelsA well know example: kernel PCA [Schölkopf et al., 1998]PCA analysis performed in the feature space induced by the kernel K .

In practice:

K is centered: K ← K − 1N K IN + 1

N2 I>NK IN;

K-PCA is performed by the eigen-decomposition of (centered) K

Other unsupervised kernel methods: kernel SOM[Olteanu and Villa-Vialaneix, 2015, Mariette et al., 2017]

Nathalie Villa-Vialaneix | Unsupervised multiple kernel learning 16/41

Exploratory analysis with kernelsA well know example: kernel PCA [Schölkopf et al., 1998]PCA analysis performed in the feature space induced by the kernel K .

In practice:

K is centered: K ← K − 1N K IN + 1

N2 I>NK IN;

K-PCA is performed by the eigen-decomposition of (centered) K

If (αk )k=1,...,N ∈ RN and (λk )k=1,...,N are the eigenvectors and eigenvalues,

PC axes are:

ak =N∑

i=1

αkiφ(xi)

and ak = (aki)i=1,...,n are orthonormal in the feature space induced by thekernel:

∀ k , k ′, 〈ak , ak ′〉 = α>k Kαk ′ = δkk ′ with δkk ′ =

{0 if k , k ′

1 otherwise.

Other unsupervised kernel methods: kernel SOM[Olteanu and Villa-Vialaneix, 2015, Mariette et al., 2017]

Nathalie Villa-Vialaneix | Unsupervised multiple kernel learning 16/41

Exploratory analysis with kernelsA well know example: kernel PCA [Schölkopf et al., 1998]PCA analysis performed in the feature space induced by the kernel K .

In practice:

K is centered: K ← K − 1N K IN + 1

N2 I>NK IN;

K-PCA is performed by the eigen-decomposition of (centered) K

Coordinate of the projection of the observations (φ(xi))i :

〈ak , φ(xi)〉 =n∑

j=1

αkjKji = Ki.αk = λkαki ,

where Ki. is the i-th row of K .

No representation for the variables (no real variables...).

Other unsupervised kernel methods: kernel SOM[Olteanu and Villa-Vialaneix, 2015, Mariette et al., 2017]

Nathalie Villa-Vialaneix | Unsupervised multiple kernel learning 16/41

Exploratory analysis with kernelsA well know example: kernel PCA [Schölkopf et al., 1998]PCA analysis performed in the feature space induced by the kernel K .

In practice:

K is centered: K ← K − 1N K IN + 1

N2 I>NK IN;

K-PCA is performed by the eigen-decomposition of (centered) K

Other unsupervised kernel methods: kernel SOM[Olteanu and Villa-Vialaneix, 2015, Mariette et al., 2017]

Nathalie Villa-Vialaneix | Unsupervised multiple kernel learning 16/41

Usefulness of K-PCANon linear PCA

Source: By Petter Strandmark - Own work, CC BY 3.0, https://commons.wikimedia.org/w/index.php?curid=3936753

[Mariette et al., 2017] K-PCA for non numeric datasets - here aquantitative time series: job trajectories after graduation from the Frenchsurvey “Generation 98” [Cottrell and Letrémy, 2005]

color is the mode of the trajectories

Nathalie Villa-Vialaneix | Unsupervised multiple kernel learning 17/41

Usefulness of K-PCA

[Mariette et al., 2017] K-PCA for non numeric datasets - here aquantitative time series: job trajectories after graduation from the Frenchsurvey “Generation 98” [Cottrell and Letrémy, 2005]

color is the mode of the trajectories

Nathalie Villa-Vialaneix | Unsupervised multiple kernel learning 17/41

From multiple dissimilarities to multiple kernels

1 several (non Euclidean) dissimilarities D1, . . . , DM , transformed intosimilarities with [Lee and Verleysen, 2007]:

Km(xi , xj) = −12

Dm(xi , xj) −2N

N∑k=1

Dm(xi , xk ) +1

N2

N∑k , k ′=1

Dm(xk , xk ′)

2 if non positive, clipping or flipping (removing the negative part of theeigenvalues decomposition or taking its opposite) produce kernels[Chen et al., 2009].

Nathalie Villa-Vialaneix | Unsupervised multiple kernel learning 18/41

From multiple kernels to an integrated kernel

How to combine multiple kernels?

naive approach: K ∗ = 1M

∑m Km

supervised framework: K ∗ =∑

m βmKm with βm ≥ 0 and∑

m βm = 1with βm chosen so as to minimize the prediction error[Gönen and Alpaydin, 2011]

unsupervised framework but input space is Rd [Zhuang et al., 2011]K ∗ =

∑m βmKm with βm ≥ 0 and

∑m βm = 1 with βm chosen so as to

I minimize the distortion between all training data∑

ij K ∗(xi , xj)‖xi − xj‖2;

I AND minimize the approximation of the original data by the kernelembedding

∑i

∥∥∥xi −∑

j K ∗(xi , xj)xj

∥∥∥2.

Our proposal: 2 UMKL frameworks which do not require data to havevalues in Rd .

Nathalie Villa-Vialaneix | Unsupervised multiple kernel learning 19/41

From multiple kernels to an integrated kernel

How to combine multiple kernels?

naive approach: K ∗ = 1M

∑m Km

supervised framework: K ∗ =∑

m βmKm with βm ≥ 0 and∑

m βm = 1with βm chosen so as to minimize the prediction error[Gönen and Alpaydin, 2011]

unsupervised framework but input space is Rd [Zhuang et al., 2011]K ∗ =

∑m βmKm with βm ≥ 0 and

∑m βm = 1 with βm chosen so as to

I minimize the distortion between all training data∑

ij K ∗(xi , xj)‖xi − xj‖2;

I AND minimize the approximation of the original data by the kernelembedding

∑i

∥∥∥xi −∑

j K ∗(xi , xj)xj

∥∥∥2.

Our proposal: 2 UMKL frameworks which do not require data to havevalues in Rd .

Nathalie Villa-Vialaneix | Unsupervised multiple kernel learning 19/41

From multiple kernels to an integrated kernel

How to combine multiple kernels?

naive approach: K ∗ = 1M

∑m Km

supervised framework: K ∗ =∑

m βmKm with βm ≥ 0 and∑

m βm = 1with βm chosen so as to minimize the prediction error[Gönen and Alpaydin, 2011]

unsupervised framework but input space is Rd [Zhuang et al., 2011]K ∗ =

∑m βmKm with βm ≥ 0 and

∑m βm = 1 with βm chosen so as to

I minimize the distortion between all training data∑

ij K ∗(xi , xj)‖xi − xj‖2;

I AND minimize the approximation of the original data by the kernelembedding

∑i

∥∥∥xi −∑

j K ∗(xi , xj)xj

∥∥∥2.

Our proposal: 2 UMKL frameworks which do not require data to havevalues in Rd .

Nathalie Villa-Vialaneix | Unsupervised multiple kernel learning 19/41

From multiple kernels to an integrated kernel

How to combine multiple kernels?

naive approach: K ∗ = 1M

∑m Km

supervised framework: K ∗ =∑

m βmKm with βm ≥ 0 and∑

m βm = 1with βm chosen so as to minimize the prediction error[Gönen and Alpaydin, 2011]

unsupervised framework but input space is Rd [Zhuang et al., 2011]K ∗ =

∑m βmKm with βm ≥ 0 and

∑m βm = 1 with βm chosen so as to

I minimize the distortion between all training data∑

ij K ∗(xi , xj)‖xi − xj‖2;

I AND minimize the approximation of the original data by the kernelembedding

∑i

∥∥∥xi −∑

j K ∗(xi , xj)xj

∥∥∥2.

Our proposal: 2 UMKL frameworks which do not require data to havevalues in Rd .

Nathalie Villa-Vialaneix | Unsupervised multiple kernel learning 19/41

STATIS like framework[L’Hermier des Plantes, 1976, Lavit et al., 1994]Similarities between kernels:

Cmm′ =〈Km,Km′〉F

‖Km‖F‖Km′‖F=

Trace(KmKm′)√Trace((Km)2)Trace((Km′)2)

.

(Cmm′ is an extension of the RV-coefficient [Robert and Escoufier, 1976] tothe kernel framework)

maximizeM∑

m=1

⟨K ∗(v),

Km

‖Km‖F

⟩F

= v>Cv

for K ∗(v) =M∑

m=1

vmKm and v ∈ RM such that ‖v‖2 = 1.

Solution: first eigenvector of C⇒ Set β = v∑Mm=1 vm

(consensual kernel).

Nathalie Villa-Vialaneix | Unsupervised multiple kernel learning 20/41

STATIS like framework[L’Hermier des Plantes, 1976, Lavit et al., 1994]Similarities between kernels:

Cmm′ =〈Km,Km′〉F

‖Km‖F‖Km′‖F=

Trace(KmKm′)√Trace((Km)2)Trace((Km′)2)

.

(Cmm′ is an extension of the RV-coefficient [Robert and Escoufier, 1976] tothe kernel framework)

maximizeM∑

m=1

⟨K ∗(v),

Km

‖Km‖F

⟩F

= v>Cv

for K ∗(v) =M∑

m=1

vmKm and v ∈ RM such that ‖v‖2 = 1.

Solution: first eigenvector of C⇒ Set β = v∑Mm=1 vm

(consensual kernel).

Nathalie Villa-Vialaneix | Unsupervised multiple kernel learning 20/41

STATIS like framework[L’Hermier des Plantes, 1976, Lavit et al., 1994]Similarities between kernels:

Cmm′ =〈Km,Km′〉F

‖Km‖F‖Km′‖F=

Trace(KmKm′)√Trace((Km)2)Trace((Km′)2)

.

(Cmm′ is an extension of the RV-coefficient [Robert and Escoufier, 1976] tothe kernel framework)

maximizeM∑

m=1

⟨K ∗(v),

Km

‖Km‖F

⟩F

= v>Cv

for K ∗(v) =M∑

m=1

vmKm and v ∈ RM such that ‖v‖2 = 1.

Solution: first eigenvector of C⇒ Set β = v∑Mm=1 vm

(consensual kernel).

Nathalie Villa-Vialaneix | Unsupervised multiple kernel learning 20/41

A kernel preserving the original topology of the data IFrom an idea similar to that of [Lin et al., 2010], find a kernel such that thelocal geometry of the data in the feature space is similar to that of theoriginal data.

Proxy of the local geometry

Km −→ Gmk︸ ︷︷ ︸

k−nearest neighbors graph

−→ Amk︸ ︷︷ ︸

adjacency matrix

⇒W =∑

m I{Amk >0} or W =

∑m Am

k

Feature space geometry measured by

∆i(β) =

⟨φ∗β(xi),

φ∗β(x1)...

φ∗β(xN)

=

K ∗β (xi , x1)

...

K ∗β (xi , xN)

Nathalie Villa-Vialaneix | Unsupervised multiple kernel learning 21/41

A kernel preserving the original topology of the data I

From an idea similar to that of [Lin et al., 2010], find a kernel such that thelocal geometry of the data in the feature space is similar to that of theoriginal data.

Proxy of the local geometry

Km −→ Gmk︸ ︷︷ ︸

k−nearest neighbors graph

−→ Amk︸ ︷︷ ︸

adjacency matrix

⇒W =∑

m I{Amk >0} or W =

∑m Am

kAdjacency matrix image from: By S. Mohammad H. Oloomi, CC BY-SA 3.0,https://commons.wikimedia.org/w/index.php?curid=35313532

Feature space geometry measured by ∆i (β) =

⟨φ∗β(xi ),

φ∗β(x1)

.

.

.φ∗β(xN )

=

K∗β (xi , x1)

.

.

.K∗β (xi , xN )

Nathalie Villa-Vialaneix | Unsupervised multiple kernel learning 21/41

A kernel preserving the original topology of the data IFrom an idea similar to that of [Lin et al., 2010], find a kernel such that thelocal geometry of the data in the feature space is similar to that of theoriginal data.

Proxy of the local geometry

Km −→ Gmk︸ ︷︷ ︸

k−nearest neighbors graph

−→ Amk︸ ︷︷ ︸

adjacency matrix

⇒W =∑

m I{Amk >0} or W =

∑m Am

k

Feature space geometry measured by

∆i(β) =

⟨φ∗β(xi),

φ∗β(x1)...

φ∗β(xN)

=

K ∗β (xi , x1)

...

K ∗β (xi , xN)

Nathalie Villa-Vialaneix | Unsupervised multiple kernel learning 21/41

A kernel preserving the original topology of the data IISparse version

minimizeN∑

i,j=1

Wij∥∥∥∆i(β) −∆j(β)

∥∥∥2

for K ∗β =M∑

m=1

βmKm and β ∈ RM st βm ≥ 0 andM∑

m=1

βm = 1.

⇔ minimizeM∑

m,m′=1

βmβm′Smm′

β ∈ RM such that βm ≥ 0 andM∑

m=1

βm = 1,

for Smm′ =∑N

i,j=1 Wij‖∆mi −∆m

j ‖2 and ∆m

i =

Km(xi , x1)

...

Km(xi , xN)

.

Non sparse

version

minimizeN∑

i,j=1

Wij∥∥∥∆i(β) −∆j(β)

∥∥∥2

for K ∗v =M∑

m=1

vmKm and v ∈ RM st vm ≥ 0 and ‖v‖2 = 1.

⇔ minimizeM∑

m,m′=1

vmvm′Smm′

v ∈ RM such that vm ≥ 0 and ‖v‖2 = 1,

for Smm′ =∑N

i,j=1 Wij‖∆mi −∆m

j ‖2 and ∆m

i =

Km(xi , x1)

...

Km(xi , xN)

.

Nathalie Villa-Vialaneix | Unsupervised multiple kernel learning 22/41

A kernel preserving the original topology of the data IISparse version

minimizeN∑

i,j=1

Wij∥∥∥∆i(β) −∆j(β)

∥∥∥2

for K ∗β =M∑

m=1

βmKm and β ∈ RM st βm ≥ 0 andM∑

m=1

βm = 1.

⇔ minimizeM∑

m,m′=1

βmβm′Smm′

β ∈ RM such that βm ≥ 0 andM∑

m=1

βm = 1,

for Smm′ =∑N

i,j=1 Wij‖∆mi −∆m

j ‖2 and ∆m

i =

Km(xi , x1)

...

Km(xi , xN)

.

Non sparse

version

minimizeN∑

i,j=1

Wij∥∥∥∆i(β) −∆j(β)

∥∥∥2

for K ∗v =M∑

m=1

vmKm and v ∈ RM st vm ≥ 0 and ‖v‖2 = 1.

⇔ minimizeM∑

m,m′=1

vmvm′Smm′

v ∈ RM such that vm ≥ 0 and ‖v‖2 = 1,

for Smm′ =∑N

i,j=1 Wij‖∆mi −∆m

j ‖2 and ∆m

i =

Km(xi , x1)

...

Km(xi , xN)

.

Nathalie Villa-Vialaneix | Unsupervised multiple kernel learning 22/41

A kernel preserving the original topology of the data IINon sparse version

minimizeN∑

i,j=1

Wij∥∥∥∆i(β) −∆j(β)

∥∥∥2

for K ∗v =M∑

m=1

vmKm and v ∈ RM st vm ≥ 0 and ‖v‖2 = 1.

⇔ minimizeM∑

m,m′=1

vmvm′Smm′

v ∈ RM such that vm ≥ 0 and ‖v‖2 = 1,

for Smm′ =∑N

i,j=1 Wij‖∆mi −∆m

j ‖2 and ∆m

i =

Km(xi , x1)

...

Km(xi , xN)

.Nathalie Villa-Vialaneix | Unsupervised multiple kernel learning 22/41

Optimization issuesSparse version writes minβ βT Sβ st β ≥ 0 and ‖β‖1 =

∑m βm = 1⇒

standard QP problem with linear constrains (ex: package quadprogin R).

Non sparse version writes minβ βT Sβ st β ≥ 0 and ‖β‖2 = 1⇒ QPQCproblem (hard to solve).

Relaxed into to the following problem: minβ,B Trace(S2X) stTrace(AX) = 1, Trace(AjX) ≥ 0 with:

I X =

(1 β>

β B

)is positive semi-definite

I A =

(0 0>M

0M IM

)I Aj =

(0 1>j1j 0MM

)Semi-definite programming⇒ efficient solvers exist.

Nathalie Villa-Vialaneix | Unsupervised multiple kernel learning 23/41

Optimization issuesSparse version writes minβ βT Sβ st β ≥ 0 and ‖β‖1 =

∑m βm = 1⇒

standard QP problem with linear constrains (ex: package quadprogin R).

Non sparse version writes minβ βT Sβ st β ≥ 0 and ‖β‖2 = 1⇒ QPQCproblem (hard to solve).

Relaxed into to the following problem: minβ,B Trace(S2X) stTrace(AX) = 1, Trace(AjX) ≥ 0 with:

I X =

(1 β>

β B

)is positive semi-definite

I A =

(0 0>M

0M IM

)I Aj =

(0 1>j1j 0MM

)Semi-definite programming⇒ efficient solvers exist.

Nathalie Villa-Vialaneix | Unsupervised multiple kernel learning 23/41

Optimization issuesSparse version writes minβ βT Sβ st β ≥ 0 and ‖β‖1 =

∑m βm = 1⇒

standard QP problem with linear constrains (ex: package quadprogin R).

Non sparse version writes minβ βT Sβ st β ≥ 0 and ‖β‖2 = 1⇒ QPQCproblem (hard to solve).

Equivalent to the following problem: minβ,B Trace(S2X) stTrace(AX) = 1, Trace(AjX) ≥ 0 and B = β>β with:

I X =

(1 β>

β B

)I A =

(0 0>M

0M IM

)I Aj =

(0 1>j1j 0MM

)

Relaxed into to the following problem: minβ,B Trace(S2X) stTrace(AX) = 1, Trace(AjX) ≥ 0 with:

I X =

(1 β>

β B

)is positive semi-definite

I A =

(0 0>M

0M IM

)I Aj =

(0 1>j1j 0MM

)Semi-definite programming⇒ efficient solvers exist.

Nathalie Villa-Vialaneix | Unsupervised multiple kernel learning 23/41

Optimization issuesSparse version writes minβ βT Sβ st β ≥ 0 and ‖β‖1 =

∑m βm = 1⇒

standard QP problem with linear constrains (ex: package quadprogin R).

Non sparse version writes minβ βT Sβ st β ≥ 0 and ‖β‖2 = 1⇒ QPQCproblem (hard to solve).

Relaxed into to the following problem: minβ,B Trace(S2X) stTrace(AX) = 1, Trace(AjX) ≥ 0 with:

I X =

(1 β>

β B

)is positive semi-definite

I A =

(0 0>M

0M IM

)I Aj =

(0 1>j1j 0MM

)Semi-definite programming⇒ efficient solvers exist.

Nathalie Villa-Vialaneix | Unsupervised multiple kernel learning 23/41

A proposal to improve interpretability of K-PCA in ourframework

Issue: How to assess the importance of a given species in the K-PCA?

our datasets are either numeric (environmental) or are built from an × p count matrix

⇒ for a given species, randomly permute counts and re-do theanalysis (kernel computation - with the same optimized weights - andK-PCA)

the influence of a given species in a given dataset on a given PCsubspace is accessed by computing the Crone-Crosby distancebetween these two PCA subspaces [Crone and Crosby, 1995] (∼Frobenius norm between the projectors)

Nathalie Villa-Vialaneix | Unsupervised multiple kernel learning 24/41

A proposal to improve interpretability of K-PCA in ourframework

Issue: How to assess the importance of a given species in the K-PCA?

our datasets are either numeric (environmental) or are built from an × p count matrix

⇒ for a given species, randomly permute counts and re-do theanalysis (kernel computation - with the same optimized weights - andK-PCA)

the influence of a given species in a given dataset on a given PCsubspace is accessed by computing the Crone-Crosby distancebetween these two PCA subspaces [Crone and Crosby, 1995] (∼Frobenius norm between the projectors)

Nathalie Villa-Vialaneix | Unsupervised multiple kernel learning 24/41

A proposal to improve interpretability of K-PCA in ourframework

Issue: How to assess the importance of a given species in the K-PCA?

our datasets are either numeric (environmental) or are built from an × p count matrix

⇒ for a given species, randomly permute counts and re-do theanalysis (kernel computation - with the same optimized weights - andK-PCA)

the influence of a given species in a given dataset on a given PCsubspace is accessed by computing the Crone-Crosby distancebetween these two PCA subspaces [Crone and Crosby, 1995] (∼Frobenius norm between the projectors)

Nathalie Villa-Vialaneix | Unsupervised multiple kernel learning 24/41

Sommaire

1 Metagenomic datasets and associated questions

2 A typical (and rich) case study: TARA Oceans datasets

3 A UMKL framework for integrating multiple metagenomic data

4 Application to TARA Oceans datasets

Nathalie Villa-Vialaneix | Unsupervised multiple kernel learning 25/41

Integrating ’omics data using kernels

M TARA Oceans datasets(xm

i )i=1,...,n,m=1,...,M measured on the sameocean samples (1, . . . ,N) which takevalues in an arbitrary space (Xm)m:

environmental dataset,

bacteria phylogenomic tree,

bacteria functional composition,

eukaryote pico-plankton composition,

. . .

virus composition.

Nathalie Villa-Vialaneix | Unsupervised multiple kernel learning 26/41

Integrating ’omics data using kernels

Environmental dataset: standard euclideandistance, given by K(xi , xj) = xT

i xj .

Nathalie Villa-Vialaneix | Unsupervised multiple kernel learning 26/41

Integrating ’omics data using kernels

Bacteria phylogenomic tree: the weightedUnifrac distance, given by

dwUF (xi , xj) =

∑e le |pei − pej |∑

e pei + pej.

Nathalie Villa-Vialaneix | Unsupervised multiple kernel learning 26/41

Integrating ’omics data using kernels

All composition based datasets: bacteriafunctional composition, eukaryote (pico,nano, micro, meso)-plankton compositionand virus composition calculated using theBray-Curtis dissimilarity,

dBC(xi , xj) =

∑g |nig − njg |∑g nig + njg

,

nig: gene g abundances summarized at theKEGG orthologous groups level in samplei.

Nathalie Villa-Vialaneix | Unsupervised multiple kernel learning 26/41

Integrating ’omics data using kernels

Combinaison of M kernels by a weightedsum

K ∗ =M∑

m=1

βmKm,

where βm ≥ 0 and∑M

m=1 βm = 1.

Nathalie Villa-Vialaneix | Unsupervised multiple kernel learning 26/41

Integrating ’omics data using kernels

Apply standard data mining methods(clustering, linear model, PCA, . . . ) in thefeature space.

Nathalie Villa-Vialaneix | Unsupervised multiple kernel learning 26/41

Correlation between kernels (STATIS)

Low correlations between the bacteria functional composition andother datasets.

Strong correlation between environmental variables and smallorganisms (bacteria, eukarote pico-plankton and virus).

Nathalie Villa-Vialaneix | Unsupervised multiple kernel learning 27/41

Correlation between kernels (STATIS)

Low correlations between the bacteria functional composition andother datasets.

Strong correlation between environmental variables and smallorganisms (bacteria, eukarote pico-plankton and virus).

Nathalie Villa-Vialaneix | Unsupervised multiple kernel learning 27/41

Correlation between kernels (STATIS)

Low correlations between the bacteria functional composition andother datasets.

Strong correlation between environmental variables and smallorganisms (bacteria, eukarote pico-plankton and virus).

Nathalie Villa-Vialaneix | Unsupervised multiple kernel learning 27/41

Influence of k (nb of neighbors) on (βm)m

k ≥ 5 provides stable results

Nathalie Villa-Vialaneix | Unsupervised multiple kernel learning 28/41

(βm)m values returned by graph-MKL

The dataset the less correlated to the others: the bacteria functionalcomposition has the highest coefficient.

Three kernels have a weight equal to 0 (sparse version).

Nathalie Villa-Vialaneix | Unsupervised multiple kernel learning 29/41

(βm)m values returned by graph-MKL

The dataset the less correlated to the others: the bacteria functionalcomposition has the highest coefficient.

Three kernels have a weight equal to 0 (sparse version).

Nathalie Villa-Vialaneix | Unsupervised multiple kernel learning 29/41

(βm)m values returned by graph-MKL

The dataset the less correlated to the others: the bacteria functionalcomposition has the highest coefficient.

Three kernels have a weight equal to 0 (sparse version).

Nathalie Villa-Vialaneix | Unsupervised multiple kernel learning 29/41

Proof of concept: using [Sunagawa et al., 2015]

Datasets139 samples, 3 layers (SRF, DCM and MES)

kernels: phychem, pro-OTUs and pro-OGs

Nathalie Villa-Vialaneix | Unsupervised multiple kernel learning 30/41

Proof of concept: using [Sunagawa et al., 2015]

Nathalie Villa-Vialaneix | Unsupervised multiple kernel learning 31/41

Proof of concept: using [Sunagawa et al., 2015]

Nathalie Villa-Vialaneix | Unsupervised multiple kernel learning 32/41

Proof of concept: using [Sunagawa et al., 2015]

Nathalie Villa-Vialaneix | Unsupervised multiple kernel learning 33/41

Proof of concept: using [Sunagawa et al., 2015]

Nathalie Villa-Vialaneix | Unsupervised multiple kernel learning 34/41

Proof of concept: using [Sunagawa et al., 2015]

Nathalie Villa-Vialaneix | Unsupervised multiple kernel learning 35/41

Proof of concept: using [Sunagawa et al., 2015]

Proteobacteria (clade SAR11 (Alphaproteobacteria) and SAR86)dominate the sampled areas of the ocean in term of relativeabundance and taxonomic richness.

Nathalie Villa-Vialaneix | Unsupervised multiple kernel learning 36/41

K-PCA on K ∗

Nathalie Villa-Vialaneix | Unsupervised multiple kernel learning 37/41

K-PCA on K ∗ - environmental dataset

Nathalie Villa-Vialaneix | Unsupervised multiple kernel learning 38/41

K-PCA on K ∗ - environmental dataset

Nathalie Villa-Vialaneix | Unsupervised multiple kernel learning 39/41

Conclusion et perspectives

Summary

an integrative exploratory method

... particularly well suited for multi metagenomic datasets

with enhanced interpretability

Perspectives

implement SDP solution and test it

improve biological interpretation

soon-to-be-released R package

Nathalie Villa-Vialaneix | Unsupervised multiple kernel learning 40/41

Questions?

Nathalie Villa-Vialaneix | Unsupervised multiple kernel learning 41/41

ReferencesAronszajn, N. (1950).Theory of reproducing kernels.Transactions of the American Mathematical Society, 68(3):337–404.

Brum, J., Ignacio-Espinoza, J., Roux, S., Doulcier, G., Acinas, S., Alberti, A., Chaffron, S., Cruaud, C., de Vargas, C., Gasol, J.,Gorsky, G., Gregory, A., Guidi, L., Hingamp, P., Iudicone, D., Not, F., Ogata, H., Pesant, S., Poulos, B., Schwenck, S., Speich, S.,Dimier, C., Kandels-Lewis, S., Picheral, M., Searson, S., Tara Oceans coordinators, Bork, P., Bowler, C., Sunagawa, S., Wincker,P., Karsenti, E., and Sullivan, M. (2015).Patterns and ecological drivers of ocean viral communities.Science, 348(6237).

Chen, Y., Garcia, E., Gupta, M., Rahimi, A., and Cazzanti, L. (2009).Similarity-based classification: concepts and algorithm.Journal of Machine Learning Research, 10:747–776.

Cottrell, M. and Letrémy, P. (2005).How to use the Kohonen algorithm to simultaneously analyse individuals in a survey.Neurocomputing, 63:193–207.

Crone, L. and Crosby, D. (1995).Statistical applications of a metric on subspaces to satellite meteorology.Technometrics, 37(3):324–328.

de Vargas, C., Audic, S., Henry, N., Decelle, J., Mahé, P., Logares, R., Lara, E., Berney, C., Le Bescot, N., Probert, I.,Carmichael, M., Poulain, J., Romac, S., Colin, S., Aury, J., Bittner, L., Chaffron, S., Dunthorn, M., Engelen, S., Flegontova, O.,Guidi, L., Horák, A., Jaillon, O., Lima-Mendez, G., Lukeš, J., Malviya, S., Morard, R., Mulot, M., Scalco, E., Siano, R., Vincent, F.,Zingone, A., Dimier, C., Picheral, M., Searson, S., Kandels-Lewis, S., Tara Oceans coordinators, Acinas, S., Bork, P., Bowler, C.,Gorsky, G., Grimsley, N., Hingamp, P., Iudicone, D., Not, F., Ogata, H., Pesant, S., Raes, J., Sieracki, M. E., Speich, S.,Stemmann, L., Sunagawa, S., Weissenbach, J., Wincker, P., and Karsenti, E. (2015).Eukaryotic plankton diversity in the sunlit ocean.Science, 348(6237).

Gönen, M. and Alpaydin, E. (2011).

Nathalie Villa-Vialaneix | Unsupervised multiple kernel learning 41/41

Multiple kernel learning algorithms.Journal of Machine Learning Research, 12:2211–2268.

Lavit, C., Escoufier, Y., Sabatier, R., and Traissac, P. (1994).The ACT (STATIS method).Computational Statistics and Data Analysis, 18(1):97–119.

Lee, J. and Verleysen, M. (2007).Nonlinear Dimensionality Reduction.Information Science and Statistics. Springer, New York; London.

L’Hermier des Plantes, H. (1976).Structuration des tableaux à trois indices de la statistique.PhD thesis, Université de Montpellier.Thèse de troisième cycle.

Lima-Mendez, G., Faust, K., Henry, N., Decelle, J., Colin, S., Carcillo, F., Chaffron, S., Ignacio-Espinosa, J., Roux, S., Vincent, F.,Bittner, L., Darzi, Y., Wang, B., Audic, S., Berline, L., Bontempi, G., Cabello, A., Coppola, L., Cornejo-Castillo, F., d’Oviedo, F.,de Meester, L., Ferrera, I., Garet-Delmas, M., Guidi, L., Lara, E., Pesant, S., Royo-Llonch, M., Salazar, F., Sánchez, P.,Sebastian, M., Souffreau, C., Dimier, C., Picheral, M., Searson, S., Kandels-Lewis, S., Tara Oceans coordinators, Gorsky, G.,Not, F., Ogata, H., Speich, S., Stemmann, L., Weissenbach, J., Wincker, P., Acinas, S., Sunagawa, S., Bork, P., Sullivan, M.,Karsenti, E., Bowler, C., de Vargas, C., and Raes, J. (2015).Determinants of community structure in the global plankton interactome.Science, 348(6237).

Lin, Y., Liu, T., and CS., F. (2010).Multiple kernel learning for dimensionality reduction.IEEE Transactions on Pattern Analysis and Machine Intelligence, 33:1147–1160.

Mariette, J., Olteanu, M., and Villa-Vialaneix, N. (2017).Efficient interpretable variants of online SOM for large dissimilarity data.Neurocomputing, 225:31–48.

Olteanu, M. and Villa-Vialaneix, N. (2015).

Nathalie Villa-Vialaneix | Unsupervised multiple kernel learning 41/41

On-line relational and multiple relational SOM.Neurocomputing, 147:15–30.

Robert, P. and Escoufier, Y. (1976).A unifying tool for linear multivariate statistical methods: the rv-coefficient.Applied Statistics, 25(3):257–265.

Schölkopf, B., Smola, A., and Müller, K. (1998).Nonlinear component analysis as a kernel eigenvalue problem.Neural Computation, 10(5):1299–1319.

Sommer, M., Church, G., and Dantas, G. (2010).A functional metagenomic approach for expanding the synthetic biology toolbox for biomass conversion.Molecular Systems Biology, 6(360).

Sunagawa, S., Coelho, L., Chaffron, S., Kultima, J., Labadie, K., Salazar, F., Djahanschiri, B., Zeller, G., Mende, D., Alberti, A.,Cornejo-Castillo, F., Costea, P., Cruaud, C., d’Oviedo, F., Engelen, S., Ferrera, I., Gasol, J., Guidi, L., Hildebrand, F., Kokoszka,F., Lepoivre, C., Lima-Mendez, G., Poulain, J., Poulos, B., Royo-Llonch, M., Sarmento, H., Vieira-Silva, S., Dimier, C., Picheral,M., Searson, S., Kandels-Lewis, S., Tara Oceans coordinators, Bowler, C., de Vargas, C., Gorsky, G., Grimsley, N., Hingamp, P.,Iudicone, D., Jaillon, O., Not, F., Ogata, H., Pesant, S., Speich, S., Stemmann, L., Sullivan, M., Weissenbach, J., Wincker, P.,Karsenti, E., Raes, J., Acinas, S., and Bork, P. (2015).Structure and function of the global ocean microbiome.Science, 348(6237).

Zhuang, J., Wang, J., Hoi, S., and Lan, X. (2011).Unsupervised multiple kernel clustering.Journal of Machine Learning Research: Workshop and Conference Proceedings, 20:129–144.

Nathalie Villa-Vialaneix | Unsupervised multiple kernel learning 41/41