Integrative Multivariate ‘omics Analysis with mixOmics • in draft •
mixOmics: an R package for ‘omics feature selection andmultiple data integration
Florian Rohart1, Benoît Gautier1, Amrit Singh2,3, and Kim-Anh Lê Cao∗1
1The University of Queensland Diamantina Institute, Translational ResearchInstitute, Brisbane, QLD 4102, Australia,
2UBC James Hogg Research Centre for Heart Lung Innovation, St. Paul’sHospital, University of British Columbia, Vancouver, BC, Canada.
3Prevention of Organ Failure (PROOF) Centre of Excellence, Vancouver, BC,Canada.
Abstract
The advent of high throughput technologies has led to a wealth of publicly available biological datacoming from different sources, the so-called ‘omics data (transcriptomics for the study of transcripts,proteomics for proteins, metabolomics for metabolites, etc). Combining such large-scale biological datasets can lead to the discovery of important biological insights, provided that relevant information can beextracted in a holistic manner. Current statistical approaches have been focusing on identifying smallsubsets of molecules (a ‘molecular signature’) that explains or predicts biological conditions, but mainlyfor the analysis of a single data set. In addition, commonly used methods are univariate and considereach biological feature independently. In contrast, linear multivariate methods adopt a system biologyapproach by statistically integrating several data sets at once and offer an unprecedented opportunity toprobe relationships between heterogeneous data sets measured at multiple functional levels.
mixOmics is an R package which provides a wide range of linear multivariate methods for dataexploration, integration, dimension reduction and visualisation of biological data sets. The methodswe have developed extend Projection to Latent Structure (PLS) models for discriminant analysis anddata integration and include `1 penalisations to identify molecular signatures. Here we introduce themixOmics methods specifically developed to integrate large data sets, either at the N-level, where the sameindividuals are profiled using different ‘omics platforms (same N), or at the P-level, where independentstudies including different individuals are generated under similar biological conditions using the same‘omics platform (same P). In both cases, the main challenge to face is data heterogeneity, due to inherentplatform-specific artefacts (N-integration), or systematic differences arising from experiments assayedat different geographical sites or different times (P-integration). We present and illustrate those novelmultivariate methods on existing ‘omics data available from the package.
I. Introduction
The advent of novel ‘omics technologies (e.g. transcriptomics, proteomics, metabololomics, etc)has enabled new opportunities for biological and medical research discoveries. Commonly, each
∗Corresponding author [email protected]
1
.CC-BY-NC-ND 4.0 International licensenot certified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprint (which wasthis version posted February 14, 2017. . https://doi.org/10.1101/108597doi: bioRxiv preprint
Integrative Multivariate ‘omics Analysis with mixOmics • in draft •
feature from each technology (transcripts, proteins, metabolites, etc) is analysed independentlythrough univariate statistical methods such as ANOVA, linear model or t-tests. Such analysisignores relationships between the different features and may miss crucial biological information.Indeed, biological features act in concert to modulate and influence biological systems andsignalling pathways. Multivariate approaches, which model features as a set, can therefore providea more insightful picture of a biological system, and complement the results obtained fromunivariate methods. In mixOmics we considered multivariate projection-based methodologies for‘omics data analysis Meng et al. (2016) because of several appealing properties. Firstly, they arecomputationally efficient to handle large datasets, where the number of biological features (usuallyin the thousands) is much larger than the number of samples (usually less than 50). Secondly, theyperform dimension reduction by projecting the data into a smaller subspace while capturing andhighlighting the largest sources of variation from the data, resulting in powerful visualisation ofof the system under study. Lastly, they are highly flexible to answer various biological questions(Boulesteix and Strimmer, 2007): mixOmics multivariate methods have been successfully applied inseveral recent studies to identify biomarkers in ‘omics studies ranging from metabolomics, brainimaging to microbiome and statistically integrate data sets generated from difference biologicalsources (Labus et al., 2015; Cook et al., 2016; Guidi et al., 2016; Mahana et al., 2016; Ramanan et al.,2016; Rollero et al., 2016).
In this paper, we introduce the mixOmics multivariate methods developed for supervised analysis,where the aims are to classify or discriminate sample groups, to identify the most discriminantsubset of biological features, and to predict the class of new samples. In particular, our twonovel frameworks were implemented for the integration of multiple data sets. DIABLO enablesthe integration of the same biological N samples measured on different ‘omics platforms with(N-integration, Singh et al. 2016), MINT enables the integration of several independent data sets orstudies measured on the same P predictors (P-integration, Rohart et al. 2016a). One of the mainchallenges in N- and P-integration is to overcome the technical variance among ’omics platforms- either between different types of ‘omics, or within the same type of ‘omics but generatedfrom several laboratories, to extract common information. To date, very few statistical methodscan perform N- and P-integration in a supervised context. For instance, N-integration is oftenperformed by concatenating all the different ’omics datasets (Liu et al., 2013), thus ignoring theheterogeneity between ‘omics platforms, or by combining the molecular signatures identified fromseparate analyses of each ‘omics platform (Günther et al., 2012), thus disregarding the relationshipsbetween the different ‘omics functional levels. With P-integration, statistical methods are oftensequentially combined to accommodate for technical differences among studies or platformsbefore classifying samples. Such sequential approach is not appropriate for the prediction of newsamples as they are prone to overfitting (Rohart et al., 2016a). Our two promising frameworks havethe high potential to lead to new discoveries by either modelling relationships between differenttypes of ‘omics data (N-integration) or by enabling the integrative analysis of independent ‘omicsstudies and increasing sample size and statistical power (P-integration).
The present article introduces the main functionalities in mixOmics, presents our multivariateframeworks for the identification of molecular signatures in one and several data sets and illustrateseach framework in a case study available in the package.
II. The mixOmics R package.
mixOmics is a user-friendly R package dedicated to data exploration, mining, integration andvisualisation. It provides a wide range of innovative multivariate methods for the analysis andintegration of large data sets in several settings (sparse PLS-DA, DIABLO and MINT, Fig. 1) with
2
.CC-BY-NC-ND 4.0 International licensenot certified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprint (which wasthis version posted February 14, 2017. . https://doi.org/10.1101/108597doi: bioRxiv preprint
Integrative Multivariate ‘omics Analysis with mixOmics • in draft •
appealing outputs such as (i) insightful visualisations of the data whose dimension has beenreduced with the use of latent components, (ii) identification of molecular signatures and (iii)improved usage with common calls to all visualisation and performance assessment methods (seea list of those S3 functions in Suppl. S1).
Multivariate projection-based methods. The multivariate dimension reduction techniques im-plemented in mixOmics perform unsupervised analyses such as Principal Component Analysis(using NonLinear Iterative Partial Least Squares, Wold 1975), Independent Component Analysis(Yao et al., 2012), Partial Least Squares regression (PLS, also known as Projection to Latent Struc-tures, Wold 1966), regularised Canonical Correlation Analysis (rCCA, González et al. 2008) andGeneralised Canonical Correlation Analysis (rGCCA, based on a PLS algorithm Tenenhaus andTenenhaus 2011), multi-group PLS (Eslami et al., 2013a) as well as supervised analyses such asPLS-Discriminant Analysis (PLS-DA, Nguyen and Rocke 2002b,a; Boulesteix 2004), and recentlyGCC-DA (Singh et al., 2016) and multi-group PLS-DA (Rohart et al., 2016a).
While each multivariate method aims at answering a specific biological question, the uniquenessof the mixOmics package is to provide novel sparse variants to enable the identification of keypredictors (e.g. genes, proteins, metabolites) in large biological data sets. Feature selection isperformed via `1 regularisation (LASSO, Tibshirani 1996), which is implemented directly intothe optimisation of the statistical criterion specific to each method. Such criterion include themaximisation of the most important source of variation in the data, of the covariance or correlationbetween different ‘omics sets, or of the segregation of a categorical outcome of interest. Solving theoptimisation criterion enables to seek for latent components and loading vectors. Latent componentsare linear combinations of the original predictors, where each predictor is assigned a coefficientindicated in the loading vectors. The ourrefore, linear multivariate methods reduce the dimensionof the data into a space spanned by a few components, by projecting the samples into a smaller,interpretable space.
In mixOmics methods, the parameters to choose include the total number of components, alsocalled dimensions H, and the `1 penalty on each dimension for all sparse methods. Contraryto other R packages implementing `1 penalisation methods (e.g. glmnet, Friedman et al. 2010,PMA, Witten et al. 2013), and it order to improve usability of the methods, the `1 parameter issolved via soft-thresholding and equivalently replaced by the number of features to select on eachdimension. In our multivariate models, the tuning of the number of features to select is performedvia repeated cross-validation. The result is a selection of a subset of correlated features that bestdiscriminate the outcome and constitute a molecular signature.
Historically, our first methods were dedicated to the integration of two ‘omics data sets(González et al., 2008; Lê Cao et al., 2008, 2009b,a), or the discriminant analysis of a single ‘omicsdata set (Lê Cao et al., 2011). The integrative methods presented in this manuscript focus onthe integration of multiple biological data sets to address cutting-edge biological and biomedicalquestions.
Implementation. mixOmics is fully implemented in the R language and exports more than 30functions for either performing a statistical analysis, tuning its parameters or visualising itsresults. mixOmics mainly depends on the R base packages (parallel, methods, grDevices, graphics,stats, utils) and recommended packages (MASS, lattice), but also imports functions from a limitednumber of other R packages (igraph, rgl, ellipse, corpcor, RColorBrewer, plyr, dplyr, tidyr, reshape2,ggplot2). In mixOmics, we provide generic R/S3 functions to assess the performance of themethods (predict, plot, print, perf, auroc, etc) and visualisation the results (plotIndiv,plotArrow, plotVar, plotLoadings, etc) as described in the next paragraph.
3
.CC-BY-NC-ND 4.0 International licensenot certified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprint (which wasthis version posted February 14, 2017. . https://doi.org/10.1101/108597doi: bioRxiv preprint
Integrative Multivariate ‘omics Analysis with mixOmics • in draft •
DATAIN
PUT
METHO
DGR
APHICS
sparsePLS-DA DIABLO MINT
Sample
plots
Varia
ble
plots
SINGLE‘OMICS INTEGRATIVE‘OMICS
X
(N xP)
Y11231
X1
(N xP1)
X2
(NxP2)
XQ
(NxPQ)…
Y11231
(N1 xP)
(N2 xP)
XM (NM xP)
…
Y1
Y2
YM
112323122
2133
…
X2
X1
Figure 1: Overview of the mixOmics multivariate methods for single (sparse PLS-DA) and integrative (DIABLOand MINT) ‘omics supervised analyses. X denote a predictor dataset, and Y a categorical outcome response.Integrative analyses include N-integration (across studies generated on the same N samples and differenttypes of predictor features), and P-integration (the same P predictors are measured on independent studies).See also Suppl. S1 for a summary of the different method call and plot functions.
Currently, seventeen methods are implemented in mixOmics to integrate large biologicaldatasets, amongst which twelve have similar names (mint).(block).(s)pls(da) (see Table 1) asthey are wrappers of a single main hidden function of mixOmics. The wrapper functions checkand shape the input parameters before passing them to the hidden function that extends theSGCCA algorithm (Tenenhaus et al., 2014) to perform either N- or P-integration. The remainingfour statistical methods are PCA, sparse PCA, IPCA, rCCA and rGCCA. Each statistical methodimplemented in mixOmics returns a list of essential outputs which are used in our S3 visualisationfunctions.
Graphical outputs to visualise multivariate analysis results. mixOmics aims to provide insight-ful and user-friendly graphical outputs to interpret the statistical and biological results, someof which were introduced in González et al. 2012. Thanks to R/S3 functions as listed in S1, thefunction calls are identical for all multivariate methods implemented in the mixOmics package,as we illustrate in the next sections. We provide various visualisations, including sample plotsand feature plots, which are based on the component scores and the loading vectors, respectively.
4
.CC-BY-NC-ND 4.0 International licensenot certified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprint (which wasthis version posted February 14, 2017. . https://doi.org/10.1101/108597doi: bioRxiv preprint
Integrative Multivariate ‘omics Analysis with mixOmics • in draft •
Framework sparse Function name
Single ’omicsunsupervised
- pca- ipcaX spca
supervised- plsdaX splsda
Two ’omics unsupervised- rcca- plsX spls
P-integrationunsupervised
- mint.plsX mint.spls
supervised- mint.plsdaX mint.splsda
N-integrationunsupervised
- wrapper.rgcca- block.plsX block.spls
supervised- block.plsdaX block.splsda (DIABLO)
Table 1: Seventeen statistical methods available in mixOmics
Here we list the main important visualisation functions in mixOmics.
• plotIndiv (Sample plot): represents samples by plotting the component scores. Such plotvisualises similarities between samples in the small subspace spanned by the components.For the integrative methods described in Sections V and IV, samples from each data set, oreach study are represented on separate plots, allowing to visualise the agreement betweenthe data sets at the sample level. Confidence ellipse plots for each class can be displayed.
• plotArrow (Arrow representation): plots the components scores associated to either X data(start of the arrow) or Y outcome (tip of the arrow). As such, short arrows indicate a gooddiscrimination of the classes. In the case of N-integration, the start of the arrow indicates thecentroid between all data sets for a given sample and the tips of the arrows the location ofthat sample in each data set. In that specific case, short arrows indicate a strong agreementbetween the matching data sets, long arrows a disagreement between the matching data sets.
• plotVar (Correlation circle plots): displays features selected by the multivariate method.Each feature coordinate is defined as the Pearson correlation between the original data andthe loading vector for each dimension (see González et al. (2012) for a detailed description).Correlation circle plots are particularly useful to visualise the contribution of each feature todefine each component (feature close to the large circle of radius 1), as well as the correlationstructure between features (clusters of features). The cosine angle between any two pointsrepresent the correlation (negative positive, null) between two features.
Both plotIndiv and plotVar offer usual plot arguments to display symbols, colours andlegend. Graphic styles include default ggplot2, graphics, lattice and 3D plots.
• cim (Clustered Image Maps): heatmap plots to visualise the distances between features withrespect to each sample. By default we use Euclidian distance and complete linkage method.
5
.CC-BY-NC-ND 4.0 International licensenot certified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprint (which wasthis version posted February 14, 2017. . https://doi.org/10.1101/108597doi: bioRxiv preprint
Integrative Multivariate ‘omics Analysis with mixOmics • in draft •
For the specific case of N-integration, the function cimDiablo represents the selected featuresfrom the different data sets.
• network (Relevance networks): represents the correlation structure between features ofdifferent types. A similarity matrix representing the association between pairs of featuresacross all components is calculated as the sum of the correlations between the originalfeatures and the loading vector across all dimensions of interest h = 1, . . . , H (see Gonzálezet al. (2012) for more details). Those networks are bipartite and thus only a link between twofeatures of different types is represented.
• plotLoadings: represents the loading coefficient of each feature selected on each dimensionof the multivariate model. Features are ranked according to their contribution to thecomponent (bottom to top), colors indicate the class for which the mean (or median)expression value is the highest (or the lowest) for each feature. Such graphical output enablesmore insight into the molecular signature, especially when interpreted in conjunction withthe sample plot.
Other graphical outputs are available in mixOmics to represent classification performance ofmultivariate models using the generic function plot. The listing of the functions for eachframework presented here are summarised in Suppl. S1.
General notations. We assume each data set has been normalised using appropriate techniquesspecific for the type of ‘omics platform. Let X denote a data matrix of size N observations (rows)× P predictors (e.g. expression levels of P genes, in columns). The categorical outcome y isexpressed as a dummy matrix Y in which each column represents one outcome category andeach row indicates the class membership of each sample. Y is of size N observations (rows) ×K categories outcome (columns). We denote for all a ∈ Rn its `1 norm ||a||1 = ∑
p1 |aj| and its `2
norm ||a||2 = (∑p1 a2
j )1/2. For any matrix we denote by > its transpose.
III. Multivariate analysis of one data set
Linear Discriminant Analysis (LDA) and Projection to Latent Structure (PLS, Wold 1966) arepopular multivariate methods for supervised analyses. In mixOmics we mainly focus on PLSmethods for their flexibility to solve a variety of analytical problems (Boulesteix and Strimmer,2007). PLS regression (Wold, 1966) was originally developed for unsupervised analysis to integratetwo continuous data sets measured on the same observations. We introduce here a supervisedversion, called PLS-Discriminant Analysis (PLS-DA, (Nguyen and Rocke, 2002a; Barker and Rayens,2003), a natural extension that substitutes one of the data set for a dummy matrix Y. PLS-DA fitsa classifier multivariate model that assigns samples into known classes, with the ultimate aim topredict the classes of external test samples where the outcome is often unknown.
6
.CC-BY-NC-ND 4.0 International licensenot certified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprint (which wasthis version posted February 14, 2017. . https://doi.org/10.1101/108597doi: bioRxiv preprint
Integrative Multivariate ‘omics Analysis with mixOmics • in draft •
componentst1 =X1 a1t2 =X2 a2
max cov(t,Y)t1 t2 t3
112312
y1….……………….……….P1....N
X(NxP)
associatedloadingvectors a2
a2
a3
……
…
Figure 2: Example of data matrix decomposition for single ‘omics analysis. The predictor matrix X is decomposedinto a set of components (t1, . . . , tH) and associated loading vectors (a1, . . . , aH), Y is the outcome coded asa dummy matrix and combined linearly (see exact formula in Equation (1)). Xh is the deflated (residual)matrix starting with X1 = X, for h = 1 . . . H the dimension of the model (number of components).
PLS-DA. Briefly, PLS-DA is an iterative method that constructs H successive artificial (latent)components th = Xhah and uh = Yhbh for h = 1, .., H, where the hth component th (respectively uh)is a linear combination of the X (Y) features. H denotes the dimension of the PLS-DA model. Theweight coefficient vector ah (bh) is the loading vector that indicates the importance of each featureto define the component. For each dimension h = 1, . . . , H PLS-DA seeks to maximize
max(ah ,bh)
cov(Xhah, Yhbh), s.t. ||ah||2 = ||bh||2 = 1 (1)
where Xh, Yh are the residual (deflated) matrices extracted from each iterative linear regression(see Lê Cao et al. 2011 for more details). The PLS-DA model assigns to each sample i a pairof H scores (ti
h, uih) which effectively represents the projection of that sample into the X- or Y-
space spanned by those PLS components. As H << P, the projection space is small, allowing fordimension reduction as well as insightful sample plot representation. Note that the projection intothe Y-space is of no use for a Discriminant Analysis as PLS-DA.
Feature selection with sparse PLS-DA. We developed a sparse version of PLS-DA (Lê Cao et al.,2011) which includes an `1 penalisation (Tibshirani, 1996) on the loading vector ah to shrink somecoefficients to zero. Thus, for each dimension h = 1, .., H, sPLS-DA solves:
max(ah ,bh)
cov(Xhah, Yhbh), s.t. ||ah||2 = ||bh||2 = 1 and ||ah||1 ≤ λh (2)
where λh is a non negative parameter that controls the amount of shrinkage in ah. The componentscores th = Xhah are now defined on a small subset of features with non-zero coefficients, leading
7
.CC-BY-NC-ND 4.0 International licensenot certified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprint (which wasthis version posted February 14, 2017. . https://doi.org/10.1101/108597doi: bioRxiv preprint
Integrative Multivariate ‘omics Analysis with mixOmics • in draft •
to feature selection that aims to optimally maximise the discrimination between the K outcomeclasses in Y.
Prediction. Once fitted, the (sparse) PLS-DA model can be applied on an external test set X ofsize (Ntest × P) to predict the class of new samples (see Lê Cao et al. 2011 for more details). Thepredict function outputs the predicted scores for each test sample. Since the predicted scores areexpressed as continuous values, a prediction distance must be applied to obtain the final predictedclass membership. Distances such as ‘maximum distance’, ‘Malhanobis distance’ and ‘Centroiddistance’ are provided (see Figure 3B1).
Choice of parameters. One important parameter to choose in PLS-DA and sPLS-DA is thenumber of components, or dimension of the model, called ncomp in mixOmics. While this parameterhas mostly been ignored from the PLS-DA literature it plays a crucial role to ensure maximalprediction accuracy. Our experience when analysing a large number of ‘omics data sets has shownthat ncomp = K-1 was usually sufficient to summarise most of the discriminatory informationfrom the data (Lê Cao et al., 2011; Shah et al., 2016). The second parameter to choose pertains tothe λh penalisation parameters for sPLS-DA, which was replaced by the number of features toselect on each component with the argument keepX.
The tune function performs repeated cross-validation (CV) for a user-input grid of keepX valuesto assess. The keepX parameter that leads to the best prediction accuracy of the model is reportedfor each component. Prediction accuracy is evaluated according to the overall classificationerror rate, or the Balanced Error Rate (BER) for unbalanced number of samples per class. Bothmeasures are calculated on the left-out samples set during the CV procedure, and averaged acrossthe repeated CV runs. The number of folds in CV depends on the number of samples N andcan be specified in the function, with a sufficient number of run (e.g. nrepeat = 50-100). Inthe case of small N, leave-one-out validation is advised and nrepeat is set to 1. Additionaloutputs from the tune function include 1/the stability of the selected features across all CV runs,which represents a useful measure of reproducibility of the molecular signature and 2/receiveroperating characteristic (ROC) curves and Area Under the Curve (AUC) averaged using one-vs-allcomparison if K > 2. Note however that ROC and AUC criteria may not be particularly insightfulas the prediction threshold in our methods is based on a specified distance as described earlier.
An additional option that we developed and implemented for all tune function in mixOmicsis to tune and fit a constraint model. The process is as follows: once the optimal keepX value ischosen on one component, the model is fitted with the specific keepX, and the resulting featureselection is then fixed for the tuning of the following component. In other words, the tuning isperformed on the optimal list of selected features (keepX.constraint) instead of the number offeatures (keepX). Such strategy was implemented in the sister package bootPLS and successfullyapplied in our recent integrative study Rohart et al. (2016b). Our experience has shown that thecontraint tuning and models improve the performance of the methods. We illustrate an examplein Suppl. V.
The tuning step must be conducted with caution to avoid overfitting results, as widelydescribed in the literature (see for example Ambroise and McLachlan 2002). Our tune procedureperforms repeated CV, reports the frequency of selected features across all repeated CV foldsand the classification error rate for each keepX value. Once ncomp and keepX for each componentare chosen, the final PLS-DA or sPLS-DA model is fitted on the whole data set and the finalperformance can be obtained with the perf function that also performs repeated CV (see Suppl.V).
8
.CC-BY-NC-ND 4.0 International licensenot certified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprint (which wasthis version posted February 14, 2017. . https://doi.org/10.1101/108597doi: bioRxiv preprint
Integrative Multivariate ‘omics Analysis with mixOmics • in draft •
Extensions of PLS-DA for repeated measurements and 16S microbiome data. PLS-DA andsPLS-DA were extended to account for repeated measurement designs, as described in Liquetet al. (2012) by specifying the argument multilevel in the plsda and splsda functions. Recentextensions in the package include sPLS-DA analysis to identify microbial communities for 16Sdata with an additional logratio argument to account for compositional data in microbiomeexperiment (Lê Cao et al. 2016, see also our mixMC framework in www.mixOmics.org/mixMC).
Usage in mixOmics. Figure 3 illustrates the different graphical outputs obtained when analysinga single data set from unsupervised to supervised analyses. The data set analysed is a microarraydata set available from the mixOmics package investigating Small Round Blue Cell Tumors (SRBCT,Khan et al. 2001) of 63 tumour samples with the expression levels of 2,308 genes. Samples areclassified into four classes: 8 Burkitt Lymphoma (BL), 23 Ewing Sarcoma (EWS), 12 neuroblastoma(NB), and 20 rhabdomyosarcoma (RMS). The aim of this analysis is to assess similarities betweentumour types, using Principal Component Analysis (Fig. 3A), to classify the different tumoursubtypes with PLS-DA (Fig. 3B) and to identify a molecular gene signature discriminating thetumour types with sPLS-DA (Fig. 3C). The full pipeline, results interpretation and associated Rcode is available in Electronic Suppl. V.
IV. Integration of heterogeneous data sets with DIABLO
The integration of multiple ‘omics datasets measured on the same N biological samples (Figure 1)is based on a variant of the multivariate methodology Generalised Canonical Correlation Analysis(GCCA, Tenenhaus and Tenenhaus 2011; Tenenhaus et al. 2014), which, contrary to what its namesuggests, generalises PLS for N-integration. Our recent development DIABLO further improvedthe implementation of GCCA to include feature selection in a supervised framework and in auser-friendly manner (Günther et al., 2014; Singh et al., 2016).
Method. We denote Q ‘omics data sets X(1)(N× P1), X(2)(N× P2), ..., X(Q)(N× PQ) measuringthe expression levels of Pq ‘omics features on the same N biological samples, q = 1, . . . , Q. GCCAsolves for each component h = 1, . . . , H:
maxa(1)h ,...,a(Q)
h
Q
∑q,j=1,q 6=j
cq,j cov(X(q)h a(q)h , X(j)
h a(j)h ), s.t. ||a(q)h ||2 = 1 and ||a(q)h ||1 ≤ λ(q) (3)
where λ(q) is the penalisation parameter, a(q)h is the loading vector on component h associated
to the residual matrix X(q)h of the data set X(q), and C = {cq,j}q,j is the design matrix. C is a
Q× Q matrix of zeros and ones which specifies whether datasets should be correlated; zeroswhen datasets are not connected and ones where datasets are connected. Thus, it is possibleto constraint the model to only take into account specific pairwise covariances by setting thedesign matrix (see Tenenhaus et al. (2014) for more details). Such design thus enables to model aparticular association between pairs of ‘omics data, as expected from prior biological knowledgeor experimental design. DIABLO Discriminant Analysis in mixOmics extends (3) to a supervisedframework by replacing one data matrix X(q) with the outcome dummy matrix Y.
Prediction. DIABLO includes several predictions strategies such as a majority vote and a weightedvote. Both are based on the predictions obtained from each ’omics dataset via the X(q)
h a(q)hcomponents. The majority vote consists in assigning to a sample the class that has received the
9
.CC-BY-NC-ND 4.0 International licensenot certified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprint (which wasthis version posted February 14, 2017. . https://doi.org/10.1101/108597doi: bioRxiv preprint
Integrative Multivariate ‘omics Analysis with mixOmics • in draft •
0.0
0.1
0.2
0.3
0.4
0.5
0.6
Component
Cla
ssifi
catio
n er
ror r
ate
1 2 3 4 5 6 7 8 9 10
overallBER
max.distcentroids.distmahalanobis.dist
0
10
20
30
40
50
60
70
80
90
100
0 10 20 30 40 50 60 70 80 90 100100 − Specificity (%)
Se
nsi
tivity
(%
)
Outcome
BL vs Other(s): 1
EWS vs Other(s): 1
NB vs Other(s): 0.8627
RMS vs Other(s): 0.8244
ROC Curve Comp 2
−6.36 0 6.36
Color key
g820
g177
2g4
30g1
991
g170
8g1
049
g566
g368
g225
3g1
954
g131
9g1
645
g107
4g1
327
g336
g545
g138
9g2
46g1
980
g779 g52
g129
8g3
48 g29
g803
g167
1g2
117
g36
g133
0g1
917
g979
g469
g109
3g2
047
g483
g603 g2
g129
g372
g189
6g1
74g2
046
g910
g208
3g1
030
g338
g215
9g1
353
g100
3g1
955
g191
1g1
194
g111
0g2
29g1
799
g971
g188
8g2
050
g220
3g8
19g1
775
g126
3g7
19g1
091
g192
4g1
158
g758
g160
6g3
35g8
46g1
386
g123
g585
g783
g836
g143
4g2
000
g975
g422
g976
g823
g879
g742
g219
9g4
17g1
804
g215
7g1
601
g236
g176
4g2
55 g3g1
776
g119
8g1
698
g695
g136
9g1
862
g182
9g1
460
g134
8g1
730
g225
8g4
00g1
007
g956
g157
9g5
97g2
241
g162
4
EWS.T3EWS.T6EWS.T19EWS.T14EWS.T7EWS.T15EWS.T9EWS.T11EWS.C6EWS.T2EWS.C9EWS.C8EWS.T1EWS.T12EWS.T4EWS.C11EWS.C10EWS.C4EWS.C3EWS.C7EWS.C1EWS.C2EWS.T13RMS.C9RMS.T10RMS.C3RMS.T6RMS.T11RMS.T3RMS.T7RMS.T5RMS.T8RMS.T4RMS.T1RMS.T2RMS.C2RMS.C4RMS.C5RMS.C7RMS.C6RMS.C11RMS.C8RMS.C10BL.C6BL.C3BL.C4BL.C7BL.C8BL.C5BL.C2BL.C1NB.C1NB.C7NB.C2NB.C3NB.C9NB.C10NB.C4NB.C11NB.C6NB.C8NB.C12NB.C5
g1389g246g1954g545g1319g1074g1327g2050g1645g2117g1772g1888g348g2253g566g368g1980g336g1708g29g36
g1298g1330g979g971g469g820g1093g1194g1671g52
g1799g1917g1110g779g229g1991g803g1049g430
−0.2 −0.1 0.0 0.1
Loadings on comp 2
OutcomeEWSBLNBRMS
g123
g846
g335
g836
g1606
g783
g758
g1386
g1158
g585
0.00 0.05 0.10 0.15 0.20 0.25 0.30
Loadings on comp 1
OutcomeEWSBLNBRMS
−2 0 2 4 6 8
−10
−50
5
X
Y
LegendEWSBLNBRMS
sPLS−DA on SRBCT
−10
−5
0
5
0 4 8X−variate 1: 5% expl. var
X−va
riate
2: 6
% e
xpl.
var
LegendEWSBLNBRMS
0.0
0.1
0.2
0.3
0.4
0.5
Comp 1
Features
Stability
g123
g846
g836
g335
g1606
g1386
g758
g783
g1327
g1389
g1954
g246
g545
g1319
g1074
g1387
g1772
g585
g1158
g1884
g589
g1645
g1980
g2253
g2050
g1295
g1915
g165
g1916
g336 g85
g1116
g1036
g2117
g607
g820
g153
g1894
g1932
g2144
g255 g74
g742
g937
g976
g1730
g2157
g348
g368 g52
g823
g879
g1003
g1601
g1708
g187
g1955
g2046
g276
g571
g719
g910
g1099
g1207
g174
g182
g1896
g1974
g422
g566
g1536
g2276
g36
g509
g998 g1
g1007
g1030
g1066
g1067
g1142
g1194
g1279
g1347
g1353
g1453
g1518
g166
g1662
g1723
g1758
g1804
g1822
g1862
g1911
g1949
g1962
g2127
g2159
g2162
g236 g29
g417
g430
g444
g552
g603
g655
g975
0.0
0.1
0.2
0.3
0.4
Comp 2
Features
Stability
g1003
g1954
g1389
g246
g1319
g1327
g545
g1074
g1955
g1645
g2050
g509
g2117
g187
g1980
g1207
g1772
g1884
g1194
g1888
g1911
g2253
g336
g348
g1708
g2046
g174
g368
g910
g566
g1896
g1298
g165 g29
g1030
g2159
g36
g1353
g166
g1723
g603
g971 g2
g338
g1932
g820
g979 g74
g1799
g1330
g229
g469
g758
g2083
g719
g1093
g1671
g129
g2047
g433
g1263
g1917
g52
g1110
g1991
g483
g1372
g552
g230
g779
g430
g1049
g2203
g481
g1055
g1196
g819
g1738
g1613
g1862
g803
g823
g1292
g1730
g753
g532
g1579
g400
g1198
g1606
g828
g976
g2258
g937
g123
g153
g2049
g2146
g255
g335
g437 g85
g1036
g1386
g1387
g1601
g1706
g2144
g2157
g422 g50
g585
g589
g742
g783
g836
g846
g1453
g1764
g1775
g1894
g571
g1158
g1536
g1829
g1915
g1916
g879
g1090
g1099
g1283
g1776
g182
g188
g2162
g2230
g842
g849
g1295
g1662
g975
g1804
g2240
g373
g607
g1116
g1159
g372
g812
g867
g1089
g1220
g1795
g1887
g1914
g276
g1279
g1624
g1735
g54
g597
g956
g1007
g1021
g119
g1375
g780
g1022
g1347
g1839
g236
g546
g941
g1876
g799
g1067
g1770
g444
g554
g365
g1974
g417
g707
g1746
g190
g2199
g602
g655
g2156
g744
g796
g1084
g1224
g268
g666 g67
g1301
g1621
g2099
g244
g378
g484
g555
g903
g1088
g1142
g1201
g1518
g1649
g1773
g1962
g657
g970
g1493
g1634
g169
g1924
g1964
g2279
g415
g449
g998
g2192
g510 g89
g1066
g107
g137
g1443
g1517
g1784
g1857
g2104
g2175
g2300
g504
g575
g695
g1734
g1822
g2136
g2241
g289
g604
g688
g912
g1068
g1070
g1105
g1122
g1246
g1286
g1522
g1587
g159
g1853
g1942
g1944
g2022
g2127
g2217
g2301
g256
g424
g756
g901
g965
g973
g1016
g1315
g1317
g1348
g1460
g1686
g1765
g1928
g1992
g2227
g2276
g442
g632
g639 g76 g1
g1017
g1042
g1206
g1215
g1217
g1291
g1309
g1312
g1318
g1325
g1378
g1392
g1416
g1486
g1496
g1700
g1873
g1956
g2209 g3
g315
g606
g636
g865
g972
g1002
g1187
g1203
g1221
g1226
g1273
g1302
g1351
g1370
g1384
g1405
g1424
g1436
g1479
g1489
g1524
g1548
g1565
g1608
g1655
g1710
g1716
g1751
g1909
g191
g1968
g1979
g2106
g2116
g2186
g264
g288
g384
g477
g499
g573
g590
g612
g667
g694
g733
g761
g794
g801
g832
g837
g873
g881
g891 g90
g906
g951
g1005
g1008
g1023
g1151
g1228
g1262
g1308
g1324
g139
g141
g142
g1434
g1445
g1457
g1458
g1462
g1490
g151
g1535
g1577
g1599
g1626
g1656
g1661
g1697
g1714
g1761
g1800
g1803
g1823
g1837
g1863
g1929
g1930
g1949
g1977
g1994
g2000
g2016
g2039
g205
g2053
g2080
g2081
g2114
g214
g2151
g217
g2207
g2247
g2289
g2295
g251
g258
g279 g37
g381
g388
g397
g450
g465
g490
g497
g522
g533
g558
g594
g615
g642
g643
g665
g714
g715
g736
g745
g747
g798 g80
g800
g808
g817
g861
g875
g932
g933 g94
g948
g982
0.0
0.2
0.4
0.6
0.8
Comp 3
Features
Stability
g742
g879
g255
g1804
g2157
g1601
g2199
g236
g417
g1764
g1003
g174
g1896
g1955
g603
g910
g1434
g2046
g2159
g975
g976
g2000
g422
g2144
g799
g1862
g1030
g1353
g1924
g1579
g823
g1776
g1084
g1829
g2083
g2258
g597
g1911
g153
g2047
g1263
g575
g483
g1348
g1698
g2241
g1066
g1662
g695
g819
g1088
g1007
g972
g1347
g372
g129
g1158
g123
g1295
g1386 g2
g442
g783
g836
g846
g1116
g1387
g1460
g1606
g1915
g1916
g335
g585
g589
g956
g2203
g276
g1196
g1914
g188
g849
g1198
g2156 g3
g1207
g842
g1091
g1536
g901
g1369
g1655
g325
g1730
g338
g509
g780
g1723
g187
g1624
g1944
g433
g1055
g1735
g400
g190
g2049
g2230
g758
g1036
g1309
g951 g85
g1389
g246
g545
g1327
g1974
g1980
g424
g1273
g1436
g229
g655
g1954
g554
g555
g1319
g1074
g1194
g1645
g1706
g1799
g1372
g1006
g1370
g2136
g54
g1283
g1375
g1775
g373
g1021
g336
g971
g719
g982
g380
g214
g1649
g1887
g2050
g1032
g1105
g1738
g1962
g1022
g1110
g1704
g571
g1392
g1484
g828
g1302
g1664
g2146
g532
g707
g2104
g1090
g1741
g606
g632
g1308
g1659
g2175
g566
g1330
g2117
g493 g50
g941
g1599
g1894
g36
g970
g1462
g1888
g1992
g586
g602
g1023
g1220
g25
g67
g753
g1151
g1576
g1932
g1979
g2131
g348
g469
g1286
g1656
g1909
g2228
g801
g840
g1017
g1768
g2192
g639
g1453
g1496
g845
g1217
g151
g1581
g1708
g2227
g230
g499
g1093
g1315
g142
g1538
g26
g289
g642
g666
g747
g1099
g1201
g1206
g1250
g166
g2042
g2162
g2303
g903
g933
g965
g988
g1215
g1493
g2040
g650
g733
g841
g881
g998
g1343
g1427
g1577
g1626
g1682
g191
g744
g1132
g1291
g141
g1531
g1597
g1634
g165
g169
g1714
g1884
g209
g2097
g2118
g2207
g2240
g453
g462
g501
g796 g1
g1008
g1067
g1279
g1292
g1384
g1414
g1443
g1497
g1548
g1621
g1770
g1772
g1839
g1881
g1917
g1930
g1942
g2127
g2186
g558
g667 g74
g795
g859
g867
g912
g937
g1016
g1076
g1089
g1422
g1673
g1746
g1784
g1823
g1882
g2116
g2209
g2231
g2276
g248 g46
g527
g1012
g1013
g105
g1054
g1072
g1298
g1488
g1535
g1594
g1628
g1671
g1710
g1803
g182
g1853
g2088
g2163
g2266
g262
g368
g444
g449
g474
g582
g607
g714
g731
g891
g922
g1004
g1301
g1335
g1498
g1694
g1759
g1762
g1795
g1850
g1868
g1883
g1945
g1957
g1994
g2039
g2086
g2089
g2114
g2213
g222
g2253
g2301
g294
g437
g481
g543
g544
g590
g761
g820
g873
g883
g905
g1002
g1042
g1061
g107
g108
g1112
g1157
g1187
g1224
g1261
g1264
g1266
g1268
g1287
g1318
g1324
g1341
g137
g1377
g1405
g1417
g1426
g1445
g1489
g1518
g1522
g1529
g1587
g1608
g1613
g1697
g1699
g17
g1702
g1734
g1742
g1797
g1867
g1901
g1920
g1929
g1931
g1935
g1949
g1991
g2009
g205
g2075
g2080
g2167
g2198
g2208
g2235
g2279
g2285
g2289
g242
g251
g252
g268 g29
g291
g378
g419
g430
g443 g45
g450
g455
g463
g536
g552
g615
g635
g645
g647
g653
g657
g665
g702 g77
g771
g788
g789
g794
g797
g812
g814
g857
g875 g90
PCA on SRBCT
−20
0
20
−20 −10 0 10 20PC1: 11% expl. var
PC2:
11%
exp
l. va
r
LegendEWSBLNBRMS
PLSDA on SRBCT
−20
−10
0
10
20
−20 0 20X−variate 1: 10% expl. var
X−va
riate
2: 6
% e
xpl.
var
LegendEWSBLNBRMS
1 2 3 4 5 6 7 8 9 10
Principal Components
Expl
aine
d Va
rianc
e
0.00
0.02
0.04
0.06
0.08
0.10
A1 A2 B1 B2
C3
C2
C6C1
C5
C4
Figure 3: Illustration of PLS-DA and sPLS-DA in mixOmics. A) Unsupervised preliminary analysis with PCA,A1: percentage of explained variance per component, A2: PCA sample plot. B) Supervised analysiswith PLS-DA, B1: PLS-DA sample plot with confidence ellipse plots, B2: classification performance percomponent (overall or BER) for three prediction distances using 50 * 5-fold cross-validation. C) Supervisedanalysis and feature selection with sparse PLS-DA, C1: sPLS-DA sample plot with confidence ellipseplots, C2: arrow plot representing each sample pointing towards its outcome category, C3: coefficient weightof the features selected on component 1 and component 2, with colour indicating the class with maximalmean expression value for each feature, C4: feature stability when evaluating the performance of a sPLS-DAmodel with 10, 40 and 60 features on the first three components (50 * 5-fold cross-validation), C5: ClusteredImage Map (Euclidian Distance, Complete linkage). Samples are represented in rows, selected features incolumns (10, 40 and 60 genes selected on each component respectively), C6: receiver operating characteristic(ROC) curve and Area Under the Curve (AUC) averaged using one-vs-all comparisons.
highest number of predictions over all ‘omics data set, while the weighted vote combines thepredictions of all ’omics after weighting each by the correlation between the component X(q)
h a(q)hand the outcome. In both strategies, a prediction distance is to be specified to obtain a predictedclass, as described in Section III. Ties are indicated as ‘NA’ in the predicted classes.
10
.CC-BY-NC-ND 4.0 International licensenot certified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprint (which wasthis version posted February 14, 2017. . https://doi.org/10.1101/108597doi: bioRxiv preprint
Integrative Multivariate ‘omics Analysis with mixOmics • in draft •
Specific outputs to visualise multiple ‘omics data sets integration. Several types of graphicaloutputs are available to support interpretation of the statistical results. To represent samples,plotIndiv displays component scores from each ‘omics data set individually. Such type of plotenable to visualise the agreement between all data sets at the sample level. The plotArrow functionalso enables similar visualisation (see Section II). The function plotDiablo is a matrix scatterplotof the components from each data set for a given dimension; it enables to check whether thepairwise correlation between two ‘omics has been modelled according to the design. The functioncircosPlot shows pairwise correlations among the selected features over all data sets. Featuresare represented on the side of the circos plot, with colours indicating the type of data, and external(optional) lines display expression levels for each outcome category. It is an extension of themethod used in plotVar, cim and network (see González et al. 2012).
Parameters tuning. The first parameter to choose in DIABLO is the design matrix, which canbe specified based on either prior biological knowledge, or by using a preliminary multivariatemethod integrating two data sets at a time (e.g. PLS) to assess the potential common informationbetween data sets in an unsupervised analysis. In addition, the function plotDIABLO run on allfeatures (non sparse model) can further confirm the suitability of the design to maximise thecorrelations between data sets. By default the design links each data set to the outcome Y.Similar to PLS-DA, the number of components ncomp needs to be specified. We usually found thatK− 1 components were sufficient to discriminate the sample classes but this should be furtherassessed with the model performance and graphical outputs (see our example 4 and Suppl. V).Finally and most importantly, the number of features to select per data set and per componentneeds to be specified with the list argument keepX. The tune function evaluates the performanceof the model over a grid of different keepX parameters using repeated cross-validation, based onthe (balanced) classification error rate, with a parallelisation option (argument cpus). Note thatthis tuning step might become cumbersome as there might be numerous combinations to evaluate.A constraint tuning is also available, see Section III. Our experience shows that a minimal errorrate could be attained with a rather small number of features per component and data set (<20,Singh et al. 2016). However, the user can enlarge the search grid to ensure a sufficiently largenumber of selected features when the focus is on the downstream biological interpretation (e.g.enrichment analyses).
Usage in mixOmics. Figure 4 displays some of the graphical outputs when performing N-integration. The multi-‘omics breast cancer study analysed include mRNA (P1 = 200), miRNA(P2 = 184) and proteomics (P3 = 142) data that were normalised and drastically filtered forillustrative purpose in this manuscript. The data were divided into a training set composed ofN = 150 samples and an external test set of Ntest = 70 samples where the proteomics data aremissing (see details in Singh et al. 2016). The aim of N-integration is to identify a highly correlatedmulti-‘omics signature discriminating the breast cancer subgroups Luminal A, Her2 and Basal.Figure 4A displays the matrix design and the sample correlation between each component fromeach data set, B the sample plots for each data set, C our different feature plots and D a clusteredimage map of the multi-‘omics signature. The full pipeline, results interpretation and associated Rcode is available in Electronic Suppl. V.
11
.CC-BY-NC-ND 4.0 International licensenot certified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprint (which wasthis version posted February 14, 2017. . https://doi.org/10.1101/108597doi: bioRxiv preprint
Integrative Multivariate ‘omics Analysis with mixOmics • in draft •
−0.87 −0.48 0.87
Color key
ZEB1
TAGLNARL4C
ASPM
KDM4B
ZNF552
FLI1
COL15A1
FUT8
PRNP
PRKCDBP
TCF4
AHR
COL6A1RAB3IL1
APBB1IP
IL1R1
CNN2
CCDC80
CXCL12
CCNA2
CTSK
JAM3
hsa−let−7c
hsa−mir−100
hsa−mir−127
hsa−mir−130b
hsa−mir−17
hsa−mir−199a−1hsa−mir−199a−2hsa−mir−199b
hsa−mir−20a
hsa−mir−379
hsa−mir−381
hsa−mir−505hsa−mir−590
hsa−mir−99a
AR
ASNS
Annexin_I
B−Raf
Bak
Caveolin−1
Claudin−7
Collagen_VI
Cyclin_B1
Cyclin_E1
E−Cadherin
ER−alpha
ER−alpha_pS118
GATA3
GSK3−alpha−beta
K−RasKu80
PKC−alpha
PKC−alpha_pS657
S6
Smad4
mTOR
Block: mrna Block: mirna
Block: prot
−5
0
5
10
−5.0
−2.5
0.0
2.5
5.0
7.5
−6
−3
0
3
6
−2 0 2 −2 0 2 4
−2 0 2 4variate 1
varia
te 2
LegendBasalHer2LumA
DIABLOA1 B
C1 C2
A2
mRNA
miRNA
protein
D
C3 C4
Index
mrna
X.label
Y.label
−3 −2 −1 0 1 2 3
X.label
Y.label
−3 −2 −1 0 1 2 3
−20
24
Index
0.83
Index
1 mirna
X.label
Y.label
−20
24
Index
0.9Index
1 0.76
Index
1 prot
Basal Her2 LumA
−2 0 1 2
Color key
Cycli
n_E1
hsa−
mir−
20a
hsa−
mir−
17Cy
clin_
B1CC
NA2
ASPM
ASNS
hsa−
mir−
505
hsa−
mir−
590
hsa−
mir−
130b
GAT
A3ER
−alp
haKD
M4B
ZNF5
52 ARFU
T8
A0ESA0EWA0EXA06PA133A0DSA0RHA0ROA0CSA18SA0JFA146A0AZA0I8A1ALA12AA0BQA0EIA0B0A0DVA0EAA0T6A0BMA0SHA0T7A0YLA1B1A0IKA0J5A0H7A0BPA0DPA0E1A0EUA0TZA0CDA12XA0ASA07ZA04AA0FDA12HA0CTA15RA0BSA140A0RMA12BA0SUA0W5A0XSA0X0A0IOA08TA0FSA0W4A1BDA18NA08AA1AVA09AA0XNA08ZA086A1APA1AKA0I9A0XWA18PA18FA1ATA12YA0RVA0RGA08OA0DKA03LA15EA1AUA12TA07IA15LA12PA04WA128A13ZA09GA12QA135A09XA0RXA152A0D1A0T1A0A7A12DA08XA0TXA1B0A0WXA0I2A12LA137A18RA0DAA150A131A0G0A0EQA0ARA0T2A0FLA04PA08LA0AVA0JLA12VA0ALA0RTA094A0EEA14PA1B6A14XA0U4A1AIA13EA0B3A124A0XUA0T0A07RA0E0A147A04UA0B9A0D2A0FJA04DA04TA0YMA1AZA0CMA0D0A0ATA143A0SXA1AYA0SKA0CE
RowsBasalHer2LumA
ColumnsmrnamirnaprotZN
F552
FUT8
KDM
4B
ZEB1
PLCD
3
CCNA
2
JAM3
ASPM
CTSK
TCF4
CXCL12
CCDC80
COL6A1
PRKCDBP
RAB3IL1
BOC
ARL4C
PRNP
COL15A1
IL1R1
FLI1
TAGLN
AHR
CNN2
APBB1IP
mrna
AR
GATA3
ER−alphaASNS
Cyclin_B1HER2Cyclin_E1
Caveolin−1
ER−alpha_pS118
Smad4
Pea−15
4E−BP1_pS65
ACC1S6
p27_
pT15
7
SCD1
LckBa
k
Colla
gen_
VI
BaxACC_p
S79PKC−alp
haSTAT3_pY705
Fibronectin
B−Raf
Annexin_IGSK3−alpha−beta
E−Cadherin
Ku80
mTOR
Src_pY527
Claudin−7
NF2
PKC−alpha_pS657
K−Ras
prot
hsa−mir−590
hsa−mir−130b
hsa−mir−17
hsa−mir−505
hsa−mir−20a
hsa−mir−379hsa−mir−127
hsa−mir−100hsa−mir−199a−1
hsa−mir−381
hsa−mir−199a−2
hsa−let−7c
hsa−mir−199b
hsa−mir−99amirna
CorrelationsPositive CorrelationNegative Correlation
Correlation cut−offr=0.7
Comp 1−2
CCDC80CXCL12PRNPBOC
TAGLNARL4CCOL6A1CNN2PLCD3CTSK
RAB3IL1TCF4AHRFLI1ZEB1IL1R1JAM3
PRKCDBPAPBB1IPCOL15A1
−0.4 −0.3 −0.2 −0.1 0.0
Contribution on comp 2Block 'mrna'
hsa−mir−199a−2
hsa−mir−199b
hsa−mir−199a−1
hsa−let−7c
hsa−mir−381
hsa−mir−100
hsa−mir−99a
hsa−mir−379
hsa−mir−127
−0.4 −0.3 −0.2 −0.1 0.0
Contribution on comp 2Block 'mirna'
Claudin−7Annexin_I
PKC−alpha_pS657Caveolin−1PKC−alpha
mTORNF2
ER−alpha_pS118STAT3_pY7054E−BP1_pS65E−Cadherin
ACC1Lck
Pea−15Collagen_VI
Smad4Ku80S6
HER2Src_pY527
BakGSK3−alpha−beta
B−RafK−Ras
FibronectinACC_pS79
SCD1Bax
p27_pT157Cyclin_B1
−0.4 −0.2 0.0 0.2 0.4
Contribution on comp 2Block 'prot'
−1.0 −0.5 0.0 0.5 1.0
−1.0
−0.5
0.0
0.5
1.0
Component 1
Com
pone
nt 2
Correlation Circle Plots
Figure 4: Illustration of DIABLO analysis in mixOmics. A1, design and A2, sample scatterplot from plotDiablodisplaying the first component in each data set (upper plot) and Pearson correlation between each component(lower plot). B: sample plot per data set (block), C) feature outputs, C1: Correlation Circle plot representingeach type of selected features, C2: relevance network visualisation of the selected features, C3: Circos plotshows the positive (negative) correlation (r > 0.7) between selected features as indicated by the brown (black)links, feature names appear in the quadrants, C4: coefficient weight of the features selected on component1 in each data set, with color indicating the class with a maximal mean expression value for each feature.D Clustered Image Map (Euclidian distance, Complete linkage) of the multi-omics signature. Samples arerepresented in rows, selected features on the first component in columns.
V. P-integration across independent data sets with MINT
The integration of independent data sets measured on the same common P features under similarconditions or treatments (Figure 1) is a useful approach to increase sample size and gain statisticalpower. In this context, the challenge is to accommodate for systematic differences that arise due todifferences between protocols, geographical sites or the use of different technological platforms togenerate the same type of ‘omics data (e.g. transcriptomics). The systematic unwanted variation,also called ‘batch-effect’, often acts as a strong confounder in the statistical analysis and may leadto spurious results and conclusions if it is not accounted for in the statistical model.
12
.CC-BY-NC-ND 4.0 International licensenot certified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprint (which wasthis version posted February 14, 2017. . https://doi.org/10.1101/108597doi: bioRxiv preprint
Integrative Multivariate ‘omics Analysis with mixOmics • in draft •
Method. MINT (Rohart et al., 2016a) is an extension of the multi-group PLS framework (mg-PLS, Eslami et al. 2013b, 2014), where ‘groups’ represent independent studies, to a supervisedframework with feature selection. MINT seeks to identify a common projection space for allstudies that is defined on a small subset of discriminative features and that display an analogousdiscrimination of the samples across studies.
We combine M datasets denoted X(1)(N1 × P), X(2)(N2 × P), ..., X(M)(NM × P) measured onthe same P predictors but from independent studies, with N = ∑M
m=1 Nm. Each data set X(m),m = 1, . . . , M, has an associated dummy outcome Y(m) in which all K classes are represented. Wedenote X (N × P) and Y (N × K) the concatenation of all X(m) and Y(m) respectively. In the MINTparticular framework, each feature of the datasets X(m) and Y(m) is centered and scaled. For eachcomponent h, MINT solves :
maxah ,bh
M
∑m=1
Nm cov(X(m)h ah, Y(m)
h bh), s.t. ||ah||2 = 1 and ||ah||1 ≤ λ (4)
where ah and bh are the global loadings vectors common to all studies, t(m)h = X(m)
h ah and
u(m)h = Y(m)
h bh are the partial PLS-components that are study specific. Residual (deflated) matricesare calculated for each iteration of the algorithm based on the global components and loadingvectors (see Rohart et al. 2016a). Thus the MINT algorithm models the study structure during theintegration process. The penalisation parameter λ controls the amount of shrinkage and thus thenumber of non zero weights in the global loading vector a. Similarly to sPLS-DA (Section III) MINTselects a combination of features on each PLS-component.
Specific graphical outputs. The set of partial components t(m)h , h = 1, ..., H provides study-
specific outputs in plotIndiv. These graphics can act as a quality control step to detect studiesthat cluster outcome classes differently to other studies (i.e. ‘outlier’ studies). The functionplotLoadings displays the coefficients weights of the features globally selected by the model butrepresented individually in each study. Visualisation of the global loading vectors is also available.Note the projection into the Y-space is of not useful in MINT.
Parameters tuning. We take advantage of the independence between studies to evaluate the per-formance based on a novel CV technique called ‘Leave-One-Group-Out Cross-Validation’ (Rohartet al., 2016a). LOGOCV performs CV where each group m is left out once. The aim is to reflecta realistic prediction of independent external studies. The tune function implements LOGOCVto choose the optimal number of features keepX or the optimal set of features keepX.constraintto select in X, as described in the earlier Sections. Note that LOGOCV cannot be repeated (nonrepeat argument) as this type of cross-validation is not random.
Usage in mixOmics. Figure 5 displays some of the graphical outputs when performing P-integration with mixOmics. We combined four independent transcriptomics stem cell studiesthat measure the expression levels of 400 genes across 125 samples (cells). The data were nor-malised and drastically filtered for illustrative purpose in this manuscript. The cells were classifiedinto Fibroblasts, hESC and hiPSC. The aim of this P− integration analysis is to identify a robustmolecular signature across all studies to discriminate the three different cell types. After apply-ing MINT via the mint.plsda and mint.splsda functions, generic visualisations functions of themixOmics R-package like plotIndiv, cim and plotLoadings can be used. Figure 5 displays some
13
.CC-BY-NC-ND 4.0 International licensenot certified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprint (which wasthis version posted February 14, 2017. . https://doi.org/10.1101/108597doi: bioRxiv preprint
Integrative Multivariate ‘omics Analysis with mixOmics • in draft •
outputs easily obtained by calls to those functions. Figure 5A displays a MINT PLS-DA sampleplot, Figure 5B the tuning and performance evaluation of the MINT sPLS-DA analysis and Figure5C the different sample and feature with MINT sPLS-DA. The full pipeline, results interpretationand associated R code is available in Electronic Suppl. V.
Global
−5
0
5
−2 0 2 4 6X−variate 1: 25% expl. var
X−va
riate
2: 5
% e
xpl.
var
LegendFibroblasthESChiPS
Study1234
MINT sPLS−DAStudy 1 Study 2
Study 3 Study 4
−4
−2
0
2
4
6
−5.0
−2.5
0.0
2.5
−2
0
2
−2
−1
0
1
2
−2 0 2 4 −2 0 2
0 2 4 6 −2 0 2 4X−variate 1
X−va
riate
2
MINT sPLS−DAstem cell study
−10
−5
0
5
10
−10 0 10 20X−variate 1: 26% expl. var
X−va
riate
2: 5
% e
xpl.
var
LegendFibroblasthESChiPS
Study1234
MINT PLS−DA
1 2 5 10 20 50 100
0.32
0.34
0.36
0.38
0.40
Number of selected genes
Bala
nced
erro
r rat
e
comp1comp1 to comp 2
A C1
B1C3
ENSG00000181449
ENSG00000123080
ENSG00000110721
ENSG00000176485
ENSG00000184697
ENSG00000102935
−20 0 20
Study 1
ENSG00000181449
ENSG00000123080
ENSG00000110721
ENSG00000176485
ENSG00000184697
ENSG00000102935
−60 −20 0 20 40 60
Study 2
ENSG00000181449
ENSG00000123080
ENSG00000110721
ENSG00000176485
ENSG00000184697
ENSG00000102935
−20 −10 0 10
Study 3
ENSG00000181449
ENSG00000123080
ENSG00000110721
ENSG00000176485
ENSG00000184697
ENSG00000102935
−15 −5 0 5 10
Study 4
Contribution on comp 1
BER Fib hESC hiPS
Comp1 0.50 0.00 0.92 0.57
Comp2 0.15 0.03 0.22 0.19
B2
C2
C4
−3.28 0 3.28
Color key
ENSG
0000
0102
935
ENSG
0000
0184
697
ENSG
0000
0110
721
ENSG
0000
0181
449
ENSG
0000
0176
485
ENSG
0000
0123
080
hiPShiPShiPShiPShESChiPShiPShiPShiPShiPShiPShESChESChiPShESChESChiPShESChESChiPShiPShiPShiPShESChESChiPShESChESChiPShESChESChESChiPShiPShiPShESChiPShiPShESChiPShiPShESChESChESChiPShiPShiPShiPShiPShiPShESChiPShiPShESChiPShiPShiPShiPShiPShiPShiPShiPShiPShESChiPShiPShiPShiPShiPShESChiPShESChiPShiPShESChESChESChESChESChESChiPShiPShESChESChESChiPShESChiPShiPShESChESChiPShiPShESChiPSFibroblastFibroblastFibroblastFibroblastFibroblastFibroblastFibroblastFibroblastFibroblastFibroblastFibroblastFibroblastFibroblastFibroblastFibroblastFibroblastFibroblastFibroblastFibroblastFibroblastFibroblastFibroblastFibroblastFibroblastFibroblastFibroblastFibroblastFibroblastFibroblastFibroblast
Figure 5: Illustration of MINT analysis in mixOmics. A: Preliminary analysis with MINT PLS-DA (no featureselection), sample plot displays the sample cell types. B Parameter tuning and performance with MINTsPLS-DA, B1: BER (y-axis) with respect to number of selected features (x-axis) when 1 and 2 componentsare successively added in the model. Full diamond represents the optimal number of features to select on eachcomponent using Leave-One-Group-Out cross-validation and the maximum distance, B2: Final performanceof the MINT sPLS-DA model for a selection of 6 and 55 transcripts on each component: overall BER and errorrate per cell type with the maximum distance, C) MINT sPLS-DA graphical outputs using plotIndiv,cim and plotLoadings. C1: Global sample plot with confidence ellipse plots. C2: Study specific sampleplot. C3: Clustered Image Map (Euclidian Distance, Complete linkage). Samples are represented in rows,selected features on the first component in columns. C4: Coefficient weight of the features selected oncomponent 1 in each study, with color indicating the class with a maximal mean expression value for eachtranscript.
14
.CC-BY-NC-ND 4.0 International licensenot certified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprint (which wasthis version posted February 14, 2017. . https://doi.org/10.1101/108597doi: bioRxiv preprint
Integrative Multivariate ‘omics Analysis with mixOmics • in draft •
Conclusions
The technological race in high-throughput biology lead to increasingly complex biological problemswhich require innovative statistical and analytical tools. Our package mixOmics focuses on dataexploration and data mining, that are crucial steps for a first understanding of large data sets. Inthis article we presented our latest methods to answer cutting-edge integrative and multivariatequestions in biology. In particular, our supervised frameworks DIABLO and MINT substantiallyextend the key PLS-DA method to perform N- and P- integration of multiple data sets, classificationand class prediction of external studies. Combined together, those two framework bear the promiseof NP-integration (combine multiple studies that each have several type of data). The sparseversion of our methods are particularly insightful to identify molecular signatures across thosemultiple data sets.
Feature selection resulting from our methods help to refine biological hypotheses, suggestdownstream analyses including statistical inference analyses, and may propose biological ex-perimental validations. Indeed, multivariate methods include appealing properties to mine andanalyse large and complex biological data, as they allow more relaxed assumptions about datadistribution, data size (n << p) and data range than univariate methods, and provide insightfulvisualisations. In addition, the identification of a combination of discriminative features meetbiological assumptions that cannot be addressed with univariate methods. Nonetheless, we believethat combining different types of statistical methods (univariate, multivariate, machine learning)is the key to answer complex biological questions. However, such questions must be well stated,in order for those exploratory integrative methods to provide meaningful results, and especiallyfor the non trivial case of multiple data integration.
We illustrated our different frameworks on classical ‘omics data, however, mixOmics methodscan also be applied to data beyond the realm of ‘omics as long as they are expressed as continuousvalues. Our future work will include extensive development for other types of data, such asgenotypic as well as time course biological data. Finally, while our manuscript focused mainly onsupervised methodologies, the package also include their unsupervised counterparts to investigaterelationships and associations between features with no prior phenotypic or response information.
Availability and requirements
The R package mixOmics is available from the CRAN (R Core Team, 2016), with tutorials andnewsletter updates available from our website www.mixOmics.org.
Conflict of Interest
The authors declare that they have no competing interests.
Availability of supporting data
The data sets supporting the results of this article are available from the mixOmics R package in aprocessed format. R scripts, full tutorials and reports to reproduce the results from the proposedframework are available as Sweave code from our website www.mixOmics.org.
15
.CC-BY-NC-ND 4.0 International licensenot certified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprint (which wasthis version posted February 14, 2017. . https://doi.org/10.1101/108597doi: bioRxiv preprint
Integrative Multivariate ‘omics Analysis with mixOmics • in draft •
Author’s contributions
FR implemented the MINT method, FR, BG and AS implemented the DIABLO method, FR wasthe main developer of the mixOmics package from v6.0.0. KALC manages and supervises themixOmics project. FR and KALC edited the manuscript.
Acknowledgements
FR was supported, in part, by the Australian Cancer Research Foundation (ACRF) for the Dia-mantina Individualised Oncology Care Centre at The University of Queensland DiamantinaInstitute. KALC was supported, in part, by and the National Health and Medical Research Council(NHMRC) Career Development fellowship (APP1087415). The authors would like to thank thenumerous mixOmics users to continuously helping us improving the package.
References
Ambroise, C. and McLachlan, G. J. (2002). Selection bias in gene extraction on the basis ofmicroarray gene-expression data. Proceedings of the national academy of sciences, 99(10):6562–6566.
Barker, M. and Rayens, W. (2003). Partial least squares for discrimination. Journal of chemometrics,17(3):166–173.
Boulesteix, A.-L. (2004). Pls dimension reduction for classification with microarray data. Statisticalapplications in genetics and molecular biology, 3(1):1–30.
Boulesteix, A.-L. and Strimmer, K. (2007). Partial least squares: a versatile tool for the analysis ofhigh-dimensional genomic data. Brief. Bioinform., 8(1):32–44.
Cook, J. A., Chandramouli, G. V., Anver, M. R., Sowers, A. L., Thetford, A., Krausz, K. W., Gonzalez,F. J., Mitchell, J. B., and Patterson, A. D. (2016). Mass spectrometry–based metabolomics identifieslongitudinal urinary metabolite profiles predictive of radiation-induced cancer. Cancer research,76(6):1569–1577.
Eslami, A., Qannari, E. M., Kohler, A., and Bougeard, S. (2013a). Multi-group pls regression:Application to epidemiology. In New Perspectives in Partial Least Squares and Related Methods,pages 243–255. Springer.
Eslami, A., Qannari, E. M., Kohler, A., and Bougeard, S. (2013b). Multi-group PLS Regression:Application to Epidemiology. In New Perspectives in Partial Least Squares and Related Methods,pages 243–255. Springer.
Eslami, A., Qannari, E. M., Kohler, A., and Bougeard, S. (2014). Algorithms for multi-group PLS. J.Chemometrics, 28(3):192–201.
Friedman, J., Hastie, T., and Tibshirani, R. (2010). Regularization paths for generalized linearmodels via coordinate descent. Journal of statistical software, 33(1):1.
González, I., Déjean, S., Martin, P. G., Baccini, A., et al. (2008). CCA: An R package to extendcanonical correlation analysis. Journal of Statistical Software, 23(12):1–14.
González, I., Lê Cao, K.-A., Davis, M. J., Déjean, S., et al. (2012). Visualising associations betweenpaired ’omics’ data sets. BioData mining, 5(1):19.
16
.CC-BY-NC-ND 4.0 International licensenot certified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprint (which wasthis version posted February 14, 2017. . https://doi.org/10.1101/108597doi: bioRxiv preprint
Integrative Multivariate ‘omics Analysis with mixOmics • in draft •
Guidi, L., Chaffron, S., Bittner, L., Eveillard, D., Larhlimi, A., Roux, S., Darzi, Y., Audic, S., Berline,L., Brum, J. R., et al. (2016). Plankton networks driving carbon export in the oligotrophic ocean.Nature.
Günther, O. P., Chen, V., Freue, G. C., Balshaw, R. F., Tebbutt, S. J., Hollander, Z., Takhar, M.,McMaster, W. R., McManus, B. M., Keown, P. A., et al. (2012). A computational pipeline for thedevelopment of multi-marker bio-signature panels and ensemble classifiers. BMC bioinformatics,13(1):326.
Günther, O. P., Shin, H., Ng, R. T., McMaster, W. R., McManus, B. M., Keown, P. A., Tebbutt,S. J., and Lê Cao, K.-A. (2014). Novel multivariate methods for integration of genomics andproteomics data: applications in a kidney transplant rejection study. Omics: a journal of integrativebiology, 18(11):682–695.
Khan, J., Wei, J. S., Ringner, M., Saal, L. H., Ladanyi, M., Westermann, F., Berthold, F., Schwab, M.,Antonescu, C. R., Peterson, C., et al. (2001). Classification and diagnostic prediction of cancersusing gene expression profiling and artificial neural networks. Nature medicine, 7(6):673–679.
Labus, J. S., Van Horn, J. D., Gupta, A., Alaverdyan, M., Torgerson, C., Ashe-McNalley, C.,Irimia, A., Hong, J.-Y., Naliboff, B., Tillisch, K., et al. (2015). Multivariate morphological brainsignatures predict patients with chronic abdominal pain from healthy control subjects. Pain,156(8):1545–1554.
Lê Cao, K., Rossouw, D., Robert-Granié, C., Besse, P., et al. (2008). A sparse PLS for variableselection when integrating omics data. Statistical applications in genetics and molecular biology,7:Article–35.
Lê Cao, K.-A., Boitard, S., and Besse, P. (2011). Sparse PLS Discriminant Analysis: biologicallyrelevant feature selection and graphical displays for multiclass problems. BMC bioinformatics,12(1):253.
Lê Cao, K.-A., González, I., and Déjean, S. (2009a). integrOmics: an R package to unravelrelationships between two omics datasets. Bioinformatics, 25(21):2855–2856.
Lê Cao, K.-A., Lakis, V. A., Bartolo, F., Costello, M.-E., Chua, X.-Y., Brazeilles, R., and Rondeau, P.(2016). Mixmc: Multivariate insights into microbial communities. PloS one, 11(8):e0160169.
Lê Cao, K.-A., Martin, P. G., Robert-Granié, C., and Besse, P. (2009b). Sparse canonical methods forbiological data integration: application to a cross-platform study. BMC bioinformatics, 10(1):34.
Liquet, B., Lê Cao, K.-A., Hocini, H., and Thiébaut, R. (2012). A novel approach for biomarker selec-tion and the integration of repeated measures experiments from two assays. BMC bioinformatics,13:325.
Liu, Y., Devescovi, V., Chen, S., and Nardini, C. (2013). Multilevel omic data integration in cancercell lines: advanced annotation and emergent properties. BMC systems biology, 7(1):14.
Mahana, D., Trent, C. M., Kurtz, Z. D., Bokulich, N. A., Battaglia, T., Chung, J., Müller, C. L., Li, H.,Bonneau, R. A., and Blaser, M. J. (2016). Antibiotic perturbation of the murine gut microbiomeenhances the adiposity, insulin resistance, and liver disease associated with high-fat diet. Genomemedicine, 8(1):1.
17
.CC-BY-NC-ND 4.0 International licensenot certified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprint (which wasthis version posted February 14, 2017. . https://doi.org/10.1101/108597doi: bioRxiv preprint
Integrative Multivariate ‘omics Analysis with mixOmics • in draft •
Meng, C., Zeleznik, O. A., Thallinger, G. G., Kuster, B., Gholami, A. M., and Culhane, A. C. (2016).Dimension reduction techniques for the integrative analysis of multi-omics data. Briefings inbioinformatics, page bbv108.
Nguyen, D. V. and Rocke, D. M. (2002a). Multi-class cancer classification via partial least squareswith gene expression profiles. Bioinformatics, 18(9):1216–1226.
Nguyen, D. V. and Rocke, D. M. (2002b). Tumor classification by partial least squares usingmicroarray gene expression data. Bioinformatics, 18(1):39–50.
R Core Team (2016). R: A Language and Environment for Statistical Computing. R Foundation forStatistical Computing, Vienna, Austria.
Ramanan, D., Bowcutt, R., Lee, S. C., San Tang, M., Kurtz, Z. D., Ding, Y., Honda, K., Gause, W. C.,Blaser, M. J., Bonneau, R. A., et al. (2016). Helminth infection promotes colonization resistancevia type 2 immunity. Science, 352(6285):608–612.
Rohart, F., Eslami, A., Matigian, N., Bougeard, S., and Le Cao, K.-A. (2016a). Mint: A multi-variate integrative method to identify reproducible molecular signatures across independentexperiments and platforms. bioRxiv, page 070813.
Rohart, F., Mason, E. A., Matigian, N., Mosbergen, R., Korn, O., Chen, T., Butcher, S., Patel, J.,Atkinson, K., Khosrotehrani, K., Fisk, N. M., Lê Cao, K., and Wells, C. A. (2016b). A molecularclassification of human mesenchymal stromal cells. PeerJ, 4:e1845.
Rollero, S., Mouret, J.-R., Sanchez, I., Camarasa, C., Ortiz-Julien, A., Sablayrolles, J.-M., and Dequin,S. (2016). Key role of lipid management in nitrogen and aroma metabolism in an evolved wineyeast strain. Microbial cell factories, 15(1):1.
Shah, A. K., Lê Cao, K.-A., Choi, E., Chen, D., Gautier, B., Nancarrow, D., Whiteman, D. C.,Baker, P. R., Clauser, K. R., Chalkley, R. J., et al. (2016). Glyco-centric lectin magnetic bead array(lemba)- proteomics dataset of human serum samples from healthy, barrettŒs s esophagus andesophageal adenocarcinoma individuals. Data in Brief, 7:1058–1062.
Singh, A., Gautier, B., Shannon, C. P., Vacher, M., Rohart, F., Tebutt, S. J., and Le Cao, K.-A. (2016).Diablo-an integrative, multi-omics, multivariate method for multi-group classification. bioRxiv,page 067611.
Tenenhaus, A., Philippe, C., Guillemot, V., Le Cao, K.-A., Grill, J., and Frouin, V. (2014). Variableselection for generalized canonical correlation analysis. Biostatistics, page kxu001.
Tenenhaus, A. and Tenenhaus, M. (2011). Regularized generalized canonical correlation analysis.Psychometrika, 76(2):257–284.
Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal of the Royal StatisticalSociety. Series B (Methodological), pages 267–288.
Witten, D., Tibshirani, R., Gross, S., and Narasimhan, B. (2013). PMA: Penalized MultivariateAnalysis. R package version 1.0.9.
Wold, H. (1966). Estimation of principal components and related models by iterative least squares.J. Multivar. Anal., pages 391–420.
Wold, H. (1975). Path models with latent variables: The NIPALS approach. Acad. Press.
18
.CC-BY-NC-ND 4.0 International licensenot certified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprint (which wasthis version posted February 14, 2017. . https://doi.org/10.1101/108597doi: bioRxiv preprint
Integrative Multivariate ‘omics Analysis with mixOmics • in draft •
Yao, F., Coquery, J., and Lê Cao, K.-A. (2012). Independent Principal Component Analysis forbiologically meaningful dimension reduction of large biological data sets. BMC bioinformatics,13(1):24.
Supplementary Material
functions PLS-DA sPLS-DA DIABLO sparseDIABLO MINT sparseMINT
functioncall plsda splsda block.plsda block.splsda mint.plsda mint.splsda
parameters ncomp ncompkeepX
designncomp
designncompkeepX
ncomp ncompkeepX
performance
tune, plot.tune
✓ ✓ ✓
perf, plot.perf
✓ ✓ ✓ ✓ ✓ ✓
auroc ✓ ✓ ✓ ✓ ✓ ✓
sample plot
plotIndiv ✓ ✓ ✓ ✓ ✓ ✓
plotArrow ✓ ✓ ✓ ✓ ✓ ✓
plotDiablo ✓ ✓
variableplot
plotVar ✓ ✓ ✓ ✓ ✓ ✓
plotLoadings ✓ ✓ ✓ ✓ ✓ ✓
circosPlot ✓ ✓
cim ✓ ✓ ✓ ✓ ✓ ✓
network ✓ ✓ ✓ ✓ ✓ ✓
variablelist selectVar ✓ ✓ ✓ ✓ ✓ ✓
Figure S1: List of the main mixOmics functions for supervised analyses.
Electronic Supporting Information
Sweave and R code for PLS-DA analysis are available on our website at this link.
19
.CC-BY-NC-ND 4.0 International licensenot certified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprint (which wasthis version posted February 14, 2017. . https://doi.org/10.1101/108597doi: bioRxiv preprint
Integrative Multivariate ‘omics Analysis with mixOmics • in draft •
Sweave and R code for DIABLO analysis are available on our website at this link.Sweave and R code for MINT analysis are available on our website at this link.
20
.CC-BY-NC-ND 4.0 International licensenot certified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprint (which wasthis version posted February 14, 2017. . https://doi.org/10.1101/108597doi: bioRxiv preprint