sgcca.Rd
SGCCA extends RGCCA to address the issue of variable selection. Specifically, RGCCA is combined with an L1-penalty that gives rise to Sparse GCCA (SGCCA) which is implemented in the function sgcca(). Given \(J\) matrices \(X_1, X_2, ..., X_J\), that represent \(J\) sets of variables observed on the same set of \(n\) individuals. The matrices \(X_1, X_2, ..., X_J\) must have the same number of rows, but may (and usually will) have different numbers of columns. Blocks are not necessarily fully connected within the SGCCA framework. Hence the use of SGCCA requires the construction (user specified) of a design matrix (\(C\)) that characterizes the connections between blocks. Elements of the symmetric design matrix \(C = (c_{jk})\) are equal to 1 if block \(j\) and block \(k\) are connected, and 0 otherwise. The SGCCA algorithm is very similar to the RGCCA algorithm and keeps the same monotone convergence properties (i.e. the bounded criteria to be maximized increases at each step of the iterative procedure and hits at convergence a stationary point). Moreover, using a deflation strategy, sgcca() enables the computation of several SGCCA block components (specified by ncomp) for each block. Block components for each block are guaranteed to be orthogonal when using this deflation strategy. The so-called symmetric deflation is considered in this implementation, i.e. each block is deflated with respect to its own component. Moreover, we stress that the numbers of components per block could differ from one block to another.
sgcca(A, C = 1 - diag(length(A)), c1 = rep(1, length(A)), ncomp = rep(1, length(A)), scheme = "centroid", scale = TRUE, init = "svd", bias = TRUE, tol = .Machine$double.eps, verbose = FALSE)
A | A list that contains the \(J\) blocks of variables \(X_1, X_2, ..., X_J\). |
---|---|
C | A design matrix that describes the relationships between blocks (default: complete design). |
c1 | Either a \(1*J\) vector or a \(max(ncomp) * J\) matrix encoding the L1 constraints applied to the outer weight vectors. Elements of c1 vary between \(1/sqrt(p_j)\) and 1 (larger values of c1 correspond to less penalization). If c1 is a vector, L1-penalties are the same for all the weights corresponding to the same block but different components: $$for all h, |a_{j,h}|_{L_1} \le c_1[j] \sqrt{p_j},$$ with \(p_j\) the number of variables of \(X_j\). If c1 is a matrix, each row \(h\) defines the constraints applied to the weights corresponding to components \(h\): $$for all h, |a_{j,h}|_{L_1} \le c_1[h,j] \sqrt{p_j}.$$ |
ncomp | A \(1*J\) vector that contains the numbers of components for each block (default: rep(1, length(A)), which means one component per block). |
scheme | Either "horst", "factorial" or "centroid" (Default: "centroid"). |
scale | If scale = TRUE, each block is standardized to zero means and unit variances and then divided by the square root of its number of variables (default: TRUE). |
init | Mode of initialization use in the SGCCA algorithm, either by Singular Value Decompostion ("svd") or random ("random") (default : "svd"). |
bias | A logical value for biaised or unbiaised estimator of the var/cov. |
tol | Stopping value for convergence. |
verbose | Will report progress while computing if verbose = TRUE (default: FALSE). |
A list of class sgcca with the following elements:
A list of \(J\) elements. Each element of Y is a matrix that contains the SGCCA components for each block.
A list of \(J\) elements. Each element of a is a matrix that contains the outer weight vectors for each block.
A list of \(J\) elements. Each element of astar is a matrix defined as Y[[j]][, h] = A[[j]]%*%astar[[j]][, h]
A design matrix that describes the relationships between blocks (user specified).
The scheme chosen by the user (user specified).
A vector or matrix that contains the value of c1 applied to each block \(\mathbf{X}_j\), \( j=1, \ldots, J\) and each dimension (user specified).
A \(1 \times J\) vector that contains the number of components for each block (user specified).
A vector that contains the values of the objective function at each iterations.
Indicators of model quality based on the Average Variance Explained (AVE): AVE(for one block), AVE(outer model), AVE(inner model).
Tenenhaus, A., Philippe, C., Guillemot, V., Le Cao, K. A., Grill, J., and Frouin, V. , "Variable selection for generalized canonical correlation analysis.," Biostatistics, vol. 15, no. 3, pp. 569-583, 2014.
############# # Example 1 # ############# if (FALSE) { # Download the dataset's package at http://biodev.cea.fr/sgcca/. # --> gliomaData_0.4.tar.gz require(gliomaData) data(ge_cgh_locIGR) A <- ge_cgh_locIGR$multiblocks Loc <- factor(ge_cgh_locIGR$y) ; levels(Loc) <- colnames(ge_cgh_locIGR$multiblocks$y) C <- matrix(c(0, 0, 1, 0, 0, 1, 1, 1, 0), 3, 3) tau = c(1, 1, 0) # rgcca algorithm using the dual formulation for X1 and X2 # and the dual formulation for X3 A[[3]] = A[[3]][, -3] result.rgcca = rgcca(A, C, tau, ncomp = c(2, 2, 1), scheme = "factorial", verbose = TRUE) # sgcca algorithm result.sgcca = sgcca(A, C, c1 = c(.071,.2, 1), ncomp = c(2, 2, 1), scheme = "centroid", verbose = TRUE) ############################ # plot(y1, y2) for (RGCCA) # ############################ layout(t(1:2)) plot(result.rgcca$Y[[1]][, 1], result.rgcca$Y[[2]][, 1], col = "white", xlab = "Y1 (GE)", ylab = "Y2 (CGH)", main = "Factorial plan of RGCCA") text(result.rgcca$Y[[1]][, 1], result.rgcca$Y[[2]][, 1], Loc, col = as.numeric(Loc), cex = .6) plot(result.rgcca$Y[[1]][, 1], result.rgcca$Y[[1]][, 2], col = "white", xlab = "Y1 (GE)", ylab = "Y2 (GE)", main = "Factorial plan of RGCCA") text(result.rgcca$Y[[1]][, 1], result.rgcca$Y[[1]][, 2], Loc, col = as.numeric(Loc), cex = .6) ############################ # plot(y1, y2) for (SGCCA) # ############################ layout(t(1:2)) plot(result.sgcca$Y[[1]][, 1], result.sgcca$Y[[2]][, 1], col = "white", xlab = "Y1 (GE)", ylab = "Y2 (CGH)", main = "Factorial plan of SGCCA") text(result.sgcca$Y[[1]][, 1], result.sgcca$Y[[2]][, 1], Loc, col = as.numeric(Loc), cex = .6) plot(result.sgcca$Y[[1]][, 1], result.sgcca$Y[[1]][, 2], col = "white", xlab = "Y1 (GE)", ylab = "Y2 (GE)", main = "Factorial plan of SGCCA") text(result.sgcca$Y[[1]][, 1], result.sgcca$Y[[1]][, 2], Loc, col = as.numeric(Loc), cex = .6) # sgcca algorithm with multiple components and different L1 penalties for each components # (-> c1 is a matrix) init = "random" result.sgcca = sgcca(A, C, c1 = matrix(c(.071,.2, 1, 0.06, 0.15, 1), nrow = 2, byrow = TRUE), ncomp = c(2, 2, 1), scheme = "factorial", scale = TRUE, bias = TRUE, init = init, verbose = TRUE) # number of non zero elements per dimension apply(result.sgcca$a[[1]], 2, function(x) sum(x!=0)) #(-> 145 non zero elements for a11 and 107 non zero elements for a12) apply(result.sgcca$a[[2]], 2, function(x) sum(x!=0)) #(-> 85 non zero elements for a21 and 52 non zero elements for a22) init = "svd" result.sgcca = sgcca(A, C, c1 = matrix(c(.071,.2, 1, 0.06, 0.15, 1), nrow = 2, byrow = TRUE), ncomp = c(2, 2, 1), scheme = "factorial", scale = TRUE, bias = TRUE, init = init, verbose = TRUE)}