PLS Generalized Linear Regression and Kernel Multilogit Algorithm (KMA) for Microarray Data Classification

We implement extensions of the partial least squares generalized linear regression (PLSGLR) due to Bastien et al. (2005) through its combination with logistic regression and linear discriminant analysis, to get a partial least squares generalized linear regression-logistic regression model (PLSGLR-log), and a partial least squares generalized linear regression-linear discriminant analysis model (PLSGLRDA). These two classification methods are then compared with classical methodologies like the k-nearest neighbours (KNN), linear discriminant analysis (LDA), partial least squares discriminant analysis (PLSDA), ridge partial least squares (RPLS), and support vector machines(SVM). Furthermore, we implement the kernel multilogit algorithm (KMA) by Dalmau et al. (2015)and compare its performance with that of the other classifiers. The results indicate that for both un-preprocessed and preprocessed data, the KMA has the lowest classification error rates.


Introduction
The field of genomics has witnessed a tremendous increase in the amount of data generation due to biotechnological advances like microarrays and next-generation sequencing platforms. These biotechnological advances have made it possible to simultaneously monitor expression levels for thousands of genes, and thus help in solving particular problems related to the identification of molecular variants in them, and their relation to the classification, diagnosis, prognosis and treatment of different conditions. The high dimensional data generated from microarray technology involve many thousands of genes measured simultaneously, a different microarray for each individual. This definitely introduces some noise and unwanted variations that might stem from technical or unknown sources.
In a microarray experiment let n and p be the numbers of the samples and genes respectively, so that the generated data is a n × p matrix. The main challenge with these technologies is that the resultant data are noisy due to biological and technological variations, and at the same time they usually are high dimensional, i.e., they have more variables than cases due to a low sample size, so n << p. This condition makes the direct application of most classical statistical methodology implausible, leading researchers to propose new solutions for this type of problem.
Normally before the down stream analysis of the data generated from DNA microarrays, a preprocessing and normalization stage is performed to remove the noise, filtering out the genes with low expression values, addressing missing values, and standardizing the data via a log-transformation. One of the most used preprocessing procedures for microarray data was proposed by Dudoit et al. (2002), which entails three basic steps, namely: thresholding, filtering out of genes outside of a range of minimum/maximum intensities, and finally, standardization of the expression values by a log transformation (see (Alshamlan et al. (2013); Dudoit et al. (2002))).
This work considers classification problems for microarray data sets under two conditions: un-preprocessed and preprocessed. In the un-preprocessed data all genes available in the study are included, while in the preprocessed only the subset of genes believed to play important roles in the biological problem of interest are used. We extend the Partial Least Squares Generalized Linear Regression (PLSGLR) algorithm of Bastien et al. (2005) by combining it with Logistic Regression, to give PLSGLR-log, and with Linear Discriminant Analysis to come up with PLSGLRDA. Furthermore, we compare their performance with that of the kernel multilogit algorithm (KMA) proposed by (Dalmau et al., 2015), and of the classical methods: the k-Nearest Neighbour (KNN), Ridge Partial Least Squares (RPLS), Partial Least Squares-Linear Discriminant Analysis (PLSDA), the usual Linear Discriminant Analysis (LDA) and the Support Vector Machines (SVM), when applied to a set of microarray data, referred to in this work as the Colon data set (by (Alon et al., 1999)). We evaluate the classifiers with regard to their classification error rates in this data set and compare them.
Our work addresses problems similar to many studies involving classification in microarrays, with typically high dimensional data and low numbers of samples (or subjects). Following a two stage strategy, many involve the use of the original PLS to build the components, even though the response variables are discrete, for example the analysis of (Nguyen and Rocke, 2002a,b); this is intuitively not correct since the original PLS is an algorithm best suited for continuous response variables. And in almost all of the procedures a variable (gene) selection step is implemented, with an accompanying computing cost. This paper describes a procedure suitable for categorical data, and its performance is studied with and without the gene selection step, and compared to that of each of the other classifiers used. An additional advantage of our approach is that the PLSGLR can deal with missing values, unlike the original PLS, commonly used in the literature.
The proposed two stage strategy for the classification problem is described as follows. Step 1: Dimension reduction In this stage, we propose to use PLSGLR to project the high dimensional data to a low dimension space thus resulting in new components (latent variables), which preserve the information in the intrinsic structure of the data.
Step 2: Use of latent variables for classification Analyze the obtained latent variables with the classical statistical classifiers: i PLSGLR components with logistics regression to get the PLSGLRlogistic model denoted as (PLSGLR-log) ii PLSGLR components with linear discriminant analysis model to get PLSGLR-Linear Discriminant Analysis model denoted as (PLSGLRDA) To the best of our knowledge, the proposed combination of PLS generalized linear regression algorithm with logistic and discriminant analysis has not been used before in cases where n << p. The PLS generalized linear regression algorithm is simple, and a good performance when compred to the classical methods would make it an attractive alternative.

Kernel multilogit algorithm (KMA)
The KMA was recently proposed by Dalmau et al. (2015). This algorithm works by first transforming a categorical response variable to a continuous one via a multilogit transformation.
The transformed variable is then used with the explanatory variables in a regression model for classification and prediction. Finally, the new predicted variables are transformed back using the inverse multilogit function to the original space to enable classification.
Let the response variable vector y be categorical with class labels {1, 2, . . . , C}. To classify a discrete variable from predictor variables x, the first step is to transform the response variable y into a new space using the multilogit function. The multinomial logit model with C as the reference category can be given as where f (x; θ i ) = x T θ i . The expected value of y being a multinomial random variable is given by E(y|x) = [Pr(y = 1|x), Pr(y = 2|x), . . . , Pr(y = C|x)] T . Now, denoting t = E(y|x), the original response variable y is not used but instead a transformed version ϑ = logit(t) is used.
The logit transformation is done with C as the reference category as follows In the second step a parametric linear model is proposed and its parameter estimates can be obtained via the standard Bayesian formula Pr(ϑ|x) = Pr(x|ϑ)Pr(ϑ)/Pr(x) where Pr(ϑ|x) is the posterior probability distribution, Pr(x|ϑ) is the likelihood function and Pr(x) is the normalization constant, assuming that ϑ ∈ R C−1 for a given x ∈ R m follows a multivariate normal distribution ϑ|x ∼ N (Θ T x, α 2 I), Θ ∈ R m×C−1 ,Pr(ϑ|x) is also multivariate normally distributed. Furthermore, the prior parameters are assumed to follow a normal distribution, i.e. θ ∼ N (0, β 2 I) where β is known. The parameter matrix Θ is thus estimated by optimizing an equivalent of a regularized least squares function ..,n , . F is the Frobenius norm of a matrix and λ is the regularization parameter. The result is a closed form estimate given byΘ = (X T X+λI) −1 X T ϑ.
To capture non-linearities which may be present, a dual representation Θ = X T Γ is taken so The final step of the algorithm involves prediction/classification given a new set of re- The inverse of a logit function is given by The class labels associated with x new are then computed using the estimated conditional distribution by finding the components that maximize those of t new i.e. using the Bayes rule.
The computed t new is then used to get the class label (ŷ new ) of the new data; for details see Dalmau et al. (2015).

Partial least squares (PLS) and some of its applications in genomics
PLS is a very useful approach because it is able to analyze data with many, noisy, collinear as well as incomplete variables. PLS is usually utilized in data reduction when there is multicollinearity or when the data have more variables than the number of samples. Essentially, the PLS aims at maximizing the covariance between the response variables Y and the predictors X, i.e., cov(X T Y ) of highly multidimensional data by finding a linear subspace of the explanatory variables (Höskuldsson, 1988;Wold et al., 2001). Some literature on PLS can be found in Wold et al. (1984Wold et al. ( , 2001; ?, among others. The research on PLS is still very active due to its ability to address problems associated with the high dimensional data such as multicollinearity and high dimensionality, among others. In the recent past, PLS has been utilized predominantly in high dimensional data in different fields like chemometrics and the "omics" like genomics, proteomics, metabolomics, and many other fields that generate large amounts of data, like spectroscopy (Gromski et al., 2015).

PLS generalized linear regression algorithm
In this section, we present an algorithm that can be applied to any Generalized Linear Regression which was developed by Bastien et al. (2005). Consider the response data y with the explanatory variables x 1 , . . . , x p ; then a PLS-General Linear Regression (PLSGLR) can be written as where θ a conditional expectation of the variable y if its distribution is continuous, or a vector of probabilities if the variable y follows a discrete distribution with a finite support, while g(.) is the link function chosen according to the probability distribution of y. The PLS components are given by t h = p j=1 w * hj x j , j = 1, . . . , p, h = 1, . . . , m. To compute the PLS components, let X = x 1 . . . , x p be a matrix of p centred explanatory variables x j 's. The key objective is to determine m orthogonal PLS components defined as a linear combination of the x j 's. The algorithm is presented as follows: 1. Computation of the first PLS component t 1 : First, the regression coefficients a 1j of x j are computed using the usual GLM procedure of y on x j , j = 1 . . . p. The column vector a 1 which contains a 1j is then normalized : w 1 = a 1 / a 1 . Finally, the component t 1 is computed as t 1 = Xw 1 /w 1 w 1 .

Computation of the second PLS component t 2 :
Involves the computation of the linear model coefficients a 2j of x j in the GLM setting of y on t 1 and x j , j = 1, ..., p. Since the main idea of PLS is to create the orthogonal components t 2 , the component t 1 is added as a variable in estimating y on t 1 and x j , j = 1, ..., p. This is because the structure of PLSGLR does not allow the residuals of y to be obtained in each iteration that would aid in construction of orthogonal components. The column vector a 2 which contains a 2j is normalized: w 2 = a 2 / a 2 and thereafter, the residual matrix X 1 is obtained via the regression of X on t 1 . The use of residual matrix in the attainment of the next component ensures orthogonality between the different components. The component t 2 is calculated by t 2 = X 1 w 2 /w 2 w 2 . Finally, t 2 is expressed in terms of X : t 2 = Xw * 2 . 5 Applications to real data sets

Some exploratory analysis
In this study, we will describe in detail the analysis of the Colon data by Alon et al. (1999), obtained from the R package plsgenomics, which consists of a (62 × 2000) matrix giving the expression levels of 2000 genes for 62 colon tissue samples.
An exploratory analysis of the data was done in order to visualize the differences in the un-preprocessed and preprocessed microarray data sets. The preprocessing is done using the R package plsgenomics see https://rdrr.io/cran/plsgenomics/, that implements the recommendations of (Dudoit et al., 2002). To visualize the differences between the preprocessed and un-preprocessed data sets, we consider the pairs of box plots, relative log expression (RLE), and principal components analysis (PCA) plots presented in Figures 1,2, 3, 4 and 5 respectively. Un−preprocessed colon data sample expressi on Figure 1: Box plot for the un-preprocessed colon data. The box plot for un-preprocessed data clearly shows that the data are noisy and have a lot of variations. The data have some unwanted variations that are expected to affect the analysis. They also lack symmetry.  Figure 2: Box plot for the preprocessed colon data. This plot presents less variations. The data seem to have a symmetric distribution and do not show the presence of unwanted variation. From the two figures, it is expected that the preprocessed data would be easier to analyze.
The same pair of data sets is examined using RLE plots, to show how the preprocessed data compares with the un-preprocessed data set with regard to the batch effect or any other abnormality. The RLE plots have been extensively used in studies of microarray data to reveal the effectiveness of data normalization; for an example see Gagnon-Bartsch and Speed (2011).
The RLE plots are simple yet very powerful in the visualization of data to detect unwanted variations. To understand how an RLE plot is constructed, consider a data matrix X p×n where p is the number of genes while n the number of microarray samples, and so the element of the data matrix x ij represents the i th gene in the j th sample. The RLE plot is then constructed by first calculating the median across each of the p rows, and then substracting the respective median across each row of the data matrix X, i.e (x ij − median x i * ). The median is used because it is robust and not affected by outliers. A box plot is then generated for each of the n samples, and a good one will be centered around zero and its width (interquartile range) should be equal to or less than 0.2 (see (Gagnon-Bartsch and Speed, 2011)).  Figure 3: RLE plots for the un-preprocessed and preprocessed colon data. The RLE plot for the un-preprocessed data shows the presence of a lot of heterogeneity, implying that the data have variations that do not necessarily come from biological factors. However, the RLE plot for the processed data shows homogeneity and lack of unwanted noise, and should give better results when analyzed statistically.
Finally we compare the ease of classification between the un-preprocessed and preprocessed data. The simplest way to visualize the separability of categories in a given data set is the use of principal components analysis (PCA) plots.  Gagnon- Bartsch and Speed (2011) note that one of the key challenges of the removal of unwanted variation is the difficulty in distinguishing the unwanted variations from the biological variation of interest. Furthermore, they note that the most appropriate way to deal with unwanted variation depends on the final objective of the analysis, for instance: differential expression (DE), classification, or clustering.

Analysis of the un-preprocessed data
In this analysis, we compare the performance of our proposed model extensions PLSGLR-log, PLSGLRDA and the KMA Dalmau et al. (2015) to that of the classical methods when the data has neither been preprocessed nor variables been selected, thus testing the performance of the classification algorithms in the presence of noise. The performance of the methodologies is then compared using a 10 fold cross validation (10-CV) and the corresponding classification error rates are computed. The results are presented in Table 2. A particular method is judged to be the "best" if it has a lower classification error rate relative to the other methods, otherwise it is a poor classifier. The results based on minimal cross validation classification error rates indicate that for the Colon data, the KMA emerges as the best, followed by PLSDA, and RPLS, while the worst were KNN and PLSGLR-log.

Analysis of preprocessed data
During the preprocessing of microarray data the feature selection step is usually performed.
This is because out of the thousands of variables (genes) generated, only a handful may play an important role towards the biological problem of interest. The thousands of data points are likely to be noisy due to biological or technical reasons. Thus the feature selection extracts a subset of the genes that are most informative (optimum subset of features). This reduces the noise by removing irrelevant or redundant features (Awada et al., 2012;Dudoit et al., 2002).
Most commonly used feature selection methods involve ranking the genes based on some value of a univariate statistic, like the t-statistic, the F-statistic, or the Wilcoxon and Kruskal-Wallis statistics. A cut-off point based on either the number of genes or the p-value is imposed, to determine the number of variables to be used. Dudoit et al. (2002) suggest a gene selection method based on ranking. This is achieved by finding the ratio of between-group to withingroup sum of squares (BSS/W SS) so that for a gene j, wherex .j andx kj are the average expression levels of gene j and across all samples in class k, respectively. The p genes with the biggest ratio are selected. In this study, we adopted the Dudoit et al. (2002) method of feature selection.
The preprocessing and the gene selection were performed using the recommendations of Dudoit et al. (2002). The top p genes were thus selected using Equation 6 for the implementation of the classification methods.
The classification error rates for the various methodologies when applied to the data under consideration are presented in Table 3. The results indicate that KMA was the best, followed by RPLS, PLSDA. PLSGRDA performed equally well, while KNN emerged as the worst classifier, also in every comparison.

Summary and conclusions
In this study, two extensions of the PLSGLR were considered in addition to the KMA for a comparative study with some classical classification methodologies, namely KNN, LDA, PLSDA, RPLS and SVM, when applied to one commonly used microarray data set. The data were considered when un-preprocessed and when preprocessed. For both the un-preprocessed and preprocessed cases, the KMA emerged as a clear "winner" based on lower classification error rates. The KMA algorithm can therefore be recommended for classification problems involving noisy and non-noisy data. This could be due to the fact that the chosen kernels map the samples to a higher dimensional space, where they become linearly separable. This leads to a better classification ability by the KMA. Furthermore, the three new algorithms can therefore be considered as an addition to the existing literature for the microarray data classification problems.