ResearchCharacteristics of predictor sets found using differential prioritizationGippsland School of Information Technology, Monash University, Churchill, VIC 3842, Australia
Algorithms for Molecular Biology 2007, 2:7doi:10.1186/1748-7188-2-7 The electronic version of this article is the complete one and can be found online at: http://www.almob.org/content/2/1/7
©
2007 Ooi et al; licensee BioMed Central Ltd. AbstractBackgroundFeature selection plays an undeniably important role in classification problems involving high dimensional datasets such as microarray datasets. For filter-based feature selection, two well-known criteria used in forming predictor sets are relevance and redundancy. However, there is a third criterion which is at least as important as the other two in affecting the efficacy of the resulting predictor sets. This criterion is the degree of differential prioritization (DDP), which varies the emphases on relevance and redundancy depending on the value of the DDP. Previous empirical works on publicly available microarray datasets have confirmed the effectiveness of the DDP in molecular classification. We now propose to establish the fundamental strengths and merits of the DDP-based feature selection technique. This is to be done through a simulation study which involves vigorous analyses of the characteristics of predictor sets found using different values of the DDP from toy datasets designed to mimic real-life microarray datasets. ResultsA simulation study employing analytical measures such as the distance between classes before and after transformation using principal component analysis is implemented on toy datasets. From these analyses, the necessity of adjusting the differential prioritization based on the dataset of interest is established. This conclusion is supported by comparisons against both simplistic rank-based selection and state-of-the-art equal-priorities scoring methods, which demonstrates the superiority of the DDP-based feature selection technique. Reapplying similar analyses to real-life multiclass microarray datasets provides further confirmation of our findings and of the significance of the DDP for practical applications. ConclusionThe findings have been achieved based on analytical evaluations, not empirical evaluation involving classifiers, thus providing further basis for the usefulness of the DDP and validating the need for unequal priorities on relevance and redundancy during feature selection for microarray datasets, especially highly multiclass datasets. BackgroundThe aim of feature selection is to form, from all available features in a dataset, a relatively small subset of features capable of producing the optimal classification accuracy. This subset is called the predictor set. A feature selection technique is made of two components: the predictor set scoring method (which evaluates the goodness of a candidate predictor set); and the search method (which searches the gene subset space for the predictor set based on the scoring method). The technique becomes wrapper-based when classifiers are invoked in the predictor set scoring method. Otherwise, the technique is filter-based, which is the focus of this study. An important principle behind most filter-based feature selection studies can be summarized by the following statement: A good predictor set should contain features highly correlated to the target class concept, and yet uncorrelated with each other [1]. The predictor set attribute referred to in the first part of this statement, 'relevance', is the backbone of rank-based feature selection techniques. The aspect alluded to in the second part, 'redundancy', refers to pairwise relationships between all pairs of features in the predictor set. The relevance of a predictor set tells us how well the predictor set is able to distinguish among different classes. The redundancy in a predictor set indicates the amount of similarity among the members of the predictor set, or rather, the amount of repetitions in terms of the information conveyed by the members of the predictor set. Previous studies [1,2] have based their feature selection techniques on the concept of relevance and redundancy having equal importance in the formation of a good predictor set. We call the predictor set scoring methods used in such correlation-based feature selection techniques equal-priorities scoring methods. On the other hand, it is demonstrated in [3] using a 2-class problem that seemingly redundant features may improve the discriminant power of the predictor set instead, although it remains to be seen how this scales up to multiclass domains with thousands of features. A study was implemented on the effect of varying the importance of minimizing redundancy in predictor set evaluation in [4]. However, due to its use of a relevance score that is inapplicable to multiclass problems, the study was limited to only binary classification. Currently, when it comes to the use of filter-based feature selection for multiclass molecular classification, three popular recommendations are: 1) no selection [5,6]; 2) select based on relevance alone [5,7]; and finally, 3) select based on relevance and redundancy [2,8]. Thus, so far, relevance and redundancy are the two existing criteria which have ever been used in predictor set scoring methods for multiclass molecular classification. To these two criteria we introduce one modification and a new criterion in our previous study [9]: • Antiredundancy, which is a parameter opposite to redundancy in terms of quality and thus is to be maximized along with relevance. Accordingly, instead of maximizing relevance and minimizing redundancy, we now maximize both relevance and antiredundancy. • Aside from relevance and antiredundancy/redundancy, there is a third criterion in feature selection which is necessary for the formation of the predictor set. The third criterion is the degree of differential prioritization (DDP), which represents the relative importance placed between relevance and antiredundancy. DDP compels the search method to prioritize the optimization of one of the two criteria (of relevance or antiredundancy) at the cost of the optimization of the other. In other words, DDP controls the balance between the two requirements in feature selection (maximizing relevance and maximizing antiredundancy). Therefore, unlike other existing correlation-based techniques, the novelty of the DDP-based feature selection technique is that it does not take for granted that the optimizations of both elements of relevance and antiredundancy are to have equal priorities in the search for the predictor set [10,11]. DDP is represented by a variable α which can take any value from 0 to 1. Decreasing the value of α forces the search method to put more priority on maximizing antiredundancy at the cost of maximizing relevance. Raising the value of α increases the emphasis on maximizing relevance (and at the same time decreases the emphasis on maximizing antiredundancy) during the search for the predictor set [10,11]. A predictor set found using a larger value of α contains more features with strong relevance to the target class concept, but also more redundancy among these features. Conversely, a predictor set obtained using a smaller value of α contains less redundancy among its member features, but at the same time also has fewer features with strong relevance to the target class concept. At α = 0.5, we get an equal-priorities scoring method. At α = 1, the feature selection technique becomes rank-based. Thus, the beauty of the DDP concept is that it subsumes the two existing concepts in feature selection which are represented by equal-priorities scoring methods and rank-based techniques. A large body of our work has provided empirical support regarding the efficacy of the DDP concept in feature selection [9-12], including comparisons to other feature selection techniques on highly multiclass microarray datasets in [11]. However, we have yet to establish the fundamental strengths and merits of the DDP-based feature selection technique. This is precisely the aim of this paper, which is to be realized through a simulation study involving vigorous analyses of predictor sets found using the DDP-based feature selection technique and simple but illustrative examples using toy datasets. To generate toy datasets for this purpose, we employ two models which are well-known and recognized not only in the domains of molecular classification and microarray analysis but also conventional data mining [12]. Later in this paper, we also show how close conditions in real-life multiclass microarray datasets resemble those of our toy datasets. Additional advantages of toy datasets include the unlimited number of datasets we can generate (vs. the limited number of available real-life microarray datasets [12]); the control we are able to exercise over dataset characteristics such as the number of classes and features; and prior knowledge of the members of the ideal predictor set, which provides the ultimate means for measuring the efficacy of the feature selection technique without involving the inductions of actual classifiers. The organization of the paper is as follows: Beginning with descriptions of the models used to produce the toy datasets: the OVA (one-vs.-all) and PW (pairwise) models, we proceed to analyze the characteristics of the predictor sets obtained from each of the toy datasets and then summarize the properties of the predictor sets which are dependent on the associated DDP values. After reapplying the same set of analyses to eight real-life multiclass microarray datasets, we demonstrate how the DDP works for datasets with different number of classes. We then follow with further discussion of the results and present the conclusions of the study. Finally, in the Methods section, we describe the DDP-based feature selection technique and the real-life datasets used in this study. ResultsToy datasetsThe aim of toy datasets is to provide simple but clear and demonstrative examples on the importance of the correct choice of the value of the DDP in forming the best predictor set. Furthermore, another advantage of toy datasets is the fact that we know exactly just how large a predictor set should be for each case, facilitating the task of determining the value of the maximum size of the predictor set, P. It is widely accepted that over-expression or under-expression (suppression) of genes causes the difference in phenotype among samples of different classes. The categorization of gene expression is given as follows. • A gene is over-expressed: if its expression value is above baseline. • A gene is under-expressed: if its expression value is below baseline. • Baseline interval: the normal range of expression value. As one of the data processing steps recommended in [13], logarithmic transformation are applied on microarray datasets: base 10 log for data derived from oligonucleotide (Affymetrix) platform and base 2 log for data derived from cDNA (two-color) platform. Later, another of the data processing steps, normalization, is conducted. Normalization involves the standardization of the gene expression data by mean-centering so that the samples have mean 0 across genes [13]. The purpose of normalization is to prevent the expression levels in one particular sample from dominating the average expression levels across samples [14]. (This normalization is not to be confused with dye normalization, which is performed in an earlier stage of data processing.) Since the result of normalization is that the mean expression across all genes in a sample is 0, the 'average' genes in a sample have expression values of or close to 0. As the 'average' genes are associated with the baseline or the normal range of expression, the value 0 denotes the center of the baseline interval. Over-expression is represented by positive values and under-expression by negative values. With this categorization, we next employ two well-known paradigms leading to the OVA and PW models, which are then used to generate two different sets of toy datasets. One-vs.-all (OVA) modelThe crux of the OVA concept has gained wide, albeit tacit, acceptance among researchers involved in gene expression analysis. The fact that particular genes are only over-expressed in tissues of a certain type of cancer, and not any other types of cancer or normal tissues [6], is part of the domain knowledge. Hence the term 'marker' – for genes that mark the particular cancer associated with them. In the OVA model, certain groups of genes, also called the 'marker genes' are only over-expressed (or under-expressed) in samples belonging to a particular class and never in samples of other classes. This model emphasizes that a group of marker genes is specific to one class. Therefore for a K-class dataset, there are K different groups of marker genes. Let us denote as G the number of genes in each group of marker genes, Xmax and Xmin the maximum and minimum limits, respectively, to the absolute value of the class means for the whole dataset. Thus, for the g-th gene in a group of marker genes, the maximum limit to the absolute value of the class means is defined as: xmax,g = Xmax - (ΔX)(g - 1)(1) where g = 1, 2, ..., G, and For the g-th gene in a group of marker genes, the difference between the class means of subsequent classes is defined in the following manner: The purpose of equations (1), (2), and (3) is to produce the following effect: We would like to vary the class means such that there is an imbalance or inequality in terms of class means among the K classes. The reasons are firstly to mimic a condition prevalent in multiclass microarray datasets (imbalance among classes in terms of class means even after normalization), especially in datasets with large number of classes; and secondly, to present a challenge to the feature selection technique in choosing sufficiently relevant but non-redundant genes. We will provide further elucidation on the second reason later in this section. Another purpose of the equations is to generate genes with varying relevance in each group of marker genes. Based on equations (1), (2), and (3), the first gene in a group of marker genes (the gene associated with g = 1) has the strongest relevance among the members of that group of marker genes. Accordingly, the gene with the weakest relevance is the last gene in a group of marker genes (the gene associated with g = G). The reason for doing this is also to present a challenge to the feature selection technique in choosing sufficiently relevant but non-redundant genes. Next, initialize a matrix M: = (μi, k)N × K of zeros where N is the total number of genes in the dataset, and, in this case, is the product of G and K. This is the matrix of class means, whose element, μi,k, represents the mean of gene i across samples belonging to class k (k = 1, 2, ..., K): μ(g - 1)K + k, k = (-1g)[xmax, g - (Δxg)(k - 1)](4) The [(g - 1)K + k]-th gene is the g-th member of the k-th group of marker genes and therefore has non-zero class mean for class k and zero class means for all other classes – the archetypal OVA trait. The term (-1g) serves to change the sign of the class mean at different values of g so as to produce both over- and under-expressed marker genes. Standard deviation among samples of the same class, or class standard deviation, is set to 1 for all instances, σi,k = 1 for all k and i. For all k, a total of m samples are generated for class k using Gaussian distribution of mean μi,k and standard deviation σi,k for gene i. In Table 1, an entry on the i-th row and k-th column represents the class mean of class k for gene i, where i = [(g - 1)K + k], and therefore gene i is the g-th member of the k-th group of marker genes. We can see that using relevance alone as a criterion, and with uniform class size, marker genes associated with class 1 and 4 will always be favored more than marker genes specific to any other classes, regardless of the value of g. Including antiredundancy as the second criterion will obviate this imbalanced predilection – therein lies the reason for us to use unequal values for class means among different classes. But how much weight is to be assigned to relevance, and how much to antiredundancy? Table 1. A 4-class example from the OVA model. μi, k represents the mean of gene i across samples belonging to class k. The ostensible answer would be equal weights, which is the foundation of existing equal-priorities scoring methods. But as mentioned previously in the Background section, it has been implied that antiredundancy is not as important as relevance for the 2-class problem [3] – this is obvious in case of our OVA toy dataset; any subset of sufficiently relevant genes is capable of differentiating between the two classes. Hence we ask the questions which motivate the concept of the DDP: If at K = 2, antiredundancy is not as important as relevance, will this change as the number of classes increases (an important theme in multiclass classification studies)? As K increases, might not the importance of antiredundancy (w.r.t. relevance) increase as well? If yes, is there a point where antiredundancy eventually overcomes relevance in terms of importance as a criterion in feature selection? These questions are to be answered from the analyses in this study. Pairwise (PW) modelIn the PW model, for a given pair of classes, a group of marker genes only distinguishes samples from one class of the pair of classes against samples from the other class of the pair of classes. As implied by its name, this model represents the 1-vs.-1 paradigm as opposed to the 1-vs.-others of the OVA model. For a K-class dataset, there are As is the case in the OVA model, we denote as G the number of genes in each group of marker genes, Xmax and Xmin the maximum and minimum limits, respectively, to the absolute value of the class means for the whole dataset. The definitions of xmax,g, ΔX, and Δxg are the same as for the OVA model. Initialize a matrix M: = (μi,k)N × K of zeros where N is the total number of genes in the dataset, and, in this case, is the product of G and for b = 1 and b = 2. For the PW model, the The procedure for the generation of datasets is similar to that of the OVA model. Experiment settingsIn this study, for both models, Xmax and Xmin are set to 100 and 1 respectively, while the number of samples per class, or class size, m, is set to 100 uniformly for all classes. Ten values of α are tested from 0.1 to 1 with equal intervals of 0.1, α denoting the value of the DDP. For both models, the number of genes in each group of marker genes, G, is set to 3, 5, 10, 20, and 30. We test for K = 2 to K = 30, K denoting the number of classes in a dataset. Since no inductions of classifiers are to be implemented in this study, whole datasets are used as training sets during feature selection. For toy datasets generated from the OVA model, the minimum predictor set size necessary to differentiate among the K classes is K - 1. The optimal predictor set is actually any subset of K - 1 genes from the first K of the marker genes (i.e., at g = 1) generated using the class means defined in equation (4). In case of toy datasets based on the PW model, the optimal predictor set is any subset S of K - 2 genes from the first where |S| = K - 2, Therefore for datasets generated from the OVA model, P is set to K - 1 and for those from the PW model, P is set to K - 2. Separation of classesA natural way to measure separation of classes is the distance between pairs of class centers. We use two popular metrics, the Euclidean and the Manhattan (or taxicab) distances. At the end of the "One-vs.-all (OVA) model" subsection under the Results section, we discuss a preceding study on feature selection [3] which inspired the DDP concept. The authors of that study employ a form of separation of classes to demonstrate that a redundant feature may enhance the predictor set's ability to distinguish between two classes in a 2-class problem (thus implying that antiredundancy is not as important as relevance for the 2-class problem). This form of separation of classes corresponds to the Manhattan distance used in our study. In a 2-class problem, the authors of [3] first present two features from a toy dataset which are both relevant but redundant w.r.t each other, contained in a predictor set distinguishing between the two classes. Then, after a 45° rotation of those two features, the authors of that study show that the Manhattan distance between the class centers along one axis is now greater by a factor of We observe that the Euclidean distance remains the same before and after the transformation in that study. Therefore we have included the Euclidean distance as another form of separation of classes to study, if any, the differences between the two distances in the context of the DDP. Moreover, the Euclidean distance is as popularly used as the Manhattan distance in the field of intelligent data analysis. For the q-th pair of classes, Cq = {c1,q, c2,q} (where in a K-class problem q = 1, 2,...,
where θ denotes The Manhattan distance between the q-th pair of classes as measured by Sα is computed as follows: Averaging across all where θ denotes If there is more than one value of α satisfying equations (9) or (12), the mean among these values is taken as Figure 1 shows that the number of classes, K, influences the value of
Conversely, we find this to be untrue in terms of the Manhattan distance (Figure 2). Regardless of the number of classes, K, the value of
Principal component analysis (PCA)PCA linearly transforms the data such that the greatest amount of variance among samples comes to lie along the axis representing the first principal component (PC). Similarly, the second PC contains the second largest variance among samples, and so on. An important property of the PCs is that a PC is always orthogonal to the adjacent PC. In addition to analyzing the predictor sets in the original projection, we investigate the characteristics of the predictor sets after transformation by PCA. In the original form, the data are characterized along axes representing members of the predictor set (original feature space). After transformation by PCA, data are characterized along axes representing the PCs derived from the members of the predictor set (PCA-transformed space or PC space). The input data matrix is never mean-centered throughout the transformation procedures – this is to enable comparisons in terms of distance metrics between data in original feature space and data in PC space later in this study. (For instance, in this manner, the Euclidean distance remains constant in both original feature space and PC space.) The sole effect of not mean-centering the dataset is that the first PC will span the variance characterized by the overall distance of the dataset from the origin [15]. In case of our models (OVA and PW), marker genes contain non-zero class mean for each of the classes (OVA model) or non-zero class means for each of the pairs of classes (PW model) that they mark, and zero class means for all other classes. Thus for both models, even without mean-centering, the variance contained by the first PC will still be variance among classes, because for both models, the distance of a data point (a sample) from the origin as measured by each gene is actually characterized by the class of that data point. The main use of PCA in this case is to rotate the data from the original sets of axes (represented by the members of the predictor set) so that the data are now projected along new sets of axes (represented by the PCs) which are orthogonal and hence minimally correlated to each other. In this study, PCA is conducted only on the members of the predictor set, not on the whole dataset. The reason we apply PCA in this manner is to expand on the finding in [3] which we discuss in the beginning of the "Separation of classes" subsection. Therefore, each of the PCs in this study contains information only from the predictor set, and never from any gene which is not a member of the predictor set. Antiredundancy of PCA-transformed predictor setsLet us denote the antiredundancy of predictor set Sα after transformation by PCA as For untransformed predictor sets, the value of the DDP satisfying the expression
Separation of classes in PCA-transformed predictor setsThe Euclidean distance remains the same whether the predictor sets have been transformed by PCA or not; hence we do not repeat the analysis described in the previous subsection (on separation of classes) for PCA-transformed predictor sets. The Manhattan distance, however, is affected by the transformation. The procedures involved in computing the DDP value leading to the best separation of classes are the same as in case of the untransformed predictor sets described in the previous subsection (on separation of classes). To distinguish between the DDP value associated with untransformed predictor sets and the DDP value associated with PCA-transformed predictor sets, we will denote the latter as Although we have seen from Figure 2 that K has no effect on
Indeed, the observation regarding Summary of analysesWe have found that as K increases, three parameters decrease in an exponential-like manner: • the value of the DDP producing the best separation of classes in terms of the Euclidean distance; • the value of the DDP producing the highest antiredundancy in PC space; and • the value of the DDP producing the best separation of classes in terms of the Manhattan distance in PC space. We have shown that regardless of the model type (OVA or PW) or the value set to the model parameter, G (3, 5, 10, 20, or 30), each of these three characteristics can be optimized by choosing the right value of the DDP, and that this value, in turn, is determined by the number of classes in the dataset. Investigating the imbalance of class means in real-life datasetsBefore reapplying the analyses to real-life datasets, we investigate how close conditions in real-life datasets match those of toy datasets. We have mentioned in the section on the generation of toy datasets that imbalance in terms of class means among classes is prevalent in highly multiclass microarray datasets. Investigation is conducted on whole datasets (no splitting) in this case. For class k, we choose the class mean with the greatest absolute value (equivalent to the absolute value of μ(g - 1)K+k,k from equation (4) or Next, to illustrate the imbalance among classes in terms of class means, we compute the range of class means: The results are shown in Table 2. We observe that for datasets with large K (such as BRN, GCM, and NCI60) the range R( Table 2. Range of class means in real-life datasets. Reapplying the analyses on real-life datasetsFor real-life datasets, the analyses are implemented separately upon the training set of each split, there being a total of 10 splits of training and test sets. The mean across all splits is taken for the characteristics measured in the analyses: We will assume that the optimal P for each real-life dataset is also directly proportional to K (as is the case for toy datasets). However, allowing for remnant noise (left even after data preprocessing), we assign larger values to P for real-life datasets (30K) than for toy datasets with similar K. Furthermore, we conduct two versions of the analyses involving PCA-transformed predictor sets: • in the first version, only the top three PCs are used • in the second version, all PCs are used, as is the case for toy datasets. The reason for this is given as follows: The large percentage of the total variance among samples which is represented by the top PCs is 'relevant' variance (i.e., variance which is due to the difference among classes and thus is relevant w.r.t. the target class concept). On the other hand, the last PCs contain the remainder (small) percentage of the total variance, which is most likely caused by noise or variance within class (i.e., 'non-relevant' variance as opposed to the first type of variance since variance within class is not relevant w.r.t. the target class concept). Variance within class in real-life datasets, unlike variance within class in toy datasets (which is fixed at 1 in this study), differs from class to class even within the same dataset, and is likely to be larger than 1. This is the reason for the first version of the analysis. Figure 5 shows that for majority of real-life datasets, the trend regarding the aforementioned three characteristics is similar to the trend for toy datasets (Figures 1, 3, and 4), as indicated by the accompanying gray trend-lines. There are several divergences. In the
In Figure 6, we repeat the analysis involving PCA using all PCs instead of merely the top three PCs. This applies only to the
DiscussionIn this section, we demonstrate how the DDP works for datasets with different number of classes. Then we discuss the reasons for the difference between the behaviors of A look at how the DDP concept worksA few examples from OVA-based toy datasets generated at G = 30 are used to demonstrate how the DDP concept works for different number of classes. Figures 7, 8, 9, 10 show predictor sets obtained using several values of the DDP:
For the 4-class toy dataset, the values of the DDP which maximize separation of classes in terms of the Euclidean distance are 0.7, 0.8, 0.9, and 1; we can observe from Figure 7 that the predictor set produced from these DDP values do not contain non-relevant genes (genes which are generated at large values of g and hence barely differentiate among the K classes). On the other hand, predictor sets obtained using the DDP values of 0.5 and smaller do comprise such non-relevant genes. Figure 8 shows that for the 6-class toy dataset, only predictor sets obtained at the DDP values of Note that at smaller values of K, For the 10-class toy dataset, the predictor set obtained at Figure 10 shows that for the 14-class toy dataset, the predictor set obtained at In summary, a predictor set obtained at At P equal to K - 1 for OVA-based toy datasets we observe that none of the DDP values (whether The difference between the behaviors of
|





on Google Scholar









author email
corresponding author email




















Figure 1.
Figure 2.



Figure 3.

Figure 4.



Figure 5.
Figure 6.
Figure 7.
Figure 8.
Figure 9.
Figure 10.






