The Effect of Feature Selection on Classification of High Dimensional
Data
m dot gerber at alum
dot dot edu
MAS.622J
Fall 2006
Meredith Gerber
The Effect of Feature
Selection on Classification of High Dimensional Data
1. Problem
The Differential Mobility Spectrometer (DMS) is a promising technology for differentiating biochemical samples outside of a lab setting, but it is still emerging and analysis techniques are not yet standardized.
The DMS is a micromachined device that separates compounds based on the difference in their mobilities at high and low electric fields (Miller, Eiceman et al. 2000). (See Figure 1.)
Figure 1.
Mode of DMS detection. Chemical separation in the
DMS is achieved due to an alternating electric
field
that is transverse to the direction of carrier gas flow. Changes in the DC compensation voltage allow
different
compounds to pass through to the detector according to differences in
mobilities between the high
and
low fields. Image from (Krylova, Krylov et al.
2003).
The DMS has been used in laboratory settings for differentiating biologically-relevant chemical samples (Shnayderman, Mansfield et al. 2005), demonstrating that different bacteria, and even different growth phases of the same bacteria have distinct chemical characteristics that the DMS can detect.
DMS technology has even been applied in non-laboratory settings demonstrating its potential as a compact, reagent-less, fiieldable sensor technology (Snyder, Dworzanski et al. 2004).
Analysis of this data is particularly challenging for two reasons:
1) The high dimensional nature of the data
2) The difficult link between identifiable data features and specific chemical compounds
What I show here are potential directions and points of consideration for pattern recognition problems involving high-dimensional data, with an eye towards selecting features with clear chemical relevance from high-dimensional data.
2. Data
The entire data set consists of 100 samples in each of three different classes corresponding to the material sampled: Bovine Serum Albumin (BSA), Ovalbumin (OVA), and water. Data were split into a training/testing subset (90%) and a sequestered validation subset (10%) randomly. In the training/testing subset of the data there were 85 BSA samples, 92 OVA samples, and 93 water samples. The training/testing set will be used to optimize parameters and feature sets, and the validation set will be used to validate parameters and selected features.
Each file has 98,000 dimensions corresponding to unique combinations of three different variables: scan number, compensation voltage (Vc), and polarity. When the positive and negative spectra are placed next to each other (even though they are recorded at the same time), each data file can be thought of as a matrix, the entries of which vary in two dimensions, uniquely, according to the chemical composition of the sample. (See Figure 2.)
B A
Figure 2. Example
Data Sample. Compensation voltage is on the vertical axis
and scan number (elution time)
is on the horizontal axis. The
positive and negative spectra are placed next to each other horizontally,
for visualization purposes, but as their scan numbers indicate, the
data were collected simultaneously.
(A) indicates the Reactant Ion Potential (RIP), and (B) indicates some
chemical features which may be of
interest from a classification perspective as well.
The color represents sensor output values; the noticeable horizontal line (A) at compensation voltage -15 is the Reactant Ion Potential (RIP). It is mostly an artifact of the carrier gas with which the sample travels through the system. It is not hypothesized to yield relevant chemical data. The activity of interest seems to occur at compensation voltages of -14 and higher. These data have already been trimmed for relevance in the time domain, but an additional trimming here in the voltage domain could be beneficial in that it will help to remove from consideration nearly half of the remaining features.
All of the data have been pre-processed prior to this analysis. They have been trimmed in the time domain in addition to having been background-adjusted so that the baseline value is common across all samples.
The biggest pattern recognition problem here is the curse of dimensionality. The dimensionality of each sample is about 98,000 and there are 100 samples per class. The challenge is to create a system/classifier that given an input can determine the underlying class of that input--ie, diseased or not diseased, with accuracy, given limited space/memory and limited time. This is a problem that is common in different biology applications (such as DMS, mass spectrometry, and microarray analysis).
3. Approaches
Feature selection has been performed on the training/testing data set so far with two approaches:
1. Two-class Fisher Linear Discriminant
2. PCA
In all cases, 50 features were selected from the data, and classification was performed with a range of numbers of features, between 1 and 50, to demonstrate trends in classification as the number of features used is changed. Additionally, features were selected at a range of radial distances from previously selected features (measured in scan-Vc space pixels). This was done to decrease the degree of correlation between selected features. The specific numbers of features and radial distances were:
Number
of features used: [1 3 5 10 20 30
50]
Radial
distances used: [1 2 3 5 8 10]
Classification has been performed with SVM (polynomial kernel) and K-Nearest Neighbors (K-NN) with a range of values for K:
Values
for K: [1
3 7 11]
3.1 Fisher Discriminant, Two-Class (FLD)
For each pixel, the Fisher Linear Discriminant was calculated as indicated in Equation 1 (Duda, Hart et al. 2001). Pixels were considered as potential features and the features with the highest discriminant values were retained.
(1)
Where μ represents the mean of the pixel values across all samples and σ2 is the variance of the values. The maximum values were between 1 and 1.5 and the range of values can be seen in Figure 3.
Figure 3.
Histogram of the top 50 Fisher Linear Discriminant values. In the first cross-validation loop of the two
class
comparison between BSA and OVA. The
higher the discriminant value, the more separable the two classes.
The features with the highest Fisher Linear Discriminants are sent to the classifier for classification. However, in many cases, these features are not independent. Each pixel is a potential feature, but as can be seen in Figure 2, chemical features encompass a range of scans and compensation voltages. Potential features which are near to each other (in scan number and compensation voltage) are more likely to be correlated (in their information content).
3.2 Principal Component Analysis (PCA)
A dimension reduction approach was implemented using PCA. Rather than selecting specific pre-existing features, PCA extracts features from the original data, making linear combinations of potential features resulting in a feature set with lower dimensionality than the original feature set. In this case, however, features with clear chemical identity were desired, so a traditional PCA approach was modified to create a feature selector that would yield specific (scan number, Vc) points.
The eigen-decompisition step of PCA is space/memory intensive. In order to make this feasible, the dimensionality of the data had to be further reduced prior to feature selection. First, a subset of the data was selected in scan-Vc space (see Figure 4). Then, further reduction was achieved by down-sampling this space until fewer than 5,000 dimensions remained (out of an original 98,000.
Figure 4.
Feature space covered by PCA analysis.
This space
includes about 21,000 out of 98,000
potential features from the data
displayed. This number of potential
features had to be further reduced
in order to perform PCA
because of space and memory constraints.
To further reduce the potential
number of features, this
space was evenly sampled.
Figure 5. Percent
of variance explained by top several eigenvectors. The top eigenvector explains
a bit over 70% of the
variance whereas the second eigenvector only accounts for slightly more than 3%
of the variance.
E D C B A
Figure 6. Eigenvector
coefficients for top eigenvector. These coefficients explain 70% of
the
variance of the distribution
of (training and test) data. Some sets
of high coefficient feature
indices are consecutive
(A,B,C,D) while another has a sinusoidal nature (E). This likely indicates
features which are either
more oriented along the scan- or Vc-axes.
4. Results
4.1 FLD
Figure 7.
Classification Performance for Water vs. BSA (SVM) FLD feature selection.
Note that classification
improves
as features are added, until about 30 features. Also, note that classification at this point
is generally
better
with a higher radius rule, likely because selected features are not as
correlated.
Figure 8. Classification Performance for Water vs. BSA
(K-NN, K=7) FLD feature selection. Note that
performance
and trends are generally to similar to those with SVM (see Figure 7). Classification
improves as features
are
added, until about 30 features. Also,
note that classification at this point is generally better with a higher radius
rule,
likely because selected features are not as correlated.
Summary
of FLD classification results |
||
At r=3,
30 features |
|
|
|
|
|
Comparison |
SVM |
K-NN,
K=7 |
H2O vs.
BSA |
93% |
94% |
H2O vs.
OVA |
96% |
95% |
BSA vs.
OVA |
83% |
88% |
Table 1. Classification Performance Summary for FLD
feature selection. These are results of
classification with 4-fold
cross-validation on the
training/testing data set. The K=7 K-NN
classifier was chosen as a point of comparison for
the SVM results.
4.2 PCA
Figure 9.
Classification Performance for Water vs. BSA (K-NN, K=7), PCA feature
selection. Compare to results
with FLD feature selection (Figure 8). Trends are
similar, though classification accuracy is not as good, in comparison.
Summary
of PCA classification results |
||
At r=3,
30 features |
|
|
|
|
|
Comparison |
SVM |
K-NN,
K=7 |
H2O vs.
BSA |
88% |
90% |
H2O vs.
OVA |
84% |
88% |
BSA vs.
OVA |
85% |
86% |
Table 2. Classification Performance Summary for PCA
feature selection. These are results of
classification with 4-fold
cross-validation on the
training/testing data set. The K=7 K-NN classifier was chosen as a point of
comparison for
the SVM results.
6. References
Duda, R. O., P. E. Hart, et al. (2001). Pattern Classification, John Wiley & Sons.
Krylova, N. S., E. Krylov, et al. (2003). "Effect of Moisture on the Field Dependence of Mobility for Gas-Phase Ions of Organophosphorus Compounds at Atmospheric Pressure with Field Asymmetric Ion Mobility Spectrometry." Journal of Physical Chemistry A 107: 3648-3654.
Miller, R. A., G. A. Eiceman, et al. (2000). "A novel micromachined high-field asymmetric waveform-ion mobility spectrometer." Sensors and Actuators B 67: 300-306.
Shnayderman, M., B. Mansfield, et al. (2005). "Species-specific bacteria identification using differential mobility spectrometry and bioinformatics pattern recognition." Anal Chem 77(18): 5930-7.
Snyder, A. P., J. P. Dworzanski, et al. (2004). "Correlation of mass spectrometry identified bacterial biomarkers from a fielded pyrolysis-gas chromatography- ion mobility spectrometry biodetector with the microbiological gram stain classification scheme." Analytical Chemistry 76: 6492-6499.