Fischer Linear Discriminant

The FLD technique is primarily a way to reduce dimensions. It does this such that the distributions of the reduced data are maximally separate for Gaussian criteria. In other words, the new means of the two reduced sets are as far apart as possible, while maintaining the new variances as small as possible. Reducing to a single dimension for classification, which I did here, is probably the simplest application of Fischer Discriminants, although it is possible to reduce to other dimensions as well.

I was pleasantly surprised by this technique. I underestimated it completely. I thought it would perform only slightly better than Bayesian classification. This is because following the dimension reduction, a conventional Bayes classification takes place. True, the FLD technique improves the separation, but I did not anticipate it would be so good. In fact, this turned out to be the most accurate classifier.

As usual, I began by surveying the features against the validation data one at a time. I was impressed by how many of them (22 out of 24) were higher than the priors:

rate Feat #
60.4167 7
60.4167 21
62.5000 14
64.5833 5
64.5833 17
66.6667 1
66.6667 2
66.6667 3
66.6667 4
66.6667 6
66.6667 8
66.6667 9
66.6667 10
66.6667 11
66.6667 12
66.6667 13
66.6667 15
66.6667 16
66.6667 18
66.6667 20
66.6667 22
66.6667 23
66.6667 24
68.7500 19

Improved Feature Separation

FLD improves separation of otherwise lousy features. For example, remember the track density features examined in detail in the Bayesian classification section? FLD improved them significantly:

Feature Bayes Class. Rate w/o FLD Bayes w/ FLD
Kick Density 60.66% 66.66%
Snare Density 60.66% 66.66%
Clap Density 60.66% 66.66%

One can also see this in the graphs of their distributions. The FLD transformed the datasets into more separable ones. This is most clear in features 2 and 3, although not necessarily in feature 1. Also notice the vertical axis. The pre-FLD distributions had maximum heights at around 4. The FLD transform significantly narrowed the distributions, lowering the variances radically:

Feature 1 after FLD


Feature 2 after FLD


Feature 3 after FLD

Next, I selected groups of potentially useful features using forward, backward and random processes and tested them:

feature Fwd Bkwd Rand 1 Rand 2 Rand 3
1 1 0 1 0 0
2 1 0 1 1 1
3 1 0 1 0 1
4 0 0 0 0 1
5 1 0 1 0 1
6 1 0 0 1 0
7 1 0 0 1 1
8 1 1 1 1 1
9 0 0 1 0 0
10 0 1 0 1 0
11 1 0 1 0 0
12 1 1 1 0 1
13 1 0 0 0 1
14 0 0 0 0 0
15 0 0 0 1 0
16 1 1 1 0 1
17 0 0 1 1 0
18 1 1 0 1 1
19 0 0 0 1 0
20 1 1 1 0 0
21 0 1 1 1 0
22 0 1 1 1 0
23 1 1 0 1 0
24 0 1 1 0 1
Validation Rate 75.00% 66.66% 64.5833% 72.9167% 77.08%
Testing Rate 60.3% 58.1% 61.0% 56.6% 61.03%

Improved Separation

In the following five graphs, I graph the distributions of the classes after FLD reduction:

Forward Selected Distributions


Backward Selected Distributions


Random #1 Selected Distributions


Random #2 Selected Distributions


Random #3 Selected Distributions

Cheating

Just out of curiousity, I executed the random feature seeking script against the testing data in order to find the theoretical maximum possible classification. It reached a maximum value of 65.44%. This result can not be counted toward actual classification, it is just a reference number that suggests how much better the classification might be able to get with more data.

Results

The correct classification rate for the group of features that tested highest against the validation data is 61.03%.

--->k Nearest Neighbor

Index