Bayesian Classification
The first classifier I explored was a Bayesian classifier, with
Gaussian distributions. I believed it would work well for
some features, and not well for others. For example, consider
the density feature. I believed that very low densities and very high densities
would lead to bad rhythms. Therefore, there would be
a sweet spot in the middle where all the good patterns were.
The data, however, do not support this hypothesis. Examining the distributions,
there is much overlap between the good and bad classes in track density.
This table describes it:
Feature | mean (bad) | mean (good) | sigma^2(good) | sigma^2(bad) |
Clap Density | .6659 | .6603 | .0146 | .0132 |
Snare Density | .4380 | .4512 | .0150 | .0165 |
Kick Density | .3563 | .3646 | .0145 | .0153 |
This is even easier to see graphically. The red curve is the bad data, the green curve is the good data.
Recall that the area under each curve is unity. The Bayes error rate is the area underneath
where both curves overlap. In all three of these graphs, you can see that the error rate is
nearly 100%:
![](img/bdf1.jpg) |
Kick PDF |
![](img/bdf2.jpg) |
Snare PDF |
![](img/bdf3.jpg) |
Clap PDF |
Some Better Features
Although the simple density features were not very compatible with
Gaussian classification, one of the nice surprises of this project
was to discover that a few other features were better. For example,
the 8th harmonic, the density of silence (a.k.a. state 0) and the
density of snare drums (state 2) had the highest correct classification
rates when taken independently. The difference is easy to see in their Probability
Density Functions (PDFs):
![](img/bdf14.jpg) |
8th Harmonic |
![](img/bdf17.jpg) |
Silence Density |
![](img/bdf19.jpg) |
Snare Density |
Taken independently, the correct classification rates for
these three features are:
Feature | Classifcation Rate |
8th Harmonic | 65.89% |
State 0 Density | 67.44% |
State 2 Desnity | 62.14% |
A Big Mistake
A big mistake I made was to ignore the curse of dimensionality. Remembering
that with Bayesian classification, additional data can only equal or improve
classification, I hurried to concatenate as many features as possible into
my Bayesian classifier. It was only after I got a miserable classification
rate with all 24 features enabled (54.4%) that I remembered the second half
of the rule: Additional data can only improve Bayes with infinite data sets.
Feature Finding
Then I got curious. I wanted to know the quality of every single feature,
so wrote a matlab script called Feature Find, that classified with Bayes
for every single feature independently (against the validation set). The results were surprising.
The biggest shocker was that many features were classifying
just barely above the priors. (P1 = 59.71%. P2 = 40.42%):
Classification Rate | Feature # |
59.33% | 20 |
60.66% | 1 |
60.66% | 2 |
60.66% | 3 |
60.66% | 4 |
60.66% | 6 |
60.66% | 7 |
60.66% | 9 |
60.66% | 10 |
60.66% | 11 |
60.66% | 12 |
60.66% | 13 |
60.66% | 15 |
60.66% | 16 |
60.66% | 18 |
60.66% | 21 |
60.66% | 22 |
60.66% | 23 |
60.66% | 24 |
61.33% | 8 |
61.66% | 5 |
62.13% | 19 |
65.89% | 14 |
67.44% | 17 |
Feature Seeking
Given that some features are better than others, I wanted to
know what combinations of features would be best.
Obviously, there are 2^K = 2^24 possible combinations which would
take too long to run, so I wrote a new version of Feature Find
called Feature Seek. This script sequentially classifies, adding
one more feature each time. If the new feature makes the classification
worse, it gets dropped. Otherwise, it gets accumulated.
It is not perfect, but it is interesting and helpful. When I ran
it the first time, I realized it is very biased toward the
first few features. For example, the first feature is always
better than 0, so it always gets picked up and can not be dropped
later. These features might possibly inhibit other useful features.
Therefore, I ran it forwards, from feature 1 to feature 24, then
backwards from 24 to 1. I saw that the best three independent
features did not all survive. Forwards, features 17 and 19 survived,
but 14 did not. Backwards, 14 and 19 survived, but 17 did not.
10 features did not survive either direction.
Because I was rooting
for my favorite features, I decided to try to initialize the
feature enable vector with features 14, 17 and 19, the ones
with the highest independent classification rates. Then, I ran
the forwards and backwards sequential Feature Seek. Interestingly,
these favorite features all remained enabled after sweeping
through forwards and backwards. Clearly, they must be important.
Random Feature Seek
Reconsidering the bias of sequential movement, I made a Random Feature Seek (RFS) script.
RFS picks a feature at random, inverts its enabled/distabled status, classifies,
and keeps the change if it is better. I ran it three times for 200 iterations
each. As the table shows, the selected features were similar to the sequential,
but with no obvious bias toward the first or last features. Also note that this
method produced the highest rate of classifcation for Bayes (71.06%).
Finally, it is interesting to note that in all 7 runs of selecting features,
there were some definite trends. Feature 10, the 4th harmonic, was never selected.
Features 5, 6, 12 and 20, were each picked only once. Features 8, 14, 17 and
19 were the most often picked. What is especially interesting is that
while 8, 14, 17 and 19 were the among the highest independent features, feature
5 was also among the highest independent, but the lowest in combination
with others. Feature 5 represents the evenness, which is a combination of
other features. Therefore, when it is alone, it represents quite a bit of
information. However, in the presence of the other information from which
it is comprised, it has little more to contribute.
(1=Enabled, 0=Disabled)
Feat. # | Fwd | Bkwd | Init. Fwd | Init. Bwd | Rand1 | Rand2 | Rand3 |
1 | 1 | 0 | 1 | 0 | 0 | 1 | 0 |
2 | 1 | 0 | 1 | 1 | 1 | 0 | 0 |
3 | 1 | 0 | 1 | 0 | 0 | 1 | 0 |
4 | 1 | 0 | 0 | 0 | 1 | 0 | 0 |
5 | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
6 | 1 | 0 | 0 | 0 | 0 | 0 | 0 |
7 | 0 | 0 | 0 | 1 | 1 | 1 | 1 |
8 | 0 | 1 | 1 | 1 | 1 | 1 | 1 |
9 | 0 | 0 | 1 | 1 | 1 | 0 | 1 |
10 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
11 | 0 | 1 | 0 | 0 | 1 | 1 | 1 |
12 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
13 | 0 | 0 | 1 | 0 | 0 | 0 | 0 |
14 | 0 | 1 | 1 | 1 | 1 | 1 | 0 |
15 | 0 | 0 | 0 | 1 | 0 | 1 | 0 |
16 | 0 | 0 | 0 | 1 | 1 | 1 | 1 |
17 | 1 | 0 | 1 | 1 | 1 | 1 | 0 |
18 | 0 | 0 | 1 | 0 | 1 | 0 | 0 |
19 | 1 | 1 | 1 | 1 | 0 | 1 | 0 |
20 | 0 | 0 | 0 | 0 | 1 | 0 | 0 |
21 | 0 | 1 | 0 | 1 | 1 | 0 | 1 |
22 | 0 | 1 | 0 | 0 | 1 | 0 | 1 |
23 | 0 | 1 | 0 | 1 | 0 | 0 | 1 |
24 | 0 | 1 | 0 | 1 | 0 | 0 | 0 |
rate: | 66.48% | 67.73% | 70.39% | 69.08% | 67.76% |
69.74% | 71.06 % |
Results
The correct classification rate for the group of features
that tested highest against the validation data is 52.20%.
--->General Linear Discriminant
Index