Consider the state transition matrices for
the good and bad classes. This grid represents the probability that
one drum sound will be heard following another. It is easy to see,
that for example, transitions from state 0 to state 0 happen more often
in bad rhythms than in good ones:
good = 0.2702 0.3007 0.3064 0.3016 0.2842 0.2853 0.2485 0.2486 0.1530 0.1608 0.1410 0.1467 0.1554 0.1706 0.1775 0.1568 0.1694 0.1556 0.1718 0.1467 0.1554 0.1324 0.1420 0.1405 0.0701 0.0745 0.0731 0.0870 0.0784 0.0706 0.0858 0.0649 0.1580 0.1451 0.1474 0.1630 0.1421 0.1500 0.1627 0.1784 0.0722 0.0641 0.0615 0.0571 0.0717 0.0706 0.0828 0.0811 0.0708 0.0627 0.0641 0.0734 0.0691 0.0882 0.0592 0.0649 0.0365 0.0366 0.0346 0.0245 0.0438 0.0324 0.0414 0.0649 bad = 0.3178 0.2933 0.3176 0.2895 0.2934 0.3235 0.3012 0.3463 0.1505 0.1538 0.1541 0.1654 0.1614 0.1581 0.1352 0.1245 0.1430 0.1342 0.1645 0.1203 0.1504 0.1342 0.1639 0.1089 0.0656 0.0871 0.0728 0.0846 0.0747 0.0551 0.0676 0.0739 0.1567 0.1502 0.1238 0.1523 0.1411 0.1434 0.1639 0.1401 0.0757 0.0818 0.0662 0.0902 0.0646 0.0735 0.0676 0.0739 0.0599 0.0631 0.0699 0.0695 0.0729 0.0680 0.0717 0.0739 0.0308 0.0364 0.0312 0.0282 0.0415 0.0441 0.0287 0.0584
For example, if a rhythm segment happened to have, as its first four steps:
Note that since the rhythm is repeated, the 16th transition is from the 16th step to the first step.
Dataset | Classification Rate |
Training | 49.5% |
Validation | 64.6% |
Testing | 52.9% |
The results are not great. In trying to reason why, I considered that, when hearing a rhythm, sometimes it is not so important which drum gets hit. Instead, it is simply important that at that instant in time, any drum gets hit. From this perspective, a rhythm consists of a repeating sequence of noises with varying amplitudes.
In other words, what I was trying to do was examine the transition of a feature through time. It might be possible, for example, to say that a rhythm consists of a short group of loud sounds, followed by a longer group of quiet ones, more often than other combinations of loud and quiet sounds for brief and long intervals.
At first, this concept sounded like too much additional work for this
project. However, I noticed a shortcut that enabled me to report
on the results. I simply remapped the states 1-8 into a string
that looked like this:
Dataset | Classification Rate |
Training | 50.7% |
Validation | 50.0% |
Testing | 50.8% |
So, perhaps it was a better idea to remap the 8 states into a new 8 state space with 8 varying levels of amplitude. There are 8! possible such mappings, namely:
[1 2 3 4 5 6 7 8]
[2 1 3 4 5 6 7 8]
[3 1 2 4 5 6 7 8]
[3 2 1 4 5 6 7 8]
[2 3 1 4 5 6 7 8]
[4 2 3 1 5 6 7 8]
[ .... ]
[8 7 6 5 4 3 2 1]
There was almost no time left on the assignment, so rather than explore this carefully, I just looked at them randomly. This was easy, because the Markov Model classifier runs very quickly. I wrote a script that generated three hundred different amplitude remappings, classified with all of them, and saved the best remappings.
Unfortunately, this technique was very prone to overfitting.
On the validation dataset, it produced 73% correct classification,
but that number dropped to 49.6% on the testing data. The magnitude
remappings were:
mags=[0 0 7 0 7 7 1 0]
I decided to repeat the experiment, training the magnitude remapper
on the training set instead of the validation set.
This resulted in a rate of 57.8% on validation data, which
dropped to 51.6% on testing data. The magnitude remapping
it chose is:
with mags=[7 4 7 7 7 3 6 6 ]
1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0
Moreover, such a change in this model is indistinguishable from
this sequence:
Dataset | Classification Rate |
Training | 49.5% |
Validation | 64.6% |
Testing | 52.9% |