Automatic Drum Samples Classification
A final project for Pattern Recogintion MAS 622J/1.126J
Eyal Shahar, MIT Media Lab
All quotes are taken from the the sketch "More Cowbell" as performed in "Saturday Night Live"
Background
"I put my pants on just like the rest of you -- one leg at a time. Except, once my pants are on, I make gold records."
Musicians today, both professional and hobbyist, who rely heavily on their computers to make music,
usually find themselves with hard drives full of music samples of all sorts.
The majority of these samples are of individual drum samples, often called “hits” or “one shots”.
Arranging these samples in folders is usually done manually by listening to every sample
and moving it into a desired folder. While in the process of making music, retrieval of these samples
is done, once more, by tedious auditions of each and every sample.
This project is a first step towards making the life of the computer-based musician a little bit
easier by automatically classifying these samples and allowing better methods of retrieval.
Objective
"Before we're done here.. y'all be wearing gold-plated diapers."
The goal of this project is to automatically classify drum samples,
compare classification techniques and optimal features sets.
Training And Testing Sets
"I gotta have more cowbell, baby!"
The training set consists of 1000 samples, divided to 6 classes: bassdrums, snares, hi-hats, cymbals, tom-toms and claps.
The testing set consists of 1200 samples.
The following table describes the distribution of the sets.
Features
"... The last time I checked, we don't have a whole lot of songs that feature the cowbell."
Most of the feature extraction was done using the
MIRtoolbox for Matlab, by the University of Jyväskylä. These are:
- Brightness – Percentage of energy above 1500Hz
- Rolloff – Frequency below which 85% of the energy is found
- Roughness – based on the frequency ratio of each pair of sinusoids
- Irregularity – degree of variation of successive spectrum peaks
- MFCC – Mel frequency cepstrum coefficients
In addition, two more features were extracted using custom algorithms:
- Pitch – The samples are sliced into equal length frames.
In each frame the peak of the spectrum is found.
These are averaged and the frequency correlating to that FFT bin is returned.
- Decay – The amplitude envelope of the sample is calculated with MIRtoolbox.
The peak amplitude is found, and the decay time is calculated as the time
that the amplitude drops from the peak to 50%.
The Matlab GUI
"... and, Gene - Really explore the studio space this time."
To manage the learning and testing processes, a Matlab GUI was created.
It provides quick and intuitive access to feature extraction, loading and saving of the training and testing data sets,
saving and loading of classification models, selection of active model, calling the testing and learning routines and graphic
visualization of the feature space.
Classification methods
"Let's just do the thing."
Support vector machine
For this method Matlab’s SVM tools were used.
The main drawback of this implementation is that with the absence of the optimization package,
as in my computer, the algorithm uses a linear kernel.
Six SVM were trained, one for each class, using a one-versus-all approach.
For validation a leave-one-out method was used.
K Nearest Neighbors
custom KNN algorithm was written for this method and was trained to find the optimal K for 1<k<15 for odd values of k.
Validation was done using a leave-one-out approach.
Neural Network
Matlab’s neural networks tools were used for this algorithm, testing both one and two hidden layers,
with each layers tested for 5 to 10 units.
Validation is a part of the toolbox’s features and therefore no additional validation was done,
while the MSE as calculated during the learning process was used as a measure of performance to determine the best net
and features set configuration.
Feature Selection
"Well, it's just that I find Gene's cowbell playing distracting."
In all the classification methods learning process, forward feature selection was implemented:
At first, the algorithm was with one feature as input.
The feature that performed best remained in the features set the algorithm was tested again with each of the remaining
features as a second feature. This process repeated it self until the performance did not show improvement of over 0.5%.
Results And Performance
"...And I'd be doing myself a disservice and every member of this band, if I didn’t perform the hell out of this!"
The K-nearest neighbors gave the best results with k = 9. The selected features were Brightness, Irregularity,
Decay, MFCC 1, MFCC 2, MFCC 3 and MFCC 5.
The neural-network learning algorithm produced a 2 hidden layer network, with 9 and 7 units respectively.
The SVM learning algorithm found these features to be optimal:
The following table and graph show the accuracy of detection of the testing set:
Conclusions
"Guess what? I got a fever! And the only prescription.. is more cowbell!"
Random Insights
- It is interesting to see how features are selected for each of the SVM instances,
and to note that theses features do indeed say a lot about the behavior of samples from each class.
Hi-hats, for examples, are very short and high pitched, and indeed "Decay" and "Pitch" are among the selected features.
Cymbal sounds, on the other hand, are usually very long, noisy and have plenty of high frequencies,
so we see "Decay", "Roughness" and "Brightness" selected.
- It is clear that classes that were less common in the training set, namely Toms and Claps, were detected less
accurately during testing. The extreme case is of the Toms, of which there were only 12 in the training set, and had
negligible detection rates in both k-NN and neural-network algorithms. Furthermore, the performance of the neural network for each
class seems to depend heavily on the amount of samples from that class present in the training set.
It is therefore important to have enough samples of each class, preferably with all classes equally represented.
- The apparent reason for the SVM algorithm to be an exception is that it has the advantage of producing a different decision
machine for each class, and thus has the opportunity to isolate the class unique characteristics, even when a small
training set is introduced.
Possible Improvements
The following steps can be considered in order to improve recognition results :
- Extend and manually validate training set - It is very likely that significantly better results can be achieved
by enlarging the training set. Also, some samples have been noticed to have dubious character. These, perhaps, should not be a part
of the training set.
- Try temporal features approaches, namely HMM - Certain features, such as pitch and MFCC, can give better results when
calculated for small time frames. Moreover, their behavior over time can give more information about the sound's class.
- Implement multiple algorithms and majority selection - It is clear that some algorithms are better at detecting
certain classes than others. In fact, for every class there is an algorithm the performs very well. Applying all algorithms
and taking a joint decision might exhibit better performance.
Future Work
As stated earlier, this work can be a framework of a system with stronger capabilities, such as:
- Melodic instruments samples - Should these algorithms mature into being reliable at a high degree, it would be interesting
to see whether they can be applied to melodic instruments.
- Naive retrieval methods - having a database of sounds and their features provides the user with new ways to retrieve sounds.
The user would be able to select a sound, and "stroll around" that sound's environment, knowing that these sounds are
close to the one he selected.
- Subjective features extraction - Having the user define new features and figuring out how these features are
reflected in the existing features can give fascinating results. Suppose a user teaches the machine what sounds are "duller" than
others, "sharper", more "soothing", or "yellow". Then musicians can retrieve sounds from the based on their subjective definitions.
Final project presentation (.pdf)
Project proposal presentation (.pdf)