Music Classification (Artist ID, Album ID, Genre ID, Style ID)

Music information retrieval (MIR) is the science of extracting and organizing metadata in music. MIR helps people organize their music, allows libraries to automatically index their collections and encourages musicians, labels and record stores to seek out new audiences. Some facets of MIR are "easy problems:" if you sort your MP3s by Artist in iTunes you are performing a retrieval task-- but the interesting applications all require a notion of computational music intelligence-- in which a system makes a prediction about what is in an audio signal. Can a computer tell you if you'll like a song, or what kind of music it is, or which instruments, or how people describe it?

For this project you'll want to think about how machine learning can be useful for music retrieval. Audio-based MIR work starts with a glut of data and only get worse with scale and time. Your average song, uncompressed, is 40 megabytes of data. The average personal collection is a thousand times that. Both the features you calculate from these songs and the machine learning apparatus you use to make sense of them need to do a good job at finding out what's important in an audio signal and to only consider the data that informs the task.

Below you'll find audio-derived features for 4,000 songs covering over 200 artists, along with brief edited metadata for each song. There are a few suggested tasks of which you'll probably just want to concentrate on one. Each are well-studied in the (recent) literature and you can find papers in various MIR and multimedia conference proceedings for guidelines and example results. I've included a few of our own group's papers below, mostly concentrating on the artist ID problem.


Feature Extraction

Here are the features zipped up in a 300MB file.

We have extracted for you a set of two types of audio features for 4,000 songs by 208 currently popular artists. This frees you to think about the classification problem and not the machine representation. The features we provide are influenced by speech research and perform relatively well in most music intelligence tasks, although there is still room for much improvement. If you have an idea for a feature and can provide me with matlab or UNIX-compatible code that returns a set of feature vectors from a sound file or signal I might be able to run it on our testbed.

The first feature is the well-known MFCC (Mel Frequency Cepstral Coefficients,) used widely in speech recognition and identification. The base analysis rate is 5hz, so for a two-minute song there will be up to 600 vectors of 13-dimensional data. You can read up on MFCCs [4] if you'd like but suffice to say they try to emulate the ear's natural response to sound and then perform a DCT for decorrelation.

The second feature is 'modulation cepstra,' or Penny, which is described in detail in [5]. Penny is the the FFT of the MFCC: how often the MFCC repsonse changes over time. The first bands of Penny indicate high-level structure (events, beats) while higher dimensions correlate to timbral properties of the sound. ("The sound of the sound.") This implementation of Penny gives you 26 dimensions per modulation frame.

Don't blindly trust the features: if your results aren't great try omitting dimensions or creating delta dimensions for "time-delay embedding." You could also try cepstral mean normalisation (CMN.)

Textual Features

I've also included a set of optional textual features for each album. Community reaction to a signal is just as important as the signal itself, so if you are interested in the relationship between the audio and the community, you could do something interesting with both. For ideas check [5], [6], [7], [8], [9].

Dataset Labelling and Use

Each song in the dataset has two files associated with it, one for each feature. The MFCC features have the extension .mfcc_5hz while the Penny features have .penny_5hz_new2. These files are binary dumps and can be read with the following matlab script: libread.m. To read in a file, simply do:

 M = libread('./Akufen/My Way/05. Wet Floors.mp3.penny_5hz_new2');

And M will have your 26 x 193 matrix: dimensions by frames. If you're not using matlab, look at the source of libread.m or this C snippet to see how to read the files in.

Artist and Album Metadata

The artist name and album and song title metadata is encoded in the directory. It's simply Artist Name/Album Name/Song Title.feature.

Genre and Style Metadata

The genre list for each album is encoded as a text file here. The style list for each album is here. These list the album name (the directory) and then the corresponding genre or style tags afterwards with a tab character preceeding each one.

Community Metadata

If you want it, I've also placed community metadata vectors [9] for each album in their directories. There will be a BOTH.adj.tfidf and BOTH.np.tfidf, which contain the TF-IDF values for adjective and noun phrases respectively for that album. The fields look like
2.94117647058824 0.0240631767768468 122.227272727273 revelatory
2.94117647058824 0.0262507383020147 112.041666666667 astonishing
2.94117647058824 0.0284382998271826 103.423076923077 coherent
which is TF, DF, TF-IDF, and the term.

Tasks

There are three recommended tasks: artist ID, genre ID, and style ID. Artist ID is by far the most challenging but the least practical. Genre ID suffers from ill ground truth but can give encouraging results. Style ID is more helpful for recommendation and similarity and has multiple truth labels per observation.

Artist ID

When you hear a new song by an artist you're familar with, how long does it take for you to recognize who it is? What if they've has changed their style completely, or switched producers, or aged 15 years?

We are interested the artist ID problem because it seems to hold the key to many computational music perception problems. If a system can reliably identify the artist of a song it hasn't heard yet, then the learning algorithm and feature space can be bootstrapped into performing more applicable Music-IR tasks such as genre/style ID, robust song ID, segmentation, summarization, etc. We have found artist ID to be hard but not impossible, and enormously dependent on a good time-aware learning algorithm with some musical knowledge.

The most common way to train an artist ID system is to create a set of n discriminant classifiers, one for each artist. You train your classifiers on the positive examples by that artist and a random set of music not by that artist. You then test your classifiers on a held out test set of music by the same artist, hopefully across albums and time periods to make a robust discriminator. [1] You'll run into a few problems:

We've had luck with "classifying by combining" or "anchor models" in which you train first a set of 'little machines' each that answer a question about music. For example you could train a machine that tells you if the song is rock or classical, or if the singer is male or female. You then take the output of each of those machines and treat that as your new feature space, upon which you train a normal 1-in-n artist ID problem. This forces your features to reflect semantic dimensions instead of purely statistical ones. See for example [7] where we derived those little machines automatically from the community metadata text: i.e. we let the listeners tell us which were the most important discriminative qualities of music.

Genre ID

Pretend you are running a record store and want to have the computer tell you which CDs are "Rock" and which are "Electronic" so that you can nicely categorize your stock. [I personally despise the Genre ID task because we end up teaching computers about marketing constructs instead of meaningful musical characteristics. Ask four people which genre Bjork is in and you will get back four different answers. What is the difference between Rock and Pop anyway?] It's a worthwhile exercise because it works, for the most part-- and it is immediately useful for listeners. Genre ID is done in a very similar fashion as artist ID except you'll find you'll need a lot less data to get good results, and for most learning systems too much data will make it perform poorly. The genre ground truth labels are in the files listed above. It would be interesting to see what the cutoff / elbow is for information density vs. genre classifier performance.

Style ID

Styles are a bit more expressive and reflect just a few artists, like "Math Rock" or "Adult Contemporary Jazz Fusion." The task is just like Genre ID, but there are more truth labels and they are fine-grained. Some styles will only match up with one artist, which might cause a bias problem. Also, a single album will resolve to more than one truth label, which could be taken advantage of in your classifier. The labels are in the style file listed above.

Other Tasks You Could Do With This Data

Good luck!

Feel free to contact me (bwhitman@media) with any questions.


References


Brian Whitman - bwhitman@media.mit.edu