Artist Classification of Music

When you hear a new song by an artist you're familar with, how long does it take for you to recognize the artist? What if the artist has changed their style completely, or switched producers, or has aged 15 years?

We are interested the Artist ID problem for a number of reasons, but mostly because it seems to hold the key to many computational music perception problems. If a system can reliably identify the artist of a song it hasn't heard yet, then the learning algorithm and feature space can be bootstrapped into performing more applicable Music-IR tasks such as genre/style ID, robust song ID, segmentation, summarization, etc. We have found artist ID to be hard but not impossible, and enormously dependent on a good time-aware learning algorithm with some musical knowledge.

(Feel free to contact me (bwhitman@media) with any questions)

Dataset will be provided soon

Feature Extraction

We provide three sets of audio-derived features for this music understanding task:

A power spectral density (PSD) estimate of each tenth of a song. A PSD is roughly the mean of the STFT over the window size, normalized to some window-dependent constant. We use a STFT bin size of 1024, which returns a vector of 513 FFT magnitudes. The intended effect is to gauge the "sound of the sound:" a spectral fingerprint of audio aboutness.
MFCC vectors with a 4 second window. For each 4 seconds of audio we compute the Mel Frequency Cepstral Coefficients, widely used in speech analysis as a pitch-independent measure of audio. We have reason to believe that the DCT-estimation step of 'cepstra' combined with the psychoacoustic scaling of the frequency will be a better spectral aboutness measure than flat PSD.
HMM-dervied state paths from a generic MPEG-7 music model. The model can be obtained from [5] if you wish to examine the transition probabilities, but for our purposes you can consider the HMM pre-process a 'black box.' We choose from one of 20 states per every 1/100th of a second.

Dataset Information

Our featureset has been extracted from the NECI Minnowmatch Testbed, a set of roughly 10,000 songs from the 1,000 most popular albums on the OpenNap peer-to-peer network in August, 2001. (The albums were purchased, not downloaded. We can't distribute the source files, sorry.) Because of the popularity constraint, the songs from this collection stay mostly within the pop and rock genres.

Dataset Labelling and Use

We are providing data for each song in a .MAT matlab file. The .MAT file will be named like so:

Artist##Album##Track_Number##Song_Title##(Disc hash or other information).mp3.mat

This is the only encoding of per-song textual metadata. By using a command in Matlab like:

 arto = load('Arto_Lindsay##Noon_Chill##13##Reentry##NO.mp3.mat');

You'll get back a struct containing the following fields:

arto.PSD: the 10 x 513 vector of PSD information
arto.cep: the  n x 13 vector of MFCC coefficients
arto.Path: the 1 x n path of HMM states

Now you're free to do what you like with each vector.

Problems

"The Producer Effect" (aka "The Album Effect"): an artist ID classifier might learn the production (EQ, compression, sound field) rather than the actual music. A good way to test against this is to train and test cross-album; that is, make sure your system can identify an artist outside of the album you trained on. Even better, use a live CD (we have a few in the Mima testbed.)
"The Arto Effect": very quiet artists (low energy variance) will perform horribly put up against loud artists in the same artist space. To get around this, you may need feature pre-processing systems that normalize the input space.

Thoughts

Can we encode the all-important time domain? Preliminary studies show that encoding time (either as a time-delay network or more specifically) help an artist classifier.
Investigate self-similarity within a feature space... perhaps knowing the song structure (from clustering or autocorrelation) can help.
Get more musical: hard with this feature space (time is too blurred) but perhaps devise a beat tracker on the state paths.
Do more than artist ID: we see from the anchor model paper [2] that the outputs of multiple classifiers (like we use in [1] from the SVMs) can also be used to do content-based music similarity. Also, how about song ID? Genre ID?
Kernel methods: what's the best kernel for a discrete sequence like the state paths? What sort of statistics can work?

References

Whitman, Brian, Gary Flake, and Steve Lawrence (2001, September 10-12). Artist Detection in Music with Minnowmatch. In Proceedings of the 2001 IEEE Workshop on Neural Networks for Signal Processing, pp. 559-568. Falmouth, Massachusetts.
Adam Berenzweig, Daniel P.W. Ellis, and Steve Lawrence (2002)Anchor Models for Artist Classification and Similarity Measurement of Music.
Adam Berenzweig, Daniel P.W. Ellis, and Steve Lawrence (2002) Using Voice Segments to Improve Artist Classification of Music. To appear in AES-22 Intl. Conf. on Virt., Synth., and Ent. Audio.
Kim, Youngmoo and Brian Whitman. "Singer Identification in Popular Music Recordings Using Voice Coding Features." In Proceedings of the 3rd International Conference on Music Information Retrieval. 13-17 October 2002, Paris, France.
MusicStructure.org - MP7 segmentation demos
Logan, Beth. Mel Frequency Coefficients for Music Modelling. In Proceedings of the 1st International Conference on Music Information Retrieval.

Brian Whitman - bwhitman@media.mit.edu