Artist Classification of Music
When you hear a new song by an artist you're familar with, how long does it
take for you to recognize the artist? What if the artist has changed their style
completely, or switched producers, or has aged 15 years?
We are interested the Artist ID problem for a number of reasons, but
mostly because it seems to hold the key to many computational music perception
problems. If a system can reliably identify the artist of a song it hasn't heard
yet, then the learning algorithm and feature space can be bootstrapped into
performing more applicable Music-IR tasks such as genre/style ID, robust song
ID, segmentation, summarization, etc. We have found artist ID to be hard but not
impossible, and enormously dependent on a good time-aware learning algorithm
with some musical knowledge.
(Feel free to contact me (bwhitman@media) with any questions)
***Dataset will be provided soon***
Feature Extraction
We provide three sets of audio-derived features for
this music understanding task:
- A power spectral density (PSD) estimate of each tenth of a song. A PSD is
roughly the mean of the STFT over the window size, normalized to some
window-dependent constant. We use a STFT bin size of 1024, which returns a
vector of 513 FFT magnitudes. The intended effect is to gauge the "sound of
the sound:" a spectral fingerprint of audio aboutness.
- MFCC vectors with a 4 second window. For each 4 seconds of audio we
compute the Mel Frequency Cepstral Coefficients, widely used in speech
analysis as a pitch-independent measure of audio. We have reason to believe
that the DCT-estimation step of 'cepstra' combined with the psychoacoustic
scaling of the frequency will be a better spectral aboutness measure than flat
PSD.
- HMM-dervied state paths from a generic MPEG-7 music model. The model can
be obtained from [5] if you wish to examine the transition probabilities, but
for our purposes you can consider the HMM pre-process a 'black box.' We choose
from one of 20 states per every 1/100th of a second.
Dataset Information
Our featureset has been extracted from the NECI
Minnowmatch Testbed, a set of roughly 10,000 songs from the 1,000 most popular
albums on the OpenNap peer-to-peer network in August, 2001. (The albums were
purchased, not downloaded. We can't distribute the source files, sorry.) Because
of the popularity constraint, the songs from this collection stay mostly within
the pop and rock genres.
Dataset Labelling and Use
We are providing data for each song in a .MAT
matlab file. The .MAT file will be named like so:
Artist##Album##Track_Number##Song_Title##(Disc hash or other information).mp3.mat
This
is the only encoding of per-song textual metadata. By using a command in Matlab
like: arto = load('Arto_Lindsay##Noon_Chill##13##Reentry##NO.mp3.mat');
You'll
get back a struct containing the following fields: arto.PSD: the 10 x 513 vector of PSD information
arto.cep: the n x 13 vector of MFCC coefficients
arto.Path: the 1 x n path of HMM states
Now you're free to do what you like with each vector.
Problems
- "The Producer Effect" (aka "The Album Effect"): an artist ID classifier
might learn the production (EQ, compression, sound field) rather than the
actual music. A good way to test against this is to train and test
cross-album; that is, make sure your system can identify an artist outside of
the album you trained on. Even better, use a live CD (we have a few in the
Mima testbed.)
- "The Arto Effect": very quiet artists (low energy variance) will perform
horribly put up against loud artists in the same artist space. To get around
this, you may need feature pre-processing systems that normalize the input
space.
Thoughts
- Can we encode the all-important time domain? Preliminary studies show that
encoding time (either as a time-delay network or more specifically) help an
artist classifier.
- Investigate self-similarity within a feature space... perhaps knowing the
song structure (from clustering or autocorrelation) can help.
- Get more musical: hard with this feature space (time is too blurred) but
perhaps devise a beat tracker on the state paths.
- Do more than artist ID: we see from the anchor model paper [2] that the
outputs of multiple classifiers (like we use in [1] from the SVMs) can also be
used to do content-based music similarity. Also, how about song ID? Genre ID?
- Kernel methods: what's the best kernel for a discrete sequence like the
state paths? What sort of statistics can work?
References
- Whitman, Brian, Gary Flake, and Steve Lawrence (2001, September 10-12). Artist Detection
in Music with Minnowmatch. In Proceedings of the 2001 IEEE Workshop on
Neural Networks for Signal Processing, pp. 559-568. Falmouth, Massachusetts.
- Adam Berenzweig, Daniel P.W. Ellis, and Steve Lawrence (2002)Anchor
Models for Artist Classification and Similarity Measurement of Music.
- Adam Berenzweig, Daniel P.W. Ellis, and Steve Lawrence (2002) Using
Voice Segments to Improve Artist Classification of Music. To appear in
AES-22 Intl. Conf. on Virt., Synth., and Ent. Audio.
- Kim, Youngmoo and Brian Whitman. "Singer Identification in Popular Music
Recordings Using Voice Coding Features." In Proceedings of the 3rd
International Conference on Music Information Retrieval. 13-17 October 2002,
Paris, France.
- MusicStructure.org - MP7
segmentation demos
- Logan, Beth. Mel Frequency
Coefficients for Music Modelling. In Proceedings of the 1st International
Conference on Music Information Retrieval.
Brian Whitman - bwhitman@media.mit.edu