|
I have been interested in applying for machine learning techniques to classifying audio. What work has been done towards finding great features? I have some background in music theory or signals processing. Would features in use in speech recognition transfer well? |
|
There's been a substantial amount of work done on this topic. Much of it is in industry, but there are a lot of available papers. A google scholar search for 'audio fingerprinting' will get you off and running. To start, this is a solid survey paper from Pedro Cano and others. One of my favorite approaches is described by Avery Wang in this paper describing how Shazam works. It's devilishly simple. 1) Identify peaks in the spectrogram. 2) Build a hash keyed on pairs of peaks. 3) Profit. In general, speech recognition features (MFCC, PLP) don't translate especially well to this task. First of all, they ignore pitch in order to be robust to speaker differences. Clearly, pitch is important for music recognition. Second, the harmonic structure of music -- multiple formants per instrument -- is dramatically more complicated than speech -- one instrument. To accomodate this complexity, you'll need higher order coefficients -- but at higher orders the reliability of this processing gets less reliable. |
|
Probably features based on rhythm and tone should be added as well, since I think speech recognition puts a lot of emphasis on the pitch rather than the rhythm of the words. Something like HMM or CRF might be useful, but I am not quite sure. |