|
Hi all, I'm looking for some guidance about which techniques/algorithms I should research for the following problem. I've currently got an algorithm that clusters extremely similar-sounding mp3s using acoustic fingerprinting. In each cluster, I have some metadata (song/artist/album) for each file. For that cluster, I'd like to pick the most representative song/artist/album that matches an existing row in my database, or if there is no best match, decide to insert a new row. For a cluster, there is generally some correct metadata, but individual files have many types of problems:
- Artist/songs are completely misnamed, or just slightly mispelled
- the artist, song, or album is missing, but the rest of the information is there
- the song is actually a live recording, but only some of the files in the cluster are labeled as such. In this case the song name should have (live) in it. A simple voting algorithm works fairly well, but I'd like to have something I can train on a large set of data that might pick up more nuances than what I've got right now. Any links to papers or similar projects would be greatly appreciated. Thanks! |
|
You can try to use some supervised clustering algorithm, probably using features such as short sequences of contiguous characters or using a distance function between names based on edit distance or string kernels. The paper I linked to has some references for non-bayesian methods you can try. +1: supervised clustering.. that's new for me. How do you get in touch with this article? how do you find such interesting subjects?
(Dec 09 '11 at 05:58)
Lucian Sasu
|