|
Hello, First of all, please excuse my lack of knowledge as I'm very new to machine learning and I am probably getting head of myself. I'm looking to implement the distance dependent Chinese Restaurant process in MATLAB for the clustering of audio tracks. I'm not 100% sure this is the correct approach, so please feel free to correct me. I'm looking to use the dd-CRP on 26 features. I'm guessing the process might go like this
While this is occurring, I will be keeping track of how many tables there are. I will be running the algorithm over say for example 16 audio tracks. The way the audio will be fed into the algorithm is the first feature vector will be from say the first frame from audio track 1, the second feature vector from form the first frame in track 2 etc. as I'm trying to find out which audio tracks like to cluster together most, but I don't want to define how many centroids there are. Obviously I'll have to keep track of which audio track is at which "table". Does this make sense? Thanks, Daithi |
|
Hi Welcome to our group Well, the way you describe it sure you might use it that way, but is not formally nor will give you any guarantees. Yours is basically an empirical model to fill up the tables. I highly recommend you to start with something simpler like mixture of gaussians or K-Means, Dirichlet Processes, which is what you want to do are (in my opinion) not mature enough to be used by people without experience with other clustering methods. I really recommend you to use K-Means or Mixture of Gaussians, what you are proposing has little theoretical basis, and while is a good exercise, I think you need to look to deeper questions, like:
all of these are questions that can be answered with a bit more reading on the process and how its generative story is put. Hi Leon, First of all thanks for getting back to me. I failed to mention that I intend on there being an infinite amount of tables. I've already used the K-means algorithm, but it doesn't address the audio engineering question I am asking. I've thought the problem through a bit more. Indeed, I would need to determine the centroid of each table with each feature that is added. Possibly by determining the mean value for that table. Then by some value theta that I set, I could use this determine how similar each new feature has to be, to be added to each table. Correct me if I'm wrong, but I thought the distance dependent CRP works under the premise that the new customer sits at a table where the customers currently at it are similar to him/her.
(Nov 13 '13 at 06:31)
daithi ronain
Your idea of distance dependent CRP is correct, but yours is an heuristic that has no formality, and as such will give you no guarantees of a set number of clusters. How are you going to define the metric and the distance, what happens in borderline cases? The CRP works with probabilities. Check Blei's paper http://www.cs.princeton.edu/~blei/papers/BleiFrazier2011.pdf You'll see there is an inference process associated to the finding of the centroids. Your idea is sweet though, I would look for Matlab code that already implements it. Keep going at it, you can shoot me an email (is in my profile) if you want further detail.
(Nov 14 '13 at 18:55)
Leon Palafox ♦
|