|
I know a lot of nonparametric bayesian models have unbounded or infinite number of features. Many models are derived by taking the limit of the distribution in the infinite. For example, the Indian Buffet Process. When taking the limit when K -> infinity, what do we want to get for a valid distribution? Is it that P(Z) should have a limit > 0? I still not sure why we need the infinite because in practice, the amount of data is finite and limited. Dont we only need to choose a K that is large enough. What is the advantage of considering the infinite limit? |
|
There are some advantages to considering the infinite limit. You can see for example Zoubin Ghahramani talking about some of them in this videolecture. The main reasons, from my point of view, are:
Thanks for the responses. Your point (1) is pretty interesting way to look at. So taking the infinite limit helps discover finite model with flexibility to fit much better to data (like in DPMM, growing new cluster for data that is not fitted very well). Is it true in practice infinite model is implemented as a certain truncation level? Point (2): is the finite truncated LDA a truncated infinite model? I also dont understand how to derive the infinite limit. What do I want the Probability to behave when K->infinity. It seems like when K is very large, the probability should go to 0.
(Mar 29 '11 at 18:31)
WeakLearner
1
Yes, infinite models are usually explicitly truncated for efficiency reasons unless you use some specific sampling algorithms, but these algorithms do a lot of memory allocation and freeing for the extra space. Truncation is not a huge problem because of some results on the expected model size of nonparametric models. The finite truncated LDA almost qualifies as a truncated infinite model, but not in a good way. A truncated infinite model would have deeper forms of topic sharing between different documents, and there is the identifiability problem. Remember that if you have a finite symmetric dirichlet model, for any given sample there are K! other samples with exactly the same parameters (the identifiability issue), so you need to correct for that when computing the probability of the infinite model. Also, you need to switch from the space of indices to the space of partitions, and alpha should go down as k increases. It's actually hard to compute the full-sample likelihood according to the DP, as you have infinite things lying around. I suggest you read the original papers to get an idea. David Blei has a really good list here: http://bit.ly/a694bH . Some of the mathematics is beyond me, as taking these limits can be rather tricky.
(Mar 29 '11 at 18:44)
Alexandre Passos ♦
2
Just to add: To deal with the limit of K going to infinity, people usually have specific ways for specific models. For example, for the Indian Buffet Process (IBP) which is a distribution over binary matrices having potentially infinite number of columns (K), the probability of a single binary matrix would be zero as K goes to infinity. To deal with this, Griffith and Ghahramani (see the IBP technical report http://bit.ly/eS8ZKQ ) define an "equivalence class" of binary matrices which is basically the set of all binary matrices having K columns, such that when each of which is reordered column-wise, map to the same "left-ordered-form" matrix. So instead of considering the probability of a single matrix, you define the probability of the equivalence class represented by this "left-ordered-form" matrix. Now when you make K go to infinity, the probability is no more zero but rather a well-defined quantity.
(Mar 29 '11 at 20:00)
ebony
What is the maths behind the probability should not be zero? I was trying to read more about measure theory, stochastic process but it is still beyond me how to connect the two.
(Mar 29 '11 at 21:00)
WeakLearner
Say, you are modeling a random variable Z and have a nonparametric prior P(Z). As in the IBP example where Z is the binary matrix wih K columns, when K goes to infinity, P(Z) would end up having zero mass for each of the possible Z's, and therefore the prior would not be meaningful. It's to avoid such cases that people use things such as the notion of equivalence class as in the IBP case.
(Mar 29 '11 at 21:30)
ebony
|