2
2

I know a lot of nonparametric bayesian models have unbounded or infinite number of features. Many models are derived by taking the limit of the distribution in the infinite. For example, the Indian Buffet Process. When taking the limit when K -> infinity, what do we want to get for a valid distribution? Is it that P(Z) should have a limit > 0?

I still not sure why we need the infinite because in practice, the amount of data is finite and limited. Dont we only need to choose a K that is large enough. What is the advantage of considering the infinite limit?

asked Mar 29 '11 at 16:17

WeakLearner's gravatar image

WeakLearner
51126


One Answer:

There are some advantages to considering the infinite limit. You can see for example Zoubin Ghahramani talking about some of them in this videolecture.

The main reasons, from my point of view, are:

  1. Reducing complexity: Say you have a mixture of K gaussians model. If you estimate it traditionally, you will almost always get precisely K gaussians in the output, regardless of whether that makes sense for the model or it is overfitting. Considering the infinite limit leads us to finite models (for example, finite stick-breaking priors) that don't have this property, and where you can specify an upper bound on the number of gaussians instead of the exact number of gaussians, thereby reducing the cost of inference substantially (a variational algorithm for a DP mixture of gaussians is assymptotically just as fast as a variational algorithm for a finite mixture of gaussians, and fits models with much smaller complexity most of the time). So while it seems counterintuitive, an infinite model might lead to solutions that are smaller than a model with large but finite K.

  2. Speed issues: Sometimes, also, infinite models, due to added flexibility, converge better and faster. The HDP-LDA is anecdotally (although someone might have a paper) a lot faster than finite truncated LDA most of the time, and finds better topics due to the exchangeability assumption. Also, looking at the properties of infinite models can lead to ideas such as the one in the Beam sampling for the infinite hidden Markov model paper, by Jurgen Van Gael et al. Nothing stops you from using that to sample a finite HMM, but it seems more natural in the infinite model, where it can be seen as a sampling-based correspondence to the Efficient Staggered decoding for sequence labeling paper by Kaji et al.

  3. Less fiddling: in line with item (1), it takes a lot less fiddling to fit nonparametric models than their parametric alternatives in some settings, as it might not be always obvious how to set the truncation parameter and cross-validation is slow.

  4. Intellectual reasons: Studying the gaussian process helps understand the behavior of neural networks; better priors for finite models can be crafted copying the structure of infinite models; infinite models lead to compact algorithms and models for sparsity in a bayesian setting; etc. Intellectually they might be interesting to consider in the same sense that real-valued continuous functions are an interesting abstraction: while dubiously grounded in reality, it leads to interesting analogies and extends many useful ideas.

answered Mar 29 '11 at 17:58

Alexandre%20Passos's gravatar image

Alexandre Passos ♦
1896744214334

Thanks for the responses.

Your point (1) is pretty interesting way to look at. So taking the infinite limit helps discover finite model with flexibility to fit much better to data (like in DPMM, growing new cluster for data that is not fitted very well). Is it true in practice infinite model is implemented as a certain truncation level?

Point (2): is the finite truncated LDA a truncated infinite model?

I also dont understand how to derive the infinite limit. What do I want the Probability to behave when K->infinity. It seems like when K is very large, the probability should go to 0.

(Mar 29 '11 at 18:31) WeakLearner
1

Yes, infinite models are usually explicitly truncated for efficiency reasons unless you use some specific sampling algorithms, but these algorithms do a lot of memory allocation and freeing for the extra space. Truncation is not a huge problem because of some results on the expected model size of nonparametric models.

The finite truncated LDA almost qualifies as a truncated infinite model, but not in a good way. A truncated infinite model would have deeper forms of topic sharing between different documents, and there is the identifiability problem.

Remember that if you have a finite symmetric dirichlet model, for any given sample there are K! other samples with exactly the same parameters (the identifiability issue), so you need to correct for that when computing the probability of the infinite model. Also, you need to switch from the space of indices to the space of partitions, and alpha should go down as k increases. It's actually hard to compute the full-sample likelihood according to the DP, as you have infinite things lying around. I suggest you read the original papers to get an idea. David Blei has a really good list here: http://bit.ly/a694bH . Some of the mathematics is beyond me, as taking these limits can be rather tricky.

(Mar 29 '11 at 18:44) Alexandre Passos ♦
2

Just to add: To deal with the limit of K going to infinity, people usually have specific ways for specific models. For example, for the Indian Buffet Process (IBP) which is a distribution over binary matrices having potentially infinite number of columns (K), the probability of a single binary matrix would be zero as K goes to infinity.

To deal with this, Griffith and Ghahramani (see the IBP technical report http://bit.ly/eS8ZKQ ) define an "equivalence class" of binary matrices which is basically the set of all binary matrices having K columns, such that when each of which is reordered column-wise, map to the same "left-ordered-form" matrix. So instead of considering the probability of a single matrix, you define the probability of the equivalence class represented by this "left-ordered-form" matrix.

Now when you make K go to infinity, the probability is no more zero but rather a well-defined quantity.

(Mar 29 '11 at 20:00) ebony

What is the maths behind the probability should not be zero? I was trying to read more about measure theory, stochastic process but it is still beyond me how to connect the two.

(Mar 29 '11 at 21:00) WeakLearner

Say, you are modeling a random variable Z and have a nonparametric prior P(Z). As in the IBP example where Z is the binary matrix wih K columns, when K goes to infinity, P(Z) would end up having zero mass for each of the possible Z's, and therefore the prior would not be meaningful. It's to avoid such cases that people use things such as the notion of equivalence class as in the IBP case.

(Mar 29 '11 at 21:30) ebony
Your answer
toggle preview

powered by OSQA

User submitted content is under Creative Commons: Attribution - Share Alike; Other things copyright (C) 2010, MetaOptimize LLC.