|
This is somewhat related to an earlier question: how does the prior affect your model (and the inference) as the amount of data (in terms of number of examples) grows? It seems that a large amount of data would eventually wash out the effect of the prior. If that is the case, then can one just use something like maximum likelihood (or EM) to estimate parameters, instead of going the fully Bayesian route? Of course, you wouldn't get confidence estimates but if I don't care about these, can I just use MLE/EM without taking into account the prior distribution? I think the role of priors becomes important in Bayesian models with a complicated hierarchy of parameters/hyperparameters where it may benefit to have priors in order to share statistical strength among parameters (especially when you don't have enough data). But assuming that your model is relatively simple (with maybe one level of parameters), should I really care about using priors, if I have enough data? |
|
Priors are important when you're fitting complex models relative to the amount of data you have. Here's an illustration. In summary, yes, if your models are very simple and you have a lot of data, then priors don't matter much. But, why not fit good models instead of simple play-models? :) |
|
In my opinion the trade off is always the one you are talking about. If you have enough data then probably one would not care about the priors... but if there would be a case where you would want to actually care about good priors even with lots of data... hmmm thats a thinker... On an aside I think that good priors this is always the holy grail of learning in hierarchical models in general and ANN's and Auto encoder like n/w specifically. and there are newer things in prior learning one should know... I did a small study project on a new deep belief net algorithm(which is a fancy name for "better" neural networks and deep networks in general). The idea was the same layer by layer learning but with better priors at each level. And between each level the weights are learnt by a specific gibbs sampling algorithm. These weights effectively become a "prior" to the next layer, as the data is presented to the next layer by passing through the previous one. Problems in previous ANN networks was there was no fast enough way of doing this layer by layer prior learning trick and ANN's worked horribly before and much better after using this trick. Its interesting they(Geoffery Hinton and his group) called them priors, cause that's what they were in the algorithmic sense. but they were obtained from the data... Priors are having a more important meaning with this new stuff out there... |
|
Priors become less important as the data becomes plenty. EM/Maximum likelihood can work as well with less data, but it might have some degenerate behaviors (for example, EM for estimating the means and covariances of a mixture model has a degenerate solution where the inverse covariance can become arbitrarily close to zero and make the likelihood diverge; also for word segmentation models a naive EM will just segment the entire corpus as one word or segment each character as its own word) but you might be able to correct for these behaviors without the need of a prior (see, for example, http://www.aclweb.org/anthology/N/N09/N09-1069.pdf ). Also, how much data is plenty is problem-specific, and for high dimensional models you might need a huge amount of data for the posterior to be sufficiently close to a point estimate for priors to make no difference. But mostly the point estimates versus full posterior representations (either by sampling or variational) issue is orthogonal to the priors or no priors issue: you can get point estimates using priors (by doing MAP inference), and you can get confidence estimates without priors (by technically using improper priors, and effectively sampling from the likelihood function instead of maximizing it; or using bootstrap to compute the empirical variance of the maximum likelihood estimator). In general there is no one-size-fits-all recipe for priors/no priors and point estimates/confidence estimates, and people have got very good results both with and without priors both with a lot and a little data. As far as I'm concerned, I find it easier to start with a more "proper" implementation with noninformative priors and tweak for performance as needed. |