|
Popular approach to fitting a log-linear models is to minimize regularized log-likelihood where regularization term is sum of squares or absolute values of natural parameters (thetas). Two things seem unsatisfactory here:
Some ideas of how this could be dealt with: a) regularizing in mean parameter space. In dice example, regularizing by (p1-1/3)^2+(p2-1/3)^2+(1-p1-p2)^2 where p1,p2 are predicted probabilities or b) regularizing by theta'A theta where A is derived from some properties of the model. In dice example, we could take as A the Fisher Information matrix at (0,0), this gives penalty proportional to 2x1^2+x2^2-x1 x2, and makes the resulting learner treat outcomes 1,2,3 as (mostly) apriori the same. Have these ideas been tried? What are some approaches for addressing the issues above? Edit: in response to Aman's question as to why those values are preferred, best way to visualize it is to plot the prior induced by squared norm on natural parameters in mean parameter space. IE, represent distributions of dice in a 3-simplex, and for each p1,p2,p3, show the probability of corresponding theta1,theta2 under prior Exp[-theta1^2-theta2^2]. Top graph shows horizontal cross-sections. You can see this prior is quite different from a natural prior in this space - uniform multinomial distribution, (pictured on right for n=5)
Edit Sept 10: Alexandre suggests to use overconstrained parameterization, in which case the prior is symmetric. In p-space it penalizes dice by exp of squares of log-probabilities. Below is a plot of that prior compared to the natural prior in that space (uniform multinomial), and the entropic prior. You can see that equal entropy lines are roughly circles, whereas our prior has triangular-looking contours, which makes our learner biased towards low-entropy models
showing 5 of 6
show all
|
|
I asked Daphne Koller and she answered:
So I (Alexandre) believe that the MAP with a gaussian prior is not everything zero if you use the overparameterized distribution (one factor per outcome) and a gaussian prior on the parameters. Oops, good point, it's addition of constant factor to log-parameters, not multiplication that preserves log-likelihood. So this prior does treat all outcomes as symmetric, but now there is a bias towards low-entropy distributions, updated in edit
(Sep 10 '10 at 16:39)
Yaroslav Bulatov
|


I am confused. If you don't have a feature that upvotes the occurrence of 3, then your learner will be biased against it. Yet, I don't see how your learner it biased against 3. Doesn't the second probability distribution have the highest value for 3 (0.36). Please clarify.
For the first dice dataset, objective function is 2 x1 + x2 - 4 Log[1 + Exp[x1] + Exp[x2]] - x1^2 - x2^2, for the second, it is x1 + x2 - 4 Log[1 + Exp[x1] + Exp[x2]] - x1^2 - x2^2. I maximized each function then plugged parameter values at max to get probabilities (ie, P(1)=(Exp(x1)/(Exp(x1)+Exp(x2)+1))
Note that penalizing by x1^2+x2^2-1/2 x1 x2 approximates uniform multinomial prior on the space of dice counts, below are results of using that penalty for same 2 datasets, you can see it makes it treat observations of 1 and observation of 3 more symmetrically
{0.39, 0.30, 0.31} {0.31, 0.31, 0.37}
Your problem is that your likelihood penalizes against 3, not the regularization. If you treat all parameters equally, your unnormalized likelihood becomes exp(x1 f1 + x2 f2 + x3 f3), and this is completely symmetrical in theta space, so l2 regularization makes sense and will treat all values indiscriminately. A recent paper by agarwal and daume gives good reasons to instead of regularizing with a gaussian always regularizing with a conjugate prior: http://hal3.name/docs/daume10conjugate.pdf
That link isn't accessible...do you have the name of paper? Also, the set of distributions over 3 outcomes is a two parameter family, so what you are giving is an overcomplete representation. If t1,t2,t3 gives data likelihood C, then so does (t1/x,t2/x,t3/x) for any x. Because you penalize by squared norm, solution to corresponding minimization problem is always (0,0,0) ... that's probably worse than having a lopsided prior :)
This is true, this seems to be one of those circumstances where MAP doesn't make sense. To solve this I guess it's better to constrain all parameters to sum to the same value, like it can done with the exponential family representation of the bernoulli distribution (see the exponential families chapter of the Koller and Friedman book), and then you can regularize the thetas given that constraint, which would push them all to having the same value unless the likelihood requires something different.
The paper is "A geometric view of conjugate priors" and that link is down, but the arxiv link http://arxiv.org/pdf/1005.0047 is up.
Also maybe you could email some expert in the area and ask them about this and post it back here (or tell them that you posted it here and ask if they'd like to answer). It feels as if on most graphical model questions here it's mostly you, me, and spinxl39 talking, recently.