Popular approach to fitting a log-linear models is to minimize regularized log-likelihood where regularization term is sum of squares or absolute values of natural parameters (thetas).

Two things seem unsatisfactory here:

  1. This corresponds to a prior which expects features of the data to be correlated. This introduces strange biases. For instance, take log-linear parameterization of 3-sided die (x=1,2 or 3, features are f_1,f_2 where f_i(x)=1 if x=1) and train it to maximize log-likelihood minus sum of theta squares. If we observe counts 2,1,1, estimated probability is {0.39, 0.29, 0.32}, but if we see 1,1,2, it'll be {0.32, 0.32, 0.36}, so you can see the learner is biased against dice that land on 3.

  2. Parameters can be fundamentally different (ie, local feature vs. structural feature parameters), shouldn't we regularize them differently?

Some ideas of how this could be dealt with: a) regularizing in mean parameter space. In dice example, regularizing by (p1-1/3)^2+(p2-1/3)^2+(1-p1-p2)^2 where p1,p2 are predicted probabilities or b) regularizing by theta'A theta where A is derived from some properties of the model. In dice example, we could take as A the Fisher Information matrix at (0,0), this gives penalty proportional to 2x1^2+x2^2-x1 x2, and makes the resulting learner treat outcomes 1,2,3 as (mostly) apriori the same. Have these ideas been tried?

What are some approaches for addressing the issues above?

Edit: in response to Aman's question as to why those values are preferred, best way to visualize it is to plot the prior induced by squared norm on natural parameters in mean parameter space. IE, represent distributions of dice in a 3-simplex, and for each p1,p2,p3, show the probability of corresponding theta1,theta2 under prior Exp[-theta1^2-theta2^2]. Top graph shows horizontal cross-sections. You can see this prior is quite different from a natural prior in this space - uniform multinomial distribution, (pictured on right for n=5)

Edit Sept 10: Alexandre suggests to use overconstrained parameterization, in which case the prior is symmetric. In p-space it penalizes dice by exp of squares of log-probabilities. Below is a plot of that prior compared to the natural prior in that space (uniform multinomial), and the entropic prior. You can see that equal entropy lines are roughly circles, whereas our prior has triangular-looking contours, which makes our learner biased towards low-entropy models

asked Sep 08 '10 at 15:46

Yaroslav%20Bulatov's gravatar image

Yaroslav Bulatov
1963193458

edited Sep 10 '10 at 16:36

I am confused. If you don't have a feature that upvotes the occurrence of 3, then your learner will be biased against it. Yet, I don't see how your learner it biased against 3. Doesn't the second probability distribution have the highest value for 3 (0.36). Please clarify.

(Sep 08 '10 at 17:14) Aman

For the first dice dataset, objective function is 2 x1 + x2 - 4 Log[1 + Exp[x1] + Exp[x2]] - x1^2 - x2^2, for the second, it is x1 + x2 - 4 Log[1 + Exp[x1] + Exp[x2]] - x1^2 - x2^2. I maximized each function then plugged parameter values at max to get probabilities (ie, P(1)=(Exp(x1)/(Exp(x1)+Exp(x2)+1))

Note that penalizing by x1^2+x2^2-1/2 x1 x2 approximates uniform multinomial prior on the space of dice counts, below are results of using that penalty for same 2 datasets, you can see it makes it treat observations of 1 and observation of 3 more symmetrically

{0.39, 0.30, 0.31} {0.31, 0.31, 0.37}

(Sep 08 '10 at 17:30) Yaroslav Bulatov
1

Your problem is that your likelihood penalizes against 3, not the regularization. If you treat all parameters equally, your unnormalized likelihood becomes exp(x1 f1 + x2 f2 + x3 f3), and this is completely symmetrical in theta space, so l2 regularization makes sense and will treat all values indiscriminately. A recent paper by agarwal and daume gives good reasons to instead of regularizing with a gaussian always regularizing with a conjugate prior: http://hal3.name/docs/daume10conjugate.pdf

(Sep 08 '10 at 17:39) Alexandre Passos ♦

That link isn't accessible...do you have the name of paper? Also, the set of distributions over 3 outcomes is a two parameter family, so what you are giving is an overcomplete representation. If t1,t2,t3 gives data likelihood C, then so does (t1/x,t2/x,t3/x) for any x. Because you penalize by squared norm, solution to corresponding minimization problem is always (0,0,0) ... that's probably worse than having a lopsided prior :)

(Sep 08 '10 at 19:54) Yaroslav Bulatov

This is true, this seems to be one of those circumstances where MAP doesn't make sense. To solve this I guess it's better to constrain all parameters to sum to the same value, like it can done with the exponential family representation of the bernoulli distribution (see the exponential families chapter of the Koller and Friedman book), and then you can regularize the thetas given that constraint, which would push them all to having the same value unless the likelihood requires something different.

The paper is "A geometric view of conjugate priors" and that link is down, but the arxiv link http://arxiv.org/pdf/1005.0047 is up.

(Sep 08 '10 at 20:10) Alexandre Passos ♦

Also maybe you could email some expert in the area and ask them about this and post it back here (or tell them that you posted it here and ask if they'd like to answer). It feels as if on most graphical model questions here it's mostly you, me, and spinxl39 talking, recently.

(Sep 08 '10 at 20:12) Alexandre Passos ♦
showing 5 of 6 show all

One Answer:

I asked Daphne Koller and she answered:

The equivalence in which you can divide the entire factor by a constant and get the same distribution holds in probability space. The regularization is applied in log space. In log space multiplying everything by a constant has the effect of sharpening or flattening the probability distribution, which can significantly change the log-likelihood.

So I (Alexandre) believe that the MAP with a gaussian prior is not everything zero if you use the overparameterized distribution (one factor per outcome) and a gaussian prior on the parameters.

answered Sep 10 '10 at 11:37

Alexandre%20Passos's gravatar image

Alexandre Passos ♦
1893744214333

Oops, good point, it's addition of constant factor to log-parameters, not multiplication that preserves log-likelihood. So this prior does treat all outcomes as symmetric, but now there is a bias towards low-entropy distributions, updated in edit

(Sep 10 '10 at 16:39) Yaroslav Bulatov
Your answer
toggle preview

powered by OSQA

User submitted content is under Creative Commons: Attribution - Share Alike; Other things copyright (C) 2010, MetaOptimize LLC.