|
I am working on a problem and one of the formulations involves maximizing the value of sum_i C_i log p_i, where C_i is a constant and p_i is constrained to be a log-probability. Intuitively, it makes sense to assign a higher probability to p_i values for which C_i is higher, but I'm not sure how to make this intuitive. Besides gradient steps followed by L1 projection or full-scale convex optimizers, how would you approach this problem? |
|
This is a pretty common problem with a satisfyingly simple solution: each p_i should be set proportional to its corresponding C_i value. There are two ways to see this. The first way is to add a constant to your objective function so that it takes the form of a Kullback-Leibler divergence: Let f(p_1, ..., p_N) = sum_i C_i log p_i, that is, your original objective function. Divide the whole thing by the sum of the C_i variables so that the constants are now a discrete probability distribution. Let C_tot = sum_i C_i, and let D_i = C_i / C_tot be the new constants. Obviously, maximizing sum_i C_i log p_i with respect to the p_i variables is the same as maximizing sum_i D_i log p_i. Now multiply by negative 1 and add the constant term sum_i D_i log D_i to the objective function. Since we multiplied by a negative, we now have to minimize instead of maximize, and our new objective function is sum_i D_i log D_i - sum_i D_i log p_i, which can be simplified as sum_i D_i log (D_i / p_i), which is exactly the Kullback-Leibler divergence from D to p. Clearly, this is minimized when each p_i = D_i. The other way to get this solution is to use the method of Lagrange multipliers. We can rewrite our constrained objective function as an unconstrained objective function with an additional variable, lambda, as follows: sum_i C_i log p_i - lambda (1 - sum_i p_i). The term lambda (1 - sum_i p_i) represents the constraint that the p_i variables have to sum to 1. Now we just set the partial derivatives equal to zero to find the optimum. The partial derivative w.r.t. p_i is p_i/C_i + lambda, and setting this equal to zero, we find that p_i = C_i/lambda. In other words, no matter what value of lambda optimizes our objective, each of the p_i variables must be set proportional to the corresponding C_i. I hope this answers your question; it's not exactly clear whether you're directly manipulating the p_i variables, or if they're functions of some other variables which you are manipulating. Thanks, this is exactly what I was looking for.
(Mar 16 '11 at 06:49)
Alexandre Passos ♦
|
|
Assuming p_i is your variable, as opposed to p_i(x_i) or something similar: Instead of constraining 0 < p_i <= 1, a simple re-statement to be a minimization problem:
... where -inf < LP_i <= 0. Hinton's rectified linear units are somewhat similar to this problem. He describes them as:
This makes sense, since the sum on the left is a crude form of integration, and an anti-derivative of the sigmoid is log(1 + e^x). log(1 + e^x) looks like the function max(0, x) with smooth but non-linear behavior for values of x near zero. The cross-over I see is in the fact that LP_i is constrained as 0 <= -LP_i < +inf, and LRUs have the exact same constraint. The shape or behavior of the LRU function may not be what you're after, but using some type of gradient learning rule on the LRU parameter would probably net you a reasonable solution without a lot of heart-ache.
This answer is marked "community wiki".
|
To be clear, you'd be optimizing on p_i or is p_i a function?
Yes, optimizing on p_i