I'm looking at this paper and I'm surprised that, in their analysis, the convergence rate depends on the optimal value of the error function (e.g. Eqs (5) and (6) have L(w*) on the right).

However, if we switch from the error function l(w, z) to l(w, z) + 2013, the algorithms (e.g. SGD), should behave in exactly the same way as before, so why do the convergence rates need to differ?!

asked Oct 25 '13 at 19:37

Max's gravatar image

Max
476162729

edited Oct 26 '13 at 11:53

Can you be more specific? Just glancing through the paper, the AG simply appears to be using a momentum term while updating, making the convergence faster. Roughly, their analysis shows that the bound on the error between the weights after nth iteration and optimal is better with AG than SGD, indicating some faster convergence with their method.

(Oct 26 '13 at 11:38) Rakesh Chalasani

@RakeshChalasani edited

(Oct 26 '13 at 11:56) Max

One Answer:

L(w*) is the expected loss at the "optimal" (or in some sense "best") solution w*. If L(w) is some expected loss one arrives after n iterations, then the bound ||L(w) - L(w*)|| helps to understand how close one is to the optimal solution. Their analysis gives a indication of how their proposed model has a tighrer upper bound indicating faster convergence to the solution after some n iterations.

The bound is not between two different loss function as you indicated, but rather the same loss function computed at two different points in the solution space of w.

answered Oct 26 '13 at 18:13

Rakesh%20Chalasani's gravatar image

Rakesh Chalasani
2641210

edited Oct 26 '13 at 18:14

Sorry, but I'm afraid you didn't understand the question. Take L and consider a new loss L + 2013. The left-hand side (||L(w) - L(w*)||) doesn't change, but the right-hand does, even though SGD should be unaffected. Do you see the problem now?

(Oct 28 '13 at 00:50) Max

Yes, I got your point now! Sorry about that. Oddly enough to understand, L(w*) seems to be important for convergence. But I think it is important to note that L(w) is a loss function here and not any convex function. Taking a constant (e.g L(w) + 2013) is tantamount to saying so much loss is bound to happen, for example so much misclassification is just given. Here is a point they make, taking it right out of the same paper (just below Eq. (3)): "This self-bounding property tells us that the norm of the gradient is small at a point if the loss is itself small at that point.".

(Oct 28 '13 at 09:30) Rakesh Chalasani
Your answer
toggle preview

powered by OSQA

User submitted content is under Creative Commons: Attribution - Share Alike; Other things copyright (C) 2010, MetaOptimize LLC.