|
Using a L1 regularization term leads to sparse solutions. Can anyone explain intuitively why this is so? I know this has something to do with fact that the l1 sphere is very "pointy" in high dimensions but I do not know the details. |
|
Here's the geometric idea from the original lasso paper (Regression Shrinkage and Selection via the Lasso):
In the picture above, the center of the ellipse (the point marked $hat{beta}$) is the non-regularized OLS estimate. Each elliptical contour around the center defines a set of (biased) weights that result in the same squared error. Now consider L1 regularization, where we want to minimize the squared error subject to the constraint that the sum of the absolute value of the weights is less than some constant. This constraint defines a "square" around the origin (the black square on the left). (In ridge regression, where we want the squares of the weights to be less than some constant, we get a sphere around the origin, as in the picture on the right.) So what L1 regularization is doing is looking for the first elliptical contour that intersects the black square. This is likely to happen at a corner of the square, and a corner is precisely where some of the weights are zeroed out. |
|
This is another point of view. What you want to minimize is the number of non-zero elements in your vector, also known as the $ell_0$ norm (technically not a norm, because it does not satisfy the scaling property). However, minimizing the number of non-zero elements is not tractable, because the $ell_0$ norm is non-convex. The $ell_1$ norm is the the closest function to the $ell_0$ norm that is convex. This figure illustrates the idea:
Figure from E. J. Cand`{e}s, M. B. Wakin, and S. P. Boyd, “Enhancing Sparsity by Reweighted $ell_1$ Minimization,” Journal of Fourier Analysis and Applications, vol. 14, no. 5, pp. 877-905, Nov. 2008. |
|
grautur's answer is great. But here is another way to look at it. Consider the derivative of the l2 regularizer penalty when a parameter is zero. For example, consider a simple 2 parameter problem. Our current weights are x = 1, y = 0. The regularizer cost is sqrt(x^2 + y^2), and d/dy (x^2 + y^2) = y / (sqrt(x^2 + y^2)). The derivative of a parameter is proportional to the current value of that parameter. When y = 0 (and x != 0), there is no regularized cost to making an infinitesimal change to y. So the optimizer is free to make any change to the weight of y that improves the training accuracy even a tiny little bit. |

