|
Typical optimization problems explore the function space and find local minima's by using some optimization methods. In respect to second order optimization methods, given a non linear opt function is there a mathematically principled way to explore the function space to understand its curvature? Though this problem greatly depends on finding the parameters of the function. What is the work that is out there that seeks to address this problem? While this wont be able to address the issue of actually finding a minima, we might be for example be able to find if the current machine can actually get stuck in one of these local minimas. What current techniques in machine learning literature explore this fundamental idea? Only one thing I could think of and I may sound like a noob here but what what about integrals over function spaces? |
|
If your function space is finite dimensional then think of your space of functions as a space of points. Consider space of functions of the form f(a,b)=a x+b y. Every function is identified by x,y which is a point in R^2. Now you can integrate or optimize over this space of functions by computing integrals or minimizing objective over x,y plane. An interesting question of what to do when you don't have a parametric representation of your function. Friedman deals with that situation in (Friedman, 2001) where gradient becomes a function to be learned from directly from data, and the final result of minimization is a sum of gradients at intermediate points. Since l-BFGS typically represents approximate Hessian in terms of last 3-5 gradient vectors, I image this could also be extended to the setting when gradients are functions. Typically the functions we in statistical machine learning, like SVM's(SVM's solved by 2nd order optimization not QP) and regression, learning are simple LSE+Regualarizer. Why Do we use such simple functions only? I wanted to study my earlier question in this context. The paper you provide gives some useful connection in this regard.... How is this connected to things like manifold learning(meaning finding minima's of a functions in this regard)? If i understand them they say data is concentrated on not one high dimensional manifold but several low ones... No? Does the latter view seems a much more "natural" description of real life data? What kinds of functions do we use to optimize in this case? Both views might be correct... I mean data could lie on one or several manifolds... But how could one know which? What works address this? The jist of what I am trying to see here is that what the way to analyze the "model+data" combination where the models are obtained by gradient and/or second order optimization techniques.. Cause it seems that different models work well on different data. Exploring the function space would be very useful in this regard... BTW When you say "what to do when you don't have a parametric representation of your function" you mean i dont have a and b in f(a,b) right?
(Sep 17 '10 at 18:02)
kpx
There are many complicated models, but only a few simple ones. Also, every simple model is close to many complicated models. So if you are trying models at random, you are more likely to have success with simple models. Think of regression, if you don't see your data and have one shot at fitting, would you try to fit your data with y=a x, or with y=Sinh(Sqrt(a x^2))? Ultimately, you need a domain expert to improve over simple models, this is formalized by No Free Lunch theorems, http://www.no-free-lunch.org/
(Sep 17 '10 at 19:09)
Yaroslav Bulatov
|
Most optimization problems you see in machine learning are not over function spaces. They minimize f(x) where x is a vector.
Integrals over function spaces are not generally relevant here, since you're usually integrating over functions (like in a hilbert space) and not over vectors, which is the case in most machine learning problems.
Second order methods generally use local taylor approximations, hence they work better when the function is locally convex. In general, for non continuous functions there is no way to explore their curvature. Even for absolutely continuous and infinitely differentiable functions you can have two different functions which have exatcly the same derivatives (and second derivatives, and third, all the way up to infinity) at some points and yet have very little in common. You will find these examples in any analysis textbook (maybe check out baby rudin?).
The most principled way to explore a function space is to use samples: either uniformly distributed with a high frequency or monte carlo. However, the number of samples necessary to explore a highly curved high-dimensional space is exponential in the number of dimensions, so this doesn't lead to interesting algorithms. Simulated annealing and other global optimization methods explore these ideas, but they are a lot slower than local optimizers when you want to find global minimums.