Typical optimization problems explore the function space and find local minima's by using some optimization methods.

In respect to second order optimization methods, given a non linear opt function is there a mathematically principled way to explore the function space to understand its curvature?

Though this problem greatly depends on finding the parameters of the function. What is the work that is out there that seeks to address this problem?

While this wont be able to address the issue of actually finding a minima, we might be for example be able to find if the current machine can actually get stuck in one of these local minimas.

What current techniques in machine learning literature explore this fundamental idea?

Only one thing I could think of and I may sound like a noob here but what what about integrals over function spaces?

asked Sep 17 '10 at 14:54

kpx's gravatar image

kpx
541182636

Most optimization problems you see in machine learning are not over function spaces. They minimize f(x) where x is a vector.

Integrals over function spaces are not generally relevant here, since you're usually integrating over functions (like in a hilbert space) and not over vectors, which is the case in most machine learning problems.

Second order methods generally use local taylor approximations, hence they work better when the function is locally convex. In general, for non continuous functions there is no way to explore their curvature. Even for absolutely continuous and infinitely differentiable functions you can have two different functions which have exatcly the same derivatives (and second derivatives, and third, all the way up to infinity) at some points and yet have very little in common. You will find these examples in any analysis textbook (maybe check out baby rudin?).

The most principled way to explore a function space is to use samples: either uniformly distributed with a high frequency or monte carlo. However, the number of samples necessary to explore a highly curved high-dimensional space is exponential in the number of dimensions, so this doesn't lead to interesting algorithms. Simulated annealing and other global optimization methods explore these ideas, but they are a lot slower than local optimizers when you want to find global minimums.

(Sep 17 '10 at 15:34) Alexandre Passos ♦

One Answer:

If your function space is finite dimensional then think of your space of functions as a space of points. Consider space of functions of the form f(a,b)=a x+b y. Every function is identified by x,y which is a point in R^2. Now you can integrate or optimize over this space of functions by computing integrals or minimizing objective over x,y plane. An interesting question of what to do when you don't have a parametric representation of your function. Friedman deals with that situation in (Friedman, 2001) where gradient becomes a function to be learned from directly from data, and the final result of minimization is a sum of gradients at intermediate points. Since l-BFGS typically represents approximate Hessian in terms of last 3-5 gradient vectors, I image this could also be extended to the setting when gradients are functions.

answered Sep 17 '10 at 17:09

Yaroslav%20Bulatov's gravatar image

Yaroslav Bulatov
2333214365

Typically the functions we in statistical machine learning, like SVM's(SVM's solved by 2nd order optimization not QP) and regression, learning are simple LSE+Regualarizer.

Why Do we use such simple functions only?
I can think of 2 reasons: 1) they are easily optimizable by a variety of techniques and have nice closed form solutions 2) the kernel trick follows so naturally from the minimization of LSE+Regularizer. And this leads to good things like reduced computational complexity etc, makes data linearly seperable in highD.

I wanted to study my earlier question in this context. The paper you provide gives some useful connection in this regard....

How is this connected to things like manifold learning(meaning finding minima's of a functions in this regard)? If i understand them they say data is concentrated on not one high dimensional manifold but several low ones... No? Does the latter view seems a much more "natural" description of real life data?

What kinds of functions do we use to optimize in this case?

Both views might be correct... I mean data could lie on one or several manifolds... But how could one know which? What works address this?

The jist of what I am trying to see here is that what the way to analyze the "model+data" combination where the models are obtained by gradient and/or second order optimization techniques.. Cause it seems that different models work well on different data. Exploring the function space would be very useful in this regard...

BTW When you say "what to do when you don't have a parametric representation of your function" you mean i dont have a and b in f(a,b) right?

(Sep 17 '10 at 18:02) kpx

There are many complicated models, but only a few simple ones. Also, every simple model is close to many complicated models. So if you are trying models at random, you are more likely to have success with simple models. Think of regression, if you don't see your data and have one shot at fitting, would you try to fit your data with y=a x, or with y=Sinh(Sqrt(a x^2))? Ultimately, you need a domain expert to improve over simple models, this is formalized by No Free Lunch theorems, http://www.no-free-lunch.org/

(Sep 17 '10 at 19:09) Yaroslav Bulatov
Your answer
toggle preview

powered by OSQA

User submitted content is under Creative Commons: Attribution - Share Alike; Other things copyright (C) 2010, MetaOptimize LLC.