Hi all,

I am curious if there is any work that tries to go along the direction of fitting programs to data.

For example, the MNIST digits are normally viewed as high-dimensional data points, but the "correct" model of a digit (say a 5) is in some sense not a cluster of data points, but a program: someone took a pen and wrote a 5. The best latent model of all 5's then is a little program that starts somewhere, "draws ink" to the left, a bit down, and then around in a circle. Wouldn't we want to have semi-automatic methods for discovering this truth?

In other words, are there any ways to try to automatically model not the appearance of the data, but their causes? In some sense, every real-world dataset is a result of a causal process. Also, this is the kind of learning that people seem to often do. Think of, for example, a physicist going through tons of data from an experiment. Often they aren't looking to model its appearance, but to explain it (i.e. what process gave rise to this data?), and that insight is much more powerful.

I am not naive to the immediate difficulty of doing this, and I wouldn't know where to start on most datasets (a set of parametrized basis programs??), but I'm wondering if anyone has at least tried.

EDIT: I already accepted an answer, but for future I should say that the best example of what I was looking for is really the work/ideas of Josh Tenenbaum. I just watched his NIPS talk here, in which he happens to actually bring up both of the examples I gave.

asked Feb 15 '11 at 13:49

karpathy's gravatar image

karpathy
266479

edited Mar 08 '11 at 22:38


3 Answers:

You seem to be asking about two different problems: learning programs and learning causal structures. When learning programs you have no guarantee you're actually approaching the causal structure of the data, and the currently state of the art means of reasoning about causal inferences involve declarative rather than procedural (and by procedural I mean program-like) models.

The oldest way of learning programs is by genetic programming. Essentially, you use a language like LISP that allows for easy structural operations in code, define an error function (or, if you prefer to think like this, a test set that gives a numerical score to each program), and randomly futz with random program by mixing and matching random parts until you think the score you have achieved is good enough. This is prone to overfitting, and to creating completely uninterpretable programs. A more recent, and far more interesting, approach is the paper by Liang, Jordan, and Klein on Learning programs: a hierarchical bayesian approach, which heavily uses the notion that many constructs are shared among programs for many different tasks, and it is more efficient to learn to reuse these constructs than to recreate them all over again, effectively controlling overfitting.

For learning causal structures, the state of the art seems to follow Judea Pearl and other researchers. A good read on the topic is his book Causality: models, reasoning, and inference, which explores the connections between bayesian networks and causal dependencies, and if I recall correctly introduces a notation called do-calculus that models causal inference as arising from interventional experiments. You can find more information on the state of the art of this research from Isabelle Guyon's causality workbench. You can fing many more references reading the papers on the nips 2008 workshop on causality.

answered Feb 15 '11 at 14:10

Alexandre%20Passos's gravatar image

Alexandre Passos ♦
2554154278421

Thank you these are great links and keywords. Google failed me on this.

(Feb 15 '11 at 14:20) karpathy

You should also checkout the literature on Inductive Logic Programming which is similar to what people do with Prolog but for inductive reasoning rather than deductive reasoning.

Recenter developments on ILP focus on the integration of reasoning under uncertainty hence is somewhat converging with Bayesian modeling, Markov Logic Networks and so on.

answered Feb 15 '11 at 14:19

ogrisel's gravatar image

ogrisel
498995591

If you are using something like sparse coding then the dictionary that you are learning basically amounts to implicitly learning stroke patterns, and each digit is basically a sparse combination of the dictionary elements. For example, see the paper Differentiable Sparse Coding, or Simple Method for High-Performance Digit Recognition Based on Sparse Coding.

There has also been prior work on modeling stroke patterns in more explicit ways. For example, using a Chebyshev series representation for the 2D co-ordinates X and Y (see the paper Online Stroke Modeling for Character Recognition). Other ways one might try could be using a 2D HMM (see Handwritten Digit Recognition Using Two-dimensional Hidden Markov Models for Feature Extraction).

answered Feb 15 '11 at 14:19

ebony's gravatar image

ebony
18181014

Your answer
toggle preview

powered by OSQA

User submitted content is under Creative Commons: Attribution - Share Alike; Other things copyright (C) 2010, MetaOptimize LLC.