|
To start with let's say that someone has a supervised learning model that performs well from a loss function standpoint, but suffers from a high run-time for prediction. I.e. the model is trained on a small subset of data points, but then needs to generate supervised learning predictions for a much larger number of outputs. Let's also say that the model is not efficient in terms of run time. A good example here would be a tree ensemble, maybe multi-layered. Every time we predict Y for a new point we're spending a lot of clock cycles evaluating a huge number of trees that have high similarity. A natural solution would be to learn a model approximation in a different, faster, probably more compact, representation. In this sense we'd be fitting a supervised learning problem again but instead of predicting Y from X, we'd be predicting the larger full model's Y-prediction from X (or possibly individual sub-structures from within the full model). So a few questions: 1) Does this approach make sense to begin with, or am I overlooking an ever simpler solution. Either in general or for tree ensembles in particular? 2) Are there any fundamental trade-offs between computational run-time of evaluation and similarity. I.e. if I'm reducing model X to model Y and requiring the similarity/correlation between the two's predictions to be above (1 - epsilon) is there any known minimum representation size/run time that I'm bounded by? This would help me know what kind of speed-ups are achievable before I start actually throwing effort at the wall. 3) Is there any representation you'd suggest for the approximation model? Two issues complicate this, the features are both highly non-linear and deeply interacting. So some form of universal or near universal representation is required. What jumps out at me is highly sparse (to reduce runtime), possibly multi-layer, neural networks. 4) For my data set in particular it's a feature set captured in real-time. That means from one point to the next the feature values tend to be very similar to the feature values from the previous point. I don't know if there's any way to capitalize on this to reduce processing time. 5) Are there any good supervised learning algorithms that particularly excel at low/zero noise (since the full model's predictions should be completely determined by the features), very large datasets (theoretically infinite since I can always just generate random points and predict them with the full model), but potentially complicated structure? |
|
What you are suggesting sounds exactly like Bucila's Model Compression. They start with a tree ensemble trained on some labelled data and then use that model to label a much larger quantity of unlabelled data & train a multilayer neural net on the original training data together with the new data. It sounds like an NN would also be a good match for you, since they are compact but highly expressive and training on the much larger set of initially unlabelled, possibly synthetic, data helps with some of the weaknesses of NNs like overfitting and local minima. They report success with the approach. That's exactly what I'm looking for. Thanks!
(Jan 10 '13 at 21:42)
Douglas Colkitt
1
Kavukcuoglu, Ranzato and LeCun have also applied this idea in an unsupervised learning context with PSD (Predictive Sparse Decomposition, http://arxiv.org/abs/1010.3467 ). What it comes down to is that they jointly train a sparse coding model, and a feedforward neural network that predicts the sparse codes. I really like the idea of training both at the same time, such that the more complicated model that is going to be replaced (by a simpler one for faster inference/prediction like a neural network) is also "aware" of the fact that it will be discarded and approximated afterwards. In the PSD paper they say that this drives the sparse coding model to learn representations that are easier to predict by the neural network. I haven't really thought about whether this would be applicable in a supervised context as well, but it seems like it could be worth investigating. Or probably someone has already done that :)
(Jan 12 '13 at 08:55)
Sander Dieleman
|
|
This has also been done by Petrov et al.. The idea there is to use an accurate but slow constituent parser to label lots of data and then train a more efficient dependency parser on the automatically labeled data. This is in the context of domain adaptation for parsing questions, but I think the idea is more generally applicable. |
|
personally I feel NN's inspire a lot more belief than any other model. I doubt you will actually get good performance with an NN with a very non linear problem. In any case you might want to look at active learning approaches. I don't know the field but am thinking of how one does adaptive numerical integration... namely you bisect the range if your prediction is within tolerance, you move to the next region otherwise you insert a point and bisect again. The adaptive numerical integration is an interesting idea. But the problem compared to single variable integration is that the feature space is high dimensional so it will require a lot of bisects before numerical stability is reached. In some sense this approach seems basically the equivalent of growing a decision tree, which might not be a bad approach with stopping leaf divisions when the points in the node are all within some epsilon of each other.
(Jan 12 '13 at 01:13)
Douglas Colkitt
|