|
Consider the case where you have a dataset with fairly low dimensionality and a very high number of data points. One such that in most supervised learning approaches you'd be more concerned with minimizing bias rather than variance. However in this problem the relatively low dimensionality is somewhat deceptive because the variables have very complex high dimensional interactions. I'll give you an example of what I'm talking about: Suppose you were looking at randomly sampled points in chess games between evenly matches players. The data you had was which pieces each player had lost and you wanted to predict whether the game would ultimately end up in win/lose/draw. Your feature space in this case is 30 0/1 variables: 8 pawns, 2 rooks, 2 knights, 2 bishops, 1 queen, (no king because if he's missing the game's over) multiplied by 2 sides. Assume that you have access to hundreds of millions of independent data points (you can always run AI-AI games and sample new positions). The idiosyncrasies of the interactions here runs far deeper than standard linear/quadratic variable interactions, for example: -Early game when a lot of material's on the board knights tend to be better than bishops. When there's not much material bishops tend to be better. -If a player is missing a few pawns from ranks far away from each other that's not as bad as a cluster of pawns missing which opens up a hole in his defense. -At the end of the game if the other side has two bishops, but the difference between having one bishop and no bishop is the difference between a guaranteed draw and win. -A queen is about equal to two rooks early game, but becomes much less powerful late game. -The difference of one side having 1 pawn and the other side having no pawns is far more than the difference between having 7 pawns and 8 pawns (because of the chance to convert the pawn to another piece). As you can see there are endless variations of how the variables can interact. Many of the interactions go along an early/late game dimension. I.e. fewer material more material interactions, which basically involve some derived metric from the underlying 30 variables (more pieces that are captured off the board the "later game" it is, so a rough approximation might just be a sum of the 30 0/1 variables). This makes it very challenging because it's almost like you need a combined supervised-unsupervised learning process that can discover interesting projections of the variables to interact on (i.e. discover the early/late game metric and start interacting other variables on it). Then plug those into a supervised method. This brings me to the various approaches that I'm considering: -Boosted trees: The problem is the underlying weak learners don't have any chance to capture any of the interesting structure in the feature space. For trees to work they need to have many nodes to discover the complex interactions, and boosting tends not to do well with deep trees. -Random forests: This has a better chance of working. But I would have to use very deep trees to capture the interaction, and even though random forest does pretty well with deep trees I've never heard of trees 20+ levels deep which is what you would need to discover some of the early/late game interactions. (At least you would need to go 20+ deep if you were splitting nodes based on single variables). -SVMs: This seems somewhat appealing. I'm not too experienced with SVMs, have much more experience with trees. But from what I understand they have a good reputation from capturing non-parametric interactions like these. My only concern here is that a Gaussian kernel wouldn't work here. But again I'm inexperienced so any input on this in particular would be greatly appreciated. -Logistic regression: This one I have to throw out right away because the problem is so non-linear, and I would have to include 32! interactions (an interaction for every possible piece). -Neural nets: I'm always somewhat skeptical of these, but the universal function feature does seem to be what I'm looking for here. It would have to be high layer though, and even though the data set's large, with ANNs over fitting is always a problem IMO. Maybe bagging many-layered ANNs here might be a good approach, but I've never heard of anyone doing this so I'm not sure if there are good arguments not to. -Manifold learning: This is the most "far out" approach, but it seems like the only thing that has the direct capability of recovering what I want, i.e. low-material high-material variable projection. Of course I don't know how exactly to translate an unsupervised method into a supervised problem... Anyway this problem has me tied up in knots. So any interesting approaches or good advice would be greatly appreciated. Thanks. |
|
If the input representation doesn't encode the positions of the pieces, you will be very limited in what your algorithm can learn. Many chess positions exist with the same material on each side, but radically different game theoretic values. Without the positional information, I don't think the interactions are nearly as complicated as you might think. I bet you would have trouble beating logistic regression or a hash table with 2^30 entries holding a quantized average result. But suppose you add information about the actual locations of the pieces? Then you have a nearly hopeless problem. The best approach is to create highly informative expert features and use a linear model or a shallow neural net. If you want to use the function as a static evaluator inside a chess playing AI, it will need to be very fast anyway. What is your end goal with the trained model? What is the actual problem you are trying to solve? Does it have anything to do with chess? The more information you can give about what you actually want, the more helpful the answers people post will be. If you add this information I will try and edit my answer to respond to it. |
|
Many things could work well for this problem, but many other things couldn't. SVMs are not particularly hopeful, as the structure of the more popular kernels mean you need a really thorough sampling of whatever manifold your (features X labels) space lies on for good generalization when there is complex underlying structure. A family of approaches which has been somewhat succesful in similar scenarios recently is deep learning/unsupervised feature learning. Decision trees are also known to be competitive in some such settings and possibly easier to train. Many of these approches involve a long period of unsupervised pretraining, where all sorts of higher-order features (such as your early-game vs late-game scenario) are discovered, followed by a supervised fine-tuning phase where a classifier is learned on top of those features (which may or may not involve adjusting those features themselves). Apart from this your list seems fairly complete. I do strongly encourage you, however, to try to do some feature engineering (explicitly encoding this early/late game feature, for example) before investing too much time implementing, training, and testing fairly complex learning-heavy algorithms, as in my experience it is often true that most of the performance comes from good representations of your inputs (i.e. features) most of the time. Thanks a lot for the advice. Deep learning sounds pretty applicable here, and I'm only barely familiar with it. Sounds like a very promising avenue. With regards to feature engineering, I have been doing that heavily. One of the problems is that my derived features still leave me with a higher error rate that what I know is achievable from our competitors (this chess example problem isn't my exact problem, but there are very close analogies). So I know there are better derived features, but I don't know where mine are failing comparatively. I was hoping to try a black box approach from the underived features to see if I could come up with black box predictions and compare them to my feature engineered predictions to figure out in what circumstances my derived features are deficient.
(Apr 09 '12 at 19:37)
Douglas Colkitt
|