|
This is a followup to my previous question on training a classifier to predict the winner for card game logs. The dataset I am using has grown by a factor of 4, now I have 1.3 million games. For my current feature set (~300 features), training a random forest on 100k game state instances (only took one game state at random per game) and 500 trees using R's randomForest takes about 18 hours and uses about 30% of the memory on system of 16 GB of RAM. I can exploit my multicore machine by running 3 or 4 tree building processes in parallel. So far I've trained 3 different sets of 500, 500, and 1500 trees on 100k batches, and the combined generalization error is still decreasing as I add more training data/trees. I have two main ideas: 1) Sample the best data: Use domain specific heuristics to filter the training data to a reasonable size and focus on keeping the best training data. 1a) Run an ELO like rating system on players, and prefer data to train on data where one or both both players are good. 1b) Come up with some heuristic for measuring games in which players executed different strategies, and prefer games where players took divergent strategies (intuitively, these games contain much more real "information" than games with mirror strategies). Efficient sampling If I want a fair distribution of games by turn, then I have a problem of sampling from essentially a 3d space of player skill x strategy diversity x current turn number. I want to maximize the first two, while getting an even distribution of the current turns. Is there some nice algorithm for sampling that preserves the desired properties? Maybe I should just bucket by turn number and then continually take non-dominated training instances on the remaining 2d player skill x strategy diversity set for a given turn bucket? 2) Optimize the ensemble: Use extra training data to optimize the ensemble by pruning out uninformative trees. In addition to biased sampling, I am considering ensemble pruning to improve the forest. Are there any R packages that implement this? Maybe I should just step back and stop trying to squeeze the most out of R's randomForest via clever preprocessing (biased sampling) or post processing (ensemble pruning), and use a learning system that scales better? |
|
Did you ever try logistic regression as suggested in an answer to your previous question? I see no reason you can't easily train a binary logistic regression model on 1.3 million, 300-dimensional training cases quite quickly using minibatched stochastic gradient descent. You should also be able to train a simple feedforward neural network on this dataset also. These are easy things to try and easy to do efficiently, so you might as well see if they work well. And if you are willing to use a GPU, you should be able to complete a single training epoch on a dataset that size in a few minutes for even a very large neural net (For a 1.1 million dataset of 384-dimensional inputs with, I can do a single epoch of training for a 384-1536-183 neural net in about 2 minutes using a GPU and minibatches of 128 cases). However, maybe you have already tried these things and have reason to believe a decision tree model will provide better performance. I can imagine that it might be appropriate for this task, although I don't know since I haven't tried it. Since you don't really have that much data, I would search around and see if there is a faster random forest implementation or if you can change the metaparameters to construct smaller models. Or just wait longer? I agree. 1.3 million examples and 300 dimensions is not large. Worst case, use stochastic gradient descent and it should be trivial to learn on even a modest laptop. SGD is easy to use with LR.
(Jul 05 '11 at 21:26)
jrennie
|
|
I had a somewhat similar problem awhile back. The process for generating the sample data itself is what needed parallelizing. We put together a pretty basic Gentoo image for the 8-core EC2 machines (at the time they averaged between 15-25 cents USD/hr if you bought time on the spot market). I could spin up 50 or so instances and have the whole thing operational in very short order (usually under 20 minutes). Furthermore, we were able to use ClusterSSH for connecting to and administering all the boxes simultaneously. This, combined with a simple web-server we setup with some cgi to divvy out tests to execute (somewhat along the lines of how mapreduce works) made it very easy for us to churn through a million or more tests in about 1-2 hours. The rest just amounts to getting your code to speak to a remote server to get a piece of work, etc. The majority of the EC2 stuff the 2 of us had done in one night (about 6-8 hours). |
|
Why don't you try Apache Mahout - it's open-source library consists of machine-learning algorithms, most of them have MapReduce implementations. And Random Forest too Also you may try Matlab/Scilab or other software. AFAIK Matlab could be installed on several computers and parallelize between them. CUDA is another option
(Jun 27 '11 at 19:45)
Robert Layton
CUDA doesn't solve the memory problem, though. Even a tesla only has 4 gigs of memory.
(Jun 29 '11 at 20:07)
Brian Vandenberg
@Brian Vandenberg There is no memory problem with 1.3 million rows and 300 columns. That's around 6 gigabytes of CSV, and with row sampling could be 2 gigabytes for each tree.
(Feb 10 '14 at 00:51)
Jeremiah M
|
|
1.3 million rows and 300 columns of data is not much for a random forest. I am training a forest on a dataset that is the same size as yours and I can do 300 trees in about 1 hour. You need to build your own efficient random forest library in an object oriented language that is multi-threaded, uses a running variance to estimate optimal split points, and uses other such tricks. randomForest() from R is widely known to be too slow, copying the original matrix multiple times in RAM and other hideous features under the hood. The reason I built my own library is because I called randomForest(), waited 24 hours, and it was still running! I sample about 350,000 rows for each tree. |
I want to make sure I understand - you basically have a classification problem with 1.3 million training examples with dimension 300?
Well, perhaps regression is more appropriate. Given a game state, estimate probability of a given player winning in that state. But the dimensions are correct. The 300 features are very sparse, probably 90% of them are 0 in a given set.
Can you explain what your real goal is here? Do you want to predict who will win a game assuming strong play or do you want the model to just guess who will win assuming bad players continue to play badly? If you want to predict who is at an actual advantage in the game, running an ELO-like system to filter your data is a perfectly good idea.