|
This is somewhat of a conceptual question, though within the sklearn framework. I'm not very experienced with ML so I'm hoping I'm overlooking something obvious. I am trying to implement a variant of an ensemble approach within sklearn and don't think it's supported by any of the built-in ensemble methods. Before I hacked something together I wanted to make sure there's not something more elegant that I'm overlooking. BACKGROUND
I'm making predictions on a set of financial outcomes that tend to be quite linear and are a good match for linear regression models with a few key features. My overall dataset is reasonably large (1M+ observations) so I can get stable coefficient values that work well out of sample. While I'm open to exploring other more exotic supervised ML algorithms (forests etc...) this particular problem tends to fit well with LinearRegression so I'm preferring to keep it simple. Within my full dataset, there are 500+ distinct items (so approx 2k observations per each of 500 items). When I build unique linear models for each of these 500 different items, I get improved in-sample results but huge problems with overfitting and out of sample performance. OBJECTIVE What I'd like to do is try hybrid approaches where I can blend the outcomes of the generic model (fitted to the 1M rows of all items) and the item specific model (fitted to the 2K or so rows) to get a more optimal balance of prediction power and minimized curve fitting. Ideally, I would be able to use a composite linear regression model (or other ML approach) that could assign weights to the predictions made by the general model and tine item-specific model. My belief is that there are some items within the set that behave predictably different than the set in aggregate, while there are others that are well described by the aggregate. QUESTION
Is there a standard method of creating this sort of ensemble of prediction models, other than the "ensemble methods" like random forest etc...? If there is, how would I generally implement in sklearn? I looked and FeatureUnion and pipelines but they didn't seem to work for predictors, only transformers? Is there a standard approach to nesting linear models inside of a tree? I found this stackexchange http://stats.stackexchange.com/questions/82503/can-random-forest-methodology-be-applied-to-linear-regressions and this link http://labs.genetics.ucla.edu/horvath/RGLM/TalkRGLM.pdf which is along these lines though differs in that it: (1) uses a large ensemble of randomized/bootstrapped GLMs and (2) is based on an R package. If there isn't, is there a strong reason why I shouldn't go down this road? Thanks in advance! |