|
I'd really like to know if it's reasonable to use some of the existing packages (particular in python, like scikit-learn) to take a large dataset, determine which features have the largest impact, and then output a set of factors that can be used outside of the package. So many current machine learning examples (and papers) just show success using the estimator itself and don't show what the actual output could be. Granted, for something like Hidden Markov Models or SVMs, it's not so simple to show it, but I'm thinking that linear / logistic regression could work for this. Right now, we make predictions based on ratings data using a relatively simple "model": we turn the rating into a percentage likelihood based on a set of pre-existing factors (so, for example, if I rate a "1" on a 1 to 10 scale, that could be equivalent to a 5% likelihood to perform some outcome). We have one binary attribute that decides which of two sets of factors to use. So something like this table (where the first column is the type of item, the second column is the ratings, and the body the appropriate factor, based on the value of X) is what we use now.
We now have a dataset of about 100K ratings from 5-7K people, along with outcomes and about 7 categorical variables. We have a simple way to calculate an error value as well. Is it reasonable to think that i could end up with a set of tables like the following that could then be applied? (given that I have three attributes, X, Y, and a "type" of A or B)
And a second table for Attribute Y
And then you would multiply the factors together to get the prediction for one element? (so (Type=A, X=2, Y=1) results in 0.18 as the prediction) This is what I'd like to get out of it, is that a reasonable goal? I'm hoping this doesn't seem overly broad--I'm just looking for some initial validation that this is possible before I start working with these different packages to find something that works. |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
I hope this doesn't come off as snide, but it seems like you might be a little confused about the types of problems machine learning actually solves. I say this as what you've described sounds like a standard supervised regression problem, and I'm not sure why you'd want to build a model just to tear it apart and try to reassemble the pieces. So in this scenario you have a bunch of (input, output) pairs and you'd like to model the relationship between them i.e., using your example input = (Type=A, X=2, Y=1), output = 0.18 Now if you think the relationship between your inputs and outputs is approximately linear, use linear regression. If not, you may want to consider polynomial regression, or neural networks, or something else which can model nonlinear relationships. If, on the other hand your outputs are actually probabilities, you should consider logistic (binary) or multinomial regression (categorical), and so on and so forth. Either way, whatever method you end up using is the model. You fit the parameters on your data and when you feed it new data, it gives you the predictions. Hope this helps clear some things up. Hey, thanks for answering--I'm definitely confused and not wedded to any particular buzzword, just want to solve the problem in a better way (and, in particular, figure out which features are really important, etc.) So what's the dividing line between "machine learning" and linear regression? For example, "scikit-learn" has a whole section of linear models as well as tools to handle feature reduction, etc.... http://scikit-learn.org/stable/
(Apr 26 '13 at 19:03)
Joe Reolan
There isn't a clear dividing line. Linear models are frequently employed in ML. There is a lot of overlap between Machine Learning and other fields like statistics, optimization, etc. If you are interested in figuring out which features are relevant for a specific task, the keyword you probably want is "feature selection."
(Apr 26 '13 at 19:36)
alto
@alto Thanks again, that's really helpful. So I guess the better question is deciding on which of multiple linear models to use, etc. At least now I have a better starting point for investigating. Cheers!
(Apr 26 '13 at 19:38)
Joe Reolan
|