I'm new to scikit-learn, I'm trying to create a Multinomial Bayes model to predict movies box office. Below is just a toy example, I'm not sure if it is logically correct (suggestions are welcome!).
E.g. input:
# G.I. Joe: Retaliation
{ "screens": 3719, "opening-gross": 40501814, "distributor": "Paramount",
"genre" : "Action", "budget" : 130000000 }
In my first approach, I split the continuous attributes into ranges. The Y's corresponds to the estimate gross I'm trying to predict (e.g. 1: < $20mi, 2: > $20mi). I also discretized the number of screens the movie was shown.
def get_data():
measurements = [ \
{'movie': 'Life of Pi', 'screens': "some", 'distributor': "fox" ....},\
{'movie': 'The Croods', 'screens': "some", 'distributor': "fox" ....},\
{'movie': 'Spring Breakers', 'screens': "few", 'distributor': "TriStar" ...},\
]
vec = DictVectorizer()
arr = vec.fit_transform(measurements).toarray()
return arr
def predict(X):
Y = np.array([1, 1, 2])
clf = MultinomialNB()
clf.fit(X, Y)
print(clf.predict(X[2]))
if __name__ == "__main__":
vector = get_data()
predict(vector)
Following a suggestion here where it says
Independently fit a gaussian NB model on the continuous part of the data and a multinomial NB model on the categorical part. Then transform all the dataset by taking the class assignment probabilities (with predict_proba method) as new features: np.hstack((multinomial_probas, gaussian_probas)) and then refit a new model (e.g. a new gaussian NB) on the new features.
I'm now updating the model to deal with both continuous and categorical data. Can someone give me some extra help on how to accomplish that? It seems that Bayes is a good strategy to give me ranges as answes, but should I use a linear classifier instead? Any hint is appreciated!
asked
Apr 02 '13 at 22:04
ksiomelo
1●1●1●1
Just try an SVM (
LinearSVCorSGDClassifier), that's much easier. Be sure to scale your features usingStandardScaler(and maybe a log transform) first, and skip themoviefeature.