I'm new to scikit-learn, I'm trying to create a Multinomial Bayes model to predict movies box office. Below is just a toy example, I'm not sure if it is logically correct (suggestions are welcome!).

E.g. input:

# G.I. Joe: Retaliation
{ "screens": 3719, "opening-gross": 40501814, "distributor": "Paramount",
"genre" : "Action", "budget" : 130000000 }

In my first approach, I split the continuous attributes into ranges. The Y's corresponds to the estimate gross I'm trying to predict (e.g. 1: < $20mi, 2: > $20mi). I also discretized the number of screens the movie was shown.

def get_data():

    measurements = [ \
    {'movie': 'Life of Pi', 'screens': "some", 'distributor': "fox" ....},\
    {'movie': 'The Croods', 'screens': "some", 'distributor': "fox" ....},\
    {'movie': 'Spring Breakers', 'screens': "few", 'distributor': "TriStar" ...},\
    ]
    vec = DictVectorizer()
    arr = vec.fit_transform(measurements).toarray()

    return arr

def predict(X):

    Y = np.array([1, 1, 2])
    clf = MultinomialNB()
    clf.fit(X, Y)
    print(clf.predict(X[2]))

if __name__ == "__main__":
    vector = get_data()
    predict(vector)

Following a suggestion here where it says

Independently fit a gaussian NB model on the continuous part of the data and a multinomial NB model on the categorical part. Then transform all the dataset by taking the class assignment probabilities (with predict_proba method) as new features: np.hstack((multinomial_probas, gaussian_probas)) and then refit a new model (e.g. a new gaussian NB) on the new features.

I'm now updating the model to deal with both continuous and categorical data. Can someone give me some extra help on how to accomplish that? It seems that Bayes is a good strategy to give me ranges as answes, but should I use a linear classifier instead? Any hint is appreciated!

asked Apr 02 '13 at 22:04

ksiomelo's gravatar image

ksiomelo
1111

1

Just try an SVM (LinearSVC or SGDClassifier), that's much easier. Be sure to scale your features using StandardScaler (and maybe a log transform) first, and skip the movie feature.

(Apr 03 '13 at 07:13) larsmans
Be the first one to answer this question!
toggle preview

powered by OSQA

User submitted content is under Creative Commons: Attribution - Share Alike; Other things copyright (C) 2010, MetaOptimize LLC.