Hi new here so please bear with me. I am looking for a simple example on how to run a Multinomial Naive Bayes Classifier. I came across this example from StackOverflow:

http://stackoverflow.com/questions/10098533/implementing-bag-of-words-naive-bayes-classifier-in-nltk

import numpy as np
from nltk.probability import FreqDist
from nltk.classify import SklearnClassifier
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_selection import SelectKBest, chi2
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline

pipeline = Pipeline([('tfidf', TfidfTransformer()),
                     ('chi2', SelectKBest(chi2, k=1000)),
                     ('nb', MultinomialNB())])
classif = SklearnClassifier(pipeline)

from nltk.corpus import movie_reviews
pos = [FreqDist(movie_reviews.words(i)) for i in movie_reviews.fileids('pos')]
neg = [FreqDist(movie_reviews.words(i)) for i in movie_reviews.fileids('neg')]
add_label = lambda lst, lab: [(x, lab) for x in lst]
#Original code from thread:
#classif.train(add_label(pos[:100], 'pos') + add_label(neg[:100], 'neg'))
classif.train(add_label(pos, 'pos') + add_label(neg, 'neg'))#Made changes here

#Original code from thread:    
#l_pos = np.array(classif.batch_classify(pos[100:]))
#l_neg = np.array(classif.batch_classify(neg[100:]))
l_pos = np.array(classif.batch_classify(pos))#Made changes here
l_neg = np.array(classif.batch_classify(neg))#Made changes here
print "Confusion matrix:\n%d\t%d\n%d\t%d" % (
          (l_pos == 'pos').sum(), (l_pos == 'neg').sum(),
          (l_neg == 'pos').sum(), (l_neg == 'neg').sum())

I received a warning after running this example.

C:\Python27\lib\site-packages\scikit_learn-0.13.1-py2.7-win32.egg\sklearn\feature_selection\univariate_selection.py:327: 
UserWarning: Duplicate scores. Result may depend on feature ordering.There are probably duplicate features, 
or you used a classification score for a regression task.
warn("Duplicate scores. Result may depend on feature ordering."

Confusion matrix:
876 124
63  937

So, my questions are..

  1. Can anyone tell me what does this error message means?
  2. I made some changes to the original code but why is the confusion matrix results so much higher than the one in the original thread?
  3. I have about 5,000 documents. How can I split the documents to 75%:25% for train:test purposes?
  4. How can I test the accuracy of this of this classifier?
  5. I modified the example above to test with my own dataset. I noticed that if I increase the chi2, k=1000 to k=2000 or higher, it yields a higher results. So, what would be the most ideal value for k?

Thanks!

asked Jul 04 '13 at 09:39

cryssie80's gravatar image

cryssie80
1111


One Answer:

you are doing feature selection - so its giving a grade to each feature according to how informative it is. its warning you that its given two features the same score which it assumes can mean two things a) you put a duplicate feature in b) you really do have two different features with the same score, in which case the feature selection could give different results if it started from the other end ( ie if it selects only one of the two features with the same score)

read the manual/tutorial for scikit.learn its very clear scikit.learn

answered Jul 04 '13 at 15:32

SeanV's gravatar image

SeanV
33629

edited Jul 04 '13 at 15:37

Your answer
toggle preview

powered by OSQA

User submitted content is under Creative Commons: Attribution - Share Alike; Other things copyright (C) 2010, MetaOptimize LLC.