I've got a document classification problem with only 2 classes and my training dataset matrix size, after the CountVectorizer becomes (40845 X 218904). I'd like to know how I'll be able to remove the least frequent 4 words/features when min_df must be a float between 0 and 1. I even got good accuracy and F1 results by modifying the min_df value to 4; however I couldn't explain what's exactly happening. I'm using python sklearn (scikit-learn) package on an 6GB machine.

asked Dec 07 '13 at 05:09

nms's gravatar image

nms
31336

edited Dec 07 '13 at 05:21

Be the first one to answer this question!
toggle preview

powered by OSQA

User submitted content is under Creative Commons: Attribution - Share Alike; Other things copyright (C) 2010, MetaOptimize LLC.