|
I've got a document classification problem with only 2 classes and my training dataset matrix size, after the CountVectorizer becomes (40845 X 218904). I'd like to know how I'll be able to remove the least frequent 4 words/features when min_df must be a float between 0 and 1. I even got good accuracy and F1 results by modifying the min_df value to 4; however I couldn't explain what's exactly happening. I'm using python sklearn (scikit-learn) package on an 6GB machine. |