I need to build a classification model for a given data set. Each data point is generally a plain text file. My feature extraction method is to build a N-gram vector using bag-of-words model. In addition, I also extract a few numerical features (in specific, 3 numerical features) from each data point. These numerical features are constructed independent of those text contents. The values for these numerical features are nominal variables, such as 100, 75, 10000, etc. My question is this:

1) The number of those text features is quite large, like 3000; while the number of numerical features is quite small, 3. How can I make sure the impact of those numerical features will not be lost?

2) Do I need to perform the normalization across the both numerical feature sets and text-generated feature sets?

asked Nov 21 '13 at 23:35

ouyang's gravatar image

ouyang
1591011

Be the first one to answer this question!
toggle preview

powered by OSQA

User submitted content is under Creative Commons: Attribution - Share Alike; Other things copyright (C) 2010, MetaOptimize LLC.