|
I need to build a classification model for a given data set. Each data point is generally a plain text file. My feature extraction method is to build a N-gram vector using bag-of-words model. In addition, I also extract a few numerical features (in specific, 3 numerical features) from each data point. These numerical features are constructed independent of those text contents. The values for these numerical features are nominal variables, such as 100, 75, 10000, etc. My question is this: 1) The number of those text features is quite large, like 3000; while the number of numerical features is quite small, 3. How can I make sure the impact of those numerical features will not be lost? 2) Do I need to perform the normalization across the both numerical feature sets and text-generated feature sets? |