|
I have a dataset consisting of numerical and categorical attributes as features. Are there particular classification techniques which are specifically sensitive and consider such a mix of feature types? I have tried NaiveBayes, Logit, SVM to get some pretty ordinary accuracy. Edit I am performing a multi-class (3-classes) machine learning classification task on a given set of data. I have extracted discriminatory features from the dataset according to my intuitions, domain knowledge and frequency of occurrence. The total number of features comes out close to 30. My feature set consists of different data types like strings, continuous numerical quantities and discrete numerical quantities. I tried to use LogisticRegression, Multi-Class SVM and NaiveBayes for my task but didn't receive good accuracy. I was wondering if machine learning classifiers are sensitive to different types of data in the same training set. If so, are there machine learning techniques or paradigms which cater to such applications? |
|
Generally machine learning methods can handle 30 features of mixed types pretty well. The fact that all these models are failing points in the direction of you either not having enough data, your problem just being naturally hard, or you not using a good feature representation (that is, using only the features you chose there is no natural easy way to split examples among the classes). Did you experiment with the standard tricks to make sure the feature vectors, when represented as such, have different components with roughly the same variance? Did you try kernels in SVMs, and if so did you carefully tune the kernel hyperparameters jointly with the C hyperparameter? How many examples do you have? (tens, hundreds, thousands, millions, etc) |
|
Alexandre is probably right that you will have to do some error analysis to figure out what new features should be added. Try focusing on the cases the classifier gets wrong. Can you guess why it is confusing those cases with the wrong class? I think you can avoid the hassle with tweaking feature representations (e.g., scaling, normalization) and tuning hyperparameters though. I recommend trying bagged decision trees while you are doing feature engineering. A good decision tree implementation should support: a) both numeric and categorical features, b) be invariant to monotonic scaling of numeric features, c) multiclass prediction with a single model, and d) missing values. You will have trouble finding support for "bag of words" type features though. The YaDT software package looked good to me although I haven't used it myself. Given a good DT implementation, you can script up a bagging implementation in a day or two. With bagging, you usually can ignore how to prune the decision tree (just grow it big) and the choice of splitting criterion because their impact will be small. Thus the only parameter to choose is the ensemble size (number of bagging iterations), and the results will not be overly sensitive. In general, more is better, although 100 is usually sufficient to get most of the predictive accuracy. If you find you have a lot of features, you can switch to random forests for computational efficiency. |
This question is going to have to be quite a bit more specific in order to get a useful answer.
@apc I did so. I have noting more to add than the above information.