|
I am working on a text classification problem using a bag of words approach. I am using Tf-Idf features. Unfortunately my dataset is rather small (Is any dataset ever big enough?) and noisy. To improve performance, I would like to use an unlabeled background collection and semi-supervised learning. What would be a good way to do so? How much does it matter, if the background collection comes from a different source? My idea was to apply latent semantic analysis (LSA) on the Tf-Idf vectors. Another idea could be to cluster the words and use the cluster frequencies as features. |
|
When doing semi-supervised learning data from a different domain can be very tricky. I'd say there are a few main approaches you can try: EM, if using a generative model, like naive bayes (see Nigam and McCallum); co-training, which works with discriminative and generative models but has the weird multi-view requirement, and a graph-based method like label propagation followed by training a supervised model, like Subramanya et al. |