0
1

I am working on a text classification problem using a bag of words approach. I am using Tf-Idf features. Unfortunately my dataset is rather small (Is any dataset ever big enough?) and noisy.

To improve performance, I would like to use an unlabeled background collection and semi-supervised learning. What would be a good way to do so? How much does it matter, if the background collection comes from a different source?

My idea was to apply latent semantic analysis (LSA) on the Tf-Idf vectors. Another idea could be to cluster the words and use the cluster frequencies as features.

asked Sep 11 '13 at 01:46

Maarten's gravatar image

Maarten
1315613


One Answer:

When doing semi-supervised learning data from a different domain can be very tricky.

I'd say there are a few main approaches you can try: EM, if using a generative model, like naive bayes (see Nigam and McCallum); co-training, which works with discriminative and generative models but has the weird multi-view requirement, and a graph-based method like label propagation followed by training a supervised model, like Subramanya et al.

answered Sep 11 '13 at 11:25

Alexandre%20Passos's gravatar image

Alexandre Passos ♦
2554154278421

Your answer
toggle preview

powered by OSQA

User submitted content is under Creative Commons: Attribution - Share Alike; Other things copyright (C) 2010, MetaOptimize LLC.