|
I have developed an algorithm for text classification (TC). It does not require any labeled dataset. I want to compare my algorithm with other algorithms. My question is which TC algorithms I should select for comparison. Currently I am comparing my algorithm with Kamal Nigam et al. “Text Classification from Labeled and Unlabeled Documents using EM”. In: Machine Learning - Special issue on information retrieval 39.2-3 (2000). Please let me know your thoughts. |
|
You should probably also compare it with completely unsupervised approaches, e.g. clustering. My recommendation is to look at the RCV1 corpus, which is a standard benchmark data set for text classification. See which papers in Google Scholar most recently cite Forman's JMLR paper about RCV1, to get the state of the art. The Nigal et al work is a good historical benchmark. |