|
Common case we have large text database and tags associated with those text. We must predict tags for new text. I'm not sure that traditional SVM can train on such many categories, and it is can't train to multiple categories. Naive Bayes have similar problems with multi category problems. |
|
I'd suggest (as almost always) starting with evaluation and working backwards. If someone gave you a tagging of the text, without telling you how it was produced, how would you measure how good that tagging is? (Or, if given two taggings, how would you decide which is better, and how much?) What is the highest and lowest value of the effectiveness measure? What would a tagging look like that got the highest possible effectiveness? That got the lowest possible effectiveness? If this is subjective (as it usually is) what is degree of agreement between human annotators at assigning tags, or at evaluating tags? This will lead to other questions, e.g. about whether you want to assume (and enforce) a finite tag set. These are the sorts of questions that should be answered long before you start thinking about particular learning or language processing algorithms. I second this, and almost feel like retreating my own anser to bring this to the top. It's far too easy to forget to question the models and instead jump into reductions.
(Sep 14 '10 at 11:32)
Alexandre Passos ♦
Dave isn't even talking about models, but about the choice of the real world loss, and how that plays into designing the objective function.
(Sep 23 '10 at 18:34)
Joseph Turian ♦♦
|
|
One approach would be to treat the tag prediction as a multi-label text classification problem. See this paper: Multilabel Text Classification for Automated Tag Suggestion. For large-scale problem (large number of tags), you may look at some other approaches such as Large scale multi-label classification via metalabeler. Yes, this should be a first attempt; it's probably easier to get working correctly than a generative models-based approach.
(Sep 11 '10 at 16:27)
Alexandre Passos ♦
What technique does the metalabeler approach use to handle a large number of tasks? I skimmed it but did not see. I agree with the general approach of treating this as a multi-label textcat problem. One approach if there is a very large number of possible outputs is to use a tree-structured output space, so that only a logarithmic number of possible labels need to be considered.
(Sep 13 '10 at 03:49)
Joseph Turian ♦♦
|
|
I think an interesting (and easy-to-use approach) is Labeled LDA. The standard baseline in this sort of data is SVMs, as expected, and they perform ok once you find a clever way of extracting the tags (such as ranking them or something like that). Is there are any open source soft to try Labeled LDA?
(Sep 11 '10 at 11:07)
yura
I think http://nlp.stanford.edu/software/tmt/tmt-0.2/ supports labeled-lda
(Sep 11 '10 at 11:08)
Alexandre Passos ♦
|