|
For my research I am using Labelled LDA (L-LDA) on Reuters-21578 ModApte split dataset. In this dataset the news stories have a title and a body. To test the effect of L-LDA, I apply it in on all three combinations, thus Title, Body and Title&Body. I compare my results with Joachims (1998) where they report his results of the top10 categories by using Break-even performance and Micro-average performance over all 90 categories. Now, LDA is said to work very well on short pieces of text. Yet, when I apply L-LDA on the dataset, the Micro-average performance over all 90 categories gives the following results:
This indeed shows that L-LDA performas better on short text (Title). However, when I look at the Break-even performances of the top10 categories, 3 out of 10 times Title&Body performs best and 6 out of 10 times Title. Also, when comparing Body and Title&Body, we see Title&Body is 6 out of 10 times better than Body. My question now is, how is it possible that, even though Title&Body has more features, it scores better than Body when applying L-LDA? |