For my research I am using Labelled LDA (L-LDA) on Reuters-21578 ModApte split dataset. In this dataset the news stories have a title and a body. To test the effect of L-LDA, I apply it in on all three combinations, thus Title, Body and Title&Body. I compare my results with Joachims (1998) where they report his results of the top10 categories by using Break-even performance and Micro-average performance over all 90 categories.

Now, LDA is said to work very well on short pieces of text. Yet, when I apply L-LDA on the dataset, the Micro-average performance over all 90 categories gives the following results:

  • Title: 95.68%
  • Body: 93.01%
  • Title&Body: 91.88%

This indeed shows that L-LDA performas better on short text (Title). However, when I look at the Break-even performances of the top10 categories, 3 out of 10 times Title&Body performs best and 6 out of 10 times Title.

Also, when comparing Body and Title&Body, we see Title&Body is 6 out of 10 times better than Body.

My question now is, how is it possible that, even though Title&Body has more features, it scores better than Body when applying L-LDA?

asked Jul 27 '14 at 09:51

TheGreatEye's gravatar image

TheGreatEye
1444

Be the first one to answer this question!
toggle preview

powered by OSQA

User submitted content is under Creative Commons: Attribution - Share Alike; Other things copyright (C) 2010, MetaOptimize LLC.