My issue is a bit hard to explain in this question's title, so hopefully I can make clear what my problem is about in this text. I also posted this question to stats.stackexchange.com, but maybe here are more people who are able to help me.

I'm dealing with partially supervised text classification. I have a set of positive documents and a set of unlabeled documents (which contains both positive and negative documents). My goal is to identify documents in this unlabeled set which are most probably negative documents. As soon as I've identified them, I use those and the positive set to classify the rest of the unlabeled documents.

In order to identify the set of reliable negative documents, I use a special version of the Rocchio classification algorithm which is explained in this paper:

Xiao-Li Li, Bing Liu, See-Kiong Ng (2010)

Negative Training Data can be Harmful to Text Classification.

In: Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP-2010).

In the upper left corner of page 6 (figure 4), there is pseudocode explaining the algorithm. Below, you find a screenshot of the two relevant lines that my question is about.

An excerpt from a special version of the Rocchio classification algorithm

At this stage of the algorithm, I have a set P of positive documents and a set PN of potential negative documents which were identified in a previous step. Each document in these two sets is represented as a vector (in bold letters) of TF-IDF values of the word vocabulary from the respective set. In the first line of the code above, I subtract the PN-vector from the P-vector. In the second line, I do it vice versa. The goal is to create a positive prototype vector p and and a negative prototype vector n.

My question is the following:

Which vocabulary do I have to take into account for each of these two subtractions? Do I have to create all feature vectors from the entire vocabulary of both the positive and the potential negative set? Or do I have to use only the vocabulary from the positive set in the first line and the vocabulary from the potential negative set in the second line? Or anything completely different from that? Unfortunately, this isn't explained anywhere.

I'm confused. Please help. Thank you so much in advance! :)

asked Oct 23 '12 at 10:01

Peter%20Stahl's gravatar image

Peter Stahl
16112

Be the first one to answer this question!
toggle preview

powered by OSQA

User submitted content is under Creative Commons: Attribution - Share Alike; Other things copyright (C) 2010, MetaOptimize LLC.