|
My issue is a bit hard to explain in this question's title, so hopefully I can make clear what my problem is about in this text. I also posted this question to stats.stackexchange.com, but maybe here are more people who are able to help me. I'm dealing with partially supervised text classification. I have a set of positive documents and a set of unlabeled documents (which contains both positive and negative documents). My goal is to identify documents in this unlabeled set which are most probably negative documents. As soon as I've identified them, I use those and the positive set to classify the rest of the unlabeled documents. In order to identify the set of reliable negative documents, I use a special version of the Rocchio classification algorithm which is explained in this paper:
In the upper left corner of page 6 (figure 4), there is pseudocode explaining the algorithm. Below, you find a screenshot of the two relevant lines that my question is about.
At this stage of the algorithm, I have a set P of positive documents and a set PN of potential negative documents which were identified in a previous step. Each document in these two sets is represented as a vector (in bold letters) of TF-IDF values of the word vocabulary from the respective set. In the first line of the code above, I subtract the PN-vector from the P-vector. In the second line, I do it vice versa. The goal is to create a positive prototype vector p and and a negative prototype vector n. My question is the following: Which vocabulary do I have to take into account for each of these two subtractions? Do I have to create all feature vectors from the entire vocabulary of both the positive and the potential negative set? Or do I have to use only the vocabulary from the positive set in the first line and the vocabulary from the potential negative set in the second line? Or anything completely different from that? Unfortunately, this isn't explained anywhere. I'm confused. Please help. Thank you so much in advance! :) |
