|
I am designing an application that will deal with individual personal data sets. Each set will effectivley be a series of pages that related to something particular to the user that requires storage. Users will capture a document and then we will automatically apply tags to it based on rules a user will define. They can also group tags into more generic collections. Once we have a critical mass of user data I want to be able to do more clever and interesting things with it, hence my post here. Things I would like to do:
What can people suggest in terms of open software / frameworks / alogrithms that would help us? Anything relevant appreciated. Rav |
|
I think what you're looking for is described in Learning Document-Level Semantic Properties from Free-text Annotations. Basically this is an LDA-variant which jointly generates text as well as document level tags. The up-shot is that when you do topic modeling documents with similar tags get similar topics and that when you have text by itself, the model predicts tags with strong accuracy. The code for the project is available here. You linked to a whole edition of JAIR, not a single paper.
(Jul 05 '10 at 15:20)
Alexandre Passos ♦
Thanks. Last time I add an answer via the iPhone!
(Jul 05 '10 at 16:09)
aria42
|
|
This kind of stuff can be doing using some search log analysis. |
|
To your first question you seem to have provided an answer in the tags. You can just use LSA (or LDA, or some variant thereof) to compute query-document similarity, instead of using the words themselves. To your second question, you can use labeled LDA to model the tags of the documents you have, and, when the user is typing in a new document, recommend it some high probability tags given the words in the document. |
|
With respect to your second question, one approach is to train a binary probabilistic classifier for each tag. Each training instance consists of a document to which someone has applied tags, and potentially features encoding the context in which they did so. (I'm a little unclear on the details of your application, so I'm not sure what the appropriate context is.) Instances with the tag are positive examples, instances without the tag are negative examples. You train as many binary classifiers as you have distinct tags. When you have an instance for which you want to make suggestions, you apply all your binary classifiers to it. You will get a probability for each tag. Sort the tags by probability and show the highest probability ones to the user as suggestions. Note that the use of a classifier that outputs calibrated probabilities is important here, because you need the outputs of different classifiers to be on the same scale. As for the particular type of probabilistic classifier, I happen to like logistic regression models. Class probability trees are another possibility, though you'd probably want to do some model averaging there. There's lot of open source software for fitting logistic regression models, including BXR, which I worked on. For ultra large scale data, vowpal wabbit is getting a lot of attention. If looking for an open source package with commercial support as well, lingpipe is one possibility. |
|
For the second part of your question, you should measure the IDF (Inverse Document Frequency) statistics of your dataset so as to be able to compute the TF-IDF score of each terms or bi- or tri-grams of terms of your document and suggest the top 5 as candidate tags. You should also have a look at KEA and maui-indexer as implementations of similar approaches. Also this demo page from the sematext software might further give you intuitions on collocations and SIPs (Statistically Improbably Phrases). I don't think this is necessarily a good idea. Just recommending bi- or tri-grams that have a higher-than-average count in the current document might be highly redundant. Usually people use tags that suggest categorizations that are not explicit in the text, but are somehow relevant. This question, for example, has the "lsi" and "lsa" tags, and I've seen many usages of tags that follow this pattern. This is why most other answers try to predict the tags based on the words in the document, not just recommend words that are already in the document.
(Jul 05 '10 at 21:43)
Alexandre Passos ♦
|