3
2

I am designing an application that will deal with individual personal data sets. Each set will effectivley be a series of pages that related to something particular to the user that requires storage.

Users will capture a document and then we will automatically apply tags to it based on rules a user will define. They can also group tags into more generic collections.

Once we have a critical mass of user data I want to be able to do more clever and interesting things with it, hence my post here. Things I would like to do:

  • Semantic Searching - so for example user queries "policy" and we know to display Car Insurance and Home Insurance documents even if they dont mention policy

  • Auto-Suggest Tags - so we learn the users behaviour and make suggestions as to tags they might want to be using based on previously processed documents

What can people suggest in terms of open software / frameworks / alogrithms that would help us?

Anything relevant appreciated.

Rav

asked Jul 05 '10 at 09:30

rav's gravatar image

rav
46123


5 Answers:

I think what you're looking for is described in Learning Document-Level Semantic Properties from Free-text Annotations. Basically this is an LDA-variant which jointly generates text as well as document level tags. The up-shot is that when you do topic modeling documents with similar tags get similar topics and that when you have text by itself, the model predicts tags with strong accuracy. The code for the project is available here.

answered Jul 05 '10 at 13:58

aria42's gravatar image

aria42
209972441

edited Jul 05 '10 at 16:09

You linked to a whole edition of JAIR, not a single paper.

(Jul 05 '10 at 15:20) Alexandre Passos ♦

Thanks. Last time I add an answer via the iPhone!

(Jul 05 '10 at 16:09) aria42

This kind of stuff can be doing using some search log analysis.
Users often don't type in a single query, they keep trying different things until they get what they were looking for. So if you can detect the fact that people who initially search for "policy" end their search session by clicking insurance documents, you would know which queries have what affinity to which categories.
Auto-suggest could also be done by analyzing the logs to see what are a user's favourite tags

answered Jul 05 '10 at 12:51

Aditya%20Mukherji's gravatar image

Aditya Mukherji
2251612

To your first question you seem to have provided an answer in the tags. You can just use LSA (or LDA, or some variant thereof) to compute query-document similarity, instead of using the words themselves.

To your second question, you can use labeled LDA to model the tags of the documents you have, and, when the user is typing in a new document, recommend it some high probability tags given the words in the document.

answered Jul 05 '10 at 13:14

Alexandre%20Passos's gravatar image

Alexandre Passos ♦
2554154278421

With respect to your second question, one approach is to train a binary probabilistic classifier for each tag. Each training instance consists of a document to which someone has applied tags, and potentially features encoding the context in which they did so. (I'm a little unclear on the details of your application, so I'm not sure what the appropriate context is.) Instances with the tag are positive examples, instances without the tag are negative examples. You train as many binary classifiers as you have distinct tags.

When you have an instance for which you want to make suggestions, you apply all your binary classifiers to it. You will get a probability for each tag. Sort the tags by probability and show the highest probability ones to the user as suggestions. Note that the use of a classifier that outputs calibrated probabilities is important here, because you need the outputs of different classifiers to be on the same scale.

As for the particular type of probabilistic classifier, I happen to like logistic regression models. Class probability trees are another possibility, though you'd probably want to do some model averaging there. There's lot of open source software for fitting logistic regression models, including BXR, which I worked on. For ultra large scale data, vowpal wabbit is getting a lot of attention. If looking for an open source package with commercial support as well, lingpipe is one possibility.

answered Jul 05 '10 at 17:17

Dave%20Lewis's gravatar image

Dave Lewis
890202846

For the second part of your question, you should measure the IDF (Inverse Document Frequency) statistics of your dataset so as to be able to compute the TF-IDF score of each terms or bi- or tri-grams of terms of your document and suggest the top 5 as candidate tags.

You should also have a look at KEA and maui-indexer as implementations of similar approaches.

Also this demo page from the sematext software might further give you intuitions on collocations and SIPs (Statistically Improbably Phrases).

answered Jul 05 '10 at 20:45

ogrisel's gravatar image

ogrisel
498995591

I don't think this is necessarily a good idea. Just recommending bi- or tri-grams that have a higher-than-average count in the current document might be highly redundant. Usually people use tags that suggest categorizations that are not explicit in the text, but are somehow relevant. This question, for example, has the "lsi" and "lsa" tags, and I've seen many usages of tags that follow this pattern. This is why most other answers try to predict the tags based on the words in the document, not just recommend words that are already in the document.

(Jul 05 '10 at 21:43) Alexandre Passos ♦
Your answer
toggle preview

powered by OSQA

User submitted content is under Creative Commons: Attribution - Share Alike; Other things copyright (C) 2010, MetaOptimize LLC.