Hi, I have a data set that has been tagged rigorously by humans, and I want to find, given a set of keywords, similar 'documents' using only keywords and not raw text. Any ideas what the best approach is?

Thanks!

asked Jul 09 '10 at 14:16

AK%20Roston's gravatar image

AK Roston
31226

edited Jul 09 '10 at 14:59


5 Answers:

You can use any bag-of-words technique for this. The best one will depend on some details of your environment. A very simple way to return "similar" documents to a set of keywords is returning documents with keyword vectors (think of a bag-of-keywords, a vector with the size of the keyword vocabulary for each document, that has a 1 on the component corresponding to each keyword and 0 otherwise) that have a high similarity (measured by cosine, low euclidean distance, high inner product, etc) to the query keyword vector.

Or is there a reason why this simple approach shouldn't work?

This answer is marked "community wiki".

answered Jul 09 '10 at 15:27

Alexandre%20Passos's gravatar image

Alexandre Passos ♦
2554154278421

edited Jul 09 '10 at 15:44

1

I think this would work. So I would have a dictionary of the complete set of keywords (50k or so), create a vector for each set of keywords [0,0,0,1,0,0 ...] and then given an input vector (the document I am trying to find similar things for), measure the distance to every other vector (with cosine) and order by shortest distance?

(Jul 09 '10 at 16:07) AK Roston

Yes, that is what I described. If you want a clustering (i.e., a partition into subsets with high similarity), you can use the k-means algorithm. In a nutshell, divide your set of documents randomly into subsets, and repeat: compute the mean of each subset, assign each document to the closest mean. After it has stopped changing, you have a reasonably good partition of your documents.

(Jul 09 '10 at 16:11) Alexandre Passos ♦

Would any normalization be required on the vectors, as every document may have different numbers of keywords (anywhere from 1 to 10 or so?)

(Jul 09 '10 at 16:17) AK Roston

There are many strategies you can experiment. This simplest ones are normalize such that every vector has norm 1, or that each vector's components sum to 1 (this last one is called tf). Another commenter mentioned tf-idf, which is tf above with each keyword also being multiplied by log(idf), where idf is the ratio of the total number of keywords in all documents and the number of occurrences of that specific keyword. The performance varies, but not by much, so be sure to try a few of these before deciding on one.

(Jul 09 '10 at 16:19) Alexandre Passos ♦

I was able to get what I consider to be pretty good results with this technique. However, I am finding that the size of the data set is making the computation of the similarities take too long to do in real time. Are there optimization techniques to try so that there is something more efficient than comparing every vector to the query vector at run time?

(Jul 14 '10 at 12:08) AK Roston

Are you using some sort of sparse vector library? If not, I really recommend you do, it can speed up this sort of problem immensely. Or you could perform some dimensionality reduction beforehand (maybe with random projections; that is, replace every vector v with the product Av where A is a rectangular matrix filled with randomly distributed numbers).

(Jul 14 '10 at 12:15) Alexandre Passos ♦
showing 5 of 6 show all

To clarify: by "bag-of-words technique", Alexandre means any clustering algorithm in which each document is represented by a "bag of words" vector, i.e. one that has one element for every single word, which is 0 or 1 depending on whether or not that word occurs in that document.

answered Jul 09 '10 at 15:33

aditi's gravatar image

aditi
85072034

Simply try TF-IDF as Alexandre suggest.

answered Jul 09 '10 at 15:54

Andrej's gravatar image

Andrej
210114

Here is a quick solution, generate a keyword frequency vector (or any other kind of feature vector) for each document. Pass it trough a dimensionality reduction method (PCA, t-SNE, ...) then run a k-means or an hclust on the resulting data. It gives you a relatively good and generic document clustering tool. some of the advanced document clustering methods are domain specific and really depend on the amount of data and it's structure.

answered Jul 09 '10 at 15:57

Mark%20Alen's gravatar image

Mark Alen
1323234146

The K-Means algorithm can work well for document clustering if you have a relatively small total number of words. It will help if you remove stop words (words that occur very frequently in the data set). I've tried the approach of scoring the terms in a set of documents using TF-IDF and then clustering the resulting feature vectors using k-means and gotten decent results. The downside to this approach is that you have to have a good idea about the number of categories in your data set before hand.

As an aside, I've found that using Particle Swarm Optimization as a preprocessor to k-means can greatly improve the accuracy of the results. Fair warning, this is an advanced approached and will most likely be unnecessary for beginners. It can also be very helpful to "seed" the k-means algorithm with initial centroids as a form of semi-supervised learning. This means you should choose a number of documents that best represent the different categories in your dataset (even if these documents don't occur in your set) and use their tfidf feature vectors as your initial k-means centroids.

answered Jul 11 '10 at 04:36

Joel%20H's gravatar image

Joel H
6123

Your answer
toggle preview

powered by OSQA

User submitted content is under Creative Commons: Attribution - Share Alike; Other things copyright (C) 2010, MetaOptimize LLC.