|
Hi, I have a data set that has been tagged rigorously by humans, and I want to find, given a set of keywords, similar 'documents' using only keywords and not raw text. Any ideas what the best approach is? Thanks! |
|
You can use any bag-of-words technique for this. The best one will depend on some details of your environment. A very simple way to return "similar" documents to a set of keywords is returning documents with keyword vectors (think of a bag-of-keywords, a vector with the size of the keyword vocabulary for each document, that has a 1 on the component corresponding to each keyword and 0 otherwise) that have a high similarity (measured by cosine, low euclidean distance, high inner product, etc) to the query keyword vector. Or is there a reason why this simple approach shouldn't work?
This answer is marked "community wiki".
1
I think this would work. So I would have a dictionary of the complete set of keywords (50k or so), create a vector for each set of keywords [0,0,0,1,0,0 ...] and then given an input vector (the document I am trying to find similar things for), measure the distance to every other vector (with cosine) and order by shortest distance?
(Jul 09 '10 at 16:07)
AK Roston
Yes, that is what I described. If you want a clustering (i.e., a partition into subsets with high similarity), you can use the k-means algorithm. In a nutshell, divide your set of documents randomly into subsets, and repeat: compute the mean of each subset, assign each document to the closest mean. After it has stopped changing, you have a reasonably good partition of your documents.
(Jul 09 '10 at 16:11)
Alexandre Passos ♦
Would any normalization be required on the vectors, as every document may have different numbers of keywords (anywhere from 1 to 10 or so?)
(Jul 09 '10 at 16:17)
AK Roston
There are many strategies you can experiment. This simplest ones are normalize such that every vector has norm 1, or that each vector's components sum to 1 (this last one is called tf). Another commenter mentioned tf-idf, which is tf above with each keyword also being multiplied by log(idf), where idf is the ratio of the total number of keywords in all documents and the number of occurrences of that specific keyword. The performance varies, but not by much, so be sure to try a few of these before deciding on one.
(Jul 09 '10 at 16:19)
Alexandre Passos ♦
I was able to get what I consider to be pretty good results with this technique. However, I am finding that the size of the data set is making the computation of the similarities take too long to do in real time. Are there optimization techniques to try so that there is something more efficient than comparing every vector to the query vector at run time?
(Jul 14 '10 at 12:08)
AK Roston
Are you using some sort of sparse vector library? If not, I really recommend you do, it can speed up this sort of problem immensely. Or you could perform some dimensionality reduction beforehand (maybe with random projections; that is, replace every vector v with the product Av where A is a rectangular matrix filled with randomly distributed numbers).
(Jul 14 '10 at 12:15)
Alexandre Passos ♦
showing 5 of 6
show all
|
|
To clarify: by "bag-of-words technique", Alexandre means any clustering algorithm in which each document is represented by a "bag of words" vector, i.e. one that has one element for every single word, which is 0 or 1 depending on whether or not that word occurs in that document. |
|
Here is a quick solution, generate a keyword frequency vector (or any other kind of feature vector) for each document. Pass it trough a dimensionality reduction method (PCA, t-SNE, ...) then run a k-means or an hclust on the resulting data. It gives you a relatively good and generic document clustering tool. some of the advanced document clustering methods are domain specific and really depend on the amount of data and it's structure. |
|
The K-Means algorithm can work well for document clustering if you have a relatively small total number of words. It will help if you remove stop words (words that occur very frequently in the data set). I've tried the approach of scoring the terms in a set of documents using TF-IDF and then clustering the resulting feature vectors using k-means and gotten decent results. The downside to this approach is that you have to have a good idea about the number of categories in your data set before hand. As an aside, I've found that using Particle Swarm Optimization as a preprocessor to k-means can greatly improve the accuracy of the results. Fair warning, this is an advanced approached and will most likely be unnecessary for beginners. It can also be very helpful to "seed" the k-means algorithm with initial centroids as a form of semi-supervised learning. This means you should choose a number of documents that best represent the different categories in your dataset (even if these documents don't occur in your set) and use their tfidf feature vectors as your initial k-means centroids. |