Let's say I have several hundred million documents, which are very short (only a few words). There are several million terms in the vocabulary. What is the fastest way to find the top-k semantically related terms for each term in the vocabulary?
When I say fastest, I mean that it should take under a week of computation time, and as little human time as possible. So use of existing implementations is encouraged.
Edit: At the behest of commentators, this problem is now an actual challenge, with a dataset, you can play with.
I am just giving a sugestion here, but as far as I remember Topic Models by Blei are quite good at detecting topics from a corpus of documents, I am not in my lab, so I do not remember the order of magnitude, you might want to give a look over Topic Models, which in essence are an application of Dirichlet Processes or Hiercharchichal Dirichlet Processes. I am not sure about the computational time though, there are a couple of implementations aout there, as soon as I get to my lab I will put the link
answered Nov 04 '10 at 05:28
Leon Palafox ♦
The best way to solve such monstrous problems is probably to avoid them completely: use known dictionaries/thesauri/someone else's results; rephrase the problem (aka the million dollar question of "what is semantics, anyway?"); rethink the constraints on the solution (incremental updates ok? what happens if we get the answer wrong?).
Assuming that's not an option :), I believe you'll have to use randomized algorithms in one form or another (i.e. the hashing trick, or subsampling and sketching) to curb either the number of features or the number of observations, or both. One option is to use the Vector Space Model and model term "semantics" through term co-occurence. Ready implementations of this type include the pair-wise similarities in MAHOUT.
On such data scales, many improvements over the standard bag-of-words are not feasible (including Latent Dirichlet Allocation: it's not really there yet... by a long shot). Latent Semantic Analysis and (especially) Random Indexing can be done, using the aforementioned hashing trick. In theory, there are many options but the only sane and scalable online implementations I know of are gensim (Python) and Apache MAHOUT (Java).
But perhaps such techniques are an overkill in your case; if you describe your problem domain in more detail, the suggestions can be more concrete.
answered Nov 04 '10 at 13:15
A good way of defining semantic similarity is by using concept of analogy, I would suggest following paper to understand definition of similar concepts:
Robert Speer, Catherine Havasi, and Henry Lieberman. AnalogySpace: Reducing the Dimensionality of Common Sense Knowledge, AAAI 2008 
Using a network of common sense concepts linked to each other via simple assertions such as IsA or HasA etc and Sparse Singular Value decomposition, an Analogy Space can be constructed. Finding similar concepts is as simple as finding dot products between vectors in analogy space.
You can construct a simple network using co-occurrence of terms from your data, may be you can add data from Conceptnet (common sense) + Wordnet (lexicon) + Open Cyc (common sense) + DBpedia (Real world entities) . + Freebase (Real world entities) .
The divisi package used to compute such similarities is open source and can be found from http://csc.media.mit.edu/analogyspace
Though after around a million nodes one runs into problems, a single SVD model cannot approximate correctly. I proposed a solution in which the network is divided into multiple communities (overlapping / non overlapping) using community detection algorithms from field of network analysis and complex systems. One can then create models for each community and get significantly better results. Note that this idea can be applied to all sorts of networks such as Social Networks. Another advantage is that the if a concept exists in multiple communities, then one can get different similar terms, which represent different meaning of that concept.
I presented this Idea at International Semantic Web Conference last year as a poster and I am currently working on it. 
 Robert Speer, Catherine Havasi, and Henry Lieberman. AnalogySpace:
Reducing the Dimensionality of Common Sense Knowledge, AAAI 2008
 Akshay Bhat. Analogy Engines for The Semantic Web, ISWC 2009 http://kcap09.stanford.edu/share/posterDemos/130/paper130.pdf
answered Nov 06 '10 at 03:34
Isn't Dekang Lin's distributional similarity thesaurus-building stuff the best for this? From the late-90's.
If I had to bet, I'd bet they do this much better than the sorts of models currently popular in the machine learning community. Watching topic model people try to do this is kind of painful. Or NN's, for that matter.