Let's say I have several hundred million documents, which are very short (only a few words). There are several million terms in the vocabulary. What is the fastest way to find the top-k semantically related terms for each term in the vocabulary?

When I say fastest, I mean that it should take under a week of computation time, and as little human time as possible. So use of existing implementations is encouraged.

Edit: At the behest of commentators, this problem is now an actual challenge, with a dataset, you can play with.

asked Nov 04 '10 at 03:30

Joseph%20Turian's gravatar image

Joseph Turian ♦♦

edited Nov 05 '10 at 21:38


I think that would be a great challenge to publish on your blog. Are the documents freely re-distributable?

(Nov 04 '10 at 06:42) ogrisel

Not to totally hijack this thread, but can I +1 the challenge? I bet anyone (else?) working on twitter/chat status/wall post data would get out of their chairs for it.

(Nov 04 '10 at 17:46) Andrew Rosenberg

Funny, I was going to post a challenge on my blog today of a slightly different nature: Given 10M strings, doing fuzzy string matching and cluster very similar strings, on a single machine. Anyway, I think I'll post this challenge task.

(Nov 05 '10 at 10:07) Joseph Turian ♦♦

Both challenges are interesting IMHO.You mght want ot use kaggle.com to handle the submission & leaderboard. However I think it's more focused on supervised learning than clustering.

(Nov 05 '10 at 10:26) ogrisel

4 Answers:

I am just giving a sugestion here, but as far as I remember Topic Models by Blei are quite good at detecting topics from a corpus of documents, I am not in my lab, so I do not remember the order of magnitude, you might want to give a look over Topic Models, which in essence are an application of Dirichlet Processes or Hiercharchichal Dirichlet Processes. I am not sure about the computational time though, there are a couple of implementations aout there, as soon as I get to my lab I will put the link

answered Nov 04 '10 at 05:28

Leon%20Palafox's gravatar image

Leon Palafox ♦

The best way to solve such monstrous problems is probably to avoid them completely: use known dictionaries/thesauri/someone else's results; rephrase the problem (aka the million dollar question of "what is semantics, anyway?"); rethink the constraints on the solution (incremental updates ok? what happens if we get the answer wrong?).

Assuming that's not an option :), I believe you'll have to use randomized algorithms in one form or another (i.e. the hashing trick, or subsampling and sketching) to curb either the number of features or the number of observations, or both. One option is to use the Vector Space Model and model term "semantics" through term co-occurence. Ready implementations of this type include the pair-wise similarities in MAHOUT.

On such data scales, many improvements over the standard bag-of-words are not feasible (including Latent Dirichlet Allocation: it's not really there yet... by a long shot). Latent Semantic Analysis and (especially) Random Indexing can be done, using the aforementioned hashing trick. In theory, there are many options but the only sane and scalable online implementations I know of are gensim (Python) and Apache MAHOUT (Java).

But perhaps such techniques are an overkill in your case; if you describe your problem domain in more detail, the suggestions can be more concrete.

answered Nov 04 '10 at 13:15

Radim's gravatar image


I enjoy your suggested answers, but I couldn't disagree more with this sentiment: "The best way to solve such monstrous problems is probably to avoid them completely"

(Nov 05 '10 at 10:38) Joseph Turian ♦♦

Let's agree to disagree then :)

I'd take a simple outside-of-the-box "solution" over a technological tour de force anytime.

But I see I misjudged this question, it is rather hypothetical and what-if. Good luck with the challenge though!

(Nov 06 '10 at 03:19) Radim

A good way of defining semantic similarity is by using concept of analogy, I would suggest following paper to understand definition of similar concepts:

Robert Speer, Catherine Havasi, and Henry Lieberman. AnalogySpace: Reducing the Dimensionality of Common Sense Knowledge, AAAI 2008 [1]

Using a network of common sense concepts linked to each other via simple assertions such as IsA or HasA etc and Sparse Singular Value decomposition, an Analogy Space can be constructed. Finding similar concepts is as simple as finding dot products between vectors in analogy space.

You can construct a simple network using co-occurrence of terms from your data, may be you can add data from Conceptnet (common sense) + Wordnet (lexicon) + Open Cyc (common sense) + DBpedia (Real world entities) . + Freebase (Real world entities) .

The divisi package used to compute such similarities is open source and can be found from http://csc.media.mit.edu/analogyspace

Though after around a million nodes one runs into problems, a single SVD model cannot approximate correctly. I proposed a solution in which the network is divided into multiple communities (overlapping / non overlapping) using community detection algorithms from field of network analysis and complex systems. One can then create models for each community and get significantly better results. Note that this idea can be applied to all sorts of networks such as Social Networks. Another advantage is that the if a concept exists in multiple communities, then one can get different similar terms, which represent different meaning of that concept.

I presented this Idea at International Semantic Web Conference last year as a poster and I am currently working on it. [2]

References: [1] Robert Speer, Catherine Havasi, and Henry Lieberman. AnalogySpace: Reducing the Dimensionality of Common Sense Knowledge, AAAI 2008

[2] Akshay Bhat. Analogy Engines for The Semantic Web, ISWC 2009 http://kcap09.stanford.edu/share/posterDemos/130/paper130.pdf

Akshay Bhat
Student, Cornell University

answered Nov 06 '10 at 03:34

Akshay%20Bhat's gravatar image

Akshay Bhat

Isn't Dekang Lin's distributional similarity thesaurus-building stuff the best for this? From the late-90's.

If I had to bet, I'd bet they do this much better than the sorts of models currently popular in the machine learning community. Watching topic model people try to do this is kind of painful. Or NN's, for that matter.

answered Dec 21 '10 at 00:33

brendan642's gravatar image


edited Dec 21 '10 at 00:37

Your answer
toggle preview

powered by OSQA

User submitted content is under Creative Commons: Attribution - Share Alike; Other things copyright (C) 2010, MetaOptimize LLC.