While we were all twiddling our thumbs, a 17-year-old Canadian boy has apparently found an information retrieval algorithm that:

a) performs with twice the precision of the current, and widely-used vector space model

b) is 'fairly accurate' at identifying similar words.

c) makes microsearch more accurate

Here is a good interview.

Unfortunately, there's no published paper I can find yet, but, from the snatches I remember from the graphical models and machine learning classes I took a few years ago, I think we should be able to reconstruct it from his submision abstract, and what he says about it in interviews.

From interview:

Some searches find words that appear in similar contexts. That’s pretty good, but that’s following the relationships to the first degree. My algorithm tries to follow connections further. Connections that are close are deemed more valuable. In theory, it follows connections to an infinite degree.

And the abstract puts it in context:

A novel information retrieval algorithm called "Apodora" is introduced, using limiting powers of Markov chain-like matrices to determine models for the documents and making contextual statistical inferences about the semantics of words. The system is implemented and compared to the vector space model. Especially when the query is short, the novel algorithm gives results with approximately twice the precision and has interesting applications to microsearch.

I feel like someone who knows about markov-chain-like matrices or information retrieval would immediately be able to realize what he's doing.

So: what is he doing?

asked Aug 06 '11 at 11:15

aditi's gravatar image

aditi
85072033

edited Aug 06 '11 at 11:18

Also, while this is a fantastic project for a 17-yo, and I only wish I was doing this kind of stuff at his age, the odds that there's really anything that can't be easily replicated/improved in his algorithm are very very small.

(Aug 06 '11 at 14:00) Jacob Jensen

I unfortunately agree with Jacob: unless someone out there is making money out of this it's probably a fluke if there's no paper, as it's quite possible to beat the state-of-the-art on toy/small data.

I also agree with Jacob that it sounds a lot like he's defining a markov chain between words somehow and using it to derive similarity measures or something like this. The one thing that doesn't make sense is that smoothing things out (which is what similarity measures do) usually reduces the precision at the expense of increasing recall substantially.

(Aug 06 '11 at 15:20) Alexandre Passos ♦

3 Answers:

From the use of words like 'context' and the fact that he's introduced a second order level of statistical dependency, I suspect he is doing something related to the LDA-HMM method outlined in the paper: Griffiths, T., Steyvers, M., Blei, D., & Tenenbaum, J. (2005). Integrating topics and syntax. Advances in Neural Information Processing Systems. There are some inherent limits to the resolution of the search due to model averaging. However, I'm envious of doing stuff like this at 17 and I hope to heck he's done something independent and at least incrementally better. Even a different direction on the same topic would be pretty cool.

answered Aug 08 '11 at 17:04

Aengus%20Robinson's gravatar image

Aengus Robinson
21551114

Sounds like he's doing clustering based on exploration and word adjacency. This could easily be a slightly re-weighted form of spectral word adjacency clustering. Assemble a giant word adjacency matrix, weighted by proximity, and apply either modularity, spectral or other methods to find clusters and get semantic similarity? But really, everything is too vague.

answered Aug 06 '11 at 12:06

Jacob%20Jensen's gravatar image

Jacob Jensen
1644285360

edited Aug 06 '11 at 16:05

There is not much information to go on, but he could be doing something similar to the methods in this paper:

We present a novel approach to query reformulation which combines syntactic and semantic information by means of generalized Levenshtein distance algorithms where the substitution operation costs are based on probabilistic term rewrite functions.

where "query reformulation" mainly refers to measuring short string similarity.

answered Aug 07 '11 at 16:46

Daniel%20Mahler's gravatar image

Daniel Mahler
8462912

edited Aug 07 '11 at 22:50

Your answer
toggle preview

powered by OSQA

User submitted content is under Creative Commons: Attribution - Share Alike; Other things copyright (C) 2010, MetaOptimize LLC.