|
While we were all twiddling our thumbs, a 17-year-old Canadian boy has apparently found an information retrieval algorithm that: a) performs with twice the precision of the current, and widely-used vector space model b) is 'fairly accurate' at identifying similar words. c) makes microsearch more accurate Here is a good interview. Unfortunately, there's no published paper I can find yet, but, from the snatches I remember from the graphical models and machine learning classes I took a few years ago, I think we should be able to reconstruct it from his submision abstract, and what he says about it in interviews. From interview:
And the abstract puts it in context:
I feel like someone who knows about markov-chain-like matrices or information retrieval would immediately be able to realize what he's doing. So: what is he doing? |
|
From the use of words like 'context' and the fact that he's introduced a second order level of statistical dependency, I suspect he is doing something related to the LDA-HMM method outlined in the paper: Griffiths, T., Steyvers, M., Blei, D., & Tenenbaum, J. (2005). Integrating topics and syntax. Advances in Neural Information Processing Systems. There are some inherent limits to the resolution of the search due to model averaging. However, I'm envious of doing stuff like this at 17 and I hope to heck he's done something independent and at least incrementally better. Even a different direction on the same topic would be pretty cool. |
|
Sounds like he's doing clustering based on exploration and word adjacency. This could easily be a slightly re-weighted form of spectral word adjacency clustering. Assemble a giant word adjacency matrix, weighted by proximity, and apply either modularity, spectral or other methods to find clusters and get semantic similarity? But really, everything is too vague. |
|
There is not much information to go on, but he could be doing something similar to the methods in this paper:
where "query reformulation" mainly refers to measuring short string similarity. |
Also, while this is a fantastic project for a 17-yo, and I only wish I was doing this kind of stuff at his age, the odds that there's really anything that can't be easily replicated/improved in his algorithm are very very small.
I unfortunately agree with Jacob: unless someone out there is making money out of this it's probably a fluke if there's no paper, as it's quite possible to beat the state-of-the-art on toy/small data.
I also agree with Jacob that it sounds a lot like he's defining a markov chain between words somehow and using it to derive similarity measures or something like this. The one thing that doesn't make sense is that smoothing things out (which is what similarity measures do) usually reduces the precision at the expense of increasing recall substantially.