Summary
In the spirit of shared tasks and NLP “bake offs”, I hereby announce the first MetaOptimize Challenge. It’s an open problem, and I am interested in involving practitioners who want to demo their style, as well as people who want to learn some large-scale IR/NLP. Hopefully, we’ll all learn something about various real-world approaches.
Join the announcement list to hear about any developments or important announcements.
Join the discuss list to chat about techniques and approaches.
I also have an ulterior motive.
The Problem
Let’s say I have several ten or hundred million documents, which are very short (only a few words). There are several million word types in the vocabulary. What is the fastest way to find the top-k (say k=10) semantically related words for each word in the vocabulary?
“Semantically related” is purposefully left vague.
When I say fastest, I mean that it should take under a week of computation time, and as little human time as possible. So use of existing implementations is encouraged. Single machine or righteously parallel solutions will both be considered, as long as your approach works and you demo it, preferably in the next two weeks.
Background
I brought up this question on MetaOptimize Q+A: Find semantically related terms over a large vocabulary (>1M)? I had some ideas in mind. But I wanted to hear about other ideas.
Olivier Grisel and Andrew Rosenberg commented on my question, suggesting I post this as a public challenge. So here goes. I hope people participate.
Why this is cool?
Here is one potential application:
Increased insight into emerging topics, trends, and new products. Run this on social media updates (Facebook posts, Tweets) after collecting sufficient mentions of a topic, trend or product, and have insight more insight into what is being discussed.
Coming up with other applications is an left as exercise for the reader.
Problem Details
Here is a sample dataset, for development (6.7 million documents, 40 MB gzipped). There is one document per line. Each word is separated by a space:
abbey seal abbey seekers abbey series
Here is the sample vocabulary file, in decreasing order of frequency (1.1 million word types, 4.5 MB gzipped).
The first column is the frequency and the second column is the word.
There might be words in the dataset that are not in the vocabulary:
32972 group 31998 research 30820 information 30090 uk 29721 10 29665 london
I will soon post a larger dataset and vocabulary file, and announce it on the –announce mailing list.
The desired output you produce is a file with eleven columns.
The first column should be identical to the second column of the vocabulary file. There will be as many lines as there are in the vocabulary file. The next ten columns should be the ten most related words, in descending order of relevance:
group groups working research support pm steering ltd pvc advisory age research researchers centre group researcher project | programme institute unit council
The challenge is, within two week, to post a full output file for the larger dataset. By “full” I mean there is one line for every vocabulary word.
How will you evaluate it?
You should explain why your solution is correct. There is no “right” answer. Specifying evaluation pretty much determines the solution, as Alexandre Passos says (p.c.).
Honestly, being able to define the problem and justify your answer is half the puzzle.
Edit: For any submission, I will post for a random subset of vocab words each entry’s 10 related terms. I’ll then ask people to vote blind. This is a reasonable technique for quantitative evaluation.
Why is it hard?
First you need to take each word, and define a similarity measure between words, depending upon their usage. You need to define this similarity measure over an appropriate feature vector for each word, and choosing a good feature vector is not necessarily obvious.
Second of all, you need to do fast retrieval of the ten most similar words. But if you look at all 1M * 1M pairs, that’s 1 trillion comparisons.
Challenge Details
In a few days, I’m going to write a small post discussing the problem, and possible approaches. I will also point to existing open-source code that can perhaps solve the problem, so that skilled engineers have enough information to put together a working implementation, even if they have no background in NLP/IR. If you write up a good solution on your own blog or on the MetaOptimize Q+A forum, I’ll mention it in my blog post.
If you have a solution, please share it within, say, two weeks (Friday, November 19th). Share your full result file, or send it to me and I’ll put it on s3.
Join the announcment list to hear about any developments or important announcements.
Join the discuss list to chat about techniques and approaches.
Data Set
The data set is unique terms that occur in a crawl of .uk.
I took the UKWAC web-as-corpus crawl (2 billion words, crawled in 2008), ran it through the splitta sentence splitter, removed all funny characters, ran the Penn treebank word tokenizer, and perform term extraction with topia.termextract, discarding terms that are single words:
./sentencesplit.py | remove-nonascii-characters.pl | ~/dev/common-scripts/tokenizer.sed | ./topiaterms.py | gzip –c > ukwac-allmultiwordterms.txt.gz
I then lowercased the terms, sorted them, and uniqued them, to give the dataset:
zcat ukwac-allmultiwordterms.txt.gz | remove-nonascii-characters.pl | perl –né ‘print lc($_);’ | sort | uniq | gzip –c > ukwac-uniqmultiwordterms.txt.gz
Finally, I constructed the vocabulary from the unique terms, to give the vocabulary:
zcat ukwac-uniqmultiwordterms.txt.gz | perl –né ‘s/ /\n/g; print’ | sort | uniq –c | sort –rn | gzip –c > ukwac-vocabulary.txt.gz
Ulterior motive
If you do this, I have more exciting project work for you, and can pay. This is very similar to the style of interview question I ask, and it’s also very similar to the sort of work I do. So if you can hack it, you’re basically my ideal choice for a collaborator right now.