NLP Challenge: Find semantically related terms over a large vocabulary (>1M)?

Sum­mary

In the spirit of shared tasks and NLP “bake offs”, I hereby announce the first MetaOp­ti­mize Chal­lenge. It’s an open prob­lem, and I am inter­ested in involv­ing prac­ti­tion­ers who want to demo their style, as well as peo­ple who want to learn some large-scale IR/NLP. Hope­fully, we’ll all learn some­thing about var­i­ous real-world approaches.

Join the announce­ment list to hear about any devel­op­ments or impor­tant announcements.

Join the dis­cuss list to chat about tech­niques and approaches.

I also have an ulte­rior motive.


The Prob­lem

Let’s say I have sev­eral ten or hun­dred mil­lion doc­u­ments, which are very short (only a few words). There are sev­eral mil­lion word types in the vocab­u­lary. What is the fastest way to find the top-k (say k=10) seman­ti­cally related words for each word in the vocabulary?

Seman­ti­cally related” is pur­pose­fully left vague.

When I say fastest, I mean that it should take under a week of com­pu­ta­tion time, and as lit­tle human time as pos­si­ble. So use of exist­ing imple­men­ta­tions is encour­aged. Sin­gle machine or right­eously par­al­lel solu­tions will both be con­sid­ered, as long as your approach works and you demo it, prefer­ably in the next two weeks.


Back­ground

I brought up this ques­tion on MetaOp­ti­mize Q+A: Find seman­ti­cally related terms over a large vocab­u­lary (>1M)? I had some ideas in mind. But I wanted to hear about other ideas.

Olivier Grisel and Andrew Rosen­berg com­mented on my ques­tion, sug­gest­ing I post this as a pub­lic chal­lenge. So here goes. I hope peo­ple participate.


Why this is cool?

Here is one poten­tial appli­ca­tion:
Increased insight into emerg­ing top­ics, trends, and new prod­ucts. Run this on social media updates (Face­book posts, Tweets) after col­lect­ing suf­fi­cient men­tions of a topic, trend or prod­uct, and have insight more insight into what is being discussed.

Com­ing up with other appli­ca­tions is an left as exer­cise for the reader.


Prob­lem Details

Here is a sam­ple dataset, for devel­op­ment (6.7 mil­lion doc­u­ments, 40 MB gzipped). There is one doc­u­ment per line. Each word is sep­a­rated by a space:

abbey seal
abbey seekers
abbey series

Here is the sam­ple vocab­u­lary file, in decreas­ing order of fre­quency (1.1 mil­lion word types, 4.5 MB gzipped).
The first col­umn is the fre­quency and the sec­ond col­umn is the word.
There might be words in the dataset that are not in the vocabulary:

  32972 group
  31998 research
  30820 information
  30090 uk
  29721 10
  29665 london

I will soon post a larger dataset and vocab­u­lary file, and announce it on the –announce mail­ing list.

The desired out­put you pro­duce is a file with eleven columns.
The first col­umn should be iden­ti­cal to the sec­ond col­umn of the vocab­u­lary file. There will be as many lines as there are in the vocab­u­lary file. The next ten columns should be the ten most related words, in descend­ing order of relevance:

group groups working research support pm steering ltd pvc advisory age
research researchers centre group researcher project | programme institute unit council

The chal­lenge is, within two week, to post a full out­put file for the larger dataset. By “full” I mean there is one line for every vocab­u­lary word.


How will you eval­u­ate it?

You should explain why your solu­tion is cor­rect. There is no “right” answer. Spec­i­fy­ing eval­u­a­tion pretty much deter­mines the solu­tion, as Alexan­dre Pas­sos says (p.c.).

Hon­estly, being able to define the prob­lem and jus­tify your answer is half the puzzle.

Edit: For any sub­mis­sion, I will post for a ran­dom sub­set of vocab words each entry’s 10 related terms. I’ll then ask peo­ple to vote blind. This is a rea­son­able tech­nique for quan­ti­ta­tive evaluation.


Why is it hard?

First you need to take each word, and define a sim­i­lar­ity mea­sure between words, depend­ing upon their usage. You need to define this sim­i­lar­ity mea­sure over an appro­pri­ate fea­ture vec­tor for each word, and choos­ing a good fea­ture vec­tor is not nec­es­sar­ily obvious.

Sec­ond of all, you need to do fast retrieval of the ten most sim­i­lar words. But if you look at all 1M * 1M pairs, that’s 1 tril­lion comparisons.


Chal­lenge Details

In a few days, I’m going to write a small post dis­cussing the prob­lem, and pos­si­ble approaches. I will also point to exist­ing open-source code that can per­haps solve the prob­lem, so that skilled engi­neers have enough infor­ma­tion to put together a work­ing imple­men­ta­tion, even if they have no back­ground in NLP/IR. If you write up a good solu­tion on your own blog or on the MetaOp­ti­mize Q+A forum, I’ll men­tion it in my blog post.

If you have a solu­tion, please share it within, say, two weeks (Fri­day, Novem­ber 19th). Share your full result file, or send it to me and I’ll put it on s3.

Join the announc­ment list to hear about any devel­op­ments or impor­tant announcements.

Join the dis­cuss list to chat about tech­niques and approaches.


Data Set

The data set is unique terms that occur in a crawl of .uk.

I took the UKWAC web-as-corpus crawl (2 bil­lion words, crawled in 2008), ran it through the splitta sen­tence split­ter, removed all funny char­ac­ters, ran the Penn tree­bank word tok­enizer, and per­form term extrac­tion with topia.termextract, dis­card­ing terms that are sin­gle words:


./sentencesplit.py | remove-nonascii-characters.pl | ~/dev/common-scripts/tokenizer.sed | ./topiaterms.py | gzip –c > ukwac-allmultiwordterms.txt.gz

I then low­er­cased the terms, sorted them, and uniqued them, to give the dataset:


zcat ukwac-allmultiwordterms.txt.gz | remove-nonascii-characters.pl | perl –né ‘print lc($_);’ | sort | uniq | gzip –c > ukwac-uniqmultiwordterms.txt.gz

Finally, I con­structed the vocab­u­lary from the unique terms, to give the vocab­u­lary:

zcat ukwac-uniqmultiwordterms.txt.gz | perl –né ‘s/ /\n/g; print’ | sort | uniq –c | sort –rn | gzip –c > ukwac-vocabulary.txt.gz



Ulte­rior motive

If you do this, I have more excit­ing project work for you, and can pay. This is very sim­i­lar to the style of inter­view ques­tion I ask, and it’s also very sim­i­lar to the sort of work I do. So if you can hack it, you’re basi­cally my ideal choice for a col­lab­o­ra­tor right now.

  • http://twitter.com/atpassos_ml/status/719219356344321 Alexan­dre Passos

    RT @turian: NLP Chal­lenge: Find seman­ti­cally related terms over a large vocab­u­lary (>1M)? http://bit.ly/cGOkKf

  • http://twitter.com/hntweets/status/721546804985856 Hacker News

    NLP Chal­lenge: Find seman­ti­cally related terms over a large vocab­u­lary (1M): http://bit.ly/daFNyD Com­ments: http://bit.ly/9H94uY

  • http://twitter.com/turian/status/723941857427456 Joseph Turian

    NLP Chal­lenge: Find seman­ti­cally related terms over a large vocab­u­lary (>1M)? http://bit.ly/cGOkKf

  • http://twitter.com/albertzeyer/status/724911685373953 Albert Zeyer

    MetaOp­ti­mize NLP Chal­lenge: Find seman­ti­cally related terms over a large vocab­u­lary (1M) http://goo.gl/RDvcw

  • http://twitter.com/newsycombinator/status/729094182404096 news.yc Pop­u­lar

    NLP Chal­lenge: Find seman­ti­cally related terms over a large vocab­u­lary (1M)? http://j.mp/aVT6ah

  • http://twitter.com/rgaidot/status/729543262343168 Régis Gaidot

    NLP Chal­lenge: Find seman­ti­cally related terms over a large vocab­u­lary (>1M)? http://t.co/XrP1KJa #nlp

  • http://twitter.com/frikifeeds/status/729960083886080 Tech & Freak Feeds

    NLP Chal­lenge: Find seman­ti­cally related terms over a large vocab­u­lary (1M)? http://dlvr.it/85fZK

  • http://twitter.com/earlkman/status/730446237274112 Emil­iano Kargieman

    RT @newsycombinator: NLP Chal­lenge: Find seman­ti­cally related terms over a large vocab­u­lary (1M)? http://j.mp/aVT6ah

  • http://twitter.com/hackernewsyc/status/731183423946752 Hacker News YC

    NLP Chal­lenge: Find seman­ti­cally related terms over a large vocab­u­lary (1M)? http://goo.gl/fb/OOFfe

  • http://twitter.com/myikegami_bot/status/731392270925825 m.y.ikegami_bot

    NLP Chal­lenge: Find seman­ti­cally related terms over a large vocab­u­lary (1M)? http://goo.gl/fb/sJKNx

  • http://twitter.com/spencertipping/status/731786137051136 Spencer Tip­ping

    RT @rgaidot: NLP Chal­lenge: Find seman­ti­cally related terms over a large vocab­u­lary (>1M)? http://t.co/XrP1KJa #nlp

  • http://twitter.com/hackernws/status/734141196795904 hack­ernews

    NLP Chal­lenge: Find seman­ti­cally related terms over a large vocab­u­lary (1M)? http://bit.ly/bqvIPw

  • http://twitter.com/josephjay/status/734141192601601 Joe

    NLP Chal­lenge: Find seman­ti­cally related terms over a large vocab­u­lary (1M)? http://bit.ly/cyj8SG

  • http://twitter.com/spencertipping/status/734347149713408 Spencer Tip­ping
  • http://twitter.com/bartezzini/status/735043248979968 bartezzini

    NLP Chal­lenge: Find seman­ti­cally related terms over a large vocab­u­lary (1M)?: Com­ments http://goo.gl/fb/DpM29

  • http://twitter.com/jbrownlee/status/736749386989568 Jason Brown­lee

    nlp chal­lenge: find seman­ti­cally related terms over a large vocab­u­lary (>1m)? http://bit.ly/cjv2pE

  • stunt­goat

    The desired out­put you prod­uct is a file with eleven columns.“
    You mean _product_, not product.

  • stunt­goat

    _I_ mean _produce_, not product.

  • http://twitter.com/alexbowe/status/746093440663552 Alex Bowe

    I might do this RT @newsycombinator: NLP Chal­lenge: Find seman­ti­cally related terms over a large vocab­u­lary (1M)? http://j.mp/aVT6ah

  • http://twitter.com/matthewsinclair/status/748126705033216 Matthew Sin­clair

    @mat_kelcey This sounds like you — “NLP Chal­lenge: Find seman­ti­cally related terms over a large vocab­u­lary (>1M)?” http://bit.ly/bJKvhR

  • stunt­goat

    Hi! Thanks for your chal­lenge. I am inter­ested in your work.

    When you vaguely say ‘seman­ti­cally related’, do you mean that each vocabulary-list word should be related in regard to the other words within the same ‘doc­u­ment’ ( ‘doc­u­ment’ in this case refers to the same line )? If so, how can I get the 10 most seman­ti­cally mean­ing­ful words that are related a vocabulary-list word that only appears once in the entire dataset and shares less than 10 words within that sin­gle ‘document’?

    My guess is: for vocabulary-list words that only appear 1 to maybe 5 times in the dataset, there are not enough other words within all ‘doc­u­ments’ to reach a total of 10, let alone words that appear more than once– so as to be seman­ti­cally more mean­ing­ful. I hope this makes sense, because I am curi­ous how you would like to han­dle these situations.

    PS ‘~~~~~~~’ is said to occur 2 times accord­ing to the sam­ple vocabulary-list but it only occurs once in the dataset. Which doc­u­ment has the error?

  • http://twitter.com/milesosborne Miles Osborne

    isn’t this almost exactly the same as using local­ity sen­si­tive hashing?

    http://www.isi.edu/natural-language/people/hovy/papers/05ACL-clustering.pdf

    P05-1077: Deepak Ravichan­dran; Patrick Pan­tel; Eduard Hovy
    Ran­dom­ized Algo­rithms and NLP: Using Local­ity Sen­si­tive Hash Func­tions for High Speed Noun Clustering

  • Anony­mous

    What does “dis­card­ing terms that will sin­gle words” mean? Is “sin­gle” being used as a verb in that context?

  • Anony­mous

    What does “dis­card­ing terms that will sin­gle words” mean? Is “sin­gle” being used as a verb in that context?

  • Anony­mous

    What hap­pened to “the” and other com­mon words? Are they removed dur­ing term extraction?

    Their pres­ence might be needed for some approaches.

  • Anony­mous

    Are you allow­ing use of out­side data sources, such as Word­Net? If yes, then that means I could also use other cor­pora as well, right? How about Mechan­i­cal Turk?

  • http://twitter.com/gilesgoatboy/status/755682005487616 Giles

    @disqus guys please fil­ter RTs for unique­ness. http://bit.ly/czmPqt

  • http://twitter.com/lakshminp/status/762783687778304 Lak­shmi Narasimhan

    RT @newsycombinator: NLP Chal­lenge: Find seman­ti­cally related terms over a large vocab­u­lary (1M)? http://j.mp/aVT6ah

  • http://twitter.com/jacobrothstein/status/766241220333569 Jacob Roth­stein

    RT @turian: NLP Chal­lenge: Find seman­ti­cally related terms over a large vocab­u­lary (>1M)? http://bit.ly/cGOkKf

  • http://twitter.com/abhishektiwari/status/766601460715520 Abhishek Tiwari

    RT @newsycombinator: NLP Chal­lenge: Find seman­ti­cally related terms over a large vocab­u­lary (1M)? http://j.mp/aVT6ah

  • http://twitter.com/yzli_pictional/status/768354130337792 YZ Li

    RT @newsycombinator: NLP Chal­lenge: Find seman­ti­cally related terms over a large vocab­u­lary (1M)? http://j.mp/aVT6ah

  • http://twitter.com/brentcappello/status/779509032820736 Brent Cap­pello

    NLP Chal­lenge: Find seman­ti­cally related terms over a large vocab­u­lary (1M)? http://bit.ly/cyj8SG

  • http://twitter.com/doomie/status/785468723564544 doomie

    RT @turian: NLP Chal­lenge: Find seman­ti­cally related terms over a large vocab­u­lary (>1M)? http://bit.ly/cGOkKf

  • KN

    My approach, which I have put as a com­ment in hacker news:
    http://news.ycombinator.com/item?id=1876651

  • http://twitter.com/josephdung/status/794846411358208 Joseph Dung

    RT @turian: NLP Chal­lenge: Find seman­ti­cally related terms over a large vocab­u­lary (>1M)? http://bit.ly/cGOkKf

  • http://twitter.com/2angle/status/798147890642944 shimo

    NLP Chal­lenge: Find seman­ti­cally related terms over a large vocab­u­lary (>1M)? — http://bit.ly/aVT6ah

  • http://twitter.com/newsyc50/status/815916010967040 Hacker News 50

    NLP Chal­lenge: Find seman­ti­cally related terms over a large vocab­u­lary (1M http://bit.ly/aymHcR (http://bit.ly/cD4oIH) #guru

  • http://twitter.com/newsyc50/status/815916010967040 Hacker News 50

    NLP Chal­lenge: Find seman­ti­cally related terms over a large vocab­u­lary (1M http://bit.ly/aymHcR (http://bit.ly/cD4oIH) #guru

  • http://twitter.com/vnce/status/818003180519424 Vin­cent Lacey

    this looks like fun. @wasauce RT @turian: NLP Chal­lenge: Find seman­ti­cally related terms over a large vocab­u­lary (>1M) http://bit.ly/cGOkKf

  • http://twitter.com/ogrisel/status/841227956133888 Olivier Grisel

    RT @turian: NLP Chal­lenge: Find seman­ti­cally related terms over a large vocab­u­lary (>1M)? http://bit.ly/cGOkKf

  • http://twitter.com/nono2357/status/857093200412672 nono2357

    RT @turian: NLP Chal­lenge: Find seman­ti­cally related terms over a large vocab­u­lary (>1M)? http://bit.ly/cGOkKf

  • http://twitter.com/kaule/status/858574808944640 Alvin Kaule

    RT @turian: NLP Chal­lenge: Find seman­ti­cally related terms over a large vocab­u­lary (>1M)? http://bit.ly/cGOkKf

  • http://twitter.com/bncolorado/status/868837113200640 Borja N Colorado

    RT @turian: NLP Chal­lenge: Find seman­ti­cally related terms over a large vocab­u­lary (>1M)? http://bit.ly/cGOkKf

  • http://twitter.com/bncolorado/status/868837113200640 Borja N Colorado

    RT @turian: NLP Chal­lenge: Find seman­ti­cally related terms over a large vocab­u­lary (>1M)? http://bit.ly/cGOkKf

  • http://twitter.com/tophackernews/status/872564897357824 Top Hacker News

    NLP Chal­lenge: Find seman­ti­cally related terms over a large vocab­u­lary (1M)? http://bit.ly/bNBa77 http://ff.im/-tgwWU

  • kevin

    Will you share the results for all of us to use in our own projects?

  • http://metaoptimize.com Joseph Turian

    Very cre­ative. You can use what­ever data you like. This is sup­posed to be like the “real world”, not an aca­d­e­mic exer­cise, and in prac­tice get­ting more infor­ma­tion is half the battle.

    You don’t even have to use the pro­vided train­ing data if you don’t like.

  • http://twitter.com/algoriffic/status/902393428447232 Anthony

    NLP chal­lenge to find seman­ti­cally related words: http://j.mp/crGWur

  • http://twitter.com/crtvzen/status/904800552095744 bsm

    RT @turian: NLP Chal­lenge: Find seman­ti­cally related terms over a large vocab­u­lary (>1M)? http://bit.ly/cGOkKf

  • http://metaoptimize.com Joseph Turian

    Typo that I corrected.

    topia.termextract extracts terms, which can be sin­gle or mul­ti­ple words.
    I dis­carded all sin­gle word terms.

  • http://metaoptimize.com Joseph Turian

    topia.termextract tries to extract multi-word terms, and focuses on con­tent words. If you have an addi­tional dataset that you would like to con­tribute, send me an email and we can talk it over.

  • http://metaoptimize.com Joseph Turian

    Yes, LSH is one solid approach to the lin­ear, not qua­dratic, time retrieval.

  • http://metaoptimize.com Joseph Turian

    For words that are very rare, well, that’s part of the chal­lenge to fig­ure out the best way to han­dle them.

    I see ‘~~~~~~~’ occur twice:

    veg­etable crisps ~~~~~~~ fri­day
    brrrrrrskreeooowwwwwwwwwwwwwwww ~~~~~~~

  • http://metaoptimize.com Joseph Turian

    Good typo catch, thanks.

  • http://metaoptimize.com Joseph Turian

    Of course. The whole point of this chal­lenge is for us to share and com­pare techniques.

  • http://twitter.com/geeniemart/status/925335449239553 Gee­niemart

    NLP Chal­lenge: Find seman­ti­cally related terms over a large … http://bit.ly/cC3AbC

  • http://twitter.com/nealrichter/status/926877589970944 neal­richter

    RT @jbrownlee: nlp chal­lenge: find seman­ti­cally related terms over a large vocab­u­lary (>1m)? http://bit.ly/cjv2pE

  • http://twitter.com/andrewmaxr/status/938334494920704 andrew­maxr

    RT @turian: NLP Chal­lenge: Find seman­ti­cally related terms over a large vocab­u­lary (>1M)? http://bit.ly/cGOkKf

  • Spencer Tip­ping

    I have a first-order finite con­text model solu­tion on the train­ing set up on Github: http://github.com/spencertipping/metaoptimize-challenge (in the fcm direc­tory). The results gen­er­ated from this are at http://spencertipping.com/metaoptimize-challenge/final-formatted.gz if any­one is inter­ested. I doubt this is a great solu­tion (I don’t have much of an NLP back­ground, and most words have only one sim­i­lar­ity due to the spar­sity of input), but maybe use­ful to someone.

  • http://twitter.com/juliengrenier/status/975637393178624 Julien Gre­nier

    RT @newsycombinator: NLP Chal­lenge: Find seman­ti­cally related terms over a large vocab­u­lary (1M)? http://j.mp/aVT6ah

  • http://twitter.com/semanticvoid/status/987641646424065 anand kishore

    RT @algoriffic: NLP chal­lenge to find seman­ti­cally related words: http://j.mp/crGWur

  • http://twitter.com/mstrohm/status/989345846661120 Markus Strohmaier

    RT @algoriffic: NLP chal­lenge to find seman­ti­cally related words: http://j.mp/crGWur

  • http://twitter.com/infolandscape/status/993330129281024 Fitzger­ald Analytics

    RT @algoriffic: NLP chal­lenge to find seman­ti­cally related words: http://j.mp/crGWur

  • http://twitter.com/thibaudvibes/status/996329585119232 Thibaud VIBES

    RT @turian: NLP Chal­lenge: Find seman­ti­cally related terms over a large vocab­u­lary (>1M)? http://bit.ly/cGOkKf

  • http://twitter.com/caporal_/status/1091441908518912 W.

    RT @turian: NLP Chal­lenge: Find seman­ti­cally related terms over a large vocab­u­lary (>1M)? http://bit.ly/cGOkKf

  • Lushan Han

    Your pre­pro­cess­ing on the train­ing dataset shows that you tend to find “col­lo­ca­tions” but not “seman­tic related”. More­over, the unique oper­a­tion on the terms makes the dataset less use­ful because co-occurrence fre­quency count is crit­i­cal for build­ing a rea­son­able sim­i­lar­ity or asso­ci­a­tion measure.

    I see the biggest chal­lenge for this task is effi­ciency. Since you are only look­ing for “seman­tic related”, a first order affin­ity like “PMI” or “LLR” might work bet­ter and def­i­nitely faster than dis­tri­b­u­tional sim­i­lar­ity approaches. “LLR” is more pref­er­en­tial than “PMI” when deal with rare words with fre­quency less than 5.

    The paper “DIRT – Dis­cov­ery of Infer­ence Rules from Text” addresses a sim­i­lary prob­lem with a approx­i­ma­tion to speed up the com­par­i­son. The vocab­u­lary size in their case is 220,000, 5 time smaller than your vocab­u­lary. How­ever, the work is done eight years ago with a much slower machine than today’s.

    Lushan Han

    PhD stu­dent from UMBC

  • http://twitter.com/chengweiwei/status/1307481879879680 Wei­wei Cheng

    NLP chal­lenge: Find seman­ti­cally related terms over a large vocab­u­lary http://j.mp/crGWur @algoriffic @semanticvoid

  • http://twitter.com/jamborta/status/1311385451495424 Tamas Jam­bor

    NLP Chal­lenge: find seman­ti­cally related terms over a large vocab­u­lary (>1M) http://bit.ly/aBWobF #mahout #ir

  • http://twitter.com/mainec/status/1314789917728768 MaineC

    RT @jamborta: NLP Chal­lenge: find seman­ti­cally related terms over a large vocab­u­lary (>1M) http://bit.ly/aBWobF #mahout #ir

  • http://twitter.com/josephreisinger/status/1328892266680320 joseph reisinger

    RT @turian: NLP Chal­lenge: Find seman­ti­cally related terms over a large vocab­u­lary (>1M)? http://bit.ly/cGOkKf

  • http://twitter.com/harittweets/status/1344161722662912 harit himan­shu

    RT @newsycombinator: NLP Chal­lenge: Find seman­ti­cally related terms over a large vocab­u­lary (1M)? http://j.mp/aVT6ah

  • http://twitter.com/mrmdesai/status/1507102640054272 man­dar

    RT @algoriffic: NLP chal­lenge to find seman­ti­cally related words: http://j.mp/crGWur

  • http://twitter.com/light_caster/status/1655107695411200 Kon­stan­tin Selivanov

    RT @turian: NLP Chal­lenge: Find seman­ti­cally related terms over a large vocab­u­lary (>1M)? http://bit.ly/cGOkKf

  • http://www.nlpdocs.com nlp­docs

    Nlp is very inter­est­ing exprience in my life.

  • http://metaoptimize.com Joseph Turian

    I would be very excited if you tried an LLR approach. Ted Dun­ning was sug­gest­ing that style of approach to me too.

  • Mikes

    I like this prob­lem, but I think it would be bet­ter by mak­ing the fol­low­ing adjustments:

    1) The data isn’t clean. I’d love to throw out the num­bers (at least the low ones) and the extra char­ac­ters like dashes. If the goal is to find seman­tic mean­ing, you’re adding more noise than value by includ­ing them.

    2) Leav­ing the goal of “seman­ti­cally defined” vague doesn’t really give us much to shoot at. I went for co-occurrence and pro­duced rea­son­able results. Another per­son inter­preted it as “sim­i­lar” and pro­duced some­thing dif­fer­ent. What’s the goal here? Any good algo­rithm needs a spec.

    3) Is the goal really to find seman­tic rela­tion­ships based on YOUR data, or just to find seman­tic rela­tion­ships based on web-mined data? The data that you zipped up isn’t ter­ri­ble, but it’s sur­pris­ingly noisy. We can find bet­ter data on Twit­ter, Wikipedia, or on most other crawls that I’ve seen.

    The goal here is good. The setup is not.

  • http://twitter.com/karanjude/status/2111921247494145 karan­jude

    NLP Chal­lenge: Find seman­ti­cally related terms over a large vocab­u­lary (>1M)? – MetaOp­ti­mize http://icio.us/ot4na0

  • http://metaoptimize.com Joseph Turian

    The data isn’t clean.

    Wel­come to the real world.

    I’d love to throw out the num­bers (at least the low ones) and the extra char­ac­ters like dashes.

    You can do that if you like.

    Leav­ing the goal of “seman­ti­cally defined” vague doesn’t really give us much to shoot at.

    Part of this task is to see how peo­ple define the prob­lem. Part of the exer­cise is learn­ing through eval­u­a­tion and look­ing at people’s out­puts. Yeah, it’s less well-defined and less aca­d­e­mic that way. I find that more interesting.

    I think it will also be inter­est­ing to see if there is a mis­match between people’s inter­pre­ta­tions of what “seman­ti­cally related” means and what meth­ods pro­duce a cer­tain interpretation.

    *Is the goal really to find seman­tic rela­tion­ships based on YOUR data, or just to find seman­tic rela­tion­ships based on web-mined data?*

    The goal is to find seman­tic rela­tion­ships over a par­tic­u­lar vocab­u­lary. You can use the data set that gen­er­ated that vocab­u­lary. And/or you can use aux­il­iary data.

    But I apol­o­gize to the extent that you don’t like the setup. This is my first time doing a chal­lenge and I’m try­ing to learn for next time.

  • http://mobilei.tk/?p=525 Strata Week: Life, by the num­bers | mobilei.tk

    […] of MetaOp­ti­mize has announced a com­pe­ti­tion in the field of nat­ural lan­guage pro­cess­ing (NLP). The chal­lenge is to con­struct a method for find­ing the top semantically-related terms over a vocab­u­lary of […]

  • http://twitter.com/socalsue2/status/3260846453039104 Susan

    RT @turian: NLP Chal­lenge: Find seman­ti­cally related terms over a large vocab­u­lary (>1M)? http://bit.ly/cGOkKf #semweb

  • http://twitter.com/sbos/status/4653613326540800 Sergey Bar­tunov

    Осталось меньше двух дней до окончания этого прекрасного и поучительного соревнования: http://bit.ly/aiwZdn И столько дел дурацких

  • http://twitter.com/sauravsahay/status/5044073282928640 Saurav Sahay

    RT @turian: NLP Chal­lenge: Find seman­ti­cally related terms over a large vocab­u­lary (>1M)? http://bit.ly/cGOkKf

  • http://blog.ethanjfast.com/2010/11/nlp-challenge/ NLP Chal­lenge | Ethan Fast

    […] out seven grad school appli­ca­tions, I found enough time to par­tic­i­pate in Joseph Turian’s NLP Chal­lenge. Basi­cally, the idea is that given a vocab­u­lary and a large set of small doc­u­ments, you need to […]

  • Anony­mous

    the” is a con­tent word when it is miss­ing. In other words, a text with­out “the” is not a typ­i­cal doc­u­ment at all. The removal of so-called “non-content” words from all doc­u­ments takes some good tech­niques off the table.

  • http://twitter.com/sznyelveszet/status/7105123549454336 Gépész Nyelvész

    NLP Chal­lenge: Find seman­ti­cally related terms over a large vocab­u­lary (>1M)? http://bit.ly/fAds4p Ha nincs munkád ugorj neki!

  • http://twitter.com/zybler/status/8002514091773952 Hao Wooi Lim

    #nlp con­test has drawn to an end a week ago, any news about the results? Who won, the approach used, etc http://goo.gl/nswzt

  • http://twitter.com/nokuno/status/8003232265666560 Yoh Okuno

    RT @zybler: #nlp con­test has drawn to an end a week ago, any news about the results? Who won, the approach used, etc http://goo.gl/nswzt

  • Guil­laume Pitel

    I’ve found your chal­lenge too late to par­tic­i­pate, but I can still answer it :
    1 — cre­at­ing good fea­tures is prob­a­bly the hard­est part, sev­eral pos­si­bil­i­ties are opened for such a large scale prob­lem : local embed­ding with the­matic dimen­sions (pro­vided you can gather a few the­matic col­lec­tions), HOOI for the PCA, or you could use my (still undis­closed) method whose results are described here : http://blog.guillaume-pitel.fr/index.php?post/2010/07/My-neighbours-are-nicer-than-yours-%3A%29
    2 — find­ing the top-K near­est neigh­bours can be eas­ily done (I think) with LSH or KD-trees

    As for the speed of fea­ture extrac­tion, I did this exper­i­ment once : 500K words as vocab­u­lary, 20 * 40M doc­u­ments (they were word win­dows), using my method, it took approx. 30min on a quad core with a Geforce280 (I was also exper­i­ment­ing with GPGPU computing).

  • Grif­fin

    So, any­thing ever hap­pen with this? I checked out both lists men­tioned above, but found noth­ing. No new blog posts per­tain­ing to this chal­lenge either.

  • http://probreasoning.wordpress.com/2011/01/16/hello-world/ Related words in a cor­pus | probreasoning

    […] will attempt to describe a sim­ple approach to find related words in a large cor­pus. I fol­low the prob­lem described by Joseph Turian. We have doc­u­ments and a total of  unique words among […]

  • https://probreasoning.wordpress.com/2011/01/16/hello-world/ pro­brea­son­ing

    I used a very sim­ple llr-type approach, described at https://probreasoning.wordpress.com/2011/01/16/hello-world/

  • Abiya Veni

    i want infor­ma­tion about how to iden­tify the seman­tic sim­i­lar words

  • Thi­la­garanim

    I need data­base to find seman­ti­cally related verbs

  • http://twitter.com/enarduin/status/73421843792527361 eNar­duin

    RT @turian: NLP Chal­lenge: Find seman­ti­cally related terms over a large vocab­u­lary (>1M)? http://t.co/5jZkiqp

  • http://www.mactonweb.com web devel­op­ment bangalore

     The data isn’t clean. I’d love to throw out the num­bers (at least the low ones) and the extra char­ac­ters like dashes.

  • http://www.mactonweb.com/web-design-company-united-kingdom.html web design london

    Very inter­est­ing never thought of this as a solv­able prob­lem but i will look at it and give it a good go.

  • SEO

    Blog­ging keeps me insane. Keep up all the pos­i­tive work. I too love to blog. I found this one to be very infor­ma­tive :)

    Search Engine Opti­miza­tion India

  • Web design London

    Awe­some article..You have done a great job..Your pre­pro­cess­ing on the train­ing dataset shows that you tend to find “col­lo­ca­tions” but not “seman­tic related”. More­over, the unique oper­a­tion on the terms makes the dataset less use­ful because co-occurrence fre­quency count is crit­i­cal for build­ing a rea­son­able sim­i­lar­ity or asso­ci­a­tion measure. Thanks for shar­ing it…!

    web Design london

blog comments powered by Disqus