Hi all,

I am the new comer. Nice to see you here. As far as know, IR focus on more implementation and ML more focus on theory part. When you go to see papers in SIGIR, lots of them talk about some interesting problems and the dataset is always new. However, the solution is always not that novel or impressive. However, for ML, the first image jumping out of my head is the huge bunch of math equations and strange symbols.

Above is the difference in my opinion (not mature). However, in recent years, the border of them seems not that clear. Applying ML methods (like classification and clustering) in IR paper is very common, even though they don't know why to apply those sophisticated algorithms. Thus, in those paper, they will not discuss why they use ML, but just use it. If you can add some complicated maths into your paper, you may get high chances to be accepted.

I want to ask here, what is the most obvious difference between IR and ML? For a student in IR, how many ML knowledge is sufficient? In which circumstance ML is needed in IR?

Hope my question is not that ambiguous for you to answer.

Best regards, Eric

asked Nov 30 '11 at 05:10

ericzhao's gravatar image

ericzhao
15113


2 Answers:

I think the line is blurring with the emerging topics in Learning to Rank and much more recently Learning to Search.

Also we should always keep in mind that querying a full-text index is basically using a memory / hard-drive optimized k-Nearest Neighbors model on a TF-IDF sparse vector dataset with a cosine similarity function (which is a baseline machine learning algorithm). The algorithm might seem simple from a mathematical point of view but implementing it in a scalable way for indexing billions of document with limited CPU & memory resources is far from trivial.

Edit: I think learning ML is not necessary but helps better understand some of the theory behind IR. Also practical experience with large-scale text preprocessing (cleanup, tokenization, stemming..) and indexing is not necessarily taught to ML students.

answered Nov 30 '11 at 05:52

ogrisel's gravatar image

ogrisel
398464480

edited Nov 30 '11 at 05:57

Ogrisel,

Thank you for your reply. From your answer I know you know pretty much about IR. I agree with you that learning ML will help the research of IR.

(Dec 01 '11 at 01:19) ericzhao

Personal opinion: the convergence between these two will accelerate for two reasons 1) Data has exploded in recent years, and access to big data will continue to grow, what we define as big data will also grow. But the rate at which it increases will slow down, and fall below increased access to computational power. I'm not just talking about Moore's law, but also improved parallel programming tools, improved cloud computing capacity and elasticity (e.g. EC2), and and improved parallel algorithms.

In 10 years, a sufficiently determined researcher will be able to direct over a petaflop of computing power towards his problem, needing only a moderate size grant. This is enough to start seriously applying superlinear algorithms to terabyte data, leading to a lot more flexibility.

2) linear time and sublinear algorithms are becoming much better understood. Research into online learning for even very complex models has taken off and is coming into its own. Whether or not he has access to what is today top-tier supercomputing, the IR specialist/researcher of 10 years from now will be using these tools.

3) not-so-huge data will become much more interesting. IR for a gigantic data base is one important problem domain, but what if you only have a few dozen gigs of data? That's enough to store all the scholarly work on a not-too-narrow topic, all the tweets with even a very popular hashtag, or even all the news stories written about a particular event. With moderate resources, you'll be able to perform some very muscular NLP on these mid-sized datasets. As capabilities improve, demand will follow, and vice-versa in a virtuous cycle.

answered Dec 03 '11 at 15:53

Jacob%20Jensen's gravatar image

Jacob Jensen
1644285360

Your answer
toggle preview

powered by OSQA

User submitted content is under Creative Commons: Attribution - Share Alike; Other things copyright (C) 2010, MetaOptimize LLC.