Almost everyone uses the same techniques for improving recall, but at the expense of precision at least as far as set retrieval.

  1. Stemming & Lemmatization to conflate ([porters => porter])
  2. Inflecting terms to expand (bark => [barks, barking, barker])

Intuitively, even if you want the recall at the expense of precision regarding set retrieval, there is still an opportunity in the ranking model to improve rank of the 'better' matches based on a 'better' inflection match. For example, one could search for the following:

"bark tree"

with the intent of looking up the topic of tree bark. However, this would also recall (and score/rank similarly) "barking up the wrong tree" probably even factoring in document length and other factors. This is clearly not ideal. We lost information. While it's OK for the documents about dogs barking up the wrong tree to be in the result set, ideally they'd be ranked lower. Actually, this sounds like a decent hedging of our bets.

... similarly for inflecting/expanding, one would search for (AND (OR bark barks barking barker) (OR tree trees)).

With the former technique (stemming & lemmatization), there is no opportunity to boost a match because the representation is lossy. With the latter technique, perhaps one could deliberately repeat or boost the exact term, i.e. for "bark tree" (emphasis in CAPS):

....(AND (OR BARK BARK barks barking barker) (OR TREE TREE trees)).

However, doing this is rather ad hoc. It probably also does strange things to ranking for various reasons. Why should terms with more inflections account for more words in the query? Why should 2 different inflections matched in the document be something like a linear boost? And why should it be exact-or-not? Pluralization might be considered less of a change than "gerundization" for example, though I have nothing to back that up academically.

So 2 questions:

  1. Has anyone done any rigorous testing on whether this is a smart thing to do?
  2. Is there a smart way to do it, accounting for some level of "closeness" rather than a boolean function of (exact || not-exact). For example, I'd say "shoe" is closer to "shoes" -- still a noun -- than "shoeing" -- verb/gerund.

So is this worth study? Has anyone studied it? Are there good overviews?

... or is this barking up the wrong tree (sorry, couldn't help it).

asked May 30 '13 at 02:31

f00's gravatar image

f00
46116

edited May 30 '13 at 02:55


2 Answers:

Here's what Sphinx does:

11.2.49. expand_keywords Expand keywords with exact forms when possible. Queries against indexes with expand_keywords feature enabled are internally expanded as follows. If the index was built with both stemming and index_exact_words enabled, exact form is also added. Here's an example that shows how internal expansion works:

running -> ( running | =running )

Expanded queries take naturally longer to complete, but can possibly improve the search quality, as the documents with exact form matches should be ranked generally higher than documents with stemmed matches.

Is this really a reasonable approach?

I suppose it's better than what Microsoft does in MS SQL. Here's what MS SQL:

FREETEXT queries will add words to the query via inflectional generation (inflected forms of the original query words); these words are treated as separate words with no special relationship to the words from which they were generated. Synonyms generated from the Thesaurus feature are treated as separate, equally weighted terms. Each word in the query contributes to the rank.

They're basically the same, but MS SQL is a bigger liability in general if you match more than few different inflections (i.e. dog, dogs, dogged) re: "treated as separate words with no special relationship to the words from which they were generated."

Both techniques really "treat inflections as separate words." Both would reward roughly linearly (assuming similar idf). I'm not sure this is reasonable, as in most rankers, frequency of same term is almost never a linear function. Usually there's a log, square root, or hyperbolic function because for a query "a b c," a document with "a a a" is obviously inferior to one with "a b c."

So why should this be different?

answered May 30 '13 at 14:56

f00's gravatar image

f00
46116

edited May 30 '13 at 15:29

There are (at least) two orthogonal issues that you want to address here:

  1. Improving recall by looking at word variations (e.g. by stemming or inflection)
  2. Avoiding bad matches that are due to polysemy or when there are idiomatic expressions ("barking up the wrong tree").

Idiomatic expressions are are tricky weird case, and perhaps you'd do best just to detect those automatically. Use a dictionary to find near matches, and then collapse them into one token.

To avoid bad matches due to polysemy ("plant" can be a vegetable or a factory), you could run a word sense disambiguation algorithm. See Yarowsky 1995 for a classic simple WSD algorithm. You could then retokenize everything according to its word sense, e.g. "plant_FACTORY".

To use word variations, but have a weighting scheme that makes more sense, you could duplicate the tokens. "barks" becomes "barks_ORIGINAL bark_STEM". Then, you improve recall, but still have a sensible weighting for rarer word forms, and not doublecount (overly emphasize) words with many variations.

answered May 31 '13 at 01:31

Joseph%20Turian's gravatar image

Joseph Turian ♦♦
579051125146

Your answer
toggle preview

powered by OSQA

User submitted content is under Creative Commons: Attribution - Share Alike; Other things copyright (C) 2010, MetaOptimize LLC.