3
1

Note: I'm fairly new to nlp and terminology might be not correct one.

Suppose i have a set of objects, each object being a text sentence limited in length by 200 characters. I want to take an object from the set and match it with other objects by meaning, producing "weight of relativeness" between chosen object and objects remaining in set.

For example if i have four objects:

  • What is a car and how it works?
  • How does a plane fly?
  • Does ufo exist?
  • How automobile works?

And take 4th object and match it against remaining 3, i'm expecting the output to be: [0.998, 0.01, 0.01]

Note that in example above all sentences are questions, but in real world data set i expect only around 70% of sentences to be questions.

What methods / algorithms / libraries should i be looking at to solve that problem?

asked Jun 08 '11 at 08:43

Andrey%20Staev's gravatar image

Andrey Staev
46124

edited Jun 09 '11 at 20:48

Robert%20Layton's gravatar image

Robert Layton
1625122637


4 Answers:

This is kind of classic information retrieval task. You should represent each sentence as a vector and all you need to do is to compute the similarity between two vectors. I would avoid the word "meaning" but rather use word-cooccurence. On top of my mind it comes Semantic Vectors and a book which you may find helpful is Introduction to Information Retrieval, look for vector space modeling. And the Wikipedia article on Vector Space Model I guess other people in this community will probably come with other helpful comments. Успех!

answered Jun 08 '11 at 09:47

Svetoslav%20Marinov's gravatar image

Svetoslav Marinov
26618

edited Jun 08 '11 at 09:55

A nice library to look at is gensim. In particular, look at the tutorial. The example he uses is quite similar to what you want to do.

answered Jun 08 '11 at 10:51

Alejandro's gravatar image

Alejandro
301610

Vector space models will only work on near-exact semantic matches unless you do some creative filtering.

In natural language processing, there is a task is known as textual entailment. The goal is to infer whether or not something is true given prior information. While that might not sound exactly like what you're looking for, many approaches to this task have developed excellent text similarity metrics that you could probably use. Look at the Recognizing Textual Entailment challenges for many of the solutions people have come up with.

answered Jun 08 '11 at 11:11

Kirk%20Roberts's gravatar image

Kirk Roberts
4612410

edited Jun 09 '11 at 11:55

An interesting gem I heard about yesterday is Explicit Semantic Analysis, by Gabrilovich et al. The idea is to construct a "concept vector" for each word which is a bag of wikipedia articles where this word appears. This gives you a word similarity, which you can then average and get document similarity. It seems to outperform many latent methods by virtue of leveraging this already built huge knowledge base in an aggregate way without overtrusting any single article.

answered Jun 10 '11 at 03:01

Alexandre%20Passos's gravatar image

Alexandre Passos ♦
2554154278421

Your answer
toggle preview

powered by OSQA

User submitted content is under Creative Commons: Attribution - Share Alike; Other things copyright (C) 2010, MetaOptimize LLC.