I was thinking about what steps are required for building a algorithm for accomplishing the following task:

Suppose you were given a text, and you wanted to extract named entities (NER) and further you want to get the text passages that are related to this named entities (coreference?). Now, what step would be required to build such a system?

Afterwards, one had names (of persons, companies or even products) and the corresponding references to these words like "is friendly" or "went bankrupt" etc.

What NLP techniques are necessary for this and how should they be combined?

Edit: I see my question is not specific enough. I try to give an example. Suppose, I had given the following text (copied from a tech review of a tech site):

But the Xoom, which sported a 10.1-inch screen, was a bit too heavy (1.6 pounds) and much 
too expensive, and the 3G and 4G models were available 
only through Verizon.

This sentence describes some features of the Xoom tablet. These features are

(Xoom is) a bit too heavy, too expensive, available only through Verizon

As I understand coreference resolution this would be a correct result for coreference resolution? If so, you could apply a coreference resolution algorithm on this sentence which would (ideally) extract these features of the Xoom. Then, I could use those features like "a bit too heavy" to predict wether this is positive or negative. So the result would look something like that (no ofense to Xoom users ;) ):

a bit too heavy (negative), 
too expensive (negative),  
available only through Verizon (neutral)

At the end I could say: Okay this review of the Xoom is negative.

Returning to my question: Is this technically (on the basis of currently existing algorithms) realistic? And if so, what type of algorithms would solve such a problem?

asked Jan 26 '12 at 11:02

Tom's gravatar image

Tom
71101214

edited Jan 29 '12 at 09:19

I have edited your question so that it is more direct.

(Jan 27 '12 at 16:22) Joseph Turian ♦♦

One Answer:

What problem are you actually trying to solve? What is the end product?

Coreference resolution might make your problem much harder than necessary.

You might want to consider a web-scale relation extraction approach. For example, see Reverb from the University of Washington.

Check out the ClueWeb09 extractions. They include, for example: ("Adair", "went bankrupt in", "1972")

You could then apply postprocessing to clean up the data.

But it is hard to say what is an appropriate approach without understanding the end goal.

answered Jan 27 '12 at 16:26

Joseph%20Turian's gravatar image

Joseph Turian ♦♦
579051125146

Thanks you for your answer. I will check your link soon. I also edited my original question, so that my question becomes more clear.

(Jan 29 '12 at 09:20) Tom
Your answer
toggle preview

powered by OSQA

User submitted content is under Creative Commons: Attribution - Share Alike; Other things copyright (C) 2010, MetaOptimize LLC.