17
9

This might be too open-ended, but I'm interested in what problems people are facing in text which don't have an academic analogue? I'm looking for new research problems. Its okay if the problem is close but not quite to an existing academic NLP problem.

asked Jul 06 '10 at 14:57

aria42's gravatar image

aria42
209972441

edited Dec 03 '10 at 06:53

Alexandre%20Passos's gravatar image

Alexandre Passos ♦
2554154278421

Another thing is actually cross-language NLP (abstracting away from Machine Translation). Let's say language E has been greatly discussed in Academia and hundreds of problems have been solved with large number of free open source library available to tackle anything you like (parsing, anaphora resolution, summarization, NER, etc.). On the other hand languages A, S, T (you name it) have been very little dealt with in the Academia for zillions of different reasons. Now, in a real world application you may want to attack the NER problem for all the languages E, A, S and T. But you cannot properly do it, nor even simplify it, because languages A, S and T have totally different morphology, syntax, etc. And maybe simply, no free tools are available, nor even Academic papers. So this discrepancy in the state-of-the-art of NLP research in different languages, is I think one interesting problem which people in the Enterprise world face and need to tackle over large noisy datasets.

(Jun 01 '11 at 03:35) Svetoslav Marinov

7 Answers:

There are three big problems in information access: navigation, importance (which you touch on in "authority" in your answer), and summarization. I'll talk about navigation later.

Regarding summarization:

I'd be interested in web-scale multi-document summarization and source aggregation.

The same story is repeated ten times across online newspapers and summarized on twitter in a variety of ways. Aggregate this into one headline with a link to all the articles.

Afterwards, bloggers comment on the story. It's interesting to get a summary of the different commentaries, with duplicates aggregated, and the summaries of the different commentaries dangle off the summary of the main story. etc.

answered Jul 06 '10 at 16:01

Joseph%20Turian's gravatar image

Joseph Turian ♦♦
579051125146

edited Jul 06 '10 at 16:02

2

Summarization is actually more important than many people realize. In particular, anyone unfamiliar with the topic would be surprised to realize that we have no effective method (yet) of actually CREATING summarizations, only extrapolating them from important lines of text in the original source documents.

(Jul 08 '10 at 03:39) Daniel Duckwoth

Summarization is indeed very hard. You don't appreciate how hard it is to string a few sentences together in a lucid and explanatory way (despite the fact that a great many humans have trouble with it) until you try to get a computer to do it.

(May 27 '11 at 16:38) Jacob Jensen

the work of Katja Filippova is strongly similar to that see Sentence Fusion and Sentence Compression https://sites.google.com/site/katjaf/

(Jul 05 '11 at 23:15) maxime caron

I'll go first to get the ball rolling. A problem I've become interested in are automated way of assessing authority in forums and social networks. So a site like this has a formal mechanism for voting on answers and so there is an objective system for assessing textual contributions. For other settings such as reviews or blog post comments, there aren't formal mechanisms for assessing authority. Is this a problem the "crowd" will solve in general for different domains if we give them a formal mechanism? Or do we need an NLP solution?

answered Jul 06 '10 at 15:40

aria42's gravatar image

aria42
209972441

1

An NLP solution should be better, because it allows for more general application (if it is good enough, I mean), like use in search engines or other forms of automated information extraction.

Also, karma systems like the one on this site tend to reward frequent posters instead of authorities (of course, a person can be both, but, for example, I'm pretty sure my answers are not half as authoritative as those from most users with lower karma, but I post a lot, which helps).

I think most of the previous work in this area is based on pagerank, Kleinberg's hubs and authorities, Gerrish and Blei's influence on text corpora with topic models.

(Jul 06 '10 at 16:00) Alexandre Passos ♦

A NLP solution has its pros, but it also has its Disadvantages. Infact,it would be a constantly evolving,(just like spam filters and google bombs) there will be gaming the solution. Ofcourse at the moment it is very unlikely and will probably remain so, till the rewards go up. So my guess is a collaboration(weighted average?) between the karma and NLP system would be interesting.

(Jul 17 '10 at 12:28) anandjeyahar

The biggest NLP problem is Language Semantics and semantic processing technologies. Don't mix with Semantic Web. There is a set of good articles about that on www.sempl.net

This answer is marked "community wiki".

answered Jul 15 '10 at 15:19

Igor's gravatar image

Igor
12

edited Jul 19 '10 at 12:02

Andrew%20Rosenberg's gravatar image

Andrew Rosenberg
173772540

1

Can you elaborate what kind of tasks this entails?

(Jul 19 '10 at 12:51) aria42

When we are talking about natural language processing by some software we are guessing the software is intelligent enough to provide the correct processing and understanding of the natural language phrases. Sure, that is not true today. A couple of simple examples: 1. The phrase ‘A child eats by hands’ is very understandable for a human, but very complicated for a computer. NOBODY and NEVER eats by hands. We eat by mouth, sometimes, holding a food in the hands. 2. Two phrases: ‘close the door behind you’ and ‘close the safe’. The first phrase semantically doesn’t make any sense. We physically unable to close any door. We are able to close the room by the door, but nobody talks like that. In the best case people say ‘close the door of the room’, but semantically it is incorrect as well. The second phrase is semantically correct, but nobody will tell you ‘close the safe by the door’, because ‘the door’ is understandable by default for a human – BUT NOT FOR A COMPUTER.

That’s why I recommend again to go to www.sempl.net and read about semantics and semantic coding. Thanks to Andrew Rosenberg for correction of this web site address in the previous comment.

(Jul 21 '10 at 10:43) Igor

@Igor: this sounds a lot like bickering about meaning based on very strict and unnatural definitions of what the words mean and how they can be used.

(Jul 21 '10 at 11:05) Alexandre Passos ♦
2

I agree that semantics - what words mean and how these meanings combine - is a big open question in NLP. But I have a very different impression of semantics than Igor describes in the above post.

I would claim that all of those sentences - "close the door behind you", "close the safe", and even "a child eats by hands" though the latter sounds very non-native in American English - are semantically "correct". If a machine or system of semantic description can't interpret them correctly, it's a problem with the machine or system, not the sentences.

The work at www.sempl.net all strikes me as very brittle and almost directly inspired by the classical AI perspective that "meaning" operates under first order propositional logic axioms. While there's certainly a place for logic in AI and decision making, language and meaning only occasionally follow these strict rules. (cf. the examples above...also the fact that people will say that the number 3 is more odd than 517, when under the "meaning" of "odd" they are equally valid members of the set defined by "whole numbers that are not evenly divisible by 2")

(Jul 21 '10 at 11:32) Andrew Rosenberg

A thing I think could be useful is extreme multi-task learning, for predicting all things, like which feed reader/news items/emails/search results to throw away as uninteresting, and where to file the rest. I don't see how to make this work, though, except by postulating some miraculous deep architecture that generates really good features for whatever it is you want to do with text.

Although, honestly, this seems too close to AI to work correctly.

answered Jul 06 '10 at 16:35

Alexandre%20Passos's gravatar image

Alexandre Passos ♦
2554154278421

Another problem that I think is very interesting but still beyond the state of the art is some form of constrained natural language generation (that constrains the text to be from a given topic distribution in a topic model, or close to a given document, or with a given sentiment, etc).

Of course, it's slightly sad that this would mostly benefit spammers.

answered Jul 07 '10 at 17:46

Alexandre%20Passos's gravatar image

Alexandre Passos ♦
2554154278421

All these problems come from the core issue of not having a solid model to represent the semenatics of natural language. First order logic has proved too rigid, Probabalistic FOL shows more promise but hasn't been solved yet. Once there is a representation that holds the complete semantic info of the sentence, summarization, translation, language generation and other tasks become easier to solve.

answered May 31 '11 at 08:40

Scott%20Frye's gravatar image

Scott Frye
151138

what about markov logic ?

(Jul 05 '11 at 23:18) maxime caron
-2

Extracting the location referenced to in a webpage interests me a lot. I posted a question about the same at http://metaoptimize.com/qa/questions/1190/approach-to-find-the-location-referenced-in-a-web-page

answered Jul 17 '10 at 02:29

ArchieIndian's gravatar image

ArchieIndian
9951011

edited Jul 17 '10 at 02:30

Down voters: it's awfully impolite to down-vote without providing at least some amount of feedback.

(May 31 '11 at 18:59) Brian Vandenberg
1

@Brian Vandenbergh: I didn't downvote, but if you follow that question linked there are lots of comments and answers attesting to the fact that this problem doesn't seem to be open at all, hence it does not really qualify as a candidate to the next big NLP problem.

(May 31 '11 at 19:04) Alexandre Passos ♦

Well put, Alex.

(Jun 01 '11 at 13:12) Brian Vandenberg
Your answer
toggle preview

powered by OSQA

User submitted content is under Creative Commons: Attribution - Share Alike; Other things copyright (C) 2010, MetaOptimize LLC.