NLP applications are basically determined by 3 components: features, structure, and search over the structure. Why it's so hard to develop NLP tools then? Why don't we have higher-level NLP languages that work well?

I find it extremely frustrating that for most tools I've tried to develop, I had to build the tools almost from scratch.

asked Aug 08 '10 at 12:05

Lev%20Ratinov's gravatar image

Lev Ratinov
105246

edited Aug 30 '10 at 23:15

Frank's gravatar image

Frank
1349274453


3 Answers:

You may want to look at ScalaNLP, which is under active development out of the Berkeley NLP group.

answered Aug 31 '10 at 12:44

Shane's gravatar image

Shane
241210

This is highly language-dependent.

There are several frameworks for English NLP: OpenNLP, GATE, NLTK, LingPipe, and I am sure there are others I missed.

There are several open-source IR frameworks: Lucene, Sphinx and others.

If you are developing for languages that are less popular, you may be right.

If you use a non-mainstream programming language, you may still be right.

As Alexandre said, please give a specific example and people may be able to help.

answered Aug 31 '10 at 08:41

Yuval%20F's gravatar image

Yuval F
452

+1 for language dependency and some relevant tools as OpenNLP and LingPipe (though disappointing on some points). However, for me nltk is exactly the kind of tools the OP was referring to. It is merely a development kit, complex parsing, ner tagging, anaphora resolution must be build almost from scratch.

(Aug 31 '10 at 14:08) log0
1

@Yuval F: I think you misunderstood Lev's question. The issue here is not a library of pre-packaged tools such as taggers and parsers. These are indeed available to some extent, and you mentioned some of them.

Rather, the question was about a library/framework that should help in developing new tools (parsers, taggers, semantic parsers or any other task you may think of). Maybe the question is confusing because what Lev is after is not really an NLP library, but rather a structured-prediction library which is well suited for NLP usage (i.e. an easy way to do IO over text, support for graph and tree structures, easy feature extractiom from text, graphs and trees, scalability to large data sets, etc).

I am not aware of any such framework, but it definitely would be nice to have (if somewhat restrictive, in the sense that once this is available, people may be too lazy to think of smarter stuff which is not in the library).

(Sep 01 '10 at 20:25) yoavg

Maybe because the dividing lines between these abstractions are not as clear as they could be? Features for one problem are easily very different from features for another, and search depends a lot on the structure you put on the model as well. But aren't there already libraries for most basic tools in NLP, like parsing, chunking, tagging, etc, and also for CRFs, HMMs, and graphical models that one could use to build other sorts of NLP tools?

I'm not really experienced, so I'd like to know what do you think is missing specifically

answered Aug 09 '10 at 07:48

Alexandre%20Passos's gravatar image

Alexandre Passos ♦
2554154278421

2

For one thing, feature extraction for NLP is quite a pain. It's usually tons of repetitive, boring code. I bet there is an abstraction that can cover at least 90% of the easy, repetitive cases that you will generally have in most applications. Then you could add extra stuff on top of that. (and no, Mallet's Pipes do not fill this role).

Also, some dynamic programming stuff is quite generic. I guess Dyna is trying to fill this niche, but from my (very short, a long time ago) experience with it, it does not really tie in nicely with outside code.

(Sep 01 '10 at 20:32) yoavg
1

This is true. I've wasted thousands of lines of code counting words, suffixes, prefixes, n-grams, weighted, unweighted, considering distributional similarities, etc.

(Sep 01 '10 at 20:42) Alexandre Passos ♦

I can relate to yoavg and Alexandre in this regard. Does Scala help?

(Feb 11 '11 at 14:29) Dexter
Your answer
toggle preview

powered by OSQA

User submitted content is under Creative Commons: Attribution - Share Alike; Other things copyright (C) 2010, MetaOptimize LLC.