|
NLP applications are basically determined by 3 components: features, structure, and search over the structure. Why it's so hard to develop NLP tools then? Why don't we have higher-level NLP languages that work well? I find it extremely frustrating that for most tools I've tried to develop, I had to build the tools almost from scratch. |
|
This is highly language-dependent. There are several frameworks for English NLP: OpenNLP, GATE, NLTK, LingPipe, and I am sure there are others I missed. There are several open-source IR frameworks: Lucene, Sphinx and others. If you are developing for languages that are less popular, you may be right. If you use a non-mainstream programming language, you may still be right. As Alexandre said, please give a specific example and people may be able to help. +1 for language dependency and some relevant tools as OpenNLP and LingPipe (though disappointing on some points). However, for me nltk is exactly the kind of tools the OP was referring to. It is merely a development kit, complex parsing, ner tagging, anaphora resolution must be build almost from scratch.
(Aug 31 '10 at 14:08)
log0
1
@Yuval F: I think you misunderstood Lev's question. The issue here is not a library of pre-packaged tools such as taggers and parsers. These are indeed available to some extent, and you mentioned some of them. Rather, the question was about a library/framework that should help in developing new tools (parsers, taggers, semantic parsers or any other task you may think of). Maybe the question is confusing because what Lev is after is not really an NLP library, but rather a structured-prediction library which is well suited for NLP usage (i.e. an easy way to do IO over text, support for graph and tree structures, easy feature extractiom from text, graphs and trees, scalability to large data sets, etc). I am not aware of any such framework, but it definitely would be nice to have (if somewhat restrictive, in the sense that once this is available, people may be too lazy to think of smarter stuff which is not in the library).
(Sep 01 '10 at 20:25)
yoavg
|
|
Maybe because the dividing lines between these abstractions are not as clear as they could be? Features for one problem are easily very different from features for another, and search depends a lot on the structure you put on the model as well. But aren't there already libraries for most basic tools in NLP, like parsing, chunking, tagging, etc, and also for CRFs, HMMs, and graphical models that one could use to build other sorts of NLP tools? I'm not really experienced, so I'd like to know what do you think is missing specifically 2
For one thing, feature extraction for NLP is quite a pain. It's usually tons of repetitive, boring code. I bet there is an abstraction that can cover at least 90% of the easy, repetitive cases that you will generally have in most applications. Then you could add extra stuff on top of that. (and no, Mallet's Pipes do not fill this role). Also, some dynamic programming stuff is quite generic. I guess Dyna is trying to fill this niche, but from my (very short, a long time ago) experience with it, it does not really tie in nicely with outside code.
(Sep 01 '10 at 20:32)
yoavg
1
This is true. I've wasted thousands of lines of code counting words, suffixes, prefixes, n-grams, weighted, unweighted, considering distributional similarities, etc.
(Sep 01 '10 at 20:42)
Alexandre Passos ♦
I can relate to yoavg and Alexandre in this regard. Does Scala help?
(Feb 11 '11 at 14:29)
Dexter
|