3
1

If I use NLTK (Apache License 2.0) to train a part-of-speech tagger with the treebank corpus (whose README says it is only for non-commercial purposes) then what license(s) cover the part-of-speech tagger? If I understand the Apache license correctly, then it is not a "Derivative Work", because it is a separable object that only links to NLTK. What about the treebank license, can I use the tagger for commercial purposes, since it is now separate from the corpus? What if I build an API around the tagger, can the tagger be used for commercial purposes through the API?

asked Feb 06 '11 at 21:51

Jacob%20Perkins's gravatar image

Jacob Perkins
86125


2 Answers:

The Apache license is permissive, so you're set there. Using the tree bank is a legal can of worms you'll probably want a lawyer to think through. Copyright isn't much of a worry, as the model process is thoroughly transformative and closely analogous to human learning. However, it appears that additional contractual restrictions have been placed on the corpus. There are some rights you can't take away on text (eg, first sale doctrine), but I don't know which apply in this case, and it's also not a usual text form (the tagging is what we're interested in, not the content). I'm not aware of any legal reason you couldn't make text available under a license that you not incorporate it into ML models, but I'm not a lawyer.

Then, even if you're in the legal clear, it's probably a novel enough case that you could still have to defend yourself if it came to that. On the other hand, especially if you don't advertise the tree bank roots, a lawsuit feels unlikely to me (but again, not a lawyer and that's just idle speculation).

answered Feb 08 '11 at 08:37

Paul%20Barba's gravatar image

Paul Barba
4314915

Thanks Paul, I'm not particularly worried about lawsuits, and I agree that obscuring the training corpus would definitely reduce the likelihood. But I do wonder, if copyright no longer applies to the trained tagger, then perhaps license usage restrictions do not apply either, since most corpus licenses are focused on copyright & distribution of the text.

(Feb 08 '11 at 10:50) Jacob Perkins

The Apache OpenNLP (incubating) developers recently had the same discussion on their mailing list. The emerging consensus is that it would be better to bootstrap completely open source / freely distributable annotated corpora from sources such as wikipedia / wikinews. That way we would no longer have to deal with (boring) legal constraints and could just focus on the (challenging) technical issues. Please join the OpenNLP mailing list if you would like to participate in this effort.

Would be great to develop a common tooling framework for validating / curating open source annotated corpora for NLP.

answered Feb 07 '11 at 19:59

ogrisel's gravatar image

ogrisel
398464480

Do you have a link to that discussion? I couldn't seem to find the right thread in the archives. While I totally agree with the idea of open/free corpora, the fact remains that much of the good corpora have restricted licenses, but it's not clear what the limits of the licenses are.

(Feb 07 '11 at 21:29) Jacob Perkins
2

I think that most copyright laws do not apply to statistical models build out of copyrighted text (since you cannot accurately reconstruct the text from the model) but would feel better I we could both redistribute the trained models and the original training corpus so that people can tweak / refine the corpus annotations & retrain the models themselves without any legal burden.

The thread I was mentioning is Distributing our statistical models (on page 2 of the browser) permalink to the first post of the thread.

(Feb 08 '11 at 04:26) ogrisel
1

This is good news. The models created using copyrighted material have been infuritating, since they're still not very good, and can't be improved since the developers can't release the training data due to copyright concerns. This basically renders the current models useless in the long run. I'm not sure why they didn't simply start off with Gutenberg texts...

(Mar 09 '11 at 08:25) Cerin

Cerin - The Gutenberg texts don't have linguistic annotations on them, so they can only be used for certain types of models, such as ngram language models. And they are old texts rather than modern newswire texts. (Though that in itself is a limitation of many existing treebanks now that there is so much interest in analysis of social media texts that differ significantly from newswire.)

(Jun 10 '11 at 22:09) Jason Baldridge
Your answer
toggle preview

powered by OSQA

User submitted content is under Creative Commons: Attribution - Share Alike; Other things copyright (C) 2010, MetaOptimize LLC.