Hi, I am looking for a JAVA based library for doing the following on a huge dataset of natural language(English). I want to do do Tokenization, lemmentitaion, stemming , stop word removal and build a vocabulary and also if possible build tf-idf score table. Is there a single library available with which I can do all of the above?

asked Jul 04 '12 at 04:29

Lancelot's gravatar image

Lancelot
250172426


2 Answers:

Any of the following:

answered Jul 05 '12 at 13:31

Pedro%20Oliveira's gravatar image

Pedro Oliveira
26449

also http://lucene.apache.org/

answered Jul 06 '12 at 00:00

mat%20kelcey's gravatar image

mat kelcey
1861410

Your answer
toggle preview

powered by OSQA

User submitted content is under Creative Commons: Attribution - Share Alike; Other things copyright (C) 2010, MetaOptimize LLC.