I'm looking for a toy corpus of textual document to make some really simple examples with LSI. I would like to have a corpus composed by a sequence o sentences or really short text. Can you provide me some reference?

asked Nov 06 '12 at 06:52

pietro's gravatar image

pietro
1556


One Answer:

I you are using python, NLTK comes with a couple of relatively small corpora. Some small corpora that have been used a lot are the 20 newsgroups data set and the DMOZ corpus. The 20 newsgroups data consists of short posts that are grouped in 20 categories and might be what you are looking for. You can find it here.

answered Nov 06 '12 at 07:57

Philemon%20Brakel's gravatar image

Philemon Brakel
2445103560

Your answer
toggle preview

powered by OSQA

User submitted content is under Creative Commons: Attribution - Share Alike; Other things copyright (C) 2010, MetaOptimize LLC.