|
I'm looking for a toy corpus of textual document to make some really simple examples with LSI. I would like to have a corpus composed by a sequence o sentences or really short text. Can you provide me some reference? |
|
I you are using python, NLTK comes with a couple of relatively small corpora. Some small corpora that have been used a lot are the 20 newsgroups data set and the DMOZ corpus. The 20 newsgroups data consists of short posts that are grouped in 20 categories and might be what you are looking for. You can find it here. |