Hi all,

How would I find a corpus of Spanish text, as large as possible, for training general NLP models?

Ideally, it would be a corpus of books and/or magazines for the sake of lower variation in grammar and spelling, etc.

Thanks P

asked Jan 01 '13 at 02:29

Petar%20Maymounkov's gravatar image

Petar Maymounkov
1111

edited Jan 07 '13 at 23:30

Leon%20Palafox's gravatar image

Leon Palafox ♦
40857194128


3 Answers:

I've bee having some troubles finding suitable corpus, so I usually end up parsing webpages, which is not that difficult.

You can try sending an email to these guys, perhaps they can be of some help.

answered Jan 03 '13 at 20:33

Leon%20Palafox's gravatar image

Leon Palafox ♦
40857194128

What about the Spanish Wikipedia? http://es.wikipedia.org

There are also books in Spanish in the Gutenberg project: http://www.gutenberg.org/browse/languages/es

answered Jan 04 '13 at 09:48

Alejandro's gravatar image

Alejandro
301610

edited Jan 04 '13 at 09:53

I know about this tool, FreeLing, developed at Universitat Politècnica de Catalunya. It provides a range of NLP tools for Spanish. There is no corpora available for download, but I suppose you can get in touch with the people who work on the project and they can provide you with corpora for non-commercial use.

answered Jan 05 '13 at 22:20

Martin%20SAVESKI's gravatar image

Martin SAVESKI
15634

Your answer
toggle preview

powered by OSQA

User submitted content is under Creative Commons: Attribution - Share Alike; Other things copyright (C) 2010, MetaOptimize LLC.