Hi,

I want to create a corpus which contains news articles from online media line BBC, ...What I want to to have an .xml or excel file in which the plain text of the news and also the topic of the news article (Sport, Politics, ...) is contained.

Could anybody give me some advice on doing this task?

asked Apr 29 '13 at 02:53

Pashutan%20Modaresi's gravatar image

Pashutan Modaresi
16112

can you use the Guardian or Reuters datasets?

http://www.guardian.co.uk/open-platform

http://www.guardian.co.uk/news/datablog/interactive/2013/jan/14/all-our-datasets-index

(Apr 29 '13 at 16:55) eugene tani

2 Answers:

You can use the JSoup java library for this task. The libary is fast and robust. It uses CSS-like syntax for selecting html elements.

answered May 15 '13 at 02:43

Anton%20Kazennikov's gravatar image

Anton Kazennikov
1

Be careful with the legal issues. News organizations make money from content and it takes some convincing for them to let you host their content. Maybe an approach such as the one taken by the wikilinks corpus as provided by google is better?

Essentially, you can make your dataset be a list of urls and rules about how to extract the information you want from the webpages. Maybe even a script to do that. Then people can go and download the data themselves.

answered Apr 29 '13 at 09:32

Alexandre%20Passos's gravatar image

Alexandre Passos ♦
2554154278421

Your answer
toggle preview

powered by OSQA

User submitted content is under Creative Commons: Attribution - Share Alike; Other things copyright (C) 2010, MetaOptimize LLC.