|
Hi, I want to create a corpus which contains news articles from online media line BBC, ...What I want to to have an .xml or excel file in which the plain text of the news and also the topic of the news article (Sport, Politics, ...) is contained. Could anybody give me some advice on doing this task? |
|
You can use the JSoup java library for this task. The libary is fast and robust. It uses CSS-like syntax for selecting html elements. |
|
Be careful with the legal issues. News organizations make money from content and it takes some convincing for them to let you host their content. Maybe an approach such as the one taken by the wikilinks corpus as provided by google is better? Essentially, you can make your dataset be a list of urls and rules about how to extract the information you want from the webpages. Maybe even a script to do that. Then people can go and download the data themselves. |
can you use the Guardian or Reuters datasets?
http://www.guardian.co.uk/open-platform
http://www.guardian.co.uk/news/datablog/interactive/2013/jan/14/all-our-datasets-index