2
3

Is there a corpus of political texts available for NLP work. I am in need of a large dataset of political blogs, news articles, even tweets

In case I cannot find one, what is the best way of curating one?

I personally do not care about any other meta data other than the actual text. I just need a corpus that contains a lot of text.

asked Jul 07 '10 at 22:51

Mark%20Alen's gravatar image

Mark Alen
1323234146

edited Jul 08 '10 at 03:46

ogrisel's gravatar image

ogrisel
498995591


6 Answers:

I've seen papers using the bitterlemons corpus. It's a collection of texts on the israeli/palestinian divide annotated with the side of the writer. I have a friend who did a more thorough (but unmarked) mining of the bitterlemons site, send me an email if you're interested.

answered Jul 07 '10 at 23:03

Alexandre%20Passos's gravatar image

Alexandre Passos ♦
2554154278421

There's the CORPS corpus (http://hlt.fbk.eu/corps) but that's political speeches, not commentary.

Collecting tweets is really easy through the twitter API (http://apiwiki.twitter.com/) since they're mostly public, it's straightforward to scrape a bunch of these from as many sources @chucktodd, @glennbeck, whatever you want. It's an interesting clustering question to identify "political" tweets from the immense amount of data that comes from that firehose, but it doesn't mean you couldn't do it.

Blogs are slower, but longer and most have RSS feeds that you could mine for content.

As far as curating reliable/reusable data, it really depends what you want to do with it. Do you need to balance the content by political point of view? balance by author? topic? If not, then no problem. Maintaining that sort of balance has some perks, but will mean that you have to be a little more careful in deciding what to include in the final corpus.

answered Jul 07 '10 at 23:23

Andrew%20Rosenberg's gravatar image

Andrew Rosenberg
173772540

Thanks CORPS seems to be very useful. I just sent them an email. I just hope they reply

(Jul 07 '10 at 23:28) Mark Alen

What about the New York Times annotated LDC corpus (not free) and its linked open dataset?

The annotated corpus has every single article published by the New York Times 1987-2007. I recommend it because, in addition to being clean, and grammatical, it has extremely useful metadata not just about date/time/author, but about which topics, people, places, and organizations occurred in each article. You could easily choose the tag "United States Politics and Government" (actual tag) or some such.

answered Jul 08 '10 at 11:48

aditi's gravatar image

aditi
85072034

If you are okay with it, US Congressional Floor debate transcripts are available here.

This corpus is used in the paper Get out the vote: Determining support or opposition from Congressional floor-debate transcripts. Matt Thomas, Bo Pang, and Lillian Lee. Proceedings of EMNLP, pp. 327–335, 2006.

answered Sep 20 '10 at 09:03

Dexter's gravatar image

Dexter
416243438

Justin Grimmer (Stanford) collected "over 24,000 Senate press releases, collected from each Senate office in 2007". See: "A Bayesian Hierarchical Topic Model for Political Texts: Measuring Expressed Agendas in Senate Press Releases".

Sean Gerrish (Princeton) used Govtrack, "an independent Website which provides comprehensive tracking of legislative information to the public. Our collection contains 4,915 documents, 1,253 unique legislators". See: "The Ideal Point Topic Model: Predicting Legislative Roll Calls from Text".

answered Dec 23 '10 at 11:39

Yariv%20Maron's gravatar image

Yariv Maron
17526

edited Dec 23 '10 at 11:48

Here are a few more political blog corpora.

  • Yano et al http://www.ark.cs.cmu.edu/blog-data/
  • Eisenstein and Xing http://sailing.cs.cmu.edu/socialmedia/blog2008.html

And, not strictly a blog, but:

  • Bitter Lemons corpus http://sites.google.com/site/weihaolinatcmu/data

answered Dec 23 '10 at 13:39

brendan642's gravatar image

brendan642
91116

edited Dec 23 '10 at 13:40

Your answer
toggle preview

powered by OSQA

User submitted content is under Creative Commons: Attribution - Share Alike; Other things copyright (C) 2010, MetaOptimize LLC.