I was wondering if anyone knew of any document corpora that include the name of the author. Specifically, I'm looking for large corpora (>1000), with at least a sub-sample labelled for authorship.

Most studies in authorship analysis only use a relatively small amount of data. I'm looking to see if it works on a large scale, and need the data to do so. My thought was that a standard dataset in document clustering has authorship as meta data, but this isn't really information that gets 'advertised'. Any thoughts?

edit: There are a few new users, so I'm shamelessly bumping this question to see if anyone has any ideas. In short, looking for a large collection of text with known authors for at least some documents.

asked Jul 09 '11 at 01:40

Robert%20Layton's gravatar image

Robert Layton
1625122637

edited Aug 17 '11 at 07:16

I believe the RCV1 (Reuters) has authorship detail - the original did not. Is this true?

(Jul 09 '11 at 01:40) Robert Layton
1

I'm intrigued, since my area isn't NLP, I know little on the topic. How do you relate authors to papers? Topics or style?

(Jul 10 '11 at 09:52) Leon Palafox ♦
1

@Leon: Usually style works much better than topic, but of course it depends on how homogeneous the document collection is.

(Jul 10 '11 at 12:13) Alexandre Passos ♦
1

@Leon: Alexandre is right, stylistic features are much better. They are preferred over topic as well - if you don't address the topic issue, then your method is not generalisable when the author writes on a new topic. Current state of the art uses character n-grams (n=3,4,5), and the most frequent (compare to topic modelling which discards the most frequent words)

(Jul 10 '11 at 22:02) Robert Layton

How do you define style? Is it the kind of words they use, like the most frequent n-grams?

(Jul 12 '11 at 08:31) Leon Palafox ♦

Basically. The bag-of-n-grams method performs quite well, taking the top L (usually a thousand or two) n-grams of size 3 or 4. Local methods, such as CNG (Keselj et al, 2004) take the top L n-grams for each author, then compare both the lists between authors and new documents, and the frequencies.

(Jul 12 '11 at 19:51) Robert Layton
showing 5 of 6 show all
Be the first one to answer this question!
toggle preview

powered by OSQA

User submitted content is under Creative Commons: Attribution - Share Alike; Other things copyright (C) 2010, MetaOptimize LLC.