|
I was wondering if anyone knew of any document corpora that include the name of the author. Specifically, I'm looking for large corpora (>1000), with at least a sub-sample labelled for authorship. Most studies in authorship analysis only use a relatively small amount of data. I'm looking to see if it works on a large scale, and need the data to do so. My thought was that a standard dataset in document clustering has authorship as meta data, but this isn't really information that gets 'advertised'. Any thoughts? edit: There are a few new users, so I'm shamelessly bumping this question to see if anyone has any ideas. In short, looking for a large collection of text with known authors for at least some documents.
showing 5 of 6
show all
|
I believe the RCV1 (Reuters) has authorship detail - the original did not. Is this true?
I'm intrigued, since my area isn't NLP, I know little on the topic. How do you relate authors to papers? Topics or style?
@Leon: Usually style works much better than topic, but of course it depends on how homogeneous the document collection is.
@Leon: Alexandre is right, stylistic features are much better. They are preferred over topic as well - if you don't address the topic issue, then your method is not generalisable when the author writes on a new topic. Current state of the art uses character n-grams (n=3,4,5), and the most frequent (compare to topic modelling which discards the most frequent words)
How do you define style? Is it the kind of words they use, like the most frequent n-grams?
Basically. The bag-of-n-grams method performs quite well, taking the top L (usually a thousand or two) n-grams of size 3 or 4. Local methods, such as CNG (Keselj et al, 2004) take the top L n-grams for each author, then compare both the lists between authors and new documents, and the frequencies.