|
I have sets of (book-length) documents by both Alice and Bob. Given a new document D, I want to know whether D was written by Alice or Bob. What are some useful methods? One simple way is to pick features like sentence length, proportion of adjectives/adverbs/etc., proportion of words in some set, etc., and treat it like a standard text classification problem. What are some features that have been used in the past with good effect? Are there other methods that have worked well? I'm interested in both general techniques, and actual applications to famous authors (e.g., did Shakespeare actually write Shakespeare? when person FOO died and BAR finished FOO's work, how well did BAR imitate FOO's writing style?] I'm also interested in some related problems:
|
|
It's important to keep in mind what the goal is of building a classifier for this problem. It's verrrry easy to fool yourself in authorship attribution, building a classifier that you think distinguishes author style when it's really distinguishing the topics that the authors tend to write about. Using stylistic features helps, but appropriate choice of training data is crucial (e.g. trying to balance for topic where possible) as well. The state of knowledge in authorship attribution is very poor, alas, and better data sets are desperately needed. David Madigan, Shlomo Argamon, and I unsuccessfuly tried to stir up some funding for that a few years ago. The AAAC corpus has a few problems in it that are normalised for subject, as is the TO BHMA corpus. Both are still 'smaller' datasets, but they do exist.
(Dec 07 '10 at 17:37)
Robert Layton
|
|
The problem of "Given a set of authors, which one wrote a particular piece" is known as Authorship Attribution and is part of a (only slightly) larger field called Authorship Analysis. There are a number of techniques that do this, which are summarised fairly well in Stamatatos' survey paper. Personally (I am doing this for my PhD), I find n-gram models to be the best methodology for doing this. The SCAP methodology is particularly good, and its very easy to implement (Original paper, Shameless plug for my paper using it for Twitter Authorship). The issue with SCAP is that there isn't much growth in the basic model: you can't just throw the results into SVM and see an improvement. This is due to the fact that SCAP gives a distance matrix between known authors and test documents, but no vector space representation. For that type of more standard machine learning model, using the top L n-grams for a corpus (L can be around 500, n between 2 and 6 for English) and getting the normalised frequency of each of those n-grams in the documents. That gives a nice easy vector space model, which can then be used with standard machine learning methods for your one class problem. As for the features that have been used well in the past, Zheng has a good overview here. I haven't been able to check that link, but the paper is titled: A framework for authorship identification of online messages: Writing-style features and classification techniques. Zheng 2006. Features such as sentence length, etc. have been shown to not be wildly useful by themselves, but can be useful when used as part of a larger model. Finally, for your Shakespeare example, this is a case of plagiarism detection (which is able to determine if a single part of a document has been written by someone else). I haven't looked too far into this, so I'll let someone else answer that question. If you don't want to look into yet another field, you could always split each document up into sections of about 200 words and see which sections are wildly different from the other sections by the same author using the above mentioned authorship attribution techniques. Even more finally, my own PhD work is on unsupervised authorship analysis, the case where non of the authors are unknown. Short of throwing a method from supervised learning into a clustering algorithm, there is not that much out there right now. This field is, loosely, similarity detection, otherwise known as authorship distinction and more focuses on finding metrics and methodologies that strongly correlate to authorship. Still a very new area, so hopefully more will be known about how to do this soon. |
|
You can try this paper http://jmlr.csail.mit.edu/proceedings/papers/v13/kaelin10a/kaelin10a.pdf, they tackle the problem of choosing whether a fund donor is Democrat or Republican given certain conditions. I had an opportunity to talk with the guy (Fabian Kaelin) and he had a similar idea in mind, you might try to contact him and check it out. Or check his web page. |
|
edit: The below response answers a slightly different question, namely, how to identify an author based on his/her handwriting. I initially misinterpreted the question asked. I am leaving my answer here anyway; you never know who it is useful for... To the best of my knowledge, the two best-performing types of features for handwriter identification are: (1) edge-based directional features and (2) grapheme features. Identification can be performed based on those features using your favorite supervised learner. The idea of edge-based directional features is to measure (changes in) orientation of the script. The most prominent feature of this type is the edge-hinge feature. The edge-hinge feature feature is computed by first apply a simple edge detector on the handwriting (e.g., using a 3x3 Laplace filter and thresholding). Now, from each edge pixel, there are two edge segments emanating. The edge-hinge feature estimates the joint distribution of the angles of these two segments with respect to the baseline in a histogram. It generally helps if you compute the feature on multiple scales (i.e., for a number of segment lengths). Grapheme features estimate the distribution by which writers generate small components of handwriting. Ideally, these handwriting components would be individual characters, but segmenting connected handwriting into characters requires handwriting recognition whilst recognition requires segmentation (a chicken-and-egg problem known as Sayre's paradox). As a surrogate for individual characters, so-called graphemes are extracted; you can extract graphemes from the handwriting by, e.g., following the edge contour of a the script and making a vertical 'cut' whenever the edge direction changes from downwards to upwards. This leads to a bunch of binary objects (graphemes) that can easily be extracted. Subsequently, a vector quantizer (e.g., k-means) is applied on the graphemes extracted from the training corpus to construct a grapheme codebook. The final feature is a histogram over graphemes that measures how often a writer produces a certain grapheme. Some Matlab code that implements these two features is available here. Neat! I hadn't thought about handwriting analysis, since the data I'm interested in is purely digital text, but I'll definitely look into this as well.
(Nov 27 '10 at 23:20)
grautur
|