Hi,

I have a collection of 'n' text documents. Each document contains content (text) generated by users. Amongst these n documents 'k' belong to a particular user class while the rest belong to another class and k << n-k .

I would like to build a signature for the 'k' users by their text based content. What algorithms (machine learning or natural language processing) should I look into? I am aware that Language Models could be one of the avenues.

asked Jun 10 '11 at 10:32

Dexter's gravatar image

Dexter
416243438

edited Jun 10 '11 at 20:09

Robert%20Layton's gravatar image

Robert Layton
1625122637


One Answer:

Have a look into authorship profiling methods. They look at determining properties about the author based on features within their text. The common ones used are age, gender and first language, but others are available.

Authorship profiling is, basically, a text classification task with classes determined by the author attribution you wish to find - however the techniques are a bit different. As an example, character n-grams are one of the best methods, and the most common n-grams usually do better that taking the middle of the range, ala standard text category classification.

Some references. Koppel and Argamon are probably good researchers to following in this area.

  • S. Hariharan, Gender Prediction in Chat based Medium’s Using Text Mining (2011), in: International Journal of Research and Reviews in Information Sciences (IJRRIS), 1:1
  • Shlomo Argamon, Moshe Koppel, James W. Pennebaker and Jonathan Schler, Automatically profiling the author of an anonymous text (2009), in: Commun. ACM, 52:2(119--123)
  • J. SCHLER, M. Koppel, S. Argamon and J. Pennebaker, Effects of age and gender on blogging (2006) Proceedings of 2006 AAAI Spring Symposium on Computational Approaches for Analyzing Weblogs
  • M. Koppel, S. Argamon and A. R. Shimoni, Automatically categorizing written texts by author gender (2002), in: Literary and Linguistic Computing, 17:4(401)

answered Jun 10 '11 at 20:09

Robert%20Layton's gravatar image

Robert Layton
1625122637

Your answer
toggle preview

powered by OSQA

User submitted content is under Creative Commons: Attribution - Share Alike; Other things copyright (C) 2010, MetaOptimize LLC.