|
Hi, I have a collection of 'n' text documents. Each document contains content (text) generated by users. Amongst these n documents 'k' belong to a particular user class while the rest belong to another class and k << n-k . I would like to build a signature for the 'k' users by their text based content. What algorithms (machine learning or natural language processing) should I look into? I am aware that Language Models could be one of the avenues. |
|
Have a look into authorship profiling methods. They look at determining properties about the author based on features within their text. The common ones used are age, gender and first language, but others are available. Authorship profiling is, basically, a text classification task with classes determined by the author attribution you wish to find - however the techniques are a bit different. As an example, character n-grams are one of the best methods, and the most common n-grams usually do better that taking the middle of the range, ala standard text category classification. Some references. Koppel and Argamon are probably good researchers to following in this area.
|