0
1

I'm currently working on a hobby project that deals with machine learning. My goal is to detect the topic (i.e. Technology, Health, etc..) of a given article based on it's top keywords.

My plan was to manually attribute a topic for a given amount of articles then write a script which would automatically classify other articles based keyword score.

For instance if I have an article who's top keywords are: "machine learning", "artificial intelligence" and I manually set the topic to "Technology" then those keywords would get +1 for technology.

The problem is whenever a new keyword appears, I won't be able to classify the article automatically as the keyword will be unknown to the system.

Is there another way of doing this?

Thanks :)

asked Sep 17 '10 at 15:31

Christian%20Joudrey's gravatar image

Christian Joudrey
1121


One Answer:

You can use a semi-supervised learning algorithm. A very simple one in your setting is naive bayes with EM. In standard naive bayes you have a few classes, and for each class you keep a count vector (counting, for each keyword, how many times it was seen in that class). Then when you want to classify a document, for each class, you compute the product of, for each word in the document, (count(word, class)+0.1)/(sum(count(w,class))+0.1*total_words_seen_so_far). After you do this for all classes you normalize these probabilities (i.e., divide each one by the sum of all probabilities). The class of the new document is the class that gives it a higher normalized probability. Now, for each word w in that document, for each class c, make count(w,c) = count(w,c)+pc, where pc is the probability you just computed.

This is roughly what the EM algorithm does for naive bayes (with a few simplifications that shouldn't harm performance too much). For more information see Nigam et al, Semi supervised text classification with EM. An easy way to improve performance is go through all documents again and do both steps separately (i.e., first you compute all probabilities then you use the probabilities to compute the class counts and then repeat, until convergence).

answered Sep 17 '10 at 15:52

Alexandre%20Passos's gravatar image

Alexandre Passos ♦
2554154278421

Your answer
toggle preview

powered by OSQA

User submitted content is under Creative Commons: Attribution - Share Alike; Other things copyright (C) 2010, MetaOptimize LLC.