I'm working with NLP to classify documents. I'm at the point where the NLP software is returning classification probabilities for each category, and the probabilities are quite accurate.

The next task is how to use the result probabilities to assign category(s) to each unlabeled document. Specifically, my dataset has:

  • Title
  • Body

The categorization that is done for each document is:

  • Type A: 30 categories, where each document must belong to one category, and at most two categories. If no category is a strong match, the document is assigned to an "Unknown" category.

  • Type B: 10 other categories, where each document is only associated with a category if there is a strong match, and each document can belong to as many categories as match.

  • Type C: 4 other categories, where each document must belong to only one category, and if there isn't a strong match the document is assigned to a default category.

The NLP tool is classifying the Title and Body separately (training and using different models for each). The Body produces the most accurate classifications, but surely using the Title can help improve accuracy in some cases (right?). Examples of result probabilities might be (imagining there are 5 categories):

Document 1
----------------
Title:
Category 1 0.950
Category 2 0.030
Category 3 0.020
Category 4 0.005
Category 5 0.005

Body:
Category 1 0.920
Category 2 0.050
Category 3 0.020
Category 4 0.020
Category 5 0.010

Document 2
----------------
Title:
Category 1 0.572
Category 2 0.185
Category 3 0.092
Category 4 0.077
Category 5 0.074

Body:
Category 1 0.129
Category 2 0.785
Category 3 0.052
Category 4 0.032
Category 5 0.002

Document 3
----------------
Title:
Category 1 0.455
Category 2 0.425
Category 3 0.102
Category 4 0.010
Category 5 0.008

Body:
Category 1 0.462
Category 2 0.449
Category 3 0.081
Category 4 0.006
Category 5 0.002

As a human, I can fairly easily analyze the results and see which category(s) should be assigned, whether no category is a strong match, etc.

What I need now is an automated way to do this. Essentially filling in these functions:

determineCategoriesTypeA($title_probabilities, $body_probabilities)
     Returns one category, two categories, or "unknown" if there
     are no strong matches.

determineCategoriesTypeB($title_probabilities, $body_probabilities)
     Returns 0-10 categories, but only those that are strong matches
     (most likely 2-3 max will be strong matches).

determineCategoryTypeC($title_probabilities, $body_probabilities)
     Returns one category, or "unknown" if there is no strong match.

So I've got to figure out how to analyze the probabilities and return those categories that match. I don't have a statistics background, just a lowly programmer, so I'm not sure what the best solution is or even where to start.

Edit:

This problem seems to be called multi-label classification. There is a very comprehensive PDF on the topic. It contains a list of "thresholding strategies" that can be used. The authors have created an addon for Weka called Mulan that implements a number of multi-label learning and thresholding strategies.

At this point I need to either try Weka and Mulan, or implement one of the thresholding strategies using the results from my current NLP tool.

asked Nov 24 '11 at 17:22

Roger's gravatar image

Roger
1113

edited Nov 24 '11 at 22:09

Be the first one to answer this question!
toggle preview

powered by OSQA

User submitted content is under Creative Commons: Attribution - Share Alike; Other things copyright (C) 2010, MetaOptimize LLC.