Revision history[back]
click to hide/show revision 1
Revision n. 1

Feb 01 '11 at 15:22

mfhughes's gravatar image

mfhughes
61247

Some input from a non-expert here. I am developing a text classifier as well, with large dimensionality (many words) and large # of input documents.

I achieved a large jump in accuracy by using the 'maximum entropy' classifier in my ML lib, Python NLTK. This is sometimes known as logistic regression. The classifier actually used the same exact feature data structures as the naive bayesian so it was literally a slide-in replacement.

I struggled for a couple of months with naive Bayesian and selecting features (tokens, bi-grams, top 1000 more informative, etc, etc, etc.

You might also want to build a separate classifier (maybe as an input or adjunct to your main classifier) that looks at file metadata - discretized creation date, file owner, file size, file type/extension, email headers, etc. I had surprisingly good results classifying on this type of data.

click to hide/show revision 2
Revision n. 2

Feb 01 '11 at 15:23

mfhughes's gravatar image

mfhughes
61247

Some input from a non-expert here. I am developing a text classifier as well, with large dimensionality (many words) and large # of input documents.

I achieved a large jump in accuracy by using the 'maximum entropy' classifier in my ML lib, Python NLTK. This is sometimes known as logistic regression. The classifier actually used the same exact feature data structures as the naive bayesian so it was literally a slide-in replacement.

I struggled for a couple of months with naive Bayesian and selecting features (tokens, bi-grams, top 1000 more informative, etc, etc, etc.

You might also want to build a separate classifier (maybe as an input or adjunct to your main classifier) that looks at file metadata - discretized creation date, file owner, file size, file type/extension, email headers, etc. I had surprisingly good results classifying on this type of data.

click to hide/show revision 3
Revision n. 3

Feb 01 '11 at 15:24

mfhughes's gravatar image

mfhughes
61247

Some input from a non-expert here. I am developing a text classifier as well, with large dimensionality (many words) and large # of input documents.

Based on the advice of Alexandre, I achieved a large jump in accuracy by using the 'maximum entropy' classifier in my ML lib, Python NLTK. This is sometimes known as logistic regression. The classifier actually used the same exact feature data structures as the naive bayesian so it was literally a slide-in replacement.

I struggled for a couple of months with naive Bayesian and selecting features (tokens, bi-grams, top 1000 more informative, etc, etc, etc.

You might also want to build a separate classifier (maybe as an input or adjunct to your main classifier) that looks at file metadata - discretized creation date, file owner, file size, file type/extension, email headers, etc. I had surprisingly good results classifying on this type of data.

click to hide/show revision 4
Revision n. 4

Feb 01 '11 at 15:25

mfhughes's gravatar image

mfhughes
61247

Some input from a non-expert here. I am developing a text classifier as well, with large dimensionality (many words) and large # of input documents.

Based on the advice of Alexandre, I achieved a large jump in accuracy by using the 'maximum entropy' classifier in my ML lib, Python NLTK. This is sometimes known as logistic regression. The maxent classifier actually used uses the same exact feature data structures as the naive bayesian so it was literally a slide-in replacement.

I struggled for a couple of months with naive Bayesian and selecting features (tokens, bi-grams, top 1000 more informative, etc, etc, etc.etc. I could never really achieve better than 50% accuracy.

You might also want to build a separate classifier (maybe as an input or adjunct to your main classifier) that looks at file metadata - discretized creation date, file owner, file size, file type/extension, email headers, etc. I had surprisingly good results classifying on this type of data.

click to hide/show revision 5
Revision n. 5

Feb 01 '11 at 15:25

mfhughes's gravatar image

mfhughes
61247

Some input from a non-expert here. I am developing a text classifier as well, with large dimensionality (many words) and large # of input documents.

Based on the advice of Alexandre, I achieved a large jump in accuracy by using the 'maximum entropy' classifier in my ML lib, Python NLTK. This is sometimes known as logistic regression. The maxent classifier actually uses the same exact feature data structures as the naive bayesian so it was a slide-in replacement.

I struggled for a couple of months with naive Bayesian and selecting features (tokens, bi-grams, top 1000 more informative, etc, etc, etc. I could never really achieve better than 50% accuracy.

You might also want to build a separate classifier (maybe as an input or adjunct to your main classifier) that looks at file metadata - discretized creation date, file owner, file size, file type/extension, email headers, etc. I had surprisingly good results classifying on this type of data.data with naive bayesian. I still need to explore this idea further with other classifiers.

click to hide/show revision 6
Revision n. 6

Feb 01 '11 at 15:26

mfhughes's gravatar image

mfhughes
61247

Some input from a non-expert here. I am developing a text classifier as well, with large dimensionality (many words) and large # of input documents.

Based on the advice of Alexandre, Alexandre Passos, I achieved a large jump in accuracy by using the 'maximum entropy' classifier in my ML lib, Python NLTK. This is sometimes known as logistic regression. The maxent classifier actually uses the same exact feature data structures as the naive bayesian so it was a slide-in replacement.

I struggled for a couple of months with naive Bayesian and selecting features (tokens, bi-grams, top 1000 more informative, etc, etc, etc. I could never really achieve better than 50% accuracy.

You might also want to build a separate classifier (maybe as an input or adjunct to your main classifier) that looks at file metadata - discretized creation date, file owner, file size, file type/extension, email headers, etc. I had surprisingly good results classifying on this type of data with naive bayesian. I still need to explore this idea further with other classifiers.

powered by OSQA

User submitted content is under Creative Commons: Attribution - Share Alike; Other things copyright (C) 2010, MetaOptimize LLC.