1
2

I am currently looking into using NLP based features to build a machine learning classifier for a problem at hand.

I have a collection of 12,000 text documents where each document is labelled into two classes: X and Y. I would like to build an ML classifier to distinguish between these two classes. I would like to in particular use NLP bases features for the ML classifier. Initially, I would like to explore POS tags and grammatical relations(also known as typed dependencies) as features. The text documents are noisy viz. the language is informal (SMS, chat style communication).

Firstly, taking into account that POS Taggers and Typed Dependency Parsers are trained on formal language would it make sense (since my data is noisy) to rely on them to use them as features for my ML classifier?

Secondly, assuming I can rely on these tools (this is a more generic question) is there way I can extract the statistically(or other pre-defined parameter) significant POS Tag sequence(s) OR Grammatical Relation/ Typed Dependency sequence(s) from each of the individual classes? Something like a list of Longest Common Subsequence(s) ranked by their order of statistical (or other predefined parameter) significance. I came across Tree Kernels (in addition to String Kernels) as an alternative to LCS.

Your thoughts/opinions on the same ?

PS: Kindly pardon me if my questions sound amateurish.

asked Sep 20 '10 at 06:47

mcenley's gravatar image

mcenley
356243436


4 Answers:

If your classifier is any good (that is, robust) it should be able to deal with feature noise, which is what essentially happens when you try to run NLP tools on informal data. Of course, they might be too noisy and/or irrelevant for your problem, so try many variations of incorporating them.

About your second question, if your classifier is any good (here I mean regularized) it should be able to separate relevant from irrelevant POS n-grams by itself (that is, define a feature for each POS tag n-gram in each document, and let the classifier figure out which are relevant). About the dependency relations, they are harder to incorporate in a classifier. String kernels extend the idea of POS n-grams to arbitrary length, but I'm not a fan of using strict kernel-based methods for this sort of problem since (a) you need to store n² kernel values before starting the optimization (or be ready to compute them over and over again), and this does not sound like a good idea with 12k documents, and (b) it gets hard to mix and match lots of different kernels with hand-crafted features, which is the dominant way to make progress in NLP. For dependency data I'm not sure what is the standard way to use them as features, but you could try using features for subtrees of fixed size (the tree equivalent to ngrams), either considering or ignoring word identity.

answered Sep 20 '10 at 07:19

Alexandre%20Passos's gravatar image

Alexandre Passos ♦
1893744214333

Alexandre, Thanks for the response !

1] I do understand some classifiers are immune to noise but I want to avoid the Garbage in- Garbage out scenario viz. giving incorrect POS tags and grammatical sequences may lead to incorrect results (though classification may work but results won't be generalizable).

2] W.r.t. to POS tags, I want my input feature to be a POS tag sequence (completely complete) of the document (of any length) and hence don't want to restrict their length/size on n. The length of my documents significantly vary and hence, this is one requirement which I may not be able to relax. Even if I am, I would like to pose a generic question (assuming it's not) as to how to incorporate such sequences as "features" in a ML classifier. The number of permutations/combinations of such a sequence would be very huge and hence the traditional method of counting (and incorporating) such occurrences as features would lead to an exponential increase in feature dimension which would lead me to apply a new feature reduction algorithm on the feature space. I would like to avoid that scenario as after the decision boundary is formed by the classifier I would like to know the best POS tag sequences (I want to avoid doing this manually and encode them as features) ranked by their discriminatory power which led to successful classification. Therefore, I thought string kernels would be a nice way forward.

3] W.r.t. to Grammatical relations, I wouldn't like to limit the subtrees for reasons similar to POS tag sequences. Tree Kernels (as mentioned in the link in my post) provide a good alternative to avoid such tasks. Is there any specific reason not to use them?

4] Obviously, using different kernels for different types of features( POS tags, Grammatical relations aren't the ONLY features for my classifier, other features include standard statistical features like average word length/sentence, number of function words etc.) introduces a whole new ( and probably rather complex) dimension. Something which currently I am not exposed to. Is it possible to have different kernels for different features in the same classification task? OR Should I use three different classifiers depending on the kernels and combine their output to another classifier and make a decision?

PS: Thanks for bearing with me !

(Sep 20 '10 at 07:44) mcenley
2

(1) To protect against over fitting the best way is to use separate validation and test sets. Doing that you shouldn't have to worry about feature noise

(2) String kernels are completely uninteresting if you want an interpretable model because they are completely black-box. All a string kernel model will tell you is that such and such documents are the more relevant to classification. If you use POS n-grams (or up-to-n-grams), what you get is essentially a finitist approximation to string kernels, but one that lets you treat directly each substring as its own feature. After training a regularized linear classifier you can inspect the weights it assigns to different POS n-grams and see which are more relevant, or you can even use the lasso (l1 regularization) or elastic net (l1 and l2 regularization, which generalizes better than the lasso) to bring down the weights of most features and see only the relevant ones when you actually inspect your feature vector. Then you can trim your feature set to use only features similar to those and train again (with l2 regularization) to get the best classifier you can get. To repeat: the best way of finding which ones are relevant is indeed dealing with them explicitly, since string kernels are completely opaque, and then inspecting the weight vector, possibly using the lasso to do model selection.

(3) Tree kernels are just like string kernels, so the argument I gave in (2) applies: they are opaque (you can't tell from a classifier which subtrees do matter) and can be well approximated by lots of features corresponding to finite subtrees.

(4) Combining different kernels in the same classifier is a much harder problem (called kernel learning or multiple kernel learning). There's been a lot of recent work in this area, trying to reliably tune different kernels for a single prediction task, but the overall complexity of training is a lot higher and the results are not as good. There was a workshop in nips 2009 where you can get more references. Overall, for what you seem to want, I'd say there are two ways you can go: one is to use an unstructured linear model (logistic regression or svms, basically) and use a lot of features (tag up-to-n-grams, subtrees of up-to-n length) and model selection (for interpretability; l2 regularized logistic regression is known to perform well with a ridiculously high number of irrelevant features in nlp problems, against some theoretical work) to find out what matters. The second way is to explicitly use a structured model, like a CRF, and make the POS tags and dependency tree part of the model itself, such as is done in this paper: http://www.blogger.com/aclweb.org/anthology-new/N/N10/N10-1120.pdf (Dependency Tree-based Sentiment Classification using CRFs with Hidden Variables). If your problem is binary classification, however, such a model should be overkill and you're probably better with a featuritis model.

Fundamentally, string and tree kernels are nice if what you're after is performance, but for non trivial tasks specific features are usually better, since with these kernels you're effectively averaging over all possible features, which decreases interpretative power and makes it sure you will need a lot of documents to represent the simpler decision boundaries (since you will effectively need to "average out" most of the infinite features availiable to these models).

(Sep 20 '10 at 07:59) Alexandre Passos ♦

Alexandre, Thanks once again !

1] Great!

2] Thanks for clearing the String/Tree Kernel doubt. In order to make this discussion more generic let us consider two scenarios:

2.a] Using String/Tree Kernels as I don't want to find the most discriminatory sequences Since, this is opaque and will require multiple kernel learning for the prediction tax Shogun (http://www.fml.tuebingen.mpg.de/raetsch/suppl/shogun) Machine Learning toolbox seems a good implementation. One point I would like to stress is that though my dataset size is small, the real-world setting may have millions of instances. I could only get so much annotated! Hence, I am worried about the performance of the classifier. I haven't ruled out the possibility of MKL as yet.

2.b] Not using String/Tree Kernels to find the distinguishing sequences Limiting the size of POS tags and Subtrees sounds a good idea. However, I have a question regarding the choice of the classifier. I guess I may have to used LibLinear (http://www.csie.ntu.edu.tw/~cjlin/liblinear/index.html) instead of LibSVM with a linear kernel? Does SVM provide me an option for l1 or l2 regularization? I am yet not clear on how I could use the output provided by a SVM to "interpret" the model. I know one could get the distance of a point from the hyperplane. (http://www.csie.ntu.edu.tw/~cjlin/libsvm/faq.html#f4151). I also see some work done by FML-MPI to improve the "interpretability of predictive classifiers like SVMs" . FIRM (http://www.fml.tuebingen.mpg.de/raetsch/members/raetsch/bibliography/ZKSR2009) seems to be one such algorithm. I also read the CRF + dependency tree paper but I can't use the same (in the current task) as I don't have such phrase/word level annotations (in this paper sentiment is annotated or calculated using a lexicon). Unfortunately, in the current setting I guess this may not help.

3] I read the paper link posted by Awais (Sentiment Classification Using Word Sub-sequences and Dependency Sub-trees) and they use the most frequent word sequences and Dependency sub-trees. How does the idea come across to you? My first reaction to it is that it may be missing some important features as my understanding is that frequent doesn't necessarily mean discriminator and vice-versa. If I'm wrong, I was wondering if I can mix-match and use string/tree kernels for classification and use the frequent subsequence/tree mining algorithms from the paper to characterize my class.

:-)

(Sep 20 '10 at 11:17) mcenley
1
  1. a) Just to clarify, you only need MKL if you want to use more than one out of string kernels, tree kernels, and other features. Shogun seems nice. The size (and prediction complexity) of the SVM with kernels depends linearly on the training set size, so if your training set is small this shouldn't be an issue.

  2. b) I like to use logistic regression instead of support vector machines precisely because of interpretability. I use a variant of the standard theano implementation (you'll find it in deeplearning.net/tutorial ) that I tuned to support elastic net and optimization by L-BFGS (which gives better results on small data sets than stochastic graident descent). To interprete the weights of any linear classifier (logistic regression and svm included) you need to have access to the classifier itself (and I'm not sure how to do it with either liblinear or libsvm, but with the theano implementation it's trivial). What you're interested in are the features whose weights have the highest modulos. If the weights are positive, the features are generally good indicators of the positive class; same for negative weights. If the weights are negative the features are highly predictive of the negative class. This requires that all features have roughly the same mean and variance, but this is very easy to ensure in an nlp setting, since most of them are binary. My code is really hacky, but it should be easy to extend the theano implementation to do what I did. You can get a good, fast implementation of logistic regression (that does support model selection, although not with the lasso) at http://www.cs.utah.edu/~hal/megam/ , and a faster general linear classifier implementation at http://hunch.net/~vw/ ; both make it relatively easy to tune the hyperparameters and inspect the weights.

3) Yes, this is an issue with looking at only the most frequent X when doing classification, for all values of X. If you can afford it's better to do some model selection during training instead of doing that, since the model selection will take care of the irrelevant features (which might just as well be frequent, i.e. stopwords) by itself.

(Sep 20 '10 at 11:35) Alexandre Passos ♦

Thanks Alexandre !

1 a] Yes. I will have to use more than one.

2 b] If needed (and appropriate) I will contact you later for your code. I will also in the mean time delve more into logistic regression. I guess megam should be good even if no lasso. What say? The Yahoo research link looks more like an online learning algorithm. I will currently stay away from it.

3] Thanks for buttressing me on this point. However, by model selection I hope you mean interpretability of the model built with the training data and tested on the test data.

(Sep 20 '10 at 11:54) mcenley

3) By model selection I mean most features having weight 0. If this is true, the model is very interpretable, since you can compare the few remaining weights and see what makes sense. What I meant is that it's better to have an algorithm choose which features will matter than heuristics beforehand, if you can afford it.

(Sep 20 '10 at 11:56) Alexandre Passos ♦

Also, vowpal wabbit is really fast and can easily train better-performing classifiers than many batch algorithms.

(Sep 20 '10 at 12:07) Alexandre Passos ♦
showing 5 of 7 show all

I would start simple, and use a bag-of-words. It's a good baseline for many textcat applications.

answered Sep 22 '10 at 23:31

Joseph%20Turian's gravatar image

Joseph Turian ♦♦
467541105126

Joseph, The BOW baseline is already implemented and as mentioned in the comments to Awais the precision/recall of the system isn't good. Hence, the problem requires deeper understanding of text using the linguistic information.

(Sep 23 '10 at 05:14) mcenley

What is the actual problem?

(Sep 23 '10 at 16:52) Joseph Turian ♦♦

Joesph,

I would like to encode POS tags and grammatical relations (typed dependencies) OR syntactic parse trees as features for a binary text classification task at hand (using a Machine Learning classifier). The underlying data is noisy (SMS, forum style data). I am planning to use the Stanford POS Tagger and Stanford Dependency parser for extracting the POS tags and syntactic parse trees. I understand that these tools may not work correctly for noisy data but assuming that these tools work lets think about how we could use deeper linguistic cues to help text classification. My question is more generic, in terms of using NLP based features like POS tags and syntactic parse trees as features for a ML classifier. The Bag-Of-Words (BOW) model has been tried for the aforementioned task but the precision/recall is very low (as expected).

Following are my requirements : 1]Building a ML classifier with various types of features like String (POS tags), Trees (Syntactic parse trees), Numeric (Average word length, number of punctuation marks), Binary etc 2]I want to avoid manual selection of features (by eye-balling) as far as possible (to avoid overfitting to the dataset and make the classifier more generalizable and to avoid any loss of information). For example, I would like to feed the complete POS tags/Grammatical relations to the classifier and let the classifier deal with them 3]Feature selection algorithm before feeding the features to the classifier is an open question as the requirement in point 1 may lead to an exponential increase in features 4]The model developed must be "interpretable" viz. I should know which features are the most important for the classification task (depending on their feature weights etc.)

Based on my requirements above I have the following ways(based on prior approaches in literature) in which I can encode such NLP based features: 1]Traditional way of eye-balling and selecting the discriminating features (POS tags and syntactic parse trees) and encoding them depending on their presence in text (binary) OR the number of times they appear in text (statistical) 2]n-gram POS sequences and n length dependency sub trees (all of them). The paper (Sentiment Classification Using Word Sub-sequences and Dependency sub-trees) which uses the idea 3]Generalized Dependency Trees. The paper(Generalized Dependency Features for Opinion Mining) which uses such an idea for an Opinion Mining Task 4]Use of String/Tree Kernels with MKL (least preferred choice due to difficulty in modeling and interpretability)

I would like to know which of the above would be the best way to start with keeping in mind my requirements. I understand POS taggers/Dependency parsers may not be accurate for noisy data but are there other similar tools which handle noisy text?

(Sep 24 '10 at 07:42) mcenley

For your second question, you might want to take a look at http://www.springerlink.com/content/n9ep5q1gcupeya8r/

answered Sep 20 '10 at 07:38

Awais%20Athar's gravatar image

Awais Athar
462

Awais, Thanks for the paper !

My first reaction to it is that it may be missing some important features as my understanding is that frequent doesn't necessarily mean discriminator and vice-versa. Your thoughts?

(Sep 20 '10 at 11:17) mcenley

Yes, you are right. At the most, you can use the FREQT program (mentioned in the paper) to enumerate all subtrees and perform very basic frequency-based filtering. After that, you can always do some sort of feature selection. You might also want to try using generalized dependency relations as features. (http://www.aclweb.org/anthology/P/P09/P09-2079.pdf)

In case you decide to use LIBSVM, there's a tool available which helps you get the decision function for linear SVMs. (http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/#primal_variable_w_of_linear_svm_and_feature_selection)

(Sep 21 '10 at 11:26) Awais Athar

Awais, Thanks ! One line from the paper which immediately got my attention was "one critical issue might be the sparseness of the very specific linguistic features, which may cause the classifier learned from such features to not generalize". Does this mean string/tree kernels are a bad bet? Also, if I implement my features as Alexandre has mentioned wouldn't it lead to the same "feature explosion" and "non-generalizable" phenomena mentioned in the paper?

Currently, I would like to stay away from feature selection techniques. I am apprehensive about how good they work. In addition to the paper you mentioned, I also looked into this paper (http://portal.acm.org/citation.cfm?id=1597270) which implements such generalized dependency features.

However, wouldn't it be better to start with simple dependency features and try them out first before moving to such lexicalized/generalized dependency representations?

How does the the feature selection work for SVMs? All they have mentioned is that there's a feature selection algorithm.

(Sep 22 '10 at 07:14) mcenley

I am not really sure about string/tree kernels being a bad bet. I guess it depends on the nature and size of your corpus as well. Given that it is chat/SMS text, you can to to clean it up a bit (http://aclweb.org/anthology/C/C10/C10-2022.pdf). What kind of results do you have for a bag-of-words model?

The Joshi-Rose paper uses chi-squared feature selection. If time and memory are not a problem, You can try different attribute selection methods on WEKA. Its SVMAttributeEval tool might be of interest to you.

answered Sep 22 '10 at 09:18

Awais%20Athar's gravatar image

Awais Athar
462

1

The problem with string/tree kernels is that the weight vector can't be interpreted (the dot product happens in an infinite dimensional projected space, and all you have to represent that are weights assigned to some training examples). So this is why I suggest explicit truncated string/tree kernels because if you do that the weight space become finite and you become capable of inspecting individual features.

(Sep 22 '10 at 09:46) Alexandre Passos ♦

Awais, Thanks! The results on the BOW model (as expected) aren't good : low precision and recall. I could clean up using some simple heuristics but that's it and it may not help much. Just to note the actual records may go up to millions. Only, the size of my dataset is 12k. Plus I am concerned about performance.

Alexandre, Wouldn't feature explosion be an issue in such a case? Also, since I have two definitive (and discrete) classes should I use a logistic regression (continuous classes) model? Just trying to validate my research method/design.

(Sep 22 '10 at 09:59) mcenley
1

Logistic regression does not necessarily mean continuous classes, you can use it as a black box classifier. Yes, you will have a lot of features, and you might have to tune the size of the POS ngrams and subtrees to control the amount of features you're using. After that, I suggest you use an algorithm that does feature selection (such as l1 regularization or megam's approximate bayes factor feature selection).

(Sep 22 '10 at 10:01) Alexandre Passos ♦
1

A fast implementation of l1 regularization is http://ttic.uchicago.edu/~tewari/code/smidas/

(Sep 22 '10 at 10:02) Alexandre Passos ♦

Thanks Alexandre! This discussion was one of the best ML discussions I've had. :-)

(Sep 22 '10 at 10:18) mcenley
Your answer
toggle preview

powered by OSQA

User submitted content is under Creative Commons: Attribution - Share Alike; Other things copyright (C) 2010, MetaOptimize LLC.