|
I'm not really familiar with NLP type stuff, so part of the purpose of this question is simply to understand the lingo. The background: I'm building predictive models on text using n-gram features. These get very good performance in terms of AUC, but looking at why particular predictive decisions are made leaves a lot to be desired. Specifically, I've noticed that i often have two sequential two grams as highly indicative, when I actually have the four-gram as a known feature. put another way: i'm seeing ab and cd as an indicative feature, two "phrases" that clearly go together as abcd another feature that logically has the same or more indicative power and exists in my dictionary. Furthermore i often see abc as another indicative feature for my model, when there's clearly a ton of redundancy, maybe even linear dependence. What i want is to have the model overcome this while still using text as the input in order to maintain interpretability. the solution i'm currently favoring is a "chunking" of the documents (i think this is the correct usage of the work chunking) into what are probable phrases (eg grams) and terms that do not overlap. first of all, is this a good approach to the problem? Second of all, given that i want to base this on a ton of data- my models are all currently all trained in a single pass over millions of documents using SGD. I'd like to adopt a similar approach for such a chunking. my proposed technique: build a radix tree to maintain sequence probabilities of grams, eg maintain the transition probabilities. then, when a document is being chunked start with one term. if the next term in the document is sufficiently probably (i guess the transition probability is higher than some rho), add this term to the chunk and repeat. else treat the current chunk as a feature and ship it off to the model. how does this sound? a good approach? is there a better way? what about deciding where chunks start or stop? will a simple (or even adaptive) threshold work? how do i choose this threshold? thanks, downer |
|
i'm trying to establish a good relationship between the phenomena being modeled and the model itself; what the model is capturing is not what's really happening. Eg, there are single, longer phrases that have all of the predictive power of their components. I want to force the model to use these longer phrases, if possible. |
I don't understand your concern. If abcd is an indicative phrase, ab and cd are good features and may generalize better because there are more instances of ab and cd being useful in your data. As long as the weights are learned jointly, they will be compensated for correlations. This is in contrast to (e.g.) naive bayes where weights are determined independently and probabilities are often poorly calibrated.