|
I just came across the hash kernels, so it piqued my interest, because it can deal with very high dimensional feature spaces. Can anyone explain me how would one use them to reduce bag-of-words kind of feature vectors? As well as the pros and cons of using it. |
|
Hash kernels are really easy to use. Suppose you have a family of K independent hash functions (murmurhash can give you such a family) and a high-dimensional feature space. Building a bag of words vector is like
building such a vector using feature hashing is
and that's it. Now with feature hashing (or hash kernels) you can also add the hashes of whatever feature you want, like substrings, etc, just by hashing the unique name for that feature (I use names like "word:foo", "substr:f*o", etc) with the K hash functions and addind them to the feature vector. The value of k is a hyperparameter that helps you deal with collisions. You can even use it in structured learning (where the usual feature space is even bigger). See Hash Kernels for Structured Learning for better explanations both of the method and of its extensions, and also see vowpal wabbit for a fast implementation. These are the prons. The cons are mainly that you lose the inherent interpretability of linear models (each feature will activate many different values, that can have different weights, so it gets harder to tell which features matter), that you get to extra hyperparameters to tune (the number of hash functions and the size of the reduced feature vector), the feature vector gets K times less sparse (so the memory used to store a single instance might increase when using hashing), and that for some settings the cost in classification performance can be non negligible. |
|
I haven't used hash-kernels myself but in NLP feature extraction on advantage may be that it's easier to parallelize the process of creating a bag of word matrix since no word --> idx mappings have to be synchronized/communicated between simultaneous extraction processes. 1
Also, it has the nice property that it is no longer necessary to keep any state at all, except for the seed of the hash function, so you can even share feature vectors between completely different corpora.
(Mar 28 '11 at 19:13)
Alexandre Passos ♦
|