4
1

I just came across the hash kernels, so it piqued my interest, because it can deal with very high dimensional feature spaces. Can anyone explain me how would one use them to reduce bag-of-words kind of feature vectors? As well as the pros and cons of using it.

asked Nov 05 '10 at 01:12

Oliver%20Mitevski's gravatar image

Oliver Mitevski
872173144

edited Nov 05 '10 at 11:16

ogrisel's gravatar image

ogrisel
498995591


2 Answers:

Hash kernels are really easy to use. Suppose you have a family of K independent hash functions (murmurhash can give you such a family) and a high-dimensional feature space. Building a bag of words vector is like

  1. For each word w in the document:
    1. V[index(w)] += 1

building such a vector using feature hashing is

  1. For each word w in the document:
    1. For each hash function h_k:
    2. V[h_k(w) % len(V)] += 1

and that's it. Now with feature hashing (or hash kernels) you can also add the hashes of whatever feature you want, like substrings, etc, just by hashing the unique name for that feature (I use names like "word:foo", "substr:f*o", etc) with the K hash functions and addind them to the feature vector. The value of k is a hyperparameter that helps you deal with collisions. You can even use it in structured learning (where the usual feature space is even bigger). See Hash Kernels for Structured Learning for better explanations both of the method and of its extensions, and also see vowpal wabbit for a fast implementation.

These are the prons. The cons are mainly that you lose the inherent interpretability of linear models (each feature will activate many different values, that can have different weights, so it gets harder to tell which features matter), that you get to extra hyperparameters to tune (the number of hash functions and the size of the reduced feature vector), the feature vector gets K times less sparse (so the memory used to store a single instance might increase when using hashing), and that for some settings the cost in classification performance can be non negligible.

answered Nov 05 '10 at 06:50

Alexandre%20Passos's gravatar image

Alexandre Passos ♦
2554154278421

edited Nov 05 '10 at 15:40

I haven't used hash-kernels myself but in NLP feature extraction on advantage may be that it's easier to parallelize the process of creating a bag of word matrix since no word --> idx mappings have to be synchronized/communicated between simultaneous extraction processes.

answered Mar 28 '11 at 19:10

David%20the%20Dude's gravatar image

David the Dude
60458

1

Also, it has the nice property that it is no longer necessary to keep any state at all, except for the seed of the hash function, so you can even share feature vectors between completely different corpora.

(Mar 28 '11 at 19:13) Alexandre Passos ♦
Your answer
toggle preview

powered by OSQA

User submitted content is under Creative Commons: Attribution - Share Alike; Other things copyright (C) 2010, MetaOptimize LLC.