I've heard people mention the "hashing trick" in machine learning, particularly with regards to machine learning on large data.
What is this trick, and what is it used for? Is it similar to the use of random projections?
(Yes, I know that there's a brief page about it here. I guess I'm looking for an overview that might be more helpful than reading a bunch of papers.)
asked Aug 13 '11 at 22:48
Let's say we want to design a function v = phi(x), which from a d-dimensional vector x = (x(1), x(2), ..., x(d)) outputs a new m-dimensional vector v, with m either greater or smaller than d. In other words, phi can be used either for reducing dimensionality of x (d > m) or for sparsifying x (m > d).
One way to do so is to use a hash function h to map x(1) to v(h(1)), x(2) to v(h(2)), ..., x(d) to v(h(d)). Hash functions are functions which, from an arbitrary integer input, can output an integer in a specified range. Good hash functions should have uniform output and obey the avalanche effect: a small perturbation in the input must result in a great change in the output. This ensures that any dimension in x will be mapped to a random dimension in v. Note that this will typically results in collisions (two dimensions in x can be mapped to the same dimension in v) but in practice, this won't affect performance if m is big enough.
In document classification, documents are typically transformed to vectors in the bag-of-word representation. The problem with that is that you don't know the dimensionality of your vectors until you've made an entire pass over your dataset. Fortunately there exist hash functions which take string as input. We can thus map words to a dimension in v as we go. I think this is why it is called the hashing trick: similarly to the kernel trick, v = phi(x) never needs to be computed explicitly. The ability to transform documents as you go is very useful in online algorithms such as SGD. Vowpal Wabbit uses something like 2^26 for m.
I've also seen hash functions used to automatically generate cross-product features. If you have a hash function which gives you an integer from two integers, i.e. i = h(a,b), you can now map the combined feature x(a) * x(b) to v(i). Cross-product features can be useful to model the interactions between features.
The paper "Feature Hashing for Large Scale Multitask Learning" (Weinberger et al., ICML09) also shows how to use the hashing trick for multi-task learning. For example, in spam filtering, since each user typically receives different kinds of spam, it would be nice if we could use one binary classifier per user. This is an example of multi-task learning since each task is related. Obviously it would require massive amounts of storage to store the weight vector of each classifier. The authors show that it is possible to use one hash function per user (as well as one global function) to combine the different users into one feature space. This is a recommended reading as it also covers the background on the hashing trick.
To summarize some of the applications:
Pretty awesome, huh?
Adding to mathieu's answer, you can think of the hashing trick as using the Johnson-Lindenstrauss lemma to do dimensionality reduction of the weight vectors directly from the document representation without every storing the original, possibly huge, weight vector.
The idea behind the johnson-lindenstrauss lemma is that you can use certain families of suitably-constructed random projections to reduce the dimensionality of vectors while preserving the dot product. Given such a projection A, then, you can bound the expected value of || Ax dot Ay - x dot y||². Usually Johnson-Lindesntrauss is used with something very similar to an independent gaussian A (and then it's easy to see how it works, if you follow the proofs on wikipedia).
However, it also works with a projection A that is binary and sparse, such that each row of A has a small number of ones and the rest is zeros (technically it's either 1/n or 0, for some value of n). Then, if you look at what the projection is doing, each coefficient of the projected vector is the sum of a few randomly chosen coefficients of the original vector. If you look at the columns of this matrix, then, you will see that each original coefficient will get added to a small random subset of the indices of the new vector.
The idea of feature hashing comes from this: you can use a small number of randomly chosen hash functions to represent the columns of the matrix A above. Assume you have this very big, very sparse feature vector for each training example. Now, for every feature in this training example, you go through a small number of randomly chosen hash functions h1(x)...hn(x) and add the value of that feature to the h1(x)-th index of a fixed-dimensionality vector. Now if you look at each component of this fixed-dimensionality vector (which is what your learning algorithm will use as a feature vector) every entry will be the sum of a small number of the features of the original huge sparse feature vector, just as you need for Johnson-Lindenstrauss to apply.
Now since most learning algorithms rely only on the dot product between instances, you know everything should be ok, because these random projections will then preserve the dot product very well in expectation (the intuition behind this is that each feature is mixed with random features in each coefficient it shows up in; then the learning algorithm can choose which of these copies to assign a high weight to). Indeed, these are sometimes called hash kernels precisely because they define a dot product you can use (however, unlike most kernels commonly used, the projection phi(x) used here is explicit instead of implicit).
answered Aug 14 '11 at 07:23
Alexandre Passos ♦
Say, I learn a text classifier on "hash-trick" feature vectors. Then during classification of an unseen new text - I guess one would hash ALL the words in this text into the input vector and not just the ones encountered in the training set?
answered Feb 12 '12 at 13:24
David the Dude