# What is the "hashing trick"?

 15 16 I've heard people mention the "hashing trick" in machine learning, particularly with regards to machine learning on large data. What is this trick, and what is it used for? Is it similar to the use of random projections? (Yes, I know that there's a brief page about it here. I guess I'm looking for an overview that might be more helpful than reading a bunch of papers.) asked Aug 13 '11 at 22:48 grautur 961●12●23●28 An accessible introduction to the hashing trick by Weinberger et al. has appeared in the Virus Bulletin. (Jun 11 '12 at 09:53) larsmans

 14 Adding to mathieu's answer, you can think of the hashing trick as using the Johnson-Lindenstrauss lemma to do dimensionality reduction of the weight vectors directly from the document representation without every storing the original, possibly huge, weight vector. The idea behind the johnson-lindenstrauss lemma is that you can use certain families of suitably-constructed random projections to reduce the dimensionality of vectors while preserving the dot product. Given such a projection A, then, you can bound the expected value of || Ax dot Ay - x dot y||². Usually Johnson-Lindesntrauss is used with something very similar to an independent gaussian A (and then it's easy to see how it works, if you follow the proofs on wikipedia). However, it also works with a projection A that is binary and sparse, such that each row of A has a small number of ones and the rest is zeros (technically it's either 1/n or 0, for some value of n). Then, if you look at what the projection is doing, each coefficient of the projected vector is the sum of a few randomly chosen coefficients of the original vector. If you look at the columns of this matrix, then, you will see that each original coefficient will get added to a small random subset of the indices of the new vector. The idea of feature hashing comes from this: you can use a small number of randomly chosen hash functions to represent the columns of the matrix A above. Assume you have this very big, very sparse feature vector for each training example. Now, for every feature in this training example, you go through a small number of randomly chosen hash functions h1(x)...hn(x) and add the value of that feature to the h1(x)-th index of a fixed-dimensionality vector. Now if you look at each component of this fixed-dimensionality vector (which is what your learning algorithm will use as a feature vector) every entry will be the sum of a small number of the features of the original huge sparse feature vector, just as you need for Johnson-Lindenstrauss to apply. Now since most learning algorithms rely only on the dot product between instances, you know everything should be ok, because these random projections will then preserve the dot product very well in expectation (the intuition behind this is that each feature is mixed with random features in each coefficient it shows up in; then the learning algorithm can choose which of these copies to assign a high weight to). Indeed, these are sometimes called hash kernels precisely because they define a dot product you can use (however, unlike most kernels commonly used, the projection phi(x) used here is explicit instead of implicit). answered Aug 14 '11 at 07:23 Alexandre Passos ♦ 25541●54●278●421
 -1 Say, I learn a text classifier on "hash-trick" feature vectors. Then during classification of an unseen new text - I guess one would hash ALL the words in this text into the input vector and not just the ones encountered in the training set? answered Feb 12 '12 at 13:24 David the Dude 60●4●5●8 This is not an answer, and you should post it as a question. Answering your question, you can't gain anything in terms of accuracy by including those features, but the usual reason for using the hashing trick is to not keep an explicit map of the features you've seen so far as that can be expensive. You won't lose much in performance by including all features anyway. (Feb 12 '12 at 13:26) Alexandre Passos ♦ Yeah, sorry for adding this as an answer when it is not. Back to the "hash-trick" - ok, this is what I suspected. Thanks. (Feb 12 '12 at 13:41) David the Dude
 toggle preview community wiki