15
15

I've heard people mention the "hashing trick" in machine learning, particularly with regards to machine learning on large data.

What is this trick, and what is it used for? Is it similar to the use of random projections?

(Yes, I know that there's a brief page about it here. I guess I'm looking for an overview that might be more helpful than reading a bunch of papers.)

asked Aug 13 '11 at 22:48

grautur's gravatar image

grautur
961122328

An accessible introduction to the hashing trick by Weinberger et al. has appeared in the Virus Bulletin.

(Jun 11 '12 at 09:53) larsmans

3 Answers:
29

Let's say we want to design a function v = phi(x), which from a d-dimensional vector x = (x(1), x(2), ..., x(d)) outputs a new m-dimensional vector v, with m either greater or smaller than d. In other words, phi can be used either for reducing dimensionality of x (d > m) or for sparsifying x (m > d).

One way to do so is to use a hash function h to map x(1) to v(h(1)), x(2) to v(h(2)), ..., x(d) to v(h(d)). Hash functions are functions which, from an arbitrary integer input, can output an integer in a specified range. Good hash functions should have uniform output and obey the avalanche effect: a small perturbation in the input must result in a great change in the output. This ensures that any dimension in x will be mapped to a random dimension in v. Note that this will typically results in collisions (two dimensions in x can be mapped to the same dimension in v) but in practice, this won't affect performance if m is big enough.

In document classification, documents are typically transformed to vectors in the bag-of-word representation. The problem with that is that you don't know the dimensionality of your vectors until you've made an entire pass over your dataset. Fortunately there exist hash functions which take string as input. We can thus map words to a dimension in v as we go. I think this is why it is called the hashing trick: similarly to the kernel trick, v = phi(x) never needs to be computed explicitly. The ability to transform documents as you go is very useful in online algorithms such as SGD. Vowpal Wabbit uses something like 2^26 for m.

I've also seen hash functions used to automatically generate cross-product features. If you have a hash function which gives you an integer from two integers, i.e. i = h(a,b), you can now map the combined feature x(a) * x(b) to v(i). Cross-product features can be useful to model the interactions between features.

The paper "Feature Hashing for Large Scale Multitask Learning" (Weinberger et al., ICML09) also shows how to use the hashing trick for multi-task learning. For example, in spam filtering, since each user typically receives different kinds of spam, it would be nice if we could use one binary classifier per user. This is an example of multi-task learning since each task is related. Obviously it would require massive amounts of storage to store the weight vector of each classifier. The authors show that it is possible to use one hash function per user (as well as one global function) to combine the different users into one feature space. This is a recommended reading as it also covers the background on the hashing trick.

To summarize some of the applications:

  • dimensionality reduction
  • sparsification
  • bag-of-words on the fly
  • cross-product features
  • multi-task learning

Pretty awesome, huh?

answered Aug 14 '11 at 00:30

Mathieu%20Blondel's gravatar image

Mathieu Blondel
119121615

edited Aug 14 '11 at 01:00

Awesome indeed, and awesome answer! You've convinced me that I should go read all those papers now =). A question about your document classification example, though: what's the advantage of using hash functions to map strings to some arbitrary dimension in a high-dimensional space? I might be misunderstanding what you mean, but it seems like we can get the same result (without going all the way up to 2^26 dimensions) by simply keeping track of what words we've seen already. For example, the first distinct word we see gets mapped to the first dimension, the second distinct word we see gets mapped to the second dimension, and so on. Is the advantage simply that we don't need to perform this book-keeping (which doesn't seem all that complex to me, though I guess if we don't need to keep track of the order in which we see words, then that maybe makes it easier to parallelize certain algorithms)? Also, was this an example of the sparsification use case of hashing?

(Aug 21 '11 at 01:57) grautur
2

In addition to the convenience of not having to do the book keeping as you mention, there's the problem of not knowing the dimensionality of your data unless you make an entire pass over your dataset. However, in an online or large-scale setting, you want to update the hypothesis upon receiving each example. For example, with a linear model, during training, for efficiency, you need O(1) access to the weights, thus you need to store the weight vector in dense format. During prediction though, you can store the weight vector in sparse format since, for computing dot products, you only need to iterate over the common features.

(Aug 21 '11 at 06:48) Mathieu Blondel
1

@grautur, you'd still end up using a hash map to store the look-up table you'd use for determining whether you've encountered the word previously.

(Aug 22 '11 at 11:37) Brian Vandenberg

"Good hash functions should have uniform output and obey the avalanche effect: a small perturbation in the input must result in a great change in the output. This ensures that any dimension in x will be mapped to a random dimension in v." Mathieu, can you explain more on this? Is this the case for VW? And what about LSH?

(Nov 21 '11 at 18:34) ZXZ
14

Adding to mathieu's answer, you can think of the hashing trick as using the Johnson-Lindenstrauss lemma to do dimensionality reduction of the weight vectors directly from the document representation without every storing the original, possibly huge, weight vector.

The idea behind the johnson-lindenstrauss lemma is that you can use certain families of suitably-constructed random projections to reduce the dimensionality of vectors while preserving the dot product. Given such a projection A, then, you can bound the expected value of || Ax dot Ay - x dot y||². Usually Johnson-Lindesntrauss is used with something very similar to an independent gaussian A (and then it's easy to see how it works, if you follow the proofs on wikipedia).

However, it also works with a projection A that is binary and sparse, such that each row of A has a small number of ones and the rest is zeros (technically it's either 1/n or 0, for some value of n). Then, if you look at what the projection is doing, each coefficient of the projected vector is the sum of a few randomly chosen coefficients of the original vector. If you look at the columns of this matrix, then, you will see that each original coefficient will get added to a small random subset of the indices of the new vector.

The idea of feature hashing comes from this: you can use a small number of randomly chosen hash functions to represent the columns of the matrix A above. Assume you have this very big, very sparse feature vector for each training example. Now, for every feature in this training example, you go through a small number of randomly chosen hash functions h1(x)...hn(x) and add the value of that feature to the h1(x)-th index of a fixed-dimensionality vector. Now if you look at each component of this fixed-dimensionality vector (which is what your learning algorithm will use as a feature vector) every entry will be the sum of a small number of the features of the original huge sparse feature vector, just as you need for Johnson-Lindenstrauss to apply.

Now since most learning algorithms rely only on the dot product between instances, you know everything should be ok, because these random projections will then preserve the dot product very well in expectation (the intuition behind this is that each feature is mixed with random features in each coefficient it shows up in; then the learning algorithm can choose which of these copies to assign a high weight to). Indeed, these are sometimes called hash kernels precisely because they define a dot product you can use (however, unlike most kernels commonly used, the projection phi(x) used here is explicit instead of implicit).

answered Aug 14 '11 at 07:23

Alexandre%20Passos's gravatar image

Alexandre Passos ♦
2551153277421

-1

Say, I learn a text classifier on "hash-trick" feature vectors. Then during classification of an unseen new text - I guess one would hash ALL the words in this text into the input vector and not just the ones encountered in the training set?

answered Feb 12 '12 at 13:24

David%20the%20Dude's gravatar image

David the Dude
60458

This is not an answer, and you should post it as a question.

Answering your question, you can't gain anything in terms of accuracy by including those features, but the usual reason for using the hashing trick is to not keep an explicit map of the features you've seen so far as that can be expensive. You won't lose much in performance by including all features anyway.

(Feb 12 '12 at 13:26) Alexandre Passos ♦

Yeah, sorry for adding this as an answer when it is not. Back to the "hash-trick" - ok, this is what I suspected. Thanks.

(Feb 12 '12 at 13:41) David the Dude
Your answer
toggle preview

powered by OSQA

User submitted content is under Creative Commons: Attribution - Share Alike; Other things copyright (C) 2010, MetaOptimize LLC.