|
I recently read a paper (Weiss et al., 2008) about Spectral Hashing. They give some pretty impressive results, outperforming LSH and a few others, all with a pretty compact algorithm. They also benchmarked against a RBM autoencoder, but found the performance increase to be insignificant. What I'm wondering is whether there are some limitations to their approach that aren't immediately obvious (or mentioned explicitly in their paper)? I ask because I can't find any references on the internet of people actually implementing their algorithm (other than their own implementation that is companion to the paper) for practical purposes. By contrast, I've seen boatloads of people doing all sorts of things with autoencoders and such. Is there some fundamental advantage among the seemingly state-of-the-art approaches to semantic hashing? NB: Perhaps I'm just terrible at searching. If so please direct me :-) |
|
Spectral hashing doesn't scale well with the code size, if you want a good 10 bit code spectral hashing is great, but for 32 bit codes or larger a deep auto-encoder will certainly get much better codes. I would also add that the paper you link to compares to an "RBM" which is a somewhat bizarre thing to compare to. Why not a deep auto-encoder that uses RBM pre-training? If they didn't first do PCA on the data then I would not expect a shallow auto-encoder or an RBM to do much better than doing PCA and then binarizing sensibly. Also, the following sentence from the paper from the 6th paragraph of the introduction is wrong and I am not quite sure what the authors were trying to say with it: "In [5] the authors used an auto-encoder with several hidden layers. The architecture can be thought of as a restricted Boltzmann machine (RBM) in which there are only connections between layers and not within layers." |
Not sure about this, but "fundamental advantage of approaches to semantic hashing" is very context dependence. For instance, the hashing approaches used in web-scale retrieval applications can diverge pretty wildly from stuff like autoencoders.
Thanks for the reply. Perhaps 'fundamental' wasn't the best word to capture what I meant. I didn't mean some fundamental fact that makes one algorithm better than another for all situations. Rather, I want to know (a) is there a reason I can't find instances of people using spectral hashing, (b) is there a problem domain where the spectral method breaks down and (c) what is it that has made so many people pick up the auto-encoder method instead?
Performing PCA and eigenfunction thresholding aren't the most expensive tasks, in the grand scheme of things, so presumably the spectral hashing method could scale towards something web-scale. The authors tested it with a database of 80M images, although they didn't give computational performance metrics.
Really I'm just curious if people here have some experience driven reasons for going one way or the other.
Unfortunately most papers are never used in applications or reimplemented. This is just how science works, and it's not necessarily horrible, as maybe follow-up work to this will interest practitioners. As far as I know few people who use this kind of LSH in an application have enough computational cycles to compute eigenvectors to trade for a few percent of performance (and maybe, say, compared to LSH, you know have to train a model versus just using one).
Of course I understand that not everything explored scientifically makes the jump to production. Perhaps the answer to my (a) question is simply "nobody's decided it's worth doing yet." This is what I want to know, and why.
So then, Alexandre, you think that the answer to my (c) question is simply a gains/cycle decision? Not about problem domains, practical implementation issues, etc?
Thanks.
I haven't read the paper carefully so I can't comment. I've also never used hashing for anything practical, so I wouldn't recognize the dealbreakers instantly.