I'm playing around with SIFT descriptors for images in an attempt to use them in a 'bag-of-words' style approach to image classification.

I've cobbled together an assortment of random images, and after extracting features I have about 1.15 million assorted feature descriptors.

I've been mucking around with using k-means as a first blush at binning the features, but in the same way that it helps to pre-process words (removing pluralizations, prefixes, suffixes, etc), I want a reasonably concise way to consolidate near-duplicate features so as to avoid skewing clusters.

I'm looking for suggestions, whether that be a methodology, paper recommendation, or a smack to the side of the head.

asked Oct 13 '11 at 03:18

Brian%20Vandenberg's gravatar image

Brian Vandenberg
824213746


One Answer:

How about "Near duplicate image detection: min-hash and tf-idf weighting". Minhash will let you create signatures based on LSH family. You can create multiple signatures and use those as a "bag-of-words" for the image to perform duplicate detection.

answered Oct 14 '11 at 13:54

bronzebeard's gravatar image

bronzebeard
31113

+1 as it's not a bad idea. That might be something for me to look at when I have more free time; in this case I was looking for something more on the order of simple statistical methods or whatever.

(Oct 14 '11 at 19:17) Brian Vandenberg
1

I heard good comments about chapter 3 of "Mining of Massive Datasets" http://infolab.stanford.edu/~ullman/mmds.html

(Oct 15 '11 at 02:11) Mathieu Blondel
Your answer
toggle preview

Subscription:

Once you sign in you will be able to subscribe for any updates here

Tags:

×6
×1

Asked: Oct 13 '11 at 03:18

Seen: 1,260 times

Last updated: Oct 15 '11 at 02:11

powered by OSQA

User submitted content is under Creative Commons: Attribution - Share Alike; Other things copyright (C) 2010, MetaOptimize LLC.