|
My problem can be best described as follows:
For example: find well matched objects to test subject Objects: A[b:34, d:99] B[a:99, b:65, c:45, d:89] C[a:34, c:99, d:34, e:23, f:33] I have looked into nearest neighbor solutions such as KD-Tree and LSH, with some success. KD-Tree feels very much like the wrong solution given the variance in the attributes against each object, I'm not dealing with a fixed number of dimensions. I have been unable to really explore LSH solution given the available resources on line that explain the available hashing functions. Unfortunately my knowledge of mathematics isn't at the level required to understand fully most of the material. Any pointers in the right direction for best approaches and algorithms would be very much appreciated. Any pointers on the best persistent data store for such a set of data would also go down very well, I'm currently storing each object in MongoDB with the attributes as a hash field. |
|
I can see a few options. The most obvious is, if the features are sparse enough, to just use a traditional search engine indexing algorithm, probably with something like Lucene. You can treat these feature:weight pairs as word:tfidf-score pairs, and then any search engine can give you a list of everything that has at least one word in common ordered by the dot product. If things are denser than that and you can't afford these long scans then LDH is probably one way to go. Also, do you need every single item in the result list or just an approximate K-best? How expensive is just doing a linear search on your database? Can you afford to keep it in memory? Can these queries be cached, or precomputed offline? |
Alexandre: I came here to post the same thing - reinterpret the question as a search engine. Did you want to put that in an actual answer, then I can upvote it :)