1
2

My problem can be best described as follows:

  • I have a database of unknown size, potentially very large
  • Each object will have an arbitrary number of numerical attributes
  • The attributes attached to each object will be different from one object to the next, but at the same time, specific attributes are expected to exist in many objects rather than few
  • Each attribute will be a numeric value in a fixed range (0-100)
  • I am attempting to sort, not necessarily in an exact manner, starting with the closest match first, the objects, given a test subject. This will be a common operation and speed is of the utmost important, not precision.
  • The number of attributes globally available is likely to increase over time

For example: find well matched objects to test subject

Objects:
A[b:34, d:99]
B[a:99, b:65, c:45, d:89]
C[a:34, c:99, d:34, e:23, f:33]

Find best match for: T[a:99, b:35, f:45]

I have looked into nearest neighbor solutions such as KD-Tree and LSH, with some success. KD-Tree feels very much like the wrong solution given the variance in the attributes against each object, I'm not dealing with a fixed number of dimensions. I have been unable to really explore LSH solution given the available resources on line that explain the available hashing functions. Unfortunately my knowledge of mathematics isn't at the level required to understand fully most of the material.

Any pointers in the right direction for best approaches and algorithms would be very much appreciated. Any pointers on the best persistent data store for such a set of data would also go down very well, I'm currently storing each object in MongoDB with the attributes as a hash field.

asked Aug 14 '11 at 08:04

Rosie%20Arnfield's gravatar image

Rosie Arnfield
1123

edited Aug 14 '11 at 13:08

Alexandre%20Passos's gravatar image

Alexandre Passos ♦
2554154278421

1

Alexandre: I came here to post the same thing - reinterpret the question as a search engine. Did you want to put that in an actual answer, then I can upvote it :)

(Aug 14 '11 at 22:19) Robert Layton

One Answer:

I can see a few options. The most obvious is, if the features are sparse enough, to just use a traditional search engine indexing algorithm, probably with something like Lucene. You can treat these feature:weight pairs as word:tfidf-score pairs, and then any search engine can give you a list of everything that has at least one word in common ordered by the dot product.

If things are denser than that and you can't afford these long scans then LDH is probably one way to go.

Also, do you need every single item in the result list or just an approximate K-best? How expensive is just doing a linear search on your database? Can you afford to keep it in memory? Can these queries be cached, or precomputed offline?

answered Aug 15 '11 at 06:44

Alexandre%20Passos's gravatar image

Alexandre Passos ♦
2554154278421

Your answer
toggle preview

powered by OSQA

User submitted content is under Creative Commons: Attribution - Share Alike; Other things copyright (C) 2010, MetaOptimize LLC.