Hello, I was thinking about scaling(distributing) Nearest Neighbor Search algorithm (simple exhaustive search). Assumption is that you have N nodes, S samples stored in database and request dispatcher before nodes. There could be two aproaches:

  1. Keep copy of whole database (with samples) on each node and than for each request choose node on which request will be dispatched (according to e.g. round robin algorithm), and this node will simply return list of nearest neighbors.

  2. Each node will have local DB with S/N samples stored. Dispatcher sends request to all nodes, all nodes returns list of nearest neighbours. Dispatcher merges results and sends it back.

I was thinking which approach could handle more requests per second, which has lower latency, which is better ;)

asked Jul 22 '11 at 14:05

nightride's gravatar image

nightride
1025611

edited Jul 22 '11 at 14:12


2 Answers:

At first approximation 1 favors throughput & 2 favors latency, since 1 minimizes the amount of communication while 2 parallelizes the search, assuming that takes longer than the communication. However, it all depends on a number of factors:

how long does it take for a complete sequential search? how many nodes? what is the network bandwidth & latency? how reliable is your network? how many nearest neighbors are you looking for? must the result be exact be exact or is some degree of approximation acceptable?

A simple counter example to my answer above is if you use a lot of nodes & your network is flaky, and you really need exact NNs, then 2 can have poor latency, since it will have the latency of the slowest node and with enough nodes it is likely at least one will flake out. Depending on what the factors are in your case you could try an intermediate approach where you have overlap between nodes and you send the query to each node, but not wait for all to respond, just when you have received enough responses to be sufficiently confident you have seen the best result. Alternatively you can also not send responses to all only to be confident enough that you have covered all your items. All that depends on being able to tolerate some degree of approximation, though.

answered Jul 26 '11 at 20:46

Daniel%20Mahler's gravatar image

Daniel Mahler
8462912

Approach 2 is certainly going to be better than 1. You are distributing load and parallelizing much better in the second case.

answered Jul 22 '11 at 14:12

Rob%20Renaud's gravatar image

Rob Renaud
41551321

Your answer
toggle preview

powered by OSQA

User submitted content is under Creative Commons: Attribution - Share Alike; Other things copyright (C) 2010, MetaOptimize LLC.