1
1

I want to find the globally optimal (or close to optimal) pairwise alignment between two long sequences of objects (objects are strings in my case, but the algorithm is expected to operate on any object sequences, given that I provide a suitable distance function).

For shorter sequences, I can use the dynamic time warping (DTW) algorithm but the DTW algorithm needs to compute and store a n*m distance matrix (n,m are lengths of the sequences) which is not feasible for longer sequences. Can you recommend such algorithm? A working implementation would be a plus.

I asked the question at stackoverflow but I guess this site is more suitable for this.

Edit: just to clarify my problem, the objects in the sequences do no match always exactly, i.e., there are deletions, insertions and substitutions:

Input:
Sequence A: i saw the top of the mountain
Sequence B: then i see top of the mountains

Result:
Sequence A: i saw the top of the mountain Sequence B: then i see top of the mountains

asked May 17 '11 at 05:21

paraba's gravatar image

paraba
256288

edited May 17 '11 at 06:23


4 Answers:

I am not sure this is directly related to your issue, but this PyCon 2011 talk by Titus Brown might give you some interesting ideas.

answered May 17 '11 at 06:04

ogrisel's gravatar image

ogrisel
498995591

Thanks. The talk is very interesting indeed. I'm trying to figure out whether I can use some ideas for my problem.

(May 17 '11 at 09:14) paraba

You should ask some biologists, specially people working on DNA or protein matching. There are entire books on the tradeoffs between different algorithms for this task, and some are very efficient, considering biologists sometimes want to search for alignments of entire genes on the genebank efficiently.

You can probably use the BLAST algorithms for this.

answered May 17 '11 at 06:49

Alexandre%20Passos's gravatar image

Alexandre Passos ♦
2554154278421

The problem with BLAST is that it seems to be very specific to biological sequences. Also, I would like to provide my own distance function and BLAST doesn't seem to allow that. But yes, I have thought of asking biolgists, too.

(May 17 '11 at 09:17) paraba

would it not be possible to apply Levenshtein Distance? This is typically used to compute the minimum number of {adds, deletes, changes} needed to make two strings equal. The resulting "closest match" string isnt typically given in any implementation i've used, but i think you can just back this out from the edit matrix data structure used in the algorithm

answered May 17 '11 at 08:45

downer's gravatar image

downer
54891720

I cannot use Levenshtein distance because my sequences are long (tens of thousands), and the Levenshtein algorithm uses memory proportional to M*N, where M and N are lengths of my sequences. This quickly becomes too large.

(May 17 '11 at 09:12) paraba

Beam pruning might help here. For example, at each position, you retain only the 100 lowest-cost partial alignments. Then memory usage is then linear in the sequence length.

(May 17 '11 at 14:38) bedk

thanks for the beam pruning hint. what would a good starting point be for reading up? the van Hamme paper?

(May 23 '11 at 06:31) Paul

There is a book which you may want to consult: Encyclopedia of Distances:

http://www.springer.com/mathematics/geometry/book/978-3-642-00233-5

Another book I think of right now is: Algorithms on strings, trees and sequences:

http://www.amazon.com/Algorithms-Strings-Trees-Sequences-Computational/dp/0521585198

I hope you find them useful.

answered May 17 '11 at 10:14

Svetoslav%20Marinov's gravatar image

Svetoslav Marinov
26618

Your answer
toggle preview

powered by OSQA

User submitted content is under Creative Commons: Attribution - Share Alike; Other things copyright (C) 2010, MetaOptimize LLC.