|
I want to find the globally optimal (or close to optimal) pairwise alignment between two long sequences of objects (objects are strings in my case, but the algorithm is expected to operate on any object sequences, given that I provide a suitable distance function). For shorter sequences, I can use the dynamic time warping (DTW) algorithm but the DTW algorithm needs to compute and store a n*m distance matrix (n,m are lengths of the sequences) which is not feasible for longer sequences. Can you recommend such algorithm? A working implementation would be a plus. I asked the question at stackoverflow but I guess this site is more suitable for this. Edit: just to clarify my problem, the objects in the sequences do no match always exactly, i.e., there are deletions, insertions and substitutions: Input: Sequence A: i saw the top of the mountain Sequence B: then i see top of the mountains |
|
I am not sure this is directly related to your issue, but this PyCon 2011 talk by Titus Brown might give you some interesting ideas. Thanks. The talk is very interesting indeed. I'm trying to figure out whether I can use some ideas for my problem.
(May 17 '11 at 09:14)
paraba
|
|
You should ask some biologists, specially people working on DNA or protein matching. There are entire books on the tradeoffs between different algorithms for this task, and some are very efficient, considering biologists sometimes want to search for alignments of entire genes on the genebank efficiently. You can probably use the BLAST algorithms for this. The problem with BLAST is that it seems to be very specific to biological sequences. Also, I would like to provide my own distance function and BLAST doesn't seem to allow that. But yes, I have thought of asking biolgists, too.
(May 17 '11 at 09:17)
paraba
|
|
would it not be possible to apply Levenshtein Distance? This is typically used to compute the minimum number of {adds, deletes, changes} needed to make two strings equal. The resulting "closest match" string isnt typically given in any implementation i've used, but i think you can just back this out from the edit matrix data structure used in the algorithm I cannot use Levenshtein distance because my sequences are long (tens of thousands), and the Levenshtein algorithm uses memory proportional to M*N, where M and N are lengths of my sequences. This quickly becomes too large.
(May 17 '11 at 09:12)
paraba
Beam pruning might help here. For example, at each position, you retain only the 100 lowest-cost partial alignments. Then memory usage is then linear in the sequence length.
(May 17 '11 at 14:38)
bedk
thanks for the beam pruning hint. what would a good starting point be for reading up? the van Hamme paper?
(May 23 '11 at 06:31)
Paul
|
|
There is a book which you may want to consult: Encyclopedia of Distances: http://www.springer.com/mathematics/geometry/book/978-3-642-00233-5 Another book I think of right now is: Algorithms on strings, trees and sequences: http://www.amazon.com/Algorithms-Strings-Trees-Sequences-Computational/dp/0521585198 I hope you find them useful. |