3
1

I am looking to use SVM + a string kernel (Lodhi 2002) for a classification task involving sequences of bytes from a hard drive. I would prefer to use an existing implementation as reference when I write my own. I don't anticipate there will be a tool that will fully perform my classification task using a string kernel, so I am mainly looking for a reference implementation of just the kernel. Does anyone know where I could find such a reference?

asked Jan 09 '11 at 18:44

Travis%20Wolfe's gravatar image

Travis Wolfe
235119


3 Answers:

The paper talks about subsequence kernel which is basically a variation on the dynamic programming solution to longest common subsequence of two strings.

Pseudo code for the subsequence kernel is in book: Kernel Methods for Pattern Analysis (http://www.kernel-methods.net/). Also if you need it, here it's my Python implementation:

def SSK(lamb, p):
        """Return subsequence kernel"""
        def SSKernel(xi,xj,lamb,p):
            mykey = (xi, xj) if xi>xj else (xj, xi)
            if not mykey in cache:
                dps = []
                for i in xrange(len(xi)):
                    dps.append([lamb**2 if xi[i] == xj[j] else 0 for j in xrange(len(xj))])
                dp = []
                for i in xrange(len(xi)+1):
                    dp.append([0]*(len(xj)+1))
                k = [0]*(p+1)
                for l in xrange(2, p + 1):
                    for i in xrange(len(xi)):
                        for j in xrange(len(xj)):
                            dp[i+1][j+1] = dps[i][j] + lamb * dp[i][j+1] + lamb * dp[i+1][j] - lamb**2 * dp[i][j]
                            if xi[i] == xj[j]:
                                dps[i][j] = lamb**2 * dp[i][j]
                                k[l] = k[l] + dps[i][j]
                cache[mykey] = k[p]
            return cache[mykey]
        return lambda xi, xj: SSKernel(xi,xj,lamb,p)/(SSKernel(xi,xi,lamb,p) * SSKernel(xj,xj,lamb,p))**0.5

It's not commented but, xi and xj are strings, lamb and p are parameters described in the paper. It also caches already seen pairs of xi and xj.

answered Jan 19 '11 at 16:54

Rok%20Mocnik's gravatar image

Rok Mocnik
7122

could you comment on the applications of such a kernel?

(Jan 19 '11 at 17:02) Alexandre Gramfort

this is just what i asked for, and now i remember why i asked for a reference implementation :) unfortunately this may be too slow. i have some preliminary experiments using an edit distance kernel (which should be faster than SSK), and it is REALLY slow.

(Jan 19 '11 at 17:19) Travis Wolfe

What about using k-mers kernel? It should be much faster than SSK and edit distance, since it's linear in the maximum length of both strings.

(Jan 19 '11 at 18:18) Rok Mocnik

this code can be trivially rewritten with cython to get a huge speed up.

(Jan 20 '11 at 11:43) Alexandre Gramfort

hi,

it looks like libsvm has an implementation of string kernels:

http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/#libsvm_for_string_data

hope this helps

Alex

answered Jan 09 '11 at 19:26

Alexandre%20Gramfort's gravatar image

Alexandre Gramfort
91237

Their package implements an edit distance kernel (which apparently doesn't always produce a proper kernel according to them). It should help nonetheless, thanks.

(Jan 09 '11 at 23:25) Travis Wolfe

if you come up with a good implementation BSD compatible, I'm sure the scikit-learn folks [1] would be interested.

[1] http://scikit-learn.sourceforge.net/

(Jan 10 '11 at 09:42) Alexandre Gramfort

Someone (offline) gave me this solution as well.

http://ace.cs.ohiou.edu/~razvan/code/ssk_core.tar.gz.

It implements the SSK (subsequence string kernel) in Java to be integrated with libsvm. I haven't tried it yet, but it looks legit.

answered Jan 24 '11 at 16:48

Travis%20Wolfe's gravatar image

Travis Wolfe
235119

Your answer
toggle preview

powered by OSQA

User submitted content is under Creative Commons: Attribution - Share Alike; Other things copyright (C) 2010, MetaOptimize LLC.