I'm using SVMs with linear kernel for sentence classification (binary). My dataset contains many duplicate instances i.e. many sentences in the training set have identical feature vectors. In the trained model, I see a lot of duplicate support vectors, each with a different Lagrange Multiplier (alpha*y) term. For example:

-0.064019597363882741 5660:1 # fcda5cf6-e00b-11e2-ddad-001517e11bd0
-0.00028270965981203044 5660:1 # 67d4d4d2-855c-11e3-f88e-001517e11900
-0.00084391434273911603 5660:1 # 38d53d84-e435-11e2-ddad-001517e11bd0
-0.031641060077848393 5660:1 # 069c2c6e-e8be-11e2-edb1-001517e116e4
-0.0028105012661461251 5660:1 # 623c52e4-e038-11e2-edb1-001517e116e4
-0.00098570725949942428 5660:1 # f67aae78-e020-11e2-5884-001517e11900

For those who are unfamiliar with SVMLight model format, each of the above lines represents a support vector, where:

  • 1st term is alpha*y
  • key-value pairs('5660:1') are feature_number:feature_value. So, all the above training instances contain just one feature with value 1.
  • Stuff following '#' are comments (instance-ids in my case)

Why do the above support vectors have different alpha*y values, even when they have identical feature vectors? Also, why does the model contain multiple SVs with same feature vectors in the first place? I had expected all SVs to have unique feature vectors, although the ones which were duplicated in the training set would've had larger alpha value. I agree that training instance duplication might shift the margin if duplicate instances fall within the (soft) margin, but I'm unable to understand why would each duplicate support vector have a different alpha*y value?

asked Jul 22 '14 at 20:36

A%20W's gravatar image

A W
1334

Be the first one to answer this question!
toggle preview

powered by OSQA

User submitted content is under Creative Commons: Attribution - Share Alike; Other things copyright (C) 2010, MetaOptimize LLC.