Let's say I have a list of records, and I want to de-dupe one string field.

The field contains many name variants, misspellings, acronyms, etc.

What is the appropriate error-measure for measuring de-dup quality?

I could look at an error over all nrows^2 pair-wise comparisons, but I don't think this is instructive. I prefer an error measure that is based upon the number of rows (field tokens) or number of field types, with respect to the fact that there might be multiple field types glommed into one.

asked Oct 04 '12 at 11:17

Joseph%20Turian's gravatar image

Joseph Turian ♦♦
579051125146


One Answer:

One important thing to do is to avoid all-pairs comparisons using canopies whenever possible. The idea is to find some small high-recall low-precision features you can index by, and only compare records which match in at least one canopy.

Apart from that, as far as I know, state-of-the-art approaches involve mostly scoring "similarity" in the different fields with manually set weights and then manually setting thresholds for merging records that keep precision / recall in a small manually deduped set of records at acceptable levels. I think one of the old coref examples in factorie for doing bibtex record matching might be helpful, but the code is not very easy to read (and the new, cleaner, version is spread over many files).

One other important thing is to do deduping incrementally: it's often somewhat easy to find "consensus" values for dedupped records, and comparing other records against the consensus value might make it easier to see matches than comparing against other noisy samples.

answered Oct 04 '12 at 22:26

Alexandre%20Passos's gravatar image

Alexandre Passos ♦
2554154278421

Thanks Alex. Those are similar to the techniques I had in mind.

I was specifically asking what evaluation measure I should use, not what is the technique to perform the deduping.

(Oct 05 '12 at 02:05) Joseph Turian ♦♦

Oh, sorry, I didn't understand that (I confused "error measure" with "score function"). I think people in my lab use variants of precision and recall: for a given "true record" and a "merged record", precision is how many of the records in the merged record o represent the true record, and recall is how many of the records representing the true record ended up in the merge record. It's hard to label data for this, unfortunately.

(Oct 05 '12 at 07:55) Alexandre Passos ♦

Okay, so essentially F1 at the true-record level, and then I can macro-average over that.

(Oct 05 '12 at 14:00) Joseph Turian ♦♦
Your answer
toggle preview

powered by OSQA

User submitted content is under Creative Commons: Attribution - Share Alike; Other things copyright (C) 2010, MetaOptimize LLC.