I'm wondering what a good way is to tell whether two names could potentially refer to the same person. E.g. "Susan B. Yates", "Susie Yates", "Susan Bell Yates" could all be the same. Those are probably different names than "Mrs. Sally Yates" or "Susan Wates".

I'd imagine there are lots of interesting edge cases such as hispanic last names (e.g. I have a friend who may alternately write his last name as Almazzo or Valente Almazzo.). Similarly, there are names from the southern U.S. like "Billy Ray" or "Mary Anne" which taken together are considered the first name and not a middle name and last name. I'm far more interested in whether two names could refer to the same person than in actually parsing the name, but it seems hard to do without splitting into a first and last name.

Are there any existing approaches or libraries (Java preferred) for this problem that I should take a look at?

asked Apr 18 '12 at 21:48

Ben%20McCann's gravatar image

Ben McCann
171458

Do you have any training data, or just the names, depending on your training data you could use entity recognition

(Apr 19 '12 at 05:52) Leon Palafox ♦

2 Answers:

There are 3 separate problems here depending upon the context of what you're trying to do.

(1) Coreference (as mentioned by bedk) is when the two names occur in the same document. Usually the first occurrence is the full name (Susan Yates) and following occurrences use part of the name (Yates), so coref systems tend to be pretty good at merging these into a single "mention chain", or cluster.

(2) Entity grounding (or entity linking) is when you're interested in tying a name mentioned in text to an actual person (or any other entity). Typically, this involves grounding to a resource (such as Wikipedia) or database (such as a gazetteer). Thus, you can tell whether two names in completely separate documents are referring to the same individual if they have the same grounding. For example, if three documents mention "Scott Brown", but 2 of those instances ground to the Wikipedia article "Scott Brown" (i.e., the U.S. Senator) but the third grounds to "Scott Brown (Scottish footballer)", then you know that only 2 of the three refer to the same person.

(3) Name matching is the task of determining how likely two different spellings refer to the same person. This is most commonly used when dealing with non-English names, which as you mention could have different spellings or transliterations. This task is completely context free (i.e., you don't have, or choose not to use, any additional information such as document-level context). Thus the task cannot handle polysemy (multiple people with the same name), only synonymy (different names for the same person).

On the surface, it seems that you're interested in (3), though I listed the others in case you were dealing with additional context. If name matching is all you're interested in, see the MITRE challenge, which addressed this exact issue. Bob Carpenter has a very simple Java-based implementation using LingPipe available on his blog

answered Apr 19 '12 at 16:03

Kirk%20Roberts's gravatar image

Kirk Roberts
4612410

This problem goes by the name "coreference resolution" in the NLP literature. A Google search on "coreference resolution software" turns up a number of packages. I cannot comment on the quality of any of them, as this is a bit outside my area.

answered Apr 19 '12 at 07:48

bedk's gravatar image

bedk
5522

Your answer
toggle preview

powered by OSQA

User submitted content is under Creative Commons: Attribution - Share Alike; Other things copyright (C) 2010, MetaOptimize LLC.