I have an ill-formatted CSV file in which each line is a bibliography. Unfortunately, sometimes things are shifted a little so either part of the title or the journal name may appear in the author section. What is an easy, premade tool, ideally in python, that can determine whether a set of words is a human name or not?

asked Oct 15 '11 at 18:14

Jacob%20Jensen's gravatar image

Jacob Jensen
1644285360

A very difficult problem. Most names will be surnames, and Chinese names are difficult to determine (Li, Ma, etc). You may find more luck it getting the names of the journals and titles, and removing those.

(Oct 16 '11 at 20:50) Robert Layton

I got the data in XML form now, so no problem.

(Oct 18 '11 at 04:39) Jacob Jensen

One Answer:

I think you're better off trying to get a dictionary of names from the web (for example, from the US census data) and doing exact matching or some form of approximate matching to catch mispellings. You'll also unfortunately have to inspect your output and see what it missed.

answered Oct 17 '11 at 08:47

Alexandre%20Passos's gravatar image

Alexandre Passos ♦
1899744214335

Your answer
toggle preview

powered by OSQA

User submitted content is under Creative Commons: Attribution - Share Alike; Other things copyright (C) 2010, MetaOptimize LLC.