|
I have an ill-formatted CSV file in which each line is a bibliography. Unfortunately, sometimes things are shifted a little so either part of the title or the journal name may appear in the author section. What is an easy, premade tool, ideally in python, that can determine whether a set of words is a human name or not? |
|
I think you're better off trying to get a dictionary of names from the web (for example, from the US census data) and doing exact matching or some form of approximate matching to catch mispellings. You'll also unfortunately have to inspect your output and see what it missed. |
A very difficult problem. Most names will be surnames, and Chinese names are difficult to determine (Li, Ma, etc). You may find more luck it getting the names of the journals and titles, and removing those.
I got the data in XML form now, so no problem.