|
Hi All, What specific techniques from supervised learning can we apply to improve our efficiency in matching large quantities of company names to a reference dataset of 3m company names. We have done a lot of semi-automatic "fuzzy matching" already (using Levenstein's algorithm) and may therefore have a good-sized training set to use. For example, every month we receive a few thousand new names one of which might be: "BMI UK Ltd". We fuzzy match this against the reference list and we find the following possible matches: "BMI Training Solutions", "BMI Limited", "BMI UK Training". We currently have to choose one of the options by hand - "BMI Limited" in this case - and have recorded the positive matches, while keeping a list of the fuzzy match results that we discard (e.g. "BMI UK Training" in this example. The training set consists of matching attempts for some 300,000 companies and I think there are usually about 2-4 fuzzy match results for each attempt. The reference dataset contains about 3million company names. In the future we'll be seeing names that we've never seen at all so I was assuming that we'd have to somehow predict match-likelihood based on features that are un-related to the proper nouns, except for some very common tokens like 'Inc, PLC, Ltd, com, partner and their synonyms/stems etc). Many of the smaller companies also have their industry in the name 'First Bus' etc, 'Smiths Plumbers' and so this might be helpful. Many thanks for reading. Appreciate any thoughts on techniques to consider. John |