|
Good Day, I've been scouring Machine Learning texts, papers and websites looking for some indication as to the best approach given our data / problem domain but have come up with very little. Here's the situation: There are roughly around 250000 'categories' and each of their defining features are 100% textual, unrelated and generally not shared between categories. The categories are representative of a specific car--not a make or a model--a physical car. My car, your car and your neighbours car are all distinct categories. A pythonic example:
So say we have a master list of ~250000 cars represented as above--let's call that data set A. I'd mentioned above that the features are generally not shared between categories and by this I mean that vin's and plate's are unique to the vehicle and full names are fairly unique.
The problem we're trying to solve is that we have another dataset--data set B--with 100's of millions of incomplete, inaccurate and messy snippets of features from unknown cars. Some will map 100% to a car within A as follows:
Whereas some will only contain one or two attributes:
And others may have an ambiguous mix of attributes:
We wish to determine the best match for each of the unknown car reports. Better still would be a list of the top X closest matches. This could obviously be approached in a straightforward programmatic manor giving either equal or weighted importance to all of the possible features of a car... but I'm looking into what machine learning methodologies have to offer. This matching of elements in B against A will be occuring on an ongoing basis as data flows in--millions per day. Any ideas what approach may be the a good fit? Bonus points if it's included in the scikit-learn python library. Right now, I'd be happy with a solution where features considered to be either match or not... but down the road I'd also like to incorporate some sort of a fuzzy, Levenshtein-like, matching on the features. If this example seems unusual it's because it's completely contrived but mirrors the actual data/problem pretty well. Any ideas would be a great help, thanks! |
|
This seems more like a database problem than a machine learning problem. Unless I am mistaken, you are basically saying that you have a large number of entries and receive an even larger number of queries, some of which only contain partial data. The partial data does create some uncertainty, but with only three features and so much disparity between entries, you could just throw some rules together to prioritize matching order (or return all potential matches). |