|
I apologize if this question is too simple for this site. I really just need some advice for where to start on this problem. Let's say I have sales transaction data from a number of different retailers that all sell the same products. Even though they are selling the same products they all identify them a little bit differently. For example for one particular product one retailer calls it "Kellogs Corn Flakes 20oz." and another calls it "Kel CF, 20". And there are other representations for the 100's of other retailers. The problem to solve is how to map each of the different products to a standard set of product descriptions so that the data can be aggregated. Assume that I have some group of retailers already mapped manually. I've been going over in my head where to start with this. Is it a search problem where I consider the search query is the retailer's representation and the standard description is the "document" to find? Or is it a classification problem in that I'm trying to classify each description into a standard product description category? Or can named entity recognition play into it somehow. Any advice you could provide to get me started would be much appreciated. I've been looking at tools like Apache Lucene, Solr and OpenNLP but it's just not clear to me how to characterize the problem. |
|
The problem that you are trying to solve is really specific, and as far as I can tell, is an ontology description problem In biology is a very researched area by the NLP community, because biologists tend to discover new strains of DNA ( for example) but have no idea other group might already discovered it. Then they upload it to a database and the database creates families of categories for those newly found strains. Papers like this might give you a starting point of your problem. I wish I could help you more, but NLP is not really my area. Thanks, Leon. I hadn't thought of something like that. Will definitely check it out.
(Sep 13 '11 at 10:38)
Dave Kincaid
|
So is it correct that for every product you already have unique canonical description, and additionally for some of the non-canonical descriptions you are also given canonical description. Apart from that the only information about each item is the description text. Yes?
How many distinct products? How many total descriptions?
What is the ditribution the number of descriptions per product like?
What is the distribution of length of description in words? & in characters?
Are all description in English?
Can you get feedback on the mapping as you generate them?