If I wanted to extract product names from text, how would I get training data? E.g. 'makeup' or 'Dolce & Gabbana' might both be terms that would indicate a product that someone might buy. Doing a Google search to see if there are ads on the term seems like a good way, but they'd ban you as a bot before long. What other ways would there be to determine if a word might have commercial significance?

asked Nov 30 '11 at 11:45

Ben McCann

Ben McCann

Have you tried Wikipedia? D&G is in the category Luxury Brands (see bottom of page), which itself is in the category brand, etc. You can download Wikipedia dumps in sql or xml.

answered Dec 02 '11 at 11:22

Renaud Richardet

Renaud Richardet

