2
2

Hi,

Trying to extract structured data from E-Commerce Product Offering pages, i got the following extracted from a HTML file:

simplified data structure: <classification> content content_length tag tag_id tree_pos abs_pos parent_tag parent_tag_id

example learning data set:

<price> "12.99 €" "7" "p" "price" "price" "5" "12.3" "div" "price-info"
<title> "Casette iPhone Case" "19" "H1" "product-title" "12" "60.53" "div" "header"
<desc> "This is our best product, you need it. Get it now. In red or blue" "50" p" "description" "13" "323" "div" "description"
<stock> ...
<image> ...

Which classification algorithm would fit best to deal with this kind of learning set?

Content is essentially the target of the classification for structured data.
I see the following levels of scale:
- nominal (HTML tag type (H1, p, etc.)
- ordinal (tree position, length)
- ratio (absolute values)

Tag-id is a hard one. Its essentially strings, and I see a useful way to exploit this in putting learning set string in some kind of "bag of words". If there is an exact match, a synonym match or a match with low string edit distance, output the score for a specific class.

Am I basically on the right way?

What would be your suggestions in terms of learning algo?

Big TIA & Best

Uwe Stoll

asked Oct 18 '11 at 04:35

ustoll's gravatar image

ustoll
31124

edited Oct 18 '11 at 04:37

Be the first one to answer this question!
toggle preview

powered by OSQA

User submitted content is under Creative Commons: Attribution - Share Alike; Other things copyright (C) 2010, MetaOptimize LLC.