Hi,
Trying to extract structured data from E-Commerce Product Offering pages,
i got the following extracted from a HTML file:
simplified data structure:
<classification> content content_length tag tag_id tree_pos abs_pos parent_tag parent_tag_id
example learning data set:
<price> "12.99 €" "7" "p" "price" "price" "5" "12.3" "div" "price-info"
<title> "Casette iPhone Case" "19" "H1" "product-title" "12" "60.53" "div" "header"
<desc> "This is our best product, you need it. Get it now. In red or blue" "50" p" "description" "13" "323" "div" "description"
<stock> ...
<image> ...
Which classification algorithm would fit best to deal with this kind of learning set?
Content is essentially the target of the classification for structured data.
I see the following levels of scale:
- nominal (HTML tag type (H1, p, etc.)
- ordinal (tree position, length)
- ratio (absolute values)
Tag-id is a hard one. Its essentially strings, and I see a useful way to exploit this in putting learning set string in some kind of "bag of words". If there is an exact match, a synonym match or a match with low string edit distance, output the score for a specific class.
Am I basically on the right way?
What would be your suggestions in terms of learning algo?
Big TIA & Best
Uwe Stoll
asked
Oct 18 '11 at 04:35
ustoll
31●1●2●4