1
1

For text extraction from HTML, should I use Jericho html parser or boilerpipe? pros and cons?

Although boilerpipe seems very fast in producing text, the Jericho parser seems to give more flexibility in extracting information from tags in addition to text. True?

asked Jun 21 '11 at 21:34

Melipone%20Moody's gravatar image

Melipone Moody
221468

edited Jun 23 '11 at 15:54

Joseph%20Turian's gravatar image

Joseph Turian ♦♦
579051125146

1
(Jun 23 '11 at 15:57) Joseph Turian ♦♦

One Answer:

I think you answered your own question. From looking at the webpages you are right - boilerpipe focuses on speed, Jericho focuses on flexibility. In that case, and if you aren't working on a project with a lot of interdependency, I would use boilerpipe until you find it too restrictive and move to Jericho.

If you modularise the calls (i.e.. just make an interface Extract and make a boilerpipe centric class using that Interface), you shouldn't have too much problem changing later on.

answered Jun 22 '11 at 21:02

Robert%20Layton's gravatar image

Robert Layton
1625122637

Your answer
toggle preview

powered by OSQA

User submitted content is under Creative Commons: Attribution - Share Alike; Other things copyright (C) 2010, MetaOptimize LLC.