|
For text extraction from HTML, should I use Jericho html parser or boilerpipe? pros and cons? Although boilerpipe seems very fast in producing text, the Jericho parser seems to give more flexibility in extracting information from tags in addition to text. True? |
|
I think you answered your own question. From looking at the webpages you are right - boilerpipe focuses on speed, Jericho focuses on flexibility. In that case, and if you aren't working on a project with a lot of interdependency, I would use boilerpipe until you find it too restrictive and move to Jericho. If you modularise the calls (i.e.. just make an interface Extract and make a boilerpipe centric class using that Interface), you shouldn't have too much problem changing later on. |
See also: Text extraction from HTML pages