|
I would like to extract text from pdfs with two-column text, as found in ICML papers. So far I have tried the following methods for two-column pdf text extraction on (Ubuntu) Linux:
Two can be immediately eliminated: pyPdf extractText spits out a stream of characters without spacing; acroread requires manual intervention for each document. Of the remaining methods, each displayed partial success on the sample ICML pdfs. Failure modes include intermixing the columns, putting the right column text before the left column text, and leaving out spaces between words. pdftotext was most consistent in its success. Since pdftotext still leaves room for improvement, the next options I plan to try are JPDFText, Apache PDFBox, Asprise Java PDF Reader, Multivalent, PDFTextStream. Can anyone recommend a two-column pdf text extraction method that they have found to be reliable? |
|
I have hacked a basic tool to re-arrange a 2 columns PDF file to be readable on a kindle / android phone / iphone / whatever small screen device that is able to read PDF files: https://github.com/ogrisel/paper2ebook It is using PDFbox. It does not do what you want but you can have a look at the source code and the ExtractTextByArea example and the PrintTextLocations example from the PDFbox lib to make what you want. |