3
1

I would like to extract text from pdfs with two-column text, as found in ICML papers.

So far I have tried the following methods for two-column pdf text extraction on (Ubuntu) Linux:

  1. pdftotext: at the command line
  2. acroread: open and then choose "Save as text"
  3. pyPdf extractText() method: in Python
  4. pdfminer: at the command line
  5. ps2pdf > ps2txt: at the command line

Two can be immediately eliminated: pyPdf extractText spits out a stream of characters without spacing; acroread requires manual intervention for each document.

Of the remaining methods, each displayed partial success on the sample ICML pdfs. Failure modes include intermixing the columns, putting the right column text before the left column text, and leaving out spaces between words.

pdftotext was most consistent in its success.

Since pdftotext still leaves room for improvement, the next options I plan to try are JPDFText, Apache PDFBox, Asprise Java PDF Reader, Multivalent, PDFTextStream.

Can anyone recommend a two-column pdf text extraction method that they have found to be reliable?

asked Dec 17 '10 at 22:56

Which%20One's gravatar image

Which One
46113

edited Dec 17 '10 at 23:05


One Answer:

I have hacked a basic tool to re-arrange a 2 columns PDF file to be readable on a kindle / android phone / iphone / whatever small screen device that is able to read PDF files:

https://github.com/ogrisel/paper2ebook

It is using PDFbox. It does not do what you want but you can have a look at the source code and the ExtractTextByArea example and the PrintTextLocations example from the PDFbox lib to make what you want.

answered Dec 18 '10 at 07:25

ogrisel's gravatar image

ogrisel
498995591

edited Dec 19 '10 at 08:22

Your answer
toggle preview

powered by OSQA

User submitted content is under Creative Commons: Attribution - Share Alike; Other things copyright (C) 2010, MetaOptimize LLC.