nblock's ~

Extracting tabular data from pdf files

Yesterday, I got an e-mail from a colleague asking me to convert the content of a pdf file back to text. The pdf file had just one huge table with a few columns in it. There are several websites out there that offer this kind of conversion, but using these offers was no option due to confidential data in the pdf file. Here is a screenshot of the pdf file:

A screenshot of the pdf file.

Convert pdf to text

pdftotext is quite handy for this task. Together with the option -layout, it tries to keep the visual appearance for the text file, as it was present in the pdf file:

pdftotext -layout input.pdf

Cleaning up the text file

A quick look at the text file revealed, that there were a lot of bogus empty lines and invalid first and last lines as well. Those issues can easily be fixed with sed:

sed -i -e '/^$/d' -e '1d' -e '$d' input.txt

Importing the text file into LibreOffice

LibreOffice Calc may be used to import this text file as table. Select Fixed width as a separator and visually select the column borders.

The import screen of LibreOffice.

The rows and columns should now match your expectations. One remaining issue is the whitespace in each and every cell. This can be easily fixed with the following search and replace pattern (select regular expressions in the options):

  • Search: [:space:]*(.+)[:space:]*
  • Replace: $1
The Search & Repace screen of LibreOffice.

Now save the file and you're done.

Feedback? Contact me!


permalink | tweet this

tagged libreoffice, pdftotext and sed