Extracting tabular data from pdf files
written on Saturday, May 9, 2015
Yesterday, I got an e-mail from a colleague asking me to convert the content of a pdf file back to text. The pdf file had just one huge table with a few columns in it. There are several websites out there that offer this kind of conversion, but using these offers was no option due to confidential data in the pdf file. Here is a screenshot of the pdf file:
Convert pdf to text
pdftotext is quite handy for this task. Together with the option -layout, it tries to keep the visual appearance for the text file, as it was present in the pdf file:
pdftotext -layout input.pdf
Cleaning up the text file
A quick look at the text file revealed, that there were a lot of bogus empty lines and invalid first and last lines as well. Those issues can easily be fixed with sed:
sed -i -e '/^$/d' -e '1d' -e '$d' input.txt
Importing the text file into LibreOffice
LibreOffice Calc may be used to import this text file as table. Select Fixed width as a separator and visually select the column borders.
The rows and columns should now match your expectations. One remaining issue is the whitespace in each and every cell. This can be easily fixed with the following search and replace pattern (select regular expressions in the options):
- Search: [:space:]*(.+)[:space:]*
- Replace: $1
Now save the file and you're done.
Feedback? Contact me!