Python – Parsing PDFs with Tika

Reading from a pdf is actually quite an easy task with Python. If the PDF is of course “readable”, e.g. made from a word processor.

The first thing to do is to install Tika and Java:

Having this, the code below will print you my master thesis in the command prompt:

As a bonus, you will get all the metadata from the file.

Using the code above, if you want to locate my master thesis supervisors from the PDF, you may add the following lines:

The idea of the regex is to give anything between the substrings Client and supervisor  and \n .

And actually this is what you are going to read in the console:

Pretty much that is all.

Enjoy it 🙂


Tagged with: , , , ,