Python – Parsing PDFs with Tika

Reading from a pdf is actually quite an easy task with Python. If the PDF is of course “readable”, e.g. made from a word processor.

The first thing to do is to install Tika and Java:

pip install tika
conda install -c conda-forge tika # as alternative
java --version #this one checks the installed java version in the command prompt

Having this, the code below will print you my master thesis in the command prompt:

import tika
from tika import parser
import re


def main():
    tika.initVM()
    parsed_pdf = parser.from_file("https://www.vitoshacademy.com/wp-content/uploads/2014/01/Vitosh_K._Doynov.pdf")

    print(parsed_pdf)
    for my_key, my_value in parsed_pdf["metadata"].items():
        print(f'{my_key}')
        print(f'\t{my_value}\n')

    my_content = parsed_pdf['content']
    print(my_content)

As a bonus, you will get all the metadata from the file.

Using the code above, if you want to locate my master thesis supervisors from the PDF, you may add the following lines:

    supervisor = find_between(my_content, "Supervisor:", '\n')
    print(supervisor)

    regex = r'Client and supervisor:(.*?)\n'
    matches = re.findall(regex, my_content)
    for match in matches:
        print(match)


def find_between(s, first, last):
    try:
        start = s.index(first) + len(first)
        end = s.index(last, start)
        return s[start:end]
    except ValueError:
        return ""

The idea of the regex is to give anything between the substrings Client and supervisor and \n .

And actually this is what you are going to read in the console:

Pretty much that is all.

Enjoy it 🙂

Python – Parsing PDFs with Tika

Related posts: