Reading from a pdf is actually quite an easy task with Python. If the PDF is of course “readable”, e.g. made from a word processor.
The first thing to do is to install Tika and Java:
1 2 3 |
pip install tika <em>conda install</em> -c conda-forge <em>tika # as alternative</em> java --version #this one checks the installed java version in the command prompt |
Having this, the code below will print you my master thesis in the command prompt:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 |
import tika from tika import parser import re def main(): tika.initVM() parsed_pdf = parser.from_file("https://www.vitoshacademy.com/wp-content/uploads/2014/01/Vitosh_K._Doynov.pdf") print(parsed_pdf) for my_key, my_value in parsed_pdf["metadata"].items(): print(f'{my_key}') print(f'\t{my_value}\n') my_content = parsed_pdf['content'] print(my_content) |
As a bonus, you will get all the metadata from the file.
Using the code above, if you want to locate my master thesis supervisors from the PDF, you may add the following lines:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 |
supervisor = find_between(my_content, "Supervisor:", '\n') print(supervisor) regex = r'Client and supervisor:(.*?)\n' matches = re.findall(regex, my_content) for match in matches: print(match) def find_between(s, first, last): try: start = s.index(first) + len(first) end = s.index(last, start) return s[start:end] except ValueError: return "" |
The idea of the regex is to give anything between the substrings Client and supervisor and \n .
And actually this is what you are going to read in the console:
Pretty much that is all.
Enjoy it 🙂