Scraping a web site up to the N-th level with Python

Scraping data from a web site has always been a pleasure with Python. If you come from the VBA world, the BeautifulSoup4 is indeed beautiful! I have been scraping a web site with python before, but now as I have started to take a better look at python, I have decided to write it a again. And hopefully, better, introducing some OOP into it.

So, what is the idea?

Pretty much, there is a web site which is a starting point. It is “crawled” and the links from it are taken. These links are then written to a set, including them and the website in which they are found. This action is the 0-th level of the task. A set of urls and their parents looks like this:

Crawling Class

The left part of every tuple is the url to be visited and the right part is its parent. On the next level, the set which was used for the 0th level is a bit full with data. Thus, for every part of it, a new crawling is carried out. until the initial set is crawled completely. Then the data is written back to the set. This logic is present in the scrape_site(self). This is probably easier to grasp, when the whole class is presented:

The get_links(self, url) is the method, which crawls an url and returns a set of url and their parents – links. As far as many sites do not allow crawling, there is an exception handler, implemented. And in order not to print every site being crawled, print_if_needed(self, text_to_print) is implemented, printing every 10th site.

Database Class

Writing to the DB is also in a separated class. A separate function for writing to a server, creating a table and dropping a table are implemented. Interestingly, the cursor and the connection to the database are created on initialization of the class and are used in the class up to the end of the program, not going outside the class. This is actually one of the reasons why OOP exists – to separate concerns and provide better scope. This is how the class looks like:

Main Class

At the end, having the two main classes, the program is really eased. In about 13 lines the main procedure runs crawls the site we pass to it and writes it to a database. In general, make sure that you write 0 or 1 levels, as far if you go for more it would be quite a long time. Here is the code:

I hope you enjoy it! If you have ideas to improve it, feel free to submit a pull request in GitHub.

Tagged with: , , , ,