Scraping data from a web site has always been a pleasure with Python. If you come from the VBA world, the BeautifulSoup4 is indeed beautiful! I have been scraping a web site with python before, but now as I have started to take a better look at python, I have decided to write it a again. And hopefully, better, introducing some OOP into it.
So, what is the idea?
Pretty much, there is a web site which is a starting point. It is “crawled” and the links from it are taken. These links are then written to a set, including them and the website in which they are found. This action is the 0-th level of the task. A set of urls and their parents looks like this:
1 2 3 4 |
{('https://www.vitoshacademy.com/tag/lists/', 'https://www.vitoshacademy.com/2014/03/'), ('https://help.github.com/articles/which-remote-url-should-i-use', 'https://github.com/vitoshacademy/02.PrimitiveDataTypesAndVariables'), ('https://www.vitoshacademy.com/book-review-beginning-sql-queries-from-novice-to-professional/', 'https://www.vitoshacademy.com/category/review/'), ('https://www.vitoshacademy.com/tag/icpc/', 'https://www.vitoshacademy.com/category/c-sharp-tricks/')} |
Crawling Class
The left part of every tuple is the url to be visited and the right part is its parent. On the next level, the set which was used for the 0th level is a bit full with data. Thus, for every part of it, a new crawling is carried out. until the initial set is crawled completely. Then the data is written back to the set. This logic is present in the scrape_site(self). This is probably easier to grasp, when the whole class is presented:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 |
import urllib.request from bs4 import BeautifulSoup class SiteCrawler(): def __init__(self, url, level): self.links = set() self.initial_url = url self.levels = level self.counter_frequency = 10 self.counter = 0 def scrape_site(self): links_previous_level = self.get_links(self.initial_url) self.links = links_previous_level.copy() for i in range (self.levels): links_this_level = set() for link in links_previous_level: link_to_visit = link[0] self.counter = self.counter + 1 self.print_if_needed(link) new_set = self.get_links(link_to_visit) links_this_level = links_this_level.union(new_set) links_previous_level = links_this_level.copy() self.links = self.links.union(links_this_level) def get_links(self, url): try: resp = urllib.request.urlopen(url) soup = BeautifulSoup(resp, from_encoding=resp.info().get_param('charset'), features="html.parser") hrefs = soup.find_all('a', href=True) links = set() for href in hrefs: href_to_append = href['href'] if (href_to_append[:4] == "http"): tuple_to_append = tuple([href_to_append, url]) if(tuple_to_append not in self.links): links.add(tuple_to_append) return links except: print("NOT ALLOWED -> " + url) return set(tuple(["Scraping problem -" + url, url])) def print_if_needed(self, text_to_print): if self.counter % self.counter_frequency == 0: print(text_to_print) |
The get_links(self, url) is the method, which crawls an url and returns a set of url and their parents – links. As far as many sites do not allow crawling, there is an exception handler, implemented. And in order not to print every site being crawled, print_if_needed(self, text_to_print) is implemented, printing every 10th site.
Database Class
Writing to the DB is also in a separated class. A separate function for writing to a server, creating a table and dropping a table are implemented. Interestingly, the cursor and the connection to the database are created on initialization of the class and are used in the class up to the end of the program, not going outside the class. This is actually one of the reasons why OOP exists – to separate concerns and provide better scope. This is how the class looks like:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 |
import sqlite3 class SqlServer(): def __init__(self, db_location, table_name, column_name_1, column_name_2): self.connection = sqlite3.connect(db_location) self.cursor = self.connection.cursor() self.table_name = table_name self.column_name_1 = column_name_1 self.column_name_2 = column_name_2 def write_to_sql_server(self, links): sql_command = """INSERT INTO %s (%s, %s) VALUES (?, ?)""" \ % (self.table_name, self.column_name_1, self.column_name_2) self.cursor.executemany (sql_command, links) self.connection.commit() def create_table(self): sql_command = """ CREATE TABLE %s ( Id INTEGER NOT NULL PRIMARY KEY AUTOINCREMENT, %s TEXT, %s TEXT ) """ % (self.table_name, self.column_name_1, self.column_name_2) self.cursor.execute(sql_command) def drop_table_if_exists(self): sql = "DROP TABLE IF EXISTS %s" % self.table_name self.cursor.executescript(sql) self.connection.commit() |
Main Class
At the end, having the two main classes, the program is really eased. In about 13 lines the main procedure runs crawls the site we pass to it and writes it to a database. In general, make sure that you write 0 or 1 levels, as far if you go for more it would be quite a long time. Here is the code:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 |
from server import SqlServer from crawler import SiteCrawler def main(): db_location = r"C:\Users\vitos\Desktop\db\my.db" table_name = "urls" column_name_1 = "Address" column_name_2 = "Parent" sql_server = SqlServer(db_location, table_name, column_name_1, column_name_2) sql_server.drop_table_if_exists() sql_server.create_table() url = "https://vitoshacademy.com" levels = 0 site_crawler = SiteCrawler(url, levels) site_crawler.scrape_site() sql_server.write_to_sql_server(site_crawler.links) if __name__== "__main__": main() |
I hope you enjoy it! If you have ideas to improve it, feel free to submit a pull request in GitHub.