Scraping a web site with BeautifulSoup 4 and Python is a walk in the park, if you have scraped a web site with VBA before. Still, there are a few tricks, that should be taken into account.
Trick #1 – make sure that you get the correct encoding in Python and BS4:
1 2 3 4 5 6 7 8 |
def get_links(url): resp = urllib.request.urlopen(url) soup = BeautifulSoup(resp, from_encoding=resp.info().get_param('charset'), features="html.parser") hrefs = soup.find_all('a', href=True) links = [] for href in hrefs: links.append(tuple([href['href'],])) return links |
Trick #2 – if it is one column to be inserted in the DB, make sure that you pass a tuple in Python:
1 |
links.append(tuple([href['href'],])) |
And we need so many parenthesis, because the correct syntax for a tuple is this one –
tuple(['a', 'href']) or tuple([href['href'],]) as in our case, where we need to indicate that it is a tuple, but the second member is not there, thus the comma is added.
Trick #3 – Use “_”, in the variable names, because this is how the Python people write… Not the whole world is .NET or VBA, unfortunately.
Anyway the result of the task looks like this:
And the code is here:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 |
import sqlite3 import urllib.request from bs4 import BeautifulSoup import io def main(): conn = sqlite3.connect(r"C:\Users\vitos\Desktop\db\my.db") cur = conn.cursor() table_name = "urls" column_name = "Address" drop_table_if_exists(table_name, cur,conn) sql_command = """ CREATE TABLE %s ( Id INTEGER NOT NULL PRIMARY KEY AUTOINCREMENT, %s TEXT ) """ % (table_name, column_name) cur.execute(sql_command) url = "https://vitoshacademy.com" links = get_links(url) sql_command = """INSERT INTO %s (%s) VALUES (?)""" % (table_name, column_name) cur.executemany (sql_command, links) conn.commit() def get_links(url): resp = urllib.request.urlopen(url) soup = BeautifulSoup(resp, from_encoding=resp.info().get_param('charset'), features="html.parser") hrefs = soup.find_all('a', href=True) links = [] for href in hrefs: links.append(tuple([href['href'],])) return links def drop_table_if_exists(table_name, cur, conn): sql = "DROP TABLE IF EXISTS %s" % table_name cur.executescript(sql) conn.commit() if __name__== "__main__": main() |
Enjoy it!