Scraping web site and writing its links to a database with Python

Scraping a web site with BeautifulSoup 4 and Python is a walk in the park, if you have scraped a web site with VBA before. Still, there are a few tricks, that should be taken into account.

Trick #1 – make sure that you get the correct encoding in Python and BS4:

Trick #2 – if it is one column to be inserted in the DB, make sure that you pass a tuple in Python:

And we need so many parenthesis, because the correct syntax for a tuple is this one –

tuple(['a', 'href']) or tuple([href['href'],]) as in our case, where we need to indicate that it is a tuple, but the second member is not there, thus the comma is added.

Trick #3 – Use “_”, in the variable names, because this is how the Python people write… Not the whole world is .NET or VBA, unfortunately.

Anyway the result of the task looks like this:

And the code is here:

Enjoy it!

Tagged with: , , ,