Python – Crawling the top 50 sites from Bulgaria for their servers

Have you ever wondered on what servers the top 50 most visited Bulgarian sites run? That is definitely an interesting question, if you are Bulgarian IT admin and you have pretty much free time 🙂

Anyway, as far as I am not neither of these, I decided to make a small web-cralwer, going around these sites and politely taking the data for the type of their servers. With Python and the request library this was not that difficult.

python-logo-master-v3-TM-flattened

So, what did I do? I googled “top bg internet sites” (if some day yandex.ru or bing.com offer some finances, then I will use “I yandexed” or “I binged” instead of “I googled”). Then I found pretty good list in *.csv from SimilarWebRanks. The format was easy to work with and I started to create my Python crawler.

The crawler consists of two parts – extracting the web sites from the csv file and checking their server. Simple as that. I have even put comments on these two parts in the code with  “First part” and “Second part”. Once,  the crawler is ready, we have a dictionary with the servers. As far as this was not pretty enough, I have represented the dictionary into a pie chart, so you have a better idea about the top servers. This took me a little more time, because I wanted to be able to generate a tuple for the explode section of the pie, but finally everything went to its place. So, here is the beautiful pie chart:
result_servers

If I was writing a master thesis, I would have probably commented on the chart as well, but this is not the fact, so the comments are left for you! 🙂

Last but not least, here comes the code:

Enjoy it in my GitHub repo as well! 🙂

Tagged with: , , ,