Have you ever wondered on what servers the top 50 most visited Bulgarian sites run? That is definitely an interesting question, if you are Bulgarian IT admin and you have pretty much free time 🙂
Anyway, as far as I am not neither of these, I decided to make a small web-cralwer, going around these sites and politely taking the data for the type of their servers. With Python and the request library this was not that difficult.
So, what did I do? I googled “top bg internet sites” (if some day yandex.ru or bing.com offer some finances, then I will use “I yandexed” or “I binged” instead of “I googled”). Then I found pretty good list in *.csv from SimilarWebRanks. The format was easy to work with and I started to create my Python crawler.
The crawler consists of two parts – extracting the web sites from the csv file and checking their server. Simple as that. I have even put comments on these two parts in the code with “First part” and “Second part”. Once, the crawler is ready, we have a dictionary with the servers. As far as this was not pretty enough, I have represented the dictionary into a pie chart, so you have a better idea about the top servers. This took me a little more time, because I wanted to be able to generate a tuple for the explode section of the pie, but finally everything went to its place. So, here is the beautiful pie chart:
If I was writing a master thesis, I would have probably commented on the chart as well, but this is not the fact, so the comments are left for you! 🙂
Last but not least, here comes the code:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 |
import requests import matplotlib.pyplot as plt class SomeStrings: myHeaders = {} ua1 = "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36" myHeaders["User-Agent"] = ua1 servApache = "apache" servNginx = "nginx" csvName = "Top 50 BG Websites.csv" class TopSitesServer(): def __init__(self): self.topSites = [] self.serverDict = {} def generate_dictionary(self): # First part of the cralwer with open(SomeStrings.csvName, 'r') as mySource: for line in mySource: myAddress = line.split(",") myWWW = "http://www." + str(myAddress[1]) self.topSites.append(myWWW) mySource.close() # Second part of the cralwer for site in self.topSites: try: r = requests.get( site, headers=SomeStrings.myHeaders, timeout=2) serverType = r.headers["server"] if SomeStrings.servApache in serverType.lower(): serverType = SomeStrings.servApache elif SomeStrings.servNginx in serverType.lower(): serverType = SomeStrings.servNginx if serverType not in self.serverDict: self.serverDict[serverType] = 1 else: self.serverDict[serverType] += 1 except Exception: pass def printChart(self, values=2): myDict = self.serverDict myLabels = list(myDict.keys()) mySizes = list(myDict.values()) if values > len(myLabels): values = len(myLabels) explode = tuple( [0 if i > values else 0.2 for i in range(len(myLabels))]) myColors = ['yellowgreen', 'gold', 'lightskyblue', 'lightcoral', 'green', 'blue', 'yellow', 'silver', 'whitesmoke', 'navajowhite', 'ivory', 'mintcream', 'linen', 'tan', 'sienna', 'c', 'g'] plt.pie([float(v) for v in mySizes], labels=myLabels, explode=explode, colors=myColors, autopct='%1.1f%%', startangle=20) plt.axis('equal') plt.show() checkServers = TopSitesServer() checkServers.generate_dictionary() checkServers.printChart(50000) |