VBA – Data scraping from Internet with Excel – Part 2

Some time ago I wrote an article for Data scraping from Internet with Excel , which was scraping book information from amazon.com, based on a given word. Thus, for a keyword as “VBA”, this is the data, printed in the immediate window:

The code was 25 lines, thus it had some points for improvement. Thus, I have decided to make something bigger out of it with some additional options:

  • writing it to Excel, instead of printing to the immediate window
  • getting the prices of the books
  • scraping multiple titles
  • creating some user interface and reporting
  • analyzing the data

So, at the end, my Excel file was with 3 sheets – RawData, Summary and Input. This is how Input looks, after running the code:

The code starts with deleting anything in the input table, except the column A. Then it gets all the data per keyword. At the end it counts the entries and gives some information about the total count of entries per keyword, their max price and their average prices. Initially the scraped data is collected in the RawData worksheet. It looks like this:

As you see, the data is really unstructured, but there could be a pattern somehow, if you take a good look of multiple entries. E.g., the first row with data is usually the title of the book and the second is the author. Concerning the price, it is a bit more tricky, but I have decided to come up with a rule, stating that the first string between the first $-sign and the first space after it should be the price. If the digit is not numeric, I return empty string.

The last part of the code is building a Summary worksheet, thus the data becomes somehow useable. This is how the summary part looks like:

With this data, writing the Max(), Count() and Average() formulas becomes really a piece of cake for a VBA Developer.

So, the code is split into five modules, with no classes:

  • StartUp
  • AmazonInternet
  • ConstValues
  • ExcelRelated
  • General

Here is the code of each module with some description.

StartUp

This module has the Main sub, which is called by the Run button. This is how it looks like:

My idea about every Main function or sub, is that it should be like a story, which could be told to someone, without using much efforts. I hope I have managed to tell the story with this one.

AmazonInternet

This module has 3 functions. The first one, PageWithResultsExists checks whether after clicking the N-th page of Amazon.com with our results, there are some results left. MakeUrl is the one that makes the URL with a keyword and i. Navigate goes to the given URL:

ConstValues

Rather small one, with 3 lines only. 2 Contant values and one public variable. Public variables are bad, but today I was lazy, thus I have decided to leave them:

ExcelRelated

Lots of Excel Related functions. From the good old lastRow, which returns the last used row of a given worksheet to the self-explanatory WriteToExcel and RawDataToStructured.

General

The general module consists of some general subs, which I have decided to copy from my VBA repo. The subs there are created and tested by me multiple times and I feel quite secure to use them:

Pretty much this is all. The GitHub repo of the project is here – https://github.com/Vitosh/VBA_personal/tree/master/DataScrapingFromInternet.

The Excel file is in the GitHub repo. As the Excel file is with macros enabled, it is probably a good idea not to download it.

Cheers and Merry Christmas!

Tagged with: , , ,