VBA – Data scraping from Internet with Excel – Part 2
Some time ago I wrote an article for Data scraping from Internet with Excel , which was scraping book information from amazon.com, based on a given word. Thus, for a keyword as “VBA”, this is the data, printed in the immediate window:
The code was 25 lines, thus it had some points for improvement. Thus, I have decided to make something bigger out of it with some additional options:
- writing it to Excel, instead of printing to the immediate window
- getting the prices of the books
- scraping multiple titles
- creating some user interface and reporting
- analyzing the data
So, at the end, my Excel file was with 3 sheets – RawData, Summary and Input. This is how Input looks, after running the code:

The code starts with deleting anything in the input table, except the column A. Then it gets all the data per keyword. At the end it counts the entries and gives some information about the total count of entries per keyword, their max price and their average prices. Initially the scraped data is collected in the RawData worksheet. It looks like this:

As you see, the data is really unstructured, but there could be a pattern somehow, if you take a good look of multiple entries. E.g., the first row with data is usually the title of the book and the second is the author. Concerning the price, it is a bit more tricky, but I have decided to come up with a rule, stating that the first string between the first $-sign and the first space after it should be the price. If the digit is not numeric, I return empty string.
The last part of the code is building a Summary worksheet, thus the data becomes somehow useable. This is how the summary part looks like:

With this data, writing the Max(), Count() and Average() formulas becomes really a piece of cake for a VBA Developer.
So, the code is split into five modules, with no classes:
- StartUp
- AmazonInternet
- ConstValues
- ExcelRelated
- General
Here is the code of each module with some description.
StartUp
This module has the Main sub, which is called by the Run button. This is how it looks like:
Public Sub Main()
If IN_PRODUCTION Then On Error GoTo Main_Error
CleanWorksheets
Dim keyword As String: keyword = GetNextKeyWord
While keyword <> ""
Dim appIE As Object
Set appIE = CreateObject("InternetExplorer.Application")
LogMe keyword
Dim nextPageExists As Boolean: nextPageExists = True
Dim i As Long: i = 1
Dim firstRow As Long: firstRow = lastRow(tblRawData.Name) + 1
While nextPageExists
WaitSomeMilliseconds
Navigate i, appIE, keyword
nextPageExists = PageWithResultsExists(appIE, keyword)
If nextPageExists Then WriteToExcel appIE, keyword
i = i + 1
Wend
LogMe Time, keyword, "RawDataToStructured"
RawDataToStructured keyword, firstRow
keyword = GetNextKeyWord
WaitSomeMilliseconds 4000
appIE.Quit
Wend
FixWorksheets
WriteFormulas
LogMe "Program has ended!"
On Error GoTo 0
Exit Sub
Main_Error:
MsgBox "Error " & Err.Number & " (" & Err.Description & ") in procedure Main, line " & Erl & "."
End Sub
My idea about every Main function or sub, is that it should be like a story, which could be told to someone, without using much efforts. I hope I have managed to tell the story with this one.
AmazonInternet
This module has 3 functions. The first one, PageWithResultsExists checks whether after clicking the N-th page of Amazon.com with our results, there are some results left. MakeUrl is the one that makes the URL with a keyword and i. Navigate goes to the given URL:
Option Explicit
Public Function PageWithResultsExists(appIE As Object, keyword As String) As Boolean
On Error GoTo PageWithResultsExists_Error
Dim allData As Object
Set allData = appIE.document.getElementById("s-results-list-atf")
PageWithResultsExists = True
IeErrors = 0
On Error GoTo 0
Exit Function
PageWithResultsExists_Error:
WaitSomeMilliseconds
IeErrors = IeErrors + 1
Select Case Err.Number
Case 424
If IeErrors > MAX_IE_ERRORS Then
PageWithResultsExists = False
IeErrors = 0
Else
LogMe "PageWithResultsExists", IeErrors, keyword, IeErrors
PageWithResultsExists appIE, keyword
End If
Case Else
Err.Raise Err.Number, Err.Source, Err.Description, Err.HelpFile, Err.HelpContext
End Select
End Function
Public Function MakeUrl(i As Long, keyword As String) As String
MakeUrl = "https://www.amazon.com/s/ref=sr_pg_" & i & "?rh=i%3Aaps%2Ck%3A" & keyword & "&page=" & i & "&keywords=" & keyword
End Function
Public Sub Navigate(i As Long, appIE As Object, keyword As String)
Do While appIE.Busy
DoEvents
Loop
With appIE
.Navigate MakeUrl(i, keyword)
.Visible = False
End With
Do While appIE.Busy
DoEvents
Loop
End Sub
ConstValues
Rather small one, with 3 lines only. 2 Contant values and one public variable. Public variables are bad, but today I was lazy, thus I have decided to leave them:
Public IeErrors As Long Public Const MAX_IE_ERRORS = 10 Public Const IN_PRODUCTION = False
ExcelRelated
Lots of Excel Related functions. From the good old lastRow, which returns the last used row of a given worksheet to the self-explanatory WriteToExcel and RawDataToStructured.
Public Function GetNextKeyWord() As String
With tblInput
Dim lastRowB As Long
lastRowB = lastRow(.Name, 2) + 1
GetNextKeyWord = Trim(.Cells(lastRowB, 1))
If Len(GetNextKeyWord) <> 0 Then .Cells(lastRowB, 2) = Now
End With
End Function
Public Sub WriteFormulas()
Dim i As Long
With tblInput
For i = lastRow(.Name) To 2 Step -1
.Cells(i, 3).FormulaR1C1 = "=COUNTIF(Summary!C[1],Input!RC[-2])"
.Cells(i, 4).FormulaArray = "=MAX(IF(Summary!C=RC[-3],Summary!C[-1]))"
FormatUSD .Cells(i, 4)
.Cells(i, 5).FormulaArray = "=AVERAGE(IF(Summary!C[-1]=Input!RC[-4],Summary!C[-2]))"
FormatUSD .Cells(i, 5)
Next i
End With
End Sub
Public Sub FixWorksheets()
OnStart
With tblInput
.Range("B1") = "Start Time"
.Range("C1") = "Count"
.Range("D1") = "Max"
.Range("E1") = "Average"
End With
With tblSummary
.Range("A1") = "Title"
.Range("B1") = "Author"
.Range("C1") = "Price"
.Range("D1") = "Keyword"
End With
Dim ws As Worksheet
For Each ws In Worksheets
ws.Columns.AutoFit
Next ws
OnEnd
End Sub
Public Sub FormatUSD(myRange As Range)
myRange.NumberFormat = "_-[$$-409]* #,##0.00_ ;_-[$$-409]* -#,##0.00 ;_-[$$-409]* ""-""??_ ;_-@_ "
End Sub
Public Sub CleanWorksheets()
tblRawData.Cells.Delete
tblSummary.Cells.Delete
tblInput.Columns("B:F").Delete
End Sub
Public Function GetNthString(n As Long, myRange As Range) As String
Dim i As Long
Dim myVar As Variant
myVar = Split(myRange, vbCrLf)
For i = LBound(myVar) To UBound(myVar)
If Len(myVar(i)) > 0 And n = 0 Then
GetNthString = myVar(i)
Exit Function
ElseIf Len(myVar(i)) > 0 Then
n = n - 1
End If
Next i
End Function
Public Function GetPrice(myRange As Range) As String
Dim i As Long
Dim myVar As Variant
myVar = Split(myRange, "$")
If UBound(myVar) > 0 Then
GetPrice = Mid(myVar(1), 1, InStr(1, myVar(1), " "))
Else
GetPrice = ""
End If
End Function
Public Sub WriteToExcel(appIE As Object, keyword As String)
If IN_PRODUCTION Then On Error GoTo WriteToExcel_Error
Dim allData As Object
Set allData = appIE.document.getElementById("s-results-list-atf")
Dim book As Object
Dim myRow As Long
For Each book In allData.getElementsByClassName("a-fixed-left-grid-inner")
With tblRawData
myRow = lastRow(.Name) + 1
On Error Resume Next
.Cells(myRow, 1) = book.innertext
.Cells(myRow, 2) = keyword
On Error GoTo 0
End With
Next
IeErrors = 0
On Error GoTo 0
Exit Sub
WriteToExcel_Error:
IeErrors = IeErrors + 1
If IeErrors > MAX_IE_ERRORS Then
Debug.Print "Error " & Err.Number & " (" & Err.Description & ") in procedure WriteToExcel, line " & Erl & "."
Else
LogMe "WriteToExcel", IeErrors, keyword, IeErrors
WriteToExcel appIE, keyword
End If
End Sub
Public Sub RawDataToStructured(keyword As String, firstRow As Long)
Dim i As Long
For i = firstRow To lastRow(tblRawData.Name)
With tblRawData
If InStr(1, .Cells(i, 1), "Sponsored ") < 1 Then
Dim title As String
title = GetNthString(0, .Cells(i, 1))
Dim author As String
author = GetNthString(1, .Cells(i, 1))
Dim price As String
price = GetPrice(.Cells(i, 1))
If Not IsNumeric(price) Or price = "0" Then price = ""
Dim currentRow As String: currentRow = lastRow(tblSummary.Name) + 1
With tblSummary
.Cells(currentRow, 1) = title
.Cells(currentRow, 2) = author
.Cells(currentRow, 3) = price
.Cells(currentRow, 4) = keyword
End With
End If
End With
Next i
End Sub
Public Function lastRow(wsName As String, Optional columnToCheck As Long = 1) As Long
Dim ws As Worksheet
Set ws = Worksheets(wsName)
lastRow = ws.Cells(ws.Rows.Count, columnToCheck).End(xlUp).Row
End Function
General
The general module consists of some general subs, which I have decided to copy from my VBA repo. The subs there are created and tested by me multiple times and I feel quite secure to use them:
Public Declare PtrSafe Sub Sleep Lib "kernel32" (ByVal dwMilliseconds As LongPtr)
Public Sub OnEnd()
Application.ScreenUpdating = True
Application.EnableEvents = True
Application.AskToUpdateLinks = True
Application.DisplayAlerts = True
Application.Calculation = xlAutomatic
ThisWorkbook.Date1904 = False
Application.StatusBar = False
End Sub
Public Sub OnStart()
Application.ScreenUpdating = False
Application.EnableEvents = False
Application.AskToUpdateLinks = False
Application.DisplayAlerts = False
Application.Calculation = xlAutomatic
ThisWorkbook.Date1904 = False
ActiveWindow.View = xlNormalView
End Sub
Public Sub LogMe(ParamArray arg() As Variant)
Debug.Print Join(arg, "--")
End Sub
Public Sub PrintMeUsefulFormula()
Dim strFormula As String
Dim strParenth As String
strParenth = """"
strFormula = Selection.FormulaR1C1
strFormula = Replace(strFormula, """", """""")
strFormula = strParenth & strFormula & strParenth
Debug.Print strFormula
End Sub
Public Sub WaitSomeMilliseconds(Optional Milliseconds As Long = 1000)
Sleep Milliseconds
End Sub
Pretty much this is all. The GitHub repo of the project is here – https://github.com/Vitosh/VBA_personal/tree/master/DataScrapingFromInternet.
The Excel file is in the GitHub repo. As the Excel file is with macros enabled, it is probably a good idea not to download it.
Cheers and Merry Christmas!