VBA – Data scraping from Internet with Excel – Part 2

Some time ago I wrote an article for Data scraping from Internet with Excel , which was scraping book information from amazon.com, based on a given word. Thus, for a keyword as “VBA”, this is the data, printed in the immediate window:

The code was 25 lines, thus it had some points for improvement. Thus, I have decided to make something bigger out of it with some additional options:

  • writing it to Excel, instead of printing to the immediate window
  • getting the prices of the books
  • scraping multiple titles
  • creating some user interface and reporting
  • analyzing the data

So, at the end, my Excel file was with 3 sheets – RawData, Summary and Input. This is how Input looks, after running the code:

The code starts with deleting anything in the input table, except the column A. Then it gets all the data per keyword. At the end it counts the entries and gives some information about the total count of entries per keyword, their max price and their average prices. Initially the scraped data is collected in the RawData worksheet. It looks like this:

As you see, the data is really unstructured, but there could be a pattern somehow, if you take a good look of multiple entries. E.g., the first row with data is usually the title of the book and the second is the author. Concerning the price, it is a bit more tricky, but I have decided to come up with a rule, stating that the first string between the first $-sign and the first space after it should be the price. If the digit is not numeric, I return empty string.

The last part of the code is building a Summary worksheet, thus the data becomes somehow useable. This is how the summary part looks like:

With this data, writing the Max(), Count() and Average() formulas becomes really a piece of cake for a VBA Developer.

So, the code is split into five modules, with no classes:

  • StartUp
  • AmazonInternet
  • ConstValues
  • ExcelRelated
  • General

Here is the code of each module with some description.

StartUp

This module has the Main sub, which is called by the Run button. This is how it looks like:

Public Sub Main()

    If IN_PRODUCTION Then On Error GoTo Main_Error
        
    CleanWorksheets
    Dim keyword As String: keyword = GetNextKeyWord
    
    While keyword <> ""
        
        Dim appIE As Object
        Set appIE = CreateObject("InternetExplorer.Application")
        LogMe keyword
        Dim nextPageExists As Boolean: nextPageExists = True
        Dim i As Long: i = 1
        Dim firstRow As Long: firstRow = lastRow(tblRawData.Name) + 1
        
        While nextPageExists
        
            WaitSomeMilliseconds
            Navigate i, appIE, keyword
            nextPageExists = PageWithResultsExists(appIE, keyword)
            If nextPageExists Then WriteToExcel appIE, keyword
            i = i + 1
            
        Wend
        
        LogMe Time, keyword, "RawDataToStructured"
        RawDataToStructured keyword, firstRow
        keyword = GetNextKeyWord
        WaitSomeMilliseconds 4000
        appIE.Quit
        
    Wend
    
    FixWorksheets
    WriteFormulas
    
    LogMe "Program has ended!"
    
    On Error GoTo 0
    Exit Sub

Main_Error:

    MsgBox "Error " & Err.Number & " (" & Err.Description & ") in procedure Main, line " & Erl & "."

End Sub

My idea about every Main function or sub, is that it should be like a story, which could be told to someone, without using much efforts. I hope I have managed to tell the story with this one.

AmazonInternet

This module has 3 functions. The first one, PageWithResultsExists checks whether after clicking the N-th page of Amazon.com with our results, there are some results left. MakeUrl is the one that makes the URL with a keyword and i. Navigate goes to the given URL:

Option Explicit

Public Function PageWithResultsExists(appIE As Object, keyword As String) As Boolean

    On Error GoTo PageWithResultsExists_Error
    
    Dim allData As Object
    Set allData = appIE.document.getElementById("s-results-list-atf")
    PageWithResultsExists = True
    IeErrors = 0
    
    On Error GoTo 0
    Exit Function

PageWithResultsExists_Error:
    
    WaitSomeMilliseconds
    IeErrors = IeErrors + 1
    
    Select Case Err.Number
        
        Case 424
            
            If IeErrors > MAX_IE_ERRORS Then
                PageWithResultsExists = False
                IeErrors = 0
            Else
                LogMe "PageWithResultsExists", IeErrors, keyword, IeErrors
                PageWithResultsExists appIE, keyword
            End If
        Case Else
            Err.Raise Err.Number, Err.Source, Err.Description, Err.HelpFile, Err.HelpContext
    End Select
    
End Function

Public Function MakeUrl(i As Long, keyword As String) As String

    MakeUrl = "https://www.amazon.com/s/ref=sr_pg_" & i & "?rh=i%3Aaps%2Ck%3A" & keyword & "&page=" & i & "&keywords=" & keyword

End Function

Public Sub Navigate(i As Long, appIE As Object, keyword As String)
    
    Do While appIE.Busy
        DoEvents
    Loop
    
    With appIE
        .Navigate MakeUrl(i, keyword)
        .Visible = False
    End With
    
    Do While appIE.Busy
        DoEvents
    Loop
    
End Sub

ConstValues

Rather small one, with 3 lines only. 2 Contant values and one public variable. Public variables are bad, but today I was lazy, thus I have decided to leave them:

Public IeErrors As Long
Public Const MAX_IE_ERRORS = 10
Public Const IN_PRODUCTION = False

ExcelRelated

Lots of Excel Related functions. From the good old lastRow, which returns the last used row of a given worksheet to the self-explanatory WriteToExcel and RawDataToStructured.

Public Function GetNextKeyWord() As String
    
    With tblInput
        Dim lastRowB As Long
        lastRowB = lastRow(.Name, 2) + 1
        GetNextKeyWord = Trim(.Cells(lastRowB, 1))
        If Len(GetNextKeyWord) <> 0 Then .Cells(lastRowB, 2) = Now
    End With
    
End Function

Public Sub WriteFormulas()
    
    Dim i As Long
    With tblInput
        For i = lastRow(.Name) To 2 Step -1
            .Cells(i, 3).FormulaR1C1 = "=COUNTIF(Summary!C[1],Input!RC[-2])"
            
            .Cells(i, 4).FormulaArray = "=MAX(IF(Summary!C=RC[-3],Summary!C[-1]))"
            FormatUSD .Cells(i, 4)
            
            .Cells(i, 5).FormulaArray = "=AVERAGE(IF(Summary!C[-1]=Input!RC[-4],Summary!C[-2]))"
            FormatUSD .Cells(i, 5)
        Next i
    End With
    
End Sub

Public Sub FixWorksheets()
    OnStart
    With tblInput
        .Range("B1") = "Start Time"
        .Range("C1") = "Count"
        .Range("D1") = "Max"
        .Range("E1") = "Average"
    End With
    
    With tblSummary
        .Range("A1") = "Title"
        .Range("B1") = "Author"
        .Range("C1") = "Price"
        .Range("D1") = "Keyword"
    End With
    
    Dim ws As Worksheet
    For Each ws In Worksheets
        ws.Columns.AutoFit
    Next ws
    OnEnd
End Sub

Public Sub FormatUSD(myRange As Range)

    myRange.NumberFormat = "_-[$$-409]* #,##0.00_ ;_-[$$-409]* -#,##0.00 ;_-[$$-409]* ""-""??_ ;_-@_ "

End Sub

        
Public Sub CleanWorksheets()

    tblRawData.Cells.Delete
    tblSummary.Cells.Delete
    tblInput.Columns("B:F").Delete
        
End Sub

Public Function GetNthString(n As Long, myRange As Range) As String
    
    Dim i As Long
    Dim myVar As Variant
    
    myVar = Split(myRange, vbCrLf)
    For i = LBound(myVar) To UBound(myVar)
        If Len(myVar(i)) > 0 And n = 0 Then
            GetNthString = myVar(i)
            Exit Function
        ElseIf Len(myVar(i)) > 0 Then
            n = n - 1
        End If
    Next i
    
End Function


Public Function GetPrice(myRange As Range) As String
    
    Dim i As Long
    Dim myVar As Variant
    myVar = Split(myRange, "$")
    
    If UBound(myVar) > 0 Then
        GetPrice = Mid(myVar(1), 1, InStr(1, myVar(1), " "))
    Else
        GetPrice = ""
    End If
        
End Function

Public Sub WriteToExcel(appIE As Object, keyword As String)

    If IN_PRODUCTION Then On Error GoTo WriteToExcel_Error
    
    Dim allData As Object
    Set allData = appIE.document.getElementById("s-results-list-atf")
    
    Dim book As Object
    Dim myRow As Long
        
    For Each book In allData.getElementsByClassName("a-fixed-left-grid-inner")
        With tblRawData
            myRow = lastRow(.Name) + 1
            On Error Resume Next
            .Cells(myRow, 1) = book.innertext
            .Cells(myRow, 2) = keyword
            On Error GoTo 0
        End With
    Next
        
    IeErrors = 0
    
    On Error GoTo 0
    Exit Sub

WriteToExcel_Error:

    IeErrors = IeErrors + 1
    
    If IeErrors > MAX_IE_ERRORS Then
        Debug.Print "Error " & Err.Number & " (" & Err.Description & ") in procedure WriteToExcel, line " & Erl & "."
    Else
        LogMe "WriteToExcel", IeErrors, keyword, IeErrors
        WriteToExcel appIE, keyword
    End If
    
End Sub

Public Sub RawDataToStructured(keyword As String, firstRow As Long)
    
    Dim i As Long
    For i = firstRow To lastRow(tblRawData.Name)
        With tblRawData
            If InStr(1, .Cells(i, 1), "Sponsored ") < 1 Then
                Dim title As String
                title = GetNthString(0, .Cells(i, 1))
                Dim author As String
                author = GetNthString(1, .Cells(i, 1))
                Dim price As String
                price = GetPrice(.Cells(i, 1))
                If Not IsNumeric(price) Or price = "0" Then price = ""
                Dim currentRow As String: currentRow = lastRow(tblSummary.Name) + 1
                With tblSummary
                    .Cells(currentRow, 1) = title
                    .Cells(currentRow, 2) = author
                    .Cells(currentRow, 3) = price
                    .Cells(currentRow, 4) = keyword
                End With
            End If
        End With
    Next i

End Sub

Public Function lastRow(wsName As String, Optional columnToCheck As Long = 1) As Long

    Dim ws As Worksheet
    Set ws = Worksheets(wsName)
    lastRow = ws.Cells(ws.Rows.Count, columnToCheck).End(xlUp).Row

End Function

General

The general module consists of some general subs, which I have decided to copy from my VBA repo. The subs there are created and tested by me multiple times and I feel quite secure to use them:

Public Declare PtrSafe Sub Sleep Lib "kernel32" (ByVal dwMilliseconds As LongPtr)

Public Sub OnEnd()

    Application.ScreenUpdating = True
    Application.EnableEvents = True
    Application.AskToUpdateLinks = True
    Application.DisplayAlerts = True
    Application.Calculation = xlAutomatic
    ThisWorkbook.Date1904 = False
    
    Application.StatusBar = False
    
End Sub

Public Sub OnStart()
    
    Application.ScreenUpdating = False
    Application.EnableEvents = False
    Application.AskToUpdateLinks = False
    Application.DisplayAlerts = False
    Application.Calculation = xlAutomatic
    ThisWorkbook.Date1904 = False
    
    ActiveWindow.View = xlNormalView

End Sub

Public Sub LogMe(ParamArray arg() As Variant)

    Debug.Print Join(arg, "--")
    
End Sub

Public Sub PrintMeUsefulFormula()

    Dim strFormula  As String
    Dim strParenth  As String

    strParenth = """"

    strFormula = Selection.FormulaR1C1
    
    strFormula = Replace(strFormula, """", """""")

    strFormula = strParenth & strFormula & strParenth
    Debug.Print strFormula
    
End Sub

Public Sub WaitSomeMilliseconds(Optional Milliseconds As Long = 1000)
    Sleep Milliseconds
End Sub

Pretty much this is all. The GitHub repo of the project is here – https://github.com/Vitosh/VBA_personal/tree/master/DataScrapingFromInternet.

The Excel file is in the GitHub repo. As the Excel file is with macros enabled, it is probably a good idea not to download it.

Cheers and Merry Christmas!