Some time ago I wrote an article for Data scraping from Internet with Excel , which was scraping book information from amazon.com, based on a given word. Thus, for a keyword as “VBA”, this is the data, printed in the immediate window:
The code was 25 lines, thus it had some points for improvement. Thus, I have decided to make something bigger out of it with some additional options:
- writing it to Excel, instead of printing to the immediate window
- getting the prices of the books
- scraping multiple titles
- creating some user interface and reporting
- analyzing the data
So, at the end, my Excel file was with 3 sheets – RawData, Summary and Input. This is how Input looks, after running the code:
The code starts with deleting anything in the input table, except the column A. Then it gets all the data per keyword. At the end it counts the entries and gives some information about the total count of entries per keyword, their max price and their average prices. Initially the scraped data is collected in the RawData worksheet. It looks like this:
As you see, the data is really unstructured, but there could be a pattern somehow, if you take a good look of multiple entries. E.g., the first row with data is usually the title of the book and the second is the author. Concerning the price, it is a bit more tricky, but I have decided to come up with a rule, stating that the first string between the first $-sign and the first space after it should be the price. If the digit is not numeric, I return empty string.
The last part of the code is building a Summary worksheet, thus the data becomes somehow useable. This is how the summary part looks like:
With this data, writing the Max(), Count() and Average() formulas becomes really a piece of cake for a VBA Developer.
So, the code is split into five modules, with no classes:
- StartUp
- AmazonInternet
- ConstValues
- ExcelRelated
- General
Here is the code of each module with some description.
StartUp
This module has the Main sub, which is called by the Run button. This is how it looks like:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 |
Public Sub Main() If IN_PRODUCTION Then On Error GoTo Main_Error CleanWorksheets Dim keyword As String: keyword = GetNextKeyWord While keyword <> "" Dim appIE As Object Set appIE = CreateObject("InternetExplorer.Application") LogMe keyword Dim nextPageExists As Boolean: nextPageExists = True Dim i As Long: i = 1 Dim firstRow As Long: firstRow = lastRow(tblRawData.Name) + 1 While nextPageExists WaitSomeMilliseconds Navigate i, appIE, keyword nextPageExists = PageWithResultsExists(appIE, keyword) If nextPageExists Then WriteToExcel appIE, keyword i = i + 1 Wend LogMe Time, keyword, "RawDataToStructured" RawDataToStructured keyword, firstRow keyword = GetNextKeyWord WaitSomeMilliseconds 4000 appIE.Quit Wend FixWorksheets WriteFormulas LogMe "Program has ended!" On Error GoTo 0 Exit Sub Main_Error: MsgBox "Error " & Err.Number & " (" & Err.Description & ") in procedure Main, line " & Erl & "." End Sub |
My idea about every Main function or sub, is that it should be like a story, which could be told to someone, without using much efforts. I hope I have managed to tell the story with this one.
AmazonInternet
This module has 3 functions. The first one, PageWithResultsExists checks whether after clicking the N-th page of Amazon.com with our results, there are some results left. MakeUrl is the one that makes the URL with a keyword and i. Navigate goes to the given URL:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 |
Option Explicit Public Function PageWithResultsExists(appIE As Object, keyword As String) As Boolean On Error GoTo PageWithResultsExists_Error Dim allData As Object Set allData = appIE.document.getElementById("s-results-list-atf") PageWithResultsExists = True IeErrors = 0 On Error GoTo 0 Exit Function PageWithResultsExists_Error: WaitSomeMilliseconds IeErrors = IeErrors + 1 Select Case Err.Number Case 424 If IeErrors > MAX_IE_ERRORS Then PageWithResultsExists = False IeErrors = 0 Else LogMe "PageWithResultsExists", IeErrors, keyword, IeErrors PageWithResultsExists appIE, keyword End If Case Else Err.Raise Err.Number, Err.Source, Err.Description, Err.HelpFile, Err.HelpContext End Select End Function Public Function MakeUrl(i As Long, keyword As String) As String MakeUrl = "https://www.amazon.com/s/ref=sr_pg_" & i & "?rh=i%3Aaps%2Ck%3A" & keyword & "&page=" & i & "&keywords=" & keyword End Function Public Sub Navigate(i As Long, appIE As Object, keyword As String) Do While appIE.Busy DoEvents Loop With appIE .Navigate MakeUrl(i, keyword) .Visible = False End With Do While appIE.Busy DoEvents Loop End Sub |
ConstValues
Rather small one, with 3 lines only. 2 Contant values and one public variable. Public variables are bad, but today I was lazy, thus I have decided to leave them:
1 2 3 |
Public IeErrors As Long Public Const MAX_IE_ERRORS = 10 Public Const IN_PRODUCTION = False |
ExcelRelated
Lots of Excel Related functions. From the good old lastRow, which returns the last used row of a given worksheet to the self-explanatory WriteToExcel and RawDataToStructured.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 |
Public Function GetNextKeyWord() As String With tblInput Dim lastRowB As Long lastRowB = lastRow(.Name, 2) + 1 GetNextKeyWord = Trim(.Cells(lastRowB, 1)) If Len(GetNextKeyWord) <> 0 Then .Cells(lastRowB, 2) = Now End With End Function Public Sub WriteFormulas() Dim i As Long With tblInput For i = lastRow(.Name) To 2 Step -1 .Cells(i, 3).FormulaR1C1 = "=COUNTIF(Summary!C[1],Input!RC[-2])" .Cells(i, 4).FormulaArray = "=MAX(IF(Summary!C=RC[-3],Summary!C[-1]))" FormatUSD .Cells(i, 4) .Cells(i, 5).FormulaArray = "=AVERAGE(IF(Summary!C[-1]=Input!RC[-4],Summary!C[-2]))" FormatUSD .Cells(i, 5) Next i End With End Sub Public Sub FixWorksheets() OnStart With tblInput .Range("B1") = "Start Time" .Range("C1") = "Count" .Range("D1") = "Max" .Range("E1") = "Average" End With With tblSummary .Range("A1") = "Title" .Range("B1") = "Author" .Range("C1") = "Price" .Range("D1") = "Keyword" End With Dim ws As Worksheet For Each ws In Worksheets ws.Columns.AutoFit Next ws OnEnd End Sub Public Sub FormatUSD(myRange As Range) myRange.NumberFormat = "_-[$$-409]* #,##0.00_ ;_-[$$-409]* -#,##0.00 ;_-[$$-409]* ""-""??_ ;_-@_ " End Sub Public Sub CleanWorksheets() tblRawData.Cells.Delete tblSummary.Cells.Delete tblInput.Columns("B:F").Delete End Sub Public Function GetNthString(n As Long, myRange As Range) As String Dim i As Long Dim myVar As Variant myVar = Split(myRange, vbCrLf) For i = LBound(myVar) To UBound(myVar) If Len(myVar(i)) > 0 And n = 0 Then GetNthString = myVar(i) Exit Function ElseIf Len(myVar(i)) > 0 Then n = n - 1 End If Next i End Function Public Function GetPrice(myRange As Range) As String Dim i As Long Dim myVar As Variant myVar = Split(myRange, "$") If UBound(myVar) > 0 Then GetPrice = Mid(myVar(1), 1, InStr(1, myVar(1), " ")) Else GetPrice = "" End If End Function Public Sub WriteToExcel(appIE As Object, keyword As String) If IN_PRODUCTION Then On Error GoTo WriteToExcel_Error Dim allData As Object Set allData = appIE.document.getElementById("s-results-list-atf") Dim book As Object Dim myRow As Long For Each book In allData.getElementsByClassName("a-fixed-left-grid-inner") With tblRawData myRow = lastRow(.Name) + 1 On Error Resume Next .Cells(myRow, 1) = book.innertext .Cells(myRow, 2) = keyword On Error GoTo 0 End With Next IeErrors = 0 On Error GoTo 0 Exit Sub WriteToExcel_Error: IeErrors = IeErrors + 1 If IeErrors > MAX_IE_ERRORS Then Debug.Print "Error " & Err.Number & " (" & Err.Description & ") in procedure WriteToExcel, line " & Erl & "." Else LogMe "WriteToExcel", IeErrors, keyword, IeErrors WriteToExcel appIE, keyword End If End Sub Public Sub RawDataToStructured(keyword As String, firstRow As Long) Dim i As Long For i = firstRow To lastRow(tblRawData.Name) With tblRawData If InStr(1, .Cells(i, 1), "Sponsored ") < 1 Then Dim title As String title = GetNthString(0, .Cells(i, 1)) Dim author As String author = GetNthString(1, .Cells(i, 1)) Dim price As String price = GetPrice(.Cells(i, 1)) If Not IsNumeric(price) Or price = "0" Then price = "" Dim currentRow As String: currentRow = lastRow(tblSummary.Name) + 1 With tblSummary .Cells(currentRow, 1) = title .Cells(currentRow, 2) = author .Cells(currentRow, 3) = price .Cells(currentRow, 4) = keyword End With End If End With Next i End Sub Public Function lastRow(wsName As String, Optional columnToCheck As Long = 1) As Long Dim ws As Worksheet Set ws = Worksheets(wsName) lastRow = ws.Cells(ws.Rows.Count, columnToCheck).End(xlUp).Row End Function |
General
The general module consists of some general subs, which I have decided to copy from my VBA repo. The subs there are created and tested by me multiple times and I feel quite secure to use them:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 |
Public Declare PtrSafe Sub Sleep Lib "kernel32" (ByVal dwMilliseconds As LongPtr) Public Sub OnEnd() Application.ScreenUpdating = True Application.EnableEvents = True Application.AskToUpdateLinks = True Application.DisplayAlerts = True Application.Calculation = xlAutomatic ThisWorkbook.Date1904 = False Application.StatusBar = False End Sub Public Sub OnStart() Application.ScreenUpdating = False Application.EnableEvents = False Application.AskToUpdateLinks = False Application.DisplayAlerts = False Application.Calculation = xlAutomatic ThisWorkbook.Date1904 = False ActiveWindow.View = xlNormalView End Sub Public Sub LogMe(ParamArray arg() As Variant) Debug.Print Join(arg, "--") End Sub Public Sub PrintMeUsefulFormula() Dim strFormula As String Dim strParenth As String strParenth = """" strFormula = Selection.FormulaR1C1 strFormula = Replace(strFormula, """", """""") strFormula = strParenth & strFormula & strParenth Debug.Print strFormula End Sub Public Sub WaitSomeMilliseconds(Optional Milliseconds As Long = 1000) Sleep Milliseconds End Sub |
Pretty much this is all. The GitHub repo of the project is here – https://github.com/Vitosh/VBA_personal/tree/master/DataScrapingFromInternet.
The Excel file is in the GitHub repo. As the Excel file is with macros enabled, it is probably a good idea not to download it.
Cheers and Merry Christmas!