In 2013 and 2014 (wow, already 7 years ago!) I wrote two articles about linear regression with Excel. Now, I am getting more and more interested in Python, thus I guess it would be interesting to remake the article into a python one. So, this is our input from the daily profit per week:
So, starting and loading the data looks like this:
1 2 3 4 5 6 7 |
import numpy as np from sklearn.linear_model import LinearRegression import pandas as pd import matplotlib.pyplot as plt x = np.array([x for x in range(0, 8)]).reshape(-1, 1) y = np.array([35, 36, 43, 47, 50, 51, 52, 57]) |
The .reshape(-1, 1) is required, so we can produce a list of lists for the x :
Now, starting with the model, the following 2 lines do the magic:
1 2 |
model = LinearRegression() model.fit(x, y) |
The model is “fitted”. This means, that a line is produced, which “fits” the dots in a way, that the minimal r_sq is produced. This is how to produce the fitted line and the scattered ponts:
1 2 3 4 5 |
plt.scatter(x, y, color = 'black') line = model.coef_*x + model.intercept_ plt.plot(x, line, 'r', label = f' y = {model.coef_} x + {model.intercept_}') plt.legend(fontsize = 22) plt.show() |
The more interesting part in the Linear Regression is the “Prediction”. E.g., it is like saying – “What if we only had that tiny red line from the plot above, where would we have put our values for a given period?” And the answer is actually quite simple – “On that red line!”. This is how to do it. First generated the predicted values:
1 |
y_pred = model.predict(x) |
They look like this:
1 |
array([35.5, 38.60714286, 41.71428571, 44.82142857, 47.92857143, 51.03571429, 54.14285714, 57.25]) |
And they are quite different from the original values. How different? See for yourself:
1 2 3 4 5 6 7 8 9 10 |
fig = plt.figure() ax1 = fig.add_subplot() ax1.scatter(x, y, color = "black", label = "real data") ax1.scatter(x, y_pred, color = "red", label = "prediction") ax1.set_xlabel('periods', fontsize=10) ax1.set_ylabel('money', fontsize='large') plt.legend(loc='best') plt.rcParams["figure.figsize"] = (10,5) fig.suptitle('Real vs Predicted', fontsize=16) plt.show() |
Alternatively, we may use fewer lines to produce the same, without the add_subplot() part from the code above. But I guess it is less fun:
1 2 3 4 5 |
plt.scatter(x, y, color ='black', marker='x', label='real data') plt.scatter(x, y_pred, c='red', marker='o', label='prediction') plt.legend(loc='upper left') plt.rcParams["figure.figsize"] = (10,5) plt.show() |
And if we want to finish with something making our article really a bit more statistical, it is these linear regression features:
- coefficient for determination (or r^2)
- intercept – this is the b in the formula Y= a + bX)
- slope – how much y changes for every value of x. If the slope is 7, it means that for x = [1,2,3], y = [7, 14,21], if the intercept is at 0.
1 2 3 4 |
r_sq = model.score(x,y) print(f'coefficient of determination: {r_sq}') print(f'intercept: {model.intercept_}') print(f'slope: {model.coef_}') |
The code is available here. Enjoy!