There is something interesting about linear regression. I have just noticed, that I have actually quite some articles on it some years ago, but today I wanted to make a YouTube video as well. Well, what is the difference this time? I really hope I have become a bit better in explaining basic theory and plotting data.
If not, then no worries – the third article will probably come in another 4 years, if you follow the linear regression. Just some minimal code from the video here, if you need to impress someone:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 |
import numpy as np import pandas as pd import matplotlib.pyplot as plt from sklearn.model_selection import train_test_split from sklearn.linear_model import LinearRegression from sklearn.metrics import mean_squared_error import warnings warnings.filterwarnings('ignore') # Define five manual data points x_values = np.array([1, 2, 3, 4, 5]).reshape(-1, 1) # Reshape for sklearn y_true = np.array([2.3, 2.9, 3.6, 4.1, 5.0]) # Define a manually set linear equation for prediction manual_slope = 0.5 manual_intercept = 2.0 y_pred_manual = manual_slope * x_values + manual_intercept # Create and fit the linear regression model model = LinearRegression() model.fit(x_values, y_true) # Get the predicted values and coefficients from the linear regression model y_pred_best_fit = model.predict(x_values) # Calculate residuals and squared residuals for manual prediction residuals_manual = y_true.flatten() - y_pred_manual.flatten() squared_residuals_manual = residuals_manual ** 2 # Calculate residuals and squared residuals for best fit line residuals_best_fit = y_true.flatten() - y_pred_best_fit squared_residuals_best_fit = residuals_best_fit ** 2 # Plot the actual values, manual prediction line, and best fitting line plt.figure(figsize=(12, 8)) plt.scatter(x_values, y_true, color='blue', label='Actual values (y_true)') plt.plot(x_values, y_pred_manual, color='orange', linestyle='--', label='Manual Prediction Line') plt.plot(x_values, y_pred_best_fit, color='red', label='Best Fitting Line', linewidth=2) # Draw vertical lines for residuals and squares for squared residuals for both lines for i in range(len(x_values)): # Residuals for the manual prediction line plt.vlines(x_values[i], y_pred_manual[i], y_true[i], color='green', linestyle='dotted') square_side_manual = np.abs(residuals_manual[i]) plt.gca().add_patch(plt.Rectangle((x_values[i] - square_side_manual / 2, y_pred_manual[i]), square_side_manual, square_side_manual, color='purple', alpha=0.3)) plt.text(x_values[i] + 0.2, (y_pred_manual[i] + y_true[i]) / 2, f'{squared_residuals_manual[i]:.2f}', color='purple') # Residuals for the best fitting line plt.vlines(x_values[i], y_pred_best_fit[i], y_true[i], color='cyan', linestyle='dotted') square_side_best_fit = np.abs(residuals_best_fit[i]) plt.gca().add_patch(plt.Rectangle((x_values[i] - square_side_best_fit / 2, y_pred_best_fit[i]), square_side_best_fit, square_side_best_fit, color='magenta', alpha=0.3)) plt.text(x_values[i] - 1.0, (y_pred_best_fit[i] + y_true[i]) / 2, f'{squared_residuals_best_fit[i]:.2f}', color='magenta') # Adding labels and titles plt.title('Comparison of Manual Prediction Line and Best Fitting Line with Residuals') plt.xlabel('Independent Variable (X)') plt.ylabel('Dependent Variable (y)') plt.legend() # Show plot plt.grid(True) plt.show() |
What does the code do? It plots the picture above the code, based on 2 options to draw regression lines. The first one is with manually defined intersept and slope and the second one is with the fitted one from the sklearn library. Run the code below to see it:
1 2 3 4 5 6 7 8 |
# Define a manually set linear equation for prediction manual_slope = 0.5 manual_intercept = 2.0 best_line_slope = model.coef_[0] best_line_intercept = model.intercept_ print(best_line_slope, best_line_intercept ) |
The rest is present in the YouTube video, I hope you enjoy it 🙂
The GitHub code is here:
🙂