Linear regression

Linear regression

The most basic thing here

#️⃣   ⌛  ~50 min 🗿  Beginner

05.10.2022

upd:

#17

Linear regression

The most basic thing here

⌛  ~50 min

#17

🎓 24/167

This post is a part of the Regression educational series from my free course. Please keep in mind that the correct sequence of posts is outlined on the course page, while it can be arbitrary in Research.

I'm also happy to announce that I've started working on standalone paid courses, so you could support my work and get cheap educational material. These courses will be of completely different quality, with more theoretical depth and niche focus, and will feature challenging projects, quizzes, exercises, video lectures and supplementary stuff. Stay tuned!

Linear regression is one of the foundational methods in machine learning and statistics, used widely in both academic research and practical applications. It provides a systematic way to model the relationship between one or more explanatory variables (also called features or predictors) and a continuous target variable. Despite being one of the oldest and perhaps simplest forms of regression, linear regression remains a cornerstone for understanding model building, interpretability, and optimization in data science.

Motivation and overview of linear regression in machine learning

Motivation for linear regression stems from its ability to capture a linear (or linearly transformed) relationship between inputs and a continuous output. In other words, it tries to find a hyperplane in the feature space that best fits the observed data. This "best fit" is typically defined by minimizing some form of error measure, most commonly the sum of squared errors.

Many real-world phenomena — such as forecasting housing prices, predicting the lifespan of an engineering component, or relating health factors to overall well-being — can often be approximated with a linear model if we limit the scope or cleverly design features. Even in modern deep learning systems, linear components appear in the final layers for tasks like regression or classification (logistic regression being the linear model for classification).

Historical background and practical applications

Linear regression can be traced back to the 19th century, where it was studied in the context of astronomical observations and social statistics. Pioneers like Adrien-Marie Legendre and Carl Friedrich Gauss formalized the method of least squares, the mathematical backbone of linear regression.

From this historical standpoint, linear regression has grown into a ubiquitous tool across disciplines:

Economics: to predict economic indicators (e.g., GDP growth, inflation rates).
Healthcare: to model risk factors against patient outcomes like blood pressure or insurance claim amounts.
Marketing and sales: to understand relationships between advertising spend and sales revenue.
Engineering: to estimate how factors like stress, temperature, or load affect a system's performance.

Its popularity persists largely because of its interpretability: each coefficient has a meaningful explanation relating a specific feature to the outcome.

The linear model in a regression problem

A regression problem differs from classification primarily in that the target variable info Label or response variable is continuous rather than discrete. A linear regression model posits that the target $y$ can be described (or approximated) by a weighted sum of input features $x_1, x_2, \dots, x_p$ plus an intercept (often called bias in machine learning contexts).

Formally, for a single feature $x$ :

y = a + b x

where $a$ is the intercept and $b$ the slope. We can write this in vector form for multiple features:

\hat{y} = w_0 + w_1 x_1 + \ldots + w_p x_p

When using a more compact notation, we incorporate $w_0$ (the intercept) into the weight vector by extending each feature vector with a constant 1:

\hat{y} = \mathbf{w}^\top \mathbf{x} = w_0 \times 1 + w_1 x_1 + \dots + w_p x_p.

Basic assumptions of linear regression commonly include:

Linearity: The relationship between inputs and the output is linear in parameters.
Independence: Observations are assumed to be independent.
Homoscedasticity: The variance of the error terms is constant.
Normality of residuals: The error terms are often assumed (though not strictly required) to be normally distributed for small-sample inference.

Interpretation of coefficients: Each coefficient $w_i$ indicates how much the predicted value changes with respect to a unit change in feature $x_i$ , holding other features constant. This interpretability is a major advantage of linear models.

Cost function and error metrics

A central theme in regression modeling is how to measure the discrepancy between model predictions and actual observed values. This measurement is crucial for both training (where we minimize the cost) and evaluation (where we assess performance). Below are the most frequently used metrics, each with its own implications:

Mean squared error (MSE)

\text{MSE}(\hat{y}, y) = \frac{1}{n}\sum_{i=1}^{n} (y_i - \hat{y}_i)^2

Here, $y_i$ is the true target, $\hat{y}_i$ is the predicted value, and $n$ is the number of observations. It is also commonly used as the cost function to optimize in linear regression via least squares. The squaring of the errors penalizes large deviations more heavily, making MSE sensitive to outliers. Variables in the formula:

$y_i$ : Ground truth label for sample $i$ .
$\hat{y}_i$ : Predicted label for sample $i$ .
$n$ : Total number of samples.

Mean absolute error (MAE)

\text{MAE}(\hat{y}, y) = \frac{1}{n}\sum_{i=1}^{n} |y_i - \hat{y}_i|

MAE measures the average magnitude of the errors without squaring. Therefore, unlike MSE, it penalizes all residuals in a more uniform way and is less sensitive to outliers. However, it is not differentiable at zero, which can complicate the analytic solutions or certain gradient-based optimizations (though subgradient methods do exist).

Root mean squared error (RMSE)

\text{RMSE}(\hat{y}, y) = \sqrt{\frac{1}{n}\sum_{i=1}^{n} (y_i - \hat{y}_i)^2}

RMSE is simply the square root of MSE, bringing the error metric back to the same units as the target variable. This makes RMSE often easier to interpret in many practical scenarios.

Mean absolute percentage error (MAPE) and symmetric MAPE (SMAPE)

MAPE gauges the relative size of the errors by dividing by the actual target values:

\text{MAPE}(\hat{y}, y) = \frac{100\%}{n} \sum_{i=1}^{n} \left|\frac{y_i - \hat{y}_i}{y_i}\right|.

This is particularly useful if you want to measure the error in percentage terms (e.g., if an error of 10k on a 1 million-dollar house is less serious than a 10k error on a 100k-dollar house).

SMAPE modifies MAPE to account for both predicted and actual values in the denominator:

\text{SMAPE}(\hat{y}, y) = \frac{100\%}{n} \sum_{i=1}^{n} \frac{|y_i - \hat{y}_i|}{(|y_i| + |\hat{y}_i|)/2}.

This helps control skew issues when $y_i$ is very large or very small.

R-squared (coefficient of determination)

R^2 = 1 - \frac{\sum_{i=1}^{n}(y_i - \hat{y}_i)^2}{\sum_{i=1}^{n}(y_i - \bar{y})^2}

Here, $\bar{y}$ denotes the mean of the observed data. R-squared measures how much better (or worse) your regression model is compared to a simple baseline that predicts the mean of $y$ for all observations. An $R^2$ value of 1 indicates a perfect fit, and 0 indicates that your model does no better than the naive mean-based approach.

Other potential metrics

Adjusted R-squared: Adjusts for the number of features, preventing artificially high $R^2$ due to adding irrelevant predictors.
AIC and BIC: Information criteria used for model selection (these go beyond measuring pure predictive error, incorporating complexity penalties).
Explained variance score: Indicates how much variance is explained by the model vs. total variance in the data.

Analytical approach to linear regression

Derivation of the normal equation

When using the least squares approach and MSE as the cost function, one can solve for the optimal weights (\mathbf{w}) in closed form. Suppose you have a dataset (\mathbf{X}, \mathbf{y}), where $\mathbf{X}$ is an $n \times (p+1)$ matrix of features (including a column of ones for the intercept) and $\mathbf{y}$ is an $n$ -dimensional vector of targets. The cost function in matrix form is:

J(\mathbf{w}) = \frac{1}{2}(\mathbf{X}\mathbf{w} - \mathbf{y})^\top(\mathbf{X}\mathbf{w} - \mathbf{y}),

where the factor of $1/2$ is simply a convenience for cleaner derivatives. Differentiating and setting to zero yields:

\nabla_{\mathbf{w}}J(\mathbf{w}) = \mathbf{X}^\top(\mathbf{X}\mathbf{w} - \mathbf{y}) = 0.

If $\mathbf{X}^\top\mathbf{X}$ is invertible (non-singular), the normal equation becomes:

\mathbf{w}^* = (\mathbf{X}^\top \mathbf{X})^{-1}\mathbf{X}^\top \mathbf{y}.

Advantages and limitations of the analytical solution

Advantages:

Direct formula, no need for iterative methods if the dimensionality is manageable.
Conceptual simplicity, easy to understand from a linear algebra perspective.

Limitations:

Computational cost: Inverting a $p+1$ by $p+1$ matrix can be expensive and numerically unstable for large $p$ .
Ill-conditioning: If features are collinear or nearly collinear, (\mathbf{X}^\top\mathbf{X}) can become singular or poorly conditioned, leading to unstable solutions.

Computational considerations for large-scale problems

For high-dimensional scenarios (large $p$ ), direct matrix inversion is impractical. Instead, methods such as gradient descent, stochastic gradient descent, or advanced linear algebra techniques (e.g., Singular Value Decomposition, QR decomposition) are typically used. These methods can handle very large datasets where the matrix (\mathbf{X}^\top\mathbf{X}) wouldn't even fit in memory for direct computation.

Multiple linear regression

Extending from simple to multiple predictors

In multiple linear regression, we allow more features. Conceptually, the line becomes a plane (for two features) or a hyperplane (for higher dimensions). Our model is:

\hat{y} = w_0 + w_1 x_1 + w_2 x_2 + \cdots + w_p x_p.

Geometric interpretation (hyperplanes)

Each data point in $p$ -dimensional space is projected onto a hyperplane defined by the weights (\mathbf{w}\). The objective is to find the hyperplane that minimizes the sum of squared distances to the observed data points in the target dimension.

Common pitfalls and multicollinearity

When two or more features are almost linearly dependent, it causes multicollinearity. This can lead to large swings in the values of the estimated coefficients. Methods like Ridge Regression or Lasso (discussed in a later chapter on regularization) introduce penalties that help mitigate these issues by shrinking coefficients.

Polynomial features

Motivation for non-linear patterns

A purely linear model might be insufficient for certain phenomena that exhibit curvature or more intricate relationships. Polynomial features allow us to handle such cases without leaving the conceptual framework of linear regression.

Generating polynomial terms

For a single feature $x$ , generating polynomial features up to degree $d$ means you include $x, x^2, \ldots, x^d$ as if they were separate input features in a linear model. The resulting hypothesis remains linear in terms of these extended features but can fit non-linear patterns in $x$ .

\hat{y} = w_0 + w_1 x + w_2 x^2 + \dots + w_d x^d.

Balancing complexity and overfitting risks

Higher-degree polynomials can capture complex trends, yet they are prone to overfitting. A polynomial model of very high degree could track noise in the data rather than the underlying relationship. Regularization strategies (e.g., Ridge, Lasso, or ElasticNet) help penalize large coefficients, reducing variance and improving generalization.

Practical implementation hints

In applying linear regression, several practical considerations can make the difference between a robust model and a misleading one:

Data preprocessing:
- Handle missing values appropriately (imputation or removal if justified).
- Address outliers if they are suspected to distort the fit.
- Use feature scaling (standardization or min-max normalization) especially when using gradient-based optimizers.
Choosing the right metric:
- If outliers are critical to capture, MSE might be preferable.
- If you care about overall consistency in magnitude of errors, MAE might be a better choice.
- If relative performance is important (e.g., 10k difference means different things for small vs. large values), consider MAPE or SMAPE.
Best practices for model evaluation:
- Use cross-validation to obtain a more reliable estimate of performance.
- Examine residual plots to detect patterns in errors (non-linearity, heteroskedasticity, etc.).
- Compare the model against baseline approaches (e.g., predicting the mean, or a simpler model) using $R^2$ or other relevant statistics.
Handling multicollinearity:
- If features are highly correlated, consider dimension reduction techniques (e.g., PCA, introduced later in the course) or apply regularization.
- Sometimes removing redundant features or combining them in a more meaningful way is sufficient.
Large-scale data:
- Instead of the closed-form normal equation, rely on gradient descent or its variants. Libraries like scikit-learn often default to more numerically stable decompositions (like SVD).
- For extremely large datasets, consider stochastic gradient methods and frameworks with automatic differentiation (e.g., TensorFlow, PyTorch).

Example implementations in Python

Below is a simple demonstration of using Python with scikit-learn. This example uses a single feature for simplicity, but it easily extends to multiple features.


import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

# Hypothetical dataset
X = np.array([1, 2, 3, 4, 5]).reshape(-1,1)
y = np.array([2.3, 2.9, 3.6, 4.5, 5.1])

# Fit the model
model = LinearRegression()
model.fit(X, y)

# Predictions
y_pred = model.predict(X)

# Evaluate
mse = mean_squared_error(y, y_pred)
print("MSE:", mse)
print("Weights:", model.coef_, "Intercept:", model.intercept_)

# Visualization
plt.scatter(X, y, label="Data")
plt.plot(X, y_pred, color="orange", label="Fitted line")
plt.legend()
plt.show()

By adding polynomial features:


from sklearn.preprocessing import PolynomialFeatures

poly_transform = PolynomialFeatures(degree=2, include_bias=False)
X_poly = poly_transform.fit_transform(X)
model_poly = LinearRegression()
model_poly.fit(X_poly, y)

y_pred_poly = model_poly.predict(X_poly)
mse_poly = mean_squared_error(y, y_pred_poly)
print("Polynomial MSE:", mse_poly)

If you have more extensive feature sets or large data volumes, consider gradient descent-based methods such as SGDRegressor in scikit-learn or frameworks like TensorFlow and PyTorch for automatic differentiation and more advanced optimization strategies.

An image was requested, but the frog was found.

Alt: "Linear regression line illustration"

Caption: "Straight-line fit on a synthetic dataset"

Error type: missing path

For multicollinearity, or ill-conditioned feature matrices, you can switch to Ridge or Lasso:


from sklearn.linear_model import Ridge

ridge_model = Ridge(alpha=1.0)
ridge_model.fit(X, y)

Larger (\alpha\) in Ridge means more shrinkage on coefficients, mitigating large coefficient blow-ups in near-singular systems.

Key takeaways:

Linear regression, despite its apparent simplicity, provides a critical stepping stone for more advanced machine learning methods.
Proper understanding of cost functions and error metrics ensures consistent optimization and model evaluation.
Analytical solutions can be derived neatly, but in practice, we often rely on numerical methods for large-scale or ill-posed problems.
Polynomial features allow linear regression to capture non-linearities, albeit with caution regarding overfitting.
Thorough data preprocessing, metric selection, and the right blend of regularization remain essential for robust performance.

This chapter sets the stage for deeper discussions on regularization, model interpretability, and advanced optimization methods. Mastering the fundamentals of linear regression is essential before moving on to more complex machine learning algorithms.

Averett's Heuristics@avheuristics

Subscribe to my Telegram channel for updates in the Research section and more tech content