Linear Regression and Its Assumptions
Linear regression is one of the most important and widely used algorithms in predictive modelling. It is used to predict a continuous numerical outcome by learning a straight-line relationship between input features and the target variable.
Because linear regression is simple, interpretable, and mathematically clear, it is often used as a baseline model for regression problems and as a foundation for understanding more advanced machine learning techniques.
What is Linear Regression?
Linear regression is a supervised learning algorithm used for predicting numerical values. It assumes that the target variable can be explained as a linear combination of one or more input variables.
For example, a real estate company may use linear regression to predict house price using house area, number of rooms, location score, property age, and distance from city centre.
Core Idea: Linear regression tries to fit the best possible straight line, or linear equation, that explains the relationship between input features and a numerical target variable.
Simple Linear Regression
Simple linear regression uses one input variable to predict one continuous target variable. It tries to fit a straight line through the data points.
Visual Idea of Linear Regression
Multiple Linear Regression
Multiple linear regression uses more than one input variable to predict the target. This is more common in real-world predictive modelling because outcomes are usually influenced by many factors.
For example, house price may be predicted using area, location score, number of bedrooms, property age, floor number, and distance from metro station.
Key Terms in Linear Regression
| Term | Meaning | Example Interpretation |
|---|---|---|
| Target Variable | The numerical outcome we want to predict. | House price, sales revenue, delivery time, customer spend. |
| Feature / Predictor | The input variable used to predict the target. | House area, number of rooms, customer income. |
| Intercept | The predicted value of Y when all input variables are zero. | Baseline prediction before adding feature effects. |
| Coefficient | The effect of one feature on the target, holding other features constant. | If area coefficient is 4,000, each extra sq. ft. adds ₹4,000 to predicted price. |
| Residual | Difference between actual value and predicted value. | If actual price is ₹80 lakh and predicted price is ₹75 lakh, residual is ₹5 lakh. |
| Error Term | Unexplained variation not captured by the model. | Market sentiment, negotiation effect, or unrecorded property quality. |
How Linear Regression Learns
Linear regression finds the line or equation that minimizes the difference between actual values and predicted values. These differences are called residuals.
The most common method is Ordinary Least Squares, which chooses coefficients that minimize the sum of squared residuals.
Linear Regression Training Process
Why Linear Regression is Useful
Linear Regression Assumptions
Linear regression works best when certain assumptions are reasonably satisfied. These assumptions help ensure that coefficients, predictions, and statistical interpretations are reliable.
| Assumption | Meaning | How to Check | What to Do If Violated |
|---|---|---|---|
| Assumption Linearity |
Relationship between features and target should be approximately linear. | Scatter plots, residual plots. | Use transformations, polynomial features, or non-linear models. |
| Assumption Independence of Errors |
Residuals should not be correlated with each other. | Check time order, residual autocorrelation, Durbin-Watson test. | Use time-series methods or add lag features. |
| Assumption Homoscedasticity |
Residual variance should be roughly constant across prediction levels. | Residuals vs fitted values plot. | Transform target, use weighted regression, or robust errors. |
| Assumption Normality of Residuals |
Residuals should be approximately normally distributed for inference. | Histogram or Q-Q plot of residuals. | Transform variables, check outliers, or use robust methods. |
| Assumption No Strong Multicollinearity |
Input features should not be highly correlated with each other. | Correlation matrix, VIF. | Remove redundant features, combine variables, or use regularization. |
| Assumption No Extreme Influential Outliers |
A few extreme points should not dominate the fitted line. | Box plots, residual plots, leverage, Cook’s distance. | Investigate, cap, transform, remove if erroneous, or use robust regression. |
Assumption 1: Linearity
Linear regression assumes that the relationship between each predictor and the target is approximately linear. This means a straight-line pattern should reasonably describe the relationship.
If the relationship is curved, linear regression may underfit the data and produce biased predictions.
- Scatter plot shows a roughly straight-line pattern.
- Residuals are randomly scattered around zero.
- Feature effect is stable across the range.
- Scatter plot shows a curve or U-shape.
- Residual plot shows a clear pattern.
- Predictions are poor at low or high values.
Assumption 2: Independence of Errors
The residuals should be independent of each other. This is especially important when data is collected over time or when observations are grouped by customer, store, region, or machine.
For example, monthly sales errors may be correlated across time because sales in one month are related to sales in previous months. In such cases, simple linear regression may not be enough.
Assumption 3: Homoscedasticity
Homoscedasticity means that the spread of residuals should be approximately constant across all levels of predicted values. In simple words, the model should not make very small errors for low values and very large errors for high values.
If residual spread increases or decreases systematically, the problem is called heteroscedasticity.
Example: In house price prediction, errors may be much larger for luxury houses than for affordable houses. This creates unequal error variance and may require transformation or a different modelling approach.
Assumption 4: Normality of Residuals
Linear regression assumes residuals are approximately normally distributed when we want to make statistical inferences such as confidence intervals and hypothesis tests.
For prediction alone, slight non-normality is often less serious than strong non-linearity, leakage, outliers, or heteroscedasticity. However, extremely non-normal residuals may indicate missing patterns or poor model fit.
Assumption 5: No Strong Multicollinearity
Multicollinearity occurs when input features are highly correlated with each other. It can make coefficients unstable and difficult to interpret.
For example, house area and number of rooms may be highly correlated. If both are included, the model may struggle to assign separate effects clearly.
Practical Insight: Multicollinearity may not always destroy prediction accuracy, but it can seriously reduce coefficient interpretability.
Assumption 6: No Extreme Influential Outliers
Outliers can strongly influence a linear regression line because the model minimizes squared errors. A few extreme observations can pull the line toward themselves and distort the coefficients.
Outliers should be investigated before treatment. They may be data errors, rare valid events, or important business signals.
Common Diagnostics for Linear Regression
| Diagnostic Tool | Used To Check | What to Look For |
|---|---|---|
| Diagnostic Scatter Plot |
Linearity between feature and target. | Roughly straight relationship. |
| Diagnostic Residual Plot |
Linearity and homoscedasticity. | Random scatter around zero with no clear pattern. |
| Diagnostic Histogram of Residuals |
Residual normality. | Approximately bell-shaped distribution. |
| Diagnostic Q-Q Plot |
Residual normality. | Points approximately following a straight diagonal line. |
| Diagnostic Correlation Matrix |
Multicollinearity. | Very high correlations between predictors. |
| Diagnostic VIF |
Multicollinearity severity. | High VIF values may indicate redundant predictors. |
Model Evaluation Metrics for Linear Regression
Linear regression is evaluated using regression metrics that compare actual values with predicted values.
| Metric | Meaning | Interpretation |
|---|---|---|
| MAE Mean Absolute Error |
Average absolute difference between actual and predicted values. | Easy to understand in original units. |
| MSE Mean Squared Error |
Average squared prediction error. | Penalizes large errors more strongly. |
| RMSE Root Mean Squared Error |
Square root of MSE. | Error in original units, sensitive to large errors. |
| R² Coefficient of Determination |
Proportion of variation in target explained by the model. | Higher is generally better, but must be interpreted carefully. |
| Adjusted R² | R² adjusted for number of predictors. | Useful when comparing models with different numbers of features. |
Example: House Price Prediction
Business Problem
A real estate company wants to predict house prices using area, number of bedrooms, property age, distance from city centre, and location score.
| Feature | Possible Coefficient | Business Interpretation |
|---|---|---|
| Area | +4,000 | Each additional sq. ft. is associated with ₹4,000 higher predicted price, holding other variables constant. |
| Property Age | -75,000 | Each additional year of age is associated with ₹75,000 lower predicted price, holding other variables constant. |
| Location Score | +2,50,000 | Each one-point increase in location score is associated with ₹2.5 lakh higher predicted price. |
| Distance from City Centre | -1,20,000 | Each additional kilometre from the city centre is associated with lower predicted price. |
These coefficients are useful because they make the model explainable to business users, not just predictive.
When Linear Regression Works Well
- Target variable is continuous.
- Relationships are approximately linear.
- Interpretability is important.
- Dataset is clean and well-prepared.
- Business wants coefficient-level explanation.
- House price prediction.
- Sales forecasting baseline.
- Demand estimation.
- Marketing spend impact analysis.
- Cost and revenue prediction.
When Linear Regression May Not Work Well
- Relationship is strongly non-linear.
- Data has many extreme outliers.
- Important feature interactions are missing.
- Residuals show strong patterns.
- Target variable is categorical.
- Polynomial regression.
- Decision trees.
- Random forest regression.
- Gradient boosting regression.
- Regularized regression such as Ridge or Lasso.
Common Mistakes in Linear Regression
| Mistake | Why It Is Harmful | Better Approach |
|---|---|---|
| Using linear regression for categorical target | Linear regression predicts continuous values, not classes. | Use logistic regression or classification models for categorical targets. |
| Ignoring non-linearity | Model may underfit and produce biased predictions. | Use transformations, interaction terms, polynomial features, or non-linear models. |
| Ignoring outliers | Extreme points can heavily influence coefficients. | Investigate outliers and use robust methods if needed. |
| Interpreting correlation as causation | Coefficient relationships do not automatically prove cause and effect. | Use business logic, experiments, or causal methods for causal claims. |
| Not checking multicollinearity | Coefficients may become unstable and misleading. | Use correlation checks, VIF, feature selection, or regularization. |
Best Practices for Linear Regression
Linear Regression Checklist
- Use it for continuous targets: Linear regression is designed for numerical prediction.
- Start with EDA: Check distributions, outliers, missing values, and relationships.
- Check linearity: Use scatter plots and residual plots.
- Inspect residuals: Residuals should not show strong patterns.
- Check multicollinearity: Use correlation matrix or VIF.
- Handle outliers carefully: Investigate before removing or capping.
- Use train-validation-test split: Evaluate generalization, not memorization.
- Interpret coefficients carefully: Coefficients show association, not automatic causation.
- Compare with other models: Use linear regression as a baseline before trying complex models.
Why Linear Regression Remains Important
Even though many advanced machine learning models exist, linear regression remains important because it is simple, fast, transparent, and easy to explain. It helps analysts understand the basic relationship between variables and provides a strong foundation for predictive modelling.
In many business situations, interpretability is as important as accuracy. Linear regression is especially useful when stakeholders want to know not only what the prediction is, but also why the model made that prediction.
Practical Insight: Linear regression is often the first model to build in a regression problem. Even if a more advanced model performs better later, linear regression provides an interpretable benchmark.
Key Takeaways
- Linear regression predicts continuous numerical outcomes.
- Simple linear regression uses one predictor; multiple linear regression uses many predictors.
- Coefficients show the expected change in the target for a one-unit change in a feature.
- Residuals are the differences between actual and predicted values.
- Major assumptions include linearity, independence of errors, homoscedasticity, normal residuals, no strong multicollinearity, and no extreme influential outliers.
- Residual plots, scatter plots, Q-Q plots, correlation matrices, and VIF help diagnose problems.
- Linear regression is interpretable, fast, and useful as a baseline model.
- It should be used carefully when relationships are non-linear, outliers are extreme, or assumptions are strongly violated.