House Price Prediction with Linear Regression: California Housing Dataset (Step-by-Step)

Codeayan Team · Apr 11, 2026
House Price Prediction

This project builds a multiple linear regression model to predict median house values in California using the well‑known California Housing Dataset. The analysis covers the entire regression workflow—from data loading and exploratory correlation analysis to model evaluation and diagnostic residual plots.

Dataset Source: scikit‑learn – California Housing Dataset

GitHub Repository: View on GitHub – The complete code, visual outputs, and downloadable notebook are available in the repository. Note: Only the code blocks are displayed below; all generated plots and results can be found in the GitHub repo.

About the Dataset

The California Housing Dataset contains aggregated data for 20,640 census block groups in California, derived from the 1990 census. The goal is to predict the median house value (in units of $100,000) for each block group using eight numeric features:

  • MedInc – median income in block group
  • HouseAge – median house age in block group
  • AveRooms – average number of rooms per household
  • AveBedrms – average number of bedrooms per household
  • Population – block group population
  • AveOccup – average household occupancy
  • Latitude – block group latitude
  • Longitude – block group longitude

Project Objectives

  • Understand the relationship between housing features and median value using correlation analysis.
  • Train a multiple linear regression model using ordinary least squares (OLS).
  • Evaluate model performance with RMSE and R² metrics.
  • Diagnose model assumptions using residual plots.
  • Interpret which features most strongly influence house prices.

Analysis Roadmap

  1. Data Loading & DataFrame Conversion – Fetch the dataset and prepare a Pandas DataFrame.
  2. Correlation Heatmap – Visualise feature correlations and identify the strongest predictor.
  3. Train/Test Split – Split data 80/20 with a fixed random seed for reproducibility.
  4. Model Fitting & Coefficient Interpretation – Train a linear regression model and explain coefficient meanings.
  5. Prediction & Evaluation – Calculate RMSE and R² on the test set.
  6. Actual vs. Predicted Scatter Plot – Assess overall prediction accuracy visually.
  7. Residuals vs. Fitted Values Plot – Check for heteroscedasticity and model bias.
  8. Feature Impact Interpretation – Determine which features have the largest positive and negative effect.

Note: All code is written in Python using scikit‑learn, Pandas, Matplotlib, and Seaborn.



Step 1: Data Loading & DataFrame Conversion

The California Housing dataset is conveniently bundled with scikit‑learn. We use fetch_california_housing() to obtain the data as a dictionary‑like object. The features and target are then converted into a single Pandas DataFrame for easier manipulation and analysis.

What the Code Does

  • fetch_california_housing() returns a Bunch object containing data, target, feature_names, and DESCR.
  • We create a DataFrame df using the data array and assign the feature names as column headers.
  • The target variable MedHouseVal (median house value in $100,000 units) is added as a new column.
  • The first five rows and the shape of the dataset are printed for a quick sanity check.

Code Preview (what follows):
• Import pandas and fetch_california_housing
• Fetch the dataset
• Build DataFrame from data and feature_names
• Add MedHouseVal column from target
• Display df.head() and df.shape

import pandas as pd
from sklearn.datasets import fetch_california_housing

# Fetch the dataset
california_data = fetch_california_housing()

# Convert the feature matrix into a Pandas DataFrame
df = pd.DataFrame(california_data.data, columns=california_data.feature_names)

# Add the target variable (median house value in units of $100,000)
df['MedHouseVal'] = california_data.target

# Display the first few rows and check data info
print(df.head())
print("\nDataset Shape:", df.shape)


Step 2: Correlation Heatmap – Identifying the Strongest Predictor

Before building the regression model, we examine the linear relationships between all numeric features and the target variable MedHouseVal. A Seaborn heatmap of the Pearson correlation matrix provides an intuitive overview.

What the Code Does

  • Compute the correlation matrix using df.corr().
  • Plot a heatmap with sns.heatmap(), annotating each cell with the correlation coefficient.
  • Extract the correlations with MedHouseVal, sort them descending, and print the result.
  • The output clearly shows that MedInc (median income) has the highest positive correlation (~0.69), making it the strongest single predictor.

Code Preview (what follows):
• Import matplotlib.pyplot and seaborn
• Compute corr_matrix = df.corr()
• Draw heatmap with annotations and ‘coolwarm’ colormap
• Print sorted correlations with target

import matplotlib.pyplot as plt
import seaborn as sns

# Calculate the correlation matrix
corr_matrix = df.corr()

# Set up the matplotlib figure
plt.figure(figsize=(10, 8))

# Draw the heatmap
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm', fmt=".2f", linewidths=0.5)
plt.title('Correlation Heatmap of California Housing Features')
plt.show()

# Identify the strongest predictor
target_corr = corr_matrix['MedHouseVal'].sort_values(ascending=False)
print("\nCorrelations with Target (MedHouseVal):")
print(target_corr)
print("\nInsight: 'MedInc' (Median Income) is the strongest predictor of house prices, with a correlation of ~0.69.")


Step 3: Train/Test Split

To evaluate model performance on unseen data, we split the dataset into training and testing subsets. An 80/20 split is used, and a fixed random_state ensures reproducibility.

What the Code Does

  • Separate the DataFrame into features X (all columns except MedHouseVal) and target y.
  • Use train_test_split() with test_size=0.2 and random_state=42.
  • Print the shapes of the resulting training and testing sets to confirm the split.

Code Preview (what follows):
• Import train_test_split from sklearn.model_selection
• Define X = df.drop(‘MedHouseVal’, axis=1), y = df[‘MedHouseVal’]
• Split with train_test_split(X, y, test_size=0.2, random_state=42)
• Print X_train.shape and X_test.shape

from sklearn.model_selection import train_test_split

# Separate features (X) and target (y)
X = df.drop('MedHouseVal', axis=1)
y = df['MedHouseVal']

# Split the data 80% training, 20% testing
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

print(f"Training data shape: {X_train.shape}")
print(f"Testing data shape: {X_test.shape}")


Step 4: Model Fitting & Coefficient Interpretation

We instantiate a LinearRegression model and fit it to the training data. The resulting coefficients (weights) quantify the expected change in the target variable for a one‑unit increase in each feature, holding all other features constant.

What the Code Does

  • Create a LinearRegression object and call .fit(X_train, y_train).
  • Extract the intercept and coefficients, storing them in a DataFrame for readability.
  • Sort the coefficients to easily see the most influential features.
  • Print the intercept and a business‑friendly explanation of how to interpret the coefficients.

Code Preview (what follows):
• Import LinearRegression from sklearn.linear_model
• Fit model: model = LinearRegression().fit(X_train, y_train)
• Create coefficient DataFrame and sort
• Print intercept and coefficient explanation

from sklearn.linear_model import LinearRegression

# Initialize and fit the model
model = LinearRegression()
model.fit(X_train, y_train)

# Create a DataFrame to view coefficients easily
coefficients = pd.DataFrame({
    'Feature': X.columns,
    'Coefficient': model.coef_
}).sort_values(by='Coefficient', ascending=False)

print("Intercept:", model.intercept_)
print("\nModel Coefficients:")
print(coefficients)

print("\n--- Business Meaning ---")
print("A coefficient represents the expected change in the median house value (in $100k units)")
print("for a 1-unit increase in that feature, assuming all other features remain perfectly constant.")
print("For example, a MedInc coefficient of ~0.44 means that for every $10,000 increase in median")
print("neighborhood income, the median house value increases by ~$44,000.")


Step 5: Prediction & Evaluation (RMSE and R²)

With the model trained, we predict house values on the unseen test set and compute two key regression metrics: Root Mean Squared Error (RMSE) and R‑squared (R²).

Metrics Explained

  • RMSE – measures the average magnitude of prediction errors in the same units as the target (here, $100,000). Lower is better.
  • – represents the proportion of variance in the target explained by the model. Values closer to 1 indicate a better fit.

What the Code Does

  • Generate predictions with model.predict(X_test).
  • Compute RMSE using np.sqrt(mean_squared_error(y_test, y_pred)).
  • Compute R² using r2_score(y_test, y_pred).
  • Print both metrics with human‑readable dollar amounts.

Code Preview (what follows):
• Import numpy, mean_squared_error, r2_score
y_pred = model.predict(X_test)
• Calculate and print RMSE and R²

import numpy as np
from sklearn.metrics import mean_squared_error, r2_score

# Make predictions on the unseen test data
y_pred = model.predict(X_test)

# Calculate metrics
rmse = np.sqrt(mean_squared_error(y_test, y_pred))
r2 = r2_score(y_test, y_pred)

print(f"Root Mean Squared Error (RMSE): {rmse:.4f} (approx ${rmse * 100000:,.0f} error)")
print(f"R-squared (R2) Score: {r2:.4f} ({r2 * 100:.2f}% of variance explained)")


Step 6: Actual vs. Predicted Scatter Plot

A scatter plot of actual versus predicted values offers a visual assessment of model accuracy. Points that lie close to the diagonal y = x line indicate accurate predictions.

What the Code Does

  • Create a scatter plot with y_test on the x‑axis and y_pred on the y‑axis.
  • Overlay a red dashed line representing perfect predictions (y = x).
  • Add axis labels, a title, and a legend for clarity.

Code Preview (what follows):
• Set figure size
plt.scatter(y_test, y_pred, alpha=0.3)
• Plot y = x reference line
• Label axes and show plot

plt.figure(figsize=(8, 6))

# Plot actual vs predicted
plt.scatter(y_test, y_pred, alpha=0.3, color='blue', label='Predictions')

# Plot the perfect prediction y=x line
# The ideal model would have all points falling exactly on this red dashed line
min_val = min(y_test.min(), y_pred.min())
max_val = max(y_test.max(), y_pred.max())
plt.plot([min_val, max_val], [min_val, max_val], color='red', linestyle='--', linewidth=2, label='Perfect Fit (y=x)')

plt.xlabel('Actual Median House Value ($100k)')
plt.ylabel('Predicted Median House Value ($100k)')
plt.title('Actual vs. Predicted House Prices')
plt.legend()
plt.show()


Step 7: Residuals vs. Fitted Values Plot

Residual plots are essential for diagnosing violations of linear regression assumptions. We plot the residuals (actual minus predicted) against the fitted (predicted) values. A random scatter around zero suggests that the assumptions of linearity and equal variance (homoscedasticity) are reasonable.

What the Code Does

  • Calculate residuals: residuals = y_test – y_pred.
  • Scatter y_pred (fitted) vs. residuals.
  • Add a horizontal dashed line at y = 0 for reference.
  • Print a diagnostic note explaining what patterns (e.g., a funnel shape) would indicate heteroscedasticity.

Code Preview (what follows):
• Compute residuals
plt.scatter(y_pred, residuals, alpha=0.3)
plt.axhline(y=0, color=’red’, linestyle=’–‘)
• Label axes and print interpretation guide

# Calculate residuals
residuals = y_test - y_pred

plt.figure(figsize=(8, 6))

# Scatter plot of fitted values (predictions) vs residuals
plt.scatter(y_pred, residuals, alpha=0.3, color='purple')

# Add a horizontal line at 0 (representing zero error)
plt.axhline(y=0, color='red', linestyle='--', linewidth=2)

plt.xlabel('Fitted Values (Predicted Prices)')
plt.ylabel('Residuals (Actual - Predicted)')
plt.title('Residuals vs. Fitted Values')
plt.show()

print("\n--- Diagnostic Check ---")
print("If the points form a random, evenly spread cloud around the red line, the model is homoscedastic.")
print("If you notice a 'funnel' shape (errors getting wider as predicted prices increase), this indicates")
print("heteroscedasticity, meaning our model struggles more with predicting highly expensive homes.")


Step 8: Feature Impact Interpretation

The raw linear regression coefficients directly reflect the magnitude and direction of each feature’s influence on the predicted house value. We identify which features have the largest positive and negative impacts.

What the Code Does

  • Using the sorted coefficients DataFrame from Step 4, extract the first row (largest positive coefficient) and the last row (largest negative coefficient).
  • Print the feature names and coefficient values.
  • Provide a final interpretation: AveBedrms (average bedrooms) has the largest positive weight, while Longitude and Latitude have notable negative weights, indicating geographic trends in house prices.
  • Discuss the counter‑intuitive negative coefficient for AveRooms (holding bedrooms constant) as a possible multicollinearity effect.

Code Preview (what follows):
• Identify biggest_positive and biggest_negative from the coefficients DataFrame
• Print the results and a business interpretation

# Extract the largest positive and negative impact features based on raw coefficients
biggest_positive = coefficients.iloc[0]
biggest_negative = coefficients.iloc[-1]

print(f"Biggest Positive Impact: {biggest_positive['Feature']} (Coef: {biggest_positive['Coefficient']:.4f})")
print(f"Biggest Negative Impact: {biggest_negative['Feature']} (Coef: {biggest_negative['Coefficient']:.4f})")

print("\n--- Final Interpretation ---")
print(f"Positive Impact: '{biggest_positive['Feature']}' has the largest positive mathematical weight.")
print(f"In this unscaled model, holding everything else constant, an increase of 1 unit in AveBedrms")
print(f"drastically drives up the predicted price.")
print(f"\nNegative Impact: '{biggest_negative['Feature']}' has the largest negative mathematical weight.")
print(f"Interestingly, in this multiple regression, as the average number of rooms increases (holding")
print(f"bedrooms and all else constant), the predicted value drops. This could indicate that homes")
print(f"with many rooms but few bedrooms (or perhaps in less desirable locations) are valued lower,")
print(f"or it's an artifact of multicollinearity between AveRooms and AveBedrms.")