Univariate and Bivariate Analysis
Exploratory Data Analysis becomes more powerful when we study variables at two levels: individually and in relation to each other. Univariate analysis helps us understand one variable at a time, while bivariate analysis helps us understand how two variables move together.
These two techniques help identify distributions, outliers, category imbalance, relationships, trends, and predictors that may be useful for machine learning models.
What is Univariate Analysis?
Univariate analysis means analysing one variable at a time. The goal is to understand the basic behaviour of a single feature or target variable without considering its relationship with other variables.
For example, if we analyse only customer age, monthly income, transaction amount, product category, or churn status individually, we are performing univariate analysis.
Core Idea: Univariate analysis answers the question: “What does this one variable look like?”
What is Bivariate Analysis?
Bivariate analysis means analysing the relationship between two variables. The goal is to understand whether one variable changes when another variable changes.
For example, if we analyse the relationship between house size and house price, income and loan default, advertising spend and sales, or customer complaints and churn, we are performing bivariate analysis.
Core Idea: Bivariate analysis answers the question: “How does one variable behave in relation to another variable?”
Univariate vs Bivariate Analysis
| Aspect | Univariate Analysis | Bivariate Analysis |
|---|---|---|
| Number of Variables | One variable at a time. | Two variables together. |
| Main Question | What does this variable look like? | How are these two variables related? |
| Common Outputs | Distribution, frequency, spread, outliers. | Relationship, trend, comparison, association. |
| Common Visuals | Histogram, box plot, bar chart. | Scatter plot, grouped bar chart, box plot by category, cross-tabulation. |
| Modelling Use | Helps detect data quality issues and preprocessing needs. | Helps identify useful predictors and relationships with the target. |
Simple Difference Between Univariate and Bivariate Analysis
Why These Analyses Matter for Predictive Modelling
Univariate Analysis for Numerical Variables
For numerical variables, univariate analysis focuses on central tendency, spread, distribution shape, skewness, and outliers.
| Check | What to Look For | Recommended Visual | Modelling Action |
|---|---|---|---|
| Univariate Central Value |
Mean, median, and whether they are far apart. | Summary table. | Decide whether mean or median better represents typical behaviour. |
| Univariate Spread |
Minimum, maximum, range, standard deviation, IQR. | Box plot. | Check if scaling or outlier treatment is required. |
| Univariate Shape |
Symmetry, skewness, long tails, multiple peaks. | Histogram. | Apply transformation if distribution is highly skewed. |
| Univariate Outliers |
Extreme values that are unusually high or low. | Box plot or percentile table. | Investigate whether to remove, cap, transform, or keep. |
Univariate Analysis for Categorical Variables
For categorical variables, univariate analysis focuses on frequency counts, category percentages, dominant categories, rare categories, and class imbalance.
| Check | What to Look For | Recommended Visual | Modelling Action |
|---|---|---|---|
| Categorical Frequency Count |
Number of records in each category. | Bar chart. | Understand category distribution. |
| Categorical Dominant Category |
One category appearing much more than others. | Bar chart or frequency table. | Check whether feature has low information value. |
| Categorical Rare Categories |
Categories with very few observations. | Frequency table. | Group rare categories into “Other” before encoding. |
| Categorical Target Balance |
Whether target classes are balanced or imbalanced. | Bar chart. | Use stratified splitting and appropriate metrics. |
Bivariate Analysis: Choosing the Right Technique
The best bivariate analysis method depends on the data types of the two variables being compared. Numerical-to-numerical relationships are different from categorical-to-numerical or categorical-to-categorical relationships.
Bivariate Analysis Matrix
Use scatter plots, correlation, trend lines, and pair plots.
Example: House area vs. house price.
Use grouped summaries, box plots, violin plots, and bar charts of averages.
Example: Product category vs. sales amount.
Use cross-tabulation, stacked bar charts, and proportion tables.
Example: Plan type vs. churn status.
Use line charts, rolling averages, seasonal plots, and trend analysis.
Example: Month vs. sales revenue.
Numerical vs Numerical Analysis
When both variables are numerical, bivariate analysis helps identify whether they move together, move opposite to each other, or show no clear relationship.
| Analysis Method | What It Shows | Example | Modelling Insight |
|---|---|---|---|
| Numerical Scatter Plot |
Pattern between two continuous variables. | Advertising spend vs. sales. | Shows linear, non-linear, clustered, or outlier patterns. |
| Numerical Correlation |
Strength and direction of linear relationship. | Income vs. credit limit. | Helps detect useful predictors and multicollinearity. |
| Numerical Trend Line |
Average direction of relationship. | Property size vs. price. | Shows whether relationship may be linear or non-linear. |
Categorical vs Numerical Analysis
When one variable is categorical and the other is numerical, we compare the distribution or average value of the numerical variable across categories.
For example, we may compare average monthly spending across customer segments or compare house prices across different locations.
| Analysis Method | What It Shows | Example | Modelling Insight |
|---|---|---|---|
| Bivariate Grouped Mean / Median |
Average numerical value by category. | Average spend by customer segment. | Shows which categories have higher or lower outcomes. |
| Bivariate Box Plot by Category |
Distribution of numerical variable across groups. | Salary distribution by department. | Shows spread, outliers, and group differences. |
| Bivariate Bar Chart of Averages |
Comparison of average values between categories. | Average order value by region. | Helps identify categories with predictive value. |
Categorical vs Categorical Analysis
When both variables are categorical, bivariate analysis focuses on frequency combinations and proportions. This is especially useful in classification problems.
For example, in customer churn prediction, we may analyse whether churn rate differs across plan types, regions, payment methods, or complaint categories.
| Analysis Method | What It Shows | Example | Modelling Insight |
|---|---|---|---|
| Categorical Cross-Tabulation |
Counts for combinations of two categories. | Plan type vs. churn status. | Shows whether categories are associated with the target. |
| Categorical Proportion Table |
Percentage distribution within categories. | Churn rate by region. | Helps compare groups fairly even when group sizes differ. |
| Categorical Stacked Bar Chart |
Visual comparison of category proportions. | Payment method vs. repeat purchase. | Highlights group-level differences in outcome behaviour. |
Target-Based Bivariate Analysis
In predictive modelling, one of the most important uses of bivariate analysis is studying each feature against the target variable. This helps identify which features may be useful predictors.
Feature-to-Target Analysis Workflow
Example: Customer Churn Analysis
Business Problem
A telecom company wants to predict customer churn. Before building the model, analysts perform univariate and bivariate analysis to understand customer behaviour.
| Analysis Type | Variable or Relationship | Finding | Modelling Decision |
|---|---|---|---|
| Univariate | Monthly charges | Distribution is right-skewed with a few high-value customers. | Check outliers and consider transformation if needed. |
| Univariate | Contract type | Most customers are on monthly contracts. | One-hot encode contract type and check target relationship. |
| Bivariate | Contract type vs. churn | Monthly contract customers have much higher churn rate. | Contract type is likely an important predictor. |
| Bivariate | Tenure vs. churn | New customers churn more frequently than long-term customers. | Create tenure groups or use tenure as a strong feature. |
| Bivariate | Support tickets vs. churn | Customers with repeated complaints show higher churn risk. | Create complaint frequency feature. |
Example: House Price Analysis
Regression Problem
A real estate company wants to predict house prices. Univariate analysis helps understand individual variables, while bivariate analysis helps understand what drives price.
- Univariate: Analyse distribution of price, area, rooms, property age, and location.
- Bivariate: Analyse area vs. price, location vs. price, rooms vs. price, and property age vs. price.
- Insight: If area has a strong positive relationship with price, it becomes an important model feature.
- Insight: If price differs strongly by location, location encoding becomes important.
Common Patterns Found During Bivariate Analysis
| Pattern | Meaning | Possible Modelling Action |
|---|---|---|
| Positive Relationship | As one variable increases, the other also increases. | Use as predictor; consider linear relationship. |
| Negative Relationship | As one variable increases, the other decreases. | Use as predictor; check business interpretation. |
| No Clear Relationship | Variables do not show visible association. | Feature may have weak individual predictive power. |
| Non-Linear Relationship | Relationship changes direction or shape. | Use transformations, bins, or tree-based models. |
| Group Difference | Numerical outcome differs across categories. | Encode category carefully and consider interaction features. |
| Outlier Relationship | Some points behave very differently from the pattern. | Investigate outliers and decide whether to treat them. |
Common Mistakes to Avoid
| Mistake | Why It Is Harmful | Better Approach |
|---|---|---|
| Skipping univariate analysis | Data quality issues, skewness, and outliers may remain hidden. | Analyse every important variable individually first. |
| Using only correlation | Correlation captures mainly linear relationships and may miss non-linear patterns. | Use scatter plots and grouped summaries along with correlation. |
| Ignoring categorical relationships | Important category-level patterns may be missed. | Use cross-tabs, stacked bars, and group-wise target rates. |
| Confusing association with causation | A relationship between two variables does not prove one causes the other. | Interpret relationships carefully and validate with business logic. |
| Not analysing features against target | Important predictive signals may remain undiscovered. | Perform feature-to-target analysis for every meaningful feature. |
Best Practices for Univariate and Bivariate Analysis
Analysis Checklist
- Start with univariate analysis: Understand each variable before studying relationships.
- Separate numerical and categorical variables: Use different summaries and visuals for each type.
- Analyse the target variable carefully: Check class imbalance, skewness, and unusual values.
- Use bivariate analysis with the target: Identify features that may have predictive value.
- Use the right chart: Histograms for numerical distributions, bar charts for categories, scatter plots for numerical relationships.
- Compare groups carefully: Use proportions, medians, and distributions, not only raw counts.
- Look for non-linear relationships: Not all predictive patterns are straight-line relationships.
- Connect findings to feature engineering: Convert EDA insights into useful model inputs.
- Validate with business logic: Statistical patterns should make practical sense.
How This Analysis Improves Predictive Models
Univariate analysis improves modelling by revealing data quality issues, outliers, skewness, missingness, and imbalance. Bivariate analysis improves modelling by revealing relationships, target patterns, useful features, and possible transformations.
Together, these methods help analysts move from raw data to modelling strategy. They guide decisions about encoding, scaling, transformations, feature selection, feature engineering, and model evaluation.
Practical Rule: Do not start modelling before asking two questions: “What does each variable look like?” and “How does each important variable relate to the target?”
Key Takeaways
- Univariate analysis studies one variable at a time.
- Bivariate analysis studies the relationship between two variables.
- Univariate analysis helps detect distributions, outliers, rare categories, and imbalance.
- Bivariate analysis helps identify relationships and useful predictors.
- Numerical, categorical, and time-based variables require different analysis techniques.
- Feature-to-target analysis is especially important for predictive modelling.
- EDA findings should guide preprocessing, feature engineering, model selection, and evaluation strategy.
- Strong predictive modelling begins with careful univariate and bivariate analysis.