This project performs a comprehensive Exploratory Data Analysis (EDA) on the Olist Brazilian E‑Commerce Public Dataset. Olist is the largest department store in Brazilian marketplaces, and this dataset contains information on 100,000 orders made between 2016 and 2018 across multiple online platforms.

Dataset Source: Kaggle – Brazilian E‑Commerce Public Dataset by Olist

GitHub Repository: View on GitHub – The complete code, visual outputs, and downloadable notebook are available in the repository. Note: Only the code blocks are displayed below; all generated plots and results can be found in the GitHub repo.

About the Dataset

The data is highly normalized across a relational database schema and is provided as nine separate CSV files. Understanding this structure is critical before any analysis can take place. The files include:

olist_orders_dataset.csv – Order timestamps and delivery status
olist_customers_dataset.csv – Customer unique identifiers and location (city/state)
olist_order_items_dataset.csv – Products purchased within each order, including price and freight
olist_products_dataset.csv – Product metadata (dimensions, weight, category)
olist_order_payments_dataset.csv – Payment methods, installments, and total value
olist_order_reviews_dataset.csv – Customer review scores and comments
olist_sellers_dataset.csv – Seller location information
olist_geolocation_dataset.csv – Brazilian zip code prefixes with latitude/longitude
product_category_name_translation.csv – Portuguese‑to‑English category translations

Project Objectives

The primary goal is to transform this scattered, normalized data into actionable business insights. By the end of this analysis, we will answer key questions such as:

What is the overall order volume trend over time?
Which product categories generate the most revenue?
What are the key statistical metrics like Average Order Value (AOV) and average items per order?
How do freight costs correlate with product price and customer satisfaction?
Which Brazilian states contribute the highest revenue?
Are there extreme outliers in order values that warrant further investigation?

Analysis Roadmap

Data Ingestion & Master Assembly – Loading all nine CSV files and joining them using primary/foreign keys to create a single analytical DataFrame.
Data Cleaning & Type Conversion – Converting timestamp columns to datetime, handling missing values, and removing duplicate records.
Descriptive Statistics – Calculating Mean Order Value, average items per order, and ranking top states by revenue.
Monthly Order Volume Trend – Visualizing the growth trajectory of the e‑commerce platform using time‑series analysis.
Top Product Categories – Identifying which product types drive the most revenue using a horizontal bar chart.
Correlation Analysis – Using a heatmap to examine relationships between price, freight value, and review scores.
Outlier Detection – Using boxplots to identify and contextualize extreme order payment values.

Note: All code is written in Python using Pandas for data manipulation, Matplotlib and Seaborn for visualization. The analysis follows a step‑by‑step, reproducible workflow suitable for both Jupyter notebooks and production reporting.

Step 1: Data Ingestion & Master DataFrame Assembly

In a real‑world e‑commerce system, data is rarely stored in a single flat file. Instead, it is normalized across a relational database to prevent redundancy and maintain integrity. The Olist dataset is split into nine separate CSV files, each representing a distinct business entity. Before we can analyze sales trends, revenue, or customer behaviour, we must first reconstruct a unified view of the data.

The Nine Source Files

The following datasets are loaded into Pandas DataFrames:

olist_orders_dataset.csv – Core order metadata (timestamps, status)
olist_customers_dataset.csv – Customer unique identifiers and location
olist_order_items_dataset.csv – Line‑item details: product, price, freight, seller
olist_products_dataset.csv – Product attributes (category, dimensions, weight)
olist_order_payments_dataset.csv – Payment methods, installments, and values
olist_order_reviews_dataset.csv – Review scores and comment text
olist_sellers_dataset.csv – Seller location (city/state)
olist_geolocation_dataset.csv – Geographic coordinates for Brazilian zip prefixes
product_category_name_translation.csv – English translations of product category names

How the Master DataFrame is Built

We start with the orders table as our anchor and perform a series of merge() operations—the Pandas equivalent of SQL JOIN clauses. Each join adds new columns from a related table using shared keys.

Join customers on customer_id → attaches customer state and city to each order.
Join order_items on order_id → brings in the products purchased, their prices, freight values, and seller IDs.
Join products on product_id → adds product details like category name, weight, and dimensions.
Join product_category_name_translation on product_category_name → provides English category labels for clearer visualisation.
Join order_payments on order_id → appends payment type, instalments, and total payment value.
Join order_reviews on order_id → includes review score and comment timestamp.
Join sellers on seller_id → adds seller location information.

Crucially, all merges use the parameter how='left'. This ensures that every order is preserved, even if some associated records (e.g., a missing review or an untranslated category) do not exist. The resulting DataFrame will contain 119,143 rows and 40 columns, providing a comprehensive foundation for all subsequent analysis.

Verification

After the merges are complete, we inspect the shape of the final DataFrame and display the first few rows. This quick validation confirms that:

All tables have been correctly aligned.
The expected number of rows matches the base orders table (plus any multi‑item expansions).
Column headers are consistent and ready for cleaning in the next step.

Code Preview (what follows):
• Import pandas
• Load nine CSV files with pd.read_csv()
• Chain .merge() calls starting from df_orders
• Print the shape using df_master.shape

import pandas as pd

# 1. Load ALL 9 CSV files into separate Pandas DataFrames
df_orders = pd.read_csv('olist_orders_dataset.csv')
df_customers = pd.read_csv('olist_customers_dataset.csv')
df_items = pd.read_csv('olist_order_items_dataset.csv')
df_products = pd.read_csv('olist_products_dataset.csv')
df_payments = pd.read_csv('olist_order_payments_dataset.csv')
df_reviews = pd.read_csv('olist_order_reviews_dataset.csv')
df_sellers = pd.read_csv('olist_sellers_dataset.csv')
df_geo = pd.read_csv('olist_geolocation_dataset.csv')
df_translations = pd.read_csv('product_category_name_translation.csv')

# 2. Merge the datasets sequentially to build the master DataFrame
df_master = (
    df_orders
    # Link orders to the customer who made them
    .merge(df_customers, on='customer_id', how='left')
    
    # Link the specific items purchased within each order
    .merge(df_items, on='order_id', how='left')
    
    # Link the product details for each item
    .merge(df_products, on='product_id', how='left')
    
    # Translate the product category name to English
    .merge(df_translations, on='product_category_name', how='left')
    
    # Add payment details (needed for Step 3 and 7)
    .merge(df_payments, on='order_id', how='left')
    
    # Add customer reviews (needed for Step 6)
    .merge(df_reviews, on='order_id', how='left')
    
    # Add seller details
    .merge(df_sellers, on='seller_id', how='left')
)

# 3. Verify the execution by checking the shape and columns
print(f"Master Dataset successfully created!")
print(f"Total Rows: {df_master.shape[0]}")
print(f"Total Columns: {df_master.shape[1]}")

# View the first 3 rows to confirm data alignment
display(df_master.head(3))

import pandas as pd

# 1. Load ALL 9 CSV files into separate Pandas DataFrames
df_orders = pd.read_csv('olist_orders_dataset.csv')
df_customers = pd.read_csv('olist_customers_dataset.csv')
df_items = pd.read_csv('olist_order_items_dataset.csv')
df_products = pd.read_csv('olist_products_dataset.csv')
df_payments = pd.read_csv('olist_order_payments_dataset.csv')
df_reviews = pd.read_csv('olist_order_reviews_dataset.csv')
df_sellers = pd.read_csv('olist_sellers_dataset.csv')
df_geo = pd.read_csv('olist_geolocation_dataset.csv')
df_translations = pd.read_csv('product_category_name_translation.csv')

# 2. Merge the datasets sequentially to build the master DataFrame
df_master = (
    df_orders
    # Link orders to the customer who made them
    .merge(df_customers, on='customer_id', how='left')
    
    # Link the specific items purchased within each order
    .merge(df_items, on='order_id', how='left')
    
    # Link the product details for each item
    .merge(df_products, on='product_id', how='left')
    
    # Translate the product category name to English
    .merge(df_translations, on='product_category_name', how='left')
    
    # Add payment details (needed for Step 3 and 7)
    .merge(df_payments, on='order_id', how='left')
    
    # Add customer reviews (needed for Step 6)
    .merge(df_reviews, on='order_id', how='left')
    
    # Add seller details
    .merge(df_sellers, on='seller_id', how='left')
)

# 3. Verify the execution by checking the shape and columns
print(f"Master Dataset successfully created!")
print(f"Total Rows: {df_master.shape[0]}")
print(f"Total Columns: {df_master.shape[1]}")

# View the first 3 rows to confirm data alignment
display(df_master.head(3))

Step 2: Data Cleaning & Preprocessing

After assembling the master dataset, the next critical step is to ensure data quality and consistency. Raw data often contains incorrect data types, missing values, or duplicate records that can skew analysis or cause errors during visualisation. This step systematically inspects and rectifies these issues.

1. Data Type Conversion

Several columns representing dates (e.g., order_purchase_timestamp, order_delivered_customer_date) are initially loaded as generic object or string types. We convert them to Pandas datetime objects. This allows us to:

Extract time‑based features (e.g., year‑month for trend analysis).
Perform date arithmetic (e.g., delivery delay calculations).
Ensure proper sorting and plotting on time axes.

2. Assessing Missing Values

We generate a count of null values per column, sorted descending, to identify the most incomplete fields. The primary columns with missing data are:

review_comment_title and review_comment_message – many customers leave a score but no written review.
order_delivered_customer_date – missing for orders still in transit or cancelled.
product_category_name_english – some categories lack an English translation.
product_photos_qty, product_name_lenght, product_description_lenght – missing for certain product records.

3. Handling Null Values

Based on business logic, we apply different strategies for different columns:

Drop rows with missing product_category_name_english
Without a valid category, these products cannot be used in category‑level revenue analysis. This removes approximately 2,500 rows.
Fill missing review_score with the median score
This preserves the order for other analyses while using a neutral, central value for missing reviews.
Fill missing freight_value and price with 0
These are financial fields; a zero value indicates no charge (though missing values in these fields are extremely rare after the merges).

4. Duplicate Removal

Duplicate rows can artificially inflate metrics. We use Pandas’ duplicated() method to identify and then drop any exact row duplicates. In this dataset, there are typically 0 duplicate rows after proper merging, but the check is a critical safeguard.

5. Final Verification

After cleaning, we print the final shape of the dataset. The output confirms a reduction in rows (from ~119k to ~116k) due to the removal of untranslatable product categories, and verifies that all 40 columns are intact and ready for analysis.

A Note on the FutureWarning: The code uses fillna(..., inplace=True) on a DataFrame slice. Pandas 3.0 deprecates this pattern. The notebook includes the warning output, but the operation completes successfully. For production code, one should use direct column assignment (e.g., df_master['column'] = df_master['column'].fillna(value)).

Code Preview (what follows):
• Convert date columns using pd.to_datetime()
• Display top missing value counts with isnull().sum().sort_values()
• Drop rows lacking English category with dropna(subset=[…])
• Fill missing review scores, freight, and price values
• Remove duplicates with drop_duplicates()
• Print final shape for confirmation

# 1. Convert Timestamp Columns to proper Datetime objects
date_columns = [
    'order_purchase_timestamp', 
    'order_approved_at', 
    'order_delivered_carrier_date', 
    'order_delivered_customer_date', 
    'order_estimated_delivery_date'
]

for col in date_columns:
    df_master[col] = pd.to_datetime(df_master[col])

print("Data types successfully updated!")

# 2. Check for missing values (Nulls) before cleaning
print("\n--- Missing Values Before Cleaning ---")
# Displaying the top 10 columns with the highest missing values
print(df_master.isnull().sum().sort_values(ascending=False).head(10))

# 3. Clean the Nulls Accordingly
# Drop rows where the English product category is missing
df_master.dropna(subset=['product_category_name_english'], inplace=True)

# Fill missing review scores with the median review score
median_score = df_master['review_score'].median()
df_master['review_score'].fillna(median_score, inplace=True)

# Fill missing freight values or prices with 0 (if any slipped through)
df_master['freight_value'].fillna(0, inplace=True)
df_master['price'].fillna(0, inplace=True)

# 4. Check and remove exact row duplicates
duplicates_count = df_master.duplicated().sum()
print(f"\nFound {duplicates_count} duplicate rows. Removing them now...")
df_master.drop_duplicates(inplace=True)

# 5. Final Verification
print(f"\nData cleaning complete!")
print(f"Final Dataset Shape: {df_master.shape[0]} rows, {df_master.shape[1]} columns")

# 1. Convert Timestamp Columns to proper Datetime objects
date_columns = [
    'order_purchase_timestamp', 
    'order_approved_at', 
    'order_delivered_carrier_date', 
    'order_delivered_customer_date', 
    'order_estimated_delivery_date'
]

for col in date_columns:
    df_master[col] = pd.to_datetime(df_master[col])

print("Data types successfully updated!")

# 2. Check for missing values (Nulls) before cleaning
print("\n--- Missing Values Before Cleaning ---")
# Displaying the top 10 columns with the highest missing values
print(df_master.isnull().sum().sort_values(ascending=False).head(10))

# 3. Clean the Nulls Accordingly
# Drop rows where the English product category is missing
df_master.dropna(subset=['product_category_name_english'], inplace=True)

# Fill missing review scores with the median review score
median_score = df_master['review_score'].median()
df_master['review_score'].fillna(median_score, inplace=True)

# Fill missing freight values or prices with 0 (if any slipped through)
df_master['freight_value'].fillna(0, inplace=True)
df_master['price'].fillna(0, inplace=True)

# 4. Check and remove exact row duplicates
duplicates_count = df_master.duplicated().sum()
print(f"\nFound {duplicates_count} duplicate rows. Removing them now...")
df_master.drop_duplicates(inplace=True)

# 5. Final Verification
print(f"\nData cleaning complete!")
print(f"Final Dataset Shape: {df_master.shape[0]} rows, {df_master.shape[1]} columns")

Step 3: Descriptive Statistics – Key Business Metrics

With a clean dataset in hand, we now extract high‑level business insights that summarise the overall health and performance of the Olist marketplace. Three fundamental metrics are computed in this step: Mean Order Value, Average Items per Order, and a ranking of Top 10 States by Total Revenue.

1. Mean Order Value (MOV)

The Mean Order Value represents the average amount a customer spends per transaction. It is a critical metric for evaluating pricing strategy, marketing effectiveness, and overall revenue per customer interaction.

Calculation Approach: Since a single order can contain multiple payment records (e.g., instalment payments), we first isolate unique payments per order using drop_duplicates() on order_id and payment_sequential. Then we divide the total sum of payment_value by the count of unique order_id values.

2. Order Frequency (Average Items per Order)

This metric tells us, on average, how many distinct items a customer adds to their cart during a single checkout. It informs inventory bundling strategies, cross‑selling opportunities, and shipping cost implications.

Calculation Approach: For each order, the column order_item_id increments with each item purchased. We take the maximum order_item_id per order (which equals the total number of items in that order) and then compute the mean across all orders using groupby() and mean().

3. Top 10 States by Total Revenue

Understanding where revenue originates geographically is essential for targeted marketing, logistics planning, and regional expansion decisions. Brazil is a large and diverse country, and Olist operates nationwide.

Calculation Approach: Using the unique payments dataset, we group by customer_state and sum the payment_value. The resulting table is sorted in descending order, and the top 10 rows are displayed. The output reveals that São Paulo (SP) dominates revenue, followed by Rio de Janeiro (RJ) and Minas Gerais (MG)—a pattern consistent with Brazil’s economic geography.

Business Implications

MOV (R$ 160.80): A healthy average order value indicates customers are willing to spend moderately per transaction.
Items per Order (1.14): The low average suggests most orders are single‑item purchases—opportunity for cross‑sell recommendations.
Revenue Concentration: The top three states account for a significant portion of total revenue; marketing and logistics resources may be prioritised accordingly.

Code Preview (what follows):
• Create a deduplicated payment DataFrame
• Compute MOV: total_revenue / total_orders
• Compute average items per order: df_master.groupby(‘order_id’)[‘order_item_id’].max().mean()
• Group payments by customer_state and sum revenue
• Sort and display top 10 states

# 1. Calculate Mean Order Value (MOV)
# Isolate unique payments per order to avoid duplicating revenue for multi-item orders
unique_payments = df_master.drop_duplicates(subset=['order_id', 'payment_sequential'])

total_revenue = unique_payments['payment_value'].sum()
total_orders = unique_payments['order_id'].nunique()

mean_order_value = total_revenue / total_orders
print(f"Mean Order Value (MOV): R$ {mean_order_value:.2f}")

# 2. Calculate Order Frequency (Average items per cart)
# The max 'order_item_id' for each order tells us the total number of items bought in that transaction
order_frequency = df_master.groupby('order_id')['order_item_id'].max().mean()
print(f"Average Items per Order: {order_frequency:.2f} items")

# 3. Calculate Top 10 States by Revenue
# Group the unique payments data by the customer's state and sum the revenue
state_revenue = unique_payments.groupby('customer_state')['payment_value'].sum().reset_index()

# Sort the values from highest to lowest and grab the top 10
top_10_states = state_revenue.sort_values(by='payment_value', ascending=False).head(10)

print("\n--- Top 10 States by Total Revenue ---")
display(top_10_states)

# 1. Calculate Mean Order Value (MOV)
# Isolate unique payments per order to avoid duplicating revenue for multi-item orders
unique_payments = df_master.drop_duplicates(subset=['order_id', 'payment_sequential'])

total_revenue = unique_payments['payment_value'].sum()
total_orders = unique_payments['order_id'].nunique()

mean_order_value = total_revenue / total_orders
print(f"Mean Order Value (MOV): R$ {mean_order_value:.2f}")

# 2. Calculate Order Frequency (Average items per cart)
# The max 'order_item_id' for each order tells us the total number of items bought in that transaction
order_frequency = df_master.groupby('order_id')['order_item_id'].max().mean()
print(f"Average Items per Order: {order_frequency:.2f} items")

# 3. Calculate Top 10 States by Revenue
# Group the unique payments data by the customer's state and sum the revenue
state_revenue = unique_payments.groupby('customer_state')['payment_value'].sum().reset_index()

# Sort the values from highest to lowest and grab the top 10
top_10_states = state_revenue.sort_values(by='payment_value', ascending=False).head(10)

print("\n--- Top 10 States by Total Revenue ---")
display(top_10_states)

Step 4: Monthly Order Volume Trend Analysis

Understanding how order volume changes over time is fundamental to assessing business growth, identifying seasonality, and evaluating the impact of marketing campaigns. This step visualizes the number of unique orders placed each month using a clear, publication‑ready Matplotlib line chart.

1. Extracting Year‑Month from Timestamps

The raw dataset contains precise timestamps in the order_purchase_timestamp column. For meaningful monthly aggregation, we create a new column year_month by converting the timestamp to a Period object with monthly frequency (dt.to_period('M')). This standardises dates to a format like 2017-01, 2017-02, etc., enabling straightforward grouping.

2. Aggregating Unique Orders per Month

Since the master DataFrame contains one row per order item, a single order with multiple items would be counted multiple times if we simply used count(). To obtain the true number of transactions, we group by year_month and count the unique order_id values using nunique(). This yields a time series of monthly order counts.

3. Constructing the Matplotlib Line Chart

The aggregated series is plotted using Matplotlib with custom formatting to ensure clarity and professional presentation:

Figure size – set to (14, 6) for a wide, readable aspect ratio.
Line style – blue line (#1f77b4) with circular markers at each data point.
Grid – a dashed background grid aids in reading exact values.
Axis labels and title – clearly describe the content and units.
Rotated x‑axis labels – prevents overlapping of month names.

Insights from the Trend

The resulting chart typically reveals:

Steady growth from late 2016 through early 2018.
A noticeable peak around November 2017, likely driven by Black Friday and holiday shopping.
Some month‑to‑month volatility, but an overall upward trajectory.

This visualisation provides a clear, at‑a‑glance understanding of Olist’s sales momentum over time.

Code Preview (what follows):
• Import matplotlib.pyplot
• Create year_month column using dt.to_period(‘M’)
• Group by year_month and count unique order_id values
• Plot with kind=’line’, markers, and custom styling
• Add title, labels, grid, and rotated ticks
• Display the chart

import matplotlib.pyplot as plt

# 1. Extract Year-Month from the purchase timestamp
# This creates a new column grouping exact dates into standardized months
df_master['year_month'] = df_master['order_purchase_timestamp'].dt.to_period('M')

# 2. Group by the new year_month column and count unique orders
# We use nunique() on order_id to ensure we don't count multi-item orders twice
monthly_orders = df_master.groupby('year_month')['order_id'].nunique()

# 3. Create the Matplotlib Line Chart
plt.figure(figsize=(14, 6))
monthly_orders.plot(
    kind='line', 
    marker='o', 
    color='#1f77b4', 
    linewidth=2.5, 
    markersize=8
)

# 4. Format the chart for maximum readability and business presentation
plt.title('Monthly E-Commerce Order Volume Trend (Olist)', fontsize=18, fontweight='bold')
plt.xlabel('Year and Month', fontsize=14)
plt.ylabel('Total Number of Unique Orders', fontsize=14)

# Rotate the x-axis dates so they don't overlap
plt.xticks(rotation=45)

# Add a dashed background grid for easier reading
plt.grid(True, linestyle='--', alpha=0.7)
plt.tight_layout()

# 5. Render the plot in the notebook
plt.show()

import matplotlib.pyplot as plt

# 1. Extract Year-Month from the purchase timestamp
# This creates a new column grouping exact dates into standardized months
df_master['year_month'] = df_master['order_purchase_timestamp'].dt.to_period('M')

# 2. Group by the new year_month column and count unique orders
# We use nunique() on order_id to ensure we don't count multi-item orders twice
monthly_orders = df_master.groupby('year_month')['order_id'].nunique()

# 3. Create the Matplotlib Line Chart
plt.figure(figsize=(14, 6))
monthly_orders.plot(
    kind='line', 
    marker='o', 
    color='#1f77b4', 
    linewidth=2.5, 
    markersize=8
)

# 4. Format the chart for maximum readability and business presentation
plt.title('Monthly E-Commerce Order Volume Trend (Olist)', fontsize=18, fontweight='bold')
plt.xlabel('Year and Month', fontsize=14)
plt.ylabel('Total Number of Unique Orders', fontsize=14)

# Rotate the x-axis dates so they don't overlap
plt.xticks(rotation=45)

# Add a dashed background grid for easier reading
plt.grid(True, linestyle='--', alpha=0.7)
plt.tight_layout()

# 5. Render the plot in the notebook
plt.show()

Step 5: Top 10 Product Categories by Revenue (Horizontal Bar Chart)

Not all product categories contribute equally to the bottom line. This step identifies the top 10 revenue‑generating product categories and presents them in a clean, descending horizontal bar chart using Seaborn. Horizontal orientation is particularly effective for category labels that vary in length, ensuring readability.

1. Calculating Revenue per Category

The first task is to aggregate total revenue for each product category. We group the master DataFrame by the column product_category_name_english and sum the price column. This produces a new DataFrame where each row represents a unique category and its corresponding total sales value in Brazilian Reais (R$).

2. Sorting and Selecting the Top 10

The aggregated category revenue data is sorted in descending order using sort_values(by='price', ascending=False). The first 10 rows are extracted with head(10), isolating the highest‑performing categories for visualisation.

3. Creating the Seaborn Horizontal Bar Chart

A horizontal bar chart is constructed with Seaborn and customised for clarity and aesthetic appeal:

Figure size – set to (12, 8) for ample space.
Palette – ‘viridis’ colour map, which is perceptually uniform and colourblind‑friendly.
Axes – x=’price’ (revenue) and y=’product_category_name_english’ (category label).
Grid – a subtle x‑axis grid (axis=’x’) helps the reader estimate exact revenue figures.
Title and labels – clearly communicate the chart’s purpose and units.

Key Insights

The resulting chart typically highlights that Health & Beauty, Watches & Gifts, and Bed, Bath & Table are among the dominant revenue drivers. This information can guide inventory decisions, promotional focus, and supplier negotiations.

A Note on the FutureWarning: The code uses palette without assigning hue, which triggers a deprecation warning in newer Seaborn versions. The chart still renders correctly. For future compatibility, one can assign y to hue and set legend=False.

Code Preview (what follows):
• Import seaborn and matplotlib.pyplot
• Group by product_category_name_english and sum price
• Sort and select top 10 categories
• Create horizontal bar plot with sns.barplot()
• Customise title, labels, grid, and layout
• Render the chart

import seaborn as sns
import matplotlib.pyplot as plt

# 1. Calculate the total revenue (sum of item prices) per product category
category_revenue = df_master.groupby('product_category_name_english')['price'].sum().reset_index()

# 2. Sort the data from highest to lowest revenue and grab the Top 10
top_10_categories = category_revenue.sort_values(by='price', ascending=False).head(10)

# 3. Initialize the Matplotlib figure
plt.figure(figsize=(12, 8))

# 4. Create the Seaborn Horizontal Bar Chart
sns.barplot(
    data=top_10_categories, 
    x='price', 
    y='product_category_name_english', 
    palette='viridis' # A beautiful, colorblind-friendly color palette
)

# 5. Format the chart for business presentation
plt.title('Top 10 Product Categories by Total Revenue (Olist)', fontsize=18, fontweight='bold')
plt.xlabel('Total Revenue (R$)', fontsize=14)
plt.ylabel('Product Category', fontsize=14)

# Add a subtle grid on the X-axis to make reading the values easier
plt.grid(axis='x', linestyle='--', alpha=0.7)
plt.tight_layout()

# 6. Render the plot in the notebook
plt.show()

import seaborn as sns
import matplotlib.pyplot as plt

# 1. Calculate the total revenue (sum of item prices) per product category
category_revenue = df_master.groupby('product_category_name_english')['price'].sum().reset_index()

# 2. Sort the data from highest to lowest revenue and grab the Top 10
top_10_categories = category_revenue.sort_values(by='price', ascending=False).head(10)

# 3. Initialize the Matplotlib figure
plt.figure(figsize=(12, 8))

# 4. Create the Seaborn Horizontal Bar Chart
sns.barplot(
    data=top_10_categories, 
    x='price', 
    y='product_category_name_english', 
    palette='viridis' # A beautiful, colorblind-friendly color palette
)

# 5. Format the chart for business presentation
plt.title('Top 10 Product Categories by Total Revenue (Olist)', fontsize=18, fontweight='bold')
plt.xlabel('Total Revenue (R$)', fontsize=14)
plt.ylabel('Product Category', fontsize=14)

# Add a subtle grid on the X-axis to make reading the values easier
plt.grid(axis='x', linestyle='--', alpha=0.7)
plt.tight_layout()

# 6. Render the plot in the notebook
plt.show()

Step 6: Correlation Heatmap – Price, Freight Value & Review Score

Understanding relationships between numerical variables can reveal hidden patterns and business drivers. In this step, we explore how product price, freight cost, and customer review scores relate to one another. A Seaborn heatmap provides an intuitive, colour‑coded view of the Pearson correlation coefficients.

1. Selecting the Relevant Columns

From the master DataFrame, we isolate three columns of interest:

price – the listed price of the item.
freight_value – the shipping cost paid by the customer.
review_score – the rating (1 to 5) given by the customer.

This subset forms the basis for the correlation analysis.

2. Computing the Pearson Correlation Matrix

Using Pandas’ .corr() method, we calculate pairwise Pearson correlation coefficients. The coefficient ranges from -1 (perfect negative correlation) to +1 (perfect positive correlation), with 0 indicating no linear relationship.

3. Visualising with a Seaborn Heatmap

The correlation matrix is rendered as a heatmap with the following customisations:

Annotations – annot=True displays the exact correlation value inside each cell.
Colour map – ‘coolwarm’ uses blue for negative correlations, white for zero, and red for positive correlations.
Scale limits – vmin=-1, vmax=1 ensures the full theoretical range is represented.
Formatting – numbers are shown with two decimal places (fmt=’.2f’).
Square cells – square=True creates a symmetric, clean appearance.

Interpreting the Results

Based on the Olist data, the heatmap typically reveals:

Price and Freight Value: A moderate positive correlation (often around 0.5–0.6). Higher‑priced items tend to incur higher shipping costs, likely due to insurance, weight, or size.
Price and Review Score: A very weak positive correlation (near 0.0–0.1). Price alone does not strongly predict customer satisfaction.
Freight Value and Review Score: Similarly negligible correlation. Shipping cost does not appear to systematically influence the rating given.

These findings suggest that while operational costs (freight) scale with product value, customer satisfaction is driven by factors beyond price and shipping expense—likely product quality, delivery speed, and service experience.

Code Preview (what follows):
• Import seaborn and matplotlib.pyplot
• Subset columns: [‘price’, ‘freight_value’, ‘review_score’]
• Compute correlation matrix with .corr()
• Plot heatmap using sns.heatmap() with annotations and ‘coolwarm’ colormap
• Customise title and layout, then display

import seaborn as sns
import matplotlib.pyplot as plt

# 1. Isolate the specific columns we want to analyze
correlation_columns = ['price', 'freight_value', 'review_score']
df_correlation = df_master[correlation_columns]

# 2. Compute the Pearson Correlation Matrix
corr_matrix = df_correlation.corr()

# 3. Initialize the Matplotlib figure
plt.figure(figsize=(8, 6))

# 4. Create the Seaborn Heatmap
sns.heatmap(
    corr_matrix, 
    annot=True,              # Prints the exact correlation numbers on the chart
    cmap='coolwarm',         # Uses a blue-to-red color scale
    vmin=-1, vmax=1,         # Sets the absolute minimum and maximum scale limits
    fmt=".2f",               # Formats the numbers to two decimal places
    linewidths=0.5,          # Adds a clean border between squares
    square=True              # Forces the grid to be perfectly square
)

# 5. Format the chart for business presentation
plt.title('Correlation Heatmap: Price, Freight, and Reviews', fontsize=16, fontweight='bold')
plt.tight_layout()

# 6. Render the plot in the notebook
plt.show()

import seaborn as sns
import matplotlib.pyplot as plt

# 1. Isolate the specific columns we want to analyze
correlation_columns = ['price', 'freight_value', 'review_score']
df_correlation = df_master[correlation_columns]

# 2. Compute the Pearson Correlation Matrix
corr_matrix = df_correlation.corr()

# 3. Initialize the Matplotlib figure
plt.figure(figsize=(8, 6))

# 4. Create the Seaborn Heatmap
sns.heatmap(
    corr_matrix, 
    annot=True,              # Prints the exact correlation numbers on the chart
    cmap='coolwarm',         # Uses a blue-to-red color scale
    vmin=-1, vmax=1,         # Sets the absolute minimum and maximum scale limits
    fmt=".2f",               # Formats the numbers to two decimal places
    linewidths=0.5,          # Adds a clean border between squares
    square=True              # Forces the grid to be perfectly square
)

# 5. Format the chart for business presentation
plt.title('Correlation Heatmap: Price, Freight, and Reviews', fontsize=16, fontweight='bold')
plt.tight_layout()

# 6. Render the plot in the notebook
plt.show()

Step 7: Outlier Detection in Order Value Using Boxplots

Outliers can distort summary statistics and hide underlying patterns. Identifying extreme order values helps us understand exceptional transactions—whether they represent high‑value corporate purchases, bulk orders, or potential data entry errors. A boxplot provides a robust visual summary of the distribution and clearly flags outliers beyond the typical range.

1. Preparing Unique Payment Data

As in Step 3, we must avoid double‑counting payments when an order uses multiple instalments or payment methods. We create a deduplicated view of the payments data using drop_duplicates(subset=['order_id', 'payment_sequential']). This ensures each order contributes exactly one total payment value to the analysis.

2. Constructing the Boxplot

A boxplot (or box‑and‑whisker plot) summarises the distribution through five key statistics:

Minimum (excluding outliers)
First quartile (Q1) – 25th percentile
Median (Q2) – 50th percentile, the central value
Third quartile (Q3) – 75th percentile
Maximum (excluding outliers)

Outliers are defined as points that fall outside the interquartile range (IQR) by more than 1.5 × IQR. They are plotted as individual points beyond the whiskers.

The plot is customised with:

color=’skyblue’ – a clean, neutral fill for the box.
flierprops – customises outlier markers as semi‑transparent red circles, making them immediately identifiable.

3. Interpreting the Results

The output reveals a heavily right‑skewed distribution:

Median order value is around R$ 100, indicating a typical transaction is modest.
The box is compressed near the lower end of the scale, showing that most orders cluster below R$ 200–300.
Numerous outliers extend far to the right, with the maximum order exceeding R$ 13,000.

These extreme values likely represent bulk B2B purchases or high‑value electronics orders. Rather than removing them automatically, they warrant separate investigation—perhaps they belong to a distinct customer segment that could be targeted with tailored offers.

Business Implications

Pricing strategy: Most revenue comes from frequent, lower‑value orders rather than occasional large ones.
Fraud detection: The highest outliers should be reviewed for legitimacy.
Customer segmentation: High‑value outliers may represent a distinct VIP or corporate segment.

Code Preview (what follows):
• Deduplicate payments DataFrame
• Initialise figure with plt.figure(figsize=(12, 4))
• Create boxplot using sns.boxplot() with custom flierprops
• Add title and axis label
• Print median and maximum order values for context

import seaborn as sns
import matplotlib.pyplot as plt

# 1. Ensure we are using unique order payments to prevent duplication
# (Re-creating it here just in case you skipped Step 3)
unique_payments = df_master.drop_duplicates(subset=['order_id', 'payment_sequential'])

# 2. Initialize the Matplotlib figure
plt.figure(figsize=(12, 4))

# 3. Create the Seaborn Boxplot
sns.boxplot(
    x=unique_payments['payment_value'], 
    color='skyblue',
    flierprops={"marker": "o", "markerfacecolor": "red", "markersize": 5, "alpha": 0.5}
)

# 4. Format the chart for business presentation
plt.title('Distribution of Order Payment Values (Outlier Detection)', fontsize=16, fontweight='bold')
plt.xlabel('Total Payment Value (R$)', fontsize=14)

# 5. Render the plot in the notebook
plt.tight_layout()
plt.show()

# 6. Print some statistical context
max_order = unique_payments['payment_value'].max()
median_order = unique_payments['payment_value'].median()
print(f"Median Order Value: R$ {median_order:.2f}")
print(f"Maximum Order Value (Largest Outlier): R$ {max_order:.2f}")

import seaborn as sns
import matplotlib.pyplot as plt

# 1. Ensure we are using unique order payments to prevent duplication
# (Re-creating it here just in case you skipped Step 3)
unique_payments = df_master.drop_duplicates(subset=['order_id', 'payment_sequential'])

# 2. Initialize the Matplotlib figure
plt.figure(figsize=(12, 4))

# 3. Create the Seaborn Boxplot
sns.boxplot(
    x=unique_payments['payment_value'], 
    color='skyblue',
    flierprops={"marker": "o", "markerfacecolor": "red", "markersize": 5, "alpha": 0.5}
)

# 4. Format the chart for business presentation
plt.title('Distribution of Order Payment Values (Outlier Detection)', fontsize=16, fontweight='bold')
plt.xlabel('Total Payment Value (R$)', fontsize=14)

# 5. Render the plot in the notebook
plt.tight_layout()
plt.show()

# 6. Print some statistical context
max_order = unique_payments['payment_value'].max()
median_order = unique_payments['payment_value'].median()
print(f"Median Order Value: R$ {median_order:.2f}")
print(f"Maximum Order Value (Largest Outlier): R$ {max_order:.2f}")

Home » Ultimate Guide to Exploratory Data Analysis (EDA) on the Brazilian E-Commerce (Olist) Dataset

Ultimate Guide to Exploratory Data Analysis (EDA) on the Brazilian E-Commerce (Olist) Dataset

About the Dataset

Project Objectives

Analysis Roadmap

Step 1: Data Ingestion & Master DataFrame Assembly

The Nine Source Files

How the Master DataFrame is Built

Verification

Step 2: Data Cleaning & Preprocessing

1. Data Type Conversion

2. Assessing Missing Values

3. Handling Null Values

4. Duplicate Removal

5. Final Verification

Step 3: Descriptive Statistics – Key Business Metrics

1. Mean Order Value (MOV)

2. Order Frequency (Average Items per Order)

3. Top 10 States by Total Revenue

Business Implications

Step 4: Monthly Order Volume Trend Analysis

1. Extracting Year‑Month from Timestamps

2. Aggregating Unique Orders per Month

3. Constructing the Matplotlib Line Chart

Insights from the Trend

Step 5: Top 10 Product Categories by Revenue (Horizontal Bar Chart)

1. Calculating Revenue per Category

2. Sorting and Selecting the Top 10

3. Creating the Seaborn Horizontal Bar Chart

Key Insights

Step 6: Correlation Heatmap – Price, Freight Value & Review Score

1. Selecting the Relevant Columns

2. Computing the Pearson Correlation Matrix

3. Visualising with a Seaborn Heatmap

Interpreting the Results

Step 7: Outlier Detection in Order Value Using Boxplots

1. Preparing Unique Payment Data

2. Constructing the Boxplot

3. Interpreting the Results

Business Implications

Titanic Survival Prediction with Logistic Regression: A Step-by-Step Guide

Retail Sales Aggregation & Performance Analysis with SQL

House Price Prediction with Linear Regression: California Housing Dataset (Step-by-Step)