From Raw Transactions to Actionable Intelligence

In the fast‑paced world of retail, understanding sales patterns is not just helpful—it is essential for survival. Retail sales aggregation transforms millions of raw transaction rows into clear, strategic insights. This project leverages SQL to analyze the classic Sample Superstore Dataset, a rich collection of 9,994 orders spanning multiple years, regions, and product categories. Through a series of targeted queries, we will answer critical business questions: Which regions drive the most profit? Which sub‑categories deserve more shelf space? How do discounts truly impact the bottom line? By the end of this analysis, you will have a solid grasp of essential SQL techniques including GROUP BY, HAVING, window functions, and conditional logic with CASE WHEN. If you are new to data analysis, you might also enjoy our guide on Brazilian E‑Commerce EDA.

About the Dataset

The Superstore dataset contains order‑level details including Order Date, Ship Mode, Customer Segment, Region, Category, Sales, Quantity, Discount, and Profit. It is ideal for practicing retail sales aggregation because it mirrors real‑world complexity—multiple years, varying discount levels, and both profitable and loss‑making transactions. We imported the CSV into DB Browser for SQLite to execute the queries below.

What You Will Learn

Aggregating metrics across regions and categories with GROUP BY.
Filtering aggregated results using HAVING.
Classifying performance bands dynamically with CASE WHEN.
Comparing year‑over‑year growth using window functions like LAG().
Identifying loss‑making orders that require immediate attention.

Note: All queries are written in standard SQL and executed in SQLite. The code blocks follow each explanatory section.

Step 1: Regional Performance Overview with GROUP BY

The first question any retail executive asks is: “How are we performing across different regions?” To answer this, we need retail sales aggregation at the regional level. The GROUP BY clause groups rows that have the same values in the Region column. Then, aggregate functions—SUM()—calculate the total sales, total profit, and total quantity sold for each region. Consequently, we can instantly identify top‑performing and under‑performing geographic areas.

What This Query Reveals

Which region generates the highest revenue (Sales).
Whether high sales always translate to high profit.
Regional differences in order volume (Quantity).

In many cases, the West and East regions lead in absolute sales. However, profit margins may vary significantly due to differing discount strategies or product mixes. This initial snapshot sets the stage for deeper investigation. For a different approach to exploratory analysis, check out our Python‑based Olist E‑Commerce EDA.

Code Preview (what follows): SELECT Region, SUM(Sales), SUM(Profit), SUM(Quantity) FROM superstore GROUP BY Region;

SELECT Region, 
       SUM(Sales) AS Total_Sales, 
       SUM(Profit) AS Total_Profit, 
       SUM(Quantity) AS Total_Quantity
FROM superstore
GROUP BY Region;

SELECT Region, 
       SUM(Sales) AS Total_Sales, 
       SUM(Profit) AS Total_Profit, 
       SUM(Quantity) AS Total_Quantity
FROM superstore
GROUP BY Region;

Step 2: Identifying High‑Margin Sub‑Categories with CASE WHEN

Revenue alone does not tell the full story. A million‑dollar category that yields razor‑thin margins may be less valuable than a smaller category with stellar profitability. Therefore, our next step in retail sales aggregation focuses on profit margin, calculated as total profit divided by total sales. We group by Sub_Category, compute the margin, and then use ORDER BY with LIMIT 5 to see the top performers.

Additionally, a CASE WHEN statement categorizes each sub‑category into a Margin Band (e.g., ‘High Margin’ for margins ≥ 25%, ‘Low Margin’ for positive margins below 25%, and ‘Loss Making’ for negative margins). This dynamic classification helps business users quickly spot which areas deserve more investment and which need pricing or cost reviews.

Business Implications

High Margin: Potential candidates for increased marketing spend or premium placement.
Low Margin: Opportunities for cost reduction or supplier negotiation.
Loss Making: Immediate investigation required—possibly driven by excessive discounting.

Code Preview (what follows): SELECT Sub_Category, (SUM(Profit)/SUM(Sales)) AS Profit_Margin, CASE WHEN … END AS Margin_Band FROM superstore GROUP BY Sub_Category ORDER BY Profit_Margin DESC LIMIT 5;

SELECT Sub_Category, 
       (SUM(Profit) / SUM(Sales)) AS Profit_Margin,
       CASE 
           WHEN (SUM(Profit) / SUM(Sales)) >= 0.25 THEN 'High Margin'
           WHEN (SUM(Profit) / SUM(Sales)) > 0 THEN 'Low Margin'
           ELSE 'Loss Making'
       END AS Margin_Band
FROM superstore
GROUP BY Sub_Category
ORDER BY Profit_Margin DESC
LIMIT 5;

SELECT Sub_Category, 
       (SUM(Profit) / SUM(Sales)) AS Profit_Margin,
       CASE 
           WHEN (SUM(Profit) / SUM(Sales)) >= 0.25 THEN 'High Margin'
           WHEN (SUM(Profit) / SUM(Sales)) > 0 THEN 'Low Margin'
           ELSE 'Loss Making'
       END AS Margin_Band
FROM superstore
GROUP BY Sub_Category
ORDER BY Profit_Margin DESC
LIMIT 5;

Step 3: Filtering Aggregated Results with HAVING

While WHERE filters rows before aggregation, HAVING filters groups after aggregation. This distinction is crucial for effective retail sales aggregation. In this step, we want to identify which years had total sales exceeding the overall average yearly sales. First, we extract the year from the Order_Date string (since SQLite imports dates as text, we use SUBSTR(Order_Date, -4)). Next, we group by year and sum the sales. Finally, the HAVING clause compares each year’s total to a subquery that computes the average yearly sales across all years.

This technique is invaluable for monitoring long‑term trends. Years that consistently beat the average indicate strong growth phases, while those falling below may signal economic downturns or operational issues. Moreover, it demonstrates how to nest aggregate functions within a subquery to create dynamic benchmarks.

Code Preview (what follows): SELECT SUBSTR(Order_Date, -4) AS Order_Year, SUM(Sales) FROM superstore GROUP BY Order_Year HAVING SUM(Sales) > (SELECT SUM(Sales)/COUNT(DISTINCT SUBSTR(Order_Date, -4)) FROM superstore);

SELECT 
    SUBSTR(Order_Date, -4) AS Order_Year,
    SUM(Sales) AS Total_Sales
FROM superstore
GROUP BY Order_Year
HAVING SUM(Sales) > (
    SELECT SUM(Sales) / COUNT(DISTINCT SUBSTR(Order_Date, -4)) FROM superstore
)
ORDER BY Total_Sales DESC;

SELECT 
    SUBSTR(Order_Date, -4) AS Order_Year,
    SUM(Sales) AS Total_Sales
FROM superstore
GROUP BY Order_Year
HAVING SUM(Sales) > (
    SELECT SUM(Sales) / COUNT(DISTINCT SUBSTR(Order_Date, -4)) FROM superstore
)
ORDER BY Total_Sales DESC;

Step 4: Discount Impact Analysis with Conditional Aggregation

Discounts are a double‑edged sword. They can drive volume but often at the expense of profitability. To quantify this trade‑off in our retail sales aggregation, we compare average profit and average sales for orders with and without discounts. A CASE WHEN statement inside the GROUP BY clause splits the data into two cohorts: ‘With Discount’ (Discount > 0) and ‘No Discount’. We then compute the average profit and average sales for each group.

Typically, this analysis reveals a stark reality: discounted orders yield significantly lower average profit, and in some cases, negative profit. Meanwhile, full‑price orders maintain healthy margins. This insight empowers managers to refine discounting policies—perhaps by capping maximum discount percentages or restricting discounts to specific low‑margin clearance items. For more on avoiding misleading aggregate conclusions, see our deep dive into Simpson’s Paradox.

Code Preview (what follows): SELECT CASE WHEN Discount > 0 THEN ‘With Discount’ ELSE ‘No Discount’ END AS Discount_Status, AVG(Profit), AVG(Sales) FROM superstore GROUP BY Discount_Status;

SELECT 
    CASE WHEN Discount > 0 THEN 'With Discount' ELSE 'No Discount' END AS Discount_Status,
    AVG(Profit) AS Avg_Profit,
    AVG(Sales) AS Avg_Sales
FROM superstore
GROUP BY CASE WHEN Discount > 0 THEN 'With Discount' ELSE 'No Discount' END;

SELECT 
    CASE WHEN Discount > 0 THEN 'With Discount' ELSE 'No Discount' END AS Discount_Status,
    AVG(Profit) AS Avg_Profit,
    AVG(Sales) AS Avg_Sales
FROM superstore
GROUP BY CASE WHEN Discount > 0 THEN 'With Discount' ELSE 'No Discount' END;

Step 5: Flagging Unprofitable Transactions with WHERE and CASE WHEN

Not all sales are good sales. In retail sales aggregation, it is equally important to identify individual loss‑making orders. These transactions erode overall profitability and often stem from excessive discounting, shipping errors, or product returns. A simple WHERE Profit < 0 clause isolates these problematic rows. To make the output actionable, we include an Action_Flag column generated via CASE WHEN. Orders with negative profit are marked ‘Review Needed’.

Sorting the results by profit in ascending order (ORDER BY Profit ASC) brings the most severe losses to the top. This allows operations teams to prioritize investigations. Common root causes include misapplied promotional codes or high‑cost expedited shipping on low‑margin items. For a similar diagnostic approach in machine learning, check out our House Price Prediction project.

Code Preview (what follows): SELECT Order_ID, Customer_Name, Product_Name, Sales, Profit, CASE WHEN Profit < 0 THEN 'Review Needed' ELSE 'OK' END AS Action_Flag FROM superstore WHERE Profit < 0 ORDER BY Profit ASC LIMIT 10;

SELECT Order_ID, Customer_Name, Product_Name, Sales, Profit,
       CASE WHEN Profit < 0 THEN 'Review Needed' ELSE 'OK' END AS Action_Flag
FROM superstore
WHERE Profit < 0
ORDER BY Profit ASC
LIMIT 10;

SELECT Order_ID, Customer_Name, Product_Name, Sales, Profit,
       CASE WHEN Profit < 0 THEN 'Review Needed' ELSE 'OK' END AS Action_Flag
FROM superstore
WHERE Profit < 0
ORDER BY Profit ASC
LIMIT 10;

Step 6: Ranking Cities by Revenue with ORDER BY and LIMIT

Geographic insights become even more powerful when we drill down from regions to cities. This retail sales aggregation query ranks cities based on total revenue. By grouping on both City and State (to disambiguate cities with the same name), we sum sales and then use ORDER BY Revenue DESC combined with LIMIT 10 to retrieve the top‑performing metropolitan areas.

Such rankings are essential for logistics planning, targeted local marketing campaigns, and decisions about where to open new warehouses or retail outlets. In the Superstore dataset, major hubs like New York City and Los Angeles often dominate the list. However, smaller cities with specialized B2B demand may also appear, revealing hidden opportunities.

Code Preview (what follows): SELECT City, State, SUM(Sales) AS Revenue FROM superstore GROUP BY City, State ORDER BY Revenue DESC LIMIT 10;

SELECT City, State, SUM(Sales) AS Revenue
FROM superstore
GROUP BY City, State
ORDER BY Revenue DESC
LIMIT 10;

SELECT City, State, SUM(Sales) AS Revenue
FROM superstore
GROUP BY City, State
ORDER BY Revenue DESC
LIMIT 10;

Step 7: Year‑Over‑Year Sales Comparison Using LAG()

Our final step in retail sales aggregation tackles a classic business metric: year‑over‑year (YoY) growth. Unlike simple totals, YoY growth contextualizes performance relative to previous periods. To achieve this in SQL, we use a window function: LAG(). This function looks back at a previous row in the ordered result set. After grouping sales by year, LAG(SUM(Sales)) fetches the total sales of the immediately preceding year.

With the current year’s sales and the previous year’s sales side‑by‑side, we calculate the percentage change. A positive percentage indicates growth, while a negative value signals a contraction. This query is particularly insightful for boardroom presentations and strategic planning. For those interested in predictive modeling based on historical trends, our Linear Regression tutorial offers a great next step.

Note that window functions require careful ordering. Here, we order by the extracted year to ensure the lag correctly references the previous calendar year, not just an arbitrary row.

Code Preview (what follows): SELECT SUBSTR(Order_Date, -4) AS Year, SUM(Sales) AS Total_Sales, LAG(SUM(Sales)) OVER (ORDER BY SUBSTR(Order_Date, -4)) AS Previous_Year_Sales, ROUND((SUM(Sales) – LAG(SUM(Sales)) OVER (…)) / LAG(SUM(Sales)) OVER (…) * 100, 2) AS YoY_Growth_Percentage FROM superstore GROUP BY Year;

SELECT 
    SUBSTR(Order_Date, -4) AS Year, 
    SUM(Sales) AS Total_Sales,
    LAG(SUM(Sales)) OVER (ORDER BY SUBSTR(Order_Date, -4)) AS Previous_Year_Sales,
    ROUND(((SUM(Sales) - LAG(SUM(Sales)) OVER (ORDER BY SUBSTR(Order_Date, -4))) / 
     LAG(SUM(Sales)) OVER (ORDER BY SUBSTR(Order_Date, -4))) * 100, 2) AS YoY_Growth_Percentage
FROM superstore
GROUP BY Year;

SELECT 
    SUBSTR(Order_Date, -4) AS Year, 
    SUM(Sales) AS Total_Sales,
    LAG(SUM(Sales)) OVER (ORDER BY SUBSTR(Order_Date, -4)) AS Previous_Year_Sales,
    ROUND(((SUM(Sales) - LAG(SUM(Sales)) OVER (ORDER BY SUBSTR(Order_Date, -4))) / 
     LAG(SUM(Sales)) OVER (ORDER BY SUBSTR(Order_Date, -4))) * 100, 2) AS YoY_Growth_Percentage
FROM superstore
GROUP BY Year;

Retail Sales Aggregation & Performance Analysis with SQL