Project Documentation: Understanding the Customer Order Lifecycle
Welcome to the documentation for this project, which focuses on tracking the complete lifecycle of customer orders, from placement to final delivery. Our objective is to provide comprehensive insights into this process using a data warehousing technique known as “Accumulating Snapshot Fact Modelling.” This document will explain the core concepts and practical applications of this methodology.Project Overview
This project addresses the critical need for detailed analytics within an online retail environment. Beyond simply recording what was purchased, businesses require a clear understanding of the operational efficiency and duration of key order fulfillment stages. Specifically, this solution provides insights into:- The time elapsed between order placement, shipment, and customer delivery.
- The financial aspects associated with each order.
- Detailed customer and product information linked to each transaction.
Understanding Accumulating Snapshot Fact Modelling
Accumulating Snapshot Fact Modelling is an effective approach for tracking processes that progress through a series of discrete, sequential steps over time. Consider the journey of a typical customer order:- Order Placed: The initial event, marking the beginning of the order’s lifecycle.
- Order Confirmed: The system acknowledges the order, indicating the next stage.
- Order Prepared: The product is being readied for dispatch.
- Order Shipped: The product leaves the warehouse via a carrier.
- Order Delivered: The product reaches the customer, signifying completion.
- Key Event Timestamps: Such as
order_purchase_timestamp,order_delivered_carrier_date, andorder_delivered_customer_date. - Calculated Durations: Metrics like
days_to_ship(time from purchase to carrier delivery) anddays_to_deliver(time from carrier delivery to customer delivery). - Associated Attributes: Relevant details such as
total_price,payment_type, and foreign keys linking to customer and product dimensions.
Business Scenarios for Accumulating Snapshot Fact Modelling
This modelling technique is highly versatile and applicable to any business process characterized by a defined sequence of steps. Here are several additional scenarios where it can provide significant value:- Customer Onboarding Process:
- Milestones: Application submission, document verification, account approval, first service activation, initial transaction.
- Insights: Average onboarding duration, identification of bottlenecks in the customer journey, and conversion rates at each stage.
- Loan Application Lifecycle:
- Milestones: Application receipt, credit assessment, approval, fund disbursement, first repayment.
- Insights: Time-to-decision, time-to-fund disbursement, and analysis of application drop-off points.
- Manufacturing Production Workflow:
- Milestones: Raw material intake, assembly start, quality control completion, packaging, final shipment.
- Insights: Production cycle times, identification of delays in specific manufacturing phases, and overall process efficiency.
- Healthcare Patient Journey:
- Milestones: Appointment scheduling, patient check-in, consultation, diagnosis, treatment initiation, discharge.
- Insights: Patient wait times, duration of various treatment phases, and optimization of patient flow.
- Software Development Lifecycle (SDLC):
- Milestones: Feature request, development commencement, testing phase, bug resolution, production deployment.
- Insights: Lead time for feature delivery, identification of high-effort or delayed phases, and team productivity analysis.
Project Architecture: A Layered Approach
This project employs a structured data architecture, commonly referred to as a “Medallion Architecture” (Bronze, Silver, Gold layers), to systematically build and refine the analytical fact table. Below, we’ll walk through the code snippets that bring each layer to life. Data and code notebook: Data-Warehousing-Types-of-Facts-design/Accumulating Snapshot Fact Modelling at main · Sarb1236/Data-Warehousing-Types-of-Facts-design1. Bronze Layer: Raw Data Ingestion
The Bronze layer is our initial landing zone for raw, untransformed data. It preserves the original state of the source data, directly loading CSV files into Delta tables. Code Snippet:pipeline_config dictionary to centralize all configurations for various tables (customers, products, orders, etc.). For each table, it specifies the path to the raw CSV file. The code then iteratively reads these CSVs into Spark DataFrames, infers their schema, and writes them as bronze_{table_name} Delta tables. This ensures that our raw data is stored in a robust, versioned format, ready for the next stage.
2. Silver Layer: Data Cleansing and Standardization
In the Silver layer, data from the Bronze layer undergoes initial cleansing, deduplication, type casting, and standardization. This prepares the data for more advanced analytical modelling. Code Snippet:process_silver_table takes a Bronze table, applies a set of cleaning rules defined in the pipeline_config, and writes the cleaned data to a silver_{table_name} Delta table. The rules include deduplication based on primary keys, type casting for numerical and timestamp columns, trimming whitespace from string columns, and performing regular expression replacements. This systematic approach ensures data quality and consistency before it’s used for analytical purposes.
3. Dimension Tables (SCD2): Tracking Historical Changes
We construct dimension tables for key entities such as customers and products. The “SCD2” (Slowly Changing Dimension Type 2) implementation ensures that a complete historical record of changes to these attributes is maintained, providing accurate historical context for analysis. We also create adim_date table for time-based analysis.
Code Snippet (dim_date):
dim_date table, which is crucial for time-based analysis. It creates a sequence of dates for a defined range (2016-2025) and then extracts various time attributes like year, month, day of month, day of week, week of year, quarter, day of year, month name, and day name. This comprehensive date dimension allows for flexible aggregation and filtering of data by different time granularities.
Code Snippet (SCD2 Merge):
dim_customers and dim_products. The scd2_merge function takes a silver table and its business keys, then adds sk (surrogate key), effective_start_date, effective_end_date, and is_current columns. It then uses a MERGE INTO SQL statement to update existing records (setting is_current to FALSE and effective_end_date for old versions) and insert new or changed records. This ensures that historical changes to customer or product attributes are preserved, allowing for accurate historical reporting.
4. Accumulating Snapshot Fact Table (Gold Layer): The Core of Our Insights
This is the central analytical table,fact_order_lifecycle. It integrates cleaned data from the Silver layer with surrogate keys from the dimension tables. This table is designed to track the critical timestamps and derived durations of the order lifecycle.
Code Snippet:
fact_order_lifecycle table, which is the heart of our analytical solution. It joins the silver_orders, silver_order_items, and silver_payments tables. Crucially, it also joins with the dim_customers and dim_products tables (filtered to is_current = TRUE to get the latest dimension attributes) to bring in the surrogate keys (customer_sk, product_sk). Finally, it selects relevant columns and calculates key metrics like total_price (combining item price and freight value) and the durations (days_to_ship, days_to_deliver) using date difference functions. This table provides a single, comprehensive view of each order’s journey and its associated details.
5. Gold View for Reporting: Simplified Analytics
A final, user-friendly view namedgold_order_summary is created. This view simplifies data access for business intelligence tools and reporting, abstracting away the underlying complexity of joins and calculations.
Code Snippet:
VIEW called gold_order_summary. This view simplifies data consumption for reporting tools by pre-joining the fact_order_lifecycle table with the dim_customers and dim_products tables (again, only considering the current dimension records). It selects key business-friendly columns like Customer_Name, product_category_name, order_date, total_price, and the calculated days_to_ship and days_to_deliver. By providing a view, business users can easily query this consolidated data without needing to understand the underlying table structures or join logic.
This layered architectural approach ensures data quality, traceability, and ease of consumption for diverse analytical requirements.