Data Preparation for YOLOv5 CV model in Retail Analytics

May 20, 2025
5 minutes

Introduction

Preparing high-quality datasets for LLM fine-tuning demands a deliberate approach to addressing marketing bias. While our earlier article introduced the concept at a high level, this recipe focuses on the practical side, providing data engineers with a technical, developer-focused walkthrough for converting theoretical bias dimensions from a bias matrix into concrete, production-ready data pipelines. For full context, we recommend reviewing our foundational pieces on the LLM data preparation framework and the underlying principles of marketing bias.

Business Background

A multi-channel retailer aimed to deepen its understanding of in-store customer behavior through the use of YOLOv5-based computer vision technology. Partnering with a specialized AI consultancy, the retailer began deploying custom-trained models on edge devices throughout its store network, with the goal of eventually scaling coverage to all locations.

This initiative went beyond passive observation. By linking visual insights to their e-commerce and product information management (PIM) systems, the retailer sought to generate actionable intelligence, such as identifying when customers examine labels but don’t purchase, or detecting common interest in overlooked products. These signals could then power real-time personalization and long-term experience optimization.

To achieve this, a critical early step was the creation of a training dataset that accurately reflected the diversity of in-store shopping behavior, not just across demographics or store types, but across time. That required a dedicated effort to identify and mitigate temporal bias.

Solution Overview

YOLOv5 models in this setting are trained on annotated video frames that capture real-world customer interactions in-store. The typical process involves:

Extracting frames from raw video footage
Annotating those frames with structured behavior labels
Training the model on this annotated dataset
Deploying the model to edge devices
Streaming real-time frame data or interaction events to local servers for further actioning

These structured interaction events are then forwarded by the local server to backend enterprise systems (e.g., CRM, PIM, e-commerce) where they inform decisions ranging from product placement to automated marketing offers.

But none of this works reliably without an unbiased dataset. Specifically, temporal bias—the uneven representation of certain time windows in the dataset—can significantly degrade model performance during underrepresented conditions like early mornings, off-season shopping, or promotional periods.

Bridging Marketing and Engineering: The Bias Matrix

To formalize their bias mitigation strategy, the team adopted a bias matrix—a shared framework developed by marketing and analytics stakeholders. This matrix identifies key bias categories, defines how they should be measured, establishes acceptable target distributions, and outlines mitigation strategies.

Here is a simplified excerpt from the full matrix:

Bias Category	Bias Type	Measurement Method	Target Distribution	Mitigation Strategy
Temporal Bias	Day of Week	% frames per day	Mon–Fri: 10–15% each Sat–Sun: 15–20% each	Stratified sampling across days
	Time of Day	% frames per time block	Morning: 20–25% Afternoon: 30–35% Evening: 25–30% Night: 10–15%	Weighted extraction from recordings
	Seasonal	% frames per season	Spring, Summer, Fall, Winter: 20–30% each	Multi-season collection and augmentation
	Promotional vs. Normal	% frames during promotions	Normal: 60–70% Promotional: 30–40%	Targeted collection during promotions

While the full matrix includes other bias types—such as demographic, store environment, and product interaction biases—this guide focuses exclusively on temporal bias.

In the sections that follow, we will walk through how to operationalize each temporal subcategory into a robust and reusable data pipeline that ensures balanced temporal representation during training.

This includes:

Structuring metadata for temporal traceability
Monitoring live data distributions
Designing stratified and weighted sampling logic
Implementing augmentation for rare conditions
Introducing performance-driven feedback loops

Together, these steps form a pipeline blueprint that transforms the bias matrix from a theoretical checklist into a concrete system architecture.

Technical Implementation

The temporal bias dimension implies that the training data must contain the following temporal distribution

Day of Week: Mon–Fri (10–15% each), Sat–Sun (15–20% each)
Time of Day: Morning (20–25%), Afternoon (30–35%), Evening (25–30%), Night (10–15%)
Season: Each season (20–30%)
Promotional Period: Normal (60–70%), Promotional (30–40%)

This implies that for every 100 samples, each weekday (Monday through Friday) should contribute between 10 and 15 samples, while Saturday and Sunday should each contribute between 15 and 20 samples, ensuring the overall total remains within the 100-sample limit. Similar conditions must be met for the time of the day, season, and promotional period.

The process of translating bias matrix into concrete data pipelines is outlined below.

Step 1: Extract Image Frames from Video

This is the first step in the data preparation process and involves parsing the raw video footage into image frames. Notice that to train the YOLOv5 model, you need to pass it annotated image frames and not the whole video file.

Extract frames from raw video footage– The input to this step is a raw video file (MP4, AVI, etc.) captured from in-store cameras. Each video file represents a continuous sequence of real-world customer activity within a store over a certain time period (e.g., 1 hour of footage from a Tuesday afternoon in Store #42).
The extract can be done as a batch process using Python OpenCV module that takes the input as the raw video file and outputs individual frames.

				
					import cv2
import os

def extract_frames(video_path, output_dir, sampling_rate=1):
    cap = cv2.VideoCapture(video_path)
    fps = cap.get(cv2.CAP_PROP_FPS)
    frame_interval = int(fps / sampling_rate)

    frame_count = 0
    saved_count = 0

    while cap.isOpened():
        ret, frame = cap.read()
        if not ret:
            break
        if frame_count % frame_interval == 0:
            frame_name = f"frame_{saved_count:05d}.jpg"
            cv2.imwrite(os.path.join(output_dir, frame_name), frame)
            saved_count += 1
        frame_count += 1

    cap.release()

Each frame is now an image file (e.g., frame_00001.jpg) that becomes a candidate training example—but it’s not usable yet until it’s enriched with metadata.

Step 2: Enrich Frames With Metadata

Frames alone are just pixels. To implement bias mitigation, we must know when and where each frame was captured. Metadata provides:

The timestamp (e.g., May 2, 2025, 14:33)
The day of the week (e.g., Friday)
The time block (e.g., afternoon)
The season (e.g., spring)
Whether it was during a promotional period
Context about the store (location, format, department)

This enriched metadata allows us to:

Monitor bias (e.g., 22% of frames are from Friday afternoons)
Sample intentionally (e.g., we need more morning frames)
Augment strategically (e.g., simulate winter lighting)

To hold this metadata, you need to alter your existing table or create a new one:

				
					CREATE TABLE frame_metadata (
    frame_id VARCHAR(255) PRIMARY KEY,
    timestamp DATETIME,
    day_of_week TINYINT,
    time_block VARCHAR(20),
    season VARCHAR(20),
    is_promotional BOOLEAN,
    frame_path VARCHAR(255)
);

Your code must now extract this metadata from each frame.

				
					import cv2
import os
import json
import pandas as pd
from datetime import datetime, timedelta

# ------------- Helper Functions -------------

def categorize_time_block(hour):
    """Classify the hour into a time block."""
    if 6 <= hour < 12:
        return 'morning'
    elif 12 <= hour < 17:
        return 'afternoon'
    elif 17 <= hour < 21:
        return 'evening'
    else:
        return 'night'

def determine_season(month):
    """Map month to a season."""
    if month in [12, 1, 2]:
        return 'winter'
    elif month in [3, 4, 5]:
        return 'spring'
    elif month in [6, 7, 8]:
        return 'summer'
    else:
        return 'fall'

def check_promotional(timestamp, promo_calendar):
    """Return True if the date is in the promotional calendar."""
    date_str = timestamp.strftime('%Y-%m-%d')
    return date_str in promo_calendar

def parse_timestamp_from_filename(filename):
    """
    Extract timestamp from filename format: storeID_YYYY-MM-DD_HH-MM.mp4
    Example: 'store42_2025-05-10_14-00.mp4'
    """
    parts = filename.replace('.mp4', '').split('_')
    if len(parts) < 3:
        raise ValueError("Filename does not contain timestamp in expected format.")
    timestamp_str = f"{parts[1]}_{parts[2]}"
    return datetime.strptime(timestamp_str, '%Y-%m-%d_%H-%M')

# ------------- Main Function -------------

def extract_frames_with_metadata(video_path, output_dir, sampling_rate=1, store_info=None, promo_calendar=None):
    """
    Extract frames from a video and enrich each with temporal metadata.
    
    Args:
        video_path (str): Path to input video file.
        output_dir (str): Directory to save extracted frames.
        sampling_rate (int): Frames per second to extract.
        store_info (dict): Optional store metadata.
        promo_calendar (set): Set of promotional dates (YYYY-MM-DD).
    
    Returns:
        List of dictionaries, each containing metadata for one frame.
    """
    os.makedirs(output_dir, exist_ok=True)
    cap = cv2.VideoCapture(video_path)

    fps = cap.get(cv2.CAP_PROP_FPS)
    frame_interval = int(fps / sampling_rate)
    frame_index = 0
    saved_index = 0
    metadata_records = []

    # Get base timestamp from video filename
    base_timestamp = parse_timestamp_from_filename(os.path.basename(video_path))

    while cap.isOpened():
        ret, frame = cap.read()
        if not ret:
            break

        if frame_index % frame_interval == 0:
            # Save frame image
            frame_filename = f"frame_{saved_index:05d}.jpg"
            frame_path = os.path.join(output_dir, frame_filename)
            cv2.imwrite(frame_path, frame)

            # Estimate timestamp for this frame
            frame_timestamp = base_timestamp + timedelta(seconds=(frame_index / fps))

            # Generate metadata
            metadata = {
                'frame_id': f"{saved_index:05d}",
                'timestamp': frame_timestamp.isoformat(),
                'day_of_week': frame_timestamp.weekday(),
                'time_block': categorize_time_block(frame_timestamp.hour),
                'season': determine_season(frame_timestamp.month),
                'is_promotional': check_promotional(frame_timestamp, promo_calendar or set()),
                'frame_path': frame_path
            }

            if store_info:
                metadata.update({
                    'store_id': store_info.get('store_id'),
                    'store_format': store_info.get('format'),
                    'location_type': store_info.get('location_type'),
                    'department': store_info.get('department')
                })

            metadata_records.append(metadata)
            saved_index += 1

        frame_index += 1

    cap.release()
    return metadata_records

# ------------- Run Script -------------

if __name__ == "__main__":
    # Input video (must include timestamp in filename)
    video_file = "store42_2025-05-10_14-00.mp4"
    output_folder = "frames_output"

    # Example store metadata
    store_metadata = {
        'store_id': 'store42',
        'format': 'medium',
        'location_type': 'urban',
        'department': 'snacks'
    }

    # Example promotional calendar
    promotional_dates = {"2025-05-10", "2025-05-11"}

    # Run extraction
    metadata = extract_frames_with_metadata(
        video_path=video_file,
        output_dir=output_folder,
        sampling_rate=1,  # 1 frame per second
        store_info=store_metadata,
        promo_calendar=promotional_dates
    )

    # Save metadata to CSV
    pd.DataFrame(metadata).to_csv("frame_metadata.csv", index=False)
    print(f"Extracted {len(metadata)} frames and saved metadata to frame_metadata.csv.")

Step 3: Temporal Distribution Monitoring

The goal of this step is to measure and track how well your current dataset matches the temporal targets defined in the bias matrix. You’re not yet selecting or sampling data here—you’re auditing what’s already been collected or extracted.

This step helps answer key questions like:

Are we overrepresenting certain days or time blocks?
Have we collected enough data from underrepresented seasons?
Are promotional periods adequately covered?
Are we drifting away from the target distribution?

The output of this step is a set of summary statistics and optionally a dashboard or alert system that gives visibility into bias risks.

Why This Matters

Before you can fix a bias, you have to see it.

Raw data pipelines often reflect uncontrolled collection patterns:

Store staff may record more during peak hours.
Some cameras may run longer on weekdays than weekends.
Promotions may not be tagged consistently.

If you blindly use this data for training, you risk building a model that performs well only in overrepresented conditions—and fails silently when exposed to underrepresented contexts (e.g., late nights or winter).

Temporal Distribution Monitoring gives you ground truth visibility over the balance of your dataset, so you can take action in later steps (sampling, weighting, augmentation).

What Exactly Are You Doing in This Step?

You are:

Querying your enriched metadata (from Step 1) to compute the actual distribution of frames across temporal dimensions:
- How many frames came from each day of the week?
- What percent came from each time block (morning, afternoon, evening, night)?
- What is the seasonal breakdown?
- What portion occurred during promotions?
Comparing those distributions to the target ranges specified in the bias matrix.
Identifying any deviations, i.e., subcategories that are underrepresented or overrepresented compared to the target.
(Optionally) Visualizing the metrics via bar charts, gauges, or dashboards to make imbalances obvious and track drift over time.
(Optionally) Triggering alerts if certain thresholds are violated—for example, if “morning” frames fall below 15%, even though the target minimum is 20%

Sample Output for Step 2

Imagine you’ve processed 10,000 frames and want to evaluate them against the target temporal bias distribution.

1. Day of Week Distribution

Day	Frame Count	Actual %	Target Range (%)	Status
Monday	980	9.8%	10–15%	Slightly Low
Tuesday	1120	11.2%	10–15%	Within Range
Wednesday	1105	11.1%	10–15%	Within Range
Thursday	1087	10.9%	10–15%	Within Range
Friday	1012	10.1%	10–15%	Within Range
Saturday	1845	18.5%	15–20%	Within Range
Sunday	1851	18.5%	15–20%	Within Range

Note: Monday is slightly under target; all other days are compliant.

2. Time of Day Distribution

Time Block	Frame Count	Actual %	Target Range (%)	Status
Morning	1680	16.8%	20–25%	Underrepresented
Afternoon	3502	35.0%	30–35%	Within Range
Evening	2863	28.6%	25–30%	Within Range
Night	1955	19.5%	10–15%	Overrepresented

Observation: Nighttime data is significantly overrepresented; morning data is under-collected.

3. Seasonal Distribution

Season	Frame Count	Actual %	Target Range (%)	Status
Spring	2300	23.0%	20–30%	Within Range
Summer	2430	24.3%	20–30%	Within Range
Fall	2515	25.2%	20–30%	Within Range
Winter	2755	27.5%	20–30%	Within Range

Observation: Seasonal coverage is balanced—no action needed here.

4. Promotional vs. Normal Periods

Period	Frame Count	Actual %	Target Range (%)	Status
Normal	7680	76.8%	60–70%	Overrepresented
Promotional	2320	23.2%	30–40%	Underrepresented

Observation: Not enough data collected during promotional campaigns.

Summary Status Report

Dimension	Issues Detected?	Recommended Action
Day of Week	Minor (Monday < 10%)	Increase Monday frame collection
Time of Day	Yes	Reduce night sampling, increase morning frames
Season	No	No action needed
Promotional Periods	Yes	Increase targeted capture during promotions

As can be seen from the status report, the original data looks skewed and there are specific steps that you must take to correct the bias.

The full code snippet for this step will be as follows:

				
					import pandas as pd
from collections import defaultdict

# --- Bias Matrix Temporal Targets ---

BIAS_TARGETS = {
    'day_of_week': {
        0: (10, 15),  # Monday
        1: (10, 15),
        2: (10, 15),
        3: (10, 15),
        4: (10, 15),
        5: (15, 20),  # Saturday
        6: (15, 20),  # Sunday
    },
    'time_block': {
        'morning': (20, 25),
        'afternoon': (30, 35),
        'evening': (25, 30),
        'night': (10, 15),
    },
    'season': {
        'spring': (20, 30),
        'summer': (20, 30),
        'fall': (20, 30),
        'winter': (20, 30),
    },
    'is_promotional': {
        True: (30, 40),
        False: (60, 70),
    }
}

# --- Monitoring Function ---

def calculate_temporal_distribution(df, targets):
    """
    Compare actual distributions with bias matrix targets.

    Args:
        df: DataFrame containing metadata
        targets: Dict with bias targets

    Returns:
        summary_table: A list of dictionaries with metrics per subcategory
    """
    total_frames = len(df)
    summary = []

    for category, category_targets in targets.items():
        value_counts = df[category].value_counts(normalize=True) * 100  # as %
        for value, (min_t, max_t) in category_targets.items():
            actual = value_counts.get(value, 0)
            if actual < min_t:
                status = 'Underrepresented'
            elif actual > max_t:
                status = 'Overrepresented'
            else:
                status = 'Within Range'

            summary.append({
                'Dimension': category,
                'Category': value,
                'Actual %': round(actual, 2),
                'Target Range (%)': f"{min_t}–{max_t}",
                'Status': status
            })

    return pd.DataFrame(summary)

# --- Load Metadata CSV ---

def load_metadata(csv_path):
    """
    Load frame metadata CSV and ensure correct dtypes.
    """
    df = pd.read_csv(csv_path)

    # Coerce categorical types
    df['time_block'] = df['time_block'].astype(str)
    df['season'] = df['season'].astype(str)
    df['is_promotional'] = df['is_promotional'].astype(bool)
    df['day_of_week'] = df['day_of_week'].astype(int)  # 0 = Monday

    return df

# --- Run Script ---

if __name__ == "__main__":
    metadata_file = "frame_metadata.csv"
    df = load_metadata(metadata_file)
    
    report = calculate_temporal_distribution(df, BIAS_TARGETS)
    report.to_csv("temporal_bias_report.csv", index=False)

    print("\n=== Temporal Bias Summary ===\n")
    print(report)

Step 4: Obtain Temporal Balance

The goal of this step is to select a balanced subset of frames from your raw dataset that closely matches the temporal distribution targets defined in the bias matrix. You are no longer just observing bias (as in Step 2)—you are actively shaping your training dataset to correct for it.

This ensures that:

Underrepresented time categories (e.g. winter evenings or promotional mornings) are not ignored
Overrepresented categories (e.g. Saturday afternoons) are not allowed to dominate model training
Your model is exposed to diverse time-based conditions, improving generalization and robustness

Why This Matters

Without stratified sampling, your training data may inherit the imbalances of how and when footage was collected. For example:

If cameras record mostly during the day, the model may fail at night.
If most footage is from promotional events, the model might overfit to campaign behaviors.
If winter shopping is underrepresented, the model may misinterpret seasonal behaviors (e.g., people wearing bulky coats, less time browsing).

This step gives you fine-grained control over what data is included, ensuring that the final dataset reflects your fairness goals.

What Exactly Are You Doing?

You are:

Using the metadata-enriched frame dataset (from Step 1)
Using the temporal bias targets (defined in the matrix and validated in Step 2)
Calculating the number of frames needed for each temporal subcategory (e.g., 2,500 frames from morning sessions)
Querying or filtering the metadata to pull that exact number of frames per category (or as close as possible)
Compensating if some bins don’t have enough data by logging, augmenting (in later steps), or rebalancing

The output is a balanced, bias-mitigated training dataset ready for use in LLM fine-tuning.

				
					import pandas as pd
import random
from collections import defaultdict
import os

# --- Define Temporal Bias Targets ---

TEMPORAL_TARGETS = {
    'time_block': {
        'morning': 0.23,     # 23%
        'afternoon': 0.33,   # 33%
        'evening': 0.27,     # 27%
        'night': 0.17        # 17%
    },
    'season': {
        'spring': 0.25,
        'summer': 0.25,
        'fall': 0.25,
        'winter': 0.25
    },
    'is_promotional': {
        True: 0.35,
        False: 0.65
    }
}

# --- Utility Function to Calculate Sampling Plan ---

def compute_stratified_frame_counts(total_frames, targets):
    """
    Calculate number of frames needed per combination of temporal categories.
    
    Args:
        total_frames: Total number of frames to select.
        targets: Nested dict of target proportions per temporal dimension.
    
    Returns:
        Dict mapping category combinations to number of frames.
    """
    target_counts = {}

    # Build all combinations (cartesian product of categories)
    for t_block, p_time in targets['time_block'].items():
        for season, p_season in targets['season'].items():
            for promo, p_promo in targets['is_promotional'].items():
                combo = (t_block, season, promo)
                proportion = p_time * p_season * p_promo
                target_counts[combo] = int(round(total_frames * proportion))
    
    return target_counts

# --- Sample Frames from Metadata Based on Category Combinations ---

def stratified_sample_frames(df, target_counts):
    """
    Select frames from the metadata DataFrame according to target counts.

    Args:
        df: Metadata DataFrame.
        target_counts: Dict with category combinations as keys and sample sizes as values.

    Returns:
        List of selected frame_ids.
    """
    selected_frames = []

    for combo, count in target_counts.items():
        time_block, season, promo = combo

        subset = df[
            (df['time_block'] == time_block) &
            (df['season'] == season) &
            (df['is_promotional'] == promo)
        ]

        available = len(subset)

        if available == 0:
            print(f"WARNING: No frames found for {combo}")
            continue

        sample_size = min(count, available)
        sampled_ids = subset.sample(n=sample_size, replace=False)['frame_id'].tolist()
        selected_frames.extend(sampled_ids)

        print(f"Selected {sample_size}/{count} frames for {combo} (available: {available})")

    return selected_frames

# --- Main Execution ---

def run_stratified_sampling(metadata_path, total_frames_to_select, output_path):
    df = pd.read_csv(metadata_path)

    # Ensure proper types
    df['time_block'] = df['time_block'].astype(str)
    df['season'] = df['season'].astype(str)
    df['is_promotional'] = df['is_promotional'].astype(bool)

    # Step 1: Compute target counts per category combo
    target_counts = compute_stratified_frame_counts(total_frames_to_select, TEMPORAL_TARGETS)

    # Step 2: Select frames
    selected_ids = stratified_sample_frames(df, target_counts)

    # Step 3: Save results
    selected_df = df[df['frame_id'].isin(selected_ids)]
    selected_df.to_csv(output_path, index=False)

    print(f"\nStratified sampling complete. {len(selected_ids)} frames written to {output_path}")

# --- Example Usage ---

if __name__ == "__main__":
    metadata_csv = "frame_metadata.csv"  # from Step 1
    output_file = "stratified_sample.csv"
    sample_size = 10000  # Adjust based on your target dataset size

    run_stratified_sampling(metadata_csv, sample_size, output_file)

Output

A file stratified_sample.csv containing the subset of metadata for the selected frames
Each frame respects the desired cross-distribution across:
- time_block × season × is_promotional
Warnings will be shown if not enough frames exist for any bin

You now have a bias-aware, balanced dataset ready for use in fine-tuning. The next step could be to copy the selected frames into a new dataset directory structure, or to feed them into a downstream annotation process.

Pipeline Deployment Overview: Azure Cloud

Core Azure Services

Pipeline Component	Azure Service Used
Raw video storage	Azure Blob Storage
Job orchestration	Azure Data Factory or Azure Synapse Pipelines
Compute for scripts	Azure Batch, Azure Functions, or Azure ML Pipelines
Metadata database	Azure SQL Database or Azure Data Lake
Monitoring & alerts	Azure Monitor + Log Analytics
Final dataset storage	Azure Data Lake Gen2 or Blob Storage

Sample Pipeline in Azure Data Factory

Trigger: When new video is uploaded to a Blob container.
Step 1 (Metadata Enrichment):
- Run a Python script in Azure Batch or Azure ML Compute.
- Output: Frame images + metadata CSV to Azure Blob / Data Lake.
Step 2 (Monitoring):
- Use a pipeline step to run the audit script.
- Save the bias report as a CSV or send alerts via Azure Monitor.
Step 3 (Stratified Sampling):
- Run your stratified sampler Python script as a Custom Activity.
- Output: Selected metadata → used for dataset creation.

Pipeline Deployment Overview: AWS Cloud

Core AWS Services

Pipeline Component	AWS Service Used
Raw video storage	Amazon S3
Job orchestration	AWS Step Functions or Amazon MWAA (Airflow)
Compute for scripts	AWS Lambda, EC2, or AWS Batch
Metadata database	Amazon RDS or Amazon Athena
Monitoring & alerts	Amazon CloudWatch + SNS
Final dataset storage	S3 (versioned buckets)

Sample Pipeline in AWS Step Functions

Trigger: S3 event when a new video file is uploaded.
Step 1 (Metadata Enrichment):
- Run your Python script in AWS Batch or Lambda.
- Store output images + metadata CSV in a processed/ S3 path.
Step 2 (Bias Monitoring):
- Run audit job using a Lambda or Batch job.
- Store report in S3 and send alerts via SNS if any category drifts.
Step 3 (Stratified Sampling):
- Execute as a Lambda or Docker job on ECS/Fargate.
- Output selected frame list to S3.
Step 4 (Optional Augmentation):
- Trigger augmentation only for flagged categories (CloudWatch metric).
Step 5 (Dataset Packaging):
- Move sampled frames into versioned dataset/ folders (e.g., v1/train/, v1/test/).
- Store metadata manifest (manifest.json) in each dataset folder.

Table of Contents

Data Preparation for YOLOv5 CV model in Retail Analytics

Introduction

Business Background

Solution Overview

Bridging Marketing and Engineering: The Bias Matrix

Technical Implementation

Step 1: Extract Image Frames from Video

Step 2: Enrich Frames With Metadata

Step 3: Temporal Distribution Monitoring

Why This Matters

What Exactly Are You Doing in This Step?

Sample Output for Step 2

1. Day of Week Distribution

2. Time of Day Distribution

3. Seasonal Distribution

4. Promotional vs. Normal Periods

Summary Status Report

Step 4: Obtain Temporal Balance

Why This Matters

What Exactly Are You Doing?

Output

Pipeline Deployment Overview: Azure Cloud

Core Azure Services

Sample Pipeline in Azure Data Factory

Pipeline Deployment Overview: AWS Cloud

Core AWS Services

Sample Pipeline in AWS Step Functions