Skip to content

Back to Content Hub

Table of Contents

Data Preparation for YOLOv5 CV model in Retail Analytics

Introduction

Preparing high-quality datasets for LLM fine-tuning demands a deliberate approach to addressing marketing bias. While our earlier article introduced the concept at a high level, this recipe focuses on the practical side, providing data engineers with a technical, developer-focused walkthrough for converting theoretical bias dimensions from a bias matrix into concrete, production-ready data pipelines. For full context, we recommend reviewing our foundational pieces on the LLM data preparation framework and the underlying principles of marketing bias.

Business Background

A multi-channel retailer aimed to deepen its understanding of in-store customer behavior through the use of YOLOv5-based computer vision technology. Partnering with a specialized AI consultancy, the retailer began deploying custom-trained models on edge devices throughout its store network, with the goal of eventually scaling coverage to all locations.

This initiative went beyond passive observation. By linking visual insights to their e-commerce and product information management (PIM) systems, the retailer sought to generate actionable intelligence, such as identifying when customers examine labels but don’t purchase, or detecting common interest in overlooked products. These signals could then power real-time personalization and long-term experience optimization.

To achieve this, a critical early step was the creation of a training dataset that accurately reflected the diversity of in-store shopping behavior, not just across demographics or store types, but across time. That required a dedicated effort to identify and mitigate temporal bias.

Solution Overview

YOLOv5 models in this setting are trained on annotated video frames that capture real-world customer interactions in-store. The typical process involves:

  • Extracting frames from raw video footage
  • Annotating those frames with structured behavior labels
  • Training the model on this annotated dataset
  • Deploying the model to edge devices
  • Streaming real-time frame data or interaction events to local servers for further actioning

These structured interaction events are then forwarded by the local server to backend enterprise systems (e.g., CRM, PIM, e-commerce) where they inform decisions ranging from product placement to automated marketing offers.

But none of this works reliably without an unbiased dataset. Specifically, temporal bias—the uneven representation of certain time windows in the dataset—can significantly degrade model performance during underrepresented conditions like early mornings, off-season shopping, or promotional periods.

Bridging Marketing and Engineering: The Bias Matrix

To formalize their bias mitigation strategy, the team adopted a bias matrix—a shared framework developed by marketing and analytics stakeholders. This matrix identifies key bias categories, defines how they should be measured, establishes acceptable target distributions, and outlines mitigation strategies.

Here is a simplified excerpt from the full matrix:

Bias Category

Bias Type

Measurement Method

Target Distribution

Mitigation Strategy

Temporal Bias

Day of Week

% frames per day

Mon–Fri: 10–15% each
Sat–Sun: 15–20% each

Stratified sampling across days

 

Time of Day

% frames per time block

Morning: 20–25%
Afternoon: 30–35%
Evening: 25–30%
Night: 10–15%

Weighted extraction from recordings

 

Seasonal

% frames per season

Spring, Summer, Fall, Winter: 20–30% each

Multi-season collection and augmentation

 

Promotional vs. Normal

% frames during promotions

Normal: 60–70%
Promotional: 30–40%

Targeted collection during promotions

While the full matrix includes other bias types—such as demographic, store environment, and product interaction biases—this guide focuses exclusively on temporal bias.

In the sections that follow, we will walk through how to operationalize each temporal subcategory into a robust and reusable data pipeline that ensures balanced temporal representation during training.

This includes:

  1. Structuring metadata for temporal traceability
  2. Monitoring live data distributions
  3. Designing stratified and weighted sampling logic
  4. Implementing augmentation for rare conditions
  5. Introducing performance-driven feedback loops

Together, these steps form a pipeline blueprint that transforms the bias matrix from a theoretical checklist into a concrete system architecture.

Technical Implementation

The temporal bias dimension implies that the training data must contain the following temporal distribution

  1. Day of Week: Mon–Fri (10–15% each), Sat–Sun (15–20% each)
  2. Time of Day: Morning (20–25%), Afternoon (30–35%), Evening (25–30%), Night (10–15%)
  3. Season: Each season (20–30%)
  4. Promotional Period: Normal (60–70%), Promotional (30–40%)

This implies that for every 100 samples, each weekday (Monday through Friday) should contribute between 10 and 15 samples, while Saturday and Sunday should each contribute between 15 and 20 samples, ensuring the overall total remains within the 100-sample limit. Similar conditions must be met for the time of the day, season, and promotional period.

The process of translating bias matrix into concrete data pipelines is outlined below.

Step 1: Extract Image Frames from Video

This is the first step in the data preparation process and involves parsing the raw video footage into image frames. Notice that to train the YOLOv5 model, you need to pass it annotated image frames and not the whole video file.

  1. Extract frames from raw video footage– The input to this step is a raw video file (MP4, AVI, etc.) captured from in-store cameras. Each video file represents a continuous sequence of real-world customer activity within a store over a certain time period (e.g., 1 hour of footage from a Tuesday afternoon in Store #42).
  2. The extract can be done as a batch process using Python OpenCV module that takes the input as the raw video file and outputs individual frames.
				
					import cv2
import os

def extract_frames(video_path, output_dir, sampling_rate=1):
    cap = cv2.VideoCapture(video_path)
    fps = cap.get(cv2.CAP_PROP_FPS)
    frame_interval = int(fps / sampling_rate)

    frame_count = 0
    saved_count = 0

    while cap.isOpened():
        ret, frame = cap.read()
        if not ret:
            break
        if frame_count % frame_interval == 0:
            frame_name = f"frame_{saved_count:05d}.jpg"
            cv2.imwrite(os.path.join(output_dir, frame_name), frame)
            saved_count += 1
        frame_count += 1

    cap.release()

				
			

Each frame is now an image file (e.g., frame_00001.jpg) that becomes a candidate training example—but it’s not usable yet until it’s enriched with metadata.

Step 2: Enrich Frames With Metadata

Frames alone are just pixels. To implement bias mitigation, we must know when and where each frame was captured. Metadata provides:

  • The timestamp (e.g., May 2, 2025, 14:33)

  • The day of the week (e.g., Friday)

  • The time block (e.g., afternoon)

  • The season (e.g., spring)

  • Whether it was during a promotional period

  • Context about the store (location, format, department)

This enriched metadata allows us to:

  • Monitor bias (e.g., 22% of frames are from Friday afternoons)
  • Sample intentionally (e.g., we need more morning frames)
  • Augment strategically (e.g., simulate winter lighting)

To hold this metadata, you need to alter your existing table or create a new one:

				
					CREATE TABLE frame_metadata (
    frame_id VARCHAR(255) PRIMARY KEY,
    timestamp DATETIME,
    day_of_week TINYINT,
    time_block VARCHAR(20),
    season VARCHAR(20),
    is_promotional BOOLEAN,
    frame_path VARCHAR(255)
);

				
			

Your code must now extract this metadata from each frame.

				
					import cv2
import os
import json
import pandas as pd
from datetime import datetime, timedelta

# ------------- Helper Functions -------------

def categorize_time_block(hour):
    """Classify the hour into a time block."""
    if 6 <= hour < 12:
        return 'morning'
    elif 12 <= hour < 17:
        return 'afternoon'
    elif 17 <= hour < 21:
        return 'evening'
    else:
        return 'night'

def determine_season(month):
    """Map month to a season."""
    if month in [12, 1, 2]:
        return 'winter'
    elif month in [3, 4, 5]:
        return 'spring'
    elif month in [6, 7, 8]:
        return 'summer'
    else:
        return 'fall'

def check_promotional(timestamp, promo_calendar):
    """Return True if the date is in the promotional calendar."""
    date_str = timestamp.strftime('%Y-%m-%d')
    return date_str in promo_calendar

def parse_timestamp_from_filename(filename):
    """
    Extract timestamp from filename format: storeID_YYYY-MM-DD_HH-MM.mp4
    Example: 'store42_2025-05-10_14-00.mp4'
    """
    parts = filename.replace('.mp4', '').split('_')
    if len(parts) < 3:
        raise ValueError("Filename does not contain timestamp in expected format.")
    timestamp_str = f"{parts[1]}_{parts[2]}"
    return datetime.strptime(timestamp_str, '%Y-%m-%d_%H-%M')

# ------------- Main Function -------------

def extract_frames_with_metadata(video_path, output_dir, sampling_rate=1, store_info=None, promo_calendar=None):
    """
    Extract frames from a video and enrich each with temporal metadata.
    
    Args:
        video_path (str): Path to input video file.
        output_dir (str): Directory to save extracted frames.
        sampling_rate (int): Frames per second to extract.
        store_info (dict): Optional store metadata.
        promo_calendar (set): Set of promotional dates (YYYY-MM-DD).
    
    Returns:
        List of dictionaries, each containing metadata for one frame.
    """
    os.makedirs(output_dir, exist_ok=True)
    cap = cv2.VideoCapture(video_path)

    fps = cap.get(cv2.CAP_PROP_FPS)
    frame_interval = int(fps / sampling_rate)
    frame_index = 0
    saved_index = 0
    metadata_records = []

    # Get base timestamp from video filename
    base_timestamp = parse_timestamp_from_filename(os.path.basename(video_path))

    while cap.isOpened():
        ret, frame = cap.read()
        if not ret:
            break

        if frame_index % frame_interval == 0:
            # Save frame image
            frame_filename = f"frame_{saved_index:05d}.jpg"
            frame_path = os.path.join(output_dir, frame_filename)
            cv2.imwrite(frame_path, frame)

            # Estimate timestamp for this frame
            frame_timestamp = base_timestamp + timedelta(seconds=(frame_index / fps))

            # Generate metadata
            metadata = {
                'frame_id': f"{saved_index:05d}",
                'timestamp': frame_timestamp.isoformat(),
                'day_of_week': frame_timestamp.weekday(),
                'time_block': categorize_time_block(frame_timestamp.hour),
                'season': determine_season(frame_timestamp.month),
                'is_promotional': check_promotional(frame_timestamp, promo_calendar or set()),
                'frame_path': frame_path
            }

            if store_info:
                metadata.update({
                    'store_id': store_info.get('store_id'),
                    'store_format': store_info.get('format'),
                    'location_type': store_info.get('location_type'),
                    'department': store_info.get('department')
                })

            metadata_records.append(metadata)
            saved_index += 1

        frame_index += 1

    cap.release()
    return metadata_records

# ------------- Run Script -------------

if __name__ == "__main__":
    # Input video (must include timestamp in filename)
    video_file = "store42_2025-05-10_14-00.mp4"
    output_folder = "frames_output"

    # Example store metadata
    store_metadata = {
        'store_id': 'store42',
        'format': 'medium',
        'location_type': 'urban',
        'department': 'snacks'
    }

    # Example promotional calendar
    promotional_dates = {"2025-05-10", "2025-05-11"}

    # Run extraction
    metadata = extract_frames_with_metadata(
        video_path=video_file,
        output_dir=output_folder,
        sampling_rate=1,  # 1 frame per second
        store_info=store_metadata,
        promo_calendar=promotional_dates
    )

    # Save metadata to CSV
    pd.DataFrame(metadata).to_csv("frame_metadata.csv", index=False)
    print(f"Extracted {len(metadata)} frames and saved metadata to frame_metadata.csv.")

				
			

Step 3: Temporal Distribution Monitoring

The goal of this step is to measure and track how well your current dataset matches the temporal targets defined in the bias matrix. You’re not yet selecting or sampling data here—you’re auditing what’s already been collected or extracted.

This step helps answer key questions like:

  • Are we overrepresenting certain days or time blocks?

  • Have we collected enough data from underrepresented seasons?

  • Are promotional periods adequately covered?

  • Are we drifting away from the target distribution?

The output of this step is a set of summary statistics and optionally a dashboard or alert system that gives visibility into bias risks.

Why This Matters

Before you can fix a bias, you have to see it.

Raw data pipelines often reflect uncontrolled collection patterns:

  • Store staff may record more during peak hours.

  • Some cameras may run longer on weekdays than weekends.

  • Promotions may not be tagged consistently.

If you blindly use this data for training, you risk building a model that performs well only in overrepresented conditions—and fails silently when exposed to underrepresented contexts (e.g., late nights or winter).

Temporal Distribution Monitoring gives you ground truth visibility over the balance of your dataset, so you can take action in later steps (sampling, weighting, augmentation).

What Exactly Are You Doing in This Step?

You are:

  1. Querying your enriched metadata (from Step 1) to compute the actual distribution of frames across temporal dimensions:

    • How many frames came from each day of the week?

    • What percent came from each time block (morning, afternoon, evening, night)?

    • What is the seasonal breakdown?

    • What portion occurred during promotions?

  2. Comparing those distributions to the target ranges specified in the bias matrix.

  3. Identifying any deviations, i.e., subcategories that are underrepresented or overrepresented compared to the target.

  4. (Optionally) Visualizing the metrics via bar charts, gauges, or dashboards to make imbalances obvious and track drift over time.

  5. (Optionally) Triggering alerts if certain thresholds are violated—for example, if “morning” frames fall below 15%, even though the target minimum is 20%

Sample Output for Step 2

Imagine you’ve processed 10,000 frames and want to evaluate them against the target temporal bias distribution.

1. Day of Week Distribution
DayFrame CountActual %Target Range (%)Status
Monday9809.8%10–15%Slightly Low
Tuesday112011.2%10–15%Within Range
Wednesday110511.1%10–15%Within Range
Thursday108710.9%10–15%Within Range
Friday101210.1%10–15%Within Range
Saturday184518.5%15–20%Within Range
Sunday185118.5%15–20%Within Range

Note: Monday is slightly under target; all other days are compliant.


2. Time of Day Distribution
Time BlockFrame CountActual %Target Range (%)Status
Morning168016.8%20–25%Underrepresented
Afternoon350235.0%30–35%Within Range
Evening286328.6%25–30%Within Range
Night195519.5%10–15%Overrepresented

Observation: Nighttime data is significantly overrepresented; morning data is under-collected.


3. Seasonal Distribution
SeasonFrame CountActual %Target Range (%)Status
Spring230023.0%20–30%Within Range
Summer243024.3%20–30%Within Range
Fall251525.2%20–30%Within Range
Winter275527.5%20–30%Within Range

Observation: Seasonal coverage is balanced—no action needed here.


4. Promotional vs. Normal Periods
PeriodFrame CountActual %Target Range (%)Status
Normal768076.8%60–70%Overrepresented
Promotional232023.2%30–40%Underrepresented

Observation: Not enough data collected during promotional campaigns.


Summary Status Report

DimensionIssues Detected?Recommended Action
Day of WeekMinor (Monday < 10%)Increase Monday frame collection
Time of DayYesReduce night sampling, increase morning frames
SeasonNoNo action needed
Promotional PeriodsYesIncrease targeted capture during promotions

As can be seen from the status report, the original data looks skewed and there are specific steps that you must take to correct the bias.

The full code snippet for this step will be as follows:

				
					import pandas as pd
from collections import defaultdict

# --- Bias Matrix Temporal Targets ---

BIAS_TARGETS = {
    'day_of_week': {
        0: (10, 15),  # Monday
        1: (10, 15),
        2: (10, 15),
        3: (10, 15),
        4: (10, 15),
        5: (15, 20),  # Saturday
        6: (15, 20),  # Sunday
    },
    'time_block': {
        'morning': (20, 25),
        'afternoon': (30, 35),
        'evening': (25, 30),
        'night': (10, 15),
    },
    'season': {
        'spring': (20, 30),
        'summer': (20, 30),
        'fall': (20, 30),
        'winter': (20, 30),
    },
    'is_promotional': {
        True: (30, 40),
        False: (60, 70),
    }
}

# --- Monitoring Function ---

def calculate_temporal_distribution(df, targets):
    """
    Compare actual distributions with bias matrix targets.

    Args:
        df: DataFrame containing metadata
        targets: Dict with bias targets

    Returns:
        summary_table: A list of dictionaries with metrics per subcategory
    """
    total_frames = len(df)
    summary = []

    for category, category_targets in targets.items():
        value_counts = df[category].value_counts(normalize=True) * 100  # as %
        for value, (min_t, max_t) in category_targets.items():
            actual = value_counts.get(value, 0)
            if actual < min_t:
                status = 'Underrepresented'
            elif actual > max_t:
                status = 'Overrepresented'
            else:
                status = 'Within Range'

            summary.append({
                'Dimension': category,
                'Category': value,
                'Actual %': round(actual, 2),
                'Target Range (%)': f"{min_t}–{max_t}",
                'Status': status
            })

    return pd.DataFrame(summary)

# --- Load Metadata CSV ---

def load_metadata(csv_path):
    """
    Load frame metadata CSV and ensure correct dtypes.
    """
    df = pd.read_csv(csv_path)

    # Coerce categorical types
    df['time_block'] = df['time_block'].astype(str)
    df['season'] = df['season'].astype(str)
    df['is_promotional'] = df['is_promotional'].astype(bool)
    df['day_of_week'] = df['day_of_week'].astype(int)  # 0 = Monday

    return df

# --- Run Script ---

if __name__ == "__main__":
    metadata_file = "frame_metadata.csv"
    df = load_metadata(metadata_file)
    
    report = calculate_temporal_distribution(df, BIAS_TARGETS)
    report.to_csv("temporal_bias_report.csv", index=False)

    print("\n=== Temporal Bias Summary ===\n")
    print(report)

				
			

Step 4: Obtain Temporal Balance

The goal of this step is to select a balanced subset of frames from your raw dataset that closely matches the temporal distribution targets defined in the bias matrix. You are no longer just observing bias (as in Step 2)—you are actively shaping your training dataset to correct for it.

This ensures that:

  • Underrepresented time categories (e.g. winter evenings or promotional mornings) are not ignored

  • Overrepresented categories (e.g. Saturday afternoons) are not allowed to dominate model training

  • Your model is exposed to diverse time-based conditions, improving generalization and robustness

Why This Matters

Without stratified sampling, your training data may inherit the imbalances of how and when footage was collected. For example:

  • If cameras record mostly during the day, the model may fail at night.

  • If most footage is from promotional events, the model might overfit to campaign behaviors.

  • If winter shopping is underrepresented, the model may misinterpret seasonal behaviors (e.g., people wearing bulky coats, less time browsing).

This step gives you fine-grained control over what data is included, ensuring that the final dataset reflects your fairness goals.

What Exactly Are You Doing?

You are:

  1. Using the metadata-enriched frame dataset (from Step 1)

  2. Using the temporal bias targets (defined in the matrix and validated in Step 2)

  3. Calculating the number of frames needed for each temporal subcategory (e.g., 2,500 frames from morning sessions)

  4. Querying or filtering the metadata to pull that exact number of frames per category (or as close as possible)

  5. Compensating if some bins don’t have enough data by logging, augmenting (in later steps), or rebalancing

The output is a balanced, bias-mitigated training dataset ready for use in LLM fine-tuning.

				
					import pandas as pd
import random
from collections import defaultdict
import os

# --- Define Temporal Bias Targets ---

TEMPORAL_TARGETS = {
    'time_block': {
        'morning': 0.23,     # 23%
        'afternoon': 0.33,   # 33%
        'evening': 0.27,     # 27%
        'night': 0.17        # 17%
    },
    'season': {
        'spring': 0.25,
        'summer': 0.25,
        'fall': 0.25,
        'winter': 0.25
    },
    'is_promotional': {
        True: 0.35,
        False: 0.65
    }
}

# --- Utility Function to Calculate Sampling Plan ---

def compute_stratified_frame_counts(total_frames, targets):
    """
    Calculate number of frames needed per combination of temporal categories.
    
    Args:
        total_frames: Total number of frames to select.
        targets: Nested dict of target proportions per temporal dimension.
    
    Returns:
        Dict mapping category combinations to number of frames.
    """
    target_counts = {}

    # Build all combinations (cartesian product of categories)
    for t_block, p_time in targets['time_block'].items():
        for season, p_season in targets['season'].items():
            for promo, p_promo in targets['is_promotional'].items():
                combo = (t_block, season, promo)
                proportion = p_time * p_season * p_promo
                target_counts[combo] = int(round(total_frames * proportion))
    
    return target_counts

# --- Sample Frames from Metadata Based on Category Combinations ---

def stratified_sample_frames(df, target_counts):
    """
    Select frames from the metadata DataFrame according to target counts.

    Args:
        df: Metadata DataFrame.
        target_counts: Dict with category combinations as keys and sample sizes as values.

    Returns:
        List of selected frame_ids.
    """
    selected_frames = []

    for combo, count in target_counts.items():
        time_block, season, promo = combo

        subset = df[
            (df['time_block'] == time_block) &
            (df['season'] == season) &
            (df['is_promotional'] == promo)
        ]

        available = len(subset)

        if available == 0:
            print(f"WARNING: No frames found for {combo}")
            continue

        sample_size = min(count, available)
        sampled_ids = subset.sample(n=sample_size, replace=False)['frame_id'].tolist()
        selected_frames.extend(sampled_ids)

        print(f"Selected {sample_size}/{count} frames for {combo} (available: {available})")

    return selected_frames

# --- Main Execution ---

def run_stratified_sampling(metadata_path, total_frames_to_select, output_path):
    df = pd.read_csv(metadata_path)

    # Ensure proper types
    df['time_block'] = df['time_block'].astype(str)
    df['season'] = df['season'].astype(str)
    df['is_promotional'] = df['is_promotional'].astype(bool)

    # Step 1: Compute target counts per category combo
    target_counts = compute_stratified_frame_counts(total_frames_to_select, TEMPORAL_TARGETS)

    # Step 2: Select frames
    selected_ids = stratified_sample_frames(df, target_counts)

    # Step 3: Save results
    selected_df = df[df['frame_id'].isin(selected_ids)]
    selected_df.to_csv(output_path, index=False)

    print(f"\nStratified sampling complete. {len(selected_ids)} frames written to {output_path}")

# --- Example Usage ---

if __name__ == "__main__":
    metadata_csv = "frame_metadata.csv"  # from Step 1
    output_file = "stratified_sample.csv"
    sample_size = 10000  # Adjust based on your target dataset size

    run_stratified_sampling(metadata_csv, sample_size, output_file)

				
			

Output

  • A file stratified_sample.csv containing the subset of metadata for the selected frames

  • Each frame respects the desired cross-distribution across:

    • time_block × season × is_promotional

  • Warnings will be shown if not enough frames exist for any bin

You now have a bias-aware, balanced dataset ready for use in fine-tuning. The next step could be to copy the selected frames into a new dataset directory structure, or to feed them into a downstream annotation process.

Pipeline Deployment Overview: Azure Cloud

Core Azure Services

Pipeline ComponentAzure Service Used
Raw video storageAzure Blob Storage
Job orchestrationAzure Data Factory or Azure Synapse Pipelines
Compute for scriptsAzure Batch, Azure Functions, or Azure ML Pipelines
Metadata databaseAzure SQL Database or Azure Data Lake
Monitoring & alertsAzure Monitor + Log Analytics
Final dataset storageAzure Data Lake Gen2 or Blob Storage

Sample Pipeline in Azure Data Factory

  1. Trigger: When new video is uploaded to a Blob container.

  2. Step 1 (Metadata Enrichment):

    • Run a Python script in Azure Batch or Azure ML Compute.

    • Output: Frame images + metadata CSV to Azure Blob / Data Lake.

  3. Step 2 (Monitoring):

    • Use a pipeline step to run the audit script.

    • Save the bias report as a CSV or send alerts via Azure Monitor.

  4. Step 3 (Stratified Sampling):

    • Run your stratified sampler Python script as a Custom Activity.

    • Output: Selected metadata → used for dataset creation.

Pipeline Deployment Overview: AWS Cloud

Core AWS Services

Pipeline ComponentAWS Service Used
Raw video storageAmazon S3
Job orchestrationAWS Step Functions or Amazon MWAA (Airflow)
Compute for scriptsAWS Lambda, EC2, or AWS Batch
Metadata databaseAmazon RDS or Amazon Athena
Monitoring & alertsAmazon CloudWatch + SNS
Final dataset storageS3 (versioned buckets)

Sample Pipeline in AWS Step Functions

  1. Trigger: S3 event when a new video file is uploaded.

  2. Step 1 (Metadata Enrichment):

    • Run your Python script in AWS Batch or Lambda.

    • Store output images + metadata CSV in a processed/ S3 path.

  3. Step 2 (Bias Monitoring):

    • Run audit job using a Lambda or Batch job.

    • Store report in S3 and send alerts via SNS if any category drifts.

  4. Step 3 (Stratified Sampling):

    • Execute as a Lambda or Docker job on ECS/Fargate.

    • Output selected frame list to S3.

  5. Step 4 (Optional Augmentation):

    • Trigger augmentation only for flagged categories (CloudWatch metric).

  6. Step 5 (Dataset Packaging):

    • Move sampled frames into versioned dataset/ folders (e.g., v1/train/, v1/test/).

    • Store metadata manifest (manifest.json) in each dataset folder.