Introduction
Preparing high-quality datasets for LLM fine-tuning demands a deliberate approach to addressing marketing bias. While our earlier article introduced the concept at a high level, this recipe focuses on the practical side, providing data engineers with a technical, developer-focused walkthrough for converting theoretical bias dimensions from a bias matrix into concrete, production-ready data pipelines. For full context, we recommend reviewing our foundational pieces on the LLM data preparation framework and the underlying principles of marketing bias.
Business Background
A multi-channel retailer aimed to deepen its understanding of in-store customer behavior through the use of YOLOv5-based computer vision technology. Partnering with a specialized AI consultancy, the retailer began deploying custom-trained models on edge devices throughout its store network, with the goal of eventually scaling coverage to all locations.
This initiative went beyond passive observation. By linking visual insights to their e-commerce and product information management (PIM) systems, the retailer sought to generate actionable intelligence, such as identifying when customers examine labels but don’t purchase, or detecting common interest in overlooked products. These signals could then power real-time personalization and long-term experience optimization.
To achieve this, a critical early step was the creation of a training dataset that accurately reflected the diversity of in-store shopping behavior, not just across demographics or store types, but across time. That required a dedicated effort to identify and mitigate temporal bias.
Solution Overview
YOLOv5 models in this setting are trained on annotated video frames that capture real-world customer interactions in-store. The typical process involves:
- Extracting frames from raw video footage
- Annotating those frames with structured behavior labels
- Training the model on this annotated dataset
- Deploying the model to edge devices
- Streaming real-time frame data or interaction events to local servers for further actioning
These structured interaction events are then forwarded by the local server to backend enterprise systems (e.g., CRM, PIM, e-commerce) where they inform decisions ranging from product placement to automated marketing offers.
But none of this works reliably without an unbiased dataset. Specifically, temporal bias—the uneven representation of certain time windows in the dataset—can significantly degrade model performance during underrepresented conditions like early mornings, off-season shopping, or promotional periods.
Bridging Marketing and Engineering: The Bias Matrix
To formalize their bias mitigation strategy, the team adopted a bias matrix—a shared framework developed by marketing and analytics stakeholders. This matrix identifies key bias categories, defines how they should be measured, establishes acceptable target distributions, and outlines mitigation strategies.
Here is a simplified excerpt from the full matrix:
Bias Category | Bias Type | Measurement Method | Target Distribution | Mitigation Strategy |
Temporal Bias | Day of Week | % frames per day | Mon–Fri: 10–15% each | Stratified sampling across days |
Time of Day | % frames per time block | Morning: 20–25% | Weighted extraction from recordings | |
Seasonal | % frames per season | Spring, Summer, Fall, Winter: 20–30% each | Multi-season collection and augmentation | |
Promotional vs. Normal | % frames during promotions | Normal: 60–70% | Targeted collection during promotions |
While the full matrix includes other bias types—such as demographic, store environment, and product interaction biases—this guide focuses exclusively on temporal bias.
In the sections that follow, we will walk through how to operationalize each temporal subcategory into a robust and reusable data pipeline that ensures balanced temporal representation during training.
This includes:
- Structuring metadata for temporal traceability
- Monitoring live data distributions
- Designing stratified and weighted sampling logic
- Implementing augmentation for rare conditions
- Introducing performance-driven feedback loops
Together, these steps form a pipeline blueprint that transforms the bias matrix from a theoretical checklist into a concrete system architecture.
Technical Implementation
The temporal bias dimension implies that the training data must contain the following temporal distribution
- Day of Week: Mon–Fri (10–15% each), Sat–Sun (15–20% each)
- Time of Day: Morning (20–25%), Afternoon (30–35%), Evening (25–30%), Night (10–15%)
- Season: Each season (20–30%)
- Promotional Period: Normal (60–70%), Promotional (30–40%)
This implies that for every 100 samples, each weekday (Monday through Friday) should contribute between 10 and 15 samples, while Saturday and Sunday should each contribute between 15 and 20 samples, ensuring the overall total remains within the 100-sample limit. Similar conditions must be met for the time of the day, season, and promotional period.
The process of translating bias matrix into concrete data pipelines is outlined below.
Step 1: Extract Image Frames from Video
This is the first step in the data preparation process and involves parsing the raw video footage into image frames. Notice that to train the YOLOv5 model, you need to pass it annotated image frames and not the whole video file.
- Extract frames from raw video footage– The input to this step is a raw video file (MP4, AVI, etc.) captured from in-store cameras. Each video file represents a continuous sequence of real-world customer activity within a store over a certain time period (e.g., 1 hour of footage from a Tuesday afternoon in Store #42).
- The extract can be done as a batch process using Python OpenCV module that takes the input as the raw video file and outputs individual frames.
import cv2
import os
def extract_frames(video_path, output_dir, sampling_rate=1):
cap = cv2.VideoCapture(video_path)
fps = cap.get(cv2.CAP_PROP_FPS)
frame_interval = int(fps / sampling_rate)
frame_count = 0
saved_count = 0
while cap.isOpened():
ret, frame = cap.read()
if not ret:
break
if frame_count % frame_interval == 0:
frame_name = f"frame_{saved_count:05d}.jpg"
cv2.imwrite(os.path.join(output_dir, frame_name), frame)
saved_count += 1
frame_count += 1
cap.release()
Each frame is now an image file (e.g., frame_00001.jpg
) that becomes a candidate training example—but it’s not usable yet until it’s enriched with metadata.
Step 2: Enrich Frames With Metadata
Frames alone are just pixels. To implement bias mitigation, we must know when and where each frame was captured. Metadata provides:
The timestamp (e.g., May 2, 2025, 14:33)
The day of the week (e.g., Friday)
The time block (e.g., afternoon)
The season (e.g., spring)
Whether it was during a promotional period
Context about the store (location, format, department)
This enriched metadata allows us to:
- Monitor bias (e.g., 22% of frames are from Friday afternoons)
- Sample intentionally (e.g., we need more morning frames)
- Augment strategically (e.g., simulate winter lighting)
To hold this metadata, you need to alter your existing table or create a new one:
CREATE TABLE frame_metadata (
frame_id VARCHAR(255) PRIMARY KEY,
timestamp DATETIME,
day_of_week TINYINT,
time_block VARCHAR(20),
season VARCHAR(20),
is_promotional BOOLEAN,
frame_path VARCHAR(255)
);
Your code must now extract this metadata from each frame.
import cv2
import os
import json
import pandas as pd
from datetime import datetime, timedelta
# ------------- Helper Functions -------------
def categorize_time_block(hour):
"""Classify the hour into a time block."""
if 6 <= hour < 12:
return 'morning'
elif 12 <= hour < 17:
return 'afternoon'
elif 17 <= hour < 21:
return 'evening'
else:
return 'night'
def determine_season(month):
"""Map month to a season."""
if month in [12, 1, 2]:
return 'winter'
elif month in [3, 4, 5]:
return 'spring'
elif month in [6, 7, 8]:
return 'summer'
else:
return 'fall'
def check_promotional(timestamp, promo_calendar):
"""Return True if the date is in the promotional calendar."""
date_str = timestamp.strftime('%Y-%m-%d')
return date_str in promo_calendar
def parse_timestamp_from_filename(filename):
"""
Extract timestamp from filename format: storeID_YYYY-MM-DD_HH-MM.mp4
Example: 'store42_2025-05-10_14-00.mp4'
"""
parts = filename.replace('.mp4', '').split('_')
if len(parts) < 3:
raise ValueError("Filename does not contain timestamp in expected format.")
timestamp_str = f"{parts[1]}_{parts[2]}"
return datetime.strptime(timestamp_str, '%Y-%m-%d_%H-%M')
# ------------- Main Function -------------
def extract_frames_with_metadata(video_path, output_dir, sampling_rate=1, store_info=None, promo_calendar=None):
"""
Extract frames from a video and enrich each with temporal metadata.
Args:
video_path (str): Path to input video file.
output_dir (str): Directory to save extracted frames.
sampling_rate (int): Frames per second to extract.
store_info (dict): Optional store metadata.
promo_calendar (set): Set of promotional dates (YYYY-MM-DD).
Returns:
List of dictionaries, each containing metadata for one frame.
"""
os.makedirs(output_dir, exist_ok=True)
cap = cv2.VideoCapture(video_path)
fps = cap.get(cv2.CAP_PROP_FPS)
frame_interval = int(fps / sampling_rate)
frame_index = 0
saved_index = 0
metadata_records = []
# Get base timestamp from video filename
base_timestamp = parse_timestamp_from_filename(os.path.basename(video_path))
while cap.isOpened():
ret, frame = cap.read()
if not ret:
break
if frame_index % frame_interval == 0:
# Save frame image
frame_filename = f"frame_{saved_index:05d}.jpg"
frame_path = os.path.join(output_dir, frame_filename)
cv2.imwrite(frame_path, frame)
# Estimate timestamp for this frame
frame_timestamp = base_timestamp + timedelta(seconds=(frame_index / fps))
# Generate metadata
metadata = {
'frame_id': f"{saved_index:05d}",
'timestamp': frame_timestamp.isoformat(),
'day_of_week': frame_timestamp.weekday(),
'time_block': categorize_time_block(frame_timestamp.hour),
'season': determine_season(frame_timestamp.month),
'is_promotional': check_promotional(frame_timestamp, promo_calendar or set()),
'frame_path': frame_path
}
if store_info:
metadata.update({
'store_id': store_info.get('store_id'),
'store_format': store_info.get('format'),
'location_type': store_info.get('location_type'),
'department': store_info.get('department')
})
metadata_records.append(metadata)
saved_index += 1
frame_index += 1
cap.release()
return metadata_records
# ------------- Run Script -------------
if __name__ == "__main__":
# Input video (must include timestamp in filename)
video_file = "store42_2025-05-10_14-00.mp4"
output_folder = "frames_output"
# Example store metadata
store_metadata = {
'store_id': 'store42',
'format': 'medium',
'location_type': 'urban',
'department': 'snacks'
}
# Example promotional calendar
promotional_dates = {"2025-05-10", "2025-05-11"}
# Run extraction
metadata = extract_frames_with_metadata(
video_path=video_file,
output_dir=output_folder,
sampling_rate=1, # 1 frame per second
store_info=store_metadata,
promo_calendar=promotional_dates
)
# Save metadata to CSV
pd.DataFrame(metadata).to_csv("frame_metadata.csv", index=False)
print(f"Extracted {len(metadata)} frames and saved metadata to frame_metadata.csv.")
Step 3: Temporal Distribution Monitoring
The goal of this step is to measure and track how well your current dataset matches the temporal targets defined in the bias matrix. You’re not yet selecting or sampling data here—you’re auditing what’s already been collected or extracted.
This step helps answer key questions like:
Are we overrepresenting certain days or time blocks?
Have we collected enough data from underrepresented seasons?
Are promotional periods adequately covered?
Are we drifting away from the target distribution?
The output of this step is a set of summary statistics and optionally a dashboard or alert system that gives visibility into bias risks.
Why This Matters
Before you can fix a bias, you have to see it.
Raw data pipelines often reflect uncontrolled collection patterns:
Store staff may record more during peak hours.
Some cameras may run longer on weekdays than weekends.
Promotions may not be tagged consistently.
If you blindly use this data for training, you risk building a model that performs well only in overrepresented conditions—and fails silently when exposed to underrepresented contexts (e.g., late nights or winter).
Temporal Distribution Monitoring gives you ground truth visibility over the balance of your dataset, so you can take action in later steps (sampling, weighting, augmentation).
What Exactly Are You Doing in This Step?
You are:
Querying your enriched metadata (from Step 1) to compute the actual distribution of frames across temporal dimensions:
How many frames came from each day of the week?
What percent came from each time block (morning, afternoon, evening, night)?
What is the seasonal breakdown?
What portion occurred during promotions?
Comparing those distributions to the target ranges specified in the bias matrix.
Identifying any deviations, i.e., subcategories that are underrepresented or overrepresented compared to the target.
(Optionally) Visualizing the metrics via bar charts, gauges, or dashboards to make imbalances obvious and track drift over time.
(Optionally) Triggering alerts if certain thresholds are violated—for example, if “morning” frames fall below 15%, even though the target minimum is 20%
Sample Output for Step 2
Imagine you’ve processed 10,000 frames and want to evaluate them against the target temporal bias distribution.
1. Day of Week Distribution
Day | Frame Count | Actual % | Target Range (%) | Status |
---|---|---|---|---|
Monday | 980 | 9.8% | 10–15% | Slightly Low |
Tuesday | 1120 | 11.2% | 10–15% | Within Range |
Wednesday | 1105 | 11.1% | 10–15% | Within Range |
Thursday | 1087 | 10.9% | 10–15% | Within Range |
Friday | 1012 | 10.1% | 10–15% | Within Range |
Saturday | 1845 | 18.5% | 15–20% | Within Range |
Sunday | 1851 | 18.5% | 15–20% | Within Range |
Note: Monday is slightly under target; all other days are compliant.
2. Time of Day Distribution
Time Block | Frame Count | Actual % | Target Range (%) | Status |
---|---|---|---|---|
Morning | 1680 | 16.8% | 20–25% | Underrepresented |
Afternoon | 3502 | 35.0% | 30–35% | Within Range |
Evening | 2863 | 28.6% | 25–30% | Within Range |
Night | 1955 | 19.5% | 10–15% | Overrepresented |
Observation: Nighttime data is significantly overrepresented; morning data is under-collected.
3. Seasonal Distribution
Season | Frame Count | Actual % | Target Range (%) | Status |
---|---|---|---|---|
Spring | 2300 | 23.0% | 20–30% | Within Range |
Summer | 2430 | 24.3% | 20–30% | Within Range |
Fall | 2515 | 25.2% | 20–30% | Within Range |
Winter | 2755 | 27.5% | 20–30% | Within Range |
Observation: Seasonal coverage is balanced—no action needed here.
4. Promotional vs. Normal Periods
Period | Frame Count | Actual % | Target Range (%) | Status |
---|---|---|---|---|
Normal | 7680 | 76.8% | 60–70% | Overrepresented |
Promotional | 2320 | 23.2% | 30–40% | Underrepresented |
Observation: Not enough data collected during promotional campaigns.
Summary Status Report
Dimension | Issues Detected? | Recommended Action |
---|---|---|
Day of Week | Minor (Monday < 10%) | Increase Monday frame collection |
Time of Day | Yes | Reduce night sampling, increase morning frames |
Season | No | No action needed |
Promotional Periods | Yes | Increase targeted capture during promotions |
As can be seen from the status report, the original data looks skewed and there are specific steps that you must take to correct the bias.
The full code snippet for this step will be as follows:
import pandas as pd
from collections import defaultdict
# --- Bias Matrix Temporal Targets ---
BIAS_TARGETS = {
'day_of_week': {
0: (10, 15), # Monday
1: (10, 15),
2: (10, 15),
3: (10, 15),
4: (10, 15),
5: (15, 20), # Saturday
6: (15, 20), # Sunday
},
'time_block': {
'morning': (20, 25),
'afternoon': (30, 35),
'evening': (25, 30),
'night': (10, 15),
},
'season': {
'spring': (20, 30),
'summer': (20, 30),
'fall': (20, 30),
'winter': (20, 30),
},
'is_promotional': {
True: (30, 40),
False: (60, 70),
}
}
# --- Monitoring Function ---
def calculate_temporal_distribution(df, targets):
"""
Compare actual distributions with bias matrix targets.
Args:
df: DataFrame containing metadata
targets: Dict with bias targets
Returns:
summary_table: A list of dictionaries with metrics per subcategory
"""
total_frames = len(df)
summary = []
for category, category_targets in targets.items():
value_counts = df[category].value_counts(normalize=True) * 100 # as %
for value, (min_t, max_t) in category_targets.items():
actual = value_counts.get(value, 0)
if actual < min_t:
status = 'Underrepresented'
elif actual > max_t:
status = 'Overrepresented'
else:
status = 'Within Range'
summary.append({
'Dimension': category,
'Category': value,
'Actual %': round(actual, 2),
'Target Range (%)': f"{min_t}–{max_t}",
'Status': status
})
return pd.DataFrame(summary)
# --- Load Metadata CSV ---
def load_metadata(csv_path):
"""
Load frame metadata CSV and ensure correct dtypes.
"""
df = pd.read_csv(csv_path)
# Coerce categorical types
df['time_block'] = df['time_block'].astype(str)
df['season'] = df['season'].astype(str)
df['is_promotional'] = df['is_promotional'].astype(bool)
df['day_of_week'] = df['day_of_week'].astype(int) # 0 = Monday
return df
# --- Run Script ---
if __name__ == "__main__":
metadata_file = "frame_metadata.csv"
df = load_metadata(metadata_file)
report = calculate_temporal_distribution(df, BIAS_TARGETS)
report.to_csv("temporal_bias_report.csv", index=False)
print("\n=== Temporal Bias Summary ===\n")
print(report)
Step 4: Obtain Temporal Balance
The goal of this step is to select a balanced subset of frames from your raw dataset that closely matches the temporal distribution targets defined in the bias matrix. You are no longer just observing bias (as in Step 2)—you are actively shaping your training dataset to correct for it.
This ensures that:
Underrepresented time categories (e.g. winter evenings or promotional mornings) are not ignored
Overrepresented categories (e.g. Saturday afternoons) are not allowed to dominate model training
Your model is exposed to diverse time-based conditions, improving generalization and robustness
Why This Matters
Without stratified sampling, your training data may inherit the imbalances of how and when footage was collected. For example:
If cameras record mostly during the day, the model may fail at night.
If most footage is from promotional events, the model might overfit to campaign behaviors.
If winter shopping is underrepresented, the model may misinterpret seasonal behaviors (e.g., people wearing bulky coats, less time browsing).
This step gives you fine-grained control over what data is included, ensuring that the final dataset reflects your fairness goals.
What Exactly Are You Doing?
You are:
Using the metadata-enriched frame dataset (from Step 1)
Using the temporal bias targets (defined in the matrix and validated in Step 2)
Calculating the number of frames needed for each temporal subcategory (e.g., 2,500 frames from morning sessions)
Querying or filtering the metadata to pull that exact number of frames per category (or as close as possible)
Compensating if some bins don’t have enough data by logging, augmenting (in later steps), or rebalancing
The output is a balanced, bias-mitigated training dataset ready for use in LLM fine-tuning.
import pandas as pd
import random
from collections import defaultdict
import os
# --- Define Temporal Bias Targets ---
TEMPORAL_TARGETS = {
'time_block': {
'morning': 0.23, # 23%
'afternoon': 0.33, # 33%
'evening': 0.27, # 27%
'night': 0.17 # 17%
},
'season': {
'spring': 0.25,
'summer': 0.25,
'fall': 0.25,
'winter': 0.25
},
'is_promotional': {
True: 0.35,
False: 0.65
}
}
# --- Utility Function to Calculate Sampling Plan ---
def compute_stratified_frame_counts(total_frames, targets):
"""
Calculate number of frames needed per combination of temporal categories.
Args:
total_frames: Total number of frames to select.
targets: Nested dict of target proportions per temporal dimension.
Returns:
Dict mapping category combinations to number of frames.
"""
target_counts = {}
# Build all combinations (cartesian product of categories)
for t_block, p_time in targets['time_block'].items():
for season, p_season in targets['season'].items():
for promo, p_promo in targets['is_promotional'].items():
combo = (t_block, season, promo)
proportion = p_time * p_season * p_promo
target_counts[combo] = int(round(total_frames * proportion))
return target_counts
# --- Sample Frames from Metadata Based on Category Combinations ---
def stratified_sample_frames(df, target_counts):
"""
Select frames from the metadata DataFrame according to target counts.
Args:
df: Metadata DataFrame.
target_counts: Dict with category combinations as keys and sample sizes as values.
Returns:
List of selected frame_ids.
"""
selected_frames = []
for combo, count in target_counts.items():
time_block, season, promo = combo
subset = df[
(df['time_block'] == time_block) &
(df['season'] == season) &
(df['is_promotional'] == promo)
]
available = len(subset)
if available == 0:
print(f"WARNING: No frames found for {combo}")
continue
sample_size = min(count, available)
sampled_ids = subset.sample(n=sample_size, replace=False)['frame_id'].tolist()
selected_frames.extend(sampled_ids)
print(f"Selected {sample_size}/{count} frames for {combo} (available: {available})")
return selected_frames
# --- Main Execution ---
def run_stratified_sampling(metadata_path, total_frames_to_select, output_path):
df = pd.read_csv(metadata_path)
# Ensure proper types
df['time_block'] = df['time_block'].astype(str)
df['season'] = df['season'].astype(str)
df['is_promotional'] = df['is_promotional'].astype(bool)
# Step 1: Compute target counts per category combo
target_counts = compute_stratified_frame_counts(total_frames_to_select, TEMPORAL_TARGETS)
# Step 2: Select frames
selected_ids = stratified_sample_frames(df, target_counts)
# Step 3: Save results
selected_df = df[df['frame_id'].isin(selected_ids)]
selected_df.to_csv(output_path, index=False)
print(f"\nStratified sampling complete. {len(selected_ids)} frames written to {output_path}")
# --- Example Usage ---
if __name__ == "__main__":
metadata_csv = "frame_metadata.csv" # from Step 1
output_file = "stratified_sample.csv"
sample_size = 10000 # Adjust based on your target dataset size
run_stratified_sampling(metadata_csv, sample_size, output_file)
Output
A file
stratified_sample.csv
containing the subset of metadata for the selected framesEach frame respects the desired cross-distribution across:
time_block
×season
×is_promotional
Warnings will be shown if not enough frames exist for any bin
You now have a bias-aware, balanced dataset ready for use in fine-tuning. The next step could be to copy the selected frames into a new dataset directory structure, or to feed them into a downstream annotation process.
Pipeline Deployment Overview: Azure Cloud
Core Azure Services
Pipeline Component | Azure Service Used |
---|---|
Raw video storage | Azure Blob Storage |
Job orchestration | Azure Data Factory or Azure Synapse Pipelines |
Compute for scripts | Azure Batch, Azure Functions, or Azure ML Pipelines |
Metadata database | Azure SQL Database or Azure Data Lake |
Monitoring & alerts | Azure Monitor + Log Analytics |
Final dataset storage | Azure Data Lake Gen2 or Blob Storage |
Sample Pipeline in Azure Data Factory
Trigger: When new video is uploaded to a Blob container.
Step 1 (Metadata Enrichment):
Run a Python script in Azure Batch or Azure ML Compute.
Output: Frame images + metadata CSV to Azure Blob / Data Lake.
Step 2 (Monitoring):
Use a pipeline step to run the audit script.
Save the bias report as a CSV or send alerts via Azure Monitor.
Step 3 (Stratified Sampling):
Run your stratified sampler Python script as a Custom Activity.
Output: Selected metadata → used for dataset creation.
Pipeline Deployment Overview: AWS Cloud
Core AWS Services
Pipeline Component | AWS Service Used |
---|---|
Raw video storage | Amazon S3 |
Job orchestration | AWS Step Functions or Amazon MWAA (Airflow) |
Compute for scripts | AWS Lambda, EC2, or AWS Batch |
Metadata database | Amazon RDS or Amazon Athena |
Monitoring & alerts | Amazon CloudWatch + SNS |
Final dataset storage | S3 (versioned buckets) |
Sample Pipeline in AWS Step Functions
Trigger: S3 event when a new video file is uploaded.
Step 1 (Metadata Enrichment):
Run your Python script in AWS Batch or Lambda.
Store output images + metadata CSV in a
processed/
S3 path.
Step 2 (Bias Monitoring):
Run audit job using a Lambda or Batch job.
Store report in S3 and send alerts via SNS if any category drifts.
Step 3 (Stratified Sampling):
Execute as a Lambda or Docker job on ECS/Fargate.
Output selected frame list to S3.
Step 4 (Optional Augmentation):
Trigger augmentation only for flagged categories (CloudWatch metric).
Step 5 (Dataset Packaging):
Move sampled frames into versioned
dataset/
folders (e.g.,v1/train/
,v1/test/
).Store metadata manifest (
manifest.json
) in each dataset folder.