Smarter Segments, Smarter Campaigns: A Practical Guide to Einstein Segment Creation

What if you could describe your ideal customer segment in plain English and have AI instantly build it for you? Salesforce’s Einstein Segment Creation transforms this possibility into practice, eliminating the need for SQL queries or complex filter navigation. The key to unlocking its full power lies in strategically integrating your own unique data sources and combining them with the data that Salesforce Marketing Cloud collects natively.

Let’s explore how to make Einstein work seamlessly with your specific business context.

What is Einstein Segment Creation?

Einstein Segment Creation is a generative AI feature within Salesforce Marketing Cloud and Data Cloud. It empowers marketers to define audience segments using natural language prompts. Instead of manually setting filters or writing SQL queries, you can simply describe your desired audience—for example, “Leads in the United States who have opened an email in the last 30 days”—and Einstein will generate the corresponding segment criteria for you.

How It Works

  1. Accessing the Feature: Navigate to the Segments section in Data Cloud or the Campaigns page in Marketing Cloud.
  2. Creating a Segment: Click on “Create with Einstein” and enter a description of your target audience in the chat interface.
  3. AI Processing: Einstein interprets your prompt, queries the unified dataset, and generates a segment based on your criteria.
  4. Review and Activation: You can review, edit, and activate the segment for use in your campaigns.

This AI-driven approach simplifies and accelerates audience segmentation, making it accessible even to those without technical expertise.

What Happens After You Enter a Prompt?

Once you submit your natural language prompt to Einstein Segment Creation:

  • Einstein processes your intent using NLP, maps the entities in your prompt (like job title, email clicks, geography) to actual data fields in your unified dataset.
  • A visual segment definition is generated, showing:
    • Filters and logic Einstein used
    • Source data sets referenced
    • An estimated segment size
  • You can review and tweak this segment before finalizing. Add or modify filters, exclude groups, or refine the logic.
  • You also get a preview of segment membership — a sample list of contacts or accounts that match the criteria.

After approval, the segment appears in Data Cloud > Segments, and if your environment is connected to Marketing Cloud, you can sync it for use in journeys, campaigns, or dynamic audiences.

The Three-Step Flow for Using Custom-Data-Enhanced Segments

To successfully use AI-powered segments in Marketing Cloud based on enriched customer data, follow this three-step process:

  • Step 1: Send and aggregate data into Data Cloud
    Unify and enrich customer data from CRM, product systems, and behavioral sources within Data Cloud.
  • Step 2: Create the segment in Marketing Cloud
    Use Einstein’s natural language interface to describe your audience and automatically generate a segment definition.
  • Step 3: Activate and sync the segment to Marketing Cloud
    Push the refined audience from Data Cloud into a Marketing Cloud Data Extension for use in campaigns, journeys, or personalization workflows.

This seamless flow bridges advanced segmentation logic with campaign execution, allowing marketers to act on deep insights without complex technical steps. Let’s explore these individually.

Step 1: Getting Your Data into Salesforce Data Cloud – The Right Way (Establishing the Foundation)

This is because even though the UI for segment creation resides within Marketing Cloud, the actual process of segment building happens in the Data Cloud. The process of adding additional data to customer profiles within the data cloud isn’t just a technical step—it’s the foundation of every intelligent segment Einstein will build for you. Here’s how to approach it strategically. What Kind of Data Are We

What Kind of Data Are We Talking About?

When we say “your own data,” we mean any first-party information that gives insight into your customers or prospects. This could include:

  • CRM data: Names, email addresses, phone numbers, lead source, account hierarchy, opportunity stages.
  • Behavioral data: Email opens, clicks, website visits, event registrations, mobile app usage.
  • Transactional data: Purchase history, subscription renewals, cart abandonment, refund requests.
  • Offline data: Call center logs, store visits, surveys, loyalty programs.
  • External data: Third-party enrichment data, ad campaign metrics from platforms like Meta or Google Ads.

Connecting Your Data Sources

Salesforce provides multiple ways to bring this data into Data Cloud:

  1. Native Connectors: Integration with Salesforce products like Sales Cloud or Service Cloud.
  2. Marketing Cloud Integration: Sync data extensions from Marketing Cloud Engagement.
  3. APIs and Batch Uploads: Push data from external systems or data warehouses.
  4. Cloud Storage Connectors: Automate ingestion from AWS S3, Google Cloud Storage, or Azure.

Structuring and Updating the Data

Unless you are already using a separate customer data platform where user profile data is aggregated uniquely under a single record, you will also need to perform essential data preparation steps before additional data can actually be used. You will also need to ensure that changes in source data from legacy systems are continuously synced into the Data Cloud.

Note: If you update data in Data Cloud, the predictive scores and AI-driven segment membership will update after the next scoring and segment refresh.

Step 2: Create the Segment in Marketing Cloud

Once your data is unified within Data Cloud, it’s time to create your segment using Einstein:

  • Open the Campaigns tab in Marketing Cloud or go directly to Segments in Data Cloud.
  • Click “Create with Einstein”.
  • In the chat interface, describe your desired audience in plain language (e.g., “Finance managers in Canada who clicked on a webinar invite email last month and have not booked a demo”).

Einstein interprets the prompt using natural language processing, identifies matching data attributes, and auto-generates a segment with filter logic and size estimates. You can preview the list, tweak filters, and finalize the segment—all without writing SQL.

Step 3: Sync Data Cloud with Marketing Cloud

After finalizing your segment in Data Cloud, make it campaign-ready:

  • Go to Data Cloud > Segments
  • Locate your new segment and click Activate
  • Choose Marketing Cloud Engagement as the destination

What gets synced is not all the raw data from your CRM, product systems, or external sources. Instead, Salesforce sends a refined audience list—the result of your segment definition. This includes:

  • Identifiers like SubscriberKey, email address, or contact ID
  • Select enriched fields needed for personalization (e.g., job title, region, trial status)
  • Only the data that’s required for segmentation or campaign logic

This data appears in Marketing Cloud as a Data Extension, which can then be used in:

  • Journey Builder (as entry criteria)
  • Email Studio (for targeting and personalization)
  • Mobile Studio (for push/SMS)

The full customer profile and unified data remain in Data Cloud, allowing Marketing Cloud to act on the intelligence without duplicating all the underlying information.

This ensures your AI-powered segment is fully actionable inside Marketing Cloud workflows while keeping your data infrastructure clean and scalable.

Dynamic Segment Syncing

Once a segment has been activated and synced to Marketing Cloud, its membership can be kept up-to-date—but only if you configure it accordingly. Salesforce allows you to choose from:

  • Real-time sync: Automatically updates the Marketing Cloud Data Extension as soon as membership changes in Data Cloud.
  • Scheduled sync: Refreshes segment membership at regular intervals (e.g., hourly, daily).
  • Manual sync: Requires the user to manually re-activate the segment to refresh the audience.

If you’re using real-time or scheduled sync, changes in customer data, such as a trial beginning, a demo being booked, or a change in job role, will trigger an automatic update to segment membership. As a result, the corresponding Data Extension in Marketing Cloud Engagement will reflect the updated audience without requiring manual intervention.

However, if the segment was exported as a static list (manual sync), the Data Extension remains frozen in time. You’ll need to re-run and re-sync the segment to reflect changes.

Pro Tip: For dynamic journeys and personalization strategies, always use real-time or scheduled syncing to ensure your campaigns stay accurate and relevant.

Real-Life Segment Example: B2B SaaS Marketing Use Case

Company Profile

A B2B SaaS company sells enterprise workflow automation tools. Their primary customers are large organizations, and they use a mix of free trials, product-led growth (PLG), and account-based marketing (ABM).

Target Segment: “Engaged Mid-Funnel Accounts from Healthcare Sector in North America, Focused on Primary Care Providers and Medical Reps”

Why This Segment?

The marketing team wants to focus on accounts that have shown buying signals but haven’t converted. Sales is prioritizing the healthcare vertical in the US and Canada with a focus on a specific sub-audience: Primary Care Providers (e.g., GP surgeries) and Medical Representatives selling into these practices. The goal is to accelerate deals by precisely targeting these influential decision-makers.

Example Prompt to Einstein:

“Companies in the healthcare sector based in North America where at least 2 contacts have clicked on a product tour email in the last 30 days. Only include primary care providers or medical representatives selling to them.”

What Data Might Already Exist in Marketing Cloud?

  • Email interaction data (opens, clicks)
  • Subscriber metadata (title, company, country)
  • Campaign membership (nurture journeys)
  • Basic industry tags (if imported manually)

Additional Data That Strengthens This Segment:

Data Point

Example Field

Why It Helps

Where It Might Come From

Industry

industry from CRM

Filters for healthcare accounts

Salesforce Sales Cloud, third-party enrichment

Audience Type

audience_type field

Excludes non-target roles (e.g., procurement)

Enrichment sources, firmographic data, CRM

Region

country, region

Targets North America only

Subscriber profile, CRM

Product Trial Status

trial_status = active

Includes trial users

Product analytics tools (e.g., Mixpanel, Pendo)

Demo Booking Status

demo_status = false

Excludes sales-engaged leads

CRM (Sales Cloud), Sales activity logs

Engagement Level

Web visits, email clicks

Prioritizes active accounts

Marketing Cloud Data Views, Web Analytics

How Einstein Uses This Data

When a marketer enters a natural language prompt, Einstein:

  • Parses “healthcare” and filters based on industry
  • Identifies niche roles using audience_type (e.g., PCPs)
  • Identifies location from country
  • Pulls click and open activity from _Click and _Open data views
  • Checks trial activity from custom fields or DEs
  • Excludes accounts with demo_status = true

The result is a smart segment like:

Filters:

  • Industry = Healthcare
  • Audience Type = PCP
  • Country IN (US, Canada)
  • Trial Status = Active
  • At least two contacts clicked a product tour email in last 30 days
  • Demo not booked

Einstein may also suggest enhancements like:

  • “Would you like to filter by company size?”
  • “Should I prioritize users with high product usage scores?”

What to Watch Out For: Common Pitfalls in Segment Creation

While Einstein Segment Creation dramatically simplifies the audience-building process, it’s not without its practical challenges. Here are key pitfalls to be aware of:

  • Prompt Variability
    Different natural language prompts can yield slightly different logic. Standardize your phrasing and test multiple versions before deploying.
  • No Segment Versioning
    Changes overwrite prior versions. Always document or clone critical prompts for reference.
  • Testing Limitations
    There’s no true sandbox. Use preview membership and export sample contacts to validate logic before activation.
  • Ambiguous Language
    Vague prompts like “engaged leads” may be misinterpreted. Be specific: use measurable criteria like “clicked email in last 30 days.”
  • Overlapping Prompts
    Multiple prompts can result in nearly identical segments. Track and label your prompts clearly to avoid redundancy.
  • Incomplete or Inconsistent Data
    Segments are only as good as the data behind them. If critical fields like industry or trial_status are missing or inconsistently populated, targeting breaks down.
  • Sync Mode Confusion
    Static vs. dynamic segments behave differently. Ensure your campaign team understands whether the audience list auto-updates.
  • Field Accessibility Gaps
    Not all fields in Data Cloud are available in Marketing Cloud unless explicitly included during sync setup.
  • Privacy and Compliance Risks
    Avoid syncing sensitive or irrelevant fields to Marketing Cloud, especially where data access is governed by regional or role-based policies.

Understanding and planning for these challenges will help you unlock Einstein Segment Creation’s potential while minimizing campaign risk.

Final Words

Einstein Segment Creation isn’t just a feature—it’s your gateway to a more intelligent, responsive marketing organization. By combining natural language segment creation with comprehensive activation across email, journeys, mobile, and advertising channels, you create a marketing engine that’s both powerful and accessible.

Evaluating LLM output using AI-powered evaluator model

Introduction

Our introductory post on measuring LLM performance explored how AI-based evaluation methods can help assess generative AI output using business-aligned proxy metrics—predictive scores tied to real-world outcomes like email open rates for specific LLM outputs.

In this post, we take a more hands-on approach and walk through the technical details of implementation.

You’ll see how to build two predictive models designed to evaluate the effectiveness of different LLMs in generating marketing emails:

  • A Subject Line Open Rate Predictor
  • A Click-Through Rate (CTR) Predictor for the email body

These evaluation models are the core of a business-aligned, outcome-oriented evaluation pipeline for fine-tuned LLMs.

How LLMs and Evaluation Models Work Together

To avoid confusion, it’s important to distinguish between what goes into your LLM and what goes into your evaluation models:

  • The LLM takes in a natural language prompt, such as:

“Write a promotional email for 20% off on eco-friendly water bottles. Include a subject line and a short persuasive body.”

It outputs a subject line and body text.

  • The evaluation models do not receive prompts. Instead, they ingest structured data extracted from the LLM’s outputs—things like word counts, text embeddings, flags (e.g., “contains discount”), and metadata (e.g., campaign type).

In short, LLMs generate content. Evaluation models score it.

Ingredients: What You’ll Need

Data Sources

  • Historical email campaigns (subject lines, bodies, open rates, CTRs)

Tools

  • Python, Pandas, scikit-learn
  • SentenceTransformer (for embeddings)
  • Optional: AutoML platforms like Google Vertex AI or DataRobot

Step 1- Getting the Data (If You Use Salesforce Marketing Cloud)

If your email campaigns were sent using Salesforce Marketing Cloud (SFMC), you can extract the necessary training data in a few ways:

  • Data Extract via Automation Studio: Use a Data Extract Activity to pull email tracking data (such as subject lines, opens, and clicks). This can be scheduled as part of an automation and exported to Enhanced FTP.
  • REST or SOAP API: Use SFMC’s Tracking API to retrieve send results, opens, and click-throughs programmatically.

You’ll want to collect:

  • Subject lines
  • Email body copy (or references to content blocks)
  • Open and click-through rates
  • Metadata such as send time, audience segment, and campaign category

This becomes the foundation of your evaluation models.

Once extracted, your raw data needs to be stored in a secure and accessible location for further processing. In most practical setups, this means one of the following:

  • Cloud storage (e.g., Amazon S3, Google Cloud Storage, Azure Blob): scalable and integrates well with ML pipelines.
  • Data warehouse (e.g., Snowflake, BigQuery, Redshift): ideal for structured querying, joins with other business data, and large-scale reporting.
  • Local file system or shared drive: acceptable for small teams or prototyping.

From there, you can read the raw data into a Python or notebook environment (via Pandas or Spark), and begin your feature extraction and model training workflows.

To prepare the training dataset, you might need to implement some or all of the steps outlined in our data preparation framework for generative AI models.

Step 2- Prepare the Feature Sets

A Note About Feature Engineering

Before a machine learning model can make accurate predictions, it needs structured, numerical input—this is what “features” are. We extract features from raw email text and metadata so that the model can learn patterns that influence open rates or click-through rates.For example, whether a subject line contains urgency words, includes a discount, or is sent at a particular time can all impact engagement. Feature engineering helps convert human-readable content into model-friendly formats like embeddings, binary indicators, and numeric counts.Here is an example of a single row of feature data for the Open Rate Predictor:
FeatureValue
Embedding (vector)[0.11, -0.04, …, 0.07] (384 dims)
word_count6
char_count48
has_percent1
has_number1
has_urgency1
is_personalized0
send_hour10
campaign_type_promo1
audience_segment_loyal1
The full dataset is a matrix of many such rows, each representing one LLM-generated (or historical) email entry.

Below is the code to transform raw text and metadata into meaningful features for both evaluators.

Open Rate Predictor (for Subject Lines)

Add Text Embeddings

Use a transformer model to convert subject lines into vectors that capture semantic meaning.

				
					from sentence_transformers import SentenceTransformer
model = SentenceTransformer('all-MiniLM-L6-v2')
df['embeddings'] = model.encode(df['subject_line'].tolist())
				
			

Add Text-Derived Features

				
					df['word_count'] = df['subject_line'].str.split().str.len()
df['char_count'] = df['subject_line'].str.len()
df['has_percent'] = df['subject_line'].str.contains('%').astype(int)
df['has_number'] = df['subject_line'].str.contains(r'\d').astype(int)
df['has_urgency'] = df['subject_line'].str.contains(r'now|today|urgent|final', case=False).astype(int)
df['is_personalized'] = df['subject_line'].str.contains(r'\[First Name\]', case=False).astype(int)
				
			

Add Metadata Features

				
					# Assume these columns already exist in your DataFrame
df = pd.get_dummies(df, columns=['campaign_type', 'audience_segment'], drop_first=True)
				
			

Prepare the Final Feature Set

  • Embedding vectors (typically shape: N x 384)

  • Scalar and binary features Store as a NumPy array:

				
					import numpy as np
X = np.hstack([df['embeddings'].tolist(), df[['word_count', 'char_count', 'has_percent', 'has_number', 'has_urgency', 'is_personalized', 'send_hour'] + list(df.columns[df.columns.str.startswith('campaign_type_') | df.columns.str.startswith('audience_segment_')])].values])
y = df['open_rate']
				
			

This final matrix X is used for training. For offline training, you can simply store your features in:

  • .pkl (Pickle) files: ideal for fast local reloading in Python

  • .csv files: good for portability and manual inspection

  • .parquet or .feather: efficient for larger datasets and DataFrame-based pipelines

CTR Predictor (for Email Body Copy)

The conceptual outline for recipe 2 is similar to that of recipe 1. Instead of using the email subject line, you will use the email body. Also, instead of using open rate as the output, you will use email click through rate.

Add Text Embeddings for Email Body

				
					df['embeddings'] = model.encode(df['email_body'].tolist())
				
			

Add Structure-Based Features

				
					df['word_count'] = df['email_body'].str.split().str.len()
df['link_count'] = df['email_body'].str.count('http')
df['cta_count'] = df['email_body'].str.count(r'buy now|shop now|learn more', flags=re.IGNORECASE)
				
			

Add Sentiment and Signals

				
					from textblob import TextBlob
df['sentiment'] = df['email_body'].apply(lambda x: TextBlob(x).sentiment.polarity)
df['has_urgency'] = df['email_body'].str.contains(r'limited time|hurry|act now|final', case=False).astype(int)
df = pd.get_dummies(df, columns=['cta_type'], drop_first=True)
				
			

Assemble the Final Feature Set

				
					X = np.hstack([
    df['embeddings'].tolist(),
    df[['word_count', 'link_count', 'cta_count', 'sentiment', 'has_urgency'] + list(df.columns[df.columns.str.startswith('cta_type_')])].values
])
y = df['click_through_rate']
				
			

Store the final X and y for reuse and training. You can persist the preprocessed features with joblib, pickle, or parquet format.

Step 3: Model Training, Evaluation, and Use

With the feature matrices X and labels y prepared, you can now train your predictive models. using a supervised learning approach.

Train the Model

				
					from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
model = RandomForestRegressor()
model.fit(X_train, y_train)
predictions = model.predict(X_test)
				
			

Evaluate the Model

Use standard regression metrics:

  • (coefficient of determination): how much variance your model explains

  • MAE (mean absolute error): average difference between predicted and actual values

  • RMSE (root mean squared error): penalizes large errors

Visual diagnostics (like scatterplots of predicted vs. actual) can also help.

Score New Content

Once trained, your models can be used to evaluate new LLM-generated content:

  • Feed each subject line into the Open Rate model

  • Feed each body copy into the CTR model

You can then compute a composite score to rank the outputs:

final_score = 0.4 * predicted_open_rate + 0.6 * predicted_ctr

Doing this for every LLM contestant will result in an evaluation matrix that can provide an objective basis for which model you finally select to use.

Data Preparation for YOLOv5 CV model in Retail Analytics

Introduction

Preparing high-quality datasets for LLM fine-tuning demands a deliberate approach to addressing marketing bias. While our earlier article introduced the concept at a high level, this recipe focuses on the practical side, providing data engineers with a technical, developer-focused walkthrough for converting theoretical bias dimensions from a bias matrix into concrete, production-ready data pipelines. For full context, we recommend reviewing our foundational pieces on the LLM data preparation framework and the underlying principles of marketing bias.

Business Background

A multi-channel retailer aimed to deepen its understanding of in-store customer behavior through the use of YOLOv5-based computer vision technology. Partnering with a specialized AI consultancy, the retailer began deploying custom-trained models on edge devices throughout its store network, with the goal of eventually scaling coverage to all locations.

This initiative went beyond passive observation. By linking visual insights to their e-commerce and product information management (PIM) systems, the retailer sought to generate actionable intelligence, such as identifying when customers examine labels but don’t purchase, or detecting common interest in overlooked products. These signals could then power real-time personalization and long-term experience optimization.

To achieve this, a critical early step was the creation of a training dataset that accurately reflected the diversity of in-store shopping behavior, not just across demographics or store types, but across time. That required a dedicated effort to identify and mitigate temporal bias.

Solution Overview

YOLOv5 models in this setting are trained on annotated video frames that capture real-world customer interactions in-store. The typical process involves:

  • Extracting frames from raw video footage
  • Annotating those frames with structured behavior labels
  • Training the model on this annotated dataset
  • Deploying the model to edge devices
  • Streaming real-time frame data or interaction events to local servers for further actioning

These structured interaction events are then forwarded by the local server to backend enterprise systems (e.g., CRM, PIM, e-commerce) where they inform decisions ranging from product placement to automated marketing offers.

But none of this works reliably without an unbiased dataset. Specifically, temporal bias—the uneven representation of certain time windows in the dataset—can significantly degrade model performance during underrepresented conditions like early mornings, off-season shopping, or promotional periods.

Bridging Marketing and Engineering: The Bias Matrix

To formalize their bias mitigation strategy, the team adopted a bias matrix—a shared framework developed by marketing and analytics stakeholders. This matrix identifies key bias categories, defines how they should be measured, establishes acceptable target distributions, and outlines mitigation strategies.

Here is a simplified excerpt from the full matrix:

Bias Category

Bias Type

Measurement Method

Target Distribution

Mitigation Strategy

Temporal Bias

Day of Week

% frames per day

Mon–Fri: 10–15% each
Sat–Sun: 15–20% each

Stratified sampling across days

 

Time of Day

% frames per time block

Morning: 20–25%
Afternoon: 30–35%
Evening: 25–30%
Night: 10–15%

Weighted extraction from recordings

 

Seasonal

% frames per season

Spring, Summer, Fall, Winter: 20–30% each

Multi-season collection and augmentation

 

Promotional vs. Normal

% frames during promotions

Normal: 60–70%
Promotional: 30–40%

Targeted collection during promotions

While the full matrix includes other bias types—such as demographic, store environment, and product interaction biases—this guide focuses exclusively on temporal bias.

In the sections that follow, we will walk through how to operationalize each temporal subcategory into a robust and reusable data pipeline that ensures balanced temporal representation during training.

This includes:

  1. Structuring metadata for temporal traceability
  2. Monitoring live data distributions
  3. Designing stratified and weighted sampling logic
  4. Implementing augmentation for rare conditions
  5. Introducing performance-driven feedback loops

Together, these steps form a pipeline blueprint that transforms the bias matrix from a theoretical checklist into a concrete system architecture.

Technical Implementation

The temporal bias dimension implies that the training data must contain the following temporal distribution

  1. Day of Week: Mon–Fri (10–15% each), Sat–Sun (15–20% each)
  2. Time of Day: Morning (20–25%), Afternoon (30–35%), Evening (25–30%), Night (10–15%)
  3. Season: Each season (20–30%)
  4. Promotional Period: Normal (60–70%), Promotional (30–40%)

This implies that for every 100 samples, each weekday (Monday through Friday) should contribute between 10 and 15 samples, while Saturday and Sunday should each contribute between 15 and 20 samples, ensuring the overall total remains within the 100-sample limit. Similar conditions must be met for the time of the day, season, and promotional period.

The process of translating bias matrix into concrete data pipelines is outlined below.

Step 1: Extract Image Frames from Video

This is the first step in the data preparation process and involves parsing the raw video footage into image frames. Notice that to train the YOLOv5 model, you need to pass it annotated image frames and not the whole video file.

  1. Extract frames from raw video footage– The input to this step is a raw video file (MP4, AVI, etc.) captured from in-store cameras. Each video file represents a continuous sequence of real-world customer activity within a store over a certain time period (e.g., 1 hour of footage from a Tuesday afternoon in Store #42).
  2. The extract can be done as a batch process using Python OpenCV module that takes the input as the raw video file and outputs individual frames.
				
					import cv2
import os

def extract_frames(video_path, output_dir, sampling_rate=1):
    cap = cv2.VideoCapture(video_path)
    fps = cap.get(cv2.CAP_PROP_FPS)
    frame_interval = int(fps / sampling_rate)

    frame_count = 0
    saved_count = 0

    while cap.isOpened():
        ret, frame = cap.read()
        if not ret:
            break
        if frame_count % frame_interval == 0:
            frame_name = f"frame_{saved_count:05d}.jpg"
            cv2.imwrite(os.path.join(output_dir, frame_name), frame)
            saved_count += 1
        frame_count += 1

    cap.release()

				
			

Each frame is now an image file (e.g., frame_00001.jpg) that becomes a candidate training example—but it’s not usable yet until it’s enriched with metadata.

Step 2: Enrich Frames With Metadata

Frames alone are just pixels. To implement bias mitigation, we must know when and where each frame was captured. Metadata provides:

  • The timestamp (e.g., May 2, 2025, 14:33)

  • The day of the week (e.g., Friday)

  • The time block (e.g., afternoon)

  • The season (e.g., spring)

  • Whether it was during a promotional period

  • Context about the store (location, format, department)

This enriched metadata allows us to:

  • Monitor bias (e.g., 22% of frames are from Friday afternoons)
  • Sample intentionally (e.g., we need more morning frames)
  • Augment strategically (e.g., simulate winter lighting)

To hold this metadata, you need to alter your existing table or create a new one:

				
					CREATE TABLE frame_metadata (
    frame_id VARCHAR(255) PRIMARY KEY,
    timestamp DATETIME,
    day_of_week TINYINT,
    time_block VARCHAR(20),
    season VARCHAR(20),
    is_promotional BOOLEAN,
    frame_path VARCHAR(255)
);

				
			

Your code must now extract this metadata from each frame.

				
					import cv2
import os
import json
import pandas as pd
from datetime import datetime, timedelta

# ------------- Helper Functions -------------

def categorize_time_block(hour):
    """Classify the hour into a time block."""
    if 6 <= hour < 12:
        return 'morning'
    elif 12 <= hour < 17:
        return 'afternoon'
    elif 17 <= hour < 21:
        return 'evening'
    else:
        return 'night'

def determine_season(month):
    """Map month to a season."""
    if month in [12, 1, 2]:
        return 'winter'
    elif month in [3, 4, 5]:
        return 'spring'
    elif month in [6, 7, 8]:
        return 'summer'
    else:
        return 'fall'

def check_promotional(timestamp, promo_calendar):
    """Return True if the date is in the promotional calendar."""
    date_str = timestamp.strftime('%Y-%m-%d')
    return date_str in promo_calendar

def parse_timestamp_from_filename(filename):
    """
    Extract timestamp from filename format: storeID_YYYY-MM-DD_HH-MM.mp4
    Example: 'store42_2025-05-10_14-00.mp4'
    """
    parts = filename.replace('.mp4', '').split('_')
    if len(parts) < 3:
        raise ValueError("Filename does not contain timestamp in expected format.")
    timestamp_str = f"{parts[1]}_{parts[2]}"
    return datetime.strptime(timestamp_str, '%Y-%m-%d_%H-%M')

# ------------- Main Function -------------

def extract_frames_with_metadata(video_path, output_dir, sampling_rate=1, store_info=None, promo_calendar=None):
    """
    Extract frames from a video and enrich each with temporal metadata.
    
    Args:
        video_path (str): Path to input video file.
        output_dir (str): Directory to save extracted frames.
        sampling_rate (int): Frames per second to extract.
        store_info (dict): Optional store metadata.
        promo_calendar (set): Set of promotional dates (YYYY-MM-DD).
    
    Returns:
        List of dictionaries, each containing metadata for one frame.
    """
    os.makedirs(output_dir, exist_ok=True)
    cap = cv2.VideoCapture(video_path)

    fps = cap.get(cv2.CAP_PROP_FPS)
    frame_interval = int(fps / sampling_rate)
    frame_index = 0
    saved_index = 0
    metadata_records = []

    # Get base timestamp from video filename
    base_timestamp = parse_timestamp_from_filename(os.path.basename(video_path))

    while cap.isOpened():
        ret, frame = cap.read()
        if not ret:
            break

        if frame_index % frame_interval == 0:
            # Save frame image
            frame_filename = f"frame_{saved_index:05d}.jpg"
            frame_path = os.path.join(output_dir, frame_filename)
            cv2.imwrite(frame_path, frame)

            # Estimate timestamp for this frame
            frame_timestamp = base_timestamp + timedelta(seconds=(frame_index / fps))

            # Generate metadata
            metadata = {
                'frame_id': f"{saved_index:05d}",
                'timestamp': frame_timestamp.isoformat(),
                'day_of_week': frame_timestamp.weekday(),
                'time_block': categorize_time_block(frame_timestamp.hour),
                'season': determine_season(frame_timestamp.month),
                'is_promotional': check_promotional(frame_timestamp, promo_calendar or set()),
                'frame_path': frame_path
            }

            if store_info:
                metadata.update({
                    'store_id': store_info.get('store_id'),
                    'store_format': store_info.get('format'),
                    'location_type': store_info.get('location_type'),
                    'department': store_info.get('department')
                })

            metadata_records.append(metadata)
            saved_index += 1

        frame_index += 1

    cap.release()
    return metadata_records

# ------------- Run Script -------------

if __name__ == "__main__":
    # Input video (must include timestamp in filename)
    video_file = "store42_2025-05-10_14-00.mp4"
    output_folder = "frames_output"

    # Example store metadata
    store_metadata = {
        'store_id': 'store42',
        'format': 'medium',
        'location_type': 'urban',
        'department': 'snacks'
    }

    # Example promotional calendar
    promotional_dates = {"2025-05-10", "2025-05-11"}

    # Run extraction
    metadata = extract_frames_with_metadata(
        video_path=video_file,
        output_dir=output_folder,
        sampling_rate=1,  # 1 frame per second
        store_info=store_metadata,
        promo_calendar=promotional_dates
    )

    # Save metadata to CSV
    pd.DataFrame(metadata).to_csv("frame_metadata.csv", index=False)
    print(f"Extracted {len(metadata)} frames and saved metadata to frame_metadata.csv.")

				
			

Step 3: Temporal Distribution Monitoring

The goal of this step is to measure and track how well your current dataset matches the temporal targets defined in the bias matrix. You’re not yet selecting or sampling data here—you’re auditing what’s already been collected or extracted.

This step helps answer key questions like:

  • Are we overrepresenting certain days or time blocks?

  • Have we collected enough data from underrepresented seasons?

  • Are promotional periods adequately covered?

  • Are we drifting away from the target distribution?

The output of this step is a set of summary statistics and optionally a dashboard or alert system that gives visibility into bias risks.

Why This Matters

Before you can fix a bias, you have to see it.

Raw data pipelines often reflect uncontrolled collection patterns:

  • Store staff may record more during peak hours.

  • Some cameras may run longer on weekdays than weekends.

  • Promotions may not be tagged consistently.

If you blindly use this data for training, you risk building a model that performs well only in overrepresented conditions—and fails silently when exposed to underrepresented contexts (e.g., late nights or winter).

Temporal Distribution Monitoring gives you ground truth visibility over the balance of your dataset, so you can take action in later steps (sampling, weighting, augmentation).

What Exactly Are You Doing in This Step?

You are:

  1. Querying your enriched metadata (from Step 1) to compute the actual distribution of frames across temporal dimensions:

    • How many frames came from each day of the week?

    • What percent came from each time block (morning, afternoon, evening, night)?

    • What is the seasonal breakdown?

    • What portion occurred during promotions?

  2. Comparing those distributions to the target ranges specified in the bias matrix.

  3. Identifying any deviations, i.e., subcategories that are underrepresented or overrepresented compared to the target.

  4. (Optionally) Visualizing the metrics via bar charts, gauges, or dashboards to make imbalances obvious and track drift over time.

  5. (Optionally) Triggering alerts if certain thresholds are violated—for example, if “morning” frames fall below 15%, even though the target minimum is 20%

Sample Output for Step 2

Imagine you’ve processed 10,000 frames and want to evaluate them against the target temporal bias distribution.

1. Day of Week Distribution
DayFrame CountActual %Target Range (%)Status
Monday9809.8%10–15%Slightly Low
Tuesday112011.2%10–15%Within Range
Wednesday110511.1%10–15%Within Range
Thursday108710.9%10–15%Within Range
Friday101210.1%10–15%Within Range
Saturday184518.5%15–20%Within Range
Sunday185118.5%15–20%Within Range

Note: Monday is slightly under target; all other days are compliant.


2. Time of Day Distribution
Time BlockFrame CountActual %Target Range (%)Status
Morning168016.8%20–25%Underrepresented
Afternoon350235.0%30–35%Within Range
Evening286328.6%25–30%Within Range
Night195519.5%10–15%Overrepresented

Observation: Nighttime data is significantly overrepresented; morning data is under-collected.


3. Seasonal Distribution
SeasonFrame CountActual %Target Range (%)Status
Spring230023.0%20–30%Within Range
Summer243024.3%20–30%Within Range
Fall251525.2%20–30%Within Range
Winter275527.5%20–30%Within Range

Observation: Seasonal coverage is balanced—no action needed here.


4. Promotional vs. Normal Periods
PeriodFrame CountActual %Target Range (%)Status
Normal768076.8%60–70%Overrepresented
Promotional232023.2%30–40%Underrepresented

Observation: Not enough data collected during promotional campaigns.


Summary Status Report

DimensionIssues Detected?Recommended Action
Day of WeekMinor (Monday < 10%)Increase Monday frame collection
Time of DayYesReduce night sampling, increase morning frames
SeasonNoNo action needed
Promotional PeriodsYesIncrease targeted capture during promotions

As can be seen from the status report, the original data looks skewed and there are specific steps that you must take to correct the bias.

The full code snippet for this step will be as follows:

				
					import pandas as pd
from collections import defaultdict

# --- Bias Matrix Temporal Targets ---

BIAS_TARGETS = {
    'day_of_week': {
        0: (10, 15),  # Monday
        1: (10, 15),
        2: (10, 15),
        3: (10, 15),
        4: (10, 15),
        5: (15, 20),  # Saturday
        6: (15, 20),  # Sunday
    },
    'time_block': {
        'morning': (20, 25),
        'afternoon': (30, 35),
        'evening': (25, 30),
        'night': (10, 15),
    },
    'season': {
        'spring': (20, 30),
        'summer': (20, 30),
        'fall': (20, 30),
        'winter': (20, 30),
    },
    'is_promotional': {
        True: (30, 40),
        False: (60, 70),
    }
}

# --- Monitoring Function ---

def calculate_temporal_distribution(df, targets):
    """
    Compare actual distributions with bias matrix targets.

    Args:
        df: DataFrame containing metadata
        targets: Dict with bias targets

    Returns:
        summary_table: A list of dictionaries with metrics per subcategory
    """
    total_frames = len(df)
    summary = []

    for category, category_targets in targets.items():
        value_counts = df[category].value_counts(normalize=True) * 100  # as %
        for value, (min_t, max_t) in category_targets.items():
            actual = value_counts.get(value, 0)
            if actual < min_t:
                status = 'Underrepresented'
            elif actual > max_t:
                status = 'Overrepresented'
            else:
                status = 'Within Range'

            summary.append({
                'Dimension': category,
                'Category': value,
                'Actual %': round(actual, 2),
                'Target Range (%)': f"{min_t}–{max_t}",
                'Status': status
            })

    return pd.DataFrame(summary)

# --- Load Metadata CSV ---

def load_metadata(csv_path):
    """
    Load frame metadata CSV and ensure correct dtypes.
    """
    df = pd.read_csv(csv_path)

    # Coerce categorical types
    df['time_block'] = df['time_block'].astype(str)
    df['season'] = df['season'].astype(str)
    df['is_promotional'] = df['is_promotional'].astype(bool)
    df['day_of_week'] = df['day_of_week'].astype(int)  # 0 = Monday

    return df

# --- Run Script ---

if __name__ == "__main__":
    metadata_file = "frame_metadata.csv"
    df = load_metadata(metadata_file)
    
    report = calculate_temporal_distribution(df, BIAS_TARGETS)
    report.to_csv("temporal_bias_report.csv", index=False)

    print("\n=== Temporal Bias Summary ===\n")
    print(report)

				
			

Step 4: Obtain Temporal Balance

The goal of this step is to select a balanced subset of frames from your raw dataset that closely matches the temporal distribution targets defined in the bias matrix. You are no longer just observing bias (as in Step 2)—you are actively shaping your training dataset to correct for it.

This ensures that:

  • Underrepresented time categories (e.g. winter evenings or promotional mornings) are not ignored

  • Overrepresented categories (e.g. Saturday afternoons) are not allowed to dominate model training

  • Your model is exposed to diverse time-based conditions, improving generalization and robustness

Why This Matters

Without stratified sampling, your training data may inherit the imbalances of how and when footage was collected. For example:

  • If cameras record mostly during the day, the model may fail at night.

  • If most footage is from promotional events, the model might overfit to campaign behaviors.

  • If winter shopping is underrepresented, the model may misinterpret seasonal behaviors (e.g., people wearing bulky coats, less time browsing).

This step gives you fine-grained control over what data is included, ensuring that the final dataset reflects your fairness goals.

What Exactly Are You Doing?

You are:

  1. Using the metadata-enriched frame dataset (from Step 1)

  2. Using the temporal bias targets (defined in the matrix and validated in Step 2)

  3. Calculating the number of frames needed for each temporal subcategory (e.g., 2,500 frames from morning sessions)

  4. Querying or filtering the metadata to pull that exact number of frames per category (or as close as possible)

  5. Compensating if some bins don’t have enough data by logging, augmenting (in later steps), or rebalancing

The output is a balanced, bias-mitigated training dataset ready for use in LLM fine-tuning.

				
					import pandas as pd
import random
from collections import defaultdict
import os

# --- Define Temporal Bias Targets ---

TEMPORAL_TARGETS = {
    'time_block': {
        'morning': 0.23,     # 23%
        'afternoon': 0.33,   # 33%
        'evening': 0.27,     # 27%
        'night': 0.17        # 17%
    },
    'season': {
        'spring': 0.25,
        'summer': 0.25,
        'fall': 0.25,
        'winter': 0.25
    },
    'is_promotional': {
        True: 0.35,
        False: 0.65
    }
}

# --- Utility Function to Calculate Sampling Plan ---

def compute_stratified_frame_counts(total_frames, targets):
    """
    Calculate number of frames needed per combination of temporal categories.
    
    Args:
        total_frames: Total number of frames to select.
        targets: Nested dict of target proportions per temporal dimension.
    
    Returns:
        Dict mapping category combinations to number of frames.
    """
    target_counts = {}

    # Build all combinations (cartesian product of categories)
    for t_block, p_time in targets['time_block'].items():
        for season, p_season in targets['season'].items():
            for promo, p_promo in targets['is_promotional'].items():
                combo = (t_block, season, promo)
                proportion = p_time * p_season * p_promo
                target_counts[combo] = int(round(total_frames * proportion))
    
    return target_counts

# --- Sample Frames from Metadata Based on Category Combinations ---

def stratified_sample_frames(df, target_counts):
    """
    Select frames from the metadata DataFrame according to target counts.

    Args:
        df: Metadata DataFrame.
        target_counts: Dict with category combinations as keys and sample sizes as values.

    Returns:
        List of selected frame_ids.
    """
    selected_frames = []

    for combo, count in target_counts.items():
        time_block, season, promo = combo

        subset = df[
            (df['time_block'] == time_block) &
            (df['season'] == season) &
            (df['is_promotional'] == promo)
        ]

        available = len(subset)

        if available == 0:
            print(f"WARNING: No frames found for {combo}")
            continue

        sample_size = min(count, available)
        sampled_ids = subset.sample(n=sample_size, replace=False)['frame_id'].tolist()
        selected_frames.extend(sampled_ids)

        print(f"Selected {sample_size}/{count} frames for {combo} (available: {available})")

    return selected_frames

# --- Main Execution ---

def run_stratified_sampling(metadata_path, total_frames_to_select, output_path):
    df = pd.read_csv(metadata_path)

    # Ensure proper types
    df['time_block'] = df['time_block'].astype(str)
    df['season'] = df['season'].astype(str)
    df['is_promotional'] = df['is_promotional'].astype(bool)

    # Step 1: Compute target counts per category combo
    target_counts = compute_stratified_frame_counts(total_frames_to_select, TEMPORAL_TARGETS)

    # Step 2: Select frames
    selected_ids = stratified_sample_frames(df, target_counts)

    # Step 3: Save results
    selected_df = df[df['frame_id'].isin(selected_ids)]
    selected_df.to_csv(output_path, index=False)

    print(f"\nStratified sampling complete. {len(selected_ids)} frames written to {output_path}")

# --- Example Usage ---

if __name__ == "__main__":
    metadata_csv = "frame_metadata.csv"  # from Step 1
    output_file = "stratified_sample.csv"
    sample_size = 10000  # Adjust based on your target dataset size

    run_stratified_sampling(metadata_csv, sample_size, output_file)

				
			

Output

  • A file stratified_sample.csv containing the subset of metadata for the selected frames

  • Each frame respects the desired cross-distribution across:

    • time_block × season × is_promotional

  • Warnings will be shown if not enough frames exist for any bin

You now have a bias-aware, balanced dataset ready for use in fine-tuning. The next step could be to copy the selected frames into a new dataset directory structure, or to feed them into a downstream annotation process.

Pipeline Deployment Overview: Azure Cloud

Core Azure Services

Pipeline ComponentAzure Service Used
Raw video storageAzure Blob Storage
Job orchestrationAzure Data Factory or Azure Synapse Pipelines
Compute for scriptsAzure Batch, Azure Functions, or Azure ML Pipelines
Metadata databaseAzure SQL Database or Azure Data Lake
Monitoring & alertsAzure Monitor + Log Analytics
Final dataset storageAzure Data Lake Gen2 or Blob Storage

Sample Pipeline in Azure Data Factory

  1. Trigger: When new video is uploaded to a Blob container.

  2. Step 1 (Metadata Enrichment):

    • Run a Python script in Azure Batch or Azure ML Compute.

    • Output: Frame images + metadata CSV to Azure Blob / Data Lake.

  3. Step 2 (Monitoring):

    • Use a pipeline step to run the audit script.

    • Save the bias report as a CSV or send alerts via Azure Monitor.

  4. Step 3 (Stratified Sampling):

    • Run your stratified sampler Python script as a Custom Activity.

    • Output: Selected metadata → used for dataset creation.

Pipeline Deployment Overview: AWS Cloud

Core AWS Services

Pipeline ComponentAWS Service Used
Raw video storageAmazon S3
Job orchestrationAWS Step Functions or Amazon MWAA (Airflow)
Compute for scriptsAWS Lambda, EC2, or AWS Batch
Metadata databaseAmazon RDS or Amazon Athena
Monitoring & alertsAmazon CloudWatch + SNS
Final dataset storageS3 (versioned buckets)

Sample Pipeline in AWS Step Functions

  1. Trigger: S3 event when a new video file is uploaded.

  2. Step 1 (Metadata Enrichment):

    • Run your Python script in AWS Batch or Lambda.

    • Store output images + metadata CSV in a processed/ S3 path.

  3. Step 2 (Bias Monitoring):

    • Run audit job using a Lambda or Batch job.

    • Store report in S3 and send alerts via SNS if any category drifts.

  4. Step 3 (Stratified Sampling):

    • Execute as a Lambda or Docker job on ECS/Fargate.

    • Output selected frame list to S3.

  5. Step 4 (Optional Augmentation):

    • Trigger augmentation only for flagged categories (CloudWatch metric).

  6. Step 5 (Dataset Packaging):

    • Move sampled frames into versioned dataset/ folders (e.g., v1/train/, v1/test/).

    • Store metadata manifest (manifest.json) in each dataset folder.