Introduction
Our introductory post on measuring LLM performance explored how AI-based evaluation methods can help assess generative AI output using business-aligned proxy metrics—predictive scores tied to real-world outcomes like email open rates for specific LLM outputs.
In this post, we take a more hands-on approach and walk through the technical details of implementation.
You’ll see how to build two predictive models designed to evaluate the effectiveness of different LLMs in generating marketing emails:
- A Subject Line Open Rate Predictor
- A Click-Through Rate (CTR) Predictor for the email body
These evaluation models are the core of a business-aligned, outcome-oriented evaluation pipeline for fine-tuned LLMs.
How LLMs and Evaluation Models Work Together
To avoid confusion, it’s important to distinguish between what goes into your LLM and what goes into your evaluation models:
- The LLM takes in a natural language prompt, such as:
“Write a promotional email for 20% off on eco-friendly water bottles. Include a subject line and a short persuasive body.”
It outputs a subject line and body text.
- The evaluation models do not receive prompts. Instead, they ingest structured data extracted from the LLM’s outputs—things like word counts, text embeddings, flags (e.g., “contains discount”), and metadata (e.g., campaign type).
In short, LLMs generate content. Evaluation models score it.
Ingredients: What You’ll Need
Data Sources
- Historical email campaigns (subject lines, bodies, open rates, CTRs)
Tools
- Python, Pandas, scikit-learn
- SentenceTransformer (for embeddings)
- Optional: AutoML platforms like Google Vertex AI or DataRobot
Step 1- Getting the Data (If You Use Salesforce Marketing Cloud)
If your email campaigns were sent using Salesforce Marketing Cloud (SFMC), you can extract the necessary training data in a few ways:
- Data Extract via Automation Studio: Use a Data Extract Activity to pull email tracking data (such as subject lines, opens, and clicks). This can be scheduled as part of an automation and exported to Enhanced FTP.
- REST or SOAP API: Use SFMC’s Tracking API to retrieve send results, opens, and click-throughs programmatically.
You’ll want to collect:
- Subject lines
- Email body copy (or references to content blocks)
- Open and click-through rates
- Metadata such as send time, audience segment, and campaign category
This becomes the foundation of your evaluation models.
Once extracted, your raw data needs to be stored in a secure and accessible location for further processing. In most practical setups, this means one of the following:
- Cloud storage (e.g., Amazon S3, Google Cloud Storage, Azure Blob): scalable and integrates well with ML pipelines.
- Data warehouse (e.g., Snowflake, BigQuery, Redshift): ideal for structured querying, joins with other business data, and large-scale reporting.
- Local file system or shared drive: acceptable for small teams or prototyping.
From there, you can read the raw data into a Python or notebook environment (via Pandas or Spark), and begin your feature extraction and model training workflows.
To prepare the training dataset, you might need to implement some or all of the steps outlined in our data preparation framework for generative AI models.
Step 2- Prepare the Feature Sets
A Note About Feature Engineering
Before a machine learning model can make accurate predictions, it needs structured, numerical input—this is what “features” are. We extract features from raw email text and metadata so that the model can learn patterns that influence open rates or click-through rates.For example, whether a subject line contains urgency words, includes a discount, or is sent at a particular time can all impact engagement. Feature engineering helps convert human-readable content into model-friendly formats like embeddings, binary indicators, and numeric counts.Here is an example of a single row of feature data for the Open Rate Predictor:Feature | Value |
Embedding (vector) | [0.11, -0.04, …, 0.07] (384 dims) |
word_count | 6 |
char_count | 48 |
has_percent | 1 |
has_number | 1 |
has_urgency | 1 |
is_personalized | 0 |
send_hour | 10 |
campaign_type_promo | 1 |
audience_segment_loyal | 1 |
Below is the code to transform raw text and metadata into meaningful features for both evaluators.
Open Rate Predictor (for Subject Lines)
Add Text Embeddings
Use a transformer model to convert subject lines into vectors that capture semantic meaning.
from sentence_transformers import SentenceTransformer
model = SentenceTransformer('all-MiniLM-L6-v2')
df['embeddings'] = model.encode(df['subject_line'].tolist())
Add Text-Derived Features
df['word_count'] = df['subject_line'].str.split().str.len()
df['char_count'] = df['subject_line'].str.len()
df['has_percent'] = df['subject_line'].str.contains('%').astype(int)
df['has_number'] = df['subject_line'].str.contains(r'\d').astype(int)
df['has_urgency'] = df['subject_line'].str.contains(r'now|today|urgent|final', case=False).astype(int)
df['is_personalized'] = df['subject_line'].str.contains(r'\[First Name\]', case=False).astype(int)
Add Metadata Features
# Assume these columns already exist in your DataFrame
df = pd.get_dummies(df, columns=['campaign_type', 'audience_segment'], drop_first=True)
Prepare the Final Feature Set
Embedding vectors (typically shape: N x 384)
Scalar and binary features Store as a NumPy array:
import numpy as np
X = np.hstack([df['embeddings'].tolist(), df[['word_count', 'char_count', 'has_percent', 'has_number', 'has_urgency', 'is_personalized', 'send_hour'] + list(df.columns[df.columns.str.startswith('campaign_type_') | df.columns.str.startswith('audience_segment_')])].values])
y = df['open_rate']
This final matrix X
is used for training. For offline training, you can simply store your features in:
.pkl
(Pickle) files: ideal for fast local reloading in Python.csv
files: good for portability and manual inspection.parquet
or.feather
: efficient for larger datasets and DataFrame-based pipelines
CTR Predictor (for Email Body Copy)
The conceptual outline for recipe 2 is similar to that of recipe 1. Instead of using the email subject line, you will use the email body. Also, instead of using open rate as the output, you will use email click through rate.
Add Text Embeddings for Email Body
df['embeddings'] = model.encode(df['email_body'].tolist())
Add Structure-Based Features
df['word_count'] = df['email_body'].str.split().str.len()
df['link_count'] = df['email_body'].str.count('http')
df['cta_count'] = df['email_body'].str.count(r'buy now|shop now|learn more', flags=re.IGNORECASE)
Add Sentiment and Signals
from textblob import TextBlob
df['sentiment'] = df['email_body'].apply(lambda x: TextBlob(x).sentiment.polarity)
df['has_urgency'] = df['email_body'].str.contains(r'limited time|hurry|act now|final', case=False).astype(int)
df = pd.get_dummies(df, columns=['cta_type'], drop_first=True)
Assemble the Final Feature Set
X = np.hstack([
df['embeddings'].tolist(),
df[['word_count', 'link_count', 'cta_count', 'sentiment', 'has_urgency'] + list(df.columns[df.columns.str.startswith('cta_type_')])].values
])
y = df['click_through_rate']
Store the final X
and y
for reuse and training. You can persist the preprocessed features with joblib
, pickle
, or parquet
format.
Step 3: Model Training, Evaluation, and Use
With the feature matrices X
and labels y
prepared, you can now train your predictive models. using a supervised learning approach.
Train the Model
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
model = RandomForestRegressor()
model.fit(X_train, y_train)
predictions = model.predict(X_test)
Evaluate the Model
Use standard regression metrics:
R² (coefficient of determination): how much variance your model explains
MAE (mean absolute error): average difference between predicted and actual values
RMSE (root mean squared error): penalizes large errors
Visual diagnostics (like scatterplots of predicted vs. actual) can also help.
Score New Content
Once trained, your models can be used to evaluate new LLM-generated content:
Feed each subject line into the Open Rate model
Feed each body copy into the CTR model
You can then compute a composite score to rank the outputs:
Doing this for every LLM contestant will result in an evaluation matrix that can provide an objective basis for which model you finally select to use.