Evaluating LLM output using AI-powered evaluator model

May 23, 2025
6 minutes

Introduction

Our introductory post on measuring LLM performance explored how AI-based evaluation methods can help assess generative AI output using business-aligned proxy metrics—predictive scores tied to real-world outcomes like email open rates for specific LLM outputs.

In this post, we take a more hands-on approach and walk through the technical details of implementation.

You’ll see how to build two predictive models designed to evaluate the effectiveness of different LLMs in generating marketing emails:

A Subject Line Open Rate Predictor
A Click-Through Rate (CTR) Predictor for the email body

These evaluation models are the core of a business-aligned, outcome-oriented evaluation pipeline for fine-tuned LLMs.

How LLMs and Evaluation Models Work Together

To avoid confusion, it’s important to distinguish between what goes into your LLM and what goes into your evaluation models:

The LLM takes in a natural language prompt, such as:

“Write a promotional email for 20% off on eco-friendly water bottles. Include a subject line and a short persuasive body.”

It outputs a subject line and body text.

The evaluation models do not receive prompts. Instead, they ingest structured data extracted from the LLM’s outputs—things like word counts, text embeddings, flags (e.g., “contains discount”), and metadata (e.g., campaign type).

In short, LLMs generate content. Evaluation models score it.

Ingredients: What You’ll Need

Data Sources

Historical email campaigns (subject lines, bodies, open rates, CTRs)

Tools

Python, Pandas, scikit-learn
SentenceTransformer (for embeddings)
Optional: AutoML platforms like Google Vertex AI or DataRobot

Step 1- Getting the Data (If You Use Salesforce Marketing Cloud)

If your email campaigns were sent using Salesforce Marketing Cloud (SFMC), you can extract the necessary training data in a few ways:

Data Extract via Automation Studio: Use a Data Extract Activity to pull email tracking data (such as subject lines, opens, and clicks). This can be scheduled as part of an automation and exported to Enhanced FTP.
REST or SOAP API: Use SFMC’s Tracking API to retrieve send results, opens, and click-throughs programmatically.

You’ll want to collect:

Subject lines
Email body copy (or references to content blocks)
Open and click-through rates
Metadata such as send time, audience segment, and campaign category

This becomes the foundation of your evaluation models.

Once extracted, your raw data needs to be stored in a secure and accessible location for further processing. In most practical setups, this means one of the following:

Cloud storage (e.g., Amazon S3, Google Cloud Storage, Azure Blob): scalable and integrates well with ML pipelines.
Data warehouse (e.g., Snowflake, BigQuery, Redshift): ideal for structured querying, joins with other business data, and large-scale reporting.
Local file system or shared drive: acceptable for small teams or prototyping.

From there, you can read the raw data into a Python or notebook environment (via Pandas or Spark), and begin your feature extraction and model training workflows.

To prepare the training dataset, you might need to implement some or all of the steps outlined in our data preparation framework for generative AI models.

Step 2- Prepare the Feature Sets

A Note About Feature Engineering

Before a machine learning model can make accurate predictions, it needs structured, numerical input—this is what “features” are. We extract features from raw email text and metadata so that the model can learn patterns that influence open rates or click-through rates. For example, whether a subject line contains urgency words, includes a discount, or is sent at a particular time can all impact engagement. Feature engineering helps convert human-readable content into model-friendly formats like embeddings, binary indicators, and numeric counts. Here is an example of a single row of feature data for the Open Rate Predictor:

Feature	Value
Embedding (vector)	[0.11, -0.04, …, 0.07] (384 dims)
word_count	6
char_count	48
has_percent	1
has_number	1
has_urgency	1
is_personalized	0
send_hour	10
campaign_type_promo	1
audience_segment_loyal	1

The full dataset is a matrix of many such rows, each representing one LLM-generated (or historical) email entry.

Below is the code to transform raw text and metadata into meaningful features for both evaluators.

Open Rate Predictor (for Subject Lines)

Add Text Embeddings

Use a transformer model to convert subject lines into vectors that capture semantic meaning.

				
					from sentence_transformers import SentenceTransformer
model = SentenceTransformer('all-MiniLM-L6-v2')
df['embeddings'] = model.encode(df['subject_line'].tolist())

Add Text-Derived Features

				
					df['word_count'] = df['subject_line'].str.split().str.len()
df['char_count'] = df['subject_line'].str.len()
df['has_percent'] = df['subject_line'].str.contains('%').astype(int)
df['has_number'] = df['subject_line'].str.contains(r'\d').astype(int)
df['has_urgency'] = df['subject_line'].str.contains(r'now|today|urgent|final', case=False).astype(int)
df['is_personalized'] = df['subject_line'].str.contains(r'\[First Name\]', case=False).astype(int)

Add Metadata Features

				
					# Assume these columns already exist in your DataFrame
df = pd.get_dummies(df, columns=['campaign_type', 'audience_segment'], drop_first=True)

Prepare the Final Feature Set

Embedding vectors (typically shape: N x 384)
Scalar and binary features Store as a NumPy array:

				
					import numpy as np
X = np.hstack([df['embeddings'].tolist(), df[['word_count', 'char_count', 'has_percent', 'has_number', 'has_urgency', 'is_personalized', 'send_hour'] + list(df.columns[df.columns.str.startswith('campaign_type_') | df.columns.str.startswith('audience_segment_')])].values])
y = df['open_rate']

This final matrix X is used for training. For offline training, you can simply store your features in:

.pkl (Pickle) files: ideal for fast local reloading in Python
.csv files: good for portability and manual inspection
.parquet or .feather: efficient for larger datasets and DataFrame-based pipelines

CTR Predictor (for Email Body Copy)

The conceptual outline for recipe 2 is similar to that of recipe 1. Instead of using the email subject line, you will use the email body. Also, instead of using open rate as the output, you will use email click through rate.

Add Text Embeddings for Email Body

				
					df['embeddings'] = model.encode(df['email_body'].tolist())

Add Structure-Based Features

				
					df['word_count'] = df['email_body'].str.split().str.len()
df['link_count'] = df['email_body'].str.count('http')
df['cta_count'] = df['email_body'].str.count(r'buy now|shop now|learn more', flags=re.IGNORECASE)

Add Sentiment and Signals

				
					from textblob import TextBlob
df['sentiment'] = df['email_body'].apply(lambda x: TextBlob(x).sentiment.polarity)
df['has_urgency'] = df['email_body'].str.contains(r'limited time|hurry|act now|final', case=False).astype(int)
df = pd.get_dummies(df, columns=['cta_type'], drop_first=True)

Assemble the Final Feature Set

				
					X = np.hstack([
    df['embeddings'].tolist(),
    df[['word_count', 'link_count', 'cta_count', 'sentiment', 'has_urgency'] + list(df.columns[df.columns.str.startswith('cta_type_')])].values
])
y = df['click_through_rate']

Store the final X and y for reuse and training. You can persist the preprocessed features with joblib, pickle, or parquet format.

Step 3: Model Training, Evaluation, and Use

With the feature matrices X and labels y prepared, you can now train your predictive models. using a supervised learning approach.

Train the Model

				
					from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
model = RandomForestRegressor()
model.fit(X_train, y_train)
predictions = model.predict(X_test)

Evaluate the Model

Use standard regression metrics:

R² (coefficient of determination): how much variance your model explains
MAE (mean absolute error): average difference between predicted and actual values
RMSE (root mean squared error): penalizes large errors

Visual diagnostics (like scatterplots of predicted vs. actual) can also help.

Score New Content

Once trained, your models can be used to evaluate new LLM-generated content:

Feed each subject line into the Open Rate model
Feed each body copy into the CTR model

You can then compute a composite score to rank the outputs:

final_score = 0.4 * predicted_open_rate + 0.6 * predicted_ctr

Doing this for every LLM contestant will result in an evaluation matrix that can provide an objective basis for which model you finally select to use.

Table of Contents