Designing Data Repositories for LLMs: An Enterprise Architect's Blueprint (Part 2

Designing Data Repositories for LLMs: An Enterprise Architect’s Blueprint (Part 2 – Data Processing)

Designing the data processing layer for LLM repositories is where raw data transforms into training-ready datasets. From an enterprise architecture perspective, data processing is not just about choosing Spark over Flink or AWS Glue—it’s a comprehensive building block that encompasses transformation logic, quality validation, orchestration patterns, and operational resilience.

This article is the second in a three-part series proposing a blueprint for designing enterprise-grade LLM data repositories. Following our exploration of the Data Storage building block, this installment focuses on the Data Processing building block, detailing its architecture components and enterprise-level considerations.

Primer- What is an LLM Data Repository?

An LLM data repository is a specialized infrastructure built to support the training, fine-tuning, and inference of large language models. Unlike traditional data warehouses that handle mainly analytics queries, LLM repositories must accommodate:

Diverse data formats (text, structured data, multimedia)
Large-scale feature engineering pipelines
Both batch training data and real-time inference workloads

While this post focuses on LLMs, the architectural blueprint applies broadly to other AI use cases, including computer vision, text-to-speech, and multimodal systems.

Data Processing: The Transformation Engine for AI Workloads

The Data Processing building block handles the ingestion, transformation, validation, and preparation of data for AI workloads. Enterprise Architecture for this block goes far beyond choosing Apache Spark, AWS Glue, or Airflow. While selecting processing engines is important, it’s only part of the equation. Enterprise Architects must design the complete data processing ecosystem including:

Processing architecture patterns (batch, stream, lambda, kappa)
Data transformation and validation logic
Quality gates and error handling strategies
Orchestration and scheduling frameworks
Monitoring and alerting systems
Cost optimization and resource management strategies

These definitions form the enterprise blueprint that ensures data processing systems are scalable, reliable, and aligned with business objectives.

Dissecting Data Processing: 10 Sub-blocks

Here’s a practical look at the sub-blocks within Data Processing, and specific deliverables that EA’s must generate for rolling out global AI initiatives:

Component to Architect	What the Enterprise Architect Must Define
Processing Patterns	Establish enterprise-wide standards and technologies for how to use batch, stream, and real-time data processing across all AI initiatives
Data Ingestion Frameworks	Define approved ingestion technologies, integration patterns, and governance policies for connecting to enterprise data sources
Transformation Logic & Engines	Set standards for transformation frameworks, code deployment practices, and reusable transformation libraries across teams
Data Quality Validation	Create enterprise data quality frameworks, validation rule taxonomies, and quality gate policies for all processing pipelines
Workflow Orchestration	Establish orchestration platform standards, workflow design principles, and cross-team dependency management protocols
Error Handling & Recovery	Define enterprise-wide error handling patterns, escalation procedures, and business continuity requirements for data processing
Resource Management	Set cost governance policies, resource allocation frameworks, and performance SLA standards across all processing workloads
Processing Monitoring	Establish monitoring standards, alerting hierarchies, and performance metrics that apply to all enterprise data processing initiatives
Schema Evolution	Create schema governance policies, versioning standards, and backward compatibility requirements for enterprise data contracts
Compliance & Auditing	Define audit trail requirements, regulatory compliance frameworks, and data lineage standards for all processing activities

The key change is shifting from “what the component does” to “what framework/standards/policies the EA must establish” that will govern how teams implement these components across the entire enterprise.

From Components to Blueprint: A Practical Example

Identifying the components of the Data Processing building block is only the first step. Enterprise architects must define how these components work individually and together with other components to create data transformation pipelines.

Take the example of the first block-data processing patterns. To illustrate the specifics of what is involved in Enterprise Architecture (and how it differs from project-level solution architecture), the next section shows how to architect this block in practice, along with specific deliverables and examples.

How to Architect Processing Patterns Sub-block

Most AI use cases involve processing three types of data transformations: batch, streaming, and hybrid. Enterprise Architects must decide how these patterns should be implemented for specific business requirements, performance goals, and operational constraints.

While there is no cookie-cutter approach that can be applied as-is to all business scenarios, the following template provides a generic framework for due diligence.

Batch Processing

Standard Pattern:

Detailed Architecture Components:

Ingestion: Scheduled extracts via APIs, SFTP, or database replication
Storage Zones: Bronze (raw) → Silver (cleaned) → Gold (business-ready)
Processing Engine:
- Parallel data processing must be implemented using Databricks jobs
- For tasks that require access to storage data, projects must use AWS Glue jobs written in PySpark
- For tasks that do not require accessing storage, all processing/transformation logic must be implemented using Python/PySpark and implemented via AWS ECS tasks/services
Orchestration: Dedicated orchestration engine that implements data transformation steps

Approved Technology Stack:

Primary: PySpark
Cloud Options: AWS: EMR Serverless, Step Functions, ECS, S3, Glue
Orchestration: AWS Step Functions

Performance Standards:

Throughput: Minimum 1TB/hour processing capability
SLA Framework:
- Daily batch jobs: Complete within an 8-hour window
- Weekly aggregations: Complete within a 24-hour window
- Historical reprocessing: Complete within a 72-hour window
Resource Efficiency: Target 70%+ cluster utilization during processing windows

Stream Processing

Standard Pattern:

Detailed Architecture Components:

Message Broker: Kafka clusters with 3+ brokers, replication factor 3
Stream Processing: Flink clusters with checkpointing to persistent storage
State Management: RocksDB for local state, S3/ADLS for checkpoints
Output Sinks: Real-time feature store, operational databases, downstream Kafka topics
Schema Evolution: Confluent Schema Registry with backward compatibility

Approved Technology Stack:

Primary: Apache Kafka 3.5+ + Apache Flink 1.18+ + RocksDB
Cloud Options:
- AWS: MSK + Kinesis Analytics for Flink + DynamoDB/S3
Monitoring: Prometheus + Grafana for metrics, Jaeger for tracing

Performance Standards:

Latency:
- P95 processing latency < 100ms
- P99 processing latency < 500ms
Throughput: Minimum 100,000 events/second per cluster
Availability: 99.9% uptime with automatic failover
Backpressure Handling: Automatic scaling triggers at 80% capacity

Hybrid Processing

This patterns runs parallel processing systems that eventually converge at a serving layer to provide a complete view of the data.

Standard Pattern:

Detailed Architecture Components:

Batch Layer (green) for historical truth and high accuracy
Speed Layer (orange) for real-time delta and low latency
Serving Layer (purple) that unifies both views
Applications consuming the unified interface

Approved Technology Stack:

Storage Format: Delta Lake (preferred)
Processing:
- Batch: PySpark/Spark with Delta Lake integration
- Stream: Flink with Delta Lake sink connector
Serving:
- Real-time: Redis/DynamoDB for sub-second queries
- Analytical: Snowflake/BigQuery for complex analytics
Orchestration: AWS Step Functions

Performance Standards:

Consistency: Eventual consistency within 15 minutes for Lambda, immediate for Kappa
Query Performance:
- Real-time queries: < 10ms P95 latency
- Analytical queries: < 30 seconds for typical dashboards
Storage Efficiency: < 30% storage overhead for maintaining both batch and stream views
Recovery Time: Full system recovery within 4 hours using replay mechanisms

These reference architectures provide the foundational framework that all data processing initiatives must follow. Project teams have flexibility in implementation details—such as specific Spark configurations, Kafka topic structures, or Delta Lake partitioning strategies—as long as they adhere to the approved technology stacks and performance standards outlined above.

Full-scale Blueprint

Enterprise Architects should apply this same methodology to define all remaining sub-blocks—including Orchestration, Error Handling, Resource Management, and others. The primary goal is establishing enterprise-wide consistency through standardized technology frameworks and implementation patterns.

This approach delivers three critical outcomes: project teams leverage proven, reusable components rather than building custom solutions from scratch; regulatory compliance and audit capabilities become embedded by design across all initiatives; and the organization benefits from a unified technology stack that reduces operational complexity and enables knowledge sharing across teams.

Next Up

This post outlined the Data Processing building block as part of our broader enterprise blueprint for LLM data repositories. In the final post in this series, we’ll explore feature management: the final building block that delivers ML-ready features reliably and efficiently to both training and inference workloads.

By the end of this series, you’ll have a comprehensive, enterprise-level blueprint for designing LLM data repositories that are scalable, reliable, and business-aligned.

Let's Architect This Together

Ready to design processing architectures that can scale with your AI ambitions? Or need help selecting the right processing patterns for your specific use cases?

We'd love to help. Book a free discovery call to discuss your specific data processing challenges and explore how our team can support your enterprise architecture journey.

You can also explore our full suite of AI for Marketing services for insights and solutions across data architecture, LLM implementation, and advanced analytics.

Get in touch

Emma Glendin

Emma is a seasoned Technical Writer with over 10 years of experience in writing about a wide range of IT topics, including cloud computing, blockchain, data engineering, and DevOps. Currently, she specializes in AI/ML engineering, MLOps, and custom model fine-tuning, with a particular focus on AWS technologies. Her work bridges the gap between complex technical concepts and clear, user-friendly content, helping professionals understand and navigate the ever-evolving world of technology.

All Posts