Designing the data processing layer for LLM repositories is where raw data transforms into training-ready datasets. From an enterprise architecture perspective, data processing is not just about choosing Spark over Flink or AWS Glue—it’s a comprehensive building block that encompasses transformation logic, quality validation, orchestration patterns, and operational resilience.
This article is the second in a three-part series proposing a blueprint for designing enterprise-grade LLM data repositories. Following our exploration of the Data Storage building block, this installment focuses on the Data Processing building block, detailing its architecture components and enterprise-level considerations.
Primer- What is an LLM Data Repository?
An LLM data repository is a specialized infrastructure built to support the training, fine-tuning, and inference of large language models. Unlike traditional data warehouses that handle mainly analytics queries, LLM repositories must accommodate:
- Diverse data formats (text, structured data, multimedia)
- Large-scale feature engineering pipelines
- Both batch training data and real-time inference workloads
While this post focuses on LLMs, the architectural blueprint applies broadly to other AI use cases, including computer vision, text-to-speech, and multimodal systems.

Data Processing: The Transformation Engine for AI Workloads
The Data Processing building block handles the ingestion, transformation, validation, and preparation of data for AI workloads. Enterprise Architecture for this block goes far beyond choosing Apache Spark, AWS Glue, or Airflow. While selecting processing engines is important, it’s only part of the equation. Enterprise Architects must design the complete data processing ecosystem including:
- Processing architecture patterns (batch, stream, lambda, kappa)
- Data transformation and validation logic
- Quality gates and error handling strategies
- Orchestration and scheduling frameworks
- Monitoring and alerting systems
- Cost optimization and resource management strategies
These definitions form the enterprise blueprint that ensures data processing systems are scalable, reliable, and aligned with business objectives.
Dissecting Data Processing: 10 Sub-blocks
Here’s a practical look at the sub-blocks within Data Processing, and specific deliverables that EA’s must generate for rolling out global AI initiatives:
Component to Architect | What the Enterprise Architect Must Define |
Processing Patterns | Establish enterprise-wide standards and technologies for how to use batch, stream, and real-time data processing across all AI initiatives |
Data Ingestion Frameworks | Define approved ingestion technologies, integration patterns, and governance policies for connecting to enterprise data sources |
Transformation Logic & Engines | Set standards for transformation frameworks, code deployment practices, and reusable transformation libraries across teams |
Data Quality Validation | Create enterprise data quality frameworks, validation rule taxonomies, and quality gate policies for all processing pipelines |
Workflow Orchestration | Establish orchestration platform standards, workflow design principles, and cross-team dependency management protocols |
Error Handling & Recovery | Define enterprise-wide error handling patterns, escalation procedures, and business continuity requirements for data processing |
Resource Management | Set cost governance policies, resource allocation frameworks, and performance SLA standards across all processing workloads |
Processing Monitoring | Establish monitoring standards, alerting hierarchies, and performance metrics that apply to all enterprise data processing initiatives |
Schema Evolution | Create schema governance policies, versioning standards, and backward compatibility requirements for enterprise data contracts |
Compliance & Auditing | Define audit trail requirements, regulatory compliance frameworks, and data lineage standards for all processing activities |
The key change is shifting from “what the component does” to “what framework/standards/policies the EA must establish” that will govern how teams implement these components across the entire enterprise.
From Components to Blueprint: A Practical Example
Identifying the components of the Data Processing building block is only the first step. Enterprise architects must define how these components work individually and together with other components to create data transformation pipelines.
Take the example of the first block-data processing patterns. To illustrate the specifics of what is involved in Enterprise Architecture (and how it differs from project-level solution architecture), the next section shows how to architect this block in practice, along with specific deliverables and examples.
How to Architect Processing Patterns Sub-block
Most AI use cases involve processing three types of data transformations: batch, streaming, and hybrid. Enterprise Architects must decide how these patterns should be implemented for specific business requirements, performance goals, and operational constraints.
While there is no cookie-cutter approach that can be applied as-is to all business scenarios, the following template provides a generic framework for due diligence.
Batch Processing
Standard Pattern:

Detailed Architecture Components:
- Ingestion: Scheduled extracts via APIs, SFTP, or database replication
- Storage Zones: Bronze (raw) → Silver (cleaned) → Gold (business-ready)
- Processing Engine:
- Parallel data processing must be implemented using Databricks jobs
- For tasks that require access to storage data, projects must use AWS Glue jobs written in PySpark
- For tasks that do not require accessing storage, all processing/transformation logic must be implemented using Python/PySpark and implemented via AWS ECS tasks/services
- Orchestration: Dedicated orchestration engine that implements data transformation steps
Approved Technology Stack:
- Primary: PySpark
- Cloud Options: AWS: EMR Serverless, Step Functions, ECS, S3, Glue
- Orchestration: AWS Step Functions
Performance Standards:
- Throughput: Minimum 1TB/hour processing capability
- SLA Framework:
- Daily batch jobs: Complete within an 8-hour window
- Weekly aggregations: Complete within a 24-hour window
- Historical reprocessing: Complete within a 72-hour window
- Resource Efficiency: Target 70%+ cluster utilization during processing windows
Stream Processing
Standard Pattern:
Detailed Architecture Components:
- Message Broker: Kafka clusters with 3+ brokers, replication factor 3
- Stream Processing: Flink clusters with checkpointing to persistent storage
- State Management: RocksDB for local state, S3/ADLS for checkpoints
- Output Sinks: Real-time feature store, operational databases, downstream Kafka topics
- Schema Evolution: Confluent Schema Registry with backward compatibility
Approved Technology Stack:
- Primary: Apache Kafka 3.5+ + Apache Flink 1.18+ + RocksDB
- Cloud Options:
- AWS: MSK + Kinesis Analytics for Flink + DynamoDB/S3
- Monitoring: Prometheus + Grafana for metrics, Jaeger for tracing
Performance Standards:
- Latency:
- P95 processing latency < 100ms
- P99 processing latency < 500ms
- Throughput: Minimum 100,000 events/second per cluster
- Availability: 99.9% uptime with automatic failover
- Backpressure Handling: Automatic scaling triggers at 80% capacity
Hybrid Processing
This patterns runs parallel processing systems that eventually converge at a serving layer to provide a complete view of the data.
Standard Pattern:

Detailed Architecture Components:
- Batch Layer (green) for historical truth and high accuracy
- Speed Layer (orange) for real-time delta and low latency
- Serving Layer (purple) that unifies both views
- Applications consuming the unified interface
Approved Technology Stack:
- Storage Format: Delta Lake (preferred)
- Processing:
- Batch: PySpark/Spark with Delta Lake integration
- Stream: Flink with Delta Lake sink connector
- Serving:
- Real-time: Redis/DynamoDB for sub-second queries
- Analytical: Snowflake/BigQuery for complex analytics
- Orchestration: AWS Step Functions
Performance Standards:
- Consistency: Eventual consistency within 15 minutes for Lambda, immediate for Kappa
- Query Performance:
- Real-time queries: < 10ms P95 latency
- Analytical queries: < 30 seconds for typical dashboards
- Storage Efficiency: < 30% storage overhead for maintaining both batch and stream views
- Recovery Time: Full system recovery within 4 hours using replay mechanisms
These reference architectures provide the foundational framework that all data processing initiatives must follow. Project teams have flexibility in implementation details—such as specific Spark configurations, Kafka topic structures, or Delta Lake partitioning strategies—as long as they adhere to the approved technology stacks and performance standards outlined above.
Full-scale Blueprint
Enterprise Architects should apply this same methodology to define all remaining sub-blocks—including Orchestration, Error Handling, Resource Management, and others. The primary goal is establishing enterprise-wide consistency through standardized technology frameworks and implementation patterns.
This approach delivers three critical outcomes: project teams leverage proven, reusable components rather than building custom solutions from scratch; regulatory compliance and audit capabilities become embedded by design across all initiatives; and the organization benefits from a unified technology stack that reduces operational complexity and enables knowledge sharing across teams.
Next Up
This post outlined the Data Processing building block as part of our broader enterprise blueprint for LLM data repositories. In the final post in this series, we’ll explore feature management: the final building block that delivers ML-ready features reliably and efficiently to both training and inference workloads.
By the end of this series, you’ll have a comprehensive, enterprise-level blueprint for designing LLM data repositories that are scalable, reliable, and business-aligned.
Let's Architect This Together
Ready to design processing architectures that can scale with your AI ambitions? Or need help selecting the right processing patterns for your specific use cases?
We'd love to help. Book a free discovery call to discuss your specific data processing challenges and explore how our team can support your enterprise architecture journey.
You can also explore our full suite of AI for Marketing services for insights and solutions across data architecture, LLM implementation, and advanced analytics.
Get in touch