Learn
Build AI agents that scale: A systems-oriented reference architecture for startups

Build AI agents that scale: A systems-oriented reference architecture for startups

How was this content?

creative team meeting modern office collaboration

Every generation of builders faces a shift in abstraction. Assembly yielded to higher-level languages. Monoliths evolved into distributed systems. On-premises infrastructure gave way to cloud-native platforms. Now software itself is becoming AI-native, shaped by models, context, agents, and adaptive workflows.

At re:Invent 2025, Werner Vogels framed this moment clearly: the developers who succeed think in systems and build with precision. The companies that win are not the ones who adopted AI earliest or picked the most capable model. They are the ones who think in complete systems, who understand that every architectural decision ripples through the layers above and below it.

The modern AI stack is too often treated as a checklist: choose a model, attach retrieval, add orchestration, deploy. AI-powered products fail when components look impressive in isolation. They succeed when the system behaves predictably under real-world load, balancing speed, reliability, governance, and cost.

Peter DeSantis made this argument from the infrastructure side at re:Invent 2025. The five fundamentals AWS has obsessed over for twenty years — security, availability, elasticity, agility, and cost — are more important now, not less. AI workloads compound every architectural weakness. A system that scales at 100 users can fail structurally at 10,000. A cost model that looks reasonable at prototype stage can become untenable at production volume. And a governance approach built on good intentions cannot survive an enterprise security review.

This article outlines a systems-oriented reference architecture for model builders and AI-native SaaS teams moving from prototype to production, comprising five layers that deliver their full value when they work together.

Think in systems, not just services

A common mistake in generative AI design is optimizing each layer independently. One team picks the "best" model. Another picks the "best" vector store. A third chooses an orchestration framework based on familiarity. Each decision may look rational in isolation, but users experience the behavior of the whole system: retrieval speed, response quality, workflow durability, policy enforcement, tenant isolation, and cost-to-serve.

In AI-native software, those outcomes emerge from interactions across layers: how identity and permissions flow into retrieval and tool access; how context freshness shapes output quality; how orchestration handles retry, state, and long-running steps; how observability spans model calls, workflows, and application logic; how cost compounds across storage, inference, and workflow execution.

The best AI stack is not the one with the most individually impressive parts. It is the one whose feedback loops produce reliable, predictable system behavior. With that lens, here is a practical reference architecture for modern AI-native startups on AWS.

Layer 1: Data and context foundation

Every AI product is built on a data foundation. This layer determines whether the product can ground AI behavior in durable, governed context. In production systems, context shapes retrieval quality, model behavior, personalization, and trust. If this layer is brittle, stale, or poorly governed, the instability spreads upward.

Four failure modes are common in practice:

Source-of-truth data must outlive any one model or retrieval strategy. Teams that tie their data architecture too tightly to a specific embedding model often rebuild the foundation every time the model or access pattern changes.
Context must be organized for fast, relevant access. Retrieval latency is a product-quality issue that compounds across every layer above it.
The same context that improves accuracy can create risk if stale, over-shared, or poorly isolated across tenants. Governed boundaries are essential to accuracy and trust, not just compliance.
Long-lived unstructured data, vector embeddings, and operational state serve different purposes and should remain architecturally distinct, even when they live close together.

Amazon Simple Storage Service (Amazon S3) remains the canonical system of record for documents, transcripts, artifacts, and logs. S3 Vectors extends that foundation into native vector storage at billion-vector scale, preserving the elasticity, durability, and availability model of S3. For an ISV building a knowledge-intensive product, regulatory content, customer interaction history, and the embeddings that make both searchable can live in the same buckets under the same access policies. without a separate vector database to provision, scale, and secure.

A team previously managing a separate vector database would handle provisioning, monitor index health, and plan for scaling events separately from the rest of their infrastructure. S3 Vectors removes that entirely. It inherits the same access policies already governing the document store, so there is no separate scaling strategy, no additional credential management, and no new failure surface to monitor.

Specialized vector stores still have a place. OpenSearch is the better fit when the application needs to combine exact keyword matching with semantic relevance, or when retrieval performance must be optimized at lower latency. Amazon Nova Multimodal Embeddings becomes important when data is not purely text. A contract intelligence platform processing scanned PDFs alongside structured records, or a media platform indexing video with transcripts, benefits from a shared vector space instead of fragmented pipelines.

Key services: Amazon S3, Amazon S3 Vectors, Amazon OpenSearch Service (GPU-accelerated), Amazon Nova Multimodal Embeddings, Amazon Bedrock Knowledge Bases.

Starting point: Store source documents in S3 with prefix-based tenant isolation from day one, then configure a Bedrock Knowledge Base against that bucket before building any custom retrieval logic.

Layer 2: Model and serving

This layer determines how the system generates intelligence and what it costs to do so. The decision is not which model is most capable, but which model strategy delivers the right balance of accuracy, latency, cost, and control for each workload type.

A domain-specific builder (legaltech, coding assistant, or financial document classifier) needs proprietary accuracy that a generic frontier model cannot consistently deliver or economically sustain. A modern ISV needs predictable latency and cost at query volume. And an inference consumer must avoid paying frontier-model prices for routine tasks like routing, summarization, or entity extraction, where a smaller tuned model performs equivalently at a fraction of the cost.

For most teams, Amazon Bedrock is the right starting point, a managed palette of 18+ open-weight models alongside Anthropic's frontier models, with Nova 2 in the best cost-to-performance tier, without the operational burden of running inference infrastructure. As the product matures, the right question shifts from "which model is best?" to "how much of our competitive advantage comes from proprietary model behavior versus proprietary product workflows?" Bedrock Reinforcement Fine-Tuning (RFT) can improve accuracy over base models on domain-specific tasks, making smaller, faster, more cost-effective variants practical at production volume.

For teams that need more control, Amazon SageMaker AI is the managed-but-controlled tier for builders who need to go deeper into fine-tuning, evaluation, MLOps, and custom deployment. It’s also the better fit when proprietary model behavior is part of the product itself. Teams that need runtime patterns a fully managed surface does not expose, such as bidirectional streaming for voice-native experiences, will find SageMaker the more practical choice. Streaming audio in and partial transcripts out makes the interaction feel fluid rather than latency-bound.

For teams building foundation models from scratch, EC2 Trn3 (Trainium3) offers 40 percent lower training cost and 5x higher output tokens per megawatt, with native PyTorch integration so teams can train and deploy without changing model code. Amazon Elastic Kubernetes Service (EKS) sits at the far end of the spectrum for teams that need complete runtime control or specialized serving stacks.

Model-tier decisions set the cost floor for the system. If every request defaults to a frontier model, costs climb quickly as retrieval, agents, and multi-step workflows are added. A disciplined model strategy keeps capability aligned to task requirements and prevents model cost from quietly overtaking product economics.

Bedrock's pricing page provides a current model-by-model cost comparison across input and output tokens. Running these calculations against your expected request volume is a worthwhile step before finalising your architecture.

Key services: Amazon Bedrock (serverless inference, service tiers), Amazon Nova 2 family, Bedrock RFT, SageMaker AI, EC2 Trn3 (Trainium3).

Starting point: Identify which of your workloads require a frontier model and which could be handled by a smaller or cheaper alternative, as the cost difference compounds quickly at scale.

Layer 3: Inference and agentic runtime

Inference is where architectural intent becomes user-visible behavior. This layer governs latency, throughput, concurrency, session state, tool interaction patterns, quality under bursty demand, and cost per customer interaction. The challenge is not capability, it is reliability, isolation, and cost consistency under real conditions: multiple tenants, bursty demand, tool calls that can fail, and workflows that run for minutes rather than milliseconds.

This is the layer that determines whether a modern ISV can sell to enterprises or only to early adopters. An agentic application with excellent model performance but no tenant isolation, no workflow durability, and no auditable tool call history will fail a procurement review, not because it is technically wrong, but because it cannot be trusted at the system level.

Project Mantle, the inference engine underlying Bedrock, addresses reliability and isolation at infrastructure level. Service tier routing lets a contract intelligence platform route user-facing clause extraction to a priority lane and background regulatory cross-referencing to a flexible lane, optimizing cost without degrading user experience. Per-customer queue isolation means a burst in one tenant's document uploads does not affect another tenant's active review session. The Journal, a key innovation within Mantle, checkpoints inference state continuously so that a long-running due diligence workflow that fails 12 minutes in resumes at the 12-minute mark, not from scratch.

Amazon Bedrock AgentCore provides the production runtime that most teams would otherwise spend months building: containerized agent deployment across any framework (LangGraph, CrewAI, Strands Agents, OpenAI Agents SDK), cross-session episodic memory, MCP-based tool access with Cedar policy enforcement, and continuous quality evaluation against live evaluators. A legal SaaS team running their own agent infrastructure typically absorbs multiple engineers to manage containerization, session handling, and tool security. AgentCore consolidates those concerns into a managed layer, freeing that engineering capacity for the clause library, risk taxonomy, and client-specific policy rules that win deals.

The principle that makes or breaks this layer in enterprise sales cycles is the distinction between mechanisms and good intentions:

Key services: Amazon Bedrock (Project Mantle: service tiers, queue isolation, Journal), AgentCore (Runtime, Memory, Gateway, Evaluations, Identity), Strands Agents, AWS Step Functions, Amazon Simple Queue Service (SQS).

Starting point: List the enterprise requirements your agent needs to meet and map each one to a concrete mechanism, rather than a prompting strategy.

Layer 4: Orchestration and compute

This is the layer where AI stops being a single model call and becomes software. Most production AI products are multi-step systems that retrieve context, invoke models, call tools, validate outputs, trigger downstream actions, persist results, and re-enter workflows asynchronously. Orchestration is part of the core application architecture, not an implementation detail.

Consider a financial services SaaS platform performing contract analysis. A single request might involve document ingestion, chunking, embedding generation, retrieval against prior agreements, multi-step reasoning over clauses, routing to a human reviewer, and triggering a downstream compliance workflow. That is a durable application workflow with branching logic, retries, state transitions, and asynchronous steps spanning minutes or hours, not a single inference call.

The structural insight here mirrors what made serverless computing transformative: the goal is not simply to make infrastructure easier to manage, but to remove entire categories of infrastructure management. Lambda Managed Instances applies that principle to a common gap. Some workloads require specific compute characteristics, high memory for embedding generation, document preprocessing pipelines, or CPU-heavy model inference, that are too heavy for simple serverless functions but do not warrant direct fleet management. A startup processing thousands of legal documents daily can run those functions on specific instance profiles while Lambda manages provisioning, scaling, and patching, keeping a serverless architecture without inheriting EC2 fleet operations.

For teams that require deeper runtime control, EKS provides the consistency and control that model builders running custom inference servers or platform teams standardizing on Kubernetes prefer.

Amazon DynamoDB fits naturally as the transactional control plane for workflow state, session metadata, tenant configuration, idempotency keys, tool results, and audit references. It’s the operational backbone that keeps the application coherent as work moves across services and workflow steps. This is distinct from the semantic memory layer used for retrieval.

AI-powered development environment Kiro fits into this layer as a software delivery accelerator. Its role is to help teams translate natural language requirements into structured designs, specifications, and implementation tasks, allowing teams to move faster while keeping system architecture coherent.

Key services: Lambda Managed Instances, Lambda Durable Functions, Amazon DynamoDB, Amazon EC2 M9g (Graviton5), AWS Step Functions, Amazon ECS on Graviton5, Amazon EventBridge, Amazon SQS.

Starting point: Map your workflow on paper, identifying every step that could fail, branch, or run asynchronously, before choosing any orchestration tooling.

Layer 5: Governance, observability, and trust

This is where many AI stacks still break down. Teams treat governance as something that can be added later. Agents with broad tool access, limited evaluation rigor, and vague prompt-based constraints erode trust and create adoption barriers. The better principle? Mechanisms, not intentions.

Enterprise buyers ask two consistent questions before adopting an AI system: Can you demonstrate that our data never crosses tenant boundaries? And can you prove that your AI operates within the boundaries you claim, with mechanisms that enforce those boundaries and logs that show what happened?

For a healthcare ISV, this means a HIPAA-scoped agent that cannot access records outside a patient's authorized context. For a financial services SaaS provider, it means an investment research assistant whose tool calls are constrained by client-specific data access agreements. These are standard requirements for regulated enterprise deployments, not edge cases.

Bedrock Confidential Computing addresses the data isolation question in the inference plane by protecting data during execution and providing a higher-assurance runtime boundary. Services including Bedrock, AgentCore, Lambda, and S3 can operate within a unified identity model, allowing access governance to be applied consistently across data, model invocation, and agent tool usage, without building a separate authorization system for each layer. The same policies that govern data access also govern model calls and tool permissions. Tool-call logs then become auditable records of system behavior.

Governance at this layer also includes tenant-aware data controls, model and prompt versioning, traceability across tool calls and workflows, cost visibility by environment or customer, and end-to-end observability across application, workflow, and model layers. These are not architectural luxuries. They are what allow teams to reason about system behavior, investigate incidents at the correct layer, and demonstrate compliance without reconstructing events from raw logs.

For teams operating in EMEA, regulatory context shapes several of these architectural decisions. GDPR requirements around data residency mean tenant isolation is not just an enterprise sales requirement, it’s a legal one. S3 prefix-based isolation and per-tenant encryption are the practical mechanisms that satisfy it. The EU AI Act introduces further obligations around transparency and human oversight for high-risk AI applications, mapping directly to audit logging and tool-call traceability.

Note that Bedrock model availability, S3 Vectors, and AgentCore are not uniformly available across all AWS EMEA regions. Teams should verify availability in their target region before committing to a specific architecture.

Startup teams should also note that some services referenced above, including S3 Vectors and AgentCore, are relatively new to production environments. Validate maturity for your specific use case and region before committing to them as core infrastructure.

Key services: Amazon Bedrock (confidential computing), AWS Identity and Access Management (IAM), Amazon CloudWatch, AgentCore Policy (Cedar), AgentCore Evaluations, AWS Security Hub.

Starting point: Decide what your audit log needs to prove to buyers or compliance officers. Then work backwards from that to the policies and tooling you need in place.

How the full system comes together

A user request enters the application. The runtime authenticates the request and resolves tenant context. Relevant memory and product knowledge are retrieved from the context layer. The orchestration layer determines whether the task is a simple model interaction or a multi-step workflow. The inference layer generates or reasons over the next step. Policies constrain which tools may be invoked. Long-running steps checkpoint state and recover cleanly on failure. The system emits telemetry across each stage. Evaluations and feedback loops measure quality over time. Product teams use those signals to refine prompts, update policies, improve retrieval, or decide when deeper model customization is warranted.

That is systems thinking in practice. The winning architecture is the one that aligns data, context, inference, workflow, governance, and operations into a coherent system that behaves well as the company scales.

For most startup teams, the entry point is Layer 1. Get your data foundation and tenant isolation right before tackling agents or orchestration. The layers that follow are only as reliable as the foundation beneath them.

Build it. Own it.

A well-designed AI stack improves time to market, reliability, trust, and cost discipline, not because of any single service, but because the layers work together as a coherent system. When the stack is systems-oriented, teams can evolve their architecture as the business grows without rebuilding from scratch every time a new capability appears.

How was this content?