AWS Web3 Blog

EKS marks the spot: scaling Circle’s blockchain nodes with a modern Kubernetes stack

This is a guest post by James Fong, VP of Technical Operations at Circle, and Jake Scaltreto, Principal Site Reliability Engineer at Circle, in partnership with AWS.

Operating blockchain node infrastructure at scale is not for the faint of heart. A diverse assortment of blockchain node software with varying requirements and release cycles, and myriad pitfalls present significant hurdles for even the most seasoned operations team. But what if you could do it with less toil and greater efficiency by leveraging the cloud?

At Circle, we operate scores of nodes across dozens of blockchains in order to provide reliable RPC services supporting products such as USDC, Cross-Chain Transfer Protocol (CCTP), Circle Payments Network (CPN), and others. Circle is a global financial technology firm that enables businesses of all sizes to harness the power of digital currencies and public blockchains for payments, commerce, and financial applications worldwide. Circle is building the world’s largest, most-widely used, stablecoin network, and issues, through its regulated affiliates, USDC and EURC stablecoins. We’ve tackled the unique operational complexities of blockchain node infrastructure, from demanding compute and storage requirements to the constant operational effort of keeping them up-to-date and running, by building a robust solution on AWS.

We operate all of our blockchain node infrastructure on AWS, running all of our nodes exclusively in Amazon Elastic Kubernetes Service (Amazon EKS). We believe that the flexible capabilities of AWS and EKS have enabled us to scale more quickly compared to an on-premises solution.

In this post, we share details about how we operate Circle’s blockchain node infrastructure at scale, using AWS services, common off-the-shelf tools, and some custom tooling.

Circle’s blockchain node infrastructure overview

The following diagram provides an overview of Circle’s blockchain node infrastructure on AWS, illustrating the core components, workflows, and AWS services that enable scalable and reliable node operations.

Dynamic Instance Provisioning with Karpenter

Each blockchain has unique requirements for compute, memory, and storage. Some chains run easily on a small instance, while others demand significant resources and terabytes of ultra-low-latency storage. To support these diverse workloads in a multi-tenant EKS cluster, traditional Auto Scaling Group (ASG) based node groups would be challenging to manage.

Instead, we leverage Karpenter to handle our cluster autoscaling needs. With Karpenter, we can simply deploy pods and let Karpenter provision instances to accommodate them. This approach respects CPU and memory requests, pod affinity, and topology spread, allowing our team to focus on the workload’s requirements rather than the specific instance types. Karpenter also helps us keep costs in check by “binpacking” workloads onto more cost-efficient instance types as needed, maximizing the use of our compute resources.

Leveraging StatefulSets for blockchain nodes

With cluster autoscaling managed by Karpenter, our next objective was to determine how to deploy stateful blockchain nodes in EKS. We use Kubernetes StatefulSets to manage our blockchain node pods.

When a new node is launched, the StatefulSet controller automatically provisions a PersistentVolumeClaim. We use the AWS EBS CSI driver to provision Amazon EBS volumes for most of our nodes. Using distinct StorageClasses allows us to fine-tune the EBS configuration based on the chain’s needs; many function well with conservatively provisioned gp3 volumes, while others may require higher IOPS and latency guarantees provided by io2 volumes.

In a few rare cases, certain chains demand even lower latencies than io2 can provide. For these cases, the requirement pushed our team to get creative. Fortunately, AWS offers EC2 instance types with locally attached NVMe storage. To leverage this, we use a custom startup script in our AMI to detect all locally attached NVMe block devices and combine them into a single striped LVM volume. This is a rare case where we are more hands-on with Karpenter, using affinity to select nodes of a particular instance family, such as i4i.

As instance storage is tied to the lifecycle of the EC2 instance, we built a custom tool to handle backing up and restoring node data. The tool runs as a non-terminating initContainer; upon startup, it checks if the local data directory is initialized, and if not, it downloads the most recent snapshot from an Amazon S3 bucket and restores it. When a pod is terminated, the tool creates an archive of the node’s data directory (leveraging XFS’s reflink feature for fast copy-on-write snapshotting) and syncs it to S3.

Customizing pod lifecycle with a CRD

With storage addressed, our next concern was determining how best to manage the lifecycle of the blockchain node pods themselves. We often run multiple nodes with varying configurations for a given blockchain, for example, running multiple execution clients like Geth and Reth, or nodes of different software versions for testing. Running pods with different specifications is not supported by a single Kubernetes StatefulSet resource, so we’ll typically run multiple single-replica StatefulSets supporting a particular chain and network. However, Kubernetes lacks native support for orchestrating the deployment of many StatefulSets in a way that is both automated and safe against unwanted disruptions.

To overcome this, we introduced a higher-level abstraction with a Custom Resource Definition (CRD) called the Nodeset. Each Nodeset represents a set of nodes for a given blockchain and network. An in-house operator, Blockchain Controller, watches for changes to StatefulSets and manages the pods’ lifecycles directly, effectively usurping some of the function of the StatefulSet controller (to achieve this, we configure our StatefulSets to use the OnDelete updateStrategy). The operator includes additional safeguards to ensure availability by monitoring the health of nodes and preventing an outage when a pod is terminated, such as using the Eviction API so that Pod Disruption Budgets are respected.

Ensuring node health with a custom monitor

Monitoring blockchain nodes presents a unique challenge. A simple TCP or HTTP probe can tell us if a node is ready to accept RPC traffic, but it doesn’t necessarily indicate if the node is up-to-date and consistent with the network.

To ensure clients receive timely and accurate information, we developed a custom health monitoring solution, Blockchain Monitor. This runs as a sidecar alongside the node software, continually verifying the local node’s state against trusted public RPCs. The monitor checks if the local node’s block height is within an acceptable tolerance, if block data is consistent, and if the node has sufficient peers to continue syncing reliably.

High-availability ingress

Networking and, in particular, ingress also pose non-trivial obstacles. We prefer to run our nodes in a high-availability (HA) configuration rather than load balancing RPC requests across all nodes for a given chain. This setup provides a “hot spare” that can be promoted if the active “leader” node becomes unhealthy. It also enables zero-downtime updates by applying changes to the standby node first.

To accomplish this, we again leverage the Nodeset CRD and Blockchain Controller. For each Nodeset, the operator provisions all necessary ingress resources: Services, EndpointSlices, and Gateway API HttpRoutes (or GRPCRoutes). Blockchain Controller tracks the active “leader” node in the CRD’s status and can react to changes, such as pod readiness, by switching to a healthy node and updating the EndpointSlice.

For ingress, we selected Traefik Proxy as our Gateway API controller. It has proven highly scalable, comfortably handling thousands of requests per second on just a few pods. Integrated metrics and tracing capabilities provide robust observability patterns, allowing us to see at a glance how our node fleet is performing.

We placed Traefik behind an AWS NLB managed by the AWS Load Balancer Controller. The choice of NLB wasn’t arbitrary; the NLB’s static IP address and support for AWS PrivateLink allow us to provide reliable RPC services wherever clients may be running. Within the cluster, we use Cilium CNI for its highly performant, eBPF-powered traffic management and policy enforcement.

Conclusion

Our journey to scale blockchain node infrastructure on AWS has shown that a modern, cloud-native approach is not just viable, but transformative. By using Amazon EKS and tools like Karpenter, we’ve built a platform that is resilient, performant, and cost-efficient, fundamentally changing how our team operates. The reduction in manual toil through automation has allowed us to shift our focus from day-to-day maintenance to innovation, enabling us to build new custom tooling and solve more complex challenges. This architecture is a testament to how the right combination of AWS services and open-source tools can empower teams and drive innovation.

To learn more about building resilient, scalable platforms, check out Amazon EKS and Circles developer documentation for integrating blockchain capabilities seamlessly.


About the authors

James Fong

James Fong

James is the VP of Technical Operations at Circle, where he leads the design and operations of Circle’s multi-chain infrastructure platform, supporting over 25 blockchain networks. With over 15 years of experience in infrastructure engineering, DevOps, and site reliability, he specializes in building secure, scalable, and resilient systems. James is passionate about infrastructure-as-code, supply chain security, and delivering seamless developer experiences. As a forward-thinking leader in blockchain infrastructure, James Fong oversees Circle’s production infrastructure initiatives, including node orchestration, public RPC services, and next-generation tooling for digital asset ecosystems. His meticulous approach and instinctive leadership have shaped Circle’s multi-chain platform, enabling secure and scalable access to blockchain networks at global scale.

Jake Scaltreto

Jake Scaltreto

Jake is a Principal Site Reliability Engineer at Circle, where he builds tooling and processes to operate a fleet of nodes supporting more than two dozen blockchains. With nearly twenty years of systems engineering experience across industries including biotech, gaming, and banking, he is passionate about designing infrastructure that is reliable, secure, and repeatable. Based outside Boston, he lives with his girlfriend and two cats, and when not behind a keyboard, he can often be found behind the scenes producing theater.

Jigna Gandhi

Jigna Gandhi

Jigna is a Senior Solutions Architect at AWS, where she specializes in guiding financial services and Web3 customers in designing secure, scalable, and forward-thinking cloud architectures. With a strong focus on innovation, she collaborates with a diverse range of clients—from established banks exploring blockchain solutions to fast-growing Web3 startups aiming to scale globally. Jigna is passionate about bridging the gap between cutting-edge technologies and real-world impact, helping organizations navigate complex challenges while unlocking new opportunities through the power of the cloud.

Karim Akacem

Karim Akacem

Karim is a Senior Technical Account Manager and Enterprise Support Lead at AWS, where he partners with Fintech and Web3 customers to drive successful cloud adoption and operational excellence. With deep expertise in cloud operations, cloud financial management (FinOps), and emerging technologies, Karim delivers strategic technical guidance and tailored solutions that help enterprise clients navigate the complexities of their cloud transformation journeys. He plays a key role in aligning technical outcomes with business goals, advocating for customer needs within AWS, and fostering long-term partnerships built on trust, innovation, and impact.