Migration & Modernization
How Place Exchange transformed their programmatic platform with Graviton and Karpenter on Amazon EKS
This post was co-created with Brian Annis, Director of Site Reliability Engineering at Place Exchange.
In the ever-evolving landscape of digital advertising, Place Exchange stands out as a game-changer for out-of-home (OOH) media. As the leading Supply-Side Platform for programmatic OOH, they’ve skillfully bridged the gap between digital and physical advertising realms. Their patented technology enables advertisers to integrate all types of OOH media – from outdoor billboards, to screens at retailers, transit hubs, and more – into their digital strategies. This allows seamless use of existing workflows, creative assets, and reporting tools across both online and OOH campaigns. This unified approach means that marketers can now plan, buy, and measure OOH advertising with the same ease and data-driven precision as their digital efforts. By making OOH advertising as accessible and manageable as running any other programmatic campaign, Place Exchange is opening up new possibilities for brands to connect with their audiences in meaningful physical environments woven into consumers’ daily journeys between home, work, shopping, and entertainment.
In this blog post you will learn about Place Exchange’s multi-step journey to reduce infrastructure spending while maintaining a highly available and performant ad exchange.
Background
As an early adopter of containerization, Place Exchange initially faced limited options for container orchestration, leading them to develop a custom Docker Swarm setup to run their ad exchange platform. However, Docker Swarm’s lack of autoscaling support made it unsuitable for their growing business needs, prompting a decision to migrate to Amazon Elastic Kubernetes Service (EKS), a more robust and scalable container orchestration solution.
With the move to EKS, Place Exchange implemented Cluster Autoscaler (CA) to scale their clusters dynamically. This allowed pending pods to trigger scale-up actions, expanding the cluster’s compute capacity on an as-needed basis, thus addressing the limitations they had faced with their previous setup and supporting their business growth more effectively.
Figure 1. EKS with Cluster Autoscaler Architecture – EKS with one managed on-demand nodegroup across a single x86 instance family
AWS released Karpenter, an open-source high-performance Kubernetes Cluster Autoscaler, in November 2021 to address the limitations of the traditional Cluster Autoscaler (CA). Unlike CA, which depended on Auto Scaling Groups (ASGs) and required complex configuration, Karpenter can spawn instance types as needed without ASG constraints. Karpenter allows users to utilize multiple instance types for their Kubernetes clusters more easily, eliminating the need to create an ASG for each instance type and simplifying cluster management overall.
The Challenge
Place Exchange’s Site Reliability Engineering team set about introducing EC2 spot instances to the existing EKS clusters to help optimize for cost. Spot instances offer savings of up to 90% compared to on demand instances.
However, the Cluster Autoscaler had several limitations that required additional configuration:
- Pod scheduling on different nodegroups based on requirements required operators to know which managed nodegroups would be the best fit, and then apply a static label to the deployment or pod to only schedule the pod on that nodegroup. This became cumbersome as the number of managed nodegroups to support various CPU to memory ratios became unmanageable.
- Operators had to make a choice about whether to manage EC2 spot instances at the nodegroup level or the ASG level. Place Exchange elected to have separate and distinct nodegroups and AGSs that were either entirely on-demand or entirely on-spot, and balance the two using the Priority Based expander in CA.
Despite configuring the Priority Based expander to prefer spot instances and fall back to on-demand instances when necessary, Place Exchange faced challenges with the CA. CA’s limitation of working only with same-sized instances within an ASG led to occasional disruptions across Availability Zones, causing unexpected downtime. Following an incident where all available spot instances were interrupted and CA responded slowly to remediate the issue, Place Exchange decided to discontinue the use of spot instances altogether.
Solution
Introducing Karpenter
Place Exchange deployed Karpenter and retired the Cluster Autoscaler. Karpenter was designed to relieve some of the pain points with the CA by directly provisioning instances that matched the needs of unscheduled pods pending in the cluster.
Karpenter can manage the spot capacity in both nodepools, while allowing application owners to opt-out by using labels. This flexibility, coupled with Karpenter’s ability to support large pools of spot instance types, meant that spot interruptions were rare in practice and they don’t have the limitations of a narrow instance family. By leveraging this strategy, Place Exchange is able to operate nearly all services using 100% spot instances and this greatly improved the cost savings by 68%.
Figure 2. EKS Cluster with Karpenter configured to use a mix of x86 EC2 on-demand and spot instance families
Graviton
Place Exchange saw another opportunity for cost optimization by implementing multi-architecture (x86 & ARM) images for all ad server builds . They also conducted several performance tests to ensure there was no regression. Deploying Graviton-powered spot instances yielded even greater cost savings compared to traditional x86 spot instances, enabling Place Exchange to reduce their operating costs of the ad serving application by an additional 20%. The transition to Graviton also resulted in 25% lower p95 request latency compared to the previous x86 instances using the same number of vCPUs.
Figure 3. EKS Cluster with Karpenter & Graviton configured to use both EC2 on-demand and spot instances within both x86 and graviton families
Conclusion
By strategically blending the use of EC2 Spot and Graviton instances to host the ad serving platform, Place Exchange achieved a 71% reduction in overall costs for the application, while also delivering a 25% decrease in latency per request. This dramatic optimization significantly lowered the day-to-day operating costs of the exchange.
The previous architecture had been more costly to operate and less performant. However, by transitioning to Graviton instances in the spot market, they were able to dramatically improve the cost-to-performance ratio, while also serving requests with lower end-to-end latency.
These principles of continuous iteration and staying abreast of the latest AWS techniques and tools have continued to benefit Place Exchange, allowing them to further enhance the efficiency and capabilities of our ad serving platform over time.
Below are some references that you can read for additional information:
Mixing AWS Graviton with x86 CPUs to optimize cost and resiliency using Amazon EKS
The content and opinions in this post are those of the third-party author and AWS is not responsible for the content or accuracy of this post.