Delhivery achieves 160 ms latency for high-precision geocoding using Amazon EKS

Learn how Delhivery improved geocoding in logistics across India using generative AI models deployed on Amazon EKS.

Benefits

reduction in model-serving costs

latency at 30 concurrent requests

requests handled per minute

Overview

To improve address matching with greater speed, cost efficiency, and control, Delhivery, a logistics provider in India, turned to generative AI. The company implemented a fine-tuned large language model (LLM) on Amazon Web Services (AWS) to support high-volume geocoding across its operations. The system now processes up to 8,000 requests per minute at 160 milliseconds latency, meeting internal benchmarks for real-time performance. As a result, Delhivery has cut model-serving costs by approximately 80 percent and accelerated prototyping cycles from two days to under six hours.

About Delhivery

Delhivery is a technology-driven logistics and supply chain services provider in India, offering transportation, warehousing, freight, and fulfillment solutions to businesses nationwide.

Opportunity | Using generative AI for high-precision geocoding

Operating one of India’s largest logistics networks, Delhivery helps clients move shipments nationwide through its extensive network of local couriers. A critical part of its operations is a proprietary navigational stack that powers high-precision geocoding of pickup and drop-off addresses—minimizing location errors and ensuring fast, accurate last-mile deliveries. Shyam Mukherjee, senior data scientist at Delhivery, says, “Geocoding with minimal error radius is at the core of what we do. Our goal is to map locations precisely, so shipments reach the right place without delay.”

To enhance address matching, the Delhivery team initially tested serverless LLMs from third-party providers. But these services came with limitations—rate caps of 2,000 requests per minute, or higher costs for provisioned access that exceeded actual usage needs. “We needed a solution that could handle up to 8,000 requests per minute and still keep costs in check,” explains Mukherjee. In addition, traditional machine learning models lacked contextual understanding and required long training cycles, slowing experimentation. These challenges made it difficult to scale efficiently, especially during demand spikes. Delhivery needed a faster, more flexible, and cost-effective solution to support high-volume inference and accelerate prototyping across its logistics stack.

Solution | Scaling a fine-tuned LLM using Amazon EKS

Delhivery began exploring custom LLMs after encountering rate limits and high costs with third-party API-based services. To address these constraints, the team evaluated the feasibility of self-hosted models that could offer greater flexibility and performance. After testing various LLMs, Delhivery selected a version of the open-source Llama 3.2 1B model that delivered the performance it needed. The team fine-tuned the model externally and began designing a solution tailored to its high-volume address-matching use case.

To accelerate the path to production, Delhivery engaged the AWS Prototyping and Cloud Engineering (PACE) team, which supported experimentation with different inference optimization techniques. This included identifying suitable instance types, optimizing model serving with NVIDIA Triton Inference Server, and packaging the deployment for Amazon Elastic Kubernetes Service (Amazon EKS), which was already part of Delhivery’s tech stack. “We started with basic benchmarking and ended up with a full deployment package we could plug into our existing stack,” says Mukherjee. “That made the transition to production much faster.”

To support production deployment, the team used Amazon EKS with G5 Xlarge instances as cluster nodes. These instances, equipped with NVIDIA A10G GPUs, were chosen to deliver fast inference response times using the vLLM framework. With auto scaling in place, the Amazon EKS cluster can seamlessly scale the deployment up or down based on demand.

Outcome | Accelerating AI innovation at scale with 80% cost savings

Following a successful transition to production, Delhivery validated the performance of its fine-tuned LLM setup in its operational environment. The deployment consistently met internal targets for responsiveness and scale, achieving a latency of 160 milliseconds at a concurrency of 30. With support for up to 8,000 requests per minute, the system delivers the speed and throughput required for Delhivery’s high-volume geocoding workloads. Additionally, Delhivery reduced its monthly model-serving costs by approximately 80 percent by optimizing GPU utilization and eliminating the overhead of third-party API provisioning. “By maximizing throughput per node, we’ve brought infrastructure costs down significantly,” Mukherjee explains.

Engaging with the AWS PACE team helped Delhivery turn an early-stage prototype into a scalable, production-grade deployment—unlocking the full potential of its custom model architecture. Beyond performance and cost improvements, the deployment has accelerated innovation. Prototyping cycles that once took two days can now be completed in under six hours. Generative AI now supports the majority of internal services at Delhivery, and the team is building toward a unified model that can handle multimodal inputs, including text, images, and video. “This deployment gives us a cost-effective way to run generative AI across our operations and the flexibility to evolve as our needs grow,” says Mukherjee, who sees the work as a foundational step in transforming his company into an AI-first logistics organization.

This deployment gives us a cost-effective alternative for running generative AI across our operations and the flexibility to evolve it as our needs grow.

Shyam Mukherjee

Senior Data Scientist at Delhivery

AWS Services Used

Amazon Elastic Kubernetes Service

Amazon Elastic Kubernetes Service (Amazon EKS) is a fully managed Kubernetes service that enables you to run Kubernetes seamlessly in both AWS Cloud and on-premises data centers.

Learn more

Amazon EC2 G5 Instances

Amazon EC2 G5 instances are the latest generation of NVIDIA GPU-based instances that can be used for a wide range of graphics-intensive and machine learning use cases.

Learn more

Get Started

Organizations of all sizes across all industries are transforming their businesses and delivering on their missions every day using AWS. Contact our experts and start your own AWS journey today.

Contact Sales

Did you find what you were looking for today?

Let us know so we can improve the quality of the content on our pages