On-device SLMs with agentic orchestration for hyper-personalized customer experiences in telecom

Opportunities for telecoms with small language models (SLMs) explored how telecom operators can use SLMs deployed on Customer Premises Equipment (CPE) and Internet of Things (IoT) devices to enable autonomous environments and handle routine information requests. Subsequently, in Distributed inference with collaborative AI agents for telco-powered smart-X, we further explored the spectrum of distributed computing locations available to telecoms, emphasizing the need to push inference to edge locations, use the Amazon Web Services (AWS) Cloud for aggregated learning, and employ artificial intelligence (AI) agents to orchestrate the overall solution. In this post, we build on these foundational concepts and showcase a real-world implementation developed in partnership with Intel and MongoDB, demonstrating how smartphones and residential gateways can perform on-device and far-edge AI inference. This solution enables telecom operators to use their last mile connectivity to handle customer queries locally, and deliver hyper-personalized experiences at scale—efficiently, and cost-effectively.

Introduction

Nearly half of telecom contact center calls today involve repetitive, routine questions—such as billing inquiries and network troubleshooting—which rarely require human intervention. Handling these queries through centralized AWS Cloud services or human agents introduces unnecessary latency, high bandwidth costs, and operational redundancy, negatively impacting customer experiences and driving up expenses. Telecom providers already possess powerful, underused computing resources within customers’ homes and devices (such as residential gateways, set-top boxes, and smartphones), which could efficiently handle these routine queries locally. This sets the stage for a fundamental transformation where the traditional customer service model gives way to intelligent, hyper-personalized customer experiences delivered instantly through on-device AI.

Solution overview

Our solution is to deploy SLMs on customer premises equipment (the residential gateway/router/STB at the far edge) and on user smartphones (device edge). Instead of every query going to the contact center or AWS Cloud, most customer questions can be answered right where they originate: on the device itself. This is achieved through a combination of edge inference, agentic orchestration, and cloud fallback for complex cases.

Figure 1: AI inference on a residential gateway (far-edge), and a smartphone (device-edge)

You can watch an overview of it in this video link.

Networking architecture

Each customer’s residential gateway/user device in the field establishes a secure VPN tunnel back to the AWS Cloud (see the following figure). All communication between the far edge agents (on gateways/smartphones) and AWS services (Amazon Bedrock agent, MongoDB) is sent through encrypted channels. This streamlines network configuration. The far edge device is essentially on a virtual private network with the AWS Cloud, eliminating complex firewall traversal. We use the AWS IoT Core device gateway (MQTT over TLS) for messaging, and a VPN client on the gateway that connects to the cloud. The result is that queries escalated to AWS Cloud, model updates, and data sync operations all traverse a safe, encrypted tunnel rather than the public internet. The “far edge stack” on the gateway/user devices comprises the CPU, ONNX runtime, and AWS IoT Greengrass runtime. AWS IoT Greengrass handles local messaging, AWS Lambdas, and deployment of our SLM component. The on-device SLM model runs within this AWS IoT Greengrass environment, as a container or native binary using Intel’s OpenVINO with ONNX runtime. The local database stores recent data and summaries. All edge devices are registered in AWS IoT Core, which allows remote monitoring and updating. Overall, the networking architecture makes sure of secure, low-latency connectivity between millions of edge devices and the operator’s AWS Cloud, while keeping most inference traffic off the network.

Figure 2: End-to-end networking architecture for residential gateways

Technical implementation

The following figure shows user query routing across three reference use cases. A customer’s query is initially processed by a device agent on their smartphone. For Wi-Fi troubleshooting on a home network, the agent invokes the Wi-Fi Debugger Tool on the gateway, performing far-edge inference and returning the response locally. For routine queries such as billing explanations, the device agent responds directly using on-device inference. In cases involving billing discrepancies or more complex issues, the query is escalated to the AWS Cloud. Routine queries may also be addressed through voice channels (for example Alexa), while more complex interactions are routed to Amazon Connect for human assistance. Throughout this process, the system makes sure of data synchronization between the AWS Cloud and far edge, maintaining context-aware responses for the customer.

Figure 3: Multi-agent query routing in action

For a recorded demonstration of customer experience, here is a video link.

Model deployment and inference using Intel’s OpenVINO

We packaged Meta’s Llama 3.2 1B onto a simulated home gateway and a smartphone. Residential gateway is simulated using Onlogic Helix 401 that has consumer grade Intel i5 CPU. Smartphone has 8 GB RAM and a Qualcomm mobile chipset. The Llama 3.2 1B model was chosen for its performance and availability in the ONNX format, which allows compatibility with various device frameworks and machine learning (ML) accelerators. Intel’s OpenVINO toolkit enables compressing the model weights to INT4, which can shrink model size by eight times with minimal accuracy loss. This makes it possible to deploy, optimize, and accelerate the model for CPU inference, making even a low power edge processor able to run a language model efficiently.

We use AWS IoT Greengrass to deploy and manage language models on far edge devices. AWS IoT Greengrass allows us to package the SLM and its runtime as an IoT edge component and push updates securely to gateways in the field. This provides a scalable way to manage models across millions of customer premises. The heavy tuning and periodic model updates happen in the AWS Cloud, while the inferencing runs on the edge device’s CPU. OpenVINO serves as the local abstraction layer, allowing the AWS IoT Greengrass deployment to be built once and deployed across all edge platforms.

AWS Cloud-to-edge data pipelines

There are two data workflows in place, as shown in the following figure. The first runs on the gateway, where system logs from the router are collected and summarized every hour using Llama 3.2 1B. This is a batch process triggered hourly, taking approximately 1.5 minutes to complete. This is an acceptable delay because near real-time latency isn’t needed. The summarization frequency can be adjusted based on service level agreement (SLA) requirements. These summaries are stored in a lightweight local database, and serve as a concise history of network performance, which the SLM can reference when diagnosing issues at runtime.

The second workflow runs in the AWS Cloud, where customer data from various sources is aggregated to create a comprehensive profile. Following each customer interaction with the operator, a summary is generated using a large language model (LLM) through Amazon Bedrock and stored along with its vector embeddings and reranking models in MongoDB. Then, these summaries are synced to edge devices, such as gateways and smartphones, making sure that on-device SLMs maintain current context without overloading the network. Furthermore, the collective workflows provide Amazon Bedrock with the ability to have a ubiquitous operational data layer—ODL (across the various data stores or silos) for inferencing, which is enabled through the deep integration with MongoDB data workflows.

Figure 4: AWS Cloud and edge workflows

While not part of the demo, on-device data synchronization with MongoDB Atlas on AWS can use Ditto’s Edge Sync Platform combined with the Ditto MongoDB Connector to facilitate robust bidirectional synchronization. Smartphones can run operator’s application embedded with Ditto’s Small Peer SDK, creating a self-organizing peer-to-peer (P2P) mesh network capable of real-time data syncing over Bluetooth Low Energy, P2P Wi-Fi, local area networks, and cellular connections. These edge devices can synchronize data locally using conflict-free replicated data types (CRDTs), efficiently propagating only data changes or deltas across the mesh network. When AWS Cloud connectivity is available, Ditto’s Big Peer middleware can forward these deltas to MongoDB Atlas hosted on AWS, making sure of causal consistency through MongoDB change streams. This distributed architecture eliminates single points of failure, providing reliable, low-latency, and resilient synchronization suitable for mobile and edge applications, as shown in the following figure.

Figure 5: Data synchronization architecture for MongoDB Atlas on AWS and Ditto Big Peer

This AWS Cloud-to-edge data sync makes sure of consistency: all edge agents and cloud agents operate with a single source of truth for customer context. Doing the heavy data processing in batch and sharing concise summaries allows us to minimize real-time computation needed on the edge. Therefore, the edge SLMs become contextual experts about the customer, and can give answers that feel personalized without querying the AWS Cloud for those details every time.

Agentic architecture

The heart of the system is a multi-agent orchestration layer (see the following figure) that spans the edge and AWS Cloud. A Device Agent on the customer’s smart phone is the primary orchestrator that interfaces with various Tools (Wi-Fi debugging, billing explanation) to handle routine tasks. It collaborates with a Cloud Agent for tasks needing cloud resources. The Planning module decides how to answer a query (which tool or agent to invoke), and the Action module executes the steps. If a query type is beyond local tools (for example a complex billing discrepancy), then the Device Agent pings the Cloud Agent to trigger AWS Cloud workflows. This agentic framework allows dynamic routing of user requests for on-device and AWS Cloud inference.

Figure 6: Multi-agent orchestration

This multi-agent approach is policy-driven and context-aware. Each agent (edge or AWS Cloud) has a domain of expertise and a set of tools. A high-level policy (the Planning module) uses rules and AI prompts to decide the route: If query topic == Wi-Fi and connected to router, use Wi-Fi tool; if query complexity score > threshold, escalate to cloud. The system essentially implements an expert system where the most direct resolution path is attempted first. This is reminiscent of how skilled support triages issues: direct issues handled immediately at Tier 1, more complex ones passed to Tier 2, except here Tier 1 is an on-device AI and Tier 2 is an LLM in the AWS Cloud. The result is a hyper-personalized experience: each customer’s devices become smart “personal support reps” that know their context, while the AWS Cloud makes sure of accuracy and completeness when needed.

Future outlook

The home router can transform into a central AI hub for the smart home. The on-board SLM means that the gateway can offer more services beyond telecom support: for example, parental control queries (“What websites did the kids access today?”), smart home voice assistance, or home security alerts could all be handled by the local AI. The gateway can even augment older IoT devices by providing edge computing for them. For example, it could perform local facial recognition for a legacy security camera that couldn’t do it by itself. This creates opportunities for the operator to offer new value-added services using the same on-device AI infrastructure. It also positions the operator’s equipment as an integral part of the customer’s AI ecosystem, potentially increasing stickiness.

Privacy considerations make this edge-based approach particularly compelling. Processing queries and data locally on the router means that sensitive household information never leaves the premises. This architectural choice makes sure of compliance with evolving privacy regulations such as GDPR and CCPA across different regions. The SLM-enabled gateway acts as a privacy-preserving intermediary—it can provide personalized services and insights while keeping customer data within their control. This design inherently supports data sovereignty requirements and gives customers transparency into how their information is used. Operators can use this privacy-first architecture as a key differentiator while rolling out AI-enhanced services.

Having GPUs in home routers is costly, making them impractical for widespread deployment. To reduce costs while enabling language model inference, we used Intel CPUs alongside the OpenVINO SDK. This setup allows efficient inference without the need for dedicated accelerators. MIPS-based processors, common in embedded networking systems, aren’t suitable for deploying language models. Standard frameworks such as ONNX and AWS IoT Greengrass lack support for MIPS, and while some lightweight C-based libraries exist, they can’t handle the complexity of SLMs. Even containerized solutions face challenges, because most AI frameworks assume x86 or ARM architectures and offer no MIPS support.

A router device typically runs an open software platform such as RDK-B or PrplOS/OpenWRT on top of its hardware. These platforms support deployment of various applications, such as Wi-Fi management, Quality of Experience (QoE) monitoring, and cybersecurity agents, all of which can interact directly with the router. Relying solely on system logs is insufficient for achieving deep observability and automation. To truly enable SLM inference for intelligent automation and closed-loop control, we need structured access to the router’s full data model. This is where industry-standard data models such as TR-069/TR-369 (for provisioning and management) and TR-181 (for detailed device telemetry and operational parameters) become essential. These models expose granular configurations and runtime metrics of the device—such as Wi-Fi performance, device associations, signal strength, channel usage, error rates, and more—which go far beyond what syslogs alone can offer. Processing this structured data with embedded or cloud-assisted SLMs enables real-time diagnostics, proactive issue detection, and autonomous policy adjustments, paving the way for AI-driven router management.

Conclusion

On-device SLMs for hyper-personalized support offer a compelling win-win for telecom operators and customers alike. Offloading inference to the far edge reduces AWS Cloud expenses and network transit costs (data egress). Fewer support calls to call centers means labor savings as well. It alleviates load on both data centers and contact centers, translating to millions in savings at scale. Customers experience near-instant response times for most queries since the latency of a round trip to the cloud is eliminated. Each response is also tailored with the customer’s actual data (made possible by the synced context), so it feels like speaking with a support rep who knows their history. This level of personalization can improve NPS/CSAT scores and reduce churn due to unresolved frustrations. The models live on the customer’s devices, thus basic support functions continue even during internet outages or in low-bandwidth conditions. Using AWS IoT for deployment means this solution can scale to millions of customer endpoints with relative ease. The AWS IoT Core device management can handle fleets of IoT devices and message traffic in the trillions. As more gateways come online, the operator can onboard them with the AI capabilities through automated provisioning (zero-touch setup through IoT certificates). Model updates or policy changes are rolled out over-the-air to all devices.

The AWS cloud-managed, distributed approach is highly scalable. This effectively uses the AWS Cloud for coordination while pushing compute to the far edge. In effect, operators gain a massively parallel inference engine distributed across their customer base. Scaling to new use cases (for example adding a “network outage predictor” agent) might be a software update to all gateways. Turning far edge devices into AI agents allows an operator to evolve from a service provider into an AI-enriched experience provider—delivering the future of customer experience right at the edge while maintaining the highest standards of data protection.

Intel – APN partner spotlight

Intel is an AWS Competency Partner that shares a passion with AWS for delivering constant innovation. Together, Intel and AWS have developed a variety of resources and technologies for high-performance computing, big data, machine learning, and IoT.

Contact Intel | Practice Overview | AWS Marketplace

AWS for Industries