AWS Database Blog

Scale read operations with Amazon Timestream for InfluxDB read replicas

Time-series data workloads often require high read throughput to support real-time alarms and notifications, monitoring dashboards, and analytics visualization tools. In this post, we show how to use Amazon Timestream for InfluxDB read replicas to scale your read operations by adding additional read replicas while maintaining a single write endpoint. Built in partnership with InfluxData, our read replica add-on offers InfluxDB open source customers the ability to horizontally scale their read capacity.

Challenges with managing time-series data

Modern infrastructure monitoring generates massive amounts of time-series data that often serves two distinct purposes. Organizations managing this data face a common challenge: addressing both immediate operational needs and longer-term analytical requirements with the same dataset. This dual-purpose data management includes:

  • Real-time operational needs – For quick detection of anomalies, instant dashboard updates, and immediate alerting
  • Deep analytical requirements – For resource optimization, capacity planning, and root cause analysis

This combination demands solutions that can efficiently handle both time-sensitive operations and complex historical analysis without compromise. The following diagram illustrates our single-instance implementation.

A diagram showing a single Timestream for InfluxDB instance with a Telegraf fleet writing data and three EC2 instances reading from the Timestream for InfluxDB instance.

A single Timestream for InfluxDB instance handles real-time operational needs and deep analytical requirements at the same time. The instance has competing demands and needs to deal with different types of requests without having a way to prioritize them according to urgency.

Solution overview

Let’s explore how Timestream for InfluxDB read replicas can help you effectively manage different workload patterns while maintaining system performance. For our example workload, we focus on solving this problem by adopting a Timestream for InfluxDB read replica cluster. We start by dividing our query workload in two groups based on the different needs:

  • Operational monitoring – This consists of the following:
    • High-frequency, lightweight queries, usually using operations like last(), max(), and similar functions, with short time ranges
    • Dashboards that refresh every 10–30 seconds showing fleet-wise health averages over time
    • Alerts evaluation that run every few minutes, only looking for values that breach predefined thresholds
    • Basic aggregations and comparisons
    • Required latency of less than 500 milliseconds

An example of an operational monitoring Flux query, querying an Internet of Things (IoT) dataset, is the following:

from(bucket: "iot")
    |> range(start: -1m)
    |> filter(fn: (r) => r["_measurement"] == "readings" and
        r["fleet"] == "East" and r["_field"] == "velocity")
    |> group()
    |> max(column: "_value")

This queries the maximum speed of all trucks within the East fleet in the last minute.

  • Analytical processing – This consists of the following:
    • Complex correlation analysis
    • Machine learning (ML) model training
    • Seasonal pattern detection
    • Capacity planning calculations
    • Root cause analysis queries
    • Acceptable latency is from seconds to minutes

An example of an analytical processing Flux query is the following:

from(bucket: "iot")
  |> range(start: 2025-07-09T03:00:00.000Z, stop: 2025-07-09T04:00:00.000Z)
  |> filter(fn: (r) => r["_measurement"] == "diagnostics"
    and (r["_field"] == "current_load" or r["_field"] == "fuel_state"))
  |> aggregateWindow(every: 5m, fn: mean, createEmpty: true)
  |> derivative(unit: 5m) |> group(columns: ["name"])
  |> movingAverage(n: 25) |> group(columns: ["_time"])
  |> sort(columns: ["_time"], desc: false)
  |> limit(n: 10000)

The query analyzes the rate of change in truck load and fuel consumption over a 1-hour period. Due to the larger time range, use of movingAverage, and small aggregate window duration, this query takes longer to run and is considered CPU intensive.

Our solution implements a workload-optimized architecture, as shown in the following diagram.

A diagram showing a read replica cluster with a Telegraf fleet and an EC2 instance writing to the cluster's primary node while two EC2 instances read from the cluster's replica node.

The read replica configuration is as follows:

  • Primary node (operational):
    • Handles metric ingestion
    • Kept for quick, simple queries
    • Dedicated to real-time dashboards and alerting
    • Used for tasks that require minimal latency
    • Data sources include Telegraf and custom agents
    • Typical ingest rate of 100,000 metrics per second
  • Replica node (analytical):
    • Used for complex queries
    • Handles ML training workloads
    • Dedicated for queries with higher memory allocation

Read replica nodes use the same hardware specifications as Timestream for InfluxDB instances. For more information, see Hardware specifications for DB instance classes.

The solution in this post will use the following AWS resources:

You will incur charges for the AWS resources used in the solution. See Amazon Timestream pricing and Amazon EC2 pricing for more information.

Understanding read replicas

Read replicas are read-only copies of your primary database that help distribute read operations across multiple database instances. The following diagram shows the architecture of a single Timestream for InfluxDB instance (left) compared to the architecture of a Timestream for InfluxDB read-replica cluster (right).

The architecture of a single Timestream for InfluxDB side-by-side with the architecture of a Timestream for InfluxDB read replica cluster.

Some time-series workloads require many more read operations than write operations. Read replicas offer an ideal solution for this imbalance. Each read replica maintains a copy of your data that updates automatically. Your application can distribute read requests across multiple replicas instead of overloading a single instance.

The replication process works behind the scenes through asynchronous replication. When your application writes data, the primary instance first saves and confirms the data locally. Only after this confirmation does the primary instance begin copying the data to your read replicas. The primary instance doesn’t wait for replicas to acknowledge receipt before responding to your application, allowing for faster write operations. After data is committed by the writer DB instance, it is replicated to the read replica instance almost instantaneously. For more information about replica lag, refer to Replica lag in read replica clusters.

With this launch, we are also introducing the concept of DB clusters. A DB cluster consists of one of more DB instances that are managed as a single entity. At launch, Timestream for InfluxDB only supports two-node clusters that are comprised of the primary (write and read node) and a secondary read replica (read-only node).

Changes that you want to introduce into your cluster configuration can affect all instances that are part of the cluster. Additionally, the only instance that can modify configurations or data is the primary instance. This means that you must use your primary for the following actions:

  • Write data
  • Create buckets, orgs, users, and access tokens
  • Configure dashboards, task, alarms, and notifications
  • Perform other actions that might change data

The focus of the read replicas is to extend your read capacity, so you can maintain write performance while expanding your read requirements.

Use cases in which a read replica cluster can better serve your needs

Timestream for InfluxDB read replicas are recommended for the following scenarios:

  • Increasing read requirements – Your read requirements are growing faster than your write throughput and have started to impact the speed at which you can ingest data. By adding a secondary read replica, you can offload your queries to the replica so that the write performance of your primary remains unaffected.
  • Dashboarding and analytics queries are impacting performance – You have a set of dashboarding or analytics queries that have started to impact query performance on the most critical alarm and notification queries. You can implement read replicas to fix this by keeping only your most critical alarm- and notification-related read workloads on the primary and offloading the dashboarding and analytical queries to your secondary.
  • A large database serving many customers – Your database serves a large number of customers that send impromptu reads to your system. Imagine you’re a company tracking the performance of home-based solar cells and your customers can access dashboards from their apps to see their panel performance, payback costs, and usage-related savings. In this case, you can use read replicas to balance the query load, separating customers by region, product, or support tier.
  • Minimal downtime requirement – Your business requires the shortest possible downtime when issues affect your primary instance. The automated failover capability quickly detects problems with your primary instance and promotes a read replica to become the new primary without your intervention. This fast, automatic process minimizes disruption to your applications and helps maintain service availability even during unexpected events. When monitoring critical infrastructure, high availability is an important characteristic of a time-series data store. Although we offer Multi-AZ instances that are synchronously replicated upon failure of your primary instance, your secondary InfluxDB needs to start up, because up to this point it will have been in standby mode. Although this approach offers the highest durability for your data, the time required for your instance to be up again will depend on many workload- and engine-related aspects, such as cardinality, the size of the database, and the rate of ingestion before the crash. These factors might let you extend downtime while your engine restarts. With Timestream for InfluxDB read replicas, although the asynchronous replica doesn’t offer the same level of durability as its Multi-AZ counterpart, it can provide faster failover regardless of the workload characteristics.

In summary, Timestream for InfluxDB read replicas can help you achieve your specific query or availability goals by offloading queries to the replica node to maintain performance of the primary node, balancing query load between the nodes according to workload characteristics, and providing faster failover regardless of workload compared to Multi-AZ instances.

How Timestream for InfluxDB read replicas work

Timestream for InfluxDB read replicas work in the following way:

  • When you need to write data to your Timestream for InfluxDB database, you can only write to your primary node in your cluster. After you get the OK response from the API, it means that your write was accepted and persisted to disk, and is ready to be read.
  • When records are persisted to the Write Ahead Log (WAL), the engine uses an optimized data-transfer method called WAL shipping to send these new entries to the replica.
  • You can track the ReplicaLag metric in Amazon CloudWatch and create alarms to make sure it’s kept under control.

Read replica cluster performance

In this section, we discuss the performance you can expect from your read replica cluster in a few real-world scenarios.

Queries per second with simultaneous reads and writes

Read replica clusters can help you increase the query performance on your Timestream for InfluxDB workloads. Compared to single instances, clusters help you distribute your query load across your nodes, resulting in faster response times and improved throughput. This performance boost is particularly noticeable for high-concurrency scenarios and large-scale data analysis tasks. We compared the performance of a single db.influx.xlarge instance against a db.influx.xlarge read-replica cluster under the same workload. Five proxy servers aggregated IoT data of 100,000 trucks, writing data every 5 seconds in batches of 5,000-line protocol points. At the same time, 25 clients queried and retrieved 15-minute aggregates for different metrics over a 15-minute interval. For the single instance, all users wrote and queried from the same instance. For the read-replica cluster, all write requests were sent to the primary node, and read requests were divided between the primary and replica nodes.

The single instance averaged 59.46 queries per second, with all 25 clients sending query requests. The following diagram shows the number of requests sent to the single instance over time, as generated by the Goose testing framework.

A chart showing the number of requests sent to a single Timestream for InfluxDB instance over time.

The read replica cluster averaged 105.52 queries per second, with 25 querying clients split between the primary and read-replica nodes. The following diagram shows the number of requests sent to the read replica cluster over time.

A diagram showing requests sent to a Timestream for InfluxDB read replica cluster over time.

Compared to single-instance performance, this represents a 77.5% increase in read capacity.

Query CPU usage

Using the same dataset as the previous example, the following query analyzes the rate of change in truck load and fuel consumption over a 1-hour period:

from(bucket: "iot")
  |> range(start: 2025-07-09T03:00:00.000Z, stop: 2025-07-09T04:00:00.000Z)
  |> filter(fn: (r) => r["_measurement"] == "diagnostics"
    and (r["_field"] == "current_load" or r["_field"] == "fuel_state"))
  |> aggregateWindow(every: 5m, fn: mean, createEmpty: true)
  |> derivative(unit: 5m) |> group(columns: ["name"])
  |> movingAverage(n: 25) |> group(columns: ["_time"])
  |> sort(columns: ["_time"], desc: false)
  |> limit(n: 10000)

We compared the CPU usage of a single db.influx.xlarge instance against a db.influx.xlarge read-replica cluster under the same workload. Over 10 minutes, four clients sent the preceding CPU-intensive query as many times as possible. The single instance reached 100% CPU usage. The following screenshot shows the instance’s CPU utilization over a 30-minute period, as shown on the Timestream console. Note that the periods of 0% CPU usage are periods before and after testing.

A diagram showing the CPU usage of a single Timestream for InfluxDB instance over time.

The read-replica cluster, with the workload distributed evenly across instances, reached 46.33% CPU usage on its primary node. The following screenshot shows the primary node’s CPU utilization over a 30-minute period.

A diagram showing the CPU usage of a Timestream for InfluxDB read replica cluster's primary node over time.

The read-replica’s replica node reached 54.41% CPU usage. The following screenshot shows the replica node’s CPU utilization over a 30-minute period.

A diagram showing the CPU usage of a Timestream for InfluxDB read replica cluster's secondary node over time.

The cluster handled a greater overall workload without overloading either instance, leaving capacity to handle additional requests.

Conclusion

With the Timestream for InfluxDB read replica feature, you can optimize your cluster for high query performance workloads. Where a normal Timestream for InfluxDB instance would be subject to the workload limitations of a single instance, the read replica feature allows for supporting additional load without affecting primary node performance by utilizing the additional read node.

Get started by deploying a Timestream for InfluxDB read replica cluster for your own use case, and share your feedback in the comments.


About the authors

Victor Servin

Victor Servin

Victor is a Senior Product Manager for the Amazon Timestream team at AWS, bringing over 18 years of experience leading product and engineering teams in the Telco vertical. With an additional 5 years of expertise in supporting startups with product-led growth strategies and scalable architecture, Victor’s data-driven approach is perfectly suited to drive the adoption of analytical products like Timestream. His extensive experience and commitment to customer success helps him support customers to efficiently achieve their goals.

Trevor Bonas

Trevor Bonas

Trevor is a Senior Software Developer at Improving. Trevor has experience with time-series databases and ODBC driver development. In his free time, Trevor writes fiction and dabbles in game development.

Forest Vey

Forest Vey

Forest Vey is a Team Lead at Improving. He is experienced working with time-series and serverless technologies. With a growing passion for cloud development and embedded systems applications, he has gained proficiency in designing software solutions from bare metal to the provisioning layer. When not immersed in learning about software development, he enjoys rock climbing and hiking with his friends.