AWS Big Data Blog

Enhance stability with dedicated cluster manager nodes using Amazon OpenSearch Service

Amazon OpenSearch Service is a managed service that you can use to secure, deploy, and operate OpenSearch clusters at scale in the AWS Cloud. With OpenSearch Service, you can configure clusters with different types of node options such as data nodes, dedicated cluster manager nodes, dedicated coordinator nodes, and UltraWarm nodes. When configuring your OpenSearch Service domain, you can exercise different node options to manage your cluster’s overall stability, performance, and resiliency.

In this post, we show how to enhance the stability of your OpenSearch Service domain with dedicated cluster manager nodes and how using these in deployment enhances your cluster’s stability and reliability.

The benefit of dedicated cluster manager nodes

A dedicated cluster manager node handles the behind-the-scenes work of running an OpenSearch Service cluster, but it doesn’t store actual data or process search requests. In the absence of dedicated cluster manager nodes, OpenSearch Service will use data nodes for cluster management; combining these responsibilities on the data nodes can impact performance and stability because data operations (like indexing and searching) compete with critical cluster management tasks for computing resources. The dedicated cluster manager node is responsible for several key tasks: monitoring and keeping track of all the data nodes in the cluster, knowing how many indexes and shards there are and where they’re located, and routing data to the correct places. They also update and share the cluster state whenever something changes, like creating an index or adding and removing nodes. The problem, however, is that when traffic gets heavy, the cluster manager node can get overloaded and become unresponsive. If this happens, your cluster will not respond to write requests until it elects a new cluster manager, at which point the cycle might repeat itself. You can alleviate this issue by deploying dedicated cluster manager instances, whereby this separation of duties between the manager node and the data nodes results in a much more stable cluster.

Calculating the number of dedicated cluster manager nodes

In OpenSearch Service, a single node is elected as the cluster manager from all eligible nodes through a quorum-based voting process, confirming consensus before taking on the responsibility of coordinating cluster-wide operations and maintaining the cluster’s state. Quorum is the minimum number of nodes that need to agree before the cluster makes important decisions. It helps keep your data consistent and your cluster running smoothly. When you use dedicated cluster manager nodes, only those nodes are eligible for election and OpenSearch Service sets the quorum to half of the nodes, rounded down to the nearest whole number, plus one. One dedicated cluster manager node is explicitly prohibited by OpenSearch Service because you have no backup in the event of a failure. Using three dedicated cluster manager nodes makes sure that even if one node fails, the remaining two can still reach a quorum and maintain cluster operations. We recommend three dedicated cluster manager nodes for production use cases. Multi-AZ with standby is an OpenSearch Service feature designed to deliver four 9s of availability using a third AWS Availability Zone as a standby. When you use Multi-AZ with standby, the service requires three dedicated cluster manager nodes. If you deploy with Multi-AZ without standby or Single-AZ, we still recommend three dedicated cluster manager nodes. It provides two backup nodes in the event of one cluster manager node failure and the necessary quorum (two) to elect a new manager. You can choose three or five dedicated cluster manager nodes.

Having five dedicated cluster manager nodes works as well as three, and you can lose two nodes while maintaining a quorum. But because only one dedicated cluster manager node is active at any given time, this configuration means you pay for four idle nodes.

Cluster manager node configurations for different domain creation methods

This section explains the resources each domain creation method and template deploy when you set up an OpenSearch Service domain.

With the Easy create option, you can quickly create a domain using ‘multi-AZ with standby’ for high availability three-cluster manager nodes distributed across three Availability Zones. The following table summarizes the configuration.

Domain Creation Method Output
Easy Create

Dedicated cluster manager node: Yes

Number of cluster manager nodes: 3

Availability Zones: 3

Standby: Yes

The Standard create option provides templates for ‘Production’ and ‘Dev/test’workloads. Both templates come with a Domain with standby and a Domain without standby deployment choice. The following table summarizes these configuration options.

Domain Creation Method Template Deployment Option Output
Standard Create Production Domain with standby

Requires dedicated cluster manager node

Number of cluster manager nodes: 3

Availability Zones: 3

Standby: Yes

Instance type choice: Yes

Standard create Production Domain without standby

Requires dedicated cluster manager node

Number of cluster manager nodes: 3, 5

Availability Zones: 3

Standby: No

Instance type choice: Yes

Standard Create Dev/test Domain with standby

Requires dedicated cluster manager node

Number of cluster manager nodes: 3

Availability Zones: 3

Standby: Yes

Instance type choice: Yes

Standard create Dev/test Domain without standby Does not require dedicated cluster manager node

Choosing a dedicated cluster manager instance type

Dedicated cluster manager instances typically handle critical cluster operations like shard distribution and index management and track cluster state changes. It’s recommended to select a comparatively smaller instance type. Refer to Choosing instance types for dedicated master nodes for more information on instance types for dedicated cluster manager nodes.

You should expect to occasionally adjust cluster manager instance size and type as your workload evolves over time. As with all scale questions, you need to monitor performance and make sure you have enough CPU and Java virtual machine (JVM) heap for your dedicated cluster managers. We recommend using Amazon CloudWatch alarms to monitor the following CloudWatch metrics, and adjust according to the alarm state:

  • ManagerCPUUtilization – Maximum is greater than or equal to 50% for 15 minutes, three consecutive times
  • ManagerJVMMemoryPressure – Maximum is greater than or equal to 95% for 1 minute, three consecutive times

Conclusion

Dedicated cluster manager nodes provide added stability and protection against split-brain situations, can be of a different instance type than data nodes, and are an obvious benefit when OpenSearch Service is backing mission-critical applications for production workloads. They are typically not required for development workloads like proof of concept because the cost of running a dedicated cluster manager node exceeds the tangible benefits of keeping the cluster up and running. To learn more about OpenSearch best practices, see link.


About the authors

Imtiaz (Taz) Sayed is the WW Tech Leader for Analytics at AWS. He enjoys engaging with the community on all things data and analytics. He can be reached through LinkedIn.

Chinmayi Narasimhadevara is a Senior Solutions Architect focused on Data Analytics and AI at AWS. She helps customers build advanced, highly scalable, and performant solutions.