Introduction

At Opster, we see many deployments of OpenSearch. There are many common problems that repeat themselves among the various deployments. One of the toughest among these is when the cluster state becomes too large, which poses many challenges and causes stability incidents and performance degradation.

Cluster state needs to be synced between all nodes in a cluster. Due to this, having large cluster states can cause time-outs and errors while syncing.

The OpenSearch cluster needs to maintain the cluster state in memory on all nodes, which can create another common problem: master nodes will not finish their initialization procedures due to insufficient memory to load the cluster state. These can cause nodes to disconnect and also cluster downtimes.

Though many customers can accurately detect on their own that the issue lies with the cluster state, solving these issues can be complicated. This requires some knowledge of what cluster state is composed of and setting up the correct automations for curating data and metadata from the cluster.

What is cluster state?

OpenSearch clusters are managed by the master nodes, or more specifically the elected master node. The way the master node knows what the state of the nodes and data in the cluster is by using the data set known as “cluster state”. The cluster state includes metadata information about the: nodes, indices, shards, allocation of shards, mapping and settings of the indices and more. All of this information must be shared between all the nodes to ensure that all operations across the cluster are coherent.

The structure of the cluster state varies between OpenSearch versions. This is the core structure of cluster state in OpenSearch:

{
    "cluster_name": "es_8.1.3",
    "cluster_uuid": "EvVUB8IvRSaJr8m4UKT7-w",
    "version": 433,
    "state_uuid": "86A361nNSbCKCD9dYW-90g",
    "master_node": "aH_dmogxRAKyl6C3TE3IGw",
    "blocks": {},
    "nodes": {},
    "metadata": {},
    "routing_table": {},
    "routing_nodes": {}
}

The cluster state stores: cluster-wide settings, index metadata, including mapping and settings for each index, templates, scripts, pipelines, and so on.

Cluster state: purposes and use cases

Coordinating requests – as each node in the cluster can accept any indexing, update or search requests, each node needs to know where every index and shard is located in the cluster. This information is stored in the routing table of the cluster state.
Managing index mapping and templates – as new data is indexed, new fields may be introduced to the data nodes. The data nodes then inform the master nodes of the new mapping, as the elected master is the only node that can accept new fields. If the field is accepted, the master will then update the cluster state, which will propagate to all other nodes in the cluster and ensure field type consistency throughout the cluster.
Cluster and index settings – In a similar way as the mapping, the same applies to cluster settings and changes to the indices’ settings.
Node list – The list of nodes in the cluster is maintained in the cluster state. The elected master is the only one who can change this list, by adding new nodes or disconnecting nodes from the cluster, and updating the other nodes of this change using the cluster state.
Debugging and diagnostics – There are additional use cases that can sometimes be utilized, such as debugging and diagnostic purposes, as the cluster state keeps track of such a wide variety of information regarding the cluster.

The impact of having large cluster state

As mentioned in the introduction, a large cluster state can cause many issues and incidents in clusters.

Cluster state needs to be synced between all nodes in a cluster. Due to this, having large cluster states can cause time-outs and errors while syncing. Master nodes may not finish their initialization procedures due to insufficient memory to load the cluster state, which can cause nodes to disconnect and cluster downtimes.

How to calculate the cluster state’s size

In OpenSearch, you can request the size by using the Node Stats API:

GET _nodes/stats/discovery?filter_path=**.discovery

The above GET API will give you the size of each node under the serialized_cluster_states object.

Prior, there is no API that provides the size of the cluster state. The simplest way to discover the size is to get the cluster state itself and calculate the file size.

Another method is to use Postman. Run the GET cluster state API. The size will be displayed in the response section.

Note that when the cluster state is too big (more than a few GB), this method might fail with timeouts and an error fetching the state. If this process takes too long/fails, that could be an indication that the cluster state is too big (although there could be other reasons for this failure as well).

How to detect large cluster state in OpenSearch logs

Another method for detecting large cluster states is through OpenSearch logs. Here are some examples of errors that might occur due to large cluster state, indicating that this is the issue which requires your attention:

INFO (TransportClientNodesService.java:522) – failed to get local cluster state for {#transport#-25}{pFu-9uiISaG_7UbNT-INPw}{56.241.23.136}{56.241.23.136:9303}, disconnecting… org.opensearch.transport.ReceiveTimeoutTransportException:
[DEBUG][o.e.a.a.i.m.p.TransportPutMappingAction] [NODE_NAME_1] failed to put mappings on indices [[[INDEX_1/SOME_ID]]], type [_doc] org.opensearch.cluster.metadata.ProcessClusterEventTimeoutException: failed to process cluster event (put-mapping) within 30s at org.opensearch.cluster.service.MasterService$Batcher.lambda$onTimeout$0(MasterService.java:143) [opensearch-7.5.2.jar:7.5.2]
“timed out waiting for all nodes to process published state [175229398] (timeout [30s], pending nodes: [{opensearch-data-1}{TULFVEcrTxCfINXVazprLA}{Yd1lhNcUQxaKFW8zBhIPjQ}{IP:9300}])”

How to solve this issue

Reduce cluster state – this solution is the most understandable solution, though it is not always straightforward to accomplish. See possible steps you can take below.
Raise the timeouts – as the cluster state grows, some requests might timeout (see examples of timeout logs above). One way to work around this issue until you reduce the cluster state is to increase the timeouts relevant for cluster state sync, such as cluster.publish.timeout.
Add resources to master nodes – specifically adding more memory, will enable large cluster states to be stored in memory as a temporary solution until cluster state is reduced.

How to reduce the size of cluster state

How to reduce the size of cluster state in OpenSearch:

Avoid mapping explosions
Cluster state can become unreasonably large if there is a mapping explosion, which occurs when there are too many keys in an index or too many indices. As a result, the cluster state is updated each time a new key is added.
You may overcome this problem by using strict mapping or setting dynamic mapping to false. This will prevent a new field from being created each time a document is added.
Delete empty indices
Deleting empty indices will reduce the space they take up in the cluster state.
Reindex small indices into bigger indices
If you have a large number of indices in your cluster (we’ve even seen 25K indices in a single cluster), try dividing them to multiple clusters and merging your smaller indices into bigger ones. This will reduce the cluster state.
Remove unused scripts and templates
To reduce the size of your cluster state, remove any scripts and templates that are no longer in use.
Increase hardware capacity
If the options above are not relevant for you, you can consider increasing the hardware capacity of your master nodes (both heap and CPU) to ensure they have enough resources to compute large cluster state.