All shards failed - Common causes and quick fixes

Opster Team

August-23, Version: 6.8-8.9

Briefly, this error occurs when Elasticsearch is unable to allocate shards to nodes, possibly due to insufficient resources, network issues, or configuration errors. To resolve this, you can increase the system resources, check the network connectivity between nodes, or review the shard allocation settings. Also, ensure that the Elasticsearch cluster is properly configured and that there are no issues with the underlying hardware. Lastly, check the Elasticsearch logs for more specific error messages that can help identify the root cause.

Before you dig into reading this guide, have you tried asking OpsGPT what this log means? You’ll receive a customized analysis of your log.

Try OpsGPT now for step-by-step guidance and tailored insights into your Elasticsearch/OpenSearch operation.

For a complete solution to your to your search operation and to understand why all shards failed, try for free AutoOps for Elasticsearch & OpenSearch . With AutoOps and Opster’s proactive support, you don’t have to worry about your search operation – we take charge of it.

What this error means

The exception “all shards failed” arises when at least one shard failed. This can occur due to various reasons, such as: if text fields are being used for document aggregations or performing metric aggregation; if a given search failed on the shard and is in an unrecoverable state, and therefore no response could be given for that shard (though the shard itself is fine); or some special aggregations (like global and reverse nested aggregation) are not used in the proper order.

Possible causes

Below are 5 reasons why this error may appear, including troubleshooting steps for each.

1. Text fields are not optimized for operations

This error sometimes occurs because text fields are not optimized for operations that require per-document field data like aggregations and sorting, so these operations are disabled by default.

Quick troubleshooting steps

To overcome the above error, you need to enable the field data on the field if you want to get rid of the error but beware – it can cause performance issues.

If you are not using any explicit index mapping definition, then you can simply use the .keyword sub-field in order to run aggregation on it.

However, if you have defined index mapping,and if you don’t have the keyword field then you can use a multi-field which is useful to index the same field in different ways for different purposes. You can also change the data type of the name field from text to keyword type in the index mapping definition (to enable aggregation on it), as shown below –

{
  "mappings": {
    "properties": {
      "name":{
        "type":"keyword"       // note this
      }
    }
  }
}

2. Metric aggregations can’t be performed on text fields

Metric Aggregation mainly refers to the maths calculation done on the documents present in parent buckets. Therefore, you cannot perform metric aggregation on text fields. If these aggregations are performed on a text field, you will get the “all shards failed” exception.

Quick troubleshooting steps

The sum/max/min/ ie metric aggregation can work on a script instead of a field. The script would transform the text into a numeric value (e.g. Integer.parseInt(doc.cost.value)) or starting ES 7.11 you can use the runtime field which can be used in the query and aggregations.
If you want to avoid scripts in search query, you can change the data type of the cost field to a numeric type, to avoid the error. The index mapping definition will be like below:

{
  "mappings": {
    "properties": {
      "cost":{
        "type":"integer"     // note this
      }
    }
  }
}

3. At least one shard has failed

The aforementioned exception may arise when at least one shard has failed. Upon restarting the remote server, some shards may not recover, causing the cluster to stay red. You can check the health status of the cluster, by using the Elasticsearch Check-Up or cluster health API:

GET _cluster/health

One way to resolve the error is to delete the index completely (but it’s not an ideal solution).

4. Global aggregations are not defined as top-level

Global aggregation is a special kind of aggregation that is executed globally on all the documents without being influenced by the query. If global aggregations are not defined as top-level aggregations, then you’ll get the “all shards failed” exception.

Quick troubleshooting steps

To avoid this error, you should ensure that global aggregations are defined only as top-level aggregations and not as sub-level aggregation.

For example – In the above case you should change the search query as follows (note here that global aggregation is defined as a top-level aggregation):

{
 "size": 0,
 "aggs": {
   "all_products": {
     "global": {},
     "aggs": {
       "genres": {
         "terms": {
           "field": "cost"
         }
       }
     }
   }
 }
}

5. Reverse nested aggregation is not used inside a nested aggregation

Reverse nested aggregation is a single bucket aggregation that enables aggregating on parent docs from nested documents.

The reverse_nested aggregation must be defined inside a nested aggregation.

But if reverse nested aggregation is not used inside a nested aggregation, you’ll see this exception.

Quick troubleshooting steps

To avoid this error, you should ensure that the reverse_nested aggregation is defined inside a nested aggregation.

The modified search query will be –

{
 "aggs": {
   "comments": {
     "nested": {
       "path": "comments"
     },
     "aggs": {
       "top_usernames": {
         "terms": {
           "field": "comments.username"
         },
         "aggs": {
           "comment_issue": {
             "reverse_nested": {},
             "aggs": {
               "top_tags": {
                 "terms": {
                   "field": "tags"
                 }
               }
             }
           }
         }
       }
     }
   }
 }
}

Overview

Data in an Elasticsearch index can grow to massive proportions. In order to keep it manageable, it is split into a number of shards. Each Elasticsearch shard is an Apache Lucene index, with each individual Lucene index containing a subset of the documents in the Elasticsearch index. Splitting indices in this way keeps resource usage under control. An Apache Lucene index has a limit of 2,147,483,519 documents.

Examples

The number of shards is set when an index is created, and this number cannot be changed later without reindexing the data. When creating an index, you can set the number of shards and replicas as properties of the index using:

PUT /sensor
{
    "settings" : {
        "index" : {
            "number_of_shards" : 6,
            "number_of_replicas" : 2
        }
    }
}

The ideal number of shards should be determined based on the amount of data in an index. Generally, an optimal shard should hold 30-50GB of data. For example, if you expect to accumulate around 300GB of application logs in a day, having around 10 shards in that index would be reasonable.

During their lifetime, shards can go through a number of states, including:

Initializing: An initial state before the shard can be used.
Started: A state in which the shard is active and can receive requests.
Relocating: A state that occurs when shards are in the process of being moved to a different node. This may be necessary under certain conditions, such as when the node they are on is running out of disk space.
Unassigned: The state of a shard that has failed to be assigned. A reason is provided when this happens. For example, if the node hosting the shard is no longer in the cluster (NODE_LEFT) or due to restoring into a closed index (EXISTING_INDEX_RESTORED).

In order to view all shards, their states, and other metadata, use the following request:

GET _cat/shards

To view shards for a specific index, append the name of the index to the URL, for example:

sensor:
GET _cat/shards/sensor

This command produces output, such as in the following example. By default, the columns shown include the name of the index, the name (i.e. number) of the shard, whether it is a primary shard or a replica, its state, the number of documents, the size on disk, the IP address, and the node ID.

sensor 5 p STARTED    0  283b 127.0.0.1 ziap
sensor 5 r UNASSIGNED                   
sensor 2 p STARTED    1 3.7kb 127.0.0.1 ziap
sensor 2 r UNASSIGNED                   
sensor 3 p STARTED    3 7.2kb 127.0.0.1 ziap
sensor 3 r UNASSIGNED                   
sensor 1 p STARTED    1 3.7kb 127.0.0.1 ziap
sensor 1 r UNASSIGNED                   
sensor 4 p STARTED    2 3.8kb 127.0.0.1 ziap
sensor 4 r UNASSIGNED                   
sensor 0 p STARTED    0  283b 127.0.0.1 ziap
sensor 0 r UNASSIGNED

Notes and good things to know

Having shards that are too large is simply inefficient. Moving huge indices across machines is both a time- and labor-intensive process. First, the Lucene merges would take longer to complete and would require greater resources. Moreover, moving the shards across the nodes for rebalancing would also take longer and recovery time would be extended. Thus by splitting the data and spreading it across a number of machines, it can be kept in manageable chunks and minimize risks.

Having the right number of shards is important for performance. It is thus wise to plan in advance. When queries are run across different shards in parallel, they execute faster than an index composed of a single shard, but only if each shard is located on a different node and there are sufficient nodes in the cluster. At the same time, however, shards consume memory and disk space, both in terms of indexed data and cluster metadata. Having too many shards can slow down queries, indexing requests, and management operations, and so maintaining the right balance is critical.

How to reduce your Elasticsearch costs by optimizing your shards

Watch the video below to learn how to save money on your deployment by optimizing your shards.

Log Context

Log “all shards failed” class name is SearchScrollAsyncAction.java. We extracted the following from Elasticsearch source code for those seeking an in-depth context :

 addShardFailure(new ShardSearchFailure(failure; searchShardTarget));
 int successfulOperations = successfulOps.decrementAndGet();
 assert successfulOperations >= 0 : "successfulOperations must be >= 0 but was: " + successfulOperations;
 if (counter.countDown()) {
 if (successfulOps.get() == 0) {
 listener.onFailure(new SearchPhaseExecutionException(phaseName; "all shards failed"; failure; buildShardFailures()));
 } else {
 SearchPhase phase = nextPhaseSupplier.get();
 try {
 phase.run();
 } catch (Exception e) {

[ratemypost]

All shards failed – How to solve this Elasticsearch exception

What this error means

Possible causes

1. Text fields are not optimized for operations

Quick troubleshooting steps

2. Metric aggregations can’t be performed on text fields

Quick troubleshooting steps

3. At least one shard has failed

4. Global aggregations are not defined as top-level

Quick troubleshooting steps

5. Reverse nested aggregation is not used inside a nested aggregation

Quick troubleshooting steps

Overview

Examples

Notes and good things to know

How to reduce your Elasticsearch costs by optimizing your shards

Overview

Examples

Notes and good things to know

Log Context