All shards failed – How to solve this Elasticsearch exception

Opster Team

August-23, Version: 6.8-8.9

Briefly, this error occurs when Elasticsearch is unable to allocate shards to nodes, possibly due to insufficient resources, network issues, or configuration errors. To resolve this, you can increase the system resources, check the network connectivity between nodes, or review the shard allocation settings. Also, ensure that the Elasticsearch cluster is properly configured and that there are no issues with the underlying hardware. Lastly, check the Elasticsearch logs for more specific error messages that can help identify the root cause.

Before you dig into reading this guide, have you tried asking OpsGPT what this log means? You’ll receive a customized analysis of your log.

Try OpsGPT now for step-by-step guidance and tailored insights into your Elasticsearch/OpenSearch operation.

For a complete solution to your to your search operation and to understand why all shards failed, try for free AutoOps for Elasticsearch & OpenSearch . With AutoOps and Opster’s proactive support, you don’t have to worry about your search operation – we take charge of it.

What this error means

The exception “all shards failed” arises when at least one shard failed. This can occur due to various reasons, such as: if text fields are being used for document aggregations or performing metric aggregation; if a given search failed on the shard and is in an unrecoverable state, and therefore no response could be given for that shard (though the shard itself is fine); or some special aggregations (like global and reverse nested aggregation) are not used in the proper order.

Possible causes

Below are 5 reasons why this error may appear, including troubleshooting steps for each.

1. Text fields are not optimized for operations

This error sometimes occurs because text fields are not optimized for operations that require per-document field data like aggregations and sorting, so these operations are disabled by default.

Quick troubleshooting steps

To overcome the above error, you need to enable the field data on the field if you want to get rid of the error but beware – it can cause performance issues.

If you are not using any explicit index mapping definition, then you can simply use the .keyword sub-field in order to run aggregation on it.

However, if you have defined index mapping,and if you don’t have the keyword field then you can use a multi-field which is useful to index the same field in different ways for different purposes. You can also change the data type of the name field from text to keyword type in the index mapping definition (to enable aggregation on it), as shown below –

{
  "mappings": {
    "properties": {
      "name":{
        "type":"keyword"       // note this
      }
    }
  }
}

2. Metric aggregations can’t be performed on text fields

Metric Aggregation mainly refers to the maths calculation done on the documents present in parent buckets. Therefore, you cannot perform metric aggregation on text fields. If these aggregations are performed on a text field, you will get the “all shards failed” exception.

Quick troubleshooting steps

  1. The sum/max/min/ ie metric aggregation can work on a script instead of a field. The script would transform the text into a numeric value (e.g. Integer.parseInt(doc.cost.value)) or starting ES 7.11 you can use the runtime field which can be used in the query and aggregations.
  2. If you want to avoid scripts in search query, you can change the data type of the cost field to a numeric type, to avoid the error. The index mapping definition will be like below:

{
  "mappings": {
    "properties": {
      "cost":{
        "type":"integer"     // note this
      }
    }
  }
}

3. At least one shard has failed

The aforementioned exception may arise when at least one shard has failed. Upon restarting the remote server, some shards may not recover, causing the cluster to stay red. You can check the health status of the cluster, by using the Elasticsearch Check-Up or cluster health API:

GET _cluster/health

One way to resolve the error is to delete the index completely (but it’s not an ideal solution).

4. Global aggregations are not defined as top-level

Global aggregation is a special kind of aggregation that is executed globally on all the documents without being influenced by the query. If global aggregations are not defined as top-level aggregations, then you’ll get the “all shards failed” exception.

Quick troubleshooting steps

To avoid this error, you should ensure that global aggregations are defined only as top-level aggregations and not as sub-level aggregation.

For example –  In the above case you should change the search query as follows (note here that global aggregation is defined as a top-level aggregation):

{
 "size": 0,
 "aggs": {
   "all_products": {
     "global": {},
     "aggs": {
       "genres": {
         "terms": {
           "field": "cost"
         }
       }
     }
   }
 }
}

5. Reverse nested aggregation is not used inside a nested aggregation

Reverse nested aggregation is a single bucket aggregation that enables aggregating on parent docs from nested documents. 

The reverse_nested aggregation must be defined inside a nested aggregation. 

But if reverse nested aggregation is not used inside a nested aggregation, you’ll see this exception.

Quick troubleshooting steps

To avoid this error, you should ensure that the reverse_nested aggregation is defined inside a nested aggregation.

The modified search query will be –

{
 "aggs": {
   "comments": {
     "nested": {
       "path": "comments"
     },
     "aggs": {
       "top_usernames": {
         "terms": {
           "field": "comments.username"
         },
         "aggs": {
           "comment_issue": {
             "reverse_nested": {},
             "aggs": {
               "top_tags": {
                 "terms": {
                   "field": "tags"
                 }
               }
             }
           }
         }
       }
     }
   }
 }
}

Log Context

Log “all shards failed” class name is SearchScrollAsyncAction.java. We extracted the following from Elasticsearch source code for those seeking an in-depth context :

 addShardFailure(new ShardSearchFailure(failure; searchShardTarget));
 int successfulOperations = successfulOps.decrementAndGet();
 assert successfulOperations >= 0 : "successfulOperations must be >= 0 but was: " + successfulOperations;
 if (counter.countDown()) {
 if (successfulOps.get() == 0) {
 listener.onFailure(new SearchPhaseExecutionException(phaseName; "all shards failed"; failure; buildShardFailures()));
 } else {
 SearchPhase phase = nextPhaseSupplier.get();
 try {
 phase.run();
 } catch (Exception e) {

 

 [ratemypost]