Elasticsearch Elasticsearch Partial Snapshots – Customer Post Mortem

By Opster Expert Team - Musab Dogan

Updated: Jan 28, 2024

| 3 min read

Quick Links

Introduction

Elasticsearch snapshots are a critical feature that allows users to create backups of their data, which can be used to restore the cluster in case of a disaster. 

However, there are various reasons why Elasticsearch snapshots may sometimes fail. One common problem that users may encounter is the “PARTIAL snapshot” error, which indicates that one or more index shard snapshots could not be taken. 

The Opster support team has helped many customers with similar issues. The following is based on a real case that occurred for one of Opster’s customers. In this article, we will discuss the possible reasons for this error and how to solve it.

Overview & Definition

A snapshot copies segments from an index’s primary shards. Upon commencing a snapshot, Elasticsearch promptly replicates the segments of any primary shards that are accessible. 

If a shard is starting up or moving, Elasticsearch will hold off on duplicating its segments until those operations are finished. If any primary shards are absent, the attempt to create a snapshot will be unsuccessful.

What is a partial snapshot error? 

A PARTIAL snapshot means that a global cluster state has been stored but the data of at least one shard was not stored successfully. The failures section of the response contains more detailed information about shards that were not processed correctly.

A step-by-step guide to resolving partial snapshot issues

The example from Opster’s customer below portrays a step-by-step guide to dealing with a PARTIAL snapshot error.

In this case, the Elasticsearch version used was 6.8 and there were more than 40 nodes. Although the snapshots were taken successfully, the snapshot state suddenly became PARTIAL. Elasticsearch logs were examined to resolve the issue, but there was no information to indicate the root cause.

PARTIAL snapshots were detected with this API:

GET _cat/snapshots?v

Failed snapshot indices were detected with API:

GET /_snapshot/my_repository/my_snapshot?filter_path=*.failures

The following command was used to determine which node(s) were the primary shards of API failed indices:

GET _cat/shards/failed_index_1,failed_index_2?v&s=prirep

The API result was PARTIAL (see below), showing us that the root cause of the problem was the node(s).

PUT /_snapshot/my_repository/test_snapshot?wait_for_completion=true
{ "indices": "failed_index_1" }

The Elasticsearch keystore list access_key and secret_key were checked and updated.

POST /_nodes/<node37_id>/reload_secure_settings

API was tested, resulting in a security exception. 

Further investigation revealed that the /etc/elasticsearch/elasticsearch.keystore permission was:

root:root

Changing the permissions to the following helped to solve the security exception but not the PARTIAL snapshot error. 

elasticsearch:elasticsearch

Following the actions listed above, it appeared that the root cause might be the indices. So, the failed indices were checked. The error occurred consistently, indicating that there was a problem with a specific index or node. 

The failed indices’ primary shard locations were checked, and it was found that some indices’ primary shards were allocated on node37. Taking a snapshot of a specific index with a primary shard allocated on node37 resulted in a PARTIAL snapshot. Therefore, the issue was identified to be related to node37. 

The solution

To solve the “PARTIAL snapshot” error, it is essential to identify the root cause of the problem. In the case above, it became clear that the issue was related to node37. 

So, the data node37 was removed from the cluster and a brand new data node was added in its place. After entering credentials, the problem was resolved.

You can use node restart to allocate all the primary shards on a problem node to other nodes. Before the restart, it is essential to make sure the cluster status is GREEN. When snapshots are taken again for all indices after the restart, the snapshot state should become SUCCESS.

Ultimately, you can remove the problem node from the cluster to avoid wasting time trying to find the root cause. Adding a brand new data node will then fix the problem.