Quick Links
- Background
- 1. Reindexing data from a remote cluster
- 2. Transferring data using snapshots
- 3. Transferring data using Logstash
- Synchronization of updates
Background
When you want to upgrade an Elasticsearch cluster, it is sometimes easier to create a new, separate cluster and transfer data from the old cluster to the new one. This affords users the advantage of being able to test all of their data and configurations on the new cluster with all of their applications without any risk of downtime or data loss.
The disadvantages of that approach are that it requires some duplication of hardware and could create difficulties when trying to smoothly transfer and synchronize all of the data.
It may also be necessary to carry out a similar procedure if you need to migrate applications from one data center to another.
In this article, we will discuss and detail three ways to transfer data between Elasticsearch clusters.
There are 3 ways to transfer data between Elasticsearch clusters:
1. Reindexing from a remote cluster
2. Transferring data using snapshots
3. Transferring data using Logstash
Using snapshots is usually the quickest and most reliable way to transfer data. However, bear in mind that you can only restore a snapshot onto a cluster of an equal or higher version and never with a difference of over one major version. That means you can restore a 6.x snapshot onto a 7.x cluster but not an 8.x cluster.
If you need to increase by more than one major version, you will need to reindex or use Logstash.
Now, let’s look in detail at each of the three options for transferring data between Elasticsearch clusters.
1. Reindexing data from a remote cluster
Before starting to reindex, remember that you will need to set up appropriate mappings for all of the indices on the new cluster. To do that, you must either create the indices directly with the appropriate mappings or use index templates.
Reindexing from remote — configuration required
In order to reindex from remote, you should add the configuration below to the elasticseearch.yml file for the cluster that is receiving the data, which, in Linux systems, is usually located here: /etc/elasticsearch/elasticsearch.yml. The configuration to add is as follows:
reindex.remote.whitelist: "192.168.1.11:9200"
If you are using SSL, you should add the CA certificate to each node and include the following in the command for each node in elasticsearch.yml:
reindex.ssl.certificate_authorities: “/path/to/ca.pem”
Alternatively, you can add the line below to all Elasticsearch nodes in order to disable SSL verification. However, that approach is less recommended since it is not as secure as the previous option:
reindex.remote.whitelist: "192.168.1.11:9200" reindex.ssl.verification_mode: none systemctl restart elasticsearch service
You will need to make these modifications on every node and carry out a rolling restart. For more information on how to do that, please see our guide How to perform rolling restarts in Elasticsearch.
Reindexing command
After you have defined the remote host in the elasticsearch.yml file and added the SSL certificates if necessary, you can start reindexing data with the command below:
POST _reindex { "source": { "remote": { "host": "http://192.168.1.11:9200", "username": "elastic", "password": "123456", "socket_timeout": "1m", "connect_timeout": "1m" }, "index": "companydatabase" }, "dest": { "index": "my-new-index-000001" } }
While doing that, you may face timeout errors, so it may be useful to establish generous values for timeouts rather than relying on defaults.
Now, let’s take a look at some other common errors that you may encounter when reindexing from remote.
Common errors when reindexing from remote
1. Reindexing not whitelisted
{ "error": { "root_cause": [ { "type": "illegal_argument_exception", "reason": "[192.168.1.11:9200] not whitelisted in reindex.remote.whitelist" } ], "type": "illegal_argument_exception", "reason": "[192.168.1.11:9200] not whitelisted in reindex.remote.whitelist" }, "status": 400 }
If you encounter this error, it shows that you did not define the remote host IP address or node name DNS in Elasticsearch as described above or forgot to restart Elasticsearch services.
To fix that for the Elasticsearch cluster, you need to add the remote host to all Elasticsearch nodes and restart Elasticsearch services.
2. SSL handshake exception
{ "error": { "root_cause": [ { "type": "s_s_l_handshake_exception", "reason": "PKIX path building failed: sun.security.provider.certpath.SunCertPathBuilderException: unable to find valid certification path to requested target" } ], "type": "s_s_l_handshake_exception", "reason": "PKIX path building failed: sun.security.provider.certpath.SunCertPathBuilderException: unable to find valid certification path to requested target", "caused_by": { "type": "validator_exception", "reason": "PKIX path building failed: sun.security.provider.certpath.SunCertPathBuilderException: unable to find valid certification path to requested target", "caused_by": { "type": "sun_cert_path_builder_exception", "reason": "unable to find valid certification path to requested target" } } }, "status": 500 }
This error means that you forgot to add the reindex.ssl.certificate_authorities to elasticsearch.yml as described above. To add it:
#elasticsearch.yml reindex.ssl.certificate_authorities: "/path/to/ca.pem"
2. Transferring data using snapshots
Remember, as mentioned above, you can only restore a snapshot onto a cluster of an equal or higher version and never with a difference of over one major version
If you need to increase by more than one major version, you will need to reindex or use Logstash.
The following steps are required to transfer data via snapshots:
Step 1. Adding the repository plugin to the first Elasticsearch cluster – In order to transfer data between clusters via snapshots, you need to ensure that the repository is accessible from both the new and the old clusters. Cloud storage repositories such as AWS, Google, and Azure are generally ideal for this.
To take snapshots, please see our guide in the article below and follow the steps it describes: https://opster.com/guides/elasticsearch/how-tos/elasticsearch-snapshot/
Step 2. Restart Elasticsearch service (rolling restart).
Step 3. Create a repository for the first Elasticsearch cluster.
Step 4- Add the repository plugin to the second Elasticsearch cluster.
Step 5- Add repository as read only to second Elasticsearch cluster – You will need to add a repository by repeating the same steps that you took to create the first Elasticsearch cluster.
Important note: when connecting the second Elasticsearch cluster to the same AWS S3 repository, you should define the repository as a read-only repository:
PUT _snapshot/my_s3_repository { "type": "s3", "settings": { "bucket": "my-analytic-data", "endpoint": "s3.eu-de.cloud-object-storage.appdomain.cloud", "readonly": "true" } }
That is important because you want to prevent the risk of mixing Elasticsearch versions inside the same snapshot repository.
Step 6- Restoring data to the second Elasticsearch cluster – After taking the above steps you can restore data and transfer it to the new cluster, please follow the steps described in this article to restore data to the new cluster.
3. Transferring data using Logstash
Before starting to transfer the data with logstash, remember that you will need to set up appropriate mappings for all of the indices on the new cluster. To do that, you will need to either create the indices directly or use index templates.
To transfer data between two Elasticsearch clusters, you can set up a temporary Logstash server and use it to transfer your data between two clusters. For small clusters, a 2GB ram instance should be sufficient. For larger clusters, you can use four-core CPUs with 8GB RAM.
For guidance on installing Logstash, please see here.
Logstash configuration for transferring data from one cluster to another
A basic configuration to copy a single index from cluster A to cluster B is:
iinput { elasticsearch { hosts => ["192.168.1.11:9200"] index => "index_name" docinfo => true } } output { elasticsearch { hosts => "https://192.168.1.12:9200" index => "index_name" } }
For secured elasticsearch, you can use the configuration below:
input { elasticsearch { hosts => ["192.168.1.11:9200"] index => "index_name" docinfo => true user => "elastic" password => "elastic_password" ssl => true ssl_certificate_verification => false } } output { elasticsearch { hosts => "https://192.168.1.12:9200" index => "index_name" user => "elastic" password => "elastic_password" ssl => true ssl_certificate_verification => false } }
Index metadata
The above commands will write to a single named index. If you want to transfer multiple indices and preserve the index names, then you will need to add the following line to the Logstash output:
index => "%{[@metadata][_index]}"
Also if you want to preserve the original ID of the document, then you will need to add:
document_id => "%{[@metadata][_id]}"
Bear in mind that setting the document ID will make the data transfer significantly slower, so only preserve the original ID if you need to.
Synchronization of updates
All of the methods described above will take a relatively long period of time, and you might find that data in the original cluster has been updated while waiting for the process to complete.
There are various strategies to enable the synchronization of any updates that may have occurred during the data transfer process, and you should give some thought to these issues before starting that process. In particular, you need to think about:
- What method do you have to identify any data that has been updated/added since the start of the data transfer process (e.g., a “last_update_time” field in the data)?
- What method can you use to transfer the last piece of data?
- Is there a risk of records being duplicated? Usually, there is, unless the method you are using sets the document ID during reindexing to a known value).
The different methods to enable the synchronization of updates are described below.
1. Use of queueing systems
Some ingestion/updating systems use queues that enable you to “replay” data modifications received in the last x days. That may provide a means to synchronize any changes carried out.
2. Reindex from remote
Repeat the reindexing process for all items where “last_update_time” > x days ago. You can do this by adding a “query” parameter to the reindex request.
3. Logstash
In the Logstash input, you can add a query to filter all items where “last_update_time” > x days ago. However, this process will cause duplicates in non-time-series data unless you have set the document_id.
4. Snapshots
It is not possible to restore only part of an index, so you would have to use one of the other data transfer methods described above (or a script) to update any changes that have taken place since the data transfer process was carried out.
However, snapshot restore is a much quicker process than reindexing/Logstash, so it may be possible to suspend updates for a brief period of time while snapshots are transferred to avoid the problem altogether.