Quick links
- Introduction
- How to enable remote backend storage
- How to recover data from remote repositories
- Known Limitations
- Conclusion
Remote-backed storage is an experimental feature. Subsequently the use of remote-backed storage in a production environment is not recommended. For updates on the progress of remote-backed storage, refer to GitHub.
Introduction
There are two main ways to backup your data:
- Creating snapshots
- Adding more replica nodes
Snapshots may not be reliable because the data ingested between snapshots can be lost. On the other hand, replica nodes are not optimal if users don’t plan to run many searches on them.
OpenSearch has got users covered with the “Remote Backed Storage” feature, which stores both segments and/or translog in an object storage for recovery.
Every time a CRUD (create/update/delete) operation is sent to the cluster, it goes through a 3 step process:
- Index: Documents make it to the RAM buffer, and translog on disk. Documents are not searchable at this point.
- Refresh: Documents are converted into segments on RAM, and are searchable.
- Flush: Translog is cleared, and segments are persisted on disk.
Remote Backed Storage enables users to copy the shard segments and/or translog to a remote storage repository instead of a node’s disk.
Segment replication must be enabled for Remote Backed storage to work.
The remote-backed storage feature supports two levels of durability:
- Refresh Level durability: Segment files are uploaded to a remote storage after every refresh.
- Request Level durability: Translogs are uploaded before acknowledging requests.
How to enable remote backend storage
Go to config/jvm.options, and add the following line:
OPENSEARCH_JAVA_OPTS="-Dopensearch.experimental.feature.replication_type.enabled=true -Dopensearch.experimental.feature.remote_store.enabled=true" ./opensearch-2.6.0/bin/opensearch
Alternatively, users can set an environment variable:
export OPENSEARCH_JAVA_OPTS="-Dopensearch.experimental.feature.replication_type.enabled=true -Dopensearch.experimental.feature.remote_store.enabled=true" ./bin/opensearch
Let’s configure an index with maximum durability. So, every time a document is indexed, translog and segments will be uploaded to the repository:
curl -X PUT "https://localhost:9200/my-index?pretty" -ku admin:admin -H 'Content-Type: application/json' -d' { "settings": { "index": { "number_of_shards": 1, "number_of_replicas": 0, "replication": { "type": "SEGMENT" }, "remote_store": { "enabled": true, "repository": "segment-repo", "translog": { "enabled": true, "repository": "translog-repo", "buffer_interval": "300ms" } } } } } '
Replication.type must be set to SEGMENT because the Segment replication feature is being leveraged. Also enable the remote store for segments and for translog.
*Note that replicas are set to 0 to demonstrate that replicas aren’t needed to use this feature. Replicas must be based on availability requirements. Remote Backed Storage provides durability.
Users can use the same repository for both segments and translog if they choose to.
Finally, buffer_interval defines how much time users must wait between translog updates, because OpenSearch will try to accumulate as many operations as possible to optimize the writes.
*translog.enabled to true is currently an irreversible operation.
How to recover data from remote repositories
Users can recover data from the remote repository following these steps:
1. Close the index:
curl -X POST "https://localhost:9200/my-index/_close" -ku admin:admin
2. Use the _remotestore API:
curl -X POST "https://localhost:9200/_remotestore/_restore" -ku admin:admin -H 'Content-Type: application/json' -d' { "indices": ["my-index"] } '
Known Limitations
Writing data to a remote store can be a high-latent operation when compared to writing data on a local file system. This may impact both the indexing throughput and latency. For performance benchmarking results refer to: [Remote Store] Performance test setup for 2.3 release · Issue #3148 · opensearch-project/OpenSearch (github.com).
Conclusion
OpenSearch’s Remote Backed Storage feature provides a reliable solution for data backups by storing both segments and translogs in an external object storage for recoveries. It offers two levels of durability, the Refresh Level and Request Level, to ensure that data is safely stored. By enabling the Remote Backed Storage feature, users can avoid the limitations of snapshot creations or of having to add more replica nodes.
The feature leverages Segment replication and allows for both segment and translog storage in the same repository. While writing data to a remote store may have limitations, the benefits of reliable data backups outweigh the potential impact on indexing throughput and latency. OpenSearch’s Remote Backed Storage feature provides peace of mind for users that prioritize data protection.