Overview
It is a best practice that Elasticsearch shard size should not go above 50GB for a single shard.
The limit for shard size is not directly enforced by Elasticsearch. However, if you go above this limit you can find that Elasticsearch is unable to relocate or recover index shards (with the consequence of possible loss of data) or you may reach the lucene hard limit of 2 ³¹ documents per index.
How to resolve this issue
If your shards are too large, then you have 3 options:
1. Delete records from the index
If appropriate for your application, you may consider permanently deleting records from your index (for example old logs or other unnecessary records).
POST /my-index/_delete_by_query { "query": { "range" : { "@timestamp" : { "lte" : now-5d } } } }
2. Re-index into several small indices
The following would reindex into one index for each event_type.
POST _reindex { "source": { "index": "logs-all-events" }, "dest": { "index": "logs-2-" }, "script": { "lang": "painless", "source": "ctx._index = 'logs-2-' + (ctx._source.event_type)" } }
3. Re-index into another single index but increase the number of shards
PUT /my_new_index/_settings { "index" : { "number_of_shards" : 2 } }
POST _reindex { "source": { "index": "my_old_index" }, "dest": { "index": "my_new_index" } }
Care points for reindexing
Notice duplicate data during the reindex process
During the reindex process, if your application uses wildcards in the index name (eg. logs-*) then results may be duplicated during the reindex process since you will have two copies of the data until the process completes. This can be avoided through the use of aliases.
Notice modifications and updates during the reindex process
If data is susceptible to updates during the reindex process, then a procedure needs to be identified to capture them and ensure that the new indices reflect all modifications that are made during the transition reindexing period. For more information please refer to Opster tips on reindexing
How to avoid the issue
There are various mechanisms to optimize shard size.
Apply ILM (index lifecycle management)
Using ILM you can get Elasticsearch to automatically create a new index when your current index size reaches a given maximum size or age. ILM is suitable for documents which require no updates, but is not a viable option when documents require frequent updating, since to do so you need to know which of the previous index volumes contains the document you want to update.
Application strategy
You can design your application to ensure that shards are created of a sensible size:
Date based approach
myindex-yyyy.MM.dd
or myindex-yyyy.MM
This has the advantage that it is easy to delete old data simply by deleting older indices.
Attribute based approach
myindex-<clientID>
This has the advantage of allowing you to know exactly which index to search/update if you know the client ID.
The disadvantage is that different clients may produce very different volumes of documents (and hence different shard sizes). This can be mitigated to some extent by adjusting the number of shards per client, but you may create the opposite problem (oversharding) if you have a large number of clients.
ID range based approach
This may be possible where documents carry an ID coming from another system or application (such as a sql ID, or client ID) which enables us to distribute documents evenly.
eg. myindex-<last_digit_of_ID>
The above would give us 10 indices which will probably be evenly distributed, and also deterministic (we know which index we need to update).