How can we choose the correct number of shards per index?
An Elasticsearch index consists of one or more primary shards. As of Elasticsearch version 7, the current default value for the number of primary shards per index is 1. In earlier versions, the default was 5 shards.
Finding the right number of primary shards for your indices, and the right size for each shard, depends on a variety of factors. These factors include: the amount of data that you have, your use case(s), your queries, the complexity of your documents and of course, on your SLAs.
There are two basic approaches to finding the right shard size for your use case:
- Benchmarking – running experiments
- Generic guidelines – following common best practices
Before we dive deeper into the approaches and sizing, let’s quickly go over what shards are used for in Elasticsearch and why it’s important to set the correct number of shards per index. This article will focus on primary shards.
Having multiple primary shards when indexing
Shards are basically used to parallelize work on an index. When you send a bulk request to index a list of documents, they will be split and divided among all available primary shards.
So, if you have 5 primary shards and send a bulk request with 100 documents, each shard will have to index 20 documents in parallel. Once all documents are indexed, the response is sent back to the client. Then the next batch of documents can be indexed.
Indexing is usually quicker when you have more shards, as each document needs to be stored only once per shard. If fast ingestion is your biggest concern, you should have at least one shard per data node. Alternatively, you can also have a number that is evenly divisible by the number of data nodes.
For instance, if you have 3 data nodes, you should have 3, 6 or 9 shards. This is very important. If, instead, you have 3 data nodes and decide to use 4 primary shards, then your indexing process will actually be slower than when using 3 shards, because 1 node (aka server) will have double the work and the other 2 will sit idle.
As a rule of thumb for quick indexing, you should choose at least as many primary shards as the number of data nodes you have, assuming that the number of data nodes fits your use case.
Having multiple primary shards when searching
How does parallelization work for searching? Each shard contains only a section of the documents, meaning that each shard contains only a partial index. You can think of it as a book with multiple volumes – each volume has its own partial index.
When you submit a search request, the node that receives the request acts as the coordinating node, which then looks up which shards belong to that index according to the cluster state. The coordinating node will then dispatch that query to all of those shards it located.
After that, each shard locally computes a list of results with document IDs and scores (“query phase”). Those partial results are sent back to the coordinating node which merges them into a common result. Then the second phase begins (known as “fetch phase”) where a list of documents gets fetched.
If you go with the default of size 10, 10 documents will be fetched. The fetch phase also needs to be dispatched to all the shards.
Once all documents are back, some post-processing such as highlighting may be performed and then the entire response is sent back to the client.
Clearly, parallelizing a query comes with some overhead.
At this point one begins to wonder what the right number of shards is for the quickest search operation? If you don’t have much data, it should be as few shards as possible.
But how much data is too much? It’s not easy to define the sweet spot. You should have just 1 shard for fast searching until it is more efficient to parallelize the work across multiple nodes or threads. There are some general guidelines in the last section of this article.
For a search use case, query performance is usually not the only important factor. You must also take into account relevancy, and thus the score. The scores can differ slightly if you have multiple shards – read more here.
Write vs. read access
As you can see, we already have a conflict. For fast indexing you need as many shards as possible, and for fast searching it is better to have as few shards as possible.
It comes down to priorities. You have to decide which is more important to you: fast indexing or fast searching. Elasticsearch is built to support both but when it comes to optimizing you have to know your priorities.
Ways to determine the ideal shard size
1. Benchmarking
The best way to determine the ideal shard size is by running experiments. Pick a server that is as identical as possible to your future production server and then:
- Configure a 1 node Elasticsearch cluster.
- Index your production data.
- Measure the indexing speed with a benchmarking tool (e.g. Rally).
If the ingestion time meets your SLAs then you’re fine and you know your future setup. After that, benchmark your queries. If they are fast enough, you’re good to go.
Otherwise repeat the benchmarking with more shards or more nodes until you reach your SLAs.
This is the most accurate way to determine the shard size for your use case.
If this is too complicated, you can start with more shards and/or more nodes and scale down once you’re in production and know more about what you need. Read more about this method here.
2. Generic guidelines
Many times it is not possible to run experiments. Depending on your company, you might only get your servers provisioned once the cost can be estimated for the client.
That’s why it’s good to also have some generic guidelines and best practices that others have worked with successfully.
For logging, shard sizes between 10 and 50 GB usually perform well.
For search operations, 20-25 GB is usually a good shard size.
Another rule of thumb takes into account your overall heapsize. You should aim for having 20 shards per GB of heap – as explained here.