Quick Links
Introduction
To diagnose indexing errors in Elasticsearch, it is important to look at the precise error message(s) received and the cluster monitoring data.
Broadly speaking, the causes for indexing failure can be broken into two areas: index-related and node-related failures. This article looks into the two areas in detail, examining the causes and potential fixes for the failures.
Index-related indexing failure
Typically, index-related indexing failures depend upon issues with the index mappings and settings which mean that the document(s) in question cannot be indexed. In such cases, there is no point in retrying the same operation at a later stage since it will be bound to fail again unless some action is taken to enable the indexing Elasticsearch operation.
Three types of index-related indexing failure are discussed in more detail below:
- Version conflict
- Mapping-related failures
- “Too many fields” errors.
1. Version conflict
All Elasticsearch documents have a “_version” parameter associated with them. If you try to update a document with a “_version” that is equal to or lower than the existing operation, then the operation will fail with an error similar to the one below:
[version_conflict_engine_exception]: [search-telemetry:search-telemetry]: version conflict, required seqNo [1266538], primary term [19]. current document has seqNo [1266539] and primary term [19]
This is a safety mechanism to stop you from trying to update a document with data that is stale because it has already been updated by another operation. This will usually happen if you try to run two “update_by_query” (or other scripted updates) concurrently. Ideally, you should try to organize your architecture in such a way that it either:
- Does not run more than one update operation concurrently
- Includes mechanisms that retry operations that fail due to version conflicts.
2. Mapping-related failures
If you try to index a document that contains a field which is incompatible with the index mapping, then Elasticsearch will throw an error. Examples of incompatible fields include the following:
- Bad date formats
- Non-numeric data for numeric fields
- Bad formats for IP, GeoPoint, or GeoShape fields
3. “Too many fields” error
By default, Elasticsearch will limit the number of fields in an index to 1,000 to prevent the creation of too many fields from placing a heavy burden on the Elasticsearch cluster.
If you need to get around this, potential solutions include the following:
Increase the number of fields allowed in the index
This is the quick fix, but remember that the limit is there for a reason; just setting the limit to a higher number is likely to cause performance issues in the future. If you do want to increase the limit, this is the way to do it:
PUT my_index/_settings { "index.mapping.total_fields.limit": 2000 }
Disable dynamic mapping
Another way to deal with the too many fields error is to disable dynamic mapping, which can be done using the following:
PUT my_index { "mappings": { "dynamic": false, "properties":{ … here you will have to define each and every field that you want to search in the index…. } } }
Use dynamic mapping rules
You can limit the dynamic creation of fields by setting up the appropriate dynamic mapping rules.
For example, if you have an Elasticsearch client that creates step_1, step_2, step_3, etc. as fields and you do not need to search this data, then you could define a dynamic rule so that any field beginning with “step” is created as “index”: false.
PUT my-index/ { "mappings": { "dynamic_templates": [ { "disable_step": { "match_mapping_type": "string", "match": "step*", "mapping": { "index": false } } } ] } }
Enabled false
If you do not want Elasticsearch to create a large number of sub-fields as part of an object, then you may want to disable the creation of sub-fields for a given object by applying “enabled”: false.
For example: to disable the creation of sub-fields of the “detail_data” object, add the following mapping:
PUT my-index/ { "mappings": { "properties": { "detail_data": { "enabled": false, "match": "step*", "mapping": { "index": false } } } }
Node-related indexing failure
Typically node-related failures depend upon the state of the Elasticsearch cluster and the resources available at the time to deal with the indexing request. For that reason, the failed operations can typically be retried, and you can requeue such operations to retry at a later stage.
Five types of node-related indexing failure are discussed in more detail below:
- Rejected indexing – queue full
- Circuit breaker errors
- Cluster Red
- Disk flood stage cluster block
- Cluster block.
1. Rejected indexing –- queue full
Elasticsearch will reject indexing requests when the number of queued index requests exceeds the queue size (by default 1,000). If that happens, you will see a log message similar to the one below:
Caused by: org.elasticsearch.common.util.concurrent.EsRejectedExecutionException: rejected execution (queue capacity 1000)
This indicates that your client applications are sending indexing requests at a rate that is higher than Elasticsearch’s capacity to deal with those requests.
Look at your monitoring data to identify if indexing patterns are irregular and concentrated on just one or two nodes. If so, you may be able to spread the indexing burden by increasing the number of shards on an index.
In this case, you should resist the temptation to increase queue capacity since that only increases the resources used to maintain the queue.
You can also consider taking action to improve indexing speed, for example:
- Using queues to spread indexing activity over time
- Reducing the refresh interval
- Optimizing index mappings
Actions to improve indexing speed are described in more detail in this guide.
2. Circuit breaker errors
Circuit breaker errors are generally caused by the Elasticsearch node having insufficient memory to deal with the requests that it has been sent. If your indexing request receives a circuit breaker error in response, then you should consider the following actions to address it:
- Try reducing the batch size for bulk requests, especially if you are indexing large (or irregular-sized) documents.
- Try implementing code to limit the size of the documents being indexed.
Also, bear in mind that the circuit breaker could be caused by issues other than the indexing itself. Other factors that could cause your nodes to be overloaded include high volumes of search queries and large aggregations. So, if you receive circuit-breaking errors, then you should monitor your cluster to determine what is the cause and take appropriate steps to either reduce or spread the load on your nodes or increase the resources available to your cluster.
For a full discussion on circuit breaker errors, see the following guide.
3. Cluster red
If the index status is red, then that means that the index does not have a good copy of one or more of the primary index shards available in the cluster. When this happens, it will not be possible to index new data until the index status becomes yellow, indicating that at least one of each primary shard on the index is available.
To detect the cluster status, look at the cluster monitoring data or run:
GET _cluster/health
If you see that the cluster is red, then normally the best thing is to wait for the cluster to recover itself. With the above command, you will see how the number of active shards increases progressively to 100%, at which point it will become yellow/green. However, if you see that the cluster does not recover, then check out this guide.
4. Disk flood stage cluster block
If the disk on a node is above the flood stage disk watermark (by default 95% full), then you will not be able to index data to any of the indices with shards on the node. You will see a log message that looks like the following:
[2022-05-21T11:22:36,357][WARN ][o.e.c.r.a.DiskThresholdMonitor] [node-1] flood stage disk watermark [95%] exceeded on [KFhdit6dir84735GD][node-1]... free: 16.5gb[2.9%], all indices on this node will be marked readonly
For a full discussion on this issue and how to fix it, check out this guide.
5. Cluster block
Disk flood is the most common type of cluster block, but it is also possible that a cluster block has been applied manually. For a discussion on how to detect and fix cluster blocks, read this guide.