Quick links
Overview
First introduced in Elasticsearch 1.0, doc values is an on-disk data structure concept that was created to improve the performance of analytical operations, such as aggregations and sorting.
Elasticsearch’s full-text search capabilities are heavily dependent on a data structure called the inverted index, which extracts the terms of the documents being indexed and keeps a list of the documents that contain them. This makes it easy for Elasticsearch to find documents that contain the search terms the user is looking for. The problem with this structure is that it does not perform well when we try to go in the opposite direction: finding the terms in a given document. That is when doc values come to the rescue.
In this article, we’ll explain in more detail the purpose of the doc values structure and then show an example of usage of the doc-values-only fields, which are fields that are not indexed, often because they will only be subject to aggregations or sorting.
The inverted index X doc values
First of all, let’s better establish the difference between these two data structures: the inverted index and doc values. When you send a document to be indexed by Elasticsearch, it will be sent through the whole analysis process. That process includes tokenizers and filters that aim to produce what is called an inverted index, a data structure that makes it possible to run an efficient full-text search.
In an inverted index, the terms extracted and analyzed from the document are stored in a table-like structure. That structure includes a list of the documents that contain each term. Although it is actually a little bit more complex than that, the structure will resemble the one shown below:
Terms | Doc 1 | Doc 2 | Doc 3 |
---|---|---|---|
brown | X | X | |
dog | X | X | |
dogs | X | X | |
fox | X | X | |
foxes | X | ||
in | X | ||
jumped | X | X | |
lazy | X | X | |
leap | X | ||
over | X | X | X |
quick | X | X | X |
summer | X | ||
the | X | X |
* extracted from the book Elasticsearch: The Definitive Guide
Even though that structure is perfect for querying terms, it is not very efficient when you need to invert that relationship, i.e., when you need to collect all the terms of specific documents. In that case, you would have to scan the whole table looking for terms that occur in, say, Doc 1. The problem only gets worse as more terms get indexed via new documents.
That’s why a data structure with the inverted relationship is also kept alongside the inverted index: the doc values. In the case of the example, that structure would look something like this:
Doc | Terms |
---|---|
Doc 1 | brown, dog, fox, jumped, lazy, over, quick, the |
Doc 2 | brown, dogs, foxes, in, lazy, leap, over, quick, summer |
Doc 3 | dog, dogs, fox, jumped, over, quick, the |
Now it is a lot easier to retrieve the terms for Doc 1. You just need to retrieve the respective row for that document.
Doc values are generated by default for almost every field type. Fields of the types text and annotated_text will not have doc values generated for them.
Doc-values-only fields
If you have lots of fields of the types numeric, date, boolean, ip, geo_point, and keyword in your indices, you should consider configuring them not to be indexed in order to save some disk space. Those fields would still be queryable through the doc values structure, even when they are not indexed (meaning an inverted index was not generated for them). This should be considered only for those fields that are not often queried. That’s because even though they will still be queryable through the doc values structure, you’ll notice a decrease in performance since that structure is much slower to query than the inverted index.
So, ultimately, what we have is an interesting tradeoff between disk usage and query performance: you save some space by not indexing those fields, but whenever you need to filter by them, you’ll pay a performance tax.
To prevent a field from being indexed so it only has a doc values structure, you need to change its mapping to have the index property set to false, like this:
PUT my-index-000001 { "mappings": { "properties": { "my_field": { "type": "long", "index": false } } } }
On the other hand, if you are sure that you will never run aggregations or sort by a specific filter, you can turn off the generation of doc values for it, like so:
PUT my-index-000001 { "mappings": { "properties": { "my_field": { "type": "long", "doc_values": false } } } }
Just keep in mind that it is not possible to disable doc values for the wildcard field type.
Conclusion
Mapping a field so that it is not indexed and only has a doc values structure generated for it can be an interesting alternative way to fine-tune your cluster with Elasticsearch. Of course, this is something that will probably only make sense once you truly get to know your indices and, more importantly, their usage. Disabling indexing for a couple of fields can really make a difference in disk usage, especially when you have documents with lots of numeric, date, boolean, and so on, fields that are rarely filtered by and are mostly used for sorting or aggregations.