Quick links
What is synthetic _source?
When documents are indexed, field values from the _source are copied into separate lists (different data structures on disc that are used for pattern matching), depending on data type, so that they can be searched independently. After values are found in small lists, the original document is returned. Since only a small list is searched, rather than all field values in the entire document, the amount of time spent searching decreases. While this process increases speed, it stores duplicates of data in these small lists and in the original document.
Synthetic _source is a mode used to configure an index that can change how a document is processed at ingest time, so that it saves storage space and does not duplicate data. It creates independent lists, but it does not keep the original raw document. Instead, after a value is found, the _source content is reconstructed using small list data. Since raw documents are not stored, large amounts of storage space can be saved by only storing “lists” on disc.
It is important to note that since this is done right when the document is retrieved, additional time is needed for reconstruction of the _source value. This will save users storage space, but will decrease search speed.
Please note:
Using synthetic _source mode requires that a “doc values” list be created for each field. Multiple data types can be added to a field using multi-fields in the mapping of an index. For example, adding a second data type, “keyword” will ensure a “doc values” list is created for a specific field.
Since the _source value is reconstructed using values from the “doc values” list, some modifications are made to the original JSON. For example, arrays are moved to leaves.
Example code:
This format
PUT index/_doc/1 { "test": [ { "field": 1 }, { "field": 2 } ] }
Will be reconstructed to this:
{ "test": { "field": [1, 2] } }
How to configure an index in synthetic _source mode
Test code: In this test, an index in synthetic _source mode is compared to a standard index.
PUT index { "mappings": { "_source": { "mode": "synthetic" } } }
The standard index uses a multi-field to illustrate how a document can be retrieved using full text search and aggregations, and contains values of disabled fields in the _source content.
PUT test_standard { "mappings": { "properties": { "disabled_field": { "enabled": false }, "multi_field": { "type": "text", "fields": { "keyword": { "type": "keyword" } } } } } }
Let’s ingest some sample documents:
PUT test_standard/_doc/1 { "multi_field": "Host_01", "disabled_field" : "Required for storage 01" } PUT test_standard/_doc/2 { "multi_field": "Host_02", "disabled_field" : "Required for storage 02" } PUT test_standard/_doc/3 { "multi_field": "Host_03", "disabled_field" : "Required for storage 03" }
Full text searches retrieve the document with _source content.
GET test_standard/_search { "query": { "match": { "multi_field": "host_01" } } }
Results: The document is retrieved using full text search on a field that is analyzed. This returns all values available in the _source, including fields that have been disabled.
{ "took": 17, "timed_out": false, "_shards": { "total": 1, "successful": 1, "skipped": 0, "failed": 0 }, "hits": { "total": { "value": 1, "relation": "eq" }, "max_score": 0.9808291, "hits": [ { "_index": "test_standard", "_id": "1", "_score": 0.9808291, "_source": { "multi_field": "Host_01", "disabled_field": "Required for storage 01" } } ] } }
Here, the index in synthetic _source mode uses multi-fields to illustrate how “text” data types can be used in “doc values” lists and how values in disabled fields aren’t available.
PUT test_synthetic { "mappings": { "_source": { "mode": "synthetic" }, "properties": { "keyword_field": { "type": "keyword" }, "multi_field": { "type": "text", "fields": { "keyword": { "type": "keyword" } } }, "disabled_field": { "enabled": false } } } }
Let’s ingest some sample documents:
PUT test_synthetic/_doc/1 { "keyword_field": "Host_01", "disabled_field" : "Required for storage 01", "multi_field" : "Some info about computer 1" } PUT test_synthetic/_doc/2 { "keyword_field": "Host_02", "disabled_field" : "Required for storage 02", "multi_field" : "Some info about computer 2" } PUT test_synthetic/_doc/3 { "keyword_field": "Host_03", "disabled_field" : "Required for storage 03", "multi_field" : "Some info about computer 3" }
An exact match is required when searching “keyword” data types. Additionally, the values in the disabled field are no longer unavailable.
GET test_synthetic/_search { "query": { "match": { "keyword_field": "Host_01" } } }
Results: The document is retrieved by using an exact match on a field that is not analyzed. This will reconstruct the _source value. Values for fields that have been disabled are unavailable.
{ "took": 58, "timed_out": false, "_shards": { "total": 1, "successful": 1, "skipped": 0, "failed": 0 }, "hits": { "total": { "value": 1, "relation": "eq" }, "max_score": 0.9808291, "hits": [ { "_index": "test_synthetic", "_id": "1", "_score": 0.9808291, "_source": { "keyword_field": "Host_01", "multi_field": "Some info about computer 1" } } ] } }
Full text search is available because values are pulled from the “doc values” list, then the reverse index is recreated and used for full text search.
GET test_synthetic/_search { "query": { "match": { "multi_field": "info" } } }
Results: All documents containing the term, “info” will be retrieved. Values for fields that have been disabled are unavailable.
{ "took": 10, "timed_out": false, "_shards": { "total": 1, "successful": 1, "skipped": 0, "failed": 0 }, "hits": { "total": { "value": 3, "relation": "eq" }, "max_score": 0.13353139, "hits": [ { "_index": "test_synthetic", "_id": "1", "_score": 0.13353139, "_source": { "keyword_field": "Host_01", "text_field": "Some info about computer 1" } }, { "_index": "test_synthetic", "_id": "2", "_score": 0.13353139, "_source": { "keyword_field": "Host_02", "text_field": "Some info about computer 2" } }, { "_index": "test_synthetic", "_id": "3", "_score": 0.13353139, "_source": { "keyword_field": "Host_03", "text_field": "Some info about computer 3" } } ] } }
Good to know
- Using synthetic _source mode and removing the _source content should be done with caution. Any functionality that uses the _source content to work from, such as run-time fields or Painless scripts requiring _source will not be available. In addition, any values in fields not enabled will be unavailable after being indexed.
- During index mapping field names and their data types can be predefined. Each data type denotes different functionalities for fields. For example, a field with a “keyword” data type allows for aggregations, exact match search, and sorting. A “keyword” and other non-analyzed data-types will create what is called a “doc values” list, containing values from each of the indexed documents. A field with a “text” data type is analyzed by an analyzer and creates a “reverse index” list, which allows for full text searches.