Quick links
- Introduction
- When to use transforms in Elasticsearch
- Transform APIs
- Transforms limitations
- Notes and good things to know
Introduction
Many Elasticsearch indices are structured as streams of events, where each event is a separate document. You can use transforms to summarize this data into a structure that is more logical and conducive to analysis.
Transforms can be used to turn existing Elasticsearch indices into summarized indices. Alternatively, you can use transforms to identify the most recent documents from a set of indices for a given field value.
The transform API has several uses:
– Data transformations: The transform API is used to transform existing data stored in an index into a new index with a specified structure.
– Batch processing: The API enables users to process large datasets in batches, making it suitable for big data use cases.
Aggregation and normalization: The API allows users to aggregate and normalize data, making it easier to analyze and report on.
– Continuous data sync: Transformations can be run as continuously running tasks so that the transformation indices are synced on a regular schedule, allowing users to quickly and easily analyze large datasets.
– Data analysis: The transformed data can be used for further analysis, reporting, and machine learning purposes.
The transformed data is stored in a new index so the original data remains unchanged.
Data in Elasticsearch can be transformed using several methods such as ingest node pipelines or transforms. It’s recommended to read How to leverage ingest pipelines to transform data transparently in Elasticsearch and When you should transform your data instead of using aggregations to learn more about data transformation in Elasticsearch.
When to use
A standard Elasticsearch aggregation will provide information about the top N keys. However, in some cases, you may need the complete index to be aggregated rather than just the top keys. This would typically be the case in machine learning or when you want to search, sort, or filter the results of one or more aggregations.
In addition, transforms can be used when you want aggregation results to be sorted by a pipeline aggregation. Sorting cannot be done using pipeline aggregations because they are carried out during the reduce phase after all other aggregations have been finished.
Transforms can be used when you need to optimize queries by creating summary tables. It might be more effective to develop a transform to cache results if, for instance, your high-level dashboard is accessed by a lot of users and performs a sophisticated aggregation over a big dataset. This would mean that the aggregation query doesn’t need to be executed by every user.
Transform APIs
All transform endpoints have the following base: _transform/.
Below we will review the commands and details for:
- Create transform API
- Delete transform API
- Get transform API
- Get transform statistics API
- Preview transform API
- Reset transform API
- Start transform API
- Stop transform API
- Update transform API
- Upgrade transform API
Create transform API
The following request is used to create a transform in Elasticsearch:
PUT _transform/<transform_id>
The “create transform API” requires the following privileges:
- cluster: manage_transform (this privilege is granted by the transform_admin built-in role).
- source indices: read, view_index_metadata.
- destination index: read, create_index, index, also, the delete privilege is required if a retention_policy is configured.
The <transform_id> is a required path parameter of type “string” that represents a unique identifier for the transform. This identifier may also include underscores, hyphens, and lowercase alphanumeric characters (a-z and 0-9). It must begin and conclude with alphanumeric characters and have a maximum character length of 64.
The defer_validation query parameter is an optional parameter of type “Boolean.” Deferrable validations (such as checking the source index exists) will not run when this parameter is set to true. This is relevant if the source index has not been created at the time of creating the transform operation.
The timeout query parameter is an optional parameter of type “time,” and its default value is 30s (30 seconds).
The create transform API body request can have the following elements:
- description: optional text that describes the transform.
- dest: a required object that represents the destination for the transform. dest has two properties: index and pipeline. The index is required, while the pipeline is an optional identifier for an ingest pipeline.
- Frequency: an optional parameter representing the time between checks for source indices changes when the transform is continuously executing. The range of frequency values is between 1s and 1h (1 hour). 1m (1 minute) is the default value.
- latest: a required object for latest type transformations. It has two properties: sort and unique_key. Sort is a required date field that identifies the latest documents. unique_key is a required array of one or more fields that are used to group the data.
- _meta: an optional parameter of type “object”, which defines transform metadata.
- pivot: a required object for pivot type transformations. It has two properties: aggregations and group_by. aggregations is a required property of type “object,” and it specifies how the data will be aggregated for each grouped document. These aggregations will be used to create the fields in the destination index. It supports the following aggregations: average, bucket script, bucket selector, cardinality, filter, geo bounds, geo centroid, geo line, max, median absolute deviation, min, missing, percentiles, range, rare terms, scripted metric, stats, sum, terms, top metrics, value count, and weighted average. group_by is a required property of type object, and it specifies how the data will be grouped into individual documents. It supports these groupings: date histogram, geotile grid, histogram, and terms.
- Retention_policy: optional object which has two parameters, field and max_age. field is a required date field used to calculate the document age. max_age is a required time unit, and it identifies the document’s maximum age before being deleted from the destination index.
- settings: an optional object that defines the transform settings. It has five properties that are all optional:
- dates_as_epoch_millis (true/false default false) sets dates as epoch millis or ISO-formatted string
- docs_per_second (float) enables throttling of the transform request
- align_checkpoints (bool default true) optimizes checkpoints for performance
- deduce_mappings (bool default true) deduces mappings from the transform configuration
- max_page_search_size (integer default 500) defines the initial page size for the composite aggregations used to create the transformation. Note that the page size is then dynamically adjusted downwards if circuit breaker exceptions occur.
- source: a required object that represents the data source for the transform. It has three properties: index, query, and runtime_mappings. index is required. query is an optional way to filter the data used as an input for the transformation. runtime_mappings is an optional definition of any search-time runtime fields that the transform may use.
- Sync.time.field: an optional parameter to define the time field that is used by the transform to synchronize the source and destination indices.
- Sync.time.delay: an optional parameter to define the difference in time between the current time and the most recent time of the input data. The default is 60s. Remember that sync.time.delay must be greater than the refresh interval for the source index.
The following is an example of the create transform API using the pivot method:
PUT _transform/books_transform1 { "source": { "index": "books", "query": { "term": { "country": { "value": "USA" } } } }, "pivot": { "group_by": { "book_id": { "terms": { "field": "book_id" } } }, "aggregations": { "max_price": { "max": { "field": "book_price" } } } }, "description": "Maximum priced book by book_id in the USA", "dest": { "index": "books_transform1", "pipeline": "bookstransform_pipeline" }, "frequency": "5m", "sync": { "time": { "field": "publish_date", "delay": "60s" } }, "retention_policy": { "time": { "field": "publish_date", "max_age": "30d" } } }
The query filters the transform so that it is only carried out on books published in the USA.
It then creates one record for each book ID and adds a field for the maximum price for each. The transformation will run every five minutes and will sync new records based on the field “publish_date” but with a 60s delay. Data over 30 days old will be deleted.
The following is an example of the create transform API using the latest method:
PUT _transform/books_transform2 { "source": { "index": "books" }, "latest": { "unique_key": ["book_id"], "sort": "publish_date" }, "description": "Latest published books", "dest": { "index": "books_transform2" }, "frequency": "5m", "sync": { "time": { "field": "publish_date", "delay": "60s" } } }
The above transformation will create an index of book IDs along with the publishing date for the newest record of those IDs.
Delete transform API
The following request is used to delete an existing transform in Elasticsearch:
DELETE _transform/<transform_id>
The manage_transform cluster privilege is required for the delete transform API. The built-in role transform_admin includes this privilege. You must stop the transform before you delete it.
The <transform_id> is a required parameter of type “string” that represents a unique identifier for the transform.
The force query parameter is an optional parameter of type “Boolean.” Regardless of the transform’s present status, it is deleted if the force is set to true. The transform cannot be deleted until it has been stopped when the force is set to false, which is the default value.
The timeout query parameter is an optional parameter of type “time,” and its default value is 30s.
The following is an example of the delete transform API:
DELETE _transform/books_transform
The delete transform API has the following query parameters:
- allow_no_match: an optional parameter of type “Boolean.” It determines what to do when the request includes wildcard expressions and no transforms are matched, the no identifiers or the _all string and no matches, and when only partial matches exist and contain wildcard expressions. When there are no matches, the default value of true gives an empty transforms array. When there are matches but only partially, it returns the subset of results. When there are no matches or only partial matches, the request returns a 404 status code if this parameter is false.
- from: an optional parameter of type “integer.” Its default value is 0. It specifies the number of transforms to be skipped.
- size: an optional parameter of type “integer.” Its default value is 100. It determines the maximum number of transforms to be obtained.
- exclude_generated: an optional parameter of type “Boolean.” Its default value is false. It excludes fields that have been added automatically when the transform is created. As a result, the configuration can be obtained and added to another cluster in an appropriate format.
The API returns an array of transform resources that are arranged in ascending order according to their ID values. The response body has the create_time property of type “string,” which represents the time at which the transform was created. The value of this property can’t be changed since it is informational. In addition, the response body has the version property of type “string,” and it represents the Elasticsearch version that existed on the node when the transform was created.
The 404 response code indicates that there are either no resources that match the request or only partial matches for the request if allow_no_match is set to false.
Get transform API
This API is used to obtain the transform configuration information. A wildcard expression or comma-separated list of identifiers can be used to retrieve data for multiple transforms in a single API request. Using _all, specifying * as the <transform_id>, or leaving out the <transform_id> will return information for all transforms.
The following request is used to get the information about one transform:
GET _transform/<transform_id>
Get transform statistics API
This API is used to obtain transform usage information. A wildcard expression or comma-separated list of identifiers can be used to retrieve usage information for multiple transforms in a single API request. Using _all, specifying * as the <transform_id>, or leaving out the <transform_id> will return usage information for all transforms.
The following request is used to get usage information about one transform:
GET _transform/<transform_id>/_stats
The transform statistics API produces an array of transform statistics objects that are sorted ascendingly by ID value. These are all informational properties; you cannot change the values of any of them. These properties include the following:
- The checkpointing object, which includes checkpoint statistics. Those statistics are the following: changes_last_detected_at, last.checkpoint, last.time_upper_bound_millis, last.timestamp_millis, last_search_time, next.checkpoint, next.checkpoint_progress, next.time_upper_bound_millis, next.timestamp_millis, and operations_behind.
- The ID property, which represents the transform identifier.
- The node object, which represents the node in which the transform is started. This is for started transforms only. Properties that the node includes are attributes, ephemeral_id, id, name, and transport_address.
- The reason property, which gives information on the specifics of the failure in the event that a transform has a failed state.
- The state property, which represents the transform status. It can be aborting, failed, indexing, started, stopped, and stopping.
- The stats object, which gives the transform’s statistical information. This information includes the following: delete_time_in_ms, documents_deleted, documents_indexed, documents_processed, exponential_avg_checkpoint_duration_ms, exponential_avg_documents_indexed, exponential_avg_documents_processed, index_failures, index_time_in_ms, index_total, pages_processed, processing_time_in_ms, processing_total, search_failures, search_time_in_ms, search_total, and trigger_count.
Preview transform API
The preview transform API can be used to test or simulate transform definitions. It returns a preview of the results that you will obtain when you run the create transform API with the identical configuration. 100 results are the most it can return. All of the current data in the source index is used in the calculations. The following requests are used to preview a transform in Elasticsearch:
GET _transform/<transform_id>/_preview POST _transform/<transform_id>/_preview
Alternatively, instead of specifying the transform_id, you can add a JSON body to define the specification of the transformation you want to test using exactly the same format as in the create transform API:
POST _transform/_preview {...your JSON transform definition here ... }
A list of mappings and settings that would be created for the destination index is also produced. These mappings and settings would be the ones utilized if the destination index does not exist when the transform is started. If you want different mappings to the ones suggested, you should create the destination index before starting the transformation. Note that index templates will not override the deduced mappings from pivot transformations.
Reset transform API
The following request is used to reset a transform in Elasticsearch:
POST _transform/<transform_id>/_reset
The <transform_id> is a required parameter of type “string,” which represents a unique identifier for the transform. The transform must be stopped before it may be reset; alternatively, use the force query parameter. The reset transform API has the force and timeout query parameters described previously.
The reset transform API will remove all checkpoints and the destination index, leaving it in the same state as when the transform was freshly created.
Start transform API
The following request is used to start a transform in Elasticsearch:
POST _transform/<transform_id>/_start
The <transform_id> is a required parameter of type “string,” which represents a unique identifier for the transform. The start transform API has the timeout query parameter described previously.
The destination index is created when a transform is started if it doesn’t already exist. The auto_expand_replicas parameter is set to 0-1 and the number_of_shards parameter is set to 1. If the transform is a pivot, the source indices and the transform aggregations are used to infer the mapping definitions for the destination index. If the transform is the latest, dynamic mappings are used instead of deducing mapping definitions.
If you want to control the destination mappings yourself, create the destination index before beginning the transform. Note that index templates will NOT override the deduced mappings from pivot transforms.
Stop transform API
The following requests are all valid to stop one or more transforms in Elasticsearch:
POST _transform/<transform_id>/_stop POST _transform/<transform_id1>,<transform_id2>/_stop POST _transform/_all/_stop
The <transform_id> is a required parameter of type “string,” which represents a unique identifier for the transform. Use a comma-separated list or a wildcard expression to stop multiple transforms. Use the identifier _all or * to stop all transforms. The stop transform API has the timeout, allow_no_match, force, wait_for_checkpoint, and wait_for_completion query parameters.
The wait_for_checkpoint parameter is an optional parameter of type “Boolean.” If true is selected, the transform won’t end entirely until the present checkpoint is completed. If false is selected, the transform terminates as soon as is practical. False is the default.
The wait_for_completion parameter is an optional parameter of type “Boolean.” When set to true, the API becomes blocked until the indexer state is completely terminated. If set to false, the indexer will be stopped asynchronously in the background and the API will return right away. False is the default.
The 404 response code indicates that there are either no matches for the query or only partial matches for the request if allow_no_match is set to false.
Update transform API
The following request is used to update specific properties of a transform:
POST _transform/<transform_id>/_update
The <transform_id> is a required parameter of type “string,” which represents a unique identifier for the transform.
The parameters and body definitions are exactly the same as those of the create transform API. However, with the update, you are only required to include those properties that you want to modify.
If the transform operation is already started, then the update will not take effect until the operation reaches the next checkpoint.
Upgrade transform API
The following request is used to upgrade all transforms to the current version of the Elasticsearch cluster:
POST _transform/_upgrade
The upgrade transform API has the timeout and dry_run query parameters. The dry_run parameter is an optional parameter of type “Boolean.” When true, updates are just checked for but not carried out. False is the default.
Transforms are compatible among supported major versions as well as minor versions. However, the format of configurations of internal data structures may change from time to time. By identifying transforms with outdated configuration formats, this API may upgrade them to the most recent version while also cleaning up the internal data structures used to maintain checkpoints and state for the transforms. The source and destination indices are unaffected by the transform upgrade.
The upgrade is aborted and an error regarding the underlying issue is returned if a transform upgrade step fails. Fix the problem, then restart the process. After the upgrade is complete, a summary is returned.
You will receive the following summary when all transforms have been upgraded:
{ "needs_update": 0, "updated": 3, "no_action": 2 }
Transforms limitations
Transforms contain configuration limitations that impact how they are configured, operational limitations that affect how they behave while running, and Kibana limitations that only apply to transforms controlled through the user interface.
The configuration limitations include:
- The latest transforms omit field names prefixed with underscores.
- If the remote cluster is properly configured, transforms support cross-cluster search.
- When a transform is previewed or started, deprecation warnings are shown if it contains Painless scripts that make use of deprecated syntax.
- Transforms work better on indexed fields rather than runtime fields which require more resources to generate.
- A continuous transform checks for source data changes on a regular basis. The scheduler’s functionality is currently restricted to a simple periodic timer with a frequency range of 1s to 1h.
The operational limitations include the following:
- Destination index mappings may not always be compatible with aggregation responses. Some aggregations can output strings and numbers under some circumstances.
- Batch transforms might not take modified documents into account.
- Deleted or updated documents are not taken into consideration by continuous transform consistency.
- The Kibana index pattern or the destination index is not deleted when a transform is deleted.
- Handling the dynamic aggregation page size adjustment. In the case of circuit breaker exceptions, Elasticsearch will dynamically reduce page size. This benefits reliability at the expense of performance.
- Transforms will fail if max_page_search_size exceeds index.max_terms_count, which is 65536 by default.
- Failed transforms must be stopped and fixed or deleted manually.
- Continuous transforms could produce inaccurate results if the documents are not yet searchable. It is important that the value of sync.time.delay is greater than the source index refresh interval.
- Data type in nanoseconds will be aggregated with millisecond resolution.
- Data streams are not supported as destination indices.
- ILM is not recommended for the destination index because it could result in duplicate documents.
The limitations in Kibana include the following:
- Transforms are visible in all Kibana spaces.
- A maximum of 1,000 transforms can be listed.
- Not all transform configuration options are supported by Kibana.
Notes and good things to know
Your source index is unaffected by any transforms. The transformed data is sent to a new index.
Transforms are persistent tasks that are resilient to node failures because they are maintained in a cluster state.
The credentials of the user who called the API are used when previewing a transform. The roles of the user who created the transform are used when you start it. The preview might not adequately reflect the behavior of the transform if the two sets of roles are different. The same user who creates or updates the transform should preview it to make sure it is producing the desired data in order to prevent similar issues. Alternatively, you can provide the credentials using secondary authorization headers.
It is advised to upgrade transforms prior to upgrading the cluster in order to guarantee that continuous transforms continue to run.
If your transform needs to process a lot of historical data, it will initially consume a lot of resources, especially during the first checkpoint.
Ensure that your search aggregations and queries are optimized for better performance and that your transform is only processing the data that is required. You should apply a source query wherever possible to limit the amount of data the transform processes.