Quick links
- Overview
- How time series data streams work
- When to use time series data streams
- How to Implement time series data streams
- Conclusion
Overview
A time series data stream (TSDS) is a specialized data stream dedicated to storing one or more metric time series virtually in real-time. These behave like standard data streams, yet are optimized for storing metrics ingested in timestamp order.
Data streams in Elasticsearch allow users to store append-only data sent to backing indices, like a super alias, with an additional level of abstraction over former ILM and rollover index management.
Usage
A TSDS is used specifically for storing metrics added in timestamp order. Users can add dimensions to enable filtering on the data, but it is not suitable for text logs.
Examples of TSDS:
- CPU server metrics
- Temperature or rainfall data for weather forecasting
- Stock prices
Metrics – A metric is a numerical value that can be evaluated or plotted on a graph.
Dimensions – Dimensions are descriptive entity data that users might be receiving data for, and might want to filter or aggregate upon. eg. hostname, data-center, country, and company_name.
How time series data streams work
TSDS’s are specialized to store metrics virtually in real time and timestamp order, therefore, a lot of performance optimization and disk usage is required.
Each document stored in TSDS has one timestamp, one or more metrics, and one or more dimensions
This is mandatory. Dimensions identify metrics, so, if there is more than one dimension, the metric is identified by the combination of all dimensions.
Segments are sorted by timestamp and dimension
When users have to read these, they must be visited on disk in the same order, this is one optimization.
Another optimization allows to benefit from the compression of consecutive documents’ field values. If the dimension is the same (which is the case because documents are index sorted by dimension), space is gained. This will be the same for consecutive metric values: think of it like a sensor sending the same temperature over one hour, then these values can be compressed.
One major benefit of TSDS over standard data streams is its compression ability, reducing the disk size of indices (elastic reports a 44% gain – see https://www.elastic.co/guide/en/elasticsearch//reference/master/tsds.html)
Backing indices have upper and lower time limits
This is different from other standard data streams, meaning that if users store a data metric in the past, it will be routed to the indices in the current time period.
The benefit here is that users do not have to search all the backing indices if a request is filtered by time, which happens often with time series, since only corresponding backing indices are searched and aggregated.
In TSDS, routing will be done using the dimensions as keys
All documents with the same keys will be routed to the same shard.
The benefit here is that when users search/aggregate on a specific dimension, only one shard is queried. This enables better performance by not having to query multiple shards in parallel.
In TSDS, metrics are stored in near real time and timestamp order
Because indices are time bound, users can’t store data outside of these limits. Metrics must also be stored in timestamp order. This means that each stored document is timestamped after the last document stored and the following document will have a timestamp after that.
Only one document per dimension and timestamp is allowed
IDs are computed from dimensions and timestamps. Users cannot set document IDs and only one document per dimension at a defined timestamp is allowed.
TSDS can be downsampled
Downsampling changes the data time resolution. For example: if there’s a document, you can transform every second into a single document representing one hour. This document will contain computed metric values over the hour time period with: min, max, sum, value count, and average.
The ability to query a wider time period with less computed load is an added value of downsampling. Since less documents have to be processed, less value is put on memory for aggregations.
When to Use Time Series Data Streams
Now that users have enough background to understand how to use a TSDS, they can use them when:
- Storing metrics data (not logs)
- Storing near real time data
- Storing continuous data
How to Implement Time Series Data Streams
1 . Create the index lifecycle policy:
PUT _ilm/policy/my_tsds_policy { "policy": { "phases": { "hot": { "actions": { "rollover": { "max_age": "12h", "max_primary_shard_size": "50gb" } } } } } }
This is a classic ILM example. Read more about ILM here. Our backing indices will be rolled over every 12 hours or after 50 GB primary shard size. Because TSDS allocates data to specific indices according to timestamp, users must be careful using shrink, forcemerge, searchable snapshots, since these operations make the index read only. When carrying out these operations, bear in mind that new data for the time periods covered by these indices cannot be added.
2. Create the index template:
PUT _index_template/metrics-infra-index-template { "index_patterns": ["metrics-infra*"], "data_stream": { }, "template": { "settings": { "index.mode": "time_series", "index.routing_path": [ "env_group", "dc_zone" , "host"], "index.lifecycle.name": "my_tsds_policy", "index.look_ahead_time": "2h", "index.codec": "best_compression" }, "mappings": { "properties": { "env_group": { "type": "keyword", "time_series_dimension": true }, "dc_zone": { "type": "keyword", "time_series_dimension": true }, "host": { "type": "keyword", "time_series_dimension": true }, "cpu": { "type": "integer", "time_series_metric": "gauge" }, "network_in": { "type": "long", "time_series_metric": "gauge" }, "network_out": { "type": "long", "time_series_metric": "gauge" }, "@timestamp": { "type": "date", "format": "strict_date_optional_time" } } } }, "composed_of": [ ], "priority": 300, "_meta": { "description": "Infrastructure metric data" } }
In this case, only one template was created, to simplify the example, but in real use cases, we advise using the composable template capability for mappings.
The beginning is the same for standard data streams. One difference is that the “index.mode” settings key is set to “time_series.” This activates the TSDS mode of the data stream.
- “index.routing_path”: [ “env_group,” “dc_zone,” “host”] is an optional setting from time series dimensions. By default, all dimensions will be used.
- “index.look_ahead_time” – allows users to define the time frame for which to index data. Only index data within +/- this window is available.
- “index.codec”: “best_compression” enables the best compression optimization with related data stored near each other: this is exactly what we have in a TSDS.
- “time_series_dimension”: setting to true will declare a field a dimension for our time series. Dimensions can only be of type keyword, ip, short, integer, long, or unsigned_integer.
- “time_series_metric”: “gauge” or “counter” declares a numeric field a metric value. A gauge can go up or down, whereas a counter can only increase. It can be any numeric type or aggregate type such as:
- histogram
- aggregate_metric_double
3. Store data:
POST metrics-infra-int/_doc { "@timestamp": "2023-02-20T14:01:19.000Z", "env_group": "INT", "dc_zone": "WE", "host": "front-int-06", "cpu": 16, "network_in": 37485, "network_out": 9784043 }
It is pretty straightforward, but your timestamp must be inside the defined range in your time series settings.
This is a testing example, usually users would use the bulk API: the required fields will be the same.
Downsampling
Downsampling is the process of reducing the granularity of the metrics in order to save Elasticsearch cluster resources, especially disk space but also the RAM required to process queries. This process is only applied to TSDS, for standard indices, users would need to use Rollups. Downsampling converts a set of data points into a summary document with sum, max, min, value count, and average aggregations along with the original dimensions.
Users can use the downsampling API to carry out the operation, the only parameter is the fixed interval to use.
POST /metrics-infra-int/_downsample/my-downsampled-metrics-infra-int { "fixed_interval": "1h" }
Users can also set downsampling to occur automatically via their ILM policy.
PUT _ilm/policy/my_tsds_policy { "policy": { "phases": { "warm": { "actions": { "downsample" : { "fixed_interval": "1d" } } } } } }
- The downsampled index can be queried together with a regular index from the TSDS.
- Users can mix up different intervals in the same query (but for consistency, it’s usually better to maintain the same one).
- In the event that a query contains intervals of less than the downsample.fixed_interval, then the downsampled data will appear at period zero of the interval. eg., for a 1h interval, it would appear at minute 0. For this reason, it is better to always query for intervals equal or greater than the downsample interval, if possible.
Conclusion
TSDS is an optimized way to store time based metrics, helping to achieve a reduction of up to 70% of disk space, at the expense of a little extra configuration. If there’s a large volume of metric data on a cluster, it is well worth the extra effort to achieve these savings. The downsampling feature can achieve even greater reductions in data space and can be easily automated via ILM.