Why you may want to roll up your data
The cost of running an Elasticsearch cluster is largely relative to the volume of data stored on the cluster. If you are storing time-based data, it’s common to find that old data is queried less often than the newer data, and that the old data is often only used to look at the “bigger picture” or to compare historical trends.
Rollup jobs provide a way to drastically reduce storage cost for old data, by means of storing documents which summarize the data for a given time period. This means that you maintain the ability to carry out searches on key parameters on that data, albeit with a reduced granularity. For example, you may be storing metrics about CPU and disk usage which are recorded every minute. You could set up a rollup job to summarise this data in hourly buckets.
Define a rollup job
You can define a rollup job using the following:
PUT _rollup/job/metrics { "index_pattern": "metrics-*", "rollup_index": "metrics_rollup", "cron": "*/30 * * * * ?", "page_size": 1000, "groups": { "date_histogram": { "field": "@timestamp", "fixed_interval": "60m" }, "terms": { "fields": [ "node", "environment"] } }, "metrics": [ { "field": "cpu", "metrics": [ "min", "max", "sum", "avg" ] }, { "field": "disk", "metrics": [ "avg", "max" ] } ] }
Cron
Cron defines the frequency of running the rollup job. You may prefer to spread the load into regular short jobs with a high cron frequency, or run a single long job at 2am, depending on the load profile on your Elasticsearch cluster.
Groups
Groups define the “buckets” into which your data should be summarized. Within groups:
1. Date Histogram
It is obligatory to define a date histogram.
You must define a “calendar_interval” (eg. 1M for month) or “fixed_interval” (eg. 2h, 1d).
Bear in mind that monthly intervals may not give you evenly sized buckets, so it is generally preferable to use fixed intervals.
2. Terms
It is also important to include any other “buckets” which you may want to use to classify your data. For example, you may want to aggregate your data into sub buckets by node or environment.
It is important to think carefully about which fields you want to include here to enable you to analyze and classify your data. If you leave fields out it will not be possible to subdivide your data on fields which are not included here. However, you should avoid including fields with high cardinality since this will increase the size of the rolled up index on disk.
Metrics
Define the metrics you will want to analyze and the aggregations you require.
Possible values are min, max, sum, avg, and value_count. Again, it will not be possible to query on metrics which are not included here.
Starting and stopping the rollup job
POST _rollup/job/<id>/_start POST _rollup/job/<id>/_stop
Searching rolled up data
You can search rolled up data using exactly the same syntax as you would when searching standard data. Beyond that, you can also search rolled up data combined with regular data indices. Elasticsearch will work out which combination of rolled up data / indices to use to optimise the results. The only thing that you need to do is to use the specific rollup data endpoint. _rollup_search instead of _search.
GET /metrics_rollup/_rollup_search { "size": 0, "aggregations": { "max_cpu": { "max": { "field": "cpu" } } } }