Quick Links
Introduction
There are two main types of data sources when we’re talking about search operations:
- Time-based data (for example, access logs)
- Non-time-based data (for example, product inventory)
Time-based logs have a set of common features:
- @timestamp field is mandatory and most important.
- Documents are not updated.
- Documents are split into indices based on time.
- Old documents are less likely to be searched, or may not searched at all, but still need to be stored.
To move data around efficiently we can use ISM (Index State Management) policies. Index State Management allows you to define custom policies to automate routine tasks like moving data across nodes, or delete under certain conditions. In this way, we can split our data based on age and define its move from the fastest hardware, to slower hardware, and eventually to the trash can.
This is when data streams come into play. We need a way to tell the ISM policy how to split the data into indices. In the past we would use rollover aliases, but now data streams implement this behavior out-of-the-box. Data streams also leverage index templates, to make the generation even simpler.
How to create data streams in OpenSearch
First, we need to create the index template that will configure all the indices under the name we define as Data Streams:
PUT _index_template/logs-template { "index_patterns": [ "datalogs-*" ], "data_stream": {} }
As we know, ISM allows you to define custom policies to automate routine tasks like moving data across nodes, or delete under certain conditions. Without ISM, it does not make sense to use a data stream index.
From here, all the indices we create starting with “datalogs-” will be data streams. By default, you need to include a @timestamp field in your documents (this field can be changed).
Now we have 2 options:
1. We can call the data stream API directly:
PUT _data_stream/datalogs-example
2. We can just ingest a document to that index name:
POST datalogs-test/_doc { "@timestamp": "2022-06-20T16:24:03.417+00:00", "message": "this is an example log" }
Finally, just run this last command and pay attention to the response:
{ "_index" : ".ds-datalogs-test-000001", "_id" : "trjdqIEB7jRCWfUHatx-", "_version" : 1, "result" : "created", "_shards" : { "total" : 2, "successful" : 2, "failed" : 0 }, "_seq_no" : 0, "_primary_term" : 1 }
What happened here? We pushed to the data stream, but Opensearch created an underlying index that will hold the data. The data stream name is just an alias. The good thing is we don’t have to worry about which index we are pushing. We just have to push to the data stream, configure the ISM policy, and everything will just work.
If we set a policy to perform a rollover after one day, in one day we will see a .ds-datalogs-test-000002 index, and so on. Notice the index starts with a dot, this means the index is hidden. We are not supposed to interact with the index directly.
We can get the details about our data stream with this command:
GET _data_stream/datalogs-test
And even more insights:
GET _data_stream/datalogs-test/_stats
How to search against a data stream
Searching against a data stream is like searching against a regular index. All the underlying indices will be queried.
GET datalogs-test/_search
How to delete a data stream
Deleting a data stream will remove all the underlying indices:
DELETE _data_stream/datalogs-test/
Data stream limitations
Data streams are optimized for time-based series.
- They are designed primarily for append-only data.
- @timestamp field is required.
- Search requests are mapped to all indices, write requests to the latest write index.
Data streams can be leveraged using ISM policies, making the process of moving our data across nodes and managing the retention policies even simpler.
Conclusion
Data streams enforce a setup that works perfectly with time-based data, making the ISM policies much easier to configure.
If you want to ingest logs or any document that is timestamp centered, you can use data streams and forget about configuring rolling aliases. Just go ahead and start creating ISM policies to handle your data retention.