Trimming Down your Splunk Indexer Storage with TSIDX Retention Settings

By: Anshu August 04, 2016

Hi everyone. Today I wanted to cover the tsidx retention feature that was released in Splunk version 6.4. This feature helps you reduce the storage costs for your indexer while maintaining actively searchable data. Also in this blog, I wanted to try a new format and convey the information in an FAQ style. Please leave a comment if you found the new format helpful for learning about tsidx retention.

Tsidx File Fundamentals

First let's cover some fundamentals about tsidx files.

Q. What is a tsidx file?
A. Tsidx stands for "time-series index" file. It's created at index-time and is based on the raw data in a bucket.

Q. What is a tsidx file used for?
A. It's used during the search process to locate references to keywords in the raw data.

Q. What do tsidx files look like in a bucket?

A. Below is an example of a bucket's contents from my local Splunk instance. You'll notice the first item listed is the tsidx file. It's not a sizable file at only 186K, but for larger data sets this file would be quite larger. It's also possible for a bucket to contain more than one tsidx file.

-rw-------  1 Anshu  wheel   186K Mar 24 21:22 1458868890-1458850476-3374714464259194028.tsidx
-rw-------  1 Anshu  wheel   122B Mar 24 21:22 Hosts.data
-rw-------  1 Anshu  wheel   280B Mar 24 21:22 SourceTypes.data
-rw-------  1 Anshu  wheel   2.4K Mar 24 21:22 Sources.data
-rw-------  1 Anshu  wheel   347B Mar 24 21:13 Strings.data
-rw-------  1 Anshu  wheel   7.5K Mar 24 21:22 bloomfilter
-rw-------  1 Anshu  wheel    67B Mar 24 21:13 bucket_info.csv
-rw-------  1 Anshu  wheel     0B Mar 24 21:22 optimize.result
drwx------  5 Anshu  wheel   170B Mar 24 21:22 rawdata
-rw-------  1 Anshu  wheel    77B Mar 24 21:21 splunk-autogen-params.dat

Okay so now that know a bit more about what a tsidx file is, let's ask the main question.

Q. Why would a Splunk administrator want to implement retention settings on tsidx files?

A. It boils down to the cost of storage. These files are incredibly useful for data that is actively being searched, but becomes expensive from a storage perspective as the searching of that data goes down.

Now let's look at some details of the tsidx reduction process.

The Tsidx Reduction Process

Q. What is the tsidx reduction process?

A. This is the process that runs to impelement the tsidx retention settings.

Q. Approximately how much space can be saved by implementing the tsidx reduction process?

A. This depends on the make up of the data, but in general a bucket can be reduced in size by about 33% to 66%.

Q. What are the files that are removed during the process?
A. Regular tsidx files are replaced with "mini-tsidx" files. The "merged_lexicon.lex" file is removed

Q. What are some files that are retained after reduction?
A. The raw data file, mini-tsidx files, bloomfilter files.

Q. What is contained in the mini-tsidx files?
A. Only essential metadata, such as the header of the original tsidx file which contains metadata about each event.

Q. How often does the reduction process run?
A. By default it runs every 10 min.

Q. How long does the process take on average per bucket?
A. Just a few seconds.

Q. What happens if a bucket is a part of a current search process but is identified as a candidate for reduction by the reduction process?
A. The reduction will be delayed until the search on the bucket completes.

Q. What is the expected search performance on reduced buckets?
A. Search performance is expected to be reduced severely. If a reduced bucket is part of a result set, the user is warned that, that is occurring. The actual error message is "Search on most recent data has completed. Expect slower search speeds as we search the minified buckets."

Q. Are tstats searches affected by reduced buckets?
A. Yes, tstat searches cannot use reduced buckets. The user will be notified that the tstats result set is not complete. The actual message is
"Reduced buckets were found in index={index}. Tstats searches are not supported on reduced buckets. Search results will be incorrect."

Implementing Tsidx Retention

Q. What are the methods for implementing tsidx retention?
A. This can be done via the GUI, .conf files, or CLI.

Q. What are the settings in indexes.conf to implement this?
A. The following are the settings in indexes.conf

[my_index]
enableTsidxReduction = true
timePeriodInSecBeforeTsidxReduction = 1209600

Q. Can tsidx retention be used in indexer clusters?
A. Yes. It's important to deploy the tsidx retention configuration via the cluster master in a configuration bundle so all indexers have the same settings.

Q. Can a reduced bucket be restored back to its original state?
A. Yes. First the tsidx reduction settings for an index are configured so that the bucket to be restored will not be reduced again. Next the "splunk rebuild" command is issued on the bucket.

Q. What is the process for restoring a bucket?

A. The process for restoring a bucket is similar to thawing frozen data. First, change the tsidx retention settings so the bucket falls outside of the range of buckets being reduced. Second, issue the "rebuild" command on the bucket. For the exact process, please refer to the following link on Splunk Docs: http://docs.splunk.com/Documentation/Splunk/6.4.2/Indexer/Reducetsidxdis...

Q. Which types of data and scenarios does this make sense to implement for?
A. The simple answer is any Splunk deployment that wishes to reduce storage and generally doesn't actively search a certain index or indexes after a certain time period. From a real-world perspective, a scenario that comes to mind is an organization that has to maintain compliance with a well-defined audit policy and undergo regular audit checks, such as a federal agency. Having the data "actively" searchable in Splunk makes meeting audit requests much simpler than having to thaw data. At the same time storage usage and cost is greatly reduced, making this capability feasible. Hence, this becomes a comparable alternative to archiving data.

I hope this post has helped you learn more about the new tsidx reduction capabilities in Splunk v6.4. As always, happy Splunking!

Tags: Splunk 6.4, Splunk, tsidx, retention, storage, indexer, lexicon, bucket, rebuild, Audit

Blog