Best Practices: Approaching ROT and Duplicates
The benefits
Your organization wants to understand what data exists and where it exists. Empowered with this information, you can move forward with strategic initiatives such as a data migration. What’s more, you can implement an ongoing strategy for effective data governance as new content is added at your organization.
How does it work?
First, an initial crawl of your organization’s content is performed. The result of this crawl is an index (or listing) of the crawled files and their associated metadata. An additional crawl can be run to calculate and apply a hash value to the files listed in the index, enabling the identification and management of duplicate data. crawls are completed through Windows Command Prompt.
What’s a hash?
A hash is a unique identifier generated for each distinct file and applied to index items. This value is an alphanumeric combination and is based on the MD5 standard.
Once these crawls are complete, rules are run against the index, tagging files as redundant, obsolete, and/or trivial (ROT) content based on your rule selections and/or customizations.
Content cleanup considerations
Each organization’s content cleanup needs are different. These can vary depending on an organization’s size, sector, industry affiliations, etc. While needs vary, the following guiding questions can help determine the optimal approach to content cleanup for your organization.
Are the needs the same across your organization?
What is a helpful content cleanup activity for one department at your organization may be less so for another. Content paths can be included or excluded for many ROT rules, supporting the varying needs of teams across your organization.
Case in point
An IT team requires retention of specific file types that are considered unnecessary content for the Finance team. That same Finance team requires documents to be retained for 7 years, whereas the Legal team considers content older than 5 years as not required.
How does your organization define redundant?
For some organizations - or teams within organizations - multiple files with the same content created at different times is a measure of compliance. Understanding if this is a consideration at your organization can help with ROT rule selection.
Case in point
An IT team may store log files with the same content created at different times, indicating there is no change to the performance of applications under their purview.
How does your organization define obsolete?
Your organization may need to retain files for a different timeframe than what is included in standard ROT rules. These rules can be customized to account for such differences.
Case in point
That same IT team storing log files for compliance purposes may need to retain them for 3 years. Meanwhile, the Finance team at the same organization may need to retain tax-related documents for 7 years.
How does your organization define trivial?
Parameters that your organization may use to identify meaningless content may be significantly different for another organization in a different sector with a different mandate. It can also vary significantly across your organization’s departments and teams. Again, ROT rules can be customized to account for such differences.
Case in point
A Finance team may benefit from leveraging a ROT rule that crawls for “Not Safe for Work (NSFW)” keywords and tags this content as trivial. However, a Human Resources team at the same organization may require content that includes these same keywords be retained for employee management purposes.