Cleanup: Duplicates

Understanding Duplicates

What are duplicates?

Duplicates are different files with the same content. For example, a conference agenda saved in the Marketing department’s Events folder and the same conference agenda saved in the Training department’s Conference Sessions folder are duplicate files.

Why bother with duplicates?

Duplicate data inflates storage costs and decreases efficiency, as employees waste time determining content authenticity. Duplicates also increase time to recovery in a Disaster Recovery scenario and increase an organization’s security attack surface.

Similarly to ROT, identifying duplicates helps understand what unnecessary data exists in your organization and where it exists. Once identified, your organization — in collaboration with Shinydocs — can create a strategy for handling duplicates that meets your data management goals.

Hot Tip: Acceptable redundancy

Not all duplicate files are necessarily redundant data. For example, two different Emergency Services departments — Fire Services and Paramedic Services — may each have a copy of a procedure within their file shares because each department’s share has restricted access. In this instance, the duplicate data is necessary.

How do we get started with duplicates?

Identifying duplicates begins with the prerequisite metadata crawl. Then, a hash crawl is completed. The tag duplicates tooling reviews the results of the hash crawl in the index and tags duplicates according to the strategy your organization determines best meets your needs.

Determining a duplicate data strategy

Before running tag duplicates tooling, it is necessary to determine your organization’s duplicate data strategy. This is done by answering the following questions.

Create Date or Last Modified Date?

Your organization can use either the Create Date or the Last Modified Date to determine the primary documents in your duplicate data strategy. The tag duplicates tooling will select the primary document based on this parameter. By default, Create Date is used.

Ascending or Descending?

Your organization can also choose whether the primary document is selected based on an ascending or descending order. If ascending, the earliest Create Date or Last Modified Date is used. If descending, the most recent Create Date or Last Modified Date is used. By default, ascending is used.

Breadth and/or depth?

Lastly, consider what the scope of your organization’s duplicate data strategy will be. Will you have a broad approach, and identify duplicates across your index or indices? How deep into your organization’s folder structure will you search for duplicates? The answers to these questions vary by organization, and are best discussed in collaboration with your Shinydocs’ team.