Best Practices: Finding and Removing Duplicates
This document outlines Shinydocs’ approach to finding and removing duplicate items. Given all business environments are unique, a standard approach is outlined below with modifications and advanced solutions for more complex environments.
Reduce Data Volume
Standard Approach
Before we begin to look for duplicates—and primary duplicates, let’s make the process more efficient by removing documents that will be deleted anyway, or that can be ignored for some reason.
Agree and act on how to reduce the size of the ROT (Redundant, Obsolete and Trivial) problem.
Identify obsolete (ie. old) and trivial (ie. no value) files.
📔 We have standard ROT rules that can do this for you. Review these rules, adjust them (if needed) and then run across your Analytics Engine after you have done a metadata crawl.
Delete documents that are no longer needed.
Modifications
If you decide against deleting items as a precursor to finding duplicates, try to agree on what Insight values you want to ignore. If you are going to holistically ignore an entire Insight (so rot_obsolete exists or rot_trivial exists for example), then just use queries on these existing fields for future steps below.
If you are going to include some Insight values and exclude others, it is suggested you introduce a new Insight like “obsolete_trivial_ignore” with a value of “yes” for all of these”. Meaning, if this field exists, that document can be ignored for the purpose of identifying duplicates.
📔 Adding Insights to a “live” Analytics Engine (Index) is not recommended. Updating the Index can have adverse impact on queries being run against it at the same time (for Discovery Search, for example). A better approach is to have a “live” Index just serving queries (for Discovery Search) and a copy or “replica” Index that is being updated. On a regular basis, an update strategy would be employed where the replica Index would be copied over as the new “live” Index.
Ask Users to Delete Files & Folders
Standard Approach
While automation for finding duplicates works really well, there is no way to automatically know which files and/or folders are no longer needed by users. We recommend asking users to delete files and folders that they know are no longer needed.
Modifications
If users need assistance in this area, use the Last Modified Date to generate some file listings for users to review. This is likely more powerful on a department by department basis. This can be done by performing an ExportFromIndex operation if you have departments that can be easily identified by path/list of paths, or by Insights and values. Alternatively, a subject matter expert (SME) from each department could view—via the Visualizer—and perhaps export the File Listing by Name, if a reasonable size to review. Users can then review and provide feedback in a week or two of what files can be outright deleted.
Calculate Hash
Standard Approach
Calculate the hash for all documents, excluding those that should be ignored (via excluding rot_obsolete exists or rot_trivial exists - or - if you used “obsolete_trivial_ignore”, ignore those for which this field exists in the Analytics Engine (Index)). Also verify that the hash does not already exist so as to not waste any effort here calculating the hash, if we already have the value.
Define Primary Duplicate
Common Definition
The next step will be to calculate the “primary” duplicate. Before we can do that, it is important to agree on the definition of this. Some common approaches are listed below. Review and select the best that fits your business:
The “original” document as determined by the creation date
As a secondary step, if you have a “published” area for documents (on your ECM or file system), it may be helpful to confirm that the “original” document is also in that location.
Advanced Definitions
The “published” (single copy exists in our ECM) document
Might be that the single copy that exists in our “published area” of our ECM - company policy documents for example, vs. copies of this document in our ECM or file system.
The “published” (single copy exists in a shared, generally available folder in our file system - company policy documents, for example) document
The “departmental” primary document
This is an approach where you are OK with limiting the “scope” of duplicates to be done on a department by department basis. So this will ultimately result in each department keeping 1 primary copy - copies of which could be found in other departments.
As a secondary step here, for each department, could be either the:
the “original” document in that department (see above) OR
the “published” document in that department (see above)
Find and Tag Primary Duplicates
Standard Approach
Find (and tag) the primary duplicates, based on the approach you have agreed on.
“Original” Documents
Simply run the Cognitive Toolkit “TagDuplicate” tool on the Creation Date with a query to find all file system entries with a hash field, (via excluding rot_obsolete exists or rot_trivial exists method; or if you used “obsolete_trivial_ignore”. Ignore those for which this field exists in the Analytics Engine (Index)).
Modifications
“Original” Documents
If you have a “published” area ,confirm that all duplicate-hash = primary-duplicate are in those areas. An ExportFromIndex that includes the hash field and uses calculate master is an easy way to identify the correct “Primary duplicate”. You can sort on the hash field and see where the “Primary” and duplicates are for each hash value. For those that are not, you’ll have to find the hash that is in that area and manually assign that as primary-duplicate instead (and remove the incorrectly tagged primary-duplicate).
“Published” Documents
For each document that is in such a folder, add the Insight “published” with a value of “primary” (likely via RunScript). Next extract the hash values for each of these generating a CSV of named value pairs (published / hash). Then run a query (likely via RunScript) that finds all other documents that match each of these hash values that does not have a published=primary value and for each add the Insight “published” with value “duplicate”.
We now have all primary documents in our published folders tagged as “published=primary” and all copies of these documents tagged as “published=duplicate”.
For all other documents that you have for which the field “published” does not exist, simply follow the above “Original Documents” approach for everything else.
Departmental Documents
Define a list of mutually exclusive file paths that can define the departments in question.
For each department, simply run the Cognitive Toolkit “TagDuplicate” tool on the Creation Date with a query to find all file system entries with a hash field, and the “obsolete_trivial_ignore” field does not exist - the query here should limit all results to the path that matches the department in question.
If you have “published” areas, confirm that all duplicate-hash = primary-duplicate are in those areas. For those that are not, you’ll have to find the hash that is in that area and manually assign that as primary-duplicate instead (and remove the incorrectly tagged primary-duplicate).
Target (& Delete) Biggest Offenders
Standard Approach
Before proceeding with the following steps, it is likely that by using the Shinydocs Visualizer, you can readily identify some “big offenders” that you can act on immediately, which will be of 2 types:
Huge files
Huge number of duplicates
What “huge” is for your organization is relative. We recommend evaluating these in chunks of 10. Examine the 10 largest items; then the next 10; then the next 10; etc. Likewise, you can look at the 10 that have the most duplicates; then the next 10; then the next 10; etc.
The Duplicated Files Dashboard is a powerful resource for this. On this Dashboard, click the “Duplicated Files”, which will now list ONLY files that are duplicates (both primary and not). Then use the Top Duplicated Files and/or the File Listing by File Size to look for your Big Offenders. You will want to get the hash values for duplicates here, which you can see by selecting “Inspect” for any of these.
Look for duplicates that are “obvious” duplicates--for which all of them can be deleted. Perhaps you will see an ISO file (these are huge) for an OS that is no longer needed and is taking up a lot of space. You will need to make a judgement call on these, but the gains from deleting these are significant.
For an approach for actually deleting files via this method, jump ahead to Disposition of Non-Primary Duplicates, below.
Protect Primary Duplicates
Standard Approach
Agree and act on protecting primary duplicates.
Classify the primary duplicates as one of the following:
Ordinary - document has no overriding importance
Critical - document is critical to the business
Important - document is important to the business, but not critical
For Ordinary Primary Duplicates, leave permissions as is.
📔 For this scenario, folks other than the owner may end up modifying/deleting the Ordinary Primary Duplicate at some point.
Advanced
For Important Primary Duplicates, remove all modify permissions, except to Admin and the owning user (perhaps group)
This can be done via the Cognitive Toolkit CacheFileSystemPermissions and then the SetFileSystemPermissions tools, for either duplicate-hash = primary-duplicate or published = primary (depending on your approach above).
For Critical Primary Duplicates, remove all modify permissions, except to Admin
This can be done via the Cognitive Toolkit CacheFileSystemPermissions and then the SetFileSystemPermissions tools, for either duplicate-hash = primary-duplicate or published = primary (depending on your approach above).
Read Access by Others to Each Primary Duplicate
Standard Approach
Agree on -- IT department to act on -- read access by others to each Primary Duplicate.
Before removing any non-primary duplicates, have your IT department review and verify that “read” access to the primary duplicates is set correctly. How this is done will depend on where the document resides (file system or ECM).
For Ordinary Primary Duplicates, no additional access check is likely needed
Advanced
For Critical Primary Duplicates, confirm access is restricted appropriately:
If an company policy document, does everyone have read access
If a “secure” document, is read access restricted to the appropriate users and groups (HR documents, for example)
For Important Primary Duplicates, confirm everyone who will need access to this document, has access.
Disposition of Non-Primary Duplicates
Standard Approach
Agree and act on disposition of non-primary duplicates.
With primary duplicates identified, protected, and appropriate read access confirmed, you are OK to proceed with actual disposition of non-primary duplicates. There are a number of different approaches here, so firstly agree on which to take and then act on that. You may wish to follow more than one of these, depending on your situation.
Deletion:
All non-primary duplicates can be outright deleted. This can be done immediately via the Cognitive Toolkit Dispose tool matching either duplicate-hash = duplicate or published = duplicate (depending on your approach above).
Advanced
You may wish to schedule the disposition for some time in the future. In this case, via RunScript, tag all records with either duplicate-hash = duplicate or published = primary (depending on your approach above) with an additional Insight such as “disposition-date” and set to a date (recommend the format yyyy-mm-dd) 30 days, 60 days, 90 day, 6 months, 9 months, 12 months in the future (as per your disposition policy). If you also update the file system permissions on all non-primary duplicates to read-only, that will reinforce that it is a duplicate if anyone tries to edit the file before it is disposed. It you take this approach, then we recommend monthly listing what files are to be disposed are first exported via ExportFromIndex (which can then be shared with interested parties) and then followed by Dispose on the same set, say a week later (to allow for responses from interested parties).
Archive:
All non-primary duplicates can be archived (i.e. moved to another medium). Depending on your environment, this is likely done with the assistance of your I.T. organization. A listing of such documents can be done via the Cognitive Toolkit ExportFromIndex tool matching either duplicate-hash = duplicate or published = duplicate (depending on your approach above).
You may with to schedule this archive for some time in the future - in this case, the approach is as for Deletion above, only you are Archiving files instead of Deleting them.
Hide:
All non-primary duplicates can be hidden from view, for all but Admin users (as long as this is not blocked by the file system in question). This can be done immediately via the Cognitive Toolkit CacheFileSystemPermissions and then the SetFileSystemPermissions tools, for either duplicate-hash = duplicate or published = duplicate (depending on your approach above).
You should ALSO schedule the disposition for some time in the future, following the same process as for Deletion, above. The idea here is that files are not really deleted quite yet, giving your users a period of time to notice and react, if needed for those few files that they really need access to.
System Shortcuts:
This approach is to replace the non-primary duplicates with file system shortcuts to the primary duplicate (using Windows Shortcuts (shell links)). While this sounds somewhat “easy” to implement from a technical perspective, there are some ramifications to consider:
Some file types (e.g.: technical drawings) that open many files at a single time (usually in the same folder) will likely not work if any of those files were replaced with a shortcut. This includes Microsoft Office documents that may have internal links to other Office documents.
Permissions are likely to be an issue, as permissions on the primary duplicate may be very different from the non-primary duplicate that is being replaced with a shortcut - i.e. whomever had access before may not have access to the document that the shortcut points to - and this would likely not be discovered until some time in the future when that user clicks on the shortcut and gets a permission denied error.
Replacing a complex permissioned non-primary document with an equivalent permissioned shortcut would be difficult.
After implementation, it is possible that a user thinking they are still working on their original document may unintentionally make changes to the primary-duplicate, perhaps even deleting it (if they have perms), affecting all users who try to use this document in the future via shortcuts.
For the above reasons, if implementing system shortcuts is still desired, we recommend doing so on a file type by file type basis. So, doing so for all “.pdf” documents for example, might be a safe approach here.
As an alternate approach to using system shortcuts, if your organization has Discovery Search, consider using that tool for users to “find” the documents they are looking for in the future (rather than using shortcuts). The scenario here is that non-primary duplicates would be deleted and then users looking for a given document can just use Discovery Search to find what they are looking for. This also has the benefit of reducing the “clutter” that can occur by introducing thousands or even millions of shortcuts with users instead using Discovery Search to find what they are looking for.
Duplicate value
When using the TagDuplicate command, items that are no longer duplicates can be removed using the --overwrite
option only after performing a RemoveField operation.
Scenario
You run the TagDuplicate operation and 2 files are identified and tagged as duplicates. If you delete one of the files from the index and re-run TagDuplicate, the file remaining will still be tagged with a “duplicate” value.
How to eliminate the “duplicate” value
The only way to have the duplicate value removed is to run RemoveField, and then include the --overwrite option when running the TagDuplicate command. This process will change the value of the remaining file to reflect that it is no longer a duplicate.
If you would like to know which files are duplicates and are running TagDuplicate on a schedule, it is recommended you remove all duplicate tags prior to running.