Discovering Duplicates - File System

Applying Hash Values

The Cognitive Suite provides the ability to calculate a hash value for each file by performing a one-to-one (byte-to-byte) comparison. If two or more files have the exact content, the hash value will be the same thus identifying it as a duplicate.

Have you done the initial crawl of your files? If not, you may want to refer to this page before applying the hash values.

https://enterprisefile.atlassian.net/wiki/x/LACSMQ

Step by step:

Open a Windows command prompt as Administrator.
Change directory (cd) to where you extracted the shinydocs-cognitive-toolkit-yyyy-mm-dd.zip file
Run the command for a file share crawl to add hash: CognitiveToolkit.exe AddHashAndExtractedText --action-keyword hash --query {path-to-file} -u {url-to-index} -i {name of index}

The action-keyword option allows you to crawl for hash, text or both (default is both hash and text).

Shinydocs recommends use of the default hash+text for complex systems which typically have longer file download times (such as Content Server, SharePoint or FileNet, for example).

Adding hash and full text from a file share is possible, but not recommended. For file shares, Shinydocs recommends crawling for hash first and then crawling to extract text in a separate step.

Did you know…

There are two ways to run tools in CLI:

Using the full command (as shown above).

Or

Using .bat files. Every installation of Shinydocs Cognitive Suite comes with a folder of .bat files. Use these as a starting point for customizing commands to your specific environment.
Run this CLI command: COG-Query-AddHash-md5.bat

Example .bat file (COG-Query-AddHash-md5.bat) and .json file (query-match-path-no-hash.json):

query-match-path-no-hash.json query-match-path-no-hash.json

Visualize Your Duplicated Data

Before you begin, ensure that Analytics Engine and Visualizer has been installed. If not, refer to the installation instructions.

In a Web Browser (Chrome is recommended), enter the address for where Shinydocs Visualizer has been installed. This is typically <machine-name>:5601. If you are running the tool locally, use "localhost:5601".
From the Dashboard menu, open the basic dashboard for displaying metadata information: Duplicated Files

On this dashboard you will see a number of visualizations, each of which are based on a query of the underlying Analytics Engine.

The visualizations on this Dashboard include the following:

Total Number of Files
Number of Duplicated Files
Total Storage
Storage by Document Group & Type
Number of Files by Document Type
Number of Files with Full Text Extraction
Number of Files by Last Created Date
Number of Files by Last Modified Date
Number of Files by Last Accessed Date
Top Duplicated Files
File Listing by File Size
File Listing by Name

Did you know…

There are other advanced features you can do with the AddHashAndExtractedText tool? Click here > https://enterprisefile.atlassian.net/wiki/x/E4B0Ng to see a wider option of commands.

[data-colorid=ajkxqfk3b8]{color:#36b37e} html[data-color-mode=dark] [data-colorid=ajkxqfk3b8]{color:#4cc994}[data-colorid=o0ze6s1qxx]{color:#36b37e} html[data-color-mode=dark] [data-colorid=o0ze6s1qxx]{color:#4cc994}Applying Hash Values

Visualize Your Duplicated Data

Applying Hash Values