Skip to main content
Skip table of contents

Discovering Duplicates - File System

Applying Hash Values

The Cognitive Suite provides the ability to calculate a hash value for each file by performing a one-to-one (byte-to-byte) comparison. If two or more files have the exact content, the hash value will be the same thus identifying it as a duplicate.

Have you done the initial crawl of your files? If not, you may want to refer to this page before applying the hash values.

https://enterprisefile.atlassian.net/wiki/x/LACSMQ

Step by step:

  1. Open a Windows command prompt as Administrator.

  2. Change directory (cd) to where you extracted the shinydocs-cognitive-toolkit-yyyy-mm-dd.zip file

  3. Run the command for a file share crawl to add hash:  CognitiveToolkit.exe AddHashAndExtractedText --action-keyword hash --query {path-to-file} -u {url-to-index} -i {name of index}

The action-keyword option allows you to crawl for hash, text or both (default is both hash and text).

Shinydocs recommends use of the default hash+text for complex systems which typically have longer file download times (such as Content Server, SharePoint or FileNet, for example).

Adding hash and full text from a file share is possible, but not recommended. For file shares, Shinydocs recommends crawling for hash first and then crawling to extract text in a separate step.

Did you know… 

There are two ways to run tools in CLI:

  1. Using the full command (as shown above).

Or

  1. Using .bat files. Every installation of Shinydocs Cognitive Suite comes with a folder of .bat files. Use these as a starting point for customizing commands to your specific environment. 

  2. Run this CLI command: COG-Query-AddHash-md5.bat

Example .bat file (COG-Query-AddHash-md5.bat) and .json file (query-match-path-no-hash.json):

query-match-path-no-hash.json query-match-path-no-hash.json

Visualize Your Duplicated Data

Before you begin, ensure that Analytics Engine and Visualizer has been installed. If not, refer to the installation instructions.

  1. In a Web Browser (Chrome is recommended), enter the address for where Shinydocs Visualizer has been installed.  This is typically <machine-name>:5601. If you are running the tool locally, use "localhost:5601".

  2. From the Dashboard menu, open the basic dashboard for displaying metadata information: Duplicated Files

On this dashboard you will see a number of visualizations, each of which are based on a query of the underlying Analytics Engine. 

The visualizations on this Dashboard include the following: 

  • Total Number of Files

  • Number of Duplicated Files

  • Total Storage

  • Storage by Document Group & Type

  • Number of Files by Document Type

  • Number of Files with Full Text Extraction

  • Number of Files by Last Created Date

  • Number of Files by Last Modified Date

  • Number of Files by Last Accessed Date

  • Top Duplicated Files

  • File Listing by File Size

  • File Listing by Name

Did you know… 

There are other advanced features you can do with the AddHashAndExtractedText tool? Click here > https://enterprisefile.atlassian.net/wiki/x/E4B0Ng to see a wider option of commands.

JavaScript errors detected

Please note, these errors can depend on your browser setup.

If this problem persists, please contact our support.