Skip to main content
Skip table of contents

How to check for copies of master/controlled documents

This use case requires Shinydocs Pro 24.4 (CLI) or higher.

This feature is not currently available in the Shinydocs Pro Control Center

Scenario

Critical business documents are stored across platforms like SharePoint Online or iManage, so it's essential to keep a single, master version in use. These master files, vital for operations, should stay within their designated repository to ensure consistency and security. However, duplicates often spread across file shares and other storage spaces, complicating document management. Shinydocs offers a streamlined solution, using your master copies to quickly identify these duplicates across your data repositories, ensuring your document landscape is clean and controlled.

Prerequisites

To perform this type of analysis, you should have:

  1. At least one data source crawled (with hash values generated) and is the source of truth, this is where the master documents live

  2. At least one additional data source crawled (with hash values generated) to check for copies of the master documents

How to

Once you have the data sources (indices) in the prerequisites, you’re one command away from performing this analysis:

Using Shinydocs Cognitive Toolkit CLI

CODE
CognitiveToolkit.exe TagDuplicate -u <index_url> -q <query.json_path> -i <master_source_index> --control-set --duplicate-field-name duplicateOfMaster --match-indices <indices_to_check_for_duplicates_of_master>

Parameter

Use

-u <index_url>

The URL endpoint of the Shinydocs Search Engine (typically http(s)://localhost:9200)

-q <query.json_path>

Path to your JSON query. This should be a simple query that limits results to where the hash field exists

-i <master_source_index>

The name, names, or wildcard name of the master index. This index should be where your important documents are stored.

--control-set

Does use a value, using --control-set enables the control set feature which is pivotal in this use case. The index name passed to -i is treated as the control set.

--duplicate-field-name duplicateOfMaster

Customizeable - For this use case, as it is slightly different than a normal duplicates check, use a different name than the default. duplicateOfMaster could also be duplicateOfMasterImanage, or duplicateOfMasterSharePoint. Use what will make most sense to your use case.

--match-indices <indices_to_check_for_duplicates_of_master>

The name, names, or wildcard name of the indices you want to check for duplicates of master files in.

It is very important that your <master_source_index> is NOT included in any indices for --match-indices. Otherwise this task will produce incorrect results

Example

My organization's critical documents are stored in iManage, and you are tasked with discovering if those documents exist outside of iManage. You have already used Shinydocs tools to crawl and hash iManage, your file shares, and OneDrive. Now you are ready to see if any documents in iManage also exist on your file shares or OneDrive.

Let’s assume your index names are as follows

Index alias

Index ID

Description

imanage

dbf2ddcd-4fba-4b08-83b4-bde7205e4098

Your iManage index

acme_onedrive

96d9ccd3-88df-45c7-ba0a-b78d22454fd2

Your OneDrive index

acme_fs

543ff26e-908e-421f-93ee-5c2755bc054a

Your file system index

We will first need to make a query that returns items that have a hash value

HashExists.json

CODE
{
    "bool": {
      "must": [
        {
          "exists": {
            "field": "hash"
          }
        }
      ]
    }
  }

Then we will run the TagDuplicate tool with the following parameters

CODE
CognitiveToolkit.exe TagDuplicate -u http://localhost:9200 -q HashExists.json -i dbf2ddcd-4fba-4b08-83b4-bde7205e4098 --duplicate-field-name duplicateOfMaster --match-indices "96d9ccd3-88df-45c7-ba0a-b78d22454fd2,543ff26e-908e-421f-93ee-5c2755bc054a"

Once the operation is complete, be sure to refresh your index patterns in Shinydocs Dashboards to pick up the new duplicateOfMaster field (if any are found of course).

When you inspect your acme_onedrive* and acme_fs indices, you will be able to filter on the field duplicateOfMaster “exists”, showing you exactly which files are duplicates of existing data in iManage.It

JavaScript errors detected

Please note, these errors can depend on your browser setup.

If this problem persists, please contact our support.