How to check for copies of master/controlled documents
This use case requires Shinydocs Pro 24.4 (CLI) or higher.
This feature is not currently available in the Shinydocs Pro Control Center
Scenario
Critical business documents are stored across platforms like SharePoint Online or iManage, so it's essential to keep a single, master version in use. These master files, vital for operations, should stay within their designated repository to ensure consistency and security. However, duplicates often spread across file shares and other storage spaces, complicating document management. Shinydocs offers a streamlined solution, using your master copies to quickly identify these duplicates across your data repositories, ensuring your document landscape is clean and controlled.
Prerequisites
To perform this type of analysis, you should have:
At least one data source crawled (with hash values generated) and is the source of truth, this is where the master documents live
At least one additional data source crawled (with hash values generated) to check for copies of the master documents
How to
Once you have the data sources (indices) in the prerequisites, you’re one command away from performing this analysis:
Using Shinydocs Cognitive Toolkit CLI
CognitiveToolkit.exe TagDuplicate -u <index_url> -q <query.json_path> -i <master_source_index> --control-set --duplicate-field-name duplicateOfMaster --match-indices <indices_to_check_for_duplicates_of_master>
Parameter | Use |
---|---|
| The URL endpoint of the Shinydocs Search Engine (typically http(s)://localhost:9200) |
| Path to your JSON query. This should be a simple query that limits results to where the hash field exists |
| The name, names, or wildcard name of the master index. This index should be where your important documents are stored. |
| Does use a value, using |
| Customizeable - For this use case, as it is slightly different than a normal duplicates check, use a different name than the default. |
| The name, names, or wildcard name of the indices you want to check for duplicates of master files in. It is very important that your |
Example
My organization's critical documents are stored in iManage, and you are tasked with discovering if those documents exist outside of iManage. You have already used Shinydocs tools to crawl and hash iManage, your file shares, and OneDrive. Now you are ready to see if any documents in iManage also exist on your file shares or OneDrive.
Let’s assume your index names are as follows
Index alias | Index ID | Description |
---|---|---|
imanage | dbf2ddcd-4fba-4b08-83b4-bde7205e4098 | Your iManage index |
acme_onedrive | 96d9ccd3-88df-45c7-ba0a-b78d22454fd2 | Your OneDrive index |
acme_fs | 543ff26e-908e-421f-93ee-5c2755bc054a | Your file system index |
We will first need to make a query that returns items that have a hash value
HashExists.json
{
"bool": {
"must": [
{
"exists": {
"field": "hash"
}
}
]
}
}
Then we will run the TagDuplicate tool with the following parameters
CognitiveToolkit.exe TagDuplicate -u http://localhost:9200 -q HashExists.json -i dbf2ddcd-4fba-4b08-83b4-bde7205e4098 --duplicate-field-name duplicateOfMaster --match-indices "96d9ccd3-88df-45c7-ba0a-b78d22454fd2,543ff26e-908e-421f-93ee-5c2755bc054a"
Once the operation is complete, be sure to refresh your index patterns in Shinydocs Dashboards to pick up the new duplicateOfMaster
field (if any are found of course).
When you inspect your acme_onedrive* and acme_fs indices, you will be able to filter on the field duplicateOfMaster
“exists”, showing you exactly which files are duplicates of existing data in iManage.It