Discovering Duplicates - Advanced
This document is a companion to https://enterprisefile.atlassian.net/wiki/x/LABXNg. Here you will find advanced instructions for fine-tuning the AddHashAndExtractedText of your file(s).
Open the Cognitive Suite by opening a windows command prompt in Administrator Mode
Change directory (cd) to where you extracted the shinydocs-cognitive-toolkit-yyyy-mm-dd.zip file
To see all of the available options for the AddHashAndExtractedText tool, use the command:
CognitiveToolkit.exe AddHashAndExtractedText --help
Command options:
Tool: AddHashAndExtractedText
Usage: CognitiveToolkit AddHashAndExtractedText [options]
Command | Required/Optional | Info |
---|---|---|
-- source-settings <SOURCE SETTINGS> | Required Default: leave blank for filesystem |
Do not use this parameter if the source is a file system! Path to json file that contains connection details for the repository (source). Templates for these files can be found in the CognitiveToolkit download under |
-a|--algorithm <ALGORITHM> | Optional Default: md5 | Algorithm (Available algorithms: md5, sha1, sha256, sha512) |
--max-characters <MAX_CHARACTERS> | Optional Default: all characters | The maximum number of characters for the extracted text field Warning: Setting this value too high can result in timeouts or problems loading the index, if the supporting hardware is unable to cope with the load. Shinydocs recommends 30,000 to ensure a performant index. |
--debug-level <DEBUG_LEVEL> | Optional Default: 20 | The level of depth of exception messages (Default: 20) |
--action-keyword <ACTION_KEYWORD> | Optional Default: both | Action to perform (hash,text,both) |
--ocr-utility <OCR_UTILITY> | Optional Default: none | OCR Utility to use for text extraction (iron,none) |
-q|--query <QUERY> | Required | Path to JSON file containing your desired query. The results of this query will be the input for this tool. You can also use escaped json directly in the command. Paths with spaces or using inline JSON will require “double-quotes”. |
-s|--silent | Optional Note: For tasks that are scheduled, --silent is preferred as there is a slight performance increase. | Turn off the progress bar |
-t|--threads <THREADS> | Optional | Number of parallel processes to start |
-skip-errors | Optional | Skip re-processing errors - items marked as:
Errors indicate there was a problem in either generating the hash id for the item or text extracting. Check your logs for errors. |
-u|--index-server-url <INDEX_SERVER_URL> | Required | URL of the index server |
-i|--index-name <INDEX_NAME> | Required | Name of the index. Note the value used here as it will have to match what is used in future Cognitive Toolkit tools (such as addHash). Note the value used here as it will have to match what is used in future Cognitive Toolkit tools (such as addHash). |
--index-type <INDEX_TYPE> | Optional | Include a name for the index objects. If you do not include a name, the name “shinydocs” will be recorded here. You cannot change the index type easily! Only use this option if you know what you are doing. |
--force | Optional | Forcefully remove / Suppress prompt for confirmation |