Using AddHashandExtractedText for duplicate detection and text analytics

This document is a companion to Initial Discovery - File Share . Here you will find advanced instructions for fine-tuning the initial discovery of your file share.

Hash values are unique file identifiers. If two or more files have the exact same hash value, they are byte-by-byte the same file.

Open the Cognitive Toolkit by opening a windows command prompt in Administrator Mode
Change directory (cd) to where you extracted the shinydocs-cognitive-toolkit-yyyy-mm-dd.zip file
To see all of the available options for the AddHashandExtractedText tool, use the following command at the Command Prompt:

JSON

CognitiveToolkit.exe AddHashandExtractedText --help

A list of Command options will be displayed.

Command options:

Command	Required/Optional	Info
-- source-settings <SOURCE SETTINGS>	Required Default: leave blank for filesystem	Do not use this parameter if the source is a file system! Path to json file that contains connection details for the repository (source). Templates for these files can be found in the CognitiveToolkit download under `External Resources\Example Source Settings`
--action-keyword <ACTION_KEYWORD>	Optional Default: both	Action to perform (hash,text,both)
--debug-level <DEBUG_LEVEL>	Optional Default: 20	The level of depth of exception messages (Default: 20)
--force	Optional Default: false	Forcefully remove / Suppress prompt for confirmation
--index-type <INDEX_TYPE>	Optional Default: shinydocs	Include a name for the index objects. If you do not include a name, the name “shinydocs” will be recorded here. You cannot change the index type easily! Only use this option if you know what you are doing.
--max-characters <MAX_CHARACTERS>	Optional Default: all characters	The maximum number of characters for the extracted text field Warning: Setting this value too high can result in timeouts or problems loading the index if the supporting hardware is unable to cope with the load. Shinydocs recommends 30,000 to ensure a performant index.
--ocr-utility <OCR_UTILITY>	Optional Default: none	OCR Utility to use for text extraction (iron,none)
-a\|--algorithm <ALGORITHM>	Optional Default: md5	Algorithm (Available algorithms: md5, sha1, sha256, sha512)
-i\|--index-name <INDEX_NAME>	Required	Name of the index. Note the value used here as it will have to match what is used in future Cognitive Toolkit tools (such as addHash).
-q\|--query <QUERY>	Required	Path to JSON file containing your desired query. The results of this query will be the input for this tool. You can also use escaped json directly in the command. Paths with spaces or using inline JSON will require “double-quotes”.
-skip-errors	Optional Default: false	Skip re-processing errors - items marked as: `addhashandextractedtext:error` Errors indicate there was a problem in either generating the hash id for the item or text extracting. Check your logs for errors.
-s\|--silent	Optional Default: false Note: For tasks that are scheduled, --silent is preferred as there is a slight performance increase.	Turn off the progress bar
-t\|--threads <THREADS>	Optional Default: 1	Number of parallel processes to start
-u\|--index-server-url <INDEX_SERVER_URL>	Required	URL of the index server

Example Commands:

Add hash and text extract for file share

CODE

CognitiveToolkit.exe AddHashandExtractedText --index-name shiny --index-server-url http://localhost:9200 --query "AddHashandExtractedText.json" --max-characters 30000

The above command will look at the index shiny (located at http://localhost:9200) for items that match the query in AddHashandExtractedText.json. The tool will reach out to the file share source to generate the hash with MD5 and text extract (if able to) the results of the query up to a maximum of 30,000 characters.

Add hash and text extract for Content Server

CODE

CognitiveToolkit.exe AddHashandExtractedText --index-name shiny --index-server-url http://localhost:9200 --source-settings "External Resources\Sample Source Settings\contentserver.json" --query "AddHashandExtractedText.json" --max-characters 30000

The above command will look at the index shiny (located at http://localhost:9200) for items that match the query in AddHashandExtractedText.json. The tool will reach out to the OpenText Content Server source to generate the hash with MD5 and text extract (if able to) the results of the query up to a maximum of 30,000 characters.

Add hash FileNet

CODE

CognitiveToolkit.exe AddHashandExtractedText --index-name shiny --index-server-url http://localhost:9200 --source-settings "External Resources\Sample Source Settings\filenet.json" --query "AddHash.json" --action-keyword hash

The above command will look at the index shiny (located at http://localhost:9200) for items that match the query in AddHash.json. The tool will reach out to the FileNet source to generate the hash with MD5 (if able to) for the results of the query.

Text extract for Microsoft SharePoint

CODE

CognitiveToolkit.exe AddHashandExtractedText --index-name shiny --index-server-url http://localhost:9200 --source-settings "External Resources\Sample Source Settings\sharepoint.json" --query "TextExtraction.json" --action-keyword text --max-characters 30000

The above command will look at the index shiny (located at http://localhost:9200) for items that match the query in TextExtraction.json. The tool will reach out to the Microsoft SharePoint source to text extract (if able to) the results of the query up to a maximum of 30,000 characters.

Tips

Make your query as specific as possible to the data you want to hash/text extract
If you notice the progress bar is completing in batches of 100 very quickly, this is likely a sign you can increase the --nodes-per-request by an interval of 100