Run Content Cleanup Crawls (Metadata and Hash)
Apply and activate license
Prior to beginning a metadata crawl, an active product license is required. If necessary, follow these steps to apply and activate a license:
Confirm the installation location for the Cognitive Toolkit at your organization
Access Shinydocs Collab and download your license file
Note: The file name will begin with license.cog
Navigate to the location where the Cognitive Toolkit is extracted
Copy the license file to this location
Open Command Prompt as an administrator or a Service account that has permissions to the file directory to be crawled
Change directory to the location where the Cognitive Toolkit is extracted
In a text editor such as Notepad++, copy the following command, including quotation marks:
CognitiveToolkit.exe Activate --path "C:\ShinyDocs\cognitive-toolkit-2.7.1-2022-05-02\License.cog.yyyy-mm-dd-to-yyyy-mm-dd.shinydocs.customer.xml"
8. Update the content between quotations with the full path to your organization’s license key
9. Copy the updated command from your text editor to Command Prompt and run
Perform a metadata crawl
An initial metadata crawls the content from targeted sources to create an index of the content and its associated metadata. This information can then be used to understand what data exists where, as well as inform content cleanup and/or migration decisions.
Prepare to crawl
The following details are required to complete a metadata crawl:
File path of location where the Cognitive Toolkit is extracted
Index URL
Index name
Full UNC path(s) for share(s) to be crawled
Preparing a batch (.bat) file
Metadata crawl commands can be input manually through the command prompt. Alternatively, batch files can be used to help run multiple commands to eliminate this manual process or if there are a set of bulk commands. Batch files can also be re-run if a command needs to be run another time - for example, for delta metadata crawls, hash crawls, or any other command. If there are many file shares, a batch file is the most efficient way to complete the metadata crawl.
To prepare a batch file for a metadata crawl:
Access a previous CrawlFileSystem.bat from Shinydocs Collab and make a copy of this file
Save the copy to the location where the Cognitive Toolkit is extracted
Open a new file in a text editor
Create a list of the path names with their full UNC path for each share to be crawled
Save the list of path names as a .txt file
Note: As a recommended best practice, save this file to the location where the Cognitive Toolkit is extracted
Open the .bat file in your preferred editor
Set inputfile to the path for the .txt file and save
Execute the crawl
Crawling using a .bat file
To execute a metadata crawl using a .bat file:
Confirm the location of the .bat file
Note: As a recommended best practice, save .bat files to the location where the Cognitive Toolkit is extracted
2. Open Command Prompt as an administrator or a Service Account that has permissions to the file directory to be crawled
3. Change directory (cd) to the location where the .bat file is located
4. Append the directory result with the name of the .bat file and run
Crawling an individual share
To execute a crawl without using a .bat file:
Open Command Prompt as an administrator or the Service Account that has permissions to the file directory to be crawled
Change directory (cd) to where the Cognitive Toolkit is extracted
In a text editor such as Notepad++, copy the following command:
CognitiveToolkit.exe CrawlFileSystem --path <PATH> --index-server-url <INDEX_SERVER_URL> --index-name <INDEX_NAME>
4. Make the following updates to the copied command:
Replace PATH with the share UNC path to be crawled
Replace INDEX_SERVER_URL with the URL for the index where the results of the crawl are to be added (for example, http:/IndexerServer:9200)
Replace INDEX_NAME with the name for the index where the results of the crawl are to be added
Note: For index naming conventions, please review these guidelines
5. Copy the updated command from your text editor to Command Prompt and run
Perform a hash crawl
A hash crawl examines your data and applies a hash value to each file. Tag duplicates tooling can then compare these hash values and flag content with the same hash as a duplicate.
Prepare to crawl
The following details are required to complete a hash crawl:
File path of location where the Cognitive Toolkit is extracted
Index URL
Index name
Path(s) for share(s) to be crawled
Preparing a batch (.bat) file
Batch files can be used to complete a hash crawl as an alternative to inputting individual file shares. To prepare a batch file for a hash crawl:
Access a previous Hash_TagDuplicates.bat from Shinydocs Collab and make a copy of this file
Save the copy to the location where the Cognitive Toolkit is extracted
Open the .bat file with your preferred text editor
Confirm the index_url matches the location of the index to be crawled
Confirm the index_name matches the name of the index to be crawled
Save the updated file
Excluding text extraction
By default, the command to add hash values also extracts text from the content and adds this to the index.
To exclude text extraction:
Add the following parameter to the AddHashandExtractedText command:
--action-keyword “hash”
Excluding tag duplicates
By default, tagging duplicates functionality is included with a hash crawl.
To run a hash crawl without simultaneously tagging duplicates:
Comment out the tag duplicates command within the Hash_TagDuplicates.bat
Applying the hash crawl using a .bat file
To complete a hash crawl using a .bat file:
Confirm the location of the .bat file
Note: As a recommended best practice, save .bat files to the location where the Cognitive Toolkit is extracted
2. Open Command Prompt as an administrator or a Service Account that has permissions to the file directory to be crawled
3. Change directory (cd) to the location where the .bat file is located
4. Append the directory result with the name of the .bat file and run
Applying the hash crawl to an individual share
To complete a hash crawl for an individual share:
Open Command Prompt as an administrator or a Service Account that has permissions to the file directory to be crawled
Change directory (cd) to where the Cognitive Toolkit is extracted
In a text editor such as Notepad++, copy the following command:
Cognitivetoolkit.exe AddHashAndExtractedText --query "C:\match_all.json" -u "http://localhost:9200" -i INDEXNAME
4. Make the following updates to the copied command:
Replace http://localhost:9200 with the URL for the index where the results of the are to be added, if required
Replace INDEXNAME with the name for the index where the results of the crawl are to be added
5. Copy the updated command from your text editor to Command Prompt and run
Next Steps
crawl results can be reviewed in the Visualizer as the metadata crawl runs and after the hash crawl is complete. Dashboards visualizing these results can be prepared to share with teams and project stakeholders.