Information on feature enhancements and fixed issues that are part of the Cognitive Suite 220.127.116.11 (January 2023) release.
--text-timeout configuration in exe.config file
You can now modify the default value for the
--text-timeout option for AddHashAndExtractedText.
For example, if there are files that take longer than the default value of 60 seconds to process through Optical Character Recognition (OCR), the full text extraction is not completed for that file.
How to enable
To ensure that the OCR process is completed on each file, modify the
--text-timeout value higher than the default setting of 60 seconds in the exe.config file.
CrawlBox: retrieving folder collaborators that have access to the folders and files
When using the CrawlBox operation, the
--crawl-collaborators option will retrieve folder collaborators added for a group.
How to enable
To crawl and capture Box groups and their users, include the
--crawl-collaborators option when running the CrawlBox operation.
CrawlSharePointOnline: SharePoint Property Bag
SharePoint items contain additional properties that are not captured in the index.
How to enable
To capture additional SharePoint properties, enter a comma-delimited list of properties to pull from SharePoint using the key
SharePointPropertyKeys in the config file.
AddHashAndExtractedText: false error messages when OCR text extracting (timeout)
Previously, while performing a text extraction to test the text timeout feature in the config file for OCR documents, false error messages were populated in the log files. False error messages are no longer populated.
AddHashAndExtractedText: IronOCR Errors logging document-id in message
Previously, OCR errors were logging the document-id in the message ID field rather than in the in the document-id field column. The ID is now populating into the shinydocs-jobs index document-id field column when there is an error processing files via OCR.
Note: The ID should not be removed from the log file message, as the message field is populated from the log file message, which contains the ID.
AddPathValidation: when the file is detected as in use, mark as valid
When performing a filesystem PathValidation command, errors were displayed for files that were currently open with MS Excel or MS Word. There is no longer an error encountered when running the operation with open files.
CacheFileSystemPermission: cached-owner field is not populating with a value
When running the CacheFileSystemPermission command, the cached-owner field was not populated with a value and no errors were logged in the log file. Now, the cached-owner field is populating as expected.
CLI: email address fields added by operations need to be consistent
CrawlExchange and CrawlFileSystem with
--crawl-pst (and text extraction) use different fields to store email addresses. Any CLI tool that stores email addresses in an index will use the same fields. The names were standardized and we use the following names for CrawlExchange and for CrawlFileSystem (
CrawlExchange: attachments use "size" field rather than "length"
When using CrawlExchange, the “size” of attachments was recorded using a size field. All other sources use length to record the size of the file. For consistency, CrawlExchange has been changed to use length to record the size of the file.
CrawlExchange: crawling public folders should work on Exchange On-Premise 2019
The exchange source setting json file must contain
public-folder-access-email in order to crawl public folders while using CrawlExchange to crawl Exchange On Premise 2019.
crawling all inboxes
Permissions required: Mailbox Search and ApplicationImpersonation
Permissions required: Mailbox Search
crawling public folders
Permissions required: Public Folders
Permissions required: Public Folders
We strongly recommend creating a new role group in Exchange for managing Shinydocs software with the recommended permissions.
CrawlSharepoint: index UTC times are not correct for SharePoint
SharePoint TimeZones are not transferable to system TimeZoneInfos.
CrawlSharePointOnline: SaveValue does not work for encrypting SharePoint Online secret key
The SaveValue option was previously missing the call to SwapSavedParameters which meant that the option would not work for encrypting SharePoint Online secret key. SwapSavedParameters has been added and the operation is now working as expected. Encrypted IDs can now be successfully added to the saved-parameters.yaml file and the CrawlSharePointOnline tool can be run with a source setting file containing encrypted keys.
CrawlSharePointOnlineSites: provide status update when crawling sharepoint online using site-index
Previously, if you deleted a previously crawled site in SharePoint Online, it would still be in the index and CrawlSharePointOnlineSites would not remove it. Now, a new field called deleted with a value of true is created in the index when a SharePoint site has been deleted and you rerun the CrawlSharePointOnlineSites tool.
CacheFileSystemPermissions: only processing 100 items and getting stack trace
When running the CacheFileSystemPermissions, the progress bar stopped and produced an error message after processing only 100 items. The operation now performs as expected.
ExtractAndCrawlPst: items duplicated when rerun on a pst that was already extracted
When adding a pst to an existing index containing a pst already extracted, the items in the first pst file were duplicated in the file on the fileshare and in the index. To prevent this duplication, a new option is added to the ExtractAndCrawlPst operation:
--create-duplicates . Utilizing this command line parameter allows or disallows duplicates to be created when the ExtractAndCrawlPst operation is rerun on a pst file. By default the
--create-duplicates value is set to false. That means that if an existing msg file is found, it will be skipped.
RemoveField: setting threads to more than 2 will bypass the provided query and remove all instances of the field in the specified index
Rather than removing the field from the items listed in the query, setting the threads to higher than 2 actually resulted in removing all instances of the field from everything in the index. This has been successfully updated and now RemoveField only removes the field from what was contained in the query.
RestorePermissions - updated error message when permissions cannot be restored
If a document does not have a cached-permissions field (for example, CacheFileSystemPermissions command was not run prior to running RestoreCachedFileSystemPermissions), the permissions for the file will not be reset. This is the intended behaviour; however, the status message in the console suggested that the document was processed successfully. This contradicted the log files, where you will find an error was thrown while processing the document.
The console now displays the correct message while trying to restore the permissions for a file without cached-permissions.
SetFileSystemPermission: allow Security Groups with --identity parameter
Running the SetFileSystemPermissions operation with a security group was previously returning an error. Now, the --identity option can be used with security groups for ease of administration. For example, if there is a change of user permissions related to legal hold documents, user accounts can be updated in the security group instead of running the command each time.
TagDuplicate: change the status of documents that are no longer duplicates
When using the TagDuplicate operation, items that are no longer duplicates can be removed using the
How to enable
To remove the duplicate value, run RemoveField, and then include the
--overwrite option when running the TagDuplicate command. This process will change the value of the remaining file to reflect that it is no longer a duplicate.