Cognitive Suite 2.9.1.0 (January 2023)

Information on feature enhancements and fixed issues that are part of the Cognitive Suite 2.9.1.0 (January 2023) release.

Features

AddHashAndExtractedText: add `--text-timeout` configuration in exe.config file

Description

You can now modify the default value for the --text-timeout option for AddHashAndExtractedText.

For example, if there are files that take longer than the default value of 60 seconds to process through Optical Character Recognition (OCR), the full text extraction is not completed for that file.

How to enable

To ensure that the OCR process is completed on each file, modify the --text-timeout value higher than the default setting of 60 seconds in the exe.config file.

CrawlBox: retrieving folder collaborators that have access to the folders and files

Description

When using the CrawlBox operation, the --crawl-collaborators option will retrieve folder collaborators added for a group.

How to enable

To crawl and capture Box groups and their users, include the --crawl-collaborators option when running the CrawlBox operation.

CrawlSharePointOnline: SharePoint Property Bag

Description

SharePoint items contain additional properties that are not captured in the index.

How to enable

To capture additional SharePoint properties, enter a comma-delimited list of properties to pull from SharePoint using the key SharePointPropertyKeys in the config file.

Fixed Issues

AddHashAndExtractedText: false error messages when OCR text extracting (timeout)

Description

Previously, while performing a text extraction to test the text timeout feature in the config file for OCR documents, false error messages were populated in the log files. False error messages are no longer populated.

AddHashAndExtractedText: IronOCR Errors logging document-id in message

Description

Previously, OCR errors were logging the document-id in the message ID field rather than in the in the document-id field column. The ID is now populating into the shinydocs-jobs index document-id field column when there is an error processing files via OCR.
Note: The ID should not be removed from the log file message, as the message field is populated from the log file message, which contains the ID.

AddPathValidation: when the file is detected as in use, mark as valid

Description

When performing a filesystem PathValidation command, errors were displayed for files that were currently open with MS Excel or MS Word. There is no longer an error encountered when running the operation with open files.

CacheFileSystemPermission: cached-owner field is not populating with a value

Description

When running the CacheFileSystemPermission command, the cached-owner field was not populated with a value and no errors were logged in the log file. Now, the cached-owner field is populating as expected.

CLI: email address fields added by operations need to be consistent

Description

CrawlExchange and CrawlFileSystem with --crawl-pst (and text extraction) use different fields to store email addresses. Any CLI tool that stores email addresses in an index will use the same fields. The names were standardized and we use the following names for CrawlExchange and for CrawlFileSystem (--crawl pst).

receivedBy
fromAddress
ccRecipients

CrawlExchange: attachments use "size" field rather than "length"

Description

When using CrawlExchange, the “size” of attachments was recorded using a size field. All other sources use length to record the size of the file. For consistency, CrawlExchange has been changed to use length to record the size of the file.

CrawlExchange: crawling public folders should work on Exchange On-Premise 2019

Description

The exchange source setting json file must contain public-folder-access-email in order to crawl public folders while using CrawlExchange to crawl Exchange On Premise 2019.

Permissions

Feature	On-Premise	Online
crawling all inboxes	Permissions required: Mailbox Search and ApplicationImpersonation	Permissions required: Mailbox Search
crawling public folders	Permissions required: Public Folders	Permissions required: Public Folders

We strongly recommend creating a new role group in Exchange for managing Shinydocs software with the recommended permissions.

CrawlSharepoint: index UTC times are not correct for SharePoint

Description

SharePoint TimeZones are not transferable to system TimeZoneInfos.

CrawlSharePointOnline: SaveValue does not work for encrypting SharePoint Online secret key

Description

The SaveValue option was previously missing the call to SwapSavedParameters which meant that the option would not work for encrypting SharePoint Online secret key. SwapSavedParameters has been added and the operation is now working as expected. Encrypted IDs can now be successfully added to the saved-parameters.yaml file and the CrawlSharePointOnline tool can be run with a source setting file containing encrypted keys.

CrawlSharePointOnlineSites: provide status update when crawling sharepoint online using site-index

Description

Previously, if you deleted a previously crawled site in SharePoint Online, it would still be in the index and CrawlSharePointOnlineSites would not remove it. Now, a new field called deleted with a value of true is created in the index when a SharePoint site has been deleted and you rerun the CrawlSharePointOnlineSites tool.

CacheFileSystemPermissions: only processing 100 items and getting stack trace

Description

When running the CacheFileSystemPermissions, the progress bar stopped and produced an error message after processing only 100 items. The operation now performs as expected.

ExtractAndCrawlPst: items duplicated when rerun on a pst that was already extracted

Description

When adding a pst to an existing index containing a pst already extracted, the items in the first pst file were duplicated in the file on the fileshare and in the index. To prevent this duplication, a new option is added to the ExtractAndCrawlPst operation: --create-duplicates . Utilizing this command line parameter allows or disallows duplicates to be created when the ExtractAndCrawlPst operation is rerun on a pst file. By default the --create-duplicates value is set to false. That means that if an existing msg file is found, it will be skipped.

RemoveField: setting threads to more than 2 will bypass the provided query and remove all instances of the field in the specified index

Description

Rather than removing the field from the items listed in the query, setting the threads to higher than 2 actually resulted in removing all instances of the field from everything in the index. This has been successfully updated and now RemoveField only removes the field from what was contained in the query.

RestorePermissions - updated error message when permissions cannot be restored

Description

If a document does not have a cached-permissions field (for example, CacheFileSystemPermissions command was not run prior to running RestoreCachedFileSystemPermissions), the permissions for the file will not be reset. This is the intended behaviour; however, the status message in the console suggested that the document was processed successfully. This contradicted the log files, where you will find an error was thrown while processing the document.
The console now displays the correct message while trying to restore the permissions for a file without cached-permissions.

SetFileSystemPermission: allow Security Groups with --identity parameter

Description

Running the SetFileSystemPermissions operation with a security group was previously returning an error. Now, the --identity option can be used with security groups for ease of administration. For example, if there is a change of user permissions related to legal hold documents, user accounts can be updated in the security group instead of running the command each time.

TagDuplicate: change the status of documents that are no longer duplicates

Description

When using the TagDuplicate operation, items that are no longer duplicates can be removed using the --overwrite option.

How to enable

To remove the duplicate value, run RemoveField, and then include the --overwrite option when running the TagDuplicate command. This process will change the value of the remaining file to reflect that it is no longer a duplicate.

Features

AddHashAndExtractedText: add --text-timeout configuration in exe.config file

Description

How to enable

CrawlBox: retrieving folder collaborators that have access to the folders and files

Description

How to enable

CrawlSharePointOnline: SharePoint Property Bag

Description

How to enable

Fixed Issues

AddHashAndExtractedText: false error messages when OCR text extracting (timeout)

Description

AddHashAndExtractedText: IronOCR Errors logging document-id in message

Description

AddPathValidation: when the file is detected as in use, mark as valid

Description

CacheFileSystemPermission: cached-owner field is not populating with a value

Description

CLI: email address fields added by operations need to be consistent

Description

CrawlExchange: attachments use "size" field rather than "length"

Description

CrawlExchange: crawling public folders should work on Exchange On-Premise 2019

Description

Permissions

CrawlSharepoint: index UTC times are not correct for SharePoint

Description

CrawlSharePointOnline: SaveValue does not work for encrypting SharePoint Online secret key

Description

CrawlSharePointOnlineSites: provide status update when crawling sharepoint online using site-index

Description

CacheFileSystemPermissions: only processing 100 items and getting stack trace

Description

ExtractAndCrawlPst: items duplicated when rerun on a pst that was already extracted

Description

RemoveField: setting threads to more than 2 will bypass the provided query and remove all instances of the field in the specified index

Description

RestorePermissions - updated error message when permissions cannot be restored

Description

SetFileSystemPermission: allow Security Groups with --identity parameter

Description

TagDuplicate: change the status of documents that are no longer duplicates

Description

How to enable

AddHashAndExtractedText: add `--text-timeout` configuration in exe.config file