Cognitive Suite 2.9.1.0 (01.2023)
Information on feature enhancements and fixed issues that are part of the Cognitive Suite 2.9.1.0 (January 2023) release.
Features
AddHashAndExtractedText: add --text-timeout
configuration in exe.config file
Description
You can now modify the default value for the --text-timeout
option for AddHashAndExtractedText.
For example, if there are files that take longer than the default value of 60 seconds to process through Optical Character Recognition (OCR), the full text extraction is not completed for that file.
How to enable
To ensure that the OCR process is completed on each file, modify the --text-timeout
value higher than the default setting of 60 seconds in the exe.config file.
CrawlBox: retrieving folder collaborators that have access to the folders and files
Description
When using the CrawlBox operation, the --crawl-collaborators
option will retrieve folder collaborators added for a group.
How to enable
To crawl and capture Box groups and their users, include the --crawl-collaborators
option when running the CrawlBox operation.
CrawlSharePointOnline: SharePoint Property Bag
Description
SharePoint items contain additional properties that are not captured in the index.
How to enable
To capture additional SharePoint properties, enter a comma-delimited list of properties to pull from SharePoint using the key SharePointPropertyKeys
in the config file.
Fixed Issues
AddHashAndExtractedText: false error messages when OCR text extracting (timeout)
Description
Previously, while performing a text extraction to test the text timeout feature in the config file for OCR documents, false error messages were populated in the log files. False error messages are no longer populated.
AddHashAndExtractedText: IronOCR Errors logging document-id in message
Description
Previously, OCR errors were logging the document-id in the message ID field rather than in the in the document-id field column. The ID is now populating into the shinydocs-jobs index document-id field column when there is an error processing files via OCR.
Note: The ID should not be removed from the log file message, as the message field is populated from the log file message, which contains the ID.
AddPathValidation: when the file is detected as in use, mark as valid
Description
When performing a filesystem PathValidation command, errors were displayed for files that were currently open with MS Excel or MS Word. There is no longer an error encountered when running the operation with open files.
CacheFileSystemPermission: cached-owner field is not populating with a value
Description
When running the CacheFileSystemPermission command, the cached-owner field was not populated with a value and no errors were logged in the log file. Now, the cached-owner field is populating as expected.
CLI: email address fields added by operations need to be consistent
Description
CrawlExchange and CrawlFileSystem with --crawl-pst
(and text extraction) use different fields to store email addresses. Any CLI tool that stores email addresses in an index will use the same fields. The names were standardized and we use the following names for CrawlExchange and for CrawlFileSystem (--crawl pst
).
receivedBy
fromAddress
ccRecipients
CrawlExchange: attachments use "size" field rather than "length"
Description
When using CrawlExchange, the “size” of attachments was recorded using a size field. All other sources use length to record the size of the file. For consistency, CrawlExchange has been changed to use length to record the size of the file.
CrawlExchange: crawling public folders should work on Exchange On-Premise 2019
Description
The exchange source setting json file must contain public-folder-access-email
in order to crawl public folders while using CrawlExchange to crawl Exchange On Premise 2019.
Permissions
Feature | On-Premise | Online |
---|---|---|
crawling all inboxes | Permissions required: Mailbox Search and ApplicationImpersonation | Permissions required: Mailbox Search |
crawling public folders | Permissions required: Public Folders | Permissions required: Public Folders |
We strongly recommend creating a new role group in Exchange for managing Shinydocs software with the recommended permissions.
CrawlSharepoint: index UTC times are not correct for SharePoint
Description
SharePoint TimeZones are not transferable to system TimeZoneInfos.
CrawlSharePointOnline: SaveValue does not work for encrypting SharePoint Online secret key
Description
The SaveValue option was previously missing the call to SwapSavedParameters which meant that the option would not work for encrypting SharePoint Online secret key. SwapSavedParameters has been added and the operation is now working as expected. Encrypted IDs can now be successfully added to the saved-parameters.yaml file and the CrawlSharePointOnline tool can be run with a source setting file containing encrypted keys.
CrawlSharePointOnlineSites: provide status update when crawling sharepoint online using site-index
Description
Previously, if you deleted a previously crawled site in SharePoint Online, it would still be in the index and CrawlSharePointOnlineSites would not remove it. Now, a new field called deleted with a value of true is created in the index when a SharePoint site has been deleted and you rerun the CrawlSharePointOnlineSites tool.
CacheFileSystemPermissions: only processing 100 items and getting stack trace
Description
When running the CacheFileSystemPermissions, the progress bar stopped and produced an error message after processing only 100 items. The operation now performs as expected.
ExtractAndCrawlPst: items duplicated when rerun on a pst that was already extracted
Description
When adding a pst to an existing index containing a pst already extracted, the items in the first pst file were duplicated in the file on the fileshare and in the index. To prevent this duplication, a new option is added to the ExtractAndCrawlPst operation: --create-duplicates
. Utilizing this command line parameter allows or disallows duplicates to be created when the ExtractAndCrawlPst operation is rerun on a pst file. By default the --create-duplicates
value is set to false. That means that if an existing msg file is found, it will be skipped.
RemoveField: setting threads to more than 2 will bypass the provided query and remove all instances of the field in the specified index
Description
Rather than removing the field from the items listed in the query, setting the threads to higher than 2 actually resulted in removing all instances of the field from everything in the index. This has been successfully updated and now RemoveField only removes the field from what was contained in the query.
RestorePermissions - updated error message when permissions cannot be restored
Description
If a document does not have a cached-permissions field (for example, CacheFileSystemPermissions command was not run prior to running RestoreCachedFileSystemPermissions), the permissions for the file will not be reset. This is the intended behaviour; however, the status message in the console suggested that the document was processed successfully. This contradicted the log files, where you will find an error was thrown while processing the document.
The console now displays the correct message while trying to restore the permissions for a file without cached-permissions.
SetFileSystemPermission: allow Security Groups with --identity parameter
Description
Running the SetFileSystemPermissions operation with a security group was previously returning an error. Now, the --identity option can be used with security groups for ease of administration. For example, if there is a change of user permissions related to legal hold documents, user accounts can be updated in the security group instead of running the command each time.
TagDuplicate: change the status of documents that are no longer duplicates
Description
When using the TagDuplicate operation, items that are no longer duplicates can be removed using the --overwrite
option.
How to enable
To remove the duplicate value, run RemoveField, and then include the --overwrite
option when running the TagDuplicate command. This process will change the value of the remaining file to reflect that it is no longer a duplicate.