Cognitive Toolkit 2.10.0 (06/23)

Information on feature enhancements and fixed issues that are part of the Cognitive Toolkit 2.10.0 (June 2023) release.

Features

CrawlExchange - add path to the index to create better visualizations

The paths of Exchange email messages and attachments are now added as fields in the index to improve classification and help create better visualizations.

CrawlExchange - exclude certain folders to have better search results

To improve Discovery Search results, an option has been added to the CrawlExchange operation to filter drafts, deleted items, and junk email.

To exclude any or all 3 of these categories from Discovery Search results, use the option --ignore-folders when running the CrawlExchange operation. To learn more, see CrawlExchange.

CrawlExchange - normalize Exchange ID to standard

To improve integration with Shinydocs Search, Exchange Identifier (ID) has been normalized for both emails and attachments, making the ID easier to read and more URL-safe. The Exchange ID is stored in the index as exchangeId.

Note: For customers who crawled Exchange using an earlier version of Cognitive Toolkit, before the ID was normalized in this release (2.10), a run script is available to normalize the IDs before upgrading to 2.10. This run script will also change the field name body to fulltext. For more information, contact Shinydocs Support.

Increase the ignore_above property in our mapping for all path and parent fields in the index

The ignore_above property needs to be specifically set when putting mappings on the index. We've increased the ignore_above property to ensure that we can support customers who want to query for path/parent information with long file paths, for example during classification, when identifying behaviour, or slicing data.

This change will only be observed on newly created indices. On existing indices, use the workaround: Enable keyword queries for long paths (above 256 characters).

File Share >File Share Migration

File share to file share migration is now possible using the Migrate operation:

Items are copied with all the metadata and the index (for example, date, time, user, destination path)
Structure/taxonomy of the files is maintained
Permissions and ownership are not carried over
Created and modified dates are maintained during the migration

Note: For files that are migrated into a folder with existing files that are the same, the Migrate operation respects the --disable-over-write true flag and uses --location-field and --name-field-name configuration settings.

AddPathValidation - keep Exchange Web Services index up-to-date / add path-valid to the message

To keep Exchange search results in Discovery Search up-to-date, a path validation feature is added to indicate, in search results, when a message was deleted and moved.

For example:

If the message was permanently deleted, it is marked “path-valid : false”
If the message is moved to a deleted folder, it is marked “path-valid : false”

Note: If an Exchange item is moved to a different folder within Exchange, its ID is changed and another CrawlExchange operation must be performed to capture the item in the index again.

For example:

An item in folder A was removed from folder A. The index is up-to-date.
An item in folder A was moved to folder B. The path validation removes it from the index since the item is no longer in folder A. However, a new crawl must be performed to have a complete picture and identify that the item was added to folder B, since it has a new ID.

Integrate search library into Cognitive

To make the transition to OpenSearch as easy as possible, a parameter is added in the Cognitive Toolkit configuration file to set either Elasticsearch or OpenSearch. For more information, see Shinydocs is Migrating to OpenSearch.

The new keys in the configuration file are:

Fixed Issues

AddPathValidation - keep the Exchange Web Services index up-to-date / adding path-valid to the message

Previously, if an Exchange message was deleted or moved, its ID would change and the data would not be captured until a re-crawl was performed. This caused the Exchange Web Services index to be out-of-date, resulting in inaccurate in Discovery Search results.

We’ve implemented path validation with this release; however, if an Exchange item is moved to a different folder within Exchange, its id is changed and we cannot capture it until a recrawl has been done. See AddPathValidation for more information.

We’re investigating if we can optimize recrawling, so we can narrow down the “full” crawl to folder level and get the new id of this moved message.

AddHashAndExtractedText: new field not created when mailbox is deleted

Previously, when performing AddHashAndExtractedText on an index that included items from a deleted Exchange mailbox, a field should have been added to the index to indicate the mailbox had been deleted, but was not. This issue is resolved. The new deleted field is added to the index with a value of true for emails and email attachments originating from a deleted mailbox.

Dispose - Update configuration file: dispose key for Exchange

Previously, when disposing of Exchange emails using the Dispose operation, the default action was to move the email to a deleted items folder (soft dispose). This default action is changed to immediate and permanent disposal (hard disposal).

To configure the default action of Dispose for Exchange, edit the value of the dispose key located within the Cognitive Toolkit configuration file for Exchange to either “hard” or “soft” and save your changes:

CODE

<!--Exchange-->
    <!-- value: hard (permanently delete) or soft (moved to deleted items)  -->
    <add key="ExchangeDeleteMode" value="soft" />

To learn more about hard vs soft disposal, read https://shinydocs.atlassian.net/wiki/spaces/DD/pages/2820112411/Disposal+of+Exchange+Emails.

FetchEmailMessage needs to be hardened against credentials expiring mid-extract

When running AddHashAndExtractedText on an index that contains content from ExchangeOnline, there is a risk that the session will expire partway through the extraction process, causing the remaining files to not be extracted. The FetchEmailMessage code is upgraded to handle this situation gracefully, so that the AddHashAndExtractedText operation picks up where it left off if it suddenly stops partway through the process.

AddHashAndExtractedText - unexplained Errors hashing files in Exchange

Previously, unexplained errors were causing some confusion when using AddHashAndExtractedText with Exchange. To improve handling of these intermittent issues and make it easier to discover the root cause of the error messages, we’ve:

Improved tagging in the index and logging
Decreased the visibility of error messages that are not helpful

Query tools are unable to used wildcards for index names

The ability to use an asterisk to aggregate indices was not operating as expected. This has been fixed.

ExtractAndCrawlPST - error message running command

Previously, when running ExtractAndCrawlPST, a missing or invalid JWT was causing an error message. This has been fixed.

ExtractEntities - error with extraction service

Previously, ExtractEntities was returning the following error: There was an error contacting the extraction service . ExtractEntities is now performing as expected.

CrawlExchange - Help Menu: update descriptions

The Help Menu for the CrawlExchange operation has been updated to include the following options:

--ignored-attachment-extensions (optional)
--ignore-folders (optional)
--email (default: All)
--exclude-auto-replies (default: False)

CrawlSharePointOnPrem - show message from "GET" url when there is an error

To improve handling and understanding of error messages received during a CrawlSharePointOnPrem operation, the GET URL message is now displayed in the log files.

AddHashandExtractedText - do not create empty or null fields when extracting message files

When the AddHashandExtractedText operation extracts metadata from an email message, such as the subject, to, and from fields, a field is created in the index, even if a value doesn’t exist for the field in the message. Fields are longer created for values that do not exist within email messages.

ExportFromIndex - exports blanks for any field that contains array data

Previously, after using RunScripts to tag data and then trying to export the data, all the fields defined as an array came out blank. Exported fields are now correctly populated for both arrays and strings.

Visualizer Service: no details under Description

Previously, the Visualizer had no Description information populated in Windows Services. The following Description is now displayed: Enables users to generate visual representations of indexed data.

Indexer Service: no details under Description

Previously, the Analytics Engine (Indexer) had no Description information populated in Windows Services. The following Description is now displayed: Retrieves data and metadata from your files and turns them into an index pattern that can be later used to visualize your data.

AddHashAndExtractedText (text extraction on msg file): logs filling up with warning message for fromAddress on every item

Previously, when running AddHashAndExtractedText on message files, the from Address was populating into the index, but there were warning messages for every item in the log file. The warnings resulted from using a local email server and therefore user name(s) (for example, Administrator) were used to send/receive email messages, rather than fully qualified email addresses. The logging level for the warning is reduced to address this situation.

ExtractAndCrawlPST - command line not ending when adding adding to index

When running the ExtractAndCrawlPST operation to add new PST files to an existing index that has already had ExtractAndCrawlPST performed, the command line hangs and does not complete. This issue is fixed and the operation now performs as expected.

CrawlExchange - display the body of the message in the full text field instead in the Body field

When using the CrawlExchange operation, the body of an email message was previously displayed in the the body field of the index.

Now, the body of the message is displayed in the full text field, so that the text and the content in the message is stored consistently in the same field for all the connectors.

Steps to take if Exchange has been crawled before the release of Cognitive Toolkit 2.10:

Remove the body field in the index and re-crawl.

CrawlExchange - if message contains in-line attachment, hasAttachment flag should be set to true

Exchange allows emails to include in-line attachments. When the CrawlExchange operation is run and in-line attachments are detected in an email, the field hasAttachments is set to true.

Cognitive Toolkit - Update to Net7

Cognitive Toolkit has been updated from .NET 6 to .NET 7. NET is an open-source, managed computer software framework for Windows, Linux, and macOS operating systems.

FileSystem - Smart text extraction

Previously, when a folder was renamed, AddHashAndExtractedText treated every file with a changed path as a new file, requiring text extraction to be repeated.

Now, each hash is verified to ensure the file has not changed, even though the path name changed. If a file is found to be the same, text extraction is not performed on that file.

Images within Exchange emails are not being identified

Previously, image in the body of emails were not identified.

Now, in-line attachments in email are detected and the index field hasAttachments is set to true.

AddHashAndExtractedText - issues with Content Server documents

There were two issues with Content Server documents that have both been resolved:

Documents with no text, such as image attachments, were logged as errors during text extraction. Now, they're properly returning empty text.
Emails without metadata were logged as errors and all extracted text was discarded. Now, the extracted text is retained and no metadata shows up in the index.

CrawlExchange - handling long reply bodies

To view only the most relevant information from an email in Discovery Search, set the --max-characters option in CrawlExchange to display a predetermined number of characters, rather than an entire body of email replies. This feature performs as expected for Exchange Online and will display only the length of characters allowed in Discovery Search results.

CrawlSharePoint - Path-valid=false is not reset when file is present and crawled again

Previously, files that had been flagged as path-valid=false (after being moved to a different folder) were not reset when a re-crawl was performed and the file was found again. AddPathValidation now restores the value of the field to: path-valid = true upon completion of a successful crawl.

Known Issues

Crawling all the emails and attachments from public folders

The CrawlExchange operation crawls all emails and attachments in public folders. Posts are not crawled at this time.