Setting up the shinydocs-jobs index

Compatible with Cognitive Toolkit 2.7.0 and higher.

You will need to configure each Cognitive Toolkit to use the shinydocs-jobs index. See below for instructions.

What is the shinydocs-jobs index?

An index called shinydocs-jobs can be created and used to log data by enabling the --use-shinydocs-jobs option during the following Cognitive Toolkit operations:

AddClassifications
AddExtractedTextFromEngineeringDrawings
AddFromSqlDatabase
AddHashAndExtractedText
AddPathValidation
AddPropertyData
CacheFileSystemPermissions
CopyItems
Dispose
ExportFromIndex

ExtractAndCrawlPst
ExtractEntities
FindSimilarClassification
Migrate
RemoveField
RemoveItems
RestoreCachedFileSystemPermissions
SetFileSystemPermissions
TagDuplicate
UpdateProperties

Enabling the --use-shinydocs-jobs option prompts the Cognitive Toolkit to send logging data to the shinydocs-jobs index during an operation. This will help you track current crawl activities, gather metrics and visualize errors, such as access and text extraction errors.

Visualizations & Dashboard

See: Shinydocs Jobs Visualizations & Dashboard

Key fields in the shinydocs-jobs index

Once this feature is enabled, there are key fields you should be aware of and their purpose. There are two types of entries in this index:

Crawl operations and metrics

Contains information about the crawl operations and metrics from completed crawl operations. There are more fields available, these are some of the key fields to know about:

Index Field	Use
commandArguments	This field is only present on crawl operations as `commandArguments: <arguments used with CognitiveToolkit.exe>`
startTime	The date and time that the process started
endTime	The date and time that the process ended. If this field is not present on an entry, the process is either not complete or was terminated unexpectedly
duration	Crawl operation duration in seconds. If this field is not present on an entry, the process is either not complete or was terminated unexpectedly
files-per-second	The rate at which the crawl operation was able to crawl or process files. If this field is not present on an entry, the process is either not complete or was terminated unexpectedly
file-quantity	The number of files crawled or processed. If this field is not present on an entry, the process is either not complete or was terminated unexpectedly
toolType	The name of the tool used in Cognitive Toolkit
machine	The name of the computer running the Cognitive Toolkit

Errors

Contains errors encountered while crawling, hashing, text extracting, etc. There are more fields available, these are some of the key fields to know about:

Index Field	Use
type	This field is only present on errors as `type: error`
toolType	The name of the tool used in Cognitive Toolkit
exception	If there is an exception raised during the crawl operation, the exception will be noted here. Not all errors have an `exception`
message	The error raised during the crawl operation. All errors have a `message`
time	The date and time the error occured
machine	The name of the computer running the Cognitive Toolkit

Configure Cognitive Toolkit to log to the index

With logging to the index enabled, the Cognitive Toolkit will attempt to update the shinydocs-jobs index with log entries as well as the log file. This can be helpful in visualizing current crawl activities, gathering metrics, and visualizing errors (access errors, errors in text extraction, etc.). If the connection between the Cognitive Toolkit and the Index breaks or Cognitive Toolkit is terminated unexpectedly, it will be unable to log to the index.

This must be configured before performing any crawl operations, otherwise previous and current running processes will not be logged to the index (but will still be logged to the log file).

In the Cognitive Toolkit directory, which contains CognitiveToolkit.exe, open CognitiveToolkit.exe.config in a notepad-like editor
On Line 20, uncomment <add key="JobIndexUrl" value="http://localhost:9200" /> by removing the leading 
Set the value to the URL and port of your Analytics Engine (index) endpoint, replacing http://localhost:9200
Save the file

Set up the index pattern

Once your first crawl has started, a new index will be created called shinydocs-jobs. To set up the index pattern in the Visualizer:

In the Visualizer, click Management
Select Index Patterns
Select Create index pattern
In the field Index pattern, enter shinydocs-jobs
Select Next
In the Time Filter field name drop-down, select “I don’t want to use the Time Filter”
Select Create index pattern
Wait for the index pattern to be created (less than 1 minute)
Once created, you will see a list of fields in your index pattern
In the Filter search bar, enter duration, and select the pencil (edit) button
In the Format drop-down, select Duration
The Input format should be Seconds and the Output format should be Human Readable
Select Save field, all done!

Removing tasks that did not complete

Occasionally, there will be tasks that do not complete fully and therefore remain in the index incomplete. This can be caused by:

System shutting down while process is running
Application crash
Terminating the process before completing
Connection to the index breaks

To remove these items from the index you will need to use the Cognitive Toolkit or the Visualizer’s Dev Tools section. You will need the Task IDs (indexed as _id) of the entries to remove them (example ID: 5f0d242f-eafe-487d-982c-9250bbed4ecf). Either method will require a query to tell the index what to delete. You can use this query for either method:

Remove ID query

Replace <_id_here> with the ID(s) you want to remove

CODE

{
  "query": {
    "ids": {
      "values": [
        "<_id_here>"
      ]
    }
  }
}

Remove_ID_query.json

Cognitive Toolkit Method

Replace http://<index_url:port> with your index URL and port & <path_to_Remove_ID_query.json>with the path to the Remove ID query

CODE

CognitiveToolkit.exe RemoveItems --index-name shinydocs-jobs --index-server-url http://<index_url:port> --query <path_to_Remove_ID_query.json>

After Cognitive Toolkit has gathered the items to remove from the index, you will be prompted in the terminal to confirm the removal of these items. If you do not want to be prompted, use the --force argument in the command

Visualizer Method

Navigate to Dev Tools in the left sidebar menu.

Replace <_id_here> with the ID(s) you want to remove. Then run the command:

CODE

POST shinydocs-jobs/_delete_by_query
{
  "query": {
    "ids": {
      "values": [
        "<_id_here>"
      ]
    }
  }
}

The output after completion will look like this:

CODE

{
  "took" : 81,
  "timed_out" : false,
  "total" : 1,
  "deleted" : 1,
  "batches" : 1,
  "version_conflicts" : 0,
  "noops" : 0,
  "retries" : {
    "bulk" : 0,
    "search" : 0
  },
  "throttled_millis" : 0,
  "requests_per_second" : -1.0,
  "throttled_until_millis" : 0,
  "failures" : [ ]
}

"deleted": reflects the number of deleted items