This guide explains the operations available within Cognitive Toolkit and how to use them.
Cognitive Toolkit contains 35 operations to help you understand and manage your data. These operations are runscript-based commands which can be run using a minimum set of required parameters or configured for more complex applications.
Run a command
To begin using Cognitive Toolkit, every operation follows these initial steps:
Type cmd in the Windows search box or charms bar, and then click Run as administrator. Alternatively, you may open a CMD
prompt using a Service Account with the appropriate permissions.
Change directory (cd
) to your Cognitive Toolkit root folder. (ie. the location of the extracted shinydocs-cognitive-toolkit-[version]-[date].zip file)
At theCognitiveToolkit.exe
prompt, enter the command for the operation you wish to run using either the minimum required parameters or edit the command as required.
Edit a command
To build a command you must provide, at a minimum, the required parameters identified in the Options Table for that command.
The command can be adapted to your specific requirements and environment by editing the command:
Add optional parameters as indicated in the command’s Options Table.
Option values are variables based on your environment, as well as the setup/location of source files.
Values should be surrounded by double quotation marks (“”) within the command.
For example:
DO: --query "C:\query-match-path-no-classification.json"
DO NOT: --query C:\query-match-path-no-classification.json
Tips
The first time you use Cognitive Toolkit, you are required to Activate
your license.
Begin using Cognitive Toolkit by crawling an ECM and/or file system. This will build out an index in the Analytics Engine that can then be utilized to perform other operations.
Certain characters can negatively impact operation of the Cognitive Toolkit. Learn more about using special characters.
Source setting files provide the login credentials required to access a content source.
The following source setting files can be edited with your organization’s administrative login credential information:
| Example |
---|
Activates the license and must be performed to initiate use of the Cognitive Toolkit. | Command using minimum required inputs: CognitiveToolkit.exe activate -p "VALUE"
|
Use the Activate Options Table to edit the command.
Activate Options Table
OPTION | VALUE | CONDITION |
---|
-p<VALUE> | Path to the license file | Required |
| Example |
---|
Records Content Server classifications data in an index within the Analytics Engine. Fields created will depend on the fields created within the Content Server.
The CrawlContentServer operation must be performed before running AddClassifications.
| Command using minimum required inputs: CognitiveToolkit.exe AddClassifications --query "C:\match_all.json" -u "VALUE" -i "VALUE"
|
Use the AddClassifications Options Table to edit the command.
AddClassifications Options Table
OPTION | VALUE | CONDITION |
---|
--source-settings <VALUE> | Path to the settings file containing access information such as the username and password for the data source (Possible data sources: Box, Content Server, Documentum, Exchange, Filenet) | Optional (Other parameters will be ignored.) |
-q|--query <VALUE> | The path to the search query (File or JSON defining input parameters) | Required |
-n|--nodes-per-request <VALUE> | Number of nodes per request
| Optional *If not included, default: 1000 |
--use-shinydocs-jobs | Send logging data to the shinydocs-jobs index
| Optional *If not included, default: false |
-u|--index-server-url <VALUE> | URL of the index server
If the Cognitive Toolkit and Index are running on different servers, the value of the --index-server-url option should be set to the IP address of the index server rather than the localhost.
| Required |
-i|--index-name <VALUE> | Name of the index | Required |
-t|--threads <VALUE> | Number of parallel processes to start
| Optional *If not included, default: 1 |
--force | Forcefully remove/suppress prompt for confirmation | Optional *If not included, default: false |
--dry-run | Runs everything but doesn’t send nodes to the Analytics Engine
The --dry-run option allows you to quickly see how many items will be processed without actually creating the index.
| Optional *If not included, default: false |
-s|--silent | Turn off the progress bar | Optional |
| Example |
---|
Extracts full text from engineering drawings. | Command using minimum required inputs: CognitiveToolkit.exe AddExtractedTextFromEngineeringDrawings --query "C:\match_all.json" -u "VALUE" -i "VALUE"
|
Use the AddExtractedTextFromEngineeringDrawings Options Table to edit your runscript command.
AddExtractedTextFromEngineeringDrawings Options Table
OPTION | VALUE | CONDITION |
---|
--source-settings <VALUE> | Path to the settings file containing access information such as the username and password for the data source (Content Server, SharePoint, Box, etc...) | Optional *If not included, default: filesystem |
--path-to-micro-station <VALUE> | Fully Qualified Path to MicroStation | Optional *If not included, default: ‘C:\Program Files\Bentley\MicroStation CONNECT Edition\MicroStation\microstation.exe’ |
--is-v8i | MicroStation is V8i | Optional *If not included, default: false |
-q|--query <VALUE> | The path to the search query (File or JSON defining input parameters) | Required |
-n|--nodes-per-request <VALUE> | Number of nodes per request
| Optional *If not included default: 1000 |
--use-shinydocs-jobs | Send logging data to the shinydocs-jobs index
| Optional *If not included, default: false |
-u|--index-server-url <VALUE> | URL of the index server
If the Cognitive Toolkit and Index are running on different servers, the value of the --index-server-url option should be set to the IP address of the index server rather than the localhost.
| Required |
-i|--index-name <VALUE> | Name of the index | Required |
-t|--threads <VALUE> | Number of parallel processes to start
| Optional *If not included default: 1 |
--force | Forcefully remove/suppress prompt for confirmation | Optional *If not included default: false |
--dry-run | Runs everything but doesn’t send nodes to the Analytics Engine
The --dry-run option allows you to quickly see how many items will be processed without actually creating the index.
| Optional *If not included default: false |
-s|--silent | Turn off the progress bar | Optional |
| Example |
---|
Migrates data using a SQL query to an index within the Analytics Engine. | Command using minimum required inputs: CognitiveToolKit.exe AddFromSqlDatabase --database-type "oracle" --username "USERNAME" --password ****** --data-source 192.1.1.1:1521 --sql "C:\sqlquery.sql" --sql-parameters "id=tags" --column-prefix "sql" --query "C:\match_all.json" --index-server-url "http://localhost:9200" --index-name INDEXNAME
|
Use the AddFromSqlDatabase Options Table to edit your runscript command.
AddFromSqlDatabase Options Table
OPTION | VALUE | CONDITION |
---|
--database-type <VALUE> | The database type to which to connect.
Supported values: “oracle”, “postgres”, “sqlserver”
| Required |
--username <VALUE> | The database username (login credentials) | Required |
--password <VALUE> | The database password (login credentials) | Required |
--data-source <VALUE> | The data source | Required |
--sql <VALUE> | The SQL to run or path to .sql file | Required |
--sql-parameters <VALUE> | A comma separated list of keys and fields from the index to replace values in the SQL
| Optional |
-c|--column-prefix <VALUE> | The prefix added to column names | Required |
-q|--query <VALUE> | The path to the search query (File or JSON defining input parameters) | Required |
-n|--nodes-per-request <VALUE> | Number of nodes per request
| Optional *If not included, default: 1000 |
--use-shinydocs-jobs | Send logging data to the shinydocs-jobs index
| Optional *If not included, default: false |
-u|--index-server-url <VALUE> | URL of the index server
If the Cognitive Toolkit and Index are running on different servers, the value of the --index-server-url option should be set to the IP address of the index server rather than the localhost.
| Required |
-i|--index-name <VALUE> | Name of the index | Required |
-t|--threads <VALUE> | Number of parallel processes to start
| Optional *If not included, default: 1 |
--force | Forcefully remove/suppress prompt for confirmation | Optional *If not included, default: false |
--dry-run | Runs everything but doesn’t send nodes to the Analytics Engine
The --dry-run option allows you to quickly see how many items will be processed without actually creating the index.
| Optional *If not included, default: false |
-s|--silent | Turn off the progress bar | Optional |
AddHashAndExtractedText | Example |
---|
Generates the hash value for each file specified while also extracting full text. Note: When a folder is renamed, changing the file path, AddHashAndExtractedText verifies to ensure the file has not changed. If a file is found to be the same, text extraction is not performed on that file. | Command using minimum required inputs: Cognitivetoolkit.exe AddHashAndExtractedText --query "C:\match_all.json" -u "http://localhost:9200" -i INDEXNAME
|
Use the AddHashAndExtractedText Options Table to edit your runscript command.
AddHashAndExtractedText Options Table
OPTION | VALUE | CONDITION |
---|
--source-settings <VALUE> | Path to the settings file containing access information such as the username and password for the data source (Box, Content Server, Documentum, OneDrive, SharePoint) | Optional *If not included, default: leave blank for filesystem |
--algorithm <VALUE> | Algorithm to apply
Supported values: “sha1”, “sha256”, “sha512”, “md5”
| Optional *If not included, default: md5 |
--extraction-service-url <VALUE> | URL for the extraction service
--extraction-service-url must be included if the default location was changed during installation.
| Optional *If not included, default: “http://localhost:55555” |
--max-characters <VALUE> | The maximum number of characters for the extracted text field | Optional *If not included, default: 0 [all characters] |
--debug-level <VALUE> | The level of depth of exception messages | Optional *If not included, default: 20 |
--action-keyword <VALUE> | The AddHashAndExtractedText command automatically generates hash and extracts text from the source parameter, but it can also be modified to either generate hash OR extract text. Include the --action-keyword option to specify which of the two actions should be performed: hash or text --action-keyword “hash” will apply hash, but will not extract full text. --action-keyword “text” will extract full text, but will not apply hash. | Optional *If not included, default: both |
--time-out <VALUE> | The timeout in seconds for each batch of nodes being processed | Optional *If not included, default: 60 seconds |
--ocr-utility <VALUE> | OCR Utility to use for text extraction
Supported values: “iron”, “azure”, “none”
| Optional *If not included, default: none |
--azure-subscription-key <VALUE> | Azure Computer Vision Key is found on the Keys and Endpoint page for your Cognitive Services resource in the Azure Portal | Optional* *Required, if ocr-utility is "azure" |
--azure-subscription-endpoint <VALUE> | Azure Computer Vision Endpoint, this is found on the Keys and Endpoint page for your Cognitive Services resource in the Azure Portal | Optional* *Required, if ocr-utility is "azure" |
--text-timeout <VALUE> | The number of seconds to wait before cancelling text for an item (0 for unlimited).
To ensure that the OCR process is completed on each file, modify the --text-timeout value higher than the default setting of 60 seconds in the exe.config file.
| Optional *If not included, default: 60 seconds |
-q|--query <VALUE> | The path to the search query (File or JSON defining input parameters) | Required |
-n|--nodes-per-request <VALUE> | Number of nodes per request.
| Optional *If not included, default: 100 |
--use-shinydocs-jobs | Send logging data to the shinydocs-jobs index
| Optional *If not included, default: false |
-u|--index-server-url <VALUE> | URL of the index server
If the Cognitive Toolkit and Index are running on different servers, the value of the --index-server-url option should be set to the IP address of the index server rather than the localhost.
| Required |
-i|--index-name <VALUE> | Name of the index | Required |
-t|--threads <VALUE> | Number of parallel processes to start
| Optional *If not included, default: 1 |
--force | Forcefully remove/suppress prompt for confirmation | Optional *If not included, default: false |
--dry-run | Runs everything but doesn’t send nodes to the Analytics Engine
The --dry-run option allows you to quickly see how many items will be processed without actually creating the index.
| Optional *If not included, default: false |
-s|--silent | Turn off the progress bar | Optional |
| Example |
---|
The AddPathValidation operation checks for changes to the index within the Analytics Engine. The changes that are validated result in a false value and are based on the data source itself. AddPathValidation for File system: checks for files that have been moved, deleted or had a name change. AddPathValidation for Content Server: uses nodeID to check for files that have been deleted. AddPathValidation for SharePoint: checks for files that have been moved or deleted. AddPathValidation for Box: uses fileID to check for files that have been deleted. AddPathValidation for Documentum: uses nodeID to check for files that have been deleted. AddPathValidation for OneDrive: checks for files that have been moved or deleted.
| Command using minimum required inputs: Cognitivetoolkit.exe AddPathValidation --query "C:\match_all.json" -u "http://localhost:9200" -i INDEXNAME
|
Use the AddPathValidation Options Table to edit your runscript command.
AddPathValidation Options Table
OPTION | VALUE | CONDITION |
---|
--source-settings <VALUE> | Path to the settings file containing access information such as the username and password for the data source (Box, Content Server, Documentum, SharePoint, OneDrive) | Optional *If not included, default: leave blank for filesystem |
-q|--query <VALUE> | The path to the search query (File or JSON defining input parameters) | Required |
-n|--nodes-per-request <VALUE> | Number of nodes per request
| Optional *If not included, default: 100 |
--use-shinydocs-jobs | Send logging data to the shinydocs-jobs index
| Optional *If not included, default: false |
-u|--index-server-url <VALUE> | URL of the index server
If the Cognitive Toolkit and Index are running on different servers, the value of the --index-server-url option should be set to the IP address of the index server rather than the localhost.
| Required |
-i|--index-name <VALUE> | Name of the index | Required |
-t|--threads <VALUE> | Number of parallel processes to start
| Optional *If not included, default: 1 |
--force | Forcefully remove/suppress prompt for confirmation | Optional *If not included, default: false |
--dry-run | Runs everything but doesn’t send nodes to the Analytics Engine
The --dry-run option allows you to quickly see how many items will be processed without actually creating the index.
| Optional *If not included, default: false |
-s|--silent | Turn off the progress bar | Optional |
| Example |
---|
Pulls property data from an ECM and adds it to the index in the Analytics Engine. Content Server categories and attributes are currently supported. Category attributes - data is pulled via a direct database connector and/or REST API Classification values - data is pulled via REST API Records Management (RM) Classification values - data is pulled via REST API
| Command using minimum required inputs: Cognitivetoolkit.exe AddPropertyData --query "C:\match_all.json" -u "http://localhost:9200" -i INDEXNAME
|
Use the AddPropertyData Options Table to edit your runscript command.
AddPropertyData Options Table
OPTION | VALUE | CONDITION |
---|
--source-settings <VALUE> | Path to the settings file containing access information such as the username and password for the data source (Content Server, SharePoint, Box, etc...) | Optional *If not included, default: leave blank for filesystem |
--legacy-naming | Use legacy naming (CS data only)
If the --legacy-naming option is not used, the fields created in the index are prefixed with prop-
| Optional *If not included, default: false |
-q|--query <VALUE> | The path to the search query (File or JSON defining input parameters) | Required |
-n|--nodes-per-request <VALUE> | Number of nodes per request
| Optional *If not included, default: 100 |
--use-shinydocs-jobs | Send logging data to the shinydocs-jobs index
| Optional *If not included, default: false |
-u|--index-server-url <VALUE> | URL of the index server
If the Cognitive Toolkit and Index are running on different servers, the value of the --index-server-url option should be set to the IP address of the index server rather than the localhost.
| Required |
-i|--index-name <VALUE> | Name of the index | Required |
-t|--threads <VALUE> | Number of parallel processes to start
| Optional *If not included, default: 1 |
--force | Forcefully remove/suppress prompt for confirmation | Optional *If not included, default: false |
--dry-run | Runs everything but doesn’t send nodes to the Analytics Engine
The --dry-run option allows you to quickly see how many items will be processed without actually creating the index.
| Optional *If not included, default: false |
-s|--silent | Turn off the progress bar | Optional |
| Example |
---|
Caches the permissions on a file system item and creates the following fields within an index in the Analytics Engine: CachedPermsFieldName = "cached-permissions" CachedInheritanceFieldName = "cached-inheritance" CachedOwnerFieldName = "cached-owner"
| Command using minimum required inputs: Cognitivetoolkit.exe CacheFileSystemPermissions --query "C:\match_all.json" -u "http://localhost:9200" -i INDEXNAME
|
Use the CacheFileSystemPermissions Options Table to edit your runscript command.
CacheFileSystemPermissions Options Table
OPTION | VALUE | CONDITION |
---|
-q|--query <VALUE> | The path to the search query (File or JSON defining input parameters) | Required |
-n|--nodes-per-request <VALUE> | Number of nodes per request
| Optional *If not included, default: 100 |
--use-shinydocs-jobs | Send logging data to the shinydocs-jobs index
| Optional *If not included, default: false |
-u|--index-server-url <VALUE> | URL of the index server
If the Cognitive Toolkit and Index are running on different servers, the value of the --index-server-url option should be set to the IP address of the index server rather than the localhost.
| Required |
-i|--index-name <VALUE> | Name of the index | Required |
-t|--threads <VALUE> | Number of parallel processes to start
| Optional *If not included, default: 1 |
--force | Forcefully remove/suppress prompt for confirmation | Optional *If not included, default: false |
--dry-run | Runs everything but doesn’t send nodes to the Analytics Engine
The --dry-run option allows you to quickly see how many items will be processed without actually creating the index.
| Optional *If not included, default: false |
-s|--silent | Turn off the progress bar | Optional |
| Example |
---|
Used to copy an object from one index to another index within the Analytics Engine. | Command using minimum required inputs: Cognitivetoolkit.exe CopyItems --destination-index-url http://localhost:9200 --destination-index-name INDEXNAME_1 -u http://localhost:9200 -i INDEXNAME_2 --query C:\match_all.json
|
Use the CopyItems Options Table to edit your runscript command.
CopyItems Options Table
OPTION | VALUE | CONDITION |
---|
--destination-index-url <VALUE> | The destination index server | Required |
--destination-index-name <VALUE> | The destination index name | Required |
--destination-index-type <VALUE> | The destination index type | Optional |
--destination-index-shards <VALUE> | The destination index number of shards | Optional *If not included, default: 5 |
-q|--query <VALUE> | The path to the search query (File or JSON defining input parameters) | Required |
-n|--nodes-per-request <VALUE> | Number of nodes per request
| Optional *If not included, default: 100 |
--use-shinydocs-jobs | Send logging data to the shinydocs-jobs index
| Optional *If not included, default: false |
-u|--index-server-url <VALUE> | URL of the index server
If the Cognitive Toolkit and Index are running on different servers, the value of the --index-server-url option should be set to the IP address of the index server rather than the localhost.
| Required |
-i|--index-name <VALUE> | Name of the index | Required |
-t|--threads <VALUE> | Number of parallel processes to start
| Optional *If not included, default: 1 |
--force | Forcefully remove/suppress prompt for confirmation | Optional *If not included, default: false |
--dry-run | Runs everything but doesn’t send nodes to the Analytics Engine
The --dry-run option allows you to quickly see how many items will be processed without actually creating the index.
| Optional *If not included, default: false |
-s|--silent | Turn off the progress bar | Optional |
| Example |
---|
Crawls for the metadata within Box and adds it to an index in the Analytics Engine. | Command using minimum required inputs: Cognitivetoolkit.exe CrawlBox --source-settings "C:SourceFiles\Box.json" --query "C:\box.json" -u "http://localhost:9200" -i INDEXNAME
|
Use the CrawlBox Options Table to edit your runscript command.
CrawlBox Options Table
OPTION | VALUE | CONDITION |
---|
--source-settings <VALUE> | Path to the settings file containing access information such as the username and password for the Box data source | Required |
--start-folder-id <VALUE> | Default Start Folder Id | Optional *If not included, default: 0 |
--users <VALUE> | Comma delimited list of user logins and folder ids in format <user_login>:<start_folder_id> | Optional *If not included, default: all users |
--include-shared-files | Include folders shared by other users | Optional *If not included, default: false |
--crawl-collaborators | Crawl and capture Box groups and their users | Optional *If not included, default: false |
-q|--query <VALUE> | The path to the search query (File or JSON defining input parameters) | Required |
-u|--index-server-url <VALUE> | URL of the index server
If the Cognitive Toolkit and Index are running on different servers, the value of the --index-server-url option should be set to the IP address of the index server rather than the localhost.
| Required |
-i|--index-name <VALUE> | Name of the index | Required |
-t|--threads <VALUE> | Number of parallel processes to start
| Optional *If not included, default: 1 |
-n|--nodes-per-request <VALUE> | Number of nodes per request
| Optional *If not included, default: 100 |
--force | Forcefully remove/suppress prompt for confirmation | Optional *If not included, default: false |
--dry-run | Runs everything but doesn’t send nodes to the Analytics Engine
The --dry-run option allows you to quickly see how many items will be processed without actually creating the index.
| Optional *If not included, default: false |
-s|--silent | Turn off the progress bar | Optional |
CrawlContentServer | Example |
---|
Crawls for the metadata within Content Server and adds it to an index in the Analytics Engine. Use this tool to crawl the Content Server database directly or via REST API.
Crawling the Content Server database directly leaves REST API available for other applications.
| Command using minimum required inputs: Cognitivetoolkit.exe CrawlContentServer --source-settings "C:SourceFiles\ContentServer.json" --query "C:\content_server.json" -u "http://localhost:9200" -i INDEXNAME
|
Use the CrawlContentServer Options Table to edit your runscript command.
CrawlContentServer Options Table
OPTION | VALUE | CONDITION |
---|
--source-settings <VALUE> | Path to the settings file containing access information such as the username and password for the Content Server database
| Optional *If not included, default: crawl Content Server via the REST API |
--starting-folder-id <VALUE> | The ID of the folder from which to begin traversing | Optional *If not included, default: 0 |
--allowed-types <VALUE> | A comma delimited list of content type ids you wish to crawl | Optional *If not included, default: 1,144,736,749 |
--modified-after <VALUE> | Items changed on / after this date Note: To use the --modified-after option with Content Server, the date of the documents has to be within 30 days of of the date you run the tool Example: Today’s date: 2023-03-28 Modified-after-date: 2023-03-01 (This will crawl items after March 1st) Modified-after-date: 2023 -01-25 (This will fail with error: date_exceeds allowed number of days )
Supported date formats are: yyyy-MM-dd, yyyy-MM-dd HH:mm, yyyy-MM-ddTHH:mm. Example: 2018-12-20 or 2018-12-20 19:42. Supported relative date formats are: now, now+/-1d[/d], now+/-1m[/d], now+/-1y[/d]
| Optional |
--delta | Use the audit tables to detect changes based on the --modified-after option with a default of today’s date. | Optional *If included, crawl is performed via REST |
-q|--query <VALUE> | The path to the search query (File or JSON defining input parameters) | Required |
-u|--index-server-url <VALUE> | URL of the index server
If the Cognitive Toolkit and Index are running on different servers, the value of the --index-server-url option should be set to the IP address of the index server rather than the localhost.
| Required |
-i|--index-name <VALUE> | Name of the index | Required |
-t|--threads <VALUE> | Number of parallel processes to start
| Optional *If not included, default: 1 |
-n|--nodes-per-request <VALUE> | Number of nodes per request
| Optional *If not included, default: 1000 |
--force | Forcefully remove/suppress prompt for confirmation | Optional *If not included, default: false |
--dry-run | Runs everything but doesn’t send nodes to the Analytics Engine
The --dry-run option allows you to quickly see how many items will be processed without actually creating the index.
| Optional *If not included, default: false |
-s|--silent | Turn off the progress bar | Optional |
CrawlContentServerWorkflows | Example |
---|
Crawls for the metadata within Content Server workflows and adds it to an index in the Analytics Engine. Use this tool to crawl the Content Server database directly or via REST API.
Crawling the Content Server database directly leaves REST API available for other applications.
| Command using minimum required inputs: Cognitivetoolkit.exe CrawlContentServerWorkflows --source-settings "C:SourceFiles\ContentServer.json" --query "C:\content_server.json" -u "http://localhost:9200" -i INDEXNAME
|
Use the CrawlContentServerWorkflows Options Table to edit your runscript command.
CrawlContentServerWorkflows Options Table
OPTION | VALUE | CONDITION |
---|
--source-settings <VALUE> | Path to the settings file containing access information such as the username and password for the Content Server database
| Optional *If not included, default: crawl Content Server via the REST API |
--process-status <VALUE> | Process archived status (Supported values: "archived, noarchive")
Supported values: “archived”, “noarchive”
| Optional |
--initiated-after <VALUE> | Items changed on/after this date.
Supported date formats are: yyyy-MM-dd, yyyy-MM-dd HH:mm, yyyy-MM-ddTHH:mm. Example: 2018-12-20 or 2018-12-20 19:42. Supported relative date formats are: now, now+/-1d[/d], now+/-1m[/d], now+/-1y[/d]
| Optional |
-q|--query <VALUE> | The path to the search query (File or JSON defining input parameters) | Required |
-u|--index-server-url <VALUE> | URL of the index server
If the Cognitive Toolkit and Index are running on different servers, the value of the --index-server-url option should be set to the IP address of the index server rather than the localhost.
| Required |
-i|--index-name <VALUE> | Name of the index | Required |
-t|--threads <VALUE> | Number of parallel processes to start
| Optional *If not included, default: 1 |
-n|--nodes-per-request <VALUE> | Number of nodes per request
| Optional *If not included, default: 1000 |
--force | Forcefully remove/suppress prompt for confirmation | Optional *If not included, default: false |
--dry-run | Runs everything but doesn’t send nodes to the Analytics Engine
The --dry-run option allows you to quickly see how many items will be processed without actually creating the index.
| Optional *If not included, default: false |
-s|--silent | Turn off the progress bar | Optional |
| Example |
---|
Crawls for the metadata within Documentum and adds it to an index in the Analytics Engine. | Command using minimum required inputs: Cognitivetoolkit.exe CrawlDocumentum --source-settings "C:SourceFiles\Documentum.json" --query ""C:\documentum.json"" -u "http://localhost:9200" -i INDEXNAME
|
Use the CrawlDocumentum Options Table to edit your runscript command.
CrawlDocumentum Options Table
OPTION | VALUE | CONDITION |
---|
--source-settings <VALUE> | Path to the settings file containing access information such as the username and password for the Documentum data source | Required |
--use-single-index | Use single index | Optional *If not included, default: false |
-q|--query <VALUE> | The path to the search query (File or JSON defining input parameters) | Required |
-u|--index-server-url <VALUE> | URL of the index server
If the Cognitive Toolkit and Index are running on different servers, the value of the --index-server-url option should be set to the IP address of the index server rather than the localhost.
| Required |
-i|--index-name <VALUE> | Name of the index | Required |
-t|--threads <VALUE> | Number of parallel processes to start
| Optional *If not included, default: 1 |
-n|--nodes-per-request <VALUE> | Number of nodes per request
| Optional *If not included, default: 1000 |
--force | Forcefully remove/suppress prompt for confirmation | Optional *If not included, default: false |
--dry-run | Runs everything but doesn’t send nodes to the Analytics Engine
The --dry-run option allows you to quickly see how many items will be processed without actually creating the index.
| Optional *If not included, default: false |
-s|--silent | Turn off the progress bar | Optional |
| Example |
---|
Crawls for the metadata within Microsoft Exchange and adds it to an index in the Analytics Engine. | Command using minimum required inputs: Cognitivetoolkit.exe CrawlExchange --source-settings "C:SourceFiles\Exchange.json --email tester1@exchange.local -u http://localhost:9200 -i INDEXNAME
|
Use the CrawlExchange Options Table to edit your runscript command.
CrawlExchange Options Table
OPTION | VALUE | CONDITION |
---|
--page-size <VALUE> | Number of exchange items to retrieve in a single request | Optional *If not included, default: 500 |
--max-characters <VALUE> | The maximum number of characters for the extracted email text field Note: Setting this option when crawling Exchange Online will restrict the number of characters displayed in Enterprise Search results. | Optional *If not included, default: 0 [all characters] |
--source-settings <VALUE> | Path to the settings file containing access information such as the username and password for the Exchange data source | Required |
--crawl-public-folders | Crawl public folders | Optional *If not included, default: false |
--email <VALUE> | Comma separated list of email addresses to crawl Note: Leave blank to crawl all mailboxes
Use when crawling public folders
| Required |
--exclude-auto-replies | Exclude auto-replies (For example, out-of-office replies) from the index | Optional *If not included, default: false |
--ignore-inline-attachments | Excludes all inline attachments | Optional *If not included, default: false |
--ignore-body | Excludes the body text of exchange items from being indexed | Optional *If not included, default: false |
--ignore-attachment-extensions <VALUE> | Comma separated list of extensions of the inline attachments that should be ignored | Optional |
--ignore-folders <VALUE> | Comma separated list of folder names to be excluded from crawling
Supported values are: drafts, deleted items, junk email
| Optional |
--after-last-modified-date <VALUE> | Items changed on/after this date
Supported date formats are: yyyy-MM-dd, yyyy-MM-dd HH:mm, yyyy-MM-ddTHH:mm. Example: 2018-12-20 or 2018-12-20 19:42. Supported relative date formats are: now, now+/-1d[/d], now+/-1m[/d], now+/-1y[/d]
| Optional |
-u|--index-server-url <VALUE> | URL of the index server
If the Cognitive Toolkit and Index are running on different servers, the value of the --index-server-url option should be set to the IP address of the index server rather than the localhost.
| Required |
-i|--index-name <VALUE> | Name of the index | Required |
-t|--threads <VALUE> | Number of parallel processes to start
| Optional *If not included, default: 1 |
-n|--nodes-per-request <VALUE> | Number of nodes per request
| Optional *If not included, default: 1000 |
--force | Forcefully remove/suppress prompt for confirmation | Optional *If not included, default: false |
--dry-run | Runs everything but doesn’t send nodes to the Analytics Engine
The --dry-run option allows you to quickly see how many items will be processed without actually creating the index.
| Optional *If not included, default: false |
-s|--silent | Turn off the progress bar | Optional |
| Example |
---|
Crawls for the metadata within FileNet and adds it to an index in the Analytics Engine. | Command using minimum required inputs: Cognitivetoolkit.exe CrawlFileNet --source-settings "C:SourceFiles\FileNet.json" -u http://localhost:9200 -i INDEXNAME
|
Use the CrawlFileNet Options Table to edit your runscript command.
CrawlFileNet Options Table
OPTION | VALUE | CONDITION |
---|
--source-settings <VALUE> | Path to the settings file containing access information such as the username and password for the FileNet data source | Required |
--class-definition <VALUE> | The document class you want to filter for | Optional *If not included, default: 'All' |
-exclude-subclasses | Exclude subclasses | Optional *If not included, default: false |
--crawl-hidden | Crawl hidden document classes | Optional *If not included, default: false |
--where-clause <VALUE> | FileNetSQL where clause will override the dates when used | Optional |
--before-date-last-modified <VALUE> | Crawl everything before this date | Optional *If not included, default: Now |
--after-last-modified-date <VALUE> | Items changed on/after this date
Supported date formats: yyyy-MM-dd, yyyy-MM-dd HH:mm, yyyy-MM-ddTHH:mm Example: 2018-12-20 or 2018-12-20 19:42
| Optional *If not included, default: 1970-01-01 |
--interval <VALUE> | The number of months to crawl at a time | Optional *If not included, default: 3 |
-u|--index-server-url <VALUE> | URL of the index server
If the Cognitive Toolkit and Index are running on different servers, the value of the --index-server-url option should be set to the IP address of the index server rather than the localhost.
| Required |
-i|--index-name <VALUE> | Name of the index | Required |
-t|--threads <VALUE> | Number of parallel processes to start
| Optional *If not included, default: 1 |
-n|--nodes-per-request <VALUE> | Number of nodes per request
| Optional *If not included, default: 1000 |
--force | Forcefully remove/suppress prompt for confirmation | Optional *If not included, default: false |
--dry-run | Runs everything but doesn’t send nodes to the Analytics Engine
The --dry-run option allows you to quickly see how many items will be processed without actually creating the index.
| Optional *If not included, default: false |
-s|--silent | Turn off the progress bar | Optional |
| Example |
---|
Base operation for data discovery. Generally performed prior to running any other Cognitive Toolkit operation. This operation crawls the specified path (or multiple paths) for metadata. The metadata is then stored in an index within the Analytics Engine where it can be further mined for insights. | Command using minimum required inputs: CognitiveToolkit.exe CrawlFileSystem --path-file "C:\path.json" -u http://localhost:9200 -i INDEXNAME
|
Use the CrawlFileSystem Options Table to edit your runscript command.
CrawlFileSystem Options Table
OPTION | VALUE | CONDITION |
---|
--path <VALUE> OR --path-file <VALUE> | Single path to crawl OR Text file that contains multiple paths to crawl
*At least one of these two options must be included in the runscript command. If --path is not used, --path-file must be used. If --path-file is not used, --path must be used.
| Optional* Optional* |
--include-hidden | Includes hidden files in the crawl | Optional *If not included default: false |
--include-system | Includes system files in the crawl | Optional *If not included default: false |
-a|--add-field-owner | Add the Owner field to the index | Optional *If not included default: false |
--include-reparse | Includes reparse items
A file or directory can contain a reparse point, which is a collection of user-defined data. The format of this data is understood by the application which stores the data, and a file system filter, which interprets the data and processes the file. When an application sets a reparse point, it stores this data, plus a reparse tag, which uniquely identifies the data it is storing.
| Optional *If not included default: false |
--after-date-last-modified | Crawls everything after this date*
Supported date formats: yyyy-MM-dd, yyyy-MM-dd HH:mm, yyyy-MM-ddTHH:mm Example: 2018-12-20 or 2018-12-20 19:42 Supported relative date formats: now, now+/-ld[/d], now+/-lm[/d], now +/-ly[/d]
| Optional *If not included default: all |
--validate | Validates file paths
Issue: Using the option --validate in a folder with more than 1024 folders produces an error. Solution: In the elasticsearch.yml file, set the following parameter to a number that exceeds your folder amount: indices.query.bool.max_clause_count
| Optional *If not included default: false |
-u|--index-server-url <VALUE> | URL of the index server
If the Cognitive Toolkit and Index are running on different servers, the value of the --index-server-url option should be set to the IP address of the index server rather than the localhost.
| Required |
-i|--index-name <VALUE> | Name of the index | Required |
-t|--threads <VALUE> | Number of parallel processes to start
| Optional *If not included, default: 1 |
-n|--nodes-per-request <VALUE> | Number of nodes per request
| Optional *If not included, default: 1000 |
--force | Forcefully remove/suppress prompt for confirmation | Optional *If not included, default: false |
--dry-run | Runs everything but doesn’t send nodes to the Analytics Engine
The --dry-run option allows you to quickly see how many items will be processed without actually creating the index.
| Optional *If not included, default: false |
-s|--silent | Turn off the progress bar | Optional |
| Example |
---|
Crawls for the metadata within Maximo and adds it to an index in the Analytics Engine. | Command using minimum required inputs: Cognitivetoolkit.exe CrawlMaximo --database-type postgres --schema-type work-order --connection-string User ID=postgres;Password=mypassword;Host=localhost;Port=5435;Database=Maximodatabase --query select assetnum,workorderid,worktype from work_orders --key-fields workorderid --index-server-url http://localhost:9200 --index-name INDEXNAME
|
Use the CrawlMaximo Options Table to edit your runscript command.
CrawlMaximo Options Table
OPTION | VALUE | CONDITION |
---|
--database-type <VALUE> | The type of database
Supported types: 'oracle', 'postgres', 'sqlserver'
| Required |
--schema-type <VALUE> | The type of items being crawled
Supported types: 'work-order', 'condition-report', 'item', 'location', 'oem', 'company'
| Required |
--connection-string <VALUE> | The database Connection String, used to connect to the database | Required |
--sql-query <VALUE> | The SQL query to retrieve records | Required |
--key-fields <VALUE> | A comma separated list of fields that produce a unique key | Required |
--database-timeout <VALUE> | The length of time (in seconds) to wait for a connection to the server before terminating the attempt and generating an error | Optional *If not included, default: 120 |
--connection-string-password <VALUE> | A password to replace '{{password}}' in the connection string | Optional |
--chunk-size <VALUE> | The number of items sent to the index in a single request | Optional *If not included, default: 1000 |
-u|--index-server-url <VALUE> | URL of the index server
If the Cognitive Toolkit and Index are running on different servers, the value of the --index-server-url option should be set to the IP address of the index server rather than the localhost.
| Required |
-i|--index-name <VALUE> | Name of the index | Required |
-t|--threads <VALUE> | Number of parallel processes to start
| Optional *If not included, default: 1 |
-n|--nodes-per-request <VALUE> | Number of nodes per request.
| Optional *If not included, default: 100 |
--force | Forcefully remove/suppress prompt for confirmation | Optional *If not included, default: false |
--dry-run | Runs everything but doesn’t send nodes to the Analytics Engine
The --dry-run option allows you to quickly see how many items will be processed without actually creating the index.
| Optional *If not included, default: false |
-s|--silent | Turn off the progress bar | Optional |
| Example |
---|
Crawls for the metadata within OneDrive and adds it to an index in the Analytics Engine. | Command using minimum required inputs: Cognitivetoolkit.exe CrawlOneDrive --source-settings C:\onedrive.json -u http://localhost:9200 -i INDEXNAME --use-single-index
|
Use the CrawlOneDrive Options Table to edit your runscript command.
CrawlOneDrive Options
OPTION | VALUE | CONDITION |
---|
--source-settings <VALUE> | Path to the settings file containing access information such as the username and password for the OneDrive data source | Required |
--use-single-index | Use single index | Optional *If not included, default: false |
--specific-accounts <VALUE> | Specify email addresses to crawl (comma-separated)
Example: ‘email1@domain.com', ‘email2@domain.com’, ‘email3@domain.com’, 'email4@domain.com’
| Optional |
-u|--index-server-url <VALUE> | URL of the index server
If the Cognitive Toolkit and Index are running on different servers, the value of the --index-server-url option should be set to the IP address of the index server rather than the localhost.
| Required |
-i|--index-name <VALUE> | Name of the index | Required |
-t|--threads <VALUE> | Number of parallel processes to start
| Optional *If not included, default: 1 |
-n|--nodes-per-request <VALUE> | Number of nodes per request
| Optional *If not included, default: 1000 |
--force | Forcefully remove/suppress prompt for confirmation | Optional *If not included, default: false |
--dry-run | Runs everything but doesn’t send nodes to the Analytics Engine
The --dry-run option allows you to quickly see how many items will be processed without actually creating the index.
| Optional *If not included, default: false |
-s|--silent | Turn off the progress bar | Optional |
| Example |
---|
Crawls for the metadata within SharePoint Online and adds it to an index in the Analytics Engine. | Command using minimum required inputs: Cognitivetoolkit.exe CrawlSharePoint--source-settings C:\sharepoint.json -u http://localhost:9200 -i "shinydrive index" --use-single-index
|
Use the CrawlSharePointOnline Options Table to edit your runscript command.
CrawlSharePointOnline Options Table
OPTION | VALUE | CONDITION |
---|
--source-settings <VALUE> | Path to the settings file containing access information such as the username and password for the SharePoint Online data source | Required |
--crawl-subsites | Crawl all of the subsites | Optional *If not included, default: false |
--remove-standard-filter | Remove filter for non-standard document libraries (Hidden, System, etc) | Optional *If not included, default: false |
--remove-document-library-type-filter | Remove filter for non-document library types | Optional *If not included, default: false |
--remove-site-assets-filter | Remove filter for site asset libraries | Optional *If not included, default: false |
--crawl-site-collection | Crawl From SiteCollection (overrides CrawlSubsites option) | Optional *If not included, default: false |
--crawl-from-index <VALUE> | Crawl From Previously Run 'CrawlSharePointSites' index | Optional |
--filter <VALUE> | SharePoint Filter File/JSON Note: The --filter option for CrawlSharePointOnline is only used in conjunction with --crawl-site-collection. Results will not be filtered unless the --crawl-site-collection option is also used. | Optional |
--use-single-index | Use single index | Optional *If not included, default: false |
-n|--nodes-per-request <VALUE> | Number of nodes per request
| Optional *If not included, default: 1000 |
-u|--index-server-url <VALUE> | URL of the index server
If the Cognitive Toolkit and Index are running on different servers, the value of the --index-server-url option should be set to the IP address of the index server rather than the localhost.
| Required |
-i|--index-name <VALUE> | Name of the index | Required |
--index-type <VALUE> | Type name for index objects | Optional *If not included, default: shinydocs |
--force | Forcefully remove/suppress prompt for confirmation | Optional *If not included, default: false |
| Example |
---|
Crawls for the metadata within SharePoint On-Premise and adds it to an index in the Analytics Engine. | Command using minimum required inputs: Cognitivetoolkit.exe CrawlSharePointOnPrem--source-settings C:\sharepoint.json -u http://localhost:9200 -i "shinydrive index" --use-single-index
|
Use the CrawlSharePointOnPrem Options Table to edit your runscript command.
CrawlSharePointOnPrem Options Table
OPTION | VALUE | CONDITION |
---|
--source-settings <VALUE> | Path to the settings file containing access information such as the username and password for the SharePoint On-Premise data source | Required |
--crawl-subsites | Crawl all of the subsites. | Optional *If not included, default: false |
--hidden | Crawl hidden lists | Optional *If not included, default: false |
--catalog | Crawl catalog lists | Optional *If not included, default: false |
--application | Crawl application lists | Optional *If not included, default: false |
--private | Crawl private lists | Optional *If not included, default: false |
--use-single-index | Use single index | Optional *If not included, default: false |
-u|--index-server-url <VALUE> | URL of the index server
If the Cognitive Toolkit and Index are running on different servers, the value of the --index-server-url option should be set to the IP address of the index server rather than the localhost.
| Required |
-i|--index-name <VALUE> | Name of the index | Required |
-t|--threads <VALUE> | Number of parallel processes to start
| Optional *If not included, default: 1 |
-n|--nodes-per-request <VALUE> | Number of nodes per request
| Optional *If not included, default: 1000 |
--force | Forcefully remove/suppress prompt for confirmation | Optional *If not included, default: false |
--dry-run | Runs everything but doesn’t send nodes to the Analytics Engine
The --dry-run option allows you to quickly see how many items will be processed without actually creating the index.
| Optional *If not included, default: false |
-s|--silent | Turn off the progress bar | Optional |
| Example |
---|
Crawls SharepointOnlineSites to create a list of all the site names and add them to an index in the Analytics Engine. This information can then be used to crawl specific subsites using the CrawlSharePointOnline or CrawlSharePointOnPrem operations. | Command using minimum required inputs: Cognitivetoolkit.exe CrawlSharePointSites --source-settings "C:\sharepoint.json" -u http://localhost:9200 -i INDEXNAME
|
Use the CrawlSharePointSites Options Table to edit your runscript command.
CrawlSharePointOnlineSites Options Table
OPTION | VALUE | CONDITION |
---|
--source-settings <VALUE> | Path to the settings file containing access information such as the username and password for the SharePointOnlineSites data source | Required |
--keyword-query <VALUE> | Additional Keyword Query parameters | Optional |
-u|--index-server-url <VALUE> | URL of the index server
If the Cognitive Toolkit and Index are running on different servers, the value of the --index-server-url option should be set to the IP address of the index server rather than the localhost.
| Required |
-i|--index-name <VALUE> | Name of the index | Required |
-t|--threads <VALUE> | Number of parallel processes to start
| Optional *If not included, default: 1 |
-n|--nodes-per-request <VALUE> | Number of nodes per request
| Optional *If not included, default: 1000 |
--force | Forcefully remove/suppress prompt for confirmation | Optional *If not included, default: false |
--dry-run | Runs everything but doesn’t send nodes to the Analytics Engine
The --dry-run option allows you to quickly see how many items will be processed without actually creating the index.
| Optional *If not included, default: false |
-s|--silent | Turn off the progress bar | Optional |
| Example |
---|
Deletes the specified data/files based on the query. This will add a field to an index in the Analytics Engine called [dispose] with value of true if successful. For confirmation, Dispose identifies the number of files that will be deleted before the dispose runs. | Command using minimum required inputs: Cognitivetoolkit.exe Dispose -query "C:\disposeQuery.json" -u http://localhost:9200 -i INDEXNAME
|
Use the Dispose Options Table to edit your runscript command.
Dispose Options Table
OPTION | VALUE | CONDITION |
---|
--source-settings <VALUE> | Path to the settings file containing access information such as the username and password for the data source (Supported data sources: Content Server, Documentum, File System, OneDrive, SharePoint Online and SharePointOnPrem) | Optional *If not included, default: filesystem |
--verify-hash | Verifies that the file still matches the hash value before file deletion | Optional *If not included, default: false |
--hash-field <VALUE> | The hash field | Optional, but required if --verify-hash is specified |
--hash-algorithm <VALUE> | The hash algorithm to use when verifying
Supported values: “sha1”, “sha256”, “sha512”, “md5”
| Optional, but required if --verify-hash is specified |
-q|--query <VALUE> | The path to the search query (File or JSON defining input parameters) | Required |
-n|--nodes-per-request <VALUE> | Number of nodes per request
| Optional *If not included, default: 100 |
--use-shinydocs-jobs | Send logging data to the shinydocs-jobs index
| Optional *If not included, default: false |
-u|--index-server-url <VALUE> | URL of the index server
If the Cognitive Toolkit and Index are running on different servers, the value of the --index-server-url option should be set to the IP address of the index server rather than the localhost.
| Required |
-i|--index-name <VALUE> | Name of the index | Required |
-t|--threads <VALUE> | Number of parallel processes to start
| Optional *If not included, default: 1 |
--force | Forcefully remove / Suppress prompt for confirmation | Optional *If not included, default: false |
--dry-run | Runs everything but doesn’t send nodes to the Analytics Engine
The --dry-run option allows you to quickly see how many items will be processed without actually creating the index.
| Optional *If not included, default: false |
-s|--silent | Turn off the progress bar | Optional |
| Example |
---|
Specify and export fields/values from an index in the Analytics Engine into a comma-separated value (csv) file. | Command using minimum required inputs: Cognitivetoolkit.exe ExportFromIndex --fields creationTimeUtc,name,extension,path --query "C:\match_all.json" -u http://localhost:9200 -i INDEXNAME
|
Use the ExportFromIndex Options Table to edit your runscript command.
ExportFromIndex Options Table
OPTION | VALUE | CONDITION |
---|
--fields <VALUE> | Comma-delimited list of fields to include in the export | Required |
--filename <VALUE> | File to export the index to | Optional *If not included, default: export.csv |
--search-index-name <VALUE> | Index to search for the duplicates | Optional |
--max-file-size <VALUE> | Maximum file size in MB, limited to 1GB | Optional *If not included, default: 1GB |
--inspected-field <VALUE> | The name of the field that the duplicates were tagged on | Optional *If not included, default: hash |
--duplicate-field <VALUE> | The name of the field that identifies the duplicate | Optional *If not included, default: duplicate-hash |
-q|--query <VALUE> | The path to the search query (File or JSON defining input parameters) | Required |
-n|--nodes-per-request <VALUE> | Number of nodes per request
| Optional *If not included, default: 100 |
--use-shinydocs-jobs | Send logging data to the shinydocs-jobs index
| Optional *If not included, default: false |
-u|--index-server-url <VALUE> | URL of the index server
If the Cognitive Toolkit and Index are running on different servers, the value of the --index-server-url option should be set to the IP address of the index server rather than the localhost.
| Required |
-i|--index-name <VALUE> | Name of the index | Required |
--dry-run | Runs everything but doesn’t send nodes to the Analytics Engine
The --dry-run option allows you to quickly see how many items will be processed without actually creating the index.
| Optional *If not included, default: false |
--force | Forcefully remove/suppress prompt for confirmation | Optional *If not included, default: false |
-s|--silent | Turn off the progress bar | Optional |
| Example |
---|
Extracts text and performs a crawl of pst (email) files. | Command using minimum required inputs: Cognitivetoolkit.exe ExtractAndCrawlPst --query "C:\query-match-extension-pst-not-extracted.json" --index-server-url http://localhost:9200 --index-name INDEXNAME
|
Use the ExtractAndCrawlPst Options Table to edit your runscript command.
ExtractAndCrawlPst Options Table
OPTION | VALUE | CONDITION |
---|
--create-duplicates | Allow duplicate files
If this option is utilized, pst files that are crawled more than once will be duplicated in the file on the fileshare and in the index. By default, if an existing pst file is found, it will be skipped during the operation.
| Optional *If not included, default: false |
-q|--query <VALUE> | The path to the search query (File or JSON defining input parameters) | Required |
-n|--nodes-per-request <VALUE> | Number of nodes per request
| Optional *If not included, default: 100 |
--use-shinydocs-jobs | Send logging data to the shinydocs-jobs index
| Optional *If not included, default: false |
-u|--index-server-url <VALUE> | URL of the index server
If the Cognitive Toolkit and Index are running on different servers, the value of the --index-server-url option should be set to the IP address of the index server rather than the localhost.
| Required |
-i|--index-name <VALUE> | Name of the index | Required |
-t|--threads <VALUE> | Number of parallel processes to start
| Optional *If not included, default: 1 |
--force | Forcefully remove/suppress prompt for confirmation | Optional *If not included, default: false |
--dry-run | Runs everything but doesn’t send nodes to the Analytics Engine
The --dry-run option allows you to quickly see how many items will be processed without actually creating the index.
| Optional *If not included, default: false |
-s|--silent | Turn off the progress bar | Optional |
| Example |
---|
An information extraction technique whereby key elements from text are identified and classified into predefined categories. Categories include: Person Location Organization Date Money Percent Time
This operation transforms unstructured data to structured data that is machine readable and available for standard processing. | Command using minimum required inputs: Cognitivetoolkit.exe ExtractEntities --extraction-service-url http://localhost:8181/ --query "C:\query-match-fullText-no-entities.json" --index-server-url http://localhost:9200 --index-name INDEXNAME
|
Use the ExtractEntities Options Table to edit your runscript command.
ExtractEntities Options Table
OPTION | VALUE | CONDITION |
---|
--field <VALUE> | The name of the index field to store the extracted entities | Optional *If not included, default: entities |
-c|--classes <VALUE> | Comma separated list of entity classes to extract from text. The extracted classes is dependent on the classifier model used to perform entity extraction. | Optional *If not included, default: all classes, Example:"LOCATION,PERSON,ORGANIZATION" |
-e|--extract-from <VALUE> | The name of the index field from which to extract entities | Optional *If not included, default: fullText |
--preserve-spacing | Preserve spacing from extracted entities. Depending on the source file, there may be unwanted line breaks in extracted entities. | Optional *If not included, default: false |
--extraction-service-url <EXTRACTION_SERVICE_URL> | URL for the entity extraction service | Optional *If not included, default: http://localhost:55555 |
-q|--query <VALUE> | The path to the search query (File or JSON defining input parameters) | Required |
-n|--nodes-per-request <VALUE> | Number of nodes per request
| Optional *If not included, default: 100 |
--use-shinydocs-jobs | Send logging data to the shinydocs-jobs index
| Optional *If not included, default: false |
-u|--index-server-url <VALUE> | URL of the index server
If the Cognitive Toolkit and Index are running on different servers, the value of the --index-server-url option should be set to the IP address of the index server rather than the localhost.
| Required |
-i|--index-name <VALUE> | Name of the index | Required |
-t|--threads <VALUE> | Number of parallel processes to start
| Optional *If not included, default: 1 |
--force | Forcefully remove/suppress prompt for confirmation | Optional *If not included, default: false |
--dry-run | Runs everything but doesn’t send nodes to the Analytics Engine
The --dry-run option allows you to quickly see how many items will be processed without actually creating the index.
| Optional *If not included, default: false |
-s|--silent | Turn off the progress bar | Optional |
| Example |
---|
Adds classifications towards documents based on their similarity to other, already-classified documents in the Analytics Engine. For example, choose 5-10 documents of a similar kind and classify them by their document type, such as offer letters or purchase orders. The Shinydocs Cognitive Suite will “learn” from those examples and will be able to find other similar documents for classification. | Command using minimum required inputs: Cognitivetoolkit.exe FindSimilarClassification --classification-field classification --query "C:\query-match-path-no-classification.json" --tokens 100 --threshold 75 --min-docs 5 --min-terms 1 --match 1 --index-server-url http://localhost:9200 --index-name INDEXNAME
|
Use the FindSimilarClassification Options Table to edit your runscript command.
FindSimilarClassification Options Table
OPTION | VALUE | CONDITION |
---|
--field-list <VALUE> | Fields to compare | Optional *If not included, default: fullText |
--classification-field <VALUE> | Name of the field where classifications are found | Required |
--tokens <VALUE> | Number of tokens to compare | Optional *If not included, default: 500 |
--min-docs <VALUE> | Minimum document frequency | Optional *If not included, default: 5 |
--min-terms <VALUE> | Minimum term frequency | Optional *If not included, default: 2 |
--max-docs <VALUE> | Maximum document frequency | Optional |
--min-word-length <VALUE> | Minimum word length [number of characters] | Optional |
--threshold <VALUE> | Similarity threshold [minimum-should-match] | Optional *If not included, default: 90 (%) |
--match <VALUE> | Number of documents to match | Optional *If not included, default: 5 |
--size-similarity <VALUE> | Size similarity threshold (the percent delta between sizes) | Optional *If not included, default: 20 |
--inclusion <VALUE> | File extension inclusion list (Comma delimited) | Optional |
--exclusion <VALUE> | File extension exclusion list (Comma delimited) | Optional |
-print-query | Print the Elasticsearch query in the logs. Does not run operation! | Optional *If not included, default: false |
-q|--query <VALUE> | The path to the search query (File or JSON defining input parameters) | Required |
-n|--nodes-per-request <VALUE> | Number of nodes per request
| Optional *If not included, default: 100 |
--use-shinydocs-jobs | Send logging data to the shinydocs-jobs index
| Optional *If not included, default: false |
-u|--index-server-url <VALUE> | URL of the index server
If the Cognitive Toolkit and Index are running on different servers, the value of the --index-server-url option should be set to the IP address of the index server rather than the localhost.
| Required |
-i|--index-name <VALUE> | Name of the index | Required |
-t|--threads <VALUE> | Number of parallel processes to start
| Optional *If not included, default: 1 |
--force | Forcefully remove/suppress prompt for confirmation | Optional *If not included, default: false |
--dry-run | Runs everything but doesn’t send nodes to the Analytics Engine
The --dry-run option allows you to quickly see how many items will be processed without actually creating the index.
| Optional *If not included, default: false |
-s|--silent | Turn off the progress bar | Optional |
| Example |
---|
Migrates data/files from one source to another NOTES: A crawl of the origin source data is required before performing a migration. When Migrating to SharePoint Online, field names are NOT case-sensitive at this time. When Migrating File Share to File Share (internal servers only), permissions and ownership are not carried over
| Command using minimum required inputs: Cognitivetoolkit.exe Migrate --destination-source-settings "C:\contentserver.json" --start-location 123456 "C:\match_all.json" -u http://localhost:9200 -i INDEXNAME --path-prefix-to-remove "\\TestFolder" --classification-mapping "C:\ClassificationsMapping.json"
|
Use the Migrate Options Table to edit your runscript command.
Migrate Options Table
OPTION | VALUE | CONDITION |
---|
--origin-source-settings <VALUE> | The origin settings file | Optional *If not included, default: File System Source |
--destination-source-settings <VALUE> | The destination settings file | Required |
--start-location <VALUE> | The default starting location | Optional depending on other input. Ignored otherwise. |
--location-field <VALUE> | The field name in the index indicating the destination location for the file migration | Optional |
--migrate-versions | Migrate all versions (Source dependent) | Optional *If not included, default: false |
--migrate-permissions | Migrate all permissions (Source dependent) | Optional *If not included, default: false |
--path-prefix-to-remove <VALUE> | Removes the provided text from the beginning of path | Optional Default: Smart Prefix Removal ie. 'c:\' or '\computer\' |
--description-field-name <VALUE> | The name of the field where the description value is found (only use with Content Server as destination source) | Optional |
--name-field-name <VALUE> | The name of the field where the name value is found | Optional |
--metadata-mapping <VALUE> | Location of the metadata mapping file (Source dependent) | Optional |
--user-mapping <VALUE> | Location of the user mapping file (Source dependent) | Optional |
--user-mapping-type <VALUE> | User mapping-type to use (Source dependent)
Supported values: off, file
| Optional *If not included, default: off |
--default-owner <VALUE> | The default owner of the documents being uploaded (Source dependent defaults) | Optional |
--classification-mapping <VALUE> | Location of the classification mapping file (Content Server Only) | Optional |
--is-records-management | Records management enabled (Content Server Only)
This option is used in conjunction with the --classification-mapping option for Content Server. Including --is-records-management ensures that records management classifications are included in the migration. Excluding --is-records-management means that any associated --records management classifications are excluded from the migration.
| Optional *If not included, default: false |
--auto-upgrade-category | Auto category upgrade (Content Server Only) | Optional *If not included, default: false |
--disable-over-write | Disable over-write (SharePoint Only) | Optional *If not included, default: false |
--site-url | Site to migrate to (SharePoint Only) | Optional *If not included, default: site specified in the source settings |
-s|--scroll-size | The page size of the query results | Optional *If not included, default: false |
-q|--query <VALUE> | The path to the search query (File or JSON defining input parameters) | Required |
-n|--nodes-per-request <VALUE> | Number of nodes per request.
| Optional *If not included, default: 100 |
--use-shinydocs-jobs | Send logging data to the shinydocs-jobs index
| Optional *If not included, default: false |
-u|--index-server-url <VALUE> | URL of the index server
If the Cognitive Toolkit and Index are running on different servers, the value of the --index-server-url option should be set to the IP address of the index server rather than the localhost.
| Required |
-i|--index-name <VALUE> | Name of the index | Required |
-t|--threads <VALUE> | Number of parallel processes to start
| Optional *If not included, default: 1 |
--force | Forcefully remove/suppress prompt for confirmation | Optional *If not included, default: false |
--dry-run | Runs everything but doesn’t send nodes to the Analytics Engine
The --dry-run option allows you to quickly see how many items will be processed without actually creating the index.
| Optional *If not included, default: false |
-s|--silent | Turn off the progress bar | Optional |
| Example |
---|
Removes the field specified within the explicit index. | Command using minimum required inputs: Cognitivetoolkit.exe RemoveField --field parent --query "C:\match_all.json" -u http://localhost:9200 -i INDEXNAME
|
Use the RemoveField Options Table to edit your runscript command.
RemoveField Options
OPTION | VALUE | CONDITION |
---|
-f|--field <FIELD> | The name of the index field | Required |
-q|--query <VALUE> | The path to the search query (File or JSON defining input parameters) | Required |
-n|--nodes-per-request <VALUE> | Number of nodes per request
| Optional *If not included, default: 100 |
--use-shinydocs-jobs | Send logging data to the shinydocs-jobs index
| Optional *If not included, default: false |
-u|--index-server-url <VALUE> | URL of the index server
If the Cognitive Toolkit and Index are running on different servers, the value of the --index-server-url option should be set to the IP address of the index server rather than the localhost.
| Required |
-i|--index-name <VALUE> | Name of the index | Required |
-t|--threads <VALUE> | Number of parallel processes to start
| Optional *If not included, default: 1 |
--force | Forcefully remove/suppress prompt for confirmation | Optional *If not included, default: false |
--dry-run | Runs everything but doesn’t send nodes to the Analytics Engine
The --dry-run option allows you to quickly see how many items will be processed without actually creating the index.
| Optional *If not included, default: false |
-s|--silent | Turn off the progress bar | Optional |
| Example |
---|
Removes the index from your Analytics Engine, but does not remove the index pattern. | Command using minimum required inputs: Cognitivetoolkit.exe RemoveIndex -u http://localhost:9200 -i INDEXNAME
|
Use the RemoveIndex Options Table to edit your runscript command.
RemoveIndex Options Table
OPTION | VALUE | CONDITION |
---|
-u|--index-server-url <VALUE> | URL of the index server
If the Cognitive Toolkit and Index are running on different servers, the value of the --index-server-url option should be set to the IP address of the index server rather than the localhost.
| Required |
-i|--index-name <VALUE> | Name of the index | Required |
-t|--threads <VALUE> | Number of parallel processes to start
| Optional *If not included, default: 1 |
-n|--nodes-per-request <VALUE> | Number of nodes per request
| Optional *If not included, default: 1000 |
--force | Forcefully remove/suppress prompt for confirmation | Optional *If not included, default: false |
--dry-run | Runs everything but doesn’t send nodes to the Analytics Engine
The --dry-run option allows you to quickly see how many items will be processed without actually creating the index.
| Optional *If not included, default: false |
-s|--silent | Turn off the progress bar | Optional |
| Example |
---|
Removes items from the Index. Use Case: There are times when a dataset has been crawled, but then files have later been deleted from the dataset. This will result in files having an invalid path in the index (path-valid = false). Apply the RemoveItems operation to remove items from the index that are displaying data for files that users have deleted. | Command using minimum required inputs: Cognitivetoolkit.exe RemoveItems --query "C:\match_field_keyword.json" -u http://localhost:9200 -i INDEXNAME
|
Use the RemoveItems Options Table to edit your runscript command.
RemoveItems Options Table
OPTION | VALUE | CONDITION |
---|
-q|--query <VALUE> | The path to the search query (File or JSON defining input parameters). In this case, a match_field_keyword.JSON could be applied. For example:
CODE
{
"match" : {
"path-valid" : "false"
}
}
| Required |
-n|--nodes-per-request <VALUE> | Number of nodes per request
| Optional *If not included, default: 100 |
--use-shinydocs-jobs | Send logging data to the shinydocs-jobs index
| Optional *If not included, default: false |
-u|--index-server-url <VALUE> | URL of the index server
If the Cognitive Toolkit and Index are running on different servers, the value of the --index-server-url option should be set to the IP address of the index server rather than the localhost.
| Required |
-i|--index-name <VALUE> | Name of the index | Required |
-t|--threads <VALUE> | Number of parallel processes to start
| Optional *If not included, default: 1 |
--force | Forcefully remove/suppress prompt for confirmation | Optional *If not included, default: false |
--dry-run | Runs everything but doesn’t send nodes to the Analytics Engine
The --dry-run option allows you to quickly see how many items will be processed without actually creating the index.
| Optional *If not included, default: false |
-s|--silent | Turn off the progress bar | Optional |
| Example |
---|
Restores the FileSystem permissions. Can be used in conjunction with the following tools CachedFileSystemPermissions and SetFileSystemPermissions. CachedPermsFieldName = "cached-permissions" CachedInheritanceFieldName = "cached-inheritance" CachedOwnerFieldName = "cached-owner"
If you run into problems running the SetFileSystemPermissions, you can perform a Restore which uses the fields created by this tool. This restores the original permissions on the actual file in the file system. | Command using minimum required inputs: Cognitivetoolkit.exe RestoreCachedFileSystemPermissions --query "C:\match_all.json" -u http://localhost:9200 -i INDEXNAME
|
Use the RestoreCachedFileSystemPermissions Options Table to edit your runscript command.
RestoreCachedFileSystemPermissions Options Table
OPTION | VALUE | CONDITION |
---|
-q|--query <VALUE> | The path to the search query (File or JSON defining input parameters) | Required |
-n|--nodes-per-request <VALUE> | Number of nodes per request
| Optional *If not included, default: 100 |
--use-shinydocs-jobs | Send logging data to the shinydocs-jobs index
| Optional *If not included, default: false |
-u|--index-server-url <VALUE> | URL of the index server
If the Cognitive Toolkit and Index are running on different servers, the value of the --index-server-url option should be set to the IP address of the index server rather than the localhost.
| Required |
-i|--index-name <VALUE> | Name of the index | Required |
-t|--threads <VALUE> | Number of parallel processes to start
| Optional *If not included, default: 1 |
--force | Forcefully remove/suppress prompt for confirmation | Optional *If not included, default: false |
--dry-run | Runs everything but doesn’t send nodes to the Analytics Engine
The --dry-run option allows you to quickly see how many items will be processed without actually creating the index.
| Optional *If not included, default: false |
-s|--silent | Turn off the progress bar | Optional |
| Example |
---|
Allows you to run the Cognitive Toolkit as a different user. | Command using minimum required inputs: Cognitivetoolkit.exe RunWithCredentials
|
Use the RunWithCredentials Options Table to edit your runscript command.
RunWithCredentials Options Table
OPTION | VALUE | CONDITION |
---|
--save-credentials | Save credentials on the machine | Optional *If not included, default: false |
--hide | Hide the window | Optional *If not included, default: false |
| Example |
---|
Saves values for the purpose of using these values via substitution in tools. This tool will also encrypt passwords and user names used for such required options within the --source-settings and or directly within the command line. Encrypting repository passwords with SaveValue | Command using minimum required inputs: Cognitivetoolkit.exe SaveValue --query "C:\match_all.json" -u http://localhost:9200 -i INDEXNAME
|
Use the SaveValue Options Table to edit your runscript command.
SaveValue Options Table
OPTION | VALUE | CONDITION |
---|
--list | List all of the saved values | Optional |
--save <SAVE> | Name of saved value | Optional |
--value <VALUE> Value (Optional) | Value | Optional |
--remove <REMOVE> | Remove from saved values | Optional |
--no-encryption | Do not encrypt value | Optional *If not included, default: false |
| Example |
---|
Resets permissions on the filesystem. Make sure you retain the Administrators rights on the file system.
CacheFileSystemPermissions must be performed before running this operation.
| Command using minimum required inputs: Cognitivetoolkit.exe SetFileSystemPermissions --exclusions "Administrator,Administrators" -q "C:\match_all.json" -u http://localhost:9200 -i INDEXNAME --identity SHINYAD\identity1,identity2\identity3 --access-control-reason quarantine --rights none
|
Use the SetFileSystemPermissions Options Table to edit your runscript command.
SetFileSystemPermissions Options Table
OPTION | VALUE | CONDITION |
---|
--exclusions <VALUE> | A comma-separated list of users/groups you wish to exclude | Optional |
--identity <VALUE> | Identities to add file access control (Comma separated) | Required |
--access-control-reason <VALUE> | File access control change identifier. Ie. legal-hold, destruction, public-record | Required |
--rights <VALUE> | The level of permissions that will remain on the object (ie. none, read, modify) | Optional *If not included, default: read |
-q|--query <VALUE> | The path to the search query (File or JSON defining input parameters) | Required |
-n|--nodes-per-request <VALUE> | Number of nodes per request
| Optional *If not included, default: 100 |
--use-shinydocs-jobs | Send logging data to the shinydocs-jobs index
| Optional *If not included, default: false |
-u|--index-server-url <VALUE> | URL of the index server
If the Cognitive Toolkit and Index are running on different servers, the value of the --index-server-url option should be set to the IP address of the index server rather than the localhost.
| Required |
-i|--index-name <VALUE> | Name of the index | Required |
-t|--threads <VALUE> | Number of parallel processes to start
| Optional *If not included, default: 1 |
--force | Forcefully remove/suppress prompt for confirmation | Optional *If not included, default: false |
--dry-run | Runs everything but doesn’t send nodes to the Analytics Engine
The --dry-run option allows you to quickly see how many items will be processed without actually creating the index.
| Optional *If not included, default: false |
-s|--silent | Turn off the progress bar | Optional |
| Example |
---|
Tags any files that are considered duplicates and the option to identify the primary duplicate. This must be used in combination after you AddHashAndExtractedText tool with adding hash value. | Command using minimum required inputs: Cognitivetoolkit.exe TagDuplicate --tag-primary --date-field lastCreationUtc --inspected-field hash --sort-order descending --items-to-process-query "C:\match_all.json" -u http://localhost:9200 -i INDEXNAME
|
Use the TagDuplicate Options Table to edit your runscript command.
TagDuplicateField Options Table
OPTION | VALUE | CONDITION |
---|
--duplicate-field-name <VALUE> | The name of the field that will identify the duplicate | Optional *If not included, default: duplicate-{inspected-field} |
--tag-primary | Tag a primary duplicate
Primary duplicates are defined subjectively. Your data management strategy could identify primary duplicates by:
| Optional *If not included, default: false |
--tag-unique | Tag unique documents
The unique status of an item is calculated at the time the TagDuplicate command is run. The item may no longer be considered unique in a later crawl. There is no guarantee an item will remain unique.
| Optional *If not included, default: false |
--date-field <VALUE> | The name of the date field that will be used to determine the primary duplicate | Optional *If not included, default: creationTimeUtc |
--sort-order <VALUE> | The sort order to be used in conjunction with date-field
Supported values are: ‘ascending’, ‘descending’
| Optional *If not included, default: ascending |
--inspected-field <VALUE> | The name of the field that will be compared | Optional *If not included, default: ‘hash’ |
--use-keyword <VALUE> | Use the keyword field to filter | Optional *If not included, default: true |
--aggregate | Use the aggregate method (recommended for large datasets) | Optional *If not included, default: false |
-q|--items-to-process-query <VALUE> | Query for items to process (File or JSON input) | Required |
--match-against-query | Query for items to match against | Optional *If not included, default: match everything |
--overwrite | Items that are no longer duplicates will be erased from the index. | Optional *If not included, default: false |
-n|--nodes-per-request <VALUE> | Number of nodes per request
| Optional *If not included, default: 100 |
--use-shinydocs-jobs | Query will be used against the shinydocs-jobs index | Optional *If not included, default: false |
-u|--index-server-url <VALUE> | URL of the index server
If the Cognitive Toolkit and Index are running on different servers, the value of the --index-server-url option should be set to the IP address of the index server rather than the localhost.
| Required |
-i|--index-name <VALUE> | Name of the index | Required |
-t|--threads <VALUE> | Number of parallel processes to start
| Optional *If not included, default: 1 |
--force | Forcefully remove/suppress prompt for confirmation | Optional *If not included, default: false |
--dry-run | Runs everything but doesn’t send nodes to the Analytics Engine
The --dry-run option allows you to quickly see how many items will be processed without actually creating the index.
| Optional *If not included, default: false |
-s|--silent | Turn off the progress bar | Optional |
| Example |
---|
Once you have migrated items into Content Server, this tool will help you update the items with category and attributes. (if not completed during migration). It also allows you to move and/or rename documents within Content Server. | Command using minimum required inputs: Cognitivetoolkit.exe UpdateProperties --source-settings "C:\sourcesettings.json" -q "C:\match_all.json" -rm -cm "C:\ContentServerRMClassificationsMapping.json" -u http://localhost:9200 -i INDEXNAME
|
Use the UpdateProperties Options Table to edit your runscript command.
UpdateProperties Options Table
OPTION | VALUE | CONDITION |
---|
--source-settings <VALUE> | Path to the settings file containing access information such as the username and password for the data source (Possible data sources: Box, Content Server, Documentum, Exchange, Filenet) | Required |
--name | The field name that contains the updated node name | Optional |
--description | The field name that contains the updated node description | Optional |
--duplicate-resolution | Duplicate name conflict resolution
Supported values are: ‘duplicate', ‘skip’, ‘version’
| Optional *If not included, default: skip |
--parent | The field name that contains the updated node parent | Optional |
--records-management-classification | Include this option if you are using records management for classification | Optional *If not included, default: false |
--metadata-mapping | Metadata mapping file | Optional |
--classification-mapping | Classifications file | Optional |
-q|--query <VALUE> | The path to the search query (File or JSON defining input parameters) | Required |
-n|--nodes-per-request <VALUE> | Number of nodes per request
| Optional *If not included, default: 100 |
--use-shinydocs-jobs | Send logging data to the shinydocs-jobs index
| Optional *If not included, default: false |
-u|--index-server-url <VALUE> | URL of the index server
If the Cognitive Toolkit and Index are running on different servers, the value of the --index-server-url option should be set to the IP address of the index server rather than the localhost.
| Required |
-i|--index-name <VALUE> | Name of the index | Required |
-t|--threads <VALUE> | Number of parallel processes to start
| Optional *If not included, default: 1 |
--force | Forcefully remove/suppress prompt for confirmation | Optional *If not included, default: false |
--dry-run | Runs everything but doesn’t send nodes to the Analytics Engine
The --dry-run option allows you to quickly see how many items will be processed without actually creating the index.
| Optional *If not included, default: false |
-s|--silent | Turn off the progress bar | Optional |
To access the complete list of available operations from within the Cognitive Toolkit, type the following at the root folder of the Cognitive Toolkit: CognitiveToolkit.exe -h!|--help!