Running CrawlFileSystem

CrawlFileSystem is the base operation for data discovery and is generally performed prior to running any other Cognitive Toolkit operation. It crawls the specified path (or multiple paths) for metadata. The metadata is then stored in an index where it can be further mined for insights.

Prerequisites

Cognitive Toolkit is a Command Line tool. Every operation follows the same sequence to run:

Open CMD as an Administrator or a Service account that has the appropriate permissions.
Change directory cd to navigate to where you extracted the shinydocs-extraction-[Version]-[YYYY-MM-DD].zip file (Cognitive Toolkit root folder).
After the CognitiveToolkit.exe prompt, type in the runscript command (CrawlFileSystem), followed by the required parameters.

Running the command

To run the CrawlFileSystem command you must provide, at a minimum, the required parameters identified in the following chart:

Option values are based on your environment setup and location of source files.

CrawlFileSystem Options

Option	Details	Required
--path <VALUE> OR --path-file <VALUE>	Single path to crawl OR Text file that contains multiple paths to crawl At least one of these two options must be included in the runscript command. If --path is not* used, --path-file must be used. If --path-file is not used, --path must be used.	No* No*
--include-hidden	Includes hidden files in the crawl	No (Default: false)
--include-system	Includes system files in the crawl	No (Default: false)
-a\|--add-field-owner	Add the Owner field to the index	No (Default: false)
--include-reparse	Includes reparse items A file or directory can contain a reparse point, which is a collection of user-defined data. The format of this data is understood by the application which stores the data, and a file system filter, which interprets the data and processes the file. When an application sets a reparse point, it stores this data, plus a reparse tag, which uniquely identifies the data it is storing.	No (Default: false)
--after-date-last-modified	Crawls all files modified after this date* and ignores anything modified before it. This is useful for differentiating between crawls/indices. Supported date formats: yyyy-MM-dd, yyyy-MM-dd HH:mm, yyyy-MM-ddTHH:mm Example: 2018-12-20 or 2018-12-20 19:42 Supported relative date formats are: now, now+/-ld[/d], now+/-lm[/d], now +/-ly[/d]	No (Default: all)
--index-server-url <VALUE>	URL to index server If the Cognitive Toolkit and Index are running on different servers, the value of the --index-server-url option should be set to the IP address of the index server rather than the URL.	Yes
-i\|--index-name <VALUE>	Name of index	Yes
-t\|--threads <VALUE>	Number of parallel processes to start For recommendations on setting this number value, see https://shinydocs.atlassian.net/wiki/spaces/SHINY/pages/2356117520	No (Default: 1)
-n\|--nodes-per-request <VALUE>	Number of nodes per request. For recommendations on setting this number value, see Setting the "--nodes-per-request" option	No (Default: 1000)
--force	Forcefully remove / Suppress prompt for confirmation When running batch files, this option allows the operation to continue without interruption.	No (Default: false)
--dry-run	Runs everything but doesn’t send nodes to index The --dry-run option allows you to quickly see how many items will be indexed without actually creating the index.	No (Default: false)
--silent	Hides the progress bar during the running of the operation	No (Default: false)

Example

The following runscript command is an example of the input required for crawling a multiple paths in a file system:

CODE

CognitiveToolkit.exe CrawlFileSystem --path-file "<VALUE>" --index-server-url <VALUE> --index-name <VALUE> --add-field-owner --after-date-last-modified now-l/d

Notes

--path-file option

Cognitive Suite 2.6 and later eliminates the need to run the CrawlFileSystem operation more than once by allowing you to specify multiple paths to crawl. A path input file allows you to specify multiple paths to crawl.

Example of a path input file (.txt file):

CODE

\\Server2.companyABC.local\Files\HR\Offer Letters
\\Server2.companyABC.local\Files\HR\Acceptance Letters
\\Server2.companyABC.local\Files\HR\Onboarding
\\Server2.companyABC.local\Files\Finance\Purchase Orders
\\Server2.companyABC.local\Files\Finance\Invoices