CrawlFileSystem is the base operation for data discovery and is generally performed prior to running any other Cognitive Toolkit operation. It crawls the specified path (or multiple paths) for metadata. The metadata is then stored in an index where it can be further mined for insights.
Cognitive Toolkit is a Command Line tool. Every operation follows the same sequence to run:
CMDas an Administrator or a Service account that has the appropriate permissions.
cdto navigate to where you extracted the shinydocs-extraction-[Version]-[YYYY-MM-DD].zip file (Cognitive Toolkit root folder).
CognitiveToolkit.exeprompt, type in the runscript command (CrawlFileSystem), followed by the required parameters.
Running the command
To run the CrawlFileSystem command you must provide, at a minimum, the required parameters identified in the following chart:
Option values are based on your environment setup and location of source files.
Single path to crawl
Text file that contains multiple paths to crawl
*At least one of these two options must be included in the runscript command.
Includes hidden files in the crawl
Includes system files in the crawl
Add the Owner field to the index
Includes reparse items
A file or directory can contain a reparse point, which is a collection of user-defined data. The format of this data is understood by the application which stores the data, and a file system filter, which interprets the data and processes the file. When an application sets a reparse point, it stores this data, plus a reparse tag, which uniquely identifies the data it is storing.
Crawls all files modified after this date* and ignores anything modified before it. This is useful for differentiating between crawls/indices.
Supported date formats: yyyy-MM-dd, yyyy-MM-dd HH:mm, yyyy-MM-ddTHH:mm
Example: 2018-12-20 or 2018-12-20 19:42
Supported relative date formats are: now, now+/-ld[/d], now+/-lm[/d], now +/-ly[/d]
URL to index server
If the Cognitive Toolkit and Index are running on different servers, the value of the --index-server-url option should be set to the IP address of the index server rather than the URL.
Name of index
Number of parallel processes to start
Number of nodes per request.
Forcefully remove / Suppress prompt for confirmation
When running batch files, this option allows the operation to continue without interruption.
Runs everything but doesn’t send nodes to index
The --dry-run option allows you to quickly see how many items will be indexed without actually creating the index.
Hides the progress bar during the running of the operation
The following runscript command is an example of the input required for crawling a multiple paths in a file system:
CognitiveToolkit.exe CrawlFileSystem --path-file "<VALUE>" --index-server-url <VALUE> --index-name <VALUE> --add-field-owner --after-date-last-modified now-l/d
Cognitive Suite 2.6 and later eliminates the need to run the CrawlFileSystem operation more than once by allowing you to specify multiple paths to crawl. A path input file allows you to specify multiple paths to crawl.
Example of a path input file (.txt file):
\\Server2.companyABC.local\Files\HR\Offer Letters \\Server2.companyABC.local\Files\HR\Acceptance Letters \\Server2.companyABC.local\Files\HR\Onboarding \\Server2.companyABC.local\Files\Finance\Purchase Orders \\Server2.companyABC.local\Files\Finance\Invoices