Running CrawlFileSystem
CrawlFileSystem is the base operation for data discovery and is generally performed prior to running any other Cognitive Toolkit operation. It crawls the specified path (or multiple paths) for metadata. The metadata is then stored in an index where it can be further mined for insights.
Prerequisites
Cognitive Toolkit is a Command Line tool. Every operation follows the same sequence to run:
Open
CMD
as an Administrator or a Service account that has the appropriate permissions.Change directory
cd
to navigate to where you extracted the shinydocs-extraction-[Version]-[YYYY-MM-DD].zip file (Cognitive Toolkit root folder).After the
CognitiveToolkit.exe
prompt, type in the runscript command (CrawlFileSystem), followed by the required parameters.
Running the command
To run the CrawlFileSystem command you must provide, at a minimum, the required parameters identified in the following chart:
Option values are based on your environment setup and location of source files.
CrawlFileSystem Options
Option | Details | Required |
---|---|---|
--path <VALUE> OR --path-file <VALUE> | Single path to crawl OR Text file that contains multiple paths to crawl *At least one of these two options must be included in the runscript command. | No* No* |
--include-hidden | Includes hidden files in the crawl | No (Default: false) |
--include-system | Includes system files in the crawl | No (Default: false) |
-a|--add-field-owner | Add the Owner field to the index | No (Default: false) |
--include-reparse | Includes reparse items A file or directory can contain a reparse point, which is a collection of user-defined data. The format of this data is understood by the application which stores the data, and a file system filter, which interprets the data and processes the file. When an application sets a reparse point, it stores this data, plus a reparse tag, which uniquely identifies the data it is storing. | No (Default: false) |
--after-date-last-modified | Crawls all files modified after this date* and ignores anything modified before it. This is useful for differentiating between crawls/indices. Supported date formats: yyyy-MM-dd, yyyy-MM-dd HH:mm, yyyy-MM-ddTHH:mm Example: 2018-12-20 or 2018-12-20 19:42 Supported relative date formats are: now, now+/-ld[/d], now+/-lm[/d], now +/-ly[/d] | No (Default: all) |
--index-server-url <VALUE> | URL to index server If the Cognitive Toolkit and Index are running on different servers, the value of the --index-server-url option should be set to the IP address of the index server rather than the URL. | Yes |
-i|--index-name <VALUE> | Name of index | Yes |
-t|--threads <VALUE> | Number of parallel processes to start For recommendations on setting this number value, see https://shinydocs.atlassian.net/wiki/spaces/SHINY/pages/2356117520 | No (Default: 1) |
-n|--nodes-per-request <VALUE> | Number of nodes per request. For recommendations on setting this number value, see Setting the "--nodes-per-request" option | No (Default: 1000) |
--force | Forcefully remove / Suppress prompt for confirmation When running batch files, this option allows the operation to continue without interruption. | No (Default: false) |
--dry-run | Runs everything but doesn’t send nodes to index The --dry-run option allows you to quickly see how many items will be indexed without actually creating the index. | No (Default: false) |
--silent | Hides the progress bar during the running of the operation | No (Default: false) |
Example
The following runscript command is an example of the input required for crawling a multiple paths in a file system:
CognitiveToolkit.exe CrawlFileSystem --path-file "<VALUE>" --index-server-url <VALUE> --index-name <VALUE> --add-field-owner --after-date-last-modified now-l/d
Notes
--path-file option
Cognitive Suite 2.6 and later eliminates the need to run the CrawlFileSystem operation more than once by allowing you to specify multiple paths to crawl. A path input file allows you to specify multiple paths to crawl.
Example of a path input file (.txt file):
\\Server2.companyABC.local\Files\HR\Offer Letters
\\Server2.companyABC.local\Files\HR\Acceptance Letters
\\Server2.companyABC.local\Files\HR\Onboarding
\\Server2.companyABC.local\Files\Finance\Purchase Orders
\\Server2.companyABC.local\Files\Finance\Invoices