Skip to main content
Skip table of contents

Running CrawlFileSystem

CrawlFileSystem is the base operation for data discovery and is generally performed prior to running any other Cognitive Toolkit operation. It crawls the specified path (or multiple paths) for metadata. The metadata is then stored in an index where it can be further mined for insights.

Prerequisites

Cognitive Toolkit is a Command Line tool. Every operation follows the same sequence to run:

  1. Open CMD as an Administrator or a Service account that has the appropriate permissions.

  2. Change directory cd to navigate to where you extracted the shinydocs-extraction-[Version]-[YYYY-MM-DD].zip file (Cognitive Toolkit root folder).

  3. After the CognitiveToolkit.exe prompt, type in the runscript command (CrawlFileSystem), followed by the required parameters.

Running the command

To run the CrawlFileSystem command you must provide, at a minimum, the required parameters identified in the following chart:

Option values are based on your environment setup and location of source files.

CrawlFileSystem Options

Option

Details

Required

--path <VALUE>

OR

--path-file <VALUE>

Single path to crawl

OR

Text file that contains multiple paths to crawl

*At least one of these two options must be included in the runscript command.
If --path is not used, --path-file must be used. If --path-file is not used, --path must be used.

No*

No*

--include-hidden

Includes hidden files in the crawl

No

(Default: false)

--include-system

Includes system files in the crawl

No

(Default: false)

-a|--add-field-owner

Add the Owner field to the index

No

(Default: false)

--include-reparse

Includes reparse items

A file or directory can contain a reparse point, which is a collection of user-defined data. The format of this data is understood by the application which stores the data, and a file system filter, which interprets the data and processes the file. When an application sets a reparse point, it stores this data, plus a reparse tag, which uniquely identifies the data it is storing.

No

(Default: false)

--after-date-last-modified

Crawls all files modified after this date* and ignores anything modified before it. This is useful for differentiating between crawls/indices.

Supported date formats: yyyy-MM-dd, yyyy-MM-dd HH:mm, yyyy-MM-ddTHH:mm

Example: 2018-12-20 or 2018-12-20 19:42

Supported relative date formats are: now, now+/-ld[/d], now+/-lm[/d], now +/-ly[/d]

No

(Default: all)

--index-server-url <VALUE>

URL to index server

If the Cognitive Toolkit and Index are running on different servers, the value of the --index-server-url option should be set to the IP address of the index server rather than the URL.

Yes

-i|--index-name <VALUE>

Name of index

Yes

-t|--threads <VALUE>

Number of parallel processes to start

For recommendations on setting this number value, see https://shinydocs.atlassian.net/wiki/spaces/SHINY/pages/2356117520

No

(Default: 1)

-n|--nodes-per-request <VALUE>

Number of nodes per request.

For recommendations on setting this number value, see Setting the "--nodes-per-request" option

No

(Default: 1000)

--force

Forcefully remove / Suppress prompt for confirmation

When running batch files, this option allows the operation to continue without interruption.

No

(Default: false)

--dry-run

Runs everything but doesn’t send nodes to index

The --dry-run option allows you to quickly see how many items will be indexed without actually creating the index.

No

(Default: false)

--silent

Hides the progress bar during the running of the operation

No

(Default: false)

Example

The following runscript command is an example of the input required for crawling a multiple paths in a file system:

CODE
CognitiveToolkit.exe CrawlFileSystem --path-file "<VALUE>" --index-server-url <VALUE> --index-name <VALUE> --add-field-owner --after-date-last-modified now-l/d

 

Notes

--path-file option

Cognitive Suite 2.6 and later eliminates the need to run the CrawlFileSystem operation more than once by allowing you to specify multiple paths to crawl. A path input file allows you to specify multiple paths to crawl.

Example of a path input file (.txt file):

CODE
\\Server2.companyABC.local\Files\HR\Offer Letters
\\Server2.companyABC.local\Files\HR\Acceptance Letters
\\Server2.companyABC.local\Files\HR\Onboarding
\\Server2.companyABC.local\Files\Finance\Purchase Orders
\\Server2.companyABC.local\Files\Finance\Invoices

 

JavaScript errors detected

Please note, these errors can depend on your browser setup.

If this problem persists, please contact our support.