Scheduling Maintenance Crawls
Maintenance Crawls
Content is constantly being created, edited, disposed and moved. To accurately search this content, changes need to be recorded regularly by the Analytics Engine.
We recommend scheduling routine maintenance crawls, depending on your needs, to keep the Analytics Engine index up to date with all recent changes.
How does it work?
Cognitive Toolkit features a number of different types of crawls depending on the operation being performed.
A simple metadata crawl inspects all your data. A number of other types of crawls, but not all, are configured using date math in order to collect only the most recent changes to the repository, rather than re-crawling everything.
Metadata crawls cannot be configured using date math.
Maintenance crawls are performed using the Windows Task Scheduler. Depending on your content strategy, you may have multiple, ongoing maintenance crawls scheduled. These schedules are saved as batch files or Powershell files.
Things to Consider
When building a strategy for maintenance crawls, it’s important to take the following into account:
Impact of crawls on network traffic
Number of operations that need to be completed (ie. When crawling new content, are there any other steps within the content strategy that require a metadata, text extraction and content enrichment process?)
How often do you need the index updated?
How long does each operation take?
A number of the above factors will be determined after the initial project is underway.
Scheduling Best Practices
Step by Step
Create a folder within Cognitive Toolkit for scheduled tasks. All scheduled task queries should be saved to this folder.
Schedule a batch or Powershell file. In most cases, the Start in (optional): location should be the parent folder in which the batch or Powershell file resides.
In the Security options section, ensure the scheduled task is set to run using the proper service account for the actions.
Select Run whether user is logged on or not
As a general rule, we recommend running only one process at a time on the same items in your index. Each index item can only be updated by one process at a time. A Version Conflict error can result if two processes try to update the same item simultaneously. For example, running AddHashAndExtractedText AND CrawlFileSystem simultaneously will result in a Version Conflict error.
The order of operations in the batch or Powershell script should be the order that makes most sense to the data. A metadata crawl is typically the first step in most operations.
Plan your scheduled task from start to finish.
To prevent the script from running everything at once, use respective wait parameters when calling the CognitiveToolkit executable from your batch or Powershell script
Validate scheduled task queries by running them as a search in Visualizer (either by filtering or in Dev Tools). Ensure your crawls are capturing the desired data.
For query-based tools, it is recommended to use a date range query to process only data that has been modified in the last n number of days (ie: now-2d/d)
Instead of using must in your query, consider using filter. Filters perform better and are not scored.
Use the silent flag in commands where possible. This feature turns off the progress bar and may result in a slight performance improvement.
It may help to use multiple copies of the Cognitive Toolkit for each repository. This keeps the logs separate and easier to understand.