Certain processes within the Cognitive Toolkit have an option for --nodes-per-request. To set this option, it’s important to understand how and why it exists.
The --nodes-per-request option specifies the number of nodes that should be generated before sending data back to the index, e.g. --nodes-per-request 1
A node indicates a basic unit of data structure. Nodes-per-request is the number of items stored in memory before updating the index. Since nodes represent the data gathered per item being processed, the --nodes-per-request option should be used to throttle the amount of data being sent to the index.
When deciding how to set the --nodes-per-request option, consider the amount of processing power your indexing cluster has, as well as how often you want data sent back to the index. The greater the nodes-per-request, the larger each chunk of data will be that is sent to the index. The fewer the nodes-per-request, the more updates are sent to the index, resulting in increased network overhead.
If, for example, AddHashAndExtractedText is performed on an index that contains 10,000,000 items with the --nodes-per-request option set to 2,000 (e.g. --nodes-per-request 2000), the Cognitive Toolkit will complete the process for 2,000 items and then update the index with all the data it has gathered for those 2,0000 items. The process will then be repeated for the next 2,000 items until the entire index has completed the AddHashAndExtractedText process. A total of 5,000 real-time updates would be performed throughout the entire process.
Number of index items
Number of real-time updates
**A setting of 2000 nodes-per-request is the maximum recommended by Shinydocs.
When to set a high number of nodes-per-request
For operations where real-time updates are not critically important, such as AddPathValidation, it makes sense to set the nodes-per-request to a higher number. This will decrease the number of updates sent to the index.
When to set a low number of nodes-per-request
For operations where real-time updates are critically important as an audit measure, such as Migration, it makes sense to set the nodes-per-request to 1. This will increase the number of real-time updates and deliver critical information about each migrated item.