Skip to main content
Skip table of contents

Cognitive Suite Requirements

This guide covers the hardware and software requirements for Shinydocs Cognitive Suite, and includes the following components: Cognitive Toolkit, Analytics Engine, and Visualizer.

Your solution from Shinydocs may include other products such as Enterprise Search. Please see additional requirements needed for other Shinydocs products.

Overview

The requirements for the Analytics Engine depend on the volume of data to be analyzed. Depending on that volume you may require multiple machines. This guide will inform you of how many machines are required or recommended, and the specifications that need to be met for each machine.

The Analytics Engine, Coordinator nodes, and Cognitive Toolkit hardware requirements recommendations are based on years of experience crawling and analyzing data for organizations around the world.

The technology used by these products scales horizontally, meaning that if additional resources are required, they are additional, identically configured machines.

Power in Numbers

The Shinydocs Cognitive Toolkit, Analytics Engine, and Visualizer power most of the solutions offered by Shinydocs. These components are split out into 4 parts (5 if a Coordinator node is used):

Cognitive Toolkit

Backend (Worker)

Shinydocs Extraction Service

Backend

Analytics Engine

Backend

Coordinator node

Backend

Visualizer Interface

Frontend

What is the Shinydocs Cognitive Toolkit?

The Cognitive Toolkit contains the tools needed to achieve the solutions provided by Shinydocs. You can think of these machines as workers. The workers reach out to your data to gather information (metadata, identifiers, text, etc) from the source, they can also perform work on the index (enrichment, finding duplicates, etc.). The more workers you have, the more they can accomplish by combining their efforts. You will need at least 1 worker to get started. As your need for performance becomes greater, you can add additional machines with the Cognitive Toolkit installed. They can also be removed should your needs change. Processes like text extraction and OCR are quite intensive for the CPU of these machines, the more resources you can give these machines, the faster it can process information.

What is the Shinydocs Extraction Service?

This service is installed with the Cognitive Toolkit. The Cognitive Toolkit now uses this service for extracting text from documents as well as providing the engine for extracting entities from your documents should you wish to. 

What is the Analytics Engine?

The Analytics Engine (powered by Elasticsearch®) nodes are the core of the data solutions provided by Shinydocs, the brain. These machines house and serve all the data you collect from your systems in an index or indices. Using clustering technology, the nodes work together to distribute the load of all the information we gather about your data. These machines don’t store your files, your files can stay right where they are. The Analytics Engine machines store the metadata, text, and enrichments performed by the Cognitive Toolkit and provide a visual interface for exploring and visualizing your data. They scale very well and need to be scaled according to your volume of data. These machines need to be fast, like a brain is, to receive and serve data to the Cognitive Toolkit, Discovery Search, and the visual frontend. The faster the brain is, the faster everything connected to it can function. 

What is a Coordinator Node?

When there is a large volume of data, a coordinator node assists the environment in a similar way a network load balancer would. These specialized nodes act as the centralized endpoint for the visual interface and the Cognitive Toolkit(s), simplifying configuration and ensuring the cluster of Analytics Engines are working efficiently.

What is the Visualizer?

The Visualizer is a web application interface that allows you to discover and visualize your data. Using a web browser (Google Chrome™ or Microsoft Edge® supported), operators will be able to access all crawled information, enrichments, and visualizations. Like other popular Business Intelligence and Analytics tools, the Shinydocs Visualizer Interface can help in creating beautiful graphs, charts, and tables. Operators will be able to search across your data set to discover what your data is telling you. Access to this tool can be controlled via a reverse proxy, however, it should be noted that the contents within the web UI are not individually permissioned based on user or group(s).

This web application would be installed on 1 of the Analytics nodes, or, if your data volume requires a Coordinator node, is installed on the Coordinator node.

Getting the most out of this guide

Ready to get started? Here is a step-by-step of the key information and where you'll find it in this guide so that you can successfully implement the Analytics Engine, Visualizer, and Cognitive Toolkit.

  1. Check your organization's expected data volume

  2. Match your expected data volume in the Table: Data Volume to Nodes provided

  3. Using the table, you will now know how many Analytics nodes you will need

    1. If your corresponding data volume suggests a Coordinator node, that is in addition to the Analytics nodes

  4. Use the Analytics Engine & Coordinator Node Hardware Requirements table to see the hardware requirements and software prerequisites per node

    1. Since storage requirements are split across the index cluster, for storage requirements, see Storage Sizing

    2. Check the Shards section to match your per index data volume with the recommended shards

  5. Determine how many Cognitive Toolkit machines you want to use. These are the workers, the more you have, the greater the throughput you will achieve. We have recommendations based on your data volume

    1. Requirements for Cognitive Toolkit Machines can be found in the Cognitive Toolkit Machine Requirements section 

    2. Software prerequisites for Cognitive Toolkit can be found in the software prerequisites section

  6. Review the Anti-Virus and Security, Repository-Specific Requirements, Access and Security, and Best Practices sections

  7. Congratulations, you’re ready to install!

Analytics Engine & Coordinator Node Requirements

How many do I need?

The number of Analytics Engines you require is based on the volume of data that the tools will crawl and analyze.

Note: These recommendations are for production environments and may differ from small testing specifications.

Table: Data Volume to Nodes

Assumptions:

  1. ~1,000,000 files per terabyte of data volume

  2. ~70% of that data is text extractable 

  3. Data volume is per index. If data is being split into separate indices (e.g. an index per department) use that departments data volume for this calculation

Your Data Volume

1 – 40 TB

50 – 100 TB

200 – 400 TB

500 - 700 TB

800 TB – 1 PB

Number of Analytics Engines (nodes)

1

3

4

6

9

Number of Coordinator nodes for production

1

1

Total nodes

1

3

4

7

10

Note: The use of 2 analytics nodes is not recommended. 

You will need 3+ nodes to avoid index data loss in the event of a system failure

SharePoint Scans

There is a limit of 1000 SharePoint subsites per analytics node. If there are more than 1000 subsites in SharePoint, an additional analytics node is required, regardless of data volume.

Equation: [Number of SharePoint subsites]/1000 = Number of analytics nodes (rounded up)

For example: If ACME Corp has 2,500 subsites in SharePoint with 20 TB of data, 3 analytics nodes are required. (2500/1000 = 2.5=3 analytics nodes)

Why does this component need these resources?

The Analytics Engine is unlike most enterprise software. This technology needs to house all the metadata and text of your files, but also:

  • Stores the extracted text data of files

  • Computes very large workloads when visualizing data

  • Examples: Calculating the sum of millions and millions of file’s sizes, searching all ingested data for keywords, calculating counts of all extensions, tokenizing words, etc.

  • Serves data to the Cognitive Toolkit for processing 

  • It needs to be fast, otherwise, it would take years to go through the volumes of data enterprises keep

These requirements and recommendations are designed to give you and your organization a performant solution that can meet your business needs. 

Hardware Requirements

Dedicated machine(s) are required. The Analytics Engine & Coordinator nodes require dedicated hardware resources as outlined in the following sections

Table: Analytics Engine & Coordinator node Hardware Requirements

Component

Requirement for production environments (per node)

CPU (cores)

12+

Memory

64 GB

Storage

High-Performance NVMe SSD* (See Storage Sizing)

Network

10 Gbps interface

Shinydocs cannot guarantee specific throughput or performance due to various environment variables.

If your organization’s hardware does not meet these requirements, you may experience slower performance and throughput with a higher error rate

CPU
  • Modern x86 64-bit enterprise-grade processor with 12 or more cores (Intel recommended)

    • More cores can be added to achieve higher performance

    • Due to the varying hardware and needs from organization to organization, you may need additional cores than noted above based on your desired performance

  • Dedicated cores preferred to vCores

  • Minimum 1.6 GHz per core

  • No VM execution cap

Memory
  • Maximum supported memory per node is 64 GB

    • If there is still insufficient memory in the cluster, additional nodes of the same hardware configuration should be added to the cluster

  • Enterprise-grade with Error-Correcting Code (ECC)

  • Dedicated memory is preferred

    • The Analytics Engine has a large memory footprint that is locked upon the service starting

Storage
  • Search speed and responsiveness is heavily dependent on the speed of your drive IO performance, latency, and throughput

  • Local storage required

    • Shared, network or remote-cloud storage is not supported due to performance

    • See Storage Sizing for details

  • SSD (NVMe preferred)

  • Random Read: >1000 MB/s @ 100% Utilization

  • Sequential Read: >2000MB/s @ 100% Utilization

  • Sequential Write: >2000MB/s @ 100% Utilization

  • Latency Maximum: <3.00ms

Note: Above must be supported on the backplane of allocated servers.

Dedicated disk performance per machine must be at or above recommendation.

Disk performance must be maintained if other VMs use the same disk(s).

Network
  • Local switch

  • Server grade hardware with:

    • No routing/load balancing between devices

    • 10 Gigabit Network Interface on all nodes

    • 10 Gigabit Line between nodes

Software Prerequisites 

Storage Sizing

When sizing storage for Analytics Engine nodes, you need to decide if text extraction is in your use-case. Text extraction greatly increases the storage requirement, as the index needs to store that text. Storage sizing correlates to the number of files in your data volume. Since it is difficult for most organizations to determine this before using the Cognitive Suite, storage sizing is determined by the volume of data. Due to the unpredictable nature of enterprise data, once your data has been crawled and text extracted, there may be an opportunity to shrink the unused storage. You may also need to expand your storage if required. It is easier and more efficient to start large and scale back as needed.

Note: With a cluster of Analytics Engines, the storage requirement is divided across the nodes.

[Storage Requirement] ÷ [Number of nodes] = Free disk space required per node.

Once a node reaches 90% capacity, the node’s index data will become read-only to prevent filling the drive completely. Any storage that has reached 85% capacity should be considered “low” and additional space allocated to it to avoid putting the cluster into read-only.

Coordinator nodes

  • 30 GB minimum on non-OS drive

Analytics Engine nodes

Without extracted text

Metadata (Inc. ROT, hash values, additional metadata)

With extracted text

Metadata (Inc. ROT, hash values, additional metadata)

1% of your data volume on non-OS drive

10% of your data volume on non-OS drive

Table: Data Volume to Required Storage Examples

Expected Data Volume

No extracted text
Metadata (Inc. ROT, hash values, additional metadata)

With extracted text
Metadata (Inc. ROT, hash values, additional metadata)

1 TB

Approximately 1,000,000 files

10 GB 

Free space (combined)

Distributed across 1 Analytics Engine node

Node 1: 10 GB

100 GB

Free space (combined)

Distributed across 1 Analytics Engine node

Node 2:100 GB

100 TB

Approximately 100,000,000 files

1 TB

Free space (combined)

Distributed across 3 Analytics Engine nodes

Node 1: 341.3 GB

Node 2: 341.3 GB

Node 3: 341.3 GB

10 TB

Free space (combined)

Distributed across 3 Analytics Engine nodes

Node 1: 3.33 TB

Node 2: 3.33 TB

Node 3: 3.33 TB

1 PB 

Approximately 1,000,000,000 files

10 TB 

Free space (combined)

Distributed across 9 Analytics Engine nodes

Node 1: 1.11 TB

Node 2: 1.11 TB

Node 3: 1.11 TB

Node 4: 1.11 TB

Node 5: 1.11 TB

Node 6: 1.11 TB

Node 7: 1.11 TB

Node 8: 1.11 TB

Node 9: 1.11 TB

100 TB 

Free space (combined)

Distributed across 9 Analytics Engine nodes

Node 1: 11.11 TB

Node 2: 11.11 TB

Node 3: 11.11 TB

Node 4: 11.11 TB

Node 5: 11.11 TB

Node 6: 11.11 TB

Node 7: 11.11 TB

Node 8: 11.11 TB

Node 9: 11.11 TB

Shards

Shards are pieces or fragments of data stored by the index on an index by index basis (i.e. primary shards). These shards are also replicated (i.e. replica shards) for clusters that have 3 or more nodes. When crawling your data with the Cognitive Toolkit, by default, it will create your index with 5 primary shards and 1 replica per primary. When combined it equates to 10 shards.

Primary Shards

Replica Shards

(# of replica shards for each primary shard)

Shard Layout

Total Shards

1

1

P1

R1

2

5

1

P1, P2, P3, P4, P5

R1, R2, R3, R4, R5

10

10

1

P1, P2, P3, P4, P5, P6, P7, P8, P9, P10

R1, R2, R3, R4, R5, R6, R7, R8, R9, R10

20

Table: Data Volume to Shards

Assumptions:

  1. ~1,000,000 files per terabyte of data volume

  2. ~70% of that data is text extractable 

  3. Data volume is per index. If data is being split into separate indices (e.g. an index per department) use that department data volume for this calculation

Index Data Volume

Without Text

With Text

Primary Shards

Replica Shards

Primary Shards

Replica Shards

Up to 100 TB

5

1

5

1

200 TB

5

1

10

1

300 TB

5

1

15

1

400 TB

5

1

20

1

500 TB

5

1

25

1

600 TB

6

1

30

1

700 TB

7

1

35

1

800 TB

8

1

40

1

900 TB

9

1

45

1

1 PB

10

1

50

1

1,000 shard limit per cluster

Cognitive Toolkit Machine Requirements

Since the Cognitive Toolkit is the worker for your data, machines running it should be placed as close to the data as possible. For processes like hashing and text extraction, the toolkit needs to load that file to extract the contents. The closer the toolkit is to the data, the quicker it can load the files to extract them. This is more performant than having to load files over a great physical distance.

Cognitive Toolkit Placement.png

Hardware Prerequisites

The Shinydocs Cognitive Toolkit is your data worker, these machines are separate from the Analytics Engines and have different requirements. In most cases, more Cognitive Toolkit machines will improve throughput. The Cognitive Toolkit requires dedicated hardware resources as outlined in the following sections.

Table: Cognitive Toolkit Machine Hardware Requirements

Component

Recommendation for production environments (per machine)

CPU (cores)

8+

Memory

32 GB+

Storage

20 GB – SSD (non-OS) or similarly performing drive array

Network

1+ Gbps interface

Shinydocs cannot guarantee specific throughput or performance due to various environment variables.

If your organization’s hardware does not meet these requirements, you may experience slower performance and throughput with a higher error rate

CPU
  • Modern x86 64-bit enterprise-grade processor with 8 or more cores (Intel recommended)

    • More cores can be added to achieve higher throughput in text extraction, hashing, and OCR

    • Due to the varying hardware and needs from organization to organization, you may need additional cores than noted above based on your desired performance

  • Dedicated cores preferred to vCores

  • Minimum 2.0 GHz per core

  • No VM execution cap

Memory
  • Additional memory may be required to achieve your desired performance

  • Dedicated memory is preferred

    • The Analytics Engine has a large memory footprint that is locked upon the service starting

Storage
  • Local storage required

    • Shared or network storage is not supported due to performance

  • SSD preferred, but not required

Network
  • Server-grade hardware with:

    • 1 Gigabit per second Network Interface preferred

Software Prerequisites 

Recommended Cognitive Toolkit Machines

It is up to you and your organization on how many Cognitive Toolkit machines you want to deploy. One of the great features of Shinydocs Cognitive Suite is how simple it is to scale up, you can start with 1 and go to 25 depending on your needs. For example, with Shinydocs Discovery Search, having up-to-date index data is important for the search experience. You can add and remove Cognitive Toolkit machines to meet your timing needs. These are our recommendations based on lab and customer testing. 

Note: Some sources may not benefit from additional Cognitive Toolkit machines. These sources (like SharePoint Online) have throttle limits, additional Cognitive Toolkit machines cannot bypass those throttle limits.

Table: Recommended Cognitive Toolkit Machines Based on Data Volume

Assumptions:

  1. ~1,000,000 files per terabyte of data volume

  2. ~70% of that data is text extractable 

  3. Data volume is per index. If data is being split into separate indices (e.g. an index per department) use that departments data volume for this calculation

Your Data Volume

1 – 40 TB

50 – 100 TB

200 – 400 TB

500 - 700 TB

800 TB – 1 PB

With Text Extraction Recommended Cognitive Toolkit Machine(s)

1

2

4 - 8

10 - 14

16 - 20

Without Text ExtractionRecommended Cognitive Toolkit Machine(s)

1

1

2 - 4

5 - 7

8 - 10

With Text Extraction: Approximately 1 Cognitive Toolkit Machine per 50 TB of data volume

Without Text Extraction: Approximately 1 Cognitive Toolkit Machine per 100 TB of data volume

Why does this component need these resources?

The Shinydocs Cognitive Toolkit and Extraction Service, as mentioned, are the workers for your data. Hashing, text extraction, OCR, entity extraction, and more are very compute-intensive processes. They require high-performance hardware not only to be able to process your data but do it as quickly as it can. You can easily benchmark the performance of different CPU configurations in your environment if you want. These requirements and recommendations are designed to give you and your organization a performant solution that can meet your business needs. 

Example Configurations

Here are some examples to aid in the understanding of how these components work together. As noted, the number of Cognitive Toolkit machines is flexible based on how fast you want to process data. These examples use the recommendations in this guide.

1 TB Data Volume

01 Example Deployment - 1 TB.png

100 TB Data Volume

02 Example Deployment - 100 tb.png

500 TB Data Volume

03 Example Deployment - 500 TB.png

Antivirus and Security

For the performance and stability of Shinydocs Cognitive Suite, ensure you have set up the proper exclusions with your security/Antivirus solutions. While these items should be scanned before they are installed, we do not recommend operation without exclusions. These exclusions are for the server(s) running the Cognitive Suite solution (Analytics Engine nodes, Cognitive Toolkit Machines).

Note: The Cognitive Toolkit and the indices it ingests data into do not store your files, only the metadata. The index should be free from AV scans (on-demand, real time, scheduled) when in operation.

Anti-Virus Directory Exclusion List

Note: For simplicity, it may be easier for your security team to exclude one path. Add an exclusion for [drive]:\shinydocs AND place all products in that directory. Your environment may differ slightly. The following list is assuming you have installed the solution on the D:\ drive. Please adjust based on your actual installation

  • D:\Shinydocs\indexer

  • D:\Shinydocs\visualizer

  • D:\Shinydocs\cognitive-toolkit-x.x.x.x

Note: This is the directory that would contain the Cognitive Toolkit executable and dependencies for the executable

Other Considerations

There will be a significant negative performance impact on crawling, extracting hash values, and text extraction for files on a source (ex. File share, Content Server) if the source is actively scanned by Anti-Virus solutions. Crawl speeds will be limited by the throughput of the Anti-Virus processes.

  • Disable any scanning of local network traffic on the servers the Cognitive Toolkit solution is installed on

  • Disable the scanning of network traffic between server nodes for the Analytics Engines. This traffic operates on ports 9200 and 9300 unless configured otherwise

Repository-Specific Requirements

Since the Cognitive Toolkit connects to many popular repositories, it is important to review these requirements to minimize access-related/firewall issues that can surface.

Microsoft SharePoint® (Online Office 365)

Authentication to Microsoft SharePoint® Online (Office 365) requires Azure application registration

  • 📚 See Azure Authentication for Microsoft SharePoint.

  • It is required that the servers running the Cognitive Toolkit can access the site root and all sub-sites.

    • An easy way to check if this connection can be made is via web browser, though more advanced methods do exist (example: https://ACMECorp.sharepoint.com/site)

    • Either an internet connection is required or a network proxy must be set to allow the Cognitive Toolkit machine to connect to:

      • https://<site_name>.sharepoint.com

      • https://accounts.accesscontrol.windows.net

  • Azure app registration permissions for Sharepoint® Online (Office 365)

Crawl

Migration (destination)

Sites.Read.AllProvides read access to Sharepoint® site collections

Sites.ReadWrite.AllAllows the crawling account write access to sites/libraries

Microsoft SharePoint® (On-Premise)

Authentication to Microsoft SharePoint® on-premise requires an administrator username and password

  • It is required that the servers running the Cognitive Toolkit can access the site root and all sub-sites.

    • An easy way to check if this connection can be made is via web browser, though more advanced methods do exist (example: https://ACMECorp.sharepoint.com/site)

  • Service account permissions for Sharepoint® (on-premise)

Crawl

Migration (if destination)

Full Read - Has full read-only accessProvides read access to Sharepoint site collections

Full Control - Has full controlAllows the crawling account write access to sites/libraries

Microsoft Exchange® Online (Office 365)

We offer two methods of authentication to Microsoft Exchange® Online

  • Username and password 

  • Azure® app registration

  • It is required that the servers running the Cognitive Toolkit can access EWS (Exchange Web Services)

    • An easy way to check if this connection can be made is via web browser, though more advanced methods do exist (example: https://outlook.office365.com/ews/exchange.asmx)

    • For Microsoft Exchange Online®, either an internet connection is required or a network proxy must be set to allow the Cognitive Toolkit machine to connect to:

      • https://outlook.office365.com

      • https://login.microsoftonline.com

  • Service Account Permissions

Azure® Office 365 Exchange Online API

Username and Password Authentication

full_access_as_app Provides full access to mailboxes for the purpose of crawling

ApplicationImpersonation Allows the crawling account to impersonate each of the users and mailboxes 

Grant Admin Consent Required step in Azure App Registration

Mailbox Search permission required for the email account used to crawl

Microsoft Exchange® (On-Premise)

Authentication to Microsoft Exchange® on-premise requires a domain username and password

  • It is required that the servers running the Cognitive Toolkit can access the Exchange WSDL site.

    • An easy way to check if this connection can be made is via web browser, though more advanced methods do exist (example: https://ACMECorp/ews/Exhange.asmx)

  • Administrator service account permissions:

    • ApplicationImpersonation role

    • Mailbox Search role

OpenText™  Content Server

We offer two methods for crawling operations in Content Server:

  • Database & REST (best option, most performant)

  • REST only (slower, no database connection required)

  • If you are using the Database and REST method:

    • Database connection information (example: 10.0.255.254:1521)

    • Database Name (SQL) or Schema (Oracle)

    • Username and password for the database

      • Database account with Read Access to:

        • DTREE

        • KAUF

        • DVERSDATA

        • LLATTRDATA

        • CATREGIONMAP

    • Content Server account to authenticate at REST with read permission (and write if migrating to Content Server)

    • Access to Content Servers database from server running Cognitive Toolkit (example: 10.0.255.254:1521)

    • Access to Content Server URL (ex. http://acmecorp.com/otcs/cs.exe) from server running Cognitive Toolkit

  • If you are using the REST only method:

    • Content Server account to authenticate at REST with read permission

    • Access to Content Server URL (ex. http://acmecorp.com/otcs/cs.exe)

  • If you are migrating to OpenText™  Content Server, our Shiny Service module for Content Server is required 

Note: Complex redirects when accessing Content Servers endpoint or using self-signed certificates could result in the Cognitive Toolkit being unable to connect. Ensure your environment supports access to this site from the machine the Cognitive Toolkit is installed on, and has the appropriate certificates installed as a trusted root

OpenText™ Documentum™ 

  • Access to Documentum endpoint URL from server running Cognitive Toolkit

    • Example: http://localhost:8080/documentum-rest-endpoint

    • Account with read permission (and write if migrating to Documentum)

IBM® FileNet®

  • Access to FileNet URL

    • Example: http://ACMECorp.com:9443/wsi/TISKVJE87WCWE

    • Account with read permission (and write if migrating to FileNet)

Box Inc.

  • Access to Box URL from server running Cognitive Toolkit

    • Example: http://box.com 

  • Box application service account

  • Box dev console

  • For Box, either an internet connection is required or a network proxy must be set to allow the Cognitive Toolkit machine to connect.

Note: Box™ by Box Inc. has API limits that can incur costs to your organization. While the Cognitive Toolkit uses these API calls efficiently, depending on the volume of your data, you may go over your API limit. Please consult with your Box representative for more information.

Access and Security

Accessing and Crawling Your Data Sources

Service Accounts

When getting ready to crawl or migrate your data, it is necessary to provide access to the data so it can be crawled and if migrating, can be ingested into the destination repository. Here are some guidelines for creating service accounts when working with Shinydocs Cognitive Suite. Shinydocs offers tools to encrypt and store your passwords used to connect to non-Windows repositories (Content Server, SharePoint, etc.). 📚 See Encrypting repository passwords with SaveValue.

Crawling Data – Metadata, Hash Values, Text Extraction

The phrase “Crawling Data” is an umbrella term that essentially means reading your data. Crawling is a read-only process that pulls data from your source repository. When preparing to crawl a source repository, it is a good practice to set up a service account for access to the source. This service account would also be used to run the Cognitive Toolkit, the toolkit needs the access of this service account to ensure it can gather all the information about your data, the service account will need the ability to:

  • Read all files in the source 

  • Ensure the service account has access to all sub-items, as some repositories support disabled inheritance of permissions

  • Is allowed to run applications (exe), batch files, and PowerShell files on the machine the Cognitive Toolkit is installed on

  • Is allowed to be used in interactive sessions

  • Can be logged in without manual authentication (to run scheduled tasks)

Actioning Data – Disposal

Depending on your organization's use-case of Shinydocs Cognitive Suite there may come a time where you want to dispose of ROT. In order to dispose of data on your source, you will need to ensure:

  • All requirements from Crawling Data – Metadata, Hash Values, Text Extraction have been met

  • The service account has delete permissions (sometimes this role is rolled up into write permissions)

Actioning Data – Migration

If you are using the Shinydocs Cognitive Suite to migrate files into an ECM, you will need to ensure:

  • There is a service account (via Active Directory or built-in user manager) on the ECM the Cognitive Toolkit can use and that the operator has those credentials

  • The service account will need read and write permissions in the destination repository

    • Some ECM solutions and their custom modules may have additional permissions needed (example: permission to assign classification values)

    • Check that any area the Cognitive Toolkit will write to has permissions to do so (example: categories in Content Server) 

Component Specific Security

The Visualizer and Analytics Engine are the key components that your organization will need to control access to. This section will let you know which components should be secured by your organization and some of the options you have for securing them.

Visualizer

The Visualizer front end can be locked down by forcing the Visualizer to only talk locally (binding to 127.0.0.1). From there, an IIS reverse proxy (or a commercially available reverse proxy solution of your choice) can be created to not only encrypt the connection with your organization's SSL certificate but also leverage your Active Directory to gain access to the web application. 

Note: Anyone who has access to the Visualizer, will have access to all contents and data in the Visualizer. There are no role-based permissions within the Visualizer. See Best Practice: Securing Shinydocs Visualizer access (Reverse Proxy in IIS) for more information regarding the setup of the reverse proxy in IIS. 

Analytics Engine

The Analytics Engine is a backend service that contains a REST API. This REST API is accessible on the network (default port 9200). It is critical that your organization secures these connections to prevent malicious users from accessing data they should not. Follow your organization's standards for securing REST APIs and ensure the correct firewall rules are made.

  1. The Cognitive Toolkit machines must be able to communicate with the Analytics Engines via HTTP/HTTPS (or the Coordinator node if your implementation requires it). 

  2. Each of the Analytics Engines needs to be able to communicate with each other via HTTP/HTTPS

SSL

The Visualizer and Indexer can be configured to communicate via SSL.

📚 See Configuring SSL for Shinydocs Indexer.

Best Practices

We have been crawling data for years, in that time, we have learned a lot about some of the situations your organization can face with the technical implementation of the Cognitive Suite. Below are some of the best practices we have found to help make your implementation successful.

Installation

  • The Shinydocs Cognitive Toolkit should be installed on its own dedicated machine. The Cognitive Toolkit and the Analytics Engines are very powerful tools and require powerful hardware, problems can arise when both applications need a lot of resources. Having these components on separate machines prevents them from competing for CPU, memory, and storage IO.

  • It is best for large volumes of data to use the Shinydocs Jobs index to track metrics from the Cognitive Toolkit. This index includes information about crawls that have been completed and details about their operation. 📚 Check out Setting up the shinydocs-jobs index for more information on configuration and some of our pre-built visualizations and dashboard.

  • If your implementation suggests a Coordinator node, the Visualizer is best suited to be installed on that machine. This gives the Visualizer a direct connection to the cluster so it can retrieve data from the index without making a network hop. If your implementation does not suggest a Coordinator node, the Visualizer can be installed on any of the Analytics Engine nodes.

  • When naming your Analytics Engine nodes in the elasticsearch.yml configuration, choose a unique name that will help identify that node from the others. A common method is to name the node after the server’s machine name. That way if there are any problems, the machine with the problem can be identified easier

  • Ensure your organization has a way to secure access to Analytics Engine nodes. Some organizations segment their Analytics Engine cluster on the network, poking secure holes in the environment as needed. Others simply use their existing firewall solution.

Operation

  • Anti-Virus and packet inspection products can severely slow down the performance of the Cognitive Suite. This includes on-access type scanning used by some organizations in their source repositories (i.e., file share). Meaning that every time the toolkit seeks information from a file, that file will be scanned before the toolkit can get that file’s data. Depending on your AV, you could see a 0.5x to 50x reduction in performance due to these processes. 

  • Do not run multiple processes that could collide with each other (e.g., running a tagging script while a crawl operation is happening). An entry in the index can only be updated by one process at a time, failure to do so can result in 409 errors aka “version conflicts”. These conflicts can stop processes from completing fully. Be sure to factor this into your scheduled tasks and have the separate processes run in serial (e.g., all of the metadata crawling happens before hashing). 📚 See Cognitive Suite Crawling and Race Conditions.

  • For scheduled tasks, it is better to have separate Cognitive Toolkit directories for each repository (sometimes more than one for large data volumes). This allows for easier troubleshooting should something go wrong and the logs need to be inspected. You will know which log files, queries, script files etc. are relevant to that repository.

  • It is recommended to have Windows Error Reporting (WER) enabled on all servers using components of Shinydocs Cognitive Suite. These errors can be helpful in troubleshooting issues that may arise in your environment. 📚 For more information, please see the Microsoft article: Enable-WindowsErrorReporting.

  • When using the Visualizer (Kibana), be mindful of what you are asking for when making visualizations on your data. Large and complex visualizations can be extremely demanding on the system and may take a long time to load, sometimes even timing out. If this happens, check that the scope of your visualizations is accurate and to the point. Filter out as much data as possible that is not relevant to your interests and/or adjust the size property of the individual visualizations.

  • While there is virtually no regular maintenance required for the index, there are some maintenance-related notes in our Customer Portal. 📚 See Maintaining your indices.

  • For Content Server crawls, please take into account background processes such as Agents, WebReports, maintenance, or other scheduled tasks related to your CS instance.

Requirements and Recommendations Matrix

Data Volume

Analytics Engines required

Coordinator Nodes required

Cognitive Toolkit machines recommended 

Shards

With Text

No Text

With Text

(primary:replica)

No Text

(primary:replica)

1 TB

1

NA

1

1

5:1

5:1

2 TB

1

NA

1

1

5:1

5:1

5 TB

1

NA

1

1

5:1

5:1

10 TB

1

NA

1

1

5:1

5:1

20 TB

1

NA

1

1

5:1

5:1

30 TB

1

NA

1

1

5:1

5:1

40 TB

1

NA

1

1

5:1

5:1

50 TB

3

NA

2

1

5:1

5:1

100 TB

3

NA

2

1

5:1

5:1

200 TB

4

NA

4

2

10:1

5:1

300 TB

4

NA

6

3

15:1

5:1

400 TB

4

NA

8

4

20:1

5:1

500 TB

6

1

10

5

25:1

5:1

600 TB

6

1

12

6

30:1

6:1

700 TB

6

1

14

7

35:1

7:1

800 TB

9

1

16

8

40:1

8:1

900 TB

9

1

18

9

45:1

9:1

1 PB 

9

1

20

10

50:1

10:1

Analytics Engine Node

Cognitive Toolkit Machine

Component

Requirement

Component

Requirement

CPU (cores)

12+

CPU (cores)

8+

Memory

64 GB

Memory

32+ GB

Storage

High-Performance NVMe SSD* (See Storage Sizing)
10% of your expected data volume (e.g. 50 TB data volume would be 5 TB free space needed)

Storage

20 GB SSD (non-os) or similarly performing drive array

Network

10 Gbps interface

Network

1+ Gbps interface

Once the requirements above are installed and ready, it’s time to install the Cognitive Toolkit! 📚 Refer to the instructions in Installing and Upgrading Cognitive Suite.

JavaScript errors detected

Please note, these errors can depend on your browser setup.

If this problem persists, please contact our support.