Skip to main content
Skip table of contents

Extracting full text from a file share

By using the Full Text Extraction tool, the full text of your document(s) will be searchable and can be used to help identify and narrow down your data and to create more comprehensive business visualizations.

Shinydocs Cognitive Suite enables you to extract the full text from specified file(s) within the index that was created during the initial metadata crawl. Before you begin, ensure that initial discovery has been completed and basic metadata has been extracted. If not, refer to: Initial Discovery - File Share

Step by step:

  1. Create a query based on the file(s) from which you would like to extract full text. This query will be referenced when we run the full text extraction tool. This query should be saved as a .json file.

Example query:

CODE
{
	"bool": {
		"must_not": {
			"exists": {
				"field": "fullText"
			}
		},
		"must": [
			{
				"terms": {
					"extension": [
						"xls",
						"xlsx",
						"xlsm",
						"xlt",
						"csv",
						"ods",
						"odp",
						"odt",
						"txt",
						"doc",
						"docx",
						"docm",
						"rtf",
						"pot",
						"potx",
						"ppt",
						"pps",
						"ppsx",
						"msg"
					]
				}
			}
		]
	}
}

The above script will extract full text from only files that have NOT previously had text extraction performed. The tool will only run against the list of extensions mentioned.

  1. Open a command prompt as an Administrator

  2. Navigate to the CognitiveToolkit root folder

  3. Run the AddHashAndExtractedText tool by inputting the following command and replacing the required fields according to your environment:

CODE
CognitiveToolkit.exe AddHashAndExtractedText --action-keyword text --max-characters {max_characters} --query {path-to-file} -u {url-to-index} -i {name of index}

We recommend including a --max-characters value to limit the numbers of characters deep into a document from which full text is being extracted. Extremely large documents can significantly affect progress if the maximum character value is not included.

Sample command with a maximum character value of 30,000 characters:

CODE
CognitiveToolkit.exe AddHashAndExtractedText 
--action-keyword text
--max-characters 30000 
-q "C:\match_all.json" 
-u http://localhost:9200 
-i testdata 

Validation

A refresh of the index must be done after running full text extraction. Refreshing the index will add the new fields and values to the selected index

Refreshing a Index

  1. Within a Web Browser (A Chromium Browser is recommended), enter the address for where Shinydocs Analytics has been installed.  This is typically <machine-name>:5601. If you are running the tool locally, use "localhost:5601".

  2. Select Management > Index Patterns and the index you ran Full Text Extraction towards

  3. Select Refresh on the far right

  4. You should then see the number of fields increase

  5. Navigate to the Discover Tab and you will see fullText now under the Available fields section

Did you know… 

There are two ways to run tools in CLI:

  1. Using the full command (as shown above).
    Or using .bat files. Every installation of Shinydocs Cognitive Suite comes with a folder of .bat files. Use these as a starting point for customizing commands to your specific environment. 

  2. Run this CLI command: COG-Query-AddFullText-less-1MB-less32K-chars.bat

Example .bat file (COG-Query-AddFullText-less-1MB-less32K-chars.bat) and .json file (query-math-path-no-fullText-less-1MB.json):

COG-Query-AddFullText-less-1MB-less-32K-chars.bat query-match-path-no-fullText-less-1MB.json


JavaScript errors detected

Please note, these errors can depend on your browser setup.

If this problem persists, please contact our support.