Skip to main content
Skip table of contents

Extracting entities from full text

Summary

This article describes how to use Shinydocs Cognitive Suite for entity extraction

Entity extraction is an information extraction technique that refers to the process of identifying and classifying key elements from text into predefined categories. In this way, it helps transform data from unstructured to structured. When data is structured it is machine-readable and available for standard processing such as retrieving information, extracting facts and answering questions.

Classes of Data Extraction

Entity extraction aims to make data sorting easier by allowing machine learning to label items of interest. By default, Shinydocs entity extraction operation, extracts the following classes of information:

  1. Person

  2. Location

  3. Organization

  4. Date

  5. Money

  6. Percent

  7. Time

Prerequisites

Here’s what you need:

  • Cognitive Toolkit - an executable stored with its dependencies, examples, and resources. This is located in a zipper provided by Shinydocs. Alternatively, the latest version can be downloaded from Shinydocs Collab.

  • Shinydocs Extraction Service must be running as a service. Shinydocs Extraction Service is a powerful Java-based tool that highlights key points of data drawn from your text files through feature extractors and natural language processing. Download Shinydocs Extraction Service from Shinydocs Collab. Follow the instructions to Install Shinydocs Extraction Service.

:note: Before you begin, ensure your dataset has been crawled for:

  1. metadata

  2. hash

  3. full text

  4. ROT rules (strongly recommended)

Extracting Entities Using the ExtractEntities Operation

Step-by-step:

  1. Type cmd in the Windows search box or charms bar, and then click Run as administrator. Alternatively, you may open a CMD prompt using a Service Account with the appropriate permissions.

  2. Change directory (cd) to your Cognitive Toolkit root folder. (ie. the location of the extracted shinydocs-cognitive-toolkit-[version]-[date].zip file)

  3. At theCognitiveToolkit.exe prompt, run the ExtractEntities command using either the minimum required parameters or edit the command as required. To view an example command with minimum required inputs or additional options, see the ExtractEntities operation in the Encyclopedia of Cognitive Toolkit Operations.

Extracting Entities Using a Batch File

(Recommended method)

A batch file (BAT file) is a script file that stores commands to be executed in a serial order. It helps automate routine tasks without requiring user input or intervention. Using a batch file, you can save your standard command parameters and even schedule routine entity extraction.

Step-by-step:

  1. Create a batch file (BAT file).

    1. Download the sample BAT file:

COG-Query-ExtractEntities.bat

b. To configure the bat file, open the file COG-Query-FindSimilar-classification-loose.bat using a source code editor. Edit the code as required and save.

Possible Configurations:

Action

Required

Description

Default

--field <FIELD>

The name of the index field to store the extracted entities (Default: entities)

entities

--classes <CLASSES>

Comma separated list of entity classes to extract from text. The extracted classes are dependent on the classifier model used to perform entity extraction. The seven classes in the standard model are:

  1. PERSON

  2. LOCATION

  3. ORGANIZATION

  4. DATE

  5. MONEY

  6. PERCENT

  7. TIME

Default:

CODE
--classes

If not specified, all 7 classes are extracted.

If specified:

CODE
--classes PERSON

Only the specified entity will be extracted. In this example, --entities-person value will be extracted.

--extract-from <EXTRACT_FROM>

The name of the index field to extract entities from. 

fullText

--preserve-spacing

Preserve spacing from extracted entities. Depending on the source file there may be unwanted line breaks in extracted entities. 

false

--extraction-service-url <EXTRACTION_SERVICE_URL>

*

URL for the entity extraction service.

--query <QUERY>

*

The search query (File or Json Input) 

--silent

Turn off the progress bar 

false

--nodes-per-request <NODES_PER_REQUEST>

The number of nodes per request 

100

--threads <THREADS>

Number of parallel processes to start

1

--skip-errors

Skip re-processing errors 

false

--index-server-url <INDEX_SERVER_URL>

*

URL of the index server 

--index-name <INDEX_NAME>

*

Name of the index 

--index-type <INDEX_TYPE>

Type name for index objects 

shinydocs

--force

Forcefully remove / Suppress prompt for confirmation 

false

  1. Open a Command Line Interface (CLI) by opening a Windows Command Prompt as Administrator and navigate to the extracted stanford-named-entity-recognizer-service.bat file.

  2. Open a second Command Line Interface (CLI) by opening a Windows Command Prompt as Administrator and navigating to the extracted shinydocs-cognitive-toolkit-yyyy-mm-dd (X.X.X).zip

  3. To extract entities from your data, run the COG-Query-ExtractEntities.bat command.

 📔 We recommend you run it from a COG Batch Files folder

  • Click the Enter button

  • Enter ‘y’ for yes

  • Click the Enter button

  • A progress bar will display when all entities have been successfully extracted.

View Results

View Results using Discovery Desktop

Once the Analytics Engine has completed processing, open Discovery Desktop. It’s time to view the results.

  • Open Discovery Desktop

  • Add any of the following field names in the Columns to Display in the Search Filter:

    • entities-person

    • entities-location

    • entities- date

    • entities-percent

    • entities-organization

    • entities-time

    • entities-money

  • In the Search field, enter the following search parameter:

entities-person: [* TO *]

  • Select Search

Discovery Desktop will display all the documents that have had a Person entity extracted. In this case, there were 438 matches using full text.   

  • Select one of the documents and click to view. 

  • The document will open in its native application. 

  • Looking at the patterns of the full text, CLI will extract entities from each document such as Person, Location, Date, Percentage, and Organization.

View Results using Visualizer

  • Open Visualizer

  • Select the Management tab 

  • Select Index Patterns

  • Select the Refresh field list symbol

  • The extracted entities fields will now appear in the field list.

  • Select the Dashboard tab

  • Select Dashboard in the upper left corner to see a list of available dashboards

  • Select the Extracted Entities dashboard

Visualizer displays all of the documents that have had entities extracted using fulltext. In our example, we are showing the basic field entities Person, Location and Organization.  

Click on any part of the visualization to dig deeper into that area for understanding.

Configuration

Creating a BAT File

The Windows Command Prompt (cmd.exe) utilizes a DOS batch file, called a BAT file, to execute commands. 

Download the sample BAT file:

COG-Query-ExtractEntities.bat

To configure the bat file: 

  • Open the file COG-Query-FindSimilar-classification-loose.bat using a text and source code editor and edit the code as required.

Possible Configurations:

Action

Required

Description

Default

--field <FIELD>


The name of the index field to store the extracted entities (Default: entities)


entities

--classes <CLASSES>


Comma separated list of entity classes to extract from text. The extracted classes are dependent on the classifier model used to perform entity extraction. The seven classes in the standard model are:

  1. PERSON

  2. LOCATION

  3. ORGANIZATION

  4. DATE

  5. MONEY

  6. PERCENT

  7. TIME

Default:

CODE
--classes

If not specified, all 7 classes are extracted.

If specified:

CODE
--classes PERSON

Only the specified entity will be extracted. In this example, --entities-person value will be extracted.

--extract-from <EXTRACT_FROM>


The name of the index field to extract entities from. 

fullText

--preserve-spacing


Preserve spacing from extracted entities. Depending on the source file there may be unwanted line breaks in extracted entities. 

false

--extraction-service-url <EXTRACTION_SERVICE_URL>

*

URL for the entity extraction service.


--query <QUERY>

*

The search query (File or Json Input) 


--silent


Turn off the progress bar 

false

--nodes-per-request <NODES_PER_REQUEST>


The number of nodes per request 

100

--threads <THREADS>


Number of parallel processes to start

1

--skip-errors


Skip re-processing errors 

false

--index-server-url <INDEX_SERVER_URL>

*

URL of the index server 


--index-name <INDEX_NAME>

*

Name of the index 


--index-type <INDEX_TYPE>


Type name for index objects 

shinydocs

--force


Forcefully remove / Suppress prompt for confirmation 

false

  • Save any changes.

Creating a JSON file

A JSON file is a file that stores simple data structures and objects in JavaScript Object Notation (JSON) format. The JSON file is required to accompany a BAT file used to execute commands with the Windows Command Prompt (cmd.exe)

Download the sample JSON file:

query-match-fullText-no-entities.json

To configure the associated query .json file: 

  • Open the file Query-match-path-no-classification.json using a text and source code editor, such as Notepad++. 

  • Edit the following code, as required.

📔 Line 4 requires the full text field to exist in the index in order to extract entities. If the full text field doesn’t exist, it means that the data has not been crawled for full text and therefore entity extraction cannot be performed on the full text of the data.  

📔 Line 7 requires the field “entities-person” to “not” exist. When performing the entity extraction command, the field “person” will commonly be an extracted entity. By requiring it to “not exist”, we are ensuring that entity extraction is performed only on new items. Items that have been crawled previously for entity extraction will be “skipped” in subsequent crawls.  

JavaScript errors detected

Please note, these errors can depend on your browser setup.

If this problem persists, please contact our support.