Extracting entities from full text

Summary

This article describes how to use Shinydocs Cognitive Suite for entity extraction.

Entity extraction is an information extraction technique that refers to the process of identifying and classifying key elements from text into predefined categories. In this way, it helps transform data from unstructured to structured. When data is structured it is machine-readable and available for standard processing such as retrieving information, extracting facts and answering questions.

Classes of Entities

Entity extraction aims to make data sorting easier by allowing machine learning to label items of interest. By default, Shinydocs entity extraction extracts the following classes of information:

Person
Location
Organization
Date
Money
Percent
Time

Getting Started

:note: Before you begin, ensure your dataset has been crawled for:

metadata
text extraction

Prerequisites

Here’s what you need:

Cognitive Toolkit - an executable stored with its dependencies, examples, and resources. This is located in a zipper provided by Shinydocs.
Shinydocs Extraction Service must be running as a service - Follow the instructions to Install Shinydocs Extraction Service.
Query - a JSON file containing the query you want to use. This query will inform the Cognitive Toolkit which data to process and extract entities from.
- A simple fullText exists query is a great place to start:
  fullText-exists.json
  JSON
```
{
  "bool": {
    "must": [
      {
        "exists": {
          "field": "fullText"
        }
      }
    ]
  }
}
```
ExtractEntities command - build a command containing the options you want to use

CODE

CognitiveToolkit.exe ExtractEntities [options]

ExtractEntities Options

To build your command you must include all required options below. Include any other options, as desired.

OPTION	VALUE	CONDITION
--field <VALUE>	The name of the index field to store the extracted entities	Optional *If not included, default: entities
-c\|--classes <VALUE>	Comma separated list of entity classes to extract from text. The extracted classes is dependent on the classifier model used to perform entity extraction.	Optional *If not included, default: all classes, Example:"LOCATION,PERSON,ORGANIZATION"
-e\|--extract-from <VALUE>	The name of the index field from which to extract entities	Optional *If not included, default: fullText
--preserve-spacing	Preserve spacing from extracted entities. Depending on the source file, there may be unwanted line breaks in extracted entities.	Optional *If not included, default: false
--extraction-service-url <EXTRACTION_SERVICE_URL>	URL for the entity extraction service	Optional *If not included, default: http://localhost:55555
-q\|--query <VALUE>	The path to the search query (File or JSON defining input parameters)	Required
-n\|--nodes-per-request <VALUE>	Number of nodes per request For recommendations on setting this number value, see Setting the "--nodes-per-request" option	Optional *If not included, default: 100
--use-shinydocs-jobs	Send logging data to the shinydocs-jobs index For recommendations on setting up the shinydocs-jobs index, see Setting up the shinydocs-jobs index	Optional *If not included, default: false
-u\|--index-server-url <VALUE>	URL of the index server If the Cognitive Toolkit and Index are running on different servers, the value of the --index-server-url option should be set to the IP address of the index server rather than the localhost.	Required
-i\|--index-name <VALUE>	Name of the index	Required
-t\|--threads <VALUE>	Number of parallel processes to start For recommendations on setting this number value, see How to Set the "--threads" Option.	Optional *If not included, default: 1
--force	Forcefully remove/suppress prompt for confirmation	Optional *If not included, default: false
--dry-run	Runs everything but doesn’t send nodes to the Analytics Engine The --dry-run option allows you to quickly see how many items will be processed without actually creating the index.	Optional *If not included, default: false
-s\|--silent	Turn off the progress bar	Optional

Example ExtractEntities command

CODE

CognitiveToolkit.exe ExtractEntities -u http://localhost:9200 -i shiny -q fullText-exists.json

This example:

assumes Shinydocs Extraction Service is running on the local machine
will check each file in the shiny index at http://localhost:9200 for any of the 7 supported entities where the fullText field exists.

Running the ExtractEntities Command

Once you have your metadata and text extraction crawl complete, you can use the ExtractEntities command in Cognitive Toolkit to tag indexed data with any of the supported entity types.

This command will require a query (saved as a .json file) that captures only index data where the fullText field exists.

Step-by-step:

Type cmd in the Windows search box or charms bar. Alternatively, you may open a CMD prompt using a Service Account with the appropriate permissions.
Change directory (cd) to your Cognitive Toolkit root folder. (ie. the location of the extracted shinydocs-cognitive-toolkit-[version]-[date].zip file)
Enter your ExtractEntities command and press Enter.
CODE
```
CognitiveToolkit.exe ExtractEntities [options]
```
Once the operation is running, the CMD window will indicate its progress until completion.

View Results

View Results using Visualizer

Open Visualizer
Select the Management tab
Select Index Patterns

Select the Refresh field list symbol
The extracted entities fields will now appear in the field list.

Select the Dashboard tab

Select Dashboard in the upper left corner to see a list of available dashboards

Select the Extracted Entities dashboard

Visualizer displays all of the documents that have had entities extracted using fulltext. In our example, we are showing the basic field entities: Person, Location and Organization.

Click on any part of the visualization to dig deeper into that area for understanding.