Extracting entities from full text
Summary
This article describes how to use Shinydocs Cognitive Suite for entity extraction.
Entity extraction is an information extraction technique that refers to the process of identifying and classifying key elements from text into predefined categories. In this way, it helps transform data from unstructured to structured. When data is structured it is machine-readable and available for standard processing such as retrieving information, extracting facts and answering questions.
Classes of Entities
Entity extraction aims to make data sorting easier by allowing machine learning to label items of interest. By default, Shinydocs entity extraction extracts the following classes of information:
Person
Location
Organization
Date
Money
Percent
Time
Getting Started
:note: Before you begin, ensure your dataset has been crawled for:
metadata
text extraction
Prerequisites
Here’s what you need:
Cognitive Toolkit - an executable stored with its dependencies, examples, and resources. This is located in a zipper provided by Shinydocs.
Shinydocs Extraction Service must be running as a service - Follow the instructions to Install Shinydocs Extraction Service.
Query - a JSON file containing the query you want to use. This query will inform the Cognitive Toolkit which data to process and extract entities from.
A simple fullText exists query is a great place to start:
fullText-exists.jsonJSON{ "bool": { "must": [ { "exists": { "field": "fullText" } } ] } }
ExtractEntities command - build a command containing the options you want to use
CognitiveToolkit.exe ExtractEntities [options]
ExtractEntities Options
To build your command you must include all required options below. Include any other options, as desired.
OPTION | VALUE | CONDITION |
---|---|---|
--field <VALUE> | The name of the index field to store the extracted entities | Optional *If not included, default: entities |
-c|--classes <VALUE> | Comma separated list of entity classes to extract from text. The extracted classes is dependent on the classifier model used to perform entity extraction. | Optional *If not included, default: all classes, Example:"LOCATION,PERSON,ORGANIZATION" |
-e|--extract-from <VALUE> | The name of the index field from which to extract entities | Optional *If not included, default: fullText |
--preserve-spacing | Preserve spacing from extracted entities. Depending on the source file, there may be unwanted line breaks in extracted entities. | Optional *If not included, default: false |
--extraction-service-url <EXTRACTION_SERVICE_URL> | URL for the entity extraction service | Optional *If not included, default: http://localhost:55555 |
-q|--query <VALUE> | The path to the search query (File or JSON defining input parameters) | Required |
-n|--nodes-per-request <VALUE> | Number of nodes per request For recommendations on setting this number value, see Setting the "--nodes-per-request" option | Optional *If not included, default: 100 |
--use-shinydocs-jobs | Send logging data to the shinydocs-jobs index For recommendations on setting up the shinydocs-jobs index, see Setting up the shinydocs-jobs index | Optional *If not included, default: false |
-u|--index-server-url <VALUE> | URL of the index server If the Cognitive Toolkit and Index are running on different servers, the value of the --index-server-url option should be set to the IP address of the index server rather than the localhost. | Required |
-i|--index-name <VALUE> | Name of the index | Required |
-t|--threads <VALUE> | Number of parallel processes to start For recommendations on setting this number value, see How to Set the "--threads" Option. | Optional *If not included, default: 1 |
--force | Forcefully remove/suppress prompt for confirmation | Optional *If not included, default: false |
--dry-run | Runs everything but doesn’t send nodes to the Analytics Engine The --dry-run option allows you to quickly see how many items will be processed without actually creating the index. | Optional *If not included, default: false |
-s|--silent | Turn off the progress bar | Optional |
Example ExtractEntities command
CognitiveToolkit.exe ExtractEntities -u http://localhost:9200 -i shiny -q fullText-exists.json
This example:
assumes Shinydocs Extraction Service is running on the local machine
will check each file in the shiny index at http://localhost:9200 for any of the 7 supported entities where the fullText field exists.
Running the ExtractEntities Command
Once you have your metadata and text extraction crawl complete, you can use the ExtractEntities command in Cognitive Toolkit to tag indexed data with any of the supported entity types.
This command will require a query (saved as a .json file) that captures only index data where the fullText
field exists.
Step-by-step:
Type cmd in the Windows search box or charms bar. Alternatively, you may open a
CMD
prompt using a Service Account with the appropriate permissions.Change directory (
cd
) to your Cognitive Toolkit root folder. (ie. the location of the extracted shinydocs-cognitive-toolkit-[version]-[date].zip file)Enter your ExtractEntities command and press Enter.
CODECognitiveToolkit.exe ExtractEntities [options]
Once the operation is running, the CMD window will indicate its progress until completion.
View Results
View Results using Visualizer
Open Visualizer
Select the Management tab
Select Index Patterns
Select the Refresh field list symbol
The extracted entities fields will now appear in the field list.
Select the Dashboard tab
Select Dashboard in the upper left corner to see a list of available dashboards
Select the Extracted Entities dashboard
Visualizer displays all of the documents that have had entities extracted using fulltext. In our example, we are showing the basic field entities: Person, Location and Organization.
Click on any part of the visualization to dig deeper into that area for understanding.