Skip to main content
Skip table of contents

Extracting entities from full text

Summary

This article describes how to use Shinydocs Cognitive Suite for entity extraction

Entity extraction is an information extraction technique that refers to the process of identifying and classifying key elements from text into predefined categories. In this way, it helps transform data from unstructured to structured. When data is structured it is machine-readable and available for standard processing such as retrieving information, extracting facts and answering questions.

Classes of Entities

Entity extraction aims to make data sorting easier by allowing machine learning to label items of interest. By default, Shinydocs entity extraction extracts the following classes of information:

  1. Person

  2. Location

  3. Organization

  4. Date

  5. Money

  6. Percent

  7. Time

Getting Started

:note: Before you begin, ensure your dataset has been crawled for:

  1. metadata

  2. text extraction

Prerequisites

Here’s what you need:

  • Cognitive Toolkit - an executable stored with its dependencies, examples, and resources. This is located in a zipper provided by Shinydocs.

  • Shinydocs Extraction Service must be running as a service - Follow the instructions to Install Shinydocs Extraction Service.

  • Query - a JSON file containing the query you want to use. This query will inform the Cognitive Toolkit which data to process and extract entities from.

    • A simple fullText exists query is a great place to start:
      fullText-exists.json

      JSON
      {
        "bool": {
          "must": [
            {
              "exists": {
                "field": "fullText"
              }
            }
          ]
        }
      }
  • ExtractEntities command - build a command containing the options you want to use

CODE
CognitiveToolkit.exe ExtractEntities [options]

ExtractEntities Options

To build your command you must include all required options below. Include any other options, as desired.

OPTION

VALUE

CONDITION

 --field <VALUE>

The name of the index field to store the extracted entities

Optional

*If not included, default: entities

-c|--classes <VALUE>

Comma separated list of entity classes to extract from text. The extracted classes is dependent on the classifier model used to perform entity extraction.

Optional

*If not included, default: all classes, Example:"LOCATION,PERSON,ORGANIZATION"

-e|--extract-from <VALUE>

The name of the index field from which to extract entities

Optional

*If not included, default: fullText

--preserve-spacing

Preserve spacing from extracted entities. Depending on the source file, there may be unwanted line breaks in extracted entities.

Optional

*If not included, default: false

--extraction-service-url <EXTRACTION_SERVICE_URL>

URL for the entity extraction service

Optional

*If not included, default: http://localhost:55555

-q|--query <VALUE>

The path to the search query (File or JSON defining input parameters)

Required

-n|--nodes-per-request <VALUE>

Number of nodes per request

For recommendations on setting this number value, see Setting the "--nodes-per-request" option

Optional

*If not included, default: 100

--use-shinydocs-jobs

Send logging data to the shinydocs-jobs index

For recommendations on setting up the shinydocs-jobs index, see Setting up the shinydocs-jobs index

Optional

*If not included, default: false

-u|--index-server-url <VALUE>

URL of the index server

If the Cognitive Toolkit and Index are running on different servers, the value of the --index-server-url option should be set to the IP address of the index server rather than the localhost.

Required

-i|--index-name <VALUE>

Name of the index

Required

-t|--threads <VALUE>

Number of parallel processes to start

For recommendations on setting this number value, see How to Set the "--threads" Option.

Optional

*If not included, default: 1

--force

Forcefully remove/suppress prompt for confirmation

Optional

*If not included, default: false

--dry-run

Runs everything but doesn’t send nodes to the Analytics Engine

The --dry-run option allows you to quickly see how many items will be processed without actually creating the index.

Optional

*If not included, default: false

 -s|--silent

Turn off the progress bar

Optional

Example ExtractEntities command

CODE
CognitiveToolkit.exe ExtractEntities -u http://localhost:9200 -i shiny -q fullText-exists.json

This example:

  • assumes Shinydocs Extraction Service is running on the local machine

  • will check each file in the shiny index at http://localhost:9200 for any of the 7 supported entities where the fullText field exists.

Running the ExtractEntities Command

Once you have your metadata and text extraction crawl complete, you can use the ExtractEntities command in Cognitive Toolkit to tag indexed data with any of the supported entity types.

This command will require a query (saved as a .json file) that captures only index data where the fullText field exists.

Step-by-step:

  1. Type cmd in the Windows search box or charms bar. Alternatively, you may open a CMD prompt using a Service Account with the appropriate permissions.

  2. Change directory (cd) to your Cognitive Toolkit root folder. (ie. the location of the extracted shinydocs-cognitive-toolkit-[version]-[date].zip file)

  3. Enter your ExtractEntities command and press Enter.

    CODE
    CognitiveToolkit.exe ExtractEntities [options]
  4. Once the operation is running, the CMD window will indicate its progress until completion.

View Results

View Results using Visualizer

  • Open Visualizer

  • Select the Management tab 

  • Select Index Patterns

  • Select the Refresh field list symbol

  • The extracted entities fields will now appear in the field list.

  • Select the Dashboard tab

  • Select Dashboard in the upper left corner to see a list of available dashboards

  • Select the Extracted Entities dashboard

Visualizer displays all of the documents that have had entities extracted using fulltext. In our example, we are showing the basic field entities: Person, Location and Organization.  

Click on any part of the visualization to dig deeper into that area for understanding.

JavaScript errors detected

Please note, these errors can depend on your browser setup.

If this problem persists, please contact our support.