Shinydocs Extraction Service 2.1.0 (July 2023)

Information on feature enhancements and fixed issues that are part of the Shinydocs Extraction Service 2.1.0. (July 2023) release.

Fixed Issues

Update the installation folder name

When installed, the Shinydocs Extraction Service folder has the same name as another Shinydocs product. To eliminate any confusion between products, the folder has been renamed.

Rename Installation Windows Directory

Previously, the Shinydocs Extraction Service was installed to the following folder by default: C:\Program Files\Shinydocs\Entity Extraction. There was no option to override the default installation folder.

The Extraction Service now installs to the following folder by default: C:\Program Files\Shinydocs\Extraction and users can now override the default to install to a location of their choosing.

Bind to a loopback address of 127.0.0.1

Previously, the bind was set to a default of 0.0.0.0:55555. This meant that if the firewall was down, other machines could also hit and use the service. Now, the bind has been set to a default of 127.0.0.1:55555 to restrict the service accessibility to the machine it is on.

Update the models directory with the change to Stanford NER 4.4.0

Previously, the Entity Extraction tool had difficulty identifying Organization Names. For example, when looking at job offer documents, the tool would successfully identify an Organization Name, but it would identify additional elements incorrectly as an Organization Name. This issue has now been addressed with the upgrade to Stanford Named Entity Recognizer (NER) 4.4.0.

Change the tika default temp directory to be the same as the extraction servlet upload directory

Previously, Tika would sometimes store temporary files in C:\windows\temp. Tika temporary files are now stored in: C:\Shinydocs\extraction-service\temp\shinydocs-extraction.

Extraction Service is explicitly referencing a path to Java which breaks when upgrading (using Amazon Corretto)

Both Indexer and Visualizer use a .BAT to launch the services and as a result, they can reference the JAVA_HOME variable. In the NSSM service editor, the Extraction Service points to an explicit Application Path when using Amazon Corretto. This meant that any upgrade to Amazon Corretto would prevent the Shinydocs Extraction Service from operating properly.

This issue has been resolved.

Schedule Extraction Service cleanup more frequently

Previously, the Extraction Service Cleanup required both frequency and delay (duration) to be set to the same value (in minutes). This meant that, set to a value of 60, the cleanup service was required to keep files for a full 60 minutes (duration). Since the service would only run every 60 minutes (frequency), the temp files which accumulated during the first 60 minutes would remain in the TMP folder and not be cleaned up the next cleanup at 120 minutes.

To ensure more frequent cleanup of the TMP folder, the Service Configuration (service.config) file has been updated with a new configuration (Service.CleanupFrequency). Now the cleanup process is controlled by two parameters:

Service.CleanupDelay = 60

Service.CleanupFrequency = 1

The Extraction Service Cleanup now runs every 1 minute (frequency) and will remove files that were created 60 mins or more (duration).

Shinydocs Extraction Service in Paused status after performing an installation and unable to start

Previously, when installing the Shinydocs Extraction Service with the Use JAVA_Home System Environment Variable checked, the shinydocs-extraction service was set to a Paused status.

The Extraction Service is now operating as expected. When installing the Extraction Service with the Use JAVA_Home System Environment Variable checked, the shinydocs-extraction service is set to a Running status.

Fix the uninstaller to remove the previous installation folder

Previously, there was an issue with the uninstaller not removing the contents of the previous installation folder. The uninstaller is now working as expected. To learn how to uninstall Shinydocs Extraction Service, see Uninstalling Shinydocs Extraction Service.

Add timeout for long running extractions

To prevent files from taking an excessively long time to text extract, a timeout-seconds parameter is added to the cognitivetoolkit.config file. Using this parameter, you can set a value (in seconds) to stop the text extraction process on that item. The process will continue with the next item in the queue.

Disable the HTTP Methods TRACK and/or TRACE

TRACE and TRACK are HTTP methods used for debugging web server connections. These methods have been identified as vulnerabilities for PEN testing. For security reasons, the remote web server no longer supports the TRACE and/or TRACK HTTP methods.

Extraction Service: no details under description

Previously, there was no service description appearing in Windows Services, after installing the Extraction Service. Now, a description of the Extraction Service appears in the Description column.

AddHashAndExtractedText - Not text extracting downloaded files

Previously, when using the AddHashAndExtractedText operation in the Cognitive Toolkit, downloaded files were not text extracted. The files contained zipped data, hidden within the file, needed by the Extraction Service to parse.
Note: The files are zipped WITHIN the .docx, pptx, etc. The file extension is not .zip in this case.

A configuration change in the Shinydocs Extraction Service’s tika.config file sets the markLimit to unlimited (-1). This change ensures that the Shinydocs Extraction Service finds the metadata needed to parse. The AddHashAndExtractedText operation now works as expected with downloaded files.