Skip to main content
Skip table of contents

Metadata captured during content analysys

Common metadata

Metadata field

Analysis step field is produced

Description

creationTimeUtc

Crawling content

The time when the item was created.

duplicate-hash

Calculating duplicates

If the item is a duplicate or the original (by create date).

entities-date

Identifying people, places and organizations

Recognized dates in the content.

entities-location

Identifying people, places and organizations

Recognized locations in the content.

entities-person

Identifying people, places and organizations

Recognized person names in the content.

extension

Crawling content

extraction-error

Extracting digital and image content

An error related to text extraction.

extraction-type

Extracting digital and image content

The type of text extraction process used.

fullText

Extracting digital and image content

The text content of an item (up to 30,000 characters) that was extracted.

hash

Extracting digital and image content

The hash value of the item.

id

Crawling content

System property

lastWriteTimeUtc

Crawling content

The time when the item was last modified.

length

Crawling content

The size of the item.

name

Crawling content

The name of the item (commonly file name).

outdated-binary

Crawling content

Indicates whether the item has been updated since the last crawl.

parent

Crawling content

The parent directory or container of the file.

path

Crawling content

The file path, URL, or crumbs to where the item is.

path-valid

Crawling content

If during a crawl the item is no longer in the source under the same name and in the same place, path-valid will be false.

potential_pii

Identifying content with personal information

Whether the content contains potential personally identifiable information (PII).

rot_obsolete

Identifying non-valuable content

Indicates if the document is obsolete (related to ROT analysis).

rot_redundant

Identifying non-valuable content

Indicates if the document is redundant (related to ROT analysis).

rot_trivial

Identifying non-valuable content

Indicates if the document is trivial in nature (related to ROT analysis).

schemaType

Crawling content

System property

zero-byte

Crawling content

If the item has a size of 0 bytes (empty).

Source-specific metadata

File system

File system specific

Metadata field

Analysis step field is produced

Description

fileInfoFlags

Crawling content

Various flags related to file information.

business_function

Classifying documents by path

File system out-of-the-box path-based classification representing the business function applicable to the item.

content_type

Classifying documents by path

File system out-of-the-box path-based classification representing the content type with the business function applicable to the item.

owner

Beta feature

The filesystem owner of the file.

Exchange Online

Exchange specific

Metadata field

Analysis step field is produced

Description

ccRecipients

Crawling content

List of carbon copy recipients in emails.

conversationId

Crawling content

The unique ID of the email conversation.

conversationTopic

Crawling content

The topic of the email conversation.

fromAddress

Crawling content

The sender's email address.

hasAttachments

Crawling content

Whether the email contains attachments.

importance

Crawling content

The importance level of the email.

internetMessageId

Crawling content

The unique message ID from Exchange.

isRead

Crawling content

Whether the email has been read.

receivedBy

Crawling content

The recipient of the email.

sensitivity

Crawling content

The sensitivity level of the email.

sentDateTime

Crawling content

The time the email was sent.

sentOnBehalfOfAddress

Crawling content

Email address the message was sent on behalf of.

toRecipients

Crawling content

List of email recipients.

weblink

Crawling content

The weblink to access the Exchange Online item.

NetDocuments

NetDocuments specific

Metadata field

Analysis step field is produced

Description

cabinet

Crawling content

The cabinet where the document is stored.

created

Crawling content

The document's creation date.

createdBy

Crawling content

The user who created the document.

createdByGuid

Crawling content

The unique ID of the user who created the document.

docId

Crawling content

The document ID.

docNum

Crawling content

The document number.

envId

Crawling content

The environment ID.

ext

Crawling content

The document extension as reported by NetDocuments.

modified

Crawling content

The document's modification date.

modifiedBy

Crawling content

The user who modified the document.

modifiedByGuid

Crawling content

The unique ID of the user who modified the document.

prop-checksum

Crawling content

Hash checksum from NetDocuments.

prop-checksumAlgorithm

Crawling content

The algorithm used to generate the checksum.

prop-custom-attributes

Crawling content

Custom attributes associated with the file.

repository

Crawling content

The repository where the document is stored.

size

Crawling content

The size of the document according to NetDocuments.

url

Crawling content

URL to the NetDocuments item.

OneDrive

OneDrive specific

Metadata field

Analysis step field is produced

Description

prop-CTag

Crawling content

The CTag property, used for tracking changes.

prop-ETag

Crawling content

The ETag property, a version identifier for the file.

prop-driveId

Crawling content

The ID of the drive where the file is stored.

prop-lastModifiedBy

Crawling content

The user who last modified the file.

prop-webUrl

Crawling content

The URL to access the file online in OneDrive.

SharePoint Online

SharePoint Online specific

Metadata field

Description

documentLibrary

The document library where the file is stored.

documentLibraryGuid

The unique ID of the document library.

guid

The globally unique identifier of the document.

meta-category

The category of the content.

path-valid

Indicates whether the file path is valid.

sharePointParentUrl

The URL of the parent SharePoint item.

sharePointUrl

The URL of the SharePoint document.

siteGuid

The unique ID of the SharePoint site.

siteServerRelativeUrl

The server-relative URL of the SharePoint site.

prop.*

prop.<value> fields represent special metadata from a particular source. For SharePoint Online, the prop fields will include things like:

  • created by

  • custom column information

    • prop.<column_name>

  • version

  • modified by

  • etc.

Teams

Teams specific

Metadata field

Analysis step field is produced

Description

prop-attachedToId

Crawling content

The ID the message is attached to.

prop-contentType

Crawling content

The content type of the message.

prop-contentUrl

Crawling content

The URL of the content.

prop-messageCreationDate

Crawling content

The creation date of the message.

prop-msId

Crawling content

The Microsoft ID associated with the message.

prop-sender

Crawling content

The sender of the message.

prop-senderId

Crawling content

The unique ID of the sender.

prop-subType

Crawling content

The subtype of the message.

prop-threadId

Crawling content

The thread ID the message belongs to.

prop-userId

Crawling content

The unique user ID associated with the message.

recipients

Crawling content

The recipients of the message.

iManage

iManage specific

Metadata field

Analysis step field is produced

Description

author

Crawling content

The author of the document.

author_description

Crawling content

A description of the author.

checksum

Crawling content

The checksum of the file.

class

Crawling content

The class of the document.

class_description

Crawling content

A description of the document class.

coauthoring_file_size_warning

Crawling content

Warning related to file size for co-authoring.

content_type

Crawling content

The MIME type or file format of the content.

create_date

Crawling content

The document's creation date.

create_profile_date

Crawling content

The profile creation date.

custom1

Crawling content

iManage Custom field 1. In the iManage interface, this is usually Client.

custom1_description

Crawling content

Description of custom field 1.

custom1_ssid

Crawling content

The SSID of custom field 1.

custom2

Crawling content

Custom field 2. In the iManage interface, this is usually Matter.

custom2_description

Crawling content

Description of custom field 2.

custom2_ssid

Crawling content

The SSID of custom field 2.

custom#

Crawling content

Other custom fields used in iManage will be indexed with the field custom# (ex. custom12)

custom#_description

Crawling content

Description of custom field.

custom#_ssid

Crawling content

The SSID of custom field.

database

Crawling content

The database where the document is stored.

default_security

Crawling content

The default security level.

document_number

Crawling content

The document number.

edit_date

Crawling content

The date the document was last edited.

edit_profile_date

Crawling content

The date the profile was last edited.

file_create_date

Crawling content

The date the file was created.

file_edit_date

Crawling content

The date the file was last edited.

is_checked_out

Crawling content

Whether the document is checked out.

is_declared

Crawling content

Whether the document is declared.

is_external

Crawling content

Whether the document is marked as external.

is_external_as_normal

Crawling content

Treating external content as normal.

is_hipaa

Crawling content

Whether the document is HIPAA-compliant.

is_in_use

Crawling content

Whether the document is currently in use.

is_related

Crawling content

Whether the document is related to others.

is_restorable

Crawling content

Whether the document can be restored.

iwl

Crawling content

A document workflow-related field.

last_user

Crawling content

The last user to modify the document.

last_user_description

Crawling content

A description of the last user.

operator

Crawling content

The operator managing the document.

operator_description

Crawling content

A description of the operator.

remediation-tag

Crawling content

Tags related to remediation actions.

retain_days

Crawling content

The number of days to retain the document.

size

Crawling content

The size of the document reported by iManage.

system_edit_date

Crawling content

The date the system last edited the document.

type

Crawling content

The type of document.

type_description

Crawling content

A description of the document type.

version

Crawling content

The version number of the document.

workspace_id

Crawling content

The workspace ID where the document resides.

workspace_name

Crawling content

The workspace name.

wstype

Crawling content

The workspace type.

Enhanced metadata (meta-)

These fields will only be available on files processed with text extraction. Not all files will have all fields and is entirely dependent on the content.

Metadata field name

Applies to MS Office files

Applies to PDF (and PDF-like) files

Applies to email messages

Applies to other files

Description

meta-protected

If the document was found to be encrypted or password protected, this field will be true.

Password-protected and encrypted files cannot be extracted through text or OCRed. There will also be minimal metadata available. Files encrypted by DRM (like PurView) will result in this field being true.

meta-category

If the Office document was saved with a category specified as an option, this field will capture it.

meta-charcterCount

The content’s character count as reported by the file’s metadata.

meta-charcterCountWithSpaces

The content’s character count with spaces as reported by the file’s metadata.

meta-company

If the Office document has any information saved for the “company” field, this will be captured in this field.

meta-contentStatus

Document status is typically defined in the “status” of the document. This includes using the “status” field when saving or modifying a document, or by selecting “Mark as Final”

meta-contentType

MIME type of the file (ie. application/json, rext/plain, etc.).

meta-creationTimeUtc

The internal creation timestamp of when the file was created. It is normal for this value to be different than the dates shown in your repository.

meta-creator

The creator is typically the user, organization, or application that was used to create the file.

meta-creatorTool

Similar to metaProducerLibrary, but with some additional details about the tool used in the creation of the PDF (eg. Adobe InDesign CC, PFU ScanSnap Cloud)

meta-digitallySigned

Digitally signed documents. Not to be confused with a document that has been signed by a person with their signature, this is a digital, computer signature

meta-docSecurity

Document security is one of 4 options:

  • (1) Password protected

  • (2) Recommended read-only

  • (4) Enforced read-only

  • (8) Locked for annotation

(1) is likely to not occur when crawling as password protected files cannot have this data extracted and will only show when the document is unlocked.

meta-encoding

Encoding will primarily be on file types that rely on the encoding such as: txt, csv, reg, and other text-based files that are non-XML.

meta-language

If the PDF contains information about the language of the file (eg. en-us), this filed will capture it. Languages are normalized to lowercase.

meta-lastAuthor

Last author is typically available on Word, Excel, and PowerPoint files. This is the last author known to the file, not the repository it resides.

meta-lastWriteTimeUtc

The internal modified timestamp of when the file was last modified. It is normal for this value to be different than the dates shown in your repository.

meta-lineCount

The content’s line count as reported by file’s metadata.

meta-producerLibrary

The PDF library that was used to create the document (eg. Microsoft® Word 2019, Acrobat Distiller 20.0 (Windows), Microsoft® Excel® 2019)

meta-revision

How many revisions the document has gone through according to the internal metadata of the file.

meta-subject

The subject or description used when the file was made, email subjects.

meta-tags

If tags exist on the document, they will be captured in this field. Tags are specified when saving an Office file. Some may refer to this field as “keywords” of the document.

meta-templateUsed

Notes the template used when creating the document if available. If no template is used (aka Normal.dotm), this field will not be updated as this is the default template.

meta-tite

If a title was saved to the document (sometimes auto-generated) this field will capture it, as well as email message titles, which are often similar to the subject.

page-count

The total page count reported by the document’s metadata.

meta-imageCount

Number of images in the Microsoft Office document. This number is based on the metadata in the file.
Most Office documents are compatible, if your document does not contain this information, there is a possible issue with how the file is being saved

meta-tableCount

Number of ‘table’ objects in a Microsoft Office document as reported by the metadata.

With the release of Shinydocs Pro 24.7 comes a new suite of metadata captured from files that are processed via text extraction (check out https://help.shinydocs.com/shinydocs-pro/24.3.0/digital-text-extraction-document-types for more information on the types of files eligible for text extraction).

Internal document metadata can very greatly from file to file. This metadata architecture is governed by organizations like Microsoft, Adobe, and the opensource community. Shinydocs makes it’s best attempt to parse these values from the document, giving you extremely valuable information about your data. Metadata may not be available on all files and document types, even if noted below, because these values are not enforced in a document or in the application used to create or modify it.

Metadata extraction is only performed when digital text extraction and OCR is used.

Field name

(subject to change)

Applies to MS Office files

Applies to PDF (and PDF-like) files

Applies to email messages

Applies to other files

Description

meta-protected

If the document was found to be encrypted or password protected, this field will be true.

Password-protected and encrypted files cannot be extracted through text or OCRed. There will also be minimal metadata available. Files encrypted by DRM (like PurView) will result in this field being true.

meta-category

If the Office document was saved with a category specified as an option, this field will capture it.

meta-charcterCount

The content’s character count as reported by the file’s metadata.

meta-charcterCountWithSpaces

The content’s character count with spaces as reported by the file’s metadata.

meta-company

If the Office document has any information saved for the “company” field, this will be captured in this field.

meta-contentStatus

Document status is typically defined in the “status” of the document. This includes using the “status” field when saving or modifying a document, or by selecting “Mark as Final”

meta-contentType

MIME type of the file (ie. application/json, rext/plain, etc.).

meta-creationTimeUtc

The internal creation timestamp of when the file was created. It is normal for this value to be different than the dates shown in your repository.

meta-creator

The creator is typically the user, organization, or application that was used to create the file.

meta-creatorTool

Similar to metaProducerLibrary, but with some additional details about the tool used in the creation of the PDF (eg. Adobe InDesign CC, PFU ScanSnap Cloud)

meta-digitallySigned

Digitally signed documents. Not to be confused with a document that has been signed by a person with their signature, this is a digital, computer signature

meta-docSecurity

Document security is one of 4 options:

  • (1) Password protected

  • (2) Recommended read-only

  • (4) Enforced read-only

  • (8) Locked for annotation

(1) is likely to not occur when crawling as password protected files cannot have this data extracted and will only show when the document is unlocked.

meta-encoding

Encoding will primarily be on file types that rely on the encoding such as: txt, csv, reg, and other text-based files that are non-XML.

meta-language

If the PDF contains information about the language of the file (eg. en-us), this filed will capture it. Languages are normalized to lowercase.

meta-lastAuthor

Last author is typically available on Word, Excel, and PowerPoint files. This is the last author known to the file, not the repository it resides.

meta-lastWriteTimeUtc

The internal modified timestamp of when the file was last modified. It is normal for this value to be different than the dates shown in your repository.

meta-lineCount

The content’s line count as reported by file’s metadata.

meta-producerLibrary

The PDF library that was used to create the document (eg. Microsoft® Word 2019, Acrobat Distiller 20.0 (Windows), Microsoft® Excel® 2019)

meta-revision

How many revisions the document has gone through according to the internal metadata of the file.

meta-subject

The subject or description used when the file was made, email subjects.

meta-tags

If tags exist on the document, they will be captured in this field. Tags are specified when saving an Office file. Some may refer to this field as “keywords” of the document.

meta-templateUsed

Notes the template used when creating the document if available. If no template is used (aka Normal.dotm), this field will not be updated as this is the default template.

meta-tite

If a title was saved to the document (sometimes auto-generated) this field will capture it, as well as email message titles, which are often similar to the subject.

page-count

The total page count reported by the document’s metadata.

meta-imageCount

Number of images in the Microsoft Office document. This number is based on the metadata in the file.
Most Office documents are compatible, if your document does not contain this information, there is a possible issue with how the file is being saved

meta-tableCount

Number of ‘table’ objects in a Microsoft Office document as reported by the metadata.

JavaScript errors detected

Please note, these errors can depend on your browser setup.

If this problem persists, please contact our support.