Metadata captured during content analysys
Common metadata
Metadata field | Analysis step field is produced | Description |
---|---|---|
creationTimeUtc | Crawling content | The time when the item was created. |
duplicate-hash | Calculating duplicates | If the item is a duplicate or the original (by create date). |
entities-date | Identifying people, places and organizations | Recognized dates in the content. |
entities-location | Identifying people, places and organizations | Recognized locations in the content. |
entities-person | Identifying people, places and organizations | Recognized person names in the content. |
extension | Crawling content | |
extraction-error | Extracting digital and image content | An error related to text extraction. |
extraction-type | Extracting digital and image content | The type of text extraction process used. |
fullText | Extracting digital and image content | The text content of an item (up to 30,000 characters) that was extracted. |
hash | Extracting digital and image content | The hash value of the item. |
id | Crawling content | System property |
lastWriteTimeUtc | Crawling content | The time when the item was last modified. |
length | Crawling content | The size of the item. |
name | Crawling content | The name of the item (commonly file name). |
outdated-binary | Crawling content | Indicates whether the item has been updated since the last crawl. |
parent | Crawling content | The parent directory or container of the file. |
path | Crawling content | The file path, URL, or crumbs to where the item is. |
path-valid | Crawling content | If during a crawl the item is no longer in the source under the same name and in the same place, path-valid will be false. |
potential_pii | Identifying content with personal information | Whether the content contains potential personally identifiable information (PII). |
rot_obsolete | Identifying non-valuable content | Indicates if the document is obsolete (related to ROT analysis). |
rot_redundant | Identifying non-valuable content | Indicates if the document is redundant (related to ROT analysis). |
rot_trivial | Identifying non-valuable content | Indicates if the document is trivial in nature (related to ROT analysis). |
schemaType | Crawling content | System property |
zero-byte | Crawling content | If the item has a size of 0 bytes (empty). |
Source-specific metadata
Enhanced metadata (meta-)
These fields will only be available on files processed with text extraction. Not all files will have all fields and is entirely dependent on the content.
Metadata field name | Applies to MS Office files | Applies to PDF (and PDF-like) files | Applies to email messages | Applies to other files | Description |
---|---|---|---|---|---|
meta-protected | ✅ | ✅ | ✅ | ✅ | If the document was found to be encrypted or password protected, this field will be true. Password-protected and encrypted files cannot be extracted through text or OCRed. There will also be minimal metadata available. Files encrypted by DRM (like PurView) will result in this field being true. |
meta-category | ✅ | If the Office document was saved with a category specified as an option, this field will capture it. | |||
meta-charcterCount | ✅ | The content’s character count as reported by the file’s metadata. | |||
meta-charcterCountWithSpaces | ✅ | The content’s character count with spaces as reported by the file’s metadata. | |||
meta-company | ✅ | If the Office document has any information saved for the “company” field, this will be captured in this field. | |||
meta-contentStatus | ✅ | Document status is typically defined in the “status” of the document. This includes using the “status” field when saving or modifying a document, or by selecting “Mark as Final” | |||
meta-contentType | ✅ | ✅ | ✅ | ✅ | MIME type of the file (ie. application/json, rext/plain, etc.). |
meta-creationTimeUtc | ✅ | ✅ | ✅ | ✅ | The internal creation timestamp of when the file was created. It is normal for this value to be different than the dates shown in your repository. |
meta-creator | ✅ | ✅ | ✅ | The creator is typically the user, organization, or application that was used to create the file. | |
meta-creatorTool | ✅ | Similar to metaProducerLibrary, but with some additional details about the tool used in the creation of the PDF (eg. Adobe InDesign CC, PFU ScanSnap Cloud) | |||
meta-digitallySigned | ✅ | Digitally signed documents. Not to be confused with a document that has been signed by a person with their signature, this is a digital, computer signature | |||
meta-docSecurity | ✅ | Document security is one of 4 options:
(1) is likely to not occur when crawling as password protected files cannot have this data extracted and will only show when the document is unlocked. | |||
meta-encoding | ✅ | Encoding will primarily be on file types that rely on the encoding such as: txt, csv, reg, and other text-based files that are non-XML. | |||
meta-language | ✅ | If the PDF contains information about the language of the file (eg. en-us), this filed will capture it. Languages are normalized to lowercase. | |||
meta-lastAuthor | ✅ | Last author is typically available on Word, Excel, and PowerPoint files. This is the last author known to the file, not the repository it resides. | |||
meta-lastWriteTimeUtc | ✅ | ✅ | ✅ | ✅ | The internal modified timestamp of when the file was last modified. It is normal for this value to be different than the dates shown in your repository. |
meta-lineCount | ✅ | The content’s line count as reported by file’s metadata. | |||
meta-producerLibrary | ✅ | The PDF library that was used to create the document (eg. Microsoft® Word 2019, Acrobat Distiller 20.0 (Windows), Microsoft® Excel® 2019) | |||
meta-revision | ✅ | How many revisions the document has gone through according to the internal metadata of the file. | |||
meta-subject | ✅ | ✅ | ✅ | The subject or description used when the file was made, email subjects. | |
meta-tags | ✅ | If tags exist on the document, they will be captured in this field. Tags are specified when saving an Office file. Some may refer to this field as “keywords” of the document. | |||
meta-templateUsed | ✅ | Notes the template used when creating the document if available. If no template is used (aka Normal.dotm), this field will not be updated as this is the default template. | |||
meta-tite | ✅ | ✅ | ✅ | ✅ | If a title was saved to the document (sometimes auto-generated) this field will capture it, as well as email message titles, which are often similar to the subject. |
page-count | ✅ | ✅ | The total page count reported by the document’s metadata. | ||
meta-imageCount | ✅ | Number of images in the Microsoft Office document. This number is based on the metadata in the file. | |||
meta-tableCount | ✅ | Number of ‘table’ objects in a Microsoft Office document as reported by the metadata. |
With the release of Shinydocs Pro 24.7 comes a new suite of metadata captured from files that are processed via text extraction (check out https://help.shinydocs.com/shinydocs-pro/24.3.0/digital-text-extraction-document-types for more information on the types of files eligible for text extraction).
Internal document metadata can very greatly from file to file. This metadata architecture is governed by organizations like Microsoft, Adobe, and the opensource community. Shinydocs makes it’s best attempt to parse these values from the document, giving you extremely valuable information about your data. Metadata may not be available on all files and document types, even if noted below, because these values are not enforced in a document or in the application used to create or modify it.
Metadata extraction is only performed when digital text extraction and OCR is used.
Field name (subject to change) | Applies to MS Office files | Applies to PDF (and PDF-like) files | Applies to email messages | Applies to other files | Description |
---|---|---|---|---|---|
meta-protected | ✅ | ✅ | ✅ | ✅ | If the document was found to be encrypted or password protected, this field will be true. Password-protected and encrypted files cannot be extracted through text or OCRed. There will also be minimal metadata available. Files encrypted by DRM (like PurView) will result in this field being true. |
meta-category | ✅ | If the Office document was saved with a category specified as an option, this field will capture it. | |||
meta-charcterCount | ✅ | The content’s character count as reported by the file’s metadata. | |||
meta-charcterCountWithSpaces | ✅ | The content’s character count with spaces as reported by the file’s metadata. | |||
meta-company | ✅ | If the Office document has any information saved for the “company” field, this will be captured in this field. | |||
meta-contentStatus | ✅ | Document status is typically defined in the “status” of the document. This includes using the “status” field when saving or modifying a document, or by selecting “Mark as Final” | |||
meta-contentType | ✅ | ✅ | ✅ | ✅ | MIME type of the file (ie. application/json, rext/plain, etc.). |
meta-creationTimeUtc | ✅ | ✅ | ✅ | ✅ | The internal creation timestamp of when the file was created. It is normal for this value to be different than the dates shown in your repository. |
meta-creator | ✅ | ✅ | ✅ | The creator is typically the user, organization, or application that was used to create the file. | |
meta-creatorTool | ✅ | Similar to metaProducerLibrary, but with some additional details about the tool used in the creation of the PDF (eg. Adobe InDesign CC, PFU ScanSnap Cloud) | |||
meta-digitallySigned | ✅ | Digitally signed documents. Not to be confused with a document that has been signed by a person with their signature, this is a digital, computer signature | |||
meta-docSecurity | ✅ | Document security is one of 4 options:
(1) is likely to not occur when crawling as password protected files cannot have this data extracted and will only show when the document is unlocked. | |||
meta-encoding | ✅ | Encoding will primarily be on file types that rely on the encoding such as: txt, csv, reg, and other text-based files that are non-XML. | |||
meta-language | ✅ | If the PDF contains information about the language of the file (eg. en-us), this filed will capture it. Languages are normalized to lowercase. | |||
meta-lastAuthor | ✅ | Last author is typically available on Word, Excel, and PowerPoint files. This is the last author known to the file, not the repository it resides. | |||
meta-lastWriteTimeUtc | ✅ | ✅ | ✅ | ✅ | The internal modified timestamp of when the file was last modified. It is normal for this value to be different than the dates shown in your repository. |
meta-lineCount | ✅ | The content’s line count as reported by file’s metadata. | |||
meta-producerLibrary | ✅ | The PDF library that was used to create the document (eg. Microsoft® Word 2019, Acrobat Distiller 20.0 (Windows), Microsoft® Excel® 2019) | |||
meta-revision | ✅ | How many revisions the document has gone through according to the internal metadata of the file. | |||
meta-subject | ✅ | ✅ | ✅ | The subject or description used when the file was made, email subjects. | |
meta-tags | ✅ | If tags exist on the document, they will be captured in this field. Tags are specified when saving an Office file. Some may refer to this field as “keywords” of the document. | |||
meta-templateUsed | ✅ | Notes the template used when creating the document if available. If no template is used (aka Normal.dotm), this field will not be updated as this is the default template. | |||
meta-tite | ✅ | ✅ | ✅ | ✅ | If a title was saved to the document (sometimes auto-generated) this field will capture it, as well as email message titles, which are often similar to the subject. |
page-count | ✅ | ✅ | The total page count reported by the document’s metadata. | ||
meta-imageCount | ✅ | Number of images in the Microsoft Office document. This number is based on the metadata in the file. | |||
meta-tableCount | ✅ | Number of ‘table’ objects in a Microsoft Office document as reported by the metadata. |