What is an index?
The value of indexes
When working with Shinydocs, one of your first steps is to crawl your organization’s targeted content source(s) and create one or more indexes. These indexes are a comprehensive inventory of all the files located across your repositories. This enables your organization to not only discover detailed information about what data exists and where it exists, but provides key insights to support the success of other data initiatives such as cleaning up content, deploying an enterprise search, and migrating data.
What do indexes look like?
Indexes are stored in the Analytics Engine, which is a component of the Cognitive Suite.
A basic index will contain various fields of metadata including:
Date created
Last modified date
File extension
File size
File name
File path
Each subsequent crawl or command that is run on an index creates additional fields of metadata within the index.
For example:
After a hash crawl is run, a field displaying the hash value is added to each file.
Text extraction adds a field to the index that includes the full text of a document. This field has a character limit of 30,000 characters, which typically equates to the the first 4-6 pages of a document.
Entity extraction adds specific fields depending on the targeted entities.
Once ROT rules are run against an index, files in the index that meet the conditions for ROT are tagged with the corresponding rule.
Additional classifications can be added based on a file’s properties, characteristics, or attributes.
Understanding your data
One of the most powerful outcomes that is gained from your indexes is the ability to fully understand your data. By understanding what data exists, it can be leveraged in many ways across your organization. Our Visualizer allows you to explore the content within your indexes and gain a clear picture of what exists. Through this process you can uncover key insights and trends, identify how to maximize the value of your data, and make informed decisions about how to optimize your data strategy moving forward.
Enriching indexes
Enriching your data refers to the process of adding meaning and context to your data to increase its value, purpose, and visibility within your organization. There are many ways you can enrich your indexes, including the following:
Hashing | Hashing is the first step in identifying duplicate content. A hash value is assigned to each file during a hash crawl. Hash values are determined upon the binary content of a file, meaning that each unique file will have a unique hash value, while duplicate files will have the same hash value. |
Extraction | Extraction is most commonly performed through text extraction and entity extraction. Text extraction is when the full text of a file is extracted and added to an index, whereas entity extraction is used to extract specific text within your content and add it to the index. Entities can include location, person, organization, money, percent, data, and time. |
Tagging | Tagging refers to any action of adding descriptive metadata fields to the index for one or more items. These metadata tags help to identify attributes about the content but also to add meaning and context to your data. Tagging helps you manage your data by easily isolating and locating specific information using tags. For example, you can tag files as duplicate or ROT content, or add classifications. |
Data enrichment helps to uncover the value contained within your content, thus increasing your organization’s ability to fully leverage the wealth of its information.
Cleaning up your data
Cleaning up your content has many benefits, such as gaining insights into your data, freeing up storage space, removing content that has associated risk, and improving employee productivity. The ability to realize these benefits relies on your indexes. Enriching your indexes through hashing, extraction and tagging allows you to appropriately and confidently take action on the identified content.
For example:
Duplicate content can be easily identified and actioned once it has been hashed and tagged
ROT content tagged with the corresponding ROT rule for which it has met the conditions can be confidently actioned
Using classifications make it easier to locate and analyze your content
Content that contains sensitive information, such as PII, can be removed to eliminate risk
Removing duplicates and ROT from your content sources frees up storage space and makes it easier for employees to find what they are looking for, especially when this is paired with an enterprise search solution.
Improving enterprise search
An enterprise search solution, such as Discovery Search, relies on your indexes. When it comes to implementing an enterprise search, think of your indexes of the first point of contact between you and your data. Once a search is conducted, Discovery Search connects with the indexed data, looking for the most relevant results. Additionally, through enrichment techniques such as classification and extraction, more information is appended to your indexes, which facilitates a more flexible and accurate search.
Supporting a data migration
indexes play a pivotal role in helping to support the success of any data migration project. By using your indexes to build an understanding of what data exists within your source location, you can make more informed decisions about your data migration plan.
This includes:
Cleaning up content before a migration so that existing problems aren’t perpetuated in the new destination, and effort is not spent on moving duplicate and ROT data
Deciding what will be included or eliminated from the migration
Anticipating challenges and minimizing the risk of surprises that could derail your success
Using indexes created for the destination of the migration, results of a data migration can be visualized using the Shinydocs Visualizer. Dashboards are available within the Shinydocs Visualizer that provide a visualization of migration results that can be shared with leadership and across your organization, or you can create your own.