Extracting and crawling PST files
Important Notes:
The user performing the query / crawl will need to have write permissions in the file system in order for the extraction to be successful.
The user/service account needs write permission to expand the PST files into individual files in the same directory as the PST files.
There needs to be enough disk space on the file share to accommodate expanded PST files.
Outlook 2016/O365 or newer (64-bit is required) must be installed on the machine(s) running CognitiveToolkit.exe and the .pst file extraction cannot violate the Windows 260 character path limitation. Use
--help
in the command line to review available parameters and options for ExtractAndCrawlPst.You do not need to configure any inboxes or accounts in Outlook, the application just needs to be installed on that machine.
How Does It Work?
Shinydocs Cognitive Toolkit has a tool for extracting .pst files found in the Shinydocs Index. Based on a query specified by the Shinydocs user, the tool will extract and expand any .pst files that are found and write those folders and messages to the file system in place. This is achieved through the ExtractAndCrawlPst tool.
The Cognitive Toolkit CrawlFileSystem tool also includes the "–crawl-pst" parameter. While it is possible to crawl and extract .pst files with this tool, it is NOT the recommended approach – we recommend using the ExtractAndCrawlPst tool instead. If you do extract .pst files using the CrawlFileSystem tool, the above conditions apply and note that if you re-crawl and extract the same .pst files, it will re-extract and expand them, adding duplicates of those folders and messages to the file system again.
Crawl (metadata only) a file structure that contains .pst files. Do not extract any .pst files yet. For more information on the steps required for a metadata crawl, please refer to the following article:
Initial Discovery - File ShareRun the following command:
CognitiveToolkit.exe ExtractAndCrawlPst --query "COG Batch Files\query-match-extension-pst-not-extracted.json" --index-server-url http://localhost:9200 --index-name shiny
The query-match-extension-pst-not-extracted.json will look like this:
{
"bool": {
"must": {
"match" : {
"extension": "pst"
}
},
"must_not": {
"exists": { "field": "pst-extract-job" }
}
}
}
If you run this once, it will match all the *.pst files it has discovered and extract all of them. If you mistakenly run this a second time, it will not extract them a second time if it has seen them already as the Cognitive Toolkit tags files to indicate that they have been extracted already. Users can then perform full text extraction on the .msg files that have been created to extract the text out of the individual email messages.