Detecting potential PII
To detect potential PII in an index, you will need to use FlagFieldBasedOnRegex with Shinydocs Cognitive Toolkit. This page will provide you with the regex values to use in the command.
In addition to the regex, false positives can be reduced by using additional keyword matching in your JSON query. To find a relevant list of keywords you can use this reference from Microsoft on Data Loss Prevention https://learn.microsoft.com/en-us/exchange/policy-and-compliance/data-loss-prevention/sensitive-information-types?view=exchserver-2019. Queries built using the exchange keywords are here: Using ElasticSearch for PII Discovery .
Ontario Health Card
\b[1-9]\d{3}[ -]{0,3}\d{3}[ -]{0,3}\d{3}[ -]{0,3}[a-zA-Z]{2}\b
Example Command
CognitiveToolkit.exe RunScript --path FlagFieldBasedOnRegexScript.cs -i shiny_index -u http://localhost:9200 -q query.json --regex-pattern "\b[1-9]\d{3}[ -]{0,3}\d{3}[ -]{0,3}\d{3}[ -]{0,3}[a-zA-Z]{2}\b" --field-name "potential_PII" --value "Ontario Health Card" --search-field fullText
Ontario Driver’s License
\b[a-zA-Z]\d{4}[ -]{0,3}\d{5}[ -]{0,3}\d[0156]\d[0-3]\d\b
Example Command
CognitiveToolkit.exe RunScript --path FlagFieldBasedOnRegexScript.cs -i shiny_index -u http://localhost:9200 -q query.json --regex-pattern "\b[a-zA-Z]\d{4}[ -]{0,3}\d{5}[ -]{0,3}\d[0156]\d[0-3]\d\b" --field-name "potential_PII" --value "Ontario Drivers License" --search-field fullText
Canadian Passport Number
\b[a-zA-Z]{2}[0-9]{6,7}\b
Example Command
CognitiveToolkit.exe RunScript --path FlagFieldBasedOnRegexScript.cs -i shiny_index -u http://localhost:9200 -q query.json --regex-pattern "\b[a-zA-Z]{2}[0-9]{6,7}\b" --field-name "potential_PII" --value "Canadian Passport Number" --search-field fullText
Canadian Postal Code
\b[a-zA-Z]\d[a-zA-Z][ -]{0,3}\d[a-zA-Z]\d\b
Example Command
CognitiveToolkit.exe RunScript --path FlagFieldBasedOnRegexScript.cs -i shiny_index -u http://localhost:9200 -q query.json --regex-pattern "\b[a-zA-Z]\d{4}[\s-]*\d{5}[\s-]*\d{5}\b" --field-name "potential_PII" --value "Canadian Postal Code" --search-field fullText
Canadian Social Insurance Number (SIN)
Use --valid-luhn true
in your command for this PII for more accurate results
\b[1-79]\d{2}[ -]{0,3}\d{3}[ -]{0,3}\d{3}\b
Example Command
CognitiveToolkit.exe RunScript --path FlagFieldBasedOnRegexScript.cs -i shiny_index -u http://localhost:9200 -q query.json --regex-pattern "\b[1-79]\d{2}[ -]{0,3}\d{3}[ -]{0,3}\d{3}\b" --field-name "potential_PII" --value "Canadian Social Insurance Number" --search-field fullText --valid-luhn true
Credit Card - AMEX
\b(3[47]\d{2}[ -]{0,3}\d{6}[ -]{0,3}\d{5})\b
Example Command
CognitiveToolkit.exe RunScript --path FlagFieldBasedOnRegexScript.cs -i shiny_index -u http://localhost:9200 -q query.json --regex-pattern "\b(3[47]\d{2}[ -]{0,3}\d{6}[ -]{0,3}\d{5})\b" --field-name "potential_PII" --value "Credit Card AMEX" --search-field fullText
Credit Card - Visa
\b((4\d{3}[ -]{0,3}\d{4}[ -]{0,3}\d{4}[ -]{0,3}\d{4})|(4\d{3}[ -]{0,3}\d{3}[ -]{0,3}\d{3}[ -]{0,3}\d{3}))\b
Example Command
CognitiveToolkit.exe RunScript --path FlagFieldBasedOnRegexScript.cs -i shiny_index -u http://localhost:9200 -q query.json --regex-pattern "\b(4\d{3}[ -]{0,3}\d{4}[ -]{0,3}\d{4}[ -]{0,3}\d{4})|(4\d{3}[ -]{0,3}\d{3}[ -]{0,3}\d{3}[ -]{0,3}\d{3})\b" --field-name "potential_PII" --value "Credit Card Visa" --search-field fullText
Credit Card - MasterCard
\b(5[1-5]\d{2}|222[1-9]|22[3-9]\d|2[3-6]\d{2}|27[01]\d|2720)[ -]{0,3}\d{4}[ -]{0,3}\d{4}[ -]{0,3}\d{4}\b
Example Command
CognitiveToolkit.exe RunScript --path FlagFieldBasedOnRegexScript.cs -i shiny_index -u http://localhost:9200 -q query.json --regex-pattern "\b(5[1-5]\d{2}|222[1-9]|22[3-9]\d|2[3-6]\d{2}|27[01]\d|2720)[ -]{0,3}\d{4}[ -]{0,3}\d{4}[ -]{0,3}\d{4}\b" --field-name "potential_PII" --value "Credit Card Mastercard" --search-field fullText
Regex Explanations
Boundaries (
\b
) should be used at the beginning and end to avoid matching in the middle of a string.Although
\s
and space are similar,\s
includes carriage return, line feed and tab. Here I explicitly only want a space or a hyphen and I am allowing the pattern of space, hyphen space as an acceptable separator which is what this segment is[ -]{0,3}
. This mean any combination of hyphen and space is allowed 0 to 3 times.Where possible, try to be consistent in using
[0-9]
or\d
. Both have the same meaning and interchanging them can make reading difficult. If you are not allowing all digits, then you need to include the range of digits. Also, just to try to help out, the following[1-79]
reads as 1 to 7 and 9. All characters within square brackets are individual characters except hyphen which represents a range.