Skip to main content
Skip table of contents

Detecting potential PII

To detect potential PII in an index, you will need to use FlagFieldBasedOnRegex with Shinydocs Cognitive Toolkit. This page will provide you with the regex values to use in the command.

In addition to the regex, false positives can be reduced by using additional keyword matching in your JSON query. To find a relevant list of keywords you can use this reference from Microsoft on Data Loss Prevention https://learn.microsoft.com/en-us/exchange/policy-and-compliance/data-loss-prevention/sensitive-information-types?view=exchserver-2019. Queries built using the exchange keywords are here: Using ElasticSearch for PII Discovery .

Ontario Health Card

CODE
\b[1-9]\d{3}[ -]{0,3}\d{3}[ -]{0,3}\d{3}[ -]{0,3}[a-zA-Z]{2}\b

Example Command

CODE
CognitiveToolkit.exe RunScript --path FlagFieldBasedOnRegexScript.cs -i shiny_index -u http://localhost:9200 -q query.json --regex-pattern "\b[1-9]\d{3}[ -]{0,3}\d{3}[ -]{0,3}\d{3}[ -]{0,3}[a-zA-Z]{2}\b" --field-name "potential_PII" --value "Ontario Health Card" --search-field fullText

Ontario Driver’s License

CODE
\b[a-zA-Z]\d{4}[ -]{0,3}\d{5}[ -]{0,3}\d[0156]\d[0-3]\d\b

Example Command

CODE
CognitiveToolkit.exe RunScript --path FlagFieldBasedOnRegexScript.cs -i shiny_index -u http://localhost:9200 -q query.json --regex-pattern "\b[a-zA-Z]\d{4}[ -]{0,3}\d{5}[ -]{0,3}\d[0156]\d[0-3]\d\b" --field-name "potential_PII" --value "Ontario Drivers License" --search-field fullText

Canadian Passport Number

CODE
\b[a-zA-Z]{2}[0-9]{6,7}\b

Example Command

CODE
CognitiveToolkit.exe RunScript --path FlagFieldBasedOnRegexScript.cs -i shiny_index -u http://localhost:9200 -q query.json --regex-pattern "\b[a-zA-Z]{2}[0-9]{6,7}\b" --field-name "potential_PII" --value "Canadian Passport Number" --search-field fullText

Canadian Postal Code

CODE
\b[a-zA-Z]\d[a-zA-Z][ -]{0,3}\d[a-zA-Z]\d\b

Example Command

CODE
CognitiveToolkit.exe RunScript --path FlagFieldBasedOnRegexScript.cs -i shiny_index -u http://localhost:9200 -q query.json --regex-pattern "\b[a-zA-Z]\d{4}[\s-]*\d{5}[\s-]*\d{5}\b" --field-name "potential_PII" --value "Canadian Postal Code" --search-field fullText

Canadian Social Insurance Number (SIN)

Use --valid-luhn true in your command for this PII for more accurate results

CODE
\b[1-79]\d{2}[ -]{0,3}\d{3}[ -]{0,3}\d{3}\b

Example Command

CODE
CognitiveToolkit.exe RunScript --path FlagFieldBasedOnRegexScript.cs -i shiny_index -u http://localhost:9200 -q query.json --regex-pattern "\b[1-79]\d{2}[ -]{0,3}\d{3}[ -]{0,3}\d{3}\b" --field-name "potential_PII" --value "Canadian Social Insurance Number" --search-field fullText --valid-luhn true

Credit Card - AMEX

CODE
\b(3[47]\d{2}[ -]{0,3}\d{6}[ -]{0,3}\d{5})\b

Example Command

CODE
CognitiveToolkit.exe RunScript --path FlagFieldBasedOnRegexScript.cs -i shiny_index -u http://localhost:9200 -q query.json --regex-pattern "\b(3[47]\d{2}[ -]{0,3}\d{6}[ -]{0,3}\d{5})\b" --field-name "potential_PII" --value "Credit Card AMEX" --search-field fullText

Credit Card - Visa

CODE
\b((4\d{3}[ -]{0,3}\d{4}[ -]{0,3}\d{4}[ -]{0,3}\d{4})|(4\d{3}[ -]{0,3}\d{3}[ -]{0,3}\d{3}[ -]{0,3}\d{3}))\b

Example Command

CODE
CognitiveToolkit.exe RunScript --path FlagFieldBasedOnRegexScript.cs -i shiny_index -u http://localhost:9200 -q query.json --regex-pattern "\b(4\d{3}[ -]{0,3}\d{4}[ -]{0,3}\d{4}[ -]{0,3}\d{4})|(4\d{3}[ -]{0,3}\d{3}[ -]{0,3}\d{3}[ -]{0,3}\d{3})\b" --field-name "potential_PII" --value "Credit Card Visa" --search-field fullText

Credit Card - MasterCard

CODE
\b(5[1-5]\d{2}|222[1-9]|22[3-9]\d|2[3-6]\d{2}|27[01]\d|2720)[ -]{0,3}\d{4}[ -]{0,3}\d{4}[ -]{0,3}\d{4}\b

Example Command

CODE
CognitiveToolkit.exe RunScript --path FlagFieldBasedOnRegexScript.cs -i shiny_index -u http://localhost:9200 -q query.json --regex-pattern "\b(5[1-5]\d{2}|222[1-9]|22[3-9]\d|2[3-6]\d{2}|27[01]\d|2720)[ -]{0,3}\d{4}[ -]{0,3}\d{4}[ -]{0,3}\d{4}\b" --field-name "potential_PII" --value "Credit Card Mastercard" --search-field fullText

Regex Explanations

  • Boundaries (\b) should be used at the beginning and end to avoid matching in the middle of a string.

  • Although \s and space are similar, \s includes carriage return, line feed and tab. Here I explicitly only want a space or a hyphen and I am allowing the pattern of space, hyphen space as an acceptable separator which is what this segment is [ -]{0,3}. This mean any combination of hyphen and space is allowed 0 to 3 times.

  • Where possible, try to be consistent in using [0-9] or \d. Both have the same meaning and interchanging them can make reading difficult. If you are not allowing all digits, then you need to include the range of digits. Also, just to try to help out, the following [1-79] reads as 1 to 7 and 9. All characters within square brackets are individual characters except hyphen which represents a range.

JavaScript errors detected

Please note, these errors can depend on your browser setup.

If this problem persists, please contact our support.