Skip to main content
Skip table of contents

Detecting potential PII

To detect potential PII in an index, you will need to use FlagFieldBasedOnRegex with Shinydocs Cognitive Toolkit. This page will provide you with the regex values to use in the command.

In addition to the regex, false positives can be reduced by using additional keyword matching in your JSON query. To find a relevant list of keywords you can use this reference from Microsoft on Data Loss Prevention https://learn.microsoft.com/en-us/exchange/policy-and-compliance/data-loss-prevention/sensitive-information-types?view=exchserver-2019. Queries built using the exchange keywords are here: Using ElasticSearch for PII Discovery .

Low

Ontario Health Card

CODE
\b[1-9]\d{3}[ -]{0,3}\d{3}[ -]{0,3}\d{3}[ -]{0,3}[a-zA-Z]{2}\b

Example Command

CODE
CognitiveToolkit.exe RunScript --path FlagFieldBasedOnRegexScript.cs -i shiny_index -u http://localhost:9200 -q query.json --regex-pattern "\b[1-9]\d{3}[ -]{0,3}\d{3}[ -]{0,3}\d{3}[ -]{0,3}[a-zA-Z]{2}\b" --field-name "potential_PII" --value "Ontario Health Card" --search-field fullText

Ontario Driver’s License

CODE
\b[a-zA-Z]\d{4}[ -]{0,3}\d{5}[ -]{0,3}\d[0156]\d[0-3]\d\b

Example Command

CODE
CognitiveToolkit.exe RunScript --path FlagFieldBasedOnRegexScript.cs -i shiny_index -u http://localhost:9200 -q query.json --regex-pattern "\b[a-zA-Z]\d{4}[ -]{0,3}\d{5}[ -]{0,3}\d[0156]\d[0-3]\d\b" --field-name "potential_PII" --value "Ontario Drivers License" --search-field fullText
Medium

Low

Ontario Health Card

CODE
\b[1-9]\d{3}[ -]{0,3}\d{3}[ -]{0,3}\d{3}[ -]{0,3}[a-zA-Z]{2}\b

Example Command

CODE
CognitiveToolkit.exe RunScript --path FlagFieldBasedOnRegexScript.cs -i shiny_index -u http://localhost:9200 -q query.json --regex-pattern "\b[1-9]\d{3}[ -]{0,3}\d{3}[ -]{0,3}\d{3}[ -]{0,3}[a-zA-Z]{2}\b" --field-name "potential_PII" --value "Ontario Health Card" --search-field fullText

Ontario Driver’s License

CODE
\b[a-zA-Z]\d{4}[ -]{0,3}\d{5}[ -]{0,3}\d[0156]\d[0-3]\d\b

Example Command

CODE
CognitiveToolkit.exe RunScript --path FlagFieldBasedOnRegexScript.cs -i shiny_index -u http://localhost:9200 -q query.json --regex-pattern "\b[a-zA-Z]\d{4}[ -]{0,3}\d{5}[ -]{0,3}\d[0156]\d[0-3]\d\b" --field-name "potential_PII" --value "Ontario Drivers License" --search-field fullText

Canadian Passport Number

CODE
\b[a-zA-Z]{2}[0-9]{6,7}\b

Example Command

CODE
CognitiveToolkit.exe RunScript --path FlagFieldBasedOnRegexScript.cs -i shiny_index -u http://localhost:9200 -q query.json --regex-pattern "\b[a-zA-Z]{2}[0-9]{6,7}\b" --field-name "potential_PII" --value "Canadian Passport Number" --search-field fullText

Canadian Postal Code

CODE
\b[a-zA-Z]\d[a-zA-Z][ -]{0,3}\d[a-zA-Z]\d\b

Example Command

CODE
CognitiveToolkit.exe RunScript --path FlagFieldBasedOnRegexScript.cs -i shiny_index -u http://localhost:9200 -q query.json --regex-pattern "\b[a-zA-Z]\d{4}[\s-]*\d{5}[\s-]*\d{5}\b" --field-name "potential_PII" --value "Canadian Postal Code" --search-field fullText

Canadian Social Insurance Number (SIN)

Use --valid-luhn true in your command for this PII for more accurate results

CODE
\b[1-79]\d{2}[ -]{0,3}\d{3}[ -]{0,3}\d{3}\b

Example Command

CODE
CognitiveToolkit.exe RunScript --path FlagFieldBasedOnRegexScript.cs -i shiny_index -u http://localhost:9200 -q query.json --regex-pattern "\b[1-79]\d{2}[ -]{0,3}\d{3}[ -]{0,3}\d{3}\b" --field-name "potential_PII" --value "Canadian Social Insurance Number" --search-field fullText --valid-luhn true

Credit Card - AMEX

CODE
\b(3[47]\d{2}[ -]{0,3}\d{6}[ -]{0,3}\d{5})\b

Example Command

CODE
CognitiveToolkit.exe RunScript --path FlagFieldBasedOnRegexScript.cs -i shiny_index -u http://localhost:9200 -q query.json --regex-pattern "\b(3[47]\d{2}[ -]{0,3}\d{6}[ -]{0,3}\d{5})\b" --field-name "potential_PII" --value "Credit Card AMEX" --search-field fullText

Credit Card - Visa

CODE
\b((4\d{3}[ -]{0,3}\d{4}[ -]{0,3}\d{4}[ -]{0,3}\d{4})|(4\d{3}[ -]{0,3}\d{3}[ -]{0,3}\d{3}[ -]{0,3}\d{3}))\b

Example Command

CODE
CognitiveToolkit.exe RunScript --path FlagFieldBasedOnRegexScript.cs -i shiny_index -u http://localhost:9200 -q query.json --regex-pattern "\b(4\d{3}[ -]{0,3}\d{4}[ -]{0,3}\d{4}[ -]{0,3}\d{4})|(4\d{3}[ -]{0,3}\d{3}[ -]{0,3}\d{3}[ -]{0,3}\d{3})\b" --field-name "potential_PII" --value "Credit Card Visa" --search-field fullText

Credit Card - MasterCard

CODE
\b(5[1-5]\d{2}|222[1-9]|22[3-9]\d|2[3-6]\d{2}|27[01]\d|2720)[ -]{0,3}\d{4}[ -]{0,3}\d{4}[ -]{0,3}\d{4}\b

Example Command

CODE
CognitiveToolkit.exe RunScript --path FlagFieldBasedOnRegexScript.cs -i shiny_index -u http://localhost:9200 -q query.json --regex-pattern "\b(5[1-5]\d{2}|222[1-9]|22[3-9]\d|2[3-6]\d{2}|27[01]\d|2720)[ -]{0,3}\d{4}[ -]{0,3}\d{4}[ -]{0,3}\d{4}\b" --field-name "potential_PII" --value "Credit Card Mastercard" --search-field fullText

USA Social Security Number (Formatted)

CODE
(?is)(\b(?!666|000|9\d{2})\d{3} ?[\u002D\u007E\u00AD\u2010\u2011\u2012\u2013\u2212] ?(?!00)\d{2} ?[\u002D\u007E\u00AD\u2010\u2011\u2012\u2013\u2212] ?(?!0{4})\d{4}\b)

Example Command

CODE
%CognitiveToolkitExecutable% RunScript --path FlagFieldBasedOnRegexScript.cs -i shiny_index -u http://localhost:9200 --threads 4 --nodes-per-request 500 -q SSN.json --regex-pattern "(?is)(\b(?!666|000|9\d{2})\d{3} ?[\u002D\u007E\u00AD\u2010\u2011\u2012\u2013\u2212] ?(?!00)\d{2} ?[\u002D\u007E\u00AD\u2010\u2011\u2012\u2013\u2212] ?(?!0{4})\d{4}\b)" --field-name "potential_PII" --value "USA Social Security Number (Formatted)" --search-field fullText

USA Social Security Number (Proximity)

CODE
(?is)((\b(Social Security|Social Security#|Soc Sec|SSN|SSNS|SSN#|SS#|SSID|Soc. Sec. No|tax(payer)? identification number|Tax ID Number)(?!\w).{0,300}\b((?!666|000|9\d{2})\d{3} ?(?!00)\d{2} ?(?!0{4})\d{4})\b)|(\b((?!666|000|9\d{2})\d{3} ?(?!00)\d{2}(?!0{4}) ?\d{4})\b.{0,300}\b(Social Security|Social Security#|Soc Sec|SSN|SSNS|SSN#|SS#|SSID|Soc. Sec. No|tax(payer)? identification number|Tax ID Number)(?!\w)))

Example Command

CODE
%CognitiveToolkitExecutable% RunScript --path FlagFieldBasedOnRegexScript.cs -i shiny_index -u http://localhost:9200 --threads 4 --nodes-per-request 500 -q SSN.json

USA Passport Number

CODE
(?is)(\b(passport\s*(book|card)?\s*number|Passport.{1,150}Document Number|Passport.{1,150}Department of State|Passport No|Passport #|Passport[#:]|PassportID|Passportno|passportnumber|パスポート|パスポート番号|パスポートのNum|パスポート#|Numéro de passeport|Passeport n °|Passeport Non|Passeport #|Passeport#|PasseportNon|Passeportn °|Pasaporte)(?!\w).{0,500}\b[a-zA-Z0-9]\d{8}\b)|(\b[a-zA-Z0-9]\d{8}\b.{0,500}\b(passport\s*(book|card)?\s*number|Passport.{1,150}Document Number|Passport.{1,150}Department of State|Passport No|Passport #|Passport[#:]|PassportID|Passportno|passportnumber|パスポート|パスポート番号|パスポートのNum|パスポート#|Numéro de passeport|Passeport n °|Passeport Non|Passeport #|Passeport#|PasseportNon|Passeportn °|Pasaporte)(?!\w))

Example Command

CODE
CognitiveToolkit.exe RunScript --path FlagFieldBasedOnRegexScript.cs -i shiny_index -u http://localhost:9200 -q query.json --regex-pattern "(\b(Passport Number|Passport No|Passport #|Passport#|PassportID|Passportno|passportnumber|パスポート|パスポート番号|パスポートのNum|パスポート#|Numéro de passeport|Passeport n °|Passeport Non|Passeport #|Passeport#|PasseportNon|Passeportn °)(?!\w).{0,300}\b[a-zA-Z0-9]\d{8}\b)|(\b[a-zA-Z0-9]\d{8}\b.{0,300}\b(Passport Number|Passport No|Passport #|Passport#|PassportID|Passportno|passportnumber|パスポート|パスポート番号|パスポートのNum|パスポート#|Numéro de passeport|Passeport n °|Passeport Non|Passeport #|Passeport#|PasseportNon|Passeportn °)(?!\w))" --field-name "potential_PII" --value "US Passport Number" --search-field fullText

Medicare Beneficiary Number

CODE
(?i)\b(([1-9][AC-HJKMNP-RT-Y]{2}[0-9]-[AC-HJKMNP-RT-Y][AC-HJKMNP-RT-Y0-9][0-9]-[AC-HJKMNP-RT-Y]{2}[0-9]{2})|([1-9][AC-HJKMNP-RT-Y]{2}[0-9][AC-HJKMNP-RT-Y][AC-HJKMNP-RT-Y0-9][0-9][AC-HJKMNP-RT-Y]{2}[0-9]{2}))\b

Example Command

CODE
%CognitiveToolkitExecutable% RunScript --path FlagFieldBasedOnRegexScript.cs -i shiny_index -u http://localhost:9200 --threads 4 --nodes-per-request 500 -q medicare.json" --regex-pattern "(?i)\b(([1-9][AC-HJKMNP-RT-Y]{2}[0-9]-[AC-HJKMNP-RT-Y][AC-HJKMNP-RT-Y0-9][0-9]-[AC-HJKMNP-RT-Y]{2}[0-9]{2})|([1-9][AC-HJKMNP-RT-Y]{2}[0-9][AC-HJKMNP-RT-Y][AC-HJKMNP-RT-Y0-9][0-9][AC-HJKMNP-RT-Y]{2}[0-9]{2}))\b" --field-name "potential_PII" --value "Medicare Number" --search-field fullText

Download

You can download these commands preconfigured in a batch file for your convenience. Specialized queries are also included.

Shinydocs-PII-v1.5.zip

FlagFieldBasedOnRegexScriptV2.cs

This has a minor update to the Luhn check function to ensure only numbers are passed to the Luhn check algorithm. If you are using the Luhn check (Credit Cards and SIN) then it is recommended to use this script.

FlagFieldBasedOnRegexScriptV2.cs

Regex Explanations

  • Boundaries (\b) should be used at the beginning and end to avoid matching in the middle of a string.

  • Although \s and space are similar, \s includes carriage return, line feed and tab. Here I explicitly only want a space or a hyphen and I am allowing the pattern of space, hyphen space as an acceptable separator which is what this segment is [ -]{0,3}. This mean any combination of hyphen and space is allowed 0 to 3 times.

  • Where possible, try to be consistent in using [0-9] or \d. Both have the same meaning and interchanging them can make reading difficult. If you are not allowing all digits, then you need to include the range of digits. Also, just to try to help out, the following [1-79] reads as 1 to 7 and 9. All characters within square brackets are individual characters except hyphen which represents a range.

  • A prefix of (?i) means to treat the expression as case insensitive. This can be added to simplify an expression where there is a lot of alpha characters.

  • A prefix of (?s) means to treat any new line characters as white space (\s). This is required when using proximity search that are likely to span more than 1 line in a document.

  • For proximity searches (USA Passport and SSN Proximity) there is an expression of .{0,500} between the keyword and the regex numeric pattern. This reads, from 0 to 500 of any character between the two expressions so this is to say within a proximity of 500 characters.

JavaScript errors detected

Please note, these errors can depend on your browser setup.

If this problem persists, please contact our support.