Detecting potential PII
To detect potential PII in an index, you will need to use FlagFieldBasedOnRegex with Shinydocs Cognitive Toolkit. This page will provide you with the regex values to use in the command.
In addition to the regex, false positives can be reduced by using additional keyword matching in your JSON query. To find a relevant list of keywords you can use this reference from Microsoft on Data Loss Prevention https://learn.microsoft.com/en-us/exchange/policy-and-compliance/data-loss-prevention/sensitive-information-types?view=exchserver-2019. Queries built using the exchange keywords are here: Using ElasticSearch for PII Discovery .
Ontario Health Card
\b[1-9]\d{3}[ -]{0,3}\d{3}[ -]{0,3}\d{3}[ -]{0,3}[a-zA-Z]{2}\b
Example Command
CognitiveToolkit.exe RunScript --path FlagFieldBasedOnRegexScript.cs -i shiny_index -u http://localhost:9200 -q query.json --regex-pattern "\b[1-9]\d{3}[ -]{0,3}\d{3}[ -]{0,3}\d{3}[ -]{0,3}[a-zA-Z]{2}\b" --field-name "potential_PII" --value "Ontario Health Card" --search-field fullText
Ontario Driver’s License
\b[a-zA-Z]\d{4}[ -]{0,3}\d{5}[ -]{0,3}\d[0156]\d[0-3]\d\b
Example Command
CognitiveToolkit.exe RunScript --path FlagFieldBasedOnRegexScript.cs -i shiny_index -u http://localhost:9200 -q query.json --regex-pattern "\b[a-zA-Z]\d{4}[ -]{0,3}\d{5}[ -]{0,3}\d[0156]\d[0-3]\d\b" --field-name "potential_PII" --value "Ontario Drivers License" --search-field fullText
Canadian Passport Number
\b[a-zA-Z]{2}[0-9]{6,7}\b
Example Command
CognitiveToolkit.exe RunScript --path FlagFieldBasedOnRegexScript.cs -i shiny_index -u http://localhost:9200 -q query.json --regex-pattern "\b[a-zA-Z]{2}[0-9]{6,7}\b" --field-name "potential_PII" --value "Canadian Passport Number" --search-field fullText
Canadian Postal Code
\b[a-zA-Z]\d[a-zA-Z][ -]{0,3}\d[a-zA-Z]\d\b
Example Command
CognitiveToolkit.exe RunScript --path FlagFieldBasedOnRegexScript.cs -i shiny_index -u http://localhost:9200 -q query.json --regex-pattern "\b[a-zA-Z]\d{4}[\s-]*\d{5}[\s-]*\d{5}\b" --field-name "potential_PII" --value "Canadian Postal Code" --search-field fullText
Canadian Social Insurance Number (SIN)
Use --valid-luhn true
in your command for this PII for more accurate results
\b[1-79]\d{2}[ -]{0,3}\d{3}[ -]{0,3}\d{3}\b
Example Command
CognitiveToolkit.exe RunScript --path FlagFieldBasedOnRegexScript.cs -i shiny_index -u http://localhost:9200 -q query.json --regex-pattern "\b[1-79]\d{2}[ -]{0,3}\d{3}[ -]{0,3}\d{3}\b" --field-name "potential_PII" --value "Canadian Social Insurance Number" --search-field fullText --valid-luhn true
Credit Card - AMEX
\b(3[47]\d{2}[ -]{0,3}\d{6}[ -]{0,3}\d{5})\b
Example Command
CognitiveToolkit.exe RunScript --path FlagFieldBasedOnRegexScript.cs -i shiny_index -u http://localhost:9200 -q query.json --regex-pattern "\b(3[47]\d{2}[ -]{0,3}\d{6}[ -]{0,3}\d{5})\b" --field-name "potential_PII" --value "Credit Card AMEX" --search-field fullText
Credit Card - Visa
\b((4\d{3}[ -]{0,3}\d{4}[ -]{0,3}\d{4}[ -]{0,3}\d{4})|(4\d{3}[ -]{0,3}\d{3}[ -]{0,3}\d{3}[ -]{0,3}\d{3}))\b
Example Command
CognitiveToolkit.exe RunScript --path FlagFieldBasedOnRegexScript.cs -i shiny_index -u http://localhost:9200 -q query.json --regex-pattern "\b(4\d{3}[ -]{0,3}\d{4}[ -]{0,3}\d{4}[ -]{0,3}\d{4})|(4\d{3}[ -]{0,3}\d{3}[ -]{0,3}\d{3}[ -]{0,3}\d{3})\b" --field-name "potential_PII" --value "Credit Card Visa" --search-field fullText
Credit Card - MasterCard
\b(5[1-5]\d{2}|222[1-9]|22[3-9]\d|2[3-6]\d{2}|27[01]\d|2720)[ -]{0,3}\d{4}[ -]{0,3}\d{4}[ -]{0,3}\d{4}\b
Example Command
CognitiveToolkit.exe RunScript --path FlagFieldBasedOnRegexScript.cs -i shiny_index -u http://localhost:9200 -q query.json --regex-pattern "\b(5[1-5]\d{2}|222[1-9]|22[3-9]\d|2[3-6]\d{2}|27[01]\d|2720)[ -]{0,3}\d{4}[ -]{0,3}\d{4}[ -]{0,3}\d{4}\b" --field-name "potential_PII" --value "Credit Card Mastercard" --search-field fullText
USA Social Security Number (Formatted)
(?is)(\b(?!666|000|9\d{2})\d{3} ?[\u002D\u007E\u00AD\u2010\u2011\u2012\u2013\u2212] ?(?!00)\d{2} ?[\u002D\u007E\u00AD\u2010\u2011\u2012\u2013\u2212] ?(?!0{4})\d{4}\b)
Example Command
%CognitiveToolkitExecutable% RunScript --path FlagFieldBasedOnRegexScript.cs -i shiny_index -u http://localhost:9200 --threads 4 --nodes-per-request 500 -q SSN.json --regex-pattern "(?is)(\b(?!666|000|9\d{2})\d{3} ?[\u002D\u007E\u00AD\u2010\u2011\u2012\u2013\u2212] ?(?!00)\d{2} ?[\u002D\u007E\u00AD\u2010\u2011\u2012\u2013\u2212] ?(?!0{4})\d{4}\b)" --field-name "potential_PII" --value "USA Social Security Number (Formatted)" --search-field fullText
USA Social Security Number (Proximity)
(?is)((\b(Social Security|Social Security#|Soc Sec|SSN|SSNS|SSN#|SS#|SSID|Soc. Sec. No|tax(payer)? identification number|Tax ID Number)(?!\w).{0,300}\b((?!666|000|9\d{2})\d{3} ?(?!00)\d{2} ?(?!0{4})\d{4})\b)|(\b((?!666|000|9\d{2})\d{3} ?(?!00)\d{2}(?!0{4}) ?\d{4})\b.{0,300}\b(Social Security|Social Security#|Soc Sec|SSN|SSNS|SSN#|SS#|SSID|Soc. Sec. No|tax(payer)? identification number|Tax ID Number)(?!\w)))
Example Command
%CognitiveToolkitExecutable% RunScript --path FlagFieldBasedOnRegexScript.cs -i shiny_index -u http://localhost:9200 --threads 4 --nodes-per-request 500 -q SSN.json
USA Passport Number
(?is)(\b(passport\s*(book|card)?\s*number|Passport.{1,150}Document Number|Passport.{1,150}Department of State|Passport No|Passport #|Passport[#:]|PassportID|Passportno|passportnumber|パスポート|パスポート番号|パスポートのNum|パスポート#|Numéro de passeport|Passeport n °|Passeport Non|Passeport #|Passeport#|PasseportNon|Passeportn °|Pasaporte)(?!\w).{0,500}\b[a-zA-Z0-9]\d{8}\b)|(\b[a-zA-Z0-9]\d{8}\b.{0,500}\b(passport\s*(book|card)?\s*number|Passport.{1,150}Document Number|Passport.{1,150}Department of State|Passport No|Passport #|Passport[#:]|PassportID|Passportno|passportnumber|パスポート|パスポート番号|パスポートのNum|パスポート#|Numéro de passeport|Passeport n °|Passeport Non|Passeport #|Passeport#|PasseportNon|Passeportn °|Pasaporte)(?!\w))
Example Command
CognitiveToolkit.exe RunScript --path FlagFieldBasedOnRegexScript.cs -i shiny_index -u http://localhost:9200 -q query.json --regex-pattern "(\b(Passport Number|Passport No|Passport #|Passport#|PassportID|Passportno|passportnumber|パスポート|パスポート番号|パスポートのNum|パスポート#|Numéro de passeport|Passeport n °|Passeport Non|Passeport #|Passeport#|PasseportNon|Passeportn °)(?!\w).{0,300}\b[a-zA-Z0-9]\d{8}\b)|(\b[a-zA-Z0-9]\d{8}\b.{0,300}\b(Passport Number|Passport No|Passport #|Passport#|PassportID|Passportno|passportnumber|パスポート|パスポート番号|パスポートのNum|パスポート#|Numéro de passeport|Passeport n °|Passeport Non|Passeport #|Passeport#|PasseportNon|Passeportn °)(?!\w))" --field-name "potential_PII" --value "US Passport Number" --search-field fullText
Medicare Beneficiary Number
(?i)\b(([1-9][AC-HJKMNP-RT-Y]{2}[0-9]-[AC-HJKMNP-RT-Y][AC-HJKMNP-RT-Y0-9][0-9]-[AC-HJKMNP-RT-Y]{2}[0-9]{2})|([1-9][AC-HJKMNP-RT-Y]{2}[0-9][AC-HJKMNP-RT-Y][AC-HJKMNP-RT-Y0-9][0-9][AC-HJKMNP-RT-Y]{2}[0-9]{2}))\b
Example Command
%CognitiveToolkitExecutable% RunScript --path FlagFieldBasedOnRegexScript.cs -i shiny_index -u http://localhost:9200 --threads 4 --nodes-per-request 500 -q medicare.json" --regex-pattern "(?i)\b(([1-9][AC-HJKMNP-RT-Y]{2}[0-9]-[AC-HJKMNP-RT-Y][AC-HJKMNP-RT-Y0-9][0-9]-[AC-HJKMNP-RT-Y]{2}[0-9]{2})|([1-9][AC-HJKMNP-RT-Y]{2}[0-9][AC-HJKMNP-RT-Y][AC-HJKMNP-RT-Y0-9][0-9][AC-HJKMNP-RT-Y]{2}[0-9]{2}))\b" --field-name "potential_PII" --value "Medicare Number" --search-field fullText
Download
You can download these commands preconfigured in a batch file for your convenience. Specialized queries are also included.
FlagFieldBasedOnRegexScriptV2.cs
This has a minor update to the Luhn check function to ensure only numbers are passed to the Luhn check algorithm. If you are using the Luhn check (Credit Cards and SIN) then it is recommended to use this script.
FlagFieldBasedOnRegexScriptV2.cs
Regex Explanations
Boundaries (
\b
) should be used at the beginning and end to avoid matching in the middle of a string.Although
\s
and space are similar,\s
includes carriage return, line feed and tab. Here I explicitly only want a space or a hyphen and I am allowing the pattern of space, hyphen space as an acceptable separator which is what this segment is[ -]{0,3}
. This mean any combination of hyphen and space is allowed 0 to 3 times.Where possible, try to be consistent in using
[0-9]
or\d
. Both have the same meaning and interchanging them can make reading difficult. If you are not allowing all digits, then you need to include the range of digits. Also, just to try to help out, the following[1-79]
reads as 1 to 7 and 9. All characters within square brackets are individual characters except hyphen which represents a range.A prefix of
(?i)
means to treat the expression as case insensitive. This can be added to simplify an expression where there is a lot of alpha characters.A prefix of
(?s)
means to treat any new line characters as white space (\s
). This is required when using proximity search that are likely to span more than 1 line in a document.For proximity searches (USA Passport and SSN Proximity) there is an expression of
.{0,500}
between the keyword and the regex numeric pattern. This reads, from 0 to 500 of any character between the two expressions so this is to say within a proximity of 500 characters.