Common Query Types
There are lots of ways you can customize the query you provide to the Cognitive Toolkit. Each type of query has a different way of searching data, so you can customize it to your use! These types are nested in your bool → must/must_not/should.
When performing a query that involves forward slashes “/”, backslashes “\”, or periods “.” be sure to use .keyword after the name of the field you want to search in.
Generally, you can mix and match query terms to zero in on the desired data. You can use these query types with must, must_not, and should
Example:
{
"bool": {
"must": [
{
"term": {
"parent.keyword": "\\\\root\\folder\\subfolder\\leaf"
}
}
]
}
}
Compute Expense & Case-sensitive
On each of the query types, it will indicate the compute expense for the type as well as if it is case-sensitive.
Just because the computing expense is high, doesn’t mean you shouldn’t use it! Sometimes finding specific data requires a bit more resource usage. The general concept is to be as efficient as possible to reduce crawl times but also ensure it is capturing the correct data.
LOW
Very efficient. Minimal processing power needed to execute the query. Fastest.
MEDIUM
Average efficiency. Will require the use of the built-in analyzer which marginally increases query time.
HIGH
Not efficient. Requires heavy use of the built-in analyzer. Slowest
Common Types:
match
Compute expense: MEDIUM
Case-sensitive: NO
match
is used when you want to find one word, or all words separately. Another way to think about match
is to imagine there being an OR in between each word.
For example:
{
"bool": {
"must": [
{
"match": {
"fullText": "space invaders"
}
}
]
}
}
Will match
Any occurrence of:
space
invaders
space is vast and empty, to date there have been no earth invaders from space
Will not match
sp8ce
in vaders
spaceinvaders
The index engine will search for space
or invaders
and will show you results even if space
was ten pages away from invaders
match_phrase
Compute expense: HIGH
Case-sensitive: NO
match_phrase
is used when you want to find a string of words in the order you submit them. match_phrase
works the same as using double-quotes (“ “) in a search engine like Google, it looks for the words in the order you submitted them. match_phrase
does not match against symbols
For example:
{
"bool": {
"must": [
{
"match_phrase": {
"fullText": "space invaders"
}
}
]
}
}
The index engine will search for space invaders
, meaning it is searching for the string “space invaders”
.
Will match
space invaders
space-invaders
Space Invaders was released by Taito in 1978
Will not match
spaceinvaders
space is vast and empty, to date there have been no earth invaders from space
space_invaders
term
Compute expense: LOW
Case-sensitive: YES
term
is useful when you know exactly what you are looking for. term
does not tokenize your search, meaning it works similarly to Find (CTRL + F). It must match exactly what is in your query, and casing matters here!
For example:
{
"bool": {
"must": [
{
"term": {
"extension": "pdf"
}
},
{
"term": {
"extension": "PDF"
}
}
]
}
}
The index engine will search for pdf
and PDF
(different casing)
Will match
filename.pdf
filename.PDF
Will not match
pdf.docx
filename.pdfx
terms
Compute expense: LOW
Case-sensitive: YES
terms
works exactly the same way as term
, but it allows you to submit an array or multiple values. This can be useful if you have a small list of terms to match against.
For example:
{
"bool": {
"must": [
{
"terms": {
"name.keyword": ["Invoice 2020-01-01.docx", "Invoice 2021-01-01.docx"]
}
}
]
}
}
Note the use of “name.keyword” instead of just “name”. This is because there are spaces and periods in what is being searched for.
Will match
Invoice 2020-01-01.docx
Invoice 2021-01-01.docx
Will not match
Invoices 2020-01-01.docx
Invoice20200101.docx
exists
Compute expense: LOW
Case-sensitive: NO
exists
is as simple as it is useful. exists
allows you to query if a field exists on an index item. This can only be used to check if a field exists, regardless of the value of the field.
Keep in mind, if your index contains items that have said field but has no value (null), exists will still find that item. Remember it is checking if the field exists, not if the field has a value.
For example:
{
"bool": {
"must": [
{
"exists": {
"field": "hash"
}
}
]
}
}
Will match
Any item in the specified index that does have the field “
hash
” applied, even if the value is null or empty
Will not match
Any item in the specified index that does not have the field “
hash
”Field name:
hashes
Field name:
h ash
Field name:
hash-
prefix
Compute expense: LOW
Case-sensitive: YES
prefix
is very powerful when analyzing big data. prefix
allows you to specify the characters a field value starts with. This can be useful when you want to target files in a specific directory (and its subdirectories) or if you want to find all documents that have a file name starting with “2020”.
Example 1:
{
"bool": {
"must": [
{
"prefix": {
"name": "2020"
}
}
]
}
}
Will match
2020-acme-profits.xlsx
2020843729402.cab
Will not match
Invoices 2020-01-01.docx
2 020-profits.xlsx
Example 2:
{
"bool": {
"must": [
{
"prefix": {
"parent.keyword": "\\\\ACME\\File share\\HR"
}
}
]
}
}
Note the use of .keyword as we are looking at a string with escaped characters
Will match
\\ACME\File share\HR\new hires 2021\applicants
\\ACME\File share\HR\terminations 2021\processes\BBunny\incidents
Will not match
\\ACME\Fileshare\HR\new hires 2021
No space between ‘File’ and ‘share’\\ACME\File share\hr\new hires 2021\applicants
hr is not UPPERCASE as indicated in the query