Common Query Types

There are lots of ways you can customize the query you provide to the Cognitive Toolkit. Each type of query has a different way of searching data, so you can customize it to your use! These types are nested in your bool → must/must_not/should.

When performing a query that involves forward slashes “/”, backslashes “\”, or periods “.” be sure to use .keyword after the name of the field you want to search in.

Generally, you can mix and match query terms to zero in on the desired data. You can use these query types with must, must_not, and should

Example:

CODE

{
    "bool": {
      "must": [
        {
          "term": {
            "parent.keyword": "\\\\root\\folder\\subfolder\\leaf"
          }
        }
      ]
    }
  }

Compute Expense & Case-sensitive

On each of the query types, it will indicate the compute expense for the type as well as if it is case-sensitive.

Just because the computing expense is high, doesn’t mean you shouldn’t use it! Sometimes finding specific data requires a bit more resource usage. The general concept is to be as efficient as possible to reduce crawl times but also ensure it is capturing the correct data.

LOW

Very efficient. Minimal processing power needed to execute the query. Fastest.

MEDIUM

Average efficiency. Will require the use of the built-in analyzer which marginally increases query time.

HIGH

Not efficient. Requires heavy use of the built-in analyzer. Slowest

Common Types:

match

Compute expense: MEDIUM
Case-sensitive: NO

match is used when you want to find one word, or all words separately. Another way to think about match is to imagine there being an OR in between each word.

For example:

CODE

{
    "bool": {
      "must": [
        {
          "match": {
            "fullText": "space invaders"
          }
        }
      ]
    }
  }

Will match

Any occurrence of:

space
invaders
space is vast and empty, to date there have been no earth invaders from space

Will not match

sp8ce
in vaders
spaceinvaders

The index engine will search for space or invaders and will show you results even if space was ten pages away from invaders

match_phrase

Compute expense: HIGH
Case-sensitive: NO

match_phrase is used when you want to find a string of words in the order you submit them. match_phrase works the same as using double-quotes (“ “) in a search engine like Google, it looks for the words in the order you submitted them. match_phrase does not match against symbols

For example:

CODE

{
    "bool": {
      "must": [
        {
          "match_phrase": {
            "fullText": "space invaders"
          }
        }
      ]
    }
  }

The index engine will search for space invaders, meaning it is searching for the string “space invaders”.

Will match

space invaders
space-invaders
Space Invaders was released by Taito in 1978

Will not match

spaceinvaders
space is vast and empty, to date there have been no earth invaders from space
space_invaders

term

Compute expense: LOW
Case-sensitive: YES

term is useful when you know exactly what you are looking for. term does not tokenize your search, meaning it works similarly to Find (CTRL + F). It must match exactly what is in your query, and casing matters here!

For example:

CODE

{
    "bool": {
      "must": [
        {
          "term": {
            "extension": "pdf"
          }
        },
        {
          "term": {
            "extension": "PDF"
          }
        }
      ]
    }
  }

The index engine will search for pdf and PDF (different casing)

Will match

filename.pdf
filename.PDF

Will not match

pdf.docx
filename.pdfx

terms

Compute expense: LOW
Case-sensitive: YES

terms works exactly the same way as term, but it allows you to submit an array or multiple values. This can be useful if you have a small list of terms to match against.

For example:

CODE

{
    "bool": {
      "must": [
        {
          "terms": {
            "name.keyword": ["Invoice 2020-01-01.docx", "Invoice 2021-01-01.docx"]
          }
        }
      ]
    }
  }

Note the use of “name.keyword” instead of just “name”. This is because there are spaces and periods in what is being searched for.

Will match

Invoice 2020-01-01.docx
Invoice 2021-01-01.docx

Will not match

Invoices 2020-01-01.docx
Invoice20200101.docx

exists

Compute expense: LOW
Case-sensitive: NO

exists is as simple as it is useful. exists allows you to query if a field exists on an index item. This can only be used to check if a field exists, regardless of the value of the field.

Keep in mind, if your index contains items that have said field but has no value (null), exists will still find that item. Remember it is checking if the field exists, not if the field has a value.

For example:

CODE

{
    "bool": {
      "must": [
        {
          "exists": {
            "field": "hash"
          }
        }
      ]
    }
  }

Will match

Any item in the specified index that does have the field “hash” applied, even if the value is null or empty

Will not match

Any item in the specified index that does not have the field “hash”
Field name: hashes
Field name: h ash
Field name: hash-

prefix

Compute expense: LOW
Case-sensitive: YES

prefix is very powerful when analyzing big data. prefix allows you to specify the characters a field value starts with. This can be useful when you want to target files in a specific directory (and its subdirectories) or if you want to find all documents that have a file name starting with “2020”.

Example 1:

CODE

{
    "bool": {
      "must": [
        {
          "prefix": {
            "name": "2020"
          }
        }
      ]
    }
  }

Will match

2020-acme-profits.xlsx
2020843729402.cab

Will not match

Invoices 2020-01-01.docx
2 020-profits.xlsx

Example 2:

CODE

{
    "bool": {
      "must": [
        {
          "prefix": {
            "parent.keyword": "\\\\ACME\\File share\\HR"
          }
        }
      ]
    }
  }

Note the use of .keyword as we are looking at a string with escaped characters

Will match

\\ACME\File share\HR\new hires 2021\applicants
\\ACME\File share\HR\terminations 2021\processes\BBunny\incidents

Will not match

\\ACME\Fileshare\HR\new hires 2021
No space between ‘File’ and ‘share’
\\ACME\File share\hr\new hires 2021\applicants
hr is not UPPERCASE as indicated in the query