Proper handling query with same words

Hello guys,

what’s the best way to handle a use case where a query should return documents that contain at least N instances of the specified words?

For example, suppose my database contains the following phrases:

"to be or not to be"
"to be human"

When I search for "to be", Elasticsearch correctly returns both phrases. However, if I search for "to be to be", I want only the first phrase to match - since it contains "to be" twice, as requested in query - ES by default returns both phrases since it treats "to be to be" query equally to just "to be".

In other words, how can I make word frequency in the query affect the search results?

Hello @marc21

Welcome to the community.

POST test-index/_doc
{
    "id" : 1,
    "message" : "to be or not to be"

}

POST test-index/_doc
{
    "id" : 2,
    "message" : "to be human"

}

GET test-index/_search
{
    "size": 1,
    "track_total_hits": true,
    "query": {
        "match": {
          "message": "to be to be"
        }
    }
}

Output :

{
  "took": 1,
  "timed_out": false,
  "_shards": {
    "total": 1,
    "successful": 1,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 2,
      "relation": "eq"
    },
    "max_score": 0.9168171,
    "hits": [
      {
        "_index": "test-index",
        "_id": "QgqWQ5cBpIOblOt4qNbY",
        "_score": 0.9168171,
        "_source": {
          "id": 1,
          "message": "to be or not to be"
        }
      }
    ]
  }
}

As we see from above example the output returned is automatically sorted by "_score": 0.9168171

If you are only interested in 1 record by using size=1 , it will always return you the best match for your query. As we have enabled "track_total_hits": true this will also give the total count of matching documents from which the top result with highest _score will be displayed as size=1

Thanks!!

Thank you for quick response. Unfortunately I cannot use _score sorting - we are using alphabetical sorting :frowning:

Also when I search for "to be to be to be" for previous dataset I want no matching result at all - hits total should be 0.

"be" - total hits 2
"to be" - total hits 2
"be be" - total hits 1
"to be to be" - total hits 1
"to be to be to be" - total hits 0

1 Like