Sorting Content Pulls by Score - Documentation topics on: content pulls,dynamic pull,elasticsearch,elastic search,pulling and displaying content,pulling and displaying related content,pulling content,scores,scoring,.

Sorting Content Pulls by Score

You may use dotCMS query results scoring feature to sort and filter content by relevance. This is a subtle but powerful feature which, when combined with the use of Personas, Vistors and Rules, allows you to automatically present Personalized content that is most relevant to a visitor's interests.

For more information, please select the appropriate section below:

Note: All example queries below may be run against the dotCMS starter site and demo site.

ElasticSearch Sorting

Queries in dotCMS are performed using the integrated open source ElasticSearch package. By default, when you pull content using an ElasticSearch query, the content items that match the query are not returned in a particular order (they're returned in the order the items are found in the database). You may add sorting to your query to ensure the items are returned in a particular order - for example, sorted by Title, Publish Date, Category, etc..

In addition, dotCMS implements query results scoring which automatically sorts content by relevance so the results with the highest relevance (the greatest number of matches to the query terms) are displayed first. You may filter query results by score, ensuring only results which closely match the query terms are displayed.

How the Score is Calculated

The scoring system uses the terms of your query to determine how content items returned by the search are scored. The scoring system uses a number of factors to rank matching content; in basic terms, the following factors are used:

FactorDescription
Term FrequencyThe more times each query term is found in a content item, the higher the score it receives.
Inverse Document FrequencyThe fewer content items which match a query term, the higher the score of the content items where that query term is found.
CoordinationThe more different query terms a content item matches, the higher the score it receives.
Field LengthWhen a query term matches a content item in a short field (like title), that content item receives a higher score than a content item where the same query term matched in a longer field (like body).
BoostWeighting used to give certain search terms greater weight (see below).

For more information on scoring methods, please see the ElasticSearch Scoring documentation.

Writing Queries for Scoring

How each query term will be used in scoring depends on the specific query terms used.

Query Terms Used to Match Content

An ElasticSearch Term Query returns only content that matches the exact term specified. Thus, when only "term" queries are used, only Term Frequency has a significant impact on scoring, since all content returned must contain the specified terms in the appropriate fields.

However ElasticSearch Boolean Queries offer more options that allow you to control how individual query terms affect content matches. The following list shows the terms you may use in a Boolean query, and how each will be used to match content:

Query TermDescription
"must"The query term must appear in the content.
"should"The query term should appear in the content. Content which does not match the term may still be returned, but with a lower score.
"must_not"The query term must not appear in the content.

Notes:

  • In a query with no must clauses, content will only be returned if it matches at least one should clause.
    • So for a content item to be returned, it must match at least 1 term in the query, even if the query only contains should clauses.

Query Terms Used to Filter Content

You may also use the following terms to filter the results, ensuring that only the content which most closely matches the search terms is returned by the query.

Query TermDescriptionIncluded In
"min_score"Content will not be returned unless it's total score equals or exceeds this number.query term
"minimum_should_match"Content will not be returned unless it matches at least this number of should clauses in the bool query.bool term

For examples of using these filter terms, please see Examples 2 and 3 in the Common Scoring Uses section, below.

Example:

The following example incorporates each matching and filtering term into a single query.

{
    "query": {
        "bool" : {
            "must" : {
                "term" : { "_all" : "news" }
            },
            "must_not" : {
                "range" : {
                    "commentscount" : { "from" : 1, "to" : 5 }
                }
            },
            "should" : [
                {
                    "term" : { "title" : "invest" }
                },
                {
                    "match" : {
                        "title" : { "query": "retirement" }
                    }
                }
            ],
            "minimum_should_match" : 1
        }
    },
    "min_score" : 0.25
}

In this example:

  • Contents will only be returned if:
    • The term “news” appears in at least 1 field (any field) of the content item,
    • AND the “commentscount” field is less than 1 or greater than 5,
    • AND the “title” field includes either “invest” or “retirement” (or both),
    • AND the content item receives a total score of at least 0.25.
  • The score will be calculated based on how many times the word “news” appears in the content item fields, and the appearance of “invest” and/or “retirement” in the “title” field.

Adjusting Search Term Weights (boost)

By default, all search terms in your query are given equal scoring weight. So, for example, if your query returns all content items that match any of 3 different terms, a content item that matches the first and second search terms and a content item that matches the second and third search terms will both receive equal scores (since they both match 2 out of the 3 search terms).

However you may use the ElasticSearch Boost feature to give different weights to each term in your ElasticSearch query, so that some terms have more weight than others. Thus, for example, if you gave double weight to your first search term, a content item that matched the first and second search terms would receive a higher score than a content item that matched the second and third search terms, even though both content items matched 2 out of 3 search terms.

Excluding Query Terms from Scoring

Note: If you wish to include a term in the query, but not to have that term impact the scoring, set the boost value for that term to 0. For example, in the above query, to prevent matches of the word “news” from impacting the scoring, modify the query term that matches “news” as follows:

            "must" : {
                "match" : {
                    "_all" : {
                        "query" : "news",
                        "boost" : 0
                    }
                }
            },

Example

The following example modifies the previous example, ensuring that content which includes “retirement” in the “title” field receives a higher score than content which includes “invest” in the “tag” field:

{
    "query": {
        "bool" : {
            "must" : {
                "term" : { "_all" : "news" }
            },
            "must_not" : {
                "range" : {
                    "commentscount" : { "from" : 1, "to" : 5 }
                }
            },
            "should" : [
                {
                    "term" : { "title" : "invest" }
                },
                {
                    "match" : {
                        "title" : {
                            "query": "retirement",
                            "boost": 5.0
                        }
                    }
                }
            ],
            "minimum_should_match" : 1,
        }
    },
    "min_score" : 0.25
}

Notes:

For more information on Boosting ElasticSearch terms, please see the ElasticSearch Boost documentation.

Troubleshooting Scoring

Since the scoring calculation involves multiple factors which are weighted together for the final score, it can sometimes be difficult to understand why some documents score higher or lower than others (or than you expect).

To better understand how documents are scored for your query, you can add "explain" : true before the query, as in the following example:

{
    "explain": true,
    "query": {
        "bool" : {
            "must" : {
                "term" : { "_all" : "news" }
            },
            "must_not" : {
                "range" : {
                    "commentscount" : { "from" : 1, "to" : 5 }
                }
            },
            "should" : [
                {
                    "term" : { "title" : "invest" }
                },
                {
                    "match" : {
                        "title" : {
                            "query": "retirement",
                            "boost": 5.0
                        }
                    }
                }
            ],
            "minimum_should_match" : 1,
        }
    },
    "min_score" : 0.25
}

Important: We strongly recommend against using explain in production queries. However it may be helpful during development of your query terms and weights.

Viewing the Explanations

When "explain" is set, your query will return a detailed listing of the scores corresponding to each of the factors and the final score for each returned content. In the ElasticSearch Portlet, this information is displayed in the Raw field displayed at the bottom of the search results.

Viewing Explanations in the ElasticSearch Portlet

Common Scoring Uses

The following examples show how to use different ElasticSearch terms to perform common types of queries that sort and filter by score. Click on the links below to view each example:

  • Example 1: Simple content pull, sorted by score
  • Example 2: Simple content pull, filtered by score
  • Example 3: Simple content pull, filtered by number of terms matched

Example 1: Simple content pull, sorted by score

The following Velocity code performs a simple content pull using two query terms, and then sorts the query results by the score. This ensures that content items which match both query terms are displayed before content items which match only one of the query terms.

{
    "query": {
        "bool" : {
            "should" : [
                {
                    "term" : { "_all" : "plan" }
                },
                {
                    "term" : { "_all" : "retirement" }
                }
            ]
        }
    }
}

Example 2: Simple content pull, filtered by score

The following Velocity code performs a simple content pull using three query terms, and then filters the results so that only content items which have a score of at least 0.25 are displayed.

{
    "query": {
        "bool" : {
            "should" : [
                {
                    "term" : { "_all" : "plan" }
                },
                {
                    "term" : { "_all" : "retirement" }
                },
                {
                    "term" : { "_all" : "stock" }
                }
            ]
        }
    },
    "min_score" : 0.25
}

Example 3: Simple content pull, filtered by number of terms matched

The following Velocity code performs a simple content pull using three query terms, and then filters the results so that only content items which match at least 2 out of the 3 query terms are displayed.

{
    "query": {
        "bool" : {
            "should" : [
                {
                    "term" : { "_all" : "plan" }
                },
                {
                    "term" : { "_all" : "retirement" }
                },
                {
                    "term" : { "_all" : "stock" }
                }
            ],
            "minimum_should_match" : 2
        }
    }
}