dotCMS content search uses Elasticsearch, which offers two different types of robust syntax to search and retrieve content from the content repository.
The most common search syntax used in dotCMS is an Elasticsearch query string, which is based on the powerful Lucene query syntax. This syntax is used for specifying the query string only, so it is simple to create a query string, but does not provide access to more advanced Elasticsearch features.
The “full” Elasticsearch syntax uses JSON format to specify both the query string (or individual query terms), and additional features and options which can modify the query. This syntax is more complex than the Lucene syntax, requiring more sophisticated formatting, but allows the use of powerful features such as aggregations and geolocation queries which are not available with the Lucene syntax.
Elasticsearch Query String Syntax (Lucene)
Many tools in dotCMS allow you to specify the Lucene query syntax, eliminating the need to use the more complex JSON syntax. When you use a tool in dotCMS that accepts Lucene syntax, dotCMS automatically creates the full Elasticsearch syntax needed to perform the query in the background. You can use the exact same Lucene query syntax in all of the following ways within dotCMS:
- Velocity code:
- Back-end content search
- Back-end search tools:
- The java ContentletAPI
- The REST API interface.
For more information on how to create Lucene queries, please see the following sections:
Searching Specific Fields
When creating Content Types, you can select what fields you would like to make searchable by Elasticsearch. dotCMS allows you to query the content repository using any field in a Content Type with the System Indexed option checked (note that the System Indexed option is also automatically selected if you check the User Searchable option). These fields can then be queried in dotCMS using a
contentTypeName.[field] notation (for example,
System Fields Indexed in Elastic Search
In addition to the fields selected in the Content Type definition, dotCMS automatically indexes all of the Elasticsearch system properties for every piece of content in the system. These include:
When querying these properties (or when sorting by these properties), you do not need to prefix the Content Type Variable. For example, to query for the word “Bond” in the “Title” field of the “Products” Content Type in the dotCMS starter site, you can use the query
+title:bond (instead of
+products.title:bond), and to sort by the last modification date, you can sort by just
modDate (instead of
For more information on each of these system properties, please see the How Content is Mapped to Elasticsearch documentation.
Search terms consist of one or more words, delineated by Operators. Understanding how Elasticsearch parses search terms will help you select appropriate operators and ensure your queries work as intended.
Elasticsearch queries are case-insensitive; all query terms are converted to lower case both when the indexes are created and when queries are submitted.
Full Word Parsing
When you specify search terms in Lucene syntax, by default each term specifies a full word in the selected field.
For example, if you search for “ape” in a field, by default the search will return only content items with the full word “ape” in the selected field. However the search will not return content in which “ape” is only part of a word in the field (such as part of the words “cape” or “aped”).
If you wish to specify that a search term should match part of a word as well, use Wildcards to specify how the term should be matched.
To match a combination of words in a specific order, enclose all words in double quotes. For example, if you search for the term
"final exam" (including the double quotes), the query will only return content items where the word “exam” directly follow the word “final”, with no intervening words; content items which contain both words, but where the words are in reverse order, or are separated by other words (such as “all exam grades are final”), will not be returned.
The following are some of the most common operators used in Lucene query strings; you can use these operators to build most Lucene queries. If you need to create more complex queries, consider using the full Elasticsearch syntax. For more information on Lucene query syntax, please see the official Lucene documentation.
|Search All Fields|
|Single Character Wildcard|
|Multiple Character Wildcard|
The required operator (
+) only returns items which have the term after the “+” in the specified content field. If the required operator is not included, then results may be returned which do not include the term, as long as they meet the other requirements of the query string (such as matching other specified fields or search terms); by using the required operator, you can ensure that search results will exclude any results which do not include specific search terms.
For example, the following query will only return blog articles whose body includes the term “investment”. The search results may also have the term “JetBlue”. This means that search results which include both “investment” and “JetBlue” will receive higher scores (and thus may be sorted near the top of the results); but items which include “investment” but do not include “JetBlue” will still be returned, while items which include “JetBlue” but do not include “investment” will not be returned.
+contentType:Blog +Blog.body:(+"investment" "JetBlue")
The Prohibit (
-) operator excludes documents that contain the term after the
- symbol. For example, the following query returns all the employees whos first name begins with the letter “R”, but whose last name does NOT begin with the letter “A”:
+contentType:Employee +(Employee.firstName:R*) -(Employee.lastName:A*)
Note: For a description of the asterisk character (
*), please see Multiple Character Wildcard, below.
Boolean Operators [AND (
&&), OR (
||), and NOT (
The AND operator (
&&) matches content where both terms exist in the content field(s) being queried. For example, the following query returns only the blogs where the word “business” and “Apple” BOTH appear in the blog body field:
+contentType:Blog +Blog.body:("business" && "Apple")
Note: The order the fields appear in the results do not matter for the AND operator. The example would return both results where the word “business” appears before the word “Apple” and results where the word “business” appears after the word “Apple”.
The OR operator (
||) links two terms and finds a matching document if either of the terms exist in the content field(s) being queried. The OR operator is the equivalent of a union between sets. For example, the following query returns all items which include either the word “analysts” or the word “investment” in the body:
+contentType:Blog +Blog.body:("analysts" || "investment")
The NOT (
!) operator excludes documents that contain the term after the
! symbol. For example, the following query returns all blog artices with the word “investment” in the body, but do NOT have the term “JetBlue” in the body:
+contentType:Blog +Blog.body:("investment" !"JetBlue")
Search All Fields
To search all fields in the content (rather than a specific field), use
_all in the same way you would specify an individual field name. For example, the following query returns all items which have the word “fund” in any searchable field in the content:
For more information on using the
_all search term, please see the Search within All Content Fields documentation.
Use square brackets around a beginning and end date with the word “TO” between them to search for content dates that fall within the range. For example, the following query returns all News items with publish dates between January 1, 2011 and January 1, 2015:
+contentType:News +News.sysPublishDate:[01/01/2011 TO 01/01/2015]
Note: Searching date ranges via the Content API requires slightly different date formatting.
You may include wildcard characters to allow your query to return results which match some common portion of the search term, but which vary in other portions of the field.
Single Character Wildcard (
The question mark character (
?) will match any single character, so any result which includes any character in the respective position of the question mark will be returned as long as the rest of the query string matches. For example, the following query returns all Employees whose first name includes “Mari”, followed by any character. Search results will include employees with the first name “Maria” or “Marie” (or any other 5 letter first name which begins with “Mari”).
Multiple Character Wildcard (
The asterisk character (
*) can be used before or after a text string in a query to allow a match of any characters (or no characters) before or after a match of the specified text. For example, the following query string returns all employees whose first name begins with the letter R:
You can search for multiple fields and return all matches by including the search terms for all fields within parentheses. For example, the following query returns both employees whose first name begins with the letter
R and those whose last names begin with the letter
+contentType:Employee +(Employee.firstName:R* Employee.lastName:J*)
- Since there is no operator between the
Employee.lastName:J*terms, the search returns items which match either search term (e.g. logical OR).
- To search for employees who have both a first name beginning with
RAND a last name beginning with
J, you would need to add an AND operator between the terms.
+contentType:Employee +(Employee.firstName:R* && Employee.lastName:J*)
- To search for employees who have both a first name beginning with
dotCMS search queries support fuzzy searches based on the Levenshtein Distance. To do a fuzzy search use the tilde, “~“, symbol at the end of a Single word Term. An additional (optional) parameter can specify the required similarity. The value is between 0 and 1, with a value closer to 1 only terms with a higher similarity will be matched. The default is 0.5 if no parameter is passed. The first example below (without the parameter), returns blogs with “passes, passing, compassion, passenger, impasse, passed” etc. The second example, however returns only the blogs where “pass” specifically appears in the body.
+contentType:Blog +Blog.body:pass~ +contentType:Blog +Blog.body:pass~0.8 (optional)
Queries can be sorted by score to make the search results that contain the query terms more times appear at the top of the list. Normally, when sorting by score, all query terms are treated equally, so any search result that includes a search term more times than another search result will appear closer to the top of the results list.
However queries can be score weighted to give higher or lower listing priority to different search terms. Each term in a query can be weighted with a caret (
^), followed by an integer number to give the term a weight. When the query results are sorted by Score, the search results that contain higher weighted terms will appear at the top of the results list, and the results with lower weighted terms will appear at the bottom of the list.
In the following example, “American” has a higher weight than “invest”. So when “Score” is used as a sort parameter, a search result that include the term “American” will usually appear above a result that includes the term “invest”, even if the term “American” appears in first result fewer times than the “invest” term appears in the second result.
+contentType:News +(News.title:*American*^10 News.title:*invest*^2)
Here is an example of a code which uses this query and sorts by Score in a news content pull:
#foreach($news in $dotcontent.pull("+contentType:News +(News.title:*American*^10 News.title:*invest*^2)",10,"Score")) $velocityCount.$news.headline #end
Lucene syntax does not support negative Boost values. However you can achieve a result similar to a negative boost by giving a positive boost to all content which does not match a given term. This can be accomplished by adding the term
(*:* -[term])^[boost] to the query.
For example, the following code repeats the previous query, but reduces the score of all results which contain the term “retirement” (by boosting the score of all results which do not contain the term “retirement”):
#foreach($news in $dotcontent.pull("+contentType:News +(News.title:*American*^10 News.title:*invest*^2) (*:* -retirement)^99",10,"Score")) $velocityCount.$news.headline #end
Escape Special Characters
Certain characters have special meaning in a query string, so if you with to search for these characters in your query, you must escape the characters to ensure they aren't interpreted by the query parser. Use a backslash before a special character to escape it in a search query. The following special characters must be escaped:
|Character / Term||Text||Escaped|
||only need to be escaped if you wish to query for the two characters together.
- If you wish to search for a single
|character, you do not need to escape them.
- However it is always good practice to escape even the individual characcters, to make it clearer to someone else what the query is intended to search for.
- If you wish to search for a single
Full Elasticsearch Syntax (JSON)
Full Elasticsearch syntax uses JSON format which supports many different types of queries, features, and options. Depending on the type of query performed, full Elasticsearch syntax may require individual query terms or a query string; in queries that require a “query” term, the query string uses Lucene query syntax, as described above. For example:
For complete documentation on Elasticsearch syntax (including examples for additional features and options), please see the official Elasticsearch documentation.