How Content is Mapped to Elasticsearch

Last Updated: Jun 19, 2020
documentation for the dotCMS Content Management System

All content has properties that are indexed and made searchable by ElasticSearch. Some dotCMS properties are shared by all Content Types, and some specific properties are indexed based on how a Content Type is configured (such as when you mark a field as System Indexed).

Please see the following sections for more information on how dotCMS content is mapped in ElasticSearch:

Index Contents and Format

Data Types

dotCMS maps data from content fields into appropriate data types for the type of data stored in the field. For many field types this means that the data will be mapped as a string (text); however for the other field types the data will be mapped to other data types as appropriate. The following table shows

dotCMS Field TypeData Type PropertyIndexableES Data TypeRaw FieldSortable
BinaryN/AYes1?????NoNo
CategoryN/AAlways3TextNoNo
CheckboxN/AYesTextYesNo5
Constant FieldN/ANoText??????????
Custom FieldN/AYesTextYesThrough Raw Field4
DateN/AYesDate/TimeYesYes
Date and TimeN/AYesDate/TimeYesYes
FileN/ANo?????NoNo
Key/ValueN/AYes2TextYesNo
Hidden FieldN/ANoText??????????
ImageN/ANo?????NoNo
MultiselectN/AYesTextYesNo5
RadioN/AYesTextYesThrough Raw Field4
SelectN/AYesTextYesThrough Raw Field4
Site or FolderN/AYes2?????NoNo
TagN/AAlways3KeywordNoNo4
TextTextYesTextYesThrough Raw Field4
TextDecimalYesFloatYesYes
TextWhole NumberYesIntegerYesYes
TextareaN/AYesTextYesThrough Raw Field4
TimeN/AYesDate/TimeYesYes
WYSIWYGN/AYesTextYesThrough Raw Field4

Key:

  1. Field is indexed when the User Searchable property is set to true.
  2. Category and Tag fields are always indexed.
  3. Fields which can contain multiple values can not be sorted (even if only one value in the field is selected).
  4. Only the first 8192 characters of Raw fields are indexed, and thus sorting is only performed based on the first 8192 characters for these fields.
  5. The values of Checkbox and Multiselect field values are stored as lists of selected values. The selected values are stored in the order those values appear in the Value property of the field. Therefore, Checkbox and Multiselect fields with multiple selected values will only make sense to sort if the list of selectable values (in the field Value property) is itself sorted.

Note: The Line Divider, Permissions, and Relationships Legacy fields do not contain any data, and are not indexed.

In addition, dotCMS always also maps both a “raw” version of each field, and a text version of each field which is not already stored in a string format. The raw and text versions of fields are always stored in string data format.


Note:

In previous dotCMS versions, all fields of all types of content were mapped as strings. However dotCMS now maps fields of type “Text”, “Date”, “Time”, and “Date/time” to appropriate data types for the type of data. This means that queries against these types of fields which were written in previous versions of dotCMS may no longer produce the same results, and may need to be rewritten. For more information, please see the Upgrading to dotCMS 5.0 documentation.


For more information on how data is mapped in Elasticsearch, please see the Elasticsearch documentation.

Case of Query Terms

All variable keys and content in string format are converted to lower case before adding them to the ElasticSearch index. So when you create an ElasticSearch query against a string field, all field names and values should be written in lower case.

For example, if searching for news items tagged with “Singapore”, the query term should not capitalize the tag name:

news.tags:singapore

Content Analysis

When indexing a field, Elasticsearch automatically analyzes the contents and changes the way the content of the field is stored to optimize searches within the index. For example, words within a multi-word field are automatically separated and indexed separately by Elasticsearch. Therefore, once a field has been indexed, queries against that field may not behave as expected:

  • It may not be possible to search the contents in certain ways or for certain specific terms or combinations of terms.
    • For example, since words are automatically separated and indexed separately, it is not possible to search for multiple words in a specific order within an indexed field.
  • Sorting may not work as expected.
    • If you attempt to sort by any indexed field, the query will succeed but the results may not be sorted as expected.

These issues can be resolved by querying or sorting on the Raw Fields instead.

Custom Field Mappings

You can create custom mappings for individual Content Type fields by adding a field variable named esCustomMapping to the field.

The esCustomMapping field variable accepts an Elasticsearch field mapping. Please see below for some simple examples, and please see the Elasticsearch documentation for more detailed information on Elasticsearch field mappings.

Example 1

The following mapping directs that the field should be indexed as a Keyword (rather than as normal text), and limits the size of the value stored in the index:

{
"type":  "keyword",
"ignore_above": 8191
}

Example 2

The following mapping specifies the Elasticsearch type to be used to index the field, that the field should not be stored, and to use the Russian language text analyzer when indexing the field:

{
"analyzer": "russian",
"store": "false",
"type": "text"
}

Any Elasticsearch analyzer can be specified for the field. For a list of the built-in Elasticsearch analyzers, please see the Elasticsearch documentation.

Content Type Fields

Each Content Type contains both system properties (which exist in all Content Types) and fields specific to the Content Type (either as part of the Base Content Type or added to the Content Type). Only fields which are indexed can be queried using Elasticsearch.

Indexed and Non-Indexed Fields

All system properties and all Content Type fields with the System Indexed property set may be accessed from within an Elasticsearch query, and query results may be sorted by these fields.

Fields which are not System Indexed may be accessed from the objects returned in the search results (when using a content pull for example), but may not be included as part of the query or sort field. Attempts to query or sort by non-indexed fields will usually produce no search results.

Automatically Indexed Field Types

The following Field types are automatically indexed by dotCMS:

There is no System Indexed checkbox to select when configuring these field types because dotCMS always adds them to the Elasticsearch index.

Non-Indexed Field Types

The following Field types are not indexed by dotCMS:

  • File
  • Image

There is no System Indexed checkbox to select when configuring these field types, and you can not perform Elasticsearch queries against the contents of these fields (the queries are allowed, but will return no results).

Indexing of Binary Fields

The Binary field is a special case, because only a single Binary field is indexed. The Binary field which will be indexed is the the first Binary field with the System Indexed property set. When a Binary field is indexed, it is the metadata of the file that is indexed. If the contents of the file contain text (regardless of the specific MIME type), then the full text of the file is also indexed, in the content field of the metadata.

The file contents of Content Types which are built using the Base Content Types of File Asset and Asset are also indexed. These Base Content Types contain a Binary field by default, which is indexed the same as any other Binary field.

Required Indexing

Certain field properties require that the field be indexed for the property to work. Because of this, when you select one of these properties for a field, the field is automatically indexed (and the “System Indexed” checkbox, if available, will automatically be selected and disabled). The properties that require a field to be indexed are:

  • Show in Listing
  • Unique
  • User Searchable

Indirect Indexing

Some fields can not be explicitly indexed (there is no “System Indexed” field to check or uncheck). However these fields will still be indexed automatically if the “Required”, “Show in Listing” or “User Searchable” properties are selected for these fields:

Note, however, that since you can not explicitly un-index these fields, once you select any of the Required Indexing properties for a field of this type, the indexed property will be set, and the field will always remain indexed from that point forward (even if you un-check all the Required Indexing fields).

Sortable and Unsortable Fields

| Content Type Field Data Type | Elasticsearch Data

Numeric Fields

You may sort on all indexed numeric fields (including boolean and date fields), both when using Lucene queries and full Elasticsearch JSON queries.

Text Fields

Elasticsearch has two different ways of indexing fields that contain text. One of these, the “text” data type, is not sortable. The other, the “keyword” data type, is sortable.

All Text fields in dotCMS Content Types are indexed twice, once as a “text” field, and once as a “keyword” field.

  • Primary text fields (indexed using the Velocity variable name of the Content Type field) are indexed as “text” fields, and are not sortable.
  • “Raw” versions of text fields are indexed as “keyword” fields, and are sortable.

Therefore, when using Elasticsearch JSON queries, you must sort using the _dotraw version of a text field.

In addition, when you perform sorting using any of the built-in content pulls which include a sort parameter, the field names of any text fields you provide in the “sort” parameter are automatically modified so that it is the Raw version of the text field that is sorted on, ensuring that the sort will work as expected.

Other Field Types

Other types of fields are, in general, not sortable. In addition, it's important to note that some Content Type fields contain values in a format which does not make sense to sort. For example, “Site/Folder” fields save the inode of the selected Site or Folder, not the name or path; therefore even if it was sortable, the results of the sort would not make sense.

For these reasons, sorting is only recommended (and supported) on Content Type fields of the following types:

Directly Sortable Fields
  • Date
  • Date and Time
  • Time
  • Text (Decimal)
  • Text (Whole Number)
Sortable using the _raw Version of the Field
  • Radio
  • Select
  • Text (Text)
  • Textarea
  • WYSIWYG

System Properties

Properties Always Available

The following properties exist in all dotCMS Content Types. Each of these properties can be accessed (either in the query terms or the sort field) by specifying the property name only (e.g. “title”), without reference to the name of the content type.

FieldTypeDescription
modDatedateDate the content was last saved (either the working or live version).
titletextThe Title field is automatically selected based on the properties of the Content Type fields.
For details of how the Title is selected, please see below.
contenttypetextVariable name for the Content Type.
basetypeintEnumerated value of the Base Content Type.
Content=1, Widget=2, Form=3, File=4, Page=5.
liveboolTrue if the content has at least one version which has been published.
Please see below for more information.
workingboolTrue if the content has at least one version which has not been published.
Please see below for more information.
deletedboolTrue if the content has been archived.
Please see below for more information.
lockedboolTrue if the content is locked.
languageIdintLanguage id of the content.
identifiertextIdentifier of the content item.
inodetextInode of a specific version of the content item.
conhosttextIdentifier of the Site the content is on.
sysPublishDatedateDate and time the content was published.
stInodetextIdentifier of the Content Type of the content.
ownertextUser ID of the owner (creator) of the content.
modUsertextUser ID of the last user to modify the content.

Properties Which May be Available

The following properties are indexed for all content in dotCMS, but may not always contain valid values depending on the Content Type and the state of the content.

FieldTypeDescription
titleimagetextIdentifier of the first image field in the content (either an Image field or a Binary field containing an image).
confoldertextIdentifier of the folder the content is in.
parentpathtextPath to the folder the content is in.
pathtextPath/URL to the content.
wfcreatedBytextThe userid of the user who created the current Workflow.
wfassigntextThe id of the user or Role assigned the current Workflow.
wfsteptextThe guid of the Workflow step the content is currently in.
wfModDatedateDate the Workflow was changed to the Workflow step the content is currently in.
pubdatedateDate the content was published, formatted numerically as yyyyMMddHHmmss.
expdatedateDate the content will expire, formatted numerically as yyyyMMddHHmmss.
urlmaptextThe URL map of the content (if any).
categoriestextThe variable names of the Categories the content is assigned to.

How the Title Field is Chosen

The Title field is selected automatically based on the values and properties of the fields in the Content Type, using the following rules (in order of priority, starting at the top):

  1. If there is a field with a Velocity Variable name of “title”, it will be used.
  2. If there is a field with a Velocity Variable name that starts with “title”, it will be used.
  3. If there are any Text fields with the Show on Listing property set to true, the first one will be used.
  4. If there is a Binary field with the Indexed field set to true, the file name will be used.
  5. If the Base Content Type is fileAsset or dotAsset, the file name will be used.
  6. If none of the above conditions applies, the identifier of the content will be used.

Content Type Specific Fields

You can query any fields which are part of the Base Content Type or any fields added to a Content Type using the pattern {content type variable name}.{field variable name} (for example, news.headline or news.tags).

  • Content Types are referenced within Elasticsearch queries by the variable name of the Content Type.
    • For example, when querying the “Event” Content Type, your query must reference “calendarEvent” (the variable name of the Content Type) rather than “Event” (the display name of the Content Type).
  • Similary, the fields of a Content Type are referenced within Elasticsearch queries by the variable name of the field (not the field Label or Alias Name).
    • For example, the name used to reference the “Start Date” field of the “Event” Content Type is “startDate” (the variable name of the field), not “Start Date” (the Label of the field).
  • Finally, when accessing any field of a Content Type other than a System Property, you must preface the field name with the name of the content type and a period.
    • For example, when accessing the “Start Date” field of the “Event” Content Type, you must reference the field as “calendarEvent.startDate”.
    • If you fail to preface the field variable with the Content Type variable name, the query will not recognize the field (and may return no results).

All references to Content Types and fields must use this syntax, regardless of whether you are using Lucene syntax or full Elasticsearch JSON syntax. And you must use this syntax when specifying the sort field in dotCMS viewtools (such as content pulls and the Elasticsearch Viewtool).

Note: If a field is not indexed, you may still attempt to query it (the query will not return an error). However the query will not match any values for the non-indexed field, so the query will most likely return no results.

Key/Value Fields

Key/Value Fields have a special “flattened” mapping in Elasticsearch queries, to simplify reference to their data members. Using the pattern described above under content-specific fields, simply access the field directly, and then use the . accessor to reach the key_value:{key}_{value} property.

+{ContentType}.{keyValFieldVar}.key_value:{key}_{value}

Wildcard characters (*) are valid for either keys or values. For example, if you are trying to find the contentlet of Content Type MyObject with Key/Value field data1 that contains the pairs {key1=123, key2=456}, any of the following queries will include this contentlet in their results:

+MyObject.data1.key_value:key1_123
+MyObject.data1.key_value:key1_*
+MyObject.data1.key_value:key2_456
+MyObject.data1.key_value:key2_*
+MyObject.data1.key_value:*_123
+MyObject.data1.key_value:*_456
+MyObject.data1.key_value:*_*

Relationship Fields

When you create a Relationship between two content Types, Elasticsearch indexes all instances of the Relationship (all Relationships between two individual content items) using the Relationship name as a field name and the identifier of the related content item as the value of the field. For more information, please see the following examples.

Note:

  • The following examples all demonstrate queries using Lucene Syntax
    • The same basic method (using the Relationship name as a field name and the content identifier as the value) can be used with Elasticsearch JSON Syntax.
  • All of these queries can be run against the dotCMS starter site or demo site.
Find all content which is part of a specific Relationship:
+News-Comments:*
Find all content which is related to a specific content item through a specific Relationship:
+News-Comments:7a3d042f-aae4-4e60-8385-7fc5320f572f
Find all content which is related to a specific content item through any Relationship:
  • Method 1: Specify the Names of All Relationships:
    +(News-Comments:7a3d042f-aae4-4e60-8385-7fc5320f572f Parent_News-Child_Media:7a3d042f-aae4-4e60-8385-7fc5320f572f Parent_News-Child_Youtube)
    
  • Method 2: Search All Fields (and exclude the content item itself from the results):
    +catchall:7a3d042f-aae4-4e60-8385-7fc5320f572f -identifier:7a3d042f-aae4-4e60-8385-7fc5320f572f
    
    • Note that for some Content Types (such as Event), this method may pull additional content items which are related via another field with contains an identifier (other than a Relationship field).
      • In this case, additional query parameters must be added to filter out these other fields (e.g. -calendarEvent.disconnectedFrom:7a3d042f-aae4-4e60-8385-7fc5320f572f for the Event Content Type).

Special Fields

The following fields do not exist in all Content Types, but have special uses within dotCMS. If you wish to use these fields, you must add them to a Content Type; but once you add them they enable the use of additional dotCMS features which rely on the existence of these fields.

FieldTypeDescription
tagsarray of keywordsYou must create a tag field to be able to associate tags with a piece of content.
Note that the tag field is very important for many Personalization features.
latlonggeo_pointTo enable geolocation queries on any Content Type, you must create a text field on that content type with the specific Velocity variable name latlong.
Note: To do this, you must name the field “Latlong”; you may change the field name after you have saved it once using this name.
This field takes a string value of latitude and longitude separated by a comma (e.g. “42.648899,-71.165497”).

File Metadata

In dotCMS Enterprise edition, file contents and metadata are also indexed and searchable via ElasticSearch. However, the only metadata that is stored in dotCMS, is configured in the dotmarketing-config.properties file, via a static plugin, setting the INDEX_METADATA_FIELDS property. The default metadata settings for this property are:

INDEX_METADATA_FIELDS=width,height,contentType,author,keywords,fileSize,content

Storing ALL Metadata

All the metadata on files can be stored by setting INDEX_METADATA_FIELDS to a wildcard:

INDEX_METADATA_FIELDS=*

How to Query File Metadata

Metadata can be queried using the pattern +metaData.{fieldname}:*{value}*, for example:

 +metaData.contentType:*image/jpeg*

For more information on searcing file metadata, please see the File Metadata documentation.

Raw Fields

Due to the way Elasticsearch indexes fields, once a field has been indexed it may not be possible to search the contents of the field in certain ways, or to properly sort based on the field. However for every field that is indexed and analyzed in Elasticsearch, dotCMS adds an additional field of the same name with _dotraw appended to it (for example, the raw version of the indexed news.byline field can be accessed as news.byline_dotraw). This _dotraw field, which is stored as a keyword datatype, keeps the “raw” value of the original field, which is not analyzed by Elasticsearch. This enables you to query, aggregate, and sort based on the exact text of the field, even after it has been indexed.

Custom Sorting

Elasticsearch provides sophisticated methods to perform custom sorts based on almost any field or combination of fields. To learn more about Elasticsearch custom sorting capabilities, please see the Elasticsearch Sorting documentation.

In addition, you can perform customized sorting by creating custom fields which use Velocity code to construct customized search keys or tokens. This allows you to create relatively sophisticated custom sort capabilities using relatively simple Velocity code, without the need to learn more sophisticated Elasticsearch functionality. For more information on custom fields, please see the Custom Fields documentation.

Query Results and Permissions

dotCMS ensures that a user has appropiate permission to any contents that are returned by ElasticSearch query results; results which a user does not have permissions for are not returned to that user.

Note that this means different users running the same ElasticSearch query may receive different results. Therefore, when troubleshooting queries and query results, make sure to take the user's permissions into account.

On this page

×

We Dig Feedback

Selected excerpt:

×