How Content is Mapped to Elasticsearch - Documentation topics on: dotraw,elasticsearch,geolocation,metadata,query,raw field,sorting,system fields,system indexed,tag field,.

How Content is Mapped to Elasticsearch

All content has properties that are indexed and made searchable by ElasticSearch. Some dotCMS properties are shared by all Content Types, and some specific properties are indexed based on how a Content Type is configured (such as when you mark a field as System Indexed).

Please see the following sections for more information on how dotCMS content is mapped in ElasticSearch:

Index Contents and Format

Text Format

Whenever any content in dotCMS is indexed for Elasticsearch, the contents of each indexed field is indexed and searchable. However it is important to understant that all fields are indexed as text only; non-text fields (such as Text fields where the contents are numbers) are converted to text if necessary, and the text representation is stored in the index.

Case of Query Terms

In addition, all variable keys and content are converted to lower case before adding them to the ElasticSearch index. So when you create an ElasticSearch query, all field names and values should be written in lower case.

For example, if searching for news items tagged with “Singapore”, the query term should not capitalize the tag name:

news.tags:singapore

Content Analysis

When indexing a field, Elasticsearch automatically analyzes the contents and changes the way the content of the field is stored to optimize searches within the index. For example, words within a multi-word field are automatically separated and indexed separately by Elasticsearch. Therefore, once a field has been indexed, queries against that field may not behave as expected:

  • It may not be possible to search the contents in certain ways or for certain specific terms or combinations of terms.
    • For example, since words are automatically separated and indexed separately, it is not possible to search for multiple words in a specific order within an indexed field.
  • Sorting may not work as expected.
    • If you attempt to sort by any indexed field, the query will succeed but the results may not be sorted as expected.

These issues can be resolved by querying or sorting on the Raw Fields instead.

Content Type Fields

Each Content Type contains both system properties (which exist in all Content Types) and fields specific to the Content Type (either as part of the Base Content Type or added to the Content Type). Only fields which are indexed can be queried using Elasticsearch.

Indexed and Non-Indexed Fields

All system properties and all Content Type fields with the System Indexed property set may be accessed from within an Elasticsearch query, and query results may be sorted by these fields.

Fields which are not System Indexed may be accessed from the objects returned in the search results (when using a content pull for example), but may not be included as part of the query or sort field. Attempts to query or sort by non-indexed fields will usually produce no search results.

Automatically Indexed Field Types

The following Field types are automatically indexed by dotCMS:

There is no System Indexed checkbox to select when configuring these field types because dotCMS always adds them to the Elasticsearch index.

Non-Indexed Field Types

The following Field types can not be indexed by dotCMS:

There is no System Indexed checkbox to select when configuring these field types, and you can not perform Elasticsearch queries against the contents of these fields (the queries are allowed, but will return no results).

Required Indexing

Certain field properties require that the field be indexed for the property to work. Because of this, when you select one of these properties for a field, the field is automatically indexed (and the “System Indexed” checkbox, if available, will automatically be selected and disabled). The properties that require a field to be indexed are:

  • Show in Listing
  • Unique
  • User Searchable

Indirect Indexing

Some fields can not be explicitly indexed (there is no “System Indexed” field to check or uncheck). However these fields will still be indexed automatically if the “Required”, “Show in Listing” or “User Searchable” properties are selected for these fields:

Note, however, that since you can not explicitly un-index these fields, once you select any of the Required Indexing properties for a field of this type, the indexed property will be set, and the field will always remain indexed from that point forward (even if you un-check all the Required Indexing fields).

System Properties

The following properties exist in all dotCMS Content Types. Each of these properties can be accessed (either in the query terms or the sort field) by specifying the property name only (e.g. “title”), without reference to the name of the content type.

FieldTypeDescription
modDatelongDate the content was last saved (either the working or live version).
titlestringIf the Content Type has a field with a “title” variable name, this queries that field.
Otherwise this queries the first text field in the Content Type with “Show in Listing” checked.
contenttypestringVariable name for the Content Type.
basetypeintEnumerated value of the Base Content Type.
Content=1, Widget=2, Form=3, File=4, Page=5.
liveboolTrue if the content is live (published) on your site.
workingboolTrue if the content is a working (unpublished) version.
lockedboolTrue if the content is locked.
deletedboolTrue if the content has been archived.
langaugeidintLanguage id of the content.
identifierstringIdentifier of the content item.
inodestringInode of a specific version of the content item.
conhoststringIdentifier of the Site the content is on.
confolderstringIdentifier of the folder the content is in.
parentpathstringPath to the folder the content is in.
pathstringPath/URL to the content.
wfcreatedBystringThe userid of the user who created the current Workflow.
wfassignstringThe id of the user or Role assigned the current Workflow.
wfstepstringThe guid of the Workflow step the content is currently in.
wfModDatestringDate the Workflow was changed to the Workflow step the content is currently in.
pubdatelongDate the content was published, formatted numerically as yyyyMMddHHmmss.
expdatelongDate the content will expire, formatted numerically as yyyyMMddHHmmss.
urlmapstringThe URL map of the content (if any).
categoriesstringThe variable names of the Categories the content is assigned to.

Content Type Specific Fields

You can query any fields which are part of the Base Content Type or any fields added to a Content Type using the pattern {content type variable name}.{field variable name} (for example, news.headline or news.tags).

  • Content Types are referenced within Elasticsearch queries by the variable name of the Content Type.
    • For example, when querying the “Event” Content Type, your query must reference “calendarEvent” (the variable name of the Content Type) rather than “Event” (the display name of the Content Type).
  • Similary, the fields of a Content Type are referenced within Elasticsearch queries by the variable name of the field (not the field Label or Alias Name).
    • For example, the name used to reference the “Start Date” field of the “Event” Content Type is “startDate” (the variable name of the field), not “Start Date” (the Label of the field).
  • Finally, when accessing any field of a Content Type other than a System Property, you must preface the field name with the name of the content type and a period.
    • For example, when accessing the “Start Date” field of the “Event” Content Type, you must reference the field as “calendarEvent.startDate”.
    • If you fail to preface the field variable with the Content Type variable name, the query will not recognize the field (and may return no results).

All references to Content Types and fields must use this syntax, regardless of whether you are using Lucene syntax or full Elasticsearch JSON syntax. And you must use this syntax when specifying the sort field in dotCMS viewtools (such as content pulls and the Elasticsearch Viewtool).

Note: If a field is not indexed, you may still attempt to query it (the query will not return an error). However the query will not match any values for the non-indexed field, so the query will most likely return no results.

Relationship Fields

When you create a Relationship between two content Types, Elasticsearch indexes all instances of the Relationship (all Relationships between two individual content items) using the Relationship name as a field name and the identifier of the related content item as the value of the field. For more information, please see the following examples.

Note:

  • The following examples all demonstrate queries using Lucene Syntax
    • The same basic method (using the Relationship name as a field name and the content identifier as the value) can be used with Elasticsearch JSON Syntax.
  • All of these queries can be run against the dotCMS starter site or demo site.
Find all content which is part of a specific Relationship:
+News-Comments:*
Find all content which is related to a specific content item through a specific Relationship:
+News-Comments:7a3d042f-aae4-4e60-8385-7fc5320f572f
Find all content which is related to a specific content item through any Relationship:
  • Method 1: Specify the Names of All Relationships:
    +(News-Comments:7a3d042f-aae4-4e60-8385-7fc5320f572f Parent_News-Child_Media:7a3d042f-aae4-4e60-8385-7fc5320f572f Parent_News-Child_Youtube)
    
  • Method 2: Search All Fields (and exclude the content item itself from the results):
    +_all:7a3d042f-aae4-4e60-8385-7fc5320f572f -identifier:7a3d042f-aae4-4e60-8385-7fc5320f572f
    
    • Note that for some Content Types (such as Event), this method may pull additional content items which are related via another field with contains an identifier (other than a Relationship field).
      • In this case, additional query parameters must be added to filter out these other fields (e.g. -calendarEvent.disconnectedFrom:7a3d042f-aae4-4e60-8385-7fc5320f572f for the Event Content Type).

Special Fields

The following fields do not exist in all Content Types, but have special uses within dotCMS. If you wish to use these fields, you must manyally add them to a Content Type; but once you add them they enable the use of additional dotCMS features which rely on the existence of these fields.

FieldTypeDescription
tagsstringYou must create a tag field to be able to associate tags with a piece of content.
Note that the tag field is very important for many Personalization features.
latlongstringTo enable geolocation queries on any Content Type, you must create a text field on that content type with the specific Velocity variable name latlong.
Note: To do this, you must name the field “Latlong”; you may change the field name after you have saved it once using this name.
This field takes a string value of latitude and longitude separated by a comma (e.g. “42.648899,-71.165497”).

File Metadata

In dotCMS Enterprise edition, file contents and metadata are also indexed and searchable via ElasticSearch. Metadata can be queried using the pattern {content type variable name}.metadata.{metadata field}, for example fileasset.metadata.contenttype:image/jpeg.

For more information on searcing file metadata, please see the File Metadata documentation.

Raw Fields

Due to the way Elasticsearch indexes fields, once a field has been indexed it may not be possible to search the contents of the field in certain ways, or to properly sort based on the field. However for every field that is indexed and analyzed in Elasticsearch, dotCMS adds an additional field of the same name with _dotraw appended to it (for example, the raw version of the indexed news.byline field can be accessed as news.byline_dotraw). This field stores the “raw” value of the field, which is not analyzed by Elasticsearch. This enables you to query and sort based on the exact text of the field, even after it has been indexed.

Query Performance

Please note that queries against Raw fields are much less efficient than queries against indexed fields, and may negatively impact performance (especially when querying against a large content store). Therefore it is recommended that you limit queries against Raw fields, using Raw fields only when strictly necessary.

Custom Sorting

Elasticsearch provides sophisticated methods to perform custom sorts based on almost any field or combination of fields. To learn more about Elasticsearch custom sorting capabilities, please see the Elasticsearch Sorting documentation.

In addition, you can perform customized sorting by creating custom fields which use Velocity code to construct customized search keys or tokens. This allows you to create relatively sophisticated custom sort capabilities using relatively simple Velocity code, without the need to learn more sophisticated Elasticsearch functionality. For more information on custom fields, please see the Custom Fields documentation.

Query Results and Permissions

dotCMS ensures that a user has appropiate permission to any contents that are returned by ElasticSearch query results; results which a user does not have permissions for are not returned to that user.

Note that this means different users running the same ElasticSearch query may receive different results. Therefore, when troubleshooting queries and query results, make sure to take the user's permissions into account.