Site Search documentation for the dotCMS Content Management System

 

dotCMS Site Search utilizes ElasticSearch to create independent indexes for different groups of content on your Site. Each of the indexing processes can be scheduled to run at different times, and any content can be indexed and searched in one or more separate or overlapping indices.


Note:

  • Site Search is an Enterprise Edition feature only.
    • If you are using dotCMS Community Edition, click on “Try Enterprise Now” link your dotCMS installation to preview/purchase the dotCMS Enterprise Edition.
  • Site Search indices are separate from the standard content index used to search for and access content with standard content pulls and through the dotCMS back-end.

What is and is not Included in a Site Search Index

The Site Search indexes all Pages, Files and URL Mapped content which match the specified locations for the Site Search indexing job. The Site Search does not crawl your site. Instead it searches the content repository on the speicified site(s) and serializes the content to make it searchable. The following limitations apply to Site Search indices:

  • Only content of a Content Types with a URL Map are added to the Site Search index.
  • Only text which is either displayed on a Page or within a content item or File OR included in the meta keywords HTML tag are indexed.
    • Text which is included within HTML tags or tag properties (e.g. “style” properties) are not indexed.
    • Content fields which are not displayed on a Page are not indexed.
      • This means, for example, that Tags applied to the content are not searcheable.
    • To make any undisplayed fields on content searchable with the Site Search index, you may either:
      • Explicitly display the value of the fields on the page, or
      • Add the appropriate text to the <meta name="keywords"> tag on the page where the content is displayed.
    • For URL Mapped content, any changes made to include additional content in the index must be done on the Detail Page for the Content Type.
  • Site Search results are NOT permissioned.
    • Site Search will only index Pages, Files or URL Mapped content which are viewable by the CMS Anonymous Role.
    • For this reason, Site Search cannot be used to index or search non-public pages (pages requiring a site login to access).

Meta Keywords Tag

You may specify keywords to be indexed by including them in the HTML <meta name="keywords"> tag, as in the following example:

<meta name="keywords" content="$keywords">

The keywords “content” property can contain any number of keywords, separated by spaces and/or punctuation. The contents of the “content” property will be parsed by Elasticsearch in the same way all other content fields are parsed (including indexing words separated by spaces as separate terms, regardless of any punctuation used).

When searching a Site Search index, you can explicitly reference any content included in the keywords meta field by specifying the keywords field in the query, as in the following example:

#set($searchresults = $sitesearch.search("+keywords:*invest*", 0, 10))

Creating a New Site Search Index

To create a new Site Search Index:

  1. Select Dev Tools -> Site Search.
  2. Select the Indices tab.
    The Indices tab shows you all current Site Search indexes
  3. Click CREATE SITE SEARCH INDEX. In the popup:
    • Enter an Alias (friendly name) for the index.
    • Enter the number of Shards for the index.
      • The number of shards is used to tune Elasticsearch performance. If you are not sure the number of Shards to use, leave the default value.
        The Site Search index popup allows you to set the Alias name and the Number of Shards
  4. Press the CREATE SITE SEARCH INDEX button.

Populating the Index

To run or schedule an indexing process, select the Job Scheduler tab.

The Job Scheduler tab shows all scheduled Site Search jobs

On this page a new job can be either run now or scheduled. Enter the following fields to create a new job:

PropertyDefault ValueDescription
RunScheduledA job set to run Now will cause content to be indexed immediately when the SCHEDULE button is clicked.
A Scheduled job will be set to run automatically and repeatedly on a schedule you specify via the Cron Expression property.
NameNoneThe name that will be displayed for the job in the View All Jobs tab.
HostsNoneThe host(s) to index.
  • Only live Sites can be indexed.
  • To index all Sites on your dotCMS instance, either add them all to the list, or check the Index All Sites checkbox.
  • Important: If you want to index Content Types that reside on the System Host and do not have a Site or Folder field, you must check Index all Hosts.
Alias NameCreate New Index and Make DefaultSelect the Alias of the index where the content will be added.
  • If you select Create New Index and Make Default, a new Site Search index will be created each time the job is run.
  • These automatically created indexes will have a new, randomly generated name each time they are created, but can be referenced by specifying the “default” index.
IncrementalUnchecked
  • If this box is checked then only changes made to content since the last index will be added to the index (including any new content and any edits to existing content).
  • If this box is not checked, then the index will be deleted and rebuilt from scratch each time (making the indexing take longer, but improving the search performance).
Important:
  • Incremental indexing will only remove content from the index if that content has been unpublished or archived.
  • If you have deleted content from your site and you select Incremental, the deleted content will not be removed from the index.
    • You must perform a full reindex to ensure deleted content is removed from the index.
LanguageNoneCheck the box for each content Language to include in the index.
  • You must include at least one Language, or the Site Search index will be empty.
Paths TypeAll 
Paths ListNoneComma separated list of paths to include or exclude.
  • This field will only be active if the Paths Type is set to Include or Exclude.
  • Paths can contain wildcards, e.g. /home/*
Cron ExpressionNoneSpecifies when and how often the job will run (and thus when and how often the Site Search index will be updated).
  • Any changes to your content, Pages, and Files will not be included in the Site Search index until the job is run, so frequency of the job will determine how up-to-date your Site Search index is.
  • Some examples of Cron Expressions for different job schedules and times are displayed below the field.

Viewing Scheduled Jobs

To view all currently scheduled Site Search index jobs, select the View All Jobs tab.

  • When a job is running, progress of the job will be displayed to the left of the Job Name.
  • If a job was created with the Run field set to Now, it will display in the jobs list with the name runningOnce.
    • You may only have one runningOnce job at a time.
      • If you run a second indexing job with the Run field set to Now before the previous runningOnce job completes, the previous job will be aborted and the indexing from the previous job will not be completed.
    • When the runningOnce job finishes it will disappear from the list.

The View All Jobs tab displays all jobs which are either scheduled to run in the future, or are currently running

Performing Searches Against a Site Search Index

Searching an Index in the Back-End

You may perform searches against a Site Search index in the dotCMS back-end to validate that contents of the index. To select an index to search, do one of the following:

  • Click on the index from the Indices tab.
  • Select the Search tab, and select the name of the index from the drop-down list.

To search the index, enter the search terms in the field to the right of the index name and click the SEARCH button.

The search results display the Score, Title, Author, Content Length, Url, Uri, Mime Type and Filename. At the bottom of the page, the time required to perform the search is displayed, to help you understand and tune the performance of your Site Search index.

For more information on how you can query within your Site Search indexes, please see the SiteSearch Viewtool documentation.

Front-end Site Searches

To perform searches on a Site Search index from the front-end of your site, you must use the SiteSearch viewtool, which allows you to specify the search terms and the Site Search index to search against.

Searches against the Site Search index from the front-end of a site are typically done by creating a form which allows the user to enter their search terms, and then uses the SiteSearch Viewtool to pass the user's search terms to the appropriate Site Search index.

Since you can choose which index to search for each form, you can create different Site Search indexes for different portions of your site. For example, you can have one site search index for the documentation section of your site, another for the support section, another for the products section, and yet another that includes content from all sections of your site in a single index.

For more information and examples of how to use the viewtool, please see the SiteSearch viewtool documentation.

Job Audit Data

The Job Audit Data tab can be used to help understand and tune the performance of your Site Search indexing jobs. It contains a drop-down list that includes all the jobs which will be executed at least one more time in the future (based on the job's Cron Expression).

Note: Jobs which are scheduled to run Now, jobs which have been deleted, and jobs which have a Cron Expression that will not ever run again in the future will not be displayed here. To view a job in this tab you must schedule the job as a repeating Scheduled job.

To display the audit data for a job, select it from the list and click the LOAD button. All the job information for previous runs of that job will be displayed, with one row for each previous job execution.

Choosing Which Content to Index on a Page

When a Site Search indexing job is run, dotCMS sets the User-Agent header in the HTTP request to DOTCMS-SITESEARCH, to distinguish the indexing job from normal site traffic. This enables you to change how your page is displayed and/or what code and tags are included on the page during Site Search indexing, which allows you to do things like exclude site search indexing from your site analytics and choose which content to display for the Site Search indexing for content whose display varies (such as content which is displayed differently based on Personalization).

The sample code below shows you how to set up a portion of your Page or content to be excluded when the Site Search index is running:

#if($request.getAttribute("User-Agent")!="DOTCMS-SITESEARCH")   
    ## Custom code...   
#end

For more information on how dotCMS sets the “User-Agent” property to enable you to change how your content is displayed in different circumstances, please see the Request, Response, and Session documentation.