The system provides two types of search indexes for the content of website pages:
Indexes the structured content of pages in the content tree, which includes the following page data:
The indexed content does NOT include the following:
Recommendation: Use indexes of the Pages type for sections of the website where the important content is stored in page fields, such as products or structured articles.
Directly parses the HTML output of pages on the live site, which allows the search to find any text located on pages. Crawler indexes can provide more accurate searches of page content than Pages indexes. However, building and updating crawler indexes may require more time and resources, particularly in the case of large indexes and complex pages.
Note: The crawler indexes pages based on the structure of the content tree in Xperience. Any pages without a representation in the content tree are NOT included (e.g. pages served by custom routes implemented only on the side of the live site application).
You can create multiple indexes and use the most suitable type for different sections of your website. The implementation of your site's search functionality can combine indexes of any type.
Specifying page index content
Note: Page indexes only cover pages that are published on the live site.
To define which pages an index covers, define allowed or excluded content. The general approach is the same for both Pages and Pages crawler indexes.
- Open the Smart search application.
- Select the Local indexes tab.
- Edit the index.
- Select the Indexed content tab.
- Click Add allowed content or Add excluded content.
- Open the Sites tab and assign the websites where you wish to use the index.
- Switch to the Cultures tab and select which language versions of the website's pages are indexed.
- At least one culture must be assigned in order for the index to be functional.
Allowed content defines which of the website's pages are included in the index. Excluded content removes pages or entire website sections from the allowed content. Specify pages using a combination of the following options:
- Path – path expression identifying the allowed or excluded pages.
- Page types – allows you to limit which page types are included or excluded.
Includes or excludes all pages on the site.
Includes or excludes the /Partners page, without the child pages placed under it.
Includes or excludes all pages of the DancingGoat.Article page type on the entire site.
Includes or excludes all pages of the DancingGoat.Coffee and DancingGoat.Grinder page types found under the /Products section.
Excluding individual pages from all indexes
You can also exclude specific pages from all smart search indexing:
- Open the Pages application.
- Select the given page in the content tree.
- In Edit mode, open the Properties -> General tab.
- Enable the Exclude from search property.
- Click Save.
The system also allows additional configuration of the indexed content, depending on the page index type:
Configuring Pages indexes
When configuring Allowed content for Pages search indexes, you can add the following additional content to the index:
- Include attachment content – if selected, the index includes the text content of files attached to the specified pages. See Searching attachment files for more information.
- Include categories – if selected, the index stores the display names of Categories assigned to the specified pages. This allows users to find pages that belong to categories whose name matches the search expression.
Page type fields
Pages are often complex data structures with many different fields. Not all fields may be relevant to the search that you are implementing. Page types allow you to adjust how the system indexes specific fields. We recommend indexing only necessary fields to keep your indexes as small (and fast) as possible.
Pages crawler search indexes directly index the HTML output of pages and are not affected by the search field settings.
To edit the search field settings for:
- Open the Page types application.
- Edit a page type.
- Open the Search fields tab.
The options in the top part of the tab allow you to configure how the system displays pages of the given type in search results. Note that the final appearance of the results always depends on your search interface implementation.
- Title field – select the page field whose value is used for the title of search results.
- Content field – the field whose value is used for the content extract of search results.
- Image field – the field that contains the image displayed in search results.
- Date field – the field whose value is used for the date and time displayed in search results.
The grid in the bottom section of the tab determines how the smart search indexes the page type's fields (as defined on the Fields tab).
For locally stored search indexes, only the options under the Local and General sections of the grid apply (to learn about Azure Search page indexes, see Creating Azure Search indexes). You can set the following search options for individual fields:
If selected, the content of the field is indexed and searchable in the standard way. Within search indexes, the values of all fields with the Content option enabled are combined into a system field named _content (this field is used to find or filter matching search items, but is NOT suitable for reading and displaying human-readable information such as search result extracts).
For the purposes of standard search, Content fields are automatically tokenized by the analyzer of the used search index.
If selected, the field is stored separately within indexes and its content can be searched using expressions in format:
<field code name>:<searched phrase>
See Smart search syntax for more information about field searches.
Fields must be set as Searchable to be usable in search result filtering or ordering conditions.
Relevant for Searchable fields. Indicates if the content of the field is processed by the analyzer when indexing. This allows the search to find results that match individual tokens (subsets) of the field's value. If disabled, the search only returns items if the full value of the field exactly matches the search expression.
If a field has both the Content and Searchable options enabled, the Tokenized option only affects the content used for field searches (content is always automatically tokenized for the purposes of standard search).
|Custom search name|
Relevant for Searchable fields. The specified value is used as a substitute for the field code name in <field code name>:<searched phrase> search expressions.
Note: If you enter a Custom search name value, the original field name cannot be used.
After you Save changes of the field settings, you need to Rebuild all indexes that cover pages of the given type.
When running searches using page indexes, the system returns results according to the field search settings of individual page types. The page type search settings are shared by all page indexes in the system.
SKU (product) and general page fields
To configure the field search settings for E-commerce SKUs (products):
Warning: It is highly recommended to modify only the settings of custom SKU fields. Changing the settings of the default fields may prevent the system from searching through products correctly.
- Open the Modules application.
- Edit the E-commerce module.
- Open the Classes tab.
- Edit the SKU class.
- Select the Search tab.
- Click Customize.
You can configure the search settings for fields just like for page types. The SKU fields are joined together with general page fields, such as fields that store the page name and metadata,
Important: The search settings of general fields affect all pages, even those that are not products.
Configuring Pages crawler indexes
Page crawler search indexes read the content of pages while signed in under a user account. You can configure the user for every page crawler index (on the General tab of the index editing interface):
Sets the user account that the crawler uses to index pages. Reading pages under a user allows the crawler to:
If empty, the index uses the user account specified in Settings -> System -> Default user ID (or the default administrator user account if the setting is empty).
If you wish to assign a user to your search indexes, we recommend creating a dedicated service account with the appropriate permissions (not an account representing an actual live site user or editor).
Customizing how crawlers process page content (API)
By default, the system converts the HTML output of pages to plain text before saving it to page crawler indexes:
- Strips all HTML tags
- Converts all whitespace formatting to simple spaces
If you wish to index the content of any tags or exclude parts of the page output, you can customize how the crawlers process the HTML:
- Create a custom Class Library project (assembly).
- Assign a handler to the OnHtmlToPlainText event of the CMS.Search.SearchCrawler class.
- This event occurs whenever a page search crawler processes the HTML output of a page.
- To assign a handler method to the event, create a custom module class and override its OnInit method.
- Deploy the custom code to both your live site and Xperience administration projects. See Applying customizations in the Xperience environment.
For example, you can define the content of the class as shown below:
The OnHTMLToPlainText event provides the following string parameters to the handler:
- plainText – the page output already stripped of all tags and converted to plain text.
- originalHTML – the raw page HTML code without any modifications.
Was this page helpful?