Kentico Xperience 13 documentation and ASP.NET Core

Most documentation about running Xperience applications under ASP.NET Core can be found in a dedicated section: Developing Xperience applications using ASP.NET Core. The rest of the documentation still applies, but some code samples and scenarios might need slight modifications for Core projects.

All major differences between the MVC 5 and Core platforms are summarized in Migrating to ASP.NET Core.

×

Defining local page indexes

The system provides two types of search indexes for the content of website pages:

Pages

Indexes the structured content of pages in the content tree, which includes the following page data:

The indexed content does NOT include the following:

  • Text added and displayed through page builder widgets.
  • Text that is displayed in the resulting output, but is not stored within the indexed page (for example content added in the page's code in the live site application, or displayed from other Xperience pages or objects).

Recommendation: Use indexes of the Pages type for sections of the website where the important content is stored in page fields, such as products or structured articles.

Pages crawler

Directly parses the HTML output of pages on the live site, which allows the search to find any text located on pages. Crawler indexes can provide more accurate searches of page content than Pages indexes. However, building and updating crawler indexes may require more time and resources, particularly in the case of large indexes and complex pages.

Note: The crawler indexes pages based on the structure of the content tree in Xperience. Any pages without a representation in the content tree are NOT included (e.g. pages served by custom routes implemented only on the side of the live site application).

Recommendation: Use indexes of the Pages crawler type on sites that use content tree-based routing, particularly for sections with pages created using the page builder.

You can create multiple indexes and use the most suitable type for different sections of your website. The implementation of your site's search functionality can combine indexes of any type.

Specifying page index content

Note: Page indexes only cover pages that are published on the live site.

To define which pages an index covers, define allowed or excluded content. The general approach is the same for both Pages and Pages crawler indexes.

  1. Open the Smart search application.
  2. Select the Local indexes tab.
  3. Edit the index.
  4. Select the Indexed content tab.
  5. Click Add allowed content or Add excluded content.
  6. Open the Sites tab and assign the websites where you wish to use the index.
  7. Switch to the Cultures tab and select which language versions of the website's pages are indexed.
    • At least one culture must be assigned in order for the index to be functional.

Specifying allowed or excluded content for a page index

Allowed content defines which of the website's pages are included in the index. Excluded content removes pages or entire website sections from the allowed content. Specify pages using a combination of the following options:

  • Path – path expression identifying the allowed or excluded pages.
  • Page types – allows you to limit which page types are included or excluded.

Examples

Content settings

Result

  • Path: /%
  • Page types: empty

Includes or excludes all pages on the site.

  • Path: /Partners
  • Page types: empty

Includes or excludes the /Partners page, without the child pages placed under it.

  • Path: /%
  • Page types: DancingGoat.Article

Includes or excludes all pages of the DancingGoat.Article page type on the entire site.

  • Path: /Products/%
  • Page types: DancingGoat.Coffee;DancingGoat.Grinder

Includes or excludes all pages of the DancingGoat.Coffee and DancingGoat.Grinder page types found under the /Products section.

Excluding individual pages from all indexes

You can also exclude specific pages from all smart search indexing:

  1. Open the Pages application.
  2. Select the given page in the content tree.
  3. In Edit mode, open the Properties -> General tab.
  4. Enable the Exclude from search property.
  5. Click Save.

The system also allows additional configuration of the indexed content, depending on the page index type:

Configuring Pages indexes

When configuring Allowed content for Pages search indexes, you can add the following additional content to the index:

  • Include attachment content – if selected, the index includes the text content of files attached to the specified pages. See Searching attachment files for more information.
  • Include categories – if selected, the index stores the display names of Categories assigned to the specified pages. This allows users to find pages that belong to categories whose name matches the search expression.

Page type fields

Pages are often complex data structures with many different fields. Not all fields may be relevant to the search that you are implementing. Page types allow you to adjust how the system indexes specific fields. We recommend indexing only necessary fields to keep your indexes as small (and fast) as possible.

Pages crawler search indexes directly index the HTML output of pages and are not affected by the search field settings.

To edit the search field settings for page types:

  1. Open the Page types application.
  2. Edit a page type.
  3. Open the Search fields tab.

The options in the top part of the tab allow you to configure how the system displays pages of the given type in search results. Note that the final appearance of the results always depends on your search interface implementation.

  • Title field – select the page field whose value is used for the title of search results.
  • Content field – the field whose value is used for the content extract of search results.
  • Image field – the field that contains the image displayed in search results.
  • Date field – the field whose value is used for the date and time displayed in search results.

The grid in the bottom section of the tab determines how the smart search indexes the page type's fields (as defined on the Fields tab).

For locally stored search indexes, only the options under the Local and General sections of the grid apply (to learn about Azure Search page indexes, see Creating Azure Search indexes). You can set the following search options for individual fields:

Content

If selected, the content of the field is indexed and searchable in the standard way. Within search indexes, the values of all fields with the Content option enabled are combined into a system field named _content (this field is used to find or filter matching search items, but is NOT suitable for reading and displaying human-readable information such as search result extracts).

For the purposes of standard search, Content fields are automatically tokenized by the analyzer of the used search index.

Searchable

If selected, the field is stored separately within indexes and its content can be searched using expressions in format:

<field code name>:<searched phrase>

See Smart search syntax for more information about field searches.

Fields must be set as Searchable to be usable in search result filtering or ordering conditions.

Tokenized

Relevant for Searchable fields. Indicates if the content of the field is processed by the analyzer when indexing. This allows the search to find results that match individual tokens (subsets) of the field's value. If disabled, the search only returns items if the full value of the field exactly matches the search expression.

If a field has both the Content and Searchable options enabled, the Tokenized option only affects the content used for field searches (content is always automatically tokenized for the purposes of standard search).

Custom search name

Relevant for Searchable fields. The specified value is used as a substitute for the field code name in <field code name>:<searched phrase> search expressions.

Note: If you enter a Custom search name value, the original field name cannot be used.

Configuring a page type's search field settings for locally stored indexes

After you Save changes of the field settings, you need to Rebuild all indexes that cover pages of the given type.

When running searches using page indexes, the system returns results according to the field search settings of individual page types. The page type search settings are shared by all page indexes in the system.

SKU (product) and general page fields

To configure the field search settings for E-commerce SKUs (products):

Warning: It is highly recommended to modify only the settings of custom SKU fields. Changing the settings of the default fields may prevent the system from searching through products correctly.

  1. Open the Modules application.
  2. Edit the E-commerce module.
  3. Open the Classes tab.
  4. Edit the SKU class.
  5. Select the Search tab.
  6. Click Customize.

You can configure the search settings for fields just like for page types. The SKU fields are joined together with general page fields, such as fields that store the page name and metadata,

Important: The search settings of general fields affect all pages, even those that are not products.

Configuring Pages crawler indexes

Page crawler search indexes read the content of pages while signed in under a user account. You can configure the user for every page crawler index (on the General tab of the index editing interface):

Index propertyDescription

User
______________

Sets the user account that the crawler uses to index pages. Reading pages under a user allows the crawler to:

  • Load user-personalized content for the given user
  • Avoid indexing of pages that the user is not allowed to access

If empty, the index uses the user account specified in Settings -> System -> Default user ID (or the default administrator user account if the setting is empty).

If you wish to assign a user to your search indexes, we recommend creating a dedicated service account with the appropriate permissions (not an account representing an actual live site user or editor).

Customizing how crawlers process page content (API)

By default, the system converts the HTML output of pages to plain text before saving it to page crawler indexes:

  • Strips all HTML tags
  • Removes the Head tag, Style tags and all JavaScript
  • Converts all whitespace formatting to simple spaces

If you wish to index the content of any tags or exclude parts of the page output, you can customize how the crawlers process the HTML:

  1. Create a custom Class Library project (assembly).
  2. Assign a handler to the OnHtmlToPlainText event of the CMS.Search.SearchCrawler class.
    • This event occurs whenever a page search crawler processes the HTML output of a page.
    • To assign a handler method to the event, create a custom module class and override its OnInit method.
  3. Deploy the custom code to both your live site and Xperience administration projects. See Applying customizations in the Xperience environment.

For example, you can define the content of the class as shown below:

using System.Web;

using CMS;
using CMS.DataEngine;
using CMS.Search;
using CMS.Helpers;

// Registers the custom module into the system
[assembly: RegisterModule(typeof(CustomSearchCrawlerModule))]

public class CustomSearchCrawlerModule : Module
{
	// Module class constructor, the system registers the module under the name "CustomSearchCrawler"
	public CustomSearchCrawlerModule()
		: base("CustomSearchCrawler")
	{
	}    

	// Contains initialization code that is executed when the application starts
	protected override void OnInit()
	{
		base.OnInit();

		// Assigns a handler for the OnHtmlToPlainText event
		SearchCrawler.OnHtmlToPlainText += new SearchCrawler.HtmlToPlainTextHandler(SearchHelper_OnHtmlToPlainText);
	}

	// Add your custom HTML processing actions and return the result as a string
	static string SearchHelper_OnHtmlToPlainText(string plainText, string originalHtml)
	{
		string outputResult = originalHtml;

		// Removes new line entities
		outputResult = outputResult.Replace("\n", " ");

		// Removes tab spaces
		outputResult = outputResult.Replace("\t", " ");

		// Removes JavaScript
		outputResult = HTMLHelper.RegexHtmlToTextScript.Replace(outputResult, " ");

		// Removes tags
		outputResult = HTMLHelper.RegexHtmlToTextTags.Replace(outputResult, " ");

		// Decodes HTML entities
		outputResult = HttpUtility.HtmlDecode(outputResult);

		return outputResult;
	}
}

The OnHTMLToPlainText event provides the following string parameters to the handler:

  • plainText – the page output already stripped of all tags and converted to plain text.
  • originalHTML – the raw page HTML code without any modifications.

Was this page helpful?