Defining document indexes

You can use two types of search indexes for the pages of websites (i.e. documents in the content tree):

Documents

Index the following document data:

Note: Documents indexes do NOT include the text of other documents or objects displayed through web parts (such as the content of News document displayed through a Repeater web part).

Documents crawler

Directly parse the HTML output generated by documents, which allows the search to find any text located on pages. Crawler indexes provide more accurate searches of page content than standard Documents indexes. However, building and updating crawler indexes may require more time and resources, particularly in the case of large indexes and complex documents.

See also: Configuring document crawler indexes

Note: Documents indexes only cover documents that are published on the live site.

To define which documents an index covers, specify allowed or excluded content:

  1. Open the Smart search application.
  2. Edit () the index.
  3. Select the Indexed content tab.
  4. Click Add allowed content or Add excluded content.
  5. Open the Sites tab and assign the websites where you wish to use the index.
  6. Switch to the Cultures tab and select which language versions of the website’s documents are indexed.
    • You must assign at least one culture in order for the index to be functional.
    • If you have a multi-site index, you can select the cultures separately for each site.

Specifying allowed or excluded content for a document index

Adding allowed content

Allowed content defines which of the website’s documents are included in the index. Specify documents using a combination of the following options:

  • Path - path expression identifying the documents that should be indexed.
  • Document types - allows you to limit which document types are included in the index.

The following properties define types of additional content that you can include in Documents search indexes. The settings are not available for Documents crawler indexes:

  • Include ad-hoc forums - includes the content of ad-hoc forums placed on the specified documents (if there are any).
  • Include blog comments - includes blog comments posted for blog post documents.
  • Include message boards - includes message boards placed on the specified documents.
  • Include attachment content - if checked, the index includes the text content of files attached to the specified documents. See Searching attachment files for more information.
  • Include categories - if checked, the index stores the display names of Categories assigned to the specified documents. This allows users to find documents that belong to categories whose name matches the search expression.

Examples

Allowed content settings

Result

  • Path: /%
  • Document types: empty

Indexes all documents on the site.

  • Path: /Partners
  • Document types: empty

Only indexes the /Partners page, without the child pages placed under it.

  • Path: /%
  • Document types: CMS.News

Indexes all documents of the CMS.News document type on the entire site.

  • Path: /Products/%
  • Document types: CMS.Smartphone;CMS.Laptop

Indexes all documents of the CMS.Smartphone and CMS.Laptop document types found under the /Products section.

Adding excluded content

Excluded content allows you to remove documents or entire website sections from the allowed content. For example, if you allow /% and exclude /Special‑pages/% at the same time, the index will include all documents on the site except for the ones found under the /Special-pages node.

You can specify the following options:

  • Path - path expression identifying the documents that should be excluded.
  • Document types - allows you to limit which document types are excluded from the index.

Examples

Excluded content settings

Result

  • Path: /Partners
  • Document types: empty

Excludes the /Partners page from the index. Child pages are not excluded.

  • Path: /%
  • Document types: CMS.News

Excludes all documents of the CMS.News document type from the index.

  • Path: /Products/%
  • Document types: CMS.Smartphone;CMS.Laptop

Excludes all documents of the CMS.Smartphone and CMS.Laptop document types found under the /Products section from the index.

Excluding individual documents from all indexes

You can also exclude specific documents from all smart search indexing:

  1. Open the Pages application.
  2. Select the given document in the content tree.
  3. In Edit mode, open the Properties -> Navigation tab.
  4. Enable the Exclude from search property.
  5. Click Save.

Configuring search settings for document fields

Documents are often complex data structures with many different fields. Not all fields may be relevant to the search that you are implementing. Document types allow you to adjust how the system indexes specific fields. We recommend indexing only necessary fields to keep your indexes as small (and fast) as possible.

Documents crawler search indexes directly index the HTML output of documents. As a result, crawler indexes are not affected by the field settings of document types.

To edit the field search settings for document types:

  1. Open the Document types application.
  2. Edit () a document type.
  3. Open the Search fields tab.

In the top part of the tab, configure how the system displays documents of the given type in search results:

  • Title field - select the document field whose value is used for the title of search results.
  • Content field - the field whose value is used for the content extract of search results.
  • Image field - the field that contains the image displayed next to search results.
  • Date field - the field whose value is used for the date and time displayed in search results.

The table in the bottom section of the tab determines how the smart search indexes the document type’s fields (as defined on the Fields tab). You can set the following search options for individual fields:

Content

If selected, the content of the field is indexed and searchable in the standard way. For the purposes of standard search, Content fields are automatically tokenized by the analyzer of the used search index.

Searchable

If selected, the content of the field can be searched using expressions in format:

<field code name>:<searched phrase>

See Smart search syntax for more information about field searches.

Fields must be set as Searchable to be usable in Search filters and general search result filtering or ordering conditions (such as the Search condition and Search sort properties of Smart search result web parts).

Tokenized

Relevant for Searchable fields. Indicates if the content of the field is processed by the analyzer when indexing. This allows the search to find results that match individual tokens (subsets) of the field’s value. If disabled, the search only returns items if the full value of the field exactly matches the search expression.

If a field has both the Content and Searchable options enabled, the Tokenized option only affects the content used for field searches (content is always automatically tokenized for the purposes of standard search).

Custom search name

Relevant for Searchable fields. The specified value is used as a substitute for the field code name in <field code name>:<searched phrase> search expressions.

Note: If you enter a Custom search name value, the original field code name can’t be used.

After you Save changes of the field settings, you need to Rebuild all indexes that cover documents of the given type.

When running searches using document indexes, the system returns results according to the field search settings of individual document types. The document type search settings are shared by all document indexes in the system.

Editing a document type - configuring search fields

SKU (product) and general document fields

To configure the field search settings for E-commerce SKUs (products):

Warning: It is highly recommended to modify only the settings of custom SKU fields. Changing the settings of the default fields may prevent the system from searching through SKUs correctly.

  1. Open the Modules application.
  2. Edit () the E-commerce module.
  3. Open the Classes tab.
  4. Edit the SKU class.
  5. Select the Search tab.
  6. Click Customize.

You can configure the search settings for fields just like for document types. The SKU fields are joined together with general document fields, such as fields that store the content of editable regions on pages (DocumentContent) or the content of text widgets (DocumentWebParts).

Important: The search settings of general fields affect all documents, even those that are not products.

Configuring document crawler indexes

Document crawler search indexes read the content of pages while logged in under a user account. You can configure the following properties for every document crawler index (on the General tab of the index editing interface):

Index property

Description

User

Sets the user account that the crawler uses to index pages. Reading pages under a user allows the crawler to:

  • Load user-personalized content for the given user
  • Avoid indexing of documents that the user is not allowed to access

If empty, the index uses the default administrator user account.

On websites that use Windows authentication, you need to type the user name (including the Active Directory domain in format domain\username) and password. To guarantee that the crawler indexes under the specified Active Directory user, the covered pages cannot be accessible by public users (i.e. Windows authentication must be required).

Domain

Sets the domain that the crawler uses when indexing sites. Enter the domain name without the protocol, for example: www.domain.com

If empty, the crawler automatically uses the main domain of the site where the indexed documents belong.

For example, you can set a custom domain for web farm servers that do not have access to the main domain.

By default, document crawlers also index documents that use redirection from the site’s main domain name to a domain alias. To only allow indexing for pages that use the website’s main domain, set the CMSCrawlerAllowSiteAliasRedirect key to false in your application’s web.config file:




<add key="CMSCrawlerAllowSiteAliasRedirect" value="false" />


The key applies to all document crawler indexes in the system.

Customizing how crawlers process page content (API)

By default, the system converts the HTML output of documents to plain text before saving it to document crawler indexes:

  • Strips all HTML tags
  • Removes the Head tag, Style tags and all JavaScript
  • Converts all whitespace formatting to simple spaces

If you wish to index the content of any tags or exclude parts of the page output, you can customize how the crawlers process the HTML. You need to implement your custom functionality in a handler of the OnHtmlToPlainText event of the CMS.Search.SearchCrawler class. This event occurs whenever a document search crawler processes the HTML output of a page.

To assign a method as the handler for the OnHTMLToPlainText event, add a new class to the App_Code folder of your web project (or CMSApp_AppCode -> Old_App_Code on web application installations). For example, you can define the content of the class as shown below:




using System;
using System.Web;

using CMS.Base;
using CMS.Search;
using CMS.Helpers;

[DocumentCrawlerContentLoader]
public partial class CMSModuleLoader
{
    /// <summary>
    /// Attribute class for assigning event handlers.
    /// </summary>
    private class DocumentCrawlerContentLoaderAttribute : CMSLoaderAttribute
    {
        /// <summary>
        /// Called automatically when the application starts.
        /// </summary>
        public override void Init()
        {
            // Assigns a handler for the OnHtmlToPlainText event
            SearchCrawler.OnHtmlToPlainText += new SearchCrawler.HtmlToPlainTextHandler(SearchHelper_OnHtmlToPlainText);
        }

        // Add your custom HTML processing actions and return the result as a string
        static string SearchHelper_OnHtmlToPlainText(string plainText, string originalHtml)
        {
            string outputResult = originalHtml;

            // Removes new line entities
            outputResult = outputResult.Replace("\n", " ");

            // Removes tab spaces
            outputResult = outputResult.Replace("\t", " ");

            // Removes JavaScript
            outputResult = HTMLHelper.RegexHtmlToTextScript.Replace(outputResult, " ");

            // Removes tags
            outputResult = HTMLHelper.RegexHtmlToTextTags.Replace(outputResult, " ");

            // Decodes HTML entities
            outputResult = HttpUtility.HtmlDecode(outputResult);

            return outputResult;
        }

    }
}


The OnHTMLToPlainText event provides the following string parameters to the handler:

  • plainText - the page output already stripped of all tags and converted to plain text
  • originalHTML - the raw page HTML code without any modifications