Searching attachment files

The smart search allows users to search through the content of files uploaded as document attachments. The attachment search supports both types of file storage provided by Kentico (database or file system).

The attachment search only works for files that are connected to documents through one of the following methods:

  • Attachment files added to documents through fields with the Field type set to File or Document attachments in the document type definition
  • Attachments uploaded in the Pages application on the Properties -> Attachments tab of documents

Choosing which file types are searchable

By default, the attachment search supports the following file types:

  • txt
  • csv
  • pdf
  • docx
  • xlsx
  • pptx
  • xml
  • html
  • htm

Note: The search does NOT work for:

  • Legacy MS Office formats: doc, xls, ppt
  • Certain types of PDF files, including:
    • Encrypted files
    • Files using PDF version 1.5 or older

You can limit which of the file types are searchable for individual websites:

  1. Open the Settings applicaton.
  2. Select the System -> Search category.
  3. Fill in the Allowed attachment file types setting.
    • Enter a list of allowed file extensions without dots, separated by semicolons. If you leave the setting empty, the search works for all of the available file types.

  4. Click Save.

If you wish to search one of the unsupported file types, you need to:


Enabling indexing for document attachments

The attachment search is a part of the functionality of standard Document indexes. The attachment search is NOT available for Pages crawler type indexes, which directly index the HTML output of pages.

To set up the attachment search for your website:

  1. Open the Smart search application.
  2. Create or edit a Document search index.
  3. When defining the search content on the Indexed content tab, check Include attachment content for the index's allowed content.

    Enabling attachment search for a document index
  4. Click Save.
  5. Switch to the General tab and Rebuild the index.

While building the document index, the smart search processes the allowed documents, extracts the text of any attachment files and includes it in the content of the index (along with the other document data). When users perform a search using the index, the system returns results for documents whose attachments match the search expression.

Updating the search content of attachments (Upgrades and Hotfixes)

Kentico stores the text content extracted from document attachments in the database. When rebuilding document indexes, the search loads the "cached" attachment text from the database. The system only processes the file text directly for attachments that do not have any search content saved.

If you apply a hotfix or upgrade that changes how the search indexes attachment files, you need to clear the attachment search content:

  1. Open the System application.
  2. Select the Files -> Attachments tab.
  3. Click Clear attachment search cache.

You can then Rebuild your document indexes, which updates the attachment content according to the new functionality.

Configuring the attachment search

You can adjust how the system indexes attachment files by adding keys to the appSettings section of your application's web.config file.

The indexed content always includes:

  • File metadata (title, tags, author name etc.)
  • Comments (for example in MS Office files)

Limiting the maximum size of indexed files

Indexing of very large files can be resource intensive and have a negative impact on your website's performance. To prevent the system from indexing files larger than a certain size, add the CMSSearchMaxAttachmentSize key:

<add key="CMSSearchMaxAttachmentSize" value="10000" />

They key sets the maximum allowed file size in kB. The search ignores document attachments whose size exceeds the value.

Indexing of XML content

When indexing the content of XML files, the search does NOT include the following content by default:

  • Comments
  • The values of tag attributes

You can enable indexing for such content by adding the following web.config keys:

<add key="CMSSearchIndexXmlComments" value="true" />
<add key="CMSSearchIndexXmlAttributes" value="true" />

Enabling character encoding detection for text files

By default, the search can read text files (txt and csv) that use the following character encoding:

  • UTF-8
  • The default Windows encoding (the operating system's current ANSI code page)

If you encounter problems when indexing text files with a different encoding type, you can enable automatic encoding detection:

<add key="CMSSearchDetectTextEncoding" value="true" />

The system then attempts to detect the encoding type for each file, and use the correct option when reading the content during the indexing process.

Note: Correct encoding detection is not guaranteed for all files. Automatic detection also slightly increases the time required to index text files.