Language detection for stemming

The language used for the stemming can be defined for the documents, for the categories, or for both of them.

The text of the document is analyzed and stemmed based on the language of the document set or, if not set, on the language_code attribute of the document, and if it is not set, on the detected language. Then the result of the analysis is compared with the evidence terms of categories of the same language or which language is not defined. Defining a language for a category acts as a filter: a document is never assigned to a category of a different language. Similarly, if no language is set for a taxonomy or a category, all documents, regardless of the detected language, can match the category.

You can set the language at different levels:

On the category side, you can define the language at different levels:

If the language of a category is not specified, then the language of the taxonomy is used: it does not inherit the language of the parent category, if any. When no language is defined, the evidence terms are stemmed in English.

If you do not set the language or if you set it as Any language for a category, documents in different languages can be assigned to this category. Use the Any language option if you do not plan to activate the stemming and thus, evidence terms are valid in any language, such as patterns for social security numbers or acronyms like EMC. Setting the language of the category to Any language disables the stemming for the evidence terms of the category.

You can use the stemming for documents in English, French, German, Spanish, Italian, Portuguese, Danish, Dutch, Norwegian, Swedish, Romanian, Russian, Finnish, Hungarian, or Turkish. When stemming is enabled, a language must be defined at the taxonomy or category level.

Related topics: