The contents of the character map files are structured as follows:
value-set
This directive introduces the basic value set of the field type. The format is an ordered list (without spaces) of the characters which may occur in "words" of the given type. The order of the entries in the list determines the sort order of the index. In addition to single characters, the following combinations are legal:
Backslashes may be used to introduce three-digit octal, or
two-digit hex representations of single characters
(preceded by x
).
In addition, the combinations
\\, \\r, \\n, \\t, \\s (space — remember that real
space-characters may not occur in the value definition), and
\\ are recognized, with their usual interpretation.
Curly braces {} may be used to enclose ranges of single characters (possibly using the escape convention described in the preceding point), eg. {a-z} to introduce the standard range of ASCII characters. Note that the interpretation of such a range depends on the concrete representation in your local, physical character set.
paranthesises () may be used to enclose multi-byte characters - eg. diacritics or special national combinations (eg. Spanish "ll"). When found in the input stream (or a search term), these characters are viewed and sorted as a single character, with a sorting value depending on the position of the group in the value statement.
value-set
This directive introduces the
upper-case equivalencis to the value set (if any). The number and
order of the entries in the list should be the same as in the
lowercase
directive.
value-set
This directive introduces the character
which separate words in the input stream. Depending on the
completeness mode of the field in question, these characters either
terminate an index entry, or delimit individual "words" in
the input stream. The order of the elements is not significant —
otherwise the representation is the same as for the
uppercase
and lowercase
directives.
value-set
target
This directive introduces a mapping between each of the
members of the value-set on the left to the character on the
right. The character on the right must occur in the value
set (the lowercase
directive) of the
character set, but it may be a paranthesis-enclosed
multi-octet character. This directive may be used to map
diacritics to their base characters, or to map HTML-style
character-representations to their natural form, etc. The
map directive can also be used to ignore leading articles in
searching and/or sorting, and to perform other special
transformations. See section Section 3, “Ignoring leading articles”.