GNI - Collating

Collating

NOTE: this document describes early OpenIsis proposals. An implementation of this shall become part of Malete

This document discusses the issues involved in collating entries for the index (inverted file).
Collating starts after preliminary steps of inverting like selecting data from a record (or other sources), including extraction of keywords and the like, which are described in Inverting .
Collating involves a string transformation, so that the resulting strings

are collation keys,
i.e. yield the desired ordering on binary comparision (a la strcmp).
are normalized,
i.e. some variants of the same contents are mapped to a canonical base

The same transformation is applied to both entries before written to the index and keys before searched in the index.

A desired property of this transformation (mapping) is reversibility to a legible representation of the original key, so that index contents can be listed. However, an alternative is to use the original record contents for listing, where this can be easily identified (e.g. from a thesaurus).
Since collating involves looking up characters, it can also detect word boundaries.

The traditional means provided by the ISISAC.TAB and ISISUC.TAB tables do not extend well to multibyte character sets like UNICODE, so we explore alternatives.

basic collation

A basic variant using ISO-C/POSIX means (ctype.h, string.h) would be

using toupper for normalization
using isalpha to detect word boundaries
using strxfrm to create collation keys

This fallback approach has several disadvantages:

normalization is very limited
it depends on a "locale" which is a global state of a process, so working with databases with different settings is not multithreading safe.
the dependency from the locale is not specified precisely
there is no standard way to customize settings
it is not revertible

full unicode collation

Using a library such as Big Blue's ICU , we can use the full unicode collation algorithm ( UCA ), including the customization options described there.

use the given encoding to convert input to unicode
uppercase and canonicalize input, especially decompose composite characters
remove most diacritics (but not the ring on the A when in sweden)
use the alphabetic character property to detect word boundaries
get a collation key

This approach is very powerful. It provides prepared data for virtually all scripts, encodings and locales without the need for customization. It provides for three or four levels of sorting and can even handle french accent ordering (where accents later in the otherwise same word have higher preceedence). Sooner or later, OpenIsis will provide a means to link to ICU.
However, this still has a couple of disadvantages

it requires a library that alone is several times larger than OpenIsis. The used algorithms are very complex, so it can not be expected to perform all too well on low hardware ressources.
customization for special normalizations is difficult
reverting a compressed collation key might not always be possible, at least it is a pretty non-trivial task (and not supported by ICU)

isis custom collation

NOTE: this is now superseded by a simpler and more robust definition in Malete

We devise a relatively simple scheme, that combines the mappings used for normalization, collation key creation and word boundary detection into one step. It does (by now) not provide multi-level ordering, since the discriminators for secondary levels (like case and diacritics) are typically discarded anyway during normalization.

Features are

efficiency
easy customization
independence of encoding
mapping of sequences

The mapping is specified by an isis record, that could be stored in an separate file or be part of embedded header data. In any way, the mappings are described using the actual characters, in the encoding of the database, rather than code numbers. This way, simple explicit mappings will survive a recoding (ranges, however, may be broken).

Fields used in the customization are mapping rules, containing one or more "items" separated by TAB characters. An item is

a single character
if it is a single character (sic!)
a literal sequence
if the first one is a quote '"' (34)
a range
is it consists of three characters, and the second one is a dash '-' (45)
an enumeration
else. A dash at second position or quote at first position can be avoided by permutation or by using multiple rules.

The fields in the mapping (record) map the second and following items to the first one. Let n the number of characters in the target (the first item), m those of the source, where a sequence is regarded as a single character.

if n >= m,
the source characters are mapped to the first m corresponding target characters
if n < m,
the first n source characters are mapped to the corresponding target characters, other source characters are mapped to the last character of the target (or a blank, if n=0). Sequences of characters mapping to the last one are squeezed as with the -s option of the "tr" tool, i.e. produce only one character.

All but "map" fields describe a final rule, which assigns a character class (letter,word,other,null) and one or a range of collation values to the listed items. Collation values are assigned in the order of rules (fields).

map
do NOT assign a collation value, but process the result recursively.
ignore
assigns character class null and no collation value. ignored characters are silently discarded and don't break words (like soft hyphen, combining diacritical marks).
letter
assigns character class letter, i.e. characters that are part of words.
word
assigns character class word, i.e. characters that are themselves words.
other
assigns character class other, i.e. characters that are not part of words and are discarded in word-split mode.

The builtin default rule maps all characters to a blank, which thus always has the lowest collation value.
In any case, the longest matching sequence is used, and map rules should (?) take preceedence over final rules. TODO: detailled preceedence of conflicting rules.

The actual collation key is a sequence of collation values, encoded as unsigned characters, if less than 256, or UTF-8 else. (A fixed-number-of-bits encoding could be used alternatively). The reverse mapping is simple: for each collation value, there is a corresponding final rule that produced this value or a range containing this value. The corresponding character (or sequence) in the first item of this rule is used.

Simple example for traditional spanish collation:

letter	A-L	a-l
letter	LL	ll
letter	M-Z	m-z
map	AEIOUN	ÁÉÍÓÚÑ	áéíóúñ