NOTE: this document describes early OpenIsis proposals.
An implementation of this shall become part of
This document discusses the issues involved in collating entries
for the index (inverted file).
Collating starts after preliminary steps of inverting like
selecting data from a record (or other sources), including
extraction of keywords and the like, which are described in
Collating involves a string transformation, so that the resulting strings
- are collation keys,
i.e. yield the desired ordering on binary comparision (a la strcmp).
- are normalized,
i.e. some variants of the same contents are mapped to a canonical base
The same transformation is applied to both entries before written to the
index and keys before searched in the index.
A desired property of this transformation (mapping) is reversibility
to a legible representation of the original key,
so that index contents can be listed.
However, an alternative is to use the original record contents
for listing, where this can be easily identified (e.g. from a thesaurus).
Since collating involves looking up characters, it can also detect
The traditional means provided by the ISISAC.TAB and ISISUC.TAB tables
do not extend well to multibyte character sets like UNICODE,
so we explore alternatives.
| basic collation |
A basic variant using ISO-C/POSIX means (ctype.h, string.h) would be
- using toupper for normalization
- using isalpha to detect word boundaries
- using strxfrm to create collation keys
This fallback approach has several disadvantages:
- normalization is very limited
- it depends on a "locale" which is a global state of a process,
so working with databases with different settings is not multithreading safe.
- the dependency from the locale is not specified precisely
- there is no standard way to customize settings
- it is not revertible
| full unicode collation |
Using a library such as Big Blue's
, we can use the full unicode collation algorithm (
), including the customization options described there.
- use the given encoding to convert input to unicode
- uppercase and canonicalize input, especially decompose composite characters
- remove most diacritics (but not the ring on the A when in sweden)
- use the alphabetic character property to detect word boundaries
- get a collation key
This approach is very powerful. It provides prepared data for virtually
all scripts, encodings and locales without the need for customization.
It provides for three or four levels of sorting
and can even handle french accent ordering
(where accents later in the otherwise same word have higher preceedence).
Sooner or later, OpenIsis will provide a means to link to ICU.
However, this still has a couple of disadvantages
- it requires a library that alone is several times larger than OpenIsis.
The used algorithms are very complex, so it can not be expected
to perform all too well on low hardware ressources.
- customization for special normalizations is difficult
- reverting a compressed collation key might not always be possible,
at least it is a pretty non-trivial task (and not supported by ICU)
| isis custom collation |
NOTE: this is now superseded by a simpler and more robust definition in
We devise a relatively simple scheme, that combines the mappings used for
normalization, collation key creation and word boundary detection into one step.
It does (by now) not provide multi-level ordering, since the discriminators
for secondary levels (like case and diacritics) are typically discarded anyway
- easy customization
- independence of encoding
- mapping of sequences
The mapping is specified by an isis record,
that could be stored in an separate file or be part of embedded header data.
In any way, the mappings are described using the actual characters,
in the encoding of the database, rather than code numbers.
This way, simple explicit mappings will survive a recoding
(ranges, however, may be broken).
Fields used in the customization are mapping rules,
containing one or more "items" separated by TAB characters.
An item is
- a single character
if it is a single character (sic!)
- a literal sequence
if the first one is a quote '"' (34)
- a range
is it consists of three characters, and the second one is a dash '-' (45)
- an enumeration
else. A dash at second position or quote at first position can be
avoided by permutation or by using multiple rules.
The fields in the mapping (record) map the second and following items
to the first one. Let n the number of characters in the target (the first item),
m those of the source, where a sequence is regarded as a single character.
- if n >= m,
the source characters are mapped to the first m corresponding
- if n < m,
the first n source characters are mapped to the corresponding
target characters, other source characters are mapped to the last
character of the target (or a blank, if n=0).
Sequences of characters mapping to the last one are squeezed as with
the -s option of the "tr" tool, i.e. produce only one character.
All but "map" fields describe a final rule,
which assigns a character class (letter,word,other,null)
and one or a range of collation values to the listed items.
Collation values are assigned in the order of rules (fields).
do NOT assign a collation value, but process the result recursively.
assigns character class null and no collation value.
ignored characters are silently discarded and don't break words
(like soft hyphen, combining diacritical marks).
assigns character class letter, i.e. characters that are part of words.
assigns character class word, i.e. characters that are themselves words.
assigns character class other, i.e. characters that are not part of words
and are discarded in word-split mode.
The builtin default rule maps all characters to a blank,
which thus always has the lowest collation value.
In any case, the longest matching sequence is used,
and map rules should (?) take preceedence over final rules.
TODO: detailled preceedence of conflicting rules.
The actual collation key is a sequence of collation values,
encoded as unsigned characters, if less than 256, or UTF-8 else.
(A fixed-number-of-bits encoding could be used alternatively).
The reverse mapping is simple: for each collation value,
there is a corresponding final rule that produced this value
or a range containing this value. The corresponding character
(or sequence) in the first item of this rule is used.
Simple example for traditional spanish collation:
letter A-L a-l
letter LL ll
letter M-Z m-z
map AEIOUN ÁÉÍÓÚÑ áéíóúñ