GNI - Inverting

Inverting

This document discusses various issues of indexing (creating entries for the inverted file).
For updated information please refer to Malete FileFormats and the "index" method in the Protocol

the inverted file

The inverted file maps keys to (sorted) sequences of values. Although an application can store arbitrary information in values, in standard use, they denote a hit (where the key was found in a record). A hit is a structure of five numbers, specifying (in that order) a database number, a rowid (mfn), tag, occurence (of this tag in it's record) and position (of this "word" in it's field).
The inverted file is implemented as a B-Link-Tree (Lehmann/Yao) with variable length keys of up to 255 bytes. The length of values is fixed for a given index, but can be configured on index creation in the range 8 to 23 bytes (8+n, with 4 bits for n). The 8 byte minimum configuration (corresponding to the traditional IFP) uses

2 bytes pos
1 byte occ
2 bytes tag
3 bytes mfn

A ninth byte is used for the mfn, bytes 10 and 11 for a database number, additional bytes have no standard usage in hits (by now).
Occurence and position are typically used for "near" conditions in queries. Where this is not needed, they could also be used, for example, to apply weights to the hits.

creation of index entries

Traditional ISIS systems based indexing on

a "print format" to create text from a record
an "indexing technique" to extract entries. The various indexing techniques split the text at newlines, subfields, word boundaries, angle brackets or other delimiters.

OpenIsis uses a different approach based on an "index record", which is created either explicitly by application code or using a view. Most split up like extracting keywords from between angle brackets or other delimiters is done during creation of this record, containing, in the simplest case, one field per index entry.
The only additional transformation that may be applied is word split to be done during Collating , because the definition of which characters constitute words is configured together with the collation info.

In the index record, fields with a non-negative tag are treated as index entries according to the current mode. All fields with negative tags give processing instructions as follows (denoted with symbolic tag names):

index f[ields] [startocc]
an xmode field with value 'f' (or some prefix of 'fields') sets default indexing mode, where each value (of a non-negative field) is used as literal index entry. The occurence is set to startocc or 0, incremented for runs of fields with the same tag, and reset to 0 on tag change.
index w[ords] [startpos]
sets word mode, assuming fields to contain pre-split words. Instead of the occ, the pos is set, incremented and reset.
index s[plit]
sets word autosplit mode. Index entries are split into words according to collation info; each word increments the pos, each field resets the pos and increments or resets the occ as in fields mode. If the index has no collation info, all characters but the well-known ASCII non-letters are assumed to be word characters.
index a[dd]|d[el]
sets add (default) or del mode for index entries.
index m[fn] mfn
sets the mfn to use
index n
where n is a number, sets max key length to n (default is 30)
hit [+|-][[dbn.]mfn.]tag.[occ].[pos]|tag[.pos] key declares explicit hit data for key. With a leading + or -, explicitly add or del the key regardless of mode. Unspecified fields are determined as usual from the environment. Since search results can be sorted by increasing pos, this field can also be used to specify a weight (starting from 0 for a 100% match).
fst viewspec
declares a view line to create fields from the current data record, see Views

The index fields can also be used with a tag and value in their field value, in which case the given mode applies to that tag and value only. (TODO)
While view-based indexing is yet to be done (as of May 03), application controlled indexing is working and robust (see the save method in openIsis.tcl and sample.fsp).

special values

The entries created according to the index record are then subject to a collation transformation to create index keys.
Finally they are handed to the B-Tree for insertion. If the key is already there, it may have an associated special value, which does not denote a hit, but rather processing instructions. Special values are recognized by a mfn of 0.

a completely blank value of all zeroes marks a stopword. no other values for this key are added to the index. (This is implemented).
a value of mfn 0, first byte of tag 1, marks a ref. The rest of the value and some successive values contain an alternate key, under which the entry is made (and searched). Since this has some difficulties, it will probably take some time until this is implemented...

A multilingual (English, French, Italian, Spanish and Portuguese) stopword file can be loaded like

tr '[a-z]' '[A-Z]' <docum.txt | sort | sto/openisis -db db/test/test -swload

$Id: Inverting.txt,v 1.4 2004/06/10 12:52:29 kripke Exp $