Isi
Serialized
NOTE: this document describes early OpenIsis proposals. For updated information please refer to the Malete protocol

ISIS records serialized
Serialization means to convert the internal representation of an ISIS record to a sequence of bytes ("octets") suitable to be stored in a file or transferred via a network.
The serialization format described here is used by OpenIsis for both database master files and network communications.

design goals

The serialization format should be
  • easy to use
    for programmers and tool writers
  • efficient
    in execution time and space used
  • robust
    a broken masterfile should be fixed using a text editor
  • versatile
    can be used for a variety of applications using a variety of tools
  • without limits
    in number and size of records and fields


basic format

In general, a record is serialized by
  • serializing meta information
  • serializing the fields in order
  • appending a blank line

Fields are serialized as
  • the field tag printed using ASCII decimal digits (optionally preceeded by a minus sign, if negative tags are allowed)
  • a (horizontal) TAB character (ASCII value 9, ^I)
  • the field value
  • a newline character (ASCII value 10, ^J)

Metadata is serialized in the same way, using special tags according to the needs of the environment. Two situations are distinguished:
  • "soft" metadata, which may and should be accessible as part of the record. This is encoded by convention using tags < 100. An example of this is HTTP and other MIME-style communication, where the MIME headers like "User-agent" or "Date" are encoded in such a way, while content data like GET or POST parameters should be mapped to IDs starting from 100.
  • "hard" metadata, which must not interfere with the record contents in order for the environment to work properly. This is encoded using a single non-digit character instead of tag digits. An example of this is the MFN in a master file and information regarding record deletion or update.

The final blankline may be omitted, where only a single record is contained in an otherwise delimited byte sequence.
A reader should support a lazy mode, allowing the TAB to be omitted, where unambigous (the field value does not start with a digit or a TAB). Writers, however, are strongly urged to write the TAB.
Tags with leading zeros are allowed (typically with %03d) and must not be interpreted as octal using atoi.

newline conventions

The application must ensure that field values do not contain newline characters. Some easy common encodings are suggested to deal with newline characters:
  • in "field mode", replace newlines by spaces or tabs.
  • in "text mode", newlines are replaced with vertical tabs VT (ASCII 11, ^K). This maybe reversed to restore newline-separated lines if needed, but e.g. on printing the VT will have the desired effect.
  • in "binary mode", newlines are replaced as VT followed by a byte value 1, if the newline is followed by a byte value 0 or 1, else by a single VT. A VT is replaced by a VT and a 0 byte.

The advantages of text mode over binary mode are
  • it is slightly faster than the binary translation
  • the serialized records do not need more space (whereas the binary serialization might need twice the space)

The binary mode has the advantage of not loosing vertical tab characters that might have been contained in the original field values. It is fully transparent and can be used to store any binary data like images with an average overhead of 0.4% (as compared to +33% needed by BASE64 encoding).

masterfile format

A basic masterfile consists of a blank-line separated series of records.
The first record is the "controlling record", containing descriptive information such as the character encoding. All of this is optional; the masterfile might just start with a blank line.
The MFNs are then assigned implicitly in order, starting from 1. There is no distinction between empty (consecutive blanklines) and deleted records. There is absolutely no redundant information contained.
Masterfile compression creates this state (however, it may choose to use special meta lines for long ranges of deleted records, see below -- but those are inefficient in the Xref, anyway). Such a masterfile can be very easily created by any tool (like Perl and such). The Xref file can be easily, fast and reliably recreated.

When writing to the database, information is ALWAYS appended to the end. There is NO OVERWRITING of any data, ever, period. That way data CANT BE DESTROYED by any operation (one could advise the operating system to set a mandatory read lock on that), and all changes are easily traced using tail -f.

basic mode writing

In metalines, all numbers are given in decimal digits and multiple items are separated by TABs. The optional timestamps are an arbitrary prefix of generalized time format YYYYMMDDhhmmssttt... as of 'date +%Y%m%d%H%M' plus milliseconds ttt, optionally followed by up to 14 characters to create a unique request id.
Write operations use the following meta line (preceeding record data):
  • W mfn [oldpos [timestamp]]
    followed by new record data denotes a write of record mfn. Depending on the needs of the environment, the byte offset of the last version might be added in order to support access to old versions (e.g. for delayed index update). oldpos may be given as position[.length[.fields]]

A W with no data following is mostly equivalent to a delete. If an otherwise written mfn is higher than the highest known, the highest known (and thus the implicit counter) is set to this.
A lazy reader should not require meta lines to be followed by a blank line, where unambigous. For writers, however, the blank line is strongly recommended.

A reasonable size limit on metalines is 127(+newline), since
  • 22 = 1+1+20 for operator, tab and mfn (64 bits ~ 20 digits)
  • 43 = 1+20+1+10+1+10 for tab position.length.fields
  • 32 = 1+17+14 for tab, milliseconds time and id
totals to 97, so we have some room left.

advance mode

For advanced space efficiency at the cost of read increased access time, the following lines may be used:
  • D mfn [oldpos [timestamp]]
    a record entry consisting of only a D line denotes the deletion of record number mfn. Basically equivalent to a write with no data following, but a little bit more explicit.
  • I mfn [timestamp]
    (set id) override the implicit MFN counter, e.g. after a series of deleted rows or just to be explicit. Basically equivalent to a write with no old pos, but a little bit more explicit. Recommended when appending records after some deletes.
  • C mfn oldpos [timestamp]
    introduces a series of patch commands specifying how record mfn was changed.

Software writing the masterfile will typically choose to write full updates and MUST provide a switch to forcedly do so, in order to be compatible with basic mode readers.
However, the patch command language is particularily useful with server operations, to avoid the need for read-write sequences with some sort of locking.

the patch language

The patch commands are lines starting with special characters like +, -, ~ and so on, followed by (an optional TAB and) field addresses, TAB and field data.

The simplest case is the '+' command, meaning that it's data is to be appended to the record. The + and TAB may be omitted. In other words, the add command may look exactly like an ordinary field line.
A series of '=' commands works exactly like the set operation in an OpenIsis Tcl record. Especially, field indexes and subfields are supported. The '-' command resembles the del operation.
A detailled description of the "patch language" is to be done.

example
C	1234
=	24	foo
=	24	bar
25	baz
changes record 1234 by setting the first to occurences of field 24 to foo and baz, respectively, deleting any other occurences of field 24, and appending a field 25 with value baz.

the pointer file

The pointer file is an array of n-Byte (n >= 6) entries, the ith entry referencing mfn i, similar to the traditional .XRF (crossref).
The n=k+l+m bytes specify two or three numbers (in native byte order) of up to 8 bytes each:
  • the first k bytes (k >= 4) give the position of the record (or it's last update or change entry)
  • the next l bytes (l >= 2) give the length of the record (excluding the last field's terminating newline and following blank line)
  • the final m bytes (m >= 0) give the number of fields. If m is 0, or all bits in a field number are set (for large records), the reader has to determine the number fo fields by inspecting the record.

The first six bytes of the first entry describe the detailled layout. Four bytes are the "magic number" containing the ASCII characters "ISIX". Two bytes are the number (m*256 + l*16 + k) in native byte order.
The minimum case k=4, l=2 imposes the limits of traditional ISIS. Actually the lower limits are not enforced; in a very specialised application one might want to use k/l/m = 3/1/0. Recommended are at least eight bytes as k/l/m = 4/3/1 or 4/4/0 or, with large file support, 5/3/0.

However, when any of the limits is reached, or an unsupported combination or byte order is found, the Xref can easily be recreated with greater values. The 12 byte pointer with k/l/m = 6/4/2 will be enough for even gigantic databases of a quarter Petabyte (262.144 Gigabyte).
The number of fields is redundant, but as an optimization may make live a little bit easier for a reader. If the pointer structure has m>0 (typically m=1), a value of 0 must be stored if the number of fields exceeds the representable range and a reader should be prepared to figure it out itself in that case.


$Id: Serialized.txt,v 1.10 2004/06/10 12:52:29 kripke Exp $