GNI - PatchWork

PatchWork

principles of a patchworked database: how OO-style inheritance works for data

references

It is quite common for records in a database to refer in some way to other records. In fact, it seems to be such an interesting property, that the main classification of databases is along the lines of how they handle references:

in a hierarchical DB, every record has 1 or 0 parents, and any number of childs
in a networked DB, every record has a variable number of references (typically fix for a given type of record)
in a relational DB, every record has any number of references, which are determined by queries. References are constructed on the fly; if they are backed by some index and the query optimizer is able to figure that out, this works even for larger DBs.

While basically every model can be emulated on top of every other, it is a matter of performance and convenience (i.e. performance counted by hours of programming) how well a given model of data access supports a given pattern of data usage (that's why you should stop thinking that RDBMS solve all problems).

The magic working behind references is always the same: either a (more or less) physical key of the referred record or some logical key is stored in the referring record, the latter requiring translation by an index (typically B-Tree). In general, the key might as well be a RDF URI, resolved by a http or other server.

The "physical" key is known as "master file number" (MFN) in ISIS databases, as "ROWID" or similar in most RDBMS. Using a ROWID is a gross violation of the principles of relational databases (Codd's axioms), and most RDBMS cannot retain a ROWID reference via export/import (i.e. DBs using such references can neither be reliably exported nor backed up, short to a DB shutdown and file backup, as stated by a major vendor).

Hierarchical and networked DBs, on the other hand, have much stronger support for the fast physical key, but typically still allow for logical keys to reach via an index on any record. They just don't sport an elaborate and "standardized" query language like SQL giving convenient access to what does and doesn't work. It's up to the application programmer, rather than the query optimizer designer, to follow the paths of well supported references.

ISIS databases are clearly in the latter (older) branch of the family of databases, believing more in the application programmer's knowledge than in artificial intelligence. Besides the conceptually simple and ultra-fast MFN reference, it also has a very flexible indexing mechanism, allowing for references to even be based on fulltext.

We want to present a view on data, that has some interesting properties and (while it can be emulated on any DBMS) is particularily well supported on ISIS DBs:

the patch relation

We propose a relation between records that transcedes a mere reference (independent of whether the underlying reference is based on a physical or logical key):

there is a patch operator
that constructs a new record from referring and referred-to record
there is a diff operator
that constructs a new record from referring and referred-to record, in such a way that the patch operator, applied to the latter and the result, will construct the referring record

Basically, a very small record can tell a story like

I am related to that record
I am very similar to that record, but
I have these and those fields added/changed/deleted

Everybody who is accustomed to tools like RCS or CVS (based on RCS) will get the idea of such a relation: depending on what is most commonly needed, either one or the other version can be stored much more compactly as just the diff to the other one. Or, if both are referred to oftenly, you can store them both at full life size, but put only the differences in the index.

If the idea of applying diff and patch to database records seems strange, have a look at a plain text representation for ISIS records. This paper does include a means to store patch "scripts" even at database master file level, thereby providing very efficient support for cross-referring patch records.

However, the patch relation can as well very efficiently be implemented on top of a standard ISIS database. Some meta data entries (per field in the FDT) might be used to set convenient defaults for interpretation of references and patches:

to which database an entry in field x referres to
index tag and/or prefix to be used on lookup
whether field occurences in the patch record are additive or overriding

In the absence of more specific instructions, the default patch operation is used by reading the referring record as a series of set statements against it's base: every field (tag) which is present in the referring record overrides all fields of the same tag in the base. Should there be multiple bases defined, they are by default added (i.e. concatenated). A multiplikative operation is modeled by chaining inheritance.

In a way, every reference can be regarded as defining a patch relation: If you resolve the reference in a table join, you are actually creating a new virtual record, comprised of some fields of the referred and referring records. However, in the much more flexible ISIS data model, the creation of a new record is much more flexibel.

inheritance by patching

One way to use patching is to think of a patch relation as being an inheritance relation: not by class, but by object! A record can inherit data fields from one or several other records, which still exists as independent entities.

When the inheritance is resolved, some data from the parents is copied to the new virtual record, some is dropped or overwritten. An inheritance relation does not only apply to logically subordinate child records, it can also be used for versioning, with a successor inheriting from it's predecessor, while both are still available as independent entities.

example: the serial killer

Management of serials is known as a notoriuosly hard problem. So let's have a look at how we could make a data model using record inheritance.

We start out simple, with

one record for the serial,
one for each issue, inheriting from the serial
one for each copy, inheriting from the issue

The issue adds a number, date, table of contents and so on to the serial. The copy adds some id to the issue, and maybe a note that it's in a poor state, since somebody poured coffee over it.

But then things can get a little bit more complicated:

if the serial changes it's name or other attributes, a successor may inherit from the original one, and following issues inherit from the successor (while earlier issues still belong to the original)
if two serials are joined, a successor may inherit from both
sometimes two otherwise independent serials are printed together, so the actual issues inherit from both
an issue might have a reprint,
which inherits from the original issue, changing the date and adding a new preface
if somebody ripped an article out of a given copy, that copy might adjust it's issues table of contents
if copies of several issues are bound together, you actually have only one copy, inheriting from several issues

To summarize, a lot of complicated situations can be modelled in a quite natural fashion when thinking in terms of inheritance through patch relations, and these, in turn can be easily and efficiently be implemented based on ISIS record and database structures.
If you think of modelling that in a relational database, you will probably find that you end up with mimicking the ISIS record.

$Id: PatchWork.txt,v 1.2 2003/06/23 14:43:24 kripke Exp $