|
PatchWork
|
principles of a patchworked database:
how OO-style inheritance works for data
| references |
It is quite common for records in a database to refer in some way to other
records. In fact, it seems to be such an interesting property,
that the main classification of databases is along the lines of how
they handle references:
- in a hierarchical DB, every record has 1 or 0 parents,
and any number of childs
- in a networked DB, every record has a variable number of references
(typically fix for a given type of record)
- in a relational DB, every record has any number of references,
which are determined by queries. References are constructed on the fly;
if they are backed by some index and the query optimizer is able to
figure that out, this works even for larger DBs.
While basically every model can be emulated on top of every other,
it is a matter of performance and convenience (i.e. performance counted
by hours of programming) how well a given model of data access supports
a given pattern of data usage (that's why you should stop thinking that
RDBMS solve all problems).
The magic working behind references is always the same:
either a (more or less) physical key of the referred record
or some logical key is stored in the referring record,
the latter requiring translation by an index (typically B-Tree).
In general, the key might as well be a RDF URI,
resolved by a http or other server.
The "physical" key is known as "master file number" (MFN) in ISIS databases,
as "ROWID" or similar in most RDBMS. Using a ROWID is a gross violation
of the principles of relational databases (Codd's axioms),
and most RDBMS cannot retain a ROWID reference via export/import
(i.e. DBs using such references can neither be reliably exported nor backed up,
short to a DB shutdown and file backup, as stated by a major vendor).
Hierarchical and networked DBs, on the other hand, have much stronger support
for the fast physical key, but typically still allow for logical keys to
reach via an index on any record. They just don't sport an elaborate
and "standardized" query language like SQL giving convenient access
to what does and doesn't work. It's up to the application programmer,
rather than the query optimizer designer, to follow the paths of well
supported references.
ISIS databases are clearly in the latter (older) branch of the family of
databases, believing more in the application programmer's knowledge
than in artificial intelligence. Besides the conceptually simple and ultra-fast
MFN reference, it also has a very flexible indexing mechanism,
allowing for references to even be based on fulltext.
We want to present a view on data, that has some interesting properties
and (while it can be emulated on any DBMS) is particularily well supported
on ISIS DBs:
| the patch relation |
We propose a relation between records that transcedes a mere reference
(independent of whether the underlying reference is based on a physical
or logical key):
- there is a patch operator
that constructs a new record from referring and referred-to record
- there is a diff operator
that constructs a new record from referring and referred-to record,
in such a way that the patch operator, applied to the latter and the result,
will construct the referring record
Basically, a very small record can tell a story like
- I am related to that record
- I am very similar to that record, but
- I have these and those fields added/changed/deleted
Everybody who is accustomed to tools like RCS or CVS (based on RCS)
will get the idea of such a relation:
depending on what is most commonly needed, either one or the other version
can be stored much more compactly as just the diff to the other one.
Or, if both are referred to oftenly, you can store them both at full
life size, but put only the differences in the index.
If the idea of applying diff and patch to database records seems strange,
have a look at
a plain text representation for ISIS records.
This paper does include a means to store patch "scripts" even at database
master file level, thereby providing very efficient support for
cross-referring patch records.
However, the patch relation can as well very efficiently
be implemented on top of a standard ISIS database.
Some meta data entries (per field in the FDT) might be used to set
convenient defaults for interpretation of references and patches:
- to which database an entry in field x referres to
- index tag and/or prefix to be used on lookup
- whether field occurences in the patch record are additive or overriding
In the absence of more specific instructions,
the default patch operation is used by reading the referring record
as a series of set statements against it's base:
every field (tag) which is present in the referring record
overrides all fields of the same tag in the base.
Should there be multiple bases defined, they are by default added
(i.e. concatenated). A multiplikative operation is modeled
by chaining inheritance.
In a way, every reference can be regarded as defining a patch relation:
If you resolve the reference in a table join, you are actually creating
a new virtual record, comprised of some fields of the referred and referring
records. However, in the much more flexible ISIS data model,
the creation of a new record is much more flexibel.
| inheritance by patching |
One way to use patching is to think of a patch relation as being an
inheritance relation: not by class, but by object!
A record can inherit data fields from one or several other records,
which still exists as independent entities.
When the inheritance is resolved, some data from the parents is copied
to the new virtual record, some is dropped or overwritten.
An inheritance relation does not only apply to logically subordinate child
records, it can also be used for versioning, with a successor inheriting
from it's predecessor, while both are still available as independent entities.
| example: the serial killer |
Management of serials is known as a notoriuosly hard problem.
So let's have a look at how we could make a data model using
record inheritance.
We start out simple, with
- one record for the serial,
- one for each issue, inheriting from the serial
- one for each copy, inheriting from the issue
The issue adds a number, date, table of contents and so on to the serial.
The copy adds some id to the issue, and maybe a note that it's in a poor
state, since somebody poured coffee over it.
But then things can get a little bit more complicated:
- if the serial changes it's name or other attributes,
a successor may inherit from the original one,
and following issues inherit from the successor
(while earlier issues still belong to the original)
- if two serials are joined, a successor may inherit from both
- sometimes two otherwise independent serials are printed
together, so the actual issues inherit from both
- an issue might have a reprint,
which inherits from the original issue,
changing the date and adding a new preface
- if somebody ripped an article out of a given copy,
that copy might adjust it's issues table of contents
- if copies of several issues are bound together,
you actually have only one copy, inheriting from several issues
To summarize, a lot of complicated situations can be modelled in a quite
natural fashion when thinking in terms of inheritance through patch relations,
and these, in turn can be easily and efficiently be implemented based
on ISIS record and database structures.
If you think of modelling that in a relational database,
you will probably find that you end up with
mimicking
the ISIS record.
$Id: PatchWork.txt,v 1.2 2003/06/23 14:43:24 kripke Exp $
|
|