GNI - RdbConv

RdbConv

Representing ISIS data in a relational database. See RdbConv2 for representing tables from a relational database in ISIS.

If you consider setting up a table with columns "author", "title" and so on, don't do it. The following structure gives you at least some of the benefits of ISIS, because it's shaped more closely to the ISIS data model.

table structure

It is not very difficult to think of a table structure for lossless representation of the contents of an ISIS masterfile:

CREATE TABLE Mst (
	mfn number,	-- master file number
	fld number,	-- counting fields in this masterfile record
	tag number,	-- the field tag
	val varchar(2048) -- field's content
)

Notes:

a row in this table does NOT hold an ISIS record, but one field of an ISIS record
each ISIS record is represented by all table rows with the same mfn
you may omit the column "fld" if you don't care for the ordering
you may choose other datatypes for the field tag, for example number(3) or char(3), depending on the compatibility required (e.g. MARC)
you may choose other dimensioning for the val column. ISIS can hold up to 32KB (signed short) in one field, Oracle has a maximum of 4KB on the varchar type.

accessing the data

Retrieving row 123 is easily done with

SELECT tag, val WHERE mfn = 123 ORDER BY fld

It is suggested that you fill the val column with native ISIS data as of "MPL" proof mode. Then you have to extract subfields and perform other substitutions for output. This can be achieved for example using the v-method of the Field Java class (it's a static method of about 100 lines, which you may grab and port to your system).

An alternative would be to pre-split the subfields into a structure like

CREATE TABLE Mst (
	mfn number,
	fld number,
	tag number,
	sub char(1),	-- subfield code
	val varchar(2048)	-- subfield's content
)

A typical keywords field like

 <deltas><humid zones><tropical zones>

should be split into three subfields using any subfield code, say '>'.
The advantage of the pre-split approach is that it makes both display and indexing easier. The disadvantage is, that you'll loose some information, albeit not too much.

searching

Searching is also relatively easy, as long as you ask simple questions ;). To get a list of the mfns with a field 24 containing Twain, use

SELECT DISTINCT(mfn) FROM Mst WHERE val = 'Twain' AND tag = 24

For this to work efficiently, you have to put an index on val. You may than search for 'Twain' with or without the restriction on a certain tag, just like in ISIS. You can also use right truncation and restrict on a couple of tags:

SELECT DISTINCT(mfn) FROM Mst WHERE val LIKE 'Twain%' AND tag IN (201,401)

Right truncation should still use the index on val, thus be reasonable efficient (depending on how smart your RDB is). With the pre-split structure as above, this applies to any subfield. If you want to find Twain anywhere within the field's values, you need

SELECT DISTINCT(mfn) FROM Mst WHERE val LIKE '%Twain%'

This will perform a pretty costly full table scan, so you may go fetch yourself a coffee meanwhile.

While you may use SQLs OR predicate to combine several terms as alternatives, the ISIS AND does NOT map to SQLs AND, but to an INTERSECTion on the mfn lists. It's actually pretty much the same as ISIS does internally (but more costly).

building an index

Still, this is lightyears away from the possibilities of an ISIS db. The index on val does not do word splits, does not use stopwords, is not case insensitive, does not apply character or word conversion and so on.
To come anywhere near the features of ISIS, you have to do just what ISIS does: set up a separate "Inverted File".

CREATE TABLE Index (
	val varchar(30),	-- index entry
	mfn number,	-- record containing entry
	tag number,	-- tag of field containing entry
	occ number,	-- field is occ'th occurence of field tag in record
	pos number	-- entry is pos'th word in field
)

To fill this table, you could of course go and reimplement all the features of an ISIS .FST in, say, Oracle's PL/SQL and recreate one record's entries from within an INSERT/UPDATE trigger.
Much easier, given you had a true ISIS database in the first place, is to load the .LK1 and .LK2 text files created during full index update. Some ISIS tools also allow you to create these files separately.
Searching is performed against the Index table (which of course needs an RDB index at least on val), not the Mst table. The latter is only used to retrieve the full content for a given mfn, or for full table scans.

so why then use ISIS in the first place?

There is a long list of reasons, here go a few of them:

You don't want to reimplement the .FST, the full formatting language, all the nifty details of the various indexing techniques and so on. Actually, if you DO want, you're very welcome in the OpenIsis team ;)
Not only the index building, but also data entry is done much more comfortable using ISIS tools.
Any queries but the most simple examples given above will not only be somewhat complex to write in SQL, but will also execute much slower than in a native ISIS implementation (consider publishing with OpenIsis).
Especially the NEAR operators can't easily be expressed in SQL. Although the index structure given above contains the necessary data in occ and pos, advanced SQL features like nested queries are needed in the appropriate SQL query. You need to have an advanced (read: expensive) RDBMS and a deep and thorough understanding of it's query optimizer and hinting system for this to run in any reasonable time.
So why then use a RDBMS in the first place?

$Id: RdbConv.txt,v 1.5 2004/06/10 15:01:10 kripke Exp $