|
RdbConv
|
Representing ISIS data in a relational database. See
RdbConv2
for representing tables from a relational database in ISIS.
If you consider setting up a table with columns "author", "title"
and so on, don't do it. The following structure gives you at least
some of the benefits of ISIS, because it's shaped more closely
to the ISIS data model.
| table structure |
It is not very difficult to think of a table structure
for lossless representation of the contents of an ISIS masterfile:
CREATE TABLE Mst (
mfn number, -- master file number
fld number, -- counting fields in this masterfile record
tag number, -- the field tag
val varchar(2048) -- field's content
)
Notes:
- a row in this table does NOT hold an ISIS record,
but one field of an ISIS record
- each ISIS record is represented by all table rows
with the same mfn
- you may omit the column "fld" if you don't care for the ordering
- you may choose other datatypes for the field tag,
for example number(3) or char(3),
depending on the compatibility required (e.g. MARC)
- you may choose other dimensioning for the val column.
ISIS can hold up to 32KB (signed short) in one field,
Oracle has a maximum of 4KB on the varchar type.
| accessing the data |
Retrieving row 123 is easily done with
SELECT tag, val WHERE mfn = 123 ORDER BY fld
It is suggested that you fill the val column with native
ISIS data as of "MPL" proof mode. Then you have to extract
subfields and perform other substitutions for output.
This can be achieved for example using the v-method of the
Field
Java class (it's a static method of about 100 lines,
which you may grab and port to your system).
An alternative would be to pre-split the subfields
into a structure like
CREATE TABLE Mst (
mfn number,
fld number,
tag number,
sub char(1), -- subfield code
val varchar(2048) -- subfield's content
)
A typical keywords field like
<deltas><humid zones><tropical zones>
should be split into three subfields using any subfield code, say '>'.
The advantage of the pre-split approach is that it makes both
display and indexing easier.
The disadvantage is, that you'll loose some information,
albeit not too much.
| searching |
Searching is also relatively easy, as long as you ask simple questions ;).
To get a list of the mfns with a field 24 containing Twain, use
SELECT DISTINCT(mfn) FROM Mst WHERE val = 'Twain' AND tag = 24
For this to work efficiently, you have to put an index on val.
You may than search for 'Twain' with or without the restriction
on a certain tag, just like in ISIS.
You can also use right truncation and restrict on a couple of tags:
SELECT DISTINCT(mfn) FROM Mst WHERE val LIKE 'Twain%' AND tag IN (201,401)
Right truncation should still use the index on val,
thus be reasonable efficient (depending on how smart your RDB is).
With the pre-split structure as above, this applies to any subfield.
If you want to find Twain anywhere within the field's values, you need
SELECT DISTINCT(mfn) FROM Mst WHERE val LIKE '%Twain%'
This will perform a pretty costly full table scan,
so you may go fetch yourself a coffee meanwhile.
While you may use SQLs OR predicate to combine several terms as
alternatives, the ISIS AND does NOT map to SQLs AND,
but to an INTERSECTion on the mfn lists.
It's actually pretty much the same as ISIS does internally (but more costly).
| building an index |
Still, this is lightyears away from the possibilities of an ISIS db.
The index on val does not do word splits, does not use stopwords,
is not case insensitive, does not apply character or word conversion
and so on.
To come anywhere near the features of ISIS, you have to do just
what ISIS does: set up a separate "Inverted File".
CREATE TABLE Index (
val varchar(30), -- index entry
mfn number, -- record containing entry
tag number, -- tag of field containing entry
occ number, -- field is occ'th occurence of field tag in record
pos number -- entry is pos'th word in field
)
To fill this table, you could of course go and reimplement
all the features of an ISIS .FST in, say, Oracle's PL/SQL
and recreate one record's entries from within an INSERT/UPDATE trigger.
Much easier, given you had a true ISIS database in the first place,
is to load the .LK1 and .LK2 text files created during full index update.
Some ISIS tools also allow you to create these files separately.
Searching is performed against the Index table
(which of course needs an RDB index at least on val),
not the Mst table.
The latter is only used to retrieve the full content for a given mfn,
or for full table scans.
| so why then use ISIS in the first place? |
There is a long list of reasons, here go a few of them:
- You don't want to reimplement the .FST, the full formatting language,
all the nifty details of the various indexing techniques and so on.
Actually, if you DO want, you're very welcome in the OpenIsis team ;)
- Not only the index building, but also data entry is done much more
comfortable using ISIS tools.
- Any queries but the most simple examples given above will not only
be somewhat complex to write in SQL, but will also execute much slower
than in a native ISIS implementation (consider publishing with OpenIsis).
- Especially the NEAR operators can't easily be expressed in SQL.
Although the index structure given above contains the necessary data
in occ and pos, advanced SQL features like nested queries are needed
in the appropriate SQL query. You need to have an advanced (read:
expensive) RDBMS and a deep and thorough understanding of it's
query optimizer and hinting system for this to run in any
reasonable time.
- So why then use a RDBMS in the first place?
See also
a comparision of ISIS and other database systems
and
notes on the flexibility of the ISIS data model.
$Id: RdbConv.txt,v 1.5 2004/06/10 15:01:10 kripke Exp $
|
|