GNI - whatabout

whatabout

what makes ISIS ISIS ?

Andrew Giles-Peters raised the important question "What is it about ISIS that makes it ISIS?"

So here are some thougts on this topic from the OpenIsis team:

As a database used for bibliographic data (among other), ISIS must be able to store and retrieve records as exchanged via ISO2709 efficiently and with no or minimal loss of information.
Besides the ability to retrieve records by number, ISIS must support an indexing mechanism which is essentially "function based", that is, index entries are not the immediate field values, but rather the values of a "view" derived by some computation are indexed.
ISIS must efficiently support typical query elements commonly used on bibliographic databases, like looking up a value without regard for the field or in several fields at once and specifying a distance within search terms should occur.

Since these are minimal requirements, they would not stop anybody from adding tons of features on top. For example, it's relatively easy to store ISO2709 data in a relational database like Sybase (used by OCLC/Pica), each record covering several rows (mfn, field number, field occ, value), then compute a second similar table for the index and so on.

However, there is the word "efficiently", which practically turns out to put some restrictions on the feature-load, especially when combined with:

ISIS must be widely usable even in the face of *very* low budgets. Therefore, not only the software itself must be available for at most a nominal fee, but it also must not require very new, very powerful or otherwise expensive hardware and system. Even very large catalogs should get by with moderate system costs.

The OCLC/Pica system for example requires one to spend hundreds of thousands of dollars for powerful Sun machines.

end of story ?

Still, it would be very nice if more areas of application could be explored for ISIS, both for the librarians in order to be able to use their favourite DB (i.e. ISIS) for a broader range of tasks and also to expand the user community, possibly leading to more support for everybody.
One important question is whether ISIS needs some fundamental changes deep in it's guts, or whether it already has everything that's needed to build a broad range of sophisticated solutions on top of it. As you might expect, we are pretty well convinced of the latter.

file formats

Just like it doesn't harm a database much to be exported to and imported from ISO2709, there is not much of a problem with different file formats, as long as there do exist conversion tools. As you know, CISIS/Unix-DBs are incompatible to WinIsis/DOS-DBs, but may be converted via ISO files. As long as the basic data structures are the same, lossless conversion is just a matter of tools. It's even less of a problem if the software itself can read several file formats (like openisis does). You won't care much whether your wordprocessor is reading a .doc or .rtf file, would you? We did an interesting and very successful study implementing an ISIS-like DB in pure Java using a plaintext masterfile very similar to the Mbox mailfolder format (hope to be able to release the code soon). Likewise there is no reason why one should not be able to read directly from an ISO2709 file. Besides convertible masterfile formats, one might well use other formats for xref and index, which always can be reconstructed as needed. There are several reasons like improved performance or robustness to do so. So I don't think ISIS is defined in terms of detailled file formats, but rather in terms of the basic data structures.
One problem that might come to mind when talking about file formats are the limits. While the maximum number of records per DB as well as the maximum total file sizes are bypassed relatively easy by logically joining several databases, the maximum record size of about 32K is a limit which might be unacceptable for some applications. (Although it can partly be resolved by deploying external files like OCLC/Pica does to circumvent Sybase's varchar limits). Raising this limit would clearly restrict lossless conversion to one way, from small to large DB. Where a large DB model is needed, all parties developing ISIS software should agree on one format to allow for as-painless-as-possible interoperability.

so what kind of database is ISIS ?

Classical database theory basically distinguishes ISAM, network, hierarchical and relational database systems. ISIS is strongly related to ISAM DBs, however it's flexible indexing is rarely paralleled by any of these systems and it's non-flat data model is targeted by hierarchical DBs only (in greater generality and with much higher costs).

Although direct joins by MFN shouldn't be too costly, ISIS is not the database of choice when several records typically need to be combined in queries or transactions. However, in many application cases, only one ISIS record is needed as opposed to several relational table rows. In such situations, ISIS is even an excellent and efficient transaction (OLTP) database (since save writing of an ISIS record is much simpler than other DB's undo/redo logs).
ISIS is not the database of choice when records are updated by the hour. However, where only about 10% of records are changed between two (monthly, weekly or daily) runs of backup and compactification, the space overhead is not a big problem. Where old versions of data need to be retained anyway (as often needed and supported, for example, by postgres history), you would hardly find a more efficient solution.
ISIS is not the database of choice when it comes to high volume online analytical processing (querying statistics on several dimensions, OLAP). However, after reading some database books and Oracle manuals, one learns that OLAP requires a well designed ("star schema") database separate from the transactional one, anyway.
ISIS does not, in itself, provide any concurrency control (actual implementations do, to some extend). This doesn't hurt when running a read-only multi-user catalogue, a stand-alone application and in some insert-only situations. For distributed multi-client update, there are mechanisms based on timestamps or stored procedures that need to be supported by some ISIS server to come.

While these data models are strongly tied to the logical nature and physical organisation of the data, newer notions like that of an 'object oriented' or 'XML' database rather describe a way to use and access a database. Actually OO or XML DBs are usually based on one of the above mentioned systems (mostly relational ones). For the most part, using a DB as OO or XML storage does require nothing but some libraries and optionally precompilers for C++ or Java -- these can be build on top of existing ISIS without changing it, and ISIS will be an excellent choice for many applications. Some aspects of increased functionality and performance will require sort of "stored procedures" running inside the database. In the case of a XML DB they are used for example to decomposite structures, in the OO case they might need some sort of "magic switch" (method overriding) to perform differently for some records than for others. We believe that all this magic can be achieved based on ISIS. The concepts of an ISIS database server and a scripting language as an alternative to formatting exits are to be discussed elsewhere ...
First we want to shed some more light on the great flexibility the ISIS database system has by it's very nature.

ISIS is a mail database

Looking at http://www.faqs.org/rfcs/rfc822.html (or its updates) one will find many similarities between ISO2709 records and internet mails, which are, after all, essentially a series of header names and values. After assigning numbers to the 100 or 200 most commonly used headers and some sort of subfield encoding (e.g. "^nname^vvalue", "namevalue" or simply "name: value") to store other header lines with a special field number, mails are easily and very efficiently stored in an ISIS database. Given the enormous number of communication, groupware and workflow systems that are nowadays built upon standard plain internet mails (typically using a set of special mail headers), this is a very large area to be served by ISIS databases. The above mentioned Mbox-style implementation of ISIS tends towards that direction, building upon the javax.mail standard. IMAP mail servers could greatly benefit from the powerful indexing and retrieval system of ISIS databases. If also the mail sending application allows to select special headers from an entry form prepared by a skilled librarian with thesauri and systematics, an institution or company could really come to a new way of using mail as a system of qualified, living information.

ISIS is a multimedia database

After all the mail not only has got headers, but also a body. A plaintext body of reasonable length (some KB, like sent by nice people), fits without problem in a field whose number means "body". A multipart body is easily decomposed to a series of body fields. Wether larger or non-plaintext bodies are stored within or outside the masterfile is a matter of the actual implementation and doesn't need to be discussed here, both approaches have their pros and cons. Anyway, the MIME standard, up and running since 1982, allows for storage and transmission of anything that uses bytes, and is easily integrated with ISIS databases (we partly did it, code to be released).

ISIS is a XML database

Likewise XML, which basically is text, can be stored in an ISIS database (with respect to the implementation's maximum record length). Add some formatting exits to address the XML node content via a DOM-style a.b.c notation as used in javascript, use them in your FST and you will for sure have one of the world's best indexed and fastest XML database -- most others are using a relational DB as basis. So indexing, retrieving and displaying XML data is more or less simply a matter of some formatting functions.
However, when thinking about data entry forms, for example, the dark side of the force shows up: Even with a very sophisticated database system with the ability to make sense out of XML DTDs, it is anyway potentially much more complicated. XML was meant to provide arbitrary complexity in the first place. And when it comes to DTDs like that of XHTML, which will carry just about the same content as any HTML page, one easily understands that reasonable automatic processing becomes nearly impossible -- that's the reason why HTML pages are largely beefed up with headers (Dublin Core and others). If you really desperately need it, it's good to have it, but else using it might be looking for trouble.
When having to work with XML structures for one or the other reason, typically because they should be imported or exported, one should think of a mapping between XML and ISIS structures. In many situations XML structures are shallow and can be ISIfied by simply mapping the first level of sister nodes to ISIS fields and the second level to subfields (may require repeated subfield support). In other situations a closer look at the data structure may reveal that it is not well designed with regard to Ockham's razor but contains totally unnecessary depth which may be collapsed to the first case. Actually, during several years of work with XML structures as suggested by several "standards", I rarely found a reasonable structure which can not be mapped to a field-subfield-schema.

But even if you really need XML structures "as is", they can be stored very efficiently in ISIS, with all the benefits of the flexible index (c.f. the universal ISIS record) . Anyway, Dublin Core metadata or other RDF (resource description framework) headers are conveniently stored in ISIS just like mail headers. Maybe, as this schema was created to suit the needs of the very old science of bibliographic knowledge management, much of that experience was built into it.
On the other hand, XML's ancestor SGML was conceived for a document's body, not the head, and I guess there still is it's place in spite of programming industry's hype. The use of XML for structuring documents that are ment to be read by humans rather than machines of course is perfectly reasonable. Transparent access to file based data associated with a record and a XML add-on to the formatting language could aid in converting extracts of document contents to metadata accessible in the ISIS database and/or it's index.

To wrap it up, I'd suggest to look at XML as an optional add-on to ISIS rather than an integral part. ISIS already has all the functionality needed to support any reasonable use of XML. ISIS data can much more efficiently contain XML structures than the other way round.

ISIS is a database for document/content management systems

It follows that ISIS may very well support the needs of systems for XML documents or website content in XML or HTML. With increasing experience with such systems, people tend to understand that content metadata should be organized according to bibliographic principles. (Not that surprising, is it)? In cooperation with the oc4science.org there are projects at german universities to integrate publishing, document management and website CMS, based on an (Open)ISIS DB and directed by the librarian.

$Id: whatabout.txt,v 1.8 2003/02/14 17:30:33 kripke Exp $