GNI - Views

Views

NOTE: this document describes early OpenIsis proposals. We are working to reduce this to an easy to handle subset for implementation in Malete

Using views in OpenIsis.
A "view", like a VIEW in SQL, creates new, typically temporary records based on existing ones by means of some transformation like selecting a subset of the available fields (a projection), retagging fields or manipulating field values.

As general concept, a view can be implemented using any algorithm in any of the available programming languages to create new records (and need not only refer to record contents, but may also access other ressources like files).
In a more narrow sense, however, a view is a special kind of transformation defined by a "view record". The fields of a view record have tags as they should appear in the target, typically some valid tags of the source plus, for example, index control tags, if the view describes indexing.

In the following, the term "alphanumeric" denotes any ASCII letter or digit, or any non-ASCII character. "Word character" denotes any alphanumeric, hyphen '-' or underscore '_'.

The value can have one of several forms:

if it is empty,
the tag is passed to the source record's v command (see below).
if it starts with a %,
the rest of the value (w/o the %) is passed to the source record's v command. If the tag is not 0, '=tag;' is prepended.
if the value starts with any word character, it is used literally.
if it starts with a quote,
the rest of the value is used literally (w/o the quote). If the value's last character is a quote, it is discarded.
if it starts with an @,
the rest of the value names a view to be included
if it starts with an &,
the rest of the value is the name of an extension exit to call
if it starts with an {,
the rest of the value is a script to be executed in the host language (after stripping an optional } as last character)
any other form
(i.e. starting with other ASCII punctuation) is reserved for future use

Example: the view

24
70

is a simple projection selecting fields 24 and 70 from the source.

the v command

is described here as an abstract command. It is available in the C-API as well as from the language bindings, possibly with language specific variations.
It resembles the core concepts of traditional formatting, including access to and looping over fields and subfields, selecting substrings and attaching optional literals. It is sort of the record's printf. Like printf, and unlike traditional formatting, it neither supports flow control nor screen rendering.

It takes a source and target record plus a string specifying a format. Depending on the language environment, the source and/or target may be implicit.
If the format starts with '=tag;', where tag is a tag, this gives the tag used in the target and as default. Otherwise, tags from the source are used in the target and default is *.
The first (next) character is then checked for an encoding mode, see below.

The format is a series of output specifications, consisting of a field tag (word characters, either numerical or by field name), selectors and modifiers. The special tag * selects all fields. Each spec may contain several subspecs, separated by commas, using the same child context (otherwise, specs and subspecs are the same). So the format is spec[;spec...], and a spec is spec[,subspec...].

The general operation of the v command is to loop over the record until the last occurence was seen for all tags. In the nth repetition, for each tag in any spec, the (n+i)th occurence of a field with this tag is used, where i is an offset given by an occurence selector. Determine whether this is the last occurence. For every iteration, a new output field is started, and the format is processed as follows:

loop over the (main) specifications
loop over childs (or use the given field)
loop over subspecs
loop over subfields (or use the whole field)
apply decoding
apply substring
apply encoding
attach literals
append the result to the target record

Each spec starts with an optional decoding mode, optionally followed by a tag, optionally followed by a child selector, optionally followed by a subfield selector, optionally followed by string modifiers, optionally intermingled with occurence selectors and literals:

, starts a new subspec
; starts a new spec with default context reset to the last tag seen
. starts a child selector
^% start a subfield selector
([ start an occurence selector
/~"'`|+ start a literal
: starts a substring selector
& calls an extension
{ evaluates a script

encoding mode

One of the following operators as first character of the format can select an output "encoding":

? outputs a 1, if the selected entitity exists, 0 else
! the opposite of ?
& applies HTML encoding
% applies URL encoding

The test encodings ?! inhibit normal processing; they immediatly return after checking the first occurence of the the first tag. For example, using a default of all tags (*), the format consisting solely of a '?' checks wether a record is empty.
More special characters (but not the '*') may be designated in the future, so a format should always start with a tag (possibly explicit *).

decoding mode

An uppercase character before the tag may denote a decoding mode:

-	H heading mode:
^x is replaced as ';' for x=a, ',' for x=b..i, '.' for others
angle brackets are removed (>< replaced by '; '), <a> or <a=b> evaluates to a

-	D data mode:
in addition to heading mode, if there is no explicit literal after this field,
append '  ', if it ends in "punctuation", or '.  ' else.

-	X index mode
like heading, but <a> evaluates to nothing and <a=b> to b

-	M traditional
For compatibility, specs reading MHx or MDx (x = L or U) set heading
or data mode, resp., as default processing (before substringing).
The case directive is ignored.

child selector

If a tag is immediatly followed by a dot '.' and optional tag, field context is switched, for this spec and following specs separated by ',', to loop over the childs with the given tag. Tag defaults to 0, selecting text nodes in the canonical XML representation. A * selects all childs, a second . recursively selects all childs.

subfield selectors

The primary subfield selector is the hat '^', followed by one character. It can produce multiple items, like repetitions of a subfield or keywords.
If the selector character is

alphanumeric
select the (repetitions of the) subfield tagged with this character.
an opening pairing brace
i.e. one of '(','{','[' or the angle bracket '<', words between pairs of this brace are selected (commonly keywords).
a *
selects the part up to the first subfield delimiter
a space
selects naive words as sequences of alphanum
a )
selects parts between TABs (array mode)
other punctuation
like / or | selects parts between pairs of this character

The percent sign '%' (think printf) works basically like the hat, but

removes quotes surrounding values
by default treats the TAB as subfield delimiter
if followed by a punctuation character or space, treats this plus surrounding whitespace as delimiter, not separating within quotes.
if followed by a ),
(optionally after another punctuation) goes to array mode, that is there is no subfield indicator stripped from the values
if followed by multiple word characters, (including '-' and '_', optionally after an initial punctuation) searches for subfields starting with that sequence followed by '=' or ':'

Examples:

'^)' splits at TABs
'%)' splits at TABs with quote removal
'%a' selects a sequence following a TAB and 'a'
'%,)' splits a line of comma separated values
'%;*' selects the primary value of a MIME property
'%;charset' selects the charset attribute of a MIME property

occurence selector

By default, all occurences of fields, childs and subfields are used. One or multiple occurences can be selected explicitly following a tag, child selector or subfield selector using brackets [] (counting from 1) or parentheses (counting from 0) like (i) or (i..j).

If i is ommited, it defaults to the first (1 or 0, resp.).
If j is ommited, it defaults to last.

Alternatively occurences may be selected by contents. The general format is an optional subfield selector, followed by an comparision operator, followed by a literal. Only occurences where the field or specified subfield matches the literal according to comparision are selected. Parentheses select all such occurences, while brackets select the first match and default to the first occurence if none matches.
Operators are

= for equality
~ for contains
* for starts with
+ for ends with

The equality operator may be ommited, where unambigous. If some key subfield is known to occur at the start or end of field, it is probably more efficient to test for +^zen than for ^z=en.

literals

Each tag, child or subfield selector may be followed by one or more literals. Every literal but the / extends to the next occurence of the same special character by which it is introduced. This special character may be escaped using a backslash. A literal backslash may be escaped as two (but need not, except at the end).
The special character governs when and where the literal is output:

" before the first occurence
(of the entity in question; i.e. field, child or subfield)
' before each
` after each
| inbetween (after each but the last)
+ after the last
/ this single-character literal starts a new output field after each occurence
~ this literal is used if the given entitity does NOT occur

Literals are not subject to the string modifiers.

substring selector

Introduced by a colon ':', it has the form :l or :o.l, where o and l are integers denoting an offset and length to cut from the currently selected value.

extension exits

An exit is a C-function (i.e., using C calling convention) in a dynamic library. TODO: describe interface.

script evaluation

If a scripting environment like Tcl is available, a {} block may contain a script to be evaluated. TODO: describe interface.

$Id: Views.txt,v 1.4 2004/06/10 12:52:29 kripke Exp $