some notes on the use of charsets with ISIS
what are charsets?
Since computers can store nothing but numbers, but we want them to store
characters, there has to a table telling which character is stored as which
number, or, vice versa, which number is to display and print as which character.
such tables are called charsets.
Since the smallest unit of number storage is a byte, which can hold 256
different numbers from 0 to 255, many charsets are based on one
byte and thus can hold up to 256 characters. such charsets are called
one-byte-charsets .
For many scripts, like the various versions of latin, greek, cyrillic, hebrew
and arabic, 256 characters are more than enough.
For others, namely chinese, japanese and korean (
CJK
) scripts with several thousand characters, it's not enough. The modern
vietnamese
script is based on latin letters but needs a vast amount of accented letters,
so 256 isn't enough. Those scripts don't get by with one byte per character,
so they need multi-byte-charsets, where two or more bytes are needed
to encode one character.
what is UNICODE
UNICODE
is a big multi-byte-charset designed to include all
characters
needed in the world (over 40.000 by now), even for some ancient languages.
The problems having several charsets are a) you have to know which charset
is used in a given text, b) computer systems need to be aware of all possible
charsets and c) it's not possible to have a text or database contain characters
which are encoded in different charsets. Having all text in unicode solves
those problems. Check out
this sample page
- with a 21st century browser like Mozilla 5 (Netscape 6) you will see most
or all of the letters.
ASCII-compatible charsets and encodings
Many charsets use the numbers 0 to 127 in the same way: to represent the
basic set of latin characters defined by
ASCII
. Whenever there's a byte with a number in that range, this byte has the
meaning of the corresponding ASCII-character. For example, the number 43 always
is a plus sign +, which is important if a query expression is scanned for
such characters.
All ISO-8859-x
charsets are ASCII-compatible. Older
Cyrillic
charsets are NOT compatible with ASCII. Some of the eastern multi-byte-charsets
are, some are not.
Some of the multi-byte-charsets have different encodings, that is,
there is only one table mapping numbers to letters, but distinct ways to use
multiple bytes to express such a number, some of which use the numbers in
the ASCII-range only for ASCII characters, others don't. UNICODE has two widely
used encodings, UTF-8
and UTF-16 (UCS-2). UTF-8 is ASCII-compatible, UTF-16 is not.
so what about ISIS
- the ISIS database format itself is capable of storing anything and
thus can store text in any charset/encoding.
tools like biremes mx may store and retrieve (by MFN) text in nearly any
encoding (but depending on how the programming is done, UTF-16 may not work
because it may use bytes with value 0).
- the ISIS query and formatting language depends on special ASCII-characters
having special meaning and therefore will require an ASCII-compatible
encoding. All the ISO-8859-x charsets will do as will UTF-8 encoded
unicode (although some care must be taken when multiple bytes representing
one character are cut off in the midth). At least in theory, mx and
wwwisis are able to search for records in any ASCII-compatible
encoding including UTF-8 unicode (given carefull web-programming).
- winisis doesn't know about the possibility of one character
having multiple bytes. It will work with any ASCII-compatible one-byte-charset
, as long as it doesn't have to know what it does. That is, if your computer
has some preferred charset installed, you will see all characters displayed
according to that charset, and a character possibly entered as the german
ä could show up as greek delta :). No support for multi-byte-charsets,
especially not unicode.
- Like any Java software,
JavaISIS
is - in theory - able to handle unicode characters and even to do
the transformation between unicode and most of the other charsets.
Some limitations may result from the underlying wwwisis. In practice, version
3.5 claims to give "Multi-language encoding support", but unfortunately it's
in beta since March 2001 (sources made available in Feb 2002).
- openisis supports any charset and with it's Java-binding,
especially unicode and all the conversions. openisis alone can
do it on the web, and in combination with JavaISIS (once new sources are
available) also with a winisis-like interface.
some other resources on unicode
To see all those characters, you need fonts to tell your display
or printer how they look like.
Here's a
very fine page
on how to acquire and install those fonts (and some more advice).
If you for some reason have to waste your time with M$ products,
you may want to check out
this page .
Especially there's the one-size(23 MB)-fits-all fat font
Arial Unicode MS (TM, (c), ... expect the worst)
containing nearly all unicode glyphs, which is also included
with newer Windoze and/or Ophice versions.