some notes on the use of charsets with ISIS

what are charsets?

Since computers can store nothing but numbers, but we want them to store characters, there has to a table telling which character is stored as which number, or, vice versa, which number is to display and print as which character. such tables are called charsets.
Since the smallest unit of number storage is a byte, which can hold 256 different numbers from 0 to 255, many charsets are based on one byte and thus can hold up to 256 characters. such charsets are called one-byte-charsets .
For many scripts, like the various versions of latin, greek, cyrillic, hebrew and arabic, 256 characters are more than enough.
For others, namely chinese, japanese and korean ( CJK ) scripts with several thousand characters, it's not enough. The modern vietnamese script is based on latin letters but needs a vast amount of accented letters, so 256 isn't enough. Those scripts don't get by with one byte per character, so they need multi-byte-charsets, where two or more bytes are needed to encode one character.

what is UNICODE

UNICODE is a big multi-byte-charset designed to include all characters needed in the world (over 40.000 by now), even for some ancient languages. The problems having several charsets are a) you have to know which charset is used in a given text, b) computer systems need to be aware of all possible charsets and c) it's not possible to have a text or database contain characters which are encoded in different charsets. Having all text in unicode solves those problems. Check out this sample page - with a 21st century browser like Mozilla 5 (Netscape 6) you will see most or all of the letters.

ASCII-compatible charsets and encodings

Many charsets use the numbers 0 to 127 in the same way: to represent the basic set of latin characters defined by ASCII . Whenever there's a byte with a number in that range, this byte has the meaning of the corresponding ASCII-character. For example, the number 43 always is a plus sign +, which is important if a query expression is scanned for such characters.
All ISO-8859-x charsets are ASCII-compatible. Older Cyrillic charsets are NOT compatible with ASCII. Some of the eastern multi-byte-charsets are, some are not.
Some of the multi-byte-charsets have different encodings, that is, there is only one table mapping numbers to letters, but distinct ways to use multiple bytes to express such a number, some of which use the numbers in the ASCII-range only for ASCII characters, others don't. UNICODE has two widely used encodings, UTF-8 and UTF-16 (UCS-2). UTF-8 is ASCII-compatible, UTF-16 is not.

so what about ISIS


some other resources on unicode

To see all those characters, you need fonts to tell your display or printer how they look like. Here's a very fine page on how to acquire and install those fonts (and some more advice). If you for some reason have to waste your time with M$ products, you may want to check out this page . Especially there's the one-size(23 MB)-fits-all fat font Arial Unicode MS (TM, (c), ... expect the worst) containing nearly all unicode glyphs, which is also included with newer Windoze and/or Ophice versions.