Computer systems vary greatly in the sets of characters they make available for use in electronic documents; this variety enables users with widely different needs to find computer systems suitable to their work, but it also complicates the interchange of documents among systems; hence the need for a chapter on this topic in these Guidelines.
Three character-set problems arise for the encoder of electronic
texts:
No single character set is required for use in TEI-encoded documents.
Users may use any character set available to them. It is recommended
that the character set used by documented by a In general, it is most convenient to use a character set readily
available on one's computer system, though for special purposes it may
be preferable to customize the character set using software specialized
for the purpose. Whether to use the usual character set or create a
custom set depends on the documents being encoded, the tools available
for customizing the character set, the user's technical facility, and
the perceived relative convenience of living with the existing character
set and modifying to suit one's documents more closely. The choice must
be made by each individual according to individual circumstances; no
general recommendations are made here as to whether locally customized
character sets should be used. For local processing, encoders should
whatever character set they find convenient.
When the characters in a text exist in the local character set, the
appropriate character codes should be used to represent them. Virtually
all computer systems provide at least the following characters (in
addition to the space character):
Other characters, such as Latin characters with diacritics (e.g.
ä or é) or non-Latin characters (e.g. Greek, Hebrew, Arabic,
Cyrillic, and Oriental scripts), are less universally provided. If the
local character set provides an Full use of a local character set will require that the SGML
declaration define all the characters used as legal SGML characters.
For further information see chapter Characters not available in the local character set should usually be
encoded using SGML For example, the standard entity name for the character
Standard entity names have been defined for most characters used by
languages written in the Latin alphabet, and for some other alphabetic
scripts. A useful subset of these may be found in chapter
Where no standard entity name exists, or where the standard name is
felt unsuitable for some reason, the encoder may declare non-standard
entities, using the normal SGML syntax. If, for example, it is desired
to distinguish, in the transcription of a manuscript, among three
distinct forms of the letter
To ensure that the SGML output uses the same entity references for
them as the SGML input, for example, one could use the following
declarations.
For transcriptions in scripts not supported by the local character
set, entity references may prove unwieldy. In such cases, it is also
possible to transliterate the material from its original script into the
script of the local character set; like a customized local character
set, a transliteration scheme should be documented with a writing system
declaration. Transliteration schemes should be reversible (i.e. from
the transliteration it should be possible to reconstruct the original
writing exactly); where possible, standard schemes should be preferred
to ad hoc schemes. Many documents contain material from more than one language: loan
words, quotations from foreign languages, etc. Since languages use a
variety of Some languages use more than one writing system. Japanese may be
written in kanji, hiragana, katakana, or combinations of these. Hebrew
may be written with or without vowel points. Some languages may be
written either in the Latin or in the Cyrillic alphabet; or Cyrillic may
alternate with Arabic script. In such cases, each writing system must
be treated separately, as if a separate It is recommended that each value used for the Like any global attribute, the Now experiments of this kind have one admirable
property and condition: they never miss or fail. ...
]]>
Electronic texts may be exchanged over electronic networks, through
exchange of magnetic media, or by other means. In every case except the
transmission of magnetic media (e.g. disk or tape) from one machine to
another machine of the same hardware type running the same operating
system, the electronic data is subject to translation and
interpretation, and hence to misinterpretation and distortion, by
utility software working somewhere on the interchange path. Network
gateways, tape-reading software, and disk utilities routinely translate
from one character set to another before passing the data on. If the
utility errs in identifying the character set, or if several utilities
translate back and forth among character sets using non-reversible
translations, the chances are good that characters will be
garbled and information lost.
At this time (1992), the characters least susceptible to loss or
misinterpretation in transit among systems are those shown below, which
represent a subset of the characters in the internation standard ISO 646
and may thus be called the In interchange over any transmission link, the transmitted document
should contain only those characters which safely survive transmission
over the link; others should be represented with entity references, or
with transliterations, as described above.
In blind interchange by means of magnetic media, it is recommended
that the document be encoded using some well documented and widely used
standard character set.
In blind interchange over networks, it is recommended that the
transmitted document contain only characters known to travel safely over
the networks involved. In the most general case, those characters are
the ISO 646 subset given above.
As from 3.2.4, with modifications.
This chapter describes the recommended solutions to these problems, in
enough detail to satisfy the needs of most users. More detail and more
technical information can be found in chapters