Terminological Databases

Terminological information generally resides in terminology databases (TDBs), but for SGML applications, these collections of data can be viewed as documents. A document containing terminological data is made up of terminological entries. Typically, a terminological entry treats a single concept and contains information on the assignment of single or multi-word terms to this concept. Bilingual and multilingual terminological entries deal with harmonized or very closely related concepts in two or more languages that are treated as functional equivalents in the context of a specific domain or subdomain. Terminological data can take the form of terminological databases (TDBs) or can be used to print hardcopy terminological documents, such as terminological dictionaries, technical vocabularies, or thesauri.

The TEI description of terminological data was originally designed primarily as a terminology interchange format (TIF) to allow users of terminology databases to exchange database records. In this guise it is called the Electronic Terminology Interchange Format (E-TIF). The exchange of database records is especially important in practice because the structure of terminological records varies considerably from TDB to TDB, reflecting differences of design and of user needs. Users of TDBs frequently need to interchange data in order to access expert information and to prevent the duplication of effort, but differences in software, hardware, and methodology complicate interchange. A universal interchange format is a crucial element in making interchange easier.

The tag set defined in this chapter may also be used to mark up documents for the purpose of printing terminological dictionaries and vocabularies, or exchanging them in electronic form. Printed terminological documents differ from terminological databases in that they are frequently divided into sections and subsections and include prose text in introductions, etc. When used for marking up printed documentation, we can speak of the tag set defined here as a Print Terminology Interchange Format (P-TIF).

Because printed terminological dictionaries differ from terminological databases, problems may arise if one attempts to use the same electronic document both for printing and to exchange records among databases. A printed terminological dictionary may contain material not suitably encoded for introduction into database records. Domain and subdomain information may be implied by the arrangement of termEntrys rather than by explicit domain specifications within the individual entries.

Other interchange difficulties include differences between term entry styles used in prescriptive and descriptive terminology work and problems arising from differences in the degree of detail used to classify data elements in different databases. (The term data element is used by terminologists to refer to the smallest defined individual items of information, regardless of whether they are represented as SGML elements, SGML attributes, or fields or columns in a database. That is the usage followed here.) Procedures for addressing these various problems are treated in more detail in another document, the TEI / LISA / ISO - TIF --- Terminology Interchange Format --- A Tutorial (1993). This document is reprinted in TermNet News, no 40, 1993, pp 5-64; copies are also available from Infoterm, z.Hd. Herrn Dr. Gerhard Budin, Heinestraße 38, Postfach No. 130, A-1021 Vienna, Austria. The Terminological Entry

The basic unit of terminology management is the terminological entry. A terminological entry documents information pertaining to a concept and generally speaking contains at least one term. In addition to the term, various kinds of descriptive and administrative data are recorded concerning the term, the concept to which it is assigned, and relationships to other terms and concepts. Administrative information supports the management of the terminology database or document.

A sample terminological entry consists of a series of entries like the following: Tags for Terminological Data

The following sections define elements for use in tagging terminological data. The elements and attributes listed are based on empirical studies. The studies indicated the use of a wide variety of different data element types (data categories or database field types), but this variety can be reduced to a relatively small set of SGML elements and attributes expressing notions common to most, if not all, TDBs. Those elements and attributes are defined here. In addition, the global TEI attributes defined in section , and the elements and attributes defined in chapter , can all be used in terminological applications.

When tagging terminological data, three elements constitute the set of non-floating elements: term, otherForm, and descrip. All other elements function as floating elements, including: admin, note, gram, bibl, biblFull, date, table, formula, figure, and the linking elements (ptr, xptr, ref, and xref). The rules for combining floating with non-floating elements are spelled out below in section , and in section . contains a single-word, multi-word or symbolic designation which is regarded as a technical term. Attributes include: classifies the term using some typology. contains a single complete entry for one concept expressed in one language and comprising one or more terms and their associated descriptive and administrative data, or, in bilingual and multilingual terminology work, two or more very closely related concepts comprising one or more terms in each language and their associated descriptive and administrative data. Attributes include: classifies the term entry using some typology, preferably the dictionary of data element types specified in ISO WD 12 620. within a termEntry element, contains information elements associated with a single term. Attributes include: classifies the tig using some typology, preferably the dictionary of data element types specified in ISO WD 12 620. contains an alternate designation for the concept treated by the term entry, such as a synonym. Attributes include: classifies the otherForm using some typology, preferably the dictionary of data element types specified in ISO WD 12 620. within a tig element, contains information elements relating to a single otherForm. Attributes include: classifies the other-form information group according to some convenient typology, preferably the dictionary of data element types specified in ISO WD 12 620. within an entry in a dictionary or a terminological data file, contains grammatical information relating to a term, word, or form. Attributes include: classifies the grammatical information given according to some convenient typology --- in the case of terminological information, preferably the dictionary of data element types specified in ISO WD 12 620. Suggested values include: part of speech (any of the word classes to which a word may be assigned in a given language, based on form, meaning, or a combination of features, e.g. noun, verb, adjective, etc.) gender (formal classification by which nouns and pronouns, and often accompanying modifiers, are grouped and inflected, or changed in form, so as to control certain syntactic relationships) number (e.g. singular, plural, dual, ...) animate or inanimate proper noun or common noun within a termEntry element, contains a definition, context or explanation used to explain or define the concept represented by a term or an otherForm. Attributes include: classifies the description using some convenient typology, preferably the dictionary of data element types specified in ISO WD 12 620. Suggested values include: The description provides all the information needed to differentiate one concept from all other related concepts in the given domain. within a termEntry element, contains administrative information pertaining to data management and documentation of the entry. Attributes include: identifies the administrative event or information using some typology, preferably the dictionary of data element types specified in ISO WD 12 620. Suggested values include: The admin element identifies the agency or individual responsible for the data element or entry. The admin element describes the creation of the data element or entry. The admin element describes the update or modification of the data element or entry. The admin element describes the final approval of the data element or entry. The element indicates the subject area to which a concept pertains. The element indicates the subdomain of the subject area to which the concept pertains.

As indicated, these elements all possess a type attribute, used to classify the generic elements so as to match the classifications used by TDBs. The type attributes allow specific items of information not defined in the DTD to be tagged as one of the defined elements with an appropriate type value. The possible values of type thus constitute a sizable open list.

At the time of publication, work is under way in ISO Technical Committee 37, Sub-Committee 3, Working Group 1 to compile an official dictionary of data element types (data categories) for use in terminology work, which will eventually provide the core for a complete list of type attribute values. This data element dictionary will appear as ISO 12 620. The attribute values that occur in the examples shown in this chapter represent a subset of those that will be defined in ISO 12 620.

The ofig and otherForm elements are not necessary if each potential otherForm element is recast as a term in its own tig. For example, a term could be placed in a tig type=synonym.

When the base tag set described in this chapter is used, the following attributes are added to the set of global attributes: indicates the group (term and related elements) to which this element should be associated by specifying a string matching the n attribute value on an appropriate element. indicates the parent element to which this element should be associated by specifying a string matching the n attribute value on an appropriate element. indicates the group (term and related elements) to which this element should be associated by specifying its unique identifier, where this is available. indicates the parent element to which this element should be associated by specifying its unique identifier, where this is available. For discussion of the usage of these attributes, see below, section .

Among the TEI core elements, the following are most likely to be found necessary in encoding terminological data; for fuller descriptions see the appropriate sections in chapter . In the case of the date element, it should be noted that the ISO format (YYYY-MM-DD) is preferred for terminology entries. contains a note or annotation. defines a reference to another location in the current document, in terms of one or more identifiable elements, possibly modified by additional text or comment. defines a pointer to another location in the current document in terms of one or more identifiable elements. defines a reference to another location in the current document, or an external document, using an extended pointer notation, possibly modified by additional text or comment. defines a pointer to another location in the current document or an external document. contains a date in any format. contains a loosely-structured bibliographic citation of which the sub-components may or may not be explicitly tagged. contains a structured bibliographic citation, in which only bibliographic subelements appear and in a specified order. contains a fully-structured bibliographic citation, in which all components of the TEI file description are present. contains text displayed in tabular form, in rows and columns. indicates the location of a graphic, illustration, or figure. contains a mathematical or other formula.

Like all other elements defined in the TEI DTDs, all elements in the base tag set for terminology possess the following global attributes: indicates the language of the element content, usually using a two- or three-letter code from ISO 639. gives a number (or other label) for an element, which is not necessarily unique within the document. provides a unique identifier for the element bearing the ID value.

Using the tags defined here, the example given above in section might be tagged thus:In this example, as in the others, white space has been liberally used for the sake of legibility; in practice most actual encodings would use less white space. appearance of materials opacity n degree of obstruction to the transmission of visible light Opazität n f Maß für die Lichtdurchsichtigkeit p. 383 opacité n f rapport du flux lumineux incident au flux lumineux transmis ou réfléchi par un noircissement photographique ]]>

Both the ptr type='bibliographic' target='ASTM.E284' and ref type='bibliographic' target='HFdn1983' elements in the example indicate links to complete bibliographical entries included in the back matter element of the same document. HFdn1983 is a source reference code for a book, generated according to ISO/TC 37 WI 18, Coding of Bibliographic References in Terminology Work and Terminography (1991). Its full bibliographic record would be: Henry G. Freeman Wörterbuch technischer Begriffe mit 4300 Definitionen nach DIN III 703 pp 1983 Beuth Verlag GmbH Berlin and Köln 1983

Compiled for the standards of the DIN (Deutsches Institut für Normung).

]]>

Further examples, including alternate encodings of this term entry, are given below in section , and section .

The formal definition of these elements depends on which style of markup is being used; for discussion of the two styles, see the following section, . For the formal declarations for the two styles, see sections , and . Basic Structure of the Terminological Entry

A terminological entry is identified with the termEntry tag and contains one or more terms marked with the tag term, which may appear with associated SGML elements. A single term and its associated SGML elements (such as gram, descrip, admin) constitute a term information group, tig. A termEntry may be made up of one or more tigs.

There are two structural descriptions for termEntrys: nested termEntrys flat termEntrys The nested structure is preferred, especially for interchange with unknown partners. The flat structure provides an option that can be used between interchange partners whose systems exhibit fairly similar structures. The flat structure may also be used as an intermediate form for systems making the transition to the nested format. Nested Term Entries

A nested termEntry uses SGML to represent the hierarchical relationships implicit in the terminological entry by utilizing the following principles of embedding and adjacency. Rule of embedding in nested term entries: Elements that constitute a part of another element are embedded inside the parent element. Rules of adjacency in nested term entries:

The conversion routine that creates the nested entry infers the language of the tig from the language of the term, a process that can be construed as upward inheritance from term to tig. Standard TEI downward inheritance applies for all the elements embedded in the tig: their language is that of the tig, unless this default value is overridden by stating a new value.

An example of a nested term entry was given in section . Flat Term Entries Using Rules of Adjacency

The flat terminological entry does not use the tig element to enclose a term and its associated elements. Instead, it provides other mechanisms to express the relationships that occur within and among entries in a TDB, while at the same time allowing the different types of entries found in different source TDBs to be represented in very natural ways. The difference between the nested and flat terminological entries is that, while both can express the same information, the nested structure represents the logical hierarchy implicit within the entry by embedding elements in one another, while the flat entry does not represent the logical hierarchy within the entry in this way. Since many existing TDBs do not overtly indicate any hierarchical structure such as that represented in a nested entry, the flat entry may be more apt to reflect the organization of data elements within an entry found in the particular source TDB, whereas the nested entry more obviously characterizes an ideal abstract structure of the term entry. In flat entries, terms and their associated elements are grouped by means of the following rules of adjacency: Rules of adjacency in flat termEntrys:

Encoded using the flat style, the example given in section , might look like this: --> appearance of materials opacity n degree of obstruction to the transmission of visible light Opazität n f Maß für die Lichtdurchsichtigkeit p. 383 opacité n f rapport du flux lumineux incident au flux lumineux transmis ou réfléchi par un noircissement photographique ]]> Flat Term Entries Using Group and Depend Attributes

In practice, there are term entries where elements are ordered in such a way that the rules of adjacency cannot be used. For instance, in Example 3 the ptr and ref linking elements refer to the immediately preceding descrip information. The admin type='responsibility' elements as represented here also refer to the descrip element. It may, however, be desirable for the bibliographic reference to refer not only to the quoted material in the descriptive element, but also to the term itself. Because the second rule of adjacency dictates that all floating elements following a non- floating element refer to that non-floating element, a mechanism is required to point to the term if the floating element depends on the term itself.

There are also other exceptions to the adjacency rules: in some term entries elements are associated with a term other than the immediately preceding term. Such entries may be called discontiguous flat term entries, since the constituents of a term information group may not be adjacent. In such entries, information pertaining to the entire terminological entry may not always appear at the beginning of the entry (i.e., prior to the introduction of a term).

Such an entry might be encoded as follows: --> opacity n Opazität n f opacité n f degree of obstruction to the transmission of visible light Maß für die Lichtdurchsichtigkeit rapport du flux lumineux incident au flux lumineux transmis ou réfléchi par un noircissement photographique p. 383 appearance of materials ]]>

In the above example, depend elements indicate that the material tagged with this attribute is related to the targeted element. The group elements indicate that the information so marked is part of an implicit tig, i.e. that it pertains either to the term or to the entire implicit tig. Items linked to other elements by depend do not require the group attribute because they are associated with the group already by virtue of their relation to elements that are themselves associated with the group.

So as to describe appropriate relationships in discontiguous flat termEntrys, it is necessary to define a pointing mechanism that allows any non-adjacent element to be related to an implicit term information group and therefore to the term with which it is associated or to some other specific element.

Two methods are provided to represent this association. For terminology files in which unique identifiers for all term elements cannot be assumed (as will often be the case in interchange), the group and depend attributes should be used. For terminology files in which unique SGML identifiers can be provided, the grpPtr and depPtr attributes should be used. The two pairs of attributes have identical significance as far as the association of elements is concerned.

The group attribute associates an element with a specific term, or with an implicit term information group: its value must be the same as the n attribute on the term element being pointed to. During interchange, the group attribute would be used to extract and assemble all the elements related to a specific term information group from a discontiguous flat termEntry by matching them to the n attributes on the terms. The group pointer accounts for the kind of relationship represented by the principle of embeddedness within a tig in a nested term entry.

The depend attribute associates an element with some other specific element: its value must be the same as the n attribute on the element being pointed to. As shown in the last line of Example 4, the depend attribute can also point to the entire terminological entry by targeting a value of n indicated in the termEntry element. If for any reason the grammatical information pertaining to a term does not follow the term immediately, this information must be linked to the term with the depend attribute.

In terms of the extended pointer notation defined in chapter , the specification group=2 is synonymous with HERE ANCESTOR (1 TERMENTRY) DESCENDANT (1 TERM N 2), and the specification depend=3 is synonymous with HERE ANCESTOR (1 TERMENTRY) DESCENDANT (1 * N 3).

To summarize the behavior of group and depend, the group attribute identifies an implicit tig, whereas the depend attribute implies relatedness. If there is any ambiguity with respect to the rules of adjacency, one should use depend.

In Example 4, the English term opacity is identified as n=1, and all other elements associated with this tig are marked as group=1; in German, the term and all its associated elements are identified as n=2 and group=2, respectively; in French, the term and associated elements are marked group=3. Since the bibliographical references are displaced from the descriptive information with which they are associated, the descriptions are identified with n=endes1, n=dedes1, and n=frdes1, respectively. The ptr and ref elements are then identified with depend attributes that target the appropriate descriptions. Even if the elements in the entry were adjacent to each other in the entry, this convention would be essential if one wanted to indicate that the source applied to the term and hence to the entire tig, rather than just to the descrip element itself. References between Term Entries

Terminology documents utilize a variety of cross-references between termEntrys, for instance to link to bibliographic entries or between equivalents in different languages, synonyms and related terms and concepts. These references are usually implemented using the TEI linking elements ptr and ref, together with a value of the attribute type. If, as is the case with the reference to ASTM E284, the total bibliographic source description is contained in the target element of the linking element, use ptr. If, on the other hand, a page number is included, this page number must appear as the content of a linking element introduced by the ref element.

Examples: ]]> or p. 383 ]]>

If the full bibliographical citation is included in the termEntry itself, linking elements are unnecessary and the citation can be marked using the bibl, biblStruct, or biblFull elements. For further discussion of bibliographic citations and references, see section . Overall Structure of Terminological Documents

To enable the base tag set for terminology, a parameter entity TEI.terminology must be declared within the document type subset, the value of which is INCLUDE, as further described in section . A document using this base tag set and no other additional tag sets will thus begin as follows: ]> ]]> This declaration makes available all of the elements described in this chapter, in addition to the core elements described in chapter . The default structure for terminological documents is similar to that defined by chapter : within the TEI.2 element they contain a teiHeader and a text. The text element, in turn, contains as usual a body element, optionally preceded by a front and followed by a back. The body may contain a series of termEntry elements, which may optionally be grouped into sections tagged with the same elements (div, div0, div1, etc.) as defined in section . contains a single text of any kind, whether unitary or composite, for example a poem or drama, a collection of essays, a novel, a dictionary, or a corpus sample. contains the whole body of a single unitary text, excluding any front or back matter. contains a subdivision of the front, body, or back of a text. contains the largest possible subdivision of the body of a text. contains a first-level subdivision of the front, body, or back of a text (the largest, if div0 is not used, the second largest if it is). contains a second-level subdivision of the front, body, or back of a text. contains a third-level subdivision of the front, body, or back of a text. contains a fourth-level subdivision of the front, body, or back of a text. contains a fifth-level subdivision of the front, body, or back of a text. contains a sixth-level subdivision of the front, body, or back of a text. contains the smallest possible subdivision of the front, body or back of a text, larger than a paragraph.

In order to support both the flat and the nested styles of markup, three distinct DTD fragments for terminology are provided. teiterm2 teite2n teite2f

In file teiterm2.dtd, the top-level elements for the terminology base are defined, and a subordinate parameter entity, termtags is defined and referred to. By default, this entity refers to file teite2n.dtd, which defines the DTD for nested markup; if the flat style of markup is to be used, the document's DTD subset should define termtags as referring to the file teite2f.dtd, as shown in the examples in section . %TEI.structure.dtd; ]]&nil;> %termtags; ]]>

In file teiterm2.ent, terminology-specific extensions to the TEI element class system are defined, including the classes terminology, comp.terminology, terminologyInclusions, and terminologyMisc. ]]> DTD Fragment for Nested Style

In file teite2n.dtd the following definitions are found, which define the elements used in the nested markup style: ]]> DTD Fragment for Flat Style

In file teite2f.dtd the following definitions, which provide support for the flat markup style, are found: ]]> Additional Examples of Term Entries

The tag set defined in this chapter is designed to accommodate the variety of structures that occur in TDBs; this section shows the how the same information may be encoded in different ways, depending on local convenience or preferences. Example 5 gives an entry from an ISO terminological standard. Example 6 treats this English-French equivalent pair as a single nested terminological entry, whereas Example 7 splits the information into two nested entries with cross-references. Example 8 shows the same data as a flat terminological entry with adjacent elements, whereas Example 9 groups the elements according to element type, which requires the use of pointers in order to reconstruct the implicit terminological information group from discontiguous elements.

The interchange of terminological data between TDBs requires an export routine (to E-TIF) and an import routine (from E-TIF). For interchange between unknown partners, it may be desirable to normalize the encoding method rather than allow all the options presented in this section. The effect of normalization would be that import routines become easier to implement while export routines become more difficult to implement. At the time of this publication, work is under way in ISO Technical Committee 37, Subcommittee 3, Working Group 3 on a normalized version of E-TIF called ISO [DIS] 12 200. Some aspects of normalization under consideration are to use only the nested representation and avoid the use of the following options: divisions within the body, the otherForm element, the group and depend attributes, elements before the term element in a tig, inclusion exceptions other than ptr and xptr, and paragraph content other than #PCDATA in the elements admin and gram. Example Term Entry from ISO 472

The following term entry is taken from ISO 472:1988, Plastics --- Vocabulary, Bilingual edition (Geneva: ISO, 1988), p. 84. The original uses typographic characteristics to represent different data element types within the term entry, not all of which have been retained in the reproduction of this sample. As prescribed by ISO layout guidelines,ISO 10241, Preparation and layout of international terminology standards, 1993. the original text is printed in Helvetica, with English and French information presented in two parallel columns; head terms appear in bold face, notes in a smaller font size than the main text, and terms referred to in the cross references are printed in italics.

The entirety of all deleterious chemical modifications of plastic at elevated temperature.

NOTE --- It is essential to report the temperature and other environmental conditions at which the phenomenon is studied.

See also ageing, degradation and deterioriation.

Ensemble de toutes les modifications chimiques nuisibles d'un plastique à température élevée.

NOTE --- Il est essentiel d'indiquer la température et les autres conditions d'environnement dans lesquelles le phénomène est étudié.

Voir aussi viellissement, dégradation et détérioration. The Example Treated as a Single Term Entry in Nested Form

This treatment assumes that both the English and French terms are treated together in the same entry. The elements grouped together at the top of the term entry apply to the entire entry. Only the first of the three cross-referenced terms is included in this example; it is represented by a ptr link which targets a term entry (related concept) contained in the same document. The id values used here are purely arbitrary. plastics p. 84 thermal degradation n The entirety of all deleterious chemical modifications of plastic at elevated temperature. It is essential to report the temperature and other environmental conditions at which the phenomenon is studied. décomposition thermique n f Ensemble de toutes les modifications chimiques nuisibles d'un plastique à température élevée. Il est essentiel d'indiquer la température et les autres conditions d'environnement dans lesquelles le phénomène est étudié. ageing ... vieillissement ... ]]> The Example Treated as Two Separate Term Entries in Nested Form

This example takes cognizance of the fact that some TDBs treat each term in a single termEntry instead of grouping all the information for a single concept into a single termEntry. The rationale behind this approach is frequently that no two languages truly provide harmonized concepts, although in the case of standardized terminology it can generally be assumed that concepts have been harmonized. The significant difference in encoding that occurs in this type of system is that ptr linking elements are required more frequently to link to term equivalents and related terms in other entries in the same document. Since there is only one tig in each entry, the ptr element could come at the beginning, as shown in the previous example, or inside the tig as shown below. plastics p. 84 The entirety of all deleterious chemical modifications of plastic at elevated temperature. It is essential to report the temperature and other environmental conditions at which the phenomenon is studied. plastics p. 84 décomposition thermique n f Ensemble de toutes les modifications chimiques nuisibles d'un plastique à température élevée. Il est essentiel d'indiquer la température et les autres conditions d'environnement dans lesquelles le phénom`ne est étudié. ageing ... vieillissement ... ]]> The Example Treated as a Flat Term Entry Using Adjacency Rules

This version of Example 5 uses a flat style of encoding, following the pattern of many existing TDBs; elements associated with a given term follow it immediately: plastics p. 84 thermal degradation n The entirety of all deleterious chemical modifications of plastic at elevated temperature. It is essential to report the temperature and other environmental conditions at which the phenomenon is studied. décomposition thermique n f Ensemble de toutes les modifications chimiques nuisibles d'un plastique à température élevée. Il est essentiel d'indiquer la température et les autres conditions d'environnement dans lesquelles le phénomène est étudié. ageing ... vieillissement ... ]]> The Example Treated as a Flat Term Entry Not Using Adjacency Rules

Many translation-oriented terminologists who work with half-screen popup windows prefer the following layout because it enables them to see the various term options at the top part of their display window without having to scroll into the body of the termEntry. Note in this case that the ref element links the bibliographic information to the entire entry. thermal degradation n décomposition thermique n f The entirety of all deleterious chemical modifications of plastic at elevated temperature. Ensemble de toutes les modifications chimiques nuisibles d'un plastique à température élevée. It is essential to report the temperature and other environmental conditions at which the phenomenon is studied. Il est essentiel d'indiquer la température et les autres conditions d'environnement dans lesquelles le phénomène est étudié. plastics p. 84 ageing ... vieillissement ... ]]>