NSGMLS(1) NSGMLS(1) NAME nsgmls - a validating SGML parser An SGML System Conforming to International Standard ISO 8879 -- Standard Generalized Markup Language SYNOPSIS nsgmls [ -deglprsuvx ] [ -alinktype ] [ -ffile ] [ -iname ] [ -mfile ] [ -tfile ] [ -wwarning_type ] [ filename... ] DESCRIPTION Nsgmls parses and validates the SGML document entity in filename... and prints on the standard output a simple text representation of its Element Structure Information Set. (This is the information set which a structure- controlled conforming SGML application should act upon.) Note that the document entity may be spread amongst sev- eral files; for example, the SGML declaration, document type declaration and document instance set could each be in a separate file. If no filenames are specified, then nsgmls will read the document entity from the standard input. Each filename is actually interpreted as a system identifier. A command line filename of - can be used to refer to the standard input. (Normally in a system iden- tifier, fd:0 is used to refer to standard input.) The following options are available: -alinktype Make link type linktype active. Not all ESIS information is output in this case: the active LPDs are not explicitly reported, although each link attribute is qualified with its link type name; there is no information about result elements; when there are multiple link rules applicable to the current element, nsgmls always chooses the first. -d Warn about duplicate entity declarations. -e Describe open entities in error messages. Error messages always include the position of the most recently opened external entity. -ffile Redirect errors to file. This is useful mainly with shells that do not support redirection of stderr. -g Show the GIs of open elements in error messages. -iname Pretend that 1 NSGMLS(1) NSGMLS(1) occurs at the start of the document type declara- tion subset in the SGML document entity. Since repeated definitions of an entity are ignored, this definition will take precedence over any other def- initions of this entity in the document type decla- ration. Multiple -i options are allowed. If the SGML declaration replaces the reserved name INCLUDE then the new reserved name will be the replacement text of the entity. Typically the document type declaration will contain and will use %name; in the status keyword specifi- cation of a marked section declaration. In this case the effect of the option will be to cause the marked section not to be ignored. -l Output L commands giving the current line number and filename. -mfile Map public identifiers and entity names to system identifiers using the catalog entry file whose sys- tem identifier is file. Multiple -m options are allowed. Catalog entry files specified with the -m option will be searched before the defaults. -p Parse only the prolog. Nsgmls will exit after parsing the document type declaration. Implies -s. -r Warn about defaulted references. -s Suppress output. Error messages will still be printed. -tfile Output to file the RAST result as defined by ISO/IEC 13673:1995 (actually this isn't quite an IS yet; this implements the Intermediate Editor's Draft of 1994/08/29, with changes to implement ISO/IEC JTC1/SC18/WG8 N1777). The normal output is not produced. -u Warn about undefined elements: elements used in the DTD but not defined. -v Print the version number. -wwarning_type Give warnings according to the value of warn- ing_type: mixed Warn about mixed content models that do not allow #pcdata anywhere. 2 NSGMLS(1) NSGMLS(1) sgmldecl Warn about various dubious constructions in the SGML declaration. should Warn about various recommendations made in ISO 8879 that the document does not comply with. (Recommendations are expressed with ``should'', as distinct from requirements which are usually expressed with ``shall''.) default Warn about defaulted references. (Same as -r.) duplicate Warn about duplicate entity declarations. (Same as -d.) undefined Warn about undefined elements: elements used in the DTD but not defined. (Same as -u.) all Give all available warnings. Multiple -w options are allowed. -x Suppress check that for each ID reference value there is an element with that ID. -X If the -t option is being used, do not give an error when a character that is not a significant character in the reference concrete syntax occurs in a literal in the SGML declaration. This may be useful in conjunction with certain buggy test suites. External entities An external entity resides in one or more storage objects, each of which contains a sequence of bytes. The entity manager component of nsgmls maps a sequence of storage objects into an entity as follows: 1. The bytes in each storage object are converted into characters, each represented by a single bit combi- nation, according to the encoding translation asso- ciated with the storage object. 2. The characters in each storage object are concate- nated. 3. The sequence of characters is treated as a sequence of lines each terminated by a line terminator. The line terminator is either a line feed or a carriage return or a a carriage return followed by a line 3 NSGMLS(1) NSGMLS(1) feed. Nsgmls determines which line terminator to use for a storage object according to which of the possible line terminators is used for the first line of the storage object. A record start is inserted at the beginning of each line, and a record end at the end of each line. If there is a partial line (a line that doesn't end with the line terminator) at the end of the entity, then a record start will be inserted before it but no record end will be inserted after it. An encoding translation defines a translation between the storage coding system and the entity coding system. The storage coding system represents characters by sequences of bytes; it can be variable width and stateful. The entity coding system represents each character by a single bit combination; it is fixed-width (but not limited to 8 bits) and stateless. Note that the SGML declaration describes the entity coding system not the storage coding system. System identifiers A system identifier describes a sequence of storage objects, each optionally associated with a encoding trans- lation. Nsgmls will attempt to interpret a system identi- fier as a keyword followed by a colon followed by a string, which is interpreted in a keyword-dependent way. Keywords are case-insensitive. The following keywords are recognized: file The string is interpreted as a filename. The sys- tem identifier describes a single storage object that will be read from the named file. fd The string is as a number. The system identifier describes a single storage object that will read from the file descriptor with that number. For example, fd:0 will read the storage object from standard input. concat The string is treated as a list of substrings sepa- rated by + characters. Each of the substrings is in turn interpreted as a system identifier, and the sequences of storage objects that each denote are concatenated. The concat system identifier describes the resulting sequence of storage objects. http The string together with the http: prefix is treated as a URL. This is implemented only under Unix. utf8 The string is interpreted as a system identifer. Each storage object that it describes that is not 4 NSGMLS(1) NSGMLS(1) associated with a encoding translation is associ- ated with an encoding translation that translates UTF8 to fixed-width encoding. Invalid multi-byte sequences are represented by the character 0xFFFD. This keyword is recognized only in the multi-byte version of nsgmls. replace The string is interpreted as a system identifier. Numeric character references using the SGML refer- ence concrete syntax will be recognized and replaced within each storage object identifier occuring in the system identifier. ucs2 The string is interpreted as a system identifer. Each storage object that it describes that is not associated with a encoding translation is associ- ated with an encoding translation that translates UCS2 to a fixed width encoding. The more signifi- cant octet of each character always precedes the less significant octet irrespective of the system's native byte-order. The codes 0xFFFE and 0xFEFF are not treated specially in any way. This keyword is recognized only in the multi-byte version of nsgmls. unicode The string is interpreted as a system identifer. Each storage object that it describes that is not associated with a encoding translation is associ- ated with the an encoding translation, which trans- lates the Unicode coding system to a fixed-width encoding. The Unicode coding system treats each pair of octets as a character in the system's byte order. If the first character is the byte order mark character (0xFEFF), it will be discarded. (This is necessary to avoid problems with the SGML document entity: a byte order mark before the SGML declaration would be a syntax error.) If the first character is the byte order mark character byte- swapped, it will be discarded and the remaining characters will be byte-swapped. This keyword is recognized only in the multi-byte version of nsgmls. ujis The string is interpreted as a system identifer. Each storage object that it describes that is not associated with a encoding translation is associ- ated with an encoding translation where the storage coding system is variable-width (packed) UJIS (EUC), and the entity coding system represents each character in the same way as the EUC complete two- byte format. In the entity coding system the code of characters in the G0 set (usually the Japanese 5 NSGMLS(1) NSGMLS(1) version of ISO 646) is unchanged; The code of char- acters in the G1 set (usually JIS X 0208-1990) is ORed with 0x8080; the code of characters in the G2 set (usually half-width katakana from JIS X 0201-1986) is ORed with 0x0080; the code of charac- ters in the G3 set (JIS X 0212-1990) is ORed with 0x8000. This keyword is recognized only in the multi-byte version of nsgmls. sjis The string is interpreted as a system identifer. Each storage object that it describes that is not associated with a encoding translation is associ- ated with an encoding translation where the storage coding system is Shift JIS and the entity coding system is the same as with the ujis encoding trans- lation (except for characters in the G3 set which are not representable using Shift JIS.) This key- word is recognized only in the multi-byte version of nsgmls. identity The string is interpreted as a system identifer. Each storage object that it describes that is not associated with a encoding translation is associ- ated with the identity encoding translation. The identity coding system converts bytes to characters by zero-extending each character. raw The string is interpreted as a system identifier. No translation of line-terminators onto RS and RE characters will be performed for each storage object that it describes. Error messages referring to these storage objects will not contain line num- bers. cooked The string is interpreted as a system identifier. This undoes the effect of any earlier raw keyword. huge This keyword is intended for use with huge files, for which the cost of keeping track of line bound- aries (roughly one byte per line) is too large. The string is interpreted as a system identifier. For each storage object that it describes, nsgmls will not keep track of where line boundaries occur as it usually does. Error messages referring to these storage objects will not contain line num- bers. If a system identifier does not contain a keyword or uses a keyword that is not recognized, then the system identi- fier will be treated as a filename. Note that the system identifier file:utf8:doc.sgm identifies the file named utf8:doc.sgm but utf8:file:doc.sgm identifies the file named doc.sgm using the utf8 coding scheme. 6 NSGMLS(1) NSGMLS(1) A relative filename in a system identifier is interpreted relative to the file in which the system identifier is specified, if any, and otherwise relative to the current directory. This applies both to system identifiers speci- fied in SGML documents, and to system identifiers speci- fied in catalog entry files. If a system identifier does not specify the encoding translation, the encoding translation of the storage object in which the system identifier was specified will be used. The raw keyword will be implied for an NDATA entity and for a system identifier defined in a storage object that was raw. This can be overridden using the cooked keyword. System identifier generation If a system identifier is not specified, then the entity manager will attempt to generate one using catalog entry files in the format defined in the SGML Open Draft Techni- cal Resolution on Entity Management. A catalog entry file contains a sequence of entries in one of the following four forms: PUBLIC pubid sysid This specifies that sysid should be used as the system identifier if the public identifier is pubid. Sysid is a system identifier as defined in ISO 8879 and pubid is a public identifier as defined in ISO 8879. ENTITY name sysid This specifies that sysid should be used as the system identifier if the entity is a general entity whose name is name. ENTITY %name sysid This specifies that sysid should be used as the system identifier if the entity is a parameter entity whose name is name. Note that there is no space between the % and the name. DOCTYPE name sysid This specifies that sysid should be used as the system identifier if the entity is an entity declared in a document type declaration whose docu- ment type name is name. LINKTYPE name sysid This specifies that sysid should be used as the system identifier if the entity is an entity declared in a link type declaration whose link type name is name. 7 NSGMLS(1) NSGMLS(1) OVERRIDE This specifies that system identifiers specified in the catalog should override system identifiers specified in the document. Normally, if an entity declaration in the document specifies a system identifier, the catalog is not consulted. If OVER- RIDE is specified, then the catalog is searched first; the system only uses the system identifier specified in the document, if no match is found in the catalog. SGMLDECL sysid This specifies that if the document does not con- tain an SGML declaration, the SGML declaration in sysid should be implied. The last four forms are extensions to the SGML Open for- mat. The delimiters can be omitted from the sysid pro- vided it does not contain any white space. Comments are allowed between parameters delimited by -- as in SGML. The environment variable SGML_CATALOG_FILES contains a list of catalog entry files. The list is separated by colons under Unix and by semi-colons under MSDOS. These will be searched after any catalog entry files specified using the -m option. If this environment variable is not set, then a system dependent list of catalog entry files will be used. A match in a catalog entry file for a PUB- LIC entry will take precedence over a match in the same file for an ENTITY, DOCTYPE or LINKTYPE entry. System declaration The system declaration for nsgmls is as follows: SYSTEM "ISO 8879:1986" CHARSET BASESET "ISO 646-1983//CHARSET International Reference Version (IRV)//ESC 2/5 4/0" DESCSET 0 128 0 CAPACITY PUBLIC "ISO 8879:1986//CAPACITY Reference//EN" FEATURES MINIMIZE DATATAG NO OMITTAG YES RANK YES SHORTTAG YES LINK SIMPLE YES 65535 IMPLICIT YES EXPLICIT YES 1 OTHER CONCUR NO SUBDOC YES 100 FORMAL YES SCOPE DOCUMENT SYNTAX PUBLIC "ISO 8879:1986//SYNTAX Reference//EN" SYNTAX PUBLIC "ISO 8879:1986//SYNTAX Core//EN" VALIDATE GENERAL YES MODEL YES EXCLUDE YES CAPACITY NO NONSGML YES SGML YES FORMAL YES SDIF PACK NO UNPACK NO The limit for the SUBDOC parameter is memory dependent. 8 NSGMLS(1) NSGMLS(1) Any legal concrete syntax may be used. SGML declaration The SGML declaration may be omitted, the following decla- ration will be implied: with the exception that characters 160 through 254 will be assigned to DATACHAR. A character in a base character set is described either by giving its number in a universal character set, or by specifying a minimum literal. The constraints on the choice of universal character set are that characters that are significant in the SGML reference concrete syntax must be in the universal character set and must have the same number in the universal character set as in ISO 646 and that each character in the character set must be repre- sented by exactly one number; that character numbers in the range 0 to 31 and 127 to 159 are control characters (for the purpose of enforcing SHUNCHAR CONTROLS). It is recommended that ISO 10646 (Unicode) be used as the uni- versal character set, except in environments where the normal document character sets are large character set which cannot be compactly described in terms of ISO 10646. The public identifier of a base character set can be asso- ciated with an entity that describes it by using a PUBLIC entry in the catalog entry file. The entity must be a fragment of an SGML declaration consisting of the the por- tion of a character set description, following the DESCSET keyword that is, it must be a sequence of character descriptions, where each character description specifies a described character number, the number of characters and either a character number in the universal character set, a minimum literal or the keyword UNUSED. Character num- bers in the universal character set can be as big as 99999999. In addition nsgmls has built in knowledge of a few charac- ter sets. These are identified using the designating sequence in the public identifier. The following desig- nating sequences are recognized: Designating ISO Minimum Number Escape Registration Character of Description Sequence Number Number Characters ------------------------------------------------------------------------------ ESC 2/5 4/0 - 0 128 full set of ISO 646 IRV ESC 2/8 4/0 2 0 128 G0 set of ISO 646 IRV ESC 2/8 4/2 6 0 128 G0 set of ASCII ESC 2/1 4/0 1 0 32 C0 set of ISO 646 The graphic character sets do not strictly include C0 and 10 NSGMLS(1) NSGMLS(1) C1 control character sets. For convenience, nsgmls aug- ments the graphic character sets with the appropriate con- trol character sets. It is not necessary for every character set used in the SGML declaration to be known to nsgmls provided that char- acters in the document character set that are significant both in the reference concrete syntax and in the described concrete syntax are described using known base character sets and that characters that are significant in the described concrete syntax are described using the same base character sets or the same minimum literals in both the document character set description and the syntax ref- erence character set description. The public identifier for a public concrete syntax can be associated with an entity that describes using a PUBLIC entry in the catalog entry file. The entity must be a fragment of an SGML declaration consisting of a concrete syntax description starting with the SHUNCHAR keyword as in an SGML declaration. The entity can also make use of the following extensions: An added function can be expressed as a parameter literal instead of a name. The replacement for a reference reserved name can be expressed as a parameter literal instead of a name. The LCNMSTRT, UCNMSTRT, LCNMCHAR and UCNMCHAR key- words may each be followed by more than one parame- ter literal. A sequence of parameter literals has the same meaning as a single parameter literal whose content is the concatenation of the content of each of the literals in the sequence. This extension is useful because of the restriction on the length of a parameter literal in the SGML dec- laration to 240 characters. The total number of characters specified for UCNM- CHAR or UCNMSTRT may exceed the total number of characters specified for LCNMCHAR or LCNMSTRT respectively. Each character in UCNMCHAR or UCNM- STRT which does not have a corresponding character in the same position in LCNMCHAR or LCNMSTRT is simply assigned to UCNMCHAR or UCNMSTRT without making it the upper-case form of any character. A parameter following any of LCNMSTRT, UCNMSTRT, LCNMCHAR and UCNMCHAR keywords may be followed by the name token ... and another parameter literal. This has the same meaning as the two parameter lit- erals with a parameter literal in between 11 NSGMLS(1) NSGMLS(1) containing in order each character whose number is greater than the number of the last character in the first parameter literal and less than the num- ber of the first character in the second parameter literal. A parameter literal must contain at least one character for each ... to which it is adja- cent. A number may be used as a parameter following the LCNMSTRT, UCNMSTRT, LCNMCHAR and UCNMCHAR keywords or as a delimiter in the DELIM section with the same meaning as a parameter literal containing just a numeric character reference with that number. The parameters following the LCNMSTRT, UCNMSTRT, LCNMCHAR and UCNMCHAR keywords may be omitted. This has the same meaning as specifying an empty parameter literal. Within the specification of the short reference delimiters, a parameter literal containing exactly one character may be followed by the name token ... and another parameter literal containing exactly one character. This has the same meaning as a sequence of parameter literals one for each charac- ter number that is greater than or equal to the number of the character in the first parameter lit- eral and less than or equal to the number of the character in the second parameter literal. The public identifier for a public capacity set can be associated with an entity that describes using a PUBLIC entry in the catalog entry file. The entity must be a fragment of an SGML declaration consisting of a sequence of capacity names and numbers. Output format The output is a series of lines. Lines can be arbitrarily long. Each line consists of an initial command character and one or more arguments. Arguments are separated by a single space, but when a command takes a fixed number of arguments the last argument can contain spaces. There is no space between the command character and the first argu- ment. Arguments can contain the following escape sequences. \\ A \. \n A record end character. \| Internal SDATA entities are bracketed by these. \nnn The character whose code is nnn octal. 12 NSGMLS(1) NSGMLS(1) A record start character will be represented by \012. Most applications will need to ignore \012 and translate \n into newline. The possible command characters and arguments are as fol- lows: (gi The start of an element whose generic identifier is gi. Any attributes for this element will have been specified with A commands. )gi The end an element whose generic identifier is gi. -data Data. &name A reference to an external data entity name; name will have been defined using an E command. ?pi A processing instruction with data pi. Aname val The next element to start has an attribute name with value val which takes one of the following forms: IMPLIED The value of the attribute is implied. CDATA data The attribute is character data. This is used for attributes whose declared value is CDATA. NOTATION nname The attribute is a notation name; nname will have been defined using a N command. This is used for attributes whose declared value is NOTATION. ENTITY name... The attribute is a list of general entity names. Each entity name will have been defined using an I, E or S command. This is used for attributes whose declared value is ENTITY or ENTITIES. TOKEN token... The attribute is a list of tokens. This is used for attributes whose declared value is anything else. Dename name val This is the same as the A command, except that it specifies a data attribute for an external entity 13 NSGMLS(1) NSGMLS(1) named ename. Any D commands will come after the E command that defines the entity to which they apply, but before any & or A commands that refer- ence the entity. atype name val The next element to start has a link attribute with link type type, name name, and value val, which takes the same form as with the A command. Nnname nname. Define a notation. This command will be preceded by a p command if the notation was declared with a public identifier, and by a s com- mand if the notation was declared with a system identifier. A notation will only be defined if it is to be referenced in an E command or in an A com- mand for an attribute with a declared value of NOTATION. Eename typ nname Define an external data entity named ename with type typ (CDATA, NDATA or SDATA) and notation not. This command will be preceded by one or more f com- mands giving the filenames generated by the entity manager from the system and public identifiers, by a p command if a public identifier was declared for the entity, and by a s command if a system identi- fier was declared for the entity. not will have been defined using a N command. Data attributes may be specified for the entity using D commands. An external data entity will only be defined if it is to be referenced in a & command or in an A com- mand for an attribute whose declared value is ENTITY or ENTITIES. Iename typ text Define an internal data entity named ename with type typ (CDATA or SDATA) and entity text text. An internal data entity will only be defined if it is referenced in an A command for an attribute whose declared value is ENTITY or ENTITIES. Sename Define a subdocument entity named ename. This com- mand will be preceded by one or more f commands giving the filenames generated by the entity man- ager from the system and public identifiers, by a p command if a public identifier was declared for the entity, and by a s command if a system identifier was declared for the entity. A subdocument entity will only be defined if it is referenced in a { command or in an A command for an attribute whose declared value is ENTITY or ENTITIES. ssysid This command applies to the next E, S or N command 14 NSGMLS(1) NSGMLS(1) and specifies the associated system identifier. ppubid This command applies to the next E, S or N command and specifies the associated public identifier. ffilename This command applies to the next E or S command and specifies the effective system identifier. The effective system identifier is the system identi- fier generated by the system from the specified external identifier and other information about the entity. {ename The start of the SGML subdocument entity ename; ename will have been defined using a S command. }ename The end of the SGML subdocument entity ename. Llineno file Llineno Set the current line number and filename. The filename argument will be omitted if only the line number has changed. This will be output only if the -l option has been given. #text An APPINFO parameter of text was specified in the SGML declaration. This is not strictly part of the ESIS, but a structure-controlled application is permitted to act on it. No # command will be out- put if APPINFO NONE was specified. A # command will occur at most once, and may be preceded only by a single L command. C This command indicates that the document was a con- forming SGML document. If this command is output, it will be the last command. An SGML document is not conforming if it references a subdocument entity that is not conforming. ENVIRONMENT NSGMLS_CODE If this is set to the name of a encoding transla- tion, then that encoding translation will be used as the default encoding translation for everything (including file input, file output, message output, filenames and command line arguments). Otherwise the identity encoding translation will be used. Setting this to ucs2 or unicode is unlikely to give reasonable results. SEE ALSO The SGML Handbook, Charles F. Goldfarb ISO 8879 (Standard Generalized Markup Language), Interna- tional Organization for Standardization 15 NSGMLS(1) NSGMLS(1) BUGS Not all ESIS information for LINK is reported. AUTHOR James Clark (jjc@jclark.com). 16