========================================================================= Date: Tue, 5 Feb 91 10:56:20 GMT Reply-To: Text Encoding Initiative public discussion list Sender: Text Encoding Initiative public discussion list From: DEL2@PHOENIX.CAMBRIDGE.AC.UK Subject: Unicode 1.0 I recently requested a copy of the draft spec of Unicode 1.0 character encoding. Although not able to give it all the time I'd have liked, my brief look does raise a number of comments. I'm grateful to have the opportunity to plug my comments into the general discussion (via TEI, HUMANIST and the UNICODE team themselves:microsoft!asmusf@uunet.uu.net). (a) There are a number of significant typos; is anyone keeping a master record of these? (b) Robin Cover has raised the question why there are not separate encodings for Hebrew SIN and SHIN. They are certainly at least as distinct as, say, LATIN E followed by ACUTE and LATIN E ACUTE. I take it that the reason the latter case has two encodings is because of previous ISO encodings; but since those are in any case ASCII encodings (and Unicode is intended as a replacement for ASCII) how relevant is that? The question also raises a more fundamental problem in my mind. There are a number of situations where a glyph (or conglomerate of glyphs) can reasonably be encoded in alternative ways; HYPHEN (U+2010=U+002d) would be a case in point. We are told that some of these redundancies are there so that natural pairing can be used "if desired" (page 6). However, these coded pairs are not consistently undertaken (eg CAPITAL DOTTED I). But what worries me is that two encodings of an identical text may thus turn out to be very different; and for anyone using computer comparison of texts this could be quite problematic. So over against those who complained that, eg, separate codings for GREEK ALPHA+GRAVE are not available I would voice the opposite disquiet: the encodings are too comprehensive. If ALL accentuation was added as a separate code I think comparison of texts would be easier. The ordering of the accents would then of course be important, and I don't think the algorithm given (centre-out) is terribly helpful; which is nearest the cente in GREEK ROUGH BREATHING+ACUTE+IOTA SUBSCRIPT? Wouldn't an additional algorithm (clockwise starting at twelve o'clock) be useful? (c) While we're on Greek, I couldn't find a Greek semicolon (raised dot). Maybe I just didn't look hard enough, but full punctuation would be useful. But see my comment (e) below. I also failed to locate LATIN CAPITAL LETTER WYNN. (d) In general I approve of the policy that by adding the special Coptic forms to the Greek alphabet one can generate Coptic text, with hard copy generated by choosing an appropriate font. (And mutatis mutandis for other languages.) However, there are some drawbacks to this policy; I foresee the following problems: (i) It may be necessary to indicate to someone (if only the compositor) where to change font. Could a coding for change-of-language be incorporated? (ii) In some Greek texts it may be important to indicate where ligatures are used; there seems no way in this encoding to distinguish between GREEK KAPPA + GREEK ALPHA + GREEK IOTA on the one hand and the ligature which stood for "kai" on the other. I am sometimes in the position of needing to say (as indeed the authors of the manual were) something like "There are three possible form of LATIN SMALL LETTER G CEDILLA (U+0123) and they look like ..." How could I encode my ellipsis? Could the whole of the manual as printed be sensibly encoded in Unicode? Oddly, there are some forms which are exclusively graphic variants (ie one would not find them together in a "natural" text) which do attract separate codings; GREEK SMALL LETTER SCRIPT THETA for instance. Perhaps consistency is unattainable, but to me it is a desideratum. (e) The encoding of special numerals seemed odd. AS well as a select group of fractions (thirds, quarters and eighths, I think) there is the top half of fractional 1/nnn (U+215f). How is its use envisaged? Wouldn't a generalised "fractional line" be better (let's call it U+nnnn) so that nnnn is to be interpreted as a fraction? Similarly, Roman 12 (XII) is encoded as U+216b, but 13 (XIII) must be (presumably) U+2169 2162. Why not a single code for "roman numbers follow here:" (or just use ROMAN CAPITAL LETTER X &c)? If codes for general *modes* like "Greek font"; "roman numeral", "fraction" were included, then many ambiguities and problems could be reduced. My Greek semicolon, for instance, could be "GREEK FONT + ;" This contribution could be better thought-out, but it was this or nothing. If the latter seems preferable; please discard! Sincerely, Douglas de Lacey. ========================================================================= Date: Tue, 5 Feb 91 12:52:13 HNE Reply-To: Text Encoding Initiative public discussion list Sender: Text Encoding Initiative public discussion list From: PADROUIN@LAVALVM1.BITNET Subject: CIL 92 This circular replaces the one published a few days ago. cipl92@lavalvm1 ======================================================================== XVe Congres international des linguistes Quebec, Canada, 9-14 aout 1992 Organise par l'Universite Laval avec le concours de l'Association canadienne de linguistique (ACL) et sous les auspices du Comite international permanent des linguistes (CIPL) 1ere circulaire Renseignements generaux General information 1st circular XVth International Congress of Linguists Quebec City, Canada, August 9-14, 1992 Organized by Laval University in collaboration with the Canadian Linguistic Association (CLA) Under the auspices of the Permanent International Committee of Linguists (PICL) CIL92 Departement de Langues et Linguistique Universite Laval Quebec City, (Que.), G1K 7P4, CANADA Telephone: (418) 656-5323 FAX: (418) 656-2019 E-Mail: CIPL92@LAVALVM1.BITNET ANNOUNCEMENT XVth International Congress of Linguists Organized by Laval University in collaboration with the Canadian Linguistic Association (CLA) Under the auspices of the Permanent International Committee of Linguists (PICL) Quebec City, August 9-14, 1992 General theme of the Congress: "The Survival of Endangered Languages" Honorary President: Michel Gervais Rector of Laval University Organizing Committee: President: Pierre Auger Department of Languages and Linguistics Laval University Vice-President: Walter Hirtle Department of Languages and Linguistics Laval University General Secretary: Silvia Faitelson-Weiser Department of Languages and Linguistics Laval University Program: Marie Surridge Past President of the CLA Department of French Studies Queen's University, Kingston, Ontario Local Arrangements: Jean-Louis Tremblay Department of Languages and Linguistics Laval University Publications: Conrad Ouellon Director of CIRAL Department of Languages and Linguistics Laval University GENERAL INFORMATION DATE AND LOCATION: August 9-14, 1992, Laval University, Quebec City, Canada ACCOMMODATION: Hotels in all price ranges and limited accomodation in university residences halls. For information on accommodation, contact: OFFICE DU TOOURISME ET DES CONGRES DE LA COMMUNAUTE URBAINE DE QUEBEC 399, Saint-Joseph East Street Quebec Ciry, Quebec Canada, G1K 8E2 Tel: (418) 522-3511 PASSPORTS AND VISAS: All visitors to Canada, except residents of the United States, are required to have a valid passport. Citizens of some countries are also required to have a visa. All enquiries should be addressed to the closest Canadian embassy, consulate or high commission. REGISTRATION FEES: Participants Accompanying Students* Guests Before 91/04/30: $160.50(U.S.)** $80.25(U.S.) $160.25(U.S.) $187.25(CAN) $107(CAN.) $187.25(CAN.) From 91/05/01 to 91/12/31: $214(U.S.) $107(U.S.) $160.50(U.S.) $251.45(CAN.) $133.75(CAN.) $187.25(CAN.) From 92/01/01 to 92/08/09: $285(U.S.) $142.50(U.S.) $171(U.S.) $342(CAN.) $171(CAN.) $199.50(CAN.) Congress fees may be paid by cheque to CIPL92, by credit cards (American Express, Master Card and Visa). In the event of a cancellation, part of the registration fee will be refunded (75% before February 28, 1992, 50% from March 1, 1992 to May 31, 1992). There will be no reimbursement for cancellations received at the Congress office after May 31, 1992. However, the persons concerned will be sent Congress registration packets. * Only participants with an official letter from their universities certifying their student status will pay the student registration fee. **All taxes included within the registration fees. PROGRAM: Days Activities Sunday, August 9 Registration Reception Monday, August 10 Opening ceremony Plenary session Oral presentations Poster sessions Panel discussions Tuesday, August 11 Plenary session Oral presentations Poster sessions Panel discussions Wednesday, August 12 Excursions Thursday, August 13 Plenary session Oral presentations Poster sessions Panel discussions Friday, August 14 Plenary session Oral discussions Poster sessions Panel discussions Closing ceremony OFFICIAL LANGUAGES: The languages of the Congress will be Canada's two official languages, French and English. PLENARY SESSIONS: As is customary, the topic of each plenary session will be introduced by three or more speakers. This will be followed by a general discussion. The topics of these sessions are: 1. Semantics, syntax, pragmatics 2. The word 3. Endangered languages 4. Theoretical approaches to language: the state of the art and prospects for the future PAPERS: Conference papers may take the form of oral presentations or poster sessions. Oral presentations are scheduled to last twenty minutes, including a five-minute question period. Participants choosing the poster session will be allowed two hours. The schedule of papers will be announced in the third circular. The following is a provisional list of section topics: 1.Sounds, phonemes and intonation 2.The word (morphology, lexicology, lexicography, terminology) 3.The sentence (syntax, function, etc.) 4.Meaning (semantics, lexical meaning, grammatical meaning, etc.) 5.Spoken or written text (pragmatics, discourse analysis, etc.) 6.Language and society (sociolinguistics, linguistic variation, language and culture, etc.) 7.Language and the individual (psycholinguistics, neurolinguistics, language acquisition, etc.) 8.The history of language 9.Language planning 10.Language learning 11.Survival of endangered languages 12.Theories of language 13.Language and the computer 14.Pidgins et creoles 15.The history of linguistics 16.Methodology (data observation, corpus gathering and processing,experimentation) 17.Other (language and women, sign language,etc.) Participants wishing to present a paper will be requested to send an abstract before October 1, 1991. See the second circular (May 1991) for details. PANEL DISCUSSIONS: The Organizing Committee invites participants to propose topics for panel discussions by April 1, 1991. Participants whose topic is chosen will be responsible for organizing their panel discussion. PRESENTING A PAPER: The second circular will be sent to those who complete the enclosed answer card. INFORMATION: CIL92 Pierre Auger Departement de langues et linguistique Universite Laval Quebec City, (Que.) G1K 7P4, CANADA Telephone: (418) 656-5323 FAX: (418) 656-2019 E-Mail: CIPL92@LAVALVM1 Early registration Form XVth International Congress of Linguists Name_Mr.________________________________________ Ms.___________________________________________ Title_____________________________________________ Institution or Agency___________________________________________________ Address:_________________________________________________ __________________________________________________ ___________________________________________________ __________________________________________________ Tel.:______________________________ FAX:_______________________________ E-Mail:____________________________ REGISTRATION Before 91/04/30 From 91/05/01 From 92/01/01 to 91/12/31 to 92/08/09 Regular $160.50(U.S.) $214(U.S.) $285(U.S.) $187.25(CAN.) $251.45(CAN.) $342(CAN.) Students $160.50(U.S.) $160.50(U.S.) $171(U.S.) $187.25(CAN.) $187.25(CAN.) $199.50(CAN.) Accompanying Guests $80.25(U.S.) $107(U.S.) $133.75(U.S.) $107(CAN.) $133.75(CAN.) $171(CAN.) PAYMENT Cheque Master Card Visa American Express Expiration date:__________________________ Signature:_______________________________ I would like to present a paper. Yes No Chosen sections (in order of preference) 1.______________________________________ 2.______________________________________ Way preferred to present a paper -oral -poster session -no preference ANSWER CARD (To be filled out by anyone wishing to receive the second circular) Name Mr.____________________________________________________ Ms._____________________________________________________ Address:____________________________________________ _____________________________________________ ____________________________________________ Tel:______________________ FAX:_________________________ E-Mail:_________________________ CIL92 Pierre Auger Departement de langues et linguistique Universite Laval Quebec City, (Que.) G1K 7P4, CANADA Telephone:(418) 656-5323 FAX:(418) 656-2019 E-Mail:CIPL92@LAVALVM1.BITNET +++++++++++++++++++++++++++++++++++++++++++++++ ANNONCE XVe Congres international des linguistes Organise par l'Universite Laval avec le concours de l'Association canadienne de linguistique (ACL) et sous les auspices du Comite international des linguistes (CIPL) Quebec, 9 au 14 aout 1992 Theme principal du Congres: "La survie des langues menacees" President d'honneur: M.Michel Gervais Recteur de l'Universite Laval Comite d'organisation President: M. Pierre Auger Departement de langues et linguistique Universite Laval Vice-president: M. Walter Hirtle Departement de langues et linguistique Universite Laval Secretaire-generale: Mme Silvia Faitelson-Weiser Departement de langues et linguistique Universite Laval Programme: Mme Marie Surridge Presidente sortante de l'ACL Departement d'Etudes Francaises Universite Queen, Kingston, Ontario Accueil: M. Jean-Louis Tremblay Departement de langues et linguistique Universite Laval Publications: M. Conrad Ouellon Directeur du CIRAL Department de langues et linguistique Universite Laval RENSEIGNEMENTS GENERAUX DATE ET LIEU DU CONGRES: 9 au 14 aout 1992, Universite Laval, Quebec, Canada HEBERGEMENT: Hotels de differentes categories et nombre limite de logements economiques dans les residences de l'Universite Laval. Toutes les demandes concernant l'hebergement doivent etre acheminees a: OFFICE DU TOURISME ET DES CONGRES DE LA COMMUNAUTE URBAINE DE QUEBEC 399, rue Saint-Joseph Est Quebec, Quebec Canada, G1K 8E2 PASSEPORTS ET VISAS: Tous les visiteurs entrant au Canada, sauf les residents des Etats-Unis, doivent etre en possession d'un passeport valide. Pour les ressortissants de certains pays, un visa est egalement requis. Chacun des participants est encourage a consulter l'ambassade, le consulat ou le haut-commissariat canadien le plus pres, pour verifier les conditions qui s'appliquent a leur situation. FRAIS D'INSCRIPTION: Congressistes Accompagnants Etudiants* Avant 91/04/30: 160.50$(U.S)** 80.25$(U.S.) 160.50$(U.S.) 187.25$(CAN.) 107$(CAN.) 187.25$(CAN.) Du 91/05/01 au 214$(U.S.) 107$(U.S) 160.50$(U.S.) 91/31/12: 251.45$(CAN.) 133.75$(CAN.) 187.25$(CAN.) Du 92/01/01 au 285$(U.S.) 142.50$(U.S.) 171$(U.S.) 92/08/09: 342$(CAN.) 171$(CAN.) 199.50$(CAN.) Les frais d'inscription peuvent etre payes par cheque, a l'ordre de CIPL92, par cartes de credit (American Express, Master Card ou Visa). En cas d'annulation, une partie des frais des inscriptions sera remboursee, soit 75% avant le 28 fevrier 1992, 50% du 1er mars au 31 mai 1992. Les annulations effectuees apres le 31 mai 1992 ne seront pas remboursees, cependant les personnes inscrites recevront tout le materiel distribue pour la tenue de ce congres. *Seuls les participants munis d'une lettre de leur universite attestant leur statut d'etudiant pourront beneficier du tarif etudiant. **Ont ete rajoutees au frais d'inscription les taxes federale et provinciale sur les biens et les services. PROGRAMME Jours Activites Dimanche, 9 aout Inscription Soiree d'accueil Lundi, 10 aout Ceremonie d'ouverture Session pleniere Communications orales Communications par affiche Tables-rondes Mardi, 11 aout Session pleniere Communications orales Communications par affiche Tables-rondes Mercredi, 12 aout Journee libre, excursions Jeudi, 13 aout Session pleniere Communications orales Communications par affiche Tables-rondes Vendredi, 14 aout Session pleniere Communications orales Communications par affiche Tables-rondes Ceremonie de cloture LANGUES DU CONGRES: Les deux langues du Congres seront les langues officielles du Canada, soit le francais et l'anglais. SESSIONS PLENIERES: Selon l'usage, le sujet de chaque session pleniere sera presente par trois conferenciers ou plus. Une discussion generale suivra. Les sujets de ces sessions seront les suivants: 1.Semantique, syntaxe, pragmatique 2.Le mot 3.Les langues menacees 4.Les approches theoriques: le present et l'avenir COMMUNICATIONS: Des communications orales ou des communications par affiche seront acceptees. Le temps alloue pour une communication orale et la discussion qui suit est de vingt minutes. L'auteur d'une communication par affiche beneficiera d'une periode de deux heures pour presenter sa communication. L'horaire des communications figurera dans la troisieme circulaire. La liste provisoire des sujets retenus pour les sections est la suivante: 1. Les sons, les phonemes et l'intonation 2. Le mot (morphologie, lexicologie, lexicographie, terminologie, etc.) 3. La phrase (syntaxe, fonction, etc.) 4. Le sens (semantique, signification lexicale, signification grammaticale, etc.) 5. Le texte parle ou ecrit (pragmatique, analyse de discours, etc.) 6. Langage et societe (sociolinguistique, variation linguistique, langue et culture,etc.) 7. La langue et l'individu (psycholinguistique, neurolinguistique, acquisition et apprentissage des langues) 8. La langue dans le temps 9. L'amenagement linguistique 10. Apprentissage des langues 11. La survie des langues menacees 12. Theorie du langage 13. Langage et informatique 14. Pidgins et creoles 15. Histoire de la langue 16.Methodologie (observation des donnees, constitution et traitement de corpus, experimentation, etc.) 17. Autres (la langue et les femmes, le langage par signes,etc.) Les participants sont invites a proposer une communication dans une des sections ci-haut mentionnees. Ils devront faire parvenir un resume de leur communication avant le 1er octobre 1991 au bureau du Congres. Pour de plus amples renseignements, consultez la deuxieme circulaire disponible en mai 1991. TABLES-RONDES: Le Comite d'organisation invite les participants interesses a organiser une table- ronde a soumettre des propositions de sujets, et ceci avant le 1er avril 1991. Le participant dont le sujet sera accepte sera responsable de l'organisation de la table- ronde. PRESENTATION D'UNE COMMUNICATION:La deuxieme circulaire (mai 1991) sera envoyee a ceux qui auront rempli la fiche de reponse ci-jointe. Veuillez la retourner le plus tot possible. INFORMATION: CIL92 Pierre Auger Departement de langues et linguistique Universite Laval Quebec, (Que.) G1K 7P4, CANADA Telephone: (418) 656-5323 FAX: (418) 656-2019 E-mail: CIPL92@LAVALVM1.BITNET Fiche de preinscription XVe Congres international des linguistes Nom___________________________ Prenom_M._______________________ Mme_________________________ Titre:________________________________________________ Etablissement ou organisme:__________________________________________________ Adresse:________________________________________ _______________________________________ _______________________________________ ______________________________________ Tel.:(____)____________________ FAX:(____)___________________________________ E-MAIL:(____)________________________________________ INSCRIPTION: Avant 91/04/30 Du 91/05/01 Du 92/01/01 au 91/12/31 au 92/08/09 Regulier 160.50$(U.S.) 214$(U.S.) 285$(U.S.) 187.25$(U.S) 251.45$(CAN.) 342$(CAN.) Etudiants 160.50$(U.S.) 160.50$(U.S.) 171$(U.S.) 187.25$(CAN.) 187.25$(CAN.) 199.50$(CAN.) Accompagnants 80.25$(U.S) 107$(U.S.) 133.75$(U.S) 107$(CAN.) 133.75$(CAN.) 171$(CAN.) MODE DE PAIEMENT: Cheque Master Card Visa American Express Date d'expiration:____________________ Signature:_______________________________ Je desire presenter une communication. Oui Non Sections choisies (par ordre de preference) 1.___________________________________________________ 2.___________________________________________________ Mode de presentation choisi -oral -par affiche -aucune preference Fiche de reponse (a remplir par tous ceux qui veulent recevoir la deuxieme circulaire) Nom_________________________ Prenom M.___________________________________________ Mme_________________________________________ Adresse:_______________________________________________ _______________________________________________ _______________________________________________ Tel.:(____)___________________Fax:(____)___________________ E_Mail:(____)______________________________ CIL 92 Pierre Auger Departement de langues et linguistique Universite Laval Quebec,(Que.) G1K 7P4, CANADA Telephone:(418) 656-5323 FAX:(418) 656-2019 E-Mail:CIPL92@LAVALAM1.BITNET ========================================================================= Date: Tue, 5 Feb 91 19:55:18 -0500 Reply-To: Text Encoding Initiative public discussion list Sender: Text Encoding Initiative public discussion list From: Don Walker Subject: COLING-92 First Announcement & Call for Papers Fourteenth International Conference on Computational Linguistics COLING-92 23-28 July 1992, Nantes, France FIRST ANNOUNCEMENT AND CALL FOR PAPERS DATES: The conference will last five full days (not counting Sunday). Pre-COLING tutorials will take place on 20-22 July (2-1/2 days). ORGANIZERS: GETA and IMAG, Grenoble (F. Peccoud, Ch. Boitet, J. Courtin), Palais des Congres, Nantes (M. Gillet), Universite de Nantes (M.H. Jayez), EC2 (G. d'Aumale). PROGRAMME CHAIR: Prof. A. Zampolli, Universita di Pisa, ILC, via della Faggiola 32, I-56100 Pisa, ITALY (tel: +39.50.560481; fax: +39.50.589055). DEADLINES: Send six A4 or 8-1/2 by 11 inch copies of the full paper to Prof. Zampolli before 1 November 1991. Notifications of acceptance will be sent by 1 March 1992. Camera-ready copies of final papers conforming to the COLING-90 style sheet must reach GETA (GETA-IMAG, COLING-92, BP 53X, F-38041 Grenoble, FRANCE) by 1 May 1992. TOPICS: All topics in Computational Linguistics are acceptable. Papers concerning real applications will be especially welcome. A special session on language industry is planned. Please indicate main areas of papers using two-level categories: computational models and formalisms (in morphology, syntax, semantics, pragmatics, discourse, dialogue, . . .), methods (symbolic, numerical, statistical, neural, . . .), tools (specialized languages, environments), large-scale resources (textual, lexical, grammatical databases), applications (natural language interfaces, information retrieval, text generation, machine translation, machine aids to writing, translating, abstracting, learning, . . .), hypermedia and natural language processing (integration of text, speech, graphics, video), generic questions in language industry (engineering, ergonomics, legal aspects, normalization, . . .). TYPES OF PAPERS: Topical papers (maximum seven pages in final format) on crucial issues in Computational Linguistics, and project notes (maximum five pages). Only unpublished papers will be accepted. Papers should describe substantial and original work, especially new methodologies and applications. They should emphasize completed rather than intended work. PRELIMINARY SCHEDULE: Twelve 30-minute lecture slots daily (hopefully in only three parallel sessions) and three 30-minute demonstration slots during the lunch break (hopefully in at least ten parallel sessions). It should be possible to have lunch and go to two or even three demos. DEMONSTRATIONS: Demonstrations are strongly encouraged. A project note without a demo will have a lower probability of acceptance. With a demo, it will get three consecutive demo slots. A topical paper including a demo will be presented as a lecture and as a demo. LANGUAGES: One extra page will be allowed for a long abstract in English, if the paper is written in another language, or conversely (paper in English and long abstract in another language). Speakers not giving their talk in English are encouraged to use visual aids in English. EXHIBITION: An exhibition of language industry products will be organized in parallel by EC2, the well known organizer of the annual Avignon meetings on Expert Systems. Industrial firms are encouraged to present state-of-the-art NLP products. OTHER ACTIVITIES: A social programme will be proposed to participants and companions. Individual discovery is also possible, as Nantes and its region are culturally very active and full of picturesque places. Organized on behalf of the International Committee on Computational Linguistics Martin Kay, Palo Alto (President); Eva Hajicova, Prague (Vice President); Donald E. Walker, Morristown (Secretary General); Christian Boitet, Grenoble; Nicoletta Calzolari, Pisa; Brian Harris, Ottawa; David Hays, New York (Honorary); Kolbjorn Heggstad, Bergen; Hans Karlgren, Stockholm; Olga Kulagina, Moscow; Winfried Lenders, Bonn; Makato Nagao, Kyoto; Helmut Schnelle, Bochum; Petr Sgall, Prague; Yorick Wilks, Las Cruces; Antonio Zampolli, Pisa ========================================================================= Date: Tue, 5 Feb 91 19:57:12 -0500 Reply-To: Text Encoding Initiative public discussion list Sender: Text Encoding Initiative public discussion list From: Don Walker Subject: ACL Applied Natural Language Processing Conference - Trento 1992 CALL FOR PAPERS 3rd Conference on Applied Natural Language Processing Trento, Italy, 1-3 April 1992 sponsored by Association for Computational Linguistics PURPOSE The focus of this conference is on the application of natural language processing techniques to real world problems. It will include invited and contributed papers, tutorials, an industrial exhibition, and demonstrations. A special video session is also being organised. The organizers want the conference to be as international as possible, and to feature the best applied natural language work presently available in the world. This conference follows on from those held in Santa Monica, California in 1983, and in Austin, Texas in 1988. AREAS OF INTEREST Original papers are being solicited in all areas of applied natural language processing, including but not limited to: dialog systems; integrated speech and natural language systems; machine translation; explanation and generation; database interface systems; tool development; text and message processing; grammar and style checking; corpus development; knowledge acquisition; lexicons; language teaching aids; evaluation; adaptive systems; multilanguage systems; multimedia systems; help systems; and other applications. Papers may discuss applications, evaluations, limitations, and general tools and techniques. Papers that critically evaluate a relevant formalism or processing strategy are especially welcome. REQUIREMENTS FOR SUBMISSION Authors should submit, by 10 September 1991, a) six copies of a full-length paper (min 9, max 18 double-spaced pages, minimum font size 12, exclusive of references); b) 16 copies of a 20-30 line abstract; c) a declaration that the paper has not been accepted nor is under review for a journal or other conference nor will it be submitted during the conference review period. Papers arriving after the deadline will be returned unopened. We regret that papers cannot be submitted electronically, or by fax. Papers should describe completed rather than intended work, identify distinctive aspects of the work, and clearly indicate the extent to which an implementation has been completed; vague or unsubstantiated claims will be given little weight. Both the paper and the abstract should include the title, the name(s) of the author(s), complete addresses and e-mail address. Papers from Europe and Asia should be sent to: Oliviero Stock (ANLP-3) phone: +39-461-814444 I.R.S.T. fax: +39-461-810851 38050 Povo (Trento), ITALY email: stock@irst.it Papers from America and other continents should be sent to: Madeleine Bates (ANLP-3) phone: +1-617-8733634 BBN Systems & Technologies fax: +1-617-8733776 10 Moulton Street email: bates@bbn.com Cambridge, MA 02138, USA Authors will be notified of acceptance or rejection by 30 November 1991. Full-length versions of accepted papers, prepared according to instructions, must be received, along with a signed copyright release statement, by 15 January 1992. All papers will be reviewed by members of the program committee, which is co-chaired by Madeleine Bates (BBN Systems & Technologies) and Oliviero Stock (IRST) and also includes: Robert Amsler, MITRE Kathy McKeown, Columbia Univ. Giacomo Ferrari, Univ. of Pisa Sergei Nirenburg, Carnegie Mellon Univ. Eduard Hovy, USC/ISI Makoto Nagao, Kyoto Univ. Paul Jacobs, General Electric Remko Scha, Univ. of Amsterdam Martin Kay, Xerox PARC Karen Sparck Jones, Univ. of Cambridge Mark Liberman, Univ. of Pennsylvania Henry Thompson, Univ. of Edinburgh Paul Martin, MCC Wolfgang Wahlster, DFKI VIDEOTAPES Videotapes are sought that display interesting research on NLP applications to real-world problems, even if presented as promotional videos (not advertisements). An ongoing video presentation will be organized that will demonstrate the current level of usefulness of NLP tools and techniques. Authors should submit one copy of a videotape of at most 15 minutes duration, accompanied by a submission letter giving permission to copy the tape to a standard format and two copies of a one to two page abstract that includes: title, name and address and email or fax number of authors; tape format of the submitted tape (VHS, any of NTSC, PAL or SECAM); duration. The final tape format provided by the authors should be one of VHS, 75'' u-Matic, BVU, in any of NTSC, PAL or SECAM. Videotapes cannot be returned. Tape submissions should be sent to the same address as the papers (see above). The timetable for submissions, notification of acceptance or rejection, and receipt of final versions is the same as for the papers. See above for details. Tapes will be reviewed and selected for presentation during the conference. Abstracts of accepted videos will appear in the conference proceedings. We are also considering the possibility of producing a collection of video proceedings, for those videotapes that authors agree to distribute. A preliminary indication on this matter will be appreciated. DEMONSTRATIONS Beside demonstrations to be carried on within a regular booth at the industrial exhibition, there will be a program of demonstrations on standard equipment available at the conference (SUN's, MAC's, etc.). Anyone wishing to present a demo should send a one-page description of the demo and a specification of the system requirements by 1 December 1991 to Carlo Strapparava phone: +39-461-814444 I.R.S.T. fax: +39-461-810851 38050 Povo (Trento), ITALY email: strappa@irst.it PRIZE A prize will be given for the best nonindustrial demonstration. TUTORIALS The meeting will be preceded by one or two days of tutorials by noted contributors to the field. Responsible for tutorials: Jon Slack phone: +39-461-814444 I.R.S.T. fax: +39-461-810851 38050 Povo (Trento), ITALY email: slack@irst.it WORKSHOPS Proposals for organizing workshops in Trento immediately after the conference can be addressed to Oliviero Stock at the above address. INDUSTRIAL EXHIBITION Facilities for exhibits will also be available. Persons wishing to arrange an exhibit should send a brief description together with a specification of physical requirements (space, power, telephone connections, table, etc.) by 1 September 1991 to Giampietro Carlevaro phone: +39-461-814444 I.R.S.T. fax: +39-461-810851 38050 Povo (Trento), ITALY email: carleva@irst.it GENERAL INFORMATION Local arrangements are being handled by Tullio Grazioli and Oliviero Stock phone: +39-461-814444 I.R.S.T. fax: +39-461-810851 38050 Povo (Trento), ITALY email: interne@irst.it For information on the ACL, contact Donald E. Walker (ACL) phone: +1-201-8294312 Bellcore, MRE 2A379 fax: +1-201-4551931 445 South Street, Box 1910 email: walker@flash.bellcore.com Morristown, NJ 07960, USA The conference is also supported by the European Coordinating Committee for Artificial Intelligence (ECCAI), the Italian Association for Artificial Intelligence (AI*IA) and Istituto Trentino di Cultura. ========================================================================= Date: Wed, 6 Feb 91 06:37:33 CST Reply-To: Text Encoding Initiative public discussion list Sender: Text Encoding Initiative public discussion list From: "Eric Johnson DSU, Madison, SD 57042" Subject: Conference I C E B O L 5 Fifth International Conference on Symbolic and Logical Computing Dakota State University April 18-19, 1991 Madison, SD 57042 KEYNOTE SPEAKER Nancy M. Ide Professor and Chair, Computer Science Department, Vassar College. Author of _Pascal for the Humanities_ and articles on William Blake, artificial intelligence, and programming for the analysis of texts. FEATURED SPEAKER Ralph Griswold One of the creators of the Icon programming language and SNOBOL4. He is the editor of two newsletters, and the author of six books and dozens of articles on computer languages and string and list processing. He is Professor of Computer Science at the University of Arizona. ICEBOL5, the fifth International Conference on Symbolic and Logical Computing, is designed for teachers, scholars, and programmers who want to meet to exchange ideas about computer programming for non-numeric applications -- especially those in the humanities. In addition to a focus on SNOBOL4, SPITBOL, and Icon, ICEBOL5 will feature presentations on processing in a variety of programming languages such as Pascal, Prolog, C, and REXX. SCHEDULED TOPICS Music Score Recognition Automatic File Generation Predicate Logic Parallel Logic Programming Tools for Navajo Lexicography Expert System for Advising Computer Analysis of Poetry and Prose Simulating Neural Activity Parsing Texts Data Integrity Checking Digitized Voice Management Selecting Expert Systems Grammar and Machine Translation Editing Large Texts Logical Modeling of Complex Systems ACCOMMODATIONS Please make your own reservations. Lake Park Motel (Single $23);(Double $26) W. Hwy. 34 Phone (605) 256-3424 Super 8 (Single $26);(Double $32) W. Hwy. 34 Phone (605) 256-6931 All major chains available in Sioux Falls, SD (50 miles from conference site) - - - - - - - - - - - - - REGISTRATION FORM - - - - - - - - - - - - - - FIFTH INTERNATIONAL CONFERENCE ON SYMBOLIC AND LOGICAL COMPUTING April 18-19, 1991 Indicate the number for the following: Amount ______Advance registration $150.00 (includes two lunches, coffee breaks, banquet, one copy of the proceedings); On-site registration $175.00 $________ ______Additional copies of ICEBOL5 proceedings ($35.00 each) $________ ______Additional banquet tickets ($15.00) $________ ______Shuttle from Sioux Falls airport ($40.00 per passenger round trip) $________ Arrival: Flight________ Date ________ Time ________ Departure: Flight_______ Date ________ Time ________ (Rental cars are available at the Sioux Falls, SD airport) Total Amount Enclosed $________ Name_______________________________ College or Firm_____________________ Mailing address_________________________________________________________ _________________________________________________________ _________________________________________________________ Return this form to: Eric Johnson, ICEBOL5 Director, 114 Beadle Hall, Dakota State University, Madison, South Dakota 57042 USA ========================================================================= Date: Wed, 6 Feb 91 15:45:50 PST Reply-To: Text Encoding Initiative public discussion list Sender: Text Encoding Initiative public discussion list From: Ken Whistler Subject: Reply to Douglas de Lacey, Re: Unicode 1.0 I sent the following reply letter to Mr. de Lacey's recent comments on Unicode 1.0. Since he posted his comments to TEI-L, I am also forwarding my reply to that list. (Ken Whistler) Mr. de Lacey, Asmus Freytag forwarded your comments to several of us who are currently working on the Unicode 1.0 draft. While formal resolution of commentary will await decisions by the Unicode Technical Committee, I thought it might prove useful to clarify a few things now. These are my own opinions, and do not necessarily reflect the decisions of the UTC. Many of the bizarre characteristics of the symbols area that you note (encoding of fractions, Roman numerals, etc.) are simply the price we have had to pay to preserve interconvertability with other, important and already-implemented character encodings. We fully expect that any "smart" Unicode implementation will ignore most of the fraction hacks, for example, and encode fractions in a uniform and productive way. There is, in fact, a dual argument fraction operator in Unicode (U+20DB) to support such implementations. The coexistence of composite Latin letters (e.g. E ACUTE) with productive composition using non-spacing diacritics is also forced by compromises between competing requirements for mapping to old standards and implementation needs of the various parties which will use Unicode. While this has been (accurately) criticized as leading to non-unique encoding--in the sense that alternative, correct "spellings" of the "same text" can be generated--it is my considered opinion, after long arguments with proponents of other approaches, that uniqueness is not obtainable. In other words, we could design a scheme which could theoretically lead to unique encoding, but it would be unacceptable as a practical character encoding--so we wouldn't get it anyway. Unicode started out as you envision it--with only baseforms and non-spacing diacritics for Latin/Greek/Cyrillic, so that all accented letters would be composed. But that allowed for no acceptable evolutionary path from where we are to where we would like to be. The other approach, which tries to encode every single combination anyone could use (i.e. ISO DIS 10646), is necessarily incomplete, in that it refuses to acknowledge productivity in application of diacritics (e.g. for IPA). So Unicode is admittedly a chimera--but a practical, real chimera that will be implemented, rather than an impractical and unimplementable one. You identify a problem which arises from non-uniqueness, namely: >two encodings of an identical text may thus turn out to be very >different; and for anyone using computer comparison of texts this could be >quite problematic. I would imagine this also disturbs the dreams of many who are working on the text encoding initiative. But again, I think there is no way to guarantee uniqueness. Furthermore, the entire notion of "identical text" requires rigorous definition before algorithmic comparisons by computer make any sense. Is a text on a Macintosh comparable to the "identical text" on an IBM PC? Well, perhaps, once considerations of several layers of hardware, software, and text formatting, together with character set mapping are resolved. Such comparisons involve appropriate filters, so that canonical forms are properly compared. All Unicode implementers I know of are fully aware of the problem of canonical form for text representation. (By the way, it might be fair to say that this is an order-of-magnitude more critical problem for corporate database implementors than it is for text analysis.) Another thing to keep distinct in understanding Unicode is that not everything which can appear on a page can be encoded in Unicode plain text. Changes of font, changes of language, or metatextual references to a particular glyph: >"There are three possible form of LATIN SMALL LETTER G CEDILLA (U+0123) >and they look like ..." require a higher level of text structure than simply a succession of characters one after another. Unicode is definitely not going to be defining a bunch of ESCAPE code sequences to be embedded into text with particular semantics such as "change font to...". Modern text editing, analyzing, and rendering software deals with such things by means of distinctions on a "plane above" the text itself. The plain answer to the question, "could the whole of the manual as printed be sensibly encoded in Unicode?", is clearly no, since it requires a layer of formatting and distinguishes multiple fonts. The particular case of the GREEK SMALL LETTER SCRIPT THETA is just baggage dragged along from mistakes made in earlier encodings (thus also the other admitted glyphs encoded separately in the Greek block). There is a scheme for indicating preferential rendering (where possible) using ligatures (such as Greek "kai"). The ZERO WIDTH JOINER (U+200D) and ZERO WIDTH NON-JOINER (U+200C) can be used as rendering hints for ligatures, as well as serving as an important part of the proper implementation of cursive scripts such as Arabic. I don't think there is a LATIN CAPITAL LETTER WYNN to be found. This is a good case for following the "How to Request Adding a Character to Unicode" guidelines. If you can provide clear textual evidence that wynn appears in regular use with a case distinction, then a capital form would be a good candidate for addition. The Greek semicolon was unified with MIDDLE DOT (U+00B7). The diacritic ordering algorithm (centre-out) is meant to apply independently to diacritics on top and to diacritics on the bottom. The issue of how to specify unambiguously side-by-side ordering within diacritics at the same vertical level is a good one, and I think it will have to be addressed in the final draft. I hope these clarifications are helpful. --Ken Whistler ========================================================================= Date: Thu, 7 Feb 91 07:50:34 EST Reply-To: Text Encoding Initiative public discussion list Sender: Text Encoding Initiative public discussion list From: "Robert A. Amsler" Subject: UNICODE 1.0 Draft I am still concerned over offering two (or more) encodings of characters WITHOUT any prescriptive guidance to those who have the option of selecting either for new texts to be encoded or translated. It is not necessary to forbid alternate encodings, only to discourage them from continued use. Some guidance as to whether either encoding is preferred seems desirable. If UNICODE doesn't provide such guidance I would advocate the TEI adding its own recommendations. As Yaccov Choueka once pointed out to me, all the text that has been encoded up to now is but the tiniest fraction of the text that WILL BE encoded in the future. We own the future a better chance to use more desirable encodings than we may have to put up with because of poor planning in the past. ========================================================================= Date: Thu, 7 Feb 91 09:56:00 EDT Reply-To: Text Encoding Initiative public discussion list Sender: Text Encoding Initiative public discussion list From: John Lavagnino Subject: Against conference announcements I propose that conference organizers be discouraged from posting their announcements to this list, unless the conference has sessions devoted to SGML or the TEI. To my mind, there have been too many postings recently about linguistics conferences with no particular connection to this list's subject. If there's anybody who is a) interested in those conferences and b) heard about them only on this list, and not also on various linguistics and humanities lists, I would be surprised. John Lavagnino Department of English and American Literature, Brandeis University ========================================================================= Date: Thu, 7 Feb 91 17:31:12 +0100 Reply-To: Text Encoding Initiative public discussion list Sender: Text Encoding Initiative public discussion list From: Timothy.Reuter@MGH.BADW-MUENCHEN.DBP.DE Subject: Conference announcements I second John Lavagnino's request. I would prefer it if conference announcements on ALL lists took the form of a ten-line statement that details of such and such a conference can be obtained from LISTSERV@SOMEWHERE_OR_OTHER together with the closing date for papers and for registration - but perhaps that's too much to hope for. Timothy Reuter, MGH, Munich ========================================================================= Date: Thu, 7 Feb 91 13:12:00 EDT Reply-To: Text Encoding Initiative public discussion list Sender: Text Encoding Initiative public discussion list Comments: Warning -- RSCS tag indicates an origin of VERONIS@VAXSAR From: Jean Veronis Subject: Re: Conference announcements It seems reasonable that conference announcements appear on TEI-L if they are related to the TEI only. More generally, I would enjoy reading more TEI-related discussions on the list which has been lacking in recent months. It's difficult for me to understand why, since there seems to be a lot of activity within the TEI. Why so little on the list? Jean Veronis ========================================================================= Date: Thu, 7 Feb 91 17:11:07 CST Reply-To: Text Encoding Initiative public discussion list Sender: Text Encoding Initiative public discussion list From: Don Goldhamer Subject: Re: Conference announcements VERY brief announcements of "relevant" converences (the suggested 10 lines or less) with pointers to more information would seem most desirable. I would prefer to interpret "relevant" very liberally, so as to include those announcements we have recently received. Some of us are not subscribed to many lists. Donald H. Goldhamer - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - Project Manager 1155 E 60th, Chicago IL 60637 Department of Academic & Public Computing dhgo@midway.UChicago.EDU University of Chicago Computing Organizations voice:(312)702-7166 ========================================================================= Date: Fri, 8 Feb 91 11:17:00 MDT Reply-To: Text Encoding Initiative public discussion list Sender: Text Encoding Initiative public discussion list From: CHERYLL BALL Subject: Mainframe Letters I am looking for schools that have a good way of producing mass quantities of letters on the mainframe. We have an IBM 9121 running MVS/XA. Our users have data collection on IDMS. Some users departments have PC's and some are hardwired to the mainframe and some use the network to get to there systems. We need a letter processing system to produce approx. 500- 2000 letters a night, 300,000 letters a year. We are looking at purchased software like IBM's ASF with Document Writing Feature/Document Composition Feature. We are looking at IDMS/PC to leave the capability for the user's to use there PC's. Please let me know what you are doing at your site. I do not belong to this listserv group please send replies to: Cheryll Ball CBALL@UNMB or CBALL@TRITON.UNM.EDU University of New Mexico Analyst/Programmer II ========================================================================= Date: Mon, 11 Feb 91 02:04:52 EST Reply-To: Text Encoding Initiative public discussion list Sender: Text Encoding Initiative public discussion list From: Josh Hendrix Subject: Re: Unicode 1.0 Does anyone know where I could obtain a copy of the Unicode book? Josh Hendrix ========================================================================= Date: Mon, 11 Feb 91 18:14:25 +1100 Reply-To: Text Encoding Initiative public discussion list Sender: Text Encoding Initiative public discussion list From: mr p paul Subject: Re: Unicode 1.0 Please let me know when you've found it. ========================================================================= Date: Mon, 11 Feb 91 09:38:55 DNT Reply-To: Text Encoding Initiative public discussion list Sender: Text Encoding Initiative public discussion list From: Hans Joergen Subject: AHC 91 Call for papers In-Reply-To: Message of Thu, 7 Feb 91 09:56:00 EDT from AHC-conference in Odense 28th to 30th August 1991 The sixth international conference of the AHC will be held in Odense, Denmark and arranged by the Danish Data Archives. In the program comity for the conference are Peter Denley, Westfield College, London, Stefan Fogelvik, Stockholms Historiska Databas, Daniel Greenstein, Glasgow University, Hans J×rgen Marker, Dansk Data Arkiv, Jan Oldervol, Universitetet i Troms×, Kevin Schurer, Cambridge Group, Josef Smets, Montpellier, and Manfred Thaller, Max Planck Institut fõr Geschichte, G¾ttingen. Topics of the Conference The topics of the conference will as usual be a broad presentation of every thing that is going on in history and computing. Papers are invited on substantial subjects as well as methodological questions. Among the expected topics are -Standardization and exchange of machie readable data in the historical disciplines -Data analysis and presentation -Event history analysis -Text analysis -Simulation and modelling -Computer aided teaching -Social and economic history -Quantitative methods At the forthcoming conference it would be natural, as the conference is located in Scandinavia, if demographic studies and large data collections would be a central issue Furthermore a number of workshops on methodological questions will be held in the spring of 1991. These workshops will present their results for further discussions in workshop sessions at the conference. One of these workshops is dedicated to the application of the TEI guidelines in the field of history. Papers are invited on all aspects of computing in history. The papers will be published in a proceedings volume from the conference provided that they are submitted in machine readable form (WordPerfect or ASCII). Info Further information on the conference will be obtainable from Hans J×rgen Marker Danish Data Archives MunkebjergvÎnget 48 5230 Odense M Denmark Phone +45 66 15 79 20 Fax +45 66 15 83 20 E-Mail (EARN): DDAHM @ NEUVM1 ========================================================================= Date: Mon, 11 Feb 91 09:54:08 EST Reply-To: Text Encoding Initiative public discussion list Sender: Text Encoding Initiative public discussion list From: Susan_R._Harris@UM.CC.UMICH.EDU Subject: Re: Unicode 1.0 You can obtain a copy of the Unicode Final Review Document by sending mail to MICROSOFT!ASMUSF@UUNET.UU.NET. (This is what I heard in HUMANIST. I sent e-mail to this address last week, though, and haven't gotten a reply.) -Susan R. Harris ========================================================================= Date: Mon, 11 Feb 91 16:12:36 GMT Reply-To: Text Encoding Initiative public discussion list Sender: Text Encoding Initiative public discussion list From: DEL2@PHOENIX.CAMBRIDGE.AC.UK Subject: Unicode For those of you interested in the Unicode (character-encoding) debate: for e-mail responses the deadline has been extended to 25 February. So order your copy now from microsoft!asmusf@uunet.uu.net, and make sure your voice is heard. Regards, Douglas de Lacey. ========================================================================= Date: Mon, 11 Feb 91 10:57:00 CST Reply-To: Text Encoding Initiative public discussion list Sender: Text Encoding Initiative public discussion list From: "Robin C. Cover" Subject: UNICODE DOCUMENT SOURCE Re: > Date: Mon, 11 Feb 91 02:04:52 EST > Subject: Re: Unicode 1.0 > Does anyone know where I could obtain a copy of the Unicode book? > Josh Hendrix Ask for the UNICODE "Draft Standard - Final Review Document." The comment period is to officially close Feb. 15th, and a technical committee meeting for review of comments is scheduled for Feb 28th; email comment open till Feb 25th. I would be interested in whether the TEI editors (or relevant sub-committees) have supplied any comment to the UNICODE Consortium reflecting the interests of various TEI constituencies. Unicode Final Review c/o Asmus Freytag Building 2/Floor 2 Microsoft Corporation One Microsoft Way Redmond, WA 98052-6399 USA Email: microsoft!asmusf@uunet.uu.net Tel: (1 206) 882-8080 FAX: (1 206) 883-8101 Telex: 160520 ========================================================================= Date: Mon, 11 Feb 91 19:39:00 GMT Reply-To: Text Encoding Initiative public discussion list Sender: Text Encoding Initiative public discussion list From: Lou Burnard Subject: TEI views on Unicode Briefly, and rather unsatisfactorily, in answer to Robin Cover's latest query: yes, the TEI working group on character set problems, chaired by Harry Gaylord, was specifically charged with an evaluation of the relevance of Unicode to TEI concerns (among other things) when it was set up in December. Its report is due in a few weeks but I have already asked Harry to post a comment here too as soon as possible. Be patient though -- like me, he's up to his eyebrows trying to get things ready for the various TEI events at the ACH/ALLC Conference in Tempe next month! Lou Burnard EuroEd TEI ========================================================================= Date: Mon, 11 Feb 91 14:03:31 CST Reply-To: Text Encoding Initiative public discussion list Sender: Text Encoding Initiative public discussion list Comments: "ACH / ACL / ALLC Text Encoding Initiative" From: Michael Sperberg-McQueen 312 996-2477 -2981 Subject: Reminder: deadline for ACH/ALLC conference registration > Remember also that a TEI workshop will be held at ACH/ALLC '91. > If you are one of the many who'd like to get better information on > how the TEI scheme is supposed to work in practice, be sure to come > to ACH/ALLC and to the workshop. -CMSMcQ This is a reminder that "early" registration for ACH/ALLC '91 must be in by February 12 to qualify for the discount and to be sure of space in the dormitory. You would like a registration form or a copy of the program, contact ATDXB@ASUACAD. Daniel Brink, Associate Dean for Technology Integration College of Liberal Arts and Sciences Arizona State University, Tempe, AZ 85287-1701 602/965-7748/1441 fax -1093 ATDXB@ASUVM.INRE.ASU.EDU ========================================================================= Date: Mon, 11 Feb 91 16:55:00 CST Reply-To: Text Encoding Initiative public discussion list Sender: Text Encoding Initiative public discussion list From: FORTIER@UOFMCC.BITNET Subject: Final Critique 1 - - THE TEI GUIDELINES (VERSION 1.1 10/90) 0 A CRITIQUE 0 by the Literature Working Group 0 Background + __________ 0 This critique of version P1.1 of the TEI Guidelines was 0 drafted by the five members of the Literature Texts Work Group. 0 These people work with texts in four natural languages, several 0 literary genres and periods from the Middle Ages to the present. 0 Among them they have recorded several million words of text, 0 directed the development of several software systems, and 0 published several dozen articles and half a dozen books based on 0 computer analyses of texts; the methodology of these publications 0 varies from traditional literary history to advanced statistical 0 analyses. 0 Much of the following critique is based on the Survey of the 0 needs of scholars in literature carried out by the Work Group; to 0 which some forty interdisciplinary producing scholars responded. 0 A copy of the results of this Survey is available from the TEI 0 Project. A preliminary version of this critique was circulated 0 to the Editors of the project, and Michael Sperberg-McQueen's 0 responses to it have been extremely helpful in arriving at this 0 final version. - 1. Perspective + _ ___________ 0 The Work Group is impressed by the finished character of the 0 current version of the Guidelines document, and the almost total 0 absence of typographic errors. As people who work with and 0 generate texts on a daily basis, we recognize the amount of 0 effort which such an achievement represents. We wish to begin by 0 expressing high praise for the current Guidelines as the result 0 of concentrated and efficacious work on a difficult problem. 0 Michael and Lou should be particularly singled out for this 0 praise. - The comments which follow are offered in a spirit of friendly 0 collaboration in the hope that that will make an impressive 0 document even better and will bring it more closely into 0 conformity with the needs and perspectives of scholars working 0 with literature. - The Work Group understands that the TEI is proposing a coding 0 system for interchange, not for entry of texts. We realize also 0 that many things are suggested as options, not as requirements. 0 It must however also be recognized that simple considerations of 0 efficiency -- it is practical to have a locally standard code as 0 close as possible to the interchange code -- will tend to foster 0 the use of TEI codes at the local level; ASCII was originally 0 proposed as an interchange code; it is now a standard for 0 alphanumeric representation. - The very polished and comprehensive nature of the present 0 Guidelines, also, means that there will be a tendency for them to 0 become standards, both for interchange and local processing, and 0 even data entry; this possibility must be faced and taken into 0 account as they are drafted. By a similar process optional 0 codes, in the absence of clear distinction between the optional 0 and the required, will tend to be considered as recommended or 0 required, in spite of occasional or implicit indications to the 0 contrary. - Three of the Poughkeepsie principles bear on this matter. - A. The Poughkeepsie Principles + _ ___ ____________ __________ - 2. The Guidelines are also intended to suggest principles for the 0 encoding of texts in the same format. - 5. The Guidelines should include a minimal set of conventions for 0 encoding new texts in the format. - 9. Conversion of existing machine-readable texts to the new 0 format involves the translation of their conventions into the 0 syntax of the new format. No requirements will be made for the 0 addition of information not already coded in the texts. - It is our opinion that these three principles are of particular 0 importance to scholars in literature, and that they are not 0 sufficiently reflected in the current version of the Guidelines. 0 Our reasons for this opinion will become clear in the rest of 0 this report. - B. The Perspective of the Literature Scholar + _ ___ ___________ __ ___ __________ _______ 0 Like most practitioners of an intellectual discipline, 0 Literature Scholars are accustomed to working from a 0 methodological perspective. The Guidelines would profit greatly 0 from a theoretical introduction, making clear what is meant by 0 such terms as "text", "tag", "hierarchy", etc. The fragments of 0 discussion of this topic found here and there in the Guidelines 0 (e.g. p. 71) are not adequate for this purpose. We realise that 0 generating such definitions will not be an easy task given that 0 in a printed text titles, footnotes, and variants are clearly 0 tags to the text, but in a TEI text they are treated as text. 0 How the nature of text and tag changes as a result of a change in 0 medium is not at all clear. 0 Similarly, we in literature recognize in a single text a 0 plethora of structures: physical (page and line breaks), formal 0 (parts, chapters, paragraphs), grammatical, semantic, actantial, 0 narrative, psychological, and so on. Each can be deemed 0 hierarchical from certain perspectives. Do the Guidelines permit 0 all of these structures to be defined as hierarchies? Does it 0 require such definition for their manipulation? Does it allow 0 them to be handled simultaneously so that their interrelations 0 can be examined? The suggestions for treating parallel texts in 0 5.10.12 (pp. 122-3) and elsewhere are not very clear on these 0 matters. - Literary texts usually aim at richness of expression and 0 multiplicity of levels of possible meaning. Can SGML-based 0 Guidelines integrate this basic characteristic of literature, or 0 do they attempt to abolish it? 0 We realise that these are vexed questions, recalcitrant to 0 simple answers, particularly when one accepts -- as we do with 0 high praise -- the principle enunciated by the linguists (p. 130) 0 that all theoretical positions must be welcomed by the 0 Guidelines, but no one must be given pride of place. On the 0 other hand, we consider it crucial for the acceptance of the 0 Guidelines by our constituency that a thoughtful discussion of 0 these matters be found at the beginning of the Guidelines 0 document. For instance, the discussion of highlighting on pp. 78 0 and 124 would seem, in the absence of such a discussion, to be 0 based on the premise that authorial intention is discernible from 0 the text; such a premise ceased being intellectually respectable 0 in our field about fifty years ago. - The pragmatics of work on literature texts is also a source of 0 concern in a number of areas. 0 The scholar in literature typically works with large amounts 0 of data, since computer processing is used mainly when it is not 0 practical to commit a text to memory. 0 These scholars are concerned mainly with inputting texts as 0 rapidly and with as reasonable a cost as possible, verifying it 0 as effectively and cheaply as possible, and getting on as quickly 0 as possible with the analytic work which was their reason for 0 working with the machine. 0 Except when they are generating a canonical text, literature 0 scholars work with a specific edition of a text which is 0 considered canonical in the sense that it is the one which is 0 cited and quoted in serious professional work. According to 0 situations, this specific edition will be a critical edition, a 0 prestigeous edition, a trade edition. They will want to refer 0 easily to pages and lines in this text. That the electronic 0 version of this be stable and not subject to change other than to 0 correct errors is also a requirement. This perspective is made 0 perfectly clear in the responses to the Survey and in the 0 practices of the great repositories of machine-readable texts, 0 like the Tresor de la langue francaise. 0 Literature scholars are not interested in, in fact many object 0 vehemently to, the perspective of obtaining texts which already 0 contain - explicitly or implicitly - literary interpretations. 0 The responses and comments elicited by the Survey bear eloquent 0 witness to this. - For these reasons we recommend that the Guidelines clearly 0 distinguish between a minimal set of required tags and a wide 0 range of optional tags to be used at the discretion of the text 0 preparer. - The present version of the Guidelines is not in harmony with 0 our perspective. Some Examples: - p. 1 (1.1.1) The statement is made that the Guidelines "are also 0 intended to provide both guidance to the scholar embarking on the 0 creation of an electronic text, both as to what textual features 0 should be captured and as to how they should be represented". We 0 do not find such a claim appropriate in what is clearly becoming 0 a technical manual, not a user's guide. We consider that such a 0 claim constitutes a dangerous trap for the neophyte. It should 0 be removed. - p. 4 para 3. States that full tags need not be entered by hand, 0 and allusion is made to macros or parsers; no examples are 0 furnished, no names or references are furnished. Here again we 0 are concerned about about the effect on the neophyte. If macros 0 and parsers exist, examples of both should be provided here and 0 at least half of the examples in the rest of the document should 0 show their use. - p. 15 (2.1.4) Recommends embedding a given interpretation into 0 mark up at the time of data capture or conversion in the form of 0 a DTD. The Survey clearly indicates that most scholars of 0 literature strongly oppose finding interpretation already in 0 texts which they receive. To recommend embodying such 0 interpretation in an interchange format is paradoxical to say the 0 least. 0 It is recognized that all coding can be seen as a kind of 0 interpretation but a fundamental distinction must be made here. 0 A certain character is or is not in italic; once the way of 0 representing italic has been decided, a simple either-or decision 0 carrying very little intellectual content will resolve the 0 matter. Why a word is italicised is open to a number of 0 interpretations; scholars legitimately may not agree on which one 0 or ones are valid. This is interpretation in the usual sense, 0 and is the domain of the scholar working on the completed text, 0 not that of the coder inputting or converting the text. 0 Recommendations overlooking this distinction will alienate the 0 vast majority of literature people working with computer. The 0 Survey has made this clear. - p. 16 (2.1.4.2) Minimisation rules are a good idea. Examples 0 (note the plural) should be provided. - p. 23, the example. The coding is much too wordy; the poem, 0 which is tiny, disappears under mass of the codes. Responses to 0 the Survey and discussions on TEI-LIST have made clear the dismay 0 of the scholarly community with this wordiness. Minimisation 0 will have to be carried much further, and software will have to 0 be developed with a feature similar to the reveal codes/hide 0 codes function on many word processors. This is not a minor 0 problem but points to an underlying reality. If structural 0 features are indicated by format, this indication suffices. 0 Those features which require explicit coding will be more 0 complex, more prone to error, more difficult to enter 0 consistently, more difficult to verify and proofread. Scholars 0 are not likely to undertake such onerous tasks whose results will 0 be so fragile. 0 It should be recalled that in the final analysis the success 0 of the TEI standards will depend on their acceptance and use by 0 the scholarly community. 0 In general, the very wordy nature of the tags recalls an 0 archaic period in computing, when the user was expected to 0 specify everything to the machine. A more contemporary and user- 0 friendly mode of tagging is expected by current users and must be 0 sought, since few users can be expected to put up with such 0 wordiness any more. - p. 28 (2.1.7) Entity Reference (string substitution). This is 0 excellent. It must be stressed more, alluded to more, and shown 0 frequently in examples. - p. 55 (4.1.4) Since most scholarly work in literature is based 0 on a canonical text, in which pagination and lineation frequently 0 varies with the PRINTING not just the edition; it is essential to 0 identify the date of printing and the print shop in the header 0 material of a machine readable file of a text based on a printed 0 edition. Reference back to the original, verification and 0 proofreading are impossible without them. - p. 62. We suggest putting print shop and date of printing between 0 the information on the publication and that of the distribution. 0 This would also be the appropriate place to identify the location 0 and shelf mark for manuscripts and incunabulae. - p. 65 (4.5) The encoding declarations are of course the ideal 0 place to put allusions to and/or explanations of the local coding 0 conventions. Please stress this fact here. In fact, we 0 recommend making it a condition of conformity to TEI standards 0 that local coding for features not available on the key-board 0 used (font changes, accented letters, etc.), be documented in a 0 header record. - p. 71 (5.1), para 1. The definition of text, "an extended 0 stretch of natural discourse, whether written or spoken", is not 0 correct. Not all texts are extended. Spoken natural discourse 0 is not text until transcribed in written form. - p. 71 (5.1), para 5. Again, the ability to point to a unique 0 place in the text of the original printed document is essential 0 to the needs of literature scholars. This must be stressed here 0 and shown in the examples. The Survey is eloquent on this 0 matter. - p. 77 (5.2.5) Colophon -- not a term everyone can be expected to 0 know. Note that the Pleiade edition shows this as front matter. 0 Given the practical importance of printing date and print shop 0 information included here, we recommend that it be put at the 0 beginning of the file, right after the publisher identification. - p. 77 (5.3.1) Given their importance for locating a quoted or 0 identified passage, line breaks should be mentionned here and 0 their importance stressed. The Survey made this abundantly 0 clear. - p. 93 (5.6) A strong recommendation to code page breaks: 0 EXCELLENT. Please put in as strong or a stronger recommendation 0 to code line breaks, i.e. always put in unless there is a 0 compelling reason not to do so, even in prose texts. To do 0 otherwise would be to ignore the contribution of the scholars who 0 participated in the Survey. - pp. 125-6 (5.11.2) Information about the layout of the edition 0 input (i.e. page and line breaks), which permits reference back 0 to the original text being studied, is crucial to the needs of 0 most literature scholars. To state that the "line-break" tag "is 0 intended only for cases where lineation of a prose text is 0 considered of importance in its own right" (p. 126), suggests 0 that such reference is rare, whereas it is THE NORM. It MUST NOT 0 be downplayed in this fashion. 0 Our judgement, confirmed by the Survey, is that most scholars 0 use electronic text in a fashion that requires the ability to 0 make unambiguous reference back to a precise place in canonical 0 printed text on which it is based. Thus lineation of a prose 0 text is always considered important a priori, unless for cases 0 like the Bible, a clear case can be made for coding in a 0 different fashion. In short, the suggestion that lineation can 0 somehow not be important in a text runs counter to the needs and 0 practices of scholars of literature. - p. 177 (7.3.1.1) It is not necessary to specify the metre 0 attribute in every line. That is the work of the analyst not the 0 archivist or the scanner corrector. - p. 178 (7.3.1.2) Even for rhyme of type "aa" French prosody 0 recognizes at least three types: rime suffisante (not necessarily 0 the same as assonance), rime pauvre, and rime riche. Perhaps 0 this should also be taken into account. BETTER, given the range 0 of languages to which the Guidelines are to apply and the large 0 number of prosodic systems in question, perhaps the Guidelines 0 should not be so prescriptive. The Work Group expects to work on 0 optional codes for such things, once more pressing requirements 0 of literature scholars have been attended to. - p. 200 Putting tags, entities and redefinitions in a separate 0 file for calling up by many texts is an excellent idea. 0 Unfortunately the example is not at all clear, and makes this 0 seem much more complex and confusing than it is or need be. - pp. 207-09. It is a trap for the unwary and an irritation to the 0 experienced to show the suppression of typographical information 0 (line breaks) in an extended example like this. The 0 justification that the edition used wasn't very good - "the 0 edition being used is of little editorial interest in itself" 0 (208) - makes things worse; poor editions should not be converted 0 to machine-readable form! - pp. 219-33 (A.6) We agree that in the case of the Bible the 0 older and more authoritative method of identifying passages 0 should prevail. - 2. Coding Levels + _ ______ ______ 0 The Guidelines recommend three levels of coding: - 1. Required in any TEI conformant document (e.g. Title, author, 0 etc.) - 2. Required for interchange, but a more succinct local code is 0 recommended (e.g. accented letters, non-roman alphabetics). - 3. Optional e.g. really. It is not always easy to tell which is which from 0 the present version of the document. This distinction must be 0 made clear. 0 We recommend a very small number of required codes: just what 0 is necessary to identify fully the edition and printing used and 0 to find a given passage in it in terms of pages and lines, 0 divisions into chapters, acts and scenes, cantos, or books, etc., 0 the character set used, and the representation used for features 0 in the text but not in the character set (i.e. accented letters, 0 font changes). All other codes must be optional. Examples of 0 optional codes should be furnished. We repeat that the 0 distinction between the two types must be made abundantly clear 0 even to the uninformed, casual or negligent reader. 0 In our view, a possible method would be to separate out each 0 type and group them as required, or optional. An alternate 0 method would be to tag each heading with a parenthetical 0 indication of which class each tag or tag type belongs to. The 0 optimum method would be to do both. - Further comments on coding levels follow: - p. 1 (1.1.2) The Guidelines recommend the use of simpler and less 0 wordy codes in a local environment, which codes are to be 0 translated into full TEI coding for interchange. EXCELLENT!!! 0 BRAVO!!! PLEASE DO MORE OF THIS! It should be made very clear 0 that this is the RECOMMENDED approach. Examples of existing 0 coding schemes upgradable to TEI level taken from existing 0 archives should be given. Other examples (made up for the 0 purpose) should be given. It must be made clear to the user that 0 clean, clear and easy codes are to be the NORM for local use, and 0 that the full TEI codes are for interchange and possibly archive 0 purposes only. - p. 4 para 5. Interchange format does not allow any tag 0 reduction. This is legitimate. But it MUST be made clearer that 0 local minimization is encouraged, as long as automatic upgrading 0 to full TEI codes is possible from the local code. - pp. 13-14 The examples are the perfect place to show a local 0 code first, then the full TEI code. - pp. 45-52 (3.2) Character Sets. It MUST be made clear that this 0 applies to interchange only. Local codes MUST be recommmended 0 and SHOWN which are easy to input and easy to use on a screen and 0 printer of MAC, DOS and Mainframe machines (at least 2 sets of 0 examples for each of the three). Preferably get some from 0 existing databases and some from the various forms of 8859. 0 The exclusion of such an important punctuation mark as the 0 exclamation mark puts a needless coding burden on scholars. This 0 exclusion should be removed. SGML should not take precedence 0 over the needs of scholars. Similar arguments can be made in 0 favour of the pound sign and square brackets. - p. 58 (para 4) The exclusion of recording the names of the 0 person or persons who actually did the recording work reveals an 0 inappropriate class and/or gender bias. Please delete this 0 paragraph. - pp. 58-59 The examples provide an excellent opportunity to show 0 both local codes and TEI codes. - p. 59 (4.3.2) para 5. The changes listed "corrections of mis- 0 spellings of data, changes in the arrangements of the contents, 0 changes in the output format", are not in fact minor. This 0 paragraph contradicts p. 55 (4.1.6). Please clarify, or better 0 still, choose. pp. 82-83 (5.3.6, 5.3.7) It MUST be made clear 0 that these very wordy and error-prone features are optional. 0 Please try to cut down their length. It is essential to warn the 0 potential user of their complexity and of the difficulty of 0 coding them accurately in a text of any size. Their optional 0 nature MUST be made more clear. In their present state they are 0 counter productive, both because of their wordiness and because 0 of the technical naivete which such wordiness embodies. - pp. 84-6 (5.3.8) List handling is excessively wordy and takes too 0 much for granted. There must be an example of a simplified local 0 code as well as the full TEI code here. - pp. 86-89 (5.3.11) Numbers: a perfect example here of a trap for 0 the unwary. Only "may" on p. 87 shows that this extremely wordy 0 coding is optional. - pp. 89-90 (5.4) This is a good idea but for a post-input markup. 0 This fact must be made clear and encouraged. Mention that this 0 is a relatively rare occurrence. - p. 93 (5.6.1) It is absolutely necessary to have an example here 0 and to show both local and TEI formats. - p. 94 (5.6.1) It is absolutely necessary here to have an example 0 and to show both local and TEI coding. It is very doubtful that 0 any scholar or dritic will ever use this kind of coding. 0 Something more straightforward and user friendly is required. - p. 97 (5.6.4) Seems to suggest only fully explicit coding in 0 milestones. You really need to show brief local codes here, PLUS 0 their expansion into TEI codes. - p. 103 (5.8.1) Explicit tagging of sentences. This is overkill. 0 This must be clearly indicated as optional and another part needs 0 to be added suggesting how to set up a local code permitting 0 automatic conversion to this level of coding. - pp. 110 ff (5.10.3) The examples from pp. 110 through 117 are 0 prime candidates for examples of both local and full TEI codes. 0 The Critical Edition example is particularly weak. The example 0 is trivial. The only clear presentation is the uncoded one. The 0 explicit and wordy recording of the lack of variants, and the use 0 of "&zero.var" for omissions, are bizarre in the extreme and flie 0 in the face of a millenium of scholarly practice. This attempt 0 to reduce three parallel texts to a single linearly expressed 0 notation is clearly defective. The text has been destroyed and 0 converted into an unreadable list of real and potential variants. 0 The prime function of any text is to be read. This conversion 0 has destroyed the text as text. Reference must be made to 0 experts in this domain and their advice must be followed. Here 0 again we hope to work on this, once more fundamental questions 0 have been resolved. - p. 170 (7.2.1) The encoding declarations are an EXCELLENT idea 0 and to be encouraged, indeed made required. They also foster the 0 definition of local standards which can be converted 0 automatically into TEI format. - pp. 207-09 A perfect place for a two-step example the first part 0 showing local code, the second showing TEI code. - 3. Coding Types + _ ______ _____ 0 Here are discussed the two types of coding Presentational 0 (capital letters, line breaks, italics, etc.), and Descriptive 0 (Proper noun, italics showing irony, stress or a foreign word, 0 etc.) 0 Our perspective is that coding (inputting or converting text) 0 is not the same as interpreting. Descriptive coding as presented 0 in the Guidelines is squarely in the domain of interpretation. 0 Most scholars do not want interpreted texts; they expect to do 0 that job themselves. They made this abundantly clear in the 0 Survey; we must not ignore them. When possible scholars hire 0 assistants to input texts, and do not expect these assistants to 0 do the interpretation. This whole aspect needs to be brought 0 into conformity with scholarly practice, otherwise the TEI 0 standards will not be respected. 0 To repeat one-to-one conversion of typographical features is 0 not controversial; it should be done as faithfully as possible. 0 It must be a requirement in a TEI conformant text. Coding or 0 interpretation in the sense of description of authorial 0 "intention" or the choice among several alternatives on the basis 0 of judgement is a different matter, which is designated 0 descriptive coding. It can be allowed but never recommended. 0 The Guidelines are quite unclear on this matter, and seem to make 0 conflicting suggestions in different places. 0 Descriptive mark up can at the limit be made an option for 0 those who feel they must do it. But it must be made clear that 0 such tagging is OPTIONAL and NOT REQUIRED. - Comments on details follow: - p. 12 (2.1.2) Direct quotation, indirect quotation, indirect 0 discourse, free indirect discourse, authorial comment, 0 description or narration -- all of these aspects of a text can 0 blend one into another. Which is which is open to interpretation 0 and debate. It is ludicrous to tag them as if such distinctions 0 could be made once and for all. Not only must the optional nature 0 of such tagging be stressed, but potential users must be 0 cautioned to exercise prudence in such coding, to define 0 categories carefully, to test them by hand on small samples and 0 shake them down on larger samples of electronic text, before 0 undertaking the tagging of a full text. - p. 71 (5.1) Presentational mark up is allowed here, as well as 0 descriptive. NO! Presentational mark up should be recommended, 0 with descriptive at most recognized as possible if one wants to 0 use it, but with warnings against it. The examples will have to 0 be revised. - pp. 77-78, 88, etc. The concept of crystals (or the choice of 0 term) is not made clear, the examples are difficult to follow. 0 Revision seems in order. - pp. 78-9 (5.3.2) This section is presented primarily in terms of 0 descriptive mark up, which is wrong. The presentational should 0 be recommended, if only because it avoids the excessive wordiness 0 of the descriptive approach. The wordiness of the so-called 1 0 2 - 0 presentational mark up must be reduced, for example "highlighted 0 rendition=italic" can be replaced with "ital" without any loss of 0 information. In fact, the longer form is more descriptive than 0 presentational. The earlier examples of handling of the 0 underlying features of italics, require so subjective an 0 interpretation that any scientific rigour in a text coded using 0 them would be destroyed. - pp. 79-81 (5.3.3) Do NOT recommend tagging of underlying 0 features, just the opposite. Stick with the for open 0 and close quotes, suggest something else for block quotes, e.g. 0 . Remind the user that she can use open and close quotes or 0 guillemets (other things for embedded quotes) for a local code 0 and have a conversion program take care of the rest. 0 "Guillemets" by the way is used in the plural. There is no 0 such thing as a single guillemet. What you show as such are 0 greater than and less than signs. What is the use of 66U, etc. 0 when character set tables are in the appendix? 0 The recommendation to use "rendition = unmarked" (p. 80) with 0 "q" is bizarre in the extreme. Many readers, and some of the 0 better software, can be expected to identify an item as unmarked 0 without the aid of a specific tag. - pp. 81-82 (5.3.4, 5.3.5) Perfect traps for the unwary. This is 0 interpretation and dependant on time; it adds unnecessary work, 0 confusion and possibility for error. Particularly true in the 0 case of "croissant" (p. 81) and in the example on p. 82. - p. 83 (5.3.7) If anyone in our community sees the bibliographic 0 tagging on 83, the TEI is a dead letter. The issues of how to 0 handle names, abbreviaations in names etc. is important and not 0 easy for programers to deal with but if this level of coding has 0 to be done at the capture or transmission stage, we assure you, 0 no one will use TEI. (Sorry archivists and programmers might, but 0 no one who is putting text into machine readable form in order to 0 do anything critical or scholarly with it will ever do this kind 0 of hiding of information in layers and layers of codes. - p. 103 (5.8.1) Explicit tagging of sentences. This takes for 0 granted that such can be known, which is not the case for 0 numerous poets, and even novelists since the l930's cf. Celine, 0 Simon, etc. in French. Here is an excellent example of why 0 descriptive coding is wrong. - p. 105 (last para) It is most questionable whether one should 0 EVER remove an interpretable feature from a text and replace it 0 by an interpretation. Not only does this make impossible 0 verification of the data (it has to be re-interpreted not 0 proofread) but it also involves the coder usurping the role of 0 the scholar who does the interpretation. - p. 123 (5.11) Here presentational mark up is described as 0 exceptional and extraordinary, earlier it was presented as a 0 valid alternative; consistent standards never hurt. 0 More important, presentational mark up should BE the standard, 0 with descriptive only an option which is allowed with cautions. - p. 123 (5.11) Use of "descriptive" in line one and of 0 "presentation" in line 4 shows the problem presented by the SGML 0 approach. If presentational markup had been used from the start 0 as the sine qua non -- none of this would be a problem. - p. 124 (5.11.1) The example. What edition was used? What are 0 the page and line boundaries? Or was this all made up too? 0 This example is a perfect demonstration of the weakness of 0 descriptive mark up: "Anglice" is not found in the standard 0 Latin dictionary (Lewis and Short). What are we dealing with 0 here? Are the italics quotes, emphasis or ironic? Let the coder 0 code and leave the interpretation to the scholar. - p. 176 (7.3) First, according to certain schools of 0 interpretation texts can and should be regarded in isolation, and 0 it is not the place of the TEI to pass judgement on this question 0 of literary theory. 0 Second, presentational mark up is essential because the 0 Guidelines deal with coding a text, not its interpretation. The 0 role of a given textual feature is ALWAYS open to interpretation, 0 so the function of a good coding scheme is to facilitate 0 interpretation, not pre-empt it. - p. 214 (bottom) The Hamlet example. The stage type describes 0 only the first half of the stage direction; this is the problem 0 with descriptive tagging. 0 Someone should try to reduce the wordiness of this tagging, 0 particularly in the case of the speaker distinctions. - 4. Other + _ _____ 0 This section contains comments that do not fit easily into the 0 categories used above. - pp. 75-76 (5.2.4) Why use etc.? The names given to the 0 sections by the author are the text. If the author choses to use 0 a number "I" or "2" surrounded by blank space that is what SGML 0 should do. It if cannot code blank lines and blanks, then we are 0 in rather serious trouble as literature scholars. We will be 0 forced to describe, when presentation is what we want to do. 0 This whole section is really designed for programmers, not for 0 people in our area -- this type of material will only frighten 0 users away from the Guidelines; it is virtually incomprehensible 0 and in the long run not even true. There are alternatives other 0 than the one listed, using the facts of the text, rather than any 0 imposed divisions: large or small. - p. 76 (5.2.4) The distinction between legal and illegal forms is 0 not clear. In any case the legalistic terminology is not 0 appropriate. - p. 79 (line 7). The "second" sentence. TYPO. It is the only 0 sentence in the example unless the TEI standards have subtleties 0 which escaped the committee. - p. 88 (example 1) TYPO. must go after "seventy-seven" if 0 you care to be consistent with the date coded earlier as 0 1977-06-12. - p. 90 (example after ) "Dumb clucks": Belittling the reader 0 in this fashion is not amusing; it is offensive. Remove it and 0 find a real example from a real text. - p. 95 Assumes exactly what we do not want to assume: "text has 0 been entered without preserving pagination". No need for 0 artificial reference scheme; one already exists (the page numbers 0 and carriage returns at the end of the lines). - p. 96 (4.6.2) What can it mean to mark as "absent" a piece of 0 text that is not present? What exactly is there to be marked? - p. 105 (5.8.2) Soft hyphens EXIST in source texts. Please 0 suggest more clearly how to handle them when they occur. - pp. 110 ff. (5.10.3). Find a real text for a real example here. 0 The imaginary and "humourous" one trivialises what is being done. - p. 129 (6.1) para 2. Trying to define forms with no reference to 0 content is a mug's game. The whole concept of structure shows 0 that form determines content and content determines form, in 0 varying degrees according to the context, example, and 0 interpretative perspective, of course. In other words, you must 0 create unanimity among the community of scholars BEFORE you can 0 define the forms they can use. Not a practicable enterprise. - p. 130 (6.1) The principle for linguistics (welcome all 0 theoretical positions, favour none) is EXCELLENT. We recommend 0 the same thing for literature; this is the basic premise of most 0 of the preceding comments. - pp. 140-44 (6.2.4) Incredibly wordy and unreadable coding for 0 linguistic features. If the linguists consider this a good idea, 0 more power to them. We recommend not getting into this for 0 literature texts. - p. 169 (7.1) "verse, drama and narrative". Narrative used in the 0 sense of prose. Not all prose is narrative (cf. Cook books, or 0 the TEI Guidelines), not even all literary prose is narrative 0 (some is descriptive). If you are going to try to dictate, or 0 even make suggestions, to scholars in literature, you must get 0 the technical language right, and "sermons guidebooks, recipe 0 books, etc." (p. 176) are NOT narratives, formal or otherwise, in 0 any accepted sense of the word. - p 180 (7.3.2.1) Overkill if both speaker and speech tell that the 0 speaker is Cordelia -- why not just say so once by recognizing 0 the abbreviation of the speaker's name that is in the text to be 0 the "tag" that it is. The real problem in dealing with speech in 0 plays is that the speaker's tag needs to appear with each 0 sentence (or all the words) of long speeches. Identifying "Cor" 0 as Cor. does not contribute to solving this 0 problem. - p. 180 (7.3.2.2) Excellent example of giving the simple tag, 0 mentioning that some investigators may want to also encode this, 0 that and the other, but not giving prescriptive examples. - p. 181 (7.3.2.4) French texts of plays also show the date and 0 place of the first production as well as the names of the actors. 0 You should provide for this. - p. 181 (7.3.3) Use PROSE not narrative, to include the essay and 0 free form creations (cf. Butor's works). - p. 207 -- the original of this example is a printed document, not 0 scanner output. Please begin by showing the original not an 0 intermediary stage of processing. - P. 215 The idea of removing speaker tags, then identify them as 0 speaker 1 and speaker 2, but then to actually give them names in 0 the speech tag that follows, is to say the least messy. Either 0 the speaker is tagged in the text or is not. - p. 215 Note "Mar.Marc" -- clearly a leftover fragment of a 0 redundant tag. - p. 270 Alternate Base for DTD for drama---If this goes out to 0 any public other than programers, then the TEI standards will not 0 be used. Give us one reason why anyone would want to. - Place of insertion to be chosen: Concern was expressed in the 0 Work Group about the integrity of electronic texts. Simply 0 counting the size of a file in bytes does not guarantee that one 0 can recognize modifications in it. Shareware exists which 0 generates a unique number for a text; a number which will change 0 if any modifications are made to it. Please look into the 0 possibility of recommending such software, or better recommending 0 that it, and the number generated by it be included with archived 0 or shared texts. ========================================================================= Date: Mon, 11 Feb 91 18:47:31 MST Reply-To: Text Encoding Initiative public discussion list Sender: Text Encoding Initiative public discussion list From: Daniel Brink Subject: Re: Final Critique In-Reply-To: Message of Mon, 11 Feb 91 16:55:00 CST from Since there is a "Critique" session planned for ACH/ALLC, will there be any representation from the authors of this "Final Critique" at the conference? Daniel Brink, Associate Dean for Technology Integration College of Liberal Arts and Sciences Arizona State University, Tempe, AZ 85287-1701 602/965-7748/1441 fax -1093 ATDXB@ASUVM.INRE.ASU.EDU ========================================================================= Date: Tue, 12 Feb 91 09:33:00 MDT Reply-To: Text Encoding Initiative public discussion list Sender: Text Encoding Initiative public discussion list From: CHERYLL BALL Subject: Mass Mailings For all you folks out there who responded to me so RUDELY, I was given this list by someone on another list who thought this list might help. I am truly sorry for being so NAIVE to believe everything that I am told from other people. I had no IDEA that I would be chastized so HARSELY. Believe me when I say I WILL NEVER POST TO THIS LIST AGAIN. ========================================================================= Date: Tue, 12 Feb 91 12:07:10 EST Reply-To: Text Encoding Initiative public discussion list Sender: Text Encoding Initiative public discussion list From: var@IRIS.BROWN.EDU Subject: TEI and HyTime There is an ANSI meeting coming up soon that might be of interest to many participants in the Text Encoding Initiative. This meeting deals with the standardization of Hypertext around an extension to SGML called HyTime. What follows is the meeting announcement for that meeting. I am enclosing it here since the meeting is a little over a week away and it would be nice to have a few TEI folks join the meeting even on such short notice. Victor Riley Institute for Research in Information and Scholarship (IRIS) Brown University PO Box 1646 Providence, RI 02912 var@iris.brown.edu ====8<====8<====8<====8<==== X3V1.8M MUSIC IN INFORMATION PROCESSING STANDARDS (MIPS) COMMITTEE operating under the rules and procedures of the American National Standards Institute X3V1.8M Secretariats: The Computer Music Association Graphic Communications Association c/o Larry Austin, President c/o Marion Elledge, Vice President, P. O. Box 1634 Information Technologies San Francisco, California 94101-1634 100 Daingerfield Road USA (817 566 2235; cma@dept.csci.unt.edu) Alexandria, Virginia 22314 USA (X3V1.8M document orders and service (703 519 8160; Fax: 703 548-2867) to the music technology community) (X3V1.8M participant mailings and service to the publishing systems community) MEETING NOTICE and DRAFT AGENDA - FIFTEENTH MEETING MEETING NOTICE: Meeting times: Saturday, February 23, 1991, 10:00 AM - 5:00 PM. Sunday, February 24, 1991, 9:30 AM - 5:30 PM. Monday, February 25, 1991, 9:30 AM - 5:30 PM Tuesday, February 26, 1991, 9:30 AM - 5:30 PM. Wednesday, February 27, 1991, 9:30 AM - 1:00 PM. Meeting Host: Graphic Communications Association (GCA), Norman Scharpf, President; Marion Elledge, Vice President, Information Technologies. The meeting is being held in conjunction with the GCA's "TechDoc Winter '91" confer- ence. TechDoc Winter '91, is subtitled "Interactive Electronic Documentation (IED)." Tutorial sessions will occur simultaneously with X3V1.8M meetings (in dif- ferent rooms, of course) on February 25 and 26, while the TechDoc Winter '91 conference will take place from February 27 (a one-day overlap with X3V1.8M's meeting) to March 1. There will be a tutorial on HyTime during Tuesday afternoon, February 26, which the X3V1.8M com- mittee may or may not choose to attend. Meeting Location: The Radisson Hotel 1600 N. Indian Avenue Palm Springs, California 92262 619 327 8311 WRITTEN CONTRIBUTIONS The usual mailing of papers contributed since the last mail- ing, together with the most recent revision of X3V1.8M/SD-7, the Journal of Development for the HyTime Hypermedia/Time- based Document Representation Language (eighth draft), will be mailed to participants of record toward the end of Janu- ary, 1991. Papers should be received in camera-ready form by January 15, 1991 by X3V1.8M Vice Chairman Steven R. Newcomb, Center for Music Research, School of Music, Florida State University R-71, Tallahassee, Florida 32306-2098 USA. (Voice: 904 644 5786, 904 422 3574. Fax: 904 386 2562 or 904 644 6100. Internet: srn@cmr.fsu.edu.) LODGING Lodging at the Radisson Hotel is available for $119/night, which is a special rate available to those who mention on the phone that they are there in conjunction with the Graphic Communications Association's TechDoc Winter '91 Conference. The phone number for Radisson reservations is 619 327 8311. There is, of course, no requirement that X3V1.8M participants stay at the Radisson, but, since the meeting will be held there, the Radisson will be the most convenient (if probably not the least expensive) lodging. TRAVEL It is possible to travel directly to Palm Springs by air. It is generally less expensive to go to Orange County Air- port and drive for a couple of hours to Palm Springs, par- ticularly if you are renting a car anyway. NOTES TO NEW PARTICIPANTS/OBSERVERS: 1. Prospective members and observers are welcome at any time to participate in the current technical work of the committee. (You can be most effective in conveying your viewpoint if you can present it in the context of the current work -- in other words, please be familiar with X3V1.8M/SD-6, SD-7 and SD-8. If you don't have these, they can be obtained for a nominal charge from the Computer Music Association's X3V1.8M Secretariat.) New participants are also urged to obtain and read ISO 8879 (Standard Generalized Markup Language). ISO 8879 is obtainable from the Graphic Communications Associa- tion for $67.50 (156 pp.). You should also obtain International Standard ISO 8879:1986/Amendment 1 from the same organization. 2. As usual, a portion of the second day's meeting (Sun- day) has been set aside for persons who wish to address the committee on topics of their own choosing, relating to the subject matter or methodology of the committee's work. Mr. Brian Caporlette of the U. S. Air Force's Human Resources Laboratory at Wright-Patterson AFB will be presenting the recent revisions to the Content Data Model (CDM) for Interactive Electronic Technical Manu- als (IETMs) his organization has made in order to make the CDM conform to HyTime. 3. New participants are asked (but not required) to inform Charles Goldfarb (c/o Sue Orlando, IBM Almaden Research Center, 408/927-2578) or Steve Newcomb (Florida State University Center for Music Research, Tallahassee, FL 32306-2098, 904/644-5786) if they plan to attend. DRAFT AGENDA: Saturday Administrative matters, including: opening, approval of agenda, introduction of new participants, and schedul- ing the sixteenth (and possibly the seventeenth) meeting(s). Technical work will include a review of the changes to SD-7 made as a result of work done at the fourteenth meeting. Sunday Continuation of review of SD-7, particularly the appli- cation of the "HyTime architectural form" idea to addi- tional elements. Presentation by Mr. Caporlette on the AFHRL Content Data Model as revised to conform to the HyTime hyperlink and document location facilities. Reconsideration of the "endsets" idea, which would allow certain link end locations to be restricted to a given list of generic identifiers. Monday Continuation of SD-7 review, including the generaliza- tion of the time model to space and time. Tuesday Continuation of Monday's agenda. Review of the operat- ing model of a HyTime engine outlined at the thirteenth meeting. Possible adjournment to HyTime Tutorial in the afternoon, which will include a presentation of the proto-SD-9 document, "HyTime Review," by Messrs. Kipp and Newcomb. Wednesday Enumeration of instructions to the editors regarding revisions to the working draft of HyTime. Adjournment. {Revised 91/02/12} ========================================================================= Date: Thu, 14 Feb 91 18:19:18 CST Reply-To: Text Encoding Initiative public discussion list Sender: Text Encoding Initiative public discussion list Comments: "ACH / ACL / ALLC Text Encoding Initiative" From: Michael Sperberg-McQueen 312 996-2477 -2981 Subject: TEI progress report, February 1991 As a change from conference announcements, we thought readers of this list might like a brief summary of what has been happening in the TEI since the distribution of the first Draft Guidelines last fall. Sincere apologies to those who feel such a report is long overdue! 1. TEI Deliverables 1.a. Documents First, a brief recap on the project's overall timescale and objectives. What will the TEI deliver in June 1992, when the funding dries up? It seems clear that a single massive report (a revised and extended version of the current document TEI P1) will not be enough. The need for a brief introductory guide, setting out the basic TEI framework and philosophy, has been repeatedly pointed out to us, sometimes privately and often publicly, as has the pressing need for tutorial material, and for demonstrations of TEI encoded texts in action. No effort was put into producing these in the first cycle, for the good reason that we did not at that time know what exactly we would be providing an introductory guide to! Now that the basic TEI framework is a little less nebulous, it seems appropriate to address these problems. Preparations for the forthcoming TEI Workshop at Tempe will provide one important source of such materials, and input from the affiliated projects another. It's possible that readers of this list may also have prepared some summary or explanatory material which might be of use -- don't be shy about letting us know about it, if you have. (For starters, we were recently delighted to receive a translation into Hungarian of the four page `executive summary' of P1). 1.b. Software -- a non-deliverable After tutorial and introductory materials the most frequently expressed desire at present seems to be for TEI-conformant software: systems which behave like the analytic packages we all know and love, but can also take advantage of the new capabilities offered by SGML. As a first step, we need programs (filters, as they are known in the trade) to translate from the TEI encoding scheme to those required by the application programs we use, and back in the other direction. For rolling one's own software, the community needs generally available routines which can read and understand TEI documents and which can be built into software individuals or projects develop for themselves or others (TEI parsers). Equally important for the usability of the encoding scheme in the community at large will be TEI-aware data-entry software -- editors and word processors which can exploit the rich text structure provided by SGML, simple routines to allow TEI tags to be entered into a text with a keystroke or two instead of ten or twenty (or in extreme cases even more!), and other tools to help make new texts in the form recommended by the TEI. Approximations to some of these are already available, and we hope to be demonstrating some of them at the Tempe Workshop. As we have often said, the TEI is not in the business of software development: nevertheless, it's clear that when any opportunity of steering software developers into channels likely to benefit the TEI community presents itself, we'd be foolish not to take it. So far, only encouraging noises have been heard from most, but products like DynaText (from Electronic Book Technologies) are a clear indication of the kinds of software we should expect to be able to choose amongst by the time the project ends. The Metalanguage Committee has accepted a `watching brief' to monitor and report on the features of commercially available SGML software, and has already produced a preliminary working paper (ML P28) which lists several products of interest to the TEI community, as well as a revised and expanded version of Robin Cover's monumental bibliography of SGML related information (ML W14). (These are not yet publicly available; ML P28 is being revised to correct a slip or two, and ML W14 will be put on the TEI-L file server just as soon as we can sweettalk the UIC system management into the necessary megabyte or so of disk space and move the data to Chicago from Kingston.) 1.c. And more documents Just as many people have asked for some description of TEI encoding less technical and formal than TEI P1, so also some have asked for a more formal treatment of the scheme, so that it would be easier to write the TEI-conformant software they'd like to develop. In this connection, some work is proceeding (slowly!) on a formal presentation of the subset of SGML required by the TEI; the Metalanguage committee is also working on a more explicit definition of the notion 'TEI conformance'; this concept was intentionally left vague in the first draft but it appears that such vagueness has less to recommend it than we thought. 2. TEI Workplans If we're not producing any software, and only grudgingly getting round to explaining the work done in the first cycle, what, you might reasonably enquire, are we in fact doing? The major objective during the second funding cycle will be to extend the scope and coverage of the Guidelines. Those who have read P1 closely will be aware, as we are, of the very large number of topics sketched out, adumbrated or downright neglected therein. We remain confident that P1 provides a good general framework for most forms of text-based scholarship, but we need to put this claim to the test in more (and more different) areas of specialisation than was possible during the first cycle. How will this be done? One way, as we've already indicated, will be through the testing of the Guidelines in a practical situation which the Affiliated Projects will carry out. The other will be through the setting-up of a number of small but tightly-focussed working groups to make recommendations in specified areas, either directly where an area is already well-defined, or indirectly by sketching out a problem domain and proposing other work groups which need to be set up within it. Each work group will be given a specific charge and will work to a specified deadline. So far, about a dozen such groups have been set up, most of which are due to report back by the end of March: a list of currently active work groups and their heads is given below: TR1: Character sets (Harry Gaylord, University of Groningen) TR2: Text criticism (Robert Kraft, University of Pennsylvania) TR3: Hypertext and hypermedia (Steven DeRose, EBT) TR4: Mathematical formulae and tables (Paul Ellison, University of Exeter) TR6: Language corpora (Douglas Biber, Northern Arizona University) AI1: General linguistics (Terry Langendoen, University of Arizona) AI2: Spoken texts (Stig Johansson, University of Oslo) AI3: Literary studies (Paul Fortier, University of Manitoba) AI4: Historical studies (Daniel Greenstein, University of Glasgow) AI5: Machine-readable dictionaries (Robert Amsler, Mitre Corporation) AI6: Computational lexica (Robert Ingria, BBN) Each group is formally assigned to one of the two major working committees of the TEI, depending on whether its work is primarily concerned with Text Representation (TR) or Text Analysis and Interpretation (AI). These two committees will then review and endorse the findings of each work group, though we expect that for some areas we will also seek expert outside reviewers, perhaps with the assistance of the Advisory Board. A number of other work group topics have already been identified, and are in the process of being set up: these include the following: TR5: Newspapers TR7: General reference works TR8: Physical description of manuscripts and incunabula TR9: Analytic bibliography AI7: Terminological data For some of these we have already identified suitably qualified members; for others (in particular the first two) * * * * * * * * * * * * * * * * * * * * * * * * * * we are soliciting volunteers or nominations. * * * * * * * * * * * * * * * * * * * * * * * * * * If there is an area of textual scholarship which you feel has been unjustly neglected by the current draft, please don't hesitate to let us know about it! Among other areas already proposed for consideration are - version control and the gradual enrichment of machine-readable texts - ephemera (tickets, matchbooks, advertising) - fragmentary ancient media (potshards, inscriptions etc.) - emblems (both isolated and libri emblematum) A meeing was held in Oxford in early December for the heads of all then-constituted workgroups, and some workgroups are already well advanced in their work. As reports become available, their existence will be publicized on this list and elsewhere. (You have already seen one working paper produced by the work group on literary studies.) In addition, of course, we will be making a full TEI progress report at the Tempe conference. 3. TEI Working Documents We are in the process of revising and making more accessible the TEI document register at Chicago, which holds information about all TEI-related working papers, reports and publications. Wherever possible, we will try to make sure that finalized reports of general interest are posted on this ListServ in the usual way. To find out what is currently available, send a note to LISTSERV@UICVM containing the line GET TEI-L FILELIST. Specific documents can be requested in the same way, or by contacting Wendy Plotkin (U49127@UICVM) who looks after the register. The one document most requested (P1 itself) is still, we regret, not available in electronic form -- we just haven't buckled down to the task of recoding its current rather esoteric markup. Please bear with us! However, the following documents are now or will soon be available (as are others of ephemeral or less general interest -- contact Wendy Plotkin for a full list), some tagged in TeX, some in (an extended form of) Waterloo or IBM GML, some without explicit tags in a form designed for reading onscreen or simple printing: TEI PC P1 The Preparation of Text Encoding Guidelines (closing statement of the planning meeting in Poughkeepsie, NY, November 1987 -- often referred to in TEI documents as the "Poughkeepsie Principles") TEI AB P1 Closing Statement of the Text Encoding Initiative Advisory Board Meeting, February 1989 (just what the title says) TEI J6 Welcome to TEI-L TEI J10 Guide to the Structure of the TEI (September 1989 -- now slightly out of date, since this document doesn't cover the work groups described above) TEI PO A1 List of Participating Organizations TEI ED P1 Design Principles for Text Encoding Guidelines (a statement of basic design goals for the TEI) TEI ED P3 Theoretical Stance and Resolution of Theory Conflict (possible outcomes in fields with competing theoretical approaches) TEI ED W5 Tags and Features (a stab at a basic taxonomy of tags and textual features, with the specification of a database record design for a database of tags; rather technical, has been described as unreadable by some readers, as fairly useful by others) TEI ML W13 Guidelines for TEI Use of SGML (virtually identical with section 2.2 of TEI P1; rather technical) TEI ML W14 SGML Bibliography (Barnard and Cover) (very large bibliography of work on SGML and text encoding; will be available soon electronically from TEI-L and as tech report from Queen's University, Ontario) TEI AI3 W4 Literature Needs Survey Results (responses to a survey on needs of literary scholars conducted by the work group for literary studies) TEI AI3 W5 The TEI Guidelines (Version 1.1): A Critique by the Literature Working Group (a detailed commentary on TEI P1 from the point of view of literary scholars) TEI AI1 W2 List of Common Morphological Features for Inclusion in TEI Starter Set of Grammatical-Annotation Tags (list of grammatical features and the values they may take, for the languages of the EEC and Russian; makes no concessions for the non-linguist and does not discuss the mechanisms required for abbreviating grammatical annotation) TEI AI1 W3 Feature System Declarations and the Interpretation of Feature Structures (technical treatment of problems arising in use of feature structures as defined in TEI P1 chapter 6, and proposal for a method of solving them with a specialized SGML document declaring the feature system in use. No concessions for lack on linguistic or SGML knowledge.) 4. A plea for help We've said it before and we'll say it again: the TEI will only succeed with the active critical participation of the community it aims to serve. If you have views on any of the topics addressed by the TEI we want to hear them. Post a note to this bulletin board, or to us directly: we may not respond as fully or as quickly as we might wish to, but be sure that your comments will be taken note of and forwarded to the appropriate technical committee or workgroup. We are committed to respond to and summarize all comments on our proposals, and it is a commitment we take very seriously indeed. (A summary of comments received through November is in progress, as are formal replies to them.) At the very least, we want to hear from everyone who received a copy of TEI P1 -- so please don't forget to complete and send in the 'User Response and Comment' form that came with your copy, if you have one! Lou Burnard (LOU@VAX.OXFORD.AC.UK) Michael Sperberg-McQueen (U35395@UICVM.BITNET) ========================================================================= Date: Fri, 15 Feb 91 15:45:20 MET Reply-To: Text Encoding Initiative public discussion list Sender: Text Encoding Initiative public discussion list From: Harry Gaylord Subject: Unicode I have been asked by several people to say something about the implications of the arrival of Unicode for TEI. Several useful comments in general have appeared on Humanist, TEI-L, and 10646 about relevant issues. Yet it is difficult to say anything succintly at this point. One thing is clear. No character set so far has tackled the problem of the need to encode the lang characteristic in texts. This was already pointed out in P1 and elsewhere. This, it seems to me, is very important regardless of which coded character set one uses. There are advantages and disadvantages to both Unicode and ISO 10646 as they are currently formulated. Hopefully they will be merged into one ISO standard. There is no need for two multi-byte standards to be used in different systems or even worse in single systems. Unicode and 10646 and the 8859 family of coded character sets have a different understanding of what a character is how it will be used. Unicode says nothing about the imaging of texts on a screen or printing on paper. In a Unicode file the Greek letter alpha + IOTA SUBSCRIPT + ROUGH BREATHING MARK + GRAVE ACCENT would be coded in 4 16-bit bytes. The software used to image this text would have to recognize this combination of one spacing and three non-spacing characters and put the image on your screen. ISO 10646 and the 8859 family the approach has been to have each combination as a different coded character. Therefore this combination would be one byte in 10646. This would be a 32-bit byte if one were using the full 10646 set or possibly 16 or 8 bits if one were using one of the compression techniques. The software running the system with Unicode would also have to know that since there are two accents above they have to be located differently above the letter than if there is only one. On the other hand some languages have so many different combinations that it is common practice to use "floating accents" or graphic character combination encoding. An example of this is Hebrew which has 23 consonants and 5 final forms. Its vowels and other signs are imaged in relation to the consonants. If one had a coded character for each possible combination, it would be enormous. Therefore present systems, e.g. Nota Bene SLS, and others encode these separately. This is also true of Unicode and 10646. It is uneconomical to do it otherwise. Two basic criticisms of the present proposals in 10646 are the very large number of wasted control character positions in it, and inadequate provision for graphic character combination encoding. In the latter there is an appendix referring to they way this can be done under another ISO standard, but this appendix is not a required part of the standard itself. The TG on character sets is in contact with Unicode and ISO with our concerns for their work. We must remember that the final outcome of what is delivered is still very uncertain. The standards have to be formulated and then hardware manufacturers have to be convinced of the importance of them and implement them. This all takes time. It is also important to note that the big players have people working in the Unicode consortium and the ISO 10646 committee. One concern that I have is the need for representing text as it is contained in older books and manuscripts. Neither standard as far as I can see has the long s of English printing in earlier books. Yet we need it for many scholarly purposes. From the standpoint of both of these standards it would be classified as a "presentational variant" of s and be placed in a completely different section of the character set. This is even more true of letter shapes as they appear in manuscripts. There is room in each proposal for private use characters which can be used by agreement of two or more parties. Yet the more that is included in a standard as standard, the better off we are. There are currently attempts to combine the work of the Unicode consortium and the committee for 10646. Let's hope they are successful and that the results improve on both. Harry Gaylord ========================================================================= Date: Mon, 18 Feb 91 12:16:54 MST Reply-To: Text Encoding Initiative public discussion list Sender: Text Encoding Initiative public discussion list From: lexical@NMSU.EDU Subject: model lexica opportunity One of our new directives is the Consortium for Lexical Research, in which good models of lexica (and software to make them informative) could be very influential. Perhaps you'd like to help build up and use this resource archive. It's sponsored by the ACL, who see the need not only for standardization but also distribution, and it is funded and encouraged by DARPA. .rm CM .DS .DE .nf .vs 20 .ps 16 .B .ce 8 The Consortium for Lexical Research .nr PS 12 .nr VS 13 .ps 12 .vs 13 Rio Grande Research Corridor Computing Research Laboratory New Mexico State University Box 30001, Las Cruces, NM 88003. lexical@nmsu.edu (505) 646-5466 Fax: (505) 646-6218 .R .nr PS 11 .nr VS 13 .ps 11 .vs 13 .PP Work in computational linguistics has reached the point where the performance of many natural language processing systems is limited by a "lexical bottleneck". That is, such systems could handle much more text and produce much more impressive application results were it not for the fact that their lexicons are too small. .PP The Association for Computational Linguistics has established the Consortium for Lexical Research (CLR), and DARPA has agreed to fund this. It will be sited at the Computing Research Laboratory, New Mexico, USA, under its Director, Yorick Wilks, and an ACL committee consisting of Roy Byrd, Ralph Grishman, Mark Liberman and Don Walker. .PP The Consortium for Lexical Research will be an organization for sharing lexical data and tools used to perform research on natural language dictionaries and lexicons, and for communicating the results of that research. Members of the Consortium will contribute resources to a repository and withdraw resources from it in order to perform their research. There is no requirement that withdrawals be compensated by contributions in kind. .PP A basic premise of the proposal for cooperation on lexical research is that the research must be "precompetitive". That is, the CLR will not have as its goal the creation of commercial products. The goal of precompetitive research would be to augment our understanding of what lexicons contain and, specifically, to build computational lexicons having those contents. .PP The task of the CLR is primarily to facilitate research, making available to the whole natural language processing community certain resources now held only by a few groups that have special relationships with companies or dictionary publishers. The CLR would as far as is practically possible accept contributions from any source, regardless of theoretical orientation, and make them available as widely as possible for research. .\"CHANGE new sentence above this. There is also an underlying theoretical assumption or hope: that the contents of major lexicons are very similar, and that some neutral, or "polytheoretic," form of the information they contain can be at least a research goal, and would be a great boon if it could be achieved. .\"CHANGE made above--L how does it look? A major activity of the CLR will be to negotiate agreements with "providers" on reassuring and advantageous terms to both suppliers and researchers. Major funders of work in this area in the US have indicated interest in making participation in the CLR a condition for financial support of research. An annual fee will be charged for membership. It is intended that after an initial start-up period, the Consortium become self-supporting. .PP The Computing Research Lab (CRL) already has an active research program in computational lexicons, text processing, machine translation, etc., funded by DARPA and NSF as well as a range of machines appropriate for advanced computing on dictionaries. .SH Resources and Services of the Consortium .PP The following lists of lexical data and tools seem to provide a reasonable starting content for the repository. We will continually solicit and encourage additions to this list. .ce .LP .B Data .R .LP 1. word lists (proper nouns, count/mass nouns, causative verbs, movement verbs, predicative adjectives, etc.) .br 2. published dictionaries .br 3. specialized terminology, technical glossaries, etc. .br 4. statistical data .br 5. synonyms, antonyms, hypernyms, pertainyms, etc. .br 6. phrase lists .br .ce .B Tools .R .LP 1. lexical data base management tools .br 2. lexical query languages .br 3. text analysis tools (concordance, KWIC, statistical analysis, collocation analysis, etc.) .br 4. SGML tools (particularly tuned to dictionary encoding) .br 5. parsers .br 6. morphological analyzers .br 7. user interfaces to dictionaries .br 8. lexical workbenches .br 9. dictionary definition sense taggers .ce .B Services .R .PP Repository management will involve cataloging and storing material in disparate formats, and providing for their retransmission (with conversion, where appropriate tools exist). In addition, it will be necessary to maintain a library of documentation describing the repository's contents and containing research papers resulting from projects that use the material. A brief description of the services to be provided is as follows: .IP a. CRL will provide a catalog of, and act as a clearinghouse for, utilities programs that have been written for existing online lexical data. .IP b. CRL will compile a list of known mistakes, misprints, etc. that occur in each of the major published sources (dictionaries etc.). .IP c. CRL will set up a new memorandum series explicitly devoted to the lexical center. .IP d. CRL will also be a clearinghouse for preprints and hard-to-find reprints on machine-readable dictionaries. .IP e. CRL also expects to conduct workshops in this area, including an inaugural workshop in late 1991 or early 1992. .IP f. CRL would provide a catalog for access to repositories of corpus-manipulation tools held elsewhere. .PP We invite you to participate in the Consortium for Lexical Research. Anyone interested in participating \fIeven in principle\fR as a provider or consumer of data, tools, or services should send a message to \fBlexical@nmsu.edu\fR or \fBlexical@nmsu.bitnet\fR .LP as should anyone who would like to be on our lexical information list. ========================================================================= Date: Mon, 18 Feb 91 14:39:39 EST Reply-To: Text Encoding Initiative public discussion list Sender: Text Encoding Initiative public discussion list From: myl@COMA.ATT.COM Subject: out of the office from 18-25 February I will be at the DARPA Speech and Natural Language workshop Asilomar Conference Center, 408-372-8016, 7227 Fax Pacific Grove, CA 93950 Your mail will be read when I return. You can reach the Penn Linguistics department at 215-898-6046. Regards, Mark Liberman ========================================================================= Date: Mon, 18 Feb 91 14:42:36 -0500 Reply-To: Text Encoding Initiative public discussion list Sender: Text Encoding Initiative public discussion list From: Don Walker Subject: Away from the office from 18 to 25 February I will be in California at the Speech and Natural Language Processing Workshop. For urgent matters, contact my secretary Elaine Molchan at em@flash.bellcore.com or (+1-201)829-4594 for information on how to reach me there. Don Walker ========================================================================= Date: Tue, 19 Feb 91 11:22:51 +0100 Reply-To: Text Encoding Initiative public discussion list Sender: Text Encoding Initiative public discussion list From: Timothy.Reuter@MGH.BADW-MUENCHEN.DBP.DE Subject: Unicode I think it's important to think about the general aspects as well as about whether Unicode does or does not have the variant form for this or that letter in this or that writing system. Some general points occur to me: a. Pace Harry Gaylord, Unicode seems to me to be biased towards display rather than other forms of data processing. Some semantic distinctions are observed, but roughly speaking, if things look substantially different, even though semantically substantially the same, they get different codes (e.g. medial and final sigma in Greek - or alphabets of Roman numerals or letters in circles - or the various forms of cedilla at U+0300 up). If on the other hand they look substantially the same, even though semantically different, they may well get the same code (e.g. hacek is considered to be identical with superscript v, and the overlaps are very acute in the mathematical symbol area). Digraphs only get in if they are in existing standards (German ss, Dutch ij, Slav Dz), i.e. since you can display, say, Spanish "ch" as "c" followed by "h" there is no provision for a code to mean "ch", though this might well be helpful in non-display contexts. b. "Unicode makes no pretense to correlate character encoding with collation or case" and indeed it doesn't. The basic setup (for those who haven't seen the draft) is that the high byte is used to indicate a kind of code page, which may contain one or more alphabets/syllabaries/symbol sets, etc. There's no attempt to use bit fields of non-byte width within the 16 bits, except in so far as sequences within existing eight-bit standards have done this. The difference between lc and uc can be 1, 32 or 48 (possibly others as well), while runs of letters can be interrupted by numerals and non-letters. Previous standards play a role here, but there seems to me to be no compelling reason if you're drawing up a 16-bit code to say that you will take over all existing standards on the basis of eight-bit code + fixed offset! It's an opportunity to eliminate rather than perpetuate things which in any case only originated because of restrictions which no longer apply. c. Diacritics are trailing "non-spacing" separate characters (actually they're backspacing). Diacritics modifying two letters follow the second one. The point has already been made that you can't really do it any other way (though in a 32-bit code you could probably do it with bit-fields). However, trailing diacritics seem to me undesirable, because you have to "maintain state" (something the Unicode people claim to eliminate) in any programming you do. If you're reading a file or a string sequentially you can't even send a character to the printer or the screen until you have checked the one after it to make sure it's not a trailing diacritic! For the user, the order of storage is irrelevant; for the programmer, preceding diacritics are much easier to handle in most contexts. The slavish take- over of existing eight-bit standards means that many diacritics are also codable as "static" single characters - as has been pointed out, this leads to potential ambiguities. Diacritics apart, there seem to be conflicts of interest between different applications, which *necessarily* lead to ambiguities or difficulties for someone. Take the "s" problem. Harry Gaylord says he needs long s as a code of its own; Unicode itself distinguishes between Greek medial and final sigma, and between "s" + "s" and German "szet", on the basis of existing standards. Any text containing these coding distinctions can be displayed more easily and more faithfully to its original than it can without them (though I would have thought there was no serious problem about identifying final sigma and acting accordingly). But other kinds of analysis become *more* difficult if such coding is used: regular expressions involving "s" are much more difficult to construct, as are collating and comparison sequences. This is an area where SGML-style entities are positively advantageous, simply because they announce their presence: if long s is always coded as &slong; in a base text, different applications can be fed with different translations. Precisely because Unicode puts so much emphasis on how things look rather than what they mean, it won't eliminate the need for such "kludges", as someone on HUMANIST thought it would. Timothy Reuter, Monumenta Germaniae Historica, Munich ========================================================================= Date: Tue, 19 Feb 91 02:27:57 PST Reply-To: Text Encoding Initiative public discussion list Sender: Text Encoding Initiative public discussion list From: Rindfleisch@SUMEX-AIM.STANFORD.EDU Subject: Away from my Mail I will be gone and not reading my mail until Sunday, February 24. Your message regarding " Unicode" will be read when I return. If your message concerns something urgent, please contact Monica Wong (Wong@SUMEX-AIM) or phone my office at (415) 723-5569. Tom R. ========================================================================= Date: Tue, 19 Feb 91 08:38:29 EST Reply-To: Text Encoding Initiative public discussion list Sender: Text Encoding Initiative public discussion list From: "Robert A. Amsler" Subject: Presentation vs. Descriptive CHARACTER Markup Timothy Reuter's note on UNICODE suggests that we ought to be careful that the same guidelines that have led the TEI to select descriptive markup for text not be abandoned when we get to characters. The TEI's concern should first and foremost be whether a character representation represents the meaning of the characters to the authors, and not their presentation format. Likewise, this also means that how the representation is achieved is rather irrelevant to whether or not the markup captures the meaning of the character. I think it worth noting that to me there seems to be a need for two standards for characters. One to represent their meaning, the other to represent their print images. The print image representation has a LOT of things to take into account, and may in fact only be possible in some form such as the famous "Hersey fonts" released long ago by the US National Bureau of Standards. That is, the print image on characters and symbols may have to be accompanied by representations as bit maps or equations as to how to draw the characters within a specified rectangular block of space. Within the descriptive markup, there clearly are enough problems to solve without adding the burden of achieving consistent print representations on all display devices. For example, one descriptive issue is that of whether the representation is adequate for spoken or only written forms of the language. While the TEI has addressed the concerns of researchers in linguistics dealing with speech, there exists a need to address the concerns of ordinary text users concerned with the representation of information about indicating spoken language information in printed form. Some of this is a bit arcane, such as how to represent text dialogues to be spoken with a foreign accent, but representing EMPHASIS is a continual issue and emphasis can descend to the characteristics of individual letters. ========================================================================= Date: Tue, 19 Feb 91 15:06:49 MET Reply-To: Text Encoding Initiative public discussion list Sender: Text Encoding Initiative public discussion list From: "E. van Konijnenburg" Subject: Re: model lexica opportunity In-Reply-To: ; from "lexical@NMSU.EDU" at Feb 18, 91 12:16 pm Hi. Please include me in your information list. Regards, Erik AND Software bv ------------------------------------------------------- Attn. E. van Konijnenburg Westersingel 108 Tel: +31 10 4367100 3015 LD ROTTERDAM Fax: +31 10 4367110 The Netherlands Email: ========================================================================= Date: Tue, 19 Feb 91 16:15:56 CST Reply-To: Text Encoding Initiative public discussion list Sender: Text Encoding Initiative public discussion list From: "Robin C. Cover" Subject: CHAR ENCODING AND TEXT PROCESSING A propos of recent comments by Timothy Reuter and Robert A. Amsler on the relationship between character encodings and (optimized) text processing, two notes: (1) Timothy writes that "Unicode seems to me to be biased towards display rather than other forms of data processing." We note that UNICODE indeed does contain algorithms for formatting right-to-left text and bi-directional text, but (as far as I know) it has no general support for indicating the language in which a text occurs. (2) On the matter of separating "form and function" (various two-level distinctions germane to character encoding and writing systems: character and graph; graph and image; language and script; writing system and script), the following article by Gary Simons may be of interest. (I do not know if it represents his current thinking in every detail.) Gary F. Simons, "The Computational Complexity of Writing Systems." Pp. 538-553 in _The Fifteenth LACUS Forum 1988_ (edited by Ruth M. Brend and David G. Lockwood). Lake Bluff, IL: Linguistic Association of Canada and the United States, 1989. In this article the author argues that computer systems, like their users, need to be multilingual. ``We need computers, operating systems, and programs that can potentially work in any language and can simultaneously work with many languages at the same time." The article proposes a conceptual framework for achieving this goal. Section 1, ``Establishing the baseline," focuses on the problem of graphic rendering and illustrates the range of phenomena which an adequate solution to computational rendering of writing systems must account for. These include phenomena like nonsequential rendering, movable diacritics, positional variants, ligatures, conjuncts, and kerning. Section 2, ``A general solution to the complexities of character rendering," proposes a general solution to the rendering of scripts that can be printed typographically. (The computational rendering of calligraphic scripts adds further complexities which are not addressed.) The author first argues that the proper modeling of writing systems requires a two-level system in which a functional level is distinguished from a formal level. The functional level is the domain of characters (which represent the underlying information units of the writing system). The formal level is the domain of graphs (which represent the distinct graphic signs which appear on the surface). The claim is then made that all the phenomena described in section 1 can be handled by mapping from characters to graphs via finite-state transducers -- simple machines guaranteed to produce results in linear time. A brief example using the Greek writing system is given. Section 3, ``Toward a conceptual model for multilingual computing," goes beyond graphic rendering to consider the requirements of a system that would adequately deal with other language-specific issues like keyboarding, sorting, transliteration, hyphenation, and the like. The author observes that every piece of textual data stored in a computer is expressed in a particular language, and it is the identity of that language which determines how the data should be rendered, keyboarded, sorted, and so on. He thus argues that a rendering-centered approach which simply develops a universal character set for all languages will not solve the problem of multilingual computing. Using examples from the world's languages, he goes on to define language, script, and writing system as distinct concepts and argues that a complete system for multilingual computing must model all three. Availability: Offprints of this article are available from the author at the following Internet address: gary@txsil.lonestar.org. The volume is available from LACUS, P.O. Box 101, Lake Bluff, IL 60044. Robin Cover BITNET: zrcc1001@smuvm1 INTERNET: robin@ling.uta.edu INTERNET: robin@txsil.lonestar.org ========================================================================= Date: Tue, 19 Feb 91 19:15:04 PST Reply-To: Text Encoding Initiative public discussion list Sender: Text Encoding Initiative public discussion list From: Ken Whistler Subject: Re: CHAR ENCODING AND TEXT PROCESSING Dear Mr. Cover, I would like to respond to your recent note, and the implications of the abstract you have made from Gary Simon's article. (In this I am speaking personally, and my opinions do not necessarily represent those of the Unicode Technical Committee.) First of all, I want to make it clear that Unicode is not, nor does it purport to be, a text description language. It is a character encoding. We need to code the LATIN CAPITAL LETTER A and the ARABIC LETTER ALEF and the DEVANAGARI LETTER A in order for any text to be encoded, and for any textual process to be programmed to operate on that text. However, assigning 16-bit values to those characters (0041, 0627, and 0905, respectively) does not, ipso facto, specify whether the LATIN CAPITAL LETTER A is being used in an English, Czech, or Rarotongan text, or the ARABIC LETTER ALEF in Arabic, Sindhi, or Malay, or the DEVANAGARI LETTER A in Hindi or Nepali. Trying to mix the character encoding with specification of textual language is guaranteed to mess up the character encoding; the appropriate place to handle this is at a metalevel of text/document description above the level of the character encoding. On the other hand, the bidirectional text problem is specifiable independent of any particular language--or even script, for that matter, since the generic problem is the same for Hebrew as it is for Arabic (scripts). The fundamental reason why Unicode is going to great lengths to include a bidirectional plain text model is that without an explicit statement of how to do this, the content of texts which contain both left-to-right and right-to-left scripts mixed can be compromised or corrupted when such texts are interchanged. If we do not come down squarely in favor of an implicit model (or an explicit model with direction-changing controls, or a visual order model), then bidirectional Unitext will regularly get scrambled, and no one will know how to interpret a number embedded in bidi text, etc., etc. Regarding form/function distinctions, I think you are preaching to the converted. I do not think you will be able to find another multilingual character encoding of this scope which has been developed with such a meticulous attention to the distinctions you mention: character vs. glyph (i.e. "graph" as you quote Simons) We have been educating people about this for years. Granted, there are glyphs encoded as characters in Unicode, too, but the main reason they got there is because Unicode has to be interconvertible to a lot of other "character" standards which couldn't distinguish the two. And why does Unicode have to be interconvertible? A) Because that is the only way to get it accepted and move into the future, and B) Because that serves the purpose of creating better software to handle text processing requirements for preexisting data. glyph vs. image Also clearly distinguished amongst our discussions. I think Unicoders are supportive of the concept of proceeding to develop a definitive registry of glyphs. This would be most helpful to font foundaries and font vendors, but also would help the software makers in performing the correct operations to map characters (in particular language and script contexts) into glyphs for rendering as images. But registry of glyphs is a different task from encoding of characters. For one thing, the universe of glyphs is much larger than the universe of characters. Unicode 1.0 is aimed at completing the character encoding as expeditiously and correctly as possible, rather than at taking on the larger glyph registry problem. language vs. script Also clearly distinguished. Unicode characters, taken by blocks, can be assigned to scripts. Hence the characters from 0980 to 09F9 are all part of the Bengali script. But no one is confusing that with the fact that some subset of those is used in writing the Bengali language and another subset in writing Assamese. script vs. writing system Again, I think you will find us sympathetic and non unaware of the distinctions involved. For example, most of us have worked on or are currently working on implementations of the Japanese writing system for one product or another on computer. Anyone with a smattering of knowledge of Japanese knows that the writing system is a complicated mix of two syllabaries, Han characters (kanji), and an adapted form of European scripts which can be rendered either horizontally or rotated for vertical rendering. It is a complicated writing system which is difficult to implement properly on computer--but that is a separate issue from how to encode the characters. You quote Gary Simons as stating that: "We need computers, operating systems, and programs that can potentially work in any language and can simultaneously work with many languages at the same time." I can guarantee you that this is the passionate concern of those who have been working on Unicode for the last two years. It is precisely because the character encoding alternatives (ISO 2022, ISO DIS 10646, various incomplete corporate multilingual sets, and font-based encodings which confuse characters and font-glyphs) are so dismal that we have worked so hard to design a multilingual character set with the correct attributes for support of multilingual operating systems, multilingual applications, multilingual text interchange and email, multilingual displays and printers, multilingual input schemes, and yes, multilingual text processing. Don't expect the holy grail by Tuesday, but if we really think all those things are worth aiming for, it is vitally important that those who build the operating systems, the networks, the low-level software components, and the high-level applications reach a reasonably firm consensus about the character encoding now. --Ken Whistler ========================================================================= Date: Tue, 19 Feb 91 20:52:12 PST Reply-To: Text Encoding Initiative public discussion list Sender: Text Encoding Initiative public discussion list From: Ken Whistler Subject: Re: Unicode Dear Mr. Reuter, I addressed some of your concerns in my reply to Robin Cover, but I would like to respond to a few of the specific points which you have raised. (Disclaimer: These are personal opinions, and do not necessarily reflect the position of the Unicode Technical Committee.) Regarding your point a., that Unicode seems biased towards display rather than other forms of data processing. First of all, you must understand that Unicode has been visited with the sins of our fathers. The medial and final sigma are already distinguished in the Greek standard. We cannot unify them without Hellenic catastrophe. (In fact the Classicists inform us that there are good reasons why we must introduce a third sigma, the "lunate sigma", in order to have a correct and complete encoding.) Nobody likes the Roman numerals, or the parenthesized letters, or the squared Roman abbreviations, ... The general reaction has been Sheesh! But important Chinese, Japanese, and Korean standards which have to be interconvertible with Unicode have already encoded such stuff, and we are stuck with it. Why? Because the design goal of a perfect, de novo, consistent, and principled character encoding is unattainable (believe me, we tried), and because the higher goal of attaining a usable, implementable, and well-engineered character encoding in a finite time is greatly furthered by including as much as possible of the preexisting character encoding standards. You also noted that the semantic overlaps are very acute in the mathematical symbol area. Nobody can tell us how many distinct semantic usages there are for "tilde", for example. Should we encode 1, 3, 7, 16 of them?? We made what I think is the best compromise we could under the circumstances. The TILDE OPERATOR is encoded as a math operator (distinct from accents, whether spacing or non-spacing), but no further attempt is made to separate all the possible semantics applicable. Note that if we start trying to distinguish "difference" from "varies with" from "similar", from "negation", etc., we would be forcing applications (and users) to encode the correct semantic--even when they don't know or can't distinguish them. This has the potential for being WORSE for text processing, rather than better. Over-differentiation in encoding is just as bad as under-differentiation. I don't understand your concern about not distinguishing hacek and superscript v. Unicode does not encode superscript v at all. Except for those superscripts grandfathered in from other standards (remember the sins of our [grand]fathers), superscript variants of letters are considered rendering forms outside the scope of Unicode altogether. If someone uses a font which has hacek rendered in a form which looks like a superscript v, that is a separate issue. From a Unicode point of view that would simply be mapping the character HACEK onto the glyph {LATIN SMALL V} in some particular typeface for rendering above some other glyph. A font vendor could do that. It might even be the correct thing to do, for example, in building a paleographic font for manuscript typesetting. Regarding your b. item concerns about the layout of Unicode: First of all, I am sensitive about your using the term "code page" in referring to the Unicode charts. "Code page" is properly applied to 8-bit (or to some double 8-bit) encodings which can be "swapped in" or "swapped out" to change the interpretation of a particular numeric value as a character. Unicode values are fixed, unambiguous, and unswappable for anything else. The charts are simply a convenient packaging unit for human visual consumption and education. The fact that we tried to align new scripts with high byte boundaries resulted from the implementation requirement that software have easy and quick tests for script identity. The subordering within script blocks does attempt to follow existing standards, where feasible. We tried the alternative of simply enumerating all the characters in a script and then packing them in next to each other in what would pass for the "best" alphabetic order, but that introduces other problems AND makes the relevant "owners" of that script gag at the introduction of a layout unfamiliar to them. In the end all such processes such as case folding, sorting, parsing, rendering, etc. depend on table lookup of attributes and properties. There is no hard-coded shortcut which will always work--even for 7-bit ASCII. The compromise which pleased the most competing interests (and which, by the way, got us to a conclusion on this issue) was to follow national standards orders as applicable. You might note that the one REALLY BIG case where we have to depart from this is in unifying 18,000+ Han characters. The only way to do this is to depart from ALL of the Asian standards--so nobody can convert from a Chinese, Japanese, or Korean standard to Unicode by a fixed offset! Believe me, that has occasioned much more grumbling (to put it mildly) than any ordering issue for Greek or Cyrillic! Concerning your point c., about diacritics being specified as following a baseform rather than preceding it: Clearly we had to come down on one side or the other. Not specifying it would be disastrous. So we made a choice. Granted, that having diacritics follow rather than precede baseforms favors rendering algorithms over parsing algorithms. To have made the opposite choice would have reversed the polarity of benefits and costs. It is a tradeoff with no absolutely right answer. Nevertheless, I think the choice made was the correct one. First, the rendering involved is not really as you have characterized it. "Non-spacing" diacritics are NOT backspacing. Such terminology is more properly applied to spacing diacritics (such as coded in ISO 8859-1 or ISO DIS 10646), which for proper rendering use require the sending of a BACKSPACE control code between a baseform and an accent. That's the way composite characters used to be printed on daisy-wheel printers, for example. But that is a defective rendering model which ignores the complex typographical relationship between baseforms and diacritics. The kind of rendering model we are talking about involves "smart" fonts with kerning pair tables. The "printhead" is not trundled back so that an accent can be overstruck; instead, a diacritic "draws itself" appropriately, in whatever medium, on a baseform in context. The technology for doing this is fairly well-understand but quite complex. I think it would be fair to say that if I were writing a text processing program (and I have), I would rather have system support for such rendering and deal with the look-ahead problem than have to deal with font rendering problems in my program. Second, the "state" that has to be maintained in parsing diacritics is quite different from the "state" that Unicode claims to eliminate. Parse states have to be maintained for all kinds of things. If I am parsing Unicode which uses non-spacing diacritics, then I have to maintain a parse state to identify text elements; but even parsing for word boundaries, for example (an elementary operation in editing) has to maintain state to find boundaries which may depend on combinations of punctuation, or on ambiguous interpretation of some characters which can only be disambiguated in context, etc., etc. More complicated parsing often maintains elaborate parse trees with multiple states. The "statefulness" that Unicode is trying to eliminate is a state in which the interpretation of the bit pattern for a character changes, depending on which state you are in. This is the "code page sickness", where one time the 94 means "o-umlaut", and next time it means "i-circumflex", and next time it means "partial differential symbol", depending on what code page you are using, and what code page shift state you happen to be in. The two-byte encodings currently are horrible in this respect, since they may mix single-byte and two-byte interpretations in ways which may mean that figuring out what a particular byte is supposed to represent in any random location can be very difficult. You have to find an anchor position from which you can parse sequentially, maintaining state, until you get to the byte in question to find out what it means. Unicode eliminates THAT kind of state maintenance. I find myself agreeing with your statement that "there seem to be conflicts of interest between different applications, which *necessarily* lead to ambiguities or difficulties for someone." The way I would put it, following a distinction made elegantly by Joe Becker, is that there is no way that any encoding of CODE ELEMENTS (i.e. the "characters" assigned numbers in Unicode) will automatically result in one-to-one mappability to all the TEXT ELEMENTS which might ever be of interest to anyone or have to be processed as units for one application or another. Your mention of "ch" as a collation unit for Spanish is one obvious example. Fixing the CODE ELEMENTS of Unicode should not preclude efforts to identify appropriate TEXT ELEMENTS for various processes. Such TEXT ELEMENTS will have to be identified as to their appropriate domain of application--and that does include language as well as other factors. But it is not the job of the character encoding to do that work. The character encoding should be designed so as not to impede TEXT ELEMENT identification and processing--for example, it would be crazy to refuse to encode LATIN LETTER SMALL I because it could be composed of a dotless-i baseform and a non-spacing dot over! But character encoding cannot BE the TEXT ELEMENT encoding, however much we might desire a simpler world to work with. --Ken Whistler ========================================================================= Date: Wed, 20 Feb 91 15:31:17 -0500 Reply-To: Text Encoding Initiative public discussion list Sender: Text Encoding Initiative public discussion list From: Katharina Klemperer Subject: sgml editors I would like to know if anyone has any experience with Macintosh SGML editors. By this I mean a "word processor" that assists in the insertion of SGML tags into a document. I saw Author/Editor, from SoftQuad, Inc., demonstrated a couple of years ago, and it looked nice, but I would like to know if there are additional similar products in the marketplace, and what experiences people have had with them. Kathy Klemperer Dartmouth College Library ========================================================================= Date: Wed, 20 Feb 91 15:40:37 -0500 Reply-To: Text Encoding Initiative public discussion list Sender: Text Encoding Initiative public discussion list From: Katharina Klemperer Subject: sgml editors I would like to know if anyone has any experience with Macintosh SGML editors. By this I mean a "word processor" that assists in the insertion of SGML tags into a document. I saw Author/Editor, from SoftQuad, Inc., demonstrated a couple of years ago, and it looked nice, but I would like to know if there are additional similar products in the marketplace, and what experiences people have had with them. Kathy Klemperer Dartmouth College Library ========================================================================= Date: Thu, 21 Feb 91 07:37:39 CST Reply-To: Text Encoding Initiative public discussion list Sender: Text Encoding Initiative public discussion list From: FEEM@QUCDN.BITNET Subject: mailing list Subject: mailing list Please exclude my name from your mailing list. Thank you. M. Fee, Strathy Language Unit, Queen's University Fleming Hall, Room 206, Kingston, Ont. (613) 545-2152 FEEM@Qucdn ========================================================================= Date: Thu, 21 Feb 91 10:42:27 PST Reply-To: Text Encoding Initiative public discussion list Sender: Text Encoding Initiative public discussion list From: Lynne_Price.PARC@XEROX.COM Subject: Re: sgml editors In-Reply-To: <91Feb20.123652pst.16169@alpha.xerox.com> Kathy, Another SGML editor on the Mac is CheckMark from Software Exoterica 383 Parkdale Ave. Suite 406 Ottawa, Ontario Canada K1Y 4R4 (613) 722-1700 I have used CheckMark fairly entensively and found it a valuable tool. I believe it supports a richer subset of the optional SGML features than does Author/Editor. In particular, it supports all markup minimization features. It can convert all or part of a document that uses minimization to a normalized form that does not. Furthermore, CheckMark can continue checking for additional SGML errors whether or not the user fixes the first problem detected. It has a special scroll bar for indicating where errors occur and allows the user to decide when to repair them. However, CheckMark is not a word processor or document formatter. It creates an SGML document, but has no provisions for displaying a formatted version of the document--for instance, it can9t center or italicize certain elements. --Lynne Price ========================================================================= Date: Mon, 25 Feb 91 20:42:24 EST Reply-To: Text Encoding Initiative public discussion list Sender: Text Encoding Initiative public discussion list From: Brian Subject: Re: mailing list In-Reply-To: Message of Thu, 21 Feb 91 07:37:39 CST from Please take my name off the list. David Megginson will keep in touch on my behalf. Brian Merrilees, University of Toronto, MERRILEE@vm.epas.utoronto.ca>