%p2idmss; %dtdmods; ]>
Driver file for TEI P2, Corpora Workpapers of TR6 and AI1; rejected bits of 22; lots of hardwork by LB
Dummy Part Dummy

The IDs for chapters other than cc are included here: 1 About These Guidelines (TEI P1 1) 1.1 Texts and Their Electronic Representation 1.2 Intended Applications 1.3 Origin and Development 1.4 Design Principles 1.5 Structure of This Document 1.6 Status of This Draft 1.7 Future Development of the Guidelines ]]> 2 Concise Summary of SGML ]]> 3 Structure of the TEI Document Type Declarations (P1 1) 3.1 Main and Auxiliary DTDs (id=STma) 3.2 Base Tag Sets and Additional Tag Sets (id=STba) 3.3 Global Attributes (id=STga) 3.4 Element Classes and Other Parameter Entities (id=STec) 3.5 Invocation of TEI DTDs (id=STin) 3.6 Combining TEI DTD Fragments (id=STco) ]]> 4 Characters and Character Sets (P1 3) 4.1 Local Character Sets 4.1.1 Characters Available Locally 4.1.2 Characters Not Available Locally 4.2 Shifting among Character Sets 4.3 Character Set Problems and Interchange 4.4 Writing System Declaration ]]> 5 The TEI Header (P1 4) 5.1 Organization of the TEI Header 5.1.1 The TeiHeader and Its Components 5.1.2 Types of Content in the TEI Header 5.2 The File Description 5.2.1 The Title Statement 5.2.2 The Edition Statement 5.2.3 Type and Extent of File 5.2.4 Publication, Distribution, etc. 5.2.5 The Series Statement 5.2.6 The Notes Statement 5.2.7 The Source Description 5.2.8 Computer Files Derived from Other Computer Files 5.2.9 Computer Files Composed of Transcribed Speech 5.3 The Encoding Description 5.3.1 The Project Description 5.3.2 The Sampling Declaration 5.3.3 The Editorial Practices Declaration 5.3.4 The Reference System Declaration 5.3.4.1 Prose method 5.3.4.2 Stepwise method 5.3.4.3 Milestone method 5.3.5 The Classification Declaration 5.4 The Profile Description 5.4.1 Creation 5.4.2 Language Usage 5.4.3 The Text Classification 5.5 The Revision Description 5.6 Minimal and Recommended Headers 5.7 Note for Library Cataloguers ]]> 6 Elements Available in All TEI DTDs 6.1 Paragraphs (P1 5.3.1) 6.2 Ambiguous Punctuation 6.3 Highlighting and Quotation 6.3.1 What Is Highlighting? 6.3.2 Emphasis, Foreign Words, and Unusual Language 6.3.2.1 Foreign Words or Expressions 6.3.2.2 Emphatic Words and Phrases 6.3.2.3 Other Linguistically Distinct Material 6.3.2 Quotation 6.3.3 Terms, Glosses, and Cited Words 6.3.4 Some Further Examples 6.4 Names, Numbers, Dates, Abbreviations, and Addresses 6.4.1 Names 6.4.2 Numbers and Measures 6.4.3 Dates and Times 6.4.4 Abbreviations and Their Expansions 6.4.5 Addresses 6.5 Simple Editorial Changes 6.5.1 Correction of Apparent Errors 6.5.2 Regularization and Normalization 6.5.3 Additions, Deletions and Omissions 6.6 Simple Links and Cross References (TR3) 6.7 Lists (P1 5.3.8) 6.8 Notes, Annotation, and Indexing (P1 5.3.9) 6.8.1 Notes and Simple Annotations 6.8.2 Index Entries 6.9 Reference Systems (P1 5.6) 6.9.1 Using the ID and N Attributes 6.9.2 Creating New Reference Systems 6.9.3 Concurrent Markup for Pages and Lines 6.9.4 Concurrent Markup for Other Hierarchies 6.9.5 Milestone Tags 6.9.6 Declaring Reference Systems 6.10 Bibliographic Citations (P1 5.5) 6.10.1 Bibliographic Citation Elements 6.10.2 Components of Bibliographic Citations 6.10.2.1 Analytic, Monographic, and Series Levels 6.10.3 Citation References 6.10.4 Relationship to Other Bibliographic Schemes 6.11 Passages of Verse or Drama 6.11.1 Verse 6.11.2 Drama 6.12 Segmentation ]]> 6 (bis) Default Text Structure 6.2 (bis) Groups of Texts ]]> 7 Base Tag Set for Prose 7.1 Divisions of the Body 7.1.1 Un-numbered Divisions 7.1.2 Numbered Divisions 7.1.3 Numbered or Un-numbered? 7.2 Contents of Prose Divisions 7.3 Front Matter 7.4 Title Pages 7.5 Back Matter 7.6 Specifying the Prose Base 7.7 Overall Structure of the Prose DTD ]]> 8 Base Tag Set for Verse (TR10) ]]> 9 Base Tag Set for Drama (TR 11) 9 Base Tag Set for Drama (TR 11) ]]> 10 Base Tag Set for Transcriptions of Spoken Texts (AI2) 10.4.1 General Considerations and Overview 10.4.2 Overall Structure of Spoken Texts 10.4.2.1 The Header 10.4.2.2 The Text 10.4.2.3 Divisions and Their Components 10.4.3 Basic Structural Elements 10.4.3.1 Contextual Information 10.4.3.2 Temporal Information 10.4.3.3 Utterances 10.4.3.4 Pause 10.4.3.5 Vocal, Kinesic, Event 10.4.3.6 Writing 10.4.4 Segmentation and Alignment 10.4.4.1 Segments 10.4.4.2 Shifts 10.4.4.3 Pointers and Alignment 10.4.5 Recommended Transcription Practice 10.4.5.1 Speaker Overlap 10.4.5.2 Word Form 10.4.5.3 Prosody 10.4.5.4 Speech Management 10.4.5.5 Analytic Coding ]]> 11 Base Tag Set for Letters and Memos (?) ]]> 12 Base Tag Set for Printed Dictionaries (AI5) ]]> 13 Base Tag Set for Terminological Data (AI7) 13.1 The Terminological Entry 13.2 Tags for Terminological Data 13.3 Basic Structure of the Terminological Entry 13.3.1 Nested Term Entries 13.3.2 Flat Term Entries Using Rules of Adjacency 13.3.3 Flat Term Entries Using Group and Depend Attributes 13.3.4 References between Term Entries 13.4 Overall Structure of Terminological Documents 13.4.1 DTD Fragment for Nested Style 13.4.2 DTD Fragment for Flat Style 13.5 Additional Examples of Term Entries 13.5.1 Example 5: Term Entry from ISO 472 13.5.2 Example 6: Example 5 Treated as a Single Term Entry in Nested Form 13.5.3 Example 7: Example 5 Treated as Two Separate Term Entries in Nested Form 13.5.4 Example 8: Example 5 Treated as a Flat Term Entry Using Adjacency Rules 13.5.5 Example 9: Example 5 Treated as a Flat Term Entry Not Using Adjacency Rules ]]>13. 14 Composite Texts and Combining Bases (TR6) ]]> 15 User-defined Base Tag Sets (AI4) ]]> 16. 16 Segmentation and Alignment 16.1 Pointers and Links 16.2 Multi-headed Pointers 16.3 External Pointers and References 16.3.1 TEI Extended Pointer Syntax 16.3.1.1 Location Ladders 16.3.1.2 Location Terms 16.3.1.3 The ROOT Keyword 16.3.1.4 The HERE Keyword 16.3.1.5 The ID Keyword 16.3.1.6 The REF Keyword 16.3.1.7 The CHILD Keyword 16.3.1.8 The DESCENDANT Keyword 16.3.1.9 The ANCESTOR Keyword 16.3.1.10 The PREVIOUS Keyword 16.3.1.11 The NEXT Keyword 16.3.1.12 The PATTERN Keyword 16.3.1.13 The TOKEN Keyword 16.3.1.14 The STR Keyword 16.3.1.15 The SPACE Keyword 16.3.1.16 The FOREIGN Keyword 16.3.1.17 The HYQ Keyword 16.3.1.18 The DITTO Keyword 16.3.2 Using Extended Pointers 16.4 Correspondence 16.4.1 A Detailed Example 16.4.2 Alignment Using External Pointers 16.4.3 Further Example 16.5 Aggregation and Virtual Elements 16.5.1 Extended example ]]> 17 Simple Analytic Mechanisms 17.4 Virtual Copies ]]> 18 Feature Structure Analysis ]]> 19 Certainty ]]> 20 Manuscripts, Analytic Bibliography, and Physical Description ]]> 21 Critical Editions (TR2) ]]> 22 Additional Tags for Names and Dates ]]> 23 Graphs, Digraphs, and Trees ]]> 24 Graphics, Figures, and Illustrations ]]> 25 Formulae and Tables (TR4) ]]> 26 Additional Tag Set for Language Corpora ]]> 27 Structured Header ]]> 28 Writing System Declaration ]]> 29 Feature System Declaration ]]> 30 Tag Set Documentation ]]> 31 TEI Conformance ]]> 32 Modifying TEI DTDs ]]> 33 Local Installation and Support of TEI Markup ]]> 34 Use of TEI Encoding Scheme in Interchange ]]> 35 Relationship of TEI to Other Standards ]]> 36 Markup for Non-Hierarchical Phenomena ]]> 37 Algorithm for Recognizing Canonical References ]]> Part VII: Alphabetical Reference List of Tags 38 Full TEI Document Type Declarations ]]> 39 Standard Writing System Declarations ]]> 40 Feature System Declaration for Basic Grammatical Annotation ]]> 41 Sample Tag Set Declaration ]]> 42 Formal Grammar for the TEI-Interchange Format Subset of SGML 42.1 Notation 42.2 Grammar for SGML Document (Overview) 42.3 Grammar for SGML Declaration 42.4 Grammar for DTD 42.5 Grammar for Document Instance 42.6 Common Syntactic Constructs 42.7 Lexical Scanner 42.8 Differences from ISO 8879 ]]> Dummy Div2 Dummy Div3 Dummy Div4 Additional Tag Set for Language Corpora

The term language corpus is used to mean a number of rather different things. It may refer simply to any collection of linguistic data (written, spoken, or a mixture of the two), although many practitioners prefer to reserve it for collections which have been organized or collected with a particular end in view, generally to characterize a particular state or variety of one or more languages. Because opinions as to the best method of achieving this goal differ, various subcategories of corpora have also been identified. For our purposes however, the distinguishing characteristic of a corpus is that its components have been selected or structured according to some conscious set of design criteria.

These design criteria may be very simple and undemanding, or very sophisticated. A corpus may be intended to represent (in the statistical sense) a particular linguistic variety or sublanguage, or it may be intended to represent all aspects of some assumed core language. A corpus may be made of whole texts or of fragments or text samples. It may be a closed corpus, or an open or monitor corpus, the composition of which may change over time. However, since an open corpus is of necessity finite at any particular point in time, the only likely effect of its expansibility from the encoding point of view may be some increased difficulty in maintaining consistent encoding practices (see further section ). For simplicity, therefore, our discussion largely concerns ways of encoding closed corpora, regarded as single but composite texts.

Language corpora are regarded by these Guidelines as composite texts rather than unitary texts (on this distinction, see further ). This is because although each discrete sample of language in a corpus clearly has a claim to be considered as a text in its own right, it is also regarded as a subdivision of some larger object, if only for convenience of analysis. Corpora share a number of characteristics with other types of composite texts, including anthologies and collections. Most notably, different components of composite texts may exhibit different structural properties (for example, some may be composed of verse, and others of prose), thus potentially requiring elements from different TEI bases. Composite texts are thus especially likely to require the techniques for combining base tag sets described in chapter .

Aside from these high-level structural differences, and possibly differences of scale, the encoding of language corpora and the encoding of individual texts present identical sets of problems. Any of the encoding techniques and elements presented in other chapters of these Guidelines may therefore prove relevant to some aspect of corpus encoding and may be used in corpora. However, we do not repeat here the discusssion of such fundamental matters as the representation of multiple character sets (see chapter ); nor attempt to summarize the variety of elements provided for encoding basic structural features such as quoted or highlighted phrases, cross references, lists, notes, editorial changes and reference systems (see chapter ). In addition to these general purpose elements, these Guidelines offer a range of more specialized sets of tags which may be of use in certain specialized corpora, for example those consisting primarily of verse (chapter ), drama (chapter ), transcriptions of spoken text (chapter ), letters and memoranda (chapter ) etc. Chapter should be reviewed for details of how these and other components of the Guidelines should be tailored to create a document type definition appropriate to a given application. In sum, it should not be asssumed that only the matters specifically addressed in this chapter are of importance for corpus creators.

Though entitled Additional Tag Set for Language Corpora, this chapter also includes some other material relevant to corpora and corpus-building, for which no other location appeared suitable. It begins with a review of the distinction between unitary and composite texts, and of the different methods provided by these Guidelines for representing composite texts of different kinds (section ). Section describes a set of additional header elements provided for the documentation of contextual information, of importance largely though not exclusively to language corpora. This is the additional tag set for language corpora proper. Section discusses a mechanism by which individual parts of the TEI Header may be associated with different parts of a TEI-conformant text. Section reviews various methods of providing linguistic annotation in corpora, with some specific examples of relevance to current practice in corpus linguistics. Finally, section provides some general recommendations about the use of these Guidelines in the building of large corpora.Parts of this chapter are derived from working papers of the Work Group on Corpora, chaired by Douglas Biber, and the Work Group on Linguistic Description, chaired by D. Terence Langendoen. The membership of these work groups is listed in the preface. Varieties of Composite Text

Both unitary and composite texts may be encoded using these Guidelines; composite texts, including corpora, will typically make use of the following tags for their top-level organization. contains the whole of a TEI encoded corpus, comprising a single corpus header and one or more TEI.2 elements, each containing a single text header and a text. contains a single TEI-conformant document, comprising a TEI header and a text, either in isolation or as part of a Teicorpus. supplies the descriptive and declarative information making up an electronic title page prefixed to every TEI-conformant text. Attributes include: specifies the kind of document to which the header is attached. Legal values are: the header is attached to a single text. the header is attached to a corpus. indicates whether the header is new or has been substantially revised. Sample values include: the header is a new header. the header is an update (has been revised). identifies the creator of the TEI Header. indicates when the first version of the header was created. indicates when the current version of the header was created. contains a single text of any kind, whether unitary or composite, for example a poem or drama, a collection of essays, a novel, or a corpus sample. contains the body of a composite text, grouping together a sequence of distinct texts (or groups of such texts) which are regarded as a unit for some purpose, for example the collected works of an author, a sequence of prose essays, etc. Full descriptions of these may be found in chapter (for TEI.corpus.2 and TEI.2), chapter (for TeiHeader), and chapter (for text and group); this section discusses their application to composite texts in particular.

In these Guidelines, the word text refers to any stretch of discourse, whether complete or incomplete, unitary or composite, which the encoder chooses (perhaps merely for purposes of analytic convenience) to regard as a unit. The term composite text refers to texts within which other texts appear; the following common cases may be distinguished: language corpora collections or anthologies poem cycles and epistolary works (novels or essays written in the form of collections or series of letters) otherwise unitary texts, within which one or more subordinate texts are embedded The tags listed above may be combined to encode each of these varieties of composite text in different ways.

In corpora, the component samples are clearly distinct texts, but the systematic collection, standardized preparation, and common markup of the corpus often make it useful to treat the entire corpus as a unit, too. Some corpora may become so well established as to be regarded as texts in their own right; the Brown and LOB corpora are now close to achieving this status.

The TEI.corpus.2 element is intended for the encoding of language corpora, though it may also be useful in encoding newspapers, electronic anthologies, and other disparate collections of material. The individual samples in the corpus are encoded as separate TEI.2 elements, and the entire corpus is enclosed in a TEI.corpus.2 element. Each sample has the usual structure for a TEI.2 document, comprising a TeiHeader followed by a text element. The corpus, too, has a corpus-level TeiHeader element, in which the corpus as a whole, and encoding practices common to multiple samples may be described. The overall structure of a TEI-conformant corpus is thus: ... ... ... ... ... ]]>

Header information which relates to the whole corpus rather than to individual components of it should be factored out and included in the TeiHeader element prefixed to the whole. This two-level structure allows for contextual information to be specified at the corpus level, at the individual text level, or at both. Discussion of the kinds of information which may thus be specified is provided below, in section , as well as in chapter . Information of this type should in general be specified only once: a variety of methods are provided for associating it with individual components of a corpus, as further described in section .

In some cases, the design of a corpus is reflected in its internal structure. For example, a corpus of newspaper extracts might be arranged to combine all stories of one type (reportage, editorial, reviews, etc.) into some higher-level grouping, possibly with sub-groups for date, region, etc. The TEI.corpus.2 element provides no direct support for reflecting such internal corpus structure in the markup: it treats the corpus as an undifferentiated series of components, each tagged TEI.2.

If it is essential to reflect a single permanent organization of a corpus into sub- and sub-sub-corpora, then the corpus or the high-level subcorpora may be encoded as composite texts, using the group element described below and in section . The mechanisms for corpus characterization described in this chapter, however, are designed to reduce the need to do this. Useful groupings of components may easily be expressed using the text classification and identification elements described in section , and those for associating declarations with corpus components described in section . These methods also allow several different methods of text grouping to co-exist, each to be used as needed at different times. This helps minimize the danger of cross-classification and mis-classification of samples, and helps improve the flexibility with which parts of a corpus may be characterized for different applications.

Anthologies and collections are often treated as texts in their own right, if only for historical reasons. In conventional publishing, at least, anthologies are published as units, with single editorial responsibility and common front and back matter which may need to be included in their electronic encodings. The texts collected in the anthology, of course, may also need to identifiable as distinct individual objects for study.

Poem cycles, epistolary novels, and epistolary essays differ from anthologies in that they are often written as single works, by single authors, for single occasions -- nevertheless, it can be useful to treat their constituent parts as individual texts, as well as the cycle itself. Structurally, therefore, they may be treated in the same way as anthologies: both are texts whose body is composed largely of other texts.

The group element is provided to simplify the encoding of collections, anthologies, and cyclic works; as noted above, the group element can also be used to record the potentially complex internal structure of language corpora. For full description, see chapter .

Some composite texts, finally, are neither corpora, nor anthologies, nor cyclic works: they are otherwise unitary texts within which other texts are embedded. In general, they may be treated in the same way as unitary texts, using the normal TEI.2 and body elements. The embedded text itself may be encoded using the text element, which may occur within quotations or between paragraphs or other chunk-level elements inside the sections of a larger text. For further discussion, see chapter .

All composite texts share the characteristic that their different component texts may be of structurally similar or dissimilar types. If the components may all be encoded using the same base tag set, then the composite text should use that same base. If the components require different base tag sets, however, the encoding should use the general base tag set or the mixed base tag set described in chapter . Contextual Information

Contextual information is of particular importance for collections or corpora composed of samples from a variety of different kinds of text. Examples of such contextual information include: the age, sex and geographical origins of participants in a language interaction, or their socio-economic status; the cost and publication date of a newspaper story; the topic, register or factuality of an extract from a textbook. Such information may be of the first importance, whether as an organizing principle in creating a corpus (for example, to ensure that the range of values in such a parameter is evenly represented throughout the corpus, or represented proportionately to the population being sampled), or as a selection criterion in analysing the corpus (for example, to investigate the language usage of some particular vector of social characteristics).

Such contextual information is potentially of equal importance for unitary texts, and these Guidelines accordingly make no particular distinction between the kinds of information which should be gathered for unitary and for composite texts. In either case, the information should be recorded in the appropriate section of a TEI Header, as described in chapter . In the case of language corpora, such information may be gathered together in the overall corpus header, or split across all the component texts of a corpus, in their individual headers, or divided between the two. The association between an individual corpus text and the contextual information applicable to it may be made in a number of ways, as further discussed in section below.

Chapter , which should be read in conjunction with the present section, describes in full the range of elements available for the encoding of information relating to the electronic file itself, for example its bibliographic description and those of the source or sources from which it was derived (see section ); information about the encoding practices followed with the corpus, for example its design principles, editorial practices, reference system etc. (see section ); more detailed descriptive information about the corpus' creation and content, such as the languages used within it and any descriptive classification system used (see section ); and version information documenting any changes made in the electronic text (see section ).

In addition to the elements defined by chapter , several other elements can be used in the TEI header if the additional tag set defined by this chapter is invoked. These additional tags make it possible to characterize the social or other situation within which a language interaction takes place or is experienced, the physical setting of a language interaction, and the participants in it. Though this information may be relevant to, and provided for, unitary texts as well as for collections or corpora, it is more often recorded for the components of systematically developed corpora than for isolated texts, and thus the additional tag set is referred to as being for language corpora. Included in this tag set are the following elements: provides a description of a text in terms of its situational parameters. describes the identifiable speakers, voices or other participants in a linguistic interaction. describes the setting or settings within which a language interaction takes place, either as a prose description or as a series of setting elements. These elements form an optional extension to the profileDesc, defined in section and are further described in the remainder of this section. They are formally defined as follows: ]]>

The additional tag set for language corpora will be invoked, thus enabling the use of these elements, if a parameter entity called TEI.corpus is declared with the value INCLUDE, somewhere within the DTD subset. If the document is structured as a TEI corpus (that is, using the TEI.corpus.2 element), its document type declaration will resemble this: ]> ]]> The Text Description

The textDesc element provides a full description of the situation within which a text was produced or experienced, and thus characterizes it in a way relatively independent of any a priori theory of text-types. It is provided as an alternative or a supplement to the common use of descriptive taxonomies used to categorize texts, which is fully described in section , and section . The description is organized as a set of values and optional prose descriptions for the following eight situational parameters, each represented by one of the following eight elements: describes the medium or channel by which a text is delivered or experienced. For a written text, this might be print, manuscript, e-mail, etc.; for a spoken one, radio, telephone, face-to-face, etc. Attributes include: specifies the mode of this channel with respect to speech and writing. Legal values are: spoken written spoken to be written (e.g. dictation) written to be spoken (e.g. a script) mixed modes unknown or inapplicable describes the internal composition of a text or text sample, for example as fragmentary, complete, etc. Attributes include: specifies how the text was constituted. Legal values are: a single complete text a text made by combining several smaller items, each individually complete a text made by combining several smaller, not necessarily complete, items composition unknown or unspecified describes the nature and extent of indebtedness or derivativeness of this text with respect to others. Attributes include: categorizes the derivation of the text. Sample values include: text is original text is a revision of some other text text is a translation of some other text text is an abridged version of some other text text is plagiarized from some other text text has no obvious source but is one of a number derived from some common ancestor describes the most important social context in which the text was realized or for which it is intended, for example private vs. public, education, religion, etc. Attributes include: categorizes the domain of use. Sample values include: art and entertainment domestic and private religious and ceremonial business and work place education government and law other forms of public context describes the extent to which the text may be regarded as imaginative or non-imaginative, that is, as describing a fictional or a non-fictional world. Attributes include: categorizes the factuality of the text. Legal values are: the text is to be regarded as entirely imaginative the text is to be regarded as entirely informative or factual the text contains a mixture of fact and fiction the fiction/fact distinction is not regarded as helpful or appropriate to this text describes the extent, cardinality and nature of any interaction among those producing and experiencing the text, for example in the form of response or interjection, commentary etc. Attributes include: specifies whether or not there is any interaction between active and passive participants in the text. Legal values are: no interaction of any kind, e.g. a monologue some degree of interaction, e.g. a monologue with set responses complete interaction, e.g. a face to face conversation this parameter is inappropriate or inapplicable in this case specifies the number of active participants (or addressors) producing parts of the text. Legal values are: a single addressor many addressors a corporate addressor number of addressors unknown or unspecifiable specifies the number of passive participants (or addressees) to whom a text is directed or in whose presence it is created or performed. Suggested values include: text is addressed to the originator e.g. a diary text is addressed to one other person e.g. a personal letter text is addressed to a countable number of others e.g. a conversation in which all participants are identified text is addressed to an undefined but fixed number of participants e.g. a lecture text is addressed to an undefined and indeterminately large number e.g. a published book describes the extent to which a text may be regarded as prepared or spontaneous. Attributes include: a keyword characterizing the type of preparedness. Sample values include: spontaneous or unprepared follows a script follows a predefined set of conventions polished or revised before presentation characterizes a single purpose or communicative function of the text. Attributes include: specifies a particular kind of purpose. Suggested values include: didactic, advertising, propaganda etc self expression, confessional etc convey information, educate etc amuse, entertain etc specifies the extent to which this purpose predominates. Legal values are: this purpose is predominant this purpose is intermediate this purpose is weak extent unknown

To make up a TEI-conformant text description, each of the above elements must be supplied in the order specified, though its value may indicate that it is not applicable. Except for the purpose element, which may be repeated to indicate multiple purposes, no element may appear more than once within a single text description. Each element may be empty, or may contain a brief qualification or more detailed description of the value expressed by its attributes. It should be noted that some texts, in particular literary ones, may resist unambiguous classification in some of these dimensions; in such cases, the situational parameter in question should be given the content not applicable or the equivalent.

Texts may be described along many dimensions, according to many different taxonomies. No generally accepted consensus as to how such taxonomies should be defined has yet emerged, despite the best efforts of many corpus linguists, text linguists, sociolinguists, rhetoricians, and literary theorists over the years. Rather than attempting the task of proposing a single taxonomy of text types (or the equally impossible one of enumerating all those which have been proposed previously), the closed set of situational parameters described above can be used in combination to supply useful distinguishing descriptive features of individual texts, without insisting on a system of discrete high-level text-types. Such text-types may however be used in combination with the parameters proposed here, with the advantage that the internal structure of each such text-type can be specified in terms of the parameters proposed. This approach has the following analytical advantages:Schemes similar to that proposed here were developed in the 1960s and 1970s by researchers such as Hymes, Halliday, and Crystal and Davy, but have rarely been implemented; one notable exception being the pioneering work on the Helsinki Diachronic Corpus of English, on which see M. Kytö and M. Rissanen, The Helsinki Corpus of English Texts, in Corpus Linguistics: hard and soft, ed. M. Kytö, O. Ihalainen, and M. Rissanen (Amsterdam: Rodopi, 1988). it enables a relatively continuous characterization of texts (in contrast to discrete categories based on type or topic) it enables meaningful comparisons across corpora it allows analysts to build and compare their own text-types based on the particular parameters of interest to them it is equally applicable to spoken and written texts

Two alternative approaches to the use of these parameters are supported by these Guidelines. One is to use pre-existing taxonomies such as those used in subject classification or other types of text categorization. Such taxonomies may also be appropriate for the description of the topics addressed by particular texts. Elements for this purpose are described in section , and elements for defining or declaring such classification schemes in section . A second approach is to develop an application specific set of feature structures and an associated feature system declaration, as described in chapters and .

Where the organizing principles of a corpus or collection so permit, it may be convenient to regard a particular set of values for the situational parameters listed in this section as forming a text-type in its own right; this may also be useful where the same set of values applies to several texts within a corpus. In such a case, the set of text-types so defined should be regarded as a taxonomy. The mechanisms described in section may be used to define hierarchic taxonomies of such text-types, provided that the catDesc component of the category element contains a textDesc element rather than a prose description. Particular texts may then be associated with such definitions using the mechanisms described in sections .

Using these parameters, an informal domestic conversation might be characterized as follows: informal face-to-face conversation each text represents a continuously recorded interaction among the specified participants plans for coming week, local affairs mostly factual, some jokes ]]>

The following example demonstrates how the same situational parameters might be used to characterize a novel: print; part issues ]]> The formal declarations for these elements are given below: ]]> The Participants Description

The particDesc element in the profileDesc element provides additional information about the participants in a spoken text or, where this is judged appropriate, the persons named or depicted in a written text. Individual speakers or groups of speakers may be named or identified by a code which can then be used elsewhere within the encoded text, for example as the value of a who attribute. Demographic and descriptive information may be supplied about their individual characteristics and the relationships between them.

It should be noted that although the terms speaker or participant are used throughout this section, it is intended that the same mechanisms may be used to characterize fictional personae or voices within a written text, except where otherwise stated. For the purposes of analysis of language usage, the information specified here should be equally applicable to written and spoken texts.

The element particDesc contains one or more participant or participantGrp elements, followed by an optional particLinks element, as described below: describes a single participant in a language interaction. Attributes include: specifies the role of this participant in the group. specifies the sex of the participant. Sample values include: male female unknown or inapplicable specifies the age group to which the participant belongs. describes a group of individuals treated as a single participant. Attributes include: specifies the role of this group of participants in the interaction. specifies the sex of the participant group. Sample values include: male female unknown mixed specifies the age group of the participants. specifies the size or approximate size of the group. describes the relationships or social links existing between participants in a linguistic interaction.

Both participants and participant groups have the same substructure. This may be a prose description, or, more formally, a series of specialized subelements providing more specific details. Such details will vary enormously for different kinds of analysis; the set of demographic characteristics presented here as sub-elements should therefore be regarded as providing only an indication of the kinds of descriptive information which have been found to be generally useful, for example in socio-linguistics. Users of these Guidelines are free to extend or modify this set of demographic characteristics, by redefining the parameter entity m.demographics, associated with the class demographics, as further described in chapter . Where well-known classification schemes exist, e.g. for socio-economic class or occupation, these should be used and may be documented in the same way as for text classification (see section )

The following elements are the default members of the class demographics: contains a name of a person, or a reference to a person. contains information about a person's birth, such as its date and place. Attributes include: specifies the date of birth in a ISO standard form (yyyy-mm-dd). specifies the first language of a participant. contains an informal description of a person's competence in different languages, dialects, etc. describes a person's present or past places of residence. contains a brief prose description of the educational background of a participant. contains an informal description of a person's present or past affiliation with some organization, for example an employer or sponsor. contains an informal description of a person's trade, profession or occupation. Attributes include: identifies the classification system or taxonomy in use by supplying the identifier of a taxonomy element elsewhere in the header. identifies an occupation code defined within the classification system or taxonomy defined by the source attribute. contains an informal description of a person's perceived social or economic status. Attributes include: identifies the classification system or taxonomy in use. identifies a status code defined within the classification system or taxonomy defined by the source attribute.

For example, an individual might be described informally by the following participant element:

Female informant, well-educated, born in Shropshire UK, 12 Jan 1950, of unknown occupation. Speaks French fluently. Socio-Economic status B2 in the PEP classification scheme. ]]> Provided that the PEP classification scheme has been defined elsewhere in the heading (as a taxonomy element within the textClass element; see ), the same individual might more formally be described as follows: 12 Jan 1950 Shropshire, UK English French Long term resident of Hull University postgraduate Unknown ]]>

An identified character in a drama or a novel might be defined using a subset of the same tags as follows:It is particularly useful to define participants in a dramatic text in this way, since it enables the who attribute to be used to link sp elements to a definition for their speaker; see further section Emma Woodhouse ]]>

As noted above, the particLinks element is used to document personal or social relationships between individual participants, where this is felt to be of importance in the analysis. This may be done either as an informal prose description, or more formally using the special purpose relation element described in this section. describes any kind of relationship or linkage amongst a specified group of participants. Attributes include: categorizes the relationship in some respect, e.g. as social, personal or other. Suggested values include: relationship concerned with social roles relationship concerned with personal roles, e.g. kinship, marriage, etc. other kinds of relationship briefly describes the relationship. identifies the activeparticipants in a non-mutual relationship, or all the participants in a mutual one. identifies the passive participants in a non-mutual relationship. indicates whether the relationship holds equally amongst all the participants. Legal values are: the relationship is mutual the relationship is directed

A relationship, as defined here, may be any kind of describable link between specified participants, for example a social relationship (such as employer/employee), a personal relationship (such as sibling, spouse, etc.) or something less precise such as possessing shared knowledge. A relationship may be mutual, in that all the participants engage in it on an equal footing (for example the sibling relationship); or it may not be if participants are not identical with respect to their role in the relationship (for example, the employer relationship). In the case of non-mutual relationships, only two kinds of role are supported by the present proposals, somewhat arbitrarily named active and passive. These names are chosen to reflect the fact that many non-mutual relations are directed, in the sense that they are most readily described by a transitive verb, or a verb phrase of the form [is] X [of] or [is] X [to]. The subject of the verb is classed as active, the direct object of the verb, or the object of the concluding preposition, as passive. Thus parents are active and children passive in the relationship parent (interpreted as [is] parent [of]); the employer is active, the employee passive, in the relationship employs. These relationships can be inverted: parents are passive and children active in the relationship [is] child [of]; similarly works for inverts the active/passive roles of employs.

For example: This example defines the following three relationships among participants P1 through P7: P1 and P2 are parents of P3 and P4. P1 and P2 are linked in a mutual relationship called spouse --- i.e. P2 is the spouse of P1, and P1 is the spouse of P2. P1 has the social relationship employer with respect to P3, P5, P6, and P7.

The elements discussed in this section are formally defined as follows: ]]> The Setting Description

The settingDesc element is used to describe the setting or settings in which language interaction takes place. It may contain a prose description, analogous to a stage description at the start of a play, stating in broad terms the locale, or a more detailed description of a series of such settings. Individual settings may be associated with particular participants by means of the optional who attribute if, for example, participants are in different places. This attribute identifies one or more individual participants or participant groups, as discussed earlier in section . If this attribute is not specified, the setting details provided are assumed to apply to all participants represented in the language interaction. The present proposals do not support the encoding of different settings for the same participant. This is a subject for further work.

Each distinct setting is described by means of a setting element, which contains either a prose description or a combination of the other elements listed below: describes one particular setting in which a language interaction takes place. Attributes include: supplies the identifiers of the participants at this setting. contains the name of, or a reference to, a place. contains a date in any format. Attributes include: gives the value of the date in some standard form, usually yyyy-mm-dd. contains a phrase defining a time of day in any format. Attributes include: gives the value of the time in a standard form. contains a brief informal description of the nature of a place for example a room, a restaurant, a park bench etc. contains a brief informal description of what a participant in a language interaction is doing other than speaking, if anything.

The following example demonstrates the kind of background information often required to support transcriptions of language interactions, first encoded as a simple prose narrative:

The time is early spring, 1989. P1 and P2 are playing on the rug of a suburban home in Bedford. P3 is doing the washing up at the sink. P4 (a radio announcer) is in a broadcasting studio in London. ]]> The same information might be represented more formally in the following way: Bedford, UK early spring, 1989 rug of a suburban home playing Bedford, UK early spring, 1989 at the sink washing-up London, UK ]]> The elements discussed in this section have the following formal definitions: ]]> Associating Contextual Information with a Text

This section discusses the assocation of the contextual information held in the header with the individual elements making up a TEI text or corpus. Contextual information is held in elements of various kinds within the TEI header, as discussed elsewhere in this section and in chapter . Here we consider what happens when different parts of a document need to be associated with different contextual information of the same type, for example when one part of a document uses a different encoding practice from another, or where one part relates to a different setting from another. In such situations, there will be more than one instance of a header element of the relevant type.

The TEI DTDs allow for the following possibilities: A given element may appear in the corpus header only, in the header of one or more texts only, or in both places There may be multiple occurrences of certain elements in either corpus or text header.

To simplify the exposition, we deal with these two possibilities separately in what follows; however, they may, of course, be combined as desired. Combining Corpus and Text Headers

A TEI conformant document may have more than one header only in the case of a TEI corpus, which must have a header in its own right, as well as the obligatory header for each text. Every element specified in a corpus-header is understood as if it appeared within every text header in the corpus. An element specified in a text header but not in the corpus header supplements the specification for that text alone. If any element is specified in both corpus and text headers, the corpus header element is over-ridden for that text alone.

The titleStmt for a corpus text is understood to be prefixed by the titleStmt given in the corpus header. All other optional elements of the fileDesc should be omitted from an individual corpus text header unless they differ from those specified in the corpus header. All other header elements behave identically, in the manner documented below.

This facility makes it possible to state once for all in the corpus header each piece of contextual information which is common to the whole of the corpus, while still allowing for individual texts to vary from this common denominator.

For example, the following schematic shows the structure of a corpus comprising three texts, the first and last of which share the same encoding declaration. The second one has its own encoding declaration ... ... ... ... ... ... ... ... ... ... ... ]]> Declarable Elements

Certain of the elements which can appear within a TEI Header are known as declarable elements. These elements have in common the fact that they may be linked explicitly with a particular part of a text or corpus by means of a decls attribute. This linkage is used to over-ride the default association between declarations in the header and a corpus or corpus text. The only header elements which may be associated in this way are those which would not otherwise be meaningfully repeatable. An alphabetically ordered list of declarable elements follows: describes scope of any analytic or interpretive information added to the text in addition to the transcription. contains a loosely-structured bibliographic citation of which the sub-components may or may not be explicitly tagged. contains a fully-structured bibliographic citation, in which all components of the TEI file description are present. contains a structured bibliographic citation, in which only bibliographic subelements appear and in a specified order. describes a broadcast used as the source of a spoken text. states how and under what circumstances corrections have been made in the text. provides details of editorial principles and practices applied during the encoding of a text. provides technical details of the equipment and media used for an audio or video recording used as the source for a spoken text. summarizes the way in which hyphenation in a source text has been treated in an encoded version of it. describes the languages, sublanguages, registers, dialects etc. represented within a text. contains a list of bibliographic citations of any kind. indicates the extent of normalization or regularization of the original source carried out in converting it to electronic form. describes the identifiable speakers, voices or other participants in a linguistic interaction. describes in detail the aim or purpose for which an electronic file was encoded, together with any other relevant information concerning the process by which it was assembled or collected. specifies editorial practice adopted with respect to quotation marks in the original. details of an audio or video recording event used as the source of a spoken text, either directly or from a public broadcast. contains a prose description of the rationale and methods used in sampling texts in the creation of a corpus or collection. contains a citation giving details of the script used for a spoken text. describes the principles according to which the text has been segmented, for example into sentences, tone-units, graphemic strata, etc. supplies a bibliographic description of the copy text(s) from which an electronic text was derived or generated. specifies the format used when standardized date or number values are supplied. groups information which describes the nature or topic of a text in terms of a standard classification scheme, thesaurus, etc. provides a description of a text in terms of its situational parameters. All of the above elements may be multiply defined within a single header, that is, there may be more than one instance of any declarable element type at a given level. When this occurs, the following rules apply: every declarable element must bear a unique identifier for each different type of declarable element which occurs more than once within the same parent element, exactly one element must be specified as the default

In the following example, an editorial declaration contains two possible correction policies, one identified as C1 and the other as C2. Since there are two, one of them (in this case C1) must be specified as the default: ... ...

...

... ]]> For texts associated with the header in which this declaration appears correction method C1 will be assumed, unless they explicitly state otherwise. Here is the structure for a text which does state otherwise: ... ... ... ... ... ]]> In this case, the contents of the divisions D1 and D3 will both use correction policy C1, and those of division D2 will use correction policy C2.

The decls attribute is defined for any element which is a member of the class declaring. This includes the major structural elements text, group, and div, as well as smaller structural units, down to the level of paragraphs in prose, individual utterances in spoken texts, or entries in dictionaries. However, TEI recommended practice is to limit the number of multiple declarable elements used by a document as far as possible, for simplicity and ease of processing.

The identifier or identifiers specified by the decls attribute are subject to two further restrictions: An identifier specifying an element which contains multiple instances of one or more other elements should be interpreted as if it explicitly identified the elements identified as the default in each such set of repeated elements Each element specified, explicitly or implicitly, by the list of identifiers must be of a different type.

To demonstrate how these rules operate, we now expand our earlier example slightly: ...

...

...

... ]]>

This encoding description now has two editorial declarations, identified as ED1 (the default) and ED2. For texts not specifying otherwise, ED1 will apply. If ED1 applies, correction method C1a and normalization method N1 apply, since these are the specified defaults within ED1. In the same way, for a text specifying decls as ED2, correction C2a, sampling SAMP2 and normalization N2b will apply.

A finer grained approach is also possible. A text might specify text decls='C2b N2a', or even text decls='C1a N2a SAMP2', to mix and match declarations as required. A tag such as text decls='ED1 ED2' would (obviously) be illegal, since it includes two elements of the same type; a tag such as text decls='ED2 C1a' is also illegal, since in this context ED2 is synonymous with the defaults for that editorial declaration, namely SAMP2 C2a N2b, resulting in a list that identifies two correction elements (C1a and C2a). Summary

The rules determing which of the declarable elements are applicable at any point may be summarized as follows: If there is a single occurrence of a given declarable element in a corpus header, then it applies by default to all elements within the corpus. If there is a single occurrence of a given declarable element in the text header, then it applies by default to all elements of that text irrespective of the contents of the corpus header. Where there are multiple occurrences of declarable elements within either corpus or text header, each must have a unique value specified as the value of its id attribute; one only must bear a default attribute with the value YES. It is a semantic error for an element to be associated with more than one occurrence of any declarable element. Selecting an element which contains multiple occurrences of a given declarable element is semantically equivalent to selecting only those contained elements which are specified as defaults. An association made by one element applies by default to all of its descendants. Linguistic Annotation of Corpora

Language corpora often include analytic encodings or annotations, designed to support a variety of different views of language. The present Guidelines do not advocate any particular approach to linguistic annotation (or tagging); instead a number of general analytic facilities are provided which support the representation of most forms of annotation in a standard and self-documenting manner. Analytic annotation is of importance in many fields, not only in corpus linguistics, and is therefore discussed in general terms elsewhere in the Guidelines. See in particular chapters , , and . The present section presents informally some particular applications of these general mechanisms to the specific practice of corpus linguistics. Levels of Analysis

By linguistic annotation we mean here any annotation determined by an analysis of linguistic features of the text, excluding as borderline cases both the formal structural properties of the text (e.g. its division into chapters or paragraphs) and descriptive information about its context (the circumstances of its production, its genre or medium). The structural properties of any TEI-conformant text should be represented using the structural elements discussed elsewhere in this chapter and in chapters , , and the various chapters of Part III (on base tag sets). The contextual properties of a TEI text are fully documented in the TEI Header, which is discussed in chapter , and in section of the present chapter.

Other forms of linguistic annotation may be applied at a number of levels in a text. A code (such as a word-class or part-of-speech code) may be associated with each word or token, or with groups of such tokens, which may be continuous, discontinuous or nested. A code may also be associated with relationships (such as cohesion) perceived as existing between distinct parts of a text. The codes themselves may stand for discrete non-decomposable categories, or they may represent highly articulated bundles of textual features. Their function may be to place the annotated part of the text somewhere within a narrowly linguistic or discoursal domain of analysis, or within a more general semantic field, or any combination drawn from these and other domains.

The manner by which such annotations are generated and attached to the text may be entirely automatic, entirely manual or a mixture. The ease and accuracy with which analysis may be automated may vary with the level at which the annotation is attached. The method employed should be documented in the analysis element within the encoding description of the TEI Header, as described in section . Where different parts of a corpus have used different annotation methods, the decls attribute may be used to indicate the fact, as further discussed in section . An Extended Example

As one example of such types of analysis, consider the following sentence, taken from the Lancaster/IBM Treebank Project: See G. N. Leech and R. G. Garside, Running a Grammar Factory, in English Computer Corpora: Selected Papers and Research Guide, ed. S. Johansson and A.-B. Stenstrøm (Berlin: de Gruyter; New York: Mouton, 1991), pp. 15-32. This sentence and its analysis are reproduced by kind permission of the University of Lancaster's Unit for Computer Research on the English Language. The victim's friends told police that Kruger drove into the quarry and never surfaced.

Our discussion focuses on the way that this sentence might be analysed using the Claws system developed at the University of Lancaster, but exactly the same principles may be applied to a wide variety of other systems.For the word-class tagging method used by Claws see I. Marshall, Choice of Grammatical Word Class without Global Syntactic Analysis: Tagging Words in the LOB Corpus, in Computers and the Humanities 17 (1983): 139-50.. For an overview of the system see R. G. Garside, G. N. Leech, and G. R. Sampson, The Computational Analysis of English: a Corpus-Based Approach (Oxford: Oxford University Press, 1991). Output from the system consists of a segmented and tokenized version of the text, in which word class codes have been associated with each token. For our example sentence, we might conveniently represent these codes using entity references: The&AT victim&NN1 's&GEN friends&NN2 told&VVD police&NN2 that&CST Krueger&NP1 drove&VVD into&II the&AT quarry&NN1 and&CC never&RR surfaced&VVD ]]> The names used for these entity references have some significance for the human reader (AT for article, NN1 for singular noun, NN2 for plural noun, etc.), but their representation in the output from an SGML system processing the document may be chosen to suit the convenience of whatever analytic software is to be used. For example, if the SGML parser operating on this sentence uses a set of entity declarations in the following form: ]]> then the wordclass tags will simply disappear from the output. Alternatively, if the entity set in use follows the following pattern: ]]> then our sample sentence will be processed by an SGML-aware processor as if it began

More usefully, the replacement texts for each entity will be a code of some significance to a particular analysis program. In this case, it might be argued that some replacement should be chosen which will not obscure the fact that (for example) NN1 and NN2 have something in common, namely their noun-ness, which they do not share with (say) VVD. It is a matter for the analyst to decide whether or not the word class categories represented by these codes are decomposable into bundles of more primitive analytic features. In cases where they are, such analytic features should be defined as a set of feature structures, as described in chapters and . A suitable replacement for the word-class entity references above would then be an empty anchor element bearing a struct attribute (for which see chapter ). The victim 's friends told police that Krueger drove into the quarry never surfaced ]]> The anchor element simply marks a place in the text. The struct attribute associates that place with the relevant feature structure by specifying its unique identifier. This solution requires that there be somewhere a formal definition for what exactly is meant by analytic code NN1, in terms as simple or as complex as seems appropriate to the analyst. For example, we might represent the constituent features of the codes AT and NN1 using feature structure notation as follows: ]]> For a more detailed presentation of this method of representing the analytic features underlying these word class code, refer to chapter .

Although common practice, this method of presenting a token-level analysis requires application of an unstated convention about how the input stream is to be tokenized. The interpretive structures are associated not with a distinct stretch of text, but with a point in it, by convention immediately following the token to which they really apply. This convention makes it impossible to associate interpretive structures directly with sequences longer than a single token.

Neither of these inconveniences applies if the text is fully segmented, using the general purpose s element both to represent each token and the outer segment containing the whole sentence: The victim 's friends told police that Krueger drove into the quarry and never surfaced ]]> As this example demonstrates, the s element has scope and may thus be used to represent segmentation carried out at several levels, provided that the whole of the segmentation is describable by a single hierarchic tree. For example, the s element might be used to mark the boundaries of a high-level unit roughly corresponding with the traditional notion of sentence or s-unit, as discussed in section as well as its individual tokens (as above) together with a third, intermediate, level of analysis, corresponding with the traditional notion of phrase. It might also be used to indicate a lower-level morphological level of analysis. To distinguish amongst these various uses of the s element, the type attribute may be used, as in the following example: The victim 's friends told police ]]>

However, in projects where annotation is routinely carried out at multiple levels it will often be felt preferable to define an element appropriate to each level of analysis. For example, Claws provides the following constituent analysis of the sample sentence: [N [G The_AT victim_NN1 's_$ G] friends_NN2 N] [V told_VVD [N police_NN2 N] [Fn that_CST [N Krueger_NP1 N] [V [V& drove_VVD [P into_II [N the_AT quarry_NN1 N]P]V&] and_CC [V+ never_RR surfaced_VVD V+]V]Fn]V] ._.

Using c for s type=constituent, this analysis of the structure of our example sentence might be represented as follows: The victim's friends told police that Krueger drove into the quarry and never surfaced ]]> In the above analysis, the value borne by each struct attribute corresponds with one of a set of feature structures describing the constitutive function of each sequence of tokens. For example, N is a noun phrase, G a genitive, FN a noun clause, and V a verb phrase. The feature structures indicated will contain further information about the classification used, in the same way as for the word-level tagging described above. Alternatively, if no further specification for the labels attached to constituents is required, then a type attribute might be defined for the c element. In such a case, the significance of the values specified should be specified in the encodingDecl element of the header. Non-nesting Structures

Each analytic segment so far discussed has been well-behaved with respect to the basic document hierarchy, having only a single parent. Moreover, the segmentation has been complete, in that each part of the text is accounted for by some segment at each level of analysis, without discontinuities or overlap. This happy state of affairs does not of course apply in all types of analysis, and these Guidelines therefore provide a number of mechanisms to support the representation of discontinuities or multiple analyses. A brief overview of these facilities is provided in chapter .

These mechanisms all depend to a greater or less degree on the ability to associate a unique identifier with any element in a TEI-conformant text, and then to specify that identifier as the target of a pointing element of some kind. The previous section included one example of such a linkage in its use of the next and prev attributes to represent the fact that the two conjunct verbal phrases in the example may be aggregated together. A similar technique might be used to represent anaphoric links, for example the fact that the victim and Krueger in this example both refer to the same person: The victim 's friends told police that Krueger ]]> These attributes, and a number of other related topics, are further discussed in section and . Further Examples

The same mechanisms may also be used to encode analyses of an entirely different kind. Here, for example, the feature structures being associated with the various segments of the text are concerned with their discoursal function: Can I have ten oranges and a kilo of bananas please? Yes, anything else? No thanks. That'll be dollar forty. Two dollars Sixty, eighty, two dollars. Thank you. ]]>

The associated feature structures will document any internal substructure the analyst wishes to associate with each category assigned to the various parts of the discourse above, or simply document them further; in this case specifying that SR is a sale request, SC a sale compliance, PC a purchase closure and so on. For further discussion of the u (utterance) element and other elements recommended for transcriptions of spoken language, see chapter . Recommendations for the Encoding of Large Corpora

These Guidelines include proposals for the identification and encoding of a far greater variety of textual features and characteristics than is likely to be either feasible or desirable in any one language corpus, however large and ambitious. (The reasoning behind this catholic approach is further discussed in chapter .) For most large scale corpus projects, it will therefore be necessary to determine a subset of TEI recommended elements appropriate to the anticipated needs of the project. Mechanisms for tailoring the TEI DTDs to implement such a subset are described in chapters and ; they include the ability to exclude selected element types, add new element types, and change the names of existing elements. A discussion of the implications of such changes for TEI conformance is provided in chapter .

Because of the high cost of identifying and encoding many textual features, and the difficulty in ensuring consistent practice across very large corpora, encoders may find it convenient to divide the set of elements to be encoded into the following three categories: