]> TEI U5: Encoding for Interchange: an introduction to the TEI

Not for publication or redistribution

TEI U1: An Introduction to TEI Tagging (derived from TEI ED W21: Living with the Guidelines)
English some examples of SGML use SGML as their lang value 9 September 1995 ed. CMSMcQ fix Oxford links 8 June 1995 ed. CMSMcQ install on TEI web server (changing DTD subset slightly) 7 June 1995 ed. CMSMcQ Bring TeX and Script spelling corrections, etc. into SGML form. 2-3 June 1995 ed. CMSMcQ Spellcheck, final (! ha!) changes, format, and print. Many changes made only in TeX and Script versions. 29-30 May 1995 LB Last (ha!) pass. Cut down intro section. Moved divgen again. Revised interp and index sections extensively and generally hacked. 24-25 May 1995 CMSMcQ changes as agreed with LB at ExCommittee meeting: interp section, rev. editorial tags, add def of TEI Lite, add section on Making It Work with software, resettle divGen and index, begin continuous pass through working from LB's notes 15 May 1995 CMSMcQ begin last push prior to publication 1 Dec 94 LB retagged using TEI Lite 23 Jun 94 LB change to use ODD-style tagdescs 1993-07-20 draftCMSMcQ made file from old ED W21
TEI Lite: An Introduction to Text Encoding for Interchange Lou Burnard C. M. Sperberg-McQueen Document No: TEI U 5 June 1995

This document provides an introduction to the recommendations of the Text Encoding Initiative (TEI), by describing a manageable subset of the full TEI encoding scheme. The scheme documented here can be used to encode a wide variety of commonly encountered textual features, in such a way as to maximize the usability of electronic transcriptions and to facilitate their interchange among scholars using different computer systems. It is also fully compatible with the full TEI scheme, as defined by TEI document P3, Guidelines for Electronic Text Encoding and Interchange, published in Chicago and Oxford in May 1994.Copies of the current version of this text may be found via the World Wide Web at http://www-tei.uic.edu/orgs/tei/intros/teiu5.tei and ftp://info.ox.ac.uk/pub/ota/TEI/doc/teiu5.tei, and at other sites mirroring these. The document is also available in HTML form at http://www-tei.uic.edu/orgs/tei/intros/teiu5.html and http://info.ox.ac.uk/~archive/teilite/teiu5.html, Copies of the formal SGML document type definition for the tag set described here may be found at the same locations, under the file name teilite.dtd: ftp://www-tei.uic.edu/orgs/tei/p3/dtd/teilite.dtd and ftp://info.ox.ac.uk/pub/ota/TEI/dtd/teilite.dtd,

Introduction

The Text Encoding Initiative (TEI) Guidelines are addressed to anyone who wants to interchange information stored in an electronic form. They emphasize the interchange of textual information, but other forms of information such as images and sound are also addressed. The Guidelines are equally applicable in the creation of new resources and in the interchange of existing ones.

The Guidelines provide a means of making explicit certain features of a text in such a way as to aid the processing of that text by computer programs running on different machines. This process of making explicit we call markup or encoding. Any textual representation on a computer uses some form of markup; the TEI came into being partly because of the enormous variety of mutually incomprehensible encoding schemes currently besetting scholarship, and partly because of the expanding range of scholarly uses now being identified for texts in electronic form.

The TEI Guidelines use the Standard Generalized Markup Language (SGML) to define their encoding scheme. SGML is an international standard (ISO 8879), used increasingly throughout the information processing industries, which makes possible a formal definition of an encoding scheme, in terms of elements and attributes, and rules governing their appearance within a text. The TEI's use of SGML is ambitious in its complexity and generality, but it is fundamentally no different from that of any other SGML markup scheme, and so any general-purpose SGML-aware software is able to process TEI-conformant texts.

The TEI is sponsored by the Association for Computers and the Humanities, the Association for Computational Linguistics, and the Association for Literary and Linguistic Computing. Funding has been provided in part from the U.S. National Endowment for the Humanities, Directorate General XIII of the Commission of the European Communities, the Andrew W. Mellon Foundation, and the Social Science and Humanities Research Council of Canada. Its Guidelines were published in May 1994, after six years of development involving many hundreds of scholars from different academic disciplines worldwide.

At the outset of its work, the overall goals of the TEI were defined by the closing statement of a planning conference held at Vassar College, N.Y., in November, 1987; these Poughkeepsie Principles were further elaborated in a series of design documents. The Guidelines, say these design documents, should: suffice to represent the textual features needed for research; be simple, clear, and concrete; be easy for researchers to use without special-purpose software; allow the rigorous definition and efficient processing of texts; provide for user-defined extensions; conform to existing and emergent standards.

The world of scholarship is large and diverse. For the Guidelines to have wide acceptability, it was important to ensure that: the common core of textual features be easily shared; additional specialist features be easy to add to (or remove from) a text; multiple parallel encodings of the same feature should be possible; the richness of markup should be user-defined, with a very small minimal requirement; adequate documentation of the text and its encoding should be provided.

The present document describes a manageable selection from the extensive set of SGML elements and recommendations resulting from those design goals, which is called TEI Lite.

In selecting from the several hundred SGML elements defined by the full TEI scheme, we have tried to identify a useful starter set, comprising the elements which almost every user should know about. Experience working with TEI Lite will be invaluable in understanding the full TEI DTD and in knowing which optional parts of the full DTD are necessary for work with particular types of text.

Our goals in defining this subset may be summarized as follows: it should include most of the TEI core tag set, since this contains elements relevant to virtually all text types and all kinds of text-processing work; it should be able to handle adequately a reasonably wide variety of texts, at the level of detail found in existing practice (as demonstrated in, for example, the holdings of the Oxford Text Archive); it should be useful for the production of new documents as well as encoding of existing ones; it should be usable with a wide range of existing SGML software; it should be derivable from the full TEI DTD using the extension mechanisms described in the TEI Guidelines; it should be as small and simple as is consistent with the other goals.

The reader may judge our success in meeting these goals for him or herself. At the time of writing, our confidence that we have at least partially done so is borne out by its use in practice for the encoding of real texts. The Oxford Text Archive uses TEI Lite when it translates texts from its holdings from their original markup schemes into SGML; the Electronic Text Centers at the University of Virginia and the University of Michigan have used TEI Lite to encode their holdings. And the Text Encoding Initiative itself uses TEI Lite, in its current technical documentation --- including this document.

Although we have tried to make this document self-contained, as suits a tutorial text, the reader should be aware that it does not cover every detail of the TEI encoding scheme. All of the elements described here are fully documented in the TEI Guidelines themselves, which should be consulted for authoritative reference information on these, and on the many others which are not described here. Some basic knowledge of SGML is assumed.

A Short Example

We begin with a short example, intended to show what happens when a passage of prose is typed into a computer by someone with little sense of the purpose of mark-up, or the potential of electronic texts. In an ideal world, such output might be generated by a very accurate optical scanner. It attempts to be faithful to the appearance of the printed text, by retaining the original line breaks, by introducing blanks to represent the layout of the original headings and page breaks, and so forth. Where characters not available on the keyboard are needed (such as the accented letter a in faàl or the long dash), it attempts to mimic their appearance.

CHAPTER 38 READER, I married him. A quiet wedding we had: he and I, the par- son and clerk, were alone present. When we got back from church, I went into the kitchen of the manor-house, where Mary was cooking the dinner, and John cleaning the knives, and I said -- 'Mary, I have been married to Mr Rochester this morning.' The housekeeper and her husband were of that decent, phlegmatic order of people, to whom one may at any time safely communicate a remarkable piece of news without incurring the danger of having one's ears pierced by some shrill ejaculation and subsequently stunned by a torrent of wordy wonderment. Mary did look up, and she did stare at me; the ladle with which she was basting a pair of chickens roasting at the fire, did for some three minutes hang suspended in air, and for the same space of time John's knives also had rest from the polishing process; but Mary, bending again over the roast, said only -- 'Have you, miss? Well, for sure!' A short time after she pursued, 'I seed you go out with the master, but I didn't know you were gone to church to be wed'; and she basted away. John, when I turned to him, was grinning from ear to ear. 'I telled Mary how it would be,' he said: 'I knew what Mr Ed- ward' (John was an old servant, and had known his master when he was the cadet of the house, therefore he often gave him his Christian name) -- 'I knew what Mr Edward would do; and I was certain he would not wait long either: and he's done right, for aught I know. I wish you joy, miss!' and he politely pulled his forelock. 'Thank you, John. Mr Rochester told me to give you and Mary this.' I put into his hand a five-pound note. Without waiting to hear more, I left the kitchen. In passing the door of that sanctum some time after, I caught the words -- 'She'll happen do better for him nor ony o' t' grand ladies.' And again, 'If she ben't one o' th' handsomest, she's noan faa\l, and varry good-natured; and i' his een she's fair beautiful, onybody may see that.' I wrote to Moor House and to Cambridge immediately, to say what I had done: fully explaining also why I had thus acted. Diana and 474 JANE EYRE 475 Mary approved the step unreservedly. Diana announced that she would just give me time to get over the honeymoon, and then she would come and see me. 'She had better not wait till then, Jane,' said Mr Rochester, when I read her letter to him; 'if she does, she will be too late, for our honey- moon will shine our life long: its beams will only fade over your grave or mine.' How St John received the news I don't know: he never answered the letter in which I communicated it: yet six months after he wrote to me, without, however, mentioning Mr Rochester's name or allud- ing to my marriage. His letter was then calm, and though very serious, kind. He has maintained a regular, though not very frequent correspond- ence ever since: he hopes I am happy, and trusts I am not of those who live without God in the world, and only mind earthly things.

This transcription suffers from a number of shortcomings: the page numbers and running titles are intermingled with the text in a way which makes it difficult for software to disentangle them; no distinction is made between single quotation marks and apostrophe, so it is difficult to know exactly which passages are in direct speech; the preservation of the copy text's hyphenation means that simple-minded search programs will not find the broken words; the accented letter in faàl and the long dash have been rendered by ad hoc keying conventions which follow no standard pattern and will be processed correctly only if the transcriber remembers to mention them in the documentation; paragraph divisions are marked only by the use of white space, and hard carriage returns have been introduced at the end of each line. Consequently, if the size of type used to print the text changes, reformatting will be problematic.

We now present the same passage, as it might be encoded using the TEI Guidelines. As we shall see, there are many ways in which this encoding could be extended, but as a minimum, the TEI approach allows us to represent the following distinctions: Paragraph divisions are now marked explicitly. Apostrophes are distinguished from quotation marks. Entity references are used for the accented letter and the long dash. Page divisions have been marked with an empty pb element alone. To simplify searching and processing, the lineation of the original has not been retained and words broken by typographic accident at the end of a line have been re-assembled without comment. If the original lineation were of interest, as it might be for an important printing, it could easily be recorded, though it has not been here. For convenience of proof reading, a new line has been introduced at the start of each paragraph, but the indentation is removed.

Reader, I married him. A quiet wedding we had: he and I, the parson and clerk, were alone present. When we got back from church, I went into the kitchen of the manor-house, where Mary was cooking the dinner, and John cleaning the knives, and I said ‐

Mary, I have been married to Mr Rochester this morning. The housekeeper and her husband were of that decent, phlegmatic order of people, to whom one may at any time safely communicate a remarkable piece of news without incurring the danger of having one's ears pierced by some shrill ejaculation and subsequently stunned by a torrent of wordy wonderment. Mary did look up, and she did stare at me; the ladle with which she was basting a pair of chickens roasting at the fire, did for some three minutes hang suspended in air, and for the same space of time John's knives also had rest from the polishing process; but Mary, bending again over the roast, said only ‐

Have you, miss? Well, for sure!

A short time after she pursued, I seed you go out with the master, but I didn't know you were gone to church to be wed; and she basted away. John, when I turned to him, was grinning from ear to ear. I telled Mary how it would be, he said: I knew what Mr Edward (John was an old servant, and had known his master when he was the cadet of the house, therefore he often gave him his Christian name) ‐ I knew what Mr Edward would do; and I was certain he would not wait long either: and he's done right, for aught I know. I wish you joy, miss! and he politely pulled his forelock.

Thank you, John. Mr Rochester told me to give you and Mary this.

I put into his hand a five-pound note. Without waiting to hear more, I left the kitchen. In passing the door of that sanctum some time after, I caught the words ‐

She'll happen do better for him nor ony o' t' grand ladies. And again, If she ben't one o' th' handsomest, she's noan faàl, and varry good-natured; and i' his een she's fair beautiful, onybody may see that.

I wrote to Moor House and to Cambridge immediately, to say what I had done: fully explaining also why I had thus acted. Diana and Mary approved the step unreservedly. Diana announced that she would just give me time to get over the honeymoon, and then she would come and see me.

She had better not wait till then, Jane, said Mr Rochester, when I read her letter to him; if she does, she will be too late, for our honeymoon will shine our life long: its beams will only fade over your grave or mine.

How St John received the news I don't know: he never answered the letter in which I communicated it: yet six months after he wrote to me, without, however, mentioning Mr Rochester's name or alluding to my marriage. His letter was then calm, and though very serious, kind. He has maintained a regular, though not very frequent correspondence ever since: he hopes I am happy, and trusts I am not of those who live without God in the world, and only mind earthly things. ]]>

The decision to focus on Brontë's text, rather than on the printing of it in this particular edition, is one aspect of a fundamental encoding issue: that of selectivity. An encoding makes explicit only those textual features of importance to the encoder. It is not difficult to think of ways in which the encoding of even this short passage might readily be extended. For example: a regularized form of the passages in dialect could be provided; footnotes glossing or commenting on any passage could be added; pointers linking parts of this text to others could be added; proper names of various kinds could be distinguished from the surrounding text; detailed bibliographic information about the text's provenance and context could be prefixed to it; a linguistic analysis of the passage into sentences, clauses, words, etc., could be provided, each unit being associated with appropriate category codes; the text could be segmented into narrative or discourse units; systematic analysis or interpretation of the text could be included in the encoding, with potentially complex alignment or linkage between the text and the analysis, or between the text and one or more translations of it; passages in the text could be linked to images or sound held on other media.

The TEI-recommended way of carrying all of these out is described in the remainder of this document. The TEI scheme as a whole also provides for an enormous range of other possibilities, of which we cite only a few: detailed analysis of the components of names; detailed meta-information providing thesaurus-style information about the text's origins or topics; information about the printing history or manuscript variations exhibited by a particular series of versions of the text. For recommendations on these and many other possibilities, the full Guidelines should be consulted.

The Structure of a TEI Text

All TEI-conformant texts contain (a) a TEI header (marked up as a teiHeader element) and (b) the transcription of the text proper (marked up as a text element).

The TEI header provides information analogous to that provided by the title page of a printed text. It has up to four parts: a bibliographic description of the machine-readable text, a description of the way it has been encoded, a non-bibliographic description of the text (a text profile), and a revision history. The header is described in more detail in section .

A TEI text may be unitary (a single work) or composite (a collection of single works, such as an anthology). In either case, the text may have an optional front or back. In between is the body of the text, which, in the case of a composite text, may consist of groups, each containing more groups or texts.

A unitary text will be encoded using an overall structure like this: [ TEI Header information ] [ front matter ... ] [ body of text ... ] [ back matter ... ] ]]>

A composite text also has an optional front and back. In between occur one or more groups of texts, each with its own optional front and back matter. A composite text will thus be encoded using an overall structure like this: [ header information for the composite ] [ front matter for the composite ] [ front matter of first text ] [ body of first text ] [ back matter of first text ] [ front matter of second text] [ body of second text ] [ back matter of second text ] [ more texts or groups of texts here ] [ back matter for the composite ] ]]>

It is also possible to define a composite of TEI texts, each with its own header. Such a collection is known as a TEI corpus, and may itself have a header: [header information for the corpus] [header information for first text] [first text in corpus] [header information for second text] [second text in corpus] ]]> It is not however possible to create a composite of corpora -- that is, a number of teiCorpus elements combined together and treated as a single object. This is a restriction of the current version of the TEI Guidelines.

In the remainder of this document, we discuss chiefly simple text structures. The discussion in each case consists of a short list of relevant TEI elements with a brief definition of each, followed by definitions for any attributes specific to that element. In most cases, short examples are also given.

Encoding the Body

As indicated above, a simple TEI document at the textual level consists of the following elements: contains any prefatory matter (headers, title page, prefaces, dedications, etc.) found before the start of a text proper. contains a number of unitary texts or groups of texts. contains the whole body of a single unitary text, excluding any front or back matter. contains any appendixes, etc., following the main part of a text. Elements specific to front and back matter are described below in section . In this section we discuss the elements making up the body of a text.

Text Division Elements

The body of a prose text may be just a series of paragraphs, or these paragraphs may be grouped together into chapters, sections, subsections, etc. In the former case, each paragraph is tagged using the p tag. In the latter case, the body may be divided either into a series of div1 elements, or into a series of div elements, either of which may be further subdivided, as discussed below: marks paragraphs in prose. contains a subdivision of the front, body, or back of a text. contains a first-level subdivision of the front, body, or back of a text (the largest, if div0 is not used, the second largest if it is).

When structural subdivisions smaller than a div1 are necessary, a div1 may be divided into div2 elements, a div2 into smaller div3 elements, etc., down to the level of div7. If more than seven levels of structural division are present, one must either modify the TEI tag set to accept div8, etc., or else use the unnumbered div element: a div may be subdivided by smaller div elements, without limit to the depth of nesting.

All these division elements take the following three attributes: This indicates the conventional name for this category of text division. Its value will typically be Book, Chapter, Poem, etc. Other possible values include Group for groups of poems, etc., treated as a single unit, Sonnet, Speech, and Song. Note that whatever value is supplied for the type attribute of the first div, div1, div2, etc., in a text is assumed to apply for all subsequent div, div1s (etc.) within the same body. This implies that a value must be given for the first division element of each type, or whenever the value changes. This specifies a unique identifier for the division, which may be used for cross references or other links to it, such as a commentary, as further discussed in section . It is often useful to provide an id attribute for every major structural unit in a text, and to derive the ID values in some systematic way, for example by appending a section number to a short code for the title of the work in question, as in the examples below. The n attribute specifies a mnemonic short name or number for the division, which can be used to identify it in preference to the ID. If a conventional form of reference or abbreviation for the parts of a work already exists (such as the book/chapter/verse pattern of Biblical citations), the n attribute is the place to record it. The attributes id and n, indeed, are so widely useful that they are allowed on any element in any TEI DTD: they are global attributes. Other global attributes defined in the TEI Lite scheme are discussed in section .

The value of every id attribute must be unique within a document. One simple way of ensuring that this is so is to make it reflect the hierarchic structure of the document. For example, Smith's Wealth of Nations as first published consists of five books, each of which is divided into chapters, while some chapters are further subdivided into parts. We might define id values for this structure as follows: ... ... ... ... ... ... .... ... ]]>

A different numbering scheme may be used for id and n attributes: this is often useful where a canonical reference scheme is used which does not tally with the structure of the work. For example, in a novel divided into books each containing chapters, where the chapters are numbered sequentially through the whole work, rather than within each book, one might use a scheme such as the following: ... ... ... ... ]]> Here the work has two volumes, each containing two chapters. The chapters are numbered conventionally 1 to 4, but the id values specified allow them to be regarded additionally as if they were numbered 1.1, 1.2, 2.1, 2.2.

Headings and Closings

Every div, div1, div2, etc., may have a title or heading at its start, and (less commonly) a closing such as End of Chapter 1. The following elements may be used to transcribe them: contains any heading, for example, the title of a section, or the heading of a list or glossary. contains a closing title or footer appearing at the end of a division of a text. Some other elements which may be necessary at the beginning or ending of text divisions are discussed below in section .

Whether or not headings and trailers are included in a transcription is a matter for the individual transcriber to decide. Where a heading is completely regular (for example Chapter 1) or has been given as an attribute value (e.g. div1 type='Chapter' n=1), it may be omitted; where it contains otherwise unrecoverable text it should always be included. For example, the start of Hardy's Under the Greenwood Tree might be encoded as follows: Mellstock-Lane

To dwellers in a wood almost every species of tree ... ]]>

Prose, Verse and Drama

As noted above, the paragraphs making up a textual division should be tagged with the p tag. For example:

I fully appreciate Gen. Pope's splendid achievements with their invaluable results; but you must know that Major Generalships in the Regular Army, are not as plenty as blackberries.

]]>

A number of different tags are provided for the encoding of the structural components of verse and performance texts (drama, film, etc.): contains a single, possibly incomplete, line of verse. Attributes include: specifies whether or not the line is metrically complete. Legal values are: F for the final part of an incomplete line, Y if the line is metrically incomplete, N if the line is complete, or if no claim is made as to its completeness, I for the initial part of an incomplete line, M for a medial part of an incomplete line. contains a group of verse lines functioning as a formal unit e.g. a stanza, refrain, verse paragraph, etc. contains an individual speech in a performance text, or a passage presented as such in a prose or verse text. Attributes include: identifies the speaker of the part by supplying an ID. contains a special form of heading or label, giving the name of one or more speakers in a performance text or fragment. contains any kind of stage direction within a performance text or fragment. Attributes include: indicates the kind of stage direction. Suggested values include entrance, exit, setting, delivery, etc.

Here, for example, is the start of a poetic text in which verse lines and stanzas are tagged: I Sing the progresse of a deathlesse soule, Whom Fate, with God made, but doth not controule, Plac'd in most shapes; all times before the law Yoak'd us, and when, and since, in this I sing. And the great world to his aged evening; From infant morne, through manly noone I draw. What the gold Chaldee, of silver Persian saw, Greeke brass, or Roman iron, is in this one; A worke t'out weare Seths pillars, bricke and stone, And (holy writs excepted) made to yeeld to none, ]]>

Note that the l element marks verse, not typographic lines: the original lineation of the first few lines above has not therefore been made explicit by this encoding, and may be lost. The lb element described in section may be used to mark typographic lines if so desired.

Sometimes, particularly in dramatic texts, verse lines are split between speakers. The easiest way of encoding this is to use the part attribute to indicate that the lines so fragmented are incomplete, as in this example: ACT I SCENE I Enter Barnardo and Francisco, two Sentinels, at several doors BarnWho's there? FranNay, answer me. Stand and unfold yourself. BarnLong live the King! FranBarnardo? BarnHe. FranYou come most carefully upon your hour. ]]>

The same mechanism may be applied to stanzas which are divided between two speakers: First voice But why drives on that ship so fast Withouten wave or wind? Second Voice The air is cut away before. And closes from behind. ]]>

This example shows how dialogue presented in a prose work as if it were drama should be encoded. It also demonstrates the use of the who attribute to bear a code identifying the speaker of the piece of dialogue concerned: The reverend Doctor Opimiam

I do not think I have named a single unpresentable fish. Mr Gryll

Bream, Doctor: there is not much to be sError

Error

Unable to load requested item.