Message tei:96 - Read From: BITNET list server at UICVM To: Joseba Abaitua Subject: File: "P3SA DOC" >Content-Transfer-Encoding: 7BIT Part 3 ADDITIONAL TAG SETS Chapter 14 LINKING, SEGMENTATION, AND ALIGNMENT This chapter discusses a number of ways in which encoders may repre- sent analyses of the structure of a text which are not necessarily lin- ear or hierarchic. In this chapter, tag sets and global attributes are provided for the following common requirements: * to link disparate elements in a single document using the id attri- bute (section 14.1, "Pointers,"); * to link disparate elements in a single document without using the id attribute or to link elements in different documents (section 14.2, "Extended Pointers,"); * to segment text into elements convenient for the encoder and to mark arbitrary points within documents (section 14.3, "Segments and Anchors," on page 27); * to represent correspondence or alignment among groups of text ele- ments, both those with content and those which are empty (section 14.4, "Correspondence and Alignment," on page 28);(73) * to synchronize elements of a text, that is to represent temporal correspondences and alignments among text elements (section 14.5, "Synchronization," on page 28) and also to align them with specific points in time (section xref target=SAsymp); * to specify that one text element is identical to or a copy of another (section 14.6, "Identical Elements and Virtual Copies," on page 28); * to aggregate possibly noncontinguous elements (section 14.7, "Aggre- gation," on page 29); * to specify that different elements are alternatives to one another and to express preferences among the alternatives (section 14.8, "Alternation," on page 29); * to associate segments of a text with interpretations or analyses of their significance (section 14.9, "Connecting Analytic and Textual Markup," on page 29). These facilities all use the same basic set of techniques, which depend on the ability to point to an element which has some form of identifier. The most convenient such identifier, and that which is rec- ommended by these Guidelines wherever possible, is provided by the glob- al id attribute, as defined in section 3.5, "Global Attributes," on page 4. An extension to this mechanism is provided, for elements which are located in different SGML documents, or to which identifiers cannot be attached (perhaps because they are held on read-only media), known as the TEI extended pointer mechanism in section 14.2, "Extended Pointers." For many of the topics discussed in this chapter, a choice of methods of encoding is offered, ranging from simple but less general ones, which use attribute values only, to more elaborate and more general ones, which use specialized elements. The following DTD fragments show the overall organization of the additional tag set discussed in the remainder of this chapter. The file teilink2.ent begins by declaring a set of additional attributes avail- able globally when this tag set is enabled. This is followed by decla- rations for the attribute classes pointer and pointerGroup to which most of the elements discussed in this chapter belong; these attributes are all further described in the remainder of the chapter. The element declarations for this tag set are contained in the file teilink2.dtd: This tag set is made available by the mechanisms described in section 3.3, "Invocation of the TEI DTD," on page 4. This implies that the doc- ument type subset for a document using any of the tags or attributes described in this chapter must define a parameter entity TEI.linking with the value INCLUDE. For example, a document using this additional tag set and the prose base would begin with a series of declarations like the following: ]> 14.1 Pointers We say that one element points to others if the first has an attri- bute whose value is a reference to the others: such an element is called a pointer element, or simply a pointer. Among the pointers that have been introduced up to this point in these Guidelines are , and . These elements all indicate an association between one place in the document (the location of the pointer itself) and one or more others (the elements whose identifiers are specified by the point- er's target attribute). This element set defines a variation on this basic kind of pointer, known as a link which specifies both "ends" of an association. In addition, we define a syntax for representing locations in a document by a variety of means not dependent on the use of SGML identifiers. 14.1.1 Pointers and Links In section 6.6, "Simple Links and Cross References," on page 12 we introduced the simplest pointer elements, and . Here we introduce additionally the element, which represents an associa- tion between two (or more) locations by specifying each location explic- itly. Its own location is irrelevant to the intended linkage. : defines a pointer to another location in the current document in terms of one or more identifiable elements. Attributes include: target : specifies the destination of the pointer as one or more SGML identifiers : defines a reference to another location in the current docu- ment, in terms of one or more identifiable elements, possibly modified by additional text or comment. Attributes include: target : specifies the destination of the reference as one or more SGML identifiers : defines an association or hypertextual link among elements or passages, of some type not more precisely specifiable by other ele- ments. Attributes include: targets : specifies the SGML identifiers of the elements or passages to be linked or associated. The element may be called a "pure pointer", because its primary function is simply to point. A pointer sets up a connection between an element (which, in the case of a pure pointer, can be thought of simply as a location in a document), and one or more others, known collectively as its target. The and elements bear a target attribute (in the singular), because they point, conceptually, at a single target, even if that target may be discontinuous in the document. The element bears a targets attribute, with a plural name, because it speci- fies at least two targets, each of which is a unitary object. It may be thought of as a representing a double link between the objects speci- fied. As members of the class pointer, these elements share a common set of attributes: type : categorizes the pointer in some respect, using any convenient set of categories. resp : specifies the creator of the pointer. crdate : specifies when the pointer was created. targType : specifies the kinds of elements to which this pointer may point. targOrder : where more than one identifier is supplied as the value of the target attribute, this attribute specifies whether the order in which they are supplied is significant. Legal values are: Y : Yes: the order in which IDREFs are specified as the value of a target attribute should be followed when combining the targeted ele- ments. N : No: the order in which IDREFs are specified as the value of a target attribute has no significance when combining the targeted elements. U : Unspecified: the order in which IDREFs are specified as the val- ue of a target attribute may or may not be significant. evaluate : specifies the intended meaning when the target of a pointer is itself a pointer. Sample values include: all : if the element pointed to is itself a pointer, then the target of that pointer will be taken, and so on, until an element is found which is not a pointer. one : if the element pointed to is itself a pointer, then its target (whether a pointer or not) is taken as the target of this pointer. none : no further evaluation of targets is carried out beyond that needed to find the element specified in the pointer's target. The targType and targOrder attributes may be used to constrain the scope of a link to certain element types. For example: This is a complete unconstrained link, of type echo. It assumes only that there is an element with identifier P1 and another with identifier P2 somewhere in the current document. This is a slightly more constrained link of the same type. P1 and P2 must now both identify a

, a , or a , but there is no requirement as to which is which. (This may be useful if, as is often the case, different elements may participate in the same kind of link.) In this variation, not only must the link targets be either

or elements, but the one with identifier P1 must be a

, and that with identifier P2 must be a . Note that the present Guidelines provide no direct way of saying that P1 may identify either a or a

and P2 must identify a . These attributes are most useful if applied to a group of links, when additional constraints may also be specified, as further discussed in section 14.1.3, "Groups of Links," below. Double connection among elements could also be expressed by a combi- nation of pointer elements, for example, two elements, or one element and one element. All that is required is that the value of the target (or other pointing) attribute of the one be the val- ue of the id attribute of the other. What the element accom- plishes is the handling of double connection by means of a single ele- ment. Thus, in the following encoding: ..... P1 points to P2, and P2 points to P1. This is logically equivalent to the more compact encoding: As noted above, all elements pointed to or linked by these elements must be identifiable using the global id attribute. This implies that they must be present in the same document, and that they must bear unique id values. Pointing or linking to external documents and point- ing or linking where SGML identifiers are not available is implemented by the external pointing mechanisms discussed in section 14.2, "Extended Pointers," where the and elements are discussed. External links and links involving elements without identifiers do not require a special element; they may be represented using the standard ele- ment, but an intermediate element must be provided within the current document, to bear the id attribute used in the target of the link. 14.1.2 Using Pointers and Links As an example of the use of these mechanisms which establish connec- tions among elements, consider the practice (common in 18th century Eng- lish verse and elsewhere) of providing footnotes citing parallel passag- es from classical authors. +----------------------------------------------------------------------+ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Figure 1: The figure shows the original page of Pope's Dunciad | ________ | which is discussed in the text. | | | +----------------------------------------------------------------------+ Such footnotes can of course simply be encoded using the element (see section 6.8, "Notes, Annotation, and Indexing," on page 13) without a target attribute, placed adjacent to the passage to which the note refers:(74) (Diff'rent our parties, but with equal grace The Goddess smiles on Whig and Tory race, Virg. Æn. 10. Tros Rutulusve fuat; nullo discrimine habebo. —— Rex Jupiter omnibus idem. 'Tis the same rope at sev'ral ends they twist, To Dulness, Ridpath is as dear as Mist) This use of the element can be called implicit pointing (or implicit linking). It relies on the juxtaposition of the note to the text being commented on for the connection to be understood. If it is felt that the mere juxtaposition of the note to the text does not make it sufficiently clear exactly what text segment is being commented on (for example, is it the immediately preceding line, or the immediately preceding two lines, or what?), or if it is decided to place the note at some distance from the text, then the pointing or the linking must be made explicit. We now consider various methods for doing that. First, a element might be placed at an appropriate point within the text to link it with the annotation: (Diff'rent our parties, but with equal grace The Goddess smiles on Whig and Tory race, 'Tis the same rope at sev'ral ends they twist, To Dulness, Ridpath is as dear as Mist) Virg. Æn. 10. Tros Rutulusve fuat; nullo discrimine habebo. —— Rex Jupiter omnibus idem. The element has been given an arbitrary identifier (N3.284) to enable it to be specified as the target of the pointer element. Because there is nothing in the text to signal the existence of the annotation, the rend attribute has been given the value unmarked. Second, the target attribute of the element can be used to point at its associated text, provided that an id attribute has been supplied for the associated text. Since, in this case, the note itself contains a pointer to the place in the text which it is annotating, this has also been encoded, using a element, which bears a target attribute of its own and contains a (slightly misquoted) extract from the text marked as a element: Verse 283-84. ——. With equal grace Our Goddess smiles on Whig and Tory race. Virg. Æn. 10. Tros Rutulusve fuat; nullo discrimine habebo. —— Rex Jupiter omnibus idem. Combining these two solutions gives us the following associations: * a pointer within one line indicates the note * the note indicates the line * a pointer within the note indicates the line Note that we do not have any way of pointing from the line itself to the note: the association is implied by containment of the pointer. We do not as yet have a true double link between text and note. Thirdly, therefore, we supply identifiers for both verse line and annotation, and use a element to associate the two. Note that the element and the target attribute on the may now be dis- pensed with: (Diff'rent our parties, but with equal grace The Goddess smiles on Whig and Tory race, 'Tis the same rope at sev'ral ends they twist, To Dulness, Ridpath is as dear as Mist) Verse 283-84. ——. With equal grace Our Goddess smiles on Whig and Tory race.] Virg. Æn. 10. Tros Rutulusve fuat; nullo discrimine habebo. —— Rex Jupiter omnibus idem. The targets attribute of the element here bears the identifi- ers of the note followed by that of the verse line. The targType and targOrder attributes may be used to enable application programs to check that the identifiers in fact pick out a element and an ele- ment and in that order. If targOrder has the value N, then the elements indicated by the targets attribute have to be either or ele- ments, but are otherwise unconstrained. If neither attribute is present, then the only constraint is that the identifiers given must apply to some element within the current document. For completeness, we could also allocate an identifier to the refer- ence within the note and encode the association between it and the verse line in the same way: Indeed, the two s could be combined into one, as follows: 14.1.3 Groups of Links Clearly, there are many reasons for which an encoder might wish to represent a link or association between different elements. For some of them, specific elements are provided in these Guidelines; some of these are discussed elsewhere in the present chapter. The element is a general purpose element which may be used for any kind of association. The element may be used to group links of a particular type together in a single part of the document; such a collection may be used to represent what is sometimes referred to in the literature of Hyper- text as a web, a term introduced by the Brown University FRESS project in 1969. : defines a collection of associations or hypertextual links. As a member of the class pointerGroup, this element shares the following attributes with other members of that class: domains : optionally specifies the identifiers of the elements within which all elements indicated by the contents of this element lie. targFunc : describes the function of each of the values of the targets attribute of the enclosed , or tags. It is also a member of the pointer class, and therefore also carries the attributes specified in section 14.1.1, "Pointers and Links," above, in particular the type attribute: type : categorizes the pointer in some respect, using any convenient set of categories. The element provides a convenient way of establishing a default for the type attribute on a group of links of the same type: by default, the type attribute on a element has the same value as that given for type on the enclosing . Typical software might hide a web entirely from the user, but use it as a source of information about links, which are displayed independent- ly at their referenced locations. Alternatively, software might provide a direct view of the link collection, along with added functions for manipulating the collection, as by filtering, sorting, and so on. To continue our previous example, this text contains many other notes, of a kind similar to the one shown above. To avoid having to repeat the type=imitation on each , we may specify it once for all on a element containing all links of this type. The targType and targOrder attributes can also be specified for a element: A place there is, betwixt earth, air and seas Where from Ambrosia, Jove retires for ease. ... Sign'd with that Ichor which from Gods distills. ... (Diff'rent our parties, but with equal grace The Goddess smiles on Whig and Tory race, 'Tis the same rope at sev'ral ends they twist, To Dulness, Ridpath is as dear as Mist) Ovid Met. 12. Orbe locus media est, inter terrasq; fretumq; Cœlestesq; plagas — Alludes to Homer, Iliad 5 Virg. Æn. 10. Tros Rutulusve fuat; nullo discrimine habebo. —— Rex Jupiter omnibus idem. Additional information for applications that use elements can be provided by means of special attributes. First, the domains attribute can be used to identify the text elements within which the individual targets of the links are to be found. Suppose that the text under discussion is organized into a element, containing the text of the poem, and a element containing the notes. Then the domains attribute can have as its value the identifiers of the and the , to enable an application to verify that the link targets are in fact contained by appropriate elements, or to limit its search space: A place there is, betwixt earth, air and seas Where from Ambrosia, Jove retires for ease. ... Sign'd with that Ichor which from Gods distills. ... (Diff'rent our parties, but with equal grace The Goddess smiles on Whig and Tory race, 'Tis the same rope at sev'ral ends they twist, To Dulness, Ridpath is as dear as Mist) Ovid Met. 12. Orbe locus media est, inter terrasq; fretumq; Cœlestesq; plagas — Alludes to Homer, Iliad 5 Virg. Æn. 10. Tros Rutulusve fuat; nullo discrimine habebo. —— Rex Jupiter omnibus idem. Note that there must be a single parent element for each "domain"; if some notes are contained by a section with identifier dunnotes, and oth- ers by a section with identifier dunimits, an intermediate pointer must be provided (as described in section 14.1.4, "Intermediate Pointers,") within the and its identifier used instead. Next, the targFunc attribute can be used to provide further informa- tion about the role or function of the various targets specified for each link in the group. The value of the targFunc attribute is a list of names (formally, SGML name tokens), one for each of the targets in the link; these names can be chosen freely by the encoder, but their significance should be documented in the encoding declaration in the header.(75) In the current example, we might think of the note as con- taining the source of the imitation and the verse line as containing the goal of the imitation. Accordingly, we can specify the in the preceding example thus: The and elements are formally defined as follows: 14.1.4 Intermediate Pointers In the preceding examples, we have shown various ways of linking an annotation and a single verse line. However, the example cited in fact requires us to encode an association between the note and a pair of verse lines (lines 284 and 285). There are a number of possible ways of correcting this error: one could use the target and targetEnd attributes of the element to delimit the span to which the note applies (see further section 6.8, "Notes, Annotation, and Indexing," on page 13). Alternatively one could create an element to encode the couplet itself and assign it an id attribute, which can then be linked to the and elements. This could be done either explicitly by means of a element, as defined in section 6.11.1, "Core Tags for Verse," on page 14, or a element, as defined in section 14.3, "Segments and Anchors," on page 27, or implicitly, by means of the element discussed in section 14.7, "Aggregation," on page 29. A third possibility however, is to use an "intermediate pointer" as follows: (Diff'rent our parties, but with equal grace The Goddess smiles on Whig and Tory race, When the target attribute of a or element specifies more than one element, the indicated elements are always understood to be combined or aggregated in some way to produce the object of the pointer. In this example, the targOrder attribute should be specified to indi- cate that the order in which identifier values are supplied in the tar- get attribute is significant. The id attribute provides an identifier which can then be linked to the and elements: The evaluate=all attribute value is used on the element to specify that any pointer encountered as a target of that element is itself evaluated. If evaluate had the value none, the link target would be the pointer itself, rather than the objects it points to. Where a element is used to group a collection of elements, any intermediate pointer elements used by those ele- ments should be included within the . Intermediate pointers of this kind are particularly important when extended pointers (discussed in the next section) are in use. 14.2 Extended Pointers Where the object of a link or pointer element is not contained within the current document, or where it does not bear an id attribute, it is not possible to point at it with a or element, nor to link it directly with a element, because no IDREF value can be sup- plied for the target or targets attribute of these elements. In such cases, the encoder must indicate the intended element indirectly by means of the elements discussed in this section. These elements identi- fy their target using a special TEI-defined extended pointer notation, defined in section 14.2.2, "Extended Pointer Syntax," below and designed for compatibility with HyTime.(76) 14.2.1 Extended Pointer Elements To point or refer to locations in the current or some other document without requiring that the target bear an SGML identifier, the following elements should be used: : defines a pointer to another location in the current document or an external document. : defines a reference to another location in the current docu- ment, or an external document, using an extended pointer notation, possibly modified by additional text or comment. These elements are both members of the element class pointer, and there- fore carry the same attributes as other members of that class, listed above (see section 14.1.1, "Pointers and Links,"). They are also mem- bers of the class xPointer, from which they inherit the following attri- butes: doc : specifies the document within which the desired location is to be found. from : specifies the start of the destination of the pointer, as an expression in the TEI extended-pointer notation described in section 14.2, "Extended Pointers." to : specifies the endpoint of the destination of the pointer, as an expression in the TEI extended pointer notation. Unlike the pointer elements discussed in the previous section, these elements do not specify their target by means of a target attribute. Instead these elements use one or both of the attributes from and to to delimit a portion of some document specified by the doc attribute. In all other respects, these elements correspond with the elements and discussed in sections 6.6, "Simple Links and Cross Referenc- es," on page 12, and 14.1, "Pointers." Note that there is no element corresponding with the element; links can be made both within and between documents using the same syntax, as further discussed below. The values of the from and to attributes on the and elements indicate the point or passage being referred to by showing how to locate it, using one or more special keywords, as defined below in section 14.2.1, "Extended Pointer Elements." Examples are given there. The and elements are formally defined as follows: 14.2.2 Extended Pointer Syntax As noted above, the elements and are used to represent a link between their own location (the "link origin") and some other location (the "destination"), which may or may not be in the same docu- ment. Software supporting intra- and inter-document links (e.g. hyper- text systems) should provide access from the location of such an element to the destination. This section defines the allowable values for the attributes from, to, and doc of the and elements. An or element with no attributes at all is, by defini- tion, a link to the root (i.e. the document element -- by default, this is the element) of the document in which it appears. The doc attribute value must be the name of an entity declared in the SGML document type declaration. If only the doc attribute is given a value, then by definition the destination is the entire entity named by the doc value. A more specific location within another entity must be specified with the from and the to attributes, as described below. The from and the to attributes indicate the specific location pointed at, within the entity named by the doc attribute (or within the current document, if no doc value is given). Their values are referred to below as location pointer specifications. When both attributes are specified, the span pointed at by the element runs from the starting point of the span indicated by from to the ending point of the string specified by to. If the latter precedes the former in the document, then the pointer is in error and fails. If only the from attribute is specified, the to attribute defaults to the same value; the effect is that the element as a whole points to the span indicated by the from attribute. It is a semantic error to specify a value for to but not for from. 14.2.2.1 Location Ladders Each location pointer specification consists of a sequence of loca- tion terms, each of which consists of a keyword specifying a location type followed by one or more parenthesized parameter lists, each of which specifies a location value via a list of parameters. Location types and values, and the parameters within a location value, must be separated by white space characters. Using terms borrowed from HyTime, we say that each TEI location term in a specification provides the location source for the next, and the entire specification is equivalent to a location ladder. By specifying the entire ladder in a single attribute value, the TEI extended pointer mechanism greatly reduces the syntactic and processing complexity of hypertextual pointers. In formal terms:(77) ladder ::= locterm | ladder locterm 14.2.2.2 Location Terms The keywords used in location terms are these; references to "the tree" mean the tree representing the SGML document hierarchy. root points at the root of the target document here points at the location of the pointer id points at an ID within the target document ref gives a "canonical reference" to a location in the target doc- ument child indicates an element found by descending one level in the tree descendant indicates an element found by descending one or more levels in the tree ancestor indicates an element found by ascending one or more levels in the tree previous indicates an element found by traversing the older siblings of the current location source next indicates an element found by traversing the younger siblings of the current location source preceding indicates an element found by traversing the entire portion of the document preceding the current location source following indicates an element found by traversing the entire portion of the document which follows the current location source pattern specifies a regular expression to be located within the exist- ing location source token points at one or more tokens in the character content of the location source str points at one or more characters in the character content of the location source space points at a location using coordinates in some (application- defined) n-dimensional space foreign points at a location using some non-SGML method, and gives the name of the method HyQ points at a location using the HyQ query language defined by ISO 10744 (HyTime) ditto (in the to attribute only) points at the same span as was indicated by the from attribute In formal terms: locterm ::= 'ROOT' // default first location | 'HERE' // location of the xptr | 'ID' '(' NAME ')' // only one ID allowed. | 'REF' '(' characters ')' // only one ref allowed | 'CHILD' steps | 'DESCENDANT' steps | 'ANCESTOR' steps | 'PREVIOUS' steps | 'NEXT' steps | 'PRECEDING' steps | 'FOLLOWING' steps | 'PATTERN' regs // mult patterns allowed | 'TOKEN' '(' range ')' | 'STR' '(' range ')' | 'SPACE' '(' NAME ')' pointpair | 'FOREIGN' parms | 'HYQ' parms | 'DITTO' // valid only in TO att. Note that the keywords, though shown here quoted in uppercase, are not case sensitive. Each location term specifies a location in the target document; this location may be a single point, more often a span of text (often the span of a single element) within the target document. The location lad- der as a whole is interpreted from left to right, and each location term specifies a location relative to the location specified by the sequence prior to that point (i.e. to its location source). Unless here or id is specified as the first location term, the beginning location source is always root. An empty location sequence thus is the same as root and specifies the entire destination entity. In general, the search for the location specified by a location term will be conducted only within its location source (i.e. within the loca- tion already identified by preceding location terms). There are however several exceptions. The terms root, here, and id all ignore the loca- tion source defined by any preceding terms and therefore make sense only as the first items in the ladder. The terms ancestor, next, and previ- ous do not ignore the location source, but select a new span from the adjacent or enclosing portions of the text, and not from within the location source. Finally the location terms foreign, space, and HyQ are not defined fully here; they may or may not ignore the existing location source. Some of the location terms make sense only in SGML documents; these are id, child, ancestor, descendant, previous, next, preceding, and w>following. The latter six involve traversing the tree representing the SGML document hierarchy and are most easily understood when their location source is a single SGML element. If the location source is not a single SGML element, the tree-traversal keywords operate upon its beginning end-point, its "front end" (in English, this will be the left- most point of the location source; in Arabic or Hebrew it will be the rightmost point). In this case child and descendant have no meaning, since character data has no descendants in the document tree; the first ancestor of such a location source is the element immediately containing the character data in question, and the siblings referred to by next and previous are the other children of that immediately containing element. The details of each keyword are given below, along with definitions of their syntax and semantics of their results. Examples are also pro- vided. It is strongly recommended that when IDs are available, they should be used in preference to the other methods for pointing defined here. For all keywords, the description assumes that the target document does in fact contain a span or element which matches the description; otherwise, the location term has no referent and is said to "fail". If any location term fails, the entire pointer fails. No backtracking or retrying is performed (and indeed for the most part the location terms are defined as having only one matching location, so backtracking would in most cases lead to no better result). 14.2.2.3 The ROOT Keyword The location term root selects the root of the destination document tree; in SGML terms, this is the "document element". Since it ignores any existing location source, the root keyword makes sense only as the first location term in the ladder. Since root is assumed as the implic- it first term in any ladder, the following two location ladders have the same meaning: ROOT DESCENDANT (2 DIV1) DESCENDANT (2 DIV1) 14.2.2.4 The HERE Keyword The keyword here designates the location at which the pointer element itself is situated; it allows extended pointers to select items like "the paragraph immediately preceding the one within which this pointer occurs". Since it ignores any existing location source, this keyword typically makes sense only as the first location term in a location specification. To designate "the paragraph preceding the current one", the following location ladder could be used: HERE ANCESTOR (1 P) PREVIOUS (1 P) (See below for descriptions of the keywords ancestor and previous.) 14.2.2.5 The ID Keyword The resulting location is the element within the destination entity whose ID attribute has the value specified as the location value. The ID location type typically makes sense only as the first location pair in a location specification, but there is no syntactic requirement that it be so. For example, the location specification ID (a27) chooses the necessarily unique element of the destination entity which has an attribute of declared value of type ID, whose value is a27. 14.2.2.6 The REF Keyword The resulting location is an element which can be found by interpret- ing the location value in accordance with document-specific rules for a canonical reference. Such reference systems, particularly common in doc- uments of interest to classical and biblical scholars, must also be defined in the TEI header, using the element (see section 5.3.5, "The Reference System Declaration," on page 9). If more than one element matches the canonical reference, the first one encountered is chosen. For example, the location specification REF (MT.2.1) chooses the first element of the destination entity which is identified by the canonical reference MT.2.1 14.2.2.7 The CHILD Keyword The child location type specifies an element or span of character data in the document hierarchy using a location value which functions as a domain-style address. The value is a series of parenthesized steps, separated by white space. Each such step represents one level of the hierarchy within the location source. Each step may contain one or more parameters separated by white space and interpreted in order as follows: 1. an instance indicator, which is a signed or unsigned integer or the special value ALL 2. optionally, an expression matching an SGML generic identifier 3. optionally, one or more pairs of expressions, the first matching an SGML attribute name and the second matching an SGML attribute value In formal terms, the location value of child is a series of steps: steps ::= '(' step ')' | steps '(' step ')' step ::= instance | instance element | instance element avspecs avspecs ::= attribute value | avspecs attribute value Location values of the same form are also used by the keywords descendant, ancestor, previous, and next; details of the interpretation may vary from keyword to keyword. If an instance indicator alone is specified, as a number n, it selects the nth child of the location source. If the special value ALL is given, then all the children of the location source are selected. If the instance indicator is specified with following parameters, it selects all, or the nth, among those children of the location source which satisfy the other parameters. If a negative number is given, the nth child is counted from the last child of the location source to the first. The location source must contain at least n children;(78) if it does not, the child term fails. In formal terms, the first parameter of a step is an instance indica- tor, which in turn is either the special value ALL or a signed integer: instance ::= 'ALL' | signed signed ::= NUMBER // default sign is + | '+' NUMBER | '-' NUMBER If a second parameter is given, it is interpreted as an SGML generic identifier, and only elements of the type indicated will be selected. For example, the location specification CHILD (3 DIV1) (4 DIV2) (29 P) chooses the 29th paragraph of the fourth sub-division of the third major division of the initial location source. The location specification CHILD (3 DIV1) (4 DIV2) (-2 P) chooses the next-to-last paragraph of the fourth of the third in the location source. Constraint by generic identifier is strongly recommended, because it makes links more perspicuous and more robust. It is perspicuous because humans typically refer to things by type: as "the second section", "the third paragraph", etc. It is robust because it increases the chance of detecting breakage if (due to document editing) the target originally pointed at no longer exists. The generic identifier may be specified as a normal SGML name, as a (parenthesized) regular expression, or using the reserved values #CDATA or *. Regular expressions take the form described below; the location term CHILD (3 (DIV[123]) matches the third element which has a generic identifier of div1, div2, or div3. If the generic identifier is specified as *, any generic iden- tifier is matched; this means that "CHILD (2 *)" is synonymous with CHILD (2). If the second parameter is #CDATA, the location term selects only untagged sub-portions of an element having SGML mixed content. The location ladder CHILD (3 #CDATA) thus chooses the third span of character data directly contained by the current location source. If the location source is a paragraph contain- ing 1. a sentence (A) 2. an embedded quotation, marked as a 3. another sentence (B) 4. an embedded note, marked as a 5. another sentence (C) 6. a second embedded quotation, marked as a where the three sentences A, B, and C are character data enclosed by no element smaller than the paragraph itself, then CHILD (3 #CDATA) selects sentence C, while CHILD (3) selects sentence B. If specified as a name (i.e. without parentheses), the generic iden- tifier is case sensitive if and only if the SGML declaration specifies that generic identifiers are case sensitive (by default they are not). If specified as a regular expression, the expression given is always case sensitive; in the usual case this means the regular expression should be in uppercase, as in the examples here. In formal terms the second parameter of a step is defined thus: element ::= NAME | '#CDATA' | '*' | '(' regular ')' The third and fourth parameters, if given, are interpreted as an attribute-value pair, and only elements which match that pair in the way described below will be selected; the fourth and fifth parameters, and all following pairs of parameters, are interpreted in the same way. When more than one pair is given, all must be matched. The third, fifth, seventh, etc., parameters are interpreted, if spec- ified, as attribute names. Like generic identifiers, attribute names may be specified as * in location ladders in the (unlikely) event that an attribute value constitutes a constraint regardless of what attribute name it is a value for. The attribute name parameter may also be speci- fied as a parenthesized regular expression. For example, the location term CHILD (1 * TARGET *) selects the first child of the location source for which the attribute target has a value. The location term CHILD (1 * (TARGET(S?)) *) will select the first child of the location source for which an attri- bute called either target or targets has a value. As with generic identifiers, attribute names are case sensitive if and only if the SGML declaration says they are; regular expressions are always case sensitive and should usually be uppercased, as shown here. In formal terms, the attribute-name parameter of a tree-traversal step is defined thus: attribute ::= NAME | '*' | '(' regular ')' If a fourth, sixth, eighth, etc., parameter is specified, it is interpreted as an attribute value, and only elements satisfying the oth- er constraints and also bearing an attribute of the specified name and value will be selected. The attribute value may be specified exactly as in an SGML document; as a consequence, if the attribute value to be specified contains white space characters, it must be enclosed in quota- tion marks. The attribute value may also be specified as a regular expression, enclosed in parentheses, or using the two special values #IMPLIED and *. For example, the location specification CHILD (1 * N 2) (1 * N 1) chooses an element using the global n attribute. Beginning at the loca- tion source, the first child (whatever kind of element it is) with an n attribute having the value 2 is chosen; then that element's first direct sub-element having the value 1 for the same attribute is chosen. The location specification CHILD (1 FS RESP ((lanc|LANC)(s|S|ashire|ASHIRE))) selects the first child of the location source which is an element bearing a resp attribute with the value lancs, lancashire, LANCS, or LANCASHIRE (as well as other possible combinations which are left to the reader's ingenuity). If specified with quotation marks or as a regular expression, the attribute-value parameter is case-sensitive; otherwise not. The location specification CHILD (1 FS RESP #IMPLIED) selects the first child of the location source which is an element for which the resp attribute has been left unspecified. The location ladder ROOT DESCENDANT (1 (DIV[01234567]) TYPE chapter N 2) selects the second chapter of a text, regardless of whether chapters are tagged using

, , , or some other text-division element. It does so by selecting the first text-division element in the document which is of type chapter and has the n value 2. In formal terms, the attribute-value parameter of a tree-traversal step is defined thus: value ::= LITERAL // i.e. quoted string. | NAME // As for attribute values in | NUMBER // document, NMTOKENs need not | NUMTOKEN // be quoted | '#IMPLIED' // No value specified, no default | '*' // Any value matches. | '(' regular ')' 14.2.2.8 The DESCENDANT Keyword If the descendant keyword is used, the location term selects an ele- ment or character-data string which is a descendant of the current loca- tion source. Like child, descendant takes as a value a series of one or more parenthesized steps, which may contain the same four parameters described above. The set of elements and strings which may be selected, however, is the set of all descendants of the location source (i.e. the set of all elements contained by it), rather than only the set of imme- diate children. The location specification ID (a23) DESCENDANT (2 TERM LANG DE) thus selects the second element with a lang of de occurring with- in the element with an id of a23. The search for matching elements occurs in the same order as the SGML data stream; in terms of the docu- ment tree, this amounts to a depth-first left-to-right search. If the instance number is negative, the search is a depth-first right-to-left search, in which the right-most, deepest matching element is numbered -1, etc. The location specification DESCENDANT (-1 NOTE) thus chooses the last element in the document, that is, the one with the rightmost start-tag. 14.2.2.9 The ANCESTOR Keyword The ancestor location term selects an element from among the direct ancestors of the location source in the document hierarchy. The loca- tion value is of the same form as defined for the child and descendant location types. However, the ancestor keyword selects elements from the list of containing elements or "ancestors" of the location source, counting upwards from the parent of the location source (which is ances- tor number 1) to the root of the document instance (which is ancestor number -1). The location source must have at least as many ancestors as the abso- lute value of the instance number specified as the first parameter of the step. The ancestor type thus may not be specified as the first com- ponent of a location specification, because the initial location source in effect at that point is the root, which has no ancestors. For example, the location term ANCESTOR (1 * N 1) (1 DIV) first chooses the smallest element properly containing the location source and having attribute n with value 1; and then the smallest
element properly containing it. The location term ANCESTOR (1) chooses the immediate parent of the location source, regardless of its type or attributes. The location term ANCESTOR (1 * LANG fr) selects the smallest ancestor for which the lang attribute has the value fr. The term ANCESTOR (-1 * LANG fr) selects the largest ancestor for which the lang attribute has the value fr. Without the attribute specification, the term ANCESTOR (-1) selects the largest ancestor of the location source and is thus normally synonymous with the keyword ROOT. If the instance indicator is given as ALL, then all the ancestor elements which match the later parameters are selected; since the largest of these will necessarily include all the others, the value ALL is thus synonymous with the value (-1) when used with ANCESTOR. Finally, the term ANCESTOR (1 (DIV[0123456789]?)) chooses the smallest
element of any level which contains the loca- tion source. 14.2.2.10 The PREVIOUS Keyword The previous keyword selects an element or character-data string from among those which precede the location source within the same containing element. We speak of the elements and character-data strings contained by the same parent element as siblings; those which precede a given ele- ment or string in the document are its elder siblings; those which fol- low it are its younger siblings. The instance number in the location value of a previous term desig- nates the nth elder sibling of the location source, counting from most recent to less recent. The location ladder ID (a23) PREVIOUS (1) thus designates the element immediately preceding the element with an id of a23. Negative instance numbers also designate elder siblings, count- ing from the eldest sibling to the youngest. The location source must have at least as many elder siblings as the absolute value of the instance number. If the location source has at least one elder sibling, then the location term PREVIOUS (-1) designates its eldest sibling and is thus synonymous with the ladder ANCESTOR (1) CHILD (1) The value ALL may be used to select the entire range of elder siblings of an element: the location ladder ID (a23) PREVIOUS (ALL) thus designates the set of elements which precede the element with an id of a23 and are contained by the same parent. 14.2.2.11 The NEXT Keyword The keyword next behaves like previous, but selects from the younger siblings of the location source, not the elder siblings. The location ladder ID (a23) NEXT (1) thus designates the element or string immediately following the element which has an id of a23. Negative instance numbers also designate young- er siblings, counting from the youngest sibling to the location source. The location source must have at least as many younger siblings as the absolute value of the instance number. If the location source has at least one younger sibling, then the location term NEXT (-1) designates its youngest sibling and is thus synonymous with the ladder ANCESTOR (1) CHILD (-1) 14.2.2.12 The PRECEDING Keyword The preceding keyword selects an element or character-data string from among those which precede the location source, without being limit- ed to the same containing element. The set of elements and strings which may be selected is the set of all elements and strings in the entire document which occur or begin before the location source. (For purposes of the keywords PRECEDING and FOLLOWING, elements are inter- preted as occurring where their start-tag occurs.) The PRECEDING key- word thus resembles PREVIOUS but differs in searching a larger set of strings and elements; its result is not guaranteed to be a subset of its location source. The instance number in the location value of a preceding term desig- nates the nth element or character-data string preceding the location source, counting from most recent to less recent. The location ladder ID (a23) PRECEDING (5) thus designates the fifth element or string before the element with an id of a23. Negative instance numbers also designate preceding elements or strings, counting from the eldest to the youngest; the ladder ID (a23) PRECEDING (-5) thus selects the fifth element or string in the document overall, assum- ing that it precedes the element with an id of a23. It is thus normally synonymous with ROOT DESCENDANT (5) differing only in that it fails if four items or fewer precede element A23. The location source must have at least as many elder siblings as the absolute value of the instance number; otherwise, the preceding term fails. The value ALL may be used to select the entire portion of the document preceding the beginning of the location source: the location ladder ID (a23) PRECEDING (ALL) designates the entire portion of the document preceding the start-tag for element A23. 14.2.2.13 The FOLLOWING Keyword The keyword following behaves like preceding, but selects from the portion of the document following the location source, not that preced- ing it. The location ladder ID (a23) FOLLOWING (1) thus designates the element or string immediately following the element which has an id of a23. Negative instance numbers select elements or strings counting from the end of the document to the location source. There must be at least as many elements or strings following the loca- tion source as the absolute value of the instance number. If the loca- tion source has at least one following element or string, then the loca- tion term FOLLOWING (-1) designates the youngest of these and is thus synonymous with the ladder ROOT DESCENDANT (-1) 14.2.2.14 The PATTERN Keyword The pattern keyword selects the first place within the location source which matches a pattern-matching expression included as the loca- tion value. If more than one location matches that expression, there is no error, but the second and later matches are ignored. Matching is defined to be case-sensitive, i.e. abc is not the same as ABC. The pattern is expressed as a regular expression in which the fol- lowing characters have special meanings, similar to those of many Unix programs (such as grep) which handle regular expressions: . match any single character (including white space characters). [ ... ] match any character from the set enclosed within the brackets. If, however, the first enclosed character is &circ., then match any character not from the set enclosed within the brackets. For example, [&circ.aeiou] would match any charac- ter except a, e, i, o, or u. \ If the next character is a, d, n, or s, the expression matches any character from a pre-defined group, as shown below; other- wise, the next character is to be taken literally, even if it would otherwise have a special meaning. The special character classes are: \a any alphabetic character (as defined in the writing system declaration) \d any digit (0 through 9) \n any line boundary \s any white-space character (space, tab, record end, record start) Note that although \n for newline is provided, its use is dis- couraged. * match zero or more occurrences of the previous regular expres- sion. + match one or more occurrences of the preceding regular expres- sion. ? match zero or one occurrences of the preceding regular expres- sion. &circ. match the following regular expression only at the beginning of the location source. $ match the preceding regular expression only at the end of the location source. | match either the regular expression on the left, or the one on the right. (...) match the regular expression within the parentheses. (Paren- theses are used to control application of the *, ?, +, and | operators, etc.) For example, the location specification PATTERN (Chapter.8) chooses the first instance of the content string Chapter which is fol- lowed by any single character and then the digit 8, within the location source. Various elements which contain that location could be selected by following the pattern location term with one or more of other types such as ancestor (see above). It is recommended practice to use structure-oriented location types to specify the destination element as narrowly as possible, and then to specify a pattern only within that element context. If element bound- aries are encountered within the location source, however, they are ignored and have no effect on the pattern matching operation. In formal terms, the location value of the pattern keyword is defined thus: regs ::= '(' regular ')' | regs '(' regular ')' regular ::= character | '.' // match any character | '[&circ' || characters || ']' // match any char not in list | '[' || characters || ']' // match any char in list | '\a' // match any alphabetic | '\d' // match any digit 0-9 | '\n' // match newline (&#RE;&#RS;) | '\s' // match any whitespace character | '\\' // match backslash (rev. solidus) | '\' || nonspecial // match nonspecial character | regular || '*' // match 0-n of 'regular' | regular || '+' // match 1-n of 'regular' | regular || '?' // match 0-1 of 'regular' | '&circ' || regular // match at start of loc source | regular || '$' // match at end of loc source | regular || regular // match 1st, then 2d regular exp. | regular || '|' || regular // match either 1st or 2d | '(' || regular || ')' // use parentheses for grouping characters ::= /* empty string */ | characters character nonspecial ::= /* any character except a, d, n, or s */ 14.2.2.15 The TOKEN Keyword The token keyword selects a sequence of one or more tokens chosen from within the character content of the location source, where tokens are counted exactly as for the corresponding HyTime tokenloc form. The location value must be either a single positive integer, or a pair of positive integers separated by white space, representing the first and the last token numbers to be included in the resulting location. If two integers are specified, the second must not be less than the first. The location source must contain at least as many tokens as are specified in the location value. This location type should not be used to count across element bound- aries. It is recommended practice to use structure-oriented location types to specify the destination element as narrowly as possible, and then to specify a token location only within that element context. If element boundaries are encountered within the location source, they are ignored. This location type behaves intuitively only for strings containing an alternating sequence of SGML name-characters and white space; this is the type of string found, for example, in SGML attribute values of type IDREFS, such as a21 z a13. For compatibility with the HyTime standard, all characters not included in the class of name characters by the cur- rent SGML declaration (by default this includes all punctuation other than the hyphen and full stop) are treated as white space characters. For example, the location specification ID (a27) TOKEN (3 5) chooses the 3rd, 4th, and 5th tokens from the content of the element whose identifier is a27. If this element contained the string This is _not_ a very good idea, the target selected would be not_ a very. In formal terms the location value of the token and str keywords is defined as a range: range ::= NUMBER | NUMBER NUMBER 14.2.2.16 The STR Keyword The str keyword identifies a sequence of one or more characters cho- sen from within the character content of the location source, where characters are counted exactly as for the HyTime dataloc form with quan- tum=str, which has a corresponding meaning and usage. The location val- ue must be either a single positive integer, or a pair of positive inte- gers separated by white space, indicating the first and the last characters to be included in the resulting location. If two integers are specified, the second must not be less than the first. The location source must have at least as many characters as are specified in the larger of the integers. This location type should not be used to count across element bound- aries. The recommended practice is to use structure-oriented location types to specify the destination element, and then to specify a charac- ter location only within that element context. If element boundaries are encountered, however, within the location source, they have no effect. Character offsets in an SGML document must be counted not from the original source file, but from the output of the SGML parser, (the ele- ment structure information set or ESIS). This is because the rules of SGML allow certain characters to be deleted or expanded transparently. For example, the location specification ID (a27) STRLOC (3 5) chooses the 3rd 4th and 5th characters of the content of the element having identifier a27. If this element contained the string "This turned out to be an even worse idea", the result would be the string is (i, s and a space). In multi-byte character sets it is characters which are counted, not bytes. However, in the case of diacritics coded by sequences of bit combinations rather than having separate code points for every combina- tion of letter and diacritic, the diacritics are counted. This means that the following location ladder may retrieve different strings, depending on the system character set in use and on the entity declara- tions in effect: PATTERN (Wagner's\sGötterd&aum;mmerung) STR (10 24) In some character sets, where ö and ä are encoded as single characters, it will select the string Götterdämmerung; in oth- ers, where they are encoded with distinct characters for umlaut, a, and o, it will select the string Götterdämmeru, truncating the last two letters. If a system-dependent definition is used (containing e.g. a printer escape sequence), the results are even less predictable. For this reason, the str keyword must be used with caution and should be avoided where possible. 14.2.2.17 The SPACE Keyword The space location term applies to entities which represent graphical or spatio-temporal data; typically such entities are not encoded in SGML, but in one of many specialized graphical formats. SGML provides standard mechanisms (the NOTATION declaration and related constructs) for specifying what format such an entity uses. The location value for space consists of two or three parenthesized parameter lists. The first contains the name of the co-ordinate space in use. The second and third each consist of any number of signed inte- gers. The numbers in a parameter list represent locations along each dimension of a Cartesian co-ordinate space with all axes orthogonal; the length of the list equals the number of dimensions/axes of the space (usually, but not inevitably, 2, 3, or 4). If the third parameter list is not specified, the location is the single point in the co-ordinate space specified by the second parameter list. If all three parameter lists are specified, the location is the rectangular prism defined by treating corresponding items of the second and third lists as inclusive bounds along each dimension in turn. The mapping from co-ordinates to physical or display space, and the meaning and ordering of the axes, are not defined by these guidelines. They should be specified in the TEI header unless they can be determined by definition from the format in which the referenced entity is known to be encoded (for example, many graphics formats can only encode locations in units of pixels, counted in a 3 dimensional left-handed co-ordinate space). Time may be construed as an axis in addition to any others; when it is, it is TEI recommended practice that it be positioned last. The units used must be defined in the TEI header; it is acceptable in cer- tain media (such as videodiscs) to use frame numbers as a surrogate axis for time. For example, SPACE (2D) (0 0) (1 1) specifies the location of the unit square tangent to the origin in quad- rant 1 of a common graph. The location value for a space location term is a NAME enclosed in parentheses, followed by a point pair: pointpair ::= '(' numbers ')' | '(' numbers ')' '(' numbers ')' numbers ::= signed | numbers signed 14.2.2.18 The FOREIGN Keyword The foreign keyword takes any number of parenthesized parameter lists, and is terminated by the end of the attribute value, or by the next non-parenthesized token, whichever comes first. The meaning of the foreign location term is not defined by these Guidelines. It is intended for use in pointing to special kinds of non- SGML, non-coordinate space data. That is, it should be used for making links to data which cannot be specified using the other mechanisms. The meaning of any foreign location types must be specified in the TEI head- er, as a series of paragraphs at the end of the element defined in section 5.3, "The Encoding Description," on page 8. If more than one such type is used, it is TEI recommended practice that the first parameter list to foreign be a name associated with the particular type by documentation in the TEI header. For example, assume that some program uses a proprietary data format called XFORM, and that the program has supplied an identifier 06286208998 for some piece of data it owns. Then the location specifi- cation FOREIGN (XFORM) (06286208998) would be one way of expressing a link to that piece of data. 14.2.2.19 The HYQ Keyword The HyQ keyword takes a single parenthesized parameter lists, which contains an expression in the HyQ query language defined by the HyTime standard. See documentation on HyTime and HyQ for definitions of HyQ expressions. 14.2.2.20 The DITTO Keyword The ditto keyword is valid only as the first location term in a lad- der, and only within the to attribute of an extended pointer element. It designates the location result of the from attribute on the same ele- ment. Thus in the pointer the from attribute designates the first occurrence of the string Wagne- rian in the
containing the element with an id of a23. The to attribute designates the first occurrence of the string Liebestod which occurs after Wagnerian, within the same
. Without the ditto key- word, it would be necessary to repeat the entire location ladder of the from attribute in the to attribute, which would be error-prone for com- plex expressions. 14.2.3 Using Extended Pointers As noted above, when only the from attribute is specified, the or element points at the span indicated by from. When both from and to are specified, the element points at the span running from the beginning of the span indicated by the former to the end of the span indicated by the latter. To point at the second, third, and fourth paragraphs of the second chapter () in the body of the current document, therefore, one may specify either of the following: To point to "the occurring in the current with attribute n = 2", only the from attribute would be required: The following example demonstrates how elements from two different documents may be combined The first indicates the element in doc1 which has identifier d1.1. The second indicates the second subelement of the element in doc2 which has identifier d2.1. These two elements are pointed to as a sin- gle item by the element and given the identifier p1. This aggre- gation, finally, is linked with two other elements both in the current document, with identifiers s1 and s2. An extended pointer, as described above, may specify as its target only a single destination. Where the intended destination of a link is an aggregation or alignment of destinations, possibly in separate docu- ments, an intermediate pointer of some kind must be used, as described in section 14.1.4, "Intermediate Pointers," on page 26 elsewhere in this chapter. Like any other element, an and may be given a unique id within the document that contains them. This id value can then be supplied as one of the target values for an intermediate or element, to represent aggregation or linkage respectively. The element discussed in section 14.7, "Aggregation," on page 29 may also be used. For example, a modern commentary on an older text must frequently refer to that text, which might well be encoded in a separate SGML docu- ment. Some discussions will refer to set of discrete passages in the older text, and will thus require multi-headed pointers. In such a case, the document type declaration must contain a declaration for an SGML entity containing the older text, which might look something like this: In the commentary itself, reference will be made to this external docu- ment, using and elements. When the commentary refers to aggregates of discontiguous passages, elements are used to point to the individual passage, and a element may refer to these pas- sages as a group by pointing to the s: ...

In the references to Theobald, Pope's satire characteristically ...

If the same discontiguous target is to be referred to repeatedly, it may be convenient to give it a single identifier, thus: ...

In the references to Theobald, Pope's satire characteristically ...

A hypertext web might associate passages of the text and notes with the individuals mentioned, the ancient authors imitated, or thematic content, thus: ...
Individuals Named in the Text A bookseller and publisher ... ... Attorney, active also as editor and reviewer ... ... ...
Ancient Authors Imitated in the Text Virgil Homer Ovid ... ... ... ... 14.3 Segments and Anchors In this section, we define two general purposes elements which may be used to mark and categorize both a span of text and a point within one. These elements have several uses, most notably to provide elements which can be given identifiers for use when aligning or linking to parts of a document, as discussed elsewhere in this chapter. They also provide a convenient way of extending the semantics of the TEI markup scheme in a theory-neutral manner. : contains any arbitrary phrase-level unit of text (including other elements). Attributes include: subtype : provides a sub-categorization of the segment is marked. part : specifies whether or not the segment is complete. Legal val- ues are: Y : the segment is incomplete N : either the segment is complete, or no claim is made as to its completeness I : the initial part of an incomplete segment M : a medial part of an incomplete segment F : the final part of an incomplete segment : attaches an identifier to a point within a text, whether or not it corresponds with a textual element. These elements are both members of the class seg, and inherit from it the attribute type: type : characterizes the type of segment. function : characterizes the function of the segment. The element may be used at the encoder's discretion to mark almost any segment of the text of interest for processing. One use of the element is to mark text features for which no appropriate markup is otherwise defined, i.e. as a simple extension mechanism. Another use is to provide an identifier for some segment which is to be pointed at by some other element, i.e. to provide a target, or a part of a target, for a or other similar element. Several examples of uses for the element are provided elsewhere in these Guidelines. For example: * as a means of marking segments significant in a metrical or rhyming analysis (see section 9.4, "Rhyme and Metrical Analysis," on page 17) * as a means of marking typographic lines in drama (see section 10.2, "The Body of a Performance Text," on page 18) or title pages (see section 7.5, "Title Pages," on page 16) * as a means of marking prosody- or pause-defined units in transcribed speech (see section 11.3.1, "Segmentation," on page 20) * as a means of marking linguistic or other analyses in a theory- neutral manner (see chapter 15, "Simple Analytic Mechanisms," on page 29 passim) In the following simple example, the element simply delimits the extent of a stutter, a textual feature for which no element is pro- vided in these Guidelines. Don't say I-I-I'm afraid, Melvin, just say I'm afraid. The element is particularly useful for the mark-up of linguisti- cally significant constituents such as the phrases that may be the out- put of an automatic parsing system. This example also demonstrates the use of the id attribute to carry an identifier which other parts of a document may use to point to, or align with: Literate and illiterate speech in a language like English are plainly different. As the above example shows, elements may be nested directly within one another, to any degree of analysis considered appropriate. This is taken a little further in the following example, where the type and subtype attributes have been used to further categorise each word of the sentence (the id attributes have been removed to reduce the complex- ity of the example): Literate and illiterate speech in a language like English are plainly different . (The example values shown are chosen for simplicity of comprehension, rather than verisimilitude). It should also be noted that specialized segment elements are defined in section 15.1, "Linguistic Segment Cat- egories," on page 29 to facilitate this particular kind of analysis. These allow for the explicit mark up of units called s-units, clauses, phrases, words, morphemes and characters, which may be felt preferable to the more generic approach typified by use of the element. Using these, the first phrase above might be encoded simply as Literate and illiterate speech Note the way in which the type attribute of these specialized elements now carries the value carried by the subtype attribute of the more gen- eral element. For an analysis not using these traditional lin- guistic categories however, the element provides a simple but pow- erful mechanism. In language corpora and similar material, the element may be used to provide an end-to-end segmentation as an alternative to the more specific element proposed in chapter 15.1, "Linguistic Segment Cat- egories," on page 29 for the mark-up of orthographic sentences, or s-units. However, it may be more useful to use the element for this purpose, since this means that the element can then be used to mark both features within s-units and segments composed of s-units, as in the following example:(79) Sigmund, the son of Volsung, was a king in Frankish country. Sinfiotli was the eldest of his sons. Like other elements, the tag must be properly enclosed within other elements. Thus, a single element can be used to group together words in different sentences only if the sentences are not themselves tagged. The first of the following two encodings is legal, but the second is not. Give me a dozen. Or two or three. Give me a dozen. Or two or three. The part attribute may be used as one simple method of overcoming this restriction: Give me a dozen. Or two or three. Another solution is to use the element discussed in section 14.7, "Aggregation," on page 29. This requires that each of the ele- ments be given an identifier. For further discussion of this generic encoding problem see also chapter 31, "Multiple Hierarchies," on page 48. The element has the same content as a paragraph in prose: it can therefore be used to group together consecutive sequences of inter class elements, such as lists, quotations, notes, stage directions etc. as well as to contain sequences of phrase-level elements. It cannot however be used to group together sequences of paragraphs or similar text units such as verse lines; for this purpose, the encoder should use intermediate pointers, as described in section 14.1.4, "Intermediate Pointers," on page 26 or the methods described in section 14.7, "Aggre- gation," on page 29. It is particularly important that the encoder pro- vide a clear description of the principles by which a text has been seg- mented, and the way in which that segmentation is represented. This should include a description of the method used and the significance of any categorization codes. The description should be provided as a series of paragraphs within the element of the encoding description in the TEI header, as described in section 5.3.3, "The Edi- torial Practices Declaration," on page 8. The remainder of this chapter contains a number of examples of the use of the element simply to provide an element to which an iden- tifier may be attached, for example so that another segment may be linked or related to it in some way. We conclude this section by intro- ducing the element which serves an identical purpose, but has no content. It may be thought of as an empty , or as an artifice enabling an identifier to be attached to any position in a text. Like the element discussed in section 6.9, "Reference Systems," on page 13, the element is useful where multiple views of a document are to be combined, for example, when a logical view based on paragraphs or verse lines is to be mapped on to a physical view based on manuscript lines. It differs from the milestone and related elements in that the element should not be used to mark the start or end of an arbitrary zone within a text, but only to mark arbi- trary points used for alignment. For example, suppose that we wish to mark the end of the fifth word following each occurrence of some term in a particular text, perhaps to assist with some collocational analysis. This can most easily be done with the help of the tag, as follows: English language. Except for not very English at all at the time English was still full of flaws English. This was revised by young In the next section we discuss ways in which these points can be used to represent an alignment, for example such as one might get in a keyword-in-context concordance. These elements are formally defined as follows: 14.4 Correspondence and Alignment In this section we introduce the notions of correspondence, expressed by the corresp attribute, and of alignment, which is a special kind of correspondence involving an ordered set of correspondences. Both cases may be represented using the and elements introduced in section 14.1, "Pointers," on page 26. We also discuss the special case of alignment in time or synchronization, for which special purpose ele- ments are proposed in section 14.5, "Synchronization." 14.4.1 Correspondence A common problem in text analysis is to determine correspondences between two or more parts of a single document, or between places in different documents. Provided that SGML elements are available to rep- resent the parts or places to be linked, then the global linking attri- bute corresp may be used to encode such correspondence, once it has been identified. corresp : points to elements that correspond to the current element in some way. This is one of the attributes made available by the mechanism described in the introduction to this chapter ( 14, "Linking, Segmentation, and Alignment," on page 26). Correspondence can also be expressed by means of the element introduced in section 14.1, "Pointers," on page 26. Where the correspondence is between spans, the element should be used, if no other element is available. Where the correspondence is between points, the element should be used, if no other element is available. The use of the corresp attribute with spans of content is illustrated by the following example: Shirley, which made its Friday night debut only a month ago, was not listed on NBC's new schedule, although the network says the show still is being considered. Here the anaphoric phrases the network and the show have been associated directly with the elements to which they refer by means of corresp attributes. This mechanism is simple to apply, but has the drawback that it is not possible to specify more exactly what kind of correspon- dence is intended. Where this attribute is used, therefore, encoders are encouraged to specify their intent in the associated encoding decla- rations in the TEI Header. Essentially, what the corresp attribute does is to specify that the element that has the attribute and the element(s) the attribute points to are doubly linked.(80) Therefore, we can also use the and elements defined in section 14.1, "Pointers," on page 26 to indicate correspondence among elements. Moreover, the use of these ele- ments provides a convenient place to indicate what kind of correspon- dence is intended as in the following retagging of the preceding exam- ple. Shirley, which made its Friday night debut only a month ago, was not listed on NBC's new schedule, although the network says the show still is being considered. In the following example, we use exactly the same mechanism to express a correspondence amongst the anchors introduced following the fifth word after English in a text: English language. Except for not very English at all at the time English was still full of flaws English. This was revised by young 14.4.2 Alignment of Parallel Texts One very important application area for the alignment of parallel texts is multilingual corpora. Consider, for example, the need to align "translation pairs" of sentences drawn from a corpus such as the Canadi- an Hansard, in which each sentence is given in both English and French. Concerning this problem, Gale and Church write:(81) Most English sentences match exactly one French sentence, but it is possible for an English sentence to match two or more French sentences. The first two English sentences [in the example below] illustrate a particularly hard case where two English sentences align to two French sentences. No smaller alignments are possible because the clause "...sales...were higher..." in the first English sentence corresponds to (part of) the second French sentence. The next two alignments ... illustrate the more typical case where one English sentence aligns with exactly one French sentence. The final alignment matches two English sentences to a single French sentence. These alignments [which were produced by a computer program] agreed with the results pro- duced by a human judge. The alignment produced by Gale and Church's program can be expressed in four different ways. The encoder must first decide whether to repre- sent the alignment in terms of points within each text (using the element) or in terms of whole stretches of text, using the element. To some extent the choice will depend on the process by which the software works out where alignment occurs, and the intention of the encoder. Secondly, the encoder may elect to represent the actual encoding using either corresp attributes attached to the individual or elements, or using a free standing element. We present first a solution using elements bearing only cor- resp attributes:

According to our survey, 1988 sales of mineral water and soft drinks were much higher than in 1987, reflecting the growing popularity of these products. Cola drink manufacturers in particular achieved above-average growth rates. The higher turnover was largely due to an increase in the sales volume. Employment and investment levels also climbed. Following a two-year transitional period, the new Foodstuffs Ordinance for Mineral Water came into effect on April 1, 1988. Specifically, it contains more stringent requirements regarding quality consistency and purity guarantees.

Quant aux eaux minérales et aux limonades, elles rencontrent toujours plus d'adeptes. En effet, notre sondage fait ressortir des ventes nettement supérieures à celles de 1987, pour les boissons à base de cola notamment. La progression des chiffres d'affaires résulte en grande partie de l'accroissement du volume des ventes. L'emploi et les investissements ont également augmenté. La nouvelle ordonnance fédérale sur les denrées alimentaires concernant entre autres les eaux minérales, entrée en vigueur le 1er avril 1988 après une période transitoire de deux ans, exige surtout une plus grande constance dans la qualité et une garantie de la pureté. There is no requirement that the corresp attribute be specified in both English and French texts, since (as noted above) this attribute is defined as representing a mutual association. However, it may simplify processing to do so, and also avoids giving the impression that the Eng- lish is translating the French, or vice versa. More seriously, this encoding does not make explicit the fact that it is in fact the entire stretch of text between the anchors which is being aligned, not simply the points themselves. If for example one text contained material omit- ted from the other, this approach would not be appropriate. We now present the same passage using the alternative mechanism and marking explicitly the segments which have been aligned:

According to our survey, 1988 sales of mineral water and soft drinks were much higher than in 1987, reflecting the growing popularity of these products. Cola drink manufacturers in particular achieved above-average growth rates. The higher turnover was largely due to an increase in the sales volume. Employment and investment levels also climbed. Following a two-year transitional period, the new Foodstuffs Ordinance for Mineral Water came into effect on April 1, 1988. Specifically, it contains more stringent requirements regarding quality consistency and purity guarantees.

Quant aux eaux minérales et aux limonades, elles rencontrent toujours plus d'adeptes. En effet, notre sondage fait ressortir des ventes nettement supérieures à celles de 1987, pour les boissons à base de cola notamment. La progression des chiffres d'affaires résulte en grande partie de l'accroissement du volume des ventes. L'emploi et les investissements ont également augmenté. La nouvelle ordonnance fédérale sur les denrées alimentaires concernant entre autres les eaux minérales, entrée en vigueur le 1er avril 1988 après une période transitoire de deux ans, exige surtout une plus grande constance dans la qualité et une garantie de la pureté. Note that use of the element allows us to mark up the ortho- graphic sentences in both languages independently of the alignment: the first translation pair in this example might be marked up as follows: According to our survey, 1988 sales of mineral water and soft drinks were much higher than in 1987, reflecting the growing popularity of these products. Cola drink manufacturers in particular achieved above-average growth rates. Quant aux eaux minérales et aux limonades, elles rencontrent toujours plus d'adeptes. En effet, notre sondage fait ressortir des ventes nettement supérieures à celles de 1987, pour les boissons à base de cola notamment. 14.4.3 A Three-way Alignment The preceding encoding of the alignment of parallel passages from two texts requires that those texts and the alignment all be part of the same SGML document. If the texts are in separate documents, then addi- tional elements must be supplied, as discussed in section 14.2, "Extended Pointers," on page 26. These external pointers may appear anywhere within the document, but if they are created solely for use in encoding links, they may for convenience be grouped within the (or other grouping element that uses them for linking). +----------------------------------------------------------------------+ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Figure 2: The figure shows the page from the Orbis pictus of Com- | ________ | enius which is discussed in the text. | | | +----------------------------------------------------------------------+ To demonstrate this facility, we consider how we might encode the alignments in an extract from Comenius' Orbis Sensualium Pictus. Each topic covered in this work has three parts: a picture, a prose text in Latin describing the topic, and a carefully-aligned translation of the Latin into English, German or some other vernacular. Key terms in the two texts are typographically distinct, and are linked to the picture by numbers, which appear in the two texts and within the picture as well.(82) First, we present the text portions. The English and Latin portions have been encoded as distinct

elements. Identifiers have been attached to each typographic line, but no other encoding added, to sim- plify the example.
The Study The Study is a place where a Student, a part from men, sitteth alone, addicted to his Studies, whilst he readeth Books,
Muséum Museum est locus ubi Studiosus, secretus ab hominibus, solus sedet, Studiis deditus, dum lectitat Libros,
Next we assume that we have stored a digitized image of the picture itself in some external entity we will call com98 (for further discus- sion of the handling of external images and graphics, see section 22.3, "Specific Elements for Graphic Images," on page 40). We further assume that we can address portions of this image as a two-dimensional co- ordinate space. The SPACE location method of the element (dis- cussed in section 6.6, "Simple Links and Cross References," on page 12 above) can now be used to point to the whole picture and to two portions of it, one containing the picture of a student and the other of a book, as follows: Note that each external pointer has its own unique identifier, in addi- tion to the n attribute, which last holds the visible label (or "explainer") used for this image portion in the original. As printed, the text exhibits three kinds of alignment. 1. The English and Latin portions are printed in two parallel col- umns, with corresponding phrases, (represented above by ele- ments), more or less next to each other. 2. Particular words or phrases are marked as terms in the two lan- guages by a change of rendition: the English text, which other- wise uses black letter type throughout, has the words The Study, a Student, Studies, and Books in a roman font; in the Latin text, which is printed in roman, the corresponding words (Museum, Stu- diosus, Studiis, and Libros) are all in italic. 3. Numbered labels appear within the text portions, linking keywords to each other and to sections of the picture. These labels, which have been left out of the above encoding, are attached to the first third and last segment in each language quoted below, and also appear (rather indistinctly) within the picture itself. If it is desired to transcribe them in the text, they might be encod- ed using as elements, elements, or s to the picture; the number itself would be transcribed as the value of the n attribute (or as the content of the ). The first kind of alignment might be represented by using the corresp attribute on the element. The second kind might be represented by using the and mechanism described in section 6.3.4, "Terms, Glosses, and Cited Words," on page 11. The third kind of align- ment might be represented using pointers embedded within the texts, although this would involve some duplication. We choose however to use the element, since this provides an efficient way of representing the three-way alignment between English, Latin and picture without redundancy. This map, of course, only aligns whole segments and image portions, since these are the only parts of our encoding which bear identifiers and can therefore be pointed to. To add to it the alignment between the typographically distinct words mentioned above, new elements must be defined, either within the text itself or externally by using the extended pointer mechanism. Encoding these word pairs as and , although intuitively obvious, requires a non-trivial decision as to whether the Latin text is glossing the English, or vice-versa. Tagging all the marked words as avoids the difficult decision, but might be thought by some encoders to convey the wrong information about the words in question. Simply tagging them as additional embedded elements with identifiers that can be aligned like the others is also a possibility. All of these require the addition of further markup to the text. This may pose no problems, or it may be infeasible (e.g. if the text is held on a read-only medium). If it is not feasible to add more markup to the original text, the extended pointer mechanism is likely to be the best choice. For example, to indicate that the words Studies and Studiis correspond, two external pointers might be defined and aligned as follows: 14.5 Synchronization In the previous section we discussed two particular kinds of align- ment: alignment of parallel texts in different languages; and alignment of texts and portions of an image. In this section we address another specialized form of alignment: synchronization. The need to mark the relative positions of text components with respect to time arises most naturally and frequently in transcribed spoken texts, but it may arise in any text in which quoted speech occurs, or events are described with- in a time frame. The methods described here are also generalizable for other kinds of alignment (for example, alignment of text elements with respect to space), and may thus be regarded as providing a simplified version of the HyTime system of finite space co-ordinates. 14.5.1 Aligning Synchronous Events To mark synchronous elements, the synch attribute, which is one of the linking attributes that are available for all text elements, may be used. synch : points to elements that are synchronous with the current ele- ment. Alternatively, the and elements may be used to make explicit the fact that the synchronous elements are aligned. To illustrate the use of these mechanisms for marking synchrony, con- sider the following representation of a spoken text: B: The first time in twenty five years, we've cooked Christmas (unclear) for a blooming great load of people. A: So you're [1] (unclear) [2] B: [1] It will be [2] nice in a way, but, [3] be strange. [4] A: [3] Yeah [4], yeah, cos it, it's [5] the [6] B: [5] not [6] This representation uses numbers in brackets to mark the points at which speakers overlap each other. For example, the [1] in A's first speech is to be understood as coinciding with the [1] in B's second speech.(83) To encode this we use the base tag set for spoken texts, described in chapter 11, "Transcriptions of Speech," on page 19, together with the additional tag set described in the present chapter. First, we tran- scribe this text, marking the synchronous points with elements, and providing a synch attribute on one of each of the pairs of synchro- nous anchors. As noted in the example given above (section 14.4.2, "Alignment of Parallel Texts,"), correspondence, and hence synchrony, is a symmetric relation; therefore the attribute need only be specified on one of the pairs of synchronous anchors.
So you're It will be nice in a way, but, be strange. Yeah , yeah, cos it, its the not Next, we encode the same example using and elements to make the temporal alignment explicit; the id attributes are provided for the and elements for a reason that is given in the next section, 14.5.2, "Placing Synchronous Events in Time."
The first time in twenty five years, we've cooked Christmas for a blooming great load of people. So you're It will be nice in a way, but, be strange Yeah, yeah, cos it, it's the not
As with other forms of alignment, synchronization may be expressed between stretches of speech as well as between points. When complete utterances are synchronous, for example, if one person says What? and another No! at the same time, that can be represented without elements as follows. What? No! A simple way of expressing overlap (where one speaker starts speaking before another has finished) is thus to use the element to encode the overlapping portions of speech. For example, So you're It will be nice in a way, but, be strange. Yeah , yeah, cos it, its the not Note in this encoding how synchronization has been effected between an empty element and a , and between an entire element and another , using the synch attribute. Alternatively, a could be used in the same way as above. 14.5.2 Placing Synchronous Events in Time A synchronous alignment specifies which points in a spoken text occur at the same time, and the order in which they occur, but does not say at what time those points actually occur. If that information is available to the encoder it can be represented by means of the and elements, whose description and attributes are the following: : indicates a point in time either relative to other elements in the same tag, or absolutely. Attributes include: absolute : supplies an absolute value for the time. interval : specifies the numeric portion of a time interval unit : specifies the unit of time corresponding to the interval val- ue. since : identifies the reference point for determining the time of the current element, which is obtained by adding the interval to the time of the reference point. id : supplies an identifier, unique to the document, for each element. : provides a set of ordered points in time which can be linked to elements of a spoken text to create a temporal alignment of that text. Attributes include: origin : designates the origin of the timeline, i.e. the time at which it begins. interval : specifies the numeric portion of a time interval unit : specifies the unit of time corresponding to the interval val- ue of the timeline or of its constituent points in time. Each element indicates a point in time, either directly by means of the absolute attribute, whose value is a string which specifies a particular time, or indirectly by means of the since attribute, which points to another . If the since is used, then the interval and unit attributes should also be used to indicate the amount of time that has elapsed since the time specified by the element pointed to by the since attribute; the value -1 can be given to indicate that the interval is unknown. If the elements are uniformly spaced in time, then the inter- val and unit values need be given once in the , and not repeated in any of the elements. If the intervals vary, but the units are all the same, then the unit attribute alone can be given in the element, and the interval attribute given in the element. The origin attribute in the element points to a element which specifies the reference or origin for the timings within the ; this must, of course, specify its position in time abso- lutely. The following might be used to accompany the marked up conversation shown in the preceding section: The information in this could now be linked to the informa- tion in the which provides the temporal alignment (synchroni- zation) for the text, as follows: To avoid the need for two distinct link groups (one marking the syn- chronization of anchors with each other, and the other marking their alignment with points on the time line) it would be better to link the elements with the synchronous points directly: Finally, suppose that a digitized audio recording is also available. The extended pointer syntax described in section 14.2, "Extended Point- ers," on page 26 could be used to address positions on or portions of this recording directly. Assuming that elements with identifiers X1, X2, etc., have been defined to do this, these identifiers could also be included as a fourth component in each of the above elements, thus providing a synchronized audio track to complement the transcribed text. For further discussion of this and related aspects of encoding tran- scribed speech, refer to chapter 11, "Transcriptions of Speech," on page 19. The and elements are defined as follows: 14.6 Identical Elements and Virtual Copies This section introduces the notion of a virtual element, that is, an element which is not explicitly present in a text, but the presence of which an application can infer from the encoding supplied. In this sec- tion, we are concerned with virtual elements made by simply cloning existing elements. In the next section ( 14.7, "Aggregation," on page 29), we discuss virtual elements made by aggregating existing elements. It is useful to be able to represent the fact that one element of text is identical to others, for analytical purposes, or (especially if the elements have lengthy content) to obviate the need to repeat the content. For example, consider the repetition of the element in the following material:

In small clumsy letters he wrote: April 4th, 1984.

He sat back. A sense of complete helplessness had descended upon him.

His small but childish handwriting straggled up and down the page, shedding first its capital letters and finally even its full stops: April 4th, 1984. Last night to the flicks.

Suppose now that we wish to encode the fact that the second ele- ment above has identical content to the first. The sameAs attribute is provided for this purpose. Using it, we can recode the last line of the above example as follows: April 4th, 1984. Last night to the flicks. The sameAs attribute may be used to document the fact that two ele- ments have identical content. It may be regarded as a special kind of link. It should only be attached to an element with identical content to that it indicates, or to one the content of which clearly designates it as a repetition, such as the word repeat or bis in the representation of the chorus of a song, the second time it is to be sung. The relation specified by the sameAs attribute is symmetric: if a chorus is repeated three times and each repetition bears a sameAs attribute indicating the first occurrence of the element concerned, it is implied that each cho- rus is identical, and there is no need for the first occurrence to spec- ify any of its copies. The copyOf attribute is used in a similar way to indicate that the content of the element bearing it is identical to that of another. The difference is that the content is not itself repeated. The effect of this attribute is thus to create a virtual copy of the element indicat- ed. Using this attribute, the repeated date in the first example above could be recoded as follows: An application program should replace whatever is the actual content of an element bearing a copyOf attribute with the content of the element specified by it. If the content of the element specified includes other elements, these will become embedded within the element bearing the attribute. Care must be taken to ensure that the document is a legal SGML document both before and after this embedding takes place. If, for example, the element bearing a copyOf attribute requires a mandatory sub-component, then this component must be present (though possibly emp- ty), even though it will be replaced by the content of the targetted element. The following example demonstrates how the copyOf attribute may be used in conjunction with the element to highlight the differences between almost identical repetitions: My object all sublime I shall achieve in time To let the punishment fit the crime, ; And make each pris'ner pent Unwillingly represent A source of innocent merriment, ! His He will For further examples of the use of this attribute, see chapters 21, "Graphs, Networks, and Trees," on page 38 and 16, "Feature Structures," on page 30, where it is used to reduce the complexity of formal analytic representations of structure. 14.7 Aggregation Because of the strict hierarchical organization of an SGML document, or for other reasons, it may not always be possible or desirable to include all the parts of a possibly fragmented text segment within a single element. In section 14.1.4, "Intermediate Pointers," on page 26 we introduced the notion of an intermediate pointer as a way of pointing to discontinuous segments of this kind. In this section we first describe another way of linking the parts of a discontinuous whole, using a set of linking attributes, which are made available for any tag by following the procedure described at the beginning of this chapter. We then describe how the element may be used to aggregate such segments, and finally introduce the element, which is a special- purpose linking element specifically for representing the aggregation of parts, and the for grouping tags. The linking attributes for aggregation are next and prev; each of these attributes has a single identifier as its value: next : points to the next element of a virtual aggregate of which the current element is part. prev : points to the previous element of a virtual aggregate of which the current element is part. The element is also a member of the class of pointer elements, and so may carry any of the attributes of that class; for the list, see section 14.1, "Pointers," on page 26. Here is the material on which we base our first illustration of the use of these mechanisms. Our problem is to represent the S-units iden- tified below as qs3 and qs4 as a single (but discontinuous) whole: Monsieur Paul, after he has taken equal parts of goose breast and the finest pork, and broken a certain number of egg yolks into them, and ground them very, very fine, cooks all with seasoning for some three hours. But, she pushed her face nearer, and looked with ferocious gloating at the pâté inside me, her eyes like X rays, he never stops stirring it! Figure to yourself the work of it — stir, stir, never stopping! Using the prev and next attributes, we can link the s-units with identifiers s1 and s2, either singly or doubly as follows: But, he never stops stirring it! But, he never stops stirring it! But, he never stops stirring it! Double linking of the two S-units, as illustrated by the last of these encodings, is equivalent to specifying a tag: Such a element must carry type=join attribute value to specify that the link is to be understood as joining its targets into a single aggregate. The element is equivalent to a element of type join; unlike a link, the default value for the targOrder attribute which this element also inherits from the pointer class is Y. Also unlike the element, the element can additionally specify information about the virtual element which it represents, by means of its result attribute. And finally, unlike the element, the position of a element within a text is significant: it must be supplied at a position where the element indicated by its result attribute would be contextually legal. : identifies a possibly fragmented segment of text, by pointing at the possibly discontiguous elements which compose it. Attributes include: result : specifies the name of an element which this aggregation may be understood to represent. targets : specifies the SGML identifiers of the elements or passages to be joined into a virtual element. targOrder : specifies whether or not the order in which components of the join are listed on the targets attribute is significant. Legal values are: Y : Yes: the order should be followed when combining the targeted elements. N : No: the order has no significance when combining the targeted elements. U : Unspecified: the order may or may not be significant. : groups a collection of elements and possibly point- ers. Attributes include: result : describes the result of the joins gathered in this collec- tion. To conclude the above example, we now use a element to represent the virtual sentence formed by the aggregation of s1 and s2: As a further example, consider the following list of authors' names. The object of the element here is to provide another list, com- posed of those authors from the larger list who happen to come from Hei- delberg: Authors Figge, Udo Heibach, Christiane Heyer, Gerhard Philipp, Bettina Samiec, Monika Schierholz, Stefan The following example shows how can be used to reconstruct a text cited in fragments presented out of order. The poem being remem- bered (an unusual translation of a well known poem by Basho) runs "When the old pond / gets a new frog, / it's a new pond."

How does it go? da-da-da gets a new frog ...

When the old pond ...

... It's a new pond.

As with other forms of link, a grouping element is avail- able for use when a number of elements of the same kind co-occur. This avoids the need to specify the result attribute for each if they are all of the same type, and also allows us to restrict the domain within which their target elements are to be found, in the same way as for elements (see 14.1.3, "Groups of Links," on page 26). Like a , a may appear only where the elements represent- ed by its contents are legal. Thus if we had created many tags of the sort just described, we could group them together, and require that their components are all contained by an element with the identifi- er MFKFhungry as follows: The element is useful as a means of representing non- hierarchic structures (as further discussed in chapter 31, "Multiple Hierarchies," on page 48). It may also be used as a convenient way of representing a variety of analytic units, like the and elements discussed in chapter 15, "Simple Analytic Mechanisms." As an example, consider the following passage: Zui-Gan called out to himself every day, "Master." Then he answered himself, "Yes, sir." And then he added, "Become sober." Again he answered, "Yes, sir." "And after that," he continued, "do not be deceived by others." "Yes, sir; yes, sir," he replied. Suppose now that we wish to represent an interpretation of the above passage in which we distinguish between the various "voices" adopted by the character Zui-Gan. In the following encoding, the who attribute has been used for this purpose; id attributes have also been added:

Zui-Gan called out to himself every day, Master.

Then he answered himself, Yes, sir.

And then he added, Become sober.

Again he answered, Yes, sir.

And after that, he continued, do not be deceived by others.

Yes, sir; yes, sir, he replied.

The id values specified now allow us to link the material spoken by each voice:

Zui-Gan called out to himself every day, Master.

Then he answered himself, Yes, sir.

And then he added, Become sober.

Again he answered, Yes, sir.

And after that, he continued, do not be deceived by others.

Yes, sir; yes, sir, he replied.

However, by using the element, we can directly represent the complete speech attributed to each voice: Note the use of the global n attribute to supply a descriptive name to distinguish the two virtual elements represented by the elements; this is necessary because the current proposals do not allow for any way of specifying the attributes to be associated with a virtual element, and hence we cannot specify a who value for them. Suppose now that id attributes, for whatever reasons, are not avail- able. Then elements may be created using any of the methods described in section 14.2, "Extended Pointers," on page 26. The id attributes of these elements may now be specified by the targets attri- bute on the elements.

Zui-Gan called out to himself every day, Master.

Then he answered himself, Yes, sir.

And then he added, Become sober.

Again he answered, Yes, sir.

And after that, he continued, do not be deceived by others.

Yes, sir; yes, sir, he replied.

For a definition of the syntax used by the element, see sec- tion 14.2.2, "Extended Pointer Syntax," on page 26 above. The extended pointer with identifier rzuiq2 (for example) may be read as "the first in the first

, inside the sixth within the second element of the current document." As mentioned above, there is no need for the and ele- ments to be held in the same SGML document as the text; indeed, if, for example, the text is held on a read-only medium, this may not be possi- ble. The doc attribute of the element may be used to specify the name of the SGML entity within which its target is to be found. Here are the formal declarations of the and ele- ments. 14.8 Alternation This section proposes elements for the representation of alternation. We say that two or more elements are in exclusive alternation if any of those elements could be present in a text, but one and only one of them is; in addition, we say that those elements are mutually exclusive. We say that the elements are in inclusive alternation if at least one (and possibly more) of them is present. The elements that are in alternation may also be called alternants. The need to mark exclusive alternation arises frequently in text encoding. A common situation is one in which it can determined that exactly one of several different words appears in a given location, but it cannot be determined which one. One way to mark such an exclusive alternation is to use the linking attribute exclude. Having marked an exclusive alternation, it can sometimes later be determined which of the alternants actually appears in the given location. To preserve the fact that an alternation was posited, one can add the linking attribute select to a tag which hierarchically encompasses the alternants, which points to the one which actually appears. To assign responsibility and degree of certainty to the choice, one can use the tag described in chapter 17, "Certainty and Responsibility," on page 33. Also see that chapter for further discussion of certainty in general. The exclude and select attributes may be used with any element assum- ing that they have been declared following the procedure discussed in the introduction to this chapter. exclude : points to elements that are in exclusive alternation with the current element. select : selects one or more alternants; if one alternant is selected, the ambiguity or uncertainty is marked as resolved. If more than one alternant is selected, the degree of ambiguity or uncertainty is mark- ed as reduced by the number of alternants not selected. A more general way to mark alternation, encompassing both exclusive and inclusive alternation, is to use the linking element . The description and attributes of this tag and of the associated grouping tag are as follows. These elements are also members of the pointer class and therefore have all the attributes associated with that class. : identifies an alternation or a set of choices among elements or passages. Attributes include: targets : specifies the SGML identifiers of the alternative elements or passages. weights : If mode=excl, each weight states the probability that the corresponding alternative occurs. If mode=incl each weight states the probability that the corresponding alternative occurs given that at least one of the other alternatives occurs. : groups a collection of elements and possibly point- ers. To take a simple hypothetical example, suppose in transcribing a spo- ken text, we encounter an utterance that we can understand either as We had fun at the beach today. or as We had sun at the beach today. We can represent the exclusive alternation of these two possibilities by means of the exclude attribute as follows.

We had fun at the beach today. We had sun at the beach today.
If it is then determined that the speaker said fun, not sun, the encoder could amend the text by deleting the alternant containing sun and the exclude attribute on the remaining alternant. Alternatively, the encoder could preserve the fact that there was uncertainty in the original transcription by retaining the alternants, and assigning the select=we.fun attribute value to the
tag that encompasses the alternants, as in:
We had fun at the beach today. We had sun at the beach today.
The above alternation (including the select attribute) could be recoded by assigning the exclude attributes to tags that enclose just the words or even the characters that are mutually exclusive, as in:(84)
We had fun sun at the beach today.
We had f s un at the beach today.
Now suppose that the transcriber is uncertain whether the first word in the utterance is We or Lee, but is certain that if it is Lee, then the other uncertain word is definitely fun and not sun. The three utterances that are in mutual exclusion can be encoded as follows.
We had fun at the beach today. We had sun at the beach today. Lee had fun at the beach today.
The preceding example can also be encoded with exclude attributes on the word segments We, Lee, fun and sun: We Lee/seg> had fun sun at the beach today. The value of the select attribute is defined as a list of identifiers (IDREFS); hence it can also be used to narrow down the range of alter- nants, as in:
We had fun at the beach today. We had sun at the beach today. Lee had fun at the beach today.
This is interpreted to mean that either the first or the third tag appears, and is thus equivalent to just the alternation of those two tags:
We had fun at the beach today. Lee had fun at the beach today.
The exclude attribute can also be used in case there is uncertainty about the tag that appears in a certain position. For example, the occurrence of the word May in the S-unit Let's go to May can be inter- preted, in the absence of other information, either as a person's name or as a date. The uncertainty can be rendered as follows, using the exclude attribute. Let's go to May . Note the use of the copyOf attribute discussed in section 14.6, "Identical Elements and Virtual Copies," on page 28. This avoids having to repeat the content of the element whose correct tagging is in doubt. The copyOf and the exclude attributes also provide for a simple way of indicating uncertainty about exactly where a particular element occurs in a document.(85) For example suppose that a particular element appears either as the third and last of the elements within the first element in the body of a document, or as the first of the second . One solution would be to record the in its entirety in the first of these positions, and a virtual copy of it in the second, and mark them as excluding each other as fol- lows: In this case, the select attribute, if used, would appear on the tag. Mutual exclusion can also be expressed using a ; the first example in this section can be recoded by removing the exclude attri- butes from the tags, and adding a as follows:(86)
We had fun at the beach today. We had sun at the beach today.
Now we define the specialized linking element ,making it a mem- ber of the pointer class of elements, and assigning it a excl (for mutu- ally exclusive) attribute, which can have either of the values Y or N. Then the following equivalence holds: = It is in the nature of alternation that the order of the targets is irrelevant; hence the targOrder attribute of the defaults to the value N. The preceding may therefore be recoded as the following tag. Other attributes that are defined specifically for the element are weights and percent. The weights attribute is to be used if one wishes to assign probabilistic weights to the targets (alternants). Its value is a list of numbers, corresponding to the targets, expressing the probability that each target appears. The percent attribute is used to indicate whether the weights are stated as percentages (percent=Y, the default) or as the actual probabilities (percent=N). If the alternants are mutually exclusive, then the weights must sum to 100% (or 1, if per- cent=N is specified). Suppose in the preceding example that it is equiprobable whether fun or sun appears. Then the that represents the alternation may be stated as follows: The assignment of a weight of 100% to one target (and weights of 0% to all the others) is equivalent to selecting that target. Thus the following encoding is equivalent to the second example at the beginning of this section.
We had fun at the beach today. We had sun at the beach today.
The sum of the weights for tags ranges from 0% to (100 x k)%, where k is the number of targets. If the sum is 0%, then the alternation is equivalent to exclusive alternation; if the sum is (100 x k)%, then all of the alternants must appear, and the situation is better encoded without an tag. If it is desired, tags may be grouped together in an tag, and attribute values shared by the individual tags may be identified on the tag. The targFunc attribute defaults to the value 'first.alternant next.alternant'. Thus, specifying the extend- Targ=2 attribute value permits the alternants to be extended indefinite- ly. To illustrate, consider again the example of a transcribed utterance, in which it is uncertain whether the first word is We or Lee, whether the third word is fun or sun, but that if the first word is Lee, then the third word is fun. Now suppose we have the following additional information: if we occurs, then the probability that fun occurs is 50% and that sun occurs is 50%; if fun occurs, then the probability that we occurs is 40% and that Lee occurs is 60%. This situation can be encoded as follows. We Lee/seg> had fun sun at the beach today. From the information in this encoding, we can determine that the probability is about 28.5% that the utterance is "We had fun at the beach today", 28.5% that it is We had sun at the beach today, and 43% that it is Lee had fun at the beach today. Another very similar example is the following regarding the text of a Broadway song. In three different versions of the song, the same line reads "Her skin is tender as a leather glove," "Her skin is tender as a baseball glove," and "Her skin is tender as Dimaggio's glove."(87) If we wish to express this textual variation using the element, we can record our relative confidence in the readings Dimaggio's (with probability 50%), a leather (25%), and a baseball (25%). Let us extend the example with a further (imaginary) variation, sup- posing for the sake of the argument that the next line is variously giv- en as and she bats from right to left (with probability 50%) or now ain't that too damn bad (with probability 50%). Using the ele- ment, we can express the conviction that if the first choice for the second line is correct, then the probability that the first line con- tains Dimaggio's is 90%, and each of the others 5%; whereas if the sec- ond choice for the second line is correct, then the probability that the first line contains Dimaggio's is 10%, and each of the others is 45%. This can be encoded, with an tag containing a combination of exclusive and inclusive tags, as follows.
Her skin is tender as Dimaggio's a leather a baseball glove, and she bats from right to left. now ain't that too damn bad.
Here are the formal declarations of the and elements. 14.9 Connecting Analytic and Textual Markup In chapters 15, "Simple Analytic Mechanisms," and 16, "Feature Struc- tures," on page 30 and elsewhere, provision is made for analytic and interpretive markup to be represented outside of textual markup, either in the same document or in a different document. The elements in these separate domains can be connected, either with the pointing attributes ana (for analysis) and inst (for instance), or by means of and elements. Numerous examples are given in these chapters, par- ticularly in sections 15.4, "Linguistic Annotation," on page 30, 16.3, "Feature, Feature-Structure and Feature-Value Libraries," on page 31 and 16.10, "Two Illustrations," on page 33. --------------------------------- (73) We use the term alignment as a special case of the more general notion of correspondence. Let A stand for an element with the attribute id=A, and suppose elements A1, A2 and A3 occur in that order and form one group, while elements B1, B2 and B3 occur in that order and form another group. Then a relation in which A1 corresponds to B1, A2 corresponds to B2 and A3 corresponds to B3 is an alignment. On the other hand, a relation in which A1 corre- sponds to B2, B1 to C2, and C1 to A2 is not an alignment. (74) The type attribute on the note is used to classify the notes using the typology established in the Advertisement to the work: "The Imitations of the Ancients are added, to gratify those who either never read, or may have forgotten them; together with some of the Parodies, and Allusions to the most excellent of the Moderns." In the source text, the text of the poem shares the page with two sets of notes, one headed "Remarks" and the other "Imitations". (75) No special element is provided for this purpose at present: the information should be supplied as a series of paragraphs at the end of the element described in section 5.3, "The Encod- ing Description," on page 8. (76) HyTime is an international standard (ISO 10744) built on SGML. It provides facilities for representing both static and dynamic infor- mation for processing and interchange by hypertext and multimedia applications. See ISO/IEC 10744 Information Technology -- Hypermedia/Time-based Structuring Language (HyTime) ([Geneva]: International Organization for Standardization, 1992). (77) The notation used for this formal grammar is that defined in chap- ter 39, "Formal Grammar for the TEI-Interchange-Format Subset of SGML," on page 77. (78) Strictly speaking, |n| (absolute value of n) children. (79) See section 15.3, "Spans and Interpretations," on page 30, where the text from which this fragment is taken is analyzed. (80) The corresp attribute is thus distinct from the target attribute in that it is understood to create a double, rather than a single, link. It is also distinct from the targets attribute in that the latter lists all the identifiers of the elements that are doubly linked, whereas the corresp doubly links the element that bears the attribute with the element(s) that make up the value of the attri- bute. (81) See William A. Gale and Kenneth W. Church, "Program for aligning sentences in bilingual corpora", Computational Linguistics 19 (1993): 75-102, from which the example in the text is taken. (82) Our example uses the English translation of Charles Hoole (1659), and is taken from John E. Sadler, ed., John Amos Comenius Orbis Pictus: a facsimile of the first English edition of 1659 (Oxford: Oxford University Press, 1968) (The Juvenile Library). (83) This sample is taken from a conversation collected and transcribed for the British National Corpus. (84) See section 15.1, "Linguistic Segment Categories," for discussion of the and tags that can be used in the following examples instead of the and tags. (85) An alternative way of representing this problem is discussed in chapter 17, "Certainty and Responsibility," on page 33. (86) In this example, we have placed the next to the tags that represent the alternants. It could also have been placed elsewhere in the document, perhaps within a . (87) The variant readings are found in the commercial sheet music, the performance score, and the Broadway cast recording.