Guidelines for SGML Text Mark-up at the Electronic Text Center
David Seaman, Electronic Text Center, University of Virginia
![[ornament]](/images/horzorn1.gif)
The TEI header
The TEI header is a vital part of any text we prepare. It is a
record of the print source for the electronic text, of the work we
have done on the electronic text, of the creation of the electronic
text, and it provides various date and keyword fields for our search
tools. It is also the source of the USMARC record that goes into our
online library catalog.
UVa text processors use a "fill-in-the-blanks" web form to create
TEI headers. This form reads in an SGML template and configures itself
to it, saving out valid TEI and an automatically-generated MARC record.
Below are some examples of the principal different types of printed and
manuscript materials for which we create headers:
The version of the TEI header that we use is comprised of four
major sections:
<teiHeader>
- <fileDesc>...</fileDesc>
- <encodingDesc>...</encodingDesc>
- <profileDesc>...</profileDesc>
- <revisionDesc>...</revisionDesc>
</teiHeader>
- The File Description -- <fileDesc> -- contains a
full bibliographical description of the computer file -- title,
author, creator of electronic version, publisher of electronic
version, the size of completed file, in KB -- along with
information about the printed source from which the electronic
text was derived (contained within the <sourceDesc>).
Notes: annotations about the electronic text go in the first <notesStmt>
field; notes about the physical object -- the book in hand -- go in the
<notesStmt> field in the <sourceDesc> field. It can be difficult
sometimes to determine which is which -- ask for help in this case. In
disputed cases, default to the <notesStmt> field in the
<sourceDesc>.
Editions, impressions, reprints: if in doubt about
what constitutes an edition and an impression, see David or Catherine.
As a rule of thumb, identical pagination and lineation between two
versions of a text means that they are different impressions of the same
edition -- they have been printed from the same physical printing
plates. Covers, illustrations, titlepage dates may of course be quite
different between two such impressions.
- The Encoding Description -- <encodingDesc> --
allows for
detailed description of whether (or how) the text was normalized
during transcription, how the encoder resolved ambiguities in the
source, what levels of encoding or analysis were applied, and so
on.
- The Text Profile Description -- <profileDesc> --
provides a detailed description of non-bibliographic aspects of
the text, specifically the languages used, the situation in which it
was produced, the participants, and their setting.
Note that the <date> field in the <creation> section is
vital; OpenText reads this when it constructs its "Centuries" document
structures. A missing or incorrect <date> here will result in
the work being left out or misplaced in the "Centuries" group.
The <keywords> fields should always include the following:
- either "fiction" or "non-fiction"
- always use at least one of the following:
"drama" ; "prose" ; "poetry". For drama, if in verse, add "verse".
- always "masculine" or "feminine"; if joint authorship, use both.
- when appropriate, use any of the following:
African American/Native American/American Civil War/Thomas
Jefferson/Women Writers/Young Readers/Literature in Translation/
Special Collections.
To get a feel for how we use these, see the sebsets online under Modern
English.
- The Revision History Description --
<revisionDesc>:
allows present and future encoders to provide a history of
changes
made during the development of the electronic text.
The University of Virginia Etext Center Header: TEMPLATE
<teiHeader type="aacr2">
<fileDesc>
<titleStmt>
<title>
The work's title [a machine-readable transcription]</title>
<author>The work's author, last name first</author>
<respStmt>
<resp>Creation of machine-readable version: </resp>
<name>creator of electronic version</name>
<resp>Creation of digital images: </resp>
<name>creator of image(s)</name>
<resp>Conversion to TEI.2-conformant markup: </resp>
<name>University of Virginia Library Electronic Text Center.</name>
</respStmt>
</titleStmt>
<extent>ca. XXX kilobytes </extent>
<publicationStmt>
<publisher>University of Virginia Library.</publisher>
<pubPlace>Charlottesville, Va.</pubPlace>
<idno type="ETC">collection and ID, e.g. Modern English, AusEmma</idno>
<availability>
<p>Place where text can be found, e.g. Available from: Oxford Text Archive</p>
<p>URL: http://etext.lib.virginia.edu/modeng.browse.html</p>
<p>Available commercially from:</p>
</availability>
<date>Current year</date>
</publicationStmt>
<seriesStmt>
<p>Name of electronic series, if any</p>
</seriesStmt>
<notesStmt>
<note>Illustrations have been included from the print version. Note
about image, if needed; note, for instance, if source differs from
print source.</note>
<note>any other notes</note>
</notesStmt>
<sourceDesc>
<biblFull>
<titleStmt>
<title>The work's title</title>
<title level="a|m|j|s|u">The title of the physical volume, if different</title>
<author>The author's name, first name first</author>
<respStmt>
<resp>e.g. Editor / Translator / Annotator</resp>
<name></name>
</respStmt>
</titleStmt>
<editionStmt>
<p>Edition information, e.g. 1st Edition.</p>
</editionStmt>
<extent></extent>
<publicationStmt>
<publisher></publisher>
<pubPlace>place of publication</pubPlace>
<date>date of publication</date>
</publicationStmt>
<seriesStmt>
<p>Name of print series.</p>
</seriesStmt>
<notesStmt>
<note></note>
</notesStmt>
</biblFull>
</sourceDesc>
</fileDesc>
<encodingDesc>
<projectDesc>
<p>Prepared for the University of Virginia Library Electronic Text
Center.</p>
</projectDesc>
<editorialDecl>
<p>All quotation marks retained as data.</p>
<p>Spell-check and verification made
against printed text using WordPerfect spell checker.</p>
<p>All unambiguous end-of-line hyphens have been removed, and the
trailing part of a word has been joined to the preceding line.</p>
<p>The images exist as archived TIFF images, one or more JPEG versions
for general use, and thumbnail GIFs.</p>
<p id="ETC">Keywords in the header are a local Electronic Text Center scheme
to aid in establishing analytical groupings.</p>
</editorialDecl>
<refsDecl>
<p>ID elements are given for each page element and are composed of the text's
unique cryptogram and the given page number, as in AusEmma1 for page one of
Jane Austen's Emma.</p>
</refsDecl>
<classDecl>
<taxonomy id="LCSH">
<bibl>
<title>Library of Congress Subject Headings</title>
</bibl>
</taxonomy>
</classDecl>
</encodingDesc>
<profileDesc>
<creation>
<date>First published date</date>
</creation>
<langUsage>
<language id="">languages used in the text; use one
"language pair of tags for each language, and for the id= value, use
an ISO639 code</language>
</langUsage>
<textClass>
<keywords>
<term>fiction or non-fiction; poetry, prose, or drama</term>
</keywords>
<keywords scheme="LCSH">
<term>LCSH</term>
</keywords>
</textClass>
<textClass>
<keywords>
<term type="artist">name of illustrator, painter, etc.
</term>
<term type="visual work">engraving/painting/illustration, </term>
</keywords>
<keywords>
<term>24-bit color; 400 dpi [or variant]</term>
</keywords>
</textClass>
</profileDesc>
<revisionDesc>
<change>
<date>date of changes</date>
<respStmt>
<resp>corrector</resp>
<name>who made the changes</name>
</respStmt>
<item>what was done</item>
</change>
</revisionDesc>
</teiHeader>
The Tags Exclusive to the Header
The global attributes are as follows:
<teiheader>
supplies the descriptive and declarative
information making
up an "electronic title page" prefixed to every
TEI-conformant text.
May contain: encodingDesc fileDesc profileDesc revisionDesc
Attributes: global plus the following:
type : specifies the kind of document to which the header is
attached.
creator : identifies the creator of the teiHeader, using the
name or initials of the
person or institution responsible.
status : indicates whether the header is new or has been
substantially revised.
Legal values are: "new" or "update".
date.created : indicates when the first version of the header
was created.
date.updated : indicates when the current version of the
header was created.
<filedesc>
contains a full bibliographic description of an electronic file including
statements of responsibility and a full bibliographic
description for the source or sources
from which the electronic text was derived.
May contain: editionStmt extent notesStmt publicationStmt
seriesStmt sourceDesc
titleStmt
Attributes: global
<titleStmt>
May contain: title author editor sponsor funder principal
respStmt
Attributes: global
<sponsor>
May occur within: titleStmt
May contain: #PCDATA ident code kw abbr address date name num
rs time add corr
del gap orig reg sic unclear emph foreign gloss hi mentioned
soCalled term title ptr ref
xptr xref anchor s seg gi formula
Attributes: global
<funder>
specifies the name of an individual, institution, or organization responsible
for the funding of a project or text. Funders provide
financial support for a project; they
are distinct from sponsors, who provide intellectual support and authority.
May occur within: titleStmt
May contain: #PCDATA abbr add address code corr date del emph foreign formula
gap gi gloss ident hi kw lang mentioned name num orig ref reg
rs s seg sic soCalled term
time title xptr xref
Attributes: global
<principal>
May include: PCDATA ident code kw abbr address date name num
rs time add corr del
gap orig reg sic unclear emph foreign gloss hi mentioned
soCalled term title ptr ref xptr
xref anchor s seg gi formula
Attributes: global
<editionstmt>
groups information relating to one
edition of a text.
May contain: edition respStmt p
Attributes: global
Example:
<editionStmt>
<edition n=S2>Students' edition</edition>
<respStmt> <resp>Adapted by
</resp><name>Elizabeth Kirk</name>
</respStmt>
</editionStmt>
<edition>
describes the particularities of one edition
of a text.
May occur within: bibl editionStmt
May contain: #PCDATA abbr add address anchor code corr date
del emph foreign
formula gap gi gloss hi ident kw mentioned name num orig ptr
ref reg rs s seg sic
soCalled term time title xptr xref
Attributes: global
Example:
<edition>First edition <date>Oct 1990</date>
</edition>
<edition n=S2>Students' edition </edition>
<extent>
describes the approximate size of the
electronic text as stored on
some
carrier medium, specified in any convenient units.
May occur within: bibl biblFull fileDesc
May contain: #PCDATA abbr add address anchor code corr date
del emph foreign
formula gap gi gloss hi ident kw mentioned name num orig ptr
ref reg rs s seg sic
soCalled term time title xptr xref
Attributes: global
Example:
<extent>3200 sentences </extent>
<extent>ten 3.5 inch high density diskettes </extent>
<publicationstmt>
groups information concerning the publication or distribution of
an electronic or other text.
May occur within: biblFull fileDesc
May contain: address authority availability date distributor
idno p
publisher pubPlace
Attributes: global
Example:
<publicationStmt>
<publisher>Chadwyck Healey </publisher>
<pubPlace>Cambridge </pubPlace>
<availability>Available under licence only
</availability>
<date>1992 </date>
</publicationStmt>
<distributor>
supplies the name of a person or other agency responsible for the
distribution of a text.
May occur within: publicationStmt
May contain: #PCDATA abbr add address anchor code corr date
del emph foreign formula gap gi gloss hi ident kw mentioned name num orig ptr
ref reg rs s seg sic soCalled term time title xptr xref
Attributes: global.
<authority>
May include #PCDATA ident code kw abbr address date name num
rs time add corr del gap orig reg sic unclear emph foreign gloss hi mentioned
soCalled term title ptr ref xptr xref anchor s seg gi formula
Attributes: global
<idno>
supplies any standard or non-standard number
used to identify a bibliographic item.
May occur within: bibl publicationStmt seriesStmt
May contain: #PCDATA
Attributes: global plus the following:
type : categorizes the number, for example as an ISBN or
other standard series.
Value: A name or abbreviation indicating what type of
identifying number is given (e.g.
ISBN, LCCN).
<availability>
supplies information about the availability of a text, for example any
restrictions on its use or distribution, its copyright status, etc.
May occur within: publicationStmt
May contain: p
Attributes: global plus the following:
status : supplies a code (free, unknown, or restricted)
identifying the current availability of the text:
free : the text is freely available.
unknown : the status of the text is unknown.
restricted : the text is not freely available.
Example:
<availability status=restricted>
<p>Available for academic research purposes only.
<availability status=free>
<availability status=restricted>
<p>Available under licence from the publishers.
<seriesstmt>
groups information about the series, if any, to which a publication belongs.
May occur within: biblFull fileDesc
May contain: idno p respStmt title
Attributes: global
Example:
<seriesStmt>
<title>Machine-Readable Texts for the Study of Indian
Literature</title>
<respStmt>
<resp>ed. by</resp> <name>Jan Gonda</name>
</respStmt>
<idno type=vol>1.2</idno>
<idno type=ISSN>0 345 6789</idno>
</seriesStmt>
<notesStmt>
collects together any notes providing
information about a
text additional to that recorded in other parts of the
bibliographic description.
May occur within: biblFull fileDesc
May contain: note
Attributes: global
<notesStmt>
<note>OCR scanning done at University of Toronto</note>
</notesStmt>
<sourcedesc>
supplies a bibliographic description of
the copy text(s) from which an electronic text was derived or generated.
May occur within: biblFull fileDesc
May contain: bibl biblFull p
Attributes: global plus the following:
default : values YES | NO
Example:
<sourceDesc>
<p>No source: created in machine-readable form.</p>
</sourceDesc>
<encodingdesc>
documents the relationship between an
electronic text and
the source or sources from which it was derived.
May contain: projectDesc samplingDecl editorialDecl tagsDecl
refsDecl classDecl
Attributes: global
<projectDesc>
May contain: p
Attributes: global plus the following:
default : values: YES | NO
<samplingDesc>
May contain: p
Attributes: global plus the following:
default: YES | NO
<editorialDesc>
May contain: p
Attributes: global plus the following:
default: YES | NO
<tagsDecl>
May contain: rendition tagUsage
Attributes: global
<tagsUsage>
May contain: #PCDATA ident code kw abbr address date name num
rs time add corr del gap orig reg sic unclear emph foreign gloss hi mentioned
soCalled term title ptr ref xptr xref anchor s seg gi formula eg bibl biblFull cit q
label list listBibl note figure stage table text
Attributes: global plus the following:
gi
occurs
ident
render
<rendition>
May contain:#PCDATA ident code kw abbr address date name num
rs time add corr
del gap orig reg sic unclear emph foreign gloss hi mentioned
soCalled term title ptr ref xptr xref anchor s seg gi formula eg bibl biblFull cit q
label list listBibl note | figure |
stage table text
Attributes: global
<refsdecl>
specifies how canonical references are
constructed for this text.
Occurs within: encodingDesc
Contains: p
Attributes: global plus the following:
doctype : identifies the document type within which this
reference declaration is used.
<classDecl>
May contain: taxonomy
Attributes: global
<taxonomy>
defines a typology used to classify texts
either implicitly, by means
of a bibliographic citation, or explicitly by a structured taxonomy.
May occur within: classDecl
May contain: bibl biblFull biblStruct category
Attributes: global
Example:
<taxonomy id=B>
<bibl>Brown Corpus</bibl>
<category id=B.A><catdesc>Press Reportage
<category id=B.A1><catdesc>Daily</category>
<category id=B.A2><catdesc>Sunday</category>
<category id=B.A3><catdesc>National</category>
<category id=B.A4><catdesc>Provincial</category>
<category id=B.A5><catdesc>Political</category>
<category id=B.A6><catdesc>Sports</category>
</category>
</taxonomy>
<category>
May contain: catDesc, category
Attributes: global
<catDesc>
#PCDATA ident code kw abbr address date name num rs time add
corr del gap orig
reg sic unclear emph foreign gloss hi mentioned soCalled term
title ptr ref xptr xref
anchor s seg gi formula
Attributes: global
<profiledesc>
provides a detailed description of
non-bibliographic
aspects of a text, specifically the languages and
sublanguages used,
the situation in which it was produced, the participants and
their
setting.
May occur within: teiHeader
May contain: creation langUsage textClass
Attributes: global
<creation>
contains information about the creation
of a text. The <creation>
element may be used to record details of a text's creation,
e.g. the date and place it was
composed, if these are of interest; it should not be confused
with the
<publicationStmt> element, which records date and place of
publication.
May occur within: profileDesc
May contain: #PCDATA abbr add address anchor corr date del
emph foreign formula
gap gi gloss hi mentioned name num orig ptr ref reg rs s seg
sic soCalled term time
title xptr xref
Attributes: global
Example:
<creation><date>Before 1987</date>
<creation><date value="1988-07-10">10 July
1988</date>
<langUsage>
describes the languages, sublanguages,
registers, dialects etc.
represented within a text. May contain either a simple prose
description, or more
formally one or more <language> elements
May occur within: profileDesc
May contain: language p
Attributes: global
<language>
identifies the language being described
in the writing system
declaration.
May occur within: langUsage
May contain: #PCDATA
Attributes: global plus the following:
iso639 : gives the standard language code from ISO 639.
Value: any two- or three-letter
code included included in ISO 639; if the language is not
included in the list in ISO 639,
the value should be given as the empty string.
<language iso639=GRC>Classical Greek</language>
<textClass>
groups information which describes the
nature or topic of a text in
terms
of a standard classification scheme, thesaurus, etc.
Attributes: global
<keywords>
contains a list of keywords or
phrases identifying the topic or
nature
of a text.
May contain: list term
Attributes: global plus the following:
scheme : identifies the controlled vocabulary within which
the set of keywords concerned
is defined.
Example:
<keywords scheme=BL>
<list><item>Babbage, Charles
<item>Mathematicians - Great Britain - Biography
</list>
</keywords>
<classCode>
May contain: #PCDATA ident code kw abbr address date name num
rs time add corr
del gap orig reg sic unclear emph foreign gloss hi mentioned
soCalled term title ptr ref
xptr xref anchor s seg gi formula
Attributes: global, plus the following:
scheme IDREF #IMPLIED
<catRef>
Empty tag
Attributes: global, plus the following:
target
scheme
<revisiondesc>
summarizes the revision history for a
file. Record changes
with most
recent changes at the top of the list.
May occur within: teiHeader
May contain: change list
Attributes: global
Example:
<revisionDesc>
<change><date>11 Nov 91</date>
<name>EB </name>
<what>Deleted chapter 10 </what>
</revisionDesc>
<change>
summarizes a particular change or
correction made to a particular
version
of an electronic text which is shared between several
researchers.
May occur within: revisionDesc
May contain: date item respStmt
Attributes: global
<respstmt>
supplies a statement of responsibility
for someone responsible for
the
intellectual content of a text, edition, recording, or
series, where the specialized elements
for authors, editors, etc. do not suffice or do not apply.
May occur within: bibl change editionStmt series seriesStmt
titleStmt
May contain: name resp
Attributes: global
Example:
<respStmt><resp>transcribed from original ms</resp>
<name>Claus Huitfeldt</name>
</respStmt>
In addition, the TEI header includes the following tags,
described in the
longer list of general TEILITE tags:
<TEI.2>
<author>
<resp>
<name>
<extent>
<publisher>
<date>
<biblFull>
<title>
<term>
| Back | Next |