Guidelines for SGML Text Mark-up at the Electronic Text Center
David Seaman, Electronic Text Center, University of Virginia
The Basic Text-Processing Procedures at UVa
What follows is a step-by-step set of guidelines for processing texts at UVa's
Electronic Text Center. It largely assumes that the electronic text is derived from
a print or manuscript source; to date, this has been the case for the vast majority of the texts we have
processed. While the precise details of these procedures are specific to UVa., the
general process and assumptions should be easily duplicated elsewhere.
- Assuming a text passes an initial inspection, it will
be put in a to-do directory and assigned to a preparer.
The to-do directory is a holding place for texts waiting
to be processed.
Each text preparer works on files within his or her working directory.
- Create the seven-letter abbreviated name that will
be the text's unique ID, and add this abbreviation to the id=
attribute of the <text> tag. Whenever possible, the ID should
consist of three letters of the author's name and four of the title:
Jane Austen's Emma has an id ofAusEmma, for
- Identify the source edition for the
electronic text, obtain a copy of it (use Inter-Library Loan if
necessary). This identification may require you to contact the
creator of the text, if he or she is known. The printed source is
invaluable when checking the electronic document; we don't want to
be "correcting" things that look like errors but are actually
features of the printed text (British spelling, as an obvious
If no source edition is marked in the file, if the text's
initial creator cannot be found, and if
comparison with copies on the shelves of the Library yields no
further information, then we need to decide whether we proceed with
- Go to the TEI header webform template, and fill it in to the degree
that it can be completed.
- Check the accuracy of the electronic text.
You could, for example, run the Unix spell program to see if there are
many words that Unix does not recognise, and check to see if they
look like scanning or typing errors. A very corrupt text may need to
be abandoned. Don't assume that a text with tags in it already is
reliable in its content even if it is reliable in its markup -- there
are a number of texts in our Modern and Middle English sections that
came to us with TEI tags in place, but which had hundreds of
typographical errors when we processed them
- Check the structure of the electronic text.
Look for any structures that can be searched for and replaced with
TEI tags (existing word-processor codes, patterns of spacing and layout, etc.)
If the text contains no markup, look for repetitive patterns that can be
replaced with a tag (see Notes on Text Formatting).
For example, if five spaces at the beginning of a line always mark
a new paragraph, this pattern can be searched for and replaced with
</p><p>. Do not leave both the <p> marker and the five
spaces in the text -- think of the <p> like a TAB command in a
If the text contains some existing markup other than
TEI, replace it. For example, if italics are marked with a pair of
# marks (on and off), these can be searched and replaced using a
routine that searches for #, replaces with <i>, goes to the next
#, and replaces with </i>. If the text is already marked up
with SGML tags (rare, at present), they may need to be converted to
our subset of TEI. Remember that the Unix search and replace
utility, SED, cannot be used if the item (such as an italicized
phrase) is not all on one line. A Jove, Emacs, or WordPerfect
macro may be your best bet.
- Look for the presence of line-end (and page-end) hyphenation.
Whenever possible, unambiguous line-end hyphenation is to be closed
up as it interferes with one's ability to search for the hyphenated
Line-end hyphenated words are considered to be unambiguous
when they are hyphenated only because they fall at the end of a
is not, as it might appear as one word, or as a hyphenated phrase, or
as two words. If in doubt, leave the line-end hyphenation alone.
If removing unambiguous line-end (or page-end) hyphenation, move
the second part of a word up to join its first part on the previous
line. During such checking, be alert for missing lines and passages.
The items that you can search and replace with SGML codes may
not include the major text divisions (<div1> <div2> etc), in
which case you will have to put these in manually. Remember that
the first major division will be <div1>, the second <div2>,
and so on. See A Practical Introduction to the Tag Set.
Check for special characters, and convert to SGML
entity references (see Appendix: Special
- Paginate. This not only makes the text easier to navigate and cite
from, but it also ensures its relative completeness, at least to the page
level (watch out for short pages that may indicate a passage has been
left out). If there are no page markers, and nothing to search for (2 blank lines, for
example, or a control character), this has to be done manually.
We have macros to put in the page markers and to add the numbers in the Etext Center.
- Spellcheck, if practical. If the text is from a
source of electronic editions that we know to be generally
reliable, a full spell-checking may be unnecessary, given our time constraints;
spell-check a section and read through a section, to doublecheck. A huge file
may simply be too time-consuming to spellcheck fully. Record what
you do in the <teiHeader>.
- Make sure that there is a single space at the end of each line.
The space is necessary as the TEI-to-HTML filter does not retain a hard
return code, and therefore words run together if there is no line-end
- Double-check the information in the TEI header.
- Unless in the process of the steps above the text is revealed to be
irredeemably corrupt, it is now ready for parsing and indexing. Run
check the form of the tags.
To parse, use nsgmls (aliased to the command "parse" on the etext machine).
| Back | Next |