Etext HomeGeneral InfoCollectionsServicesFeaturesStandardsContact UsQuestions?VIRGO

Appendix I: Text Formatting in UNIX

Guidelines for SGML Text Mark-up at the Electronic Text Center
David Seaman, Electronic Text Center, University of Virginia
[ornament]

CAT

Concatenate: to add a series of files together to form a single conglomerate file. For example, this is of use to add our TEI header template to a text:

cat FILE-1 FILE-2 >FILE-1and2

MOVE

To move or rename the file:
move OLDNAME NEWNAME
e.g.

move Austen-Emma.txt AusEmma

move FILENAME DIRECTORY NAME
e.g.

move AusEmma etext/Done


SPELL

The Unix spelling facility.

spell FILE-IN >FILE-OUT

In this case, FILE-OUT will be a list of words that are not part of spell's recognition vocabulary.


SED

For our purposes, a fast search and replace utility. Works only on text that is on the same line (unlike a WordPerfect or Jove macro).

A SED command can be issued from the command line, or saved to a file and run. The latter is much preferable for repeated uses.

The SED syntax is logical but visually a little awkward at first. The string consists of a "find this" section and a "replace with this" section separated by / marks:

s/FIND THIS/REPLACE WITH THIS/

s/hat/cat/

SED allows one to specify the position of an item, and to ask for variables -- find any upper-case word of any length at the beginning of a line; find a line that begins with a variable number of blank spaces and that has a number of any size.

Some suggested uses:

A text that uses five blank spaces for paragraph indents can be tagged with <p></p> codes by using sed to search for five blank spaces at the beginning of a line and to replace them with </p><p>.

cat FILE-IN | sed 's/^ /<\/p><p>/' >FILE-OUT

The backslash before the / in </p> is necessary because in sed, the / symbol performs a role, and one needs to tell the sed routine (using the \ backslash) to treat the / in </p> as literal.

Note: after such a "search and replace" sed routine you will need to delete the </p> for the first paragraph in the text, and add a </p> for the last paragraph in the text, but everything else is taken care of automatically.

In this case you could perform the same simple search and replace operation with a word processor (JOVE, WP), but sed is faster and safer -- one does not need to open the document and run the risk of changing something accidentally. Remember, however, that SED only works on a line -- you cannot ask it to search for several blank lines.

The following sed routines will add verse lines to the beginning and end of lines. These positions are represented by ^ and $ respectively:

cat FILE-IN | sed 's/^/<l>/' >FILE-OUT1

cat FILE-OUT1 | sed 's/$/<\/l>' >FILE-OUT2

In this case, the first FILE-OUT becomes the starting point for the second routine.

You can -- perhaps should -- run these two commands together. One way to do this is to create a file with jove, called perhaps "line.sed", which includes the two sed commands in the example above:

s/^/<l>/
s/$/<\/l>/

Then, you can run the commands in the line.sed file by typing

sed -f line.sed FILE-IN > FILE-OUT

Such a routine will put <l></l> on any blank lines too (appearing as <l> </l> if the blank line contained a space). Search for and delete those that are meaningless; some blank lines may mark a meaningful division such as page breaks (even if unnumbered), and one could replace these <l></l> with <pb n="">, so at least the page breaks are preserved.

If the text lacks a blank space at the end of each line, type

cat FILE-IN | sed 's/$/ /' >FILE-OUT

A blank space at the end of each line is necessary because Lector does not obey the line-end character, and will run lines together. You can also do this search in Jove, searching for $ and replacing with a space.



TR

A translator, to turn one character into another (or into nothing).

A common use: to remove ^M line-end characters, a product of a binary transfer (such as the transfer done by PC-NFS):

tr -d '\015' <FILE-IN >FILE-OUT

Note: 015 is the numerical value of the character ^M. You can determine these numbers with an octal dump (see the 0n-line Unix man page for this -- type man od at the command line).

To remove line-feed characters -- perhaps to create a single line of text for a PERL routine:

tr -d '\012' <FILE-IN >FILE-OUT


MULTIDOCS

Part of PAT -- not public domain. A tool for checking the completeness of your tags. Multidocs does NOT parse against the DTD, however. Use nsgmls for that.

multidocs FILE-IN tag1 >FILE-OUT

There are two tag files, currently called tags1 and tags2, in etext/Done. Both must be run separately in order to check for tagging errors.

If an error does appear, there will be a byte offset number associated with it; this tells you the location of the error within the document. In order to find the byte offset position in the text, you have two choices:

  • 1) search for the byte offset number in Jove:
    • hit ESC
    • type in the number hit CTRL-f

When you fix an error, it alters the byte offset numbers downstream of it, so you will have to compensate, and eventually re-run multidocs to generate a fresh set of error messages.



NSGMLS

A public-domain SGML parser. It will read a text against the DTD it declares itself to be obeying, and will report any errors. The language of the error reports is opaque at times, but the parser does report the line number, which makes the point of error easy to find. [Jove reminder: to have line numbers assigned in Jove, hit ESC and then X -- enter num at the prompt].

The simplest syntax is: nsgmls -s FILENAME

The -s switch specifies that you only want errors reported (by default the parser also reports on tags that are used correctly).


| Back | Next |