Optical Character Recognition: Some Sample ScansElectronic Text Center
University of Virginia
Charlottesville, VA 22904
These few examples show some typical results from scanning different types of printed texts.
Note: These test scans were made in May 1998 using OmniPage Pro, version 8. The results of each scan were exported directly as HTML except in the case of the Middle English text, which was exported as Rich Text Format and then converted to HTML.
A modern text, well printed and of good typesize.
G. Thomas Tanselle. "The Life and Work of Fredson Bowers." Studies in Bibliography 46, p. 1.
Very good results. Almost no errors; rule lines at top are ignored but cause no problem; the software misses the large initial letter, and an unnecessary hard return is inserted within the first paragraph. Small capitals in the first line are retained, as are italics throughout. Minimal post-OCR cleanup required. Text of this print quality should not pose any significant problems.
A modern mass-produced paperback. Good typesize.
E. Arnot Robertson. Ordinary Families. London: Virago, 1982. p. 15.
Good results. Very few errors: there is no space between "for, it" in the fourth line, and a hyphen has been erroneously inserted before "to" in the sixth line. Scans well, despite the lower quality of the printing in comparison to example 1 above (the good typesize makes up for the poorer print quality).
A printed, glossy brochure. Adequate typesize.
Special Collections of the University of Virginia Library. 1995.
Good results. Few errors: the software stumbles over the apostrophe in "University's" in two places, and it inserts erroneous umlauts over the letter u in "full" in the first paragraph and in "fund" in the third paragraph; in the fourth paragraph "F." is mistaken for "E".
Note: One can expect similar results from a clear printout from a laser printer or a good, clean photocopy with adequate typesize.
A Middle English text.
Frances E. Richardson, ed. Sir Eglamour of Artois. London, New York, Toronto: Oxford University Press, 1965.
Good results. The software represents words it does not know (i.e., that are not in its dictionary) in green; in this case most of the text is in green, which may appear alarming. However, an inspection of the text reveals no major errors in the poem, and only a few errors in the footnotes. Naturally the software sees the thorn (þ) as a lower-case p and the yogh as a numeral 3. However, it is possible to train OmniPage to recognize the thorn and yogh (choose Edit Training File... from the Tools menu) and to represent them with the abbreviated SGML entity references &t; and &y;. Note that after training the software it does a good job of recognizing the Middle English characters.
A 19th-century commercial printing.
Geoffrey Chaucer. The Complete Works of Geoffrey Chaucer. Rev. Walter W. Skeat, ed. Oxford: Clarendon Press, 1894.
Generally good results, as one typically finds with most 19th century printings (and some 18th century ones) of reasonable clarity, average typesize, and a font style that is not ornate or archaic. There are very few errors in the poem itself: the software fails to recognize the ë (e with umlaut) in line 10. There are several errors in the footnotes, as one may expect with a very small print size. Minimal cleanup required in the poem, but the footnotes would have to be checked carefully.
A 17th-century printing.
Plaine Description of the Barmudas. London: W. Stansby, 1613.
A complete disaster. The resulting text is riddled with errors. The preponderance of unfamiliar letter forms (the long S) and ligatures, and the broken type, causes an unacceptable error rate. This will be true even if you train the OCR software to recognize some of the ligatures, although training will cut the error rate somewhat.
The effort of scanning and correcting this type of text is greater than the effort of typing it in manually.