Etext HomeGeneral InfoCollectionsServicesFeaturesStandardsContact UsQuestions?VIRGO

Archival Digital Image Creation

For more information, see the following:


There is a distinction growing between preservation imaging and what we call here archival imaging. For the preservation world, there is a heavy reliance on high-speed, 1-bit (simple black and white) page images shot at 600 dpi and stored at Group 4 fax-compressed files. This gives an image reminiscent of a microfilm image. For a straightforward printed page with no graphics, 1-bit imaging maintains the ability to read the content but it gives no sense of the page as an artifact -- no shading, no color, etc.

What we here call archival imaging assumes that one's needs are for high quality images that replicate not simply the information on a page -- as a black and white image does for typeset material at least -- but the experience and visual nuances of the original. A high-quality color image (24-bit) does this -- the value is not simply for specialist use, but for general purpose users too. Some of our most excited and emotional users are members of the general public and high school students who use the color images of rare manuscripts and books.

The following assumes one is scanning original documents on a flatbed scanner. Except for a longer training period, and often some more complex set-up, the figures should be broadly comparable for the digital camera.

SCANNING AND FORMAT

For both current use and long-term viability, I suggest the following:

At the scanner

  • scan at 400-600 dpi (we currently use 400 dpi by default). Your decision regarding DPI will vary depending on the amount of detail in the original, its physical size, and the predictable uses.

  • scan at 24-bit colour by default. Even greyscale book illustrations and engravings look much more realistic at 24-bit colour than at 8-bit greyscale, and the JPEG file produced from the 24-bit original is typically smaller in KB than that made from an 8-bit original [see comparisons below].

  • Create a TIFF file at the scanner -- an uncompressed format that is as close as we've got to an archival form. The TIFF uncompressed archival copy is large (which means that it has a lot of information in it, which is good). Filesize should not be a deciding factor in image resolution or bit density [having said that, it is principally filesize that keeps us scanning by default at 400 dpi / 24-bit color rather than 600 dpi / 24-bit color]. In our case currently, this off-line storage is on writeable CD-ROMS; previously, we used a tape archiving system.

  • Use the automatic colour and contrast balance on the scanner. Do no additional colour correction on the archival TIFF: better to have them archived with a consistent and known bias -- the bias imposed by a particular device (e.g. a Hewlett Packard flatbed scanner). We need to avoid unrecorded and ad hoc correction of the originals, especially as the best we can do is to correct for a particular monitor. The inclusion of a standard colour reference strip at the margin of each image is a very good idea (we don't do this currently, but regret it). Do whatever colour correction is necessary on the JPEGs.

Post-Scanning Processing

  • Before the TIFF is archived off-line, create one or more JPEG images for current use -- you might decide on a high-detail (low loss) and a low-detail (high loss) version. The precise settings are determined by the type of image -- as a rule of thumb, aim to have the better copy come in at 300-500 KB and the poorer copy at under 100 KB. The better copy allows a lot of flexibility of use (details can be enlarged several times without pixelation); the poorer copy allows little in the way of flexibility of use, but is very useable at regular size and loads quickly even on low-end graphical systems.

    Example: The Booker Civil War Collection
    [http://etext.lib.virginia.edu/collections/civilwar/booker]


Data Control

I strongly encourage the use of a short tagged header template for each image or related group of images, saying how, when, and by whom it was created. The Etext Center uses the TEI header to fill this role; at worst, a set of HTML <meta> tags in the <head> section would be better than nothing.

I would also suggest that this header should be added into the binary code of the image file itself. The Etext Center tries to do this routinely now with book illustrations, manuscripts, etc.

For more information, see the Illustrations section in the Electronic Text Center Guide to Document Preparation

This data control adds a few minutes per image to the creation time, but means that we have a searchable record of the item, a bibliographical header for future cataloging, and we keep track of what we have got. We should think of ourselves as building a text database to our images as we create the images. For some groups of images, a single header may do for all the images in a group -- you may not need a different header for each specific image.



Postscript: Scanning from microfilm: commercial service bureaus will now do 8-bit greyscale scans from microfilm; for some types of document, these are a much better choice than 1-bit black-and-white scans.

Example greyscale scans from microfilm (10 meg tif files here represented by their 100K jpeg derivatives):

DMS / Jan 95 / July 95 / July 96