Introduction to Etexts: About XML
May 11th and 14th, 2002David Seaman
Director, Electronic Text Center, University of Virginia Library
for fun and profit</claim>
<div1 type="section" n="1">
<head> The Basic Scoop </head>
XML is a simple set of tagging instructions that allows one to make a set of tags in a web-friendly manner -- tags that can describe the structure of a document (like TEI does) rather than simply its appearance on-screen (like HTML does).
Q: Isn't that what SGML is supposed to do?
Yes -- XML is a simplified version of the SGML meta-language. Some items have been removed from the SGML rules and some syntax changed to create something that is almost as powerful in its descriptive abilities, but much more web-friendly, and much easier for software such as web browsers to deal with.
Q: So XML is "HTML on steroids!"
No! HTML is a tag set built from the tag-set building language SGML. XML is not a tagset at all -- rather, it is a more easily used form of a tag-set building language, derived from SGML. XML does not give you a set of tags to use, as HTML does. Rather, it gives you rules for building sets of tags that suit your purposes.
Q: So you can have an XML version of HTML, or EAD, or TEI?
Exactly. When the HTML rules are re-written to accommodate the XML features and assumptions, then you will have XML-compliant HTML. And similarly, you can have an XML version of TEI, EAD, CALS, and just about every other SGML tagset.
Q: If XML is a language for making tagsets, then I can use it to make my own, correct?
Q: But why would I want to make up my own tagset? It sounds like hard work.
It is. Typically, you would try to match your document type to an existing set of tags -- EAD for archival guides, for example. Let someone else do the work, and benefit from discussion groups, software, and the inter-operability advantages of following an established norm. But you can "roll your own" if needs dictate.
<div1 type="section" n="2">
<head> The Nitty-Gritty </head>
Q: Okay, so what do I need to know about the form of XML?
- Every XML document must have a root element an outer-wrapper tag
(equivalent to the <html> </html> outer wrapper that is
supposed to enclose an HTML document.
- Every start tag must have a closing tag. You can no longer
"minimize" tags by exercising the SGML option (found in HTML) of omitting the
closing half of a pair of tags. Software has real trouble trying to intuit
where a pair of tags ends, in the absence of an explicit close tag.
- Tags must nest cleanly: so <tag1> <tag2>
</tag2> </tag1> is valid, but <tag1> <tag2>
</tag1> </tag2> is not.
- Empty tags (such as <br>) have a different form to make it clear that these are tags with no closing tag. Again, this is designed to make it clear to software when it should expect a closing tag and when not. The new form of empty tags such as the following
Caveat: you can now also deal with empty tags by supplying a close tag: <br></br>
- All attribute values must be in quotation marks: so,
- Tags are case-sensitive and must match :
<author></AUTHOR>. Using lowercase only or uppercase only
for your tag names and attribute names will simplify your life.
- XML documents need a declaration at the top to signal that this is what
and if you are conforming to and parsing against a DTD, the DTD needs to be declared there as well:
<!DOCTYPE UVALIB SYSTEM "uvalib.dtd">
These rules will make you XML-conformant.
<div1 type="section" n="3">
<head> The Validity of Well-Formedness </head>
Q: Parsed? DTD? I don't like the sound of this.
A DTD document type definition is a bedrock principle of SGML, the meta-language from which XML was formed. A DTD is a text file that lays out in a machine-readable syntax the list of tags that belong to a given tagset, and any usage rules that govern the use of those tags. You can learn to build and read a DTD if you wish, but you don't want to hang out with folks who do it for a hobby (trust me on this).
The information that the DTD holds in a shorthand fashion will be available typically as a tag library document --a guide more palatable for humans. A parser is a piece of software that reads a DTD, and checks your file against it. An XML browser will include a parser in it, or you can use standalone parsers.
Q: Why should I care to parse a document?
To make sure that you have done the things that you thought you were doing in the tagging. For example, if you wanted to include a data and copyright field in every document, the DTD will tell the parser to check to see that every file has these required elements, and will report any that do not. Parsing against a DTD makes for much more predictable files, which makes them work better when they are en mass -- in an etext collection for example -- and makes them easier to mix in with files created elsewhere that also parse against the same DTD.
Q: But I don't have to do that, right?
No: an XML document must be "well-formed" but does not have to be "valid".
Well-formed: the file obeys the four requirements for an XML file (see section 2 above). Every XML document must be well-formed, otherwise the browser will spit it back.
Valid: A valid XML document is well-formed, but it also parses against a DTD its use of tags has been validated against a copy of the tagging rules that you think you are following.
<div1 type="section" n="4">
<head>Stylish XSL </head>
Q: So once I have made my document using XML rules, how will it work on the web or in XML software?
With HTML browsers such as Netscape, the browser knows everything about a single tagset -- HTML -- and when it sees those tags in a document it knows what to do with them. If you sent Netscape a document tagged in, say, TEI, it would not have a clue how to render the non-HTML tags. This is why services such as the Etext Center provide "on-the-fly" HTML conversion -- we can keep our data in rich forms of SGML, but still get HTML generated automatically for net delivery.
Q: So how's an XML browser going to know the tags in my document, especially if I have made them up myself?
The principal difference with XML browsers will be that the browser software will not have -- cannot have -- a hard-coded knowledge of the tags that will be in a document that it will be expected to display. Instead, you send the browser the document and a pointer to a stylesheet -- a file that accompanies the document and that defines your tags, and how you would like them to be rendered. So, the browser does not know everything about one tagset as an HTML browser does; rather, it knows nothing about specific tags, but can learn real quick, by reading your stylesheet.
The stylesheet can be in one of two forms Cascading StyleSheets (CSS) and Extensible Stylesheet Language (XSL).
CSS is already in place, supported by Netscape and Internet Explorer, and often used for HTML documents. XSL is informed by the earlier and difficult DSSSL and is more powerful than CSS. It is now in two parts -- the XSL stylesheet language for specifying the display attributes of a document, and XSLT, "which is a language for transforming XML documents into other XML documents." [http://www.w3.org/Style/XSL/]. This from a recent Microsoft article:June 7, 2000: "Transforming XML: Copying, Deleting, and Renaming Elements," by Bob DuCharme. "As XML becomes more popular, and the dreams of shared DTDs often prove unrealistic, a quick and easy way to convert documents that conform to your DTD into documents that conform to my DTD becomes very valuable. This is especially so if you and I want to do business together without going to the trouble of authoring a DTD that we can both agree on. An XSLT style sheet specifies how to transform a set of elements."
See the following for more information:
http://www.w3.org/Style/ and http://www.w3.org/Style/css/#learn
In brief, a CSS stylesheet command contains a Selector and a Declaration:
. In this case the selector H1 has attached to it the display declaration that it should be in green and centered.
XSL is somewhat more complex and powerful, and still developing. It includes CSS functions, is written in <xml> tags, and not only adds descriptive layout information but can re-arrange a document, convert from one tagset to another:
Here's a simple XSL instruction to process a document with a tag called HEADLINE and to convert it to the HTML tag H1:
<div1 type="section" n="5">
<head>X-link [XLL] </head>
The linking component of XML -- Xlink -- is in active development and describes a range of hypertextual links within and across documents. It is divided into three sections: XPath, XPointer, and XLink http://www.oasis-open.org/cover/xll.html
In brief, XML will support the same simple form of linking that we are used to in HTML, where one can link from one document to another, but will give us much more powerful linking possibilities.
For example, an XML link can point to a part of another document and the browser can retrieve just that part, and not the whole file; links will be bi-directional; a link can point to multiple points in multiple documents; and they can be absolute or relative.
The latter point is in some ways the most exciting: an absolute link is to a place that is explicitly marked in the target text -- perhaps with an ID tag. But a relative link has nothing specific in the target file to say, "hey, link to me here!". Instead, you can express an XML link as a description of a place -- "go to the third poem tag and then link to the second stanza". The weakness here of course is that if the target document chages then the link is either invalid or misleading, and the absolute links are much safer, but the relative tags can be made to files in which you have no write permission.
Related to linking, but technically somewhat different, is the notion of inclusion (xinclude). In XML, as in SGML, a document can contain an entity in this form &inclusion; where "inclusion" (or whatever name you choose) is a file (or part of a file) that is incorporated into the document at the point at which the entity occurs -- so it is a shorthand way of building a text from existing pieces even when the existing pieces exist in separate files.
For a discussion, see XInclude, Anyone", by Chris Lovett, in Microsoft's Extreme XML online magazine.
<div1 type="section" n="6">
Q: Wait a minute -- say all that again?
- XML is a simplified version of SGML
- XML allows you to have structural information in your files (metadata), rather than simply typesetting information.
- XML separates the description of structure in a document <title> from the information about its appearance -- please make all titles centered and in italics. The former is in the document, while the latter is in a stylesheet that accompanies the document. This makes it much easier to have -- and to change -- a common appearance across a set of documents, or to have different display instructions for different types of audience.
- XML structure (like SGML structure) allows XML search and browsing software to address and deliver a part of a document rather than a whole file (unlike HTML).
- XML allows you to create tags if you need to.
- Much more powerful linking possibilities, and the ability to "read in" one file into another.
- HTML will remain with us, in its current form but also as XML-conformant HTML
- XML is not harder to learn than HTML, if your needs are only to create simple documents.
<div1 type="section" n="7">
<head> Selected XML Resources </head>
The SGML/XML Web Page
"The SGML/XML Web Page is a comprehensive online database containing reference information and software pertaining to the Standard Generalized Markup Language (SGML) and its subset, the Extensible Markup Language (XML). The database features an SGML/XML news column "What's New?" and a cumulative annotated bibliography with thousands of entries. The collection contains documents explaining and illustrating the application of the SGML/XML family of standards, including HyTime, DSSSL, XSL, XLL, XLink, XPointer, SPDL, CGM, ISO-HTML, and several others."
XML Developer Center (from Microsoft)
Microsoft's XML site contains a wealth of information, including guidelines for authoring and displaying XML documents, and an overview of XML support in Internet Explorer. Extensive information about XSL, including an online tutorial, is also available here.
Maintained jointly by Seybold Publications and O'Reilly & Associates, XML.COM is rich source of articles and other features about XML, with an emphasis on commercial applications. The site includes the Annotated XML Specification.