What is Extensible Markup Language (XML)?

"The question is not an easy one to answer.  On one level, XML is a protocol for containing and managing information.  On another level, it's a family of technologies that can do everything from formatting documents to filtering data.  And on the highest level, it's a philosophy for information handling that seeks maximum usefulness and flexibility for data by refining it to its purest and most structured form."

—Erik T. Ray, Learning XML

Section 1: The Basic Scoop

XML is a simple set of tagging instructions that allows one to make a set of tags in a web-friendly manner -- tags that can describe the hierarchic structure of a document, like the Text Encoding Initiative (TEI) tag-set does, rather than simply its appearance on-screen, like HTML does.

Q: Isn't that what the Standard Generalized Markup Language (SGML) does?

Yes.  XML is skinny SGML. Some items have been removed from the SGML rules and some syntax changed to create something that is as powerful in its descriptive abilities, but much more web-friendly, and much easier for software such as web browsers to deal with.

Q: So XML is "HTML on steroids!"

No! XML does not give you a set of tags to use, as HTML does. XML is NOT a markup language in and of itself; rather XML is a set of rules for building markup languages that suit specific purposes.

Q: So you can have XML versions of HTML, or EAD, or TEI?

Exactly. When the HTML rules are re-written to accommodate the XML features and assumptions, then you will have XML-compliant HTML. You can have an XML version of just about every other SGML tag-set.

Q: If XML is a language for making tag-sets, then I can use it to make my own, correct?

Exactly--that is the basis of XML's "extensibility."

Q: But why would I want to make up my own tag-set? It sounds like hard work.

It is. Typically, you would try to match your document type to an existing set of tags--EAD for archival guides, for example. Let someone else do the work, and benefit from discussion groups, software, and the inter-operability advantages of following an established norm. But you can "roll your own" if needs dictate.

Section 2: Writing Well-formed XML

Section 3: "Valid"XML

Q: Parsed? DTD? XML Schema? I don't like the sound of this.

A document type definition (DTD) is a bedrock principle of SGML, the meta-language from which XML was formed. A DTD is a text file that lays out in a machine-readable syntax the list of tags that belong to a given tag-set, and any usage rules that govern the use of those tags.

A parser is a piece of software that reads a DTD and checks your file against it. An XML browser will include a parser in it, or you can use standalone parsers.

Q: Why should I care to parse a document?

To make sure that you have done the things that you thought you were doing in the tagging. For example, if you wanted to include a data and copyright field in every document, the DTD will tell the parser to check to see that every file has these required elements, and will report any that do not. Parsing against a DTD makes for much more predictable files, which makes those files work better when they are en mass -- in an etext collection for example -- and makes them easier to mix in with files created elsewhere that also parse against the same DTD.

Q: But I don't have to do that, right?

No: an XML document must be "well-formed" but does not have to be "valid".

Well-formed: the file obeys the four requirements for an XML file (see section 2 above). Every XML document must be well-formed, otherwise the browser will refuse to reveal the document's content.

Valid: A valid XML document is well-formed and parses against a DTD; a document's tag-structure has been validated against the tagging rules defined in the DTD.

Q: What about XML Schema?

XML Schema, like a DTD, is a way to describe the structural rules for a specific tag-set.  However, unlike a DTD, an XML Schema is actually written in XML syntax.  Unlike a DTD, XML Schema allows you to specify types of content that correspond to certain elements.  For instance, if you have an element, <phone_num> </phone_num>, you can actually specify in an XML Schema that the content inside this tag be written with three numbers in parentheses (for area code), three numbers followed by a hyphen, and four concluding digits. This: <phone_num>(804) 924-3230</phone_num> is valid whereas<phone_num>924-3230</phone_num> is not.

Section 4: Displaying XML

Q: So once I have made my document using XML rules, how will it work on the web or in XML software?

With HTML browsers such as Netscape, the browser knows everything about a single tag-set -- HTML -- and when it sees those tags in a document, it knows what to do with them. If you sent Netscape a document tagged in, say, TEI, it would not have a clue how to render the non-HTML tags. This is why services such as the Etext Center provide "on-the-fly" HTML conversion -- we can keep our data in rich forms of SGML, but still get HTML generated automatically for net delivery.

Q: So how's an XML browser going to know the tags in my document, especially if I have made them up myself?

A browser will not and cannot have a hard-coded knowledge of the tags that will be in a document you want to display.  Instead, you send the browser the document and a stylesheet.  A stylesheet is a file that accompanies the document (either internally or externally) that defines your tags and how you would like them to be rendered.  So, while the browser knows nothing about specific tags, a stylesheet can help it learn those tags real quick.  The stylesheet can be in one of two forms: Cascading StyleSheets (CSS) or in the Extensible Stylesheet Language (XSL).

In brief, a CSS stylesheet command contains a Selector and a Declaration:

H1 {color: green; text-align: center;}

In this case the selector H1 has attached to it the display declaration that it should be in green and centered.

--XSL specifies the display attributes of a document and

--XSLT "is a language for transforming XML documents into other XML documents" [see http://www.w3.org/Style/XSL/].

XSL is somewhat more complex and powerful, and is still developing. It includes CSS functions, is written in <xml> tags, and not only adds descriptive layout information but can re-arrange a document and convert from one tag-set to another: we'll get to this later in the workshop.

Section 5: X-link [XLL]

The linking component of XML -- Xlink -- is in active development and describes a range of hypertextual links within and across documents. It is divided into three sections: XPath, XPointer, and XLink [see http://www.oasis-open.org/cover/xll.html].

In brief, XML will support the same simple form of linking that we are used to in HTML, where one can link from one document to another, but will give us much more powerful linking possibilities.  For example, an XML link can point to a part of another document and the browser can retrieve just that part, and not the whole file; links will be bi-directional; a link can point to multiple points in multiple documents; and they can be absolute or relative.

The latter point is in some ways the most exciting: an absolute link is to a place that is explicitly marked in the target text -- perhaps with an ID tag. But a relative link has nothing specific in the target file to say, "hey, link to me here!"Instead, you can express an XML link as a description of a place -- "go to the third poem tag and then link to the second stanza." 

Related to linking, but technically somewhat different, is the notion of inclusion (xinclude). In XML, as in SGML, a document can contain an entity in this form -- &inclusion; -- where "inclusion" (or whatever name you choose) is a file (or part of a file) that is incorporated into the document at the point at which the entity occurs.  It is a shorthand way of building a text from existing pieces even when the existing pieces exist in separate files.

Section 6: Summary

Section 7: Selected XML Resources

The SGML/XML Web Page
http://www.oasis-open.org/cover/sgml-xml.html

"The SGML/XML Web Page is a comprehensive online database containing reference information and software pertaining to the Standard Generalized Markup Language (SGML) and its subset, the Extensible Markup Language (XML). The database features an SGML/XML news column "What's New?" and a cumulative annotated bibliography with thousands of entries. The collection contains documents explaining and illustrating the application of the SGML/XML family of standards, including HyTime, DSSSL, XSL, XLL, XLink, XPointer, SPDL, CGM, ISO-HTML, and several others."


XML Developer Center (from Microsoft)
http://msdn.microsoft.com/xml/

Microsoft's XML site contains a wealth of information, including guidelines for authoring and displaying XML documents, and an overview of XML support in Internet Explorer. Extensive information about XSL, including an online tutorial, is also available here.

XML.COM
http://www.xml.com/

Maintained jointly by Seybold Publications and O'Reilly & Associates, XML.COM is rich source of articles and other features about XML, with an emphasis on commercial applications. The site includes the Annotated XML Specification. a