
April 20, 2004
PURPOSE
To introduce programmers to the past, present, and future of XML technology in an interactive, tutorial-based fashion. This Web site is intended as a resource for the following people:
- Students of DevelopMentor's Programming / Guerrilla XML courses
- Anyone who desires a quick introduction to the world of XML
After successfully completing this tutorial, you should understand:
- The history of markup languages and understand why they are so popular today
- When to use XML in your application designs
- The relationships between the various XML specifications
- How to create basic XML documents
- Where to go to learn more
PREREQUISITES
This tutorial is designed for developers who are completely new to XML.
TOOLS
Some of the provided tools/samples require:
AUTHORS
Aaron Skonnard (principal author),
Martin Gudgin (principal author),
Keith Brown
Introduction to XML
This tutorial introduces the past, present, and future of XML technologies.
This web-based tutorial introduces you to the history of markup languages, the evolving family of XML technologies, and the basics of XML syntax. This tutorial is not intended to provide complete or comprehensive coverage of each XML specification. On the contrary, it is geared specifically towards the programmer who needs a quick introduction to XML technology before diving into the complexity that lurks beneath the surface of the various XML specifications.
The ever increasing number of XML specifications, acronyms and terms, and related tools and products can be very intimidating for those new to the technology and even for those who have been working with the technology for some time now. The intention of this tutorial is to clarify the overall XML picture by covering only the core XML topics that developers need to be familiar with today.
This tutorial differs from other traditional XML training in that it focuses more on the needs of distributed application developers than on those of traditional document-centric SGML'ers (for example, electronic publishing, HTML designers, etc.). Other XML-related topics not covered in this tutorial are either premature at this point or not central to the needs of distributed application developers. Developers that focus on the XML technologies presented herein will find themselves well prepared to leverage XML to its full potential over the next few years.
XML is a universally accepted language for layering type and structure over information--it's simple and flexible.
So why is the world so excited about XML?
XML is just another data serialization format like any other you've worked with in the past. Figure 1 describes what XML does in a single sentence.

Figure 1: What is XML?
y
Not only is XML tremendously simple, it is also very extensible and flexible. XML is built on the precepts of openness and freedom. As part of that freedom, XML does not mandate a particular API or processing model (although some are recommended). Developers are completely free to decide how they want to program against the XML documents in their systems. There are also several other XML-related specifications that provide additional layers of functionality in a platform-neutral manner.
In short, the simplicity and flexibility of XML makes it the perfect universal standard for data serialization. Developers faced with bridging heterogeneous applications in this new distributed world can turn to XML as the defacto solution. XML makes it possible to cross the boundaries that have traditionally limited the reach of the systems we build.
History of XML
The history of markup languages goes back to the late 1960s. Since then, several markup languages have been developed including GML, SGML, HTML, and finally XML.
The Beginning of Markup
Markup languages were first introduced in the 1960s to simplify text-processing systems.
Since the beginning of Computer Science, developers have searched for better ways to structure information. Virtually all applications (for example, databases, word processors, personal information managers, financial software, etc.) use custom binary formats to structure their information. Because of this, information emitted by one application cannot be consumed by another application without some type of data transformation.
All information stored on computer devices is structured in one way or another. Back in the 1960s, computers were being heavily used to simplify text-processing systems. Originally, copy editors and typesetters performed text processing manually. A copy editor would go through the physical document and "mark it up" with common symbols to indicate the desired formatting. The typesetter would then go through the physical document and translate the copy editor's "markup" into the appropriate machine format. This process was tedious and time-consuming, and as a result it begged for computer automation.
The groups of symbols used by copy editors and typesetters became known as markup languages early on. There were two classes of markup languages: specific markup, related to formatting specifics; and generic markup, which focused more on document structure. Specific markup is very similar to the type of handwritten markup performed by copy editors as it was completely focused on the resulting presentation format. The rich text format (RTF) is a modern example of specific formatting as illustrated in Figure 2.
{\rtf1\ansi\ansicpg1252\deff0\deflang1033
{\fonttbl{\f0\fswiss\fcharset0 Arial;}}
\viewkind4\uc1\pard\b\f0\fs24 Contacts\fs20\par
\par
Bob Smith\b0\par
801-555-1212\par
\i bob@smith.com\i0\par
\i\par
\b\i0 Jennifer Adams\b0\par
801-555-2121\par
\i jen@adams.com\i0\par
\par
}
Figure 2
Figure 3 illustrates what this document looks like when opened in Microsoft Wordpad.

Figure 3: RTF Document in Wordpad
Generic markup doesn't attempt to define presentation details, but rather the document structure (the presentation would be defined later through a transformation). The benefit of generic markup is that a document's structure can be codified into a common format that facilitates computer processing. It appeared that a type of generic markup language was the key to automating text processing systems towards the end of the 1960s.
Click here to view the markup language timeline.
Generalized Markup Language
Generalized Markup Language (GML) was one of the earliest formalized generic markup languages to separate document structure from presentation.
Most give credit to William Tunnicliffe, chairman of the Graphic Communications Association (GCA), for introducing the concept of a markup language that separates document structure from presentation in 1967. The GCA continued working on his ideas by forming the GenCode working group to begin formalizing what might eventually become an industry standard. Around the same time, a book designer named Stanley Rice was promoting similar ideas in conference presentations. Both of these men seemed to influence many in the area of text processing.
Charles Goldfarb, a lawyer turned IBM developer, was one of the first to put these ideas to work. He, along with Edward Mosher and Raymond Lorie, invented the Generalized Markup Language (GML) in 1969 for use in IBM's integrated text processing systems (also notice that GML matches the authors' initials, according to Goldfarb this is not a coincidence).
GML made it possible for developers to create markup vocabularies (a set of markup tags) that described different document types without ambiguity. Figure 4 provides an example of GML markup.
:h1.Example of GML
:p.Here is the first paragraph
and here is an ordered list
:ol
:li.Item A
:li.Item B
:li.Item C
:eol.
Figure 4
GML represented a revolutionary step towards automated text processing and laid a solid foundation for future markup languages.
Standard Generalized Markup Language
The work of Goldfarb and the GCA eventually evolved into an international standard: SGML (ISO 8879:1986).
Nearly a decade after GML was first introduced, ANSI formed the Computer Languages for the Processing of Text committee with a charter to create a standard markup language based on the work of Goldfarb and GenCode. Goldfarb was asked to join and later to lead the committee, which was highly supported by the GCA. The committee produced their first working draft in 1980 and called their work the Standard Generalized Markup Language (SGML). The sixth SGML working draft was released in 1983 and was quickly adopted by a few U.S. government agencies (Internal Revenue Service and the Department of Defense).
In 1984, the SGML working group was reorganized under both ANSI and ISO, and work resumed towards making SGML an international standard. The final SGML international standard was published in 1986 (ISO 8879:1986). If you would like to read more about Goldfarb's personal recollection of this history, click here.
SGML, like GML, provides a powerful, yet flexible, language for defining custom markup vocabularies for different document types. For example, an application working with contact information could use SGML to define a contact information document type definition (DTD) that codifies what specific contact instances should look like (see Figure5).

Figure 5: Document Type Definitions vs. Document Instances
Figure 6 illustrates what the instance document might look like (assuming a DTD named contacts.dtd exists).
<!DOCTYPE contacts SYSTEM "contacts.dtd">
<contacts>
<contact>
<name>Bob Smith</name>
<phone>801-555-1212</phone>
<email>bob@smith.com</email>
</contact>
<contact>
<name>Jennifer Adams</name>
<phone>801-555-2121</phone>
<email>jen@adams.com</email>
</contact>
</contacts>
Figure 6
Problems with SGML
SGML proved to be powerful and flexible, but also too complex.
It took nearly 20 years from the time the idea was first conceived for SGML to become a standard. SGML serves a very broad international audience with a wide variety of needs. Because of this, SGML contains many options and features that aren't needed by the majority of applications; this added unneeded complexity to the language. In fact, it was SGML's complexity that prevented it from becoming ubiquitously adopted throughout the industry.
As illustrated in Figure 6, SGML applications are required to have a DTD associated with each instance document. The DTD can be used for sophisticated document validation or even simply documentation. The DTD specifies what markup tags can be used in the document along with their specified structure.
Although the DTD for this example would be trivial, developing functional DTDs in general proved to be too difficult, time-consuming, and expensive for mass industry consumption. There were several application domains that highly leveraged SGML from day one, such as electronic publishing systems, but most developers didn't give it consideration.
Another negative side effect of SGML's complexity was a major lack of tool support. Developers that did decide to use SGML within their applications had very few tools available to assist them. Some feel that SGML was ahead of its time and after becoming familiar with everything the language has to offer, you'd probably agree.
HTML
Tim Berners-Lee created a specific SGML document type called the Hypertext Markup Language (HTML) in 1989.
The creation of SGML set the stage for one of the most important inventions of our time: the World Wide Web. A few years after SGML was standardized, Tim Berners-Lee created the Hypertext Markup Language (HTML) at CERN. HTML was originally defined as an SGML document type. HTML offers a specific markup vocabulary for formatting or structuring documents as well as linking documents together. Figure 7 illustrates how the contact information shown earlier could be structured using HTML.
<html>
<head>
<title>Contacts</title>
</head>
<body>
<h1>Contacts</h1>
<b>Bob Smith</b><br>
801-555-1212<br>
<i>bob@smith.com</i><br>
<br>
<b>Jennifer Adams</b><br>
801-555-2121<br>
<i>jen@adams.com</i><br>
<a href="others.htm">Other Contacts</a>
</body>
</html>
Figure 7
Notice at the end of the document there is a link to another HTML document (others.htm). The ability to link an HTML document to other documents offered the potential to create a web of inter-linked documents over the Internet. Berners-Lee named his invention the World Wide Web (WWW).
Software agents capable of viewing and processing HTML documents (for example, following inter-document links) were quickly developed and called "web browsers". By 1993, NCSA released their Mosaic browser on all major platforms and a year later Netscape was formed. Once web browsers became prevalent, interest in the WWW began to explode.
Thanks to the simplicity of HTML it was universally adopted and the number of available Web pages began growing exponentially. Suffice it to say that HTML has become the most successful and prevalent document format ever used.
Berners-Lee formed the World Wide Web Consortium (W3C) in 1994 at MIT/CERN. The W3C is a business consortium that openly develops and recommends Web-related specifications to help lead the technical evolution of the World Wide Web. Any company or individual can join the W3C and participate in its various working group activities by paying the yearly membership dues. The W3C continues to play a dominant and central role in developing today's Web-related technologies.
For a good read on the past, present, and future of the World Wide Web and the W3C, check out Bernets-Lee's book entitled Weaving the Web.
Problems with HTML
It quickly became apparent that HTML was too simple for extended application needs.
The biggest problem with HTML was opposite that of SGML--it was too simple for general application needs. HTML was great for formatting documents for viewing on the Web, but it broke down when applications had extended markup needs.
For example, consider the situation where one application needs to send contact information to another heterogeneous application for consumption. In the previous HTML example it's hard to find where the first contact starts and the second one begins because the markup doesn't denote such a thing. HTML only defines a limited markup vocabulary related to formatting and linking, which is not extensible. Because of this, HTML could not be used effectively outside the world of Web browsers.
In general, HTML was too limited because developers couldn't introduce custom tags to describe their application-specific data types. What the world really needed was a compromise between SGML's flexibility and HTML's simplicity.
Enter XML
The Extensible Markup Language (XML) was proposed to the W3C in 1996 as a major simplification of SGML.
The Extensible Markup Language (XML) was proposed to the W3C in 1996 as a major simplification of SGML. Jon Bosak of Sun Microsystems is considered by many the "father of XML" because he organized the effort and led the initiative. In February of 1998, XML was approved as a W3C Recommendation, the final stage in the W3C technical specification process. It took a relatively short period of time to finish XML in comparison to the SGML timetable. This was mostly because XML completely leveraged nearly 30 years of markup evolution. XML was not a revolutionary new concept, rather a simplification of SGML.
The major goal of XML was to create a language that offered similar capabilities to those offered by SGML but that was much simpler for developers to learn and use. This meant removing many of the options and features that were introduced into SGML for what some would consider "niche" areas. It also meant removing anything from the language that introduced significant complexity for tool implementations. In other words, simplicity was considered an absolute requirement for this new markup language to become universally accepted on the WWW.
XML works very much like SGML in that it provides syntax for developing custom markup vocabularies. Figure 8 illustrates that XML is merely a simplification of SGML for defining markup vocabularies.

Figure 8: XML vs. SGML
The fundamental difference between the two is that SGML requires DTDs while XML does not. Although XML documents can indeed leverage DTDs just like SGML, they are no longer required to do so. Figure 9 contains marked-up contact information, which is completely acceptable XML.
<contacts>
<contact>
<name>Bob Smith</name>
<phone>801-555-1212</phone>
<email>bob@smith.com</email>
</contact>
<contact>
<name>Jennifer Adams</name>
<phone>801-555-2121</phone>
<email>jen@adams.com</email>
</contact>
</contacts>
Figure 9
This is a very significant difference because it allows developers to begin using XML without getting bogged down in DTD development. Some developers would argue that it's unsafe to work with documents that don't come with DTDs because there is no way for them to be validated. While this is a valid concern, many situations on the Web simply don't require or want DTDs in the picture. Whether or not this is a safe programming practice is left for those developers to deal with. When XML documents do leverage DTDs, they are very similar to SGML only there aren't as many options or features to choose from (thank goodness).
For a more detailed explanation of the differences between XML and SGML, see this W3C Note. Also, there are several other articles on this topic listed on Robin Cover's XML Cover Pages.
The XML working group consisted of many SGML'ers as well as many Web experts. This interesting combination of experience led to an amazingly successful specification that was quickly adopted by the entire industry. Not long after XML became a W3C Recommendation there was already a wide range of tool support available for most programming environments. Today, only a few years later, XML has become so commonplace that it's hard to find a situation where the tool support seems lacking. Within the same time frame, several other technical specifications have even been layered over XML to offer higher levels of functionality and productivity.
XML was kept simple, developed quickly, and was widely supported early on: now it's quickly being adopted by software vendors, big and small. And all this without sacrificing flexibility--XML allows developers to describe any type of application data with a simple text-based markup language. XML promises to make interoperability headaches a thing of the past. As long as your software consumes and emits XML, any other piece of software that understands XML, regardless of platform or programming language, can integrate with yours.
Although XML owes its existence to those early advances in document-centric systems, it has evolved beyond that to address a much wider range of distributed application challenges. Today XML is commonly used in a wide variety of distributed applications that require high levels of interoperability. This "data-centric" view of XML requires looking at the technology from a different perspective that focuses on XML's abstract data model and type system. The remainder of this tutorial offers precisely this perspective.
XML Overview
XML is a set of specifications from the W3C that has achieved industry-wide acceptance. XML adds type and structure to information.
What Is XML?
XML adds type and structure to information.
XML is no different than any other serialization format of the past except for the impressive fact that it has been universally accepted throughout the software industry. XML offers a simple text-based syntax for adding type and structure to information. The main difference between XML and other formats is that XML is trivial to generate and consume within applications. Although XML is more verbose than most binary data formats, the interoperability benefits that stem from its acceptance make it seem very acceptable.
Figure 10 illustrates how an instance of a C struct can be mapped to XML. As you can see, XML uses tags to define the structure of the information. In this example, the spot element contains two children, x and y. Application type definitions can also be layered on top of this structure through additional metadata constructs as illustrated by the type attribute.

Figure 10: C struct to XML
Because of this simplicity and today's rich tool support, XML continues to grow in popularity as the defacto standard for transferring information between heterogeneous systems. For that reason, developers commonly refer to XML as transfer syntax.
XML is also commonly used to aggregate information from multiple sources into a single unit of information (known as an XML document), that can be stored anywhere on the Internet. Such a document can then be accessible to any Internet-capable system.
Each piece of information contained in an XML document has an XML-specific structure as well as an XML-specific type.
XML Protocol Stack
XML consists of several layered W3C specifications.
A few years ago when people referred to XML they were probably referring to a single specification: XML 1.0. Today, however, XML consists of much more than that first specification produced back in 1998. When people refer to XML today they are typically referring to an entire family of layered specifications.
The specifications are considered "layered" because most newer XML specifications are defined in terms of other XML specifications. This is similar to the way the original network protocol stack was designed in terms of isolated layers. Figure 11 illustrates what the XML protocol stack looks like.

Figure 11: The XML Protocol Stack
Notice that, like the network protocol stack, the XML protocol stack also has seven distinct layers. At the hardware layer XML, like everything else, consists of sectors and bitstreams. Operating systems and network protocols represent the next layer because they offer abstractions (files and packets) that encapsulate the complexities of working with hardware devices.
The next layer, as defined by XML 1.0, defines how physical XML documents are constructed from distinct entities (files and packets). XML 1.0 and the Namespaces in XML Recommendation together help define XML documents in terms of elements and attributes.
The next layer, defined by the Infoset, is significant because it defines XML in terms of an abstract data model that focuses on the structural items of logical XML documents. In other words, the Infoset doesn't mention anything about the serialization format that should be used to model a "logical" XML document--XML 1.0 is just one of many possible serialization formats for this purpose.
The topmost layers define XML's type system in terms of the underlying abstract data model as well as how XML types map to application specific classes and objects.
Most developers prefer to work with XML at the topmost layers (Infoset and above).
W3C Process
The W3C process helps you determine the stability of a specification.
When learning XML, it's important to understand the W3C process. The XML specifications you encounter are typically given a W3C status (for example, Recommendation, Working Draft, Note, etc.). Understanding the difference between these can help you figure out where to focus your energy in the broad XML landscape. Figure 12 represents how an official W3C specification evolves from a W3C Note into a Recommendation.

Figure 12: The W3C Process
As you can see, when an idea is conceived and submitted to the W3C, the document is labeled "Note". Anyone can submit a Note to the W3C and have it published on their site. If there is enough interest in a Note, a working group (WG) is formed along with a "Requirements" document, which acts as the seed for the WG activity.
During the process of developing the particular specification, snapshots of the WG activity are published and labeled "Working Drafts". The WG publishes the "Last Call Working Draft" when they they've finished and are ready for industry comments.
After last call, the specification becomes a "Candidate Recommendation", which basically means it's time for developers to implement the specification in order to find implementation-specific issues/concerns. The next step in the process, "Proposed Recommendation", is the last chance to complain about anything in the specification before the W3C promotes the specification to "Recommendation" status. Once a specification becomes a Recommendation, nothing can be changed without repeating the entire process over again.
Understanding this process helps significantly when making decisions based on technology maturity and specification stability.
Note: A W3C Recommendation is different from other industry standards in that it's "officially" nothing more than a recommended direction for the Web that is sponsored by a large business consortium. Nevertheless, they seem to receive the same levels of respect.
XML Dependencies
Certain XML specifications are defined in terms of others.
Figure 13 illustrates how the different XML specifications are layered in terms of specification dependencies.

Figure 13: XML Specification Dependencies
Green indicates a Recommendation, yellow a Candidate/Proposed Recommendation, blue a Working Draft, and purple a Note.
Notice that XML 1.0 relies on the Uniform Resource Identifier (URI) specification, which defines how to create unique identifiers on the Web. URIs were originally created for HTML, but today they are used for a much wider range of resource identifiers. XML uses URIs to identify many different things, the most notable of which are namespaces.
XML 1.0 again was quickly defined as a simplification of SGML. Because speed was a necessity (the Web was growing faster than we could keep up), many things were left out of the core specification but were always planned for additional layers. One of the first and most significant of these layers was the Namespaces in XML Recommendation, which defines how to assign elements and attributes to a unique namespace just like we do with custom type definitions in most traditional programming languages.
Without namespaces in XML, it would be impossible to distinguish between semantically different elements or attributes that happen to share the same name. Because name collisions are a 100% possibility in the XML space (especially within the same application domain), namespaces play a crucial role. As a result, many consider namespaces an addendum to the original specification and always refer to them together as "XML 1.0 + Namespaces".
These first three specifications were finalized back in 1998, when the last of which became a Recommendation. Since that time a tremendous effort has been made towards layering additional functionality over these core specifications. As this process evolved, it became clear that many of the specification authors were inferring different things about XML's abstract data model (implied by XML 1.0 + Namespaces) and incompatibilities between specifications began to arise.
At this point, a formal specification was created called the XML Information Set (Infoset) that attempts to codify XML's abstract data model once and for all. Although the Infoset is still a working draft, most of the additional specifications are defined (formally or informally) in terms of its data model. This shields the higher-level specifications from the complexities of XML 1.0 + Namespaces and allows them to focus on the abstract view of an XML document.
There are other dependencies between the higher-level specifications as well. For example, XSLT relies on XPath, XPointer on XML Base (and XPath), and SOAP and XML Query on XML Schema. Understanding the relationships between the different specifications can help you determine where to begin your focus.
XML Specifications
The core XML specifications are XML 1.0 + Namespaces, the Infoset, and various programming APIs.
Concrete Syntax
XML 1.0 + Namespaces
XML's concrete syntax is defined by two specifications: XML 1.0 and Namespaces in XML. Both of these specifications focus heavily on the lexical and binary representation of the information in an XML document as shown here:
<spot type='POINT'>
<x>20</x>
<y>40</y>
</spot>
Specifically, they deal with issues like the syntax for naming elements/attributes (for example, angle brackets); the syntax for structuring information using elements/attributes; the syntax for other XML constructs that model information (for example, processing instructions, comments, etc.); how to compose logical documents from physical entities; what character set is used in the document; how to use different character encodings at serialization time; how to attach binary resources to a document (for example, audio, /articles/content/DM_XML/image, etc.); and much more.
Although it's important to be familiar with XML at this level, most developers find themselves working at higher-levels of abstraction. In the next section, we'll walk you through the basic details of the XML 1.0 + Namespaces syntax to help get you started.
XML Information Set
The abstract data model for XML is defined by the Infoset
The Infoset offers a layer of abstraction that most developers prefer to interface with over XML 1.0 + Namespaces. The Infoset codifies the abstract data model in XML by providing a UML-esque view of well-formed XML documents. Click here to view a UML definition of the Infoset.
The Infoset defines the different constructs that make up an XML document and their relationship to each other. For example, the Infoset specifies that every document must contain a document information item that contains a single child element information item. Each child information item can have zero or more ordered children (of many different types) as well as a set of unordered attributes. It also specifies that each element and attribute contain two pieces of name information: a local name and a namespace URI. See the Infoset's UML as well as the official specification for more details.
There is nothing in the Infoset that deals with XML syntax or serialization details; it's completely abstract. In fact, XML 1.0 + Namespaces is just one of many possible serialization formats for the Infoset. If developers always write code against Infoset abstractions, they'll be able to adopt future serialization formats (think binary) without problems.
In other words, XML 1.0 + Namespaces is a projection of the Infoset onto a bytestream. Likewise, the different XML APIs are projections of the Infoset onto programmatic types and interfaces. The Infoset represents the core of XML. Figure 14 illustrates this relationship.

Figure 14: XML Information Set
There is an isomorphic relationship between the Infoset and serialization formats (e.g., XML 1.0) and between the Infoset and APIs (for example, DOM/SAX).
Programming XML
XML does not mandate an API.
As previously mentioned, one of the keys to the success of XML is that it doesn't mandate an API--XML can be programmed in a variety of ways. Several XML APIs have been inferred from the Infoset but the two most common are the Simple API for XML (SAX) and the Document Object Model (DOM).
Both of these APIs are projections of the Infoset onto a suite of programmatic interface suites. Most software today builds on these interface suites in order to isolate their code from implementation-specific dependencies.
Simple API for XML
SAX offers a streaming interface for Infosets
First, SAX is different from most of the other specifications we'll discuss in that it wasn't created and isn't currently supported by any standards body. It was collaboratively developed on the XML-DEV mailing list by a group of developers led by David Megginson. It was originally defined in Java but is currently being mapped to other language bindings. For more details on SAX, point your browser to David Megginson's SAX Web site.
SAX offers a streaming interface for working with Infosets. SAX models an XML document through a sequence of method calls. Because of the streaming model, SAX is like a forward-only, read-only cursor. This model makes SAX very resource friendly and efficient, especially when working with large XML documents.
SAX defines a suite of interfaces that model different aspects of an XML document. Click here to view a UML definition of the SAX interfaces. Applications that want to receive an XML document implement the SAX interfaces, while applications that want to send an XML document consume them (see Figure 15).

Figure 15: SAX Programming Model
As Figure 15 illustrates SAX implies a "pull" model, whereas the application implements the interfaces and waits for some other application (potentially a SAX parser) to send it a document. This programming model is more resource friendly but at the same time more complex because it requires the application to implement finite-state machines to keep track of relevant state.
ContentHandler is the main SAX interface that models most of the core Infoset. The following code fragment contains the beginning of a class implementation that implements ContentHandler.
class Emitter implements ContentHandler {
CharacterStream out;
public void startElement(String namespaceURI,
String localName, String rawName,
org.xml.sax.Attributes atts) {
out.write("<" + localName
+ " xmlns=\"" + namespaceURI + "\">");
}
public void endElement(String namespaceURI,
String localName, String rawName) {
out.write("</" + localName + ">");
}
// other ContentHandler methods omitted...
This implementation simply serializes the document back out as XML 1.0 + Namespaces. The following code fragment illustrates how a client might consume this implementation.
void emitDocument(ContentHandler ch) {
Attributes a = new AttributesImpl();
ch.startElement("urn:demos","foo","d:foo",atts);
ch.startElement("urn:demos","bar","d:bar",atts);
ch.startElement("urn:demos","baz","d:baz",atts);
ch.endElement("urn:demos","baz","d:baz");
ch.endElement("urn:demos","bar","d:bar");
ch.endElement("urn:demos","foo","d:foo");
}
If the previous code were executed, the following document would be the output of the Emitter class.
<d:foo xmlns:d='urn:demos'>
<d:bar>
<d:baz/>
</d:bar>
</d:foo>
As you can see, SAX is simple and straightforward, but fairly low level. It requires your application to keep track of any contextual information that it might need later. See Megginson's Web site for more details.
Document Object Model
DOM offers a traversal interface for Infosets.
The Document Object Model is a W3C Recommendation that defines another XML API for creating, consuming, and traversing XML documents. DOM is different than SAX in that it offers a "traversal" interface for working with Infosets. DOM models an XML document through a hierarchical graph of typed nodes. For example, consider the following XML document:
<?xml version="1.0"?>
<?order alpha ascending?>
<!-- renaissance art period -->
<period name="Renaissance">
<artist>Leonardo da Vinci</artist>
<artist>Michelangelo</artist>
<artist>Donatello</artist>
</period>
The DOM models this document using the hierarchy of nodes shown in Figure 16. Notice that there is almost a one-to-one mapping between the Infoset and the DOM tree model, aside from a few optimizations.

Figure 16: DOM Tree Model
If SAX is like a fire-hose cursor, DOM is more like a dynamic cursor because it allows applications to traverse and update documents. But this flexibility comes at a price because typical in-memory implementations are more expensive than streaming alternatives.
The DOM defines an interface hierarchy for interacting with the different node types in the tree. The Node interface is the base interface for every node type, which makes it possible to perform basic tree-related tasks. For example, the following code fragment illustrates how to recursively traverse the entire tree using the Node interface.
void traverseTree(Node current) {
// process current node here
for (Node child=current.getFirstChild();
child != null;
child = child.getNextSibling())
traverseTree(child);
}
The DOM also makes it possible to safely generate well-formed XML documents as illustrated by the following example.
void createArtDocument(Document doc) {
Node e,c,t;
e = doc.createElementNS("urn:art:periods",
"a:book");
doc.appendChild(e);
c = doc.createComment(
"renaissance art period");
doc.insertBefore(c, e);
Element el = (Element)e;
el.setAttributeNS("urn:art:periods",
"a:name", "Renaissance");
e = doc.createElementNS("urn:art:periods",
"a:artist");
t = doc.createTextNode("Leonardo");
e.appendChild(t);
doc.getDocumentElement().appendChild(e);
}
The document that the previous example generates could be serialized as follows:
<!--renaissance art period-->
<a:book a:name='Renaissance'
xmlns:a='urn:art:periods'>
<a:artist>Leonardo</a:artist>
</a:book>
The DOM is a very complete and powerful API that most developers choose to use first. However, when developers are forced to work with larger document streams, the overhead of most DOM implementations can be extremely hard to deal with, and SAX becomes a more attractive alternative.
It's important to note that while SAX and DOM offer wildly different programming models, they are both simply different projections of the Infoset onto programmatic types. SAX is better suited for working with large documents where not much context is not required (for example, 1-pass searches), while DOM is better suited for working with smaller documents where complete contextual information is desired.
What's Next?
Additional layered specifications simplify working with XML.
You could begin using XML quite effectively using only the specifications and technologies discussed in this section. However, as illustrated in the previous section, the family of XML technologies also includes several higher-level specifications that make XML much easier to work with. The most important of these specifications include XPath, XSLT, XML Schema, and SOAP, all of which are discussed in the next section.
Other XML Specifications
XML is a set of specifications from the W3C that has achieved industry wide acceptance. XML adds type and structure to information.
XML Path Language
XPath defines a uniform syntax for identifying Infoset subsets.
The Infoset defines XML's abstract data model but it stops short in defining a uniform syntax for addressing and identifying subsets of its data model (e.g., give me the all child elements named bob whose id attribute is not id-xyz). The XML Path Language 1.0 (XPath) defines a path-based language for this purpose, which happens to be very intuitive.
For example, consider the following XPath expression:
/foo/*/bar[@id = 'baz']
This expression identifies all of the bar elements that have an id attribute value of 'baz', and that are grandchildren elements of the root foo element. When combined with the DOM/SAX APIs or other higher-level languages like XSLT, XPath can greatly simplify the tedious and mundane tasks of identifying document subsets for processing.
To begin experimenting with XPath, check out our online XPath expression builder as shown in Figure 17 (requires IE5/MSXML 3.0). This tool illustrates how higher-level specifications can completely encapsulate the details of working with the underlying byte-streams. XPath is simple and intuitive enough to figure most of it out as you go.

Figure 17: Online XPath Expression Builder
XML Schema Languages
XML Schemas are used to define an application-specific type system in terms of the Infoset.
The Infoset defines the structural types that make up an XML document. The Infoset type system, however, is not extensible. The XML Schema definition language (XSD) offers a meta-language for layering application types over the Infoset. Figure 18 illustrates how an application type definition can be expressed using XSD.

Figure 18: C struct to schema definition
This kind of type information makes it possible to perform validation that not only constrains the structure of an XML document but also the type's value space. Henry Thompson, one of the editors of the XSD specification, has provided a Web-based XSD validator that you can use as you begin investigating the language. There is also a command-line version of the validator that you can install locally. As you begin using XSD, you'll realize that it's 1000x more sophisticated than traditional DTDs.
XSD also facilitates more powerful programming concepts, including dynamic type discovery and code generation. If enough XML type information is available to our applications, the infrastructure we build can automate mapping application-specific types to and from XML documents. Tools are already emerging in both the Windows and Java spaces for generating code and dynamic objects based on schema definitions, and the reverse is also true.
Today, XSD is a Candidate Recommendation, which is broken into two parts: structures and datatypes. Because these specifications are so large and complex, the working group has also provided an overview of the language full of examples that is much easier to digest.
Most major tool vendors are scrambling to add support for XSD in their products. As an example, the new .NET architecture from Microsoft leverages XSD as the standard for type information. There are many other XML systems and tools that are also in the process of adding support for XSD (for example, Apache, Oracle, etc.). If you pay attention to the XML community, you're sure to hear more about future advances.
XSL Transformations
XSLT is a language for defining transformations between schemas.
Because developers will never agree on one true schema for a specific application domain, interoperability can still be a problem even when using XML. XML, however, provides a solution to its own interoperability problem by including the XSL Transformations language (XSLT).
XSLT is an XML vocabulary that happens to have the semantics of a declarative programming language. In its simplest form, XSLT is similar to the exemplar paradigm used by both ASP and JSP. But at its core, XSLT is more like functional programming languages such as LISP or scheme. XSLT makes it possible to combine static content with dynamic instructions that use XPath expressions to address the input document during the transformation process. This makes XSLT perfect for translating XML documents into any other type of text-based output.
For example, the following XSLT program translates an XML document containing contact information into an HTML document.
<html
xmlns:xsl='http://www.w3.org/1999/XSL/Transform'
xsl:version='1.0'>
<body>
<h1>Contact Information</h1>
<table>
<xsl:for-each select='/contacts/contact'>
<tr>
<td>
<xsl:value-of select='name'/>
</td>
<td>
<xsl:value-of select='phone'/>
</td>
</tr>
</xsl:for-each>
<table>
</body>
</html>
To begin investigating XSLT, check out our online interactive tool (requires IE5/MSXML 3.0). Also, for a complete and thorough reference for both XPath/XSLT, see Michael Kay's XSLT Programmer's Reference.
XML Messaging
SOAP is an XML messaging specification.
As previously stated, XML is commonly used in distributed applications to transmit information between systems. Messaging systems in general benefit from using XML as a message format because it opens up the doors to any other system that supports XML. Instead of trying to make your system support every platform or language combination, you can simply define XML interfaces to your system and rest assured that any other system in the world (that understands XML) can integrate with yours.
Over the past few years many developers have defined custom message vocabularies for accessing remote services and data. These message vocabularies were designed to replace common remote procedure call (RPC) mechanisms and component technologies (for example, DCOM, IIOP, etc.), as well as other platform specific technologies. While this did open the doors to outside systems, every system did it differently, making it difficult to integrate a wide range of systems and reuse code.
The Simple Object Access Protocol (SOAP) is a messaging protocol that codifies this practice once and for all. The benefit of standardizing on a messaging protocol is the future promise of improved infrastructure that offers complete interoperability between disparate systems in a transparent manner.
SOAP is a very simple protocol built on existing technologies (XML and HTTP) that are completely language, platform, and vendor neutral. SOAP essentially defines two things: 1) a message serialization format based on XSD, and 2) a mapping to a specific transport protocol (for example, HTTP).
For example, suppose a Web service exists for adding up two Points. The following represents the SOAP request for invoking the service:
<?xml version='1.0' ?>
<SOAP-ENV:Envelope
xmlns:SOAP-ENV='uri-for-soap' >
<SOAP-ENV:Body>
<m:AddRequest xmlns:m='uri-add-points'>
<pt1>
<x>10</x>
<y>20</y>
</pt1>
<pt2>
<x>100</x>
<y>200</y>
</pt2>
</m:AddRequest>
</SOAP-ENV:Body>
</SOAP-ENV:Envelope>
And the following represents the SOAP response returned by the service:
<SOAP-ENV:Envelope
xmlns:SOAP-ENV='uri-for-soap'>
<SOAP-ENV:Body>
<pt:AddResponse xmlns:pt='uri-add-points>
<ptret>
<x>110</x>
<y>220</y>
</ptret>
</pt:AddResponse>
</SOAP-ENV:Body>
</SOAP-ENV:Envelope>
SOAP can be used to invoke any service on any platform. Any system that supports XML and the appropriate transport protocol (for example, HTTP) can leverage SOAP to make its services available to the rest of the world.
For more information on SOAP, check out the current work of the W3C SOAP Working Group, DevelopMentor's SOAP Web site, and the SOAP mailing list. You can also begin experimenting with SOAP by investigating some of these live SOAP demos, the Microsoft SOAP Toolkit, and the Apache SOAP4J Java project.
SOAP is one of the first significant applications of XML that distributed application developers can appreciate. SOAP will continue to become a more integral and transparent part of future operating systems, component technologies, and development environments. Microsoft's new .NET architecture is one of the best illustrations of this, but most other major vendors are pursuing similar goals.
How Do I Get Up to Speed?
Developers get up to speed by starting with the basics.
The XML technologies that we've covered up to this point are what developers should focus on today. We've only given you a brief introduction to each of these technologies and have in no way provided a thorough description. So where do you go from here?
As you've learned, the more powerful specifications like XML Schema and SOAP rely heavily on many other layered specifications. To understand these specifications, you must understand the underlying specifications thoroughly (for example, XML 1.0 + Namespaces and the Infoset). Therefore, we recommend starting with the basics and working through the specifications in a layered fashion. As you do, you'll become more and more confident in your ability to leverage XML to its fullest potential.
To help you get started, we've provided a brief introduction to the syntax of XML 1.0 + Namespaces, which makes up the remainder of this tutorial. From that point, the rest is up to you. See the list of XML references at the end of the tutorial for additional resources and reading material on all of these topics.
XML Element Syntax
XML has a regular, easy to parse, text-based serialization syntax. It is this serialization format that has fostered the widespread adoption of XML.
Well Formed
All XML documents are well formed.
By definition all XML documents must be well formed. A well-formed XML document is one that follows all of the syntactical rules described throughout the remainder of these sections. If a document contains an error, it's no longer considered an "XML document," but just a bunch of characters mixed together.
Unlike HTML, XML syntax is very strict and must be followed exactly. We've provided a simple online tool (requires IE5/MSXML) for checking whether a document is well formed. Feel free to use this tool to test your understanding of these various concepts as you move through the rest of the tutorial.
Start and End Tags
Elements are used to model structured data and are encoded using start and end tags.
Elements typically make up the majority of the content of an XML document. Elements can have children. These children might themselves be elements or might be processing instructions, comments, CDATA sections, or characters. The children of an element are ordered.
Elements are serialized as a pair of tags: an open tag and a close tag. The syntax for an open tag is the tag name enclosed in angle brackets, for example, <sometagname>. The syntax for a close tag is similar, the only difference being that a forward slash precedes the tag name, for example, </sometagname>.
The children of an element are serialized between the open and close tags of their parent. For example, the following is an element called 'someelement' with a single child element called 'somechild' that has no children of its own, <someelement><somechild></somechild></someelement>.
In cases like that of 'somechild', where the element has no children, the element is said to be 'empty'. A Shorthand syntax can be used for empty elements. The syntax is similar to that for the open tag, the only difference being that a forward slash precedes the closing angle bracket. The previous example can be amended to use the shorthand syntax thusly, <someelement><somechild/></someelement>.
Tag Names
XML does not provide any predefined element names; however, it defines some rules for the generation of element names.
XML does not have a fixed vocabulary of predefined element names, rather it allows vocabularies of elements to be invented. When designing new vocabularies, element names that convey some semantic meaning to human readers should be used. While the majority of XML documents will be generated and consumed by machines, having element names that mean something to humans aids in acceptance and helps during the development phase, especially when debugging.
Element names in XML are case sensitive and must begin with a letter or an underscore (_). The initial character can be followed by any number of letters, digits, periods (.), hyphens (-), underscores (_), or colons (:).
The following elements all have valid names:
<foo />
<_foo />
<foo_bar />
<foo_bar-quux />
<_foo-bar />
<foo.bar />
<foo.bar.quux />
<foo1 />
<foo_2 />
The following elements all have invalid names:
<-foo /> <!-- begins with a hyphen -->
<1foo /> <!-- begins with a digit -->
<.foo /> <!-- begins with a period -->
<:foo /> <!-- begins with a colon -->
<?foo /> <!-- begins with a question mark -->
<foo? /> <!-- contains a question mark -->
<foo* /> <!-- contains an asterisk -->
Element names beginning with the three letters x, m, and l are reserved by the XML specification for future use. Variations in capitalization are also reserved.
The following elements all have invalid names:
<xmlfoo />
<XMLfoo />
<XmlFoo />
Namespaces and Namespace Declarations
In order to distinguish between element names in different XML vocabularies, the Namespaces in the XML Recommendation provides rules for defining and using namespaces to disambiguate names in XML documents.
Because XML was designed to allow creation of new vocabularies, the possibility of name collision is quite high. This makes it difficult for software (and humans) to know deterministically how to process a given XML document. The question that has to be asked is "Is this XML really in the form I expect or is it just coincidence?"
Consider a (contrived) example: Software developer A decides to model a person in XML and needs to store the name, age, and height of the person, and to use element names of name, age, and height for those data items. An example instance of such a person would look like this:
<Person>
<name>Martin</name>
<age>32</age>
<height>64</height>
</Person>
Software developer B decides to model a person in XML and needs to store the same data items and settles on the same element names. However, when describing the same person as in the previous example, B's instance looks like this:
<code xml:space='preserve'>
<Person>
<name>Martin</name>
<age>20</age>
<height>162</height>
</Person>
</code>
The element names are the same, but in the case of age and height the values contained between the open and close tags are different. The reason is that A used base 10 for age and gave height in inches, whereas B used base 16 for age and gave height in centimeters. Unfortunately, given the previous XML
there is no deterministic way to know whether this represents a very short person who is 40 years old or a 5'8" person in their late 20s. Heuristically a human or software program can decide that the latter case is more likely but this is not a reliable way to exchange information. Fortunately, XML namespaces provide a way to deterministically distinguish between different XML vocabularies.
Using namespaces in XML involves associating element names with a Uniform Resource Identifier (URI). This URI serves as a unique string and forms the namespace name. The namespace name acts as a scope for all elements that are associated with the namespace. An element is associated with a namespace through a combination of a namespace declaration and a namespace prefix. The namespace declaration defines the prefix that represents the namespace URI. The prefix is then added to the front of the element name. Namespace declarations take the form 'xmlns:prefix="URI"' and appear inside the element start tag. For example, the following XML shows a Person element in the 'http://example.org/People' namespace.
<pp:Person
xmlns:pp='http://example.org/People'></pp:Person>
The prefix in the previous example is 'pp'. Note that the closing tag must use the same name, including the prefix, as the opening tag. Note also that the prefix used is totally arbitrary and so the following example is semantically equivalent to the previous one.
<pre:Person
xmlns:pre='http://example.org/People'/>
The name of namespace-qualified elements is made up of two parts: the prefix and the local name. For the previous example, the prefix is 'pre' and the local name is 'Person'. This prefix:local name construction is known as a Qualified Name, or more commonly a QName. Each part of a QName is a non-colonized name, or NCName, meaning a name without a colon in it. Because of the use of colons to separate the namespace prefix from the local name, element names containing colons should be avoided.
It is possible to omit the prefix and the colon, yet still associate an element with a namespace. Providing a default namespace declaration accomplishes this. This takes the form 'xmlns="URI"'. For example,
<Person xmlns='http://example.org/People'/>
defines an element that is semantically equivalent to the previous two examples.
Scope of Namespace Declarations
Namespace declarations come into scope at the defining element and apply to all descendants unless overridden by a namespace declaration on a descendant.
All namespace declarations have a scope (that is a set of elements to which they apply). A namespace declaration is in scope for the element it is declared at and all of that element's descendants. Given the following example,
<pre:People
xmlns:pre='http://example.org/People'>
<pre:Person>
<d:age
xmlns:d='urn:base10'>28</d:age>
<imp:height
xmlns:imp='urn:imperial'>68</imp:height>
</pre:Person>
</pre:People>
both the People and Person elements are in the 'http://example.org/People' namespace, while the age and height elements are in the namespaces 'http://example/org/People/Age/Base10' and 'http://example.org/People/Height/Imperial', respectively.
The in-scope mapping of a namespace URI to a prefix can be overridden by a new mapping on a descendant element. The previous example could be rewritten thusly,
<pre:People
xmlns:pre='http://example.org/People'>
<pre:Person>
<pre:age
xmlns:pre='urn:base10'>28</pre:age>
<pre:height
xmlns:pre='urn:imperial'>68</pre:height>
</pre:Person>
</pre:People>
Note that all the QNames in the aforementioned example begin with the prefix "pre" but the namespace URIs for the elements are exactly as they were for the previous example.
Elements that are not associated with a namespace, because they have no prefix and no default namespace declaration is in scope, are said to be in "no namespace". The namespace URI with which such elements are associated is the empty string, "".
XML Attribute Syntax
XML elements can be annotated with attributes. Attributes are name/value pairs where the value contains text-only data.
Attributes
Elements can be annotated with name/value pairs known as attributes. Attributes are serialized inside the element start tag.
Elements can be annotated with attributes. Attributes are typically used to encode metadata; that is they provide extra information about the content of the element they appear on. The attributes for a given element are serialized inside the start tag for that element. Attributes appear as name/value pairs separated by an equals sign (=).
Attribute names have the same construction rules as for element names. Attribute values are textual in nature and must appear either in single quotes (') or double quotes ("). Attribute values may not contain literal less-than (<) or ampersand (&) characters.
If the attribute value is enclosed in single quotes, the single quote character literal (') might not appear in the attribute value. If the attribute value is enclosed in double quotes, the double quote character literal ( " ) might not appear in the attribute value.
Here are some examples of attributes:
<age base='16'>20</age>
<age base='10'>32</age>
<height units='inches'>68</height>
<height units='cm'>140</height>
Attributes and XML Namespaces
Attribute names, like element names, are defined by the designer of a given XML vocabulary. Consequently they might need disambiguation with respect to namespace.
The author of an XML vocabulary chooses attribute names in the same way as element names are chosen so the potential for collision between attribute names exists just as it does for elements. Attributes can be associated with a namespace in the same way as elements, by specifying a prefix to the attribute name and mapping that prefix to a namespace URI.
The only difference between elements and attributes with respect to namespaces is that the default namespace declaration does not apply to attributes. So attributes without prefixes are always in 'no namespace' even if a default namespace declaration is in scope.
The following example shows namespace-qualified attributes:
<Person xmlns='http://example.org/People'
xmlns:b='http://example.org/People/base'
xmlns:u='http://example.org/units' >
<name>Martin</name>
<age b:base='10'>32</age>
<height u:units='inches'>64</height>
</Person>
The following example shows attributes that are in no namespace:
<Person xmlns='http://example.org/People'>
<name>Martin</name>
<age base='10'>32</age>
<height units='inches'>64</height>
</Person>
XML Text Syntax
Certain character literals are illegal inside element and attribute content. XML provides several standard character entities for encoding these characters along with character references and CDATA sections.
Prohibited Character Literals
Five character literals (<, >, &, ', and ") have certain limitations in terms of where they can legally appear in an XML document.
Certain characters cause problems when used as element content or inside attribute values. Specifically, the less-than character (<) cannot appear either as a child of an element or inside an attribute value as it is interpreted as the start of an element. The same restrictions apply to the ampersand character (&) although for different reasons. If the less-than (<) or
ampersand (&) characters need to be encoded as element children or inside an attribute value then a character entity must be used.
Entities begin with the ampersand character (&) and end with a semicolon (;). Between the two is where the name of the entity appears.
XML defines entities for the less-than character (<) and the ampersand character (&). The entity for the less-than character (<)
is < while the entity for the ampersand character (&) is &.
The following example shows how to encode an ampersand into the children of an element.
<IceCream>
<name>Cherry Garcia</name>
<manufacturer>Ben & Jerry</manufacturer>
</IceCream>
the resulting XML when displayed or parsed would look like this:
<IceCream>
<name>Cherry Garcia</name>
<manufacturer>Ben & Jerry</manufacturer>
</IceCream>
The apostrophe (') and quote characters (") might also need to be encoded as entities when used in attribute values. If the delimiter for the attribute value is the apostrophe, the quote character (") is legal but the apostrophe character (') is not, as it would signal the end of the attribute value. If an apostrophe is needed the character entity, ' should be used. For example;
<sayhello word='&apos;Hi&apos;' />
would result in the following being displayed or parsed:
<sayhello word=''Hi'' />
A fifth character reference is also provided for the greater-than character (>). While strictly speaking such characters seldom need to be escaped, many people prefer to escape them for consistency with the less-than character (<). For example someone wanting to show what an XML element looks like in the text content of an XML document could write the following,
but many people prefer to escape the greater-than sign (>) thus <someelement/>. Both examples would produce the following textual content <someelement/>. An example of when the greater-than character (>) must be escaped is shown in the discussion of CDATA sections.
CDATA Sections
CDATA sections allow markup characters to appear as literals without being interpreted as markup.
Using entities in place of less-than (<), greater-than (>), ampersand (&), apostrophe ('), and quote (") characters can become tedious and error-prone if a significant number of those characters appear in textual data. XML provides a construct called a CDATA section that allows such characters to appear as literals.
A CDATA section begins with the character sequence <![CDATA[ and ends with the character sequence ]]>. Between the two character sequences an XML processor ignores all markup characters such as less-than (<), greater-than (>), and ampersand (&). The only markup an XML processor recognizes inside a CDATA section is the closing character sequence (]]>). The following XML shows a CDATA section containing literal less-than (<), greater-than (>), ampersand (&), apostrophe (') and quote (") characters.
<sometext>
<![CDATA[ They're saying "x < y" &
that "z > y" so I guess that means that
z > x ]]>
</sometext>
Note that characters such as less-than (<) and ampersand (&) appear as literals in the textual content.
In order to make XML compatible with SGML, the character sequence that ends a CDATA section (]]>) must not appear inside element content. Instead the closing greater-than character (>) must be escaped using the appropriate entity (>). For example, the following is illegal in XML:
<badtext>CDATA sections end with
the ]]> character sequence</badtext>
Rather it would need to be written thusly:
<badtext>CDATA sections end with
the ]]> character sequence</badtext>
The close CDATA character sequence (]]>) can appear in attribute content.
CDATA sections cannot be nested so the following markup,
<nestedtext>
<![CDATA[Here is some text with <![CDATA[
a nested CDATA ]]> section ]]>
</nestedtext>
will not parse as the second CDATA section start sequence (<![CDATA[) is not interpreted as markup. Consequently the first CDATA section end sequence (]]>) is interpreted as closing the first (and only) CDATA section and the second end sequence is interpreted as illegal character data.
Comments
Comments are used to communicate information to humans about the content of an XML document. They often contain documentation about the structure or content of the XML they are found in.
XML also supports comments that are used to provide information to humans about the actual XML content. They are not used to encode actual data.
Comments can appear as children of elements or of the document. They begin with <!-- and are terminated with -->. Textual data is serialized between the two constructs. For compatibility with SGML the character sequence -- cannot appear inside a comment. Other markup characters such as less-than (<), greater-than (>) and ampersand (&) can appear inside comments, but are not treated as markup. If the textual content of the comment ends with the hyphen (-) character, there must be some whitespace (space, tab, carriage-return, or line-feed) between the hyphen and the close comment character sequence (-->).
The following are all valid comments:
<!-- This is a comment about how to
open (<![CDATA[) and close (]]>)
CDATA sections -->
<!-- I really like having elements
called <fred> in my markup -->
<!-- Comments can contain all sorts of
character literals including &, <, >,
', and ". Note that if entities are
used inside comments (< for example)
they are not expanded. -->
The following are all invalid comments:
<!-- Don't put -- inside a comment -->
<!-- don't allow a comment to end
with a hyphen --->
<!-- You can't <!-- nest --> comments -->
XML Processing Instructions
Processing instructions in XML provide a way to pass processing hints through to an application. They contain textual data.
Processing Instructions
Processing instructions are instructions to the application that is using the XML processor.
Processing instructions are used in XML to pass processing information through to the application. XML does not define any standard processing instructions--applications are free to define their own. It is the application which determines how to interpret a given processing instruction not the XML parser.
Processing instructions are composed of two parts: the target or name of the processing instruction and the data or information. The target is preceded by the character sequence <?. The target is followed by whitespace (any number of space, tab, carriage-return, or line-feed characters) then the data portion of the processing instruction. The data portion is textual and can contain whitespace. The processing instruction is terminated with the character sequence ?>.
Apart from the termination character sequence (?>) all markup is ignored in processing instruction content. Processing instructions defined by organizations other than the W3C cannot have targets that begin with the character sequence xml or any recapitalization thereof.
The following are all valid processing instructions;
<?display table-view?>
<?sort alpha-ascending?>
<?textinfo whitespace is allowed ?>
<?elementnames <fred>, <bert>, <harry> ?>
Processing instructions can appear as children of elements. They can also appear as top-level constructs ( children of the document ) either before or after the document element.
Processing Instructions and Namespaces
Processing instructions are not covered by XML namespaces. Consequently there is the possibility of collision of processing instruction targets.
Processing instructions are not covered by the Namespaces in XML recommendation and so the target portion of a processing instruction is not part of any in-scope namespace. This raises the possibility of collision among processing instruction targets between applications. Namespace qualified elements can be used instead of processing instructions.
XML Declaration Syntax
XML documents can begin with an optional XML declaration. The XML declaration specifies the version of XML being used along with the character encoding and whether the document is self-contained or not.
The XML Declaration
The XML declaration specifies the version of XML being used in the document. It can also specify the character encoding used and whether the document is a standalone document.
XML documents can contain an XML declaration, which, if present must be the first construct in the document. The XML declaration begins with the character sequence <?xml version='1.0'. Quotes (") can be used instead of apostrophes (') around the version number. The only version number supported is 1.0 and if an XML declaration is present in an XML document, the version information must appear.
While the version information is mandatory, the other two items that can appear in the XML declaration are optional. The character encoding used for the document content can be specified through the encoding='encoding goes here' construct.
The XML recommendation defines several possible values for the encoding: UTF-8, UTF-16, ISO-10646-UCS-2, and ISO-10646-UCS-4 all refer to Unicode/ISO 10646 encodings, while ISO-8859-1, ISO-8859-2, etc. refer to 8-bit Latin character encodings. Encodings for other character sets including Chinese, Japanese and Korean (CJK) are also supported. It is recommended that encodings be referred to using the encoding names registered with the Internet Assigned Numbers Authority (IANA).
The following are all valid XML declarations that specify a variety of character encodings:
<?xml version='1.0' encoding='US-ASCII' ?>
<?xml version='1.0' encoding='UTF-8' ?>
<?xml version='1.0' encoding='UTF-16' ?>
<?xml version='1.0' encoding='ISO-10646-UCS-2' ?>
<?xml version='1.0' encoding='ISO-10646-UCS-4' ?>
<?xml version='1.0' encoding='ISO-8859-1' ?>
<?xml version='1.0' encoding='ISO-8859-2' ?>
<?xml version='1.0' encoding='Shift-JIS' ?>
An XML document can use only one encoding, it is not possible to 'redefine' the encoding part-way through. The XML declaration is mandatory if the encoding of the document is anything other than UTF-8 or UTF-16. In practice, this means documents encoded using US-ASCII can also omit the XML declaration as US-ASCII overlaps entirely with UTF-8.
If an XML document can be read with no reference to external sources, it is said to be a standalone document. Such documents can be annotated with standalone='yes' in the XML declaration. If an XML document requires the resolution of external sources in order to parse correctly or construct the entire data tree, it is not standalone. Such documents can be marked standalone='no', but as this is the default, such an annotation rarely appears in XML documents.
Conclusion
Where Are We?
You've just finished your introduction to the world of XML.
This tutorial has provided you with a complete introduction to the world of XML. You have looked at the history of XML, the core XML specifications/technologies, the most important layered specifications, and the basic XML 1.0 + Namespaces syntax. At this point you should feel comfortable with what XML is, when you would use it, and the basics of where to begin.
We realize that you're not going to walk away from this tutorial having become experts in XML technology. But, hopefully, at this point you have a clearer understanding of what XML means to distributed application developers.
Where Do You Go from Here?
XML requires personal study and hands-on experience.
XML like any other technology you've encountered requires personal study and hands-on experience. We encourage you at this point to get a hold of an XML processor for your favorite programming language and begin experimenting with it. If you're a Windows developer, you probably want to start with Microsoft's current MSXML processor. If you're a Java developer, you might want to start with Xerces from the Apache Software Foundation. If you're neither of these, check out www.xmlsoftware.com for everything else that is currently available.
You should also pick up a book or two on the areas of XML that most interest you. Essential XML provides an introduction to XML targeted specifically towards distributed application developers. There are several other good XML books and articles listed in the references section.
If you're looking for fast-track, in-depth XML courses, DevelopMentor, Inc.
offers Essential XML and
Guerrilla XML courses. Essential XML is four-days of hands-on training
(banker's hours 9-5); while Guerrilla XML is an intense five-day course that
goes from 9am-9pm, making it possible to cover a wider-range of topics in more
depth. Several instructors from DevelopMentor have personal
sites that offer plenty of sample code and other resources.
To keep up with what's going on in the XML space, you should do one or more
of the following: 1) subscribe to the XML-DEV
mailing list, 2) check out xmlhack.com
regularly, 3) visit xml.com often, and 4) keep
posted on the activities of the W3C.
XML is an exciting area to be part of right now...good luck!
Glossary
abstraction
a powerful programming paradigm used by developers to identify the
distinguishing characteristics of a class or object in a way that encapsulates
complex and useless detail
ANSI
American National Standards Institute
attribute
a name/value pair associated with an element (e.g., in <foo a='b'/>,
'a' is an attribute)
CDATA section
A CDATA section begins with the character sequence <![CDATA[ and ends with
the character sequence ]]>. Between the two character sequences an XML
processor ignores all markup characters.
CERN
The European Particles Physics Laboratory
character encoding
an algorithm for mapping a sequence of character codes to a byte stream
character set
a mapping between characters (glyphs) and character codes (numbers)
document
a logical unit of information
DOM
Document Object Model: a traversal API for XML documents
DTD
Document Type Definition: an SGML-ism carried into XML for constraining the
physical structure of a given document type
element
The main building block of an XML document. Elements consist of a start tag,
the element content, and an end tag (e.g., <foo>bar</foo>)
exemplar
an example document template
GCA
Graphic Communications Association
GML
Generalized Markup Language: the first "general" markup language
that separated content from presentation - created by Goldfarb et. al. at IBM in
the late 1960s
HTML
Hypertext Markup Language: an SGML-like document type that is used on the
World Wide Web to describe documents and links between documents
HTTP
Hypertext Transfer Protocol: an application-level protocol used to request
resources (via a URI) on the Web
ISO
International Standards Organization
lexical
character sequence
local name
The name of an element/attribute excluding the namespace prefix (e.g., in
<f:foo xmlns:f='uri-foo'/>, 'foo' is the local name)
markup language
a set of annotations that can be used in text to provide additional metadata
messaging
the process of sending messages between systems
metadata
data about data (self-describing data)
namespace
a unique set of names
namespace declaration
A special construct used to associate a namespace URI with a namespace prefix
or to declare a default namespace URI for unqualified elements (e.g., <f:foo
xmlns:f='uri-foo'/>)
namespace prefix
An abbreviation for a namespace URI that precedes the element/attribute name,
delimited by a colon (e.g., in <f:foo xmlns:f='uri-foo'/>, 'f' is the
namespace prefix)
namespace URI
A unique identifier for a namespace
NCName
No-colon name. A name that does not contain a colon.
processing instruction
Processing instructions are used in XML to pass processing information
through to an application (e.g., <?target data data data?>)
protocol
a set of semantic and syntactic rules that determine the behavior of
functional units in achieving communication
QName
The fully-qualified name of an element/attribute consisting of two parts: the
namespace prefix and the local name (e.g., in <f:foo xmlns:f='uri-foo'/>,
'f:foo' is the QName)
SAX
Simple API for XML: a stream-based API for XML documents
schema
a codified set of element/attribute names
serialization
the process of mapping information to a byte stream
SGML
Standard Generalized Markup Language: a sophisticated language for defining
other markup languages - via document type definitions (DTDs)
SOAP
Simple Object Access Protocol: an XML messaging protocol for invoking
services
specification
a document that outlines the syntax and semantics of a particular
language/technology
standalone
indicates whether any external source must be resolved in order to correctly
interpret the document
syntax
the lexical representation of a command/language that specifies all possible
constructs
tag
Denotes the beginning (e.g., <foo>) and end (e.g., </foo>) of an
element
tag name
The name used in the start and end tags of an element
TCP/IP
Transport Control Protocol/Internet Protocol - represents the family of
networking protocols used to exchange information on the Internet today
vocabulary
a set of element/attribute names
W3C
World Wide Web Consortium: a neutral organization whose goal is to lead the
technical evolution of the Web - a consortium of businesses, each of which pays
a membership fee, participates in the various activities
WWW
World Wide Web: the set of all information accessible using computers and
networking, each unit of information identified by a URI (as defined by its
inventory, Tim Berners-Lee)
XML
Extensible Markup Language: a restricted form of SGML designed for ease of
use and implementation
XML Base
XML Base: a language for overriding the [base URI] property of an element
XML declaration
specifies the version of XML syntax and the character encoding used to
serialize the document (e.g., <?xml version='1.0' encoding='utf-8'?>) -
although it looks like a processing instruction, it's not considered one so it's
not visible at the Infoset level
XML Query
XML Query Language: an XML-based language that provides query facilities (a
la databases)
XML Schema
XML Query Language (XSD): an XML-based language for defining
application-defined type systems
XPath
XML Path Language 1.0: a language for describing intra-document addressing
XPointer
XML Pointer Language 1.0: a language for describing inter-document
addressing, points, and ranges
XSLT
XSL Transformations Language 1.0: an XML-based programming language for
describing document transformations
References
The following is a list of additional XML resources that will assist you
in becoming more familiar with the topics introduced in this tutorial.
Lessons
from the Component Wars: an XML Manifesto
Don Box, MSDN (1999-09-01)
The
Family of XML Specifications
Aaron Skonnard, MSDN Magazine (2000-05-01)
SAX,
the Simple API for XML
Aaron Skonnard, MSDN Magazine (2000-11-01)
Addressing
Infosets with XPath
Aaron Skonnard, MSDN Magazine (2000-07-01)
XSL
Transformations
Don Box, Aaron Skonnard, John Lam, MSDN Magazine (2000-08-01)
Inside
SOAP
Don Box, XML.com (2000-02-09)
A
Young Person's Guide to SOAP
Don Box, MSDN Magazine (2000-03-01)
|
|