Basics of XHTML

Submitted by Richard Lawrence March 13, 2008

More people are starting to use Couplet to manage their title information for online communication, both with their trading partners and with Publishers' Assistant Web services-based web sites. In Couplet, you can enter promotional text, reviews, and sample content for a title as plain text or with a markup language, such as HTML (the "hypertext markup language" commonly used on the Web). This article is about using XHTML as the markup language for data in these fields. XHTML is the next-generation markup language for the Web, and it is slightly different from HTML. You may want to learn more about it because your distributors and retailers require it, or because it's the default language for PAWeb-based sites, or for your own education as you maintain a web site or other data. This article will tell you the basics you need to keep in mind when writing XHTML.



SGML-Based Markup Languages

According to Wikipedia,

Some publishers may already be familiar with SGML, a language used to define markup languages for documents. SGML is the parent markup language of which HTML, XML, and XHTML are all subsets. If you've never used any of these languages before, the most important thing for you to know is that they describe how text should be formatted using enclosing tags, which are delimited by "<" and ">". Tags have names (the first word inside the "<") and, sometimes, attributes (data of the form property="value"). For example:

<ThisIsATag id="This is the 'id' attribute of the tag">
  This is some text surrounded by the tag.

The tag names and attributes describe the formatting to be applied; a computer program (such as a web browser) will render the text between tags with the formatting they describe. For example, the HTML "<b>" tag makes text bold, so

<b>This text is bold,</b> but this text is not

displays like this in a web browser:

As you can see from these examples, tags have a start or opening tag as well as an end or closing tag. The closing tag is written the same way, except that it contains a "/" before the tag name. In XML, XHTML, and HTML, closing tags do not have attributes.


The relationship between HTML, XML, and XHTML

HTML was the original markup language of the Web, and it is still the most predominant language for web pages. It contains a whole vocabulary of tags that are useful for writing documents which browsers can display to users. XML, the "eXtensible markup language," is really a meta-language for creating custom markup languages, because it allows users to define their own tags. XML was designed to make it easy to exchange data between different kinds of systems (such as between databases on servers connected to the Internet).

HTML has some shortcomings. Most significantly, it does not require that every opening tag has a corresponding closing tag. Some tags, like the <b> tag, do require closing tags (otherwise, all the text after them will be rendered in bold), but others, such as the <br> tag, do not require closing tags (although it is not an error to close them). This means that programs which read HTML, parse it, and render it — mostly web browsers — must be aware of all of these idiosyncracies, which makes those programs complex to implement. By contrast, XML has a much stricter syntax. It is designed to be easy for programs to parse. For this reason, XML requires that every start tag has a closing tag, that tags are strictly nested and do not overlap, that every document has a single "root" tag in which all others are enclosed, and so on.

XHTML is "a markup language that has the same depth of expression as HTML, but also conforms to XML syntax." You can think of it as a version of HTML that is easier for programs to parse. Writing XHTML requires a little more discipline than writing HTML, but it also means that the documents you write will be less likely to display differently in different browsers, more easily transformed into other formats, and cleaner for human editors to read.


Writing XHTML

There are several things to keep in mind when writing XHTML. In order to obtain the advantages of cross-browser compatibility and easy parsing and transformation, you must write XHTML that is both well-formed and valid. For a document to be "well-formed," it must conform to XML syntax. For it to be valid, it must also conform to the semantics of the XHTML document type, which means (among other things) that you must not use any undefined tags.



The important rules of writing well-formed XML and XHTML are as follows:

  1. Every tag must be closed. You may not use unclosed <br> tags, for example, as you can in HTML. If a tag encloses other data, it must have separate opening and closing tags, such as this paragraph tag:

        This is a paragraph with some text in it.  Note how it has both an opening and a closing tag.

    If a tag is empty (i.e., it does not enclose any other data, such as a line-break tag), it can be opened and closed in the same tag, using the special trailing-"/" format, e.g.,

    Thus, the following are strictly equivalent:
    Both will produce a line-break in an XHTML document.
  2. Tags must not overlap. If you open one tag after opening another, you must close the second tag before you close the first. In other words, tags must be strictly nested. For example, this is well-formed:

     <b><i>This text is bold and italic.  Note how the "i" tag is closed before the "b" tag.</i></b>
    But this markup is not:
     <b>Inside the bold tag <i>inside both bold and italic</b> WRONG: can't close the bold tag before the italic</i>
  3. Special characters must be escaped. Because they are meaningful in the markup language, the less than and greater than characters ("<" and ">"), ampersand ("&"), and single and double quotes should be represented using the following strings in their place:


    Replacement string











    A note about editors: You should use a good plain-text editor to create your XHTML markup, preferably one that is capable of saving text files in a variety of character encodings. You should not use a "what you see is what you get" (WYSIWYG) editor like Microsoft Word. Word in particular is prone to inserting characters which are not in the UTF-8 character set into text, such as "smart quotes" that have a different character code from the usual quotation marks (" and '). This will make it difficult to replace these characters with the correct escape string, and it will cause validation errors if your document declares itself to be UTF-8 (the default). Using a good plain text editor, such as GNU Emacs, will allow you to easily replace the above characters, and it will prevent you from entering invalid characters.

  4. Attribute values must be quoted. For example,

     <td colspan="3">This table cell spans three columns.</td>
     <td colspan='2'>Using single quotes is also OK.</td>
    But this is incorrect:
     <td colspan=3>Incorrect</td>
  5. There must be a single root tag. XML documents represent hierarchical data, and computer programs parse them into data trees. For XML data to be well-formed, it must have a single "root" — a starting place where parsing begins when the tag is opened, and ends when the tag is closed. It is not normally a problem to remember this when using XHTML, because (as with HTML), the entire document is always enclosed in an <html> tag.



In addition to being well-formed, an XHTML document must conform to the particular semantics of the XHTML standard. This is too big a topic to cover in this introductory article, but if you know HTML, you already know a good deal of the semantics of XHTML. The tag names are essentially the same; in fact, XHTML is supposed to be backward-compatible with HTML, so that old browsers which do not specifically have XHTML parsers can still render it as HTML.

One important semantic difference between XHTML and HTML, though, is that XHTML markup is case-sensitive. In XHTML, all markup, including tag names and attribute names, must be in lower case. HTML is case-insensitive; the "<p>" tag is the same as the "<P>" tag. In XHTML, there is no "<P>" tag. Also, in XHTML, there are no "<center>", "<u>", or "<applet>" tags, as there were in older versions of HTML.

You must also declare at the beginning of an XHTML document what sort of document it is, and which character encoding it uses, so that conforming XML parsers can understand it. Before the <html> root tag, you should have two lines like this:

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "">

(Explaining this cryptic text is also beyond the scope of this article, but it is required in a complete XHTML document. You can learn more about XHTML document types here, and document type definitions generally here. If you're using XHTML to manage your title data for a PAWeb-based site, don't worry! You don't need to enter a document type into Couplet, because the layout engine of your site will take care of that for you. You only need to enter the markup for the data in the field you're editing, such as a title's promotional text; the rest of the XHTML document is generated for you.)

If you would like to learn more about the semantics of XHTML, you should read the World Wide Web Consortium's specification of XHTML 1.0. In general, though, if you follow the rules for writing well-formed XHTML, and you use the tags you already know from HTML, you'll be well on your way to writing valid XHTML.


Checking your markup

Whether you edit your XHTML by hand, use a GUI editor such as Dreamweaver, or only create partial "snippets" of XHTML for inclusion in other documents — such as markup in promotional text which is displayed dynamically on a PAWeb-based site — you can check that your XHTML is valid using the W3C's validation service. Simply upload a document as a file, or enter a URL, and the service will check your XHTML for compliance with the standard. It will give very specific feedback about your markup, and point out errors you may have missed.

This can be a frustrating process at first, especially if you are converting documents from HTML, because documents which might look unproblematic can have large numbers of errors. As you get used to XHTML, though, you will find yourself making these errors less frequently. Remember, it's important to have standards-compliant markup if you want it to display correctly in different browsers, and if you want your trading partners' systems to be able to understand and use it, so learn to check your work often!