Composing Good HTML

This document attempts to address stylistic points of HTML composition, both at the document and the web level. It is available on the Web at http://www.cs.cmu.edu/~tilt/cgh/ (if you are reading this via a mirror, you may want to check the original to make sure you're seeing an up-to-date version).

Disclaimer: This document is neither finished, nor is it even all that current. It reflects my thinking during the summer of 1995, which is that last time that I had the luxury of spending three months thinking about these kinds of things. The web changes rapidly; in some ways this document is now hopelessly out of date. In other ways, though, it still retains some valuable insights (at least, in my humble opinion). You'll have to be the judge, as the reader, because I have neither the time nor the inclination these days to try to keep this document updated. This text is still largely the same text I wrote that summer, with occasional grammatical and spelling fixes. Caveat emptor!

New: This is version 2.0.16 (the old version 1 is no longer available, as of Nov 17 1997). Now that Web Weaving is on shelves near you, it seems appropriate for me to get off my duff and feed all of the changes back into this document. See "Some History," below, for more information on what the heck I'm talking about.

This document is divided into two main sections. The first section discusses the document -- it should be recognizable as the revised version of the original CGH. It discusses good practices to follow in creating your documents, common errors and things to avoid when composing HTML, and finally, a brief treatment style sheets, which provide a mechanism for greater control over how a document is rendered. The second section is brand new -- it discusses style issues regarding your Web as a whole. How it is divided and organized, how it is interlinked and intertwined; these are the issues under consideration here.

This is not a beginner's guide; check the "For More Information" section for pointers to more basic works, as well as for more advanced references and tutorials. It is designed for the HTML author who has learned the basics, and is ready to start thinking about the more advanced aspects of Web document design.

Note: I'm not finished spiffing up this new version yet, but it's good enough to be presentable, and I'd rather have the information available, rather than have it languish for lack of final polishing. At the very least, I still need to:

Make some of the larger figures into a more manageable size
Provide rendered versions of the HTML examples
Add in some more useful links to other resources (suggestions appreciated!)
Break this into single and multipart versions by preparing multiview source documents

Unfortunately, the life of grad student is not all cheese and wine (very little of it, in fact), so these will have to come at a later date. Besides, with the publication of Web Weaving (see the History section below for background), it seems an appropriate time to also re-update this document, so I won't let a little thing like a busy schedule stand in my way.

Some History

I wrote the first version of "Composing Good HTML" in January of 1994. At this point, the Web was just starting to explode, and Mosaic was the browser on the tip of everyone's mouse. Being one of the strange few who used Lynx as well as Mosaic (as well as Emacs-W3, when I was feeling cocky), I noticed that different browsers dealt with incorrect usage of HTML with varying degrees of success. When I pointed this out, the solution suggested to me was to write a "lint" for HTML that would point out common errors in documents. In preparation for this, I started making a list of common errors, and turned that list into a human-readable document. That document became "Composing Good HTML."

About that time the semester started, so I made the document publicly available, and asked for comments and criticism. I got both, in spades! I corrected errors (including a plethora of spelling and grammatical errors), added some new sections, and revised pieces of existing sections. But, all in all, CGH didn't really change much, even though things like Netscape and HTML 3.0 (let alone Java and VRML) have snuck up in the meantime.

In January of 1995, Carl Steadman, Tyler Jones, and I got together with the idea of writing a book about the Web (this was before the current explosion of the market, so you'll pardon our naivete). Rather than writing a book about HTML, we decided to write a book about creating and maintaining an entire site -- including the stylistic points in CGH as a starting point. The book is called Web Weaving, and it appeared on bookshelves on December 18th, 1995. The book is published by Addison-Wesley.

The side effect of all of this is that it gave me a reason to revise CGH to reflect current practices for inclusion in Web Weaving. And now that we've finally finished our book, this also means that the changes in CGH are getting fed right back into the online version. Which, I'm proud to say, is still freely available (and better than ever, I'd like to think). What you see here is, by and large, Chapters 11 and 12 from Web Weaving, edited so that they stand alone better. While I'd certainly recommend you read Web Weaving for a full treatment on all the issues involved in building and maintaining your Web site (and because every author hopes that his words will be read), Composing Good HTML remains (I hope!) a useful resource for HTML authors (and now Web designers) who want a slightly more sophisticated treatment of the stylistic issues involved in, well, weaving your web.

I never did get around to writing that "lint" program, though.

Document Style Considerations

The World Wide Web has been a wildly successful experiment. It has filled a need for both information users and for information providers: a tool which allows information to be deployed to a wide variety of people over wide geographic distances, regardless of what kind of computer they may be running. All that is required to publish information is any one of a number of Web servers, and all that is required to view that information is any one of a number of Web clients. This is both an opportunity and a challenge. This document discusses the ways in which you construct your markup so that it is readable and usable for a wide range of browsers.

HTML provides a device-independent way of describing information. The elements of HTML describe what your information is, not how it should be displayed. This is a subtle point, and perhaps the most important one presented here. HTML will let you describe this piece of information as a header, or that piece of information as an address. It will not let you describe this text as being in 24-point Helvetica, right justified. Your challenge is to provide professional page layout and design without using the traditional tools of professional page layout and design. Sound like a paradox? Not really. All it involves is a bit of trust.

The trust you must have can be summarized by the following rule:

if you mark up a document so that your information is labeled as what it is instead of as how it should be displayed,
then browsers will render it in a way that is appropriate and professional-looking.

With the current diversity of clients for the Web (and we can only expect to see more), it has become important to write HTML that will look good on any client, and not just on the specific client which the author may have access to. You must trust your markup. There is no way to anticipate how every browser will (differently) render your HTML. If you follow this rule you will get the best possible rendering with all browsers, instead of for just one browser.

To this end, there are a few solutions. One approach is software based -- a "lint"-like program for catching semantic errors in HTML, and perhaps even correcting them. Two good examples of this are WebTech's HTML Validation Service and WebLint. Another approach is the one taken by this document -- a style guide which points out common errors one might make in the composition of HTML, and recommending good practices to follow.

Bear in mind when following these guidelines that your document may not end up looking the best it possibly can on a particular browser. However, it also will not look ugly on any browser, which is the risk you take by disregarding these recommendations and tweaking your markup code for, say, Netscape. Unfortunately, Netscape may render things differently from Lynx which may render things differently from Mosaic, and so on and so forth -- and even within a particular browser, a user may have chosen font or style preferences different from the ones which you might assume. What these guidelines should do, if followed, is make for a better presentation for the most browsers (instead of the best presentation for only one) -- and ensure that your documents reach the widest audience possible.

Good Practices

Things contained in this section are good practices for the generation of any HTML document. Specifically, this would include anything which should routinely be done in the creation of documents for the benefit of both reader and author.

How to Use Non-Standard HTML

There are at least three major flavors of HTML currently in practice as this is being written: HTML 2.0, HTML 3.0, and the Netscape extensions to HTML 2.0. HTML 2.0 is the closest thing to current practice that is available, and can be assumed to be "safe" for all browsers.

On the other hand, the HTML 3.0 and the Netscape extensions are not widely implemented, let alone standardized. Under most circumstances, this would be a good reason not to use them until they were more widely available, but there is the mitigating circumstance that all of the Netscape extensions (and some of HTML 3.0, most notably tables) are supported by one of the most popular Web browsers ... Netscape!

What should be done about this? Many Web authors take the approach that, since most people use Netscape, it's acceptable to use the Netscape elements, even if it is to the detriment of people using other browsers. Others take the approach that nothing more than HTML 2.0 should ever be used, which means that any benefit which might be derived from these enhancements is lost.

The best road is a middle approach. Two good rules of thumb are:

If two or more popular browsers support the extension, it's probably fine to use. For instance, both Netscape and Mosaic (and Arena) now support tables, so any tables you use will be available to most of your audience.
If the extension is not widely supported, but it will not adversely affect your document if it is missing, it's probably fine to use. For instance, the FONT element changes the font size of text in the Netscape Navigator, but not some other clients (when I first wrote this, no other client supported this. These days, several others do as well, including IBM WebExplorer and Microsoft Internet Explorer). However, other clients will simply ignore tags they do not understand-so the text in the FONT element will still be readable. On the other hand, if the MATH element is ignored by a browser, the browser will display gibberish.

In general, try to think about the effect that the non-standard elements will have if they are not recognized. These elements can be used intelligently, and on browsers that recognize them, can dramatically enhance the presentation of your page. If it is not possible to use the elements in such a way that rendering is still good on all clients, think about providing multiple copies of the document (for instance, providing a version of the table using the PRE element), and possibly using content-negotiation on the server to provide the reader with the correct version of the document.

A final thought on the subject: try to avoid banners in your document that claim that your document is "Enhanced for Netscape" or "Enhanced for HTML 3.0" (or the rapidly more prevalent "Enhanced for Microsoft's Internet Explorer." Ugh.) Rather, try to build your document so that if a reader reads it in (for example) Netscape, it will be obvious that it uses the new elements to good effect ... and if a reader reads it in another browser, they can remain blissfully unaware of what they cannot see, and still be impressed by what they do see.

(Opinion Alert: a general comment, that may or may not place me on Bill Gates' hit list -- while I have a healthy disregard for the cavalier attitude in which most "extensions" are made de facto by overwhelming will of places like Netscape, I still have a healthy respect for those extensions which attempt to solve an important problem in a useful way. Many of the Netscape extensions, especially those involving tables, fit this bill, and while they did also provide many duds as well, they have also supported the valid HTML 3.0 alternatives that mirror their extensions. However, in my opinion, every single one of the "Microsoft extensions" is of dubious merit, and of certain incompatibility with any evolving HTML 3.0 specification. Given the well developed state of HTML 3.0, introducing new and incompatible methods of doing the same thing is irresponsible at the least. I highly recommend simply disregarding the extensions introduced with Internet Explorer. Please note that I have the highest respect for many of Microsoft's products; I even used Word and Internet Assistant to compose this edition of this document [although I edited the HTML afterward]. And, dear reader, this paragraph in particular is highly opinion-ridden, so you must take it with a grain of salt as you see fit. On with the useful stuff:)

Signing and time-stamping documents

One problem which faces anyone trying to find information using the Internet is the question of "authoritativeness." The relative ease with which WWW servers can be set up and populated with information means that the traditional checks of the publishing process can not act to filter out information which is inaccurate or misleading. In addition, it can often be hard to tell how current information found online is, or how actively it is maintained and updated.

One thing which you can do to assist Web users is to sign and date all documents in your infostructure, so that people viewing the documents can form some impression of the authority of the document (i.e., how recent it is, and how reliable the information provider is). This is not a complete solution, but it is a large step forward.

For example:

<HR>
Last modified: March 6, 1995
<ADDRESS>
<A HREF="http://cs.cmu.edu/~tilt/">James Eric Tilton</A><BR>
<A HREF="mailto:tilt@cs.cmu.edu">tilt@cs.cmu.edu</A>
</ADDRESS>

Some notes about this example:

The date is given in an unambiguous format: "March 6, 1995". Why is this better than the more economical "3/6/95"? One reason is that for some of your audience, especially those from Europe, this means "June 3, 1995".
A link to a home page is provided. If a reader is interested, she can follow it to find more information by this author. This provides a consistent centering function which helps keep a reader from becoming disoriented (See Main Roads and Scenic Paths, below).
A mailto: link anchors the document to the mail address of its creator. The mailto: URL specifies an e-mail address. Most browsers support this, allowing the reader to send e-mail to the address specified. This can be a useful way to get feedback. In addition, the mailto: link is separated from the link to the home page by a  , so that the two links can be easily distinguished.

Another option for signing a document is to encode information about the author in the document's header information. You can do this by including a LINK element of type made in your HEAD element. For example:

<HEAD>
<TITLE>This is my Title</TITLE>
<LINK REV="made" HREF="mailto:author@some.site.org">
</HEAD>

This example uses the LINK element, which may be unfamiliar to you. This element is equivalent to the A element; that is, it provides a link to some other object. However, since it is part of the HEAD information (which is information about the document, rather than part of the document itself), this is a link from the entire document to another object. (Anchors, on the other hand, are links from some small subset of the document, like a word or a phrase, to another document). This link, like most other HEAD information, is typically not displayed by a browser, or followable by a reader.

The fact that it is not displayed does not make it useless, however. Many browsers, such as Lynx, supply a "reply to author" function. The information about who the author is comes from using the LINK as above. Other applications which can make use of the information include Web spiders and other maintenance tools, which can benefit from having authority information in machine readable format.

The format of the LINK element is the same as that of the A element. Notice the use of the REV attribute, which describes this relationship as a REVerse relationship of the type made. This means that this document was made by the object at the other end of the anchor.

Device independence through better printing

One promise of the wide-spread availability of personal computers has been the lessening of our reliance on paper. In some ways, this promise has been realized; many trees (and municipal landfills) are no doubt grateful that many of us are now committing our words to e-mail instead of to a handwritten or typewritten letter or memo. On the other hand, until video display technology produces results indistinguishable from paper, we will no doubt continue to print out things. It's hard to curl up with a notebook at night, especially if it has a coaxial cable jutting out the back of it. Because of this, many people will want to print out the documents which you have provided electronically. In effect, they will want to take the document you have woven into a part of a web, and make it into a standalone document.

Fortunately, HTML is well-suited to this. A document in HTML can theoretically be rendered in many more formats besides simply on a screen. Print is one obvious alternative, although speech and Braille are also possible and desirable. We bring this up because it is important to consider ways other than on-screen that a reader may encounter your documents. Given that, thinking about your document as something that might be printed can be a very useful tool for creating documents that aren't tied to the specific requirements of a browser or display hardware.

Taking advantage of prose

One of the advantages of the World Wide Web over similar infosystems, like Gopher, is that the Web makes no distinction between what is a menu and what is a document. For instance, in Gopher, a document is "dead" -- it can't lead anywhere, and, in order to continue exploration, a reader must return one step back to a menu. In the same vein, a Gopher menu provides only limited information about where links lead to: often a menu item must be retrieved and explored before any sense can be made of whether it is appropriate to what a reader seeks.

On the other hand, a Web document is "live" -- there's no clear dividing line between a menu container and its contents. This is a liberating distinction, as a document can now be as verbose as necessary in providing context for links. Consider the difference between these two documents in Figures 1 and 2.

[Figure 1: A menu list without context (Lynx)]

[Figure 2: A prose description of resources (Lynx)]

The second example is much more satisfying, because it is more than simply a list of pointers. Instead, an effort has been made to integrate the list into prose that is (presumably) better tied into the subject of the document as a whole.

This is not to say that it is always preferable to force what is more naturally a menu into prose for the sake of prose. If you are creating a document that serves as an jumping-off point to other resources, your readers might not want to get into the thick of text to find the resource they're searching for. In this case, a definition list may be more appropriate, as shown in Figure 3. This is a nice compromise, giving context without becoming buried in a forest of words.

[Figure 3: A menu using a definition list (Lynx)]

Meaningless link text

When creating documents, make sure that your links are meaningful -- that is, that they avoid online-specific references, and that they don't detract from readability. The text of your links should flow well in the context of the rest of your text , and your text should also be able to stand alone as a printable document . You should at all costs avoid the "Click Here" syndrome, as shown in Figure 4.

[Figure 4, The "Click Here" Syndrome (Arena)]

Figure 4 is also bad because it refers to "clicking", which assumes that everyone is using a mouse with their browser, which is not always the case. A much better alternative is demonstrated in figure 5.

[Figure 5, Meaningful Link Text (Arena)]

Another point to consider about the choice of words selected for link text ("information about cows", in this example), is that often this link text may be what is used as information for a reader's bookmark or hotlist entry. When the word "here" is used as link text, the hotlist may become cluttered with entries that read only, "here", instead with information about what the link is actually about.

Organization through outlining

Headers provide a useful way to provide an outline for your document. Headers of level 1 (H1) indicate major points, while headers of level 2 (H2) provide sub-topics to those points, and so on and so forth. It is important to remember that the purpose of these headers is not to provide specific kinds of fonts or layout, but rather to organize a document into sections. To that end, here are some recommendations about heading usage:

A heading should not be more than one level below the heading which preceded it. That is, an H3 element should not follow an H1 element directly.
Also, one version of the HTML specification declares that "a heading element implies all the font changes, paragraph breaks before and after, and white space (for example) necessary to render the heading". Extra highlighting elements are discouraged within the header, like EM or B.
Do not markup text as H2 or H3 merely because it appears to provide the correct size and bolding of fonts on the browsers used by local readers. On another browser, that same text may be incredibly grotesque and large, not providing the desired effect at all. Figures 6 and 7 demonstrate this effect.

[Figure 6: Expected headline rendering (Arena)]

[Figure 7: Unexpected headline rendering (Netscape)]

Physical versus logical character emphasis

Since HTML (and also SGML) is designed to be a device independent language for describing the content of documents, most of the elements within it aren't intended to give direct control to the author over how the final page layout will look. The major exceptions to this are in the character highlighting elements.

There are two types of character highlighting elements -- physical and logical. The physical styles involve things like "italic font", and "boldface"; while the logical styles are things like "emphasis", "citation", and "strong." It is strongly recommended that you employ the logical styles rather than the physical styles in your documents. Using the I element to render text in italics will only be effective on those browsers which are capable of displaying italics -- which all browsers are not guaranteed to be able to do. It is far better to encode semantic content -- to describe things in terms of logical styles -- and then allow the browser to display that semantic structure as best it can, given its display capabilities.

So, instead of

<I>italics</I>

you might use

emphasized or <CITE>citation</CITE>

and instead of

<B>bold</B>

you might use

<STRONG>strong</STRONG>

This also leaves the possibilities open in the future for more sophisticated uses of these semantic encodings, which have much more inherent meaning than font styles like bold or italic. For example, the Lycos indexing system can take advantage of semantic encoding to create abstracts of documents.

Note: Before you stop using B and I altogether, here's another viewpoint to consider. One argument against logical character styles is that it turns out to be a bottomless pit, a fruitless attempt to define logical styles for every possibility. Physical styles, combined with the context of the text in which they are placed, seem to provide a much richer set without a huge number of tags. Consider the large space of context that can be implied with only the typographical conventions of bold or italic. The only problem is that that contextual space needs to have a human being to interpret it, which would make some kinds of computer-based rendering difficult, if not impossible (e.g. speech synthesis).

A picture is worth a thousand words (which is why it takes a thousand times longer to load...)

The title of this section is somewhat facetious, but only somewhat. It's more and more obvious from current Web development efforts that the main attraction of the Web is not hypertext, and it's not an easy interface; the main attraction is the flashy graphics and the alluring promise of multimedia. We shall heroically refrain from commenting on whether this is a good or a bad thing, for the fact remains that online multimedia is here to stay. What we will comment on is on the issues that must be considered to use multimedia for best effect.

The first set of issues revolves about the faux sense of page design one can get by using inline images. An early example of this was one of the early commercial forays into the Web, a graphic design house which advertised professional layout services for online brochures. They spent quite a bit of time designing graphics images of the proper width so that they could achieve page-layout effects like right justification and centering, and created a page which was fairly well-designed. However, they got bitten because this design relied on a browser's window being the default width for X Mosaic. With a wider window, the carefully aligned logo in the upper right corner was immediately followed by the image that should have been left justified on the following line.

Current browsers implement some better forms of layout control for images. For example, an author can specify the way in which text will flow around an image with an ALIGN element. Figures 8 and 9 exemplify this; the former has no text-flow information, and the latter does. This is not perfect, as using the ALIGN tag can cause strange stair-stepping effects if there is not enough text separating two images, as figure 10 illustrates. If the desired effect is of images with captions, a table is probably the best approach for layout purposes (Figure 11).

[Figure 8: IMG without the ALIGN element (Netscape)]

[Figure 9: IMG with the ALIGN element (Netscape)]

[Figure 10: Stair-stepping due to ALIGN (Netscape)]

[Figure 11: Using TABLE for layout (Netscape)]

Another consideration is unnecessary duplication of effort. Many authors swear by colored bullets and colorful horizontal rules, implementing both effects by using inlined images rather than the structural markup. Doing this can leave the portion of your audience which is unable (or unwilling) to view inlined images out of the loop, and can also negate some of the benefits provided by structural markup. There is also an unexpected side effect to using many small images: the current way in which Web clients retrieve documents requires that a separate connection to a Web server be initiated for each image. The time involved in negotiating this connection may actually be larger than the time involved in retrieving the image itself. Consider whether the effect achieved by the "enhanced" layout justifies the cost.

Another concern is the size of images. With the increasing home popularity of the Internet, more and more users are purchasing dial-up connections of one sort or another. This may be of the strict "shell-account" variety, which means that your readers will not see images at all, or they may be of the SLIP/PPP variety, which means that your readers will have an average of only 14,400 bits of information per second sent to them. This is not a large number, and huge images can take minutes to load. Bear this in mind when selecting images; will the image take so long to load that your reader will go somewhere else rather than wait?

The image size issue can be alleviated in several ways. First, the increasing popularity of the JPEG format means that images can be compressed to much smaller sizes, which provides dramatic speed-up in image load time. Even better results can be achieved by using fewer colors (gray scale, rather than full 24-bit color, for example). Another approach is to use a small set of navigational icons which appear on every page in your Web. Most browsers now cache documents and images; using the same icons (and using the same URL to refer to them with, perhaps by maintaining an /icons directory on your Web server) means that the reader will only incur the cost of downloading once.

Also, when using the IMG element, don't forget to also use the ALT attribute. The ALT attribute allows alternate text to be specified for an inlined image. This is especially useful for images that have specific meaning (and provide a link to other documents), as that meaning can be lost on those who do not have images loaded. For example:

<IMG SRC="http://www.miskatonic.edu/icons/next.gif">

can be better represented with the addition of the following ALT attribute:

<IMG SRC="http://www.miskatonic.edu/icons/next.gif" ALT="[Next Page]">

as shown in figures 12 through 16.

[Figure 12: The Document As Expected (Netscape)]

[Figure 13: Inlined Images Off/No ALT Tag (Netscape)]

[Figure 14: Text Browser/No ALT Tag (Lynx)]

[Figure 15: Inlined Images Off/ALT Tag Supplied (Netscape)]

[Figure 16: Text Browser/ALT Tag Supplied (Lynx)]

Finally, don't rely entirely on image maps and graphic logos to build your site. There are a few sites which have almost no textual content whatsoever; when visited by readers who do not (or cannot) load images, there is no information available. This is not to say that image maps must be avoided altogether. Instead, provide alternative means of navigation which supplement the image map, such as explanatory text which follows your map.

Common Errors

This section details common errors in HTML composition that may lead to documents which are not fully device-independent. The behaviors of these errors are undefined, so certain browsers may render them as intended but not all browsers are guaranteed of doing so. Therefore, these mistakes should be avoided, even if your browser of choice renders your documents correctly.

These errors are, for the most part, artifacts of "raw" HTML authoring. Web development has suffered from a lack of good authoring tools, a situation which is only now beginning to be rectified. Many of these errors involve typos or simple mistakes, although others deal with more fundamental conceptual problems.

Paragraph element errors

The use of the paragraph element (P) can be confusing. When HTML was first introduced,  served as a paragraph separator, not as an end-of-paragraph; a confusion which originally prompted this document. However, more recent version of the HTML 2.0 and later specifications have changed this behavior.

The current recommended use of the P element is to be placed at the beginning of paragraphs; for example:

<P> In this paragraph, our hero discovers that he really likes
baloney sandwiches. He also listens to some disco, and has a
lovely beverage. Ah, if only all paragraphs were this exciting!

This is in contrast to previous usage, where the  was usually placed at the end of the paragraph.

Still, in certain contexts, use of  should be avoided, such as directly before any other element which already implies a paragraph break.

To wit, the  element should not be placed before the headings, HR, ADDRESS, BLOCKQUOTE, or PRE.

It should also not be placed immediately before a list element of any stripe. That is, a should not be used to mark the end-of-text for <LI>, <DT> or <DD>. These elements already imply paragraph breaks.

Caveats

Some clarifications on the above might be in order. One is the difficulty of rendering appropriate white space by a browser. While it is true that all of the entities mentioned above imply a paragraph break, this only occasionally means that they also imply white space between sections -- this depends on the browser. So, while you might feel inclined to add a  in order to fix white space problems, please think twice and avoid it if you can.

Also, when using the glossary list (DL), please try to avoid using multiple DDs (definitions of terms) in order to provide multiple entries for a term (DT). Instead, use a  tag between paragraphs in a definition.

All clear now?

Character and entity reference errors

Simply put, a character reference and an entity reference are ways to represent information that might otherwise be interpreted as a markup tag. For example, consider the rendered HTML document in figure 17.

[Figure 17: Properly escaping character entities (Arena)]

The source which produces this document, which uses entities, looks like:

In order to represent the &quot;&lt;P&gt;&quot; in this text, I had to use &amp;lt;P&amp;gt; in my raw HTML.

In this example, the < becomes "<", the > becomes ">", the " becomes a quotation mark, and the & becomes "&" (which is needed in order to represent the text < in the document without the text being turned into "<"). There are currently four entities for this purpose in HTML, as well as several entities which allow encoding of the ISO Latin-1 Character Set.

The most common error in the use of entities is to leave off the trailing semicolon. Also, no additional spaces are needed before or after the entity/character reference. Here are some examples of incorrect usage:

Doug &amp Chris went out for a walk.
A paragraph break can be represented with
&quote; &lt; P &gt; &quote;

Can you spot the errors in the above examples? They are:

In the first line, "&amp" needs to have a semicolon after it.
In the third line, "&quote;" should be """ (this is subtle and annoying, much like the Unix system call, creat())
There should be no spaces in the third line, which should read: "".

URL errors

Another misunderstood aspect of Web document composition is in the creation of URLs.

Directory reference errors

One grey area involves references to directories. It is possible to request an index of a directory from an HTTP server. The typical response from the server is to either return a pre-generated index document (which is often the document "index.html" in the referenced directory), or to construct an HTML document on the fly which contains a listing of all files in the directory. However, when making such a directory reference, it is important to make sure to have a trailing slash on the URL. That is, if you were to request the index of my home page, you would want to refer to it as http://www.cs.cmu.edu/~tilt/, not as http://www.cs.cmu.edu/~tilt.

Many servers are able to catch these errors, and provide redirection to the proper URL, but it's best to get the URL right in the first place -- notably because not all browsers support transparent redirection. Also, getting this correct the first time means it will take less time for the page to be loaded; your readers won't have to wait through the time needed to open two (or more) HTTP connections.

Not using fully qualified domain names

Problems can arise when the hostnames in URLs aren't fully qualified. Within a local network, a machine can often be simply referred to by its host name. For example, the domain miskatonic.edu might have in it a WWW server with the host name www. Readers within that domain can refer to the machine by this name. However, the server's fully qualified domain name is www.miskatonic.edu. This fully qualified domain name provides enough information that any host, anywhere on the Internet, can find this particular machine.

What happens is that an HTML author might construct a link that looks like this:

<A HREF="http://www/~tilt/metanoia/">Metanoia -- A Change In Spirit</A>

which produces a link to "Metanoia-A Change In Spirit" that will only work for people in the local network which that machine is on. A correct link would look like this, instead:

 <A HREF="http://www.cs.cmu.edu/~tilt/metanoia/">Metanoia -- A Change In Spirit</A>

which would allow all of the readers who are interested in Metanoia -- even those living in Freedonia -- to actually follow the link.

Along those same lines, be careful in using URLs of the scheme "file:". It's possible to have a reference to file://localhost/some/file/pathname. What this does is references the file described on the local host of whoever is browsing the document. Which is why a reference to <A HREF="file://localhost/etc/motd">the message of the day</A> will display the message of the day on your machine, not the message of the day on my machine. However, this makes several assumptions about your reader's local machine and network which you probably shouldn't be making. Unless you know what you are doing (and probably even then), references of this type will really mess up your Web.

Missing quotes in start tags

One common error, especially with the current lack of widely available and useful authoring tools, is to leave off a quote in the attributes of tags. For example, this reference to the euphonium, king of instruments, should look like:

<A HREF="http://www.cs.cmu.edu/~tilt/euphonium/">

but people composing "raw" HTML from a text editor will often instead type

<A HREF="http://www.cs.cmu.edu/~tilt/euphonium/>

It's likely that by the end of that huge URL, the author had forgotten it was supposed to be quoted. The behavior of browsers upon encountering this varies -- some display a proper link, but you can't follow it, while others actually eat up huge portions of the following text, thinking everything up until the next quotation mark to be part of the URL.

Missed end tags

Many of the HTML elements contain information within them. For example, emphasized text would be rendered as emphasized text. There is a start tag (), some content (which may include text, and in some cases, other nested elements), and an end tag (, indicated by the </). A common mistake is to miss the / in the end tag. All elements (except empty elements, below) must be terminated by an end tag -- otherwise, undefined behavior may occur.

Some HTML elements may be empty, such as  and <HR> (the HTML 2.0 specification provides more information about element content). If this is the case, there is no need for an end tag.

Using white space around element tags

In general, the use of white space around element tags should be avoided. For example, if white space immediately follows a start tag, the style changes implied by that element may be applied to the initial space as well. For instance,

You really should
<A HREF="http://www.cs.cmu.edu/~tilt/"> CZeCh THIZ 0uT </A> !

would be rendered in Netscape as shown in figure 18, and in Lynx as shown in figure 19.

[Figure 18: Improper use of whitespace (and spelling and punctuation, too) (Netscape)]

[Figure 19: Improper use of whitespace (Lynx)]

On some browsers, there may be white space around the anchor, which adds unwanted unsightliness to the rendering, and may lessen the impact of the document. (This comment really applies to white space immediately following start tags, and immediately preceding end tags.)

Stylesheets

The point has probably been well made by now that HTML is not a very good vehicle for providing specific information about layout and presentation. There are no mechanisms for an author to specify how she wants specific elements rendered, or to control aspects of page layout. While one of the strengths of HTML is this very independence from presentation details, it has become clear that some form of presentation control is needed.

Stylesheets are the answer to this problem. It provides the other half of the equation, the half that is currently not provided by HTML. While HTML provides information about content, stylesheets will provide information about how to render specific elements.

Unfortunately, while several mechanisms for providing stylesheets are under development, there is no clear standard at the time of this writing. We cannot tell you what stylesheet mechanism(s) will become standard, but we can tell you about the current contenders. Keep your hopes up, though: because of the importance of stylesheets, it is highly likely that a usable standard will emerge within the next year.

Some Stylesheet Proposals

In these proposals, the stylesheets contain information about how elements should be rendered, whether this is font information, justification information, etc. At the time of this writing, the syntax for these stylesheets has not yet been fully designed.

Arena/Cascading Style Sheets

The Arena browser is currently the only browser which supports a stylesheet mechanism, and that mechanism is currently only very limited and very experimental. The mechanism involves "cascading style sheets," which means that the several different style sheets, each with a different order of importance, are combined in order of importance to create a presentation style. The reader can specify her own preferences for rendering, as can document authors, and these preferences are merged to produce the final document.

DSSSL/DSSSL Lite

DSSSL is the Document Style Semantics and Specification Language, which has emerged from the SGML community as a potential stylesheet mechanism. Because it is complex, work is being done to create "DSSSL Lite," a modified subset of DSSSL which can be easily implemented by client programmers, and easily used by HTML authors.

Alternatives to Stylesheets

While stylesheets are not currently useable, there are alternatives in existing specifications, which can be used with existing browsers. While the HTML 3.0 enhancements below are not yet widely propagated, it is likely that they will be soon; and the Netscape enhancements are already available (and are likely to be integrated into the evolving HTML 3.0 specification).

HTML 3.0

While HTML 3.0 does include the STYLE element for supporting whatever mechanism is eventually deployed for stylesheets, HTML 3.0 also provides some new elements for greater control over presentation. These elements include BANNER, BIG, SMALL, TABLE, MATH, and TAB.

The BANNER element provides a means for a banner of HTML that will always remain on the screen. This might be a copyright notice, a toolbar, or any other content which should always be available.

The BIG and SMALL elements allow for rendering text as bigger or smaller, as compared to the default text size.

The TABLE and MATH elements provide for a more sophisticated means of layout. The TABLE element allows the author to specify a spreadsheet-style arrangement, with cells that can contain text, images, and even input elements for FORMs. The MATH element allows for the description and rendering of complex mathematical formulae.

The TAB element allows the author to specify tab stops within the document.

In addition, some entities have been added, such as "&emspace;", to provide finer control over spacing.

For more information about these additional elements and entities, see the HTML 3.0 specification.

Netscape

(This section is really no longer very Netscape-specific, since many other browsers have implemented parts of Netscape's functionality in the so-called "browser wars". Since I can't keep up with them all, and since the point of this whole discussion is that there are still some browsers that don't do everything, I won't bother to enumerate them all here.)

The Netscape approach cannot be called a "style sheet," per se. Rather, as of the 1.1 release of Netscape Navigator, Netscape has provided several "enhanced" elements to help control presentation. These elements include FONT, BASEFONT, IMG, and BODY.

The FONT and BASEFONT elements allow changing the size of font within a document. The IMG element, on the other hand, has been enhanced to provide text flow around images in documents.

The BODY element now allows control over the background. The author is allowed to provide a background color or image for their document. In addition, the author can specify different colors for hypertext links, in case the default colors do not have sufficient contrast to the new background color.

If you would like more information, Netscape Communications has provided documentation of their HTML extensions online (both for the Netscape HTML 2.0 extensions and the Netscape HTML 3.0 extensions).

Note: Be careful when changing colors for hypertext links. Most browsers take the approach of using a bright color (such as bright blue), which has high contrast to the default page background, for links which have not yet been followed; and of using a dull color (such as dark blue), which has less contrast to the default page background, for links which have already been followed. Readers have become used to this high-contrast/low-contrast visual cue, and changing the link colors can confuse readers.

The best approach is to, first, not change the link colors unless you have to. With most background colors, the defaults should still be fine. If you do need to change the link colors, use a color that is bright, and high-contrast to the background color, for links to pages which have not yet been visited. Use a duller version of that same color for links that have already been followed.

Netscape Frames

Given the proliferation of Netscape's frames, it seems appropriate to at least add in a paragraph or so commenting on proper usage. Frames allow you to break the browser's window into separate subwindows, with different documents in different windows. This provides even greater control for the author in terms of what the end document actually looks like (and, granted, can be used to very good effect), but, as with all things, must be used with care.

Some gotchas with frames include:

Navigational: This has more to do with Netscape's current implementation, but may be more fundamentally related with the issues involved in providing frame-style mechanisms. Currently, when a reader encounters a space structured with frames, any further navigation they do does not make it onto the history stack. This means that the next time they hit the "back" arrow, they pop right out of the entire space, possibly going back several link selections. This can be jarring, to say the least. What this boils down to is that you must be even more careful to prepare a good navigational structure for your corpus of documents. (In fairness, Netscape has recognized the frame problem, and the 3.x version of Navigator addresses it.)
Layout: Many sites have poorly layed-out frames; when a reader with a browser window of unexpected shape or size shows up, some of the frames are not completely readable. I don't understand enough about frames to know why this happens, yet, so all I can do is to warn you to watch out.

In general, the gotchas revolve around the fact that more control is removed from the reader in a medium where the reader expects to have a good deal of control. This doesn't mean don't use frames; it means that you must carefully analyze why you are using them, and make sure that their use is justified.

Another note: there is a NOFRAME element which can be used to give alternate text for those browsers which do not support frames; use it.

More on this subject as I become more familiar with frames.

Web Style Considerations

A quick plug: Chapter 5 of Web Weaving discusses many of the issues you should take into account in planning and administering your Web (in fact, the entire book revolves around the subject in great detail). Here we will also address that subject, considering the architecture of your infostructure.

Organization

When organizing your infostructure, there are several important issues to consider. These issues include:

Presenting a clear ordering of information by subject (table of contents), or some other form of reasonable entry into the infostructure. Some useful forms are:

Table of Contents
Searchable Index
What's New (with the organic nature of online documents, a time-oriented ordering will help the infonaut quickly orient herself with what is new and/or changed in otherwise familiar territory)

The reader needs to be able to find what they are looking for, and a good overview that allows the reader to quickly find a particular topic or document is invaluable.

Only making a document as long as it needs to be. If a document can be logically decomposed into more then one file, do so, but only decompose a document if the narrative branches from the linear structure of the current document. An example of this is breaking a book-length work up into chapters, and further breaking those chapters up into sections. Because of the length of time involved in retrieving documents, making the document available in readable chunks means that the reader can use the information without becoming overwhelmed in loading times and a correspondingly large amounts of information presented a single, huge, scrolling document.

Correspondingly, make sure a document is richly cross-referenced, so that if reader wants to ask, "Why?", she can. If you can split up supplementary information into separate documents, do so. This allows the reader to follow a main flow of narrative, but still able to look up evidence and additional related stories and information as necessary. But don't put in so many links that the reader gets lost trying to follow them all.

Providing a clear, consistent navigation structure. You should always be able to easily to navigate to all documents which immediately relate, but you should also always be able to get any other document in the infostructure with a minimum of fuss. Always provide access to the original table of contents, or its equivalent. This is especially important for when others create links to documents in your Web, but do not necessarily create links to your main entry points; readers can find themselves in the middle of what is obviously a larger document, but without any means of finding additional information. See Main Roads and Scenic Paths, below.

Design Goals

Importance of content

Anyone working with HTML for any length of time will soon realize that the markup language is composed of containers, which label content. It should be obvious, then, that your web should be primarily about this content, whatever it may be.

That's not to say that content only lies between HTML tags: content is also found in other media types, of course, and, depending upon the type of information you provide, sounds or images may be more important to both you and your readers than other types of media.

Web sites, however, should be driven by content, not by vanity or the need or desire to make a buck. Whatever your background, you have real "content" -- information, discussion, narrative, ideas -- to publish on the Web. People will visit your site to find this content. Provide it. Focus your site around it.

The largest threat to the Web is that as it becomes insanely popular, instead of becoming a world-wide information repository, as its founders and proponents have hoped, it becomes a large intertwined mass of self-referential sites unwittingly involved in meta-discussions on the nature of the Web: home pages which say little more than "This is my home page" (or "our home page", in the case of the corporate or organizational "presence"), with a collection of links which (virtually) point to the same collections of sites as the last page you visited did.

Main Roads and Scenic Paths: Issues of Navigability

As readers attempt to sail the seas of your infostructure, it is important that you provide useful ways for them to move around in your infostructure. Many readers complain about the proliferation of links in documents, providing so many choices that it becomes impossible to decide where to go next. The blessings of hypertext -- leaving control in the hands of the reader -- can also be a curse, as the original thrust of the narrative becomes awash in side tracks and dead ends.

A means of approaching this problem is to use the metaphor of "main roads" and "scenic paths." This means categorizing the kinds of links you include into two major groups: those which are recommended next destinations, and those which lead off into explanatory side-trails and divergences. As an example, a main path through a hypertext version of a book would be a linear progression from first chapter to the last. A side trail, on the other hand, would be a reference from (for example) Chapter 6's description of CGI functionality in various HTTP servers to Chapter 8's extended discussion of CGI scripting.

This is not to say that there is a single main path through a document -- there can be several (just as there are several ways to read a book, including as a linear narrative, and as a random-access reference). And side trails include references outside of the immediate document, such as bibliographic references. In addition, side trails can become main paths if the trail leads to another document instead of self-contained explanation.

The point, however, is that a document (in the extended sense of several HTML pages collected and interlinked) should contain at least one or more author-defined main paths through the text, in order to provide a guidepost for those exploring the information. These main paths should take the form of "next" and "previous" anchors, links back to the table of contents and index from any point within the document, and pointers to alternate main paths which are available (where appropriate).

Although hypertext is based on notions of non-linear text, readers do make it linear as they read through it. And it doesn't hurt to provide at least one sensible linear pathway through the document for readers who aren't interested in wandering around in hyperspace.

Consistency

Consistency is what brings your site together so that it feels like a cohesive whole -- it can unite otherwise disparate topics or content areas, and it can be used to give your site a distinctive feel in comparison to other sites, or a sense of personality. Consistency also lends to the maintenance of a site -- if you have a certain way of doing things site-wide, it becomes much easier to make significant site-wide changes without putting a great deal of time into it. You can achieve site-wide consistency a number of ways:

Headers and footers

A standard site-wide graphical banner or text-based header can be used to easily identify the site or sponsoring organization. Your header doesn't necessarily need to be static across the site; you can easily share dimensions and a primary graphic element across banners while making each one relate specifically to the content at hand.

Footers can be used in the same way; a standard method to sign documents and/or a standard text-based or graphical menu bar can easily pull a site together, not only as a design element, but also as an easy way to always navigate to the table of contents or index of a site.

Server-side includes, supported by most HTTP servers, can simplify some of this work, allowing you to create generic headers and footers which can be modified once and included in all of your documents.

Graphic elements

A unifying theme for graphic elements throughout the site easily pulls it together into a whole. A shared motif, such as bubbles, sign posts, or a corporate logo, works, as does a site-wide color scheme or page backgrounds. You can rely on sizing and positioning of graphic elements or textual elements, as well, to achieve a unified feel.

Personality and style

Beyond images and design elements, sites come together because of personality and style. A consistent feel or attitude for a site, conveyed across textual and graphic elements, can not only make each piece feel as if it's part of a larger whole, it can also attract readers who share the same attitude or outlook (or are fascinated by yours). The best sites on the Web aren't necessarily the most polished, but those that pull readers back again and again not only because of informational content but also because of the voice with which that content is presented.

For documents which should have a personality all their own, such as user home pages, you can still pull all these different personalities and outlooks together by presenting a common theme or launching point. All the users of a particular Internet service provider, for example, have something in common by the sheer fact of their being there -- and by the mere fact of providing a top page view to user-maintained areas, the service provider has begun to form a community around which a commonality can develop.

Persistent URLs

Although Universal Resource Names, or URNs, are being developed in order to provide a naming system similar to the domain naming system for URLs, at this point it remains desirable to use URLs as if they refer to the same resource persistently through time.

As a content provider, you can help provide those who make links which point to your site by developing a file structure which will allow you to manage content as it grows and develops.

If your Web space is based on a hierarchical filing system, you can avoid major reorganization of that file system by

thinking not only about organizing your current content, but how you plan on developing and expanding that content in the future
creating a file space which is neither too shallow nor too deep for your content.

An example might be an organization which has just created a new division, Foobar. Currently, there's little information to publish about Foobar on the Web: Foobar has a mission statement and little else. Though it might logically follow to create a file, "foobar.html", to hold the mission statement, and to store it in the same directory as your main organization's web, it might be wiser to create a subdirectory named foobar which could then contain foobar.html and other files, as Foobar expands. This way, links don't have to be changed or redirected down the road when Foobar adds additional files and perhaps chooses to design and administer its own web space. If part of Foobar's mission statement is to spin off into its own organization, you might even create a directory on the same level as the parent organization's, to signify within the URL path the relative autonomy of the division and its future direction.

Another way to manage URLs is to only publicize a few well-known entry points to your Web: for example, the top view, or table of contents page, and perhaps an index page, or a FAQ page.

When URLs do change, it's important that you not only provide links from the old URLs to the new ones (or redirect the URLs to the new ones), but you also make an attempt to notify those that have links into your Web space, through general announcements or by contacting directly those who have well-known links to your documents (such as Yahoo or Lycos).

Seamlessness

Your web space should not only be consistent with itself internally, it should make references between the site and the outside world appear seamless.

A good case in point is the corporate site which has made its product information available via the Web, but, under the link for Ordering Information, only provides an 800 number in order to purchase the advertised commodity. Or the home page for a band which doesn't provide any audio clips of the band's songs, but just a thumbnail image of the cover art from their most recent album, available through some obscure indie label. Or the online newspaper which provides news coverage, but doesn't push the envelope and provide a real way to participate in the political process.

Seamlessness is about bridging the gap between the world you create within your web and the world outside it. Often, this means not carrying over from traditional broadcast media restrictions or limitations that fail to make sense in interactive media.

Macrocosms and Microcosms

The big picture: entire server structure

A site-wide strategy to organize information is never easy to invent, but vitally important to your site's success as a place where information is retrieved and used, versus simply being an area in which content is stored.

Finding a metaphor

Of course, there's no single recipe or structuring mechanism which you can apply to all types of content to give you a well-designed web site. That comes from thinking about the nature of your site and your content, and the logical divisions that your content can be organized around. However, finding an existing metaphor which you can work within while also pushing the boundaries of can be an effective way to plan for the organization of a site.

There are many obvious metaphors upon which to base a web site: thinking of your content as being organized like a book, building, or branching tree.

The book metaphor: pages of content

Books lend themselves easily to the Web: and, in fact, many books have been "ported" to the Web, for better and for worse. Books have tables of contents and indices, for quickly locating information; parts, chapters, sections, and sub-sections, for organizing content; and footnotes, endnotes, and bibliographies, for displaying links to other content. Collections of books become "libraries", complete with card catalogs and help desks.

However, books also have pages which display content statically, while computer displays have a single, dynamic screen. A book metaphor quickly falls apart when applied to the Web on a page level: you could choose to consider a single HTML document a "page", causing you to break up content into arbitrarily small and hard to manage, difficult to navigate pieces; or you could think of whatever text and graphics being currently displayed on a screen as a "page", which could easily drown the user in a sea of text without the benefit of traditional navigational tools such as page breaks and numbering of pages. The screen is not a page.

The building metaphor: content as artifice

Sites can also be managed as being housed in a building, a collection of buildings, or along some other spatial metaphor. The information you hope to store and manage is divided for the user along content areas, which is housed in different "buildings", which can then be further subdivided into "rooms". Obviously, this can be effective for some types of content, such as a large corporate site with many divisions, or a museum or gallery: basically, any information which can be mapped into a spatial plane consistently lends itself to this sort of view.

At the same time, a spatial metaphor in a largely text-driven medium, as the Web is today, is often hard to pull off convincingly. VRML (Virtual Reality Modeling Language) and other such developments will allow for the creation of virtual spaces; even then, the connecting points between rooms or buildings -- hallways and walkways -- need to be considered thoughtfully. It's also the case that, at many sites, the metaphor is dropped too quickly: you're asked to select a content area based upon a clickable map-based view, but then you're dropped into pages of descriptive text. Not only can this be disconcerting for a user, it points out the fact that oftentimes resources aren't allocated wisely across a Web site, with too much attention and time spent on the top page of a site in comparison to the remainder of the site.

The branching metaphor: regimented growth

A third way of thinking about a site as a whole is using a branching metaphor, where all content springs from a common root and then branches out into many divisions and content areas. This is an obvious metaphor to use for web sites built atop file systems, since most file systems share this organization of directories (or folders) branching into subdirectories (or subfolders), and so on.

A branching metaphor shouldn't be pursued over the linear flow of information, however: too many branches can be confusing or frustrating for a user, especially if navigating those branches requires repeated jumps to a monolithic top structure.

In general, there some key issues you should keep in mind when organizing a site on a macro level, including:

Providing a main entry point, or top view, which makes it easy for users to find the content which they're most interested in. At times, you'll know exactly what a user is looking for: if you run a site which provides audio clips of theme songs from popular cartoon series of the '70s, users probably expect to find a listing of available audio samples or a link to such a listing from your site's top page. Other times, you can't be expected to know: for a site covering a wide diversity of subjects, it may be necessary to provide a search mechanism or user-customizable top view in order for users to navigate your site comfortably.

Offering multiple paths to the same content. Not all readers seek the same information in the same way. A good glossary or index will cross-reference information: for example, you may be told to look under "automobiles" if you seek information under "cars". That same information could probably be found by looking through a table of contents. With hypertext links, you can refer to the same information in many ways. Do so, where it facilitates the user without overwhelming her.

Keep in mind, too, that a site, whether it be a file system or a database, need not be organized as the user sees it: the underlying structure doesn't have to be identical to the structure which the user navigates. However, a close relationship between the two can make it easier to maintain a site, as content is revised and expanded. A change in one part of your web space can have an impact on other parts of your site which share links or other references: the easier it is for you to see these relationships while maintaining these underlying documents, the more likely it becomes that your site as a whole is kept up-to-date and cohesive.

The little picture: a document corpus

Many of the decisions you make on a site-wide level to organize content carry over to the management of "documents", whether they be single pages of HTML, or a collection of such pages which cover a single topic. These things include such obvious carry-overs as having an overview of the information presented within the document available to the reader at the "top" page, or expected entry point; making links available at appropriate points (usually, at the tops or the bottoms of pages) to bring the reader back to the overview for the document; and keeping your collection of documents uniform in terms of both content and form.

Much of the management of documents, though, is the management of links. Hypertext is all about links -- this should be patently obvious to most. But producing hypertext is all about managing links from the perspective of your potential reader. Too often, Web documents fail by failing to manage links effectively -- either by delivering screenfulls and screenfulls of ever-scrolling text, or providing index-card-sized groupings of hypertext which link in a myriad of directions to other index-card-sized groupings of hypertext. Neither end of the spectrum allows the user to navigate the content presented easily: in one case, one becomes disoriented in a sea of text; in the other, in an ocean of links. Worse yet, documents can become so overseasoned with random and senseless connections to every possible place that that the reader becomes lost in a sea of text and links!

The key to managing links in your documents (besides simply verifying that they are correct) is to organize them into classifications, and to employ links of various classifications in a reasonable and intelligent way. The next few sections describe some of the various classifications of links.

Footnotes

There are two traditional purposes for footnotes: for bibliographic references, and for further commentary and/or elaboration of points within the main text. Links to short explanatory text within a hypertext document can be useful to readers, if it's clear from context that the link is a digression.

Within your documents, the "footnote" style of link should be regarded as an explanatory link which elaborates on the current discussion without drawing the reader away from the main text. A footnote will draw the reader away temporarily, explain something, and then allow the reader to return to the main flow of text. While a footnote might offer further links to further explanations of greater depth, the footnote itself is usually nothing more than a brief explanation or glossary-style definition.

You can achieve this effect by context, by linking from a phrase (as in the lemming example below) to a short explanation or parenthetical remark that explains the text in question. If you are to trying to achieve a more traditional effect, you can also use numbered note references, by either using a number surrounded by brackets ([1]), or by using the SUP element in HTML 3 (1).

HTML 3 also defines the FN element for use in footnotes, which, "when practical, [should be] rendered as pop-up notes":

<P>Nothing is certain about the <A HREF="#FN1">lemmings</A>,
other than that they left as they came, with nothing but a silly grin and
some lemon pies.

<FN ID="fn1">Lemmings: Small rodents that like to leap off of
cliffs if necessary for retrieving a really nice lemon pie.</FN>

Whole documents

Where the footnote provides brief elaboration, the link to a "whole document" (whether it be a single document, or to the entry point for a collection of documents) provides a whole new potential area of exploration. This is the most common sort of link, which provides a connection between your document and the outside world.

This sort of link should be used with care. It has the potential to draw your reader completely away from your document, by providing supplementary information that takes longer to read than the original document. It is better to use footnote-style links for explanation and elaboration, and from there to use links to outside documents to provide further reference information for the curious (and insatiable) reader. Another danger is that of peppering your document with random hypertext links that a reader feels she must follow, without actually providing further explanations or further reading that's germane to the context or the point of your own document.

On the other hand, if you are referring directly to another on-line document, this is the kind of link to use. By providing direct access to supplementary material for your readers, you can give them as much or as little detail as they are willing to plow through.

Indices

Another form of link is the index. Unlike the previous two classifications, which provide further information for the reader as they advance through the text, the index allows the reader to enter the text from whatever point she desires, so that she can get right to the meat of what she is interested in. An index allows the reader to cut through the author's pre-designed tour of the information, and get right to that vital information on wildebeest's dietary habits.

There are several variations on this. The most popular is the full-text searchable, allowing readers to query a database of keywords and retrieve those portions of your text which contain those keywords. Several software packages provide full-text searching capability, and the WN server has searching built-in.

Another variation is often found in books: an enumerated list of keywords. This differs from an index where the reader supplies the keywords in that the author can provide a selection of keywords that are particularly useful for finding information. This is important-picking proper keywords can be an arcane art, sometimes requiring intimate knowledge of the contents of the collection being searched. Especially if the collection is a large one, most keywords will return a large amount of documents which may be only partially related to what the reader had in mind.

Yet another variation provides even more refinement and selection: the table of contents. A table of contents is a form of index, organized by broad topic. Consider providing not just one, but multiple tables of contents for your documents, especially if there is more than one reasonable way in which to read the information.

Portability Between Server Platforms

One of the advantages of HTML, which most Web documents consist of, is that HTML is based upon a number of other clearly defined, widely supported, non-proprietary formats, such as ISO Latin-1 and Internet Media Types (itself based on MIME). This approach makes it much more likely that, a decade from now, your documents will not be part of some legacy system which is, at best, difficult to maintain and expand.

If your documents do have that kind of lifespan, however, it's probable that they will reside on multiple hosts in that timeframe: perhaps concurrently, in the case of popular sites which are mirrored. A little attention to the requirements of different filesystems during the initial planning of your site could save a lot of time spent renaming files and links in the future.

About filesystems: some make the argument that Web servers should sit atop databases, instead of filesystems; databases certainly allow non-hierarchical relationships between pieces of content and make it easier to provide "dynamic" documents (documents which alter their appearance or content based upon the user accessing the data or other conditions) than traditional filesystem-based approaches. By the time this book sees print, there will certainly be several HTTP-serving database systems which address many of the issues raised here "automatically".

There are some very compelling reasons for using a database over a file system. A database-oriented system might be utilized to maintain linkages as documents move and change; to track documents as they grow old, alerting maintainers to update the documents periodically so that they do not suffer "bit-rot"; and to generate multiple representations of a collection of information dynamically (allowing your readers to order your document collections in ways that make sense to them). However, a database approach is not required to get some of this functionality; other tools also exist that also do these sorts of things (Chapter 7 of Web Weaving covers these sorts of tools in more detail; examples include MOMspider and the HTML Validation Service).

But this automation may not come cheap: there will always be a learning curve to mastering any system, proprietary or non-proprietary, and the skills learned from managing a proprietary system are not easily transferred to other systems. You, as an information provider, must rely on your database solutions vendor to understand your needs and continue to build the feature-set of the system to satisfy them as you develop and grow. You may be risking the future of your documents -- by marrying your content to a single-vendor methodology -- for some short-term gains in manageability and ease of publishing content.

Please keep these sorts of considerations in mind: a fear of ours is that the Web, as it moves forward almost exponentially, may lose any sense of history as links fail and documents drop out of view because the cost of maintenance and "keeping up" has grown too great. Pick simple solutions over complex ones.

Naming Space

Historically, most Web servers have been Unix-based, and have used the naming space associated with that operation system. Many servers have since been developed for other platforms, however, and it's no doubt prudent that, as you create documents, you do not adhere to a naming space for a particular platform such that you make it difficult to move your documents to another platform.

Some filesystems have naming spaces which are case-sensitive. Unix is a good example of an OS which would consider "document.html" a different file from "Document.html", while other file systems, such as the Mac OS, make no such distinction -- both names would refer to the same file. For the sake of portability, it's probably best to keep all the file and directory names within your web structure lowercase. An added benefit is that this makes your URLs much more human-communicable: it's much easier to read an all-lowercase URL over the phone than one which contains both uppercase and lowercase characters, when case is significant.
Some filesystems require file extensions to properly type files. Servers running under the Mac OS could serve up files with proper Content-type headers based upon the file's creator and file type stored in the file's resource fork; other filesystems use extensions to do this typing. It's always wise to use the appropriate file extension for the content type -- such as .gif for GIF files -- whenever possible.
Some filesystems are restricted to a limited number of significant characters. DOS and Windows, of course, only allow eight characters, plus three characters for the file extension. Generally, filenames under 32 characters should be fairly cross-platform, but for DOS/Windows (although Windows 95 and NT eliminate this restriction). If you think your files may ever need to live on a DOS or Windows server, you may need to restrict yourself to 8 + 3 character filenames.
Almost all filesystems define special characters.

Almost all operating systems allow certain special characters in filenames, while disallowing others; the Mac OS, for example, allows slashes in file names, while Unix doesn't. It's best to avoid all characters but for the letters a through z, the numbers 0 through 9, and the underscore, hyphen, and period.

Developing Content

Uniqueness

Uniqueness may not be seen as an important design goal at first glance: after all, uniqueness -- not duplicating efforts by creating or compiling same or similar content -- may appear to be more of a community issue than an organizational one.

Providing a unique resource, however, increases traffic to your site, and adds to the authoritativeness of your content (see below). It will also require support, and a popular, unique resource can have a spill-over effect on the other content you provide on your site, especially if your site has a consistent feel and character.

In addition, redoing what has already been done elsewhere can add to frustration on the part of readers. Providing yet another list of exciting online resources means that there is simply more of the same sort of content available, which readers must then evaluate and compare to other such resources. Providing a unique resource (or a resource in short supply) means that you are adding to the content of the network, instead of duplicating it.

How to check for uniqueness of content? There are many search mechanisms on the Web, such as Lycos. You can also check in relevant newsgroups and mailing lists. (Chapter 10 of Web Weaving covers these sorts of issues in more detail).

You can also produce your content so that it leans towards providing unique, value-added content: instead of simply providing a list of poetry sites, say, you could provide a list of poetry resources which you find particularly compelling, with descriptions of why you think they are compelling. Adding value and content means that you are being a good network citizen, leaving the community with more than you found it with.

Authoritativeness

Authoritativeness has always been a fallacy, except when read as author-itativeness; whatever claims to authority you or your organization have ultimately boil down to status and reputation within the community. One becomes a reputable source not by being non-refutable, but by putting a stamp on what you write; by claiming authorship, and, thereby, author-ity.

This means that readers must take greater responsibility for critically analyzing what documents they come across. But it also means that you must be responsible in establishing credentials for what you claim, providing source material and raw data to justify your conclusions.

In some sense, this is the end result of all of the things we discuss here (and in Web Weaving). In building and maintaining your infostructure what you are aiming for is authoritativeness; for creating documents which are well thought out and well designed; which do not become stale or inaccurate; and which remain both internally and externally consistent. Your mission now is to use the tools we have provided you with to place the stamp of authority and relevance on your own works, and to truly create infostructures on the Web which are compelling and creative. Good luck!

For More Information

There already exist documents on the Web which address this same topic, and perhaps in more detail. For definitive reference information you may wish to check the HTML specifications from the World Wide Web Consortium (W3C). For a more detailed discussion of HTML composition style, you should also check the Style Guide (especially the section on device-independent formatting), which is also from the W3C.

If you're looking for a good document for learning the basics of HTML, you will want to check out the Beginner's Guide to HTML, from NCSA.

Also useful is the Bibliography from Web Weaving, from Addison-Wesley (as soon as this is placed on-line, I'll put a link to it here).

Finally, the somewhat creatively-minded among you can draw inspiration from this page's evil twin, Composing Evil HTML. Officially, I don't endorse any of these techniques. Unofficially ... well, let's just say someday I intend to buy Andrew several beers.

Acknowledgements

I'd like to thank all of you who have visited this document and commented on it, suggesting fixes, clarification, and even new sections. You know who you are (even if I managed to lose your addresses in the flood of information)! It is, in some senses, always a work in progress and is always amenable to suggestion, modification, and repair. I appreciate your help!

We (the authors of Web Weaving) especially like to thank the folks at Addison-Wesley, for helping us turn all of this into much more than I, at least, ever thought it would be. There's something just so satisfying about actually holding a book, hypertext be damned.

Copyright © 1994, 1995, 1996, 1997 by Eric Tilton. Permission is granted for individual use and reproduction provided that this document remains intact, with this copyright message clearly visible. Commercial use and reproduction rights are held by Addison-Wesley, and this document may not be resold or redistributed for compensation of any kind without prior written permission from Addison Wesley -- contact me for details. Parts of this document appear in a revised form in Web Weaving (ISBN 0-201-48959-7), a book by Eric Tilton, Carl Steadman, and Tyler Jones, published by Addison-Wesley in 1996.

The upshot is, this document has always been meant as a public service, and will remain a public service. I hope you've found it to be useful; I've had fun providing it for your use.

Last modified: Nov 17, 1997

James "Eric" Tilton, HTML Guru Wannabee and Occasional Author, tilt@cs.cmu.edu

(and with most of the Web style considerations contributed by Carl Steadman, Guy Who Doesn't Suck, carl@freedonia.com)