Chapter 2: Understanding HTML

Typography and HTML

[ Comments ] [ Copyright ] [ Chapter contents ] [ Book contents ]

Old-fashioned word processors inherited a model from the golf-ball electric typewriter. When you wanted to emphasize some text, you'd take out the print ball and drop in a new one with italic characters on it. When you finished with the emphasis, you'd put the old (roman) font ball back in the typewriter and carry on working. By adding underlined, bold, and a few other attributes, the word processors could achieve a variety of effects in a way that was easy to understand.
However, typeset text isn't like that. The font may come in a variety of sizes, and a number of weights, with a wide range of slant attributes. And you can change its colour. Although such fine control over appearance only appeared on high-end or specialized computers at first, it rapidly spread during the 1980's - today, just about any PC or Macintosh with the right software can surpass the typesetting capability of an expensive linotronic from the beginning of the 1980's.
Given the versatility of computer typesetting and rendering tools today, it should come as no surprise to learn that the focus of attention shifted towards the semantic representation of information. If I write a line of text in italics, what does it mean? You can, of course, refer back to the "typographical conventions" part of this book. But how do you know what the fact that the text is in italics signifies, if you didn't write the document, you don't know who did, and it comes with no such blurb describing its authors intentions?
We rely on the appearance of text, as much as on its content, to convey information. The appearance of typeset text is usually based on its meaning, determined in accordance with a style sheet. By noticing the visual apparance of a passage of text, we can infer something about the author's conception of its significance.
Two schools of thought appeared during the early development of HTML:
The "what I say is what you get" school, who stuck to the conventional idea of highlighting text as boldface or italic, and in emphasizing the author's direct typographical control over the visual presentation of their data.
The "I tell you what it is and you format it accordingly" school, who wanted to see a formal style sheet incorporated into HTML. A style sheet maps tagged information in a document to its visual appearance. Such tags can be relatively abstract: rather than designating text as "bold" or "italic", we can tag text as "HTML listing" or "internet address", and rely on a style-sheet driven formatter to work out that one of those pieces of text should be rendered in boldface, and the other in italic.
Most people agreed that the primitive, typewriter-like control afforded by tags like <I> (for italic) and <B> (for boldface) was inadequate. But as soon as the developers tried codifying a set of built-in style tags for HTML, they ran into huge problems - simply put, there is no such thing as a universal style sheet that can cover all contingencies. A novel requires different text styles from an automobile maintenance manual, which in turn does not resemble a poem, a letter, or a paper submitted to a refereed academic journal. If tags to cover all contingencies were incorporated into HTML, HTML would cease to be simple; however, if tags to cover all contingencies were not incorporated into HTML, it would fail in its goal of providing a useful way of delivering a wide variety of documents over the net.
Consequently we're stuck with a hybrid compromise; a tag set that contains both primitive formatting commands and a few (common) style elements. The style elements are interpreted arbitrarily by the browsers; the primitive formatting commands are a bit more coercive insofar as they specify a specific attribute of a typeface, but they're still open to interpretation.
Later on, we'll see that there are several solutions to this problem; one of them which most clearly represents the typographic-control school is a commercial system, Adobe's Acrobat, while the markup school proffer the powerful but not yet commonly available (or understood) DSSSL document style sheet specification language.

Basic formatting tags

Here are the basic formatting tags:

<B>Boldface</B>
text is enclosed in the <B> ... </B> tags.
<I>Italic</I>
text is enclosed in the <I> ... </I> tags. According to the HTML 2.0 specification, if a browser doesn't understand italics it should try to render this as a slanted font.
<TT>Typewriter</TT>
text is enclosed in the <TT> ... </TT> tags; it should be rendered in fixed-width font.

And here are the style tags:

<ADDRESS>A mail address</ADDRESS>
is rendered in italics, and is separated by a paragraph break from the text above and below it. This is typically used for indicating where the author of a file can be found.
<CITE>The HTML 2.0 specification</CITE>
defined a Citation style, for referring to the titles of other documents. It is typically rendered in italic.
<CODE>Example source code</CODE>
is reserved for bits of software source code used in examples. It is usually rendered in a monospace font.
<EM>Emphasis</EM>
provides typographic emphasis; unlike the <B> or <I> tags, it leaves the precise form of emphasis up to the browser. It is typically rendered in italics, but might come out in flashing pink neon if the browser feels like it.
<KBD>Keyboard</KBD>
tags are used to identify text typed by a user, and are usually rendered in a monospaced font. (You would use this tag to indicate text typed by a user in an example - for example, showing a login session via FTP or telnet.)
<SAMP>Samples</SAMP>
of literal characters are usually rendered in a monospaced font. This is another example style, like <SAMP> and <CODE>.
<STRONG>Strong</STRONG>
typographic emphasis is provided as an alternative to <EM>emphasis; it is typically rendered in bold, but again the final decision on presentation belongs to the browser.
<VAR>Variable</VAR>
names are typically rendered in italics.

Finally, sometimes it is necessary to present raw text, preformatted with spaces, knowing exactly where the lines will break. The <PRE> ... </PRE> tags effectively switch off word-wrapping and pagination within the designated body of text. If you separate two words by five space characters, five space characters will appear between the words when you view them in a preformatted block; normally, a web browser will eat all but one of the spaces.
Now we can have a look at an example of a formatted file. This one is a manual page for a short program (part of the UNIX system):

<HTML> <HEAD> <TITLE>cat - catenate and print</TITLE> </HEAD> <BODY> <H1>NAME</H1> cat - catenate and print <H2>SYNOPSIS</H2> <CODE>cat [ -benstuv ]</CODE> <VAR>file ...</VAR> <H2>DESCRIPTION</H2> Cat reads each file in sequence and displays it on the standard output. Thus <P> <KBD>cat file</KBD> <P> displays the file on the standard output, and <P> <KBD>cat file1 file2 >file3</KBD> <P> concatenates the first two files and places the result on the third. <P> If no input file is given, or if the argument `-' is encountered, cat reads from the standard input file. Output is buffered in the block size recommended by <CITE>stat(2)</CITE> unless the standard output is a terminal, when it is line buffered. The -u option makes the output completely unbuffered. <H2>SEE ALSO</H2> <UL> <LI>cp(1) <LI>ex(1) <LI>more(1) <LI>pr(1) <LI>tail(1) </UL> <H2>BUGS</H2> Beware of <CODE>cat a b > a</CODE> and <CODE>cat a b >b</CODE>, which destroy the input files before reading them. </BODY> </HTML>

A few points need making about this example.
The <CODE>..</CODE> tags are used here to denote commands executed by the UNIX shell. However, the <KBD>..</KBD> tags are also used to denote commands typed by the user. Any of the commands tagged as <CODE> could equally well be typed by a user. Furthermore, the tag <VAR> is used to denote filenames (which are variable parameters to the program). But are these really variables? Or would they be better represented using some other tag? This highlights one of the problems of style tags; how do you determine the correct context in which to use a given tag?
The entity > is an alternative to the greater-than sign (>). A naked greater-than encountered in an HTML document may be mistaken for part of a tag. So when entering literal greater-than (or less-than, <) symbols, it is essential to use the appropriate character entity.
The <UL>..</UL> section is an unnumbered list; this is explained in the next section.
It should be obvious that once you begin using complex markup or character entities in text, the text becomes hard to read with the naked eye. So here's what it looks like in Netscape:
[[Diagram: CH2-Example2.pict]]

Lists and structured documents

The basic tags introduced above can be used to give us basic structured documents. But many documents don't simply contain paragraphs of text that flow from one to another. It's pretty important to be able to list items, using bulleted or numbered lists.
HTML provides a number of lists:

The unordered list

A unordered list is simply a list of items, preceded by bullets. It looks like this:

<UL> <LI>First item <LI>Second item <LI>Third item </UL>

The <LI> tag introduces a new list item. Each new item starts on a new line, and is usually preceded by a bullet (generated by the browser), as in the example above.

The menu list

A menu list is like an unordered list, but displayed more compactly. Use <MENU> ... </MENU> instead of <UL> ... </UL> when you want to reduce the amount of white space above and below the list. For example:

<MENU> <LI>Apples <LI>Bananas <LI>Oranges </MENU>

(Note that the menu list isn't commonly used. Indeed, some browsers treat menu lists identically to unnumbered lists.)

The ordered list

An ordered list is like an unordered list, except that instead of a preceding bullet each item is numbered. For example:

<TITLE> An idiot's guide to cooking beans on toast </TITLE> <H1>An idiot's guide to cooking beans on toast</H1> <OL> <LI>Get at the beans <OL> <LI>take the tin can out of the cupboard <LI>close the jaws of the can-opener around one edge of the lid <LI>rotate the handle until the lid comes off </OL> <LI>Make some toast <OL> <LI>take the bread out of the cupboard <LI>remove a slice of bread from the loaf and put it in the toaster <LI>ensure toaster is plugged in <LI>remove bread from toaster before it burns, not using fingers, thumbs, or metal implements <LI>put toast on a plate </OL> <LI>Pour beans onto toast <LI>Put plate in microwave oven <LI>Turn oven on and cook for sixty seconds </OL>

Note that you don't have to specify the numbering of items in the list; the web browser does it for you. You can also have nested sub-lists; these are each numbered starting from one. This example comes out like so:
[[ picture: CH2-Example3.pict]]
According to the HTML specification, you can use the optional COMPACT parameter to an ordered list (that is, <OL COMPACT>); this causes some browsers to pack it into a smaller space. (This doesn't seem to make much difference in Netscape, however.)

The definition list

The lists described above are designed to contain entries denoted by a small tag in the left margin-a bullet or a number. A definition list, in contrast, is designed to present more complex tags; for example, a glossary of terms. In such a list, the tag is printed flush with the left margin, while the associated definition is indented below it.
For example:

<DL> <DT>HTML<DD>HyperText Markup Language <DT>HTTP<DD>HyperText Transport Protocol <DT>MIME<DD>MultiMedia Mail Extensions </DL>

This is rendered like this.
The <DT> tag introduces a definition term. The <DD> tag introduces the associated definition for the preceding term. You can have two or more definition terms in a row, for example for synonyms:

<DL> <DT>Red <DT>Blue <DT>Green <DD>These are all primary colours </DL>

A definition list usually leaves a fair amount of white space around items.
If you want to close everything up, use the COMPACT attribute:
<DL COMPACT>
This reduces the space for definition terms and closes up white space generally. For example:
[[ graphic: CH2-Example4.pict]]

The directory list

Directory lists are unordered, highly compact lists; typically used for listing large numbers of items up to 20 characters long (such as filenames). It is enclosed in the <DIR> ... </DIR> tags, and items can be arranged in columns (typically 24 characters wide). Some web browsers can balance column widths, so that the columns are left-justified and packed to an appropriate left margin. For example:

<DIR> <LI>file1.html <LI>file2.html <LI>file3.html <LI>file4.html </DIR>

Note, however, that some browsers render this as a standard unnumbered list (for example, Netscape). This mechanism for arranging data in columnar order is also arguably obsolescent, given the availability of tables (described below).
Using these formatting tags it is possible to write moderately complex structured documents that have a moderately rich visual appearance. But if you've used the web for any length of time, by now you'll be scratching your head; where are the embedded graphics and the hyperlinks to other documents or other places within the same document?
We'll deal with references next.

[ Comments ] [ Copyright ] [ Chapter contents ] [ Book contents ]