Chapter 2: Understanding HTML

Links and Anchors

[ Comments ] [ Copyright ] [ Chapter contents ] [ Book contents ]

An anchor is a special tag that does not determine the appearance of text; instead, it indicates a connection between the current document and some other entity (which might be another anchor in the same document, or an entirely different document on another server).
There are several types of anchor, but all of them use the <A> ... </A> tag (with optional parameters). Notably, there are anchor tags that define a target for a hypertext jump, and tags that define a hypertext reference.
In general, to create a link between two places in a document or web, you need to use two anchors: one to name a cross-reference target, and one to insert a hypertext reference into the text. The hypertext reference appears in the text as a highlighted hotspot word or phrase; often coloured, and usually underlined. When you click on the hot-spot, the browser uses the information in the anchor tag to locate and load the target entity.
From another angle; when you insert a cross-reference target anchor into a document, you give it a name. Thereafter, you can use cross-reference pointers that refer to that name.

The simplest kind of anchor is used to navigate within a long document.
For example, suppose we have an academic paper structured like this:

<HTML> <HEAD>  </HEAD> <BODY>  <H1>Title</H1>  <H2>Introduction</H2>  <H2>A subsection</H2>  <H2>Another subsection</H2>  <H2>Yet another ...</H2>  <H2>Conclusions</H2>  <H2>References</H2>  </BODY> </HTML>

It is reasonable in the concluding section to refer back to text in the first three sections. It is also useful in the first three sections to be able to refer forward to the references. Therefore we need to provide a target anchor with each reference, which can be jumped to from the body of the text. We also need to provide target anchors under each <H2> heading, for use in the Conclusions.
The format for a target is:
<A NAME="target_name">hot spot text</A>
target_name is a symbolic name that can be referenced in other anchors; the hot spot text is displayed by a web browser in some highlighted format. (It is usually omitted from NAME anchors, because clicking on it doesn't take you anywhere.)
The format for a pointer, within the same document, is:
<A HREF="#target_name">hot spot text</A>
The HREF attribute means "hypertext reference"; the hash sign (#) means that it is an internal reference, to an anchor placed within the current document. The text "hot spot text" is displayed in the document. Clicking on it takes you to the target identified by target_name.

We can take the paper described above and add internal target anchors to the HTML copy:

<HTML> <HEAD>  </HEAD> <BODY>  <H1>Title</H1> <A NAME="title">  <H2>Introduction</H2> <A NAME="intro">  <H2>A subsection</H2> <A NAME="sect1">  <H2>Another subsection</H2> <A NAME="sect2">  <H2>Yet another ...</H2> <A NAME="sectn">  <H2>Conclusions</H2> <A NAME="conclusions">  Well, you've read the article. As <A HREF="#sect1">Section 1</A> shows, there is a crying need for further research into this field. As <A HREF="#sect2">Section 2</A> shows, all previously-used methodologies are suspect. And as <A HREF="#sectn">Section N</A> shows, we weren't any more successful. <P> Draw your own conclusions. <HR> <H2>References</H2> <A NAME="refs">  </BODY> </HTML>

The "Conclusions" section contains three pointers, to the targets specified by sect1 sect2 and sectn. The text between the <A> and </A> tags is presented as a hypertext link that takes the reader to the appropriate section if they click on it.
The construct:
<H2>Another subsection</H2>
<A NAME="sect2"></A>
is a little clumsy. You can nest some tags; an equally valid construct is:
<H2><A NAME="sect2">Another subsection</A></H2>
Which makes the association between the anchor and the heading a little bit more explicit.
You can also apply the usual text highlighting tags to text in an anchor, although it's probably not sensible to do so without good reason. Note, however, that you should not enclose header tags in an anchor, or mess around with the order in which you start and end tags: the following is illegal, and guaranteed to screw up some badly-behaved browsers:
<A NAME="sect2"><H2>Another subsection</A></H2>

Links between files

Now we've seen the basics of anchors, we can consider how to create hotlinks between separate files. For the time being, we're going to deal with files in the same web - in fact, in the same directory (or folder, for Macintosh users).
Let's look again at the academic paper we've been writing. Structurally, it's going to look something like this:

<HEAD> <TITLE>My dissertation</TITLE> </HEAD> <BODY> <H1>Introduction</H1> [pointers to sect1 .. sectn, conclusion, refs] Sect1.html <H2>Section 1</H2> Sect2.html <H2>Section 2</H2> Sectn.html <H2>Section N</H2> Concl.html <H2>Conclusions</H2> refs.html <H2>References and Bibliography</H2> </BODY>

It has obviously grown so huge, cumbersome and complex that it needs to be broken down; instead of having separate headers in one big file, we need several separate files:

The dissertation web

This only shows the headers in each file; the other document structure information is omitted. Note also that this view assumes that the operating system can handle long filenames with mixed-case characters. This isn't available under MS-DOS (which is limited to uppercase-only, eight characters, a period, then three characters), but most modern operating systems are sufficiently flexible to allow filenames like the above. (Naming constraints and portability to DOS-based servers will be discussed later.)
NOTE : Pathnames on an HTTP server do not work in exactly the same way as pathnames on a normal UNIX system. See below for details.

We need to point to other files, rather than to markers in the same file. The same HREF notation is used, but instead of "#target_name" we use "filename" as a parameter. For example:
for a full discussion of this, see
<A HREF="section3.html">my conclusions</A>.
The leading hash symbol "#" in front of an anchor indicates a local target, embedded in the current file; in the absence of the "#", the browser assumes that the target is a file name and searches the current directory for it.
You can also play mix and match. For example, if you have target anchors for each subheading in a document, you can reference specific sections of that file by using an HREF of the form:
<A HREF="filename.html#target">go here</A>
The file designated by filename.html is loaded, and the named anchor "target" is jumped to.
Nor are you limited to your current directory. Suppose you keep the subsections in a subdirectory, so that the structure is like this:
You can place a reference in dissertation.html like this:
<A HREF="sections/section1.html">Section 1</A>
or
<A HREF="sections/section1.html#part1">Part 1 of Section 1</A>

Understanding directories

Pathnames in HTML are specified in UNIX format.
A path consists of a list of directories that must be entered in order to reach a file. (Directories contain other directories, and/or files.)
A pathname is the name of a file, appended to the path necessary to reach it through the directory structure of the filesystem.
In a path, directories and files are separated by a forward-slash "/". The ".." operator means, "go up a directory level". For example, foo/bar/../quux is equivalent to foo/quux. A leading "/" in front of a pathname means that the path is to be constructed from the very top of the accessible directory hierarchy (that is, an "absolute" path); if there is no leading slash, the path is constructed from the current default directory (that is, it is a "relative" path).
On a web server, the root directory "/" is not the same as the root directory of the server. See Chapter 5 for details.
These semantics are identical to those used by MS-DOS, Windows, and OS/2, except for the substitution of the "/" character in place of "\". They differ from the Macintosh programming pathname semantics; the Macintosh directory separator is the colon ":". Two colons "::" are equivalent to "..", three colons ":::" are equivalent to "../..", and so on. Also, confusingly, a relative path on a Macintosh begins with a directory separator character, while an absolute pathname does not. So "Macintosh HD:MacHTTP:public_html:hello.html" is an absolute pathname, but ":public_html:examples:something.html" is a relative pathname. VAX/VMS is even stranger, and IBM mainframe operating system semantics are worse. Luckily there is little sign of the web being modified to conform to VM/CMS file naming conventions ...

Pictures

It is possible to include references to graphics in HTML files. Images do not come as part of the HTML language, but you can link external graphics files into HTML using a specialized tag resembling an anchor.
Pictures should ideally be saved as Graphics Interchange Format files (.GIF); virtually all graphical browsers can handle this format, which incorporates a degree of data compression.

Graphics file formats

The web was originally designed to link text documents. It rapidly became apparent to the developers at CERN that scientific papers needed illustrations; so they initially created a mechanism for including XBM files (X Windows bitmaps).
However, the XBM file format is extremely large and wasteful; it doesn't incorporate any compression mechanism. Although it is easy to work with, it is grossly inefficient to transport over slow, low-bandwidth networks.
The next graphics file format to be supported was the Compuserve GIF (Graphics Interchange Format). This comes in two flavours, GIF87a and GIF89a; generally, browsers support both varieties. GIF was designed for encoding graphics that would be transmitted over slow links, and incorporates the fairly efficient, loss-free LZW and LZH compression algorithms. In addition, GIFs can be interlaced so that alternate scan-lines are encoded; while this does not speed the process of decompression, it does trick the eye by beginning to fill in the whole area of an image rapidly, rather than working down it top to bottom.
GIF files are subject to two problems: they are only intended to store images containing up to 8 bits of colour information per pixel, and the compression algorithm they use turns out to have been patented under an obscure filing by Unisys in the mid-1980's. In addition, more efficient compression algorithms are available if some loss of picture quality is acceptable.
JPEG files implement a more sophisticated, but computationally expensive, compression mechanism. They can contain greater numbers of bits per pixel, but tend to lose (or obscure) some information during the compression process - an expanded bitmap derived from a JPEG image is not identical to the bitmap that was compressed to produce that image.
To insert a graphics file at the current location in an HTML document, use the IMG tag:
<IMG SRC="../gif/picture.gif">
This causes the source file picture.gif to be inserted at that point in the text.
For example:

<P> This is a paragraph of text before one with an inline image. <P> This paragraph contains <IMG SRC="../gif/test.gif"> an image. <P> The previous paragraph contained an image.

This displays as:
Not all browsers can display graphics. For example, the Lynx browser (used on UNIX terminal systems) cannot display any images at all. Some browsers support a restricted range of image file types; Netscape Navigator, unlike other browsers, permits the use of inlined JPEG images (a more efficient compressed format than GIF), but these will not show up under NCSA Mosaic. And there is always a possibility that your images will not be delivered to the browser for some reason or other. Therefore, when including inline images in text, you should provide an alternative textual representation:
<IMG SRC="../gif/test.gif" ALT="floppy disk icon">
This displays as:
The ALT parameter specifies that if "test.gif" cannot be displayed, the browser should display the text "floppy disk icon" instead.
When you insert a graphic into text, it breaks the flow of the page-the browser inserts it with its baseline aligned with the text to either side of it, and shifts the line it is on down to make space. However, you can align it relative to the browser's window using the ALIGN parameter:
<IMG SRC="../gif/test.gif" ALT="floppy disk icon" ALIGN="MIDDLE">
This displays as:
In this case, the ALIGN="MIDDLE" option ensures that the middle of the image is aligned with the baseline of the text line it is inserted in.
(You can also specify "ALIGN=TOP" and "ALIGN=BOTTOM", which behave as you might expect.)
Note, however, that ALIGN directives in HTML 2.0 do not not cause text to flow round the graphic-only the first line after the graphic is affected, so an ALIGN=TOP directive will look rather odd next to a large graphic. The ALIGN directives are extended considerably by Netscape and HTML 3.0; this is discussed later.
Finally, you can put an inline image wherever you would expect to put body text in an HTML document. You can't put them in the <HEAD> section, but it is perfectly valid to do something like:

<A HREF="sections/sect1.html#part1"> <IMG SRC="../gif/button.gif" ALT="Press me"> </A>

This inserts "button.gif" into the flow of the text, as a clickable link to the anchor "Part 1" in the file "sect1.html". If the button cannot be displayed, the alternative text "Press me" is rendered in its place.

Some comments about graphics

Without doubt, graphics can enhance the appearance of any page. However, there are some major issues to be aware of.
Firstly, the old adage that "a picture says a thousand words" is no longer true in HTML. A thousand words occupy about 6 Kb of disk space. In contrast, quite a small graphic will overflow that space-a full-page GIF in 256 colours (using 8 bits to represent the colours of each pixel) will typically occupy 250-300Kb. Thus, graphics should be minimized-they're expensive in terms of bandwidth. If you place a 750Kb graphic in a page that someone accesses over a dial-up modem connection, it will take them at least five minutes to download the page. If they bother.
Secondly, not everyone will be able to see them. People using line-mode browsers won't get the full effect. People working over a dial-up modem line frequently switch image auto-loading off in their browsers, to speed access. And an 8-bit graphic can look poor when viewed on a high-performance graphics workstation capable of displaying 24-bit colour images.
Thirdly, it's dangerous to go image-happy. Images are secondary files, linked into your core HTML document. A document that depends heavily on images for effect is going to look poor to a user who cannot download or view images, or who has switched image loading off. And in some extreme cases, images can severely damage the usability of a web page.
In general, inline images are best used to emphasize the appearance of text (for example, as coloured icons to replace the bullets in a list), and as "thumbnails" for larger files. For example, suppose you wish to put a 300Kb self-portrait of yourself in your personal "who-am-I" web page. The sensible way to do this is to embed a 20Kb icon of the self-portrait, within an anchor pointing to the real thing-and to add a warning to the effect that "this is a big picture!".
There is a tendency among some designers to use inline graphics to get around the typographical limitations of the HTML medium. By building complex images of control panels and widgets in the form of graphics with clickable maps, an HTML document can be made to look quite similar to a glossy printed magazine page, or to a multimedia presentation system. However, this is in opposition to the original design purpose of the web, which is to provide a simple, universal means of transferring textual information. By choosing to make their pages graphically complex, these designers ensure that their documents cannot be downloaded rapidly, are not readable by all browsers, and are not searchable.
In extreme cases, I've seen cases of design-trained authors using one pixel wide transparent GIF images to control leading (the space between words) in their text on screen. While that kind of control is normal in typesetting systems, HTML makes no provision for it. This technique works, after a fashion, but has some problematic consequences: a user who has turned image loading off is going to see a horrible mess on screen, and a user who is trying to search the document for a string of words (using a search tool) is probably not going to find what they're looking for.
These problems are discussed in more detail in Chapters 4 and 7.
Big pictures are probably best presented as JPEG files. JPEG is a lossy compression technique that potentially provides much better compression than GIF; it is most suitable for extremely big images, or those with a large number of colours, such as photographs of natural objects. However, few web browsers (other than Netscape) can render inlined JPEG images. Thus, it is common practice to use a small thumbnail GIF as the hot-spot for a hyperlink to the JPEG image; the browser loads the image file and passes it to an external "helper application" which then displays it.
As a first principle, never include any graphics in the top page of a web. The top page-the usual point of entry for those users who discover a reference to your web somewhere else-performs the same function as the flyleaf of a book. It's not there to look pretty or convey huge numbers of facts; it's simply there to inform the reader of the nature of the document they are looking at, and to orient them. If you need to include large graphics in pages below that one, all well and good-but make sure you warn your readers beforehand. As a general rule of thumb, a page that takes more than thirty seconds to load over a slow line (without warning) may make your audience reluctant to continue; and if they are using an ancient V22bis modem, as little as 2Kb of text might clog up their system for that long.
To get the best out of the compression built into the GIF format, use drawings in preference to scanned photographs, and minimize the number of colours in the images. Big blocks of a uniform colour occupy far less space in an image file than graduated washes of colour. The LZH data compression algorithm used in GIF compressors works by locating sequences of bits in an image raster, and replacing them with pointers into a table containing a list of identified (repeating) sequences. As more patterns are added to the table, the size of the pointers required to index them grows. Thus, images consisting mainly of runs of uniform colours are represented by relatively short pointers into the index table. Gradual colour changes act to defeat this compression scheme; the number of bit-patterns to be encoded in the compressed file rises, and the file grows larger. The upshot of this is that it is possible to create small, fast GIF images for web pages, if you:

minimize the number of colours in the images, ideally by sticking to less than four, or less than sixteen, primary colours
use drawings prepared on-line, rather than scanned images or photographs

Of course, if you know your audience will only ever see your web on a local system, or over a high-speed network, you don't need to pay any attention to this advice. But if you have any doubt at all-or if your pages are destined for public access-it is best not to ignore those slow, old modems. They're out there, there are millions of them, and their owners are unlikely to invest in an expensive upgrade in order to make life easier for you.
Finally, note well: there is no direct equivalent of the <IMG SRC> tag for text. You cannot insert a tag that says: <TXT SRC="somefile.html"> and have the contents of somefile.html magically appear in your document. This is a shame, as there are many occasions (as we will see later) when including a standardized boilerplate file would be useful. However, there are three mechanisms that give us the ability to do something similar, and we will come to these in Chapter 6.

[ Comments ] [ Copyright ] [ Chapter contents ] [ Book contents ]