Chapter 2: Understanding HTML

HTML Document Structure


[ Comments ] [ Copyright ] [ Chapter contents ] [ Book contents ]


A word of warning. HTML is evolving through a process in which bits are being added to the language. Wherever possible, this section takes a pragmatic approach, but sometimes it is necessary to describe two alternative syntaxes for different levels of HTML ...

The general structure of an HTML 1 document is as follows:

<TITLE>The title of the document</TITLE>

<!-- some text goes here -->

The only mandatory tag that must be present in a basic HTML document is the TITLE tag. This indicates what the document is; most web browsers display it as a label on the window containing the document.

The more recent HTML 3 document specification has a more complex structure:

<HTML>

<HEAD>

<TITLE>The title of the document</TITLE>

<!-- Any other "head" information goes here-->

</HEAD>

<BODY>

<!-- Body of document goes here-->

</BODY>

</HTML>

The document is enclosed in <HTML> .... </HTML> tags, which tell the browser what it's reading; that is, that it's an HTML document and not something else.

Internally, the document consists of two sections; the head and its body. The head section contains various tags that provide meta-information about the file-its title, and its relationship to other documents on the web. (These, and other entities that refer to inter-document links, are described later.)

Document HEAD sections describe the document; the BODY section is the document. Under some circumstances, you can retrieve and examine the HEAD of a file without looking at the body.

Note the comment text. The tag <!-- --> is a comment (containing a single space); text inside it, <!-- like this,--> will not be displayed. It's often useful to scatter your documents with explanatory comments-notes about the document that you don't want your users to see. (Be aware, however, that the comments are downloaded across the net when someone reads your document, so don't place confidential information in them.)

The <HEAD> and <BODY> sections are, technically, optional. Your HTML documents can safely exist without them - but it's good practice to use them, for reasons which will be discussed in subsequent chapters.

The main text of the document-known as the body text-is pretty similar to any other word-processed document. It consists of text in the ISO Latin-1 codeset, sometimes known as ISO 8859/1; this is an 8-bit superset of ASCII (originally a 7-bit codeset), the characters above 127 being reserved for various accented glyphs. Latin-1 was chosen by CERN because it can be used to represent English, and also all the main Western European languages. It is possible that future versions of HTML will support other codesets defined by the ISO 8859 committee, or possibly multibyte codesets such as UniCode, but at the time of writing support for non-European languages is not part of any accepted standard for HTML.

If this business about character sets worries you, relax. Latin-1 is virtually identical to ASCII; any text editor capable out outputting ASCII files can write acceptable HTML.

Some characters (notably < and >) have special meanings within HTML.

In addition, not all keyboards can produce all the characters in Latin-1. Consequently, HTML provides two ways of referring to characters (in addition to interpreting them literally). You can enter a character by number position in the Latin-1 codeset; for example, you can specify a < symbol by typing < and it will be interpreted as a < in the text, rather than the beginning of a tag. You can also specify a character by its symbolic name. Because character index numbers are hard to remember, many characters are assigned a name; the "<" (less-than) symbol's name is "lt" (and you can insert a literal less-than symbol in your HTML by typing <).

HTML entities (as opposed to tags) are started by an ampersand £ character, and terminated by a semi-colon ;. If an entity name begins with a # and is followed by a decimal number in the range 32-126 or 161-255, it is replaced with the character at that numerical position in the Latin-1 character set; otherwise it is replaced by the character associated with the named symbol.


[ Comments ] [ Copyright ] [ Chapter contents ] [ Book contents ]