Chapter 2: Understanding HTML

Hypertext Markup Language


[ Comments ] [ Copyright ] [ Chapter contents ] [ Book contents ]


If you've used the world wide web, you've seen documents written in HTML (short for hypertext markup language). This chapter contains an accelerated overview of standard HTML, and how to create it.

There's nothing particularly mysterious about hypertext files written in HTML. Like any word processor input document, the language consists of text interspersed with special formatting commands. Unlike a word processor document, an HTML file is not intended to be printed on paper; it is intended to be formatted (rendered) on a computer screen by a piece of software called a browser. The text is rendered in accordance with the formatting commands (tags) embedded in it.

Most word processing or typesetting packages don't show you their embedded formatting commands; they hide them in arcane storage formats, preferring to show you instead an approximation of the way the page will appear when it is printed. HTML, however, is naked. You can compose HTML using virtually any word processor or text editor; its markup tags are readable characters. There exist some specialized, syntax-directed HTML editors, and WYSIWYG HTML authoring systems such as Netscape Navigator Gold, but the fact remains that you don't have to use one to edit HTML, wheras you must use a copy of Microsoft Word to edit a Word file.

HTML is a simple application of a complex document description language called SGML (standard generalized markup language, described in the standards document ISO 8879/1, issued by the International Standards Organization). However, it's not necessary to understand SGML to use HTML, so I'll describe SGML and its place in the scheme of things later.

When you view an HTML file using a WWW browser, you are doing one of two things:

1) You tell the browser to open a file stored locally. The browser loads the file and shows you what it looks like.

2) You click on a hot-spot in a document (or tell your browser to load a file stored remotely). The web browser works out where the file is stored from the URL (universal resource locator) associated with the hot-spot text. It opens an HTTP connection to the server that the file is stored on and requests the file. The server sends the file (if it exists), and the browser loads the file and shows you what it looks like.

The last step-"the browser loads the file and shows you what it looks like"-is the one that concerns us here. The file contains HTML mark-up tags that tell the browser how to display it. But HTML files have to be viewed by a multitude of different browsers running on different computers, with different display capabilities. So HTML does not describe how the information is to be presented; it merely indicates the type of information that is being displayed.

A second point needs to be made at this time. You can click on a hot-spot in a document, and your web browser will load the target document associated with that hot-spot. But there is no permanent connection between the two files; they are not part of some overarching structure. Files can go away, either by being renamed or moved. Thus, hyperlinks in HTML are unreliable.

HTML tags tell the browser how to render text components of a document (entities) by declaring the entities in the document to be of a certain type; the browser is presumed to know how to deal with different types of entity. For example, graphical browsers usually display top-level headings in a larger point size than subordinate headings. However, how the browser treats the entities is ultimately the browser's decision. For example, the CERN line-mode browser (which runs on dumb terminals) can only print the text in an HTML file; it does not apply any special highlighting to tags that direct it to underline or emphasize text, even though it detects them.

This is diametrically opposed to the philosophy of document presentation formats such as Abode's Postscript, which rigidly enforces a description of the appearance of a page in such detail that the only things the display device has control of are its resolution, and whether to display the file in colour or black and white. This is an important feature of HTML, and we'll revisit it later. HTML is not a page description language (like Postscript) that describes the layout of information on a page-rather, it describes the relationship between entities in a document, and leaves the browser to take care of displaying it.


[ Comments ] [ Copyright ] [ Chapter contents ] [ Book contents ]