Chapter 2: Understanding HTML

Entering the WWW

[ Comments ] [ Copyright ] [ Chapter contents ] [ Book contents ]

We have seen anchor tags that point to other anchors within local files, and to other files on the same server. An extension of the same syntax is used to designate resources elsewhere on the internet; the mechanism used is the Universal Resource Locator, as defined in the RFC 1630 standards document. (RFC stands for "Request For Comment" or "Request For Compliance" - the two states in which a proposed or established internet standard can exist.) A URL is actually a subset of a broader mechanism (the URI, or Universal Resource Indicator), which also encompasses URNs (Universal Resource Numbers, described in Chapter 6, "Structuring a Large Web").
A URL identifies a protocol used to obtain the remote resource; the host from which the resouce is available; and the location of the resource on the host. By using a URL, a web browser can retreive a resource. For example:
http://www.yahoo.com/computers/world_wide_web/
Means, "use the http protocol, to contact the host www.yahoo.com, and retreive the object /computers/world_wide_web ".
The components of a URL are discussed
A URN is a unique identification number for a resource. A central repository keeps track of the locations of a resource - for example, a file. When the file is moved to another host, its owner need only notify the central repository; thereafter, all requests for the URN are redirected to the new location. This is in contrast to a URL, which is "hard wired" to look at a specific place on the internet - if the file is moved, the web browser will be unable to locate it.
For example:
http://www.antipope.org/charlie/misc/bio/whoami.html
Refers to the entity charlie/misc/bio/whoami.html on www.antipope.org. (On UNIX systems, the notation ~username refers to the home, or public, directory owned by the user username. In this case, the web server is configured so that each user on the system has their own home directory.)
However, if I rename the file whoami.html to biography.html, the URL (above). will no longer work.
Because of the plastic nature of the World Wide Web, everyone agrees that a URN mechanism is in principle superior to the URL - but it has not yet been formally proposed, and will probably not be phased in before 1996.

Understanding URLs

A URL (to a web page or file) consists of three parts:

the service protocol
the destination site
the path to the resource

The service protocol, followed by a colon:

HTTP:
hypertext transport protocol (used for HTML)
Gopher:
an alternative textual transport mechanism used for plain text
News:
usenet news
Mailto:
electronic mail
FTP:
file transfer protocol (used for bulk file transfers)
WAIS:
Wide Area Information Service (a subset of Z39.50, used for accessing large text databases)
Telnet/TN3270/rlogin
Remote login access to other computer systems
The destination: ftp, HTTP, WAIS, Gopher:
is a hostname, preceded by two slashes: for example: //www.demon.co.uk. The hostname is the name by which the computer is known on the internet; alternatively, the computer's IP address (a series of four decimal numbers separated by periods) can be used instead.

Most of the time, these protocols are handled by a server running on a "well known port" (a numbered TCP/IP port between 1 and 32327). If the protocol is running on a different port, the port number is appended to the hostname, separated by a colon. For example: http://www.demon.co.uk:8080 indicates an HTTP request directed to host www.demon.co.uk, on port 8080 (instead of Port 80, the "well known port" for HTTP).

news
a usenet newsgroup name: for example, comp.infosystems.www.misc.
mailto:
this is a person's electronic mail address in the format specified by RFC822: for example, charlie@antipope.demon.co.uk
The path to the resource: HTTP
The path to the resource is the pathname of the HTML file to retrieve, from the web server's document root directory. Slashes "/" are used to separate directory and file name components A leading slash is always used to separate the pathname from the name of the host that stores the file. If the last component of the pathname is a directory, the web server should return a listing of the files in the directory; if it is a directory name followed by a slash, the server should return the default file for that directory.
FTP
Follows the same model as HTTP. It is a different protocol, however, and does not retrieve a default HTML file. (It also tends to be slower at establishing a connection, but faster at retrieving binary data.)
The other protocols all follow their own models, and have a rather different syntax for retrieving files. These will not be discussed in depth here, except by example where necessary, and they are documented in full in the RFC.

For the time being, all we need to understand are HTTP URLs. Let's look at one:
<A HREF="http://www.w3.org/pub/WWW/index.html"> press here </A>
Once you strip away the HTML anchor, you are left with:
http://www.w3.org/pub/WWW/index.html
This means (reading from left to right): using the HTTP protocol, go to the host www.w3.org, enter the directory /pub, the subdirectory /WWW, and retrieve the file /index.html.
HTTP uses port 80 (under TCP/IP, all hosts can have 32767 ports available for communication channels). We could equally well write:
http://www.w3.org:80/index.html
it means the same.
The file pathname bears some description. The file /index.html does not exist in the root directory of the computer hosting the web server. Rather, it exists in a sub-directory, designated by the server as being the root of the visible filesystem. So in actual fact, its pathname on the server might very well be /usr/local/etc/httpd/docs/index.html (if it is a standard NCSA httpd installation), but as far as other systems are concerned the root directory visible on the web server is /usr/local/etc/httpd/docs.

Weird URLs-cgi-bin queries

Not all HTTP URLs are as easy to read as the ones above. Sometimes you will come across complex forms that contain buttons to send information back to an HTTP server. You send information to a server by submitting an unusual URL. One might look something like this:
http://odd.host.com/cgi-bin/search?arg1+arg2+arg3
What does this mean?
Firstly, note that the request is going to the cgi-bin directory. CGI is an acronym for Common Gateway Interface. CGI scripts are programs which are executed by the HTTP server, in accordance with a standard calling protocol. In this instance, the program is called search. The text following the question mark is passed to search as arguments; the arguments are separated by "+" symbols. (If you need to send a literal plus sign or question mark as an argument, it should be sent as an ISO 8859/1 entity -- +; or ? respectively. In fact, if you need to send any unusual characters, they need to be specially encoded; see Chapter 5 for details.)
CGI scripts are explained in detail in subsequent chapters. Note that not all web servers use a cgi-bin directory to contain their scripts; some (notably the CERN/W3O server) keep scripts just about anywhere.

[ Comments ] [ Copyright ] [ Chapter contents ] [ Book contents ]