Web Development Tools on Linux


[ Site Index] [ Linux Index] [ Feedback ]


Linux is the most popular operating system for web servers, clocking up nearly 29% of all the servers on the internet (according to a recent survey of operating systems). If you group the various BSD UNIX derivatives with Linux (which they resemble quite strongly, being open source UNIX-type operating systems that are built on the GNU toolchain), this figure rises to more than 45%, showing a convincing lead over Microsoft (on 24%) and all the competing commercial UNIXes (all but 5% of the rest).

Part of the reason for this lead is the open source Apache web server which, with 60% of the web server market, has successfully embarrassed all its commercial competitors. Apache and Linux are a natural match.

Despite this, Linux isn't widely known as a web development platform. Most web sites are still designed and edited using tools running on Windows or MacOS, because these desktop systems are mostly familiar to the people who put the web content together. But a wide range of tools exist to make site development on Linux easy: in this feature I'm going to discuss items from each of a number of categories.

HTML and Web Site Editors

You can create HTML (hypertext markup language files, the file format used on the web) using any text editor -- vim or emacs are up to the job and provide basic assistance in the form of syntax colourising, so that you can tell where HTML tags end and text begins. In fact, almost all text editors for Linux can cope with HTML. But many of us don't like editing HTML by hand: it gets in the way of the text, and also forces you to laboriously test everything by hand.

About the commonest, and easiest to use, HTML editor on Linux is part of Netscape Communicator: the Composer component should need little or no introduction. As Netscape comes with all graphically-enabled Linux distributions today, it's reasonable to take this as the lowest common denominator of Linux-based web editors. Communicator has been criticised over the quality of the HTML code it produces, and this is being addressed in Mozilla (also known as Netscape Communicator 6.0), the HTML editor of which is a considerable improvement. Unlike the older Netscape composer, Mozilla's editor provides several editing modes. You can use it as a WYSIWYG editor or see and edit the raw HTML it generates. There's a preview mode, in which it behaves just like the browser, and there's a most useful "show all tags" mode, in which Mozilla shows you the text of the document, with tags replaced by icons (with a popup property editor to let you adjust them). This is possibly the most useful mode insofar as it minimizes the complexity of the document markup while at the same time making it possible to see it when necessary. (If you want to experiment with beta releases of Mozilla, see Mozillazine.)

There are some other office tools that can produce HTML fairly easily. Sun's StarOffice has a comprehensive HTML editor built in, and supplies many of the features of Microsoft Frontpage. Unfortunately StarOffice is a remarkably cumbersome package, and the quality of HTML it produces is if anything more erratic than Netscape 4.7's. Separate text attributes (font face, font colour) are applied in separate tags, there are unquoted percent marks in HTML attributes (a strict violation of SGML's rules), and so on. This doesn't mean that you can't produce HTML documents using StarOffice as an editor, but anyone with a claim to be a web author would have a fair job tidying the results up. (Extra tags take extra time to load and can give web browsers indigestion. They also make it difficult to maintain documents.)

Tools like Microsoft Frontpage attempt to remove the complexity of HTML production by giving a word-processor like WYSIWYG interface (hiding the underlying document structure) and providing tools to organise web pages -- letting you create links between documents in a collection by dragging and dropping, rather than by working out the relative pathname between two files. Similar tools are beginning to appear on Linux, starting with the rather remarkable Screem, the Site Creation and Editing Environment.

Screem is a GNOME-based Microsoft Frontpage work-alike. It's open source, and although it is not yet finished, it is already pretty useable. You can use it as a HTML editor, although that's not where it's real strength lies; it is at its most useful when used to process a whole site. A site (for Screem's purposes) is a collection of HTML documents and associated files (images, style sheets, and so on) that live in a directory somewhere. Screem integrates with the CVS version control system (fairly ubiquitous in the open source world) to keep track of revisions to the site, and can publish a site to a web server via FTP or WebDAV. It has wizards for managing links, images, style sheets, and tables; it also lets you build sites where all the documents share a common template, making for a uniform look and feel. It's capable of tracing the links between HTML files and seeing if any are broken. Screem is also scriptable. Like many GNU applications, it uses the Guile scripting language. Guile, the GNU extension language, is a subset of scheme, itself a version of Lisp; scriptlets placed in a "plugin" directory in the Screem home directory appear under a "scripts" menu and can be executed by the editor.

Screem is a work in progress. As of version 0.27, lots of stuff doesn't work yet -- notably some of the wizards. Nevertheless, even at this level it's possible to get productive work done in it, and the automatic link updating facilities (to keep links between files in a site pointing at each other even when a file is moved) are a boon. This is one open source project that is really crying out for backing: demand for it will be enormous when Linux finally colonises the desktop, and to think that it's the product of a small team of volunteers is remarkable.

Graphics tools

Linux isn't a platform known for its graphics tools; however it has more than enough facilities for preparing web site logos and buttons. UNIX was used extensively from the late 1980's onwards in image analysis; this has left a legacy of tools such as the pbm utilities (command line programs for manipulating bitmaps) and the famous image viewer xv (which can be used for file format translation, cropping, and HSV manipulation, among other things). More recently, the demands of desktop use of Linux has spawned two additional categories of utility -- software toolkits for generating images dynamically, and the PhotoShop work-alike, the GIMP (and commercial rivals such as Corel Photopaint).

The former category deserves some explanation. In general, graphics on web sites are static images -- the web author produces them, then they just sit there until someone loads a page that sources them in. However, this isn't the only way of doing graphics. Just as you can produce dynamic HTML content by writing CGI programs, you can produce dynamic graphical content. This might come in handy if, for example, you're writing a network monitoring system and want to display pie charts of available disk capacity or server load: instead of having lots of canned GIFs of pies with slices cut into them, you write a program that, given the appropriate percentages of used/unused space, generates a GIF image on the fly and sends it to the user's web browser when it tries to load an image with a tag like .

One of the oldest tools designed to support drawing web graphics on the fly is Thomas Boutell's GD graphics library. GD isn't an application -- it's a toolkit for programmers. It allows you to create colour bitmaps using graphics primitives, then emit the drawing as a GIF image file. Or at least, it used to emit GIF's -- now it produces JPEG, WBMP or PNG images. In addition to the basic drawing primitives, GD can optionally incorporate TrueType fonts into image output, and supports a variety of fill types, brush strokes, and line styles.

Drivers for GD are available in Perl, Tcl, Pascal, C, and ML, and there are command-line interpreters that let you bolt GD scripts into shell scripts or other programming languages.

GD is not, however, a user-friendly program; it's a tool best suited to the needs of CGI programmers, who can use it to spice up the output of otherwise text-only CGI applications. If what you want to do is work on images interactively, the number one choice on Linux is The GIMP.

The GIMP (GNU Image Manipulation Program), is an extremely powerful PhotoShop-like graphics tool. In fact, in use it feels rather like Adobe PhotoShop; it provides a palette of image manipulation tools, the ability to work with multiple layers and channels, and a host of image conversion facilities. It can work on images from files or it can capture images from the screen and, with suitable drivers, from scanner or camera. If you work with video, it has a basic ability to split video files into frames so that they can be worked on manually (although it isn't a video editing suite). It also has a powerful scripting facility, script-fu, which lets you write GIMP programs that manipulate images. Arguably, script-fu is on the way to replacing GD as a tool for developing dynamic image content on the web; with interfaces available from programming languages like Perl, and some basic paint capabilities to go with the image manipulation tools, GIMP is capable of being fully automated. Like PhotoShop, GIMP supports plug-ins, and a wide range are available for free.

The one area where the GIMP falls badly behind the commercial competition is support for proprietary CMYK colour matching schemes. For example, the GIMP simply doesn't do Pantone colour matching. It can't; Pantone colours are strictly controlled intellectual property, and no open source program is going to be licensed to use the scheme without forking out a lot of money. (There are some rumours about commercial support for the GIMP being used to fund the process of acquiring these licenses, but this hasn't materialised at the time of writing.)

The point of mentioning this is that you don't need CMYK and Pantone colour matching if you're doing web graphics; outside of the very high end graphics industry, computer monitors simply don't do true colour reproduction, and the GIMP is perfectly satisfactory for preparing images for publication on the web. (It's also a whole lot better than PhotoShop at compressing JPEG images -- on the web, every extra byte adds to your download times, whereas the paper-oriented product doesn't worry so much about this.)

Incidentally, the GIMP is a big beast of a program, like PhotoShop; if you want to use it effectively you'll need to put some effort into learning it. To that end, one of the developers has written a book titled "Groking the GIMP"; you can find the online edition at http://gimp-savvy.com/BOOK/index.html (along with a form for ordering the more convenient paper edition).

On the commercial side, Corel Photopaint 9 for Linux is available for free download from http://linux.corel.com/. Photopaint is a large image processing package aimed slightly down-market from Photoshop; easier to use but less capable, it is still suitable for bitmap editing for web graphics.

Web servers

Contemplating the area of web servers, the phrase "there can only be one" springs to mind. Apache is without a shadow of a doubt the preferred option for a web server on Linux. Descended from the NCSA HTTPD that was the state of the art on 1993, the Apache software foundation -- independent web publishers who had an interest in maintaining the open source branch of the NCSA server source tree -- have added functionality, turning it into the number one server on the net today. Apache powers more web sites than all other web servers combined.

However, there are other options.

First, a brief explanation of why Apache is so popular. For starters, it's an open source program that runs on both UNIX-type operating systems and 32-bit Windows. It was the first server to provide extensive support for hosting virtual servers -- a bunch of domain names that all point to the same computer, so that each address corresponds to a different web document trees somewhere on that computer's filesystem. This naturally made it very popular with ISPs as they moved into hosting customer websites -- instead of one computer per customer, they could host hundreds or thousands of sites on a single beefy server. It has also been designed to scale well under heavy load. Rather than forking a child process for each query, or spawning a separate thread, Apache maintains a pool of cooperating child servers, controlled by a parent that supervises them and ensures that there's always a spare to handle any new incoming request.

While Apache isn't the fastest server out there in terms of raw, blistering pages served per second, it strikes a good compromise between speed and features: it provides access to CGI programs for generating dynamic content, and permits developers to write plug-in modules to serve specific types of request even more efficiently. It also provides MultiViews -- the ability to automatically deliver content in the appropriate language for the reader, if separate document trees are present with the web content available and translated into the requested tongue.

Apache can be configured to work as a cacheing proxy server, to conserve bandwidth on a leased line; if your company or organisation installs an Apache server, telling it to also run as a proxy, and configuring all the web browsers to use it as such, will ensure that documents will only be fetched from off-site when they've actually changed. (And while it's doing this it can still carry on acting as your main organisational web server.)

And, of course, it sets a standard for logging request details -- most other servers today imitate the Apache logfile format, which is essential if you want to gather statistics about the load on your server or the people using it.

Why, then, do some sites not use Apache?

There are two features Apache doesn't provide. Firstly, some specialised web servers may have to field enormously high demand levels. This, however, doesn't come off-the-shelf; to field millions of hits per day it is usually necessary to purchase special hardware, arrange for high-bandwidth hosting, and do considerable performance tuning.

The second feature that doesn't come standard with Apache is support for SSL -- the secure sockets layer. SSL provides an encrypted tunnel for HTTP protocol requests between a web browser and a server; it's at the heart of web security, because SSL-encrypted sessions (also known as SHTTP, secure HTTP) are very difficult to listen in on. SSL encryption was not routinely approved for export from the USA until very recently, and Apache wasn't even licensed to incorporate "hooks" for third-party encryption products. Moreover, one of the encryption algorithms used in SSL is patented by RSA Data Systems (in the USA), and you will require a digitally signed site certificate before your server will be trusted by clients. Consequently, if you want to run a secure web server (for example, to take credit cards over the net), you will need to either buy an SSL-enabled Apache derivative (such as , Stronghold) and an SSL site certificate (from Verisign), or obtain a different, SSL-enabled web server such as Roxen or one of the iPlanet servers; note that iPlanet is the joint venture between Netscape and Sun that markets Netscape's web server technology).

Macro processors

Just what is a macro processor, and why should it interest web developers?

At its simplest, a macro is a symbol in a file which, when the macro processor sees it, is replaced by something else. The "something else" might be a block of text, or the result of executing a program, or even another macro.

Apache provides a simple built-in macro processor that executes so-called "server side includes". If you switch on this facility, then when Apache receives a request for a file of type "shtml" it will process it and replace any server-side includes it finds in the file with something else -- typically another file (named in the macro), but possibly the output from running a program. The SSI facility predates the CGI programming interface, and was originally used to do things like append standard footers to files (typically listing the time when the file was last updated).

Server side includes are only the beginning, though. A fully functional macro processor can let you tailor web content to the requirements of a specific reader, pull information out of a relational database, or even run an e-commerce storefront: and probably the most popular currently used on Linux is PHP.

PHP describes itself as a "server-side, cross-platform, HTML embedded scripting language." You install the PHP interpreter on your web server, typically as an Apache module (or a CGI script, on other web servers): when files of the apropriate type are requested by a browser they are first fed through PHP. Here's a simple (canonical) example of what a PHP file looks like:

Example

Yes, this is HTML -- with a difference. Tags starting with a question mark are interpreted as PHP commands; in the example file above, the tag is replaced by the output from the command:

echo "Hi, I'm a PHP script!"

So that what gets sent to the user's browser is this:

Example Hi, I'm a PHP script!

PHP is essentially a lightweight programming language that can be embedded in web pages and interpreted by the web server itself. (This differs from JavaScript/ECMAScript, which is interpreted by the web browser.) It can talk directly to a range of databases including InterBase, dBase, MySQL, Oracle, PostgreSQL, and Unix dbm files.

The PHP programming language is a not-very-strongly typed language that provides data types remeniscent of Perl or Python -- albeit with a twist; integers and floating point numbers are treated differently (and can be typecast), Python programmers will feel at home with the multi-dimensional array and associative array types (and the absence of references for constructing complex data structures), and all variable names begin with a dollar symbol (which will confuse the hell out of Perl programmers). Variable scope is handled fairly simple-mindedly (with global or local declarations relative to the function in which a variable is declared). All in all, it looks rather familiar to a UNIX programmer with C, shell, Perl or Python experience. It even provides a fairly simple object-oriented syntax for creating and using classes and handling inheritance.

The power of PHP comes from its tight integration with the web server. For example, there's no need to jump through the hoops CGI imposes in order to get at variables passed in an HTML form, the action of which points to a PHP page. If the form contains a field called "foo", then when the target PHP page is executed a variable called $foo will be available, containing whatever the user entered in that field.

PHP has a slew of additional functions. It's file open function fopen() isn't limited to local files -- it can open URLs via FTP or HTTP. It doesn't have a uniform database abstraction layer like Perl's DBI, but provides separate functions for talking to each of its targets -- on the other hand, it's easier to learn than Perl for these purposes. It's difficult to scrape the surface of PHP's power in a short description like this, but in general it's worth noting that PHP is a general purpose replacement for CGI development that promises the ability to rapidly assemble server-side applications that are easy to maintain. The disadvantage of PHP, if it has one, is that it doesn't work away from a web server -- Perl and Python are general purpose languages, and although you can write in a dialect like embPerl (embedded in a page of HTML, much as PHP is), you aren't restricted to only working with a web server as a front end: it's possible to develop command-line or GUI applications in the more general programming languages.

Web application frameworks

Websites fall into several different categories. There are simple collections of documents (like most personal sites and basic company brochure sites). There are sites with some dynamic content -- typically forms that capture customer information (for example, to subscribe to a mailing list or send feedback). There are publication sites that regularly add new content, which is delivered with additional navigation menus and bells and whistles (such as banner ads), which may be integrated via a macro system like PHP. But at the opposite end there are sophisticated sites which provide the user interface to an underlying application.

In the old days before PCs, computing was dominated by mainframes. Mainframe terminals presented the user with a form. The user filled in some fields, hit a "send" button, and the form would be despatched back to the mainframe, which would process it through a program and generate a new form which would be sent to the user's terminal. Web applications follow this same 1960's timeshare mainframe model, with the proviso that today's users expect at least the illusion of interactivity: something on the web server needs to keep track of the state of the user's session, so that each successive screen they see corresponds to the successive state of a program they're interacting with.

At their simplest, web application frameworks are designed to support this model of operation. Typically there will be an underlying relational database (for example, DB2 in the case of IBM's WebSphere Application Server, or MySQL in the case of Zope). There will be some sort of object request broker that maintains persistent software objects (possibly by saving them in the database), allows them to call one another, and allows them to call on external services. There may be a "middleware" layer like IBM's MQSeries, a message queueing system that lets programs on different types of operating system exchange messages in a common format. (This provides access to non-web aware databases or old inventory and accounting systems.) There will be a programming language that can be called directly from the web server in which the application framework is embedded, and which can talk to the database by way of the ORB (IBM's WebSphere uses Java; Zope uses Python). And there will be a potload of classes that can be used to rapidly construct applications that present their graphical user interface to the public by way of a web server.

(An optional extra is an integrated development environment. IBM uses VisualAge for Java to provide an IDE for their WebSphere range of products; Zope doesn't yet have an IDE, but work is in progress on modifying the open source Mozilla browser's page editor to support Zope directly, adding Python editing and menus for accessing the Zope configuration database. The tool, ZopeStudio, should be in a usable condition within the next year.)

IBM's WebSphere and Digital Creation's Zope are two of the more obvious web application frameworks available for Linux. WebSphere is a bit of a beast. Consisting of more than a dozen different components, and hosted on numerous platforms, WebSphere is intended as "glue" that allows any existing application to present a face to the world through web pages. It is extensible in Java (support for Enterprise JavaBeans is included) and uses the MQSeries middleware to communicate with a wide range of legacy systems. WebSphere Business Components provide a wide range of Java classes suitable for assembling e-commerce sites, and WebSphere Edge Server provides clustering and fail-over for very high volume sites. In fact, WebSphere is intimidatingly large: you could get lost in there. (For a chance to experience the dizzying array of possibilities, see IBMs home for websphere).

Zope is a more approachable package -- and it's open source, too. When you install Zope, rather than running as an Apache module, you are installing the Zserver -- a separate network server that handles Zope requests. This is linked to the Zope Core, an ORB that provides access to Zope's classes, and to the Z object database, where persistent objects exist between calls to the Zope server. The Zserver can also talk to relational databases and the Linux filesystem.

In use, you interact with Zope via a web-based management interface. This gives you an outline view of your Zope website -- including directories and files, and other types of objects (such as SQL queries and scripts). You can navigate around the site, adding folders and files (and other objects such as access rules, and searchable catalogs). It's all a little confusing at first; the best way to approach Zope is by installing it and then working through the tutorial in the Zope Content Manager's Guide (which explains how content is added and administered under Zope).

The commonest task in Zope is creating DTML files (document template markup language). DTML resembles an HTML macro language like PHP, but rather than executing external commands or being interpreted, commands trigger methods in external objects stored in the Zope system.

Once you're confident with adding content (HTML and DTML commands), it's worth looking into DTML methods. DTML methods let you store XML or HTML or DTML information in an object which can be incorporated elsewhere in your website. It's also possible to write and link in external methods, which are essentially blobs of Python code, the output of which is directly accessible to Zope objects.

A lot of the learning curve in Zope is getting to grips with its terminology. Perhaps the best way to think of it is that it's a class hierarchy; each folder is a class, and DTML pages within the folder are instances of that class. They can also draw upon associated methods defined within the class (which is where the DTML methods and external methods come in), and inherit attributes from higher up the hierarchy.

Scripting tools

When building a web site that delivers dynamic content, ultimately you will need to do some programming to glue everything together. While it's possible to use a web application framework like Zope to automate most of the workflow on a site, you will still need some elements of customisation: and for this, you need to use a real programming language.

The so-called scripting languages -- probably better described as VHLLs, or very high level languages -- are the tools of choice for web automation. While you can work in server-side Java or C++ or whatever, the VHLLs are typically semantically dense (you get a lot of work done per line of code), are well-suited to the string manipulation and operating system interactions typically required by web applications, and are easy to get started with.

On Linux, right now there are two serious contenders for website automation: Perl, and the less well known but equally powerful Python.

Perl is the duct tape of the internet, for good reason; it's one of the most commonly-used tools for constructing dynamic web sites and custom web-applications. It's name is an acroynm for Programmable Extraction and Report Language -- the management-friendly euphemism for Pathologically Eclectic Rubbish Lister (or UNIX swiss army chainsaw). Perl is a big, complex semi-compiled language with hooks for interfacing to relational database engines, GUI creation toolkits, and the CGI API (CGI is an acronym for Common Gateway Interface; it's a standard for allowing web servers to invoke external programs, passing them parameters returned from HTML forms). If you've worked with C or C++ or the UNIX shells Perl's syntax should look passingly familiar; this isn't an accident. Perl was designed by a linguist to be a kind of UNIX creole -- a mixture of the good features of all the other languages that make UNIX such a stew of scripting systems. Perl supports object-oriented programming, and some very large applications get written in it -- but it's also easy to write one-liners for automating simple tasks.

One of Perl's main strengths is the existence of a huge repository of reusable software modules called CPAN (the Combined Perl Archive Network). You can find your nearest CPAN site by way of www.perl.com. CPAN includes a hugely useful module called CGI.pm (the .pm suffix means "perl module") that automates many of the more annoying aspects of writing Perl programs that get data from HTML forms by way of CGI, and produce output in HTML that gets sent to a web browser. A number of other modules are also indispensible to the toolkit of a Perl web developer. LWP (Lib-WWW-Perl) supplies tools for getting files over the web. Libnet provides a more generalised interface to other internet services delivered via TCP/IP, and Mail::Tools lets you parse, assemble, and send email messages from Perl. There's a whole slew of HTML modules including tools to parse HTML (and XML) documents, and tools to assemble them from scratch. And a lot of off-the-shelf CGI programs are already available, written in Perl (although I'd strongly recommend getting an expert to vet any third-party CGI script you plan to deploy on your own server, as the quality may vary).

Python -- named after Monty Python -- is a newer language, and a couple of yeas behind Perl in terms of adoption and uptake. It's a VHLL like Perl, albeit of somewhat more regular syntax: coming to Python from Perl, the most striking thing about it is how clean it looks. Perl uses syntactic sugar to denote variable types, and uses references (pointers) to construct complex data structures. Python doesn't use pointers, but permits complex data types to be built up in a simple and elegant manner. Rather than being designed as a creole, Python set out to invent something cleaner and better than the existing scripting languages, and to a very large extent it has succeeded. You can find out a lot more at python.org.

Like Perl, Python is a semi-interpreted language. Unlike Perl, Python has a mature bytecode compiler and can also emit bytecode that runs on any Java virtual machine and calls upon Java class libraries. While Perl has some ability to interoperate with Java, Python's Java integration looks extremely tight.

Where Perl provides a plethora of ways of slicing and dicing strings (including BASIC constructs like substr(), and regular expression concepts stolen from awk and Icon), Python simply allows you to treat a string as a subscriptable array of characters (and provides a regular expression parser in an external module, in case you need that additional power). Python's powerful and general module system supports object oriented programming more elegantly than Perl; in the forthcoming Python 2.0 release, all data types will be a first-class object, from strings upward, and even today it looks as if it's easier to maintain a large software project implemented in Python than one in Perl.

To compare Perl and Python is perhaps a trifle unfair, but to stretch a metaphor, if Perl was ANSI C, Python would be a version of Pascal like Borland's Turbo Pascal 5, which addressed the core problems of Pascal (like the lack of a string data type, the lack of variable length parameter lists, the lack of unions, and so on) while leaving in place Pascal's innately superior features (procedure/subroutine definitions that can be nested with scope limited to the enclosing procedure, for example).

If you're commissioning work on a strongly interactive, program-driven website, Python might well be a superior technology to go with in terms of clean design, maintainability, and the ability to glue it seamlessly into a web application framework like Zope. However, there are ten times as many Perl programmers out there (some of them good, some of them catastrophically bad), so Perl tends to be the de-facto choie for big websites.


[ Site Index] [ Linux Index] [ Feedback ]