The Case For Cruft

filed under Blogging · 4 comments in the original

The world we know has quickly changed from one of 10-digit phone numbers to significantly longer URIs, URLs for us common folk. For some members of the web community, especially the blogging and web standards aficionados, this outgrowth of the URL as a universal identifier has lead to a new feeling of urgency to protect the sanctity of the URI.

What is Cruft?

Generally speaking, cruft refers to "Excess; superfluous junk" (see Dictionary.com). In the web purists world, however, a slightly more specific definition has been applied -- "all that junk the average web user doesn't care about that makes URL's long and annoying" (as defined by Mark Pilgrim, a respected gladiator for the cause.

To make this somewhat more relevant, let's take a quick look at an example of a crufty URL and an un-crufted one:

Crufty

http://cnn.com/news/2004/0004.html

Cruft-Free

http://cnn.com/news/2004/4/

Why Cruft Matters

I tried to formulate my own respectable list of reasons why Cruft was a bad thing, but I found a valuable, clear definition over at Oli Studholme's blog, Boblet.

An important but overlooked aspect of websites is the URL, or Uniform Resource Locator. This is the 'web address', usually of a page on the internet. While it seems minor, this is part of the page's interface, and some simple rules can make a big difference in ease of use. Important ease-of-use elements for people are the URL should be:

easy to type

easy to remember

short if possible

'hackable' ie predictable enough to guess

and permanent

Source: Boblet

Why Cruft Doesn't Matter (IMHO)

All this being said, I understand the merits of the process, but I certainly don't understand the need for it from a strictly "consumer" point of view. Let's look at the different aspects of a truly cruft-less URL.

Easy To Type

In general, the less someone has to type, the better the situation. This is true for many very good reasons, not excluding our generally atrocious spelling habits and tendency for other forms of error.

Truly easy to type URLs only matter as far as the domain name itself is concerned. Why? Simple -- most people don't type URLs, they click on them. Don't believe me? Consider all the following tools that allow us to NOT type URLs:

Anti-Typing Techniques

Copy / Paste
Auto-Complete
Drag-and-Drop

Anti-Typing Tools and Technologies

Search Engines
Bookmarks
Aggregators (E-mail, NNTP, RSS, etc.)
Contact Managers (VCard, Plaxo, etc.)

If you can honestly tell me that the average person doesn't spend the majority of their computer- and Internet-related time utilizing or operating one of these techniques or applications then I will concede that an easy-to-type URL is important.

Easy To Remember

You should note that above, I made a specific distinction between an easy to remember URL and an easy to remember domain. Unlike a fully-qualified URL, a domain name is very important -- it's the entrance sign to the highway that is your site.

Again, you might be thinking if you think the domain is all there is that matters, consider this list of terms:

Domain
Domain Name
Address
Web Address
Web Site
URL

All of the above terms are the terms I have heard people, both technical and non-technical, use to refer to a web site, not any specific internal page, just the plain home page that greets them. You'll notice that URL is listed as one of those terms.

The honest truth is that the most important thing for a person to remember is the way to the front of your site. A well thought out Information Architecture, Site Search, and Site Map should be able to lead people to essentially everything within the site itself.

Short If Possible

The length of the URL is certainly a matter to be concerned about. Why? Browsers cannot handle an infinitely long URL. In fact, just based on Internet Explorer (a safe assumption purely on browser penetration, the maximum length is 2,083 characters.

Before we go further, let's put something in perspective. A well-written English sentence should be approximately 20 words in length. Assuming 5 characters per word, that's only 100 characters. In short, this is a huge length when you really think about it.

Some might argue that URLs need to be short to make them easier to type or remember, but we’ve already seen why that's really not relevant. Others might argue that it is more efficient to store a shorter URL than a longer one. And they are absolutely right. BUT, those people tend to forget two incredibly unstoppable trends in the computer industry: BROADBAND and MASS STORAGE. The technology industry is delivering Terabyte drives to consumers' desktops and living rooms and ultra-high speed connections through their satellites and walls. Exactly what are we saving up for by collapsing our URLs?

Search engines, and search in general, have changed the landscape. It is no wonder that one of the primary interfaces to information (going back to the initial library systems through to Google and beyond) is search. We write things down, store them away, and index them because we openly admit that we can't remember things and acknowledge that memory is collective far more than individual. History books are full of lessons and compromises of consensus. And search is only getting more and more detailed by the second.

Often, many advocates of the Cruft-Free URL will make use of time and date stamps as an integral part of the URL schema. Unfortunately, people are very imprecise with time. Consider the profound degree of tardiness in your average business day (an environment ruled by time) and your average person's inability to remember what they did last week. Is building this information into one's URL really making it any better? Doesn't the other meta-data associated with time-related content suffice (Date Posted, Date Last Modified, File Creation and Modification times, Server Logs, etc.).

Some cite that URLs need to be short so they print well? Who prints anymore anyways? See the points relating to "Easy To Type". When it comes to URL length, size truly does not matter.

hackable (ie predictable enough to guess)

Of all the reasons provided for Cruft-Free URLs, I think this is the only one that I agree with whole-heartedly. Of course, I would contest that the removal of cruft is not required to accomplish this. Imagine this scenario:

Requested URL

http://socialtwister.com/news/whathappened.html

The problem with the above URL is that it may or may not exist. Note the .html is crufty, by definition. How can this resolved and remain hackable? Simple -- use a search-sensitive error handler. What does that mean?

All web servers can be configured to handle 404 errors with either a standard page or can be adapted to process those errors in a more useful manner. In the example above, two scenarios are possible:

/news exists:

In this scenario the user could be directed to the default page or template for the news page and the user is automatically given immediate alternatives.

/news does not exist:

In the event that a file was requested and the directory does not exist, the remaining parts of the missing URL, from the DOMAIN name onwards, can quickly and easily be converted into a search string such as “news + whathappened”. This is not only painless for the user, but leaves them with more information without requiring detailed preparation and management of hackable pages down the URL string.

Permanent

Noting in this world is permanent by any means. The assumption of most content publishers is that their content and archival system will survive all changes for the near future. It is noble to attempt to create truly universal, permanent resources on the Internet. However, it requires a truly deep financial and technical commitment to make this a reality – something most people do not possess.

The additional argument for permanence is what is called "Future-Proofing" URLs. The notion here is that since web technologies and developers obsession change constantly over time, any information stored in a URL that identify either the application server or server state are both meaningless and potentially harmful.

In the example provided just before, I noted that a .html extension was seen as cruft. This is the reason. If in 3 months the site changed and attempted to utilize Cold Fusion (.cfm), PHP (.php), Java Server Pages (.jsp), or some other technology, all references to the .html file would be broken. In theory, this is sound reasoning. Unfortunately, there are too many opportunities to prevent this from becoming a larger problem, ironically enough broached by the same people moving away from crufty URLs.

Three immediate solutions are available for the above scenario:

Scenario 1: URL Re-writing converts all references for a .html file to simply re-direct to the new file extension.
Scenario 2: Dynamic Error Handling as mentioned above.
Scenario 3: Application Server Mapping allows, in many cases, for one application server to be mapped to another file extension. For example, Cold Fusion can be assigned to process not only .cfm but also .html files.

Granted these techniques are not going to resolve the issues relating to passing application state and configuration data around in the URL, but generally that's frowned upon as a development practice and quite possibly ignorable from an application development point of view.

Summarizing

In general, the transition of a system from chaos to order is both desirable and fruitful. The efforts that are opposing Cruft are worthy tasks and worth considering as part of the design and development of a URL-rich application. However, no system of organization should compromise or otherwise straddle innovation when other, forgiving and flexible alternatives exist.