socialtwister — an archive in time

Cruft Redux

filed under Blogging · 12 comments in the original

It seems that my recent rant on Cruft has generated some feedback, and surely it is welcome. Oli Studholme has posted a nice response, to the points that I have laid out relating to Cruft. Note that Oli's entry can be conveniently located at http://oli.boblet.net/2004/01/29/cruft, which I agree is both readable and self-describing.

Before I address Oli's points directly, I think that it's important that I make it clear that I am not opposed to developing Cruft-Free URLs from a systematic point of view. I am, however, addressing the consumer issue of the matter, specifically the individual end-user's concern about long URLs and their "ability" to digest these matters to generate naming conventions on their own.

So on to Oli's issues regarding my initial assertions:

Cruft Example

Oli points out that the examples of De-Crufting that I used were not the most appropriate and has posted some much more interesting examples of what the De-Cruft gang is really trying to achieve. I completely agree that my example was sparse, to say the least, but not meant to mislead.

re: re: Short if Possible

Oli makes a very valid point regarding the "E-mail Problem":

If you send a URL via email and it’s longer than about 78 characters, it’ll get split over more than one line. For some reason email clients are really bad at working out where the URL should end, and often stop it at the end of the first line. This makes a broken link that’ll lead nowhere.

I concede this is an issue, NOW. My question is what does the future hold in this arena? It is more likely that E-mail clients, as well as other server-side processors, like those used in Blog Comments, will become MORE intelligent in the future and be able to quickly and easily discern a URL from a random SPAM identifier. I'll also raise the HTML E-mail specter now as well, which as everyone knows, is much more adept at properly displaying long e-mail links. Of course, most people who are opposed to Cruft also hate HTML E-mail, but that's an entirely different issue.

Greg also cites broadband and big hard disk in relation to URL (byte) size, which I think is a non-issue as URLs are so tiny anyway. Still I wonder if he’d be surprised that in the USA 58% of users are still using ‘narrowband’ (source: Nielsen/Netratings 2003.11 pdf)? His claim that a $1,200USD hard disk is a ‘consumer’ product is also just funny (maybe in a few years ;-)

I completely agree that size (in bytes) is a meaningless factor and wanted to make that clear. My goal of pointing out the Lacie drive was to show that the cost per megabyte has dropped considerably and that trend will continue. A few years ago, I never thought about a Terabyte desktop drive, but it's here. As for the widespread adoption of broadband, the 58% is not a surprise at all. The important things to remember is that we are still most likely in the Early Adopter stage for broadband, and that's already showing 40% uptake. If you couple this with the mega-industries, like Cable and Movie, long term goals for consumer activity, you can expect to see a great surge in the subsidization of broadband specifically to deliver their appropriate media.

Greg says search engines are another reason for not fearing long URLs. However many search engines are ‘unwilling’ to index URLs generated by scripts (the main long URL culprits), usually visible by “&” and “?”. This is because the script could generate a huge number of URLs, for example MapQuest URLs.

I'm actually glad that Oli mentioned this. This was definitely a problem in the past, and to be truthful, it CAN still be a problem now. A quick search on the matter does indeed show, however, that Google will indeed index dynamic pages. In fact, some of those SEO folks are aware of how to deal with it.

Of course, to clarify, the ? and & characters are relatively easy to avoid if one knows what one is doing. Developers of dynamic web pages have known for quite some time that it is possible to create "Search Engine-Safe" URLs. To summarize the process, all ? and & are converted to / and then parsed back out on the server. Sure it's work, but at least it's not me the end-user doing it.

The last point regarding this is that in the long run, it is not hard to imagine that even better algorithms will be developed for determining when to pursue a dynamic set of pages for indexing and when not to. It is also already in place, though not necessarily overtly, that blacklists of sites do exists and that there is room for a gray-list which utilizes some special rules. The growing competition for the "Top Search Engine" spot will surely drive more and more search engines to also begin to absorb and index an even wider range of web sites. Currently, only a small fraction of the total web pages are properly indexed and accounted for. That number only gets smaller and smaller if search engines do not start to more effectively deal with the architecture of the web and its exponential-like growth.

Greg wonders why using a date-stamp in URLs would make them better. The main reason is the time you publish a story is one of the only things that won’t change, so it’s a good thing to make a unique filename from.

[...]

A URL like /archives/2004/01/24/filename is more usable than /archives/000001.html because you can tell the date of publication before you even see the page, and you (hopefully) can access the day, month and year indexes just by deleting parts of the URL.

Yup. I do wonder. The reason I wonder is simple. I don't believe the "consumer" of fully qualified URLs are human at all. As I outlined previously, systems are being designed to further index and analyze the content of those individual nodes that the final URL almost seems to not matter. Another trend, which I failed to mention, is the increasing move to web service, flash remoting, aggregation, and rich Internet applications. All of these technologies also serve to further hide the "URL" associated with things and attempt to unify the experience for an end user with other forms of intuition.

Further to the point -- who exactly hacks URLs. I'm admittedly a geek and I know that I do hack URLs from time to time for a number of different reasons, from testing for errors to exploring for new content. However, is the the typical behavior of the non-geek? I can't really substantiate that claim, but often being inside of one circle makes it difficult to understand the point of view of another circle. Could we all be misled?

Besides, do we all publish "stories". Definitely not.

re: re: Hackable

Greg suggests using a a search-sensitive error handler to duplicate the hackable nature of non-crufty URLs, to avoid detailed preparation and management of hackable pages down the URL string.

Oli's point seems to stem mostly from his use of Movable Type. As he notes, MT already generates these hackable pages for you. Movable type can be used on a number of different platforms. This blog runs on Linux, but it could be running on anything really. MT has been deployed to generate not only .html, .php, and a number of other application servers / languages.

Oli asks that I present something that makes that "easy". I can attest that building this is very simple to design if someone had the desire to do so. I've implemented this for Cold Fusion sites and it takes literally a few lines of code to handle the 404 and redirect. The question is this -- why let the software limit you? Clearly there are other technologies available to make this sort of behavior available, why not use one if the current doesn't fit?

404 errors are definitely not painless for the user, but Cruft-Free URLs don't provide a solution in any way. In fact, its the work behind generating hackable URLs that actually does the heavy lifting. Of course, there's no excuse for a 404 error in my mind at all (not that I don't have them on my sites), if you can prevent it. I'd also say that if you're trying to "sell" anything, be it product or idea, that its your responsibility to always deliver something useful.

But like I said, hacking really has nothing to do with Cruft.

re: re: Permanent

I think Oli missed my point about permanence. The goal of highlighting the problems with permanence was to stress that it's not "automatic" and certainly not "free". I have no issue with servers handling tasks like URL rewriting and dynamic 404s. The issue I see with Oli's dismissal that "This raises the bar about as high as having a website to begin with" is that the whole point of bothering to make things permanent and future-proof is the assumption of some social responsibility to the web as a whole -- a responsibility that no node fails or disappears.

re: re: Summary

I still can’t see how creating non-crufty URLs prevents ‘innovation’, or anything else except (hopefully) link rot and user confusion. Also, while I agree that alternatives are available, none of them are as easy as simply setting a weblog up with a non-crufty URL scheme from the start.

I think the 5 reasons for non-crufty URLs that I gave are basically all about usability. Instead of saying “non-crufty URLs” maybe we should talk about “usable URLs”. I personally think a URL like http://oli.boblet.net/2004/01/29/cruft is more usable than http://socialtwister.com/archives/000042.html. Before even looking at the page I know:

  • when it was published
  • a little about the page’s topic
  • and (generally) a little about the structure of the website

There are a few problems with Oli's assumptions here:

First, I don't actually believe that URLs are necessarily easy to create. It sure seems like more work than I should be bothered with. If I have a personal site and I am making a diary-like entry, do I really want to have to sit back and reflect on what term best describes my two line description of how crappy the weather is? And what happens when I post frequently throughout the day? Say I make one post on blogging, then another? I guess I have to remember all the terms I used to avoid collisions which would inevitably result in overwriting my old entry.

More to the point, stopping innovations means that if I have to stop to think before I can simply jot something down then it's something that is aiding the process, but rather something that serves as a hindrance. I can send an e-mail from my cell phone to my blog and have it post. This works because I don't have to configure names and URLs to make that happen. If the desired effect is to have intelligence in the URL, then the software that generates them needs to be smarter, not the people using it.

Second, Oli states that his decisions are based on usability. I am extremely reluctant to say that all web authors are usability experts or that they should be required to be. At the same time, I don't even know what "usable" means since that's completely subjective. And is it really clear what the topic is? I would have to argue that it's far easier to come to the wrong conclusion on a specific topic or keyword's meaning than to come to the right one, that is without context. But since your domain name and other URL identifiers do not give me context, the term "cruft" could just as well be talking about the dirt you found in between your keyboard's keys.

Third, Oli assumes that URLs unto themselves need to be readable, mostly because I think he assumes URLs are presented without context. However, intuitively, this does not seem to be the case -- URLs are hardly ever sent without context. When my aggregator pulls the latest feeds, it sends a description. When I search on Google for something, I get an abstract. When someone I don't know, or even do know, mails me a link they usually provide a note that tells me what it's about. I don't need to analyze the URL to know this, its told to me. I don't need to assume anything, most likely wrong anyway, about the structure since the site exists to tell me about the content, far more so than the URLs do. And for those rare circumstances when I do get a link without context, it's almost always from a trusted source -- and that trust compels me to visit the link, regardless of the URL. And when I get links from untrusted sources, they have to "sell" me on why I should visit their link, they don't count on me digesting the URL -- even more context.

Forth, Cruft applies to all URLs, regardless of purpose. Most of these discussions center on blogs and posts, but that's such an infinitely small part of the web we know today. I am not sure how much most people have had to deal with non-blog like uses, but from a programmer's perspective, there's nothing quite so elegant as a unified scheme for location resources. Do we focus too much on what the end user sees, URL wise? Of course not, because the other part of our job is to present data to them intuitively.. and that almost always means NOT showing them anything about the URL. When John Q. comes to our site looking for desk lamps, we don't really want him to type in domain.com/desklamps. We want him on the front page and then we provide him with a number of different mechanisms (ads, directories, search) to locate anything and everything he wants.

In all, the arguments for Cruft and against it are really moot. I think that if humans desire more readable resource locators, then they need to work with software that can systematically generate those addresses. This process need not be unattended, and should definitely allow for augmentation and further modification by the author. Computer systems are designed to chew these types of resources and the tools and applications we make and interact with are meant to ease our use of them. Arguments for fully self-administered readability in URIs seems to assume a much higher committment to the mechanics over the content -- but I think the content is what matters to the people, let the machines sort out the rest.