Susan Boyle is 5 times more famous than Elaine Page

If you still haven’t heard about Susan Boyle, you truly are, dear sir, living under a rock. Here’s the video that started it all: http://www.youtube.com/watch?v=9lp0IWv8QZ - in which Susan Boyle desires to be more popular than Elaine Page.  Following an incredible debut on some American Idolesque show in Britain, Susan became an instant YouTube celebrity.  A little bit of analysis (code below) shows Susan’s lovely voice has now reached over 26M ears, whereas Elaine Page with over 100 videos on YouTube has only managed to woo an audience of 5M.

usr/bin/perl -w
use LWP::Simple;

# 8 pages of Elaine Page's videos
my @elaine = qw(
    http://www.youtube.com/results?search_type=&search_query=elaine+page&aq=f
    http://www.youtube.com/results?search_query=elaine+page&page=2
    http://www.youtube.com/results?search_query=elaine+page&page=3
    http://www.youtube.com/results?search_query=elaine+page&page=4
    http://www.youtube.com/results?search_query=elaine+page&page=5
    http://www.youtube.com/results?search_query=elaine+page&page=6
    http://www.youtube.com/results?search_query=elaine+page&page=7
);

# One page of Susan Boyle's videos
my @susan = qw(
    http://www.youtube.com/results?search_type=&search_query=Susan+Boyle&aq=f
);

sub total {
    my (@u) = @_;
    my @views;
    my $sum;
    for my $uri (@u) {
        push @views, get($uri) =~ /([\d\,]+) views/g;
    }
    for (@views) {
        $_ =~ s/,//g;
        $sum += $_;
    }
    $sum;
}

print total(@elaine). “\n”;
print total(@susan) . “\n”;

#  4,916,720
# 2,686,3481
~

How I cleaned up 8k+ spam comments from my Wordpress blog in less than 30 minutes

I have the most permissive of the options that Wordpress supports for authentication of comments - ”Comment author must fill out name and email” (neither of which are verified) - and as a result had 8k+ comments, most of which were spam, to approve this evening.  Wordpress has some built-in heuristics to sideline comments for approval, that seem to work fairly well, but still too many that got approved. The sidelining/approval admin UI of Wordpress is quite nice, but for a small number of comments. Sifting through 8K+ messages, 25 on a page, is not a decent way to spend an evening.

So, I learnt a few things about cleaning up a wordpress blog.  First, I installed Akismet.  Akismet is a hosted anti-spam solution that comes in form a Wordpress plugin.  It checks every comment against the Akismet server, which applies undisclosed methods to make an spam/non-spam determination.  It’s simple to install, plugs in transparently and provides high efficacy at detecting spam comments. However, the false positive rate on my blog was high enough to necessitate manual oversight.  More on the accuracy of Akismet later.

Wordpress keeps comments in the wp_comments table of the mysql database for the blog.  Here’s the definition of this table:

$ mysql wordpress
mysql> desc wp_comments;
+----------------------+---------------------+------+-----+---------------------+----------------+
| Field                | Type                | Null | Key | Default             | Extra          |
+----------------------+---------------------+------+-----+---------------------+----------------+
| comment_ID           | bigint(20) unsigned | NO   | PRI | NULL                | auto_increment |
| comment_post_ID      | int(11)             | NO   | MUL | 0                   |                |
| comment_author       | tinytext            | NO   |     |                     |                |
| comment_author_email | varchar(100)        | NO   |     |                     |                |
| comment_author_url   | varchar(200)        | NO   |     |                     |                |
| comment_author_IP    | varchar(100)        | NO   |     |                     |                |
| comment_date         | datetime            | NO   |     | 0000-00-00 00:00:00 |                |
| comment_date_gmt     | datetime            | NO   | MUL | 0000-00-00 00:00:00 |                |
| comment_content      | text                | NO   |     |                     |                |
| comment_karma        | int(11)             | NO   |     | 0                   |                |
| comment_approved     | varchar(20)         | NO   | MUL | 1                   |                |
| comment_agent        | varchar(255)        | NO   |     |                     |                |
| comment_type         | varchar(20)         | NO   |     |                     |                |
| comment_parent       | bigint(20)          | NO   |     | 0                   |                |
| user_id              | bigint(20)          | NO   |     | 0                   |                |
+----------------------+---------------------+------+-----+---------------------+----------------+
15 rows in set (0.00 sec)

The field comment_approved has three values: 1 means approved, 0 means awaiting moderation and ’spam’ means marked as spam. The comment_content contains the content of the comments and as long as you have hundreds of thousands of comments, you can use full-text search with LIKE to mark spam with overwhelmingly spammy terms as viagra, free slots, or casino. or ‘. I didn’t want to blindly mark everything containing spammy terms as spam and found that Akismet approval UI serves as a useful visual verification tool.

First, I checked all comments sidelined by Akismet and deleted them. Once the Akismet queue was empty, I marked various classes of comments as ’spam’ in MySQL. These appeared in Akismet UI immediately, where I verified them, kept any false positives and deleted the rest. Unlike the Wordpress approval UI, Akismet’s has a helpful “delete all” button and shows 50 to a page, which makes the visual approval process a lot quicker. Here are some SQL statements I issued to mark spam:

UPDATE wp_comments SET comment_approved = 'spam' WHERE comment_content LIKE '%coupon%';
UPDATE wp_comments SET comment_approved = 'spam' WHERE comment_content LIKE '%viagra%';
UPDATE wp_comments SET comment_approved = 'spam' WHERE comment_content LIKE '%Tramadol%';

I also noticed the use of geocities URLs and various URLs with the term “forum” in them, which I marked with:

UPDATE wp_comments SET comment_approved = 'spam' WHERE comment_author_url LIKE '%geocities%';
UPDATE wp_comments SET comment_approved = 'spam' WHERE comment_author_url LIKE '%forum%';

After running each of these, I eyeballed the Akismet approval list and mostly deleted stuff. I was done in about 30 minutes, and was quite certain I didn’t delete any real comments by mistake.

Akismet also installs a little button in the Wordpress comment approval UI - “check for spam”. When you click this, Akismet goes through all comments waiting for approval and marks the ones it considers bad. The marked messages are moved from wordpress approval list to Akismet approval list. Once I’d deleted all spam, I wanted to see how many of the remaining 89 legit comments Akismet would mark as spam. I set comment_approved on all comments to 0 and clicked “check on spam”. 10 of 89 were detected as spam by Akismet, which works out to a false positive rate of 11.2% - too high to be trusted blindly for cleaning up an existing archive. It does do an excellent job of detecting spam so it seems as long as I continue to inspect the approval queue occasionally, it’s a great service, especially at the $0 pricetag.

(Thanks to @ferric and @floatingatoll for various tips and pointers.)

Information Asymmetry in the Gaza / Israel conflict

Last week, after the Israel/Gaza hostilities began, and the world was rather confused about where to direct its outrage, a message from Israel’s president Shimon Pereson was published on the Huffington Post.  It said:

“It is the first time in the history of Israel that we, the Israelis, cannot understand the motives or the purposes of the ones who are shooting at us. It is the most unreasonable war, done by the most unreasonable warriors.”

I have little knowledge of what’s really going in with Gaza but I was more than bewildered to read this.  Perverse as it may be, there’s always logic behind war.  The information asymmetry around the conflict seemed immense (and still is), because Israel has severly limited access to Gaza of foreign journalists.   I asked a friend who has covered Gaza in past and spent time on the strip, to comment on this statement and the nodding support it was receiving from the media.  My friend’s name has to be withheld, but I am sharing the comment, which I found to be enlightening:

This is such bullshit, it’s hard for me to believe that the guy SAID
this publicly and that people are POSTING it online. This is how we lead
to a severely misinformed world. It amazes me that people with “press”
credentials — bloggers or not — do not feel a responsibility for
balanced reporting.

Peres says: “Still I have not heard until now a single person who could
explain to us reasonably: why are they firing rockets against Israel?
What are the reasons? What is the purpose?” — Oh yes, he has. We all
have. Hamas has been very vocal about “why.”

I’m not defending Hamas here — I do think they have become too
fundamentalistic. And precisely because of that, they are doomed. But
the points Peres makes about Gaza being “free” are exactly the things I
reported on while I was there. Gaza wasn’t “free” then and things
haven’t changed much since. I keep in touch with dozens of people who
live there, and the only new thing between my return to the U.S. and now
is a new after-school program there.

Yes, Israel pulled out of Gaza. But no, Gaza is _not_ free. A state (?)
with no economic independence, no healthcare independence, no travel
independence is not free.

Peres says: “Everything can come again to normalcy. Passages: open;
economic life: free; no Israeli intervention; no Israeli participation
in any of the turnarounds in Gaza.” The word “again” perplexes me –
there has never been normalcy. Passages have not been open (they’ve been
opened and closed depending on Israel’s whims and fancies). The most
recent Hamas attacks, which allegedly began this senseless war, was
because Israel closed off one of the only two passages through which
Gaza can send out or bring in goods.

There are only two crossings in and out of the Strip — Erez and Rafah.
Israel controls both. There is only ONE for people to move in and out –
Erez. Israel can stop anyone from crossing Erez — and stops most (I say
“most” because of lack of specific stats at this point, but I can get
you those too). People stopped range from the mother of a 4-year-old who
needs his brain tumor removed to college students accepted into a
college in Israel, or medical school. (No med schools in the strip!)

Gazans have no national identity. Heck, Palestinians have no national
identity. And of course, they have no country. Besides, the West Bank
remains largely occupied. East Jerusalem remains occupied.

I do realize that things are very complicated at both ends and
black-and-white morality is easier said than done in that part of the
world. And I don’t know if any one side is responsible for the latest
series of attacks. It’s become a chicken-and-egg issue. But my point is
a simple one: Gazans need to be free. First, they need to be FREE and
then, they need HELP in becoming what they can. Passages will be opened,
Peres says. Why should free people be subject to passages that are
opened and closed by someone else??

Additionally, Israel occupied Gaza and made sure it remained dependent
on it for everything - for decades. So Gaza has very little for
self-sustenance. The airport was bombed out in the last Intifada and the
seaport is controlled by Israel. So Gaza needs to be free - the West
Bank needs to be free — the Wall needs to be redone respecting
internationally rescognized boundaries — East Jerusalem needs to be
free — Palestine needs to become its own nation — and it needs help to
grow as an independent nation after being dependent for so long (and
surrounded by its occupiers!).

I also find the state of journalism in that part of the world sad. This
parachuting-in of journalists at times of conflict does _not_ work. They
stand at the border, do their glorious war-reporting and come back.
Outlets that do have bureaus there have their reporters stationed in
Jerusalem — not ONE corrrespondent is based in Gaza. The only folks
reporting from inside Gaza are locals — one local English service corro
for Reuters and one for APTV. That’s it. Those reporters — because they
are locals — aren’t encouraged to do any initiative reporting for fear
of bias. So the only news stories you really see come out in mainsteam
(and most other?) media is about the attacks. And then you have
irresponsible bloggers like Karin Kloosterman who include a “a
word-by-word transcript of the statement” and think “it will be useful
(especially) for those so removed from the conflict, yet so eager to
debate about it.”

Uh, irresponsibly posting one side of the story that is blatantly
dishonest is neither helpful to creating an informed world nor to stir
healthy debate. What IS helpful to people and for debate is in-depth
reporting of the issues from the ground. In this case, that would be
from inside the Strip.

Redhat perl. What a tragedy.

At my new startup, Slaant, we use a lot of perl.  We use perl for parsing massive amount of HTML/XML documents, we’ve written a homegrown RDF store in perl and we have a set of web applications built on Catalyst.  We use OS X boxes for development and Centos 5.2 for production.  Last week we deployed new hardware - over 150 cores of CPU - expecting much higher data processing throughput along the pipeline of 30 or so perl sub-systems - but the performance boost seemed marginal and perl stood out as the culprit.  This was over a year of work, running on production scale hardware for the first time, and it wasn’t measuring up.  We almost threw perl out of the window.

We have one lone FreeBSD box in our production environment and I happened to notice that a perl program that read JSON structures from disk files was over 100x faster on this box compared to our Centos boxes.  It should’ve been bottlenecked on disk I/O, but strace showed it was burning userland CPU.  Surprised, I ran the fantastic Devel::NYTProf and discovered the most expensive call, by a big margin, was a “bless”.   bless!?!  Perl will happily do millions of blesses a second on my 2Ghz macbook.  And this was a dual-2.5gz quad core server.  What the hell?

Some investigation revealed that there’s a long standing bug in Redhat Perl that causes *severe* performance degradation on code that uses the bless/overload combo.  The thread on this is here: https://bugzilla.redhat.com/show_bug.cgi?id=379791

In the thread, ritz posted the following snippet.  Try it.  It should take under a second if the perl is not broken and a lot longer if it is.

#!/usr/bin/perl
use overload q(<) => sub {};
my %h;
for (my $i=0; $i<50000; $i++) {
$h{$i} = bless [ ] => ‘main’;
print STDERR ‘.’ if $i % 1000 == 0;
}

There isn’t official fix yet, but there’s a patch in the thread.  We applied the patch.  However, it did not make the problem go away, just delayed it - perl processes using bless/overload start slowing down (and continue to do so exponentially) after a while.  At this point, I decided to recompile perl from source.  The bug was gone.  And the difference was appalling.  Everything got seriously fast.  CPUs were chilling at a loadavg below 0.10 and we were processing data 100x to 1000x faster!  I was giddy.  This was insane. We’d given up on one of the processes - to parse about 25M HTML documents using a HTML::TreeBuilder::XPath parser - because calculations showed that it would take over a year to parse them all.  We assumed the Tree::XPathEngine was somehow intrinsically slow - so we’d rewrite our parsers using regexen at some point.  With the new perl, we parsed these documents in 2 days.  2 days, instead of 365 days.

Rather massively blown away by this, I started sending the snippet above to various companies and projects I am involved with that use a lot of perl on Redhat or related distributions. It turns out many of them are running the broken perl and some of them had spent considerable amount of money and time in optimizing their perl code and infrastructure to work around the performance issue. It also turns out the issue exists on perl that comes with Fedora 9 - even if you compile perl Fedora 9 source package.

So, wow.  How many people might be affected by this?

I realized that anyone running perl code with the distribution perl interpretter on Redhat 5.2, Centos 5.2 or Fedora 9 is likely a victim. Yes, even if your code doesn’t use the fancy bless/overload idiom, many CPAN modules do!  This google search shows 1500+ modules use the bless/overload idiom and they include some really popular ones like URI, JSON.

According to this google trends analysis Redhat, Centos and Fedora make up the majority of linux distributions used in production. All these have a broken perl. How much time and money has been lost because of this?  I have a sinking feeling that it is a staggering number.  I also have a sinking feeling that many people have moved away from perl to python/ruby/java/C because this bug caused them to assume “perl is slow”.  I am hoping this issue will get more visibility because it’s silently killing perl’s reputation and resulting in some very serious wastage of resources.

August 26, 2008: Nicholas Clark, perl core developer, explains the background and points out that fixes have been available since November 2007, they just haven’t made it into RedHat packages.

September 8, 2008: Karanbir Singh has published a fixed perl 5.8.8 RPM for Centos 5.2. See his post here with details on how to upgrade.

September 17, 2008:  Redhat has released RPMs the fix this issue!  The details and upgrade instructions are available in the bug fix advisory, RHBA-2008:0876-3.

January 20, 2009: This is the official Redhat fix - RHBA-2009:0117-3.

Reading: Automatic Metadata Generation using Associative Networks

This paper describes a way to attribute metadata to metadata-poor documents based on their association with metadata-rich documents. The two methods described here - creating association networks and using spread activation algorithms to transfer metadata.  I am a fan of network methods for classification and information retrieval tasks - for large problems they fare a lot better than text classification algorithms, are independent of language and work on non-linguistic data - audio / video / numbers. The methods described in this paper are super useful for solving all manner of problems, especially for applying the increasing amount of social structure available on the web to build information retrieval systems that have been out of reach so far.

The basic notion is to first build a set of association networks from a collection of document. The authors use the high-energy physics bibliographic db as their document collection and  select properties like authorship and citation for deriving occurrence and co-occurrence association networks. Occurrence association networks are those in which documents refer to each other directly on a property - eg: the property of citation forms an occurrence association network. Co-occurrence networks are those in which documents share a property - eg: the property of keywords can be used to build a co-keyword co-occurrence association network. A simple and robust formula for deriving edge weight in an occurrence network is 1 / P(A|B) - that is if A cites 100 papers and B is one of them then weight on the edge (A,B,citation) = 1 / 100.  The weights in co-occurrence association networks can be derived by taking a simple intersection - if paper A and B share 3 out of 6 total keywords, then weight on the edge (A,B,keyword) = 1 / 2.

Once association networks are computed, the authors apply a variation on the “particle swarming algorithm” by propagating certain properties (eg co-citation) to discover other properties (eg: journal and keywords). Particle swarming is a discrete form of spread activation - it’s a method to walk the association network, visiting each node and carrying over the metadata from the visited to the visiting node while decaying the weight at every hop (also referred as losing particle energy or reducing recommendation influence).

This is expensive, of course, and the paper provides a very good analysis of cost, but achievable at scale in architectures like MapReduce and Triples DBs.  This method can be generalized to any RDF graph such that a occurrence or co-occurrence network is created on a selected predicate and other predicates are propagated on the network.  It should be possible to write a library to do this which is completely independent of the specific RDF vocabulary.

A neat project would be to take the dmoz rdf, fetch all the web pages in it and then fetch all the pages these pages link to - can take this out a few levels.  Use this to build a citation occurrence network (your basic link graph) and then propagate DMOZ categories to it. The same can be done with tags on del.icio.us to tags the rest of the web. Security is also a great application for these methods - eg: you can use a social graph (like Facebook), mark the few known “fake identities” and then propagate the “fake identity” property through the social graph.

The fallacy of corpus anti-spam evalutation

I get asked to review papers on anti-spam by various technical journals and I am continuously surprised by the insidiousness of text classification methods in anti-spam research. For instance, a lot of researchers are now using the TREC spam corpus to justify the effectiveness of their anti-spam technique and journal editors are insisting on analysis based on this corpus. This is horribly broken. Text classification research has relied on standard corpora to evaluate the effectiveness of new methods - the Reuters 21578 corpus, now the Reuters Corpus @ NIST - has virtually been a standard for this - and it stands to reason. If a classifier trains on 10% of stories about soccer and is able to detect the remaining 90% correctly, we can be quite certain that it will perform well on future soccer coverage. This method of testing reveals a classifier’s resilience to drift in vocabulary and topics, which is inherent to culture and the evolution of language. But imagine a team of sports reporters whose job is not to accurately report the highlights of a soccer game but to wordsmith their story to read like a business section editorial.  Their goal is to fool a text classifier trained on samples of soccer coverage, which they achieve easily by eschewing the colorful soccer lingo of offside and red cards and free kicks, and using instead the mundane vocabulary of supply demand curves, human resources and NASDAQ.

A text classifier that is immune to such wickedness is no longer passively modeling topic drift, rather it’s trying to predict all the ways in which these fallen journalists will attempt to fool it. In other words, the classifier cannot rely on a corpus from the past to be meaningfully co-related to a corpus from the future. This is like the warning associated with stock trading (past performance is not a reliable indicator of future performance), except it is worse. The spam corpus of today is a function of anti-spam systems of today; it is a direct result of spammers trying to defeat the anti-spam systems that are deployed. When a new anti-spam system is deployed, the nature of the corpus changes, in direct proportion to the nature of the anti-spam system.  This is entirely unlike text classification research, you are always training on the wrong corpus!

There are no easy solutions to this, just like there are no easy solutions to predicting the stock market.  One way to use corpora more meaningfully is to classified it in a taxonomy that represents the type of anti-spam technique they were meant to attack.  The corpora that attacks Naive Bayesian classifiers should be distinct from corpora that attacks Fingerprint classifiers which should be distinct from corpora that attacks Network based classifiers. The researcher should assess what technique is closest to their proposed system and evaluate their technique against that corpora.  If the proposed system is entirely novel, researchers should acknowledge the inscrutability of their method against existing corpora and use simulation and predictive methods (and good old reasoning) to determine how their method will measure up to an active adversary.

While a new method must always do well against old corpora, the fact that it does is not a guarantee that it will do well against future corpora. This is known as overfitting, and dependence on corpus based evaluation of spam filters results in overfitting on known attack strategies.  Strategic over-fitting has disastrous effects in security, both electronic and real-world, and I really hope anti-spam research does not get bogged down by poor methodology.

Humble Beginnings and other Non-Sequiturs

For the last few years, I’ve been doing a whole bunch of social media analysis - extracting social structure and information dispersal patterns from large archives blog posts, social bookmarks, user comments and such.  A bunch of us having been working on this rather exciting project, that we call Slaant, where, among other things, we discover networks of influence and micro-zeitgeists in social media. I’ve been pouring over the XML, RDF and HTML generated by tools like Movable Type and I figured it was high time that I installed it and start using the front end to write that blog I’ve been meaning to.  So here it is.  Maybe I’ll even write another post one day.

What am I reading?

I read a bunch of super cool books recently. On Intelligence by Jeff Hawkins and Sandra Blakeslee is a fascinating theory of the neocortex - the seat of human intelligence. Jeff and Sandra present a detailed and credible account of the mechanics of the neocortex and show how (and argue why) their predictive memory model results in intelligence. Their theory is elegant and, even though far from complete, I am thoroughly convinced that it is a decent approximation of what goes on in the brain. Biological machinery is extremely simple, it only seems fantastically complex due to properties that emerge at scale. Authors build on the premise that every cell in the neocortex runs the same algorithm and intelligence is an emergent product of this neocortical algorithm. Read this book - it will no doubt be considered one of the most important works of our time. If you have a machine learning background, check out Dileep George’s paper on invariant representations in the visual cortex. Dileep is a colleague of Jeff’s (and they recently founded a company together called Numenta) who is already converting insights from the research into efficient algorithms.

The Wisdom of Crowds by James Surowiecki is another kick-ass book. The thesis of this book is why (and in what situations) collective decision making surpasses expert knowledge. Since my work on spam filtration has been an exercise in collective decision making, I really appreciated the depth of analysis and ingenuity in formalization of the topic. The book is extremely content rich (and well written to boot!); I learned all sorts of stuff from it - design of Iowa Electronic Markets, how TV show ratings and advertisement pricing is computed, different kind of decision biases - cooperator bias and confirmation bias, and the many nuances of group dynamics.

Schild’s Ladder by Greg Egan is brilliant. I also read through most of Joel on Software and I violently agreed with half of it and mildly disagreed with the rest. Maus is memorable in a haunting way and Damnation Alley was a perfect companion on a three-hour, mega turbulent flight.

The cycle of complex elegance to inelegant complexity (and back)

The grand unified theory of the universe is a search for elegance. Mastering the art of winning a combinatorial game is a search for elegance. Discovering the attitude and philosophy of everyday living is a search for elegance. Good software design is a search for elegance. In the last 10 years of doing software design, the part I’ve come to enjoy most is achieving that palpable state of elegance. I can feel an elegant design soon as it emerges and what feels elegant at outset, tends to withstand the deep logical scrutiny that invariably follows. Elegance scales. Elegance embraces newness; newness that wasn’t considered or even known at the time of original contemplation. That, in fact, turns out to be the only objective test for elegance: how well it embraces and adapts to additions in specifications, to new information.

If you design a system over the long-term (years), incrementally adapt it to growing needs, there comes a time when the elegance breaks down. To retain the semblance of balance you expend tremendous amounts of energy, rethink everything, restructure, rewrite, redesign just to meet new requirements in a consistent manner.

However, another piece of information challenges this new design and you give in and accept the unweidlyness of evolution, the kludgeness of forms plastered onto an elegant system, turning it into a complexity that, while powerful, is aesthetically unpleasant. The system works, but it leaves you on shaky ground, specially with respect to the future.

There are umpteen accessible examples of this in the process of “designing” theoretical frameworks of natural phenomena. Physicists, biologists, chemists and geologists have tried, time and again, to fit new observations to existing theories, by proposing extensions that explain new observations in the framework of the existing theory. Apollonius_of_Perga’s epicycles are probably the most famous example. In order to accommodate troubling observations of varying planetary brightness and retrograde motion into the geocentric model of the solar system, Apollonius came up with idea of epicycles: planets did not circle earth in their concentric orbits, rather they were attached to circles (epicycles) within the concentric orbit. This model held for a while, but fell to further observations. Ptolemy, then, proposed the idea of epicycles within epicycles to “solve” for new observational data. Eventually, it all came tumbling down when Copernicus’s observation of the star Aldebaran could not be solved with any permutations of the epicycle equipped geocentric model. Copernicus proposed the heliocentric model and Kepler refined it to remove epicycles altogether. Elegance reigned once again.

Much like models (designs) from Aristotle (complex elegance) to Ptolemy (inelegant complexity) to Kepler (complex elegance), software design goes through the cycle as new requirements surface. We’ve all witnessed our beautiful software turn into a dangerous concoction, and then resolve into a higher level of elegance that continues to scale effortlessly (for a while).

The basic problem is that of a limited granularity of perception. We design for the requirements (or capture reality in an elegant model) based on the understanding of the world as it stands today. We use our predictive faculties to adapt our designs and models to what might appear tomorrow. But we can only see so far in the future. The universe is larger than our brains. This implies, any process of modeling or design will inevitably go through the cycle of “complex elegance to inelegant complexity” and, hopefully, back. If we formalize this notion, we can learn to discern when it occurs and modify our design processes to accommodate it. For instance one strategy that emerges from this knowledge is that if a new requirement doesn’t fit our design, we should not waste effort trying to fit it in the current model, we should just special case it and integrate the special case into an elegant whole when new requirements show way to a higher level of elegance. As Brook’s pointed out in Mythical Man-Month, “Plan to throw one away, because you will anyway”. Another strategy is to accumulate multiple requirements before hitting the design board, as multiple requirements are more likely to expose holes in the current design and hint at a better one.

Not all designs are elegant. If new requirements break a design, it is not necessarily an honest mistake. There are multiple ways to skin a cat, and it’s quite possible that you happened to pick the wrong one. How is one to know if a design is the best possible representation of the requirements? One way to examine a design is to see how tightly it is coupled to the requirements. The tighter the coupling, the poorer the design. If the design was created only to solve the particular requirements, it is more likely that it will break when new requirements arise. A design that is based on abstraction of the requirements tends to be more robust. In fact, the designers goal should be to abstract as much as possible from the requirements, and then design to the final abstractions. Is there a danger of being too abstract? Yes and no. This danger can be thought of as a trade-off with the time of implementation. A more abstract design will require the implementer to first create the abstraction framework and then specialize it to meet specific requirements. The advantage of an abstract design is that it leaves “space” to elegantly fit future requirements. It’s time expended upfront, but saved in future. We usually make this trade-off based on the importance, longevity, and other constraints.

Another way to assess the elegance of design is when it is replaced with a newer design. If major elements of of the original design were carried over into the subsequent one, than the original design could be thought of as elegant. The percentage of design elements borrowed is a good barometer of elegance. In case of the Solar System models, the element that was carried over from Aristotle’s model to Kepler’s was the idea of bodies circling each other in a concentric fashion. The only two mistakes was the choice of the body and shape of orbits. From that sense, Aristotle’s design was quite elegant. When Relativity and QM replaced Newtonian Mechanics it continued to hold true, it just became a special case of the overall picture. This, in my mind, is the best case scenario. If a design is retained as is, as a special case of the newer level of complex elegance, I believe the original to be a perfectly elegant solution for the time and circumstances of its creation.

The World’s Worst Server Rooms

I keep losing the links to this year-old story, so I am cataloging ‘em here. Here’s the original, the second followup and the final one. If you manage a hosted service or happen to have a ton of computers at home, the photographs in this story will stir up something deep inside you, and make you crack up so hard that you’ll be on your floor, enmeshed in the wires going from your machine to the wall, trying very very hard to forget what you just saw.

Enjoy. And, oh, register++

Next Page »