Complexity, Requirements and The Perfect Cup Of Tea

September 28, 2011 ~ briandrought ~ 1 Comment

The words any software developer dreads hearing are “Can you just change this one thing for this one user/client/company”. The requests usually have perfectly valid reasons, but it’s sometimes hard to explain to the person asking for the feature why it’s more work than just the code. Most of the time it’s not the coding that’s terribly hard, it’s making it just so for that particular user, testing it, maintaining it and then you have the nightmare two years down the line when you have to change code that touches it.

Much has been written about system complexity but it tends to read like this:

Great for your PhD but terrible as a demonstration/case study.

Hopefully, the following little story can serve as a way of visualising the pitfalls when a system ends up in an ever increasing cycle of complexity.

In The Beginning: A Village Fete Tea Stand

Agnes, Beatrice, Clive and Doris decide to run a small tea stand at the village fete in their home village of Systemton in the Yorkshire Dales. They do a quick run to Costco for a giant box of Yorkshire Tea and some huge cartons of milk, Agnes borrows a tea urn from the village hall, Clive pops into IKEA for 50 cheap tea cups and a handful of teapots and they’re all set. They decide on a nice round 20p per cup to make charging customers very simple. Doris will brew the tea, Beatrice will take the cash, Clive will collect the used cups and Agnes will wash them up. They knock up a quick sign that says “Tea, 20p” and hang it outside.

Very soon, they’re running flat out and things aren’t running smoothly. To cut down on the wait they’re pre making the tea, but it’s hard to keep track of how long each pot has been brewing. Some customers are complaining it’s stewed and some are having lightly coloured water. Also, some customers are complaining it’s too milky. It seems that milk is being added to make the colour the same, but that means lightly brewed tea is only getting a splash, whereas over brewed tea is getting far too much milk.

Being an ex RAF engineer, Clive decides it’s time for a system. He labels the teapots A, B, C, D and E and knocks up a chart. Using his watch he carefully notes the time each pot is filled and removes the teabags after exactly 5 minutes. They can adapt to demand by having between 1 and 5 pots on the go at once meaning the tea stays hot. He also finds a measuring cup that looks about right and milk is poured into that first before going into the cup, so each cup has exactly the same amount.

After 30 minutes or so Agnes overhears some friends say it was a nice refreshing cuppa, but perhaps a bit on the strong side. She feeds this back to Doris who adjusts the brewing time down to 4 minutes. Feedback seems to be much better from everyone and they settle on 4 minutes as the ideal brew time. They now have a system for the perfect cup of Yorkshire, word gets around about how good the tea is, and they collect 20p per cup from happy customers all day long. In fact most customers not only have 20p ready, most of them pop it in the box themselves making life very easy indeed for Beatrice so she can help Doris with the tea making.

Lessons

Take what you do and systemize it
Refine the system with feedback and improve your product

Year 2: More Options

It’s fete time again, and this year Beatrice accompanies Clive on the Costco run. As well as the many jugs of semi-skimmed and the box of Yorkshire Tea bags, Beatrice puts a box of her preferred PG Tips in the trolley. “Good idea” says Clive.

Fete day comes, the stall is running like clockwork with Doris using Clive’s system from last year. She also had a brainwave and brought along 5 kitchen timers and she sets them off when each pot is filled. When the timer beeps, she takes the teabag out of the relevant pot and bingo, the perfect cup of tea.

Except there’s a snag. They have both boxes of tea on display behind the stall, and both are being offered to customers. Doris tries to adapt to the new option by having pots of Yorkshire and a pot of PG Tips on the go, she just makes pots A, B, C and D Yorkshire and E is brewed with the less popular PG Tips. Everything goes fine until Beatrice has a cup of tea herself and chooses her favourite PG. Yuck.. over brewed. She feeds back to Doris that PG Tips only takes 3 minutes to brew. Doris adjusts her countdown timer for pot E to 3 minutes and they now have perfect cups of tea in two flavours. 20p’s are collected all afternoon and word gets out about the perfect cuppas.

Late in the afternoon after most locals had left, a coach stops off at the fete. As they’re paying their entrance fee, Agnes has the idea to get all the teapots brewing in preparation. Four pots of Yorkshire and a pot of PG on the go. The first customers appear and order PG. As do the other 50 from the coach. Oh dear. This coach was from Lancashire and they don’t want the Yorkshire. Doris has to suspend brewing of the 4 pots of Yorkshire throw them away and restart with PG. Halfway through she remembers that the 4 timers were set for 4 mins not 3. Hastily she changes the process so they don’t stew.

Still, crisis largely averted, the coach load pay their 20p’s, our four friends collect the money and sit down for a well earned rest!

Lessons

Extra requirements can sometimes require unique, or altered processes further downstream.
Even slightly more complexity makes it harder to react to small changes in the marketplace.
Choice can work against you. If only one tea was on offer they’d likely have still have sold as many cups.

Year 3: Demand Increases and Scaling is Required.

The four friends dust off their IKEA mugs and teapots, do a Costco run again, pick up the milk, the Yorkshire and the PG

Fete day arrives and they’re blessed with glorious weather, so much so that the village is overrun with visitors from the surrounding area eager to see the ever expanding Systemton Fete. Agnes, Beatrice and Clive are rushed off their feet collecting/washing cups to re-use whilst Doris makes the tea. The system of multiple teapots works well in general and they sort of adapt by having a split of 4-1 or 3-2 depending on demand, not forgetting to change the timers each time of course!

Lunchtime comes and the queue is out the door, there’s simply not enough cups or teapots to scale to the required throughput. Agnes asks her friend Edith to rustle up a load more cups from peoples houses as well as two more teapots. Brewing speeds up again and the queue decreases. However, not all is rosy. Customers from years before have come to expect the perfect cup of tea and it seems some are getting overly milky or stewed tea again. What on earth is going on? They have a quick internal management meeting and the problem is obvious. In their attempt to scale to demand they introduced different sized teapots and cups. The 4min/3min times and the measure of milk no longer work. The system is broken. Their only options now are:

Scale back down, make customers wait for the perfect cup.
Supply poor quality tea to a percentage of customers, but quickly.
Try to modify the process with some different variables for the different sized pots and cups which will require extra man power.

Clive stops collecting cups takes one of the IKEA teapots and the two ‘new’ ones (therefore massively reducing productions whilst they’re offline). He measures the volume of each and he decides that because they’re 50% and 150% the size of the IKEA ones, brew times should be:

Yorkshire PG Tips

Ikea Pot 4 mins 3 mins

Large Pot 6 mins 4.5 mins

Small Pot 2 mins 1.5 mins

He does a quick test with each pot using PG (as it takes less time than to test Yorkshire) and it all seems ok.

He jots the times down and hands it to Doris who rolls her eyes at him. She asks about the milk measure and the different sized cups. They look around at their customers… there’s the 50 identical cups and 50 more borrowed cups each of which a completely different size to one another. Whilst possible, the work involved to measure and test each cup would be horrendous, not to mention from an operational perspective doing different measures. Doris has more than enough to do without different milk measures for 51 different types of cup. If they choose to go down this path they’ll need a dedicated person just for milk measures (who, incidentally, won’t be needed when it gets quieter and they can drop back to the 50 original cups).

Milk measures aside, they’re now running at a huge scale, supplying at least twice as much tea as before. They’ve had to rope in Edith to help Doris of course as managing the various combinations of tea type/pot size is unmanageable for one person. In fact, it’s getting too much for the pair of them when demand changes and they need to swap to say 3 PG Tips pots, but Beatrice can help out as she doesn’t need to be looking after the payment full time as it’s simply 20p per customer and 9 out of 10 customers put a coin in the box themselves anyway.

Lessons

Scaling can itself add more variables and complexity
Scaling is hard when the system is already complex
Adding more options when the system has already been scaled is just as hard

Year 4: Requirements By Committee and Homemade Cake

This year, Mavis, the head of the village committee decides she like the tea stand and it should have her input. She elects herself into the team as Tea Stand CEO, and decides that they need to be offering a few more exotic teas such as Darjeeling, Assam and Jasmine. And biscuits.

The news is broken to the rest of the group on the Friday before the fete. “Oh dear”, thinks Clive as off he goes in search of tea. He can get PG and Yorkshire (and milk and biscuits) from Costco, but the others require a search. Eventually he finds the others but at a much much higher cost. “Hmmm, not sure Mavis thought about the cost of these fancy teabags, but we’ll have to work something out”

He gets home and calls the others. In preparation for tomorrow they decide not to get caught out this time and they test the new teas beforehand and they work out what tastes right. All testing is carried out with the standard IKEA pot.

Tea Brew Time

Yorkshire 4 mins

PG 3 mins

Darjeeling 2 mins

Assam 3.5 mins

Jasmine 5 mins

Great. Except they’ve not had any budget to increase their capacity of IKEA teapots (though fortunately, they did manage to buy 100 more cups so at least those are now standard) so there’s still this issue of the two other pots. Using the same method as last time they do some guestimates and hope the values are right.

Yorkshire PG Tips Darjeeling Assam Jasmine

Ikea Pot 4 mins 3 mins 2 mins 3.5 mins 5 mins

Large Pot 6 mins 4.5 mins 3 mins 5.25 mins 7.5 mins

Small Pot 2 mins 1.5 mins 1 min 1.75 mins 2.5 mins

This chart is written out, and they then turn their discussion to pricing. Working on the cost per teabag it’s decided that Yorkshire and PG are 20p/cup and the other 3 should be 30p/cup as 20p is just too low to make any money on the more expensive exotics. Of course, this is assuming the customer wants milk. 99% of PG/Yorkshire customers have milk, it’s pretty optional with Assam, but a no-no with Jasmine and Darjeeling. It’s decided that it’s unfair to charge all customers for milk, so actually the milk should be charged at 2p/measure, so the prices are adjusted to 18p and 28p with a 2p extra if you have milk.

Simple.

Of course there’s also the biscuits which are charged at an easy 10p each regardless of biscuit (some will end up paying over the odds for very small biscuits from the selection box, but that’s life)

They throw away their sign from year 1 and make a new one.

Fete day arrives again, and it’s quickly apparent there’s a problem. Customers are asking lots of questions before placing the order. Some are unsure about the different tastes of the fancy teas. With Doris and Edith brewing, Beatrice on the till, Clive collecting, Agnes washing and Mavis doing…. CEO things, there’s nobody to answer customer queries, so they enlist the help of Florence who is given some training in Tea and put at the front of the stall to field questions.

Within half an hour, brewing operations has fallen apart as well. Beatrice is too busy now with the more complicated money side of things and the serving of biscuits to help out like last year, leaving Doris and Edith not only having to deal with the Yorkshire/PG in different sized pots problem, but they have the other teas. Problem is, those teas aren’t being ordered in sufficient quantity to justify a pot on the go so they’re being made in the cups. Reacting to the change and altering the process by brewing in cups means of course the well previously researched pot times are all wrong for the special teas, so they have to rapidly re-test them in cups while in full production. It also means someone has to monitor the times of each cup. Doris and Edith decide they’ll look after the pots and they draft in Edith’s husband Graham to do the individual teas using the old style chart and watch method of timing the brew according to the pre-set brew times (there’s nowhere near enough kitchen timers). The gang muddle through, each working flat out to keep the production line running.

At lunchtime, Mavis starts chatting to her friend Olive who is running a home made cake stall. Olive lets on that she’s having a slow day. Ever one to help out a friend and maybe do some business along the way, Mavis has an idea. Why not re-sell Olive’s cakes on the tea stand. They’ll take the money and give Olive a 30% cut. Everyone’s a winner baby.

Mavis brings over a tray of each of Olive’s 20 different cakes complete with the price label on each cake, along with some plates and cutlery and places them on the stand. Unfortunately, this just serves to confuse customers more with all this choice and they soon turn to Florence to ask about the cakes as well as the tea. Florence though, being diabetic knows nothing about the cakes as it’s just not her area of expertise, she’s now a tea maestro though. She answers as best she can until Mavis grabs Olive’s daughter Heather to answer cake queries.

The end of the day arrives and our exhausted team take the proceeds, calculate Olives cut by using the sales ledger, split the remaining proceeds 9 ways between them and go home exhausted.

Lessons

Ever increasing complexity at the core of the business has now required dedicated sales people, increased the work in acquiring the raw materials, increased the work in billing, marketing and stock control. These are processes both upstream and downstream of the main brewing core of the business.
If you keep expanding your options, at some point, your process will have to split and you have multiple processes with different teams. You can only scale vertically so far.
Specialist products require specialist knowledge, and when it comes to scaling, you have to build each team rather than just the one. The overheads are now vastly higher than in year one.
System integration often requires more reporting than you’d imagine to make it work for both parties, without detailed reports of cake sales, Olive won’t be able to run her cake production optimally.
Can you imagine scaling this to deal with double the customers?

Year 5: The Final Chapter, The Team Go To India!

Word has finally got out about the original perfect cup of Yorkshire from year one, the product that got them their reputation. The team is invited to the World Tea Championships high up in the Indian Himalaya. The team do their shopping, buy some more standard teapots (no messing about this time) and pack their gear. The week before, they get together all 9 of them and revisit all the brewing times of the various teas, all in pots this time. After a lot of focus grouping amongst external friends and past customers they arrive at some very specific, perfect times, slightly improving on the original Yorkshire and the latter PG tips times as well as the fancy teas.

	Yorkshire	PG Tips	Darjeeling	Assam	Jasmine
Ikea Pot	4.1 mins	3.2 mins	2.4 mins	3.1 mins	4.3 mins

Arriving in the Indian Himalaya the intrepid tea-makers trek high up into the hills to the venue for the competition. They now have a finely honed system, perfect timings, standardised cups/teapots, a dedicated workforce, many options of tea, Fox’s biscuits and Olive’s home made cakes. There’s no way they can lose this.

The big day arrives and the scene is hectic. There’s thousands of tea lovers from all over the world and our team get brewing. The first few customers don’t seem massively impressed. Still, undeterred they continue serving from the various pots, keen to bring Yorkshire and PG Tips to the world. Except all the customers are hating it. Could it be they’re not used to it? Our team try a cup of their finest Yorkshire….. and it’s terrible. Something has broken and the brew is off. Way off. The altitude of the venue is lowering the boiling point of water to 85degC, not even close to what they’re used to.

Our crew can either try to re-engineer their processes by trialling new brew times for each of the 5 teas, buying more pots (since brewing is longer, they’ll need more pots on the go, and the extended brew time makes it even harder to react to demand as the lead time is longer) and maybe train yet another staff member in brewing to keep up with capacity now that brewing is the bottleneck.

-OR-

They simplify.

They ditch everything apart from their beloved Yorkshire, they trial a new brew time at the lower boiling point. They use ALL their teapots for Yorkshire, they ditch the separate charge for milk and put the cup price back to 20p. They average out the price on the cakes to another 20p to simplify billing. They give away biscuits to customers who bring friends along rather than trying to charge and complicate the pricing structure. All day long, they serve the perfect of tea to hundreds of happy customers, not one of which asks them if they can have Jasmine.

Lessons

A complex system makes a change in a key ingredient, or a change in market or environment a potential disaster.
Staying simple doesn’t mean not evolving. You can add more products, just do it in a streamlined way than minimizes branches in your processes.
A simplified process makes it much easier to adapt and scale. All hands moved to brewing and with almost all the team on the case they coped with demand.
Remember your core business.
Doing a simple thing very well pleases more people than lots of things done just above average.
This super efficient tea business can now be replicated to any country and with just one test brew to check for altitude and water changes the whole thing can be copied with minimal training, low overheads per outlet and a simple method of quality control. A Perfect Cup Of Yorkshire(tm) will be next to every Starbucks in the world.

Conclusion

Adding increasing numbers of options, features, widgets and conditions leads to an ever increasing burden of complexity not just in building a system, but in designing, testing, supporting and don’t forget operationally. It’s not always the best experience for the customer either.

Our team didn’t quite end up where they started though. They added biscuits and delicious homemade cakes. They also productionized the tea making and refined the brew time. They also learnt a lot along the way.

Now, time to put the kettle on…

Cute email from eBuyer.com

September 14, 2011 ~ briandrought ~ Leave a comment

Thought the copy on this was wonderful… also, the coffee cup stain smiley face? Genius!

Mini mouse review – Microsoft Sidewinder X8

September 14, 2011 ~ briandrought ~ Leave a comment

My old stalwart Logitech cordless mouse has been slowly getting worse over the last few months. It was struggling with the wood grain on my desk, and even on a better surface the pointer would occasionally jump.

Looking around at mice I wanted something with super high DPI for accuracy, a couple of buttons for doing ‘other things’ and it needed to be cordless. Getting the wire wrapped around your keyboard is so 1990s

Seemingly, the problem with wireless mice is a vast majority now seem to be aimed at laptop users, so there’s no charge cradle, they just run on an AA battery or two and have an off switch. Great until you leave it on overnight and have a flat mouse in the morning.

On the DPI front, most regular mice seem to top out at 2000DPI or so, but who do we know that always need vastly over the top kit? Yes… gamers!

After a hasty 10 minutes research I located this puppy on Amazon:

http://www.amazon.co.uk/dp/B001DCELH2

Yes, it looks ridiculous. The odd styling touches, bizarre logos and occasional lights make it the Citroen DS3 Racing of the mouse world, but for someone looking for a responsive wireless mouse it’s perfect.

Have used it for a week now and the accuracy over the old Logitech is very useful. Also useful are the adjustable DPI buttons on the top, so I can have Fast/Normal/Slow But Accurate settings direct from the mouse. Designed for gamers doing stupid things in first person shooters, but incredibly useful to be able to slow it right down for masking fiddly details in photo software.

Oh, and the bit I like the most though I’ve not needed it yet, there’s no cradle. Instead, there’s a magnetic puck on the end of a thin cable so you can charge it whilst you use it like a corded mouse. My old Logitech required you to pop it in the cradle and twiddle your thumbs for a few hours if the battery ran out.

All in all, a top product at a reasonable price. I just wish it didn’t look quite so Halfords.

Managing clients patched calls with a single click

September 21, 2010 ~ briandrought ~ 1 Comment

The status quo of call patching

With our pureJAM service you can have calls patched to you. You simply set your call instructions for one of 4 statuses. You can change this status via Twitter/SMS/Web page so you have pretty good flexibility about how we handle your calls. You can have us try to patch to your mobile, your landline, both, or neither for each status. See the below example for what a call instruction looks like when being set up.

In the above example, when the operator takes the call for me and I’m ‘Busy’, they’ll politely put the caller on hold, call me on 02072070007 and see if I want to take the call. They’ll tell me who they have on the line, and if I want to take the call they’ll put them through, if I don’t they’ll explain to the caller that I’m unavailable and take a message.

A better solution

Wouldn’t it be great though, if you could have this ‘discussion’ with our operators much much more quickly and less intrusively. A lot of us spend our time behind a computer, but that doesn’t always mean you’re available to take calls. Maybe you’re on another call, maybe you’re on a video chat or maybe you just want to get some work done.

So, I’ve developed a very prototype system for doing just that. It uses a desktop client that sits in your taskbar (and allows you to change your status, gives you indication of unread messages etc). You must be a pureJAM client (otherwise there’s no point as you can’t have calls patched!), have a PC, and you must have the .NET Framework 3.5 or later. It’s a Windows ClickOnce application so it’s easily installed from a URL. It also connects to our systems on port 80 and uses standard HTTP so it’ll work behind any firewall/NAT. If you want the download link then you’ll need to register for the Beta program (details at the end of the post).

Step 1: Instruction Setup

Once you’ve installed the desktop client and logged in and out of your online portal you’ll see the desktop client logs in as well (using a rather neat way of linking a browser to a Windows application). Then, if you go into your contact instructions you’ll see a new menu option under ‘Patch To’:

You then get a choice of action if we don’t hear back from you i.e. you’re not at your computer (incidentally, you can install copies of this desktop client on multiple computers and it’ll send the request to any that are logged in)

And this is what the status summary now looks like:

Step 2: A caller comes through to us

When our operator gets a call for you, they’ll get the basic details from the caller, and then they’ll click the (See If They Want This Call) button.

Step 3: We let you know about the call

Within a few milliseconds, a packet of data is sent to your PC (via our really cool comms system that you don’t need to care about) containing the basic call information that our operator got from the caller. You’ll get a notification bleep and you’ll then have a choice of what to do with the call.

Step 4: Our operator gets your response

Using the long polling techniques I developed in Project Totem, your response is instantly pushed backed to the operators screen and they can handle the call how you want them to.

Beta testers required

If you’re a pureJAM client and want to test this system out you’ll need to be a PC user (no Macs I’m afraid) and you’ll need the .NET Framework 3.5 or later. Most Vista/Windows7 PCs should work fine. XP will work fine provided it’s been kept up to date ! To register for the Beta program, please login to your pureJAM account and send a U2U to your Account Manager entitled something like “Desktop Client Beta”.

viewmessages.com Architecture

July 30, 2010 ~ briandrought ~ 13 Comments

OK, so we’re not the biggest site in the world but we have a fair amount of data, a fair amount of users and speed is very important to me so it’s important everything is as fast as possible. A few people have asked what our architecture is and I thought it’d make an interesting post. As is always the way with these things it’s easier to describe with a diagram:

Web Servers

content.viewmessages.com

First of all, we serve images/bulky javascript and CSS from Amazon Cloudfront CDN which is an incredibly cheap way of offloading those things to the Amazon infrastructure. It also makes the platform much much snapper for our American users. If you even have a basic website it’s worth looking into using Cloudfront if only because it gives you a second domain to pull you data from which allows the browser to parallelise more downloads.

totem.viewmessages.com

Totem is my own long polling server I developed to allow instant communication to the users browser. This allows things like instant new message notification. In short, your browser uses JQuery to request a script from Totem. If there’s no new messages, Totem will sit there for 40 seconds and return nothing. Your browser will then re-request the script and wait for another 40 seconds. If you get sent a U2U for example 5 seconds into the 40 seconds, the web server/background server dealing with the U2U sends a notification to Totem which creates a bit of Javascript to display the U2U notification and sends it back as the response to the original request that was made 5 seconds previously. For more on Totem, read my Project Totem blog post.

static.viewmessages.com

Because we use a web cluster to serve the main HTML we need a central server for avatars and other central data that we don’t push to Cloudfront. The only challenge here was getting content to it. Security in IIS from the main webcluster meant I couldn’t access the machine directly to I had to use a SQL database as a proxy

http://www.viewmessages.com

The main web serving is done by a cluster of IIS machines. These are cheap commodity machines in the Google style. 2GB Ram/2GHz Dual Core CPU/80GB drive. Nothing fancy or expensive. By using multiple cheap machines instead of one big expensive one we get vastly better availability (they can be brought offline for updating), far better performance (if you add up the total computing power) at less cost. It’s a win-win other than it makes the software development slightly more complex at times.

Each machine runs a copy of SQL Express to write access logs to (Which are then copied to the main SQL box when things are quiet) and to store a whole bunch of reasonably static information (such as configuration) to reduce the load on the main SQL box. Each machine can do front end web serving, back end task processing or both. As we need more capacity we can simply add more machines. The load balancers will send the users request to a particular web server using a session cookie. If the server goes down, the failover happens within 10 seconds and you’ll be transparently placed onto a different server.

The back end task processing is something I’m particularly pleased with as it allows the processing load to be distributed across as many machines as we need. At the moment these are the same machines that serve the front end web stuff but at a later date will be split off into a dedicated back end cluster. All the back end processing is done by requesting webpages from a queue. If you want to read about how we process background tasks heres my blog post about it

Background Servers

Background / Offline Processing

As mentioned above, this is done using queues of webpages and is processed by the main web cluster

Main SQL Store

Nothing interesting here really I’m afraid. Just a reasonably beefy Dell machine with data replicated to a hotspare backup.

Solr Server

I’m now using Solr to generate the data for the new Message Analytics Feature . I’ll do a blog post about it at some point in the future but it’s incredibly fast compared to using XML data with SQL. Doing a ‘Group By’ on an XML value in SQL was taking around 1200ms for a particular data set (with an unloaded server). Using Solr on a *much* less powerful machine took 20ms. It’s an incredible piece of software if slightly tricky to use.

Memcached

The staple of every high performance website. Memcached is a memory based data store. I don’t use it to store reasonably static data as that’s done in the ASP.Net cache object (which is 10x quicker due to it being on the machine itself), but I use Memcached to store precompiled data that’s used across machines. For example, if you get sent a U2U it’s a background task that ‘delivers’ it to your inbox. This task puts the message in your inbox, adds it to the search database, then takes the most recent 10 U2U’s for you and recompiles the HTML you see in your ‘Recent U2U Messages’ widget on your homepage and inserts it into Memcached. The background task then notifies the Totem server about the U2U, Totem notifies your browser, your browser requests the new HTML blob back from the webserver and guess what? It’s already been generated and the webserver just grabs it from Memcached. The beauty of using it over the ASP.Net cache is that cached objects can be shared across machines.

Memcached is a great bit of software and we’ve had absolutely zero issues with it. The current stats from our memcached instance are below:

STAT uptime 28625365 (nearly a year)
STAT time 1280509032
STAT pointer_size 32
STAT curr_items 30626 (It’s 100,000 or so during busy periods)
STAT total_items 10108777
STAT bytes 9450225
STAT curr_connections 17
STAT total_connections 10040
STAT connection_structures 24
STAT cmd_get 39701711
STAT cmd_set 10108777
STAT get_hits 33158267 (It’s saved a LOT of SQL reads!)
STAT get_misses 6543444
STAT bytes_read 3126086257
STAT bytes_written 821258193 (It’s served 800GB!)
STAT limit_maxbytes 524288000

And to think, I was almost tempted to use Velocity instead. You can read why I didn’t.

Summary

By applying a bit of thought and leveraging the right technology for each part of the puzzle we’ve got a platform that *way* outperforms a traditional single big webserver setup. We also have minimal load on the main SQL box by using quite aggressive caching (In memory on the local webserver, in Memcached and in SQL Express on the local webserver).

Hot air extraction – more efficient server room cooling

May 28, 2010 ~ briandrought ~ 1 Comment

In addition to the power for servers, a huge cost we have is cooling them. 6kW of servers is going to require some chilling. Our data room has air conditioning and it works very hard for a living particularly in summer where the heat differential on the aircon exchanger outside is lower. In a big data centre you’d pump chilled air into a ‘cold aisle’ in front of a load of racks, and then have a ‘hot aisle’ behind them where you suck the air back into the A/C. Unfortunately our building wasn’t designed with this in mind so we simply have a wall mounted unit that cools the whole room. The problem with cooling the room though, is there’s no way of making sure the servers see chilled air, they might get air that has come directly from the back of the rack and sucked back round again.

Whilst doing some tidying up I spotted one of our old extraction fans from years gone by. When we were a much smaller company, air was drawn into the room at one end and extracted at the other end. It kept things cool enough until we started to need more equipment and then A/C was the only option.

Anyway, below, you can see the unused fan and our main server rack to the left.

There are probably some very expensive hot air extraction systems on the market, but I figured there was no point in spending a lot of cash to trial it out. B&Q to the rescue for some gaffa tape and guttering pipe. Add in the old box from my Herman Miller ‘Mirra’ chair, and an hour of creativity and we have a working hot air extraction system…..

I simply made a baffle infront of the fan and added ducting that goes down behind the server. It’s not pretty, but it does work:

There… proof it works! We dropped the temp measured at the top of the rack by a degree. Air intake temps on the servers lowered even more. Our SQL server was drawing in 24 degree air previously, and is now a lot more chilled. (21 degrees!). The UPS unit on the floor beside the rack had a similar drop from 25deg to 22deg.

We’ve massively reduced the strain on our air con unit for the grand sum of about £50 and the overhead of a 100W fan. (which is more than offset by the potential savings in air con for that room)

The next thing to try is adding curtains from the side of the rack to the wall to force the hot air into the extracted area.

Quick book review: Leaving Microsoft to Change the World

May 19, 2010 ~ briandrought ~ Leave a comment

http://www.amazon.co.uk/Leaving-Microsoft-Change-World-Entrepreneurs/dp/0007237030/

My rating: 9/10

A great read overall… not quite Three Cups Of Tea, but inspiring nonetheless. Unlike Greg Mortenson, John Wood started out from a very strong position as a senior exec at Microsoft. It’s fascinating to see how he uses lessons from his past life working with highly driven people like Steve Ballmer to create a non profit that has improved education for more than 4 million children in Bangladesh, Cambodia, India, Laos, Nepal, South Africa, Sri Lanka, Vietnam and Zambia.

As well as making you want to jack it all in for something more meaningful, it’s got some half decent business lessons in there.

If you had to make the choice, I’d go with Three Cups of Tea every time, but this is still a cracker.

Re-inventing the spell checker

May 5, 2010 ~ briandrought ~ 1 Comment

Background

Our system does a ‘review’ of messages after our operators save them. It checks for things like fields not being filled in where they normally are, but most importantly it checks for spelling mistakes and typos.

We used to use the Telerik Radspell spell checker component in a back end web service. It worked adequately but it had a limited dictionary, didn’t know the Queen’s English (it uses American spellings) and the suggested corrections were often a bit…… random as you can see from the screenshot below. The word column contains the supplied misspelling and the subsequent columns are the suggestions (in order).

How does an average spell checker work?

It’s pretty simple to make a crude spell checker and all you need is a dictionary of correct words. You take each word and check if it exists in the dictionary. If not, you then loop through the dictionary seeing how different each correct word is to the supplied misspelled word. There’s a well used algorithm for seeing how different words are. This is called the Levenstein distance, or ‘edit distance’. Each addition/subtraction/substitution counts as an ‘edit’ For instance, take the misspelling of ‘hosspitle’

hosspitle -> hospitle
hospitle -> hospitae
hospitae -> hospital

That’s an edit distance of 3. The lower the Levenstein distance the more the words are alike.

There’s a slight snag with doing it this way though. If you have even a small dictionary of say 10,000 words you’d need to compare each of the 10,000 words to your misspelling. There’s no real way of pre-computing it as you can’t possibly cater for all misspellings. It’d be quite a costly computational exercise. We can get a much much smaller subset of words to compare by selecting them based on a phonetic algorithm. The most common of which is soundex. This way we can pre-compute the soundex code for all of our known good words.

For example, you can get the soundex value for ‘hosspitle’ using SQL2008 by doing select soundex(‘hosspitle’). This gives a value of H213. If I check the soundex of ‘hospital’ I also get H213. This means that the correct result would be in the subset which is a good start!

Spell Check 2.0…….

Why?

Because crappy looking messages with spelling mistakes and typos don’t give the client a sense of professionalism. The previous spelling corrector called wolf a bit too often and I found that a lot of operators would get used to ignoring it. Also, if it didn’t list the correct suggestion first time around it took them a while to go back into the message, correct the word and resend… there was a temptation to just ignore the mistake and send it anyway.

Improvement #1 – Junking the Telerik engine.

First step is to reproduce what the Telerik spell checker does so I can start to develop my own system. This turned out to be pretty easy. Just find a dictionary of English words on the tinterweb, upload to SQL, create a column for a Soundex field and use the built in SQL Soundex function to pre-compute the soundex’s by doing “update englishwords set soundexvalue = soundex(word)”.

You can then select your word subset back by doing something like “select * from words where soundexvalue = soundex(@mywrongword)”. Using the ‘hosspitle’ example, my dictionary gives me 47 records including

hagbut hasped hispid hackbut hagbuts hawkbit hexapod hackbuts hawkbits hexapods hexapody hiccuped	hospital hagbuteer hagbutter hiccupped hispidity hospitals hospitium houseboat housebote hospitalizing hospitableness hospitalization

(No, I don’t know what half of those words are either!)

Anyway, once I get our subset I order it by Edit Distance, so hospital will come amongst the first few search results. This gave me exactly the same results as the Telerik engine and therefore a decent baseline to work from…..

Improvement #2 – Using a bigger dictionary

Bigger is better most of the time, and I needed more words. Searching the internet for a while I found some decent sized CSV’s which had lists of words along with the number of occurrences that word has been seen (this will be useful later). I uploaded this in exactly the same way as improvement 1, into a table called BigDictionary but with a field for the occurrences. The system now uses the previous English words dictionary just to check if it’s a valid word. If I don’t see it in the table I then use the BigDictionary table to retrieve a list of possibilities.

Improvement #3 – Learn words itself

If the spelling corrector uses a fixed dictionary, it doesn’t have a hope of keeping track with the modern world. For example, just looking at the ‘wrong’ words being flagged up by the system as it was at stage 2 I could see it was probably annoying operators. It had flagged up words such as Skype, Mercedes, Google, Ferrari, Bosch, Microsoft, Nokia etc. I wrote a small routine to go through two years worth of messages, separate out each word and upload it to a table. If the word was already in the table, I incremented an ‘occurrences’ field (again, this will be useful later!). I set the system to gate the results so uncommon words don’t appear in the suggestions. This helps to stop any misspellings being learnt as valid words.

I check the BigDictionary table for suggestions, then the LearntWords table and aggregate the suggestions before sorting by edit distance.

Improvement #4 – Double Metaphone

The soundex algorithm is pretty basic and it’s totally reliant on the first letter being correct. This meant that the pre-computed subset was often a bit limited and wouldn’t contain the correct result. After doing a big of research into phonetic algorithms it seemed like Double Metaphone was a good bet and a fair bit more advanced than soundex. I created a Primary and Secondary Metaphone field for all of my dictionary tables so far (including the learnt words table) and made a script to calculate the primary and secondary metaphone values for every word. After an hour of it grinding away I had precomputed values for metaphone as well as soundex. I changed my SQL queries to something like Select * from dictionary where (pm = @pm or sm=@sm or soundex=@soundex). This instantly made the results set bigger and it seemed to get a few more hits particularly if the typo was early on in the word.

Improvement #5 – Weight by Frequency as well as Edit Distance

If you look at the screenshot of the initial results you’ll see the Telerik checker suggested ‘darvon’, ‘driven’, ‘thriven’ for the typo ‘dirven’. This is because it has no idea how common a word is, and it just so happens than ‘darvon’ has the same edit distance from ‘dirven’ as ‘driven’. I have absolutely no idea what a darvon is, and I suspect neither would our callers. Fortunately, in the BigDictionary and my LearntWords tables I have an integer field essentially telling me how common that word is. I decided against simply using the count as a multiplier of a ‘relevancy’ as some words are hugely more common than others and would overwhelm the edit distance… for example if you put ‘thene’ instead of ‘theme’, you’d find that it’d suggest ‘the’ as it’s vastly more common than theme, or even them and then. Instead, I used the ‘position’ as the multiplier, so my SQL became something like:

Select * from dictionary where (pm = @pm or sm=@sm or soundex=@soundex) order by wordcount desc

I then take the results in and do something like:

For Each Result
    Position += 1
    Score = Position * EditDistance(Result,Word)
Next

.csharpcode, .csharpcode pre
{
font-size: small;
color: black;
font-family: consolas, “Courier New”, courier, monospace;
background-color: #ffffff;
/*white-space: pre;*/
}
.csharpcode pre { margin: 0em; }
.csharpcode .rem { color: #008000; }
.csharpcode .kwrd { color: #0000ff; }
.csharpcode .str { color: #006080; }
.csharpcode .op { color: #0000c0; }
.csharpcode .preproc { color: #cc6633; }
.csharpcode .asp { background-color: #ffff00; }
.csharpcode .html { color: #800000; }
.csharpcode .attr { color: #ff0000; }
.csharpcode .alt
{
background-color: #f4f4f4;
width: 100%;
margin: 0em;
}
.csharpcode .lnum { color: #606060; }

The lower the score, the more relevant the results.

Improvement #6 – Learn from our mistakes

Looking through the log of mistakes and corrections it seemed to be that the same ones were coming up again and again, for example ‘Plesae’ being changed to ‘Please’. It’s pretty obvious, but the system should look at what’s been corrected for that same mistake and bring up the correction in the results. To recap, the process we’re now doing is:

Check EnglishWords table to see if it’s a common word
Check LearntMistakes to see if we’ve seen the mistake before, if so, load in the corrections into an array of suggestions
Search LearntWords by Soundex and Double Metaphone to see any soundalike word we’ve seen before in a previous message
Search BigDictionary by Soundex and Double Metaphone to see any soundalike words that are in the dictionary
Score all suggestions retrieved by Edit Distance and Position

Improvement #7 – Weight by source

Now we’re pulling in previous corrections, it’s pretty obvious that some sources are more relevant than others. For example, if I’ve seen ‘plesae’ changed to ‘please’ 80 times, it’s a fair bet when I next see ‘plesae’ they didn’t mean ‘police’, ‘palace’ etc. So, our array of suggestions that is being filled by our LearntMistakes, LearntWords and BigDictionary suggestions now gains a source column, and our weighting code becomes something like:

For Each Result

    Select Case Source
       Case PreviousCorrections
          SourceWeight = 10
       Case LearntWords
          SourceWeight = 15
       Case BigDictionary
          SourceWeight = 20
    End Select

    Position += 1
    Score = Position * EditDistance(Result,Word) * SourceWeight
Next

Improvement #8 – Learn words by client not just globally

Some of our clients have industry specific words, for example, if someone phones up to book a car with Supercar Experiences and we see the typo Miserati it’s pretty likely the operator meant Maserati and not Miserable. When I processed the previously seen words from the last few years, I actually created two tables. One that was global across all clients, and one that had a client code on each row, i.e. treat the learnt words separately per company. I use a much lower threshold on this table so the system is quicker to allow learnt words into the suggestions than on the global table. This is purely because any wrong words that get learnt will only appear in suggestions for that company and won’t poison the global dictionaries.

Improvement #9 – Treat transpositions differently

One snag with the Levenshtein distance algorithm as it has no way of detecting transpositions, so ‘ditsance’ is an edit distance of 2 from ‘distance’. Changing to the Damerau–Levenshtein distance algorithm changes that and seemed to massively improve results where it was just a transposition.

Improvement #10 – Context

This is my favourite part……! By now the system is getting pretty smart and the number of messages going out with mistakes is falling rapidly (I re-analyse every message that’s sent after the review process so I can count word errors) but there’s still something missing and sometimes it seems a bit woeful compared to the human brain. We’re pretty good at reading typos and half the time our brain has corrected the word without us noticing.. this is because we know what word to expect. The computer however, doesn’t.

Consider the following sentence: “sending info regarding meeting she had witrh you last month”. We can see they clearly meant with, but the computer has no idea and has to evaluate it without context.

I fed the system a made up message with words in context that it had previously struggled on. The message was “you itnerested in. off hlaf way. some ifno on. refused to leavr number. was looking to spk with accounts. her leter box. meeting he had witrh you last week. llease call regarding”

You can see in the screenshot below that the primary suggestion in word2 field was pretty rotten most of the time:

What if, we had a massive database of text……? Lucky really, we do.

I wrote a routine to go back through our previous messages and split every sentence into three word groups, so the sentence “Wanted to follow up on the meeting he had with you last week” would give us:

Word1	Word2	Word3
wanted	to	follow
to	follow	up
follow	up	on
up	on	the
on	the	meeting
the	meeting	he
meeting	he	had
he	had	with
had	with	you
with	you	last
you	last	week

So there we have it… context. Whizzing through our database of past messages gave me around a million different three word phrases. Again, I used a ‘count’ so if it was a common phrase such as “please call back” I’d just increment the count if it was already in the database.

Then, I added another stage to the spell check, which was find words in context. If I came across an unknown word, I’d simply look in my table of the word phrases by using the surrounding words. For example, if I have ‘please ca regarding’ I’d simply search for any row where word1=please and word3=regarding. Here are some example results:

Please call regarding

Please email regarding

Please contact regarding

I then load all the returned middle words into my array, giving them a low weighting so they score highly

This context method gives the engine a much better idea of what the word could be than previously. Without context, the ‘please ca’ example the suggestion would likely be ‘please can’ which obviously makes no sense if the following word is ‘regarding’ but would make a lot of sense if word3 was ‘you’.

This screenshot shows how much better the results are with an idea of context:

Stage #11 – Always learning

Goes without saying really, but the system continuously learns words and three word phrases from each new message

Stage #12 – Wrong words

The danger with the system learning is that it could learn wrong words. I have a block process and once a week I check for any words that it’s learnt that are above or near the inclusion thresholds to appear in the results. With a single click I can either delete the word from the tables, or delete and block the word from ever being learnt by adding it to a BlackListedWords table.

Summary

The process we’re now doing is:

Check EnglishWords table to see if it’s a common word
Check LearntMistakes to see if we’ve seen the mistake before, if so, load in the corrections into an array of suggestions
Check ThreeWordPhrases using context to see what the word could be
Search LearntWords by Soundex and Double Metaphone to see any soundalike word we’ve seen before in a previous message for this client
Search LearntWords by Soundex and Double Metaphone to see any soundalike word we’ve seen before in any previous message (higher threshold)
Search BigDictionary by Soundex and Double Metaphone to see any soundalike words that are in the dictionary
Score all suggestions retrieved by Edit Distance, Position and a Source weighting

Conclusion

Has it made any difference? Yes!!

As I mentioned before, I re-analyse every message that’s sent after the review process. To make it fair, I re-analysed the past 3 months worth and did some stats. The number of spelling mistakes and typos was never really very high as we have a very strict QC policy but in percentage terms, going on the two weeks the new system has been in place, the number of mistakes sent out to clients has dropped by 85%. It also speeds up the operators as if they spot a mistake it used to take a while to correct if the suggestions were poor.

All in all, a very worth while exercise and a great learning project… I ended up learning linguistics, re-learning probability and reading some ‘challenging’ research papers!

Monitoring Electricity Usage

February 25, 2010 ~ briandrought ~ Leave a comment

Simple one, this one…

We wanted to see exactly how much power we were using, and wanted to be able to display this information to staff.

First off, you need a monitoring device. I opted for the CurrentCost Envi with the optional data lead (and two more sensors as we’re on three phase!)

Next, you download the driver from the CurrentCost site. Then you plug the monitor into your USB port. In theory it’s now pumping data into COM3 at 56700 baud. Ace.

A quick check with HyperTerminal (you have to go hunting for this, it died with XP!) and sure as hell… we have some data coming in. The Envi pumps in the current readings at 6 second intervals. Cool.

Now, a teeny tiny bit of code in VB.Net gets the data into your app. With .NET 3.5 you get a nice SerialPort control. Drag one of those onto your form, and then add this code:

Private Sub Form1_Load(ByVal sender As System.Object, ByVal e As System.EventArgs) Handles MyBase.Load
SerialPort1.PortName = "COM3"
SerialPort1.BaudRate = 57600
SerialPort1.Handshake = IO.Ports.Handshake.None
SerialPort1.Open()
End Sub

Private Sub SerialPort1_DataReceived(ByVal sender As Object, ByVal e As System.IO.Ports.SerialDataReceivedEventArgs) Handles SerialPort1.DataReceived

Dim datain As String = ""
datain = SerialPort1.ReadLine()
System.Diagnostics.Debug.Print(datain)

End Sub

Tada!!!! You now have live electricity readings from within your app, coming in nice XML blobs like this:

<?xml version="1.0" encoding="utf-8" ?>

<msg>
  <src>CC128-v0.12</src>
  <dsb>00001</dsb>
  <time>12:38:19</time>
  <tmpr>18.5</tmpr>
  <sensor>0</sensor>
  <id>00077</id>
  <type>1</type>
  <ch1>
    <watts>02330</watts>
  </ch1>
</msg>

A bit of XML jiggery pokery and you have a reasonably accurate data feed of your power readings in your SQL server.

To see what I did with the data, have a look at the JAM Blog

Project Totem – A Long Polling server (Part 1)

November 25, 2009 ~ briandrought ~ 2 Comments

Normal Polling

Let’s start with normal polling. The browser simply runs some Javascript on a timer that repeatedly checks for new data on the server. The problem with this is there a trade off between latency and bandwidth. If you were to use a timer that ran every minute your server load would be minimal…. but there’d also be up to a minute before the user saw the changed data. You could drop it to a very short interval but you’d have a LOT of requests to your site.

The server setup is unchanged from a normal web server setup:

Long Polling

If we don’t want the bandwidth/latency trade off there is another way. You can use the timeout function of most AJAX libraries (I use jQuery) to perform ‘long polling’. Instead of asking the conversation between the browser and the server going like this:

“Anything new…………………..? Anything new…………………..? Anything new…………………..? Anything new…………………..?Anything new…………………..? Anything new…………………..?Anything new…………………..? Anything new…………………..? Anything new…………………..?”

It goes more like this:

“Tell me if anything new comes along in the next 20 seconds ………………………………………… ………………………………………………………………….

Nothing? OK, let’s try again…

Tell me if anything new comes along in the next 20 seconds ………………………………………… …………………………………………………………………."

a much more efficient use of bandwidth, but here’s the double bubble bonus. So long as the server it’s asking returns the new data and closes the connection there’s actually less latency. It doesn’t matter when the new data arrives, but with standard polling you have to wait until the next poll.

In flowchart form, long polling is super simple:

So we’re all sorted right? Not quite.

The problem with long polling using a regular web server is, it’s not very efficient. You end up with a LOT of open connections, and other than having IIS sit there spinning on each ‘poll’ page waiting for new data to come in, there’s not really a nice notification structure either. Apache is even worse on this front as it really dislikes connections being held open. Another minor snag is that you don’t want to query the original hostname for the data. Most browsers only allow you 2 connections per site, so if you tie up one on the polling there’s only one left to actually fetch data.

So, the answer is a dedicated polling server.

These things exist in the *nix world, most notably CometD, but it’s a lot to learn just to do something simple.

After 10 minutes of pontificating, I decided to do the obvious. Make my own! Project Totem is born. ( because a Totem is a ‘long pole’ and also as a nod to my friend Sam who runs Totem Development )

In essence it’s a very simple Windows Sockets application that just pushes Javascript back to the browser. The browser then executes that script and gets the data from the original web server.

The server generates a GUID that’s sent to each page in the polling javascipt. The server also tells Totem that it’s served that GUID, and that page needs to know about changes to data sets A, B and C.

The browser then polls Totem using the GUID, and if there’s nothing new the request will just time out after 20 seconds. It then polls again, and repeats polling using a 20 second timeout. The very millisecond that Totem gets a notification from the webtier that say data set A has changed, it returns the appropriate Javascript back to the browser and shuts the connection. The browser then does whatever you want to go get the data etc.

I’ll explain more about how I’m tracking keys/scripts and GUIDs etc in part 2 🙂

Tea	Brew Time
Yorkshire	4 mins
PG	3 mins
Darjeeling	2 mins
Assam	3.5 mins
Jasmine	5 mins