Building a scalable, fault tolerant background processing system (part 2)

After having decided how I’ll roll my batch jobs up (see Part 1) I now needed a ‘thingy’ ™ to fire the jobs off.

This ‘thingy’ should probably be a .NET desktop app rather than a service so I can touch it and see it. This ‘thingy’ simply needs to look in a SQL table to fetch the ‘queue’ and then it needs to execute the jobs. But what if it crashes? Or that machine goes down? Simple.. we go multi server. We simply run the ‘thingy’ (let’s call it QueueProcessor from now on?) on each of the web servers in our web farm. We’ll put the ‘job’ web pages on each machine as well so they can simply call http://localhost/backgroundservices/dosomething.aspx

OK, but what do we do about polling? Since this is designed for scalability and resilience over outright speed/low latency you probably only need to check the queue once per second (and let’s be honest, if you wanted to be more responsive than that you wouldn’t poll at all, you’d find a way of pushing data instead). If it was a single server it’d be a piece of cake, but multi server we need to be cleverer.

We need to poll the queue every second, but from alternating machines

A few caveats though:

The local times on the machines will be different.
The network latency to the server can be different from machine to machine (if some are located elsewhere)

If the local times are out of sync you can’t possibly hope to get 2 servers to ping every second. If one server is 0.9 seconds out you’d get checks at 0.000, 0.100, 1.000, 1.100. Not ideal.

So I use SQL time as a central time, i.e. return GetDate() as part of the queue check. Then… I also work out our network latency by figuring out the response time (we assume the query took 0ms on the server). Once I’ve done that, I then have an offset between local time and server time. I then have a timer than runs every 10ms, and just does a simple calculation using local time + the offset, and the number of running machines (grabbed by looking at how many machines have ‘checked in’) to determine if it should do anything.

Simple.