Dealing With Worker Role Failovers by Scheduling Duplicate Jobs

August 21, 2013 — 2 Comments

images (2)

This question came from Rajat on a post I did about using using Quartz.Net to schedule jobs in Windows Azure Worker roles.

When we are using quartz with multiple instances. If one instance that schedules the job goes down, how will the other instances maintain it. Because if one instance goes down then the triggers of that instance also goes down. [source]

Back then I didn’t have a great answer for him and I’m sorry about that.

Since then, I ran into the same problem.  While one of my long running jobs was executing, my worker role was updated by the Windows Azure Fabric Controller. Consequently, my job never completed and left the system with an incomplete data import.

In order to resolve this issue, I decided to track job completions in Windows Azure Table Storage. When a job runs to completion, it logs the current date and time along with the job’s name. Subsequent job instances check whether they’re the first or second attempt. If the first attempt has failed then the second job will run otherwise it will terminate itself.

This solution is great for jobs that run once or twice a day. This may not be the best solution for jobs that run every hour. When jobs run on a tight schedule, it may be preferable not to schedule duplicates, but to drop the failed jobs and wait for the next job instance to run.

Scheduling duplicate jobs isn’t natural, but it’s a necessity on Windows Azure. Failovers happen regularly, whether it’s for scheduled updates or critical failures, we need to equip ourselves in order to deal with these interruptions.

For this to be effective, you will need to calculate a time offset using the average time a job takes to run to completion. Add a slight buffer to this average and schedule the second job using the time from the first job plus the resulting offset. This will ensure that the second job doesn’t try to execute before the first job runs to completion.

Try it out and let me know how it turns out in your Cloud Services.

More on Windows Azure Table Storage Service

2 responses to Dealing With Worker Role Failovers by Scheduling Duplicate Jobs

  1. 

    Why don’t you use a queue for that? The only thing I do when I run a time-based scheduled job is to put a message in a queue to be processed elsewhere. That way if the job failed to complete I know the message will be visible again in the queue and be re-processed. I know that the worker could still be offline when the trigger kicks in so I guess keeping track of job’s completion is a good idea.

    Also with a queue I can manually trigger the job by sending a message if necessary.

    Like

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s