This question came from Rajat on a post I did about using using Quartz.Net to schedule jobs in Windows Azure Worker roles.
When we are using quartz with multiple instances. If one instance that schedules the job goes down, how will the other instances maintain it. Because if one instance goes down then the triggers of that instance also goes down. [source]
Back then I didn’t have a great answer for him and I’m sorry about that.
Since then, I ran into the same problem. While one of my long running jobs was executing, my worker role was updated by the Windows Azure Fabric Controller. Consequently, my job never completed and left the system with an incomplete data import.
In order to resolve this issue, I decided to track job completions in Windows Azure Table Storage. When a job runs to completion, it logs the current date and time along with the job’s name. Subsequent job instances check whether they’re the first or second attempt. If the first attempt has failed then the second job will run otherwise it will terminate itself.
This solution is great for jobs that run once or twice a day. This may not be the best solution for jobs that run every hour. When jobs run on a tight schedule, it may be preferable not to schedule duplicates, but to drop the failed jobs and wait for the next job instance to run.
Scheduling duplicate jobs isn’t natural, but it’s a necessity on Windows Azure. Failovers happen regularly, whether it’s for scheduled updates or critical failures, we need to equip ourselves in order to deal with these interruptions.
For this to be effective, you will need to calculate a time offset using the average time a job takes to run to completion. Add a slight buffer to this average and schedule the second job using the time from the first job plus the resulting offset. This will ensure that the second job doesn’t try to execute before the first job runs to completion.
Try it out and let me know how it turns out in your Cloud Services.