In the recent months Microsoft has accelerated the cadence on the release of Windows Azure Guest OS updates. Cloud Services greatly benefit from these frequent updates. But it also means that your services will reboot more often. In order for Microsoft to provide an interesting SLA for your Windows Azure Cloud Services, it’s recommended that we run a minimum of 2 instances of each Role. On the other hand, if your Cloud Service is built to deal with sporadic interruptions, you shouldn’t have to worry about over provisioning Roles because they probably don’t require the extra uptime guarantees provided by the elevated SLA.
Windows Azure Guest OS updates may affect some Cloud Services that require specific configurations to work properly. On Feb 6, 2014 the Windows Azure team released a survey to find out how they could serve us better. They intend on using this data to make decisions on future Guest OS policies including retirement, disablement and expiration periods. Go to the Windows Azure Guest OS Survey to take the survey. It will be available until Feb 21st 2014.
More information about the Windows Azure Guest OS can be found on the Windows Azure Guest OS Releases and SDK Compatibility Matrix.
The Continuous Service that I use to collect community interest through Twitter is hosted on a single Worker Role who is responsible for absorbing a tremendous amount of Tweets about Windows Azure. It then indexes them and tried to regroup them in order to identify what we’re really interested in.
The Worker Role instance is set to remain on the latest version of the Windows Azure Guest OS. Normally, it recovers from maintenance reboots without any issues, but recently it stopped receiving Tweets. I noticed after 24 hours and have lost a full day’s worth of Tweets.
This brings me to the true message I wish to convey through this short but important post. We must continuously monitor the health of our Cloud Services. Whether we track it by regularly pinging endpoints or through other mechanisms, we must evaluate whether we’re monitoring the right metrics. In circumstances where the Cloud Service does not power a Web interface, we will want to observe other metrics to determine whether a service is healthy. In my case, monitoring the Role’s uptime is not enough. Therefore, I will need to check whether Tweets are recorded within a normal lapse of time. To get the right numbers, I will need base myself off of the past 2 months and determine what normal means.
Think about Cloud Services that you maintain, are there mechanisms in place that allow you to certify that it’s working as expected?