This is a lesson that took me a while to learn: Providing your team with as much information as possible will help you save time and money.
Debugging a distributed system that uses Windows Azure Queue Storage Service to communicate commands between compute nodes and services can be challenging. It can look like magic! And it can scare the best of your team’s wizards!
Most of the time, the biggest difficulty I run into when creating this kind of service is traceability. Where should events be logged, what events should be logged and what kind of logging mechanism should be use? These are a few questions that need to be answered early on in the applications’ life cycle.
Once these questions are answered, it’s absolutely crucial that everyone on the team is aware of what goes into diagnostics and how they’re persisted. Without this knowledge, team members will not know what to make of the collected diagnostics. Furthermore, the diagnostics will not serve their intended purpose.
Whenever exceptions occur while messages from the queue are being processed, I regularly log the exception stack trace along with the original queue message and a human friendly description of what might have gone horribly wrong. This gives me the extra information required to be able replay the failing process in a controlled environment.
Using the original queue message, I’m usually able to target my efforts in a precise region of the code base. Consequently, reducing the amount of time necessary to pinpoint the source and cause of the exception.
In the end it all comes down to this: the extra information, provided through the application’s diagnostics, helps to reduce the overall costs associated with the efforts required to fix, test and redeploy the solution.