Archives For November 30, 1999


Never would have imagined that the laws of physics would be so important in a world where virtualization is the new normal.

Data Locality is Important

Data Locality, refers to the ability to move the computation close to the data. This is important because when performance is key, IO quickly becomes our number one bottleneck. Data access times vary from milliseconds to seconds because of many factors like hardware specifications and network capabilities.

Let’s explore Data Locality through the following Scenario. I have eight files containing data about multiple trucks, and I need to Identify trips. A trip consists of many segments, including short stops. So if the driver stops for coffee and starts again, this is still considered the same trip. The strategy depicted below is to read each file and to group data points by truck. This can be referred to as mapping the data. Then we can compute the trips for each group in parallel over multiple threads. This can be referred to as reducing the data. And finally, we merge the results in a single CSV file so that we can easily import it to other systems like SQL Server and Power BI.

Single Machine

The single machine configuration results were promising. So I decided to break it apart and distribute the process across many task Virtual Machines (TVM). Azure Batch is the perfect service to schedule jobs. Continue Reading…


Has Something Gone Wrong?

Generally, we choose to leverage Read-Access Geo-Redundant Azure Storage Accounts (RA-GRS) because we can use it as part of our disaster recovery (DR) plan. And sometimes, we forget that our devil is in the details. Disaster recovery (DR) plans are rarely tested and can cause headaches when they are. So let’s relieve some of those headaches.

Headache…

“Geo Replication Lag” for GRS and RA-GRS Accounts is the time it takes for data stored in the Primary Region of the storage account to replicate to the Secondary Region of the storage account. Because GRS and RA-GRS Accounts are replicated asynchronously to the Secondary Region, data written to the Primary Region of the storage account will not be immediately available in the Secondary Region. Customers can query the Geo Replication Lag for a storage account, but Microsoft does not provide any guarantees as to the length of any Geo Replication Lag under this SLA.

The Recovery Time Objective (RTO) and Recovery Point Objective (RPO) are the first items that come up in DR discussions. When we use RA-GRS we control the RTO because we decide when to read from the secondary location. The RPO is a bit different because that can vary due to physics and load. The best way get current Recovery Point (RP) is to get the last sync time for the RA-GRS in question. This post is all about getting the right information, when we need it, because we need facts to make the right decisions. Continue Reading…


Using StartsWith to Filter on RowKeys

There are many scenarios where filtering on partial RowKeys makes sense. One of these scenarios is Azure Diagnostics Log analysis where events are partitioned by time based PartitionKeys and by compound RowKeys. This allows us to filter and find information effectively.

Event RowKeys are composed of deployment IDs, role names, instance names, categories and other information:

8637d014bcf94452a1e48f393a11674b___Brisebois.WorkerRole___Brisebois.WorkerRole_IN_0___0000000001652031520___WADLogsLocalQuery

Querying WADLogsTable Effectively

WADLogsTable
The following example, shows how to target a specific table partition and filter events based on a StartsWith pattern.

var storageAccount = Microsoft.WindowsAzure.Storage.CloudStorageAccount.Parse(connectionString);
var client = storageAccount.CreateCloudTableClient();

var table = client.GetTableReference("WADLogsTable");

// Querying Windows Azure Diagnostics by Partition for a partial RowKey
var query = new FindWithinPartitionStartsWithByRowKey("0635204061600000000", "8637d014bcf94452a");
var result = query.Execute(table);

Continue Reading…


Don’t Frown on CSVs

In Microsoft Azure (Azure) CSV and Avro can help you deal with unpredictable amounts of data.

CSV files are surprisingly compact. They compresses really well and allows us to work with datasets that do not fit in RAM. This low-tech solution is often overlooked and frowned upon by developers who don’t get the opportunity to work with very large datasets.

Root cause analysis scenarios have led me to comb through several days’ worth of logs. More often than not, this represents gigabytes worth of data. Exporting application logs to a CSV files, I was able to parse and analyze them with minimal resources.

With these two options available to us, why should we consider using the CSVs? Well, Avro is still fairly new and unsupported by most systems. CSVs can be imported into Databases, Azure Table Storage, Hadoop (HDInsight), ERPs… And a slew of other systems with minimal effort. Heck, you can even open CSV files in Microsoft Excel! Continue Reading…


Working with Microsoft Azure Resources

On September 22 2014, I had the pleasure of speaking to the MSDEVMTL community about working with Microsoft Azure Resources.

Microsoft Azure Resources include Blob Storage, Table Storage, Queue Storage, Service Bus, Virtual Machines, Cloud Services and SQL Database. During my talk, I introduced a couple of tools that allow us to work with these resources. Some tools are built by Microsoft others are built by companies like Cerebrata, Cloud Berry and Zudio. Continue Reading…


#Azure Storage Tables – DateTime.MinValue is not Within the Supported DateTime Range

Azure Table Storage is a NoSQL storage key/value based part of Microsoft Azure Storage services. It’s really good at absorbing massive amounts of data. It’s really good at massive parallel operations work on small amounts of data. But it’s horrible when it comes time to extract large amounts of data in a serial manner. We’ll get to that topic in a future post.

DateTime is a type that is used to represent time in .NET. We use it very liberally without much afterthought. But when we start playing with Azure Table Storage, we have to think about the supported DateTime range. At this time, it’s also important to note that the local Azure Storage Emulators run on various database flavors like SQL Server and LocalDB. These databases do not impose the same limitations for DateTime values. Therefore, this bug will only show up on Microsoft Azure.

In order to limit headaches, I’m providing the following table to help identify what is supported and what isn’t.

Common Language Runtime type Details
byte[] An array of bytes up to 64 KB in size.
bool A Boolean value.
DateTime A 64-bit value expressed as Coordinated Universal Time (UTC). The supported DateTime range begins from 12:00 midnight, January 1, 1601 A.D. (C.E.), UTC. The range ends at December 31, 9999.
double A 64-bit floating point value.
Guid A 128-bit globally unique identifier.
Int32 or int A 32-bit integer.
Int64 or long A 64-bit integer.
String A UTF-16-encoded value. String values may be up to 64 KB in size.

Find out More about Azure Storage


Using Time-based Partition Keys in #Azure Table Storage

In a previous post about storing Azure Storage Table entities in descending order I combined a time-based key with a guid in order to create a unique key. This is practical when you need to use combined keys for the Row Keys or Partition Key. But it’s not practical for logs.

A better solution for logs, is to generate a Partition Key based on time. This allows you to query for logs by time periods. There are many ways to generate time-based partitions, so I will cover the two that I use the most. Continue Reading…


Patterns like the Materialized View are essential to Cloud Computing and Scalability. The Patterns & Practices team describes this pattern as generating pre-populated views over data in one or more data stores when the data is formatted in a way that does not favor the required query operations. This pattern can help improve application performance by supporting efficient querying of data.

Continue Reading...

fingerprint-secret This week I was face with an odd scenario. I needed to track URIs that are store in Windows Azure Table Storage. Since I didn’t want to use the actual URIs as row keys I tried to find a way to create a consistent hash compatible with Windows Azure Table Storage Row & Primary Keys. This is when I came across an answer on Stack Overflow about converting URIs into GUIDs.

The "correct" way (according to RFC 4122 §4.3) is to create a name-based UUID. The advantage of doing this (over just using a MD5 hash) is that these are guaranteed not to collide with non-named-based UUIDs, and have a very (very) small possibility of collision with other name-based UUIDs. [source]

Using the code referenced in this answer I was able to put together an IdentityProvider who’s job is to generate GUIDs based on strings. In my case I use it to create GUIDs based on URIs.

Continue Reading…


computer-diagnostics-icon I used to create my own logging mechanisms for my Windows Azure Cloud Services. For a while this was the perfect solution to my requirements. But It had a down side, it required cleanup routines and a bit of maintenance.

In the recent months I changed my mind about Windows Azure Diagnostics and if you’re not too adventurous and don’t need your logs available every 30 seconds, I strongly recommend using them. They’ve come such a long way since the first versions that I’m now willing to wait the full minute for my application logs to get persisted to table storage.

The issues I had with Windows Azure Diagnostics were because of my ignorance and half because of irritating issues that used to exist.

Continue Reading…