Eventual Consistency – is a consistency model used in distributed computing that informally guarantees that, if no new updates are made to a given data item, eventually all accesses to that item will return the last updated value.
Denormalized Data – is the process of attempting to optimize the read performance of a database by adding redundant data or by grouping data.
In earlier blog posts I wrote about Azure SQL Database in regards to normalization and about our false perception that all our data is relational. Back then I urged my peers to put their database on a diet and exploit Azure Blob Storage and Azure Table Storage. Recently, I came to new conclusions pulled from the first version of “This Day on #Window Azure“. This solution relied entirely on Azure Table Storage and it was doing great until I started to accumulate a moderate amount of data. At this time I realized that my design was drastically flawed as one of my processes was pulling down the entire contents of a table. As the table grew, my solution came to a crawl.
This is where patterns like the Materialized View come into play and shine. The Patterns & Practices team describes this pattern as generating pre-populated views over data in one or more data stores, when the data is formatted in a way that doesn’t favor query operations. This pattern can help improve application performance by supporting efficient querying of data.
In my case, I needed Azure Table Storage to be a buffer for daily Tweets. This gave my service the scalability required to absorb the massive volumes of data that comes from Twitter Streams. Because the data was not in the right format for my processes, it took forever to pull it out of Azure Table Storage. Consequently, my analysis processes came to a crawl. This is when I remembered that on previous projects I used Azure Blob Storage to store data and I decided to materialize views from the data stored in Azure Table Storage.
Duplicating data isn’t an issue in itself. It requires more discipline and logistics in order to maintain eventual consistency, but the payoff is far greater than the initial investment. Having data duplicated across multiple data stores allows us to scale with demand. It allows us to partition views and data so that even if we’re updating one section of our application, the rest of the application remains operational. Furthermore, services like Azure Blob Storage, allow us to version data. Think about it for a second, this is actually quite complicated to implement at a Database level and can be quite time consuming. Using Blobs could save us quite a bit of time…
Imagine updating our solutions with minimal down time. With Azure Blob Storage this is possible because we can run our services in a read-only mode by consuming Blob Snapshots. Then we can update blobs and when the application is deployed, it can fall back to the updated blobs or the latest snapshots. If something goes horribly wrong, you can roll back to a previous Blob version.
By materializing views from the data stored in Azure Table Storage, I was able to reduce the number of network requests required to fetch a day’s worth of Tweets, which lead me to utilize my Cloud Service’s resources more efficiently.
Following the realization that I needed materialized views, I started updating my design(note: the original design can be found on a previous post). I used Azure Table Storage for what it’s really good at, handling massive amounts of parallel inserts and used materialized views for queries instead of the source table. Then I used a slower asynchronous process to pull down a day’s worth of data into Azure Blob Storage, which can be seen as a sort of cache that my service can exploit. The overall effect on my solution was very positive. Ultimately, my solution went from crumbling under network latency to being snappy and responsive. Azure Table Storage became the master dataset and Azure Blob Storage provided optimized pre-processed views for my analysis processes.