Recently, I started working on a Windows Azure project which involves working with large datasets. The original design used a normalized database and the application would pull a few megabytes worth of data every time it had to accomplish a task. Needles to say that this wasn’t the most efficient way to go about working with all this data
I first looked into caching, then turned to Windows Azure Table Storage. But nothing seemed to fully satisfy the application’s requirements.
Working with Windows Azure SQL Database, I rapidly discovered its throttling capabilities, which left me with the challenge of Inserting large amounts of data in a short amount of time. This is when I thought about the ADO.Net DataTable.
Recently the DataTable has been added to the accepted SQL Command parameter types. It becomes a Table-Valued parameter which maps directly to a User-Defined Table Type. Greatly simplifying how we can work with sets of data and SQL Server Stored Procedures.
Be sure that the DataTable column order and types match 100% with the User-Defined Table Type.
At this point I had a way to insert a few hundred rows of data into SQL Database on Windows Azure without being throttled. Thinking about the DataTable used for inserting data into the database, it occurred to me that I could also use the same DataTables as my data source. Because the application regularly queries for the same data for graphs and for analysis, I decided to serialize and maintain the query results in DataTables and store them in Windows Azure Blob Service. This effectively removed the load from the SQL Database and allowed for greater scalability.
The role of the SQL Database had suddenly shifted from the sole source of data for the entire application to being the reporting data source. The SQL Database is now only used to answer questions and is relieved from other tasks like returning large sets of data. Queries to the SQL Database are also limited to small sets of data. Whenever the size of the resulting set of data exceeds the prescribed limit, the user is asked to be more specific about what they are looking for.
Benefits of using the DataTable
– It has been around since the early days of .Net and has been tested thoroughly
– Can be used to repopulate the database
– Has change tracking through the DataRow RowState property.
– Is a 2 dimensional array with individually typed columns
– It can be consumed using Linq
By moving the sets of data to Windows Azure Blob Service, the worker role responsible for performing analysis, can scale indefinably without being affected by the limitations imposed by SQL Database. It can download the DataTable from Blob Service, ensuring optimal performance. Using Windows Azure Blob Service offers various options that help with security, versioning, scalability, availability, backup, durability and is quite cost effective!
This technique has many benefits, but it also has one important drawback. Whenever a DataTable needs to be modified, you need make sure that concurrent worker roles are not trying to update the same DataTable. Trying to persist a modified DataTable from two or more worker roles at the same time will result in data loss.
The benefits of working this way greatly out number the possible problems. A sound design can limit and even eliminate the possible problems, leaving you with a very scalable solution.
The DataTable has proven to be extremely effective! Its fast and has many built in features which I simply would not want to re-implement. The DataTable can be serialized and compressed to reduce network latency. The time consumed by the CPU to decompress the data is usually much shorter than the time required to move large amounts of data through the network.
I truly recommend playing around with this technique and putting it to the test. I would also greatly appreciate your comments and feedback from your experiences with this kind of technique.