Recently, I started working on a Windows Azure project which involves working with large datasets. The original design used a normalized database and the application would pull a few megabytes worth of data every time it had to accomplish a task. Needles to say that this wasn’t the most efficient way to go about working with all this data
I first looked into caching, then turned to Windows Azure Table Storage. But nothing seemed to fully satisfy the application’s requirements.
Working with Windows Azure SQL Database, I rapidly discovered its throttling capabilities, which left me with the challenge of Inserting large amounts of data in a short amount of time. This is when I thought about the ADO.Net DataTable.
Recently the DataTable has been added to the accepted SQL Command parameter types. It becomes a Table-Valued parameter which maps directly to a User-Defined Table Type. Greatly simplifying how we can work with sets of data and SQL Server Stored Procedures.
Be sure that the DataTable column order and types match 100% with the User-Defined Table Type.
At this point I had a way to insert a few hundred rows of data into SQL Database on Windows Azure without being throttled. Thinking about the DataTable used for inserting data into the database, it occurred to me that I could also use the same DataTables as my data source. Because the application regularly queries for the same data for graphs and for analysis, I decided to serialize and maintain the query results in DataTables and store them in Windows Azure Blob Service. This effectively removed the load from the SQL Database and allowed for greater scalability.
The role of the SQL Database had suddenly shifted from the sole source of data for the entire application to being the reporting data source. The SQL Database is now only used to answer questions and is relieved from other tasks like returning large sets of data. Queries to the SQL Database are also limited to small sets of data. Whenever the size of the resulting set of data exceeds the prescribed limit, the user is asked to be more specific about what they are looking for.
Benefits of using the DataTable
– It has been around since the early days of .Net and has been tested thoroughly
– Can be used to repopulate the database
– Has change tracking through the DataRow RowState property.
– Is a 2 dimensional array with individually typed columns
– It can be consumed using Linq
By moving the sets of data to Windows Azure Blob Service, the worker role responsible for performing analysis, can scale indefinably without being affected by the limitations imposed by SQL Database. It can download the DataTable from Blob Service, ensuring optimal performance. Using Windows Azure Blob Service offers various options that help with security, versioning, scalability, availability, backup, durability and is quite cost effective!
This technique has many benefits, but it also has one important drawback. Whenever a DataTable needs to be modified, you need make sure that concurrent worker roles are not trying to update the same DataTable. Trying to persist a modified DataTable from two or more worker roles at the same time will result in data loss.
The benefits of working this way greatly out number the possible problems. A sound design can limit and even eliminate the possible problems, leaving you with a very scalable solution.
The DataTable has proven to be extremely effective! Its fast and has many built in features which I simply would not want to re-implement. The DataTable can be serialized and compressed to reduce network latency. The time consumed by the CPU to decompress the data is usually much shorter than the time required to move large amounts of data through the network.
I truly recommend playing around with this technique and putting it to the test. I would also greatly appreciate your comments and feedback from your experiences with this kind of technique.
Hum, ‘would really like to see your conclusion. Could you post the rest?… :)
LikeLike
Yea, I always hit the wrong button, the rest is coming soon =)
LikeLike
I posted the rest of my post, I would greatly appreciate your feedback.
Best regards,
Alexandre Brisebois
LikeLike
I wouldn’t use ADO for all needs as ORMs work fine for standard CRUD scenarios in detail and search forms. When you dive into Batching and Analytics thought, you usually need Architects and DBAs to figure out the optimal solution. It’s not always a question of Technology (as I think you’re proposing here), but rather one of pre-calculations.
Without adding some kind of analytics service (SSAS) to your techology stack, you can keep denormalized information into the database when inserting single rows, and pre-calculate values that take time later on, such as entity validation aggregates (SUM, MAX, etc) for batching inserts or reporting. The challenge becomes identifying the hinge points in the performance-challenged SQL queries and remove them as pre-calc table checks.
Indeed, when batch-inserting you need to validate the data first for the whole set of records with a single DB hit to get the data to validate the batch against, and then hit again the DB only once with a TVP-enabled proc call with the correct UDTT type. Standard ADO.Net is enough as ORMs are not usually built for batching.
I would not recommend validating the data directly into the database, as you cannot reuse the business logic objects used for single-entity validations (in detail forms). You’d duplicate that logic in both DB and App.
You need to watch out for your data timestamps to keep it from corruption, which MSSQL could handle at batch insert time. You can return a resultset with the content not inserted to warn the user of such cases.
Cheers!
LikeLike
I completely agree about ORMs and not using ADO for everything. In my current situation, I needed to deal with limitations imposed by Windows Azure SQL Database without using Federation. By moving the data to blob storage I was able to pre-calculate all analysis results in a parallel workflow and then insert them into the database using batching. The number of machines executing the analysis is only limited by the amount i’m willing to invest. Having the data source in blob storage allows me to execute the analysis without impacting the database or the users who are querying it. My suggestion of using the DataTable as a data structure is totally about convenience at this point.
I like your mention of TimeStamps, they play a big role in concurrency management. The idea of returning the records which weren’t inserted is a great idea and I will definitely look into it.
Thanks for your valuable feedback!
Cheers
LikeLike
That’s cool, glad to be of some help!
And happy holidays btw, there’s nothing quite like a Christmas software engineering discussion… :)
LikeLike