Ingesting Massive Amounts of Relational Data With Windows Azure SQL Database (70+ Million records/day)
February 18, 2013 1 Comment
In a previous post I discussed using Table Valued Parameters and Stored Procedures to insert large amounts of data into Windows Azure SQL Database with reasonable throttling by the SQL Database. Not long after that post, I rapidly came to the conclusion that it was a great solution for reasonable amounts of data. Then I wrote about using SqlBulkCopy to overcome the limitations imposed by the stored procedure approach and provided the ReliableBulkWriter.
This article will show how to organize a database to enable the insertion of large amounts of relational data. To illustrate the changes that were required I will be using the following model.
Imagine that you have a requirement to insert/update close to 70 million records/day. The import needs to occur without overly affecting the end users and without downtime. Furthermore, Packages and TrackingEventLogs need to be inserted in parallel.
- The key to ingesting relational data in parallel, is to break the relational constraints between the tables. To accomplish this without affecting the current ORM read model, I replace the current dependent table by a View. In order to facilitate the creation of the View, I denormalize the dependent read table so that I have enough information to rebuild the relationship using a JOIN statement. This JOIN statement is used to define the final View.
This process has a few benefits. First of all, it allows me to push data into both tables at the same time without having to query to find foreign key values for the dependent table records. Records will become visible to the end users when the JOIN conditions is met. The JOIN in this example is performed on the TrackingNumber and the interesting thing to notice here, is that a Package does not need to exist before we insert it’s TrackingEventLogs. In other words, we can have two services, operating independently from each other, who can push data into the Package table or into the TrackingEventLogReadTable table without affecting each other.
This architecture greatly simplifies maintenance. Truncating the read tables and reimporting the data becomes an easy task. Without direct relational constraints data becomes easy to work with.
Better performance can be achieved by having write tables with less indexes than are required for the read tables. The process of getting data into SQL Database using SqlBulkCopy will benefit greatly from this approach. Once the data is in SQL Database you can incrementally merge data into the read tables. Doing so will help maintain an acceptable load on the SQL Database.
In the following diagram, the red arrows show the tables used to create the View and the yellow arrows show how data is merged into read tables.
To Insert/Update data I used the following dataflow
- Split your data into reasonably sized DataTables
- Store the data tables as blobs in Windows Azure Blob Storage Service
- Use SqlBulkCopy to insert data is into write tables
- Once you have reached reasonable a amount of records in your write tables, merge the records into your read tables using reasonably sized batches. Depending on the complexity and indexes/triggers present on the read tables, batches should be of about 100000 to 500000.
- Before merging each batch, be sure to remove duplicates by keeping the most recent records only.
- Once a batch has been merged remove the data from the write table. Keeping this table reasonably small is quite important.
- Once your data has been merged, be sure to check up on your index fragmentation.