Ingesting Massive Amounts of Relational Data With Windows Azure SQL Database (70+ Million records/day)

February 18, 2013 — 1 Comment

In a previous post I discussed using Table Valued Parameters and Stored Procedures to insert large amounts of data into Windows Azure SQL Database with reasonable  throttling by the SQL Database. Not long after that post, I rapidly came to the conclusion that it was a great solution for reasonable amounts of data. Then I wrote about using SqlBulkCopy to overcome the limitations imposed by the stored procedure approach and provided the ReliableBulkWriter.

This article will show how to organize a  database to enable the insertion of large amounts of  relational data. To illustrate the changes that were required I will be using the following model.

2013-02-18_17h47_19 

Imagine that you have a requirement to insert/update close to 70 million records/day. The import needs to occur without overly affecting the end users and without downtime.  Furthermore, Packages and TrackingEventLogs need to be inserted in parallel.

    The key to ingesting relational data in parallel, is to break the relational constraints between the tables. To  accomplish this without affecting the current ORM read model, I replace the current dependent table by a View. In order to facilitate the creation of the View, I denormalize the dependent read table so that I have enough information to rebuild the relationship using a JOIN statement. This JOIN statement is used to define the final View.
    2013-02-18_17h50_30

This process has a few benefits. First of all, it allows me to push data into both tables at the same time without having to query to find foreign key values for the dependent table records. Records will become visible to the end users when the JOIN conditions is met. The JOIN in this example is performed on the TrackingNumber and the interesting thing to notice here, is that a Package does not need to exist before we insert it’s TrackingEventLogs. In other words, we can have two services, operating independently from each other, who can push data into the Package table or into the TrackingEventLogReadTable table without affecting each other.

This architecture greatly simplifies maintenance. Truncating the read tables and reimporting the data becomes an easy task. Without direct relational constraints data becomes easy to work with.

Better performance can be achieved by having write tables with less indexes than are required for the read tables. The process of getting data into SQL Database using SqlBulkCopy will benefit greatly from this approach. Once the data is in SQL Database you can incrementally merge data into the read tables. Doing so will help maintain an acceptable load on the SQL Database.

In the following diagram, the red arrows show the tables used to create the View and the yellow arrows show how data is merged into read tables.

2013-02-18_18h12_52

To Insert/Update data I used the following dataflow

Trackbacks and Pingbacks:

  1. Dew Drop – February 19, 2013 (#1,501) | Alvin Ashcraft's Morning Dew - February 19, 2013

    [...] Ingesting Massive Amounts of Relational Data With Windows Azure SQL Database (70+ Million records/da… (Alexandre Brisebois) [...]

    Like

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s