Size Matters!

March 19, 2013 — 1 Comment

cartoon_whale_bold

 

Surprisingly, Size Matters in The Cloud

Incidentally, the size of your data stored and processed on the cloud really matters! In this post I will use a common scenario to illustrate just how much size matters.

The following scenario can augment the Worker Role’s throughput by 12x


Scenario

Imagine a Worker Role whose responsibility is to produce charts based on data stored in Windows Azure Blob Storage Service. The blobs vary in sizes of 500 kilobytes to a few megabytes. Charts generated by the Worker Role are rendered and saved in Windows Azure Blob Storage Service as PNG files.

Based on this scenario, developers have built a queue processor the reads instructions from a queue about which chart to render using data stored in a specific blob.

Since the rendering process doesn’t require much RAM or CPU, the developers have configured the processor to read  messages from the queue in batches of 32. The messages are processed in parallel. Once all the messages have been  processed, it reads 32 more messages from the queue.

Lets do The Math

Given that an Extra Small virtual machine has a bandwidth of 5 mbps, lets take a look at how many queue messages can be processed in 1 hour. The scenario states that the average blob size for data is roughly 3 megabytes and that a generated chart PNG is on average 100 kilobytes.

Available Bandwidth

1 Extra Small virtual machine instance can transfer a total of 2.146 gigabytes of data in one hour. Consequently the virtual machine is capable of transferring up to 36.62 megabytes per minute.

Time to Process a Batch of Messages

A Worker Role is potentially capable of processing a maximum of 704 messages per hour.

32 messages per cycle * 22 cycles = 704 messages/hour

Details

Given that the processor processes batches of 32 messages in parallel and that each data source is on average
3 megabytes in size, the virtual machine downloads 96 megabytes per batch.

Downloading 96 megabytes takes 157.2864 seconds, which is roughly 2 minutes 40 seconds.

Each batch would produce 32 new PNG files. The combined chart sizes is roughly 3.125 megabytes. Uploading the charts to Windows Azure Blob Storage Service would take about 5.12 seconds.

Assuming that the CPU time is negligible, each batch of messages would require a minimum of 2.706775 minutes to process. Therefore, the Worker Role would be able to process 22 batches per hour. In other words, a Worker Role is potentially capable of processing 22 cycles times 32 messages resulting in 704 messages per hour.

Increasing Throughput

Perhaps the end user’s needs exceed the previously calculated throughput. Generating 704 charts per hour might not be sufficient. In this case, there are a couple of options available to the developers.

The first and simplest of the options is Scaling Out, adding a second instance will effectively double the processors throughput. Furthermore, they can Scale Up, this requires the application to be redeployed. Choosing the right virtual machine size can be challenging, and I strongly recommend exploring other options before scaling in any way. Scaling Out and Scaling Up will incur additional operational costs that affect the profit margins.

Consider the following options before attempting to scale the application

  • Compress your blobs
  • Use Json (JavaScript Object Notation) with shortened property names
    • human readable/editable
    • can be parsed without knowing schema in advance
    • excellent browser support
    • less verbose than XML
  • Use Protobuf (Protocol Buffers)
    • very dense data (small output)
    • hard to robustly decode without knowing the schema
    • very fast processing
    • not intended for human eyes (dense binary)

The following table shows size differences between data formats. All files contain the same data.

Format File Size Compressed File Size
ADO.Net DataTable 3.59 MB 168 KB
Json 1.70 MB 117 KB
Json with single character property names 1.05 MB 110 KB
Protobuf 374 KB 103 KB

The results from this experiment are quite interesting. Notice that compressed file sizes are quite similar. This may seem quite appealing, but remember that decompressing will consume CPU cycles. In this scenario decompressing the Protobuf file may take less CPU cycles. But keep in mind that most of the time, decompression takes less time than moving the original file across the network.

By reducing file sizes, you are effectively augmenting the Worker Role’s throughput. Consequently, the CPU will probably become the new bottle neck.

Let do The Math Using Compressed Protobuf

Using the same bandwidth an Extra Small virtual machine is capable of transferring up to 36.62 megabytes per minute.

A Worker Role is potentially capable of processing a maximum of 8544 messages per hour.

32 messages per cycle * 267 cycles = 8544 messages/hour

Details

Given that the processor processes batches of 32 messages in parallel and that each data source is on average
100 kilobytes in size, the virtual machine downloads 3.125 megabytes per batch.

Downloading 3.125 megabytes takes 5.12 seconds.

To be fair, lets add a 0.1 second per file for decompression. This comes out to 3.2 seconds.

Each batch would produce 32 new PNG files. The combined chart sizes is roughly 3.125 megabytes. Uploading the charts to Windows Azure Blob Storage Service would take about 5.12 seconds.

Assuming that the CPU time is negligible, each batch of messages would require a minimum of 13.44 seconds to process. Therefore, the Worker Role would be able to process 351 batches per hour. In other words, a Worker Role is potentially capable of processing 267 cycles times 32 messages resulting in 8544 messages per hour.

Why Does Size Matter?

The first question to be asked is ‘the size of what?’. Well to be honest, the size of your virtual machine, the size of your bandwidth, the size of your data, the size of your CPU and the size of your RAM all matter! The goal is to strike the right balance between your budget, your target performance and the utilization of the available resources.

In this scenario, I demonstrated how changing the size of data stored in blobs affected the over all performance of my application. Reducing the size of the data being consumed my the processor augmented its throughput by close to 12 times. Furthermore, it reduced the costs related to storing the data in Windows Azure Blob Storage Service. If the blobs are served to clients outside the cloud, reducing the size of each blob will also reduce your monthly bandwidth costs.

To reach the same throughput by Scaling out, 11 additional instances would have been required and would have cost 172.80$ USD/Month! The current scenario costs 14.40$ USD / Month.

To reach augment throughput by Scaling Up to a Small virtual machine would have cost 86.40$ USD/Month and would potentially yield a throughput of about 14 648 messages/hour.

Cutting Costs on Windows Azure Isn’t Always About Compromise, its about organization and using available resources to a maximum.

Trackbacks and Pingbacks:

  1. Reading Notes 2013-03-25 | Matricis - March 25, 2013

    […] Size Matters! – Of course size matters, but the size of what? Excellent post that explains how to maximize the use of your available resources. […]

    Like

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s