Archives For Data


Never would have imagined that the laws of physics would be so important in a world where virtualization is the new normal.

Data Locality is Important

Data Locality, refers to the ability to move the computation close to the data. This is important because when performance is key, IO quickly becomes our number one bottleneck. Data access times vary from milliseconds to seconds because of many factors like hardware specifications and network capabilities.

Let’s explore Data Locality through the following Scenario. I have eight files containing data about multiple trucks, and I need to Identify trips. A trip consists of many segments, including short stops. So if the driver stops for coffee and starts again, this is still considered the same trip. The strategy depicted below is to read each file and to group data points by truck. This can be referred to as mapping the data. Then we can compute the trips for each group in parallel over multiple threads. This can be referred to as reducing the data. And finally, we merge the results in a single CSV file so that we can easily import it to other systems like SQL Server and Power BI.

Single Machine

The single machine configuration results were promising. So I decided to break it apart and distribute the process across many task Virtual Machines (TVM). Azure Batch is the perfect service to schedule jobs. Continue Reading…


The Challenge

As developers, we are up against odds that push us to make trade-offs in order to go into production on time. More often than not, it’s a race where security becomes an afterthought.

Securing Azure SQL Databases

Security mechanisms come in many flavors. It is a requirement that needs to be defined and implemented on day 1. These rituals (policies and practices), must become natural in your application life cycle management. Consider these as a starting point from which you can develop your own security practices.

  • Do not use the default user for development, testing or for deployments
    • Create a user specifically for deployments (can perform schema alterations)
    • Create a user on a per application basis (cannot alter schema and has limited write access)
    • Create a user for support investigations (this should be read-only)
    • Create individual accounts for members of DevOps who will need to act upon the database. (these accounts should have limited write access)
  • Reference data should be read-only (immutable versions) and should only be updated through deployments. This type of data can be stored in NoSQL data services to augment the overall scalability of your application.
  • Enable Auditing for Azure SQL Database, this feature will give you deep insight in how the database is manipulated and about how it is used.
  • Use SQL Database Projects to design, build, version and deploy
  • Use schemas to segregate tenants, reference data, activity data and resouce (shared) data.
  • Use schemas to keep track of ownership chaining
  • Encrypt connection string passwords at rest
  • Use strong passwords
  • Set Trusted_Connection=False in the connection string. This forces server certificate validation
  • Set Encrypt=True in the connection string to force the client to use SSL
  • Ensure that you are covered against SQL Injection
  • SQL Database Firewall rules should block everything except the consuming applications

Continue Reading…


Don’t Frown on CSVs

In Microsoft Azure (Azure) CSV and Avro can help you deal with unpredictable amounts of data.

CSV files are surprisingly compact. They compresses really well and allows us to work with datasets that do not fit in RAM. This low-tech solution is often overlooked and frowned upon by developers who don’t get the opportunity to work with very large datasets.

Root cause analysis scenarios have led me to comb through several days’ worth of logs. More often than not, this represents gigabytes worth of data. Exporting application logs to a CSV files, I was able to parse and analyze them with minimal resources.

With these two options available to us, why should we consider using the CSVs? Well, Avro is still fairly new and unsupported by most systems. CSVs can be imported into Databases, Azure Table Storage, Hadoop (HDInsight), ERPs… And a slew of other systems with minimal effort. Heck, you can even open CSV files in Microsoft Excel! Continue Reading…