Don’t Frown on CSVs

In Microsoft Azure (Azure) CSV and Avro can help you deal with unpredictable amounts of data.

CSV files are surprisingly compact. They compresses really well and allows us to work with datasets that do not fit in RAM. This low-tech solution is often overlooked and frowned upon by developers who don’t get the opportunity to work with very large datasets.

Root cause analysis scenarios have led me to comb through several days’ worth of logs. More often than not, this represents gigabytes worth of data. Exporting application logs to a CSV files, I was able to parse and analyze them with minimal resources.

With these two options available to us, why should we consider using the CSVs? Well, Avro is still fairly new and unsupported by most systems. CSVs can be imported into Databases, Azure Table Storage, Hadoop (HDInsight), ERPs… And a slew of other systems with minimal effort. Heck, you can even open CSV files in Microsoft Excel!

CSVs are used for electronic data interchange (EDI) between companies. Of course, it requires the consumers to have knowledge about columns and intended data types. XML was a great idea. It brought data, structure and semantics together. Although the idea was good, it resulted in huge files.

Then came Avro. Avro is interesting, because like XML, it packs the structure and semantics with the data. It’s a very compact serialization protocol and even if it’s fairly new, it is natively supported by HDInsight.

So how do you choose between CSV and Avro? It all depends on intent. How will the files be consumed? Are these destined to be used for Big Data or are these files used to share datasets with partner companies? Who will consume the dataset?

Notice that I haven’t mentioned Json. That’s because Json isn’t as compact as CSV or Avro. I use Json and XML for service responses, for object graphs or for documents. I use CSV and Avro for archives, batching and cold storage.

Playing Around

Using CsvHelper I wrote 1 million times the same object in a CSV file. I’m using a Solid State Disk (SSD) and my results show that a similar operation using Avro isn’t much slower.

For both tests I used the following entity

[DataContract]
public class Data
{
    [DataMember]
    public int Id { get; set; }
    [DataMember]
    public string Name { get; set; }
    [DataMember]
    public string Email { get; set; }
    [DataMember]
    public double Lat { get; set; }
    [DataMember]
    public double Long { get; set; }
    [DataMember]
    public string Description { get; set; }
}

And the following instance

static readonly Data DataInstance = new Data
{
    Id = 1000,
    Description = "Established in 1985, Microsoft Canada Inc. " +
                  "is the Canadian subsidiary of Microsoft Corp." +
                  " (Nasdaq \"MSFT\"), the worldwide leader in software," +
                  " services and solutions that help people and" +
                  " businesses realize their full potential.",
    Email = "example@microsoft.com",
    Lat = 45.502399d,
    Long = -73.573653d,
    Name = "Microsoft, Montreal Canada"
};

To Write the entities using CsvHelper I used the following mapping

public class DataCsvMap : CsvClassMap<Data>
{
    public DataCsvMap()
    {
        Map(data => data.Id).Index(0)
                            .TypeConverterOption(NumberStyles.Integer);
        Map(data => data.Name).Index(1)
                              .TypeConverterOption(CultureInfo.InstalledUICulture);
        Map(data => data.Description).Index(2)
                                     .TypeConverterOption(CultureInfo.InstalledUICulture);
        Map(data => data.Lat).Index(3)
                             .TypeConverterOption(NumberStyles.AllowDecimalPoint);
        Map(data => data.Long).Index(4)
                              .TypeConverterOption(NumberStyles.AllowDecimalPoint);
        Map(data => data.Email).Index(5)
                               .TypeConverterOption(CultureInfo.InstalledUICulture);
    }
}

Created an empty file and executed the following to fill the file.

private static void WriteCsv(string filePath)
{
    var streamWriter = new StreamWriter(File.Create(filePath));
    var csvSerializer = new CsvSerializer(streamWriter);
    var writer = new CsvWriter(csvSerializer);

    writer.Configuration.HasHeaderRecord = true;
    writer.Configuration.RegisterClassMap(new DataCsvMap());

    writer.WriteHeader<Data>();
    for (var i = 0; i < 1000000; i++)
        writer.WriteRecord(DataInstance);

    writer.Dispose();
}

Then next part of my little experiment was reading all the entities sequentially from the file

private static void ReadCsv(string filePath)
{
    using (var file = File.Open(filePath, FileMode.Open, FileAccess.Read))
    {
        var reader = new CsvReader(new StreamReader(file));

        while (reader.Read())
        {
            var data = reader.GetRecord<Data>();
        }
    }
}

My results were surprising!

Writing  1 000 000 records to : C:\archive.csv

Archive Size 292 MB

CSV Write Completed in 192 milliseconds

CSV Read Completed in < 1 milliseconds

Then I used the Microsoft .NET Library for Avro to conduct the same experiment.

Using the following code and the entity instance from the CSV experiment I created and filled a new archive file.

private static void WriteAvro(string filePath)
{
    using (var file = File.Open(filePath, FileMode.Create, FileAccess.Write))
    {
        var container = AvroContainer.CreateWriter<Data>(file, Codec.Null);
        for (var i = 0; i < 1000000; i++)
        {
            var avroWriterBlock = container.CreateBlock();
            avroWriterBlock.Write(DataInstance);
            container.WriteBlock(avroWriterBlock);
        }
        container.Close();
        file.Flush();
    }
}

Then I read all the entities sequentially from the file in order to observe possible latency.

private static void ReadAvro(string filePath)
{
    using (var file = File.Open(filePath, FileMode.Open, FileAccess.Read))
    {
        var container = AvroContainer.CreateReader<Data>(file);
        while (container.MoveNext())
        {
            var data = container.Current.Objects;
        }
    }
}

Once again my results caught me by surprise. I was expecting that Avro would take more time to complete the serialization.

Writing  1 000 000 records to : C:\archive.avro

Archive Size 297 MB

Avro Write Completed in 211 milliseconds

Avro Read Completed in < 1 milliseconds

Although Avro performed admirably well, I believe that both formats answer to different scenarios. CSVs are a wonderful low-tech solution to a common challenge. Avro is emerging and is still limited in terms of adoption. Choosing the right format comes down to how you plan to consume the resulting dataset.

CSVs on Azure

When developing Cloud Services on Azure, CSVs can help keep operational costs under control by reducing the amount of resources required to process large datasets. Imagine a service whose purpose is to parse logs in order to extract telemetry. There are two approaches to this challenge.

The first is to load the dataset in memory and extract metrics. This can require a substantial amount of memory. Luckily, Microsoft announced the G-series (Godzilla) Virtual Machines which will provide more memory and more local Solid State Drive (SSD) storage. The largest Godzilla VM will offer 448 GB RAM and 6.5 TB of local SSD storage. But think of the price tag that comes with these reserved resources…

As developers we forget about the Virtual Machine’s local storage (temporary storage). This is where the second approach come into to play. Using CSVs we can stream large amounts of loosely structured data from Azure Table Storage to local storage. They allow us to consume datasets line by line with a very small memory footprint. When you take a step back and think about it, this approach resembles MapReduce. And because of this small memory footprint, we can get away with small A1 (1 CPU, 1.75 GB RAM, 224 GB HD) VM for about ~$63.27/mo. If the analysis requires more IOPS we can use a D1 (1 CPU, 3.5 GB RAM, 50GB SSD) VM for about ~$133.93/mo.

As developers we must strive to use resources responsibly. Finding ways to reduce the amount of resources necessary to support our applications, means that we can take on more load without adding more resources. Now, this is when things get interesting for everyone. By handling more load with the same amount of resources, you are effectively raising profit margins and creating new business opportunities.

No Comments

Be the first to start the conversation!

Leave a comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.