Scaling an FTP Ingestion Service
Making an FTP Ingestion Service Highly Available (HA) can be a challenge. On Azure, we can take advantage of the Microsoft Azure Traffic Manager to direct users to the closest FTP Server. In this specific scenario, we assume that all FTP Servers are configured the same way and that users only have write access. When a document is uploaded over FTP, it is moved to a Microsoft Azure Storage Account that is used as persistent storage.
Note: Ingress Bandwidth is free on Azure. Sending a document from one Data Center to an other is considered Egress Bandwidth from the origin. So be sure that you take this into consideration when you estimate operational costs for this kind of scenario.
The Communication Flow
- Using the CName FTP Client contacts Microsoft Azure Traffic Manager and is provided with the IP of a healthy Endpoint
- The FTP Client connects directly with the healthy Endpoint
- The FTP Client uploads a document
- The application moves the document to Microsoft Azure Storage
The Setup
To implement this scenario, we must start by creating 4 Virtual Machines. Then we configure each FTP Server to use Passive Mode. On Azure, we need to activate the Stateful FTP Filtering on the firewall of each Virtual Machine so that traffic is allowed to reach the FTP Server.
The following is a command line that can be used to enable Stateful FTP Filtering.
netsh advfirewall set global StatefulFtp enable
Configure FTP Server users on each Virtual Machine with write-only permissions. This ensures that communication is one-way and that transferred documents remain inaccessible from the outside. Furthermore, because the users are the same on all FTP Servers, clients can be load-balanced without any side effects.
Create the Microsoft Azure Storage Account that is used to collect the documents. Be sure to create the Storage Account in the Data Center that hosts the application that processes the transferred documents. Before we exit our Azure Management Environment (web portal or PowerShell), let’s make sure that we have created Input Endpoints for ports 80 and 21. These ports will be used by IIS and by the FTP Server.
Now that we have a Storage Account we can deploy, and configure auto-start for the application that moves documents to a Microsoft Azure Storage.
Configure Internet Information Services (IIS) on each Virtual Machine. This service hosts a static HTML page that is used by the Microsoft Azure Traffic Manager health probes. This enables the Traffic Manager to identify degraded nodes and to redirect traffic to healthy nodes.
Create a Microsoft Azure Traffic Manager and Endpoints from each machine. Provide FTP Clients with the Traffic Manager URL.
More on Azure Traffic Manager
Azure Traffic Manager gives us with a lot of flexibility. By providing us with 10 levels of Traffic Manager Profiles, we can compose complex routing strategies. We can add and remove Endpoints from Traffic Manager and we can use weighted distribution of network traffic for scenarios like Testing in Production (A/B Testing).
The in the following scenario, we the first Profile is used to find the closest deployment to the connected client. Then using a Round Robin Profile we distribute FTP Clients evenly over the Virtual Machines available to the selected region. Earlier I spoke of flexibility, well in this scenario, I have the possibility of adding more FTP Server instances in some regions in order to scale to demand. Adding and removing FTP Server instances can be achieved without downtime. Find out more by visiting the Traffic Manager Overview.

Moving Files to Azure Storage
The missing piece in this scenario is a program that moves documents from each Virtual Machine to Microsoft Azure Storage. The following sample uses the Dataflow (Task Parallel Library) to control the maximum degree of parallelism (the number of documents that can be uploaded in parallel).
The console application starts by creating a Blob Container. Then it queues move tasks for each document present in the monitored directory. The ActionBlock takes care of processing each task to completion. Once all documents are moved, the application waits for a few seconds and starts over.
internal class Program { private static void Main(string[] args) { var mdop = ConfigurationManager.AppSettings["maxDegreeOfParallelism"]; var maxDegreeOfParallelism = Convert.ToInt32(mdop); var cs = ConfigurationManager.AppSettings["connectionString"]; var account = CloudStorageAccount.Parse(cs); var containerName = ConfigurationManager.AppSettings["container"]; var c = account.CreateCloudBlobClient() .GetContainerReference(containerName); c.CreateIfNotExists(); while (true) { var block = new ActionBlock<FileInfo>(info => TryMoveFile(account, containerName, info), new ExecutionDataflowBlockOptions { MaxDegreeOfParallelism = maxDegreeOfParallelism }); var path = ConfigurationManager.AppSettings["directoryPath"]; var directory = new DirectoryInfo(path); var fileInfos = directory.GetFiles(); foreach (var info in fileInfos) block.Post(info); block.Complete(); block.Completion.Wait(); Task.Delay(TimeSpan.FromSeconds(5)).Wait(); } } private static void TryMoveFile(CloudStorageAccount account, string containerName, FileInfo info) { try { MoveFile(account, containerName, info); } catch (Exception e) { Console.WriteLine(e.ToString()); } } private static void MoveFile(CloudStorageAccount account, string containerName, FileInfo info) { var client = account.CreateCloudBlobClient(); var container = client.GetContainerReference(containerName); var fileName = Path.GetFileName(info.FullName); var blobRerefence = container.GetBlockBlobReference(fileName); using (var fileStream = info.Open(FileMode.Open)) { Console.WriteLine("Moving " + info.FullName); blobRerefence.UploadFromStream(fileStream); } info.Delete(); Console.WriteLine("Moved " + info.FullName); } }
The application’s configurations are listed below. Let’s take a moment to review them.
- The directory path is the folder where documents are placed once they are received by the FTP Server.
- The maximum degree of parallelism is the number of concurrent uploads. Be sure to keep this number reasonable. If the number is too high, it will have a negative impact on the overall solution.
- The connection string is the Microsoft Azure Storage connection string.
- The container is the place where document blobs are stored.
<?xml version="1.0" encoding="utf-8" ?> <configuration> <startup> <supportedRuntime version="v4.0" sku=".NETFramework,Version=v4.5" /> </startup> <appSettings> <add key="directoryPath" value="C:\ftp-files"/> <add key="maxDegreeOfParallelism" value="4"/> <add key="connectionString" value="[connection string]"/> <add key="container" value="ftp"/> </appSettings> </configuration>
In order to start this application, we are presented with a few options. First we can configure it to execute when the Virtual Machine starts up. Secondly we can convert this to a Windows Service and install it on the Virtual Machine. And finally, we can start the application through Azure Automation and PowerShell. Which of these options should be implemented? I leave this choice to you.
Summary
The solution described in this post corresponds to the ingestion of documents. It does not cover what you do with the documents once they are persisted in Microsoft Azure Storage Account. I leave that to your creativity and for future adventures. The scenario described above, uses Microsoft Azure Traffic Manager to scale out and FTP Ingestion Service. The FTP Servers are configured the same way across all instances, and users only have write access. When a document is uploaded over FTP, it is moved to a Microsoft Azure Storage Account that is used as persistent storage.
This is a common scenario when we exchange information between companies. The FTP protocol is still quite relevant in 2015. It’s a trusted technology, that many understand fairly well. Most legacy systems integrate with it and it can be deployed rather quickly.
Scaling out FTP Severs works well when clients are not reading from the FTP Server. It’s especially useful when you need to receive a large quantity of files from your partners.
Remember, the Microsoft Azure Traffic Manager is a DNS based solution. That means that it works with TCP traffic and with HTTP traffic. Feel free to test out your scenarios and to share your experiences using the comments section below.
Hi there. Really great post! I find that there’s lots of information on geo replicating web apps, VMs etc. but very little on Azure storage, SQL etc.
Given that the file uploaded via FTP still has to make the trip to the central storage account, did you find much improvement in speed compared to just uploading directly to storage from the various traffic-managed VMs? My feeling is that this would be a costly solution (even for small VMs) and that there world not be much of a speed improvement in getting the file to its final destination.
LikeLike
Hi, great question. The post was written with the FTP protocol in mind. I absolutely agree that working directly with Azure storage is best for reasonably sized documents. If we are talking about volumes of data, then it may be time to look at more specialized services and solutions. Working with this FTP scenario, having an FTP server close to the uploader was key because of delays. The secondary data movement was not as important in terms of speed. Bringing data into Azure was the key to success even if the primary storage account was inaccessible. Managing application level replication is also possible where data could be replicated in each region instead of having a central copy. Cost is definitely something o have on the radar and will impact your final architectural choices.
LikeLike
Also, what kind of criteria have you set for scaling?
LikeLike
At the time of writing this post, I had not done any work around scalability. I was mostly relying on the fact that I had instances in multiple geolocations. This would definitely be something to explore in the future.
LikeLike
Given that the file uploaded via FTP still has to make the trip to the central storage account, did you find much improvement in speed compared to just uploading directly to storage from the various traffic-managed VMs? My feeling is that this would be a costly solution (even for small VMs) and that there world not be much of a speed improvement in getting the file to its final destination.
LikeLike
Azure now have an FTP VM in the marketplace that fully setups FileZilla FTP Server and configures the required ports to allow external access to your public IP https://azuremarketplace.microsoft.com/en-gb/marketplace/apps/cloud-infrastructure-services.filezilla-ftp-server
Setup instructions: https://cloudinfrastructureservices.co.uk/install-filezilla-secure-ftp-server-on-azure-server-2016/
LikeLike