Reducing Azure cost at Intergreatme

August 16, 2023 - Reading time: 5 minutes

At the begining of this year I gave myself three core objectives at Intergreatme:

  1. Focus on building products beyond our Enterprise Know Your Customer (KYC) platform
  2. Reduce overall spending; and
  3. Create new processes to enhance the overall KYC process

Intergreatme has always had a focus on how we protect people's personal information, which is why practically everything is encrypted in our databases. This affords us the concept of "encryption-at-rest", which is a fancy way of saying that the information you give us is encrypted prior to being saved to disk (or database.)

Of course, protecting information goes beyond simply storing information in a secure way, we also need to ensure that the information is retrievable. To do this, the original solution made use of a small Cassandra cluster which was being used as an object storage engine, along with some meta data about the files being stored in PostgreSQL.

Now, we haven't really looked at optimising the storage since the company's inception, largely because some problems you can throw money at, like increasing disk space when necessary. Up until now, it has just been an opportunity cost: either we look to reduce storage to save money, or focus on money-making activities: building new product.

Last year we were thrown a real curve ball by Microsoft Azure in that they gave us notice that they were increasing Azure costs by 15%. Given we have quite the cluster of VM's to ensure we have scalability with our homebrew AI that checks every transaction, along with a cluster of CockroachDB and Cassandra boxes... this naturally became a significant price concern.

I worked with the Dev team to look through the resources we have available to us, and we saw some low-hanging fruit that allowed us to reduce the size of some VM's and optimise disk space. These initial changes were enough to negate the 15% increase in Azure costs and only took about a month for us to realise these savings*.

In January the Intergreatme mobile app suddenly stopped working for iOS users. I was again at a point where there was this opportunity cost thing lurching out at me: spend time focusing on trying to get the iOS app working again, or focus on product? More importantly, I also did a revenue analysis of the app vs. spend vs. estimated time to fix (and the cost of labour associated there) and the result to me was clear: discontinue the app.

I put forward a recommendation to the Board, along with a detailed proposal of what we would do technologically, along with how we would communicate these changes to our app users, while still complying with POPIA. After about a month of development, we released our comms to users and provided them with the ability to download their data, delete it, and obviously opt-out of any further communication.

With this new process, it was decided that we would not store the information in Cassandra, but rather in blob storage. Once complete, we decomissioned the various servers and databases associated with the app. Sadly, this gave us about an 8% reduction in Azure costs (though cumulatively by this time, we've saved 23% on spend.)

I was not quite satisfied with our spend, and started looking at ways to further reduce our spend. Our two main cost areas are storage, followed by virtual machines. And storage is expensive because we have several clusters of machines storing the same encrypted information, which inflates the size of each item we store. So I looked in to the cost of using a different model for storing our documents. I looked at three main options: Amazon S3 buckets, Cloudflare R2, and Azure Blob storage.

Given we already use Azure (for now), it made sense to use Azure Blob storage. Objects are first loaded in to hot storage as these objects are required frequently for AI to analyse documents, as well as being available to our verification portal and customer insight platform (not to mention Enterprise clients that decide to get a copy of these documents.)

Azure provides a way to move these objects to cold storage automatically, so after a month, they are automatically moved, resulting in a further cost saving.

Fortunately, we only have a few API's surrounding file access (upload or download), so there wasn't too much to change in converting from Cassandra to Blob storage. Every object is still encrypted by these services, and I didn't want us to change too much in the system to prevent side-effects, so we just kept using our same encryption methodology even though most of the modern storage options now also automatically encrypt data being stored to them.

The next phase involved stopping new images from being saved to Cassandra, and instead saving them to Blob storage. Finally, we needed to move historic items from Cassandra to Blob storage - an effort that took several days to complete I might add (it's terrabytes of data.)

This migration of data resulted in an obvious cost increase for that period with the influx of data in to the blob storage, but this would only be a temporary cost as it is a once-off exercise. Once shifted to Azure Blob storage, the team started to decomission Cassandra, remove data disks that were no longer necessary, and resize virtual machines.

The impact of these changes  is clear with our latest invoice reflecting a cost decreased of 29.41% while it ensures our future spend also remains low.

* I would say it is nearly impossible to really gauge the Azure Spend forecasts as it constantly changes, and seems to run from month-to-month without considering what your billing date is. So I've never been able to successfully read the forecasts properly until I receive an invoice.