Looking for a cloud storage solution to meet my needs, I was shocked to find that Microsoft OneDrive, when purchased as part of the Microsoft 365 Family subscription, is quite the bargain. In the U.S., OneDrive costs less than $0.02 per GB (Gigabyte) per year. AWS S3 on the other hand is over 20x that rate!
Let me share my experience researching cloud storage with you so you can hopefully narrow down the options and find an affordable solution to meet your own needs.
Storing at home is no longer an option
My data workflows typically involve downloading csv or pdf files, some light cleanup and transformation, and the loading them into a database in the cloud for further transform, analysis and sharing. With my paltry upload speeds of less than 10 mbps, I often found myself waiting and waiting for uploads to complete. But since I was only uploading data a few times a month I decided to put solving the problem on the back burner.
However, a couple months ago my ISP, Starlink, introduced data caps and throttling. Since my usage for the first month exceeded their new limit, I got inspired to improve my data workflow with the dual benefits of less waiting and reduced risk of getting throttled.
My goals
I would say my goals are fairly modest -
store up to 1,000 GB of data (for now)
at least 10 mbps data transfer speeds
for the least amount of money
My 1,000 GB of data
I mentioned that my workloads involve extracting data from csv and pdf files and pushing it into a database, so why do I need to the store the data that is already in a database?
Data Provenance
If you’re not familiar with data provenance, think of it as a chain-of-custody for data. In my case, if I or someone else finds something interesting while analyzing the data in the database, I want to be able to trace that data back to the source file so I can validate the findings are accurate and ensure that the data didn’t get accidentally modified during transformation and analysis. My goal is to ensure my data and analyses are beyond reproach.
To that end, I store two files for each set of data:
Original files - unaltered copies of source files
Corrected files - copies of the original files with minimal corrections.
I use dbt to transform the data once it’s in SQL, but there are some things that need to be fixed before loading the data, and other one-off issues with a particular file that way easier to fix manually in a file than code for in SQL. The one-off issues include incorrect column names, non-UTF-8 characters, etc. In addition, the corrected files are named to match a standard pattern.
My needs and wants
To find an optimal solution, I realized I needed to know how much data I planned to store, how much data would be transferred in and out of storage, and the size of the files I’ll be storing.
Need: Store 1,000 GB of data, but average 500 GB month-to-month
I’ll freely admit, the 1,000 GB and 500 GB requirements are very rough estimates. Currently I have 80 GB of files from 12 months of data from a single source. Eventually, I could have up to 50 sources and some of them produce more data per month than the current one, but it’s likely to take me a couple years to onboard all of those sources. So, I’m just going with some round numbers that will make it easier to compare pricing across service providers.
Additional considerations:
Some service providers like AWS, Azure, and Google Cloud are consumption-based. They charge you only for the storage you consume and grow as needed. Others like Microsoft OneDrive and Google Drive charge you for what you think you will need in large chunks of 100 GB or more, which means you are paying for storage you are not yet using, and you need to monitor it to prevent running out of space.
I’m storing my data in CSV files, which are about as inefficient as you can get. If/when I get to the point where storage costs are prohibitive, I’ll probably look at compressing the files.
Need: Transfer up to 100 GB per month
As you may know, some service providers charge you not only for storing the data, but for moving the data as well; especially if you are moving it to or from the internet. They have different names like transfer, ingress, egress, etc. If you are moving lots of data, transfer costs can really add up.
In addition to the data I send to storage every month (ingress), I am considering providing users the ability to download (egress) the original and corrected files. I don’t have a great way to anticipate that volume, so for my comparison I am using 100 GB because that allows for 2 GB per potential data source, because that is what AWS Free Tier provides each month, and because transfer costs for most services are very minimal.
A final note on transfer costs - in cases where a service provider charges for transferring data to or from the internet, I understand that could reduce a portion of that cost by hosting my data pipeline and database with the same service provider, but that would only make a difference with just one provider so I’m not factoring that into the comparison.
Want: S3 Compatible (http of storage) object storage
S3 is the name of AWS’s object storage service, and because of its popularity there are many tools and services that are S3-compatible making S3 kind of the de facto standard for object storage in my eyes.
The tools I’m currently using don’t require S3 compatibility, so it’s not a hard requirement, but I would like to keep my options open for the future so I prefer a service that supports S3 if all other factors are the same.
In addition, I don’t care whether my storage is technically “object” storage. I just need to store files. But since object storage seems to be cheaper than other forms among service providers that provide multiple options, and because of S3’s ubiquity, if a service provider offers S3 compatible storage that’s the one I’ll use to keep the comparison as close to apples-to-apples as possible.
Need: Max file size at least 5 GB
Most of my files are currently 1 GB or less uncompressed. But I anticipate some sources providing files at least twice that size. So, I’m looking for a service that will provide some headroom and let me store 5 GB files to prevent an urgent need to introduce compression or file splitting.
Want: Located closest to U.S. Pacific Northwest
I live and work in the Pacific Northwest of the United States. In addition, most of the current consumers of my solution are in the same region. So, if a service provider has different pricing for various regions I will the region closest to where I’m located to keep latency to a minimum and limit the amount of analysis I need to do. In addition, if a provider has multiple locations in the Pacific Northwest, I will obviously choose the more affordable one.
Here’s what I found
The table below contains most of the data I found that informed my decision and I’ve highlighted my choice and another interesting finding in green.
I believe it’s important to let you know that there is a potential conflict of interest as I work for Microsoft, but my data project is not part of my day job I don’t believe my employer influenced my decision. In fact, when I started looking at storage options, I didn’t include what I call Personal Storage solutions and I fully expected to find that data transfer costs would be a larger factor which would mean finding a provider where I could host my storage, my database and my web site all in one region to minimize transfer costs. Given that, I thought I would likely end up choosing a smaller provider like Digital Ocean or Linode, which I expected to be more affordable than the “big” providers.
With that said, I chose Microsoft OneDrive, which is included with Microsoft 365. As you can see in the chart the “Total annual cost per GB - all included GB” is 2-10 times cheaper than most of the others, and it looks like an easy decision but that’s not the whole story.
While OneDrive does include a total of 6,000 GB for $99 per year, the OneDrive storage is spread across 6 different included accounts with 1,000 GB each. It’s going to take a bit more thought and effort to make use of all that storage than if it were in a single account if I don’t want to move files around manually. But I don’t need all that storage, at least not for now, and I believe I can use one account as the “parent” and share folders from the remaining “child” accounts with the parent account to make the different accounts transparent.
Also, purchasing OneDrive via Microsoft 365 has a killer feature that’s not mentioned in the table - Microsoft 365 includes the Microsoft Office suite of applications, which I use almost every day! And, since I don’t need all 6,000 GB of storage, I can share one of those accounts with my mom, so she gets her own OneDrive and Office for free. Who can put a price on that?
OneDrive is not S3-compatible, and I thought that might be a deal breaker. You see, I know that OneDrive works well with Windows because I use it dozens of times a day on my Windows desktop. But for my data projects I’m planning to run an ELT tool called Meltano in a container deployed to the cloud, so I need to connect the container to OneDrive. Thankfully, I found an opensource tool called rclone that lets you “manage files on cloud storage” and rclone has an extensive list of cloud storage services that it works with, including OneDrive. After quite a few hours of testing and pulling my hair out, I was got rclone to mount my OneDrive storage inside a Docker container and my decision was final… for now.
Honorable mention - IDrive e2
As you can see under the “Cost for first 500 GB of storage” heading in the table, most of the “Cost per GB” numbers are quite close, $0.20 - $0.28, but IDrive e2, was substantially cheaper - a quarter to a fifth the price of some the others!
IDrive the company has been around for a while and their Personal offering is about the same price per GB as OneDrive, but I was not familiar with their e2 service and only stumbled across it while researching for this post. If I didn’t need Microsoft Office and was confident that all my files would be smaller than the 2 GB limit, I would definitely consider switching! And I will probably try it out for a future project.
Surprisingly expensive - AWS
I was shocked at how much more expensive S3 was than the competition. In fact, I went back through their Pricing Calculator multiple times to make sure I hadn’t done something wrong. I wasn’t that surprised that their storage cost was a bit higher, though it is over 21% higher than Azure, but the transfer cost of over 1/3 the storage cost per year blew me away. All I can say is if you are using S3 storage, make sure you are looking at your transfer costs closely. If those costs are high, consider moving your workloads into the same region if possible or even consider a hybrid cloud approach if your application can handle the latency and you IT dept can manage the security.
What do you think?
If I missed something important, or you have found this information helpful, I would love to hear from you!