Let me share the story of my experience trying a new, cool in-memory database, DuckDB, to reduce cost and improve performance over PostgreSQL in my data analytics solution.
How I started
I built my data pipeline to Extract some data from websites and CSV files, Load it into my database, and Transform it into a reporting-ready schema. I used Python and Pandas to extract and load some of the data and Meltano to load some additional supporting data. All of that data went into a PostgreSQL database hosted in the cloud on Azure where I then used dbt to create data models in the database optimized for reporting. Finally, I use Metabase to visualize the data. (whew! that's a lot of moving parts!)
It was quite nice to work on and performance was reasonably good. Also, data size and cloud costs for the database were reasonable with the data from a single, small US state (1M people). It was costing me about $15/month for a solution I don't ever anticipate making money on.
But it didn't scale
When I used the same pipeline and data warehouse to analyze a lot more data for a larger state I quickly ran into challenges. Cost increased 4x, data loading time went up at least 8x, and reporting performance in Metabase was unacceptable. And that was still not the largest of the 50 states that I wanted to support.
My options
I read about the benefits of columnar data storage for reporting data and considered using the Citus extension for PostgreSQL, Microsoft SQL Server and Snowflake - each of which supports columnar storage. But because I am extremely cost-conscious, I decided to give the open-source DuckDB database a shot. Some simple tests showed me performance would likely be significantly faster for most of my queries, and I could reduce the $60/month cloud cost for the PostgreSQL database to just $5/month in blob/object storage.
The migration
Switching was very educational, but certainly not as quick as I had hoped. I'm a jack-of-all-trades and my core skills are learning and problem solving, not coding or application design. I spent 2-3 hours a day for 3 weeks migrating to DuckDB. It's only now, when my data is nestled snugly under DuckDB's wing, that I realized there are a few additional prototypes I should have tested because they would have saved me a lot of time.
What I learned
It wasn't that surprising, but I confirmed that that DuckDB, which has not reached its 1.0 release yet, is not PostgreSQL and does not support everything PostgreSQL does. (duh!) One example - I felt my SQL transformation code was pretty simple but didn't count on how long it would take to re-write a Postgres `cross join lateral` for DuckDB. After way too many hours invested, I gave up moved that core piece of my transformation code from SQL to Python and Pandas where it is much easier to understand.
I also found that while I can use a simple DuckDB database in Metabase, I get errors when running even simple select *
queries against the DuckDB database with my full data set in it. (I still need to open an issue for this...)
I found that since DuckDB has not reached v1, I can (and did) run into problems where the version of the DuckDB Python package I was using on my machine was creating files that could not be read by the version of the community-supported DuckDB driver for Metabase. Thankfully, the developer of the driver released a new version almost immediately in response to my issue.
To minimize that file version compatibility risk, I took some advice from folks who know dbt and DuckDB and explored writing the data to .parquet files (which are not directly supported by Metabase) so I could then create views in the DuckDB database that looked like create view as select * from raw_data_2022.parquet
. This would make the DuckDB database very small and easy to update and deploy with as versions changed. However, while this worked great on my machine, when I connected the DuckDB database in Metabase and went to explore the data, I got errors telling me that the raw_data_2022.parquet file could not be found. It seems that perhaps DuckDB is looking for the file in the directory where the Metabase executable is running, not where the DuckDB database is located. (Issue recently filed.)
Where do I go from here?
While I could dig deeper, debug these issues, learn languages like Cloture and submit PRs, I feel I've hit the end of my rope. I'm certainly a bit frustrated to have wasted spent all this time but I love all the knowledge I have gained and wonderful people I have met and collaborated with in the process.
A positive and unexpected thing I learned during my exploration was that transforming the data on my local machine using Python and Pandas and storing the dataset temporarily in DuckDB may be a winning strategy from a performance perspective. DuckDB loads data from my raw CSV files very quickly, and it acts like a super-fast cache while I am doing the transformation - faster than my local PostgreSQL database - since the data is in memory and can be shared between DuckDB and Pandas without having to create copies.
I also learned that for my data set columnar storage, at least with DuckDB, is very fast and efficient. So, I may look at the Citus extension for PostgreSQL again, though because I'm a cheapskate I will need to run the database server myself in a VM or Docker container because the cloud hosted version of Citus, Azure Cosmos DB for PostgreSQL, formerly called Hyperscale, exceeds my budget (for now). And for completeness, I would love to try Snowflake, but it doesn’t appear to fit within my budget either.
Finally, I learned that loading my raw data into the database differently and doing the transform on my local machine rather than on the database server, can save a ton of space - perhaps up to 80%. So, I may be able to get my cloud database costs back down to $30 or perhaps $15 per month, even with the addition of the larger state's data.
Conclusion
For now, I'm going to back to PostgreSQL, but I'm certainly not opposed to trying DuckDB again on this project (once my issues are resolved) or on other projects in the future.
What a journey!
Interesting take here. DuckDB aims to release a v1.0 in a couple of months - https://duckdb.org/docs/dev/release_calendar.html. I wonder whether by that time they will be able to fix their major issues mentioned here. "Cheap, embeddable Postgres" definitely sounds like a dream, and so far DuckDB seems to be this close to achieving it.