The Zero Cost Stack
Intro
This post details the main tools and services used in the stock-advisor stack. It includes the requirements and reasoning behind each choice, as well as a short evaluation of whether I was satisfied with it. I have intentionally not included any code, but I’ve tried to include links to relevant parts in the repository throughout the post. For a more general, less technical introduction to the stock-advisor
product, see the first post in this series1.
The post will roughly follow the structure of the components in the flowchart below: Extract and Load, Storage, Orchestration and Dashboard.
Each section has the same structure: first, I’ll outline the Requirements, then describe the Tools and Services used and considered, and finish with some Cost Conclusions. After the Tools and Services summary, there are two sub-sections: Used and Considered. These explain the reasoning and experiences behind the “stack decisions”, in relation to the Requirements. I’ve put them in expandable sections, as they can be considered supplemental material. Each section ends with Cost Conclusions, summarising how the used tools and services aligned with the goal of keeping costs at a minimum.
Is there anything noteworthy about this stack? There’s no rocket science involved, but I think it’s cool that the stack works very well and – at the same time – costs absolutely nothing to keep running 24/7, 365 days a year. If this project were for my employer, I might not have bothered saving a few hundred dollars per year to keep a stack like this running. However, as a hobby project for a hobby investor, keeping costs to a minimum was an important constraint when developing this product.
Extract and Load
The quintessential data engineering task involves copying data from a source to a destination, the data pipeline. While a one-off “copy data from A to B" often is trival, the challenge lies in building a pipeline that never fails. Aiming for zero failures is unrealistic; therefore, we should aim to make a pipeline that:
- Is robust. We should strive for minimal failures within reasonable timeframes. In practice, this involves building in numerous retries for network transmissions, using reliable compute with high SLAs, and having thorough tests to handle common errors.
- Is easily maintainable. When failures occur, it should be easy to identify the problem and implement a fix. This requires good error handling, logging, monitoring, and comprehensive testing to allow for confident code modifications. It should also be easy to contribute improvements and new features to the code base.
- Supports fast and easy backfills. When a pipeline is fixed or historical data changes, it should be straightforward to re-ingest the data and make it available to downstream consumers.
And that’s it! Although not rocket science, I find building robust and maintainable data pipelines presents an interesting challenge. It is usually harder to get the design right than one might think at first.
Requirements
Here are some new points that build on the ones mentioned above for the Extract and Load pipeline. It should:
- Follow the E(t)LT pipeline design. We should extract data from the source, apply a minimal transformation, (t), if necessary, and then load it into the target destination. More complex transformations, T, can be performed later, providing flexibility, as we preserve the original data in the
raw
orbronze
layer. This improves on the traditional ETL design by separating the Extract and Load steps from the transformations, which, when “tightly coupled”, can become difficult to maintain. - Be idempotent. This is crucial for good pipeline design. Running the pipeline multiple times with no source changes should produce the same result. If the source changes, it should either keep the latest state only or preserve a snapshot of each run’s results.
- Be easy to add and run automated tests. This should include unit, integration, and end-to-end tests. Ideally, all three layers of testing should be implemented, using suitable testing tools. However, good tests take time2, and pragmatism is important. Most data pipelines do not serve critical data to environments like google.com3. Testing rigour should reflect the SLA with downstream consumers; failing once a month with a 10 minute fix could be acceptable. Nevertheless, data engineers should certainly try to engineer their way out of being “firefighters”; there is always plenty to do, and nobody wants to spend a sizeable amount of their time fixing broken pipelines.
- Be fast. Performance is a feature. Performance is great developer and user experience. If it is possible to make the pipelines run fast, one should try to do so. In this case, fast entails concurrency4 and swift data processing and ingestion.
- Have plenty of retries. APIs and services can be unstable, so retries with exponential backoff should be standard for all network operations.
- Be portable. The pipelines should be easy to test and run both locally and on the “production runner”5, with identical results.
- Be easy to debug. When the pipeline fails, it should take me little more than finding the application logs to identify the issue.
I don’t remember the exact number of US stocks that I found relevant for my use case, but it was slightly less than 10,000. Most of the constraints in the individual POWER plan did not affect me much, but the hourly limitation of 10k requests could seemingly be an issue. Getting data from a single stock/ticker/symbol counted as a single request, and there is no way to “bundle” a request to get multiple symbols in a single request. What if a pipeline failed after making thousands of requests, for example, would I then need to automate some “sleep until next clock hour” logic, or something? Nasty.
Tiingo API | STARTER ($0/month) | POWER ($30/month) |
---|---|---|
Unique symbols per month | 500 | 95,541 |
Max requests per hour | 50 | 10,000 |
Max requests per day | 1000 | 100,000 |
Max bandwith per month | 1 GB | 40 GB |
However, it turned out that Tiingo has another, seemingly undocumented, API endpoint that returns end-of-day data for the last trading day’s changes in a single request: https://api.tiingo.com/tiingo/daily/prices. This endpoint helped alleviate the 10k requests per hour limitation, and I managed to make an ingestion robust enough using it. It has a weakness, however, which is that it is time dependent. It does not take a date parameter; it always gives the last day’s results. Meaning that if my pipeline were to fail for a single day, I’d have a gap for end-of-day data for all stocks. For now, I run this ingest twice a day, with the second time being fully redundant if the first ingest worked as expected. I have noticed some instability with this endpoint, and on more than one occasion it was only the second ingest that completed successfully. For all stocks where splitFactor != 1
or divCash > 0
, I make a full backfill to get correct adjusted prices.
Tools and Services
I use Go, running on scheduled GitHub Actions workflows, to extract financial data from Tiingo APIs and ingest it into DuckDB or MotherDuck. The minimal transformation pipelines are fast and have rigorous end-to-end testing. This includes mocking API responses, performing all transformations, and ingesting into their respective table destinations in ephemeral DuckDB instances. Idempotency is achieved by using primary keys and INSERT OR REPLACE INTO when ingesting to DuckDB tables. The setup has yet to fail me and is easy to update and add new features to.
Here is the code for EtL, and here is an example pipeline run for end-of-day pricing data in production
.
Go
I chose Go as the primary tool for extracting and loading data to DuckDB/MotherDuck. I’ve had positive experiences using Go for extracting data from APIs previously, and I thought it would be a great fit here, given my requirements. After spending time learning and using Go in recent years, I’ve become quite fond of it. It has many good traits, but here are the most relevant for this particular E(t)L
task:
- Easy to run on any platform or service. Go compiles so quickly that I often run it as an interpreted language for smaller codebases like this. I just use
go run main.go
, rather than managing cross-compiled binaries. go-crosscompilation
Go’s cross-compilation support is, by the way, great! I can easily compile binaries that run without issues on Ubuntu GitHub Actions runners from my ARM-based MacBook. However, I tend to prefer just installing the Go runtime on the target platform and simply triggering
↩go run
(which compiles and runs the binary). And in this particular case, the DuckDB driver depends on Go’s C API, which complicates cross-compilation
Go is fast and easily enables concurrency. Go is among the fastest languages, and certainly faster than Python (which would be the most natural alternative for this task). Go provides great abstractions for concurrent code via its API and lightweight goroutines. Unlike Rust, the compiler provides little help in avoiding race conditions, so I prefer using higher-level abstractions like
MapErr
from theconc
package, rather than implementing the “low-level” concurrent code myself.Great standard library, including
testing
andhttp
. While installingpytest
when using Python isn’t much overhead, it’s great that Go has excellent testing capabilities built-in. Thenet/http/httptest
package makes it easy to mock APIs and requests, which is great for an E(t)L workload like this. Go is also excellent for “cloud stuff”, like making requests, which is a good fit for this task. I used thego-retryablehttp
package for handling retry mechanisms.Error handling. Go’s error handling is often criticised, and
Error handling
tends to be the most popular answer to the questionWhat is the biggest challenge you personally face using Go today?
in the Go developer surveys. But, I have to admit, I like it! Unlike in Python, Go’s error-handling design forces me to explicitly handle every possible failure mode, which is a good thing when building resilient systems. Spending half the development time, or more, on handling errors is often necessary if you want systems to run for years with minimal downtime and maintenance.
It is a simple, minimal language. Why mention this? I find that Go is a language where I get a significant productivity boost from LLM-assisted coding. This may be because Go uses a small set of keywords and idiomatic patterns. Suggestions by Sonnet-3.5 are usually very good and often work as prompted on the first try. The code also tends to be similar to what I’d have written myself. I feel very productive developing with Go this way, and my workflow now involves reviewing LLMs’ suggestions as much as coding myself. It should have been Go that had the zen proverb
There should be one–and preferably only one–obvious way to do it.
and not Python – which is a much more expressive language, with commonly many obvious ways to do things.
All the pipelines are triggered via a simple CLI; you can see the docs here.
GitHub Actions
After discovering that GitHub Actions usage is free for public repositories with 4-vCPU, 16GiB memory runners, this choice was obvious. I even considered bringing my old Linux computer back to life, but was happy to avoid that (it would have been much more fiddle, less reliable and more expensive due to electricity costs). Having a powerful VM available when I need it, either by trigger or schedule, close to my codebase, turned out to be ideal. Thank you GitHub (and, ehh, Microsoft)!
Python
This is, of course, the most obvious option. While I love Python and use it elsewhere in stock-advisor
, it’s not always the best fit. See the arguments for Go under the Used tab for why I preferred it. The strongest argument for Python would be to keep the stock-advisor
stack consolidated and “mono-lingual”. I am still looking for an opportunity to try dlt, which seems to have great abstractions for consuming data from APIs. I really like this saying:
Python is not the best language for anything, but it is the second best language for everything.
Redpanda Connect
I am a fan of Redpanda Connect (formerly Benthos) for simple, scheduled batch ingest scenarios. However, I needed full control and testing capabilities, which the YAML-config-first approach can’t provide. rpk
I have several times ended up jamming myself into a corner with a too-complex Redpanda Connect YAML file, where I wished I would have started with Go or Python instead. Having said that, I have plenty of success stories with Redpanda Connect too, where a simple 50-line YAML config works extremely well and reliably.
↩
DuckDB's https
extension
One of DuckDB’s great features is how easy it is to import data with it. With the https
extension, you can even read CSV or JSON responses directly from APIs. Even though using only DuckDB could have been possible, I quickly dismissed this idea because – well, yeah – SQL. SQL is great for certain things, like simpler table manipulation tasks, but it is not a good choice for creating robust pipelines. Imagine looping over every ticker symbol in the Tiingo API with a recursive CTE; I get shivers down my spine just thinking about it.
Cost conclusions
Using open-source Go with scheduled, free GitHub Actions runners provided a satisfactory “extract and load” stack, and it costs zero to keep running for years to come.
Storage
To be able to transform, train on and visualise the financial data, I needed a storage solution where I managed the data myself6.
Requirements
Ideally, the storage solution for the stock-advisor
product should:
- Provide fast queries and aggregations. Again, performance is a feature, and slow data transformations in a pipeline or in a dashboard is best avoided. Since fast aggregations are a priority (more important than fast lookups), a column-oriented OLAP database is likely the way to go.
- Be simple to set up for idempotent ingests. While most storage systems support idempotent designs, even immutable ones like blob storage, the ease of implementation can vary significantly.
- Be simple to set up for local and CI/CD testing. Automated tests for ingest pipelines and data transformations using the storage solution should be easy to create and configure. Importantly, this includes that tests can run locally and on any CI/CD runner without network requests to a managed service (which can make tests slow and expensive).
- Be cheap or free. There is no such thing as a free lunch, is there?
- Be hosted and fully managed. If a storage solution meets all the above requirements and is fully managed, my time can be spent fully on building the product itself and not bothering with the nitty-gritty of storage itself.
The solution I went with – MotherDuck + DuckDB – meets all of these criteria and has been a pleasure to use.
Tools and Services
I use DuckDB for development storage and MotherDuck for production workloads. This combination has proven ideal: lightweight, fast, fully managed, and well-suited for both ingestion and dashboards. I can’t think of any technical solution that would be a better fit for this product, and remarkably, this stack currently costs me nothing. MotherDuck stores data reliably, and its serverless backend responds instantly. What more could I ask from a free service?
DuckDB
Since I first tested DuckDB a couple of years ago, I’ve been a big fan. They’ve since reached version 1.0 with a stable storage system and continually added useful features to both the SQL dialect and via extensions. DuckDB is a fast, column-oriented database that can reside in memory or as a file on any file system. It is often described as the OLAP equivalent of the OLTP database SQLite (known as the world’s most widely used database). Having a local database available via a file is excellent but doesn’t meet the requirement of being hosted and fully managed.
Initially, I considered “sending the database file around,” allowing reads and writes to the same, latest file copy across my local machine, GitHub Actions, and the dashboard. However, this approach would involve significant orchestration and be unreliable. Dismissed ❌
I also considered storing the master data as Parquet files in blob storage (possibly with Iceberg on top for improved metadata management). This could probably work quite well for hosting (blob storage is very cheap), but it would not be fully managed. This would require setup and maintenance, which I’d rather avoid for this project. Dismissed ❌
Then I discovered MotherDuck.
MotherDuck
MotherDuck provides, simply put, a hosted and fully managed DuckDB database. It has almost full feature parity with DuckDB, which makes switching between the two databases a simple config toggle.
As of January 2025, MotherDuck offers a free tier of 10 compute unit hours per month + 10 GB of storage. At the start of this project, it was difficult to assess whether this would be sufficient. But I gave it a shot, and it turned out that my needs were indeed covered by this free tier. Currently, I use less than 5 GB for storing all financial data and CatBoost model results – the compression in DuckDB columns is very impressive!
MotherDuck + DuckDB
Since these two products are so closely related and work so well in tandem, I think it makes sense to discuss the following topics under this two-headed duck?!
The easy switch between a local DuckDB database (in-memory or file-based) and MotherDuck makes creating end-to-end tests for data pipelines simple. The tests begin by setting up an ephemeral DuckDB database with the desired state, then the pipelines run from start to finish with requests to the mocked httptest.Server
endpoints, transforming (little t
, remember) and ingesting the data with correct types, edge case handling, etc. This setup has been very pleasant to use and ensures a high degree of confidence that the production
pipelines will work correctly. What if unexpected inconsistencies appear in the API, causing the pipelines to fail? While it’s impossible to anticipate all potential data quirks, the second-best approach is to add these surprising API responses to the httptest.Server
mocks and then address the downstream error handling appropriately. It’s a great example of TDD in action!
Idempotency was also surprisingly simple to implement: I assigned primary keys to one or more columns in each table, like here, and then I used INSERT OR REPLACE INTO for data ingestion. This results in upsert semantics, which ensure that only the latest data is retained. primary-key-duckdb
Primary keys in DuckDB work similarly to row-oriented databases like Postgres, enforcing uniqueness per row. This contrasts with column-oriented databases like BigQuery and Snowflake, where you cannot enforce uniqueness constraints on tables.
↩
One of the things I learned during this project is that MotherDuck (or, more precisely, the DuckDB backend) serves well as a direct data source for the dashboarding and visualisation layer (at least when there is little concurrent load, a.k.a. only me 😄). A Streamlit dashboard that consumes data directly from MotherDuck is surprisingly snappy. As discussed in Tools and Services under Dashboard, this snappiness, combined with the ease of directly connecting to the master data directly, was one of the reasons I pivoted from Observable Framework as my dashboarding tool.
This one was relatively easy, so I didn’t spend much time considering alternatives. However, for those unfamiliar, I can briefly go through the alternatives and explain why I found DuckDB + MotherDuck to be a better fit.
BigQuery, Snowflake, Databricks, Redshift, Fabric
These “cloud data warehouse” services are what most companies use – at least in Norway – for their analytical data storage and operations. They offer straightforward management of large data volumes, support complex analytical queries, handle multiple concurrent users and workloads, all while minimising operational overhead. Moreover, they all support dbt
, the industry standard for managing data transformations in the “modern data stack”.
These services are great, and tend to work well for what they’re designed for. However, all designs come with compromises, and for me there were several reasons for not picking any one of these services for this project:
I don’t need scale. I don’t need to handle large data volumes, nor concurrent users or queries. I prefer something snappy over something scalable.
These services are not optimised for CRUD (Create, Read, Update, Delete) transactions. While my workloads for this product is nothing like a CRUD service, I just find BigQuery et al. frustratingly slow for operations like simple Read operations. In a dashboard, you typically need to support a combination of simple aggregations and Read operations (larger transformations should be completed and “cached” as a table prior to being used from a dashboard). Using, for example, BigQuery as the backend for a dashboard—unless severely optimized—just doesn’t give you the snappy feel you want to provide your end users.
There are no local versions. Imagine being able to spin up a local Snowflake database on your laptop in milliseconds! This could open the door for many new use cases, like distributed analytics (compute is served by the user’s computer, not the cloud) and running unit tests (on data pipeline integrations and transformation code) locally and on any CI/CD runner. databricks-local
For Databricks, which is based on Spark, you might be able to run a non-feature parity version locally using PySpark.
↩
- Limited support for enforced primary keys and upserts. While some of these warehouses have added primary key support, they typically don’t enforce uniqueness constraints in the same way as traditional databases, nor do they support native upsert operations (like DuckDB’s INSERT OR REPLACE INTO). This means handling duplicates often requires additional compute and complexity, either through post-ingestion transformations or external tools like Apache Beam. Over time, I’ve noticed this can create overhead in terms of SQL complexity, storage of obsolete data, and repeated deduplication in downstream transformations. snapshots
There might be one benefit with “append only, deal with duplicates later” semantics, and that is that you keep a snapshot of all ingested data. In practice, though, I’ve rarely seen anyone interested in this outdated data; I guess it would only be useful for debugging state changes in the data over time. However,there are usually better purpose-built solutions for this use case, like dbt’s snapshots.
↩
- And finally, costs. These services are generally cheap to use with the small data amounts I have in this project, but it’s hard to assess up-front if costs would be zero. Costs are commonly unpredictable, and with slightly suboptimal table configurations, like incorrect partition sizes, you might end up with some unpleasant surprises on your bills.
Transactional/OLTP databases
I suppose I could have used PostgreSQL or SQLite for this product. These databases wouldn’t be fast for analytical workloads, but the few transformation queries I ended up with might have performed adequately. Supabase would have been the natural choice for a hassle-free hosted version of Postgres, but their free tier offers only 500 MB of storage, which wouldn’t have been sufficient. I’ve never tested Turso but their hosted SQLite product (more precisely, a fork of SQLite) looks very interesting, and their free tier offers up to 9 GB of storage, which would have been enough. However, as DuckDB + MotherDuck have many similarities to SQLite + Turso but are built for analytical workloads, there was little reason to test both options.
Cost conclusions
DuckDB is fully open source and has the generous MIT Licence, making it completely free. Motherduck has a generous free tier—as of January 2025, 10 compute unit hours per month + 10 GB of storage – which proved sufficient for my needs. This was, however, difficult to assess up front, but I took the gamble, accepting the “worst case” scenario of switching to MotherDuck’s $25 per month paid subscription. With a focus on cost by, for example, not storing any duplicates, only transforming data a few times a week, and only storing key results from the transformations, I ended up being well within the free tier. I did experience MotherDuck compute throttling on days with much development activity, but that was not a big deal since I could easily switch back to the local DuckDB instance the few times this happened. Also worth noting is the excellent compression in DuckDB that minimises the storage footprint on disk.
Orchestration
I needed a tool to manage planned data transformations (the significant ‘T’ in E(t)LT) and machine learning workloads. I realised early on that this product’s orchestration needs wouldn’t be complex, and simply “running some Python functions in order” could have worked just fine. However, in case the size and complexity of the product grows, it is beneficial to already be using a tool that is designed to handle large, complex orchestration needs. Furthermore, the learning experience of trying out new tools is also part of the fun of developing products like this.
Requirements
The orchestration tool should:
- Be flexible. In this context, that means that the tool should allow for making transformations in and calling CatBoost from Python. Consequently, an orchestration tool written in Python that could be easily imported as a library would be the natural choice. Using Python also simplifies testing, such as including unit tests for data transformations (which I’ve found particularly tricky in
dbt
). - Have a command line interface. A somewhat flexible CLI is needed to allow for easy execution in GitHub Actions workflows.
- Run locally. Devex is so much better if the orchestration tool has good support for local development
- Store logging and pipeline run results. I need some sort of storage to save the results of all pipeline runs. Ideally, this storage should integrate with the chosen orchestrator, enabling easy exploration and debugging of pipeline results using a graphical user interface.
- Support MLOps requirements. While the need for managing MLOps, at least at the start, is limited, it would be nice to pick a tool that allows for easily storing things like training run metrics and metadata and potentially setting up alerts when metrics exceed e.g. exceed predefined threshold values.
Tools and Services
I use Dagster for orchestrating the single transformation and CatBoost training pipeline. Transformations are written in SQL and I created a CatBoostTrainer Class with a simple API to use from Dagster, which provides predictions and SHAP values (feature explanations) for each stock prediction. Below, in the blue expansion, is an image of the Dagster pipeline DAG, if you prefer a visual representation.
Dagster
After some research, I decided to give Dagster a shot, and it has proven to be an excellent choice. The main reasons for this include:
- Great developer experience. The
dagster dev
command makes it very easy to spin up the Dagster UI and run pipelines locally. This simplifies development and testing compared to, for example, Airflow, which typically requires pushing code to an Airflow server (strictly speaking, you can run Airflow locally via something like Docker-compose, but this is much more overhead). In addition, I found the declarative, asset based, abststractions in Dagster to be very nice to work with. Abstractions that are easy to reason about leads to cleaner and more maintainable code. - Sufficiently flexible CLI. To trigger a Dagster job/pipeline via CLI, I found that dagster job execute works well enough for my use cases. Although not as extensive as
dbt
’s CLI, the ability to incorporate necessary flexibility into the triggered Python code made it perfectly suitable. - Decent documentation. I found Dagster’s documentation to be somewhat verbose and confusing, but it certainly did the job.
- Full control over metadata. By returning one of Dagster’s types that support
metadata
, like Output, one can easily store metadata for each pipeline step, like the CatBoost model’s test set RMSE (which should not increase significantly over time). - Direct embedding of
dbt
models as nodes in pipeline DAGs. Although I didn’t usedbt
in this project, it was beneficial to know that the orchestration service could extend todbt
models if required. - Run results stored on SQLite or Postgres. The Dagster config makes it easy to configure run states to be stored in a dedicated database. I started with SQLite, which worked well, before switching to a Supabase-managed Postgres instance. This allowed me to share the same state between runs on GitHub Actions and my laptop. Supabase’s 500 MB free tier storage limit is sufficient for this use case.
That last point is worth expanding on. Many orchestration tools are marketed as hosted, “always-on” services. Managed Dagster and Prefect cloud offerings, for example, provide both a web UI and pipeline compute, available 24/7. But do you really need that? That depends, of course. With the ambition of keeping costs low, I was hoping to instead run my orchestration jobs as scheduled GitHub Actions workflows. With the 4vCPU 16GiB runner and cron scheduler offered by GitHub Actions, all I needed was a CLI to run it from and some way to save the state of pipeline runs. So, rather than having a continuously running Dagster service, I simply trigger a Dagster job whenever needed. After configuring Supabase for storing pipeline runs, which works flawlessly, I didn’t miss anything from an “always-on” Dagster service (I use dbt dagster dev
locally to view production
pipeline runs stored in Supabase). And, again, costs remained at zero.
Github Actions
Perhaps repeating myself here, but it may be worth emphasising that the only infrastructure required for production
orchestration workloads was running Dagster on scheduled GitHub Action workflows. Here is the workflow I currently run twice a week, which orchestrates transformations and trains three CatBoost models with 12, 24, and 36 months prediction horizons.
Prefect
Although I’ve heard good things about Prefect, I opted for Dagster after doing a bit of research. Overall sentiment on Reddit seemed more positive towards Dagster, and the declarative, asset based, abstractions, along with native support for dbt
models, ultimately tipped my decision in favour of Dagster. But maybe I’ll try Prefect next time?
duckdb-dbt
There is a community maintained dbt
version for DuckDB that could have worked well enough for my needs. WWhile dbt
isn’t strictly an orchestrator, my requirements were simple: a tool to create a scheduled DAG pipeline involving DuckDB transformations, training CatBoost models, and writing results back to the database. duckdb-dbt
allows for running Python on the same Python process as dbt
runs on, see here for more, which probably would have worked just fine for the model training. However, after using dbt
for many years at work, I wanted to try something else. And, to be completely honest, the idea of using a Python-first orchestration framework rather than Jinja SQL was appealing; on numerous occasions I’ve been pulling my hair out due to the constraints dbt
CLI plus Jinja SQL offers. Creating good abstractions as dbt
macros is both limited and results in code that’s hard to read and maintain.
Malloy
While not an orchestration framework, I did spend some time testing Malloy as as a candidate for the primary transformation language (to use in big ‘T’). I like the concept of combining transformations and a semantic layer in a declarative, easier-than-SQL language, but it proved insufficiently flexible for my transformation needs. I needed the full flexibility of SQL (or Polars), and among SQL dialects I must say that DuckDB is my favourite. And using Malloy just forg a semantic layer is simply overkill for this product.
Kestra (or any other declarative, non-Python orchestrator)
Unpopular opinion: I like YAML. Using an expressive language like Python for pipeline configuration offers great flexibility, but also introduces potential bugs and maintenance complexity. A declarative configuration language, like used by Kestra, presents different trade-offs, notably reducing the risk of accidental bugs such as off-by-one errors.
So, to me, tools that have a declarative configuration-first approach – like Kestra, Redpanda Connect, Terraform – are appealing and may be a good choice in certain situations. However, it just did not make sense for this project, as both DuckDB and CatBoost have a tight integration with Python and dealing with the JVM runtime is something I try to avoid unless strictly necessary.
Cost conclusions
Running open source Dagster with Github Actions as pipeline runner and Supabase as metadata storage does not cost anything.
Dashboard
Finally, to the last, but very important part: visualisation, discovery and decision-making. The results of this product would be of little value if they were not easy to explore, learn from and compare. Creating a good dashboard felt like the natural approach, as exploring “with code only”, such as in a Jupyter Notebook or the MotherDuck UI, simply does not offer an ideal UI/UX for aiding the research and decision-making process of picking stocks.
Requirements
- Great devex for developing. It should be easy to wrangle data and create useful tables, visualisations and reports. As having a hosted solution was not an absolute requirement, this is the most important requirement. Ideally, it should be quick and easy to transform my ideas – both good and bad – into a useful and snappy UI.
- Great devex for deployment and hosting. Even though I could run this dashboard from my local computer, it is much nicer to have it hosted and allow it to be reached from any computer or mobile. An, as I will elaborate on in the Used and Considered sections below, different dashboarding “backends” can significantly impact how easy it is to get a snappy dashboard with always up-to-date data up and running.
- Cheap or free hosting. If possible, free, but I could spend some pennies for cheap hosting as well (admittedly making the punchline of this post fall flat!).
- Simple authentication setup. If I want this hosted, it should at least have some easy-to-enable password log-in. (This is necessary as the Tiingo data has a personal license; otherwise, I wouldn’t mind making the dashboard public7.)
Tools and Services
I use Streamlit for creating a dashboard with multiple tabs, spanning contents from historical development together with predictions, the “Stock Picker” and Jupyter Notebook reports. Streamlit has many versatile dashboard components and building blocks with good documentation, which allows me to easily compose a nice dashboard. Using outputs like dataframe and altair_chart, inputs like date_input and multiselect and composition elements like columns and navigation, a decent dashboard for insights and exploration was completed in little time. I use Streamlit Community Cloud for free hosting, which works well. Here is the code for the dashboard.
Using the data stored in MotherDuck directly as the data source was surprisingly fast for both reads/lookups and simpler aggregations. See the GIF in the blue expansion below for a demonstration of dashboard usage and its speed when calculating, on the fly, the relative returns for all stocks selected.
Streamlit
Streamlit was not top of my list when starting to explore dashboarding tools. Partly because I’ve used it before – and been very happy, by the way – but also because I was intrigued by the newly released Observable Framework. Nevertheless, I ended up coming back to Streamlit; it is simply a great dashboarding tool that ticked off all the requirements listed above.
One of the main criticisms of Streamlit is that it renders the entire page on every user input (such as selecting a menu item). Streamlit doesn’t offer “real reactivity,” where different page parts can render independently without requiring a full refresh. This is, at the same time, one of the great strengths of Streamlit in my opinion: abstracting away one of the most difficult parts of web development – frontend state and reactivity – allows any Python developer to be productive in no time. Streamlit has a single strategy for working around the limitations of this (simple and brilliant) design: caching. And, I’ve found that caching works effectively for most scenarios, at least the ones I’ve encountered in this case. With some clever use of caching in the right places, you can get surprisingly snappy dashboards consuming data directly from MotherDuck.
And, to emphasise the primary requirement – Great devex for developing – I’ve never used a dashboarding tool in which I am so productive. Streamlit’s built-in components usually provide all the flexibility I need, enabling me to achieve the desired behaviour and user experience quickly. Kudos!
I researched various Streamlit hosting options, but ended up using the free tier offered Streamlit Community Cloud. It was easy to set up CI/CD directly from the public repo, and even with limited compute and memory on the free tier, I haven’t experienced any runtime issues. The app is public but requires a password to log in.
MotherDuck Web UI
Although not a dashboarding tool, I feel it’s worth mentioning. I spent considerable time developing my SQL pipeline and dashboard transformations in the MotherDuck Web UI. While perhaps not as much as locally using VSCode + DuckDB, sometimes it’s just convenient to log into a UI and type away to get some immediate results.
The UI is quite basic, reminiscent of Jupyter Notebooks with cells in a notebook. It provides nice summary statistics. I like it, but it certainly feels like early days.
Go MotherDuck, I am cheering for you! 👏
Observable Framework
Here’s a brief review of Observable Framework, which I ended up not using. I recommend Simon Willison’s perspectives as an introduction, that post convinced me to try it out. These were the ideas I found most appealing:
- Everything in the frontend. Frontend capabilities are rapidly advancing, largely due to WebAssembly combined with powerful consumer hardware. This also applies to DuckDB, which can be compiled to a relatively small WebAssembly binary. This opens up for building dashboards entirely in the frontend using JavaScript + in-browser DuckDB (or MotherDuck Wasm). The advantage? No backend is required, enabling you to build a dashboard as a static website that supports a unlimited number of concurrent users. Since the data and UI are all in the frontend, you can achieve instant reactivity. Imagine a dashboard with near-zero lag on updates (provided you don’t overload the Wasm engine with complex queries).
- Everything in Markdown files. I’m a big fan of Markdown; it’s my preferred medium for both short and long writing (like this blog post). The idea of seamlessly combining text with interactive visualisations, all within the frontend, was very appealing.
However, it turned out that these two main selling points also were Observable Framework’s biggest shortcomings.
With everything in the frontend, you need some way to copy your data into the frontend as well. Observable’s solution is data loaders, which allow you to easily fetch data at deploy time and package it into the static website. For use with DuckDB, the natural choice for the format to build the data into in the frontend is Parquet (a columnar format that DuckDB has good support reading from). However, this design presents some challenges. Firstly, how do you ensure your data is always up to date? Observable suggests scheduling regular builds (e.g., on GitHub Actions), where you regularly download the latest data and rebuild the site. This sounds simple enough, but it gave me a design headache: to avoid the dashboard’s page load time becoming too long (remember the browser must download all data embedded in the static site), I would likely need to use a subset of the Tiingo financial data and CatBoost results. But which data should I exclude? I have 9,000 stocks available, and ideally, I’d like to explore all of them when needed, accessing all relevant data and prediction results, even though I might only check on a few regularly. I found this constraint – what data to exclude – difficult to deal with and an overhead I’d rather skip if possible.
And the simple way to do that, of course, is to load the data from MotherDuck when needed, instead of cramming it all into the frontend (5GB of compressed data is simply too much). But here, the idea of everything in the frontend poses another challenge: how can I safely request data from MotherDuck using, say, frontend fetch
, without exposing my API token? Sure, you may point out that this dashboard is for me only, and then having the token exposed in the frontend would not be a big deal. And you may be right, but it still feels like poor design; if I can avoid security malpractice, I’ll try to do so. After pondering on this for while, I concluded that getting data behind authentication at runtime – in a secure way – is much simpler with a backend.
Then, to my second gripe with it, related to everything in Markdown files. No matter how much of a fan I am of Markdown, I found that writing JavaScript in it was surprisingly difficult. I am no JavaScript wiz, that’s for sure, but I like to think that with some help from an IDE/language server I can still get things to work pretty much how I like with some trial and error. But in Markdown, at least at the time I played with Observable Framework (some months ago), I simply got no help from VSCode whatsoever. I did not realise how dependent I’ve become on both a decent language server and, potentially, some LLM assistance when coding, and how much more productive it makes me. (Yes, I know, Observable Framework can import TypeScript/JavaScript from files, but that doesn’t either give me the devex I was hoping for. But this is what I realised quite quickly that I had to do to be even somewhat productive in this tool.)
So, when I switched to Streamlit, it was a revelation how quickly I could get things working as I wanted. This also highlights the advantages of Python over (frontend) JavaScript. I stumble into too many frontend quirks of JavaScript without understanding what’s going on, whereas Python in the backend with Streamlit is dead simple.
NiceGUI
NiceGUI was also on my shortlist, developed as a direct response to the limitations of Streamlit already discussed under Used. I didn’t end up trying it for this project, as I found Streamlit could quickly accomplish what I needed. However, I did use NiceGUI to build the website for my employer, Dataist, and I highly recommend it. With high-level components from Quasar and the utility-first approach of Tailwind CSS, you can quickly build user interfaces or entire websites that look and behave just fine.
Cost conclusions
The free tier offered by Streamlit Community Cloud has been sufficient8, and its automatic deployment from the main
branch of the public repo has worked seamlessly.
Conclusions
I did spend a considerable amount of time discovering and researching the tools and services used in the stock-advisor
stack. In review, that time seems well spent, as I’d say I mostly followed the “happy path” during development, with minimal time spent on detours or pivots. Even the few exploratory paths I took, such as experimenting with Observable Framework and Malloy, turned out to be valuable learning experiences.
I am very happy with how the stack turned out and how reliably it operates every day. It strikes a good balance between simplicity and flexibility, making it easy to maintain and extend if needed. However, as I mentioned in the previous post, this product is only possible by “standing on the shoulders of giants” – the abundance of high-quality, free tools and services available today is just amazing!
Achieving a completely free, zero-cost stack was a goal I wasn’t sure I could meet for most of the development process, so I’m especially pleased that it worked out in the end. This project has been a rewarding experience, both in terms of the technical challenges and the satisfaction of creating something functional and cost-effective.
If you have any questions or thoughts, feel free to share them in the comments below – I’d be happy to discuss or help others explore similar approaches.
If you’re not interested in data engineering, this post is probably not for you 😀 ↩︎
Helping me create tests and test boilerplate is one of the most noticeable speedups I’ve gained from LLM-assisted coding. ↩︎
Which should literally never fail. ↩︎
Primarily concurrent requests to the Tiingo API to avoid being I/O bound, but possibly also concurrent or parallel data processing. ↩︎
In this case, on GitHub Actions’ virtual machines. ↩︎
Even though I am curious about the latest zero-copy trend, consuming data directly from API “on demand” would probably be difficult and a bad design, if even possible. I would also quickly exceed the Tiingo API’s usage limits. ↩︎
Ignoring any scaling issues of course, as having many users visiting the site would have used up the free tier compute from MotherDuck in no time. ↩︎
Streamlit Community Cloud is not fully transparent about the hardware stack one gets when deploying services there; at least, I could not find the reference for it. In some places, I read 1GB of memory, in others a bit more, so I am not entirely sure what you get. It might be dynamic, and that’s why they’re deliberately not explicit about it, but that’s just speculation. ↩︎