# The Zero Cost Stack


## Intro

This post details the main tools and services used in the [stock-advisor](https://github.com/rasnes/stock-advisor) stack. It includes the requirements and reasoning behind each choice, as well as a short evaluation of whether I was satisfied with it. I have intentionally not included any code, but I've tried to include links to relevant parts in the [repository](https://github.com/rasnes/stock-advisor) throughout the post. For a more general, less technical introduction to the `stock-advisor` product, see [the first post](/posts/stock-advisor-intro) in this series[^de].

The post will roughly follow the structure of the components in the flowchart below: [Extract and Load](#extract-and-load), [Storage](#storage), [Orchestration](#orchestration) and [Dashboard](#dashboard).

{{< image src="https://github.com/user-attachments/assets/a1c1c2d4-3995-4366-a0ea-1b79855b2216" caption="The [stock-advisor](https://github.com/rasnes/stock-advisor) stack" cacheRemote="true" >}}

Each section has the same structure: first, I'll outline the _Requirements_, then describe the _Tools and Services_ used and considered, and finish with some _Cost Conclusions_. After the _Tools and Services_ summary, there are two sub-sections: _Used_ and _Considered_. These explain the reasoning and experiences behind the "stack decisions", in relation to the _Requirements_. I’ve put them in expandable sections, as they can be considered supplemental material. Each section ends with _Cost Conclusions_, summarising how the used tools and services aligned with the goal of keeping costs at a minimum.

Is there anything noteworthy about this stack? There's no rocket science involved, but I think it’s cool that the stack works very well and – at the same time – **costs absolutely nothing** to keep running 24/7, 365 days a year. If this project were for my employer, I might not have bothered saving a few hundred dollars per year to keep a stack like this running. However, as a hobby project for a hobby investor, keeping costs to a minimum was an important constraint when developing this product.

[^de]: If you're not interested in data engineering, this post is probably not for you 😀

## Extract and Load

The quintessential data engineering task involves copying data from a source to a destination, the _data pipeline_. While a one-off “copy data from A to B" often is trival, the challenge lies in building a pipeline that never fails. Aiming for zero failures is unrealistic; therefore, we should aim to make a pipeline that:

1. **Is robust.** We should strive for minimal failures within reasonable timeframes. In practice, this involves building in numerous retries for network transmissions, using reliable compute with high SLAs, and having thorough tests to handle common errors.
2. **Is easily maintainable.** When failures occur, it should be easy to identify the problem and implement a fix. This requires good error handling, logging, monitoring, and comprehensive testing to allow for confident code modifications. It should also be easy to contribute improvements and new features to the code base.
3. **Supports fast and easy backfills.** When a pipeline is fixed or historical data changes, it should be straightforward to re-ingest the data and make it available to downstream consumers.

And that's it! Although not rocket science, I find building robust and maintainable data pipelines presents an interesting challenge. It is usually harder to get the design right than one might think at first.

### Requirements

Here are some new points that build on the ones mentioned above for the Extract and Load pipeline. It should:

-  **Follow the [E(t)LT](https://double.cloud/blog/posts/2023/03/etlt-the-tech-that-is-transforming-data-processing/#what-is-etlt?) pipeline design.** We should extract data from the source, apply a minimal transformation, _(t)_, if necessary, and then load it into the target destination. More complex transformations, _T_, can be performed later, providing flexibility, as we preserve the original data in the `raw` or `bronze` layer. This improves on the traditional ETL design by separating the Extract and Load steps from the transformations, which, when “tightly coupled”, can become difficult to maintain.
-  **Be idempotent.** This is crucial for good pipeline design. Running the pipeline multiple times with no source changes should produce the same result. If the source changes, it should either keep the latest state only or preserve a snapshot of each run's results.
-  **Be easy to add and run automated tests.** This should include unit, integration, and end-to-end tests. Ideally, all [three layers of testing](https://martinfowler.com/articles/practical-test-pyramid.html) should be implemented, using suitable testing tools. However, good tests take time[^llm-for-tests], and pragmatism is important. Most data pipelines do not serve critical data to environments like google.com[^google]. Testing rigour should reflect the SLA with downstream consumers; failing once a month with a 10 minute fix could be acceptable. Nevertheless, data engineers should certainly try to engineer their way out of being "firefighters"; there is always plenty to do, and nobody wants to spend a sizeable amount of their time fixing broken pipelines.
-  **Be fast.** Performance is a feature. Performance is great developer and user experience. If it is possible to make the pipelines run fast, one should try to do so. In this case, fast entails concurrency[^concurrency] and swift data processing and ingestion.
-  **Have plenty of retries.** APIs and services can be unstable, so retries with exponential backoff should be standard for all network operations.
-  **Be portable.** The pipelines should be easy to test and run both locally and on the "production runner"[^gh-actions], with identical results.
-  **Be easy to debug.** When the pipeline fails, it should take me little more than finding the application logs to identify the issue.

[^llm-for-tests]: Helping me create tests and test boilerplate is one of the most noticeable speedups I've gained from LLM-assisted coding.
[^google]: Which should literally _never_ fail.
[^concurrency]: Primarily concurrent requests to the Tiingo API to avoid being [I/O bound](https://en.wikipedia.org/wiki/I/O_bound), but possibly also concurrent or parallel data processing.
[^gh-actions]: In this case, on GitHub Actions' virtual machines.

{{< admonition type=tip title="Tiingo API Constraints" open=false >}}
I don't remember the exact number of US stocks that I found relevant for my use case, but it was slightly less than 10,000. Most of the constraints in the [individual POWER plan](https://www.tiingo.com/about/pricing) did not affect me much, but the hourly limitation of 10k requests could seemingly be an issue. Getting data from a single stock/ticker/symbol counted as a single request, and there is no way to "bundle" a request to get multiple symbols in a single request. What if a pipeline failed after making thousands of requests, for example, would I then need to automate some "sleep until next clock hour" logic, or something? Nasty.

| Tiingo API               | STARTER (\$0/month) | POWER (\$30/month) |
| ------------------------ | :----------------: | :---------------: |
| Unique symbols per month | 500                | 95,541            |
| Max requests per hour    | 50                 | 10,000            |
| Max requests per day     | 1000               | 100,000           |
| Max bandwith per month   | 1 GB               | 40 GB             |

However, it turned out that Tiingo has another, seemingly undocumented, API endpoint that returns end-of-day data for the last trading day's changes in a _single_ request: https://api.tiingo.com/tiingo/daily/prices. This endpoint helped alleviate the 10k requests per hour limitation, and I managed to make an ingestion robust enough using it. It has a weakness, however, which is that it is time dependent. It does not take a date parameter; it always gives the last day's results. Meaning that if my pipeline were to fail for a single day, I'd have a gap for end-of-day data for all stocks. For now, I run this ingest [_twice_ a day](https://github.com/rasnes/stock-advisor/blob/38dbd1d9ef9fa573ef276378eccc95bee20f29c7/.github/workflows/prod__append-last-trading-day.yaml#L5), with the second time being fully redundant if the first ingest worked as expected. I have noticed some instability with this endpoint, and on more than one occasion it was only the second ingest that completed successfully. For all stocks where `splitFactor != 1` or `divCash > 0`, I make a full backfill to get correct _adjusted_ prices.
{{< /admonition >}}

### Tools and Services

I use Go, running on scheduled GitHub Actions workflows, to extract financial data from Tiingo APIs and ingest it into [DuckDB or MotherDuck](#tools-and-services-1). The minimal transformation pipelines are fast and have rigorous end-to-end testing. This includes mocking API responses, performing all transformations, and ingesting into their respective table destinations in ephemeral DuckDB instances. [Idempotency](https://dagster.io/glossary/data-idempotency) is achieved by using primary keys and [INSERT OR REPLACE INTO](https://duckdb.org/docs/sql/statements/insert.html#insert-or-replace-into) when ingesting to DuckDB tables. The setup has yet to fail me and is easy to update and add new features to.

Here is [the code for EtL](https://github.com/rasnes/stock-advisor/tree/main/EtL), and here is [an example pipeline run for end-of-day pricing data](https://github.com/rasnes/stock-advisor/actions/runs/12824300825/job/35760203823) in `production`.

{{< admonition type=success title=titles.used open=false >}}

<h3>Go</h3>

I chose Go as the primary tool for extracting and loading data to DuckDB/MotherDuck. I've had positive experiences using Go for extracting data from APIs previously, and I thought it would be a great fit here, given my requirements. After spending time learning and using Go in recent years, I've become quite fond of it. It has many good traits, but here are the most relevant for this particular `E(t)L` task:

- **Easy to run on any platform or service.** Go compiles so quickly that I often run it as an interpreted language for smaller codebases like this. I just use `go run main.go`, rather than managing cross-compiled binaries.{{% footnote "go-crosscompilation" %}}Go's cross-compilation support is, by the way, great! I can easily compile binaries that run without issues on Ubuntu GitHub Actions runners from my ARM-based MacBook. However, I tend to prefer just installing the Go runtime on the target platform and simply triggering `go run` (which compiles and runs the binary). And in this particular case, the [DuckDB driver](https://github.com/marcboeker/go-duckdb) depends on Go's C API, which complicates cross-compilation{{% /footnote %}}
- **Go is fast and easily enables concurrency.** Go is among the fastest languages, and certainly faster than Python (which would be the most natural alternative for this task). Go provides great abstractions for concurrent code via its API and lightweight goroutines. Unlike Rust, the compiler provides little help in avoiding race conditions, so I prefer using higher-level abstractions like [`MapErr`](https://pkg.go.dev/github.com/sourcegraph/conc/iter#Map) from the [`conc` package](https://github.com/sourcegraph/conc), rather than implementing the "low-level" concurrent code myself.
- **Great standard library, including `testing` and `http`**. While installing `pytest` when using Python isn't much overhead, it's great that Go has excellent testing capabilities built-in. The `net/http/httptest` package makes it easy to mock APIs and requests, which is great for an E(t)L workload like this. Go is also excellent for "cloud stuff", like making requests, which is a good fit for this task. I used the [`go-retryablehttp`](https://github.com/hashicorp/go-retryablehttp) package for handling retry mechanisms.
- **Error handling.** Go's error handling is often criticised, and `Error handling` tends to be the most popular answer to the question

  > What is the biggest challenge you personally face using Go today?

  in the Go developer surveys. But, I have to admit, I like it! Unlike in Python, Go's error-handling design _forces_ me to explicitly handle every possible failure mode, which is a good thing when building _resilient_ systems. Spending half the development time, or more, on handling errors is often necessary if you want systems to run for years with minimal downtime and maintenance.
- **It is a simple, minimal language.** Why mention this? I find that Go is a language where _I get a significant productivity boost from LLM-assisted coding_. This may be because Go uses a small set of keywords and idiomatic patterns. Suggestions by Sonnet-3.5 are usually very good and often work as prompted on the first try. The code also tends to be similar to what I'd have written myself. I feel very productive developing with Go this way, and my workflow now involves reviewing LLMs' suggestions as much as coding myself. It should have been Go that had the zen proverb

  > There should be one--and preferably only one--obvious way to do it.

  and not Python -- which is a much more expressive language, with commonly many obvious ways to do things.

All the pipelines are triggered via a simple CLI; you can see the docs [here](https://github.com/rasnes/stock-advisor/blob/38dbd1d9ef9fa573ef276378eccc95bee20f29c7/EtL/cli.md).

<h3>GitHub Actions</h3>

After discovering that GitHub Actions usage is free for public repositories with [4-vCPU, 16GiB memory](https://github.blog/news-insights/product-news/github-hosted-runners-double-the-power-for-open-source/) runners, this choice was obvious. I even considered bringing my old Linux computer back to life, but was happy to avoid that (it would have been much more fiddle, less reliable and more expensive due to electricity costs). Having a powerful VM available when I need it, either by trigger or schedule, close to my codebase, turned out to be ideal. Thank you GitHub (and, ehh, Microsoft)!

{{< /admonition >}}

{{< admonition type=failure title=titles.considered open=false >}}

<h3>Python</h3>

This is, of course, the most obvious option. While I love Python and use it elsewhere in `stock-advisor`, it's not always the _best_ fit. See the arguments for Go under the _Used_ tab for why I preferred it. The strongest argument for Python would be to keep the `stock-advisor` stack consolidated and "mono-lingual". I am still looking for an opportunity to try [dlt](https://dlthub.com/docs/intro), which seems to have great abstractions for consuming data from APIs. I really like this saying:

> Python is not the best language for anything, but it is the second best language for everything.

<h3>Redpanda Connect</h3>

I am a fan of [Redpanda Connect](https://www.redpanda.com/connect) (formerly Benthos) for simple, scheduled batch ingest scenarios. However, I needed _full control and testing capabilities_, which the YAML-config-first approach can’t provide.{{% footnote "rpk" %}}I have several times ended up jamming myself into a corner with a too-complex Redpanda Connect YAML file, where I wished I would have started with Go or Python instead. Having said that, I have plenty of success stories with Redpanda Connect too, where a simple 50-line YAML config works extremely well and reliably.{{% /footnote %}}

<h3>DuckDB's <code>https</code> extension</h3>

One of DuckDB's great features is how easy it is to import data with it. With the `https` extension, you can even read CSV or JSON responses directly from APIs. Even though using only DuckDB could have been possible, I quickly dismissed this idea because -- well, yeah -- SQL. SQL is great for certain things, like simpler table manipulation tasks, but it is not a good choice for creating robust pipelines. Imagine looping over every ticker symbol in the Tiingo API with a [recursive CTE](https://duckdb.org/docs/sql/query_syntax/with.html#recursive-ctes); I get shivers down my spine just thinking about it.

<!-- <h3>Rust</h3>

Just kidding.{{% footnote "rust" "up" %}}Why is this a joke? Rust could have worked for this but is not the right level of abstraction. I would have spent time on details that don't matter to the result. It would have been fun to learn Rust for this, but not productive.{{% /footnote %}} -->

{{< /admonition >}}

### Cost conclusions

Using open-source Go with scheduled, free GitHub Actions runners provided a satisfactory "extract and load" stack, and it costs zero to keep running for years to come.

## Storage

To be able to transform, train on and visualise the financial data, I needed a storage solution where I managed the data myself[^API-directly].

[^API-directly]: Even though I am curious about the latest zero-copy trend, consuming data directly from API "on demand" would probably be difficult and a bad design, if even possible. I would also quickly exceed the Tiingo API's usage limits.

### Requirements

Ideally, the storage solution for the `stock-advisor` product should:

-   **Provide fast queries and aggregations.** Again, performance is a feature, and slow data transformations in a pipeline or in a dashboard is best avoided. Since fast aggregations are a priority (more important than fast lookups), a [column-oriented OLAP database](https://en.wikipedia.org/wiki/Data_orientation#Column-oriented) is likely the way to go.
-   **Be simple to set up for idempotent ingests.** While most storage systems support idempotent designs, even immutable ones like blob storage, the ease of implementation can vary significantly.
-   **Be simple to set up for local and CI/CD testing.** Automated tests for ingest pipelines and data transformations _using the storage solution_ should be easy to create and configure. Importantly, this includes that tests can run locally and on any CI/CD runner without network requests to a managed service (which can make tests slow and expensive).
-   **Be cheap or free.** There is no such thing as a free lunch, is there?
-   **Be hosted and fully managed.** If a storage solution meets all the above requirements _and_ is fully managed, my time can be spent fully on building the product itself and not bothering with the nitty-gritty of storage itself.

The solution I went with -- [MotherDuck](https://motherduck.com/) + [DuckDB](https://duckdb.org/) -- meets _all_ of these criteria and has been a pleasure to use.

### Tools and Services

I use [DuckDB](https://duckdb.org/) for development storage and [MotherDuck](https://motherduck.com/) for production workloads. This combination has proven ideal: lightweight, fast, fully managed, and well-suited for both ingestion and dashboards. I can't think of any _technical_ solution that would be a better fit for this product, and remarkably, this stack currently costs me nothing. MotherDuck stores data reliably, and its serverless backend responds instantly. What more could I ask from a free service?

{{< admonition type=success title=titles.used open=false >}}

<h3>DuckDB</h3>

Since I first tested [DuckDB](https://duckdb.org/) a couple of years ago, I've been a big fan. They've since [reached version 1.0 with a stable storage system](https://duckdb.org/2024/06/03/announcing-duckdb-100.html) and continually added useful features to both [the SQL dialect](https://duckdb.org/2023/08/23/even-friendlier-sql.html) and via [extensions](https://duckdb.org/2024/03/22/dependency-management.html). DuckDB is a fast, column-oriented database that can reside in memory or as a file on any file system. It is often described as the OLAP equivalent of the OLTP database SQLite (known as the [world's most widely used database](https://www.sqlite.org/mostdeployed.html)). Having a local database available via a file is excellent but doesn't meet the requirement of being hosted and fully managed.

Initially, I considered "sending the database file around," allowing reads and writes to the same, latest file copy across my local machine, GitHub Actions, and the dashboard. However, this approach would involve significant orchestration and be unreliable. Dismissed ❌

I also considered storing the master data as Parquet files in blob storage (possibly with [Iceberg](https://iceberg.apache.org/) on top for improved metadata management). This could probably work quite well for _hosting_ (blob storage is very cheap), but it would not be fully managed. This would require setup and maintenance, which I'd rather avoid for this project. Dismissed ❌

Then I discovered MotherDuck.

<h3>MotherDuck</h3>

[MotherDuck](https://motherduck.com/) provides, simply put, a hosted and fully managed DuckDB database. It has almost full feature parity with DuckDB, which makes switching between the two databases a simple [config toggle](https://github.com/rasnes/stock-advisor/blob/11c3a46ef8b62e5c4f0437a34c610571c83d4d83/EtL/config.prod.yaml#L1).

As of January 2025, MotherDuck offers a free tier of 10 compute unit hours per month + 10 GB of storage. At the start of this project, it was difficult to assess whether this would be sufficient. But I gave it a shot, and it turned out that my needs were indeed covered by this free tier. Currently, I use less than 5 GB for storing all financial data and CatBoost model results -- the [compression](https://duckdb.org/2022/10/28/lightweight-compression.html) in DuckDB columns is very impressive!

<h3>MotherDuck + DuckDB</h3>

Since these two products are so closely related and work so well in tandem, I think it makes sense to discuss the following topics under this two-headed duck?!

<img width="1194" alt="Image" src="https://github.com/user-attachments/assets/f77b0f87-e412-4a57-91ad-7a5d62e0cc87" />

The easy switch between a local DuckDB database (in-memory or file-based) and MotherDuck makes creating [end-to-end tests](https://github.com/rasnes/stock-advisor/blob/11c3a46ef8b62e5c4f0437a34c610571c83d4d83/EtL/pipeline/pipelines_test.go) for data pipelines simple. The tests begin by setting up an ephemeral DuckDB database with the desired state, then the pipelines run from start to finish with requests to the mocked `httptest.Server` endpoints, transforming (little `t`, remember) and ingesting the data with correct types, edge case handling, etc. This setup has been very pleasant to use and ensures a high degree of confidence that the `production` pipelines will work correctly. What if unexpected inconsistencies appear in the API, causing the pipelines to fail? While it's impossible to anticipate all potential data quirks, the second-best approach is to add these surprising API responses to the `httptest.Server` mocks and then address the downstream error handling appropriately. It's a great example of [TDD](https://quii.gitbook.io/learn-go-with-tests#learn-test-driven-development-with-go) in action!

[Idempotency](https://dagster.io/glossary/data-idempotency) was also surprisingly simple to implement: I assigned primary keys to one or more columns in each table, like [here](https://github.com/rasnes/stock-advisor/blob/11c3a46ef8b62e5c4f0437a34c610571c83d4d83/EtL/sql/table__daily_adjusted.sql#L9), and then I used [INSERT OR REPLACE INTO](https://duckdb.org/docs/sql/statements/insert.html#insert-or-replace-into) for data ingestion. This results in upsert semantics, which ensure that _only_ the latest data is retained.{{% footnote "primary-key-duckdb" %}} Primary keys in DuckDB work similarly to row-oriented databases like Postgres, enforcing uniqueness per row. This contrasts with column-oriented databases like BigQuery and Snowflake, where you cannot enforce uniqueness constraints on tables.{{% /footnote %}}

One of the things I learned during this project is that MotherDuck (or, more precisely, the DuckDB backend) serves well as a direct data source for the dashboarding and visualisation layer (at least when there is little concurrent load, a.k.a. only me 😄). A Streamlit dashboard that consumes data directly from MotherDuck is surprisingly snappy. As discussed in [Tools and Services under Dashboard](#tools-and-services-3), this snappiness, combined with the ease of directly connecting to the master data directly, was one of the reasons I pivoted from [Observable Framework](https://observablehq.com/framework/) as my dashboarding tool.

{{< /admonition >}}

{{< admonition type=failure title=titles.considered open=false >}}

This one was relatively easy, so I didn’t spend much time considering alternatives. However, for those unfamiliar, I can briefly go through the alternatives and explain why I found DuckDB + MotherDuck to be a better fit.

<h3>BigQuery, Snowflake, Databricks, Redshift, Fabric</h3>

These "cloud data warehouse" services are what most companies use -- at least in Norway -- for their analytical data storage and operations. They offer straightforward management of large data volumes, support complex analytical queries, handle multiple concurrent users and workloads, all while minimising operational overhead. Moreover, they all support `dbt`, the industry standard for managing data transformations in the "modern data stack".

These services are great, and tend to work well for what they're designed for. However, all designs come with compromises, and for me there were several reasons for _not_ picking any one of these services for this project:

- **I don't need scale.** I don't need to handle large data volumes, nor concurrent users or queries. I prefer something snappy over something scalable.

- **These services are not optimised for CRUD (Create, Read, Update, Delete) transactions.** While my workloads for this product is nothing like a CRUD service, I just find BigQuery et al. frustratingly slow for operations like simple Read operations. In a dashboard, you typically need to support a combination of simple aggregations and Read operations (larger transformations should be completed and "cached" as a table _prior_ to being used from a dashboard). Using, for example, BigQuery as the backend for a dashboard—unless severely optimized—just doesn't give you the snappy feel you want to provide your end users.

- **There are no local versions.** Imagine being able to spin up a local Snowflake database on your laptop in milliseconds! This could open the door for many new use cases, like distributed analytics (compute is served by the user's computer, not the cloud) and running unit tests (on data pipeline integrations and transformation code) locally and on any CI/CD runner.{{< footnote "databricks-local" >}} For Databricks, which is based on Spark, you might be able to run a non-feature parity version locally using PySpark.{{< /footnote >}}
- **Limited support for enforced primary keys and upserts.** While some of these warehouses have added primary key support, they typically don't enforce uniqueness constraints in the same way as traditional databases, nor do they support native upsert operations (like DuckDB's INSERT OR REPLACE INTO). This means handling duplicates often requires additional compute and complexity, either through post-ingestion transformations or external tools like Apache Beam. Over time, I've noticed this can create overhead in terms of SQL complexity, storage of obsolete data, and repeated deduplication in downstream transformations.{{% footnote "snapshots" %}}There might be one benefit with "append only, deal with duplicates later" semantics, and that is that you keep a snapshot of all ingested data. In practice, though, I've rarely seen anyone interested in this outdated data; I guess it would only be useful for debugging state changes in the data over time. However,there are usually better purpose-built solutions for this use case, like dbt's [snapshots](https://docs.getdbt.com/docs/build/snapshots).{{% /footnote %}}
- **And finally, costs.** These services are generally cheap to use with the small data amounts I have in this project, but it's hard to assess up-front if costs would be _zero_. Costs are commonly unpredictable, and with slightly suboptimal table configurations, like incorrect partition sizes, you might end up with some unpleasant surprises on your bills.

<h3>Transactional/OLTP databases</h3>

I suppose I _could_ have used PostgreSQL or SQLite for this product. These databases wouldn't be fast for analytical workloads, but the few transformation queries I ended up with might have performed adequately. Supabase would have been the natural choice for a hassle-free hosted version of Postgres, but their free tier offers only 500 MB of storage, which wouldn't have been sufficient. I've never tested [Turso](https://turso.tech/) but their hosted SQLite product (more precisely, a fork of SQLite) looks very interesting, and their free tier offers up to 9 GB of storage, which would have been enough. However, as DuckDB + MotherDuck have many similarities to SQLite + Turso but are built for analytical workloads, there was little reason to test both options.
{{< /admonition >}}

### Cost conclusions

DuckDB is fully open source and has the generous MIT Licence, making it completely free. Motherduck has a generous free tier—as of January 2025, 10 compute unit hours per month + 10 GB of storage -- which proved sufficient for my needs. This was, however, difficult to assess up front, but I took the gamble, accepting the "worst case" scenario of switching to MotherDuck's $25 per month paid subscription. With a focus on cost by, for example, not storing any duplicates, only transforming data a few times a week, and only storing key results from the transformations, I ended up being well within the free tier. I did experience MotherDuck compute throttling on days with much development activity, but that was not a big deal since I could easily switch back to the local DuckDB instance the few times this happened. Also worth noting is the excellent [compression](https://duckdb.org/2022/10/28/lightweight-compression.html) in DuckDB that minimises the storage footprint on disk.

## Orchestration

I needed a tool to manage planned data transformations (the significant 'T' in E(t)LT) and machine learning workloads. I realised early on that this product's orchestration needs wouldn't be complex, and simply "running some Python functions in order" could have worked just fine. However, in case the size and complexity of the product grows, it is beneficial to already be using a tool that is designed to handle large, complex orchestration needs. Furthermore, the learning experience of trying out new tools is also part of the fun of developing products like this.

### Requirements

The orchestration tool should:

- **Be flexible.** In this context, that means that the tool should allow for making transformations in and calling CatBoost from Python. Consequently, an orchestration tool written in Python that could be easily imported as a library would be the natural choice. Using Python also simplifies testing, such as including unit tests for data transformations (which I've found particularly tricky in `dbt`).
- **Have a command line interface.** A somewhat flexible CLI is needed to allow for easy execution in GitHub Actions workflows.
- **Run locally.** Devex is so much better if the orchestration tool has good support for local development
- **Store logging and pipeline run results.** I need some sort of storage to save the results of all pipeline runs. Ideally, this storage should integrate with the chosen orchestrator, enabling easy exploration and debugging of pipeline results using a graphical user interface.
- **Support MLOps requirements.** While the need for managing MLOps, at least at the start, is limited, it would be nice to pick a tool that allows for easily storing things like training run metrics and metadata and potentially setting up alerts when metrics exceed e.g. exceed predefined threshold values.

### Tools and Services

I use [Dagster](https://dagster.io/) for orchestrating [the single transformation and CatBoost training pipeline](https://github.com/rasnes/stock-advisor/blob/38dbd1d9ef9fa573ef276378eccc95bee20f29c7/transformations/src/dagster_catboost.py). Transformations are [written in SQL](https://github.com/rasnes/stock-advisor/tree/38dbd1d9ef9fa573ef276378eccc95bee20f29c7/transformations/src/sql) and I created a [CatBoostTrainer Class](https://github.com/rasnes/stock-advisor/blob/38dbd1d9ef9fa573ef276378eccc95bee20f29c7/transformations/src/catboost_trainer.py) with a simple API to use from Dagster, which provides predictions and [SHAP values](https://shap.readthedocs.io/en/latest/example_notebooks/tabular_examples/tree_based_models/Catboost%20tutorial.html) (feature explanations) for each stock prediction. Below, in the blue expansion, is an image of the Dagster pipeline DAG, if you prefer a visual representation.

{{< admonition type=abstract title="Dagster DAG flowchart image" open=false >}}

<img width="549" alt="Image" src="https://github.com/user-attachments/assets/a94c4101-9aed-435a-97e9-487c8d255f65" />

{{< /admonition >}}

{{< admonition type=success title=titles.used open=false >}}

<h3>Dagster</h3>

After some research, I decided to give [Dagster](https://dagster.io/) a shot, and it has proven to be an excellent choice. The main reasons for this include:

- **Great developer experience.** The `dagster dev` command makes it very easy to spin up the Dagster UI and run pipelines locally. This simplifies development and testing compared to, for example, Airflow, which typically requires pushing code to an Airflow server (strictly speaking, you can run Airflow locally via something like Docker-compose, but this is much more overhead). In addition, I found the [declarative, asset based, abststractions](https://dagster.io/blog/declarative-scheduling) in Dagster to be very nice to work with. Abstractions that are easy to reason about leads to cleaner and more maintainable code.
- **Sufficiently flexible CLI.** To trigger a Dagster job/pipeline via CLI, I found that [dagster job execute](https://docs.dagster.io/concepts/ops-jobs-graphs/job-execution#command-line) works well enough for my use cases. Although not as extensive as `dbt`'s CLI, the ability to incorporate necessary flexibility into the triggered Python code made it perfectly suitable.
- **Decent documentation.** I found Dagster's documentation to be somewhat verbose and confusing, but it certainly did the job.
- **Full control over metadata.** By returning one of Dagster's types that support `metadata`, like [Output](https://github.com/rasnes/stock-advisor/blob/38dbd1d9ef9fa573ef276378eccc95bee20f29c7/transformations/src/dagster_catboost.py#L162), one can easily store metadata for each pipeline step, like the CatBoost model's test set RMSE (which should not increase significantly over time).
- **Direct embedding of `dbt` models as nodes in pipeline DAGs.** Although I didn’t use `dbt` in this project, it was beneficial to know that [the orchestration service could extend to `dbt` models](https://docs.dagster.io/integrations/dbt) if required.
- **Run results stored on SQLite or Postgres.** The Dagster config makes it easy to configure run states to be stored in a dedicated database. I started with SQLite, which worked well, before switching to a Supabase-managed Postgres instance. This allowed me to *share the same state* between runs on GitHub Actions and my laptop. Supabase's 500 MB free tier storage limit is sufficient for this use case.

That last point is worth expanding on. Many orchestration tools are marketed as hosted, "always-on" services. Managed Dagster and Prefect cloud offerings, for example, provide both a web UI and pipeline compute, available 24/7. But do you really _need_ that? That depends, of course. With the ambition of keeping costs low, I was hoping to instead run my orchestration jobs as scheduled GitHub Actions workflows. With the 4vCPU 16GiB runner and cron scheduler offered by GitHub Actions, all I needed was a CLI to run it from and some way to save the state of pipeline runs. So, rather than having a continuously running Dagster service, I simply _trigger a Dagster job whenever needed_. After configuring Supabase for storing pipeline runs, which works flawlessly, I didn't miss anything from an "always-on" Dagster service (I use `dbt dagster dev` locally to view `production` pipeline runs stored in Supabase). And, again, costs remained at zero.

<h3>Github Actions</h3>

Perhaps repeating myself here, but it may be worth emphasising that the only infrastructure required for `production` orchestration workloads was running Dagster on scheduled GitHub Action workflows. [Here is the workflow](https://github.com/rasnes/stock-advisor/blob/38dbd1d9ef9fa573ef276378eccc95bee20f29c7/.github/workflows/prod__run-dagster.yaml) I currently run twice a week, which orchestrates transformations and trains three CatBoost models with 12, 24, and 36 months prediction horizons.

{{< /admonition >}}

{{< admonition type=failure title=titles.considered open=false >}}

<h3>Prefect</h3>

Although I've heard good things about [Prefect](https://www.prefect.io/), I opted for Dagster after doing a bit of research. Overall sentiment on Reddit seemed more positive towards Dagster, and the [declarative, asset based, abstractions](https://dagster.io/blog/declarative-scheduling), along with native support for `dbt` models, ultimately tipped my decision in favour of Dagster. But maybe I'll try Prefect next time?

<h3>duckdb-dbt</h3>

There is a community maintained `dbt` version for DuckDB that could have worked well enough for my needs. WWhile `dbt` isn't strictly an orchestrator, my requirements were simple: a tool to create a scheduled DAG pipeline involving DuckDB transformations, training CatBoost models, and writing results back to the database. `duckdb-dbt` allows for running Python on the same Python process as `dbt` runs on, see [here](https://github.com/duckdb/dbt-duckdb?tab=readme-ov-file#python-support) for more, which probably would have worked just fine for the model training. However, after using `dbt` for many years at work, I wanted to try something else. And, to be completely honest, the idea of using a Python-first orchestration framework rather than [Jinja SQL](https://docs.getdbt.com/docs/build/jinja-macros) was appealing; on numerous occasions I've been pulling my hair out due to the constraints `dbt` CLI plus Jinja SQL offers. Creating good abstractions as `dbt` macros is both limited and results in code that’s hard to read and maintain.

<h3>Malloy</h3>

While not an orchestration framework, I did spend some time testing [Malloy](https://www.malloydata.dev/) as as a candidate for the primary transformation language (to use in big 'T'). I like the concept of combining transformations and a semantic layer in a declarative, easier-than-SQL language, but it proved insufficiently flexible for my transformation needs. I needed the full flexibility of SQL (or Polars), and among SQL dialects I must say that DuckDB is [my favourite](https://duckdb.org/2023/08/23/even-friendlier-sql.html). And using Malloy just forg a semantic layer is simply overkill for this product.

<h3>Kestra (or any other declarative, non-Python orchestrator)</h3>

Unpopular opinion: I like YAML. Using an expressive language like Python for pipeline configuration offers great flexibility, but also introduces potential bugs and maintenance complexity. A declarative configuration language, like used by [Kestra](https://kestra.io/features/declarative-data-orchestration), presents different trade-offs, notably reducing the risk of accidental bugs such as [off-by-one errors](https://stackoverflow.com/questions/2939869/what-is-an-off-by-one-error-and-how-do-i-fix-it).

So, to me, tools that have a declarative configuration-first approach -- like Kestra, Redpanda Connect, Terraform -- are appealing and may be a good choice in certain situations. However, it just did not make sense for this project, as both DuckDB and CatBoost have a tight integration with Python and dealing with the JVM runtime is something I try to avoid unless strictly necessary.

{{< /admonition >}}

### Cost conclusions

Running open source Dagster with Github Actions as pipeline runner and Supabase as metadata storage does not cost anything.

## Dashboard

Finally, to the last, but very important part: visualisation, discovery and decision-making. The results of this product would be of little value if they were not easy to explore, learn from and compare. Creating a good dashboard felt like the natural approach, as exploring "with code only", such as in a Jupyter Notebook or the MotherDuck UI, simply does not offer an ideal UI/UX for aiding the research and decision-making process of picking stocks.

### Requirements

- **Great devex for developing.** It should be easy to wrangle data and create useful tables, visualisations and reports. As having a hosted solution was not an absolute requirement, this is the most important requirement. Ideally, it should be quick and easy to transform my ideas -- both good and bad -- into a useful and snappy UI.
- **Great devex for deployment and hosting.** Even though I could run this dashboard from my local computer, it is much nicer to have it hosted and allow it to be reached from any computer or mobile. An, as I will elaborate on in the _Used_ and _Considered_ sections below, different dashboarding "backends" can significantly impact how easy it is to get a snappy dashboard with always up-to-date data up and running.
- **Cheap or free hosting.** If possible, free, but I could spend some pennies for cheap hosting as well (admittedly making the punchline of this post fall flat!).
- **Simple authentication setup.** If I want this hosted, it should at least have some easy-to-enable password log-in. (This is necessary as the Tiingo data has a personal license; otherwise, I wouldn't mind making the dashboard public[^scaling].)

[^scaling]: Ignoring any scaling issues of course, as having many users visiting the site would have used up the free tier compute from MotherDuck in no time.

### Tools and Services

I use [Streamlit](https://streamlit.io/) for creating a dashboard with multiple tabs, spanning contents from historical development together with predictions, the "Stock Picker" and Jupyter Notebook reports. Streamlit has many versatile dashboard components and building blocks with good documentation, which allows me to easily compose a nice dashboard. Using outputs like [dataframe](https://docs.streamlit.io/develop/api-reference/data/st.dataframe) and [altair_chart](https://docs.streamlit.io/develop/api-reference/charts/st.altair_chart), inputs like [date_input](https://docs.streamlit.io/develop/api-reference/widgets/st.date_input) and [multiselect](https://docs.streamlit.io/develop/api-reference/widgets/st.multiselect) and composition elements like [columns](https://docs.streamlit.io/develop/api-reference/layout/st.columns) and [navigation](https://docs.streamlit.io/develop/api-reference/navigation/st.navigation), a decent dashboard for insights and exploration was completed in little time. I use [Streamlit Community Cloud](https://streamlit.io/cloud) for free hosting, which works well. Here is the [code for the dashboard](https://github.com/rasnes/stock-advisor/tree/main/dashboard).

Using the data stored in MotherDuck directly as the data source was surprisingly fast for both reads/lookups and simpler aggregations. See the GIF in the blue expansion below for a demonstration of dashboard usage and its speed when calculating, on the fly, the relative returns for all stocks selected.

{{< admonition type=note title="GIF with dashboard interactivity" open=false >}}

![Streamlit dashboard interactivity](https://github.com/user-attachments/assets/18e94085-06ae-40a0-9cc3-23e079d476d1)

{{< /admonition >}}

{{< admonition type=success title=titles.used open=false >}}

<h3>Streamlit</h3>

Streamlit was not top of my list when starting to explore dashboarding tools. Partly because I've used it before -- and been very happy, by the way -- but also because I was intrigued by the newly released [Observable Framework](https://observablehq.com/framework/). Nevertheless, I ended up coming back to Streamlit; it is simply a great dashboarding tool that ticked off all the requirements listed above.

One of the main criticisms of Streamlit is that it renders the _entire page_ on every user input (such as selecting a menu item). Streamlit doesn't offer "real reactivity," where different page parts can render independently without requiring a full refresh. This is, at the same time, one of the great strengths of Streamlit in my opinion: abstracting away one of the most difficult parts of web development -- frontend state and reactivity -- allows any Python developer to be productive in no time. Streamlit has a single strategy for working around the limitations of this (simple and brilliant) design: caching. And, I've found that caching works effectively for most scenarios, at least the ones I've encountered in this case. With some clever use of caching in the right places, you can get surprisingly snappy dashboards consuming data directly from MotherDuck.

And, to emphasise the primary requirement -- _Great devex for developing_ -- I've never used a dashboarding tool in which I am so productive. Streamlit's built-in components usually provide all the flexibility I need, enabling me to achieve the desired behaviour and user experience quickly. Kudos!

I researched various Streamlit hosting options, but ended up using the free tier offered Streamlit Community Cloud. It was easy to set up CI/CD directly from the public repo, and even with limited compute and memory on the free tier, I haven't experienced any runtime issues. The app is public but requires a password to log in.

<h3>MotherDuck Web UI</h3>

Although not a dashboarding tool, I feel it's worth mentioning. I spent considerable time developing my SQL pipeline and dashboard transformations in the MotherDuck Web UI. While perhaps not as much as locally using VSCode + DuckDB, sometimes it's just convenient to log into a UI and type away to get some immediate results.

The UI is quite basic, reminiscent of Jupyter Notebooks with cells in a notebook. It provides nice summary statistics. I like it, but it certainly feels like early days.

Go MotherDuck, I am cheering for you! 👏

{{< /admonition >}}

{{< admonition type=failure title=titles.considered open=false >}}

<h3>Observable Framework</h3>

Here's a brief review of [Observable Framework](https://observablehq.com/framework/), which I ended up not using. I recommend [Simon Willison's perspectives as an introduction](https://simonwillison.net/2024/Mar/3/interesting-ideas-in-observable-framework/), that post convinced me to try it out. These were the ideas I found most appealing:

- **Everything in the frontend.** Frontend capabilities are rapidly advancing, largely due to WebAssembly combined with powerful consumer hardware. This also applies to DuckDB, which can be compiled to a relatively small [WebAssembly binary](https://duckdb.org/2021/10/29/duckdb-wasm.html). This opens up for building dashboards entirely in the frontend using JavaScript + in-browser DuckDB (or [MotherDuck Wasm](https://motherduck.com/docs/key-tasks/data-apps/wasm-client/)). The advantage? No backend is required, enabling you to build a dashboard as a static website that supports a unlimited number of concurrent users. Since the data and UI are all in the frontend, you can achieve instant reactivity. Imagine a dashboard with near-zero lag on updates (provided you don't overload the Wasm engine with complex queries).
- **Everything in Markdown files.** I'm a big fan of Markdown; it's my preferred medium for both short and long writing (like this blog post). The idea of seamlessly combining text with interactive visualisations, all within the frontend, was very appealing.

However, it turned out that these two main selling points also were Observable Framework's biggest shortcomings.

With everything in the frontend, you need some way to _copy your data_ into the frontend as well. Observable's solution is [data loaders](https://observablehq.com/framework/data-loaders), which allow you to easily fetch data at *deploy time* and package it into the static website. For use with DuckDB, the natural choice for the format to build the data into in the frontend is Parquet (a columnar format that DuckDB has good support reading from). However, this design presents some challenges. Firstly, how do you ensure your data is always up to date? Observable suggests scheduling regular builds (e.g., on GitHub Actions), where you regularly download the latest data and rebuild the site. This sounds simple enough, but it gave me a design headache: to avoid the dashboard's page load time becoming too long (remember the browser must download all data embedded in the static site), I would likely need to use a subset of the Tiingo financial data and CatBoost results. But which data should I exclude? I have 9,000 stocks available, and ideally, I'd like to explore all of them when needed, accessing all relevant data and prediction results, even though I might only check on a few regularly. I found this constraint – what data to exclude – difficult to deal with and an overhead I'd rather skip if possible.

And the simple way to do that, of course, is to load the data from MotherDuck when needed, instead of cramming it all into the frontend (5GB of compressed data is simply too much). But here, the idea of *everything in the frontend* poses another challenge: how can I safely request data from MotherDuck using, say, frontend `fetch`, without exposing my API token? Sure, you may point out that this dashboard is for me only, and then having the token exposed in the frontend would not be a big deal. And you may be right, but it still feels like poor design; if I can avoid security malpractice, I'll try to do so. After pondering on this for while, I concluded that getting data behind authentication at runtime -- in a secure way -- is much simpler with a backend.

Then, to my second gripe with it, related to _everything in Markdown files_. No matter how much of a fan I am of Markdown, I found that writing JavaScript in it was surprisingly difficult. I am no JavaScript wiz, that's for sure, but I like to think that with some help from an IDE/language server I can still get things to work pretty much how I like with some trial and error. But in Markdown, at least at the time I played with Observable Framework (some months ago), I simply got _no_ help from VSCode whatsoever. I did not realise how dependent I've become on both a decent language server and, potentially, some LLM assistance when coding, and how much more productive it makes me. (Yes, I know, Observable Framework can import TypeScript/JavaScript from files, but that doesn't either give me the devex I was hoping for. But this is what I realised quite quickly that I _had_ to do to be even somewhat productive in this tool.)

So, when I switched to Streamlit, it was a revelation how quickly I could get things working as I wanted. This also highlights the advantages of Python over (frontend) JavaScript. I stumble into too many frontend quirks of JavaScript without understanding what's going on, whereas Python in the backend with Streamlit is dead simple.

<h3>NiceGUI</h3>

[NiceGUI](https://nicegui.io/) was also on my shortlist, developed as a [direct response to the limitations of Streamlit](https://nicegui.io/#why) already discussed under _Used_. I didn't end up trying it for this project, as I found Streamlit could quickly accomplish what I needed. However, I did use NiceGUI to build the website for my employer, [Dataist](https://dataist.no/), and I highly recommend it. With high-level components from [Quasar](https://quasar.dev/) and the utility-first approach of [Tailwind CSS](https://quasar.dev/), you can quickly build user interfaces or entire websites that look and behave just fine.

{{< /admonition >}}

### Cost conclusions

The free tier offered by [Streamlit Community Cloud](https://streamlit.io/cloud) has been sufficient[^streamlit-cloud-free], and its automatic deployment from the `main` branch of the [public repo](https://github.com/rasnes/stock-advisor) has worked seamlessly.

[^streamlit-cloud-free]: Streamlit Community Cloud is not fully transparent about the hardware stack one gets when deploying services there; at least, I could not find the reference for it. In some places, I read 1GB of memory, in others a bit more, so I am not entirely sure what you get. It might be dynamic, and that's why they're deliberately not explicit about it, but that's just speculation.

## Conclusions

I did spend a considerable amount of time discovering and researching the tools and services used in the `stock-advisor` stack. In review, that time seems well spent, as I'd say I mostly followed the "happy path" during development, with minimal time spent on detours or pivots. Even the few exploratory paths I took, such as experimenting with [Observable Framework](https://observablehq.com/framework/) and [Malloy](https://www.malloydata.dev/), turned out to be valuable learning experiences.

I am very happy with how the stack turned out and how reliably it operates every day. It strikes a good balance between simplicity and flexibility, making it easy to maintain and extend if needed. However, as I mentioned in [the previous post](/posts/stock-advisor-intro), this product is only possible by "standing on the shoulders of giants" -- the abundance of high-quality, free tools and services available today is just amazing!

Achieving a completely free, zero-cost stack was a goal I wasn't sure I could meet for most of the development process, so I’m especially pleased that it worked out in the end. This project has been a rewarding experience, both in terms of the technical challenges and the satisfaction of creating something functional and cost-effective.

If you have any questions or thoughts, feel free to share them in the comments below -- I’d be happy to discuss or help others explore similar approaches.

