My Personal Stock Advisor

Tiingo -> MotherDuck -> CatBoost -> Streamlit

2025-01-15 15 minutes

Series - Stock Advisor

Contents

Summary

I’ve created a model that predicts the longer-term (1, 2, and 3 years) development of all US stocks relative to the S&P500 index. The model is a CatBoost regression model with uncertainty trained on historical price and fundamentals¹ data from Tiingo. Data ingestion pipelines are written in Go, production data is stored in MotherDuck, and Dagster orchestrates data transformations and the CatBoost training pipeline. The production stack has zero operational cost, with Go code and Dagster running at scheduled intervals on free-tier Github Actions workflows, and data storage and compute requirements are within the free-tier limits offered by MotherDuck².

My take is that the model produces reasonable predictions that may be useful when assessing whether to buy a stock for longer-term investments. However, more interesting than the predictions themselves – given the inherent uncertainty in predicting stock prices – are the interpretations of the model features. These help me understand why a certain stock is recommended (or not), allowing me to quickly assess if a stock seems, for example, overvalued or high risk. In the Streamlit dashboard, I can easily compare stocks’ predictions and feature contributions to identify promising stocks or assess recommendations from other sources.

In this series of posts, I aim to:

describe why I bothered doing this and how I approached the problem (this post).
detail the resulting tech stack (the engineering perspective).
detail the data and machine learning modelling (the data science perspective).

All code is available in the public repository stock-advisor and is free to use under the permissive MIT license. Note that the data from Tiingo is subject to a restrictive personal license, so you’d need to connect the code with your own API key to get started. With the exception of the initial historical data backfill, forking the repository and enabling the same Github Actions I use for scheduled data ingestion and model training runs should be straightforward³.

These posts will serve as the primary documentation for this product and the code in the repository. They represent a snapshot of the stock-advisor product as of January 2025 and may not be updated with new functionality. However, I will try to create dedicated posts (or repository documentation) outlining any major changes.

Motivation

I’ve been on and off the stock market with some small savings for the last 12 years or so. I’ve had the chance to – and been very close to – jump on many to-be good trades (e.g. AAPL in 2015, NVDA in 2017, TSLA in 2021⁴) during this predominantly bull period. However, besides holding some index ETFs⁵ that have produced a decent return, I’ve been missing out on the largest gainers. Why did I not take the chance and commit even when I had a good hunch for a specific stock?

I guess I could break down my reluctance that held me back into three distinct reasons:

I am a person that prefers to do proper research into whatever product (or stock) I may be interested in buying before making a purchase. I don’t know how to read financial statements nor to separate the signal from the noise in the deluge of news articles covering the US stock market. I simply could not properly convince myself that committing to a single stock for the longer term was worth the risk, no matter how good a hunch I had.
It was in a pretty risk-averse period in my life since my savings were there for a single purpose: to be able to buy a larger apartment or house in the not-so-distant future. Hence, my risk appetite was not high, as losing a significant portion of my savings could have had implications for the next stop in the Oslo housing market.
The great book The Intelligent Investor, which I read in the early 2010s, convinced me that beating the index is close to impossible, so I might just as well stick with index ETFs.

In hindsight, of course, I could have had a much larger return on my savings if I had committed long-term to some of the mentioned stocks. Does it feel like a big mistake? Not really, as the risk exposure of saving in ETFs was probably the appropriate risk level at the time, but it has certainly yielded better returns than putting my savings in a high-interest rate savings account. Now, however, I am in a period in my life where I live in the house I was saving for, and, even though I have a sizeable mortgage, I have the privilege to administer some savings on the side that I’d like to try to maximize returns for. I am comfortable taking some higher-risk bets now, accepting the worst-case scenario that my savings could end up depreciating.

So, you might wonder, am I challenging the notion “you can’t beat the index”? Not really. Mathematically, when excluding commission fees, this has to be correct on average. I do, however, believe one can take higher risks in the stock market, resulting in larger fluctuations “around average”. So, I believe that with a combination of shrewd analysis, some hypotheses that might turn out to become true, and a bit of luck, you might end up catching some of the large gainers (5-10x over some years). And, one should try to keep emotions out of the mix as far as possible, which is easier said than done⁶.

Therefore, I need some tool or some advisor or something that could help me make qualified, risk-considerate assessments of stocks. Being a “recovered data scientist”, I started considering and investigating if I could make something like that on my own. Both because there was a slim chance of making something useful but also as a nice side project to work on. My ambition was not to make a tool that I’d blindly trust, but rather in combination with other advisory services, like Motley Fool, I could be able to make more objective and balanced assessments when picking stocks and building a portfolio.

I started on this project in the spring of 2024, and I quickly estimated that it would take me at least 100 hours of work or more, at least when including the time spent on researching and evaluating which tools and services to use.

Do I find picking stocks fun?

Not really. Ideally, I want to spend my time on other things. So, part of the motivation for building this product is to automate more of the stock-picking process, minimising the time I spend on it. Pick some carefully selected long-term investments, stick with them for years, and only check in occasionally.

Can it be done within time, cost and data quality constraints?

To take on a project like this, I naturally had to start with investigating if it is feasible within my time and quality constraints and my hobbyist budget⁷.

Data requirements

Primary requirement: high quality, historical and continuously updated financial data.
Secondary requirement: it should be free or at least relatively cheap.

My stance is that I’d rather pay something for high quality data over getting mediocre data for free. I simply don’t have time to clean data, or worse, deal with hard-to-detect model bugs that are caused by data inconsistencies. The data needs to comprise historical price data of daily frequency + fundamentals data in an easy-to-work-with, standardised format (e.g. not parsing PDFs). Ideally, I would have liked to obtain such data for all European stock markets (in particular the Norwegian one), but I quickly realized that I should be content if I managed to get US stocks data within said requirements.

Stack requirements

I wanted a stack that fetches and stores data and trains the prediction model(s) regularly – with minimal maintenance – and a simple and flexible dashboard as the serving layer. And, ideally, with zero operational costs. Requirements were along these lines:

Runner. Some sort of server or VM that could run simple workloads at scheduled intervals. Workloads include fetching financial data from an API and ingesting it into some storage or a database, transforming that data into a format suitable for training a statistical model, and training that statistical model and storing the results⁸.
Storage. Some blob storage or database that could store financial data and model results.
Visualization and insights. Some dashboarding tool that could display the results and allow for easy investigation and comparison of stocks’ predictions and feature interpretations.

Model requirements

The modelling framework of choice should have the following characteristics:

Offer a regression model. There is very little signal in historical stock prices, so modelling this data via time series frameworks was dismissed immediately. I need fundamentals data as model features, else I am certain that the model predictions will not be worth much.
Be fast to train. As the model is to be retrained regularly on a free or cheap VM, it is important that training time is not excessive.
Produce accurate results with minimal feature engineering and tuning. The less feature engineering and knobs to tune, the less work and maintenance to get a decent model up and running.
Include uncertainty intervals for predictions. As the stock market is impossible to predict accurately, uncertainty intervals per prediction helps for assessing a model’s confidence for its maximum likelihood estimate. This is very useful when making long-term stock purchase decisions.
Feature interpretations. The goal was not to create a model that could automate trading for me, i.e., some sort of trading bot. Instead, the goal was to create a model that could produce valuable, interpretable results that could help me become more confident when making long-term bets on certain stocks.

I considered Bayesian methods for a couple of minutes, but quickly dismissed that idea due to requirements 2 and 3 above. Bayesian methods tend to be a lot of fuss and experimentation with assumptions and priors and to be compute intensive.

My thoughts quickly drifted towards CatBoost, which I knew had an implementation of the NGBoost algorithm which enables decent uncertainty estimates for little compute. CatBoost also has the other nice properties of being fast to train, easily handles categorical variables and missing values, does not require feature scaling or other preprocessing, and can produce interpretations via e.g. feature importances and SHAP values.

Feasibility assessment

I started thinking about this project in the spring of 2024, and I quickly realized that implementing this product properly⁹ would easily take more than 100 hours, even with LLM-assisted coding that has made me much more productive for (certain) coding tasks.

One thing in particular that I didn’t like was that I would need to invest a significant amount of time in fetching and transforming the required data before being able to test if my idea for a “stock-advisor” model was worth anything more than reading tea leaves. Ideally, I would have wanted to test if the model had any merit to it as early as possible, and if so, only then would I have bothered to spend dozens of hours getting the data pipelines and transformations up and running properly. This time, in contrast, I needed to take a leap of faith and jump off the waterfall and hope that the model would produce some valuable output. I ended up giving it a shot, and – puh! – the model results came out useful enough 🙌

I had several days where I concluded that this project was probably not worth it. Nevertheless, I came back to it and managed to pull it through. Which I am very happy about today.

The product materializing

When developing complex products like this, the path towards completion is never straightforward. Agile workflows come naturally in such a process: make a plan for progress, develop and test, pivot or possibly rewind if X did not turn out to be a good product fit. In review, I did not take that many detours or dead ends, so I felt pretty productivity along the way working towards to final product vision.

The data source

As all good models start with good data, getting high-quality historical data from a reliable API at a reasonable cost was a strict requirement. After a bit of research, I found Tiingo. It had a reasonable pricing model for both end-of-day and fundamentals data for US stocks¹⁰ with a personal license. I tested their API a bit, they have a generous enough free tier to get started with assessing the data formats and quality, which in turn made me optimistic about using – and paying – for data from Tiingo. Besides a backfill for historical financial statements data which required a single month’s subscription at $50, the running price of a Tiingo subscription with pricing and fundamentals data is $40 per month (or $400 annually). Not an insignificant amount, but having spent a lot of time cleaning data previously in my career, I decided that I could live with this cost as long as the data quality and reliability of Tiingo’s endpoints was high. In hindsight, besides some small quirks, I’ve been very happy with Tiingo both with respect to API reliability, data quality, and customer service¹¹.

The stack

After some research and testing, I ended up with these components:

Runner. I pondered a bit on this one, even considering bringing my old stationary back from hibernation. However, I discovered that Github offers a generous free tier for Github Action VMs if the repository is public, with as much usage as you’d like of a VM with 4 vCPUs and 16 GB memory. That should be plenty. Github Actions supports scheduled runs, and for pipelines requiring some orchestration, I could simply run the entire pipeline service on Github Actions as well. Making the repository public was an easy decision, as there is nothing proprietary or IP worth thinking about (the only overhead it created was that I needed to be careful not leaking any privately licensed Tiingo data via commits or so).
Storage. This one turned out to be easy. I prefer an OLAP database for this kind of workload and I’ve become a huge fan of DuckDB. But I needed something hosted, and while managing a DuckDB file reliably via some blob storage provider could have worked, I gave MotherDuck a try. Their free tier has served me very well, and it turned out that developing with a local DuckDB database instance locally and simply switching to MotherDuck for stage and production workloads is just a dream to work with!
Visualization and insights. The initial plan for the dashboarding tool was to use Observable Framework. It is a nice idea to use data loaders and to put everything (data and code) in the frontend. Even though I found that design enticing, I pivoted to using Streamlit instead after testing it a bit (more about this pivot in the second post). The Streamlit dashboard is hosted on Streamlit Community Cloud, which has worked perfectly fine with its free tier offering.

For the Extract and Load from Tiingo to MotherDuck I coded the solution in Go. I’ve used Python for orchestration via Dagster, training of CatBoost models, and creating the Streamlit dashboard. SQL was used for data modelling and data transformations. For more about these choices, see the second post.

The model

As already mentioned, the model I had in mind when developing this project was a CatBoost regression model with the RMSEWithUncertainty loss-function. I tested some other gradient boosted trees models, like XGBoost and LightGBM, but they did not seem to produce more accurate predictions, were more fiddly to work with and do not produce uncertainty intervals out of the box. I ended up only using CatBoost, and it has served me well.

Product summarized in a flowchart

I am happy with the stack, and I’ve so far had no reliability issues with it. On that note, I need to give credit to all the tools – and the people behind them – for the myriad of incredible open-source and free tools that are available. This product is standing on the shoulders of giants, it would not have been possible to complete if it were not for the great tools that an engineer can easily pick and stitch together these days. If I were to give myself a pat on the back, it would be for finding the right tools and abstractions and being able to stitch them together reliably in a relatively short period of time.

https://github.com/user-attachments/assets/a1c1c2d4-3995-4366-a0ea-1b79855b2216 — The stock-advisor stack

This is not another model based solely on technical analysis or time series data. Instead, it aligns more with “Warren Buffett-style” value investing, where metrics from companies’ quarterly statements are the most important features. ↩︎
10 compute unit hours per month + 10 GB of storage as of January 2025. ↩︎
Minimal familiarity with Go, Python, and Github Actions is recommended, but a good LLM could get a non-techie pretty far as well, I guess. ↩︎
Of course, buying a stock at what later seems like a great time doesn’t mean that I would have managed to HODL. ↩︎
Exchange Traded Fund, essentially a fund that you can buy and sell in the same way as a stock on an exchange. ↩︎
My best tip for this is to NOT check your stocks frequently. Instead, make a confident bet and stick with it like an ostrich hiding their head in the sand. ↩︎
A Bloomberg Terminal famously costs about $20,000 per user per year, which is, of course, way beyond what I could meaningfully spend to obtain financial data as a hobby investor. Financial data, at least targeted toward professionals, tends to be expensive and big business. ↩︎
Even though one could possibly run all of this on a private laptop, it is not a good choice for something that should work reliably every day for many years. A manual, human-in-the-loop setup like this would break on regular intervals and quickly turn into a maintenance nightmare. ↩︎
Something that will run reliably for a decade with high reliability and minimal interventions and maintenance. Besides having the possibility to occasionally fix bugs and add new features, I am hoping that maintenance is limited to some rare, manual version bumps of Python, Go, and dependencies. ↩︎
I could not find anything similar for Norwegian or European stock data; if you’re aware of a site or service offering data like this, please let me know in a comment below. ↩︎
To be clear, I have no affiliation with Tiingo, I found them “organically” when doing research for this project. ↩︎