The Model Explained

2025-02-06 22 minutes

Series - Stock Advisor

Contents

Introduction

All models are wrong, but some are useful.
George Box

Before diving into the model’s details, let’s state the obvious: it’s no panacea for picking US stocks. The model is potentially useful for a certain, limited set of investment strategies.

Specifically, it’s designed to aid decision-making/stock picking for longer-term strategies. It is not well-suited for automated algorithmic trading or short-term investment recommendations. The investment horizon should be years rather than weeks or months. For more on the motivation behind this model, see the first post in this series. The model is retrained twice a week; see the second post for engineering considerations and more on data management and pipelines.

Concretely, there are currently three distinct aspects that I use when picking stocks:

The predicted uncertainty for the index-relative returns for each stock. I deliberately list this metric before the predicted mean itself, as I find it so important. It gives me good indications of how risky an asset is, and makes it easy to identify predictions that shouldn’t be trusted. In general, the predicted uncertainty is high for most stocks.¹
The predicted index-relative returns for 1, 2 and 3-year investment horizons. Since this metric is impossible to predict with much confidence, I tend to bin predictions into three distinct categories:
1. Significantly above 1. The relatively few stocks in this category may warrant extra consideration. But each stock with a high predicted score should be treated and assessed carefully. I don’t blindly trust these recommendations.
2. At around 1. Predictions in this category range from the model being “fairly confident that this stock will perform similarly to the index” (low uncertainty) to it being “very uncertain about the stock; it could go either way” (high uncertainty).
3. Significantly below 1. “Historically, stocks with these features tend to underperform the broad index.” This doesn’t rule out the stock, but it’s valuable input to my decision-making, as the model could be flagging a high-risk asset.
Each prediction’s SHAP values. When investigating and making decisions, a key question is: why does this stock get its observed mean and uncertainty predictions? Reviewing the most important “prediction drivers” and comparing with competing candidates is a natural part of this process. The model produces “objective” explanatory metrics – the SHAP values – for each prediction, which allows for assessing prediction quality and identifying key features.

The aim of this post is to outline how the data and regression modelling were performed – complemented by the thinking and assumptions made along the way – and then assess whether the results are useful. I’ve tried to critically assess the model results and list ideas for improvement, but please comment below if you find any critical flaws or have suggestions.

Modelling approach

In brief, this model could be summarised as:

The machine learning model itself is simple, more or less a “stock” CatBoost regression model with uncertainty estimates. However, it’s trained on (seemingly) high-quality historical data that’s carefully transformed to avoid look-ahead bias and fit a regression prediction scenario.

This section is explained “top down”. I’ll start by showing what the model and transformed data look like during training. Then, I’ll outline the most important steps to get the data into that shape, linking to relevant code in the repository. Finally, I will conclude by highlighting the key underlying assumptions.

Training data composition

The target variable

In this model, the target variable, y, is the natural logarithm of the difference between a given stock’s relative development and the S&P 500 index’s relative development, for the given prediction horizon.

Here are some examples:

For a prediction horizon (e.g. 12 months), if stock FOO returns 0.8 (-20%) and the index returns 1.1 (+10%), the index-adjusted return for this stock is $ln(0.8/1.1)=-0.32$.
For a prediction horizon, if stock BAR returns 1.5 (+50%) and the index returns 1.1 (+10%), the index-adjusted return for this stock is $ln(1.5/1.1)=0.31$.

Transforming the diff to the logarithm makes sense, as it aligns the distribution of y much closer with the Normal distribution assumptions made by the CatBoostRegressor with the RMSEWithUncertainty loss function. Here’s the DuckDB macro for this calculation, and here it’s applied² to all stocks for the different time periods.

The feature variables

Instead of writing too much, I’ll just show the data used for training – how it looks just before being fed to the CatBoost model. Due to Tiingo’s licensing constraints, these training examples contain fake data (random values in approximately the correct orders of magnitude). Here they are:

This is data from 2020 onwards for three stocks: AAPL (Apple), MSFT (Microsoft), CAT (Caterpillar) and GS (Goldman Sachs). The full training set contains data back to 1995 for about 6600 stocks (including delisted stocks). The data has about 75 columns of feature variables:

date and ticker: The stock and date for which the features on the right apply. These aren’t included during model training, but are included here for context.
y_ln_12m, y_ln_24m, y_ln_36m: The target variable to predict. The three prediction horizons – 12, 24 and 36 months – are trained as three individual models. As you can see, there are many empty fields, which is natural, as we don’t have the value for y_ln_36m for AAPL on 2024-11-18; this is the predicted value we’re interested in. When training the 36-month model, data from three years prior to today is removed from the training set.
balanceSheet_*, cashFlow_*, incomeStatement_*: These are the quarterly statements fundamentals data, with a statementType prefix. I initially modelled with all statements data included, but removed quite a few with seemingly little predictive value.
Some daily fundamentals data, including peRatio, pbRatio, operating_efficiency.
Some descriptive categorical data, including sector (e.g., Technology), industry (e.g., Consumer Electronics), and location (e.g., California; USA).
tech_*: Finally, some technical indicators, like moving averages and historical volatility.

If you’re curious about the different quarterly statement fields, I’ve included Tiingo’s documentation in a table:

In short, I’m not familiar with many of the model’s features. My thinking is more “I’ll let the model figure it out.”

As you might have noticed, there are only four entries/rows per stock per year. This is essentially one per quarterly statement, and that’s no coincidence:

The quarterly statements are assumed to be the most important model features (and obviously only change four times a year).
Much of each stock’s data is highly correlated, per day, week, and even month. As a simple measure to decorrelate the data, I just remove it.
Fewer examples mean faster training, which is nice for devex and avoids long training in the scheduled GitHub Actions workflows.

In total, there are about 300,000 training examples, from 1995 to today, for 6600 different stocks.

Data transformations

To get from raw data (Tiingo’s “CSV schema” is more or less ingested to Motherduck as is) to the training data exemplified above, 4 concrete transformation steps are performed. Most of the stuff happening there is straightforward and “self-documented” in the code³. However, at least one step requires more detailed explanation and could benefit from scrutiny: how I join pricing data with quarterly statements data.

When joining these datasets – stock prices and quarterly statements – it’s vital to avoid look-ahead bias. We need to join each stock’s pricing data with statements data only from the day the quarterly statement was published. Joining price data with statements data before publication would be cheating, as the stock’s price wouldn’t yet have “priced in” the statement’s contents.

Here’s what I did. Tiingo’s financial statements provide the fiscal dates for each company. Each company must file their quarterly statements with the SEC⁴ within 45 days of the fiscal date. I don’t care much about joining price and statements data at the exact publication date (I simply don’t have that data point), but as long as I’m sure the quarterly statement was released on or a few days before the join date, I’m happy enough. This is a conservative approach, as some companies might file before the 45-day deadline. But it should guarantee that the statement information is “priced in” at the join date. In code, I add 45 days to the fiscal date and run an as of join to avoid look-ahead bias.

To autoregress or not

As shown above, the training data is transformed to fit a regression scenario where all examples (rows) are assumed to be independent and identically distributed. This isn’t strictly true, but it seems close enough for adequate results. Another way to frame this is that all examples learn from each other, meaning that when making future predictions we’re essentially asking the model:

Given a stock’s current input data, how did stocks with similar features perform historically?

Of course, the model does more, as it should have learned to generalize and be able to make reasonable predictions for unseen data too. But clear recommendations (with low uncertainty) are likely because the predicted stock’s features are somewhat familiar to the model.

As predicting a stock’s future is nearly impossible, can we hope for useful results from such a model? It’s still an open question, but as discussed below, the model seems to detect some signal for some stocks. The large uncertainty estimates, also discussed below, suggest a low signal-to-noise ratio for this prediction scenario.

As a stock’s price is obviously time series data, why deliberately not model this as a time series problem? There are several reasons:

A stock’s historical price development has very little predictive value. An entire discipline called “technical analysis” aims to predict a stock’s development solely from its recent historical price. I’m not a believer, but what do I know?
The most important features aren’t autoregressive, but fundamental data combined with company-specific data (industry, sector, location, etc.). Even though models like dynamic regression allow inclusion of both autoregressive and “regular” features, I’ve found them hard to work with compared to a gradient-boosted tree model like CatBoost. Given how little predictive value historic price contains, I decided it was easier to choose a “regular” regression model⁵ and include some derived “technical features” from the historical price data.
By predicting index-relative returns instead of stock prices, historical prices are has less predictive power. Even though a stock’s absolute value (its price) three months ago is usually predictive for its price today, this data point isn’t very valuable for predicting its index-relative return.
The model is designed to learn from all examples, i.e., all other stocks, to identify buy or sell signals. This contrasts with technical methods, where one stock’s data is used to predict its continued trajectory.

However, I do include several “technical features” based on each stock’s history, like longer-term price moving averages, price variance and volume variance. The rationale here is:

Historical variances should provide predictive value to (at least) the uncertainty predictions, and possibly suggest high future volatility.
Even though I’m sceptical about historical development predicting a stock’s future, I believe somewhat in momentum. There are two reasons. First, if a stock is mispriced, it takes time to reach its “correct” price; there’s a lag. Second, there’s human psychology. Until fear grips the markets, humans tend to think that something going up will continue to go up. This effect is reinforcing: if enough people believe a stock should continue to rise, that might be enough to keep driving its price up. This is my personal belief and may be controversial; the underlying cause (of momentum) shouldn’t matter anyway. Regardless, there’s little harm in including some longer-term simple moving averages as model features. If these features contain no signal, the model should figure that out.

Model evaluation

Test set performance

During training, I split the data randomly into a training set, an evaluation set, and a small test set used only after training to evaluate the model on unseen data. As the RMSE is a logarithmic value (log RMSE), it’s a bit hard to interpret, but I’ll try my best.

The test set log RMSE values – pretty stable on each training run – are 0.47, 0.56 and 0.59 for the 12m, 24m, and 36m models respectively. Let’s break that down:

A log RMSE of 0.47 (12m model) means the average prediction error corresponds to:

$$e^{0.47} \approx 1.60$$

This implies predictions deviate from actual index-adjusted returns by ~60% in multiplicative terms. For the 36m model (RMSE 0.59), this grows to ~80% deviation. In other words, the actual index-relative returns typically differ from the predictions by a factor between 1.6 and 1.8 (for 12m and 36m predictions, respectively). If a stock actually delivers 2× the index return, the model’s predictions would typically fall between 1.25× (2/1.6) and 3.2× (2×1.6) relative performance, though individual predictions can deviate even further.

Test set log RMSE from training runs (new random test set each time)

12m:

24m:

36m:

This sounds like a high error rate, and it is! However, I don’t think it renders the model results useless; there’s still some signal to be obtained. Here’s why:

Horizon vs accuracy pattern. The decreasing RMSE with shorter horizons (0.59 → 0.56 → 0.47) aligns with expectations – shorter-term predictions generally have less uncertainty, suggesting the model captures some time-dependent signal.
RMSE is sensitive to outliers. For uncertainty estimates, I have to use the RMSEWithUncertainty loss function in the CatBoostRegressor, which is sensitive to extreme outliers. Given stocks’ possible wild fluctuations, a few outliers (stocks tanking or going +5x) can significantly impact the test set log RMSE values. A loss function less sensitive to outliers could possibly improve the test set evaluation metric results significantly.
The values of the predicted index-relative returns don’t matter. Even though this model is framed as a regression model (i.e., predict the exact index-relative return), the results are used in a “trichotomy”:
1. Is the stock predicted to perform better than the index?
2. Is the stock predicted to perform about as well as the index?
3. Is the stock predicted to perform worse than the index?
Since predicting a stock’s actual development is impossible anyway, we care about signals that can aid decision-making. This three-level categorisation, combined with other information sources, helps objectively assess a stock’s potential and risk. I might consider buying stocks in all three categories, but I’d be wary that the stocks in the last category come with a “high risk” stamp from the model.
Even though the average log RMSE is high, uncertainty isn’t high for all predictions. Some stocks have significantly lower predicted uncertainty estimates than others, indicating the model is more confident. This is great for decision-making, helping us identify worthless predictions (very high uncertainty estimates) and possible investment targets.
Negative prediction power. Creating a model that picks only stock winners is impossible, but it might be possible to create a model that helps decide which stocks to avoid – or at least label as high risk. If you’re evaluating stocks in the Motley Fool’s Top 10 stocks for a given month, for example, the model could help in flagging high-risk assets that I then may be more cautious about committing to.
SHAP values and feature importances provide value in themselves. Being able to drill down to assess why predictions are what they are can be insightful, for both seemingly reasonable and “way off” predictions.

To wrap up, here’s a log RMSE classification scheme for stocks proposed by the brilliant Deepseek R1 model. The classification makes intuitive sense, but, when asked for sources for it, it wasn’t able to find anything. So, take it with a pinch of salt; it may be completely made up to fit my prompt (parts of this blog post). We, R1 and me, seem to agree that the model captures more than just noise at least 😁

Deepseek R1's made-up log RMSE classification for stock predictions.

For stock prediction models:

<0.4 log RMSE would indicate strong predictive power
0.4-0.6 suggests moderate signal detection
>0.6 implies mostly noise capture

Your results (0.47-0.59) sit in the moderate range, suggesting:

The model identifies some predictive patterns
Significant unexplained variance remains (expected in equity markets)
Fundamental data contains partial signal about future performance

Prediction examples

Above is a screenshot from the Streamlit dashboard with results as of 5 February 2025. The predicted means are the round dots, and bands represent model-estimated Normal distribution quantiles. The thicker uncertainty band corresponds to 1σ (~0.68 probability), and the thinner band corresponds to 2σ (~0.95 probability), with different colours representing different stock tickers. Things worth noticing:

The model is only bullish on Apple (AAPL) and Lam Research (LRCX). It has relatively low uncertainty for both, but Apple is one of the few stocks consistently predicted to outperform the index by an entire standard deviation above 1.
Alphabet (GOOGL), Marvell (MRVL), and Pure Storage (PSTG) land in the neutral category; they’re predicted to fare approximately similarly to the index. However, they have markedly different uncertainty, with Alphabet having the least uncertainty, i.e., the model is confident that Alphabet will perform similarly to the index. Pure Storage has wide uncertainty bands, which makes intuitive sense; the company is much smaller and the probability of larger index-deviating fluctuations is higher. Marvell sits between them regarding uncertainty, but should be considered rather risky (the 12m and 24m model have much higher uncertainty than the 36m model, indicating a volatile stock in the short term but perhaps less so long term).
It’s bearish on Nvidia (NVDA) and Reddit (RDDT). It seems rather confident (narrow uncertainty bands) that Nvidia is overpriced, but I wouldn’t put too much into this. However, running the model without the last 4 years of data (only up to 2021) produced a clear recommendation for Nvidia in 2021 (I wish I had this model running back then 😀). Regarding Reddit – which I love and use daily – I wouldn’t put much emphasis on the prediction. The uncertainty bands are so large that the model is basically admitting it has no clue.

Feature importance (SHAP analysis)

Below is a so-called beeswarm plot of the 36-month prediction horizon model’s SHAP values. There are many insights to be made from this plot; I’ll just highlight a few.

The top four most important predictors are industry, book value per share, location, and price-to-book ratio⁶.
Book value per share shows some surprising results: lower values contribute positively while higher values contribute negatively to the index-relative return estimates. This seems to contradict traditional value investing principles, and I’m unsure why. It could reflect the US market’s consistent preference for companies with lighter asset structures and growth potential over asset-heavy businesses throughout the analysed period (1995-present).
High volatility is usually considered bad. Interestingly, because of CatBoost’s nonlinearity, low volatility sometimes contributes negatively to a stock’s predicted development. The model claims that for certain stocks, given all the other feature values, low volatility might not be so good. Also note the grey dots for this metric, indicating a missing value, which happens for recently listed stocks, is usually interpreted as positive.
Low enterprise value is generally positive.
The P/E ratio seems like a metric with a sweet spot; it shouldn’t be too high or too low. The same can be said for earnings per share diluted (incomeStatement_epsDil).
The “relative SMA development”, a feature I created to catch momentum by comparing a stock’s SMA (simple moving average) to the S&P 500 SMA, yields different predictive behaviour depending on the lookback period. For 12-month SMA, the model finds that a low trajectory compared to the index is generally good, whereas for the 36-month lookback it seems the opposite, that an SMA that performs well compared to the index SMA contributes to more positive predictions (capturing momentum?).

Feature attribution (SHAP values per prediction)

Below is a screenshot of the Stock Picker module in the dashboard, where I can easily compare stocks’ prediction results. I’ve shaded some feature values due to the data’s personal license.

On top is a summary table with the stocks picked for comparison; below are three tables with each stock prediction’s SHAP values, in descending absolute SHAP value order.

SHAP values represent the contribution of a feature to the difference between the actual prediction and the average prediction. For example, Nvidia for the 36m prediction horizon has the most contribution, compared to the average prediction, from the features ev_to_sales, overview_bvps and overview_roa.

Critique

Predictions are inaccurate and uncertainty is high

As discussed above, the model’s prediction accuracy is not high. Accordingly, the estimated uncertainty around the model’s predicted mean (most likely estimate) are usually very high. In one sense I am content that they indeed are large, as low uncertainty for predicting future stock gains would be a model smell (too good to be true). On the other hand, the uncertainty is currently so high – for many stocks – that it raises a question: are these results valuable at all? If it tends to predict that a stock will perform on par with the index – say between 0.9 and 1.1 – with much uncertainty, are such results useful for decision-making?

To the model’s defence, the uncertainty varies quite a lot, which is a sign of a healthy model. It’s rare for the model to combine a high probability for beating the index and relatively low uncertainty (standard deviation at ~0.3), but there are some examples (e.g., Apple and Lam Research in December 2024). There are possible insights from the stock prediction’s SHAP values highlighting why a given stock has low/high uncertainty (SHAP values mainly measure attribution to the mean prediction, but still give clues about what causes high uncertainty).

Using only quarterly financial statements

Does it make sense to predict a stock’s future development 1, 2, 3 years ahead from quarterly statements? The model’s fundamentals data is just a snapshot from the latest available quarterly financial report. There’s no historical context, like the trajectory of key metrics in recent years, which could be relevant. It might be possible to add features for this, including data from the latest annual statement or calculating simple moving averages for key metrics from past quarterly statements. The risk is adding more redundant features + making the training examples more correlated, so it might not add much predictive value.

No backtesting

I haven’t backtested the model results, like simulating how a portfolio of the top 5 stock picks would have performed 3 years ahead. If you think I’m putting too much confidence in a model that hasn’t been backtested, you have a point. However:

The model has been evaluated on an unseen randomly selected test set. That random test set should contain test data for all years from 1995 until today minus the prediction horizon. So, the model has arguably already been tested on historical data.
Backtesting would require creating an automated buy and sell strategy, creating something on top of the model results that they weren’t intended for. The ambition is not to perform automated trades, but to aid manual, long-term stock decisions.

Regardless, I think I could have spent more time evaluating historical results, to better understand the model’s strengths and weaknesses. While testing on randomly selected historical data helps check basic accuracy, simulating real-world use – where models make predictions year-by-year using only past data – could better assess performance over time.

Ideas for improvement

LLM agent on top of data and model results

The model results are pretty coarse and require careful evaluation to be useful. “Sprinkling” some AI on top could help automate that process. LLM agents could, for example, include web search results and data from Tiingo’s news API to augment analyses and recommendations, producing things like “This month’s top 10 picks, given criteria X, Y and Z”, where X, Y and Z could be provided with prompts. Creating something useful and trustworthy like this would probably not be trivial, especially while keeping costs at zero and not breaching licensing terms⁷.

Feature engineering

One could always come up with ideas for more features or refine existing ones. I’m sure there are some – unknown to me – really good features. However, I also think that adding more features will show diminishing returns, so it might not be worth spending a lot of time on. Predicting stock trajectories will operate in the low signal-to-noise realm regardless of how many great features the model is trained on. Below is one concrete idea I’ve been thinking about.

Add historical news sentiments as model features

Tiingo provides a seemingly rich set of historical news articles per ticker in my subscription. These could be run through LLMs for sentiment analysis, with a prompt like “On a scale from 1 to 5, how positive is this article is for buying stock X long term?” supplemented with some examples and enforcing structured outputs. These results could be added to the model alongside existing fundamental and technical features, as a one-off job for historical articles and a weekly batch job on Github Actions for keeping data up to date.

Decay weighting and remove irregular times

Arguably, older financial data, let’s say prior to 2010, is less valuable for predicting today’s stock trajectories. A decay function for modelling a recency bias would be simple to add with the set_weight parameter.

Training data during irregular times, e.g., around the dot-com bubble and the 2008 crash, should possibly be removed or downweighted. Even though the model predicts index-adjusted returns – which should mitigate some fluctuations during times of fear – these were times of high volatility and many stocks tanked completely. It’s questionable how valuable training examples from these periods are, as the data is so noisy.

Conclusions

So much talk and no concrete outcomes! I can’t round off without revealing if I’ve made any purchase decisions based on the model results. I plan to run this product for years and jump on good trades if/when they appear; I’m trying to be calm and not rush into decisions I’ll find it hard to stick with. My portfolio is still in the making. However, I’ve made purchases into three stocks:

Lam Research (LRCX): This is the only stock purchased mainly because it was recommended by the model. It’s predicted to outperform the index in all three prediction horizons, and has relatively low uncertainty (standard deviation ~= 0.3). By comparing with other similar stocks and assesing its SHAP values, it seems like the model is suggesting it is undervalued.
Pure Storage (PSTG): This tip came from other sources, but I used the model to assess its viability. The model puts it in the “perform about the same as the index” category with not too high uncertainy. This, plus reading up on the company, convinced me to try to hold this longer term.
Marvell (MRVL): This tip came from other sources and has a notably high P/E ratio. Still, the model predicted it to be in the “perform about the same as the index” category, which made me more confident it wouldn’t be too risky but still have a big potential upside.

Admittedly, it’s an AI-biased “portfolio”, so I’m hoping the AI bubble still has some bursting just yet. I’ve committed long term and will try to sit steady for years and see how it goes. I might add more stocks if I find promising candidates.

Remember: If it goes well, the model shouldn’t get all the credit; if it goes poorly, it shouldn’t get all the blame. Either way, it will be exciting to see how it goes!

Throughout this text, I might use “uncertainty” and “standard deviation” interchangeably, but they refer to the same thing. Strictly, uncertainty is modelled as a Normal distribution, where standard deviation measures “how wide” the uncertainty distribution is. ↩︎
Note that the index, or the SPY ETF, is repeated for all stocks to simplify this calculation. ↩︎
To the extent SQL can be considered readable, of course. Sorry. ↩︎
The Securities and Exchange Commission. ↩︎
That is fast to train, can detect non-linear relationships and provides uncertainty estimates. ↩︎
I possibly should have excluded industry when including sicIndustry, but considering that they don’t contain the same values and that boosted tree models like CatBoost tend to be good at handling redundant features, I decided to leave it in. ↩︎
I’m unsure if sharing data to a privacy-respecting LLM provider via Openrouter would breach Tiingo’s personal license; I’d have to investigate. ↩︎