# The Model Explained


## Introduction

> All models are wrong, but some are useful.
>
> _George Box_

Before diving into the model's details, let's state the obvious: it's no panacea for picking US stocks. The model is potentially _useful_ for a certain, limited set of investment strategies.

Specifically, it's designed to aid decision-making/stock picking for longer-term strategies. It is _not_ well-suited for automated algorithmic trading or short-term investment recommendations. The investment horizon should be years rather than weeks or months. For more on the motivation behind this model, see the [first post](/posts/stock-advisor-intro) in this series. The model is retrained twice a week; see the [second post](/posts/stock-advisor-stack) for engineering considerations and more on data management and pipelines.

Concretely, there are currently three distinct aspects that I use when picking stocks:

1.  The predicted uncertainty for the index-relative returns for each stock. I deliberately list this metric before the predicted mean itself, as I find it so important. It gives me good indications of how risky an asset is, and makes it easy to identify predictions that shouldn't be trusted. In general, the predicted uncertainty is high for most stocks.[^uncertainty]
2.  The predicted index-relative returns for 1, 2 and 3-year investment horizons. Since this metric is impossible to predict with much confidence, I tend to bin predictions into three distinct categories:
    1.  Significantly above 1. The relatively few stocks in this category may warrant extra consideration. But each stock with a high predicted score should be treated and assessed carefully. I don't blindly trust these recommendations.
    2.  At around 1. Predictions in this category range from the model being "fairly confident that this stock will perform similarly to the index" (low uncertainty) to it being "very uncertain about the stock; it could go either way" (high uncertainty).
    3.  Significantly below 1. "Historically, stocks with these features tend to underperform the broad index." This doesn't rule out the stock, but it's valuable input to my decision-making, as the model could be flagging a high-risk asset.
3.  Each prediction's SHAP values. When investigating and making decisions, a key question is: _why_ does this stock get its observed mean and uncertainty predictions? Reviewing the most important "prediction drivers" and comparing with competing candidates is a natural part of this process. The model produces "objective" explanatory metrics -- the [SHAP values](https://christophm.github.io/interpretable-ml-book/shap.html) -- for each prediction, which allows for assessing prediction quality and identifying key features.

The aim of this post is to outline _how_ the data and regression modelling were performed -- complemented by the thinking and assumptions made along the way -- and then assess whether the results are useful. I've tried to critically assess the model results and list ideas for improvement, but please comment below if you find any critical flaws or have suggestions.

[^uncertainty]: Throughout this text, I might use "uncertainty" and "standard deviation" interchangeably, but they refer to the same thing. Strictly, uncertainty is modelled as a Normal distribution, where standard deviation measures "how wide" the uncertainty distribution is.

## Modelling approach

In brief, this model could be summarised as:

The machine learning model itself is simple, more or less a "stock" CatBoost regression model with uncertainty estimates. However, it's trained on (seemingly) high-quality historical data that's carefully transformed to avoid [look-ahead bias](https://analyzingalpha.com/look-ahead-bias) and fit a regression prediction scenario.

This section is explained "top down". I'll start by showing _what_ the model and transformed data look like during training. Then, I'll outline the most important steps to get the data into that shape, linking to relevant code in [the repository](https://github.com/rasnes/stock-advisor). Finally, I will conclude by highlighting the key underlying assumptions.

### Training data composition

<h4>The target variable</h4>

In this model, the target variable, `y`, is the natural logarithm of the difference between a given stock's relative development and the S&P 500 index's relative development, for the given prediction horizon.

Here are some examples:

- For a prediction horizon (e.g. 12 months), if stock `FOO` returns 0.8 (-20%) and the index returns 1.1 (+10%), the index-adjusted return for this stock is $ln(0.8/1.1)=-0.32$.
- For a prediction horizon, if stock `BAR` returns 1.5 (+50%) and the index returns 1.1 (+10%), the index-adjusted return for this stock is $ln(1.5/1.1)=0.31$.

Transforming the diff to the logarithm makes sense, as it aligns the distribution of `y` much closer with the Normal distribution assumptions made by the [CatBoostRegressor](https://catboost.ai/docs/en/concepts/python-reference_catboostregressor) with the [RMSEWithUncertainty](https://catboost.ai/docs/en/references/uncertainty) loss function. [Here's the DuckDB macro](https://github.com/rasnes/stock-advisor/blob/06aa210c2e8a71e6002dbc0f80448abb33743016/transformations/src/sql/macros.sql#L26) for this calculation, and [here it's applied](https://github.com/rasnes/stock-advisor/blob/06aa210c2e8a71e6002dbc0f80448abb33743016/transformations/src/sql/4_excess_returns.sql#L20-L23)[^SPY] to all stocks for the different time periods.

[^SPY]: Note that the index, or the `SPY` ETF, is repeated for all stocks to simplify this calculation.

<h4>The feature variables</h4>

Instead of writing too much, I'll just show the data used for training -- how it looks just before being fed to the CatBoost model. Due to Tiingo's licensing constraints, these training examples contain fake data (random values in approximately the correct orders of magnitude). Here they are:

{{< tabulator-table
    id="train-table-fake"
    height="500px"
    hozAlign="right"
    csv="/csv/train-table-fake.csv">}}

This is data from 2020 onwards for three stocks: `AAPL` (Apple), `MSFT` (Microsoft), `CAT` (Caterpillar) and `GS` (Goldman Sachs). The full training set contains data back to 1995 for about 6600 stocks (including delisted stocks). The data has about 75 columns of feature variables:

- `date` and `ticker`: The stock and date for which the features on the right apply. These aren't included during model training, but are included here for context.
- `y_ln_12m`, `y_ln_24m`, `y_ln_36m`: The target variable to predict. The three prediction horizons -- 12, 24 and 36 months -- are trained as three individual models. As you can see, there are many empty fields, which is natural, as we don't have the value for `y_ln_36m` for `AAPL` on `2024-11-18`; this is the predicted value we're interested in. When training the 36-month model, data from three years prior to today is removed from the training set.
- `balanceSheet_*`, `cashFlow_*`, `incomeStatement_*`: These are the quarterly statements fundamentals data, with a `statementType` prefix. I initially modelled with all statements data included, but removed quite a few with seemingly little predictive value.
- Some daily fundamentals data, including `peRatio`, `pbRatio`, `operating_efficiency`.
- Some descriptive categorical data, including `sector` (e.g., _Technology_), `industry` (e.g., _Consumer Electronics_), and `location` (e.g., _California; USA_).
- `tech_*`: Finally, some technical indicators, like moving averages and historical volatility.

If you're curious about the different quarterly statement fields, I've included Tiingo's documentation in a table:

{{< tabulator-table
    id="tiingo_definitions"
    height="400px"
    hozAlign="left"
    wrapText="description"
    csv="/csv/tiingo_definitions_reordered.csv">}}

In short, I'm not familiar with many of the model's features. My thinking is more "I'll let the model figure it out."

As you might have noticed, there are only _four_ entries/rows per stock per year. This is essentially one per quarterly statement, and that's no coincidence:

- The quarterly statements are assumed to be the most important model features (and obviously only change four times a year).
- Much of each stock's data is highly correlated, per day, week, and even month. As a simple measure to decorrelate the data, I just remove it.
- Fewer examples mean faster training, which is nice for devex and avoids long training in the scheduled GitHub Actions workflows.

In total, there are about 300,000 training examples, from 1995 to today, for 6600 different stocks.

### Data transformations

To get from raw data (Tiingo's "CSV schema" is more or less ingested to Motherduck as is) to the training data exemplified above, [4 concrete transformation steps](https://github.com/rasnes/stock-advisor/blob/06aa210c2e8a71e6002dbc0f80448abb33743016/transformations/src/sql) are performed. Most of the stuff happening there is straightforward and "self-documented" in the code[^sql_readable]. However, at least one step requires more detailed explanation and could benefit from scrutiny: how I join pricing data with quarterly statements data.

When joining these datasets -- stock prices and quarterly statements -- it's vital to avoid [look-ahead bias](https://analyzingalpha.com/look-ahead-bias). We need to join each stock's pricing data with statements data **only from the day the quarterly statement was published**. Joining price data with statements data before publication would be cheating, as the stock's price wouldn't yet have "priced in" the statement's contents.

Here's what I did. Tiingo's financial statements provide the _fiscal dates_ for each company. Each company must file their quarterly statements with the SEC[^sec] _within_ 45 days of the fiscal date. I don't care much about joining price and statements data at the exact publication date (I simply don't have that data point), but as long as I'm sure the quarterly statement was released on or a few days before the join date, I'm happy enough. This is a conservative approach, as some companies might file before the 45-day deadline. But it should guarantee that the statement information is "priced in" at the join date. In code, I add 45 days to the fiscal date and run an [as of join](https://github.com/rasnes/stock-advisor/blob/06aa210c2e8a71e6002dbc0f80448abb33743016/transformations/src/sql/1_wide_statements.sql#L101) to avoid look-ahead bias.

[^sec]: The Securities and Exchange Commission.
[^sql_readable]: To the extent SQL can be considered readable, of course. Sorry.

### To autoregress or not

As shown above, the training data is transformed to fit a regression scenario where all examples (rows) are assumed to be [independent and identically distributed](https://xgboosting.com/xgboost-assumes-data-is-iid-i.i.d./). This isn't strictly true, but it seems close enough for adequate results. Another way to frame this is that _all examples learn from each other_, meaning that when making future predictions we're essentially asking the model:

> Given a stock's current input data, how did stocks with similar features perform historically?

Of course, the model does more, as it should have learned to generalize and be able to make reasonable predictions for unseen data too. But clear recommendations (with low uncertainty) are likely because the predicted stock's features are somewhat familiar to the model.

As predicting a stock's future is nearly impossible, can we hope for useful results from such a model? It's still an open question, but as [discussed below](#test-set-performance), the model seems to detect some signal for some stocks. The large uncertainty estimates, also [discussed below](#predictions-are-inaccurate-and-uncertainty-is-high), suggest a low signal-to-noise ratio for this prediction scenario.

As a stock's price is obviously time series data, why deliberately _not_ model this as a time series problem? There are several reasons:

- A stock's historical price development has very little predictive value. An entire discipline called "technical analysis" aims to predict a stock's development solely from its recent historical price. I'm not a believer, but what do I know?
- The most important features aren't [autoregressive](https://en.wikipedia.org/wiki/Autoregressive_model), but fundamental data combined with company-specific data (industry, sector, location, etc.). Even though models like [dynamic regression](https://otexts.com/fpp3/dynamic.html) allow inclusion of both autoregressive and "regular" features, I've found them hard to work with compared to a gradient-boosted tree model like CatBoost. Given how _little_ predictive value historic price contains, I decided it was easier to choose a "regular" regression model[^catboost] and include some derived "technical features" from the historical price data.
- By predicting index-relative returns instead of stock prices, historical prices are has less predictive power. Even though a stock's _absolute value_ (its price) three months ago is usually predictive for its price today, this data point isn't very valuable for predicting its index-relative return.
- The model is designed to learn from all examples, i.e., all other stocks, to identify buy or sell signals. This contrasts with technical methods, where one stock's data is used to predict its continued trajectory.

However, I do include several "technical features" based on each stock's history, like longer-term price moving averages, price variance and volume variance. The rationale here is:

- Historical variances should provide predictive value to (at least) the uncertainty predictions, and possibly suggest high future volatility.
- Even though I'm sceptical about historical development predicting a stock's future, I believe somewhat in _momentum_. There are two reasons. First, if a stock is mispriced, it takes time to [reach its "correct" price](https://en.wikipedia.org/wiki/Efficient-market_hypothesis); there's a lag. Second, there's human psychology. Until fear grips the markets, humans tend to think that something going up will continue to go up. This effect is reinforcing: if enough people _believe_ a stock should continue to rise, that might be enough to keep driving its price up. This is my personal belief and may be controversial; the underlying cause (of momentum) shouldn't matter anyway. Regardless, there's little harm in including some longer-term simple moving averages as model features. If these features contain no signal, the model should figure that out.

[^catboost]: That is fast to train, can detect non-linear relationships and provides uncertainty estimates.

## Model evaluation

### Test set performance

During training, I split the data randomly into a training set, an evaluation set, and a small test set used only after training to evaluate the model on unseen data. As the RMSE is a logarithmic value (log RMSE), it's a bit hard to interpret, but I'll try my best.

The test set log RMSE values -- pretty stable on each training run -- are 0.47, 0.56 and 0.59 for the 12m, 24m, and 36m models respectively. Let's break that down:

A log RMSE of **0.47** (12m model) means the average prediction error corresponds to:

$$e^{0.47} \approx 1.60$$

This implies predictions deviate from actual index-adjusted returns by **~60%** in multiplicative terms. For the 36m model (RMSE 0.59), this grows to **~80% deviation**. In other words, the actual index-relative returns typically differ from the predictions by a factor between 1.6 and 1.8 (for 12m and 36m predictions, respectively). If a stock actually delivers 2× the index return, the model's predictions would typically fall between 1.25× (2/1.6) and 3.2× (2×1.6) relative performance, though individual predictions can deviate even further.

{{< admonition type=example title="Test set log RMSE from training runs (new random test set each time)" open=false >}}

**12m:** <img width="896" alt="Image" src="https://github.com/user-attachments/assets/e120ee4b-055d-4dae-b6fa-8a3f441cfdcc" />

**24m:** <img width="899" alt="Image" src="https://github.com/user-attachments/assets/aec1cec3-26db-4dd1-9425-551770704c03" />

**36m:** <img width="900" alt="Image" src="https://github.com/user-attachments/assets/3d58ef3b-c876-404e-b4d1-ae9764f870cf" />

{{< /admonition >}}

This sounds like a high error rate, and it is! However, I don't think it renders the model results useless; there's still some signal to be obtained. Here's why:

- **Horizon vs accuracy pattern.** The decreasing RMSE with shorter horizons (0.59 → 0.56 → 0.47) aligns with expectations -- shorter-term predictions generally have less uncertainty, suggesting the model captures some time-dependent signal.

- **RMSE is sensitive to outliers.** For uncertainty estimates, I _have to_ use the `RMSEWithUncertainty` loss function in the `CatBoostRegressor`, which is sensitive to extreme outliers. Given stocks' possible wild fluctuations, a few outliers (stocks tanking or going +5x) can significantly impact the test set log RMSE values. A loss function less sensitive to outliers could possibly improve the test set evaluation metric results significantly.

- **The _values_ of the predicted index-relative returns don't matter.** Even though this model is framed as a regression model (i.e., predict the exact index-relative return), the results are used in a "trichotomy":

  1. Is the stock predicted to perform better than the index?
  2. Is the stock predicted to perform about as well as the index?
  3. Is the stock predicted to perform worse than the index?

  Since predicting a stock's actual development is impossible anyway, we care about signals that can aid decision-making. This three-level categorisation, combined with other information sources, helps objectively assess a stock's potential and risk. I might consider buying stocks in all three categories, but I'd be wary that the stocks in the last category come with a "high risk" stamp from the model.

- **Even though the _average_ log RMSE is high, uncertainty isn't high for _all_ predictions.** Some stocks have significantly lower predicted uncertainty estimates than others, indicating the model is more confident. This is great for decision-making, helping us identify worthless predictions (very high uncertainty estimates) and possible investment targets.

- **Negative prediction power.** Creating a model that picks only stock winners is impossible, but it might be possible to create a model that helps decide which stocks to avoid -- or at least label as high risk. If you're evaluating stocks in the Motley Fool's Top 10 stocks for a given month, for example, the model could help in flagging high-risk assets that I then may be more cautious about committing to.

- **SHAP values and feature importances provide value in themselves.** Being able to drill down to assess _why_ predictions are what they are can be insightful, for both seemingly reasonable and "way off" predictions.

To wrap up, here's a log RMSE classification scheme for stocks proposed by the brilliant Deepseek R1 model. The classification makes intuitive sense, **but**, when asked for sources for it, it wasn't able to find anything. So, take it with a pinch of salt; it may be completely made up to fit my prompt (parts of this blog post). We, R1 and me, seem to agree that the model captures more than just noise at least 😁

{{< admonition type=quote title="Deepseek R1's made-up log RMSE classification for stock predictions." >}}

For stock prediction models:

- **<0.4 log RMSE** would indicate strong predictive power
- **0.4-0.6** suggests moderate signal detection
- **>0.6** implies mostly noise capture

Your results (**0.47-0.59**) sit in the moderate range, suggesting:

1. The model identifies some predictive patterns
2. Significant unexplained variance remains (expected in equity markets)
3. Fundamental data contains partial signal about future performance

{{< /admonition >}}

### Prediction examples

![Prediction examples](https://github.com/user-attachments/assets/ce80f9a6-126e-424e-a068-0587758bc79b)

Above is a screenshot from the Streamlit dashboard with results as of 5 February 2025. The predicted means are the round dots, and bands represent model-estimated Normal distribution quantiles. The thicker uncertainty band corresponds to 1σ ([~0.68 probability](https://en.wikipedia.org/wiki/68%E2%80%9395%E2%80%9399.7_rule)), and the thinner band corresponds to 2σ (~0.95 probability), with different colours representing different stock tickers. Things worth noticing:

- The model is only bullish on Apple ([AAPL](https://finance.yahoo.com/quote/AAPL/)) and Lam Research ([LRCX](https://finance.yahoo.com/quote/LRCX/)). It has relatively low uncertainty for both, but Apple is one of the few stocks consistently predicted to outperform the index by an entire standard deviation above 1.
- Alphabet ([GOOGL](https://finance.yahoo.com/quote/GOOGL/)), Marvell ([MRVL](https://finance.yahoo.com/quote/MRVL/)), and Pure Storage ([PSTG](https://finance.yahoo.com/quote/PSTG/)) land in the neutral category; they're predicted to fare approximately similarly to the index. However, they have markedly different uncertainty, with Alphabet having the least uncertainty, i.e., the model is confident that Alphabet will perform similarly to the index. Pure Storage has wide uncertainty bands, which makes intuitive sense; the company is much smaller and the probability of larger index-deviating fluctuations is higher. Marvell sits between them regarding uncertainty, but should be considered rather risky (the 12m and 24m model have much higher uncertainty than the 36m model, indicating a volatile stock in the short term but perhaps less so long term).
- It's bearish on Nvidia ([NVDA](https://finance.yahoo.com/quote/NVDA/)) and Reddit ([RDDT](https://finance.yahoo.com/quote/RDDT/)). It seems rather confident (narrow uncertainty bands) that Nvidia is overpriced, but I wouldn't put too much into this. However, running the model without the last 4 years of data (only up to 2021) produced a clear _recommendation_ for Nvidia in 2021 (I wish I had this model running back then 😀). Regarding Reddit -- which I love and use daily -- I wouldn't put much emphasis on the prediction. The uncertainty bands are so large that the model is basically admitting it has no clue.

### Feature importance (SHAP analysis)

Below is a so-called [beeswarm plot](https://christophm.github.io/interpretable-ml-book/shap.html#shap-summary-plot) of the 36-month prediction horizon model's SHAP values. There are many insights to be made from this plot; I'll just highlight a few.

![Model SHAP values](https://github.com/user-attachments/assets/c263e28d-e746-4abf-8877-eba8c0078f7c)

- The top four most important predictors are industry, book value per share, location, and price-to-book ratio[^industry].
- Book value per share shows some surprising results: lower values contribute positively while higher values contribute negatively to the index-relative return estimates. This seems to contradict traditional value investing principles, and I'm unsure why. It could reflect the US market's consistent preference for companies with lighter asset structures and growth potential over asset-heavy businesses throughout the analysed period (1995-present).
- High volatility is usually considered bad. Interestingly, because of CatBoost's nonlinearity, low volatility sometimes contributes negatively to a stock's predicted development. The model claims that for certain stocks, given all the other feature values, low volatility might not be so good. Also note the grey dots for this metric, indicating a missing value, which happens for recently listed stocks, is usually interpreted as positive.
- Low enterprise value is generally positive.
- The P/E ratio seems like a metric with a sweet spot; it shouldn't be too high or too low. The same can be said for earnings per share diluted (incomeStatement_epsDil).
- The "relative SMA development", a feature I created to catch momentum by comparing a stock's SMA (simple moving average) to the S&P 500 SMA, yields different predictive behaviour depending on the lookback period. For 12-month SMA, the model finds that a low trajectory compared to the index is generally good, whereas for the 36-month lookback it seems the opposite, that an SMA that performs well compared to the index SMA contributes to more positive predictions (capturing momentum?).

[^industry]: I possibly should have excluded `industry` when including `sicIndustry`, but considering that they don't contain the _same_ values and that boosted tree models like CatBoost tend to be good at handling redundant features, I decided to leave it in.

### Feature attribution (SHAP values per prediction)

Below is a screenshot of the `Stock Picker` module in the dashboard, where I can easily compare stocks' prediction results. I've shaded some feature values due to the data's personal license.

On top is a summary table with the stocks picked for comparison; below are three tables with each stock prediction's [SHAP values](https://christophm.github.io/interpretable-ml-book/shap.html), in descending absolute SHAP value order.

![Stock SHAP values](https://github.com/user-attachments/assets/5b28bf2e-4221-4f60-8a91-52247c293576)

SHAP values represent _the contribution of a feature to the difference between the actual prediction and the average prediction_. For example, Nvidia for the 36m prediction horizon has the most contribution, compared to the average prediction, from the features `ev_to_sales`, `overview_bvps` and `overview_roa`.

## Critique

### Predictions are inaccurate and uncertainty is high

As [discussed above](#test-set-performance), the model's prediction accuracy is not high. Accordingly, the estimated uncertainty around the model's predicted mean (most likely estimate) are usually _very high_. In one sense I am content that they indeed are large, as low uncertainty for predicting future stock gains would be a model smell (too good to be true). On the other hand, the uncertainty is currently so high -- for many stocks -- that it raises a question: are these results valuable at all? If it tends to predict that a stock will perform on par with the index -- say between 0.9 and 1.1 -- with much uncertainty, are such results useful for decision-making?

To the model's defence, the uncertainty varies quite a lot, which is a sign of a healthy model. It's rare for the model to combine a high probability for beating the index _and_ relatively low uncertainty (standard deviation at ~0.3), but there are some examples (e.g., Apple and Lam Research in December 2024). There are possible insights from the stock prediction's SHAP values highlighting _why_ a given stock has low/high uncertainty (SHAP values mainly measure attribution to the mean prediction, but still give clues about what causes high uncertainty).

### Using only quarterly financial statements

Does it make sense to predict a stock's future development 1, 2, 3 years ahead from _quarterly_ statements? The model's fundamentals data is just a snapshot from the latest available _quarterly_ financial report. There's no historical context, like the trajectory of key metrics in recent years, which could be relevant. It might be possible to add features for this, including data from the latest _annual_ statement or calculating simple moving averages for key metrics from past quarterly statements. The risk is adding more redundant features + making the training examples more correlated, so it might not add much predictive value.

### No backtesting

I haven't backtested the model results, like simulating how a portfolio of the top 5 stock picks would have performed 3 years ahead. If you think I'm putting too much confidence in a model that hasn't been backtested, you have a point. However:

- The model has been evaluated on an unseen randomly selected test set. That random test set should contain test data for all years from 1995 until today minus the prediction horizon. So, the model has arguably already been tested on historical data.
- Backtesting would require creating an automated buy and sell strategy, creating something on top of the model results that they weren't intended for. The ambition is not to perform automated trades, but to aid manual, long-term stock decisions.

Regardless, I think I could have spent more time evaluating historical results, to better understand the model's strengths and weaknesses. While testing on randomly selected historical data helps check basic accuracy, simulating real-world use -- where models make predictions year-by-year using only past data -- could better assess performance over time.

## Ideas for improvement

### LLM agent on top of data and model results

The model results are pretty coarse and require careful evaluation to be useful. "Sprinkling" some AI on top could help automate that process. LLM agents could, for example, include web search results and data from Tiingo's news API to augment analyses and recommendations, producing things like "This month's top 10 picks, given criteria X, Y and Z", where X, Y and Z could be provided with prompts. Creating something useful and trustworthy like this would probably not be trivial, especially while keeping costs at zero and not breaching licensing terms[^openrouter].

[^openrouter]: I'm unsure if sharing data to a privacy-respecting LLM provider via Openrouter would breach Tiingo's personal license; I'd have to investigate.

### Feature engineering

One could always come up with ideas for more features or refine existing ones. I'm sure there are some -- unknown to me -- really good features. However, I also think that adding more features will show diminishing returns, so it might not be worth spending a lot of time on. Predicting stock trajectories will operate in the low signal-to-noise realm regardless of how many great features the model is trained on. Below is one concrete idea I've been thinking about.

<h4>Add historical news sentiments as model features</h4>

Tiingo provides a seemingly rich set of historical news articles per ticker in my subscription. These could be run through LLMs for sentiment analysis, with a prompt like "On a scale from 1 to 5, how positive is this article is for buying stock X long term?" supplemented with some examples and enforcing structured outputs. These results could be added to the model alongside existing fundamental and technical features, as a one-off job for historical articles and a weekly batch job on Github Actions for keeping data up to date.

### Decay weighting and remove irregular times

Arguably, older financial data, let's say prior to 2010, is less valuable for predicting today's stock trajectories. A decay function for modelling a recency bias would be simple to add with the [set_weight](https://catboost.ai/docs/en/concepts/python-reference_pool_set_weight) parameter.

Training data during irregular times, e.g., around the dot-com bubble and the 2008 crash, should possibly be removed or downweighted. Even though the model predicts index-adjusted returns -- which should mitigate some fluctuations during times of fear -- these were times of high volatility and many stocks tanked completely. It's questionable how valuable training examples from these periods are, as the data is so noisy.

## Conclusions

So much talk and no concrete outcomes! I can't round off without revealing if I've made any purchase decisions based on the model results. I plan to run this product for years and jump on good trades if/when they appear; I'm trying to be calm and not rush into decisions I'll find it hard to stick with. My portfolio is still in the making. However, I've made purchases into three stocks:

1. **Lam Research ([LRCX](https://finance.yahoo.com/quote/LRCX/))**: This is the only stock purchased mainly because it was _recommended_ by the model. It's predicted to outperform the index in all three prediction horizons, and has relatively low uncertainty (standard deviation ~= 0.3). By comparing with other similar stocks and assesing its SHAP values, it seems like the model is suggesting it is undervalued.
2. **Pure Storage ([PSTG](https://finance.yahoo.com/quote/PSTG/))**: This tip came from other sources, but I used the model to assess its viability. The model puts it in the "perform about the same as the index" category with not too high uncertainy. This, plus reading up on the company, convinced me to try to hold this longer term.
3. **Marvell ([MRVL](https://finance.yahoo.com/quote/MRVL/))**: This tip came from other sources and has a notably high P/E ratio. Still, the model predicted it to be in the "perform about the same as the index" category, which made me more confident it wouldn't be too risky but still have a big potential upside.

Admittedly, it's an AI-biased "portfolio", so I'm hoping the AI bubble still has some bursting just yet. I've committed long term and will try to sit steady for years and see how it goes. I might add more stocks if I find promising candidates.

Remember: If it goes well, the model shouldn't get all the credit; if it goes poorly, it shouldn't get all the blame. Either way, it will be exciting to see how it goes!

