The Model Explained

Series - Stock Advisor

All models are wrong, but some are useful.

George Box

Before diving into the model’s details, let’s state the obvious: it’s no panacea for picking US stocks. The model is potentially useful for a certain, limited set of investment strategies.

Specifically, it’s designed to aid decision-making/stock picking for longer-term strategies. It is not well-suited for automated algorithmic trading or short-term investment recommendations. The investment horizon should be years rather than weeks or months. For more on the motivation behind this model, see the first post in this series. The model is retrained twice a week; see the second post for engineering considerations and more on data management and pipelines.

Concretely, there are currently three distinct aspects that I use when picking stocks:

  1. The predicted uncertainty for the index-relative returns for each stock. I deliberately list this metric before the predicted mean itself, as I find it so important. It gives me good indications of how risky an asset is, and makes it easy to identify predictions that shouldn’t be trusted. In general, the predicted uncertainty is high for most stocks.1
  2. The predicted index-relative returns for 1, 2 and 3-year investment horizons. Since this metric is impossible to predict with much confidence, I tend to bin predictions into three distinct categories:
    1. Significantly above 1. The relatively few stocks in this category may warrant extra consideration. But each stock with a high predicted score should be treated and assessed carefully. I don’t blindly trust these recommendations.
    2. At around 1. Predictions in this category range from the model being “fairly confident that this stock will perform similarly to the index” (low uncertainty) to it being “very uncertain about the stock; it could go either way” (high uncertainty).
    3. Significantly below 1. “Historically, stocks with these features tend to underperform the broad index.” This doesn’t rule out the stock, but it’s valuable input to my decision-making, as the model could be flagging a high-risk asset.
  3. Each prediction’s SHAP values. When investigating and making decisions, a key question is: why does this stock get its observed mean and uncertainty predictions? Reviewing the most important “prediction drivers” and comparing with competing candidates is a natural part of this process. The model produces “objective” explanatory metrics – the SHAP values – for each prediction, which allows for assessing prediction quality and identifying key features.

The aim of this post is to outline how the data and regression modelling were performed – complemented by the thinking and assumptions made along the way – and then assess whether the results are useful. I’ve tried to critically assess the model results and list ideas for improvement, but please comment below if you find any critical flaws or have suggestions.

In brief, this model could be summarised as:

The machine learning model itself is simple, more or less a “stock” CatBoost regression model with uncertainty estimates. However, it’s trained on (seemingly) high-quality historical data that’s carefully transformed to avoid look-ahead bias and fit a regression prediction scenario.

This section is explained “top down”. I’ll start by showing what the model and transformed data look like during training. Then, I’ll outline the most important steps to get the data into that shape, linking to relevant code in the repository. Finally, I will conclude by highlighting the key underlying assumptions.

The target variable

In this model, the target variable, y, is the natural logarithm of the difference between a given stock’s relative development and the S&P 500 index’s relative development, for the given prediction horizon.

Here are some examples:

  • For a prediction horizon (e.g. 12 months), if stock FOO returns 0.8 (-20%) and the index returns 1.1 (+10%), the index-adjusted return for this stock is ln(0.8/1.1)=0.32ln(0.8/1.1)=-0.32.
  • For a prediction horizon, if stock BAR returns 1.5 (+50%) and the index returns 1.1 (+10%), the index-adjusted return for this stock is ln(1.5/1.1)=0.31ln(1.5/1.1)=0.31.

Transforming the diff to the logarithm makes sense, as it aligns the distribution of y much closer with the Normal distribution assumptions made by the CatBoostRegressor with the RMSEWithUncertainty loss function. Here’s the DuckDB macro for this calculation, and here it’s applied2 to all stocks for the different time periods.

The feature variables

Instead of writing too much, I’ll just show the data used for training – how it looks just before being fed to the CatBoost model. Due to Tiingo’s licensing constraints, these training examples contain fake data (random values in approximately the correct orders of magnitude). Here they are:

date
ticker
y_ln_12m
y_ln_24m
y_ln_36m
exchange
balanceSheet_acctRec
balanceSheet_assetsNonCurrent
balanceSheet_cashAndEq
balanceSheet_debt
balanceSheet_debtCurrent
balanceSheet_debtNonCurrent
balanceSheet_equity
balanceSheet_intangibles
balanceSheet_inventory
balanceSheet_investments
balanceSheet_investmentsNonCurrent
balanceSheet_liabilitiesCurrent
balanceSheet_ppeq
balanceSheet_retainedEarnings
balanceSheet_totalLiabilities
cashFlow_depamor
cashFlow_freeCashFlow
cashFlow_investmentsAcqDisposals
cashFlow_issrepayEquity
cashFlow_payDiv
cashFlow_sbcomp
incomeStatement_ebitda
incomeStatement_epsDil
incomeStatement_grossProfit
incomeStatement_netIncComStock
incomeStatement_opex
incomeStatement_opinc
incomeStatement_revenue
incomeStatement_rnd
incomeStatement_sga
incomeStatement_taxExp
overview_bookVal
overview_bvps
overview_currentRatio
overview_debtEquity
overview_epsQoQ
overview_grossMargin
overview_longTermDebtEquity
overview_profitMargin
overview_revenueQoQ
overview_roa
overview_roe
overview_rps
enterpriseVal
peRatio
pbRatio
operating_efficiency
growth_investment_intensity
asset_productivity
margin_stability
working_capital_days
ev_to_ebitda
ev_to_fcf
ev_to_sales
financial_health_score
sector
industry
sicSector
sicIndustry
location
tech_relative_sma_12m
tech_relative_sma_36m
tech_volatility_12m
tech_volatility_36m
tech_SMA_volume_12m
tech_volume_volatility_12m

2024-11-18
AAPL
NASDAQ
30814755073
185823925786
23633483040
107557400786
11700545014
97180871714
80981822447
0.0000
7177598856
157943919981
102502776608
159781661013
39207403708
-10766921023
304118715492
2713647190
15274026930
-11372896319
-19948928813
-3707717416
2320484370
27367821445
1.9542
46857509119
18033835008
11634159741
31570614179
74728979307
4471072288
5086343359
9103754453
78808700024
4.9599
1.3231
1.7682
0.0680
0.4369
1.7634
0.4243
0.3970
0.2335
0.9850
5.4428
3342308914480
27.58
33.03
0.1209
0.0471
2.6352
0.0104
2.4911
121.24
124.84
20.30
0.5428
Technology
Consumer Electronics
Manufacturing
Electronic Computers
California; USA
1.0672
1.0411
0.0283
0.0251
57539512
0.5287
2024-08-19
AAPL
NASDAQ
37327722008
216885687825
23671972728
111714247675
14466155182
100060445402
86306947824
0.0000
3764422490
155724460169
121295376442
126282397875
38544093747
16028038995
293589081442
2662530092
31027948594
2840539731
-20264538122
-3403072558
2599754045
34011153027
2.0805
26443247897
19845918528
11124542616
26246950061
119506675951
6874808393
4893272621
6252804407
71155251227
4.2018
1.5958
2.1013
0.3495
0.3899
1.7907
0.3895
0.0581
0.2374
1.4009
4.6417
3127955020569
24.54
26.59
0.1086
0.0492
2.4216
0.0136
5.1457
69.65
71.70
40.80
0.6436
Technology
Consumer Electronics
Manufacturing
Electronic Computers
California; USA
1.0399
1.0395
0.0308
0.0289
60311871
0.4124
2024-05-17
AAPL
NASDAQ
44390173699
181316081999
32379198500
121437691452
20849876327
87694176972
88848852393
0.0000
4304640164
141518137282
124000451197
95567251084
38590035324
-16060101105
256760338881
2948435109
23511118608
592765911
-18925576706
-3703074210
2606321849
34796677044
1.0419
24505047433
25354749985
13057001987
34011516841
109517002093
7656130740
6010229916
11876129718
58342528864
3.1550
1.2090
1.2690
-0.2011
0.4005
1.8516
0.4042
0.2812
0.2437
1.1337
7.4382
1567053600133
24.25
34.89
0.1256
0.0511
2.1789
0.0093
3.2960
120.65
77.54
15.49
0.5189
Technology
Consumer Electronics
Manufacturing
Electronic Computers
California; USA
1.0203
1.0265
0.0344
0.0323
58445401
0.3203
2024-02-20
AAPL
NASDAQ
45962166475
219295629587
31567254945
121849746935
18612069572
93791918934
61683465125
0.0000
3806309344
127621353149
110703929693
109149116961
40121258750
15063320841
256089812089
2750883317
38395622473
-1402343470
-18691769868
-3710164100
2511885851
43574342032
1.8777
29803564754
25486244821
14317001134
14991063269
69982507616
5428277546
5177367909
7642695174
80794690609
3.7764
0.9330
2.2716
0.2657
0.4007
1.8875
0.4621
0.4579
0.2016
1.4596
4.7559
2075085074954
26.45
23.10
0.1565
0.0388
3.1052
0.0232
15.62
122.82
86.89
38.18
1.2682
Technology
Consumer Electronics
Manufacturing
Electronic Computers
California; USA
1.0731
1.0337
0.0164
0.0344
56941502
0.2793
2023-11-17
AAPL
-0.0905
NASDAQ
42682043101
230067238334
28180092932
113665836427
15453258934
95303394880
62999857774
0.0000
4879043870
154445308436
116626362569
105646833612
36192676279
-2422743364
266521386474
2968950887
32030510466
-6023565705
-16744132202
-3452343579
1926360117
32986869778
1.5495
37521102682
13898050410
9959652518
31465183700
105146233461
6942345940
6239952111
4009913882
77457074425
4.0800
1.0819
2.1336
-0.0148
0.4215
1.5028
0.4463
0.4459
0.2615
1.1621
3.7714
3142015409508
28.45
21.31
0.1226
0.0296
2.8540
0.0041
3.1523
94.82
66.16
22.18
0.7140
Technology
Consumer Electronics
Manufacturing
Electronic Computers
California; USA
1.0457
1.0371
0.0269
0.0341
62320800
0.2973
2023-08-17
AAPL
0.0032
NASDAQ
55363798267
195931268124
29873730306
103561704134
16779813133
102169387945
67725294499
0.0000
4755407638
132994985430
107205603994
122960449487
38872598394
-17560609668
252923098599
3038180938
13252324548
-7812438851
-17090095709
-3408072510
1762965831
27628622695
1.2689
45018275804
28934762560
13105198891
26115656626
110946620740
7920061915
5220046865
3009811160
81112590411
4.8368
0.8789
1.8373
0.3209
0.4192
1.3972
0.4472
0.4894
0.2647
1.1806
7.4388
3232322830908
26.67
31.90
0.1398
0.0203
1.8306
0.0088
5.6126
76.39
95.33
35.40
0.8580
Technology
Consumer Electronics
Manufacturing
Electronic Computers
California; USA
1.0615
1.0546
0.0266
0.0327
70264755
0.3372
2023-05-17
AAPL
-0.1574
NASDAQ
55530617376
198635170868
35231233668
116544069417
12135150187
103150013141
84827918816
0.0000
3564701407
138519315850
117186136088
124747016305
37894405895
17017171363
287389819985
2989903103
41632049287
-12019646235
-18216888912
-3770696987
2131249312
26387622599
0.7558
49600858496
23199190046
11245303735
17227563698
73477714541
4673286839
6293791022
4922316469
89322263470
3.2193
1.3157
2.2087
-0.0743
0.3929
1.1846
0.3978
0.0291
0.1797
1.2363
7.2434
1895834713493
29.42
18.76
0.1352
0.0279
2.6081
0.0164
14.92
82.47
84.66
18.90
1.1952
Technology
Consumer Electronics
Manufacturing
Electronic Computers
California; USA
0.9978
1.0569
0.0222
0.0328
75814482
0.3050
2023-02-17
AAPL
-0.0345
NASDAQ
46979479665
191390741204
38315352611
115797230132
22635996831
89974121497
72984813676
0.0000
5767784542
154840342020
93001261385
174021112748
41991567648
14407292741
303880739916
3017776491
20087075052
7471457871
-18227549315
-3755133243
1928451448
40680263539
1.9105
49157043927
29411525565
13799200748
16715429237
65974979219
4751257095
5249024621
8199306809
62327231572
4.7347
1.5916
2.0545
0.8293
0.4541
1.6558
0.3999
0.3721
0.2118
1.5241
4.2109
2966460500328
28.24
20.10
0.0982
0.0387
2.7888
0.0078
-1.8254
115.47
142.86
39.06
0.4906
Technology
Consumer Electronics
Manufacturing
Electronic Computers
California; USA
0.9557
1.0444
0.0194
0.0398
85052531
0.2736
2022-11-17
AAPL
0.0857
-0.0048
NASDAQ
37794207203
217983475412
27676628763
108971759429
23808136756
107907100919
61438915867
0.0000
4836434474
142791340475
127333471019
173655293428
36143872645
38726310668
259848007810
2974954665
34473546812
-2496317962
-21926431978
-3719262500
1807485000
23940049171
1.2415
43595430018
20431334835
12007476473
35618656093
103546721872
4668121772
5139992990
5084374492
82309914649
5.0068
1.0605
2.3149
0.0706
0.3967
1.6532
0.4599
0.1237
0.2084
0.8101
7.4543
2046976387914
33.73
33.91
0.1284
0.0462
2.7075
0.0102
-6.1717
107.15
77.84
39.27
0.4858
Technology
Consumer Electronics
Manufacturing
Electronic Computers
California; USA
1.0057
1.0679
0.0147
0.0495
91690744
0.2822
2022-08-17
AAPL
-0.0359
-0.0328
NASDAQ
43755582094
231097641104
27647829363
124598535761
16070909471
101355865320
81688283560
0.0000
5230294491
154846379786
118290773236
165019253542
44844750233
22311799795
303664298204
2941886056
43372587575
-4833129776
-17497733966
-3499982581
1946789917
30079035849
1.5911
32705553314
26398897031
12491993461
31944276293
86878898148
7248741252
4942147386
10744460352
73635234500
3.1725
1.2212
1.5000
0.7053
0.4387
1.5959
0.3947
0.1516
0.2036
0.6718
7.0975
2985705173979
34.23
46.16
0.1094
0.0240
2.4776
0.0113
11.10
78.56
134.00
19.15
0.9239
Technology
Consumer Electronics
Manufacturing
Electronic Computers
California; USA
1.0189
1.0761
0.0233
0.0587
88091148
0.3016
2022-05-17
AAPL
0.1182
-0.0392
NASDAQ
32250387106
187574929074
40175083625
106379381526
15351191864
98905783721
89406304971
0.0000
6883943385
142216257204
138951864745
125220057627
37442224187
29897712089
247184665931
2974766756
28940246875
-5673098153
-16409960308
-3701971400
1847948830
21312296002
0.6555
46506604286
27061726732
13065900134
25660598088
114477960504
5063214723
5792259512
7250725176
83538671787
3.1591
1.4336
1.2555
-0.2049
0.4368
1.8682
0.4480
0.4970
0.2198
1.1228
3.6130
3252811772261
35.71
18.02
0.1629
0.0324
1.7091
0.0030
10.38
117.63
103.47
35.67
1.1417
Technology
Consumer Electronics
Manufacturing
Electronic Computers
California; USA
1.0626
1.0932
0.0276
0.0675
87658908
0.3033
2022-02-17
AAPL
-0.0412
-0.0756
NASDAQ
59500753308
209410967219
40544626634
124175487372
21381790346
86461434283
61086093926
0.0000
7462101442
161366809657
111007918924
175342570606
37215207704
33565289042
275664966745
2795215303
43880855365
3782862347
-19337632763
-3585441161
2518958936
40055636450
1.5246
42155248583
26379462977
10090146916
21036730655
97919378797
5233624434
6743510207
14158975929
51661071092
4.4951
1.1174
2.2905
-0.1701
0.4512
1.3549
0.4234
-0.0248
0.1971
1.0997
3.5537
2053729328421
24.42
58.19
0.1625
0.0168
2.7192
0.0208
-0.2789
72.84
64.77
32.75
0.8597
Technology
Consumer Electronics
Manufacturing
Electronic Computers
California; USA
1.0779
1.1210
0.0251
0.0712
89460820
0.3194
2021-11-17
AAPL
0.1443
0.2300
0.1395
NASDAQ
40755850215
181890148015
31498438645
110575492113
12134126904
86600794653
70302433240
0.0000
5015466736
145838515178
92973149723
135955639253
39145154126
28938445526
243132589554
2757766508
35258042001
-1125592982
-18091283518
-3413391865
1742472457
22852088970
1.1030
35873357550
12333351485
14086084846
29214978496
68402213849
7278876503
6142362843
14110196774
80843801394
3.7432
0.8738
1.9827
0.7198
0.4567
1.2965
0.4546
0.5032
0.2555
1.5326
5.9204
1675876913453
36.77
24.52
0.1312
0.0540
2.2905
0.0116
11.04
95.60
147.70
20.08
1.2551
Technology
Consumer Electronics
Manufacturing
Electronic Computers
California; USA
1.0643
1.1006
0.0166
0.0730
89555576
0.3200
2021-08-17
AAPL
0.1817
0.1458
0.1489
NASDAQ
59599300166
222714817472
21028537716
123328525967
19126870138
86389608044
67782118777
0.0000
4829886035
144228350102
140805514883
145989286960
36426757197
-16213936649
282449657241
3050624296
35704042660
-1973609106
-18864238429
-3864234882
2842689981
22645665887
1.9649
25286192668
26027882598
12740759316
14471369711
69507519317
6795558856
5910031683
12361597119
76392106143
4.9350
1.5514
1.8423
0.2540
0.4411
1.3241
0.4157
0.3155
0.1984
0.7591
5.8908
1965796484451
30.37
47.47
0.1165
0.0317
3.0738
0.0075
-6.3005
120.53
150.22
24.61
0.9286
Technology
Consumer Electronics
Manufacturing
Electronic Computers
California; USA
1.0983
1.1032
0.0192
0.0720
110525980
0.4382
2021-05-17
AAPL
0.1771
0.2953
0.1380
NASDAQ
54384055546
226268775265
38862309536
107894431023
20806692016
98177824297
51942402206
0.0000
7131565673
127631982953
98853223533
116581973906
42673662290
-756911135
288039511309
2661890042
21699326053
-3390647149
-22990015167
-3458045526
2519477614
38672683903
1.3438
40193374202
13612099446
9921760754
18402246853
103134547084
5033791354
6527878315
8376949074
72774833485
3.5860
1.2778
2.1479
0.8600
0.4424
1.9194
0.4645
0.3771
0.2767
1.4128
6.6279
2440905204205
25.95
23.12
0.1368
0.0480
2.2017
0.0177
-9.3863
72.18
125.60
22.94
1.2430
Technology
Consumer Electronics
Manufacturing
Electronic Computers
California; USA
1.1401
1.1023
0.0269
0.0710
127994190
0.4108
2021-02-17
AAPL
0.1400
0.0988
0.0643
NASDAQ
41991819737
183628830992
37111665351
102443984539
22520251168
95979233456
68162914861
0.0000
5279382325
133556978927
136912624515
155914409487
45057351673
13982508994
281114407655
2831693079
23064100987
10634256377
-20950791333
-3547077920
1928835388
38382535238
1.9783
46338998176
12760936822
11373030733
17451285862
119066498610
5998075785
6066343326
7561891687
59110697770
3.8633
0.9071
1.9957
0.8905
0.4189
1.8000
0.4618
-0.0515
0.1934
1.3225
5.9597
1621592934637
27.14
59.01
0.1278
0.0202
1.8534
0.0217
12.84
94.07
146.74
29.94
1.3131
Technology
Consumer Electronics
Manufacturing
Electronic Computers
California; USA
1.1533
1.1174
0.0433
0.0693
154336483
0.4653
2020-11-17
AAPL
-0.0176
0.1267
0.2124
NASDAQ
33147661044
209911682958
26900705304
102850784980
23329466686
91139878553
76922468625
0.0000
7310295166
163152380916
112609502275
112932564005
36736918314
29692291247
274833897493
3047721783
18736231694
-10629647229
-20839200497
-3504612395
1777066594
38604205151
1.8746
41061174987
15676044689
10282687794
30344145229
98401823331
5222895433
5229995689
2388268705
65727595857
3.7867
1.4212
2.0083
0.7179
0.4325
1.6245
0.4355
0.1162
0.2459
0.6482
4.4219
1601639903594
31.10
17.38
0.1649
0.0347
2.7716
0.0085
17.78
80.10
107.69
30.00
1.0382
Technology
Consumer Electronics
Manufacturing
Electronic Computers
California; USA
1.2117
1.1247
0.0480
0.0605
158482061
0.4418
2020-08-17
AAPL
-0.0112
0.1705
0.1345
NASDAQ
43511664087
227358232674
35910041716
123499390263
20723669511
90293368603
79723408791
0.0000
6163330788
146468326331
118562734272
107460302611
45407931770
25982031210
266135138188
2695248565
13305891056
9729604692
-22627700478
-3467170141
2800616267
42331559640
0.8216
37716756955
26760272692
10904722341
31296856181
114002812226
5645368653
4923566357
3518393077
60436320972
4.3827
1.1342
1.4370
0.9780
0.4617
1.6558
0.3821
0.1136
0.2548
1.4694
7.5900
2542149544689
33.05
59.95
0.1286
0.0597
2.8672
0.0044
18.28
104.93
104.17
24.81
0.5819
Technology
Consumer Electronics
Manufacturing
Electronic Computers
California; USA
1.1793
1.0964
0.0410
0.0441
145083962
0.4760
2020-05-18
AAPL
0.1205
0.2976
0.4158
NASDAQ
61093797126
207161718744
29889878846
124318080291
16300262594
100521746494
79801950865
0.0000
4658671020
151823659679
124017026780
152036755719
43213447593
10642016787
251436332360
2795527654
22806595131
9618081123
-23950360030
-3593103303
2568851164
32370002558
1.7739
47771625093
18984667423
14378989484
27712782979
103237794357
5513725251
5613446956
4014095493
75784182515
4.7013
1.1381
1.9167
0.9236
0.4057
1.4478
0.4620
0.2267
0.2026
1.4579
7.2549
3228514277884
32.93
22.25
0.1341
0.0309
1.8592
0.0146
-8.7141
103.44
106.98
18.54
0.9819
Technology
Consumer Electronics
Manufacturing
Electronic Computers
California; USA
1.0993
1.0630
0.0337
0.0310
134809794
0.5106
2020-02-18
AAPL
0.3310
0.4709
0.4297
NASDAQ
52926420313
187759333177
38104464700
117086595500
17372555799
97224328387
82384451004
0.0000
6066960589
151589427786
126454534796
154515649595
37522160071
9911727928
273444079525
2915473284
30034024623
9785250420
-21501084550
-3534849902
2449357270
32585614057
0.8599
48402735220
34094886661
10359345076
17483773634
116423763602
6534752519
6434848328
3566660395
65530282331
4.0381
1.1415
1.2825
0.8044
0.4085
1.5191
0.4366
0.4941
0.2261
0.7897
6.9044
2848010784616
28.64
57.85
0.1654
0.0445
1.7193
0.0041
15.00
84.30
77.68
28.20
0.8726
Technology
Consumer Electronics
Manufacturing
Electronic Computers
California; USA
1.1667
1.0846
0.0272
0.0231
111579841
0.3510

This is data from 2020 onwards for three stocks: AAPL (Apple), MSFT (Microsoft), CAT (Caterpillar) and GS (Goldman Sachs). The full training set contains data back to 1995 for about 6600 stocks (including delisted stocks). The data has about 75 columns of feature variables:

  • date and ticker: The stock and date for which the features on the right apply. These aren’t included during model training, but are included here for context.
  • y_ln_12m, y_ln_24m, y_ln_36m: The target variable to predict. The three prediction horizons – 12, 24 and 36 months – are trained as three individual models. As you can see, there are many empty fields, which is natural, as we don’t have the value for y_ln_36m for AAPL on 2024-11-18; this is the predicted value we’re interested in. When training the 36-month model, data from three years prior to today is removed from the training set.
  • balanceSheet_*, cashFlow_*, incomeStatement_*: These are the quarterly statements fundamentals data, with a statementType prefix. I initially modelled with all statements data included, but removed quite a few with seemingly little predictive value.
  • Some daily fundamentals data, including peRatio, pbRatio, operating_efficiency.
  • Some descriptive categorical data, including sector (e.g., Technology), industry (e.g., Consumer Electronics), and location (e.g., California; USA).
  • tech_*: Finally, some technical indicators, like moving averages and historical volatility.

If you’re curious about the different quarterly statement fields, I’ve included Tiingo’s documentation in a table:

statementType
dataCode
name
description
units

overview
rps
Revenue Per Share
Revenue per share
$
overview
roa
Return on Assets ROA
Net Income/Total Assets
%
overview
assetTurnover
Asset Turnover
Revenue over assets
overview
bookVal
Book Value
Book value of the share, assets - liabilities
$
overview
bvps
Book Value Per Share
Book Value per each share
$
incomeStatement
revenue
Revenue
Revenue
$
incomeStatement
epsDil
Earnings Per Share Diluted
EPS for diluted shares
$
incomeStatement
netinc
Net Income
Net income
$
overview
profitMargin
Profit Margin
This field is marked for DEPRECATION. Please use grossMargin instead. Profit Margin is calculated by the (Revenue - COGS)/Revenue
%
overview
revenueQoQ
Revenue QoQ Growth
Revenue Quarter-over-Quarter Growth rate
%
overview
debtEquity
Debt to Equity Ratio
Debt/Equity ratio
overview
grossMargin
Gross Margin
The margin of good sold, basically how much of the money the company keeps after selling their goods and services. (Rev.-Cost of Rev.)/Rev.
%
overview
roe
Return on Equity ROE
Return on Shareholder's equity; ROE=Net Income/Shareholder's Equity
%
overview
currentRatio
Current Ratio
Ability for a company to pay off its short-term liabilities, Current Assets/Current Liabilities
overview
fxRate
FX Rate
The exchange rate used for the conversion of foreign currency to USD for non-US companies that do not report in USD.
balanceSheet
sharesBasic
Shares Outstanding
Outstanding shares
overview
piotroskiFScore
Piotroski F-Score
0-9 point scale to determine strength of company's financial position
overview
longTermDebtEquity
Long-term Debt to Equity
Long term debt to equity
overview
opMargin
Operating Margin
Operating margin, or how much money a company makes on each dollar of revenue. In other words Operating Income/Revenue
%
overview
epsQoQ
Earnings Per Share QoQ Growth
Earnings Per Share Quarter-over-Quarter Growth rate
%

In short, I’m not familiar with many of the model’s features. My thinking is more “I’ll let the model figure it out.”

As you might have noticed, there are only four entries/rows per stock per year. This is essentially one per quarterly statement, and that’s no coincidence:

  • The quarterly statements are assumed to be the most important model features (and obviously only change four times a year).
  • Much of each stock’s data is highly correlated, per day, week, and even month. As a simple measure to decorrelate the data, I just remove it.
  • Fewer examples mean faster training, which is nice for devex and avoids long training in the scheduled GitHub Actions workflows.

In total, there are about 300,000 training examples, from 1995 to today, for 6600 different stocks.

To get from raw data (Tiingo’s “CSV schema” is more or less ingested to Motherduck as is) to the training data exemplified above, 4 concrete transformation steps are performed. Most of the stuff happening there is straightforward and “self-documented” in the code3. However, at least one step requires more detailed explanation and could benefit from scrutiny: how I join pricing data with quarterly statements data.

When joining these datasets – stock prices and quarterly statements – it’s vital to avoid look-ahead bias. We need to join each stock’s pricing data with statements data only from the day the quarterly statement was published. Joining price data with statements data before publication would be cheating, as the stock’s price wouldn’t yet have “priced in” the statement’s contents.

Here’s what I did. Tiingo’s financial statements provide the fiscal dates for each company. Each company must file their quarterly statements with the SEC4 within 45 days of the fiscal date. I don’t care much about joining price and statements data at the exact publication date (I simply don’t have that data point), but as long as I’m sure the quarterly statement was released on or a few days before the join date, I’m happy enough. This is a conservative approach, as some companies might file before the 45-day deadline. But it should guarantee that the statement information is “priced in” at the join date. In code, I add 45 days to the fiscal date and run an as of join to avoid look-ahead bias.

As shown above, the training data is transformed to fit a regression scenario where all examples (rows) are assumed to be independent and identically distributed. This isn’t strictly true, but it seems close enough for adequate results. Another way to frame this is that all examples learn from each other, meaning that when making future predictions we’re essentially asking the model:

Given a stock’s current input data, how did stocks with similar features perform historically?

Of course, the model does more, as it should have learned to generalize and be able to make reasonable predictions for unseen data too. But clear recommendations (with low uncertainty) are likely because the predicted stock’s features are somewhat familiar to the model.

As predicting a stock’s future is nearly impossible, can we hope for useful results from such a model? It’s still an open question, but as discussed below, the model seems to detect some signal for some stocks. The large uncertainty estimates, also discussed below, suggest a low signal-to-noise ratio for this prediction scenario.

As a stock’s price is obviously time series data, why deliberately not model this as a time series problem? There are several reasons:

  • A stock’s historical price development has very little predictive value. An entire discipline called “technical analysis” aims to predict a stock’s development solely from its recent historical price. I’m not a believer, but what do I know?
  • The most important features aren’t autoregressive, but fundamental data combined with company-specific data (industry, sector, location, etc.). Even though models like dynamic regression allow inclusion of both autoregressive and “regular” features, I’ve found them hard to work with compared to a gradient-boosted tree model like CatBoost. Given how little predictive value historic price contains, I decided it was easier to choose a “regular” regression model5 and include some derived “technical features” from the historical price data.
  • By predicting index-relative returns instead of stock prices, historical prices are has less predictive power. Even though a stock’s absolute value (its price) three months ago is usually predictive for its price today, this data point isn’t very valuable for predicting its index-relative return.
  • The model is designed to learn from all examples, i.e., all other stocks, to identify buy or sell signals. This contrasts with technical methods, where one stock’s data is used to predict its continued trajectory.

However, I do include several “technical features” based on each stock’s history, like longer-term price moving averages, price variance and volume variance. The rationale here is:

  • Historical variances should provide predictive value to (at least) the uncertainty predictions, and possibly suggest high future volatility.
  • Even though I’m sceptical about historical development predicting a stock’s future, I believe somewhat in momentum. There are two reasons. First, if a stock is mispriced, it takes time to reach its “correct” price; there’s a lag. Second, there’s human psychology. Until fear grips the markets, humans tend to think that something going up will continue to go up. This effect is reinforcing: if enough people believe a stock should continue to rise, that might be enough to keep driving its price up. This is my personal belief and may be controversial; the underlying cause (of momentum) shouldn’t matter anyway. Regardless, there’s little harm in including some longer-term simple moving averages as model features. If these features contain no signal, the model should figure that out.

During training, I split the data randomly into a training set, an evaluation set, and a small test set used only after training to evaluate the model on unseen data. As the RMSE is a logarithmic value (log RMSE), it’s a bit hard to interpret, but I’ll try my best.

The test set log RMSE values – pretty stable on each training run – are 0.47, 0.56 and 0.59 for the 12m, 24m, and 36m models respectively. Let’s break that down:

A log RMSE of 0.47 (12m model) means the average prediction error corresponds to:

e0.471.60e^{0.47} \approx 1.60

This implies predictions deviate from actual index-adjusted returns by ~60% in multiplicative terms. For the 36m model (RMSE 0.59), this grows to ~80% deviation. In other words, the actual index-relative returns typically differ from the predictions by a factor between 1.6 and 1.8 (for 12m and 36m predictions, respectively). If a stock actually delivers 2× the index return, the model’s predictions would typically fall between 1.25× (2/1.6) and 3.2× (2×1.6) relative performance, though individual predictions can deviate even further.

Test set log RMSE from training runs (new random test set each time)

12m: Image

24m: Image

36m: Image

This sounds like a high error rate, and it is! However, I don’t think it renders the model results useless; there’s still some signal to be obtained. Here’s why:

  • Horizon vs accuracy pattern. The decreasing RMSE with shorter horizons (0.59 → 0.56 → 0.47) aligns with expectations – shorter-term predictions generally have less uncertainty, suggesting the model captures some time-dependent signal.

  • RMSE is sensitive to outliers. For uncertainty estimates, I have to use the RMSEWithUncertainty loss function in the CatBoostRegressor, which is sensitive to extreme outliers. Given stocks’ possible wild fluctuations, a few outliers (stocks tanking or going +5x) can significantly impact the test set log RMSE values. A loss function less sensitive to outliers could possibly improve the test set evaluation metric results significantly.

  • The values of the predicted index-relative returns don’t matter. Even though this model is framed as a regression model (i.e., predict the exact index-relative return), the results are used in a “trichotomy”:

    1. Is the stock predicted to perform better than the index?
    2. Is the stock predicted to perform about as well as the index?
    3. Is the stock predicted to perform worse than the index?

    Since predicting a stock’s actual development is impossible anyway, we care about signals that can aid decision-making. This three-level categorisation, combined with other information sources, helps objectively assess a stock’s potential and risk. I might consider buying stocks in all three categories, but I’d be wary that the stocks in the last category come with a “high risk” stamp from the model.

  • Even though the average log RMSE is high, uncertainty isn’t high for all predictions. Some stocks have significantly lower predicted uncertainty estimates than others, indicating the model is more confident. This is great for decision-making, helping us identify worthless predictions (very high uncertainty estimates) and possible investment targets.

  • Negative prediction power. Creating a model that picks only stock winners is impossible, but it might be possible to create a model that helps decide which stocks to avoid – or at least label as high risk. If you’re evaluating stocks in the Motley Fool’s Top 10 stocks for a given month, for example, the model could help in flagging high-risk assets that I then may be more cautious about committing to.

  • SHAP values and feature importances provide value in themselves. Being able to drill down to assess why predictions are what they are can be insightful, for both seemingly reasonable and “way off” predictions.

To wrap up, here’s a log RMSE classification scheme for stocks proposed by the brilliant Deepseek R1 model. The classification makes intuitive sense, but, when asked for sources for it, it wasn’t able to find anything. So, take it with a pinch of salt; it may be completely made up to fit my prompt (parts of this blog post). We, R1 and me, seem to agree that the model captures more than just noise at least 😁

Deepseek R1's made-up log RMSE classification for stock predictions.

For stock prediction models:

  • <0.4 log RMSE would indicate strong predictive power
  • 0.4-0.6 suggests moderate signal detection
  • >0.6 implies mostly noise capture

Your results (0.47-0.59) sit in the moderate range, suggesting:

  1. The model identifies some predictive patterns
  2. Significant unexplained variance remains (expected in equity markets)
  3. Fundamental data contains partial signal about future performance

Prediction examples

Above is a screenshot from the Streamlit dashboard with results as of 5 February 2025. The predicted means are the round dots, and bands represent model-estimated Normal distribution quantiles. The thicker uncertainty band corresponds to 1σ (~0.68 probability), and the thinner band corresponds to 2σ (~0.95 probability), with different colours representing different stock tickers. Things worth noticing:

  • The model is only bullish on Apple (AAPL) and Lam Research (LRCX). It has relatively low uncertainty for both, but Apple is one of the few stocks consistently predicted to outperform the index by an entire standard deviation above 1.
  • Alphabet (GOOGL), Marvell (MRVL), and Pure Storage (PSTG) land in the neutral category; they’re predicted to fare approximately similarly to the index. However, they have markedly different uncertainty, with Alphabet having the least uncertainty, i.e., the model is confident that Alphabet will perform similarly to the index. Pure Storage has wide uncertainty bands, which makes intuitive sense; the company is much smaller and the probability of larger index-deviating fluctuations is higher. Marvell sits between them regarding uncertainty, but should be considered rather risky (the 12m and 24m model have much higher uncertainty than the 36m model, indicating a volatile stock in the short term but perhaps less so long term).
  • It’s bearish on Nvidia (NVDA) and Reddit (RDDT). It seems rather confident (narrow uncertainty bands) that Nvidia is overpriced, but I wouldn’t put too much into this. However, running the model without the last 4 years of data (only up to 2021) produced a clear recommendation for Nvidia in 2021 (I wish I had this model running back then 😀). Regarding Reddit – which I love and use daily – I wouldn’t put much emphasis on the prediction. The uncertainty bands are so large that the model is basically admitting it has no clue.

Below is a so-called beeswarm plot of the 36-month prediction horizon model’s SHAP values. There are many insights to be made from this plot; I’ll just highlight a few.

Model SHAP values

  • The top four most important predictors are industry, book value per share, location, and price-to-book ratio6.
  • Book value per share shows some surprising results: lower values contribute positively while higher values contribute negatively to the index-relative return estimates. This seems to contradict traditional value investing principles, and I’m unsure why. It could reflect the US market’s consistent preference for companies with lighter asset structures and growth potential over asset-heavy businesses throughout the analysed period (1995-present).
  • High volatility is usually considered bad. Interestingly, because of CatBoost’s nonlinearity, low volatility sometimes contributes negatively to a stock’s predicted development. The model claims that for certain stocks, given all the other feature values, low volatility might not be so good. Also note the grey dots for this metric, indicating a missing value, which happens for recently listed stocks, is usually interpreted as positive.
  • Low enterprise value is generally positive.
  • The P/E ratio seems like a metric with a sweet spot; it shouldn’t be too high or too low. The same can be said for earnings per share diluted (incomeStatement_epsDil).
  • The “relative SMA development”, a feature I created to catch momentum by comparing a stock’s SMA (simple moving average) to the S&P 500 SMA, yields different predictive behaviour depending on the lookback period. For 12-month SMA, the model finds that a low trajectory compared to the index is generally good, whereas for the 36-month lookback it seems the opposite, that an SMA that performs well compared to the index SMA contributes to more positive predictions (capturing momentum?).

Below is a screenshot of the Stock Picker module in the dashboard, where I can easily compare stocks’ prediction results. I’ve shaded some feature values due to the data’s personal license.

On top is a summary table with the stocks picked for comparison; below are three tables with each stock prediction’s SHAP values, in descending absolute SHAP value order.

Stock SHAP values

SHAP values represent the contribution of a feature to the difference between the actual prediction and the average prediction. For example, Nvidia for the 36m prediction horizon has the most contribution, compared to the average prediction, from the features ev_to_sales, overview_bvps and overview_roa.

As discussed above, the model’s prediction accuracy is not high. Accordingly, the estimated uncertainty around the model’s predicted mean (most likely estimate) are usually very high. In one sense I am content that they indeed are large, as low uncertainty for predicting future stock gains would be a model smell (too good to be true). On the other hand, the uncertainty is currently so high – for many stocks – that it raises a question: are these results valuable at all? If it tends to predict that a stock will perform on par with the index – say between 0.9 and 1.1 – with much uncertainty, are such results useful for decision-making?

To the model’s defence, the uncertainty varies quite a lot, which is a sign of a healthy model. It’s rare for the model to combine a high probability for beating the index and relatively low uncertainty (standard deviation at ~0.3), but there are some examples (e.g., Apple and Lam Research in December 2024). There are possible insights from the stock prediction’s SHAP values highlighting why a given stock has low/high uncertainty (SHAP values mainly measure attribution to the mean prediction, but still give clues about what causes high uncertainty).

Does it make sense to predict a stock’s future development 1, 2, 3 years ahead from quarterly statements? The model’s fundamentals data is just a snapshot from the latest available quarterly financial report. There’s no historical context, like the trajectory of key metrics in recent years, which could be relevant. It might be possible to add features for this, including data from the latest annual statement or calculating simple moving averages for key metrics from past quarterly statements. The risk is adding more redundant features + making the training examples more correlated, so it might not add much predictive value.

I haven’t backtested the model results, like simulating how a portfolio of the top 5 stock picks would have performed 3 years ahead. If you think I’m putting too much confidence in a model that hasn’t been backtested, you have a point. However:

  • The model has been evaluated on an unseen randomly selected test set. That random test set should contain test data for all years from 1995 until today minus the prediction horizon. So, the model has arguably already been tested on historical data.
  • Backtesting would require creating an automated buy and sell strategy, creating something on top of the model results that they weren’t intended for. The ambition is not to perform automated trades, but to aid manual, long-term stock decisions.

Regardless, I think I could have spent more time evaluating historical results, to better understand the model’s strengths and weaknesses. While testing on randomly selected historical data helps check basic accuracy, simulating real-world use – where models make predictions year-by-year using only past data – could better assess performance over time.

The model results are pretty coarse and require careful evaluation to be useful. “Sprinkling” some AI on top could help automate that process. LLM agents could, for example, include web search results and data from Tiingo’s news API to augment analyses and recommendations, producing things like “This month’s top 10 picks, given criteria X, Y and Z”, where X, Y and Z could be provided with prompts. Creating something useful and trustworthy like this would probably not be trivial, especially while keeping costs at zero and not breaching licensing terms7.

One could always come up with ideas for more features or refine existing ones. I’m sure there are some – unknown to me – really good features. However, I also think that adding more features will show diminishing returns, so it might not be worth spending a lot of time on. Predicting stock trajectories will operate in the low signal-to-noise realm regardless of how many great features the model is trained on. Below is one concrete idea I’ve been thinking about.

Add historical news sentiments as model features

Tiingo provides a seemingly rich set of historical news articles per ticker in my subscription. These could be run through LLMs for sentiment analysis, with a prompt like “On a scale from 1 to 5, how positive is this article is for buying stock X long term?” supplemented with some examples and enforcing structured outputs. These results could be added to the model alongside existing fundamental and technical features, as a one-off job for historical articles and a weekly batch job on Github Actions for keeping data up to date.

Arguably, older financial data, let’s say prior to 2010, is less valuable for predicting today’s stock trajectories. A decay function for modelling a recency bias would be simple to add with the set_weight parameter.

Training data during irregular times, e.g., around the dot-com bubble and the 2008 crash, should possibly be removed or downweighted. Even though the model predicts index-adjusted returns – which should mitigate some fluctuations during times of fear – these were times of high volatility and many stocks tanked completely. It’s questionable how valuable training examples from these periods are, as the data is so noisy.

So much talk and no concrete outcomes! I can’t round off without revealing if I’ve made any purchase decisions based on the model results. I plan to run this product for years and jump on good trades if/when they appear; I’m trying to be calm and not rush into decisions I’ll find it hard to stick with. My portfolio is still in the making. However, I’ve made purchases into three stocks:

  1. Lam Research (LRCX): This is the only stock purchased mainly because it was recommended by the model. It’s predicted to outperform the index in all three prediction horizons, and has relatively low uncertainty (standard deviation ~= 0.3). By comparing with other similar stocks and assesing its SHAP values, it seems like the model is suggesting it is undervalued.
  2. Pure Storage (PSTG): This tip came from other sources, but I used the model to assess its viability. The model puts it in the “perform about the same as the index” category with not too high uncertainy. This, plus reading up on the company, convinced me to try to hold this longer term.
  3. Marvell (MRVL): This tip came from other sources and has a notably high P/E ratio. Still, the model predicted it to be in the “perform about the same as the index” category, which made me more confident it wouldn’t be too risky but still have a big potential upside.

Admittedly, it’s an AI-biased “portfolio”, so I’m hoping the AI bubble still has some bursting just yet. I’ve committed long term and will try to sit steady for years and see how it goes. I might add more stocks if I find promising candidates.

Remember: If it goes well, the model shouldn’t get all the credit; if it goes poorly, it shouldn’t get all the blame. Either way, it will be exciting to see how it goes!


  1. Throughout this text, I might use “uncertainty” and “standard deviation” interchangeably, but they refer to the same thing. Strictly, uncertainty is modelled as a Normal distribution, where standard deviation measures “how wide” the uncertainty distribution is. ↩︎

  2. Note that the index, or the SPY ETF, is repeated for all stocks to simplify this calculation. ↩︎

  3. To the extent SQL can be considered readable, of course. Sorry. ↩︎

  4. The Securities and Exchange Commission. ↩︎

  5. That is fast to train, can detect non-linear relationships and provides uncertainty estimates. ↩︎

  6. I possibly should have excluded industry when including sicIndustry, but considering that they don’t contain the same values and that boosted tree models like CatBoost tend to be good at handling redundant features, I decided to leave it in. ↩︎

  7. I’m unsure if sharing data to a privacy-respecting LLM provider via Openrouter would breach Tiingo’s personal license; I’d have to investigate. ↩︎