Sample size dependent Click-Through-Rate ranker

2021-04-20 12 minutes

Series - News Article Ranking

Warning

The dashboard only works on desktop size screens, not your phone. Sorry.

This post presents an interactive dashboard designed to facilitate the exploration and understanding of a potential model for calculating a Click-Through Rate (CTR)¹ based ranking score. The CtrScore is part of the combined article ranker Score as explained in the other post in this series. A Bayesian approach is used for modelling the CTR with the Binomial likelihood and the Beta distribution as the conjugate prior and posterior distributions. Obtaining probability distributions for the CTRs enable us to better handle variation due to small sample sizes and more fairly – or at least according to an explicit strategy – compare and rank CTRs for articles with different sample sizes. The idea is inspired by David Robinson’s great write-ups on empirical Bayes for baseball statistics, however, using the Beta quantile function/qbeta for ranking articles might be “novel” (at least I have not seen it anywhere else).

Motivation

In the combined ranker Score we have the TimeDecay ranker that accounts for the “time since published”, which intuitively makes sense to include when ranking news articles since their relevance is sensitive to how “fresh” the news content is. Another natural-to-include non-personalized ranker “signal” is popularity: how popular, among our users, is the news article over the last n minutes. A metric that can be used to measure article popularity is Click-Through Rate², which is simply the number of article teaser clicks divided by article teaser impressions³, or $n_{clicks}/n_{impressions}$. A simple and intuitive metric for article popularity, in my opinion. But we’re not all the way there:

When an article is published (on the frontpage) it has 0 clicks and 0 impressions, and in the first minutes afterwards⁴ it has a relatively small sample size for the CTR calculation. This can lead to large fluctuations in CtrScore for some time before closing in on its true CTR rate, if not handled properly. In addition, it is not straight-forward to compare CTRs of different sample sizes fairly since we have more confidence in CTRs with large sample sizes than smaller ones.
How can we convert this fraction to a ranker score (the CtrScore)?

Here is one concrete way to do this:

It’s using the Beta distribution as the conjugate prior for the Binomial distribution, which provides a probability distribution for the parameter $p$. By obtaining a probability distribution rather than a (frequentist) point estimate, we can quantify the uncertainty for $p$, which allows for systematic approaches for comparing CTR scores ($p$) of different sample sizes. In the dashboard I suggest using the qbeta metric (Beta quantile function). See below the dashboard for a more in-detail explanation.
The qbeta “score” is used to determine the order of articles, we still need to convert the order into concrete CtrScores. In the Svelte dashboard a simple “linear scoring” logic has been used: if there are 3 articles to rank in the order a, b, c, a would get score 1.0, b would get 0.5 and c would get 0.

The dashboard facilitates understanding of step 1 above and allows for tuning the qbeta parameter to explore different ranking strategies. Here it is, made with R Shiny and hosted on GCP Cloud Run:

Explanation and assumptions

How to use and interpret the dashboard

On the left-hand side you have a free text field, where each line break allows for a new set of ID, Clicks and Impressions. You can enter values directly in this field, but I recommend making your examples in a notepad (due to its trigger-happy reactivity) and then copy-paste into the field. You may enter as many examples as you’d like, but I suggest keeping it below 10.

On the right you have a corresponding plot and table. The plot draws the posterior Beta distributions and annotates the “CTR quantile” obtained from the $p$ parameter with the light blue colour, which is the metric used for ordering the articles. The table contains the corresponding input and output data points.

To get an intuition for why I think the Beta quantile could be a good metric for determining article order, a good example might be a list of articles with the same MLE⁵ estimate for the CTR but with different sample sizes:

a, 10, 50
b, 100, 500
c, 1000, 5000
d, 2000, 10000

I argue there is no correct value for $p$ to order articles by CTR, it depends on your ranking strategy. If you prefer a ranker or frontpage where recently published content should be prioritized, you should consider using a value of $p$ greater than 0.5. In this case, articles with lower sample sizes will be ordered above articles with similar MLE for the CTR metric. The opposite is true for $p$ below 0.5. This allows for a strategy that is more cautious that prioritizes articles having obtained larger sample sizes or more evidence; a recently published article would need to obtain more trials/data for the frontpage to “believe” in it to actually be a better performer than older articles. I encourage you to play around with different examples and different values of $p$ to get a feel for it. For more about why I think the quantile function may be a good choice for ranking, see the Why the Beta quantile function? below.

Note

There is a major weakness with the dashboard in its current design: there is no option to set the prior values for the Beta distribution. The possibility to set a common prior for all CTRs should be part of this tool. This is independent of how you obtain the prior, this could be via empirical bayes for example, or you could pick a reasonable prior that aligns with your ranking strategy. The prior value for all IDs are now 0, 0 for shape1 and shape2, which provides no regularization. If you insist on using this dashboard with priors – I am sorry to say that it will never get updated 😜 – you could add the prior values to each ID line manually yourself.

Binomial distribution

The binomial distribution can be used for modelling a case where you have a trial that has two distinct outcomes, either success or failure. This is pretty close to the scenario of an article teaser on a newspaper frontpage: each time an article is displayed to a user (trigger an impression⁶ tracking event) we have a trial going on, with success being article teaser click and failure being “no-click”. However, the assumptions required for using a Binomial distribution are strict, and our newspaper frontpage case does not comply with all of them. Let’s go through them and assess each one:

Fixed Number of Trials $n$: The number of trials is known and fixed in advance.
- Assessment: this is not true, we don’t know the number of trials that an article teaser will have. This is probably not a big deal though, so we assume that this one is okay. ✅
Binary Outcomes: Each trial has only two possible outcomes, commonly referred to as “success” and “failure.”
- Assessment: ✅
Independent Trials: The outcome of any given trial is independent of the outcomes of the other trials.
- Assessment: Whether this is true, depends on how you use the data collected from user frontpage behaviour. If we define each trial as unique(clicks)/unique(imps) per user (we collect only 1 data point per user impression and click), we should be pretty close to having independent trials. If, however, we define each trial as every time a user sees the teaser, the trials are not independent. For example, if a user comes to a front page 4 times during a day and sees the same teaser 3 times, these 3 trials are not independent. This latter way of using the data is likely the most common way, but it has issues like sampling bias⁷. ❌
Constant Probability of Success $p$: The probability of success on a single trial is constant across all trials.
- Assessment: This might be close to true for evergreen⁸ articles, but for more ephemeral content it is not true. $p$ likely depends on article freshness/time-since-published. ❌

So, given we’re violating at least two out of four assumptions, should we just discard the Binomial distribution and find another model? I’d argue no, as I think we’re still close enough for our modelling purposes. We’re not using the Binomial distribution here to predict a value of some sort (making assumptions more important), but as a mathematical expression that is close to our metric at hand, CTR. There is also the principle of Occam’s razor, that we should select the simplest among otherwise equivalent models. I think it is likely there exists better/more correct models than the Binomial for this case, but the Binomial is so simple that it is elegantly easy to understand and communicate to stakeholders, which is a great pro in itself. And, of course, we have my favourite stats quote:

All models are wrong, but some are useful.
– George Box

In my more junior stats days, I was striving to create the perfect model for the problems I worked on. I think the time spent on modelling with that objective was a great learning experience, but these days I am clear about preferring simpler, incorrect models in a great product rather than creating great models for mediocre products. There’s also the chronology here that is important from a product perspective: in general, start thinking about your product and end users from the start and put your major effort there first. If the product shows potential and traction, that’s when you should consider investing in making a great (i.e. time-consuming) model to power it.

Bayesian conjugate prior

While I try to be open-minded when selecting statistical or machine learning models for specific problems, I generally prefer Bayesian models to frequentist ones. Bayesian models offer several notable advantages, such as being explicit about assumptions, allowing for greater modelling flexibility, incorporating prior knowledge, and providing a more effective approach to assessing uncertainty through random variables rather than relying solely on point estimates and variances. However, the Achilles heel, if you’d like, of Bayesian modelling is the excessive compute required. For most non-trivial problems formulated as Bayesian models, there are no closed-form or analytical answer, which means we typically need to opt for compute-intensive simulation methods like MCMC or variational inference. This might be fine for a single experiment in research or an A/B test, but it doesn’t scale well for live products with hundreds of “experiments” with new data arriving continuously, like a newspaper frontpage.

There exists one algebraic convenience, though, that one may use if the problem at hand is simple enough, and it is called a conjugate prior. Some mathematicians have discovered that when the posterior and prior distribution are in the same probability distribution family, you can “get away with” using closed-form expressions for calculating the posterior distributions. This dramatically decreases compute requirements, which opens up the possibility for using such (Bayesian) models in low latency production settings.

In our case, with the Binomial likelihood for article teaser trials, the Beta distribution conjugate prior (and, hence, posterior) is a great fit for our use case. In Table 1 below there is a summary of this model. For more about this, I recommend David Robinson’s great write-ups on empirical Bayes for baseball statistics or his book.

Table 1: Binomial likelihood's conjugate prior distribution (from Wikipedia). Clicks correspond to successes, $\alpha$. Impressions correspond to number of trials, $m = \alpha + \beta$.

Likelihood	Model parameters	Conjugate prior distribution	Prior hyperparameters	Posterior hyperparameters	Interpretation of hyperparameters
Binomial with known number of trials, $m$	$p$ (probability)	Beta	$\alpha, \beta \in \mathbb{R}$	$$\alpha + \sum_{i=1}^{n} x_i,$$ $$\beta + \sum_{i=1}^{n}N_i-\sum_{i=1}^{n}x_i$$	$\alpha$ successes, $\beta$ failures

Why the Beta quantile function?

Why do I suggest using the Beta quantile function for determining the order of Beta distributions? Two primary reasons:

It is fast to calculate. Calculating the quantile of a Beta distribution for hundreds of articles with its closed-form solution is fast and should work well in a production setting where one would like to update article ranker Scores frequently.
It is a single parameter that allows for picking a ranking strategy. Like mentioned in How to use and interpret the dashboard above, I argue that in a ranking scenario like this it is more important to have a parameter that can be optimized than to try to calculate the “correct” order. For example, there is a way to calculate the probability to that Beta dist A will “beat” Beta dist B, which is certainly a more mathematically correct way to order two distributions. However, it does not allow for adaptation of a ranking strategy since there are no parameters to tune⁹. I argue we should prefer an approach that allows us to tweak and adapt the ranking strategy to the business optimization objectives, like “minimize subscriber churn”, “maximize article teaser clicks”, “maximize subscription sales”, “maximize ads revenue”, etc.

Click-Through Rate. The dashboard has the heading VTR, which stands for View-Through Rate, which is the same metric. ↩︎
Click-Through Rate. The dashboard has the heading VTR, which stands for View-Through Rate, which is the same metric. ↩︎
A teaser impression is an exposure of the teaser to a user. I.e. if a user sees the entire article teaser, usually a tag line with or without an image, this can be tracked as an impression. ↩︎
This depends on where the on the frontpage the article is first being published. If at the top of the frontpage it should get sufficient clicks and impressions in just a matter of seconds of a reasonably popular frontpage; if at the lower half of the frontpage it may take several minutes before the sample size is “sufficient” to be comparable to older articles. ↩︎
Maximum Likelihood Estimator. ↩︎
A teaser impression is an exposure of the teaser to a user. I.e. if a user sees the entire article teaser, usually a tag line with or without an image, this can be tracked as an impression. ↩︎
Every newspaper frontpage will have a diverse set of users, from the frequent visitor (many times a day) to the infrequent visitor (once or twice a week). Running A/B tests on frontpages where each article teaser trial is considered independent (not using uniques per user), will inevitably have a substantial sampling bias towards frequent visitors. This is because they run lots of (dependent) trials making the few trials of infrequent users being “watered out” in the metrics tested for. For subscription revenue models, this might be an important aspect to think about, as infrequent users that are subscribing may pay just as much as the “news junkie” and should probably be given at least as much emphasis in the front page optimization strategy since highly engaged users are less likely to churn than infrequent visitors. ↩︎
Articles that are relevant for a long periods – months or years – are often referred to as evergreens. ↩︎
This approach does not scale computationally for hundreds of ranker items either. ↩︎