The elephant in the analytics room: Uncertainty

Posted on July 16th, 2019   

Analytics in football is no longer something very new. The poster child of Expected Goals has crept out of blogs and into usage within clubs and as part of post-game analysis on television. More generally clubs use data in their performance analysis and recruitment activities on a routine basis. For now, the adoption of data is going in only one direction. We at Dectech would certainly like to see this continue so that the usage of objective information becomes the norm when judgements are made at football clubs.

That this trajectory continues requires that clubs continue to see the value of using analytics in their decision making. This in turn means being honest about what the analytics are providing. To that end, in this blog piece we are going to discuss uncertainty.

We do not typically see uncertainty quoted alongside analytics. This is perhaps understandable, but it does hide some potentially crucial information from the decision maker. If for example you tell someone that a player is in the 70th percentile in terms of some attribute and leave it at that, then the assumption the person will make is that you are sure that’s where they are. The reality, depending on what is being measured, could be that they are anywhere from the 50th to the 90th percentile on that attribute. This is important information. Without it a decision maker could put too much weight on your analysis of that attribute. It is quite possible that highly uncertain analytics could lead to worse decisions than no analytics at all. Ultimately, over a long enough time period, this could lead to a distrust of analytics. That’s a bold statement, so we need to do some analysis to get a flavour of the uncertainty that is out there.


Detail vs Noise

We are going to focus on a very simple metric here, and we’re going to use that to investigate where uncertainty can come from. However, before doing that, let’s consider the wider landscape.

Let’s imagine we are concerned with understanding a player’s contribution to his team’s efforts. This can have an overall positive or negative impact based on the various models and metrics used to evaluate it. But you can go a bit deeper and refer separately to its defensive and attacking impacts. Or go even deeper and evaluate passing, shooting, tackling, etc. This could go on and on, as all these splits help a manager identify the strengths and weaknesses of his players in detail. But there is a pitfall to overdoing this, the uncertainty.

You don’t need to be a statistician or data analyst to understand that the more information you gather on something, the more confident you can be about the results of any analysis applied. It may be desirable to be able to rank a player in as many categories as possible, but the observed data might be limited, so that some rarely observed categories need to be grouped together in order to contain enough data to construct meaningful metrics. In other words, there is a trade-off between the precision of the analytics and their interpretability.


Passing: A Case Study

We’ll take passing as our case study, which is the most common action taken in a football match. Not all passes are of the same difficulty or importance. Their success rate and value can vary a lot depending on the situation. In order to demonstrate our point, we look at a single season of Premier League matches, focusing on midfielders because they typically attempt the widest range of pass types. We require that a player has attempted at least 500 passes to be included in the analysis.

Our metric is a simple one: the pass success rate. The experiment idea is also simple. We start by calculating an overall pass success rate for each player. This results in a certain ranking (which we convert to a percentile ranking) of the players. Then we start decomposing our metric by partitioning our data according to various pass types: Passes Within vs Outside the Final Third of the pitch, Long vs Short, and finally Open Play vs Set Piece. Each different data partition naturally leads to a different ranking of the players. For example, while initially we have a single metric (overall pass success rate), after partitioning the data into Within and Outside the Final Third, we have Final Third pass success rate and Outside Final Third pass success rate. After the introduction of the next partitioning layer (Long vs Short) we have four categories: Final Third-Long, Final Third-Short, Outside Final Third-Long, and Outside Final Third-Short, etc.

The final step of our experiment is the uncertainty calculation. The method we use to determine uncertainty is called bootstrap and it involves a repeated re-sampling with replacement from the original data set. For every sample, we estimate the pass success rates and calculate the percentile rankings of the players in each category. We can then use the samples we have collected to measure the uncertainty.  We do this by seeing how the percentiles collected for each player vary across the samples.

Example Percentile Distribution

We will call this uncertainty the Percentile Uncertainty Range (PUR), which is based on the standard error of the sample estimates. The higher the PUR, the higher the uncertainty of the percentiles.



Given an estimated percentile of a player (say 80th) and the associated PUR (say ±10), one can expect his actual percentile to fluctuate within this range (i.e. to be between 70th and 90th percentile) with about 95% probability.

The following table displays the overall PUR for the various categories of the decomposed metric as described in the previous section.

According to the above dendrogram, the minimum percentile rank variability is achieved when there is no decomposition at all. With each added decomposition layer, the PUR increases, and this pattern is seen consistently.

In the “No Decomposition” case, the average PUR is +/-10, that is, if a player falls, e.g. in the 70th percentile, we can expect him with a high certainty to be between 60th and 80th percentiles. Already with the first split, we increase the average percentile range from 10 to 14 (In Final Third) and 18 (Out of Final Third). In some cases, the difference is extreme. Like when splitting Out of Final Third passes into Long and Short: The PUR jumps from 18 to 22 (Long) and 39 (Short).  A PUR of 39 means the metric is essentially as good as no information at all.


Closing Remarks

It is up to the analyst who designs the metrics to make the best compromise between uncertainty and interpretability. We have seen examples within the analytics industry of performance metrics having very fine splits. While it’s understandably tempting to do this, we don’t believe that the metrics are necessarily meaningful at that level. The precision is very important. As the adage says: “Just because you can doesn’t mean you should”. The question you can therefore consider when presented with a metric, or when you drill into your data yourself, is, have I ended up with too little data to draw reliable conclusions? Or to put it another way, am I simply tossing a coin without realising it?