Richard Crooks's Website
I was on Twitter looking for accounts to pre-emptively block (this is a good strategy, you should not waste your time arguing with bots or people who are arguing in bad faith) and discovered one such account claiming to be a statistician, and also using a particularly good example of a poor and quite deceptive graph (Figure 1). I decided that this graph was so bad that it needs to be written about, as it has a number of features of deceptive graphs. This graph relates to Covid-19 (of which there is a large amount of misinformation about), so it’s topical, but not the only example of this sort of deceptive graph.
The Bad Graph
Figure 1: The original graph posted on Twitter. This graph purports to show that over the duration of the Covid-19 pandemic the proportion of people who died from covid-19 who were over 80 years of age at time of death. Original data is not available, but an estimation of the extreme datapoints of the trend estimates that the percentage increases from n% to n% between 2020-nn-nn (day 0) and 2021-nn-nn day (day n). The linear trendline has the formula y = mx+c.
This graph (Figure 1) was used to argue that the basis of the lockdown strategy used in the UK, i.e. to minimise the spread of Covid-19 and thus protect elderly and other clinically vulnerable people, was not working. The reason being that during the course of the pandemic (and the lockdowns), the proportion of the deaths from Covid-19 that were in people over 80 had increased, so a “herd immunity and shielding” strategy would have been better according to this logic. Not only is this a flawed interpretation of the trend and its significance for government policy (the proportion of deaths in people over 80s isn’t the thing that the policy is aimed to reduce, rather it is the total number of deaths that the policy is aiming to reduce), it’s also a very bad graph. I shall demonstrate two problems with it.
First of all, I don’t know if the actual data (NHS England, 2016) is used to produce this graph, but I will assume that the account did use real data in good faith, and I will eyeball the start and finish points of the trend in lieu of these being provided, or taking the actual data and risking uncovering that even the data used was fraudulent. The actual values are not important for the problems with this graph or its interpretation, rather these problems are general concepts that apply to any graph and have not been considered when producing this graph. I shall reproduce the graph (Figure 2), re-plotting it in R, and changing the dates to days since the first death.
Why the Y Axis Matters
Figure 2: The effect on reader perception of plotting the same data on different size y axes with a 0-100% axis (A) and a 52-54% axis (B). With a narrower axis range, the same graph appears to show a much stronger trend, even though the same data is used in both.
The first problem is that the Y axis matters. As you can see (Figure 2), you can make a trend look much steeper than it is by narrowing the Y axis. Obviously you want the Y axis to be a fairly narrow range to show the trend and make most use of the graphing space, but what this means is that if the significance of your trend is weak (the % of fatalities in people over 80 rising by 1.7% over the course of 9 months is a weak trend, the slope is 0.0058 (that’s rounded to 4 decimal places!) you can easily manipulate viewers into thinking that a strong trend exists when it doesn’t really. It’s fairly apparent to most people what a percentage means, so this isn’t an enormous problem for this particular graph to anyone who pays attention to the scale, but where more obscure data is being plotted where the average reader is not going to know the significance of different values on an axis, this can be a huge problem.
Why Noise Matters
The second problem with this graph is noise. You and I both know that this trend is not linear day to day, there simply haven’t been enough Covid-19 deaths to give such a clean linear trend day to day, it will have some deviation from linear. But just how much variation day to day is needed for this apparent trend to be buried within noise? I simulated the intervening data points adding increasing amounts of noise (Figure 3). As you can see, you don’t have to add much noise (2% is what 1 death in 50 contributes to this noise) before the trend becomes meaningless. Something else this noise means is that you can choose other days to get different trends, with 1% noise I can (at least for my randomly generated dataset) for example choose the period between day 64 and day 121 and see that the percentage of the Covid-19 deaths which occurred in the over 80s falls from 54.87% 53.27% during this period! This is a far steeper and negative slope than the alleged trend!
Figure 3: The effect of different amounts of noise on the reliability of the trendline. The original graph shows only the line of best fit through an underlying dataset that has noise. Noise at 0.1%, 1%, 2% and 5% was generated, which as the noise increased the ability of the trendline to fit the data better than a simple mean decreased. R2 values for the trendline through each of these noisy datasets were 0.1% = 0, 1% = -1.45, 2% = -6.06, 5% = -31.23.
A metric that’s used to measure how well a trend (including this linear trend) fits with the raw data is R2, also known as the coefficient of determination. A perfect model produces an R2 of 1, a model that is no better that using the average of the observed data has an R2 of 0, and models that are worse than simply using the mean value of the observed data have negative R2 values. As you can see, if there’s more than 0.1% variance (AKA 1 death in 1000) from the linear fit, the R2 is less than 0, and the fit is no better than guessing!
None of this of course is surprising. The trend line has minimal slope, so the fit really is no different to the mean of the data (which is also a linear model with a slope of 0), and the magnitude of the trend is far less than the magnitude of any natural variation in the data. So overall we have an attempt to use graphs and “objective data” to make a case against lockdowns to protect against Covid-19, that like the rest of them falls apart under scrutiny.
If you wish to repeat my analyses in R, feel free to download the R script.
NHS England 2021. COVID-19 Daily Deaths. [Online]. [Accessed 2 April 2021]. Available from: https://www.england.nhs.uk/statistics/statistical-work-areas/covid-19-daily-deaths/