The problem with Benford's law
Watched a documentary last night which covered a mathematical concept called "Benford's Law". And I couldn't get over the near mysticism associated with this.
I'm sure that there is more that I am missing, but generally, it certainly DOES seem like a powerful tool. But, the applications are limited.
Words bandied around included "predictions". But, it isn't really a predictive measurement. Unless of course you're analyzing data and attempting to predict that there is an issue with the data.
Basically, Benford's law states that for groups of numbers which it applies, when you count the occurrences of the left most number of each numerical value you will find a pattern where 1 is the most frequent occurrence (roughly 30% in base 10) and then other numbers have decreasing frequency.
This sounds like it must be wrong. It sounds un-intuitive. But, it actually makes a lot of sense. Especially when you have a somewhat normal distribution with a lot of data points which cross multiple degrees of magnitude.
You see, with a normal distribution, there are a LOT of values in the middle and increasingly less as your reach out towards the limits of the data. Whatever that limit may be. And that means that within that uppermost order of magnitude, you will see more lower numbers than higher ones (more starting with 1s than 2s and more 2s than 3s and so on).
The lower end of the spectrum doesn't factor in as much because there are fewer numbers in each numerical bucket. For instance, in the numbers below 10, there is only 1 number that starts with a 1. And so on. But, if your upper bound is in the trillions, then there are 1 trillion numbers which start with 1. Which is why datasets which span orders of magnitude tend to exhibit this property more readily.
And, if you take a normalized distribution of numbers, the lower numbered numbers in each order of magnitude at the upper end of the distribution are increasingly closer to the center of the distribution and thus, more frequent.
At a large enough scale, I would say it could even happen when there are only 2 orders of magnitude. Say ones, and tens. But, the smaller the distance between the upper and lower bounds, the more likely you are to need a larger data set.
So, what Benford's law really supplies is a way of predicting whether or not a data set representing such a distribution of numbers is likely to be reliable or not. It can predict (which is a weird word for a retroactive evaluation of the data) whether the data has been tampered with. Modified data may seek to maintain certain properties, like ensuring that the data still follows a normalized distribution, but may ignore, or be unable to also make the data adhere to Benford's law at the same time. Or, it may simply not be possible for the person or algorithm tampering with the data to do so.
But, it is still JUST a predictor and doesn't actually prove one way or another.
In my opinion, it is an absolutely fabulous discovery about numbers. But, its value, in my opinion is on par with utilizing a normalized curve to detect anomalous values. This may detect some data modifications which don't show up there and vice versa. It is simply a property of certain data sets.
Comments
Post a Comment