Statistics.
What’s the difference between Mathematics and Statistics?
Statistics has a sort of funny and peculiar relationship with mathematics. In a lot of university departments,
they’re lumped together and you have a “Department of Mathematics and Statistics”. Other times, it’s grouped
as a branch in applied math. Pure mathematicians tend to either think of it as an application of probability
theory, or dislike it because it’s “not rigorous enough”.
After having studied both, I feel it’s misleading to say that statistics is a branch of math. Rather, statistics
is a separate discipline that uses math, but differs in fundamental ways from other branches of math, like
combinatorics or differential equations or group theory. Statistics is the study of uncertainty, and this
uncertainty permeates the subject so much that mathematics and statistics are fundamentally different modes of
thinking.
Above: if pure math and statistics were like games
Definitions and Proofs
Math always follows a consistent definition-theorem-proof structure. No matter what branch of
mathematics you’re studying, whether it be algebraic number theory or real analysis, the structure of a
mathematical argument is more or less the same.
You begin by defining some object, let’s say a wug. After defining it, everybody can look at the definition and
agree on which objects are wugs and which objects are not wugs.
Next, you proceed to prove interesting things about wugs, using marvelous arguments like proof by
contradiction and induction. At every step of the proof, the reader can verify that indeed, this step follows
logically from the definitions. After several of these proofs, you now understand a lot of properties of wugs and
how they connect to other objects in the mathematical universe, and everyone is happy.
In statistics, it’s common to define things with intuition and examples, so “you know it when you see it”;
things are rarely so black-and-white like in mathematics. This is born out of necessity: statisticians work with
real data, which tends to be messy and doesn’t lend itself easily to clean, rigorous definitions.
Take for example the concept of an “outlier”. Many statistical methods behave badly when the data contains
outliers, so it’s a common practice to identify outliers and remove them. But what exactly constitutes an outlier?
Well, that depends on many criteria, like how many data points you have, how far it is from the rest of the
points, and what kind of model you’re fitting.
In the above plot, two points are potentially outliers. Should you remove them, or keep them, or maybe remove
one of them? There’s no correct answer, and you have to use your judgment.
For another example, consider p-values. Usually, when you get a p-value under 0.05, it can be considered
statistically significant. But this value is merely a guideline, not a law –– it’s not like 0.048 is definitely
significant and 0.051 is not.
Now let’s say you run an A/B-test and find that changing a button to blue results in higher clicks, with p-value
of 0.059. Should you recommend to your boss that they make the change? What if you get 0.072, or 0.105? At
what point does it become not significant? There is no correct answer, you have to use your judgment.
Take another example: heteroscedasticity. This is a fancy word that means the variance is unequal for different
parts of your dataset. Heteroscedasticity is bad because a lot of models assume that the variance is constant, and
if this assumption is violated then you’ll get wrong results, so you need to use a different model.
Is this data heteroscedastic, or does it only look like the variance is uneven because there are so few points to
the left of 3.5? Is the problem serious enough that fitting a linear model is invalid? There’s no correct answer,
you have to use your judgment.
Another example: consider a linear regression model with two variables. When you plot the points on a graph,
you should expect the points to roughly lie on a straight line. Not exactly on a line, of course, just roughly
linear. But what if you get this:
There is some evidence of non-linearity, but how much “bendiness” can you accept before the data is definitely
not “roughly linear” and you have to use a different model? Again, there’s no correct answer, and you have to
use your judgment.
I think you see the pattern here. In both math and statistics, you have models that only work if certain
assumptions are satisfied. However, unlike math, there is no universal procedure that can tell you whether your
data satisfies these assumptions.
Here are some common things that statistical models assume:
A random variable is drawn from a normal (Gaussian) distribution
Two random variables are independent
Two random variables satisfy a linear relationship
Variance is constant
Your data is not going to exactly fit a normal distribution, so all of these are approximations. A common saying
in statistics goes: “all models are wrong, but some are useful”.
On the other hand, if your data deviates significantly from your model assumptions, then the model breaks
down and you get garbage results. There’s no universal black-and-white procedure to decide if your data is
normally distributed, so at some point you have to step in and apply your judgment.
Aside: in this article I’m ignoring Mathematical Statistics, which is the part of statistics that tries to justify
statistical methods using rigorous math. Mathematical Statistics follows the definition-theorem-proof pattern
and is very much like any other branch of math. Any proofs you see in a stats course likely belongs in this
category.
Classical vs Statistical Algorithms
You might be wondering: without rigorous definitions and proofs, how do you be sure anything you’re doing is
correct? Indeed, non-statistical (i.e. mathematical) and statistical methods have different ways of
judging “correctness”.
Non-statistical methods use theory to justify their correctness. For instance, we can prove by induction that
Dijkstra’s algorithm always returns the shortest path in a graph, or that quicksort always arranges an array in
sorted order. To compare running time, we use Big-O notation, a mathematical construct that formalizes
runtimes of programs by looking at how they behave as their inputs get infinitely large.
Non-statistical algorithms focus primarily on worst-case analysis, even for approximation and randomized
algorithms. The best known approximation algorithm for the Traveling Salesman problem has an approximation
ratio of 1.5 –– this means that even for the worst possible input, the algorithm gives a path that’s no more than
1.5 times longer than the optimal solution. It doesn’t make a difference if the algorithm performs a lot better
than 1.5 for most practical inputs, because it’s always the worst case that we care about.
A statistical method is good if it can make inferences and predictions on real-world data. Broadly
speaking, there are two main goals of statistics. The first is statistical inference: analyzing the data to
understand the processes that gave rise to it; the second is prediction: using patterns from past data to predict
the future. Therefore, data is crucial when evaluating two different statistical algorithms. No amount of theory
will tell you whether a support vector machine is better than a decision tree classifier –– the only way to find
out is by running both on your data and seeing which one gives more accurate predictions.
Above:
the winning neural network architecture for ImageNet Challenge 2012. Currently, theory fails at explaining
why this method works so well.
In machine learning, there is still theory that tries to formally describe how statistical models behave, but it’s far
removed from practice. Consider, for instance, the concepts of VC dimension and PAC learnability. Basically,
the theory gives conditions under which the model eventually converges to the best one as you give it more and
more data, but is not concerned with how much data you need to achieve a desired accuracy rate.
This approach is highly theoretical and impractical for deciding which model works best for a particular
dataset. Theory falls especially short in deep learning, where model hyperparameters and architectures are
found by trial and error. Even with models that are theoretically well-understood, the theory can only serve as a
guideline; you still need cross-validation to determine the best hyperparameters.
Modelling the Real World Both mathematics and statistics are tools we use to model and understand the world, but they do so in
very different ways. Math creates an idealized model of reality where everything is clear and deterministic;
statistics accepts that all knowledge is uncertain and tries to make sense of the data in spite of all the
randomness. As for which approach is better –– both approaches have their advantages and disadvantages.
Math is good for modelling domains where the rules are logical and can be expressed with equations. One
example of this is physical processes: just a small set of rules is remarkably good for predicting what happens in
the real world. Moreover, once we’ve figured out the mathematical laws that govern a system, they are
infinitely generalizable — Newton’s laws can accurately predict the motion of celestial bodies even if we’ve
only observed apples falling from trees. On the other hand, math is awkward at dealing with error and
uncertainty. Mathematicians create an ideal version of reality, and hope that it’s close enough to the real thing.
Statistics shines when the rules of the game are uncertain. Rather than ignoring error, statistics embraces
uncertainty. Every value has a confidence interval where you can expect it to be right about 95% of the time,
but we can never be 100% sure about anything. But given enough data, the right model will separate the signal
from the noise. This makes statistics a powerful tool when there are many unknown confounding factors, like
modelling sociological phenomena or anything involving human decisions.
The downside is that statistics only works on the sample space where you have data; most models are bad at
extrapolating past the range of data that it’s trained on. In other words, if we use a regression model with data of
apples falling from trees, it will eventually be pretty good at predicting other apples falling from trees, but it
won’t be able to predict the path of the moon. Thus, math enables us to understand the system at a deeper, more
fundamental level than statistics.
Math is a beautiful subject that reduces a complicated system to its essence. But when you’re trying to
understand how people behave, when the subjects are not always rational, learning from data is the way to go.