## 2020-12-31

### Data science 1

8.3k words, including equations (about 40 minutes)

This is an overview of fundamental ideas in data science, mostly based on Damon Wischik's excellent data science course at Cambridge (if using these notes for revision for that course, be aware that I don't cover all examinable things and cover some things that aren't examinable; the criteria for inclusion is interestingness, not examinability).

The basic question is this: we're given data; what can we say about the world based on it?

These notes are split into two parts due to length. In part 1:

• Notation

• A few results in probability, including a look at Bayes theorem leading up to an understanding of the continuous form.

• Model-fitting

• Maximum likelihood estimation
• Supervised & unsupervised learning
• Linear models (fitting them and interpreting them)
• Empirical distributions (with a note on KL divergence)

In part 2:

• Monte Carlo methods
• A few theorems that let you bound probabilities or expectations.
• Bayesianism & frequentism
• Probability systems (specifically basic results about Markov chains).

## Probability basics

The kind of background you want to have to understand this material:

• The basic maths of probability: reasoning about sample spaces, probabilities summing to one, understanding and working with random variables, etc.

• The ideas of expected value and variance.

• Some idea of the most common probability distributions:

• normal/Gaussian,
• binomial,
• poisson,
• geometric,
• etc.
• What continuous and discrete distributions are.

• Understanding probability density/mass functions, and cumulative distribution functions.

### Notation

First, a few minor points:

• It's easy to interpret $Y = f(X)$, where $X$ and $Y$ are random variables, to mean "generate a value of $X$, then apply $f$ to it, and this is $Y$". But $Y=f(X)$ is maths, not code; we're stating something is true, not saying how the values are generated. If $f$ is an invertible function, then $Y=f(X)$ and $X=f^{-1}(Y)$ are both equally good and equally true mathematical statements, and neither of them tell you what causes what.

• Indicator functions are a useful trick when bounds are unknown; for example, write $1_{x \geq y}$ (or $1[x\geq y]$) to denote 1 if $x \geq y$ and 0 in all other cases.

• They also let you express logical AND as multiplication: $1_{f(x)} \cdot 1_{g(x)}$ , where $f$ and $g$ are boolean functions, is the same as $1_{f(x) \wedge g(x)}$.

#### Likelihood notation

Discrete and continuous random variables are fundamentally different. In the discrete case, you deal with probability mass functions where there's a probability attached to each event; with the continuous case, you only get a probability density function that doesn't mean anything real and needs to be integrated to give you a probability. Many results apply to both discrete and continuous random variables though, and we might switch between continuous and discrete models in the same problem, so it's cumbersome to have to deal with the separate notation and semantics of them.

Enter likelihood notation: write $\Pr_X(x)$ to mean $P(X=x)$ if the distribution is discrete and $f(x)$ if the distribution of $X$ is continuous with probability density function $f$.

#### Python & NumPy

Python is a good choice for writing code, for various reasons:

• found almost everywhere;
• easy to install if it isn't already installed;
• not Java;

but particularly because it has excellent science/maths libraries:

• NumPy for vectorised calculations, maths, and stats;
• SciPy for, uh, science;
• Matplotlib for graphing;
• Pandas for data.

NumPy is a must-have.

To use it, the big thing to understand is the idea of vectorised calculations. Otherwise, you'll see code like this:

xs = numpy.array([1, 2, 3])
ys = x ** 2 + x


and wonder how we're adding and squaring arrays (we're not; the operations are implicitly applied to each element separately – and all of this runs in C so it's much faster than doing it natively in Python).

### Computation vs maths

Today we have computers. Statistics was invented before computers, though, and this affected the field; work was directed to all the areas and problems where progress could be made without much computation. The result is an excellent theoretical mathematical underpinning, but modern statistics can benefit a lot from a computational approach – running simulations to get estimates and so on. For the simple problems there's an (imprecise) computational method and a (precise) mathematical method; for complex problems you either spend all day doing integrals (provided they're solvable at all) or switch to a computer.

In this post, I will focus on the maths, because the maths concepts are more interesting than the intricacies of NumPy, and because if you understand them (and programming, especially in a vectorised style), the programming bit isn't hard.

### Some probability results

#### The law of total probability

Here's something intuitive: if we have a sample space (e.g. outcomes of a die roll) and we partition it into non-overlapping events $E_1$ to $E_N$ that cover every possible outcome (e.g. showing the numbers 1, 2, ..., 6, and losing the dice under the carpet), and we have some other event $A$ (e.g. a player gets mad), then

if we know the probability of $A$ given each event $E_n$, we can find the total probability of $A$ by summing up the probabilities of each $E_n$, weighted by the conditional probability that $A$ also happens. Visually, where the height of the red bars represents each $P(A|E_n)$, and the area of each segment represents the different $P(E_n)$s, we see that the total red area corresponds to the sum above: You say this diagram is "messy and unprofessional"; I say it has an "informal aesthetic".

This is called the law of total probability; a fancy name to pull out when you want to use this idea.

#### The law of the unconscious statistician

Another useful law doesn't even sound like a law at first, which is why it's called the law of the unconscious statistician.

Remember that the expected value, in case of a discrete distribution for the random variable $X$, is

Now say we're not interested in the value of $X$ itself, but rather some function $f$ of it. What is the expected value of $f(X)$? Well, the values $x_i$ are the possible values of $X$, so let's just replace the $x_i$ above with $f(x_i)$:

... and we're done – but for the wrong reasons. This result is actually more subtle than this; to prove it, consider a random variable $Y$ for which $Y=f(X)$. By the definition of expected value,

Uh oh – suddenly the connection between the obvious result and what expected value is doesn't seem so obvious. The problem is that the mapping between the $y_i$ and $x_i$ could be anything – many $x_i$, thrown into the blackbox $f$, might produce the same $y_i$ – and we have to untangle this while keeping track of all the corresponding probabilities.

For a start, we might notice values $x_i$ of $X$. So we might write

to sum over each possible value of $f(X)$, and then within that, also loop over the possible values of $X$ that might have generated that $f(X)$. We've managed to switch a term involving the probability that $Y$ takes some values to one about $X$ taking a specific value – progress!

Next, we realise that $y_i$ is the same for everything in the inner sum; $y_i = f(x_1) = f(x_2) = ... = f(x_j)$. So we don't change anything if we write

instead. Now we just have to see that the above is equivalent to iterating once over all the $j$s.

A diagram: The yellow area is the expected value of $f(x) = Y$. By the definition of expected value, we can sum up the areas of the yellow rectangles to get $E(f(X))$. What we've now done is "reduced" this to a process like this: pick $y_1$, looking at the $x_i$ that map to it with $f$ ($x_1$ and $x_2$ in this case), and find these probabilities and multiply them by $f(x_1)=f(x_2)=y_1$. So we add up the rectangles in the slots marked by the dotted lines, and we do it with this weird double-iteration of looking first at $y_i$s and then at $x_i$s.

But once we've put it this way, it's simple to see we get the same result if we iterate over the $x_i$s, get the corresponding rectangle slice for each, and add it all up. This corresponds to the formula we had above (summing $f(x_i) P(X=x_i)$ over all possible $i$).

#### Bayes' theorem (odds ratio and continuous form)

Above is a Venn diagram of a sample space (the box), with the probabilities of event $B$ and event $R$ marked by blue and red areas respectively (the hatched area represents that both happen).

By the definition of conditional probability,

Bayes theorem is about answering questions like "if we know how likely we are to be in the red area given that we're in the blue area, how likely are we to be in the blue area if we're in the red?" (Or: "if we know how likely we are to have symptoms if we have covid, how likely are we to have covid if we have symptoms?").

Solving both of the above equations for $P(B \cap R)$ and equating them gives

which is the answer – just divide out by either $P(B)$ or $P(R)$ to get, for example,

Let's say the red area $$R$$ represents having symptoms. Let's say we split the blue area $B$ into $B_1$ and $B_2$ – two different variants of covid, say. Now instead of talking about probabilities, let's talk about odds: let's say the odds ratios that a random person has no covid, has variant 1, and has variant 2 are 40:2:1, and that symptoms are, compared to the no-covid population, ten times as likely in variant 1 and twenty times as likely in variant 2 (in symbols: $P(R| \neg B_1 \cap \neg B_2)/40 = P(R|B_1) / 2 = P(R|B_2)$). Now we learn that we have symptoms and want to calculate posterior probabilities, to use Bayes-speak.

To apply Bayes' rule, you could crank out the formula exactly as above: convert odds to probabilities, divide out by the total probability of no covid or having variant 1 or 2, and then get revised probabilities for your odds of having no covid or a variant. This is equivalent to keeping track of the absolute sizes of the intersections in the diagram below:

But this is unnecessary. When we learned we had symptoms, we've already zoomed in to the red blob; that is our sample space now, so blob size compared to the original sample space no longer interests us.

So let's take our odds ratios directly, and only focus on relative probabilities. Let's imagine each scenario fighting over a set amount of probability space, with the starting allocations determined by prior odds ratios:

Now Bayes rule says to multiply each prior probability $P(B_i)$ by $P(R|B_i)$. To adjust our prior odds ratio 40:2:1 by the ratios 1:10:20 telling us how many times more likely we are to see $R$ (symptoms) given no covid or $B_1$ or $B_2$, just multiply term-by-term to get 40:20:20, or 2:1:1. You can imagine each outcome fighting it out with their newly-adjusted relative strengths, giving a new distribution of the sample space:

Now if we want to get absolute probabilities again, we just have to scale things right so that they add up to 1. This tiny bit of cleanup at the end (if we want to convert to probabilities again) is the only downside of working with odds ratios.

This gives us an idea about how to use Bayes when the sample space is continuous rather than discrete. For example, let's say the sample space is between 0 and 100, representing the blood oxygenation level $$X$$ of a coronavirus patient. We can imagine an approximation where we write an odds ratio that includes every integer from 0 to 100, and then refine that until, in the limit, we've assigned odds to every real number between 0 and 100. Of course, at this point the odds ratio interpretation starts looking a bit weird, but we can switch to another one: what we have is a probability distribution, if only we scale it so that the entire thing integrates to one.

The same logic applies as before, even though everything is now continuous. Let's say we want to calculate a conditional probability like the probability of $$X$$ (the random variable for the patient's blood oxygenation) taking the value $$x$$. At first we have no information, so our best guess is the prior across all patients, $$\Pr_X(x)$$. Say we now get some piece of evidence, like the patient's age, and know the likelihood ratios of the patient being that age given each blood oxygenation level. To get our updated belief distribution, we can just go through and multiply the prior likelihoods of each blood oxygenation level by the ratios given the new piece of evidence.

Above, the red line is the initial distribution of blood oxygenation $x$ across all patients. The yellow line represents the relative likelihoods of the patient's actual known age $a$ given a particular $x$. The green line at any particular $$x$$ is the product of the yellow and red function at that same $$x$$, and it's our relative posterior. To interpret it as a probability distribution, we have to scale it vertically so that it integrates to 1 (that's why we have a proportionality sign rather than an equals sign).

Now let's say more evidence comes in: the patient is unconscious (which we'll denote $U=\text{"yes"}$). We can repeat the same process of multiplying out relative likelihoods and the prior, this time with the prior being the result in the previous step:

We can see that in this case the blue line varies a lot more depending on $x$, and hence our distribution for $x$ (the purple line) changes more compared to our prior (the green line). Now let's say we have a very good piece of evidence: the result $m$ of a blood oxygenation meter $M$.

There's some error on the oxygenation measurement, so our final belief (that $x$ is distributed according to the black line) is very clearly a distribution of values rather than a single value, but it's clustered around a single point.

So to think through Bayes in practice, the lesson is this: throw out the denominator in the law. It's a constant anyways; if you really need it you can go through some integration at the end to find it. But it's not the central point of Bayes' theorem. Remember instead: prior times likelihood ratio gives posterior.

## Fitting models

A probability model tries to tell you how likely things are. Fitting a probability model to data is about finding one that is useful for given data.

Above, we have two axes representing whatever, and the intensity of the red shading is the probability attributed to a particular pair of values.

The model on the left is simply bad. The one in the middle is also bad, though; it assigns no probability to many of the data points that were actually seen.

Choosing which distribution to fit – or whether to do something else entirely – is sometimes obvious, sometimes not. Complexity is rarely good.

### Maximum likelihood estimation (MLE)

Let's say we do have a good idea of what the distribution is; the weight of stray cats in a city depends on a lot of small factors pushing both ways (when it last caught a mouse, the temperature over the past week, whether it was loved by its mother, etc.), so we should expect a normal distribution. Well, probably.

Let's say we have a dataset of cat weights, labelled $x_1$ to $x_n$ because we're serious maths people. How do we fit a distribution?

Step 1 is Wikipedia. Wikipedia tells us that a normal distribution has two parameters, $\mu$ (the mean) and $\sigma$ (the standard deviation), and that the likelihood (not probability! see above) a normal distribution $X$ with those parameters takes a value $x$ is

Oh dear.

After a moment's thought, we can interpret it more clearly:

So it's just an exponential that decays in both directions from $\mu$, and that's squeezed by $\sigma$.

(Why are there constants then? Because it's a probability distribution, and must therefore integrate to 1 over its entire range or else all hell will break loose.)

Step 2 is philosophising. What does it really mean to get the best fit of a distribution?

The first thing we can notice is that there are only two dials we can adjust: the values of $\mu$ and $\sigma$. For this particular problem at least, we've reduced the massive problem of picking the best model to one of finding the best spot in a 2D space (well, half of 2D space, since $\sigma$ must be greater than zero).

The second thing we can notice is that the only tool we have at our disposal here to tell us about the fit to the distribution is the likelihood function, and, well, as the saying goes: when all you have is a likelihood function ...

A good fit will give high likelihoods to the points in the data set (we can't get an arbitrarily good fit by giving everything a lot of likelihood, because there's only so much likelihood to go around – the probabilities that the likelihood function assigns across its domain must sum to 1).

Let's call the likelihood of the data, given some model, to be the likelihood that we get that specific data set by independently generating samples from the model until we have the same number as in the data set (if we have a lot of data points, the likelihood of any particular set of them will usually be very low, since it's the product of the likelihood of a lot of individual points). And let's go ahead and try to tune the model so that the likelihood of our data is maximised.

(Remember, likelihood is probability, except for continuous random variables like our normal distribution, where we can't talk about the probability of a dataset (only about something like the probability of getting a dataset at least as close as [some metric] to the dataset).)

Step 3 is algebra. So what is the likelihood of all our data? Using basic probability, it's the product of the likelihoods of each data point (just like the probability of getting a set of independent events is the product of the probabilities of each event). Returning to our normal distribution with cat data $x_1$ to $x_n$, the likelihood of the data given distribution $X$ with mean $\mu$ and standard deviation $\sigma$ is

Oh dear. Maximising this is a pain.

Thankfully, there's a trick. We don't care about the likelihood, only that we set $\mu$ and $\sigma$ so that the likelihood is maximised. We can apply any monotonically increasing function to the likelihood, maximise that, and we'll have the $\mu$ and $\sigma$ that maximise the original mess.

Which monotonically increasing function? Logarithms are generally best, because they convert the products you get from calculating the likelihood of a dataset into sums (and in this case they're especially nice, because they'll also take out the exponentials in our distribution's likelihood function).

In fact, throw away the previous calculation, note that

from which we can throw away the $\log(\sqrt{2\pi})$ because it's the same in each term, and then sum all the rest up to get a total log likelihood of

Call this $f$; the values of $\mu$ and $\sigma$ that maximise it are when when $\frac{\partial f}{\partial \mu} = 0$ and $\frac{\partial f}{\partial \sigma} = 0$; that's when we've found our peak on the 2D space of possible $(\mu, \sigma)$ pairs (technically this condition only tells us it's a stationary point, but it turns out to be the maximum, as you can prove by taking more derivatives).

So the maximum satisfies

The first condition gives

in other words that $\hat{\mu}$, our best estimator function for the value of $\mu$, is the average of the values in the data set.

From the second condition, we can do algebra to get

We need to be careful here, though. When writing out the conditions, $\mu$ and $\sigma$ stood for specific values of the parameters of the normal distribution $X$. We don't know these values; the best we can do is estimate them with estimators, which are technically not values but functions that take a data set and return an estimated value (and denoted by $\hat{\text{hats}}$). We can't have unknown values in our definition of $\hat{\sigma}$, as we currently do with the $\mu$ in it; we have to replace it with the estimator for $\mu$ like this:

– making sure that the estimator $\hat{\mu}$ does not depend on $\hat{\sigma}$ , since that would again make things undefined – or then by writing out the $\hat{\mu}$ estimator like this:

which at least makes it very clear that the $x_i$s and their number $n$ define $\hat{\sigma}$.

When you're done defining your estimators, you should have a clear diagram in your head of how to pour data into the functions you've written down and come out with concrete numbers, with no dangling inputs anywhere – you're not done if you have any.

### Supervised and unsupervised learning

There are two main types of fancy model fitting we can do:

1. Supervised learning, where we have a set of pairs (of numbers or anything else) and we try to design a system to predict one element from the other. For example, maybe we measure the length and weight of some stray cats, but get bored of trying to get them to stay on the scale long enough, so we want to ditch the weighing and predict a weight from the length alone – how well can we do this?
2. Unsupervised learning, where we have our data (as a set of tuples of associated data, like cat lengths, weights, and locations), and we try to fit a model to it so we can generate similar items; maybe we want to fake a larger stray cat population in our data than actually exists but not get caught by the statistics bureau. (This category also includes things like trying to identify clusters to interpret the data.) Fitting a distribution is perhaps the simplest example: using our one-dimensional cat weight database discussed in the MLE section, we can "generate" new cats by sampling from it, though the "cat" will just be the weight number. The more interesting case is when we have to generate a lot of associated data; for example, this website offers you a new face every time you reload it. Behind it is a probability distribution for a human face in some crazy-dimensional variable space that's detailed enough that sampling it gives you all the data needed to figure out the colours of each pixel in a photorealistic face picture.

The unifying idea is maximum likelihood estimation (MLE). Clearly, something like MLE is needed if you want to fit a distribution to data for unsupervised learning; we're going to need to generate something eventually, so we better have a probability model. It's less clear that supervised learning has anything to do with MLE though, and tempting to think of it as defining some random loss function to measure how bad a fit is, and then minimising that. It's possible to think of supervised learning this way, but then you'll end up with a lot of detail about loss functions in your head, all of which will seem to be pulled out of thin air.

Instead, think of supervised learning as MLE too. We specify a probability model, which will take in some parameters (e.g. the exponent $a$ and constant $b$ in a cat length/weight model like $\text{weight} = b \times \text{length}^a + \epsilon$, where $\epsilon$ is a normally distributed error term with mean 0 and some standard deviation we either know already or then ask the fitting procedure to find for us), and the value of the predictor variable(s) (e.g. the cat's length), and spit out its prediction of the variable(s) of interest.

(Note that often the variable of interest is not numerical, but a label: "spam", "tumour", "Eurasian oystercatcher", etc.)

In fact, seen from the MLE perspective, it can almost be hard to see the difference – if so, good. Just look at the processes:

1. Unsupervised learning:

1. Get your dataset $x = (x_1, x_2, ..., x_n)$.
2. Decide on a probability model (e.g. a simple distribution) $X$ with a parameter set $\theta = (\theta_1, \theta_2, ..., \theta_m)$.
3. Find the $\theta$ that maximises $\Pr_X(x_1; \theta) \times ... \times \Pr_X(x_n; \theta)=\Pr_X(x;\theta)$,* since assuming our data points are drawn independently, this is the likelihood of the dataset.
2. Supervised learning:

1. Get your dataset of pairs of the form (thing to predict, thing to predict from): $((y_1, x_1), (y_2, x_2), ..., (y_n, x_n))$.
2. Decide on a probability model $Y$ that which relies on parameter set $\theta = (\theta_1, \theta_2, ..., \theta_n)$, and also $x_i$, to predict $y_i$..
3. Find the $\theta$ that maximises $\Pr_Y(y_1;x_1, \theta) \times ... \times \Pr_Y(y_n; x_n, \theta) = \Pr_Y(y_1, ..., y_n; x_1, ...., y_n, \theta)$.*

*(We write $\Pr_X(x_i;\theta)$ to mean the likelihood that $X$ takes the value $x_i$ if the parameters are $\theta$; we avoid writing it as a conditional probability $\Pr_X(x \, |\, \theta)$ because interpreting this as a conditional probability is technically only valid with a Bayesian interpretation.)

### Linear models

You can invent any model you choose. As always, simplicity pays though, and it turns out that there's a class of probability models which are easy to work with and reason about, for which general algorithms and mathematical tools exist, and which is often good enough: linear models.

The word "linear" immediately brings to mind straight lines. That's not what it means in this context. The linearity in linear models is because the output is a linear combination of "features" (predictor variables).

The general form is

where $\hat{y_i}$ is the predicted value, $c_1$ through $c_n$ are constants, and $e_{1,i}$ through $e_{n,i}$ are the features describing the $i$th set of data. In the simplest case, a feature might be a value we measure directly, but in general it can be any function of data we measure. Ideally, we want that the true value $y_i \approx c_1 e_{1,i} + ... + c_n e_{n,i}$.

In the above diagram, we see we measure the data $x_i$ (note that it can be a tuple of values rather than a single value), pass it through some blackbox function to generate features, and take the prediction $\hat{y_i}$ to be the sum of multiplying together each feature by the weight assigned to it.

Note that the linear model above is a prediction-maker but not a probability model because it doesn't assign likelihoods. The probability model for a linear model is often taken to be

that is, there's an error term $\epsilon$ that we assume to be a normal distribution with standard deviation $\sigma$ (which may be known, or finding it may be part of fitting the model).

The above is also an equation for predicting one specific output ($y_i$) from one specific set of features, which in turn are determined by one specific input (e.g. a single data point). More generally we can write it in vector form:

where $\pmb{y}=(y_1, y_2, ..., y_{n})$, and likewise $\pmb{e_j}$ is a vector whose $i$th position corresponds to the $j$th feature of the $i$th data item.

Note that we can read this equation in two ways: as a vector equation about data, as just described, that's fitted to give $\pmb{y}$ from its features, or as a prediction, saying that the value of a particular $y_i$ will be roughly this.

There's a set of standard tricks to use in linear modelling:

• "One-hot coding": using a function that is 0 unless the input data satisfies some condition (having a label, exceeding a value, etc.).
• If we have the data point $x_i$, using the features $e_{0,i} = 1$, $e_{1,i} = x_i$, and $e_{2,i} = x_i^2$ to fit a quadratic (if you fit a polynomial of degree higher than 2 without a very solid reason, you're probably overfitting).
• We often have a pattern with a known period $T$ (days, years, etc.), and some non-zero starting phase $\phi$. Therefore we'd want a feature like $\sin((2\pi/T)x+\phi)$, where $x$ to is an input, to fit this pattern to. If $\phi$ is known, we don't have a problem, but if we want to fit the phase, it doesn't work: the model is not linear in $\phi$. To fix this, use a trig angle addition identity; the above becomes $\sin(\phi) \cos((2\pi/T)x) + \cos(\phi) \sin((2\pi/T)x)$, where $\sin(\phi)$ and $\cos(\phi)$ are just constants so can be forgotten about because the fitting model will determine the constants of our features. (Recovering $\phi$ from the final constants will take a bit of maths; note that the constant of the cosine and sine terms in the fitted model will have the amplitude mixed in, in addition to $\phi$.)

Here's an annotated linear model with parameter interpretation:

The features in this model:

• $e_1=x$.
• $e_2$ is 0 if $x < A$ and 1 otherwise.
• $e_3$ is 0 if $x < A$ and $x$ otherwise.

(If we want to fit the best value of $A$, we'll have to do some maths and reconfigure the model. Right now $A$ is a constant that's defined in the functions that calculate the features from the input data.)

The interpretation of the constants:

• $c_0$ is the prediction for $x=0$.
• $c_1$ is the base slope.
• $c_2$ is the difference between the prediction for $x=0$ (the $y$-intercept of the $x < A$ line) and the $y$-intercept of the $x>A$ line.
• $c_3$ is how much the slope changes after $x=A$.

We could have chosen different features (for example, letting $e_1 = 0$ for $x > A$), and then gotten perhaps more readable constants ($c_3$ would become just the slope, not the difference in slope). We could also have added a feature like $e_4 = x^2$, and then the model would no longer look like just straight lines. But whatever we do, we need to be careful to interpret the constants we get correctly, especially when the model gets complicated.

For our cat weight prediction example, we might expect weight $W$ and length $L$ to have a relation like $W \approx c L^3$, where $c$ is a constant that the model will fit. If we want to ask questions about whether a cubic relation really is the best, take logs and fit something like $\log(W) = c_1 + c_2 \log(L)$$c_2$ tells us the exponent.

#### Feature spaces and fitting linear models

The main benefit of linear models is that by talking about linear combinations of data vectors we reduce the maths of fitting parameters to linear algebra. Linear algebra is about transformations of space and the vectors in it, so it also allows for a visual interpretation of everything.

Let's say we have a model like this:

Here, $\pmb{y}$ is the actual measured data, and $\pmb{e_i}$ are functions of the (also measured) predictor variables. Let's say $\pmb{y} = (y_1, y_2, y_3)$ – i.e., we have three data points. We can imagine $\pmb{y}$ as a vector pointing somewhere in 3D space, with $y_1$, $y_2$, and $y_3$ the distances along the $x$, $y$, and $z$ axes. Likewise, $\pmb{e_1}$ and $\pmb{e_2}$ can be thought of as 3D vectors encoding some (function of the) data we've measured.

Now the only dials a linear model gives us to adjust are the weights of $\pmb{e_1}$ and $\pmb{e_2}$: $c_1$ and $c_2$. There's a 2D space of them (since there are two constants to adjust – $c_1$ and $c_2$), and as it happens, there's a nice geometric interpretation: each pair $(c_1, c_2)$ corresponds to a point on the plane spanned by $\pmb{e_1}$ and $\pmb{e_2}$ (specifically, the point you get to if you move $c_1$ times along $\pmb{e_1}$ and then $c_2$ times along $\pmb{c_2}$).

So what are the best values of $c_1$ and $c_2$? The intuitive answer is that we want to get as close as possible to $\pmb{y}$:

In this case, the closest to $\pmb{y}$ that we can reach on the plane spanned by $\pmb{e_1}$ and $\pmb{e_2}$ is the green vector, and the black vector is the difference between the predicted data vector and actual data vector.

Mathematically, what are we doing here? We're minimising the distance between the vector $\hat{\pmb{y}} = c_1 \pmb{e_1} + c_2 \pmb{e_2}$ (where $c_1$ and $c_2$ can be varied) and $\pmb{y}$; this distance is given by

Previously we simplified optimisation by applying a logarithm (a monotonically increasing function) and optimising that; this time we do the same by applying the squaring function (which is monotonically increasing for positive numbers, which our distance is limited to). This means that the quantity to minimise is

In other words, we minimise the sum of squared errors ("least squares estimation" is the most common phrase).

If we have more than three data points, then we can't picture it, but the idea is exactly the same. Fitting an $n$-dimensional dataset to a linear model of $m$ features boils down to moving as close as possible in $n$D space to the observed data vector, while limited to the $m$-dimensional (at most; see below) space spanned by the features.

(Above, $n=3$ and $m=2$. Generally $n$ is huge because datasets can be huge, while $m$ is much smaller since it's the number of features we've written down into the model.)

A maths lecturer is giving a lecture about 5-dimensional geometry.

A student asks a question: "I can follow the algebra just fine, but it would be helpful if I could visualise it. Is there any way to do that?"

The lecturer replies: "Oh, it's easy. Just imagine everything in $n$ dimensions, and then let $n=5$."

(variants of this joke are common; see for example here.)

##### Linear independence

A set of vectors is linearly dependent if there exists a vector in it that can be written as a linear combination of the other vectors. If your feature vectors are linearly dependent, you will get the same predictions out of your model, but you can't interpret the coefficients.

(For visual intuition: two vectors in 2D are linearly dependent if they lie on the same line, three vectors in 3D are linearly dependent if they lie on the same plane (a superset of the case that they lie on the same line), and so on.)

An easy way to make this mistake is if you're doing one-hot coding of categories. Let's say you're fitting a linear model to estimate student exam grades $y$ based on their university, with a model that looks like this:

using indicator function notation. Whatever linear fitting routine you do will happily give you coefficient values and the predictions it gives will be sensible, but you won't be able to interpret the coefficients. To see what's happening, consider an Oxford student: their predicted grade $y$ is $\alpha + \beta$. What is $\alpha$ and $\beta$? Good question – we can only assign meaning to their combination. If instead we eliminate one university and write

when we now fit the coefficients, $\alpha$ will be the predicted grade for Oxford students, and $\alpha+\beta$ the predicted grade for Cambridge students, so we can interpret $\alpha$ as the Oxford average, and $\beta$ as the difference between Oxford and Cambridge. (The predictions given by the model won't change though.)

The vector interpretation is that if our dataset contains, say, 3 Oxford students followed by 2 Cambridge students, the (5D) data vectors in the first model will be

But these vectors aren't linearly independent: the last two vectors sum up to the first one, and therefore there will be many triplets $(\alpha, \beta, \gamma)$ that give identical predictions.

#### Linear fitting and MLE

We talked about MLE being the holy grail of model fitting, and then about linear models and how fitting them comes down to a geometry problem. As it turns out, MLE lurks behind least squares estimation as well.

I mentioned earlier that linear models often assume a normal distribution for errors. Let's assume that, and do MLE.

Our model is that

where $\epsilon \sim N(0,\sigma^2)$ (i.e. follows a normal distribution with mean zero and standard deviation $\sigma$).

A useful property of normal distributions is that if we add a constant $c$ to a normal distribution with mean $\mu$, the result has a normal distribution with mean $\mu + c$ and the same standard deviation (this isn't true of all distributions!). Therefore we can write the above as

The likelihood for getting $y$ is

once again copying out the likelihood function for normal distributions.

Now remember that we just want to fit $c_1$ through $c_n$. These only occur in the exponent, so we can ignore all the constants out front, and also we can see that since there's a negative in the exponent, maximising it is equivalent to minimising the stuff in the exponent. Taking out $\sigma$ and constants, the relevant stuff to minimise is

where we can see that the thing we subtract from $y$ is our model's prediction of $y$ (one component of what we previously denoted $\hat{\pmb{y}}$). Once again, we can see we're minimising a square of the error. Of course, we have many $y$-values to fit; to see that it's the sum of these that we minimise, rather than some other function of them, just note that if we take a logarithm we'll get a term like the above (times constants) for each data point we're using to fit.

So least-squares fitting comes from MLE and the assumption of normally distributed errors.

(Are errors normally distributed? Often yes. Remember though that our features are functions of things we measure; even if $x$ has normally-distributed errors, after we apply an arbitrary function to it to generate feature $e$, the resulting $e$ might not have normally distributed errors (but for many simple functions it still will). We could be more fancy, and devise other fitting procedures, but often least squares is good enough.)

### Empirical distributions

What's the simplest probability model we can fit to a dataset? It's tempting to think of an answer like "a normal distribution", or "a linear model with one linear feature". But we can be even more radical: treat the dataset itself as a distribution.

On the left, we've plotted the number of data points that take different values of $x$ (this is a discrete distribution; for a continuous distribution, the probability that any two samples drawn are equal is infinitesimal). On the right, all we've done is normalised the distribution, by rescaling the vertical axis so that the heights of all the bars sum to one. Once we've done that, we can go ahead and call it a probability distribution, and assign the meaning that the height of the bar at $x$ is the probability that the distribution $X$ that we've just defined takes the value $x$. This is called an empirical distribution.

Sampling from an empirical distribution is easy – just pick a value at random from the dataset. (Of course, the likelihood such a distribution assigns to any value not in the dataset is zero, which can be a problem for many use cases.)

In fact, you've probably already dealt with empirical distributions, at least implicitly. When you calculate the mean and variance of a dataset, you can interpret this as calculating the properties of the empirical distribution given by that dataset. An empirical distribution as an abstract thing apart from your dataset may seem ad hoc, but it's not any less defined than a normal distribution.

The standard way to illustrate an empirical distribution is by plotting its cumulative distribution function (cdf); an empirical one is known as an ecdf. This is almost necessary for continuous variables. In general, the ecdf of a dataset is a very useful and general way to visualise it: it saves you from the pains of histograms (how large to make the bins? if you take logs or squares first, do you take them before or after binning? etc. etc.), and is also complete in the sense of technically displaying every point in the dataset.

The ecdf for the above distribution would look something like this:

(Like any cdf, it takes the value 0 up until the first data point and the value 1 after the last data point.)

If we now fit any parametric (i.e. non-empirical) distribution, comparing its cdf to the ecdf is a good test of how good the fit is.

#### Measuring the goodness of a model fit with KL divergence

The empirical distribution is the best possible fit to a given dataset, and therefore it's a good benchmark to measure the fit of a proposed model against.

Let's say our data is $x=x_1, ... ,x_n$, and the empirical distribution is $X^*$. The likelihood of drawing $x$ from $X*$ is (under the assumption of each $x_i$ being drawn independently)

Now $\Pr_{X^*}(x_i)$ is just the fraction of how many $x_j$ in $x$ are equal to $x_i$. Writing $N_{x_i}$ to mean the number of values equal to $x_i$ in the data, we can write

Taking logs, and writing $q_v = N_{v} / n = \Pr_{X^*}(v)$, the above product for the likelihood becomes the sum, over possible values $$v$$ of $$x_i$$, for the log likelihood:

Now we'll do one last trick, which is to scale by $1/n$; otherwise, the term in front of the log will tend to be bigger if we have more data points, while we want something that means the same regardless of how many data points there are. After we do that, we notice a nice symmetry:

This is a good baseline to compare any other model to. For example, let's say we fit to this a (discrete) distribution $X$ (with the same sample space as $X^*$) with parameters $\theta$. Write $p_v = \Pr_X(v; \theta)$, and we can express the log likelihood of the dataset as

Normalising by $1/n$ as before, we get

Now to get a measure of fit goodness, just subtract, and do some algebra on top if you feel like it:

(In the last step, I've just expanded out our earlier definitions of $p_i$ and $q_i$.)

This is called the Kullback-Leibler divergence (KL divergence). If $X=X^*$, then it comes out to 0; for worse fits, the value becomes greater.

There's a nice information theoretic interpretation of this result. $- \sum_{v} q_v \log_2(p_v)$ is the average number of bits needed to most efficiently represent a value randomly drawn from the dataset, using a coding scheme optimised for the distribution $X$

## 2020-12-17

### Review: Foragers, Farmers, and Fossil Fuels

Book: Foragers, Farmers, and Fossil Fuels: How Human Values Evolve, by Ian Morris (2015)

This post has also been published here.

Two hundred years ago, most people lived in societies that considered slavery, war, and discrimination based on class, ethnicity, and gender to be justifiable. Today, most people live in societies that hold the opposite beliefs.

What changed? A simple and tempting narrative is that we have simply become wiser; that various Enlightenment philosophers, thoughtful activists, and other principled people figured out that the pre-industrial moral order is wrong and managed to persuade everyone to change.

It is true that many smart and principled people had good ideas and that this was a big proximate driver of better values. But is it a coincidence that this change in values happened around the same time as the industrial revolution?

What about the previous economic revolution, the agricultural one? Did that also coincide with a change in the values that people held? The evidence says yes – foraging societies tend to be more accepting of violence and far less accepting of hierarchy than farming ones.

The argument of Ian Morris' Foragers, Farmers, and Fossil Fuels is that these timings are not a coincidence. Societies that change their main method of getting energy also change their values, because some sets of values give greater success for a certain type of society. Farming societies that stick to anti-hierarchical forager attitudes won't survive competition with farming societies that learn to believe in hierarchies (maybe they won't be economically competitive and won't be able to field as big an army to defend themselves as the god-king next door can field to conquer them). Likewise, industrial societies that stick to inflexible hierarchies and elite-focused economies can't compete with more equal democracies that don't squander the talents of the non-elite, and maintain a well-looked-after middle-class of rich consumers and educated workers.

We can contrast two ways of trying to explain the history of values. The first says that the history of values is a history of ideas; a battle of ideas against other ideas, waged in the minds of people. The second says that the history of values is a history of what works best. The battle is between the benefits conferred by believing in certain ideas and those conferred by other ones, and it is waged out in the real world, where empires fall or rise based on whether they value the things that will lead them to success.

It is clear that neither style of explanation is enough on its own. No matter how persuasive it can be made, a sufficiently destructive idea – as an extreme example, that everyone should commit suicide – will not find its adherents in charge of the future (or coming from the opposite direction: why do you think many religions are so big on the "be fruitful and multiply" point?). On the other hand, no matter how practically useful a certain idea is, someone has to have the idea and persuade other people to adopt it as a value before it has a chance of spreading because of its practical benefits.

The question, then, is just how far can we push the deterministic account, where the methods of energy capture constrain values. In Ian Morris' telling, the answer is surprisingly far, and if his account of the history of values is correct, I agree with him (in particular, the similarities of farming society values across continents is hard to explain otherwise). However, I think Morris, along with most people who advance or accept similar arguments, goes too far with the moral pragmatism that these ideas may be thought to imply.

But first: what values did foragers, farmers, and fossil fuel users actually hold, and what is Morris' energy-based explanation of the changes between them?

### Foragers

Everyone has some idea of what a forager or hunter-gatherer is, but since we want to deal with differences between foragers and farmers, we want a clear idea of where the line is. Morris cites a good definition by Catherine Panter-Brick: foragers are people who "exercise no deliberate alteration of the gene pool of exploited resources". If you plant and harvest a few naturally occurring plants, you're still a forager, but when you start refining the crops generation by generation or breeding the animals, that's the point when you become a farmer.

Of course, there is a vast amount of variance in culture, lifestyle, and values between different forager bands. To almost every generalisation about foragers, there exists some tribe that does the opposite. However, Morris argues that for each main type of human society (foraging/farming/industrial), it is useful to talk about the average set of values such societies held or tended to develop towards, at least in terms of the broad categories of tolerance of political/economic/gender hierarchy and propensity to violence. This covers up lots of important questions – different societies may have justified violence under different circumstances, or had different reasons for why economic inequality was acceptable, but such differences are sucked up into one category and ignored in this sort of analysis. That this makes sense will become apparent once we see that foragers, farmers, and fossil fuel users can be sensibly compared and contrasted even at this very general level.

In some ways, forager values are familiar. Even among foragers, possession and ownership are big deals, with every item generally having an owner. In other ways, they're surprisingly different.

Take violence. Though it's very difficult to come up with exact figures for anything to do with foragers (ancient foragers left behind only bones and tools, and modern foragers only live in places that farmers didn't want, so might not be a representative sample), the chance of dying by murder may have been around 10% in an average forager tribe, compared to 0.7% today, 1-2% across the 1900s (including all wars), roughly 5% in your average farming society or in the most murderous countries of today, and 20% for Poland during World War II.

This was not recognised by anthropologists until the 1990s or so because, as Morris explains:

"[T]he social scale imposed by foraging is so small that even high rates of murder are difficult for outsiders to detect. If a band with a dozen members has a 10% rate of violent death, it will suffer roughly one homicide every twenty-five years, and since anthropologists rarely stay in the field for even twenty-five months, they will witness very few violent deaths."

This is why Elizabeth Marshall Thomas' !Kung ethnography was called "The Gentle People", even though "their murder rate was much the same as what Detroit would endure at the peak of its crack cocaine epidemic".

Foragers are also extremely averse to hierarchy. Perhaps the best summary is given by a !Kung San forager asked about the absence of chiefs:

"Of course we have headmen! In fact we’re all headmen … Each one of us is headman over himself!"

It's not just that foragers don't have strict hierarchies and this behaviour falls out naturally as a result; they are actively opposed to any sort of hierarchy or inequality. Material inequality is considered morally wrong, and fairness essential. Pressure to share spoils is applied liberally. And as in any group of humans, you'll have upstarts who try to achieve greatness and power, but such people usually have opposition groups immediately form to hold them back. Anthropologist Christopher Boehm calls these "reverse dominance hierarchies"; Morris translates this as "coalitions of losers".

The one sort of inequality that foragers aren't opposed to is gender inequality, with the dominant role in politics and violence generally falling to men (as an example of this attitude, Morris cites a forager of the Ona people (also known as the Selk'nam or Onawo) saying "the men are all captains and the women are sailors"). However, the gender inequality in forager societies is still on a different level from the extreme gender inequality and regimentation of farmer societies, and attitudes about sex were looser too. Morris writes that "abused wives regularly just walk away [...] without much fuss or criticism, and attitudes towards marital fidelity and premarital virginity tend to be quite relaxed".

### Farmers

As with foragers, Morris lumps together farming societies into one ideal type, labelled Agraria by Ernest Gellner. As before, this covers up a lot of variation (in particular, he identifies horticulturalists, city states like classical Athens or medieval Venice, and proto-industrial nations like Qing dynasty China, Mughal India, Ottoman Turkey, and Enlightenment Western Europe as the three extremes of Agraria), but Morris argues "the exceptions and sub-categories should not be allowed to obscure the reality of an ideal type representing in abstract terms the core features of peasant farming society". He cites Robert Redfield:

"[I]f a peasant from [any one of widely separated farming societies] could have been transported by some convenient genie to any one of the others and equipped with a knowledge of the language in the village to which he had been moved, he would very quickly come to feel at home. And this would be because the fundamental orientations of life would be unchanged. The compass of his career would continue to point to the same moral north."

So what is the moral north of farming societies? Perhaps surprisingly, it's almost as hard to make definite conclusions about what anyone other than the elite thought in agrarian societies as it is to make conclusions about foragers.

While the elite read and wrote a lot, they didn't care much about what the peasants thought, and peasants were not literate. The most literate ancient societies – for example Athens in the 4th and 5th centuries BCE – had a rudimentary literacy rate of 10%, so one person in ten might be able to glean some meaning from words, but how well they could set down their thoughts on moral values is a different question. To get higher literacy rates, you have to move in time to the early second millennium, and in space to urban China or western Europe. Morris writes that "genuine mass literacy, with half or more of the population able to read simple sentences, belongs to the age of fossil fuels”, and because of this, most of “our evidence for peasant experience comes from archaeology and accounts by twentieth-century anthropologists, rural sociologists, and development economists." If history is the written record of the past, then the majority of the population lived their lives outside history until the past century or two. (Perhaps we might even say that history in this sense only began with the internet age, when the private lives of everyone began being set down.)

Before going into the trickier question of values, we can compare foragers and farmers in some simple ways. First, their energy consumption was higher. Foragers, like all humans, need to eat about eight and a half megajoules (2000 kilocalories) of energy as food per person per day to stay alive. Add cooking, and total energy consumption roughly doubles. The energy use of agrarian societies starts out at a forager level of around 20 MJ/person/day (5000 kcal), and goes up to the 100-150 MJ/person/day level (compare to 500 MJ/person/day (120 000 kcal), plus/minus a factor of two or so, for modern rich industrial nations).

Second, farming societies have very roughly perhaps half as few violent deaths as foragers, due to the existence of governments that at least occasionally kept the peace.

However, their life wasn't better on most metrics. In contrast to the literature (both then and now) full of "tales of vagabonds, wandering minstrels, and young men striking out to make their fortunes", "most farmers lived in worlds much smaller than most foragers had done, and never went much more than a day or two’s walk from the villages they were born in". Not only this, but:

"Excavated skeletons suggest that ancient farmers tended to suffer more than foragers from repetitive stress injuries; their teeth were often terrible, thanks to restricted diets heavy on sugary carbohydrates; and their stature, which is a fairly good proxy for overall nutrition, tended to fall slightly with the onset of agriculture, not increasing noticeably until the twentieth century AD."

No farming society even managed to escape the repeating cycles of population growth and starvation that foragers were also prone to, despite having more direct control over their food supplies. Populations would increase to keep pace with the good times until all farmers were slaving way to stay at subsistence levels given the crowdedness and quality of the land. Then many would starve to death when the bad times came.

Another trend across the history of farming societies is three things coinciding: energy consumption rises above 40 MJ (twice the minimum agrarian level and the typical forager level), towns grow past 10 000 people, and a few people take charge and start bossing around the others with their governments.

In farming societies, widespread respect and reverence for hierarchy was internalised by everyone. Morris writes that “[f]arming society often seemed obsessed with the symbolism of rank”, and twentieth century anthropologists "regularly found that having a healthy respect for authority – knowing your place – was a key part of their informants’ sense of themselves as good people". This often came, and still comes, as a surprise to non-farmers:

"[W]hen European reformers began venturing outside their urban enclaves into the countryside in the eighteenth century, they were often astonished that instead of complaining about inequality and demanding the redistribution of property, peasants largely took it as right and proper that most people were poor and weak while a few were rich and strong."

Especially revered was the "Old Deal", Morris' term for the generalised social contract between classes in agrarian societies: that some have the duty to be commanders (or "shepherds of the people", in the preferred phrasing of many a king), others to obey those commands, and if everyone follows this script then things work fine.

Even when the powerful were questioned, the questioning didn't go as far as the Old Deal itself. In fact it rarely reached the king. “The tsar is good but the boyars [aristocrats] are bad", goes a Russian saying; even those who protested the powerful assumed that the highest levels of power must be good and holy, and the problems came from their will being incorrectly carried out by lesser lords. Even when the king himself came under fire, the Old Deal itself, or the inequality it entailed, were not questioned. The most common sort of rebellion against a king took what Morris calls a "good-old-days form": the justification was that the king had broken the Old Deal (or been abandoned by the gods or lost the Mandate of Heaven) and the urgent need was to restore the days when the right dictator was in charge, not abolish the dictatorship in the first place.

There were exceptions – in the 1640s some Chinese peasants called themselves "Levelling Kings" and went around questioning who gave their rulers the right to call them serfs, and of course there's the gradual English case and the rather more abrupt French case – but these only came when the societies in question started hitting energy consumptions of 150 MJ/day, the very highest end that agrarian societies could achieve without a full-on industrial revolution.

(Morris implies that the energy consumption is the cause. This seems backwards; an explanation running through the institutions and organisation needed to sustain this energy level seems much more reasonable. In general, perhaps when Morris talks about "energy consumption", you should read "the societal factors that enable higher energy consumption" in its place.)

Given how anti-hierarchy foragers were, how did this come to be? Were the peasants all forced into a rigid hierarchy by ruthless elites?

'“You may fool all the people some of the time; you can even fool some of the people all the time; but you can’t fool all the people all the time,” Abraham Lincoln is supposed to have said (unless it was P. T. Barnum). But Korsgaard and Seaford apparently think that Lincoln/Barnum was wrong, and that for ten thousand years everyone in Agraria was led by the nose—women by men, poor by rich, everyone by priests—and robbed blind. This I just cannot credit. Humans are the cleverest animals on the planet (for all we know, the cleverest in the whole universe). We have worked out the answers to almost every problem we have ever encountered. So how, if farming values were really just a trick perpetrated by wicked elites, did they survive for ten millennia? Most of the farmers I have met have been canny folk; so why could farmers in the past not figure out what was going on behind the wizard’s veil?

The answer, in my opinion, is that there was no veil. The veil is a figment of modern academics’ imaginations, made necessary by the assumption that only a tiny elite could possibly have thought that hierarchy was a good thing. In reality, farmers had farming values not because they fell for a trick but because they had common sense.'

It is clearly a mistake to think that farmers participated in farming societies and its values through gritted teeth. However, I don't think it was so much farmers' common sense that made them adopt farming values. Societies that brainwashed their members into sincerely accepting farming-era hierarchies did better, and eventually all farming societies mastered this art.

#### Specific inequalities: forced labour and patriarchy

In addition to the general extreme hierarchy of farming societies, there are two specific types of inequality that are both interesting in their causes and tragic in their consequences.

The first is slavery, and forced labour more generally. Both are almost entirely absent in foraging bands, which might take captives from other tribes but usually eventually integrate them into the tribe rather than keeping them forever as slaves. In contrast, some form of forced labour is found in almost every agrarian society.

Why? Because financial institutions weren't strong enough. Markets for labour existed almost everywhere, but there was a problem: “anyone who had enough land to support a family preferred to make a living by working it rather than by selling labor”, because, without reliable banks for everyone, keeping a good farm was the only robust way to accumulate and maintain wealth, especially for your children. When it was time for a big construction project (maybe the pharaoh died and you need a pyramid to bury him in), even wealthy employers like the state couldn't always hire enough workers. Often they resorted to violence to lower the costs of labour. Violence, after all, came cheap.

The second specific kind of inequality was male domination and strict gender roles. Morris offers a two-pronged explanation. First, farmer men had more reason than forager men to keep farmer/forager women under control:

“The main reason that male foragers generally care less than male farmers about controlling women [...] is that foragers have much less to inherit than farmers. [...] [Q]uestions about the legitimacy of children matter a lot less than they do when only legitimate offspring will inherit land and capital.”

(We might ask why farming societies were so strict about only legitimate offspring inheriting property, but perhaps this is a case of biological values limiting the space of cultural variation.)

Second, gender roles became more regimented out of necessity. Agricultural work – plowing, manuring, and irrigation – relies on brute upper body strength, which favours males. Farmers worked harder in general than foragers, so more male-specific strength-based work also pushed everything else – home upkeep (which foragers didn't need to do) and food processing – onto women. As early as 7000 BCE, skeletons from Syria suggest that both genders regularly carried heavy loads, but only women had an arthritic condition caused by kneeling and footwork, probably as a result of grinding grain.

Finally, child bearing is obviously restricted to women. With the advent of farming, the doubling time for populations fell by a factor of five, from ten thousand to two thousand years. Infant mortality seems not to have changed, so this is due to increased birth rates alone.

Morris writes that this decision on gender norms seems so obvious that "no farming society that moved beyond horticulture ever seems to have decided anything else". According to him, "if we sit theorizing in our fossil-fuel studies" we might imagine an alternative were women had the upper hand, "sending otherwise-useless men out to labor for them in the fields, but in reality, the organizational needs of farming societies gave men the means to inflict devastating economic pain on faithless wives while also raising the costs for men of failing to deter women from bringing cuckoos back to the nest". The empirical correlation between gender inequality and farming societies seems strong and Morris' arguments are plausible, but whether they're the final word is less clear.

Of course, you can't hold everyone down all the time. Morris lists many historical cases of people who were slaves and/or women, but nevertheless defied expectations and attained great success. For example, Morris tells the story of an Athenian slave banker called Pasion, who did so well that he was eventually not only able to buy his own freedom but also the bank itself.

(Interestingly, Wikipedia tells the story slightly differently, saying he was manumitted as a reward for his work, and inherited the bank after his former owners retired, rather than by buying it outright. Wikipedia cites the 1971 Athenian Propertied Families by J. K. Davies; Morris cites Edward Cohen's Athenian Economy and Society and Jeremy Trevett's Apollorodus Son of Pasion, both from 1992. I don't know who to believe, or whether a consensus exists.)

Morris' harsh conclusion is that both forced labour and patriarchy were "functionally necessary to farming societies that generated more than 10k kcal/cap/day [42 MJ/cap/day]”.

### Fossil-fuel users

Many places underwent the agricultural revolution independently of each other, because farming spread slow enough that distant people could invent it on their own before the waves of someone else's discovery of farming reached them. In contrast, the industrial revolution happened in north-west Europe fast enough, and gave big enough advantages, that no other region had an independent industrial revolution.

The culture and values of the post-industrial West – democracy, human rights, individualism, market-orientedness, and so on – are often labelled Western. In some sense this is a tautology; by definition, these are the values that Western countries have at the moment. The label is also used in a deeper sense, to mean that there is some kernel of Westernness in these values that makes them the logical conclusion of pre-industrial Western thought, and perhaps incompatible with different cultural bases.

One consequence of Morris' arguments is that this perspective is wrong. What we might call Western values are no more Western values than farming-era values are Sumerian values (or Indus Valley values or Mesoamerican values or ...); the reason Western values are called Western values but farming values aren't called Sumerian values is that the industrial revolution spread faster than the agricultural one. To explain Western values we should look not at ancient Greek philosophers and whatnot but at the demands of industrialised societies.

This does not mean that every industrialised society will approach the West in its values, only that the pressures are there (and wily enough dictators or future technological trends may be enough to avoid them). It might also be that the reason that Europe underwent an industrial revolution while other societies at the edges of agrarian achievement did not is that, by accidents of history and geography, pre-industrial north-west European values were closer to modern industrial values than those of the other societies that have stood at the cusp of industrialisation.

But the overall conclusion remains: "Western" values are the universal values that industrialised societies tend towards. The conflict between Boko Haram or the Taliban and the West, to use two of Morris' examples, is not so much a conflict of culture versus culture, but of era versus era; a last stand of the hierarchy- and patriarchy-obsessed farming values that were held by everyone (except a forager here or there) until a few hundreds years ago. On a more granular level, the steady retreat of discrimination and formality from Western societies is simply the gradual acceptance that these vestiges of the farming era are no longer useful.

As with the transition to farming society, there's the question of how people eventually reached almost opposite stances of what their ancestors had believed. Unlike with the agricultural revolution, the question is especially pressing because the timescale of the changes is so short. But once again, a lot of it was driven by economics.

The first step was people moving from countryside farming to factory jobs:

"Nineteenth-century sources make it very clear that entering the wage-labor market could be a traumatic experience, requiring workers to submit to strict time discipline and factory conditions unlike anything they had known in the countryside; and yet millions chose to do so, because the alternative—hunger—was worse.

So eager were poor farmers for dirty, dangerous factory jobs that British employers only needed to increase wages by 5 percent (in real terms) between 1780 and 1830, although output per worker grew by 25 percent. Wage increases accelerated only in the 1830s, and even then only for urban workers. The great motor was productivity, which was now rising so high that employers began finding it cheaper to share some of their profits with their workers than to try to break strikes. (In another great irony, by the time that Dickens, Marx, and Engels were writing, wages were rising faster than ever before in history.) For the next fifty years, wages rose as fast as productivity; after 1880, they rose even faster. By then, incomes were beginning to rise in the countryside too.”

One resulting value change was the abolition of forced labour:

“By making wage labour attractive enough to draw in millions of free workers, higher wages made forced labor less necessary, and because impoverished serfs and slaves—unlike the increasingly prosperous wage labourers—could rarely buy the manufactured goods being churned out by factories, forced labour increasingly struck business interests as an obstacle to growth (especially when it was competitors who were using it).”

The farmer-era justifications for gender hierarchy also broke down. First, industrialised societies had less need for brute strength and more need for organisational work, in which there is no gender disparity. Second, birth rates eventually went down, reducing the amount of time women spent on children. As a result, almost universal male dominance during the farming era has given way to a world where 81% of people say gender equality is important, including 98% in Britain but also over 90% of Indonesians and Turks and even 78% of Iranians (India, with a very low 60% and a huge population, is probably the biggest drag on the average).

Morris offers a great summary of the principles of success in agrarian versus industrial societies:

“Agraria had worked by drawing lines, not just between elite and mass or men and women, but also between believers and nonbelievers, pure and defiled, free and slave, and countless other categories. Each group was assigned its place in a complex hierarchy of mutual obligations and privileges, tied together by the Old Deal and guaranteed by the gods and the threat of violence. Fossil-fuel societies, however, work best by erasing lines. The more a group replaces the rigid structure of figure 3.6 with the anti-structure of figure 4.7—a completely empty box, made up of interchangeable citizens—the bigger and more efficient its markets will be and the better it will function in the fossil-fuel world.”

The most successful agrarian social structure have a social structure like the one above; the most successful industrial societies look like this instead:

This, in a nutshell, is why agrarian societies tend towards extreme hierarchy while industrial societies tend towards a social structure of interchangeable mobile individuals, free to do what they want and incentivised to slot themselves wherever they create the most value (at least economically).

With industrialisation, we've managed to roll back the discrimination and hierarchy of the farming age. We've even gone back to valuing fairly flat political hierarchies like the foragers (though we maintain them through democratic institutions rather than "coalitions of losers"), and become more egalitarian about gender than the foragers were, all the while living in societies far less violent than the average hunter-gatherer band.

There is one area where we're more tolerant of hierarchy than foragers, though: economic inequality. Once again the reason is practical:

"[...] Industria can flourish only if it has affluent middle and working classes that create effective demand for all the goods and services that fossil-fuel economies generate, but on the other, it also needs a dynamic entrepreneurial class that expects material rewards for providing leadership and management. In response, fossil-fuel values have evolved across the last two hundred years to favor government intervention to reduce wealth equality—but not too much.”

However, even then we still abhor the farmer-era standard of seeing it as fair when the elite extract as much as they can from everyone under them. In fact, merely the fact that calling elites extractive has become a good political weapon shows how far we've come – as discussed in the farming section, farming-era people saw ruthlessly extractive elites as part of a fair social contract.

### A summary of value evolution?

We've just gone over a lot of detail about foragers, farmers, and fossil-fuel user values, and some reasons why values might have developed in the way they did. Is this a story of a random path through the stages of technological development, with harsh selection pressures making sure that societal values are dragged along for the ride? Or is there some pattern to the madness?

Morris' summary table does a good job of summing up the "what" of it:

Two things leaps out from this table, especially if we plot it graphically: when it comes to attitudes towards hierarchy, fossil-fuel users are much closer to foragers than farmers are to anyone, and violence has gone down all along. (Slide from a talk I gave at EA Cambridge)

Other people have noticed this; economist and futurist Robin Hanson has written about the modern conservative-liberal axis mapping onto how willing people are to abandon farming ways and revert to more forager-like lifestyles and values as societies grow richer (as some people inexplicably prefer writing in digestible chunks rather than monolithic book-length blog posts, it's hard to give just one or two key links, but see for example here, here, here, and here).

Perhaps we can tell a story like this: in the beginning there were foragers. They tended to live as people tend to do, and value the things that evolution had crafted people to want. Humans being humans, there was a lot of politicking, and with no institutions to restrain it, a fair amount of violence. The outside world was harsh and outside anyone's control.

Then the agricultural revolution slowly creeped across the world. At first people lived as before, but generation by generation it turned out that the societies that managed to best persuade people to accept a bit more hierarchy – to show a bit more obedience to the chiefs, grant a bit less non-reproductive status to women – did a bit better than the others. Over millennia, such societies either had their tricks independently discovered or copies by others, or then outright went warpath to subjugate over societies to their rule – and, of course, preach their values, which (given human adaptability) they held sincerely, and with no idea that they thought differently from their distant ancestors. Eventually, the big tricks – organised religion and the god-kings keeping power by letting their henchmen extract as much as they could from their subjects – became almost universal. They also lowered the level of violence by imposing some amount of internal order and perhaps a culture promoting peaceful conflict resolution, if only to spare more strength to throw at neighbouring societies.

Then came the industrial revolution, and suddenly what mattered is how well a society could harness the talents of its members and establish efficient, competitive markets to drive innovation. This created pressures to democratise and erase lines between people. Technology and wealth also increased people's ability to control their lives. Rich and comfortable industrialised people no longer needed to abide by strict farming-era social rules to survive, and so slowly gave up on them, reverting back to more forager-like ways, though with the added advantages of unprecedented peace and material wellbeing.

### How selection pressures change values

The reasons why societies tend to adopt pragmatic values are subtle; it's not as if people go around cynically holding the values that will best contribute to their tribe's or society's long-term success. As a result, Morris' descriptions of how selection pressures do their work are worth quoting at length.

First, here's how farmers ended up dominating the world in the first place:

“The first farmers had free will, just like us. As their families grew, their landscapes filled up. […] For all we know, some foragers in the Jordan Valley ten thousand years ago [chose to remain foragers]. The problem, though, was that they were not making a one-time choice. Tens of thousands of other people were asking the same question, and each family had to revisit the decision of whether to intensify or go hungry multiple times every year. Most important of all, each time one family chose to work harder and intensify its management of plants and animals, the payoffs from sticking with the old ways declined a little further for everyone else. Every time cultivators started thinking of the plants and animals on which they lavished care and attention as their personal gardens and flocks, not part of a common stock, hunting and gathering would become that much more difficult for those who stuck to it. Foragers who clung stubbornly and/or heroically to the old ways were doomed because the odds kept tilting against them.”

But how did this result in a world of dictator kings? Morris:

“We should probably assume that people tried lots of different ways to solve the collective action problem of how to create larger, more integrated societies with more complex divisions of labor as they moved from foraging to farming, but almost everywhere, it seems that the solution that worked best was the idea of the godlike king.”

Morris isn't very clear on why godlike kings, out of all possible forms of social organisation, worked best. We can imagine that it's hard to coordinate big armies for defence or offence without one, or that the symbolism of a godlike figurehead is the most reliable way to unite masses in a largely illiterate society, or vaguely gesture like Morris at the challenges of managing complex societies, but there doesn't seem to be much hard evidence or reason for a precise mechanism one way or the other, at least in Foragers, Farmers, and Fossil Fuels.

In general, collective action problems are important in any large organisation, and the simplest solution is complete centralisation; effectively reducing collective action problems back into individual action problems. Of course, this comes with all the cruelties and inefficiencies of real-world non-omnibenevolent, non-omniscient centralised decision-making. Given this, was the centralisation-vs-decentralisation tradeoff really so simple in the farming era that "godlike kings everywhere" was the only effective answer? Perhaps the tradeoffs really were that one-sided in the farming age, and this became a trickier question only in the industrial age when nurturing human talent and prosperity became key societal goals, and we created effective decentralised institutions like free markets and democracy. Or maybe there was a high but not extreme level of optimal centralisation, but the greed of individual rulers often pushed their societies past this level despite selection pressures working in favour of more responsibly lead societies, and it was only with the industrial age that these pressures became high enough to force the world away from the godlike king model.

Morris also describes the rise of capitalism:

“Capitalism took off in early-modern Western Europe because practical people figured out that this was the most effective way to get things done in an increasingly energy-rich world. Other people disagreed, and did things differently. Conflicts and compromises ensued as the competitive logic of cultural evolution went to work and drove the less effective ways extinct.”

Once again, I think the concept of selection pressures is a powerful lens, but the details of what drives the relationship are missing. What exactly was it about an energy-rich environment that made capitalism ideal? Even by Morris' own account, it seems the methods (e.g. complex manufacturing chains, mature financial institutions, etc.) required to most effectively extract and use energy given a particular technology level are what matter, not the raw total of joules consumed per person per day.

### Respondents

Foragers, Farmers, and Fossil Fuels originated from the Tanner Lectures at Princeton. As part of the format, the book includes four responses to Morris' arguments, by Richard Seaford, Jonathan Spence, Christine Korsgaard, and Margaret Atwood.

On the whole, these responses don't add much to book, though they are helpful in making Morris elaborate on his arguments in the final chapter (cheekily entitled "My Correct Views on Everything").

Seaford and Spence provide short chapters that seem to be more about their own interests than Morris' arguments, and have the tone of questions asked by professors who slept through the talk but are still trying to say something insightful at the questions session.

Atwood, of The Handmaid's Tale fame, brings an arsenal of literary flair to bear on the task. She manages to make some good points (what about horse-riding pastoralists, who may have been the first large-scale war-makers?), along with some ridiculous statements:

“Several billion years ago, marine algae produced the atmosphere that allows us to breathe, and these algae continue to produce from 60 to 80 percent of our oxygen. Without marine algae, we ourselves cannot survive. During the Vietnam War, huge vats of Agent Orange were being shipped across the Pacific. Should they have sunk and leaked, we would not be having this conversation today.”

Let's do some very rough calculations. If all the Agent Orange deployed in Vietnam had been uniformly distributed across the Pacific, the mass concentration of its component acids (making the highest assumptions about what concentration it was sprayed at) would have been lower than one part in tens of trillions, a hundred thousand times lower than the mass concentrations of either lead or mercury already in the oceans. I couldn't find any study of what happens to algae in oceans if you dump Agent Orange on them, but one article about using algaecide in swimming pools says applying one ten-thousandth of the pool volume is typical. Another article mentions 5-10% as a common concentration, giving an algae-killing active ingredient concentration of maybe 1 in 100 000 in water. Agent Orange would need to kill algae at ten million times lower concentrations in oceans than commercial algaecide does in swimming pools for the Pacific's oxygen production to be destroyed.

(Or maybe Atwood means the literal sense that, because of various butterfly effects, any such change in history makes any present event, including this conversation, unlikely?)

By far the most substantive response comes from the philosopher Christine Korsgaard. She also has the idea that the farming era was an aberration, with a fresh interpretation:

“Instead of thinking that values are determined by modes of energy capture, perhaps we should think that as human beings began to be in a position to amass power and property in the agricultural age, forms of ideology set in that distorted real moral values [i.e. the values a society should hold], distortions that we are only now, in the age of science and extensive literacy, beginning to overcome.”

More significantly, she makes a distinction between the values a society holds and values that should be held (“positive values” and “real moral values” respectively), in contrast to Morris' arguments that such a distinction is meaningless and the only real distinction is between biological values and the form they take in a given society. Her response manages to pick away at Morris' nonchalant bulldozing of all philosophical subtleties.

Responding to this in the last chapter, Morris quotes, and then dismisses, Ernest Gellner's response to a social theory presentation at an archaeology conference: "They tell me you're a good archaeologist, so why are you trying to be a bad philosopher?". Perhaps he should have taken the question more to heart.

### The future

The experiment of how to switch from foraging to farming was run many times. Forager bands in many places adopted farming techniques. Some of them had good ideas about how to structure their now-farming societies and succeeded, while others had bad ideas and perished, or were forced to copy techniques from the more successful.

In contrast, today the entire world has been thrust into the industrial age in the space of a few hundred years. There is only one experiment going on, and only one chance to get it right. There's no one to copy from to see what we should do, and no one to pick up the job if our attempt fails.

A successful transition to the industrial world, and whatever we might mark as the next step after that, is therefore less certain than the successful transition from foragers to farmers. The values that industrial life imposes on us might be better than the those of the farming age, but it is not yet clear if they will become as universal as hierarchies and kings once were.

(Better by which standard? I think humans are similar enough that there is a context-independent universal human ethical framework.)

Morris' arguments also lead to the question of how values might change in the future. Will the set of values that a society tends towards continue to improve as technology and wealth increases, or is the cuddliness of industrial values (compared to farming ones) a fluke?

The significance of Foragers, Farmers, and Fossil Fuels for this question is that we won't necessarily be the ones deciding. Over a span of years or decades, we can maintain our values through argument and education. Over a span of centuries, though, we can argue all we like, just as countless luddites and aristocrats railed against industrial/Western values, but if the game has changed and someone else's values make them play it better, it won't be enough. The harsh logic of evolution-like selection pressures can't be resisted forever; those that are best at spreading themselves into the future will eventually claim it.

Yuval Noah Harari, author of Sapiens, says that once we can engineer desires, the question is not "what do we want to become?", but "what do we want to want?". Morris counters that the real question is instead "what are we going to want, whether we want it or not?", and his answer is bleak yet pragmatic: "each age gets the thought it needs" ("needs" referring to "survival needs").

I don't think we need to be either nihilistic (in thinking that every set of societal values is as good as any other; some do a better job of serving universal human wants), nor pessimistic (in thinking that we can't do anything about a slide to worse values; we've never had more control over the future of our world).

Morris writes:

“Trying to imagine people who are somehow divorced from the demands of capturing energy and then speculating about what their moral values would be is an odd activity.”

I disagree. Of course we can imagine people living without being constrained by energy needs. How many science fiction writers or futurists haven't imagined a post-scarcity society?

In fact, aren't we well on our way towards such a world? Forager and farmer lives were significantly shaped by the need to get food, water, light, and warmth. Today in developed countries, these aren't free, but our lives aren't shaped by worrying about them. Sure, you need to work a job, but what you worry about in the job is likely very far separated from survival needs, and provided you have one and aren't massively wasteful, the water and light flows exactly as you want it. Technological progress removes difficulty and scarcity. Ultimately, there's no physical limit stopping us from removing scarcity considerations from our lives (or, more precisely, making them trivial enough that we don't need to worry about them; nothing is ever entirely free in this universe).

Once we've done so, no longer have to make compromises between what we should do and what we as a society are forced to value in order to survive. And so I think it is reasonable to imagine humans whose values aren't warped by survival needs; in fact such values might be good ones to aim for.

(Or maybe the need to focus at least a bit on survival is the one anchor to objective reality that prevents societies from losing themselves entirely to petty politicking and status games.)

Of course, there's always the problem of competition. What happens to our happy post-scarcity society when the people next door ratchet up the competition, say by throwing off all the safeguards around capitalism, or developing AIs or nanomachines or Robin Hanson's emulated minds, and then outcompeting us by adopting values more suitable to exploiting those technologies? Even if we ourselves don't suffer – say we have a big enough wall – in the long run we'd give up the rest of the world (or solar system or galaxy) to the pragmatic-valued competitors. At best, the long-term future looks like an oasis of human flourishing, surrounded by a galaxy-spanning alien economy with weird but morally neutral ways. (Imagine a forager tribe considering the massive and weird industrialised world around them; now imagine we're the foragers.) At worst, any good in our oasis would be outweighed by the morally bad machinations that fuel the endless growth of that weird galaxy-spanning alien economy.

So will we be forced to compromise ever more and more to avoid being outrun by those with fewer scruples about changing their values? Or can we build a world where human values are a winning strategy?

Looking at our track record, I think we have a chance.

Related:
Growth and civilisation