tag:blogger.com,1999:blog-16976733680595640132022-08-05T21:03:04.179+01:00 Strata of the World"Utilitarianism says sausages are a moral issue" – this blog solves moralityUnknownnoreply@blogger.comBlogger37125tag:blogger.com,1999:blog-1697673368059564013.post-32443292866319950492022-06-25T20:50:00.000+01:002022-06-25T20:50:31.046+01:00Information theory 3: channel coding<p style="text-align: center;"><span style="font-size: x-small;">7.9k words, including equations (~41 minutes)</span> <br /></p><p> </p><p>We've looked at basic information theory concepts <a href="https://www.strataoftheworld.com/2022/06/information-theory-1.html">here</a>, and at source coding (i.e. compressing data without caring about noise) <a href="https://www.strataoftheworld.com/2022/06/information-theory-2-source-coding.html">here</a>. Now we turn to channel coding.</p><p>The purpose of channel coding is to make information robust against any possible noise in the channel.</p><h2 id="noisy-channel-model">Noisy channel model</h2><p>The noisy channel model looks like the following:</p><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjVptKjd3xnPIq2F_8kfByNH3C96QL3mz0C3z-bXyTNEECMZHNxQwqHRusw6Mw5jxrNbT9k9L6OC8qFQuYkLr72mSoJiti9072A9B_HT6twHNku1gxJFIJ45WcEtJy7WuNMcr4MQNVZ7gi_KzuscQq9kcsTKQnbs9oAKN0oViBImC74qaxLxB273U_log/s1094/ArcoLinux_2022-06-25_19-19-53.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="239" data-original-width="1094" height="140" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjVptKjd3xnPIq2F_8kfByNH3C96QL3mz0C3z-bXyTNEECMZHNxQwqHRusw6Mw5jxrNbT9k9L6OC8qFQuYkLr72mSoJiti9072A9B_HT6twHNku1gxJFIJ45WcEtJy7WuNMcr4MQNVZ7gi_KzuscQq9kcsTKQnbs9oAKN0oViBImC74qaxLxB273U_log/w640-h140/ArcoLinux_2022-06-25_19-19-53.png" width="640" /></a></div><p>The channel can be anything: electronic signals sent down a wire, messages sent by post, or the passage of time. What's important is that it is discrete (we will look at the continuous case later), and there are some transition probabilities from every symbol that can go into the channel to every symbol that can come out. Often, the set of symbols of the inputs is the same as the set of symbols of the outputs.</p><p>The capacity $$C$$ of a noisy channel is defined as $$$ C = \max_{p_x} I(X;Y) = \max_{p_x} \big(H(Y) - H(Y|X)\big). $$$ It's intuitive that this definition involves the mutual information $$I$$ (see <a href="https://www.strataoftheworld.com/2022/06/information-theory-1.html">the first post for the definition and explanation</a>), since we care about how much information $$X$$ transfers to $$Y$$, and how much $$Y$$ tells us about $$X$$. What might be less obvious is why we take the maximum over possible input probability distributions $$p_x$$. This is because the mutual information $$I(X;Y)$$ depends on the probability distributions of $$X$$ and $$Y$$. We can only control what we send - $$X$$ - so we want to adjust that to maximise the mutual information. Intuitively, if you're typing on a keyboard with all keys working normally except the "i" key results in a random character being inserted, shifting your typing away from using the "i" key is good for information transfer. Better to wr1te l1ke th1s than to not be able to reliably transfer information.</p><p>However, the only real way to understand why this definition makes sense is to look at the noisy channel coding theorem. This theorem tells us, among other things, that for any rate (measured in bits per symbol) smaller than the capacity $$C$$, for a large enough code length we can get a probability of error as small as we like.</p><p>With noisy channels, we often work with <i>block codes</i>. The idea is that you encode some shorter sequence of bits as a longer sequence of bits, and if you've designed this well, it adds redundancy. An $$(n,k)$$ block code is one that replaces chunks of $$k$$ bits with chunks of $$n$$ bits.</p><h2 id="hamming-coding">Hamming coding</h2><p>Before we look at the noisy channel theorem, here's a simple code that is redundant to error: transmit every bit 3 times. Instead of sending 010, send 000111000. If the receiver receives 010111000, they can tell that bit 2 probably had an error, and should be a zero. The problem is that you triple your message length.</p><p>Hamming codes are a method for achieving the same - the ability to detect and correct single-bit errors, and the ability to detect but not properly correct two-bit errors - while sending a number of excess bits that grows only logarithmically with message length. For long enough messages, this is very efficient; if you're sending over 250 bits, it only costs you a 3% longer message to insure them against single-bit errors.</p><p>The catch is that the probability of having only one or fewer errors in a message declines exponentially with message length, so this is less impressive than it might sound at first.</p><p>The basic idea of most error correction codes is a parity bit. A parity bit $$b$$ is typically the XOR (exclusive-or) of a bunch of other bits $$b_1, b_2, \ldots$$, written $$b = b_1 + b_2 + \ldots$$ (we use $$+$$ for XOR because doing addition in base-2 while throwing away the carry is the same is taking the XOR). A parity bit over a set of bits $$B = {b_1, b_2, \ldots}$$ is 1 if the set of bits contains an odd number of 1s, and otherwise 0 (hence the word "parity").</p><p>Consider sending a 3-bit message where the first two bits are data and the third is a parity bit. If the message is 110, we check that, indeed, there's an even number of 1s among the data bits, so it checks out that the parity bit is 0. If the message were 111, we'd know that something had gone wrong (though we wouldn't be able to fix it, since it could have started out with any of 011, 101, or 110 and suffered a one-bit flip - and note that we can never entirely rule out that 000 flipped to 111, though since error probability is generally small in any case we're interested in, this would be extremely unlikely).</p><p>The efficiency of Hamming codes comes from the fact that we have parity bits that check other parity bits.</p><p>A $$(T, D)$$ Hamming code is one that sends $$T$$ bits in total of which $$D$$ are data bits and the remaining $$T - D$$ are parity bits. There exists a $$(2^m - 1, 2^m - m - 1)$$ Hamming code for positive integer $$m$$. Note that $$m$$ is the number of parity bits.</p><p>The default way to construct a Hamming code is that the $$m$$th parity bit is in position $$2^m - 1$$, and is set such that the parity of bits whose position's binary representation has a 1 in the $$m$$th last position is zero.</p><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEh9rZ67sMsoFtVbSeQGWwqYtMO6tRF2T5cEw5RsqrtKog1ZTH4gbMk8QT79EeOQPwVTwaIXX3-9795wAGNmTrqH4v9lN4poBgxkbrod7stbG3-BTEHiNbslFs1Zje-6ox1_5kn0G9Wq3e3pu9dV5tOG2JaTjZs2asrT0ju_Ee5RkcnE7edyM7pQVIw-sw/s1033/ArcoLinux_2022-06-25_18-47-44.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="364" data-original-width="1033" height="226" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEh9rZ67sMsoFtVbSeQGWwqYtMO6tRF2T5cEw5RsqrtKog1ZTH4gbMk8QT79EeOQPwVTwaIXX3-9795wAGNmTrqH4v9lN4poBgxkbrod7stbG3-BTEHiNbslFs1Zje-6ox1_5kn0G9Wq3e3pu9dV5tOG2JaTjZs2asrT0ju_Ee5RkcnE7edyM7pQVIw-sw/w640-h226/ArcoLinux_2022-06-25_18-47-44.png" width="640" /></a></div><p>(Above, you see bits 1 through 15, with parity bits in positions 1, 2, 4, and 8. Underneath each bit, for every parity bit there is a 0 if that bit is not included in the parity set of that parity bit, and otherwise a 1. For example, since <code>b4</code> is set for bits 8-15, <code>b4</code> is a 1 if there's an odd number of 1s in bits 8-15 inclusive and otherwise 0. Note that the columns spell out the numbers 1 through 15 in binary.)</p><p>For example, a $$(7,4)$$ Hamming code for the 4 bits of data 0101 would first become $$$ \texttt{ b1 b2 0 b3 1 0 1} $$$ and then we'd set $$b_1 = 0$$ to make there be an even number of 1s across the 1st, 3rd, 5th, and 7th positions, set $$b_2 = 1$$ to do the same over the 2nd, 3rd, 6th, and 7th positions, and then finally set $$b_3 = 0$$ to do the same over the 4th, 5th, 6th, and 7th positions.</p><p>To correct errors, we have the following rule: sum up the positions of the parity bits that do not match. For example, if parity bit 3 is set wrong relative to the rest of the message, you flip that bit; everything will be fine after we clear this false alarm. But if parity bit 2 is also set wrong, then you take their positions, 2 (for bit 2) and 4 (for bit 3) and add them to get 6, and flip the sixth bit to correct the error. This makes sense because the sixth bit is the only bit covered by both parity bits 2 and 3, and only parity bits 2 and 3.</p><p>Though the above scheme is elegant and extensible, it's possible to design other Hamming codes. The length requirements remain - the code is a $$(2^m - 1, 2^m - m - 1)$$ code if we allow $$m$$ parity bits - but we can assign any "domain" over the bits to each parity bit as long as each bit belongs to the domain a unique set of parity bits.</p><h2 id="noisy-channel-coding-theorem">Noisy channel coding theorem</h2><p>We can measure any noisy channel code we choose based on two numbers. The first is its probability of error ($$p_e$$ above). The second is its rate: how many bits of information are transferred for each symbol sent. The three parts of the theorem combine to divide that space up into a possible and impossible region:</p><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhFaR-atWmnPttu0ZXFtS2-0y3wxiPkw0DmZcP4S1U9KLhuz7Iw7SGCn_NNggZFpNKc5OBFkFL7eB29jIB3GXy7kMFVOncmVp1tTNafSdOGgDvYpf-GoOaMTDyjA5k0-RmbiwMeRitQJAR9IYWAqejEtnBXrtC1a-6a6gxzQr-JgyqsERmXXvPI-rhpMQ/s676/ArcoLinux_2022-06-25_18-49-59.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="483" data-original-width="676" height="458" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhFaR-atWmnPttu0ZXFtS2-0y3wxiPkw0DmZcP4S1U9KLhuz7Iw7SGCn_NNggZFpNKc5OBFkFL7eB29jIB3GXy7kMFVOncmVp1tTNafSdOGgDvYpf-GoOaMTDyjA5k0-RmbiwMeRitQJAR9IYWAqejEtnBXrtC1a-6a6gxzQr-JgyqsERmXXvPI-rhpMQ/w640-h458/ArcoLinux_2022-06-25_18-49-59.png" width="640" /></a></div><p>The first part of the theorem says that the region marked "I" is possible. Now there are points of this region that are more interesting than others. Yes, we can make a code that has a capacity of 0 and a very high error rate; just send the same symbol all the time. This is point (a), and we don't care about it.</p><p>What's more interesting, and perhaps not even intuitively obvious at all, is that we can get to a point (b): an arbitrarily low error rate, despite the fact that we're sending information. The maximum information rate we can achieve while keeping the error probability very low turns out to be the capacity, $$C = \max_{p_X} I(X:Y)$$.</p><p>The second part of the theorem gives us a lower bound on error rate if we dare try for a rate that is greater than the capacity. It tells us we can make codes that achieve point (c) on the graph.</p><p>Finally, the third part of the theorem proves that we can't get to points like (x), that have an error rate that is too low given how much over the channel capacity their rate is.</p><p>We started the <a href="https://www.strataoftheworld.com/2022/06/information-theory-2-source-coding.html">proof of the source coding theorem</a> by considering a simple construction (the $$\delta$$-sufficient subset) first for a single character and then extending it to blocks. We're going to do something similar now.</p><h3 id="noisy-typewriters">Noisy typewriters</h3><p>A noisy typewriter over the alphabet $${0, \ldots, n}$$ is a device where if you press the key for $$i$$, it inputs one of the following with equal probability:</p><ul><li>$$i - 1 \mod n$$</li><li>$$i \mod n$$ </li><li>$$i + 1 \mod n$$</li></ul><p>With a 6-symbol alphabet, we can illustrate its transition probability matrix as a heatmap:</p><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiwc4Mgk4fp9WF6Aal37OpOl_PgfcM8JwarOSEoWOaoaXR9xF_lG3b0Y6jCjZDFu6eaYZDX2FgVxiM-8Mg0xwNParK7KK50qDPhRd4swroReHON_8C1myzfe4Xobx7RHBMipN-SAHqVoDtjB32Nd8HW1wADPLJFUGJgAq624SeD-H4A5w69T0qQwjVQag/s1107/ArcoLinux_2022-06-25_18-52-30.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="1065" data-original-width="1107" height="385" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiwc4Mgk4fp9WF6Aal37OpOl_PgfcM8JwarOSEoWOaoaXR9xF_lG3b0Y6jCjZDFu6eaYZDX2FgVxiM-8Mg0xwNParK7KK50qDPhRd4swroReHON_8C1myzfe4Xobx7RHBMipN-SAHqVoDtjB32Nd8HW1wADPLJFUGJgAq624SeD-H4A5w69T0qQwjVQag/w400-h385/ArcoLinux_2022-06-25_18-52-30.png" width="400" /></a></div><p>The colour scale is blue (low) to yellow (high). The reading order is meant to be that each column represents the probability distribution of output symbols given an input symbol.</p><p>First, can we transmit information without error at all? Yes: choose a code where you only send the symbol corresponding to the second and fifth columns. Based on the heatmap, these can map to symbols number 1-3 and 4-6 respectively; there is no possibility of confusion. The cost is that instead of being able to send one of six symbols, or $$\log 6$$ bits of information per symbol, we can now only send one of two, or $$\log 2 = 1$$ bits of information per symbol.</p><p>The capacity is $$\max_{p_X} \big( H(Y) - H(Y|X) \big)$$. Now if $$p_X$$ is the distribution we considered above - assigning half the probability to 2 and half to 5 - then by the transition matrix we see that $$H(Y)$$ will be uniformly distributed, so it is $$\log 6$$. $$H(Y|X)$$ is $$\log 3$$ in our example code, because we see that if we always send either symbol 2 or 5, then in both cases $$Y$$ is restricted to a set of 3 values. With some more work you can show that this is in fact an optimal choice of $$p_X$$. The capacity turns out to be $$\log 6 - \log 3 = \log 2$$ bits. The error probability is zero. We see that we can indeed transfer information without error even if we have a noisy channel.</p><p>But hold on, the noisy typewriter has a very specific type of error: there's an absolute certainty that if we transmit a 2 we can't get symbols 3-6 out, and so on. Intuitively, here we can partition the space of channel outputs in such that there is no overlap in the sets of which channel input each channel output could have come from. It seems like with a messier transition matrix that doesn't have this nice property, this just isn't true. For example, what if we have a binary symmetric channel, with a transition matrix like this:</p><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgVw3TO5N1kZxnd4FvUVlFlix9C5meSAM-tqAk1E0CdfdZwJtcuEg0gucDW3Fhy34Ho4y7UXyJ_Qb8RjHSAm9TyW8wIH75eaOC5CSbhaqoxKLBSGOcUQpoy6fllPcbjufiPXJ2MX3cYWCKKEAWRuUFJQ3O7OHun6t_kHgJPxuEFPQzUXnqQ7J24eFZLeg/s1129/ArcoLinux_2022-06-25_18-54-33.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="1059" data-original-width="1129" height="300" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgVw3TO5N1kZxnd4FvUVlFlix9C5meSAM-tqAk1E0CdfdZwJtcuEg0gucDW3Fhy34Ho4y7UXyJ_Qb8RjHSAm9TyW8wIH75eaOC5CSbhaqoxKLBSGOcUQpoy6fllPcbjufiPXJ2MX3cYWCKKEAWRuUFJQ3O7OHun6t_kHgJPxuEFPQzUXnqQ7J24eFZLeg/s320/ArcoLinux_2022-06-25_18-54-33.png" width="320" /></a></div><p>Unfortunately the blue = lowest, yellow = highest color scheme is not very informative; the transition matrix looks like this, where $$p_e$$ is the probability of error: $$$ \begin{bmatrix} 1 - p_e & p_e \ p_e & 1 - p_e \end{bmatrix} $$$ Here nothing is certain: a 0 can become a 1, and a 1 can become a zero.</p><p>However, this is what we get if we use this transition probability matrix on every symbol in a string of length 4, with the strings going in the order 0000, 0001, 0010, 0011, ..., 1111 along both the top and left side of the matrix:</p><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi7BcpxB7n2poMYg7-ptGda6MU4emL-BJSL83_TWfZYwbnaTdJQ0uB29zZwxjHfrieOKPatRc-Sry9P9QbWzBfo-zah1-LFd93v1KZOoaEfobiS_Pq4yiJXE-XoTVdJ01jXMGhhHHV-RSFwREWf0I86nJcxi-4Y7WeJOOVF9bYaBCm_sCuUnW1ugeIdAQ/s1024/ArcoLinux_2022-06-25_18-56-35.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="1022" data-original-width="1024" height="399" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi7BcpxB7n2poMYg7-ptGda6MU4emL-BJSL83_TWfZYwbnaTdJQ0uB29zZwxjHfrieOKPatRc-Sry9P9QbWzBfo-zah1-LFd93v1KZOoaEfobiS_Pq4yiJXE-XoTVdJ01jXMGhhHHV-RSFwREWf0I86nJcxi-4Y7WeJOOVF9bYaBCm_sCuUnW1ugeIdAQ/w400-h399/ArcoLinux_2022-06-25_18-56-35.png" width="400" /></a></div><p>For example, the second column shows the probabilities (blue = low, yellow = high) for what you get in the output channel if 0001 is sent as a message. The highest value is for the second entry, 0001, because we have $$p_e < 0.5$$ so $$p_e < 1 - p_e$$ so the single likeliest outcome is for no changes, which has probability $$(1-p_e)^4$$. The second highest values are for the first (0000), third (0011), fifth (0101), and seventh (1001) entries, since these all involve one flip and have probability $$p_e (1-p_e)^3$$ individually and probability $${4 \choose 1} p_e (1-p_e)^3 = 4 p_e (1 - p_e)^3$$ together.</p><p>If we dial up the number, the pattern becomes clearer; here's the equivalent diagram for messages of length 8:</p><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjClJ6kMdUI2632p44ktR3yiQ-2mhzmRTNUohwUx1EBmTT4LNpzpIgtMgOoY3GgOLhs4IOdocUJbpz7Ep-Dm_0kZLATn_O_haiYViwEOD9JlhbYjv2jI7qZnWvesb0-el-eip6h42z47ALSveLDWhrglzQnMGBBNQ3Zp7wEWoAmbDwVnAD90tQ-CcSWug/s1022/ArcoLinux_2022-06-25_18-57-06.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="1020" data-original-width="1022" height="638" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjClJ6kMdUI2632p44ktR3yiQ-2mhzmRTNUohwUx1EBmTT4LNpzpIgtMgOoY3GgOLhs4IOdocUJbpz7Ep-Dm_0kZLATn_O_haiYViwEOD9JlhbYjv2jI7qZnWvesb0-el-eip6h42z47ALSveLDWhrglzQnMGBBNQ3Zp7wEWoAmbDwVnAD90tQ-CcSWug/w640-h638/ArcoLinux_2022-06-25_18-57-06.png" width="640" /></a></div><h3 id="the-return-of-the-typical-set">The Return of the Typical Set</h3><p>There are two key points.</p><p>The first is that more and more of the probability is concentrated along the diagonal (plus some other diagonals further from the main diagonal. We can technically have any transformation, even 11111111 to 00000000 when we send a message through the channel, but most of these transformations are extremely unlikely. The transition matrix starts looking more and more like the noisy typewriter, where for each message only one subset of received messages has non-tiny likelihood.</p><p>The second key point is that it is time for ... the <i>return of the typical set</i>. Recall from the <a href="https://www.strataoftheworld.com/2022/06/information-theory-2-source-coding.html">second post in this series</a> that the $$\epsilon$$-typical set of length-$$n$$ strings over an alphabet $$A$$ is defined as $$$ T_{n\epsilon} = \left\{x^n \in A^n \text{ such that } \left|-\frac{1}{n} \log p(x^n) - H(X)\right| \le \epsilon\right\}. $$$ $$-\frac{1}{n} \log p(x^n)$$ is equal to $$-\frac{1}{n} \sum_{i=1}^n \log p(x_i)$$ by independence, and this in turn is an estimator for $$\mathbb{E}[-\log p(X)] = H(X)$$. You can therefore read $$-\frac{1}{n}\log p(x^n)$$ as the "empirical entropy"; it's what we'd guess the (per-symbol) entropy of $$X$$ to be if we did a slightly weird thing of estimating the entropy while knowing the probability model but only using it to determine the information content $$-\log p$$, and estimating the $$p_i$$s in $$-\sum_i p_i \log p_i$$ instead by only using how often they occur in $$x^n$$ (rather than the probability model).</p><p>Now the big results about typical sets was that as $$n \to \infty$$, the probability $$P(x^n \sim X^n \in T_{n \epsilon}) \to 1$$, and therefore for large $$n$$, most of the probability mass is concentrated in the approximately $$2^{nH(X)}$$ strings of probability approximately $$2^{-nH(X)}$$ that lie in the typical set.</p><p>We can define a similar notion of jointly $$\epsilon$$-typical sets, denoted $$J_{n\epsilon}$$ and defined by analogy with $$T_{n\epsilon}$$ as $$$ J_{n\epsilon} = \left\{ (x^n, y^n) \in A^n \times A^n \text{ such that } \left| - \frac{1}{n} \log P(x^n, y^n) - H(X, Y)\right| \le \epsilon \right\}. $$$ Like typical sets, jointly typical sets give us similar nice properties:</p><ol><li><p>If $$x^n, y^n$$ are drawn from the joint distribution (e.g. you first draw an $$x^n$$, then apply the transition matrix probabilities to generate a $$y^n$$ based on it), then the probability that $$(x^n, y^n) \in J_{n \epsilon}$$ goes to 1 as $$n \to \infty$$. The proof is almost the same as the corresponding proof for typical sets (hint: law of large numbers).</p></li><li><p>The number $$|J_{n\epsilon}|$$ of jointly typical sequence pairs $$(x^n, y^n)$$ is about $$2^{nH(X,Y)}$$, and specifically is upper-bounded by $$2^{n(H(X,Y) + \epsilon)}$$. The proof is the same as for the typical set case.</p></li><li><p>If $$x^n$$ and $$y^n$$ are _independently drawn_ from the distributions $$p_X$$ and $$p_Y$$, the probability that they are jointly typical is about $$2^{-nI(X;Y)}$$. The specific upper bound is $$2^{-n(I(X;Y) - 3 \epsilon)}$$, and can be shown straightforwardly (remembering some of the identities in <a href="https://www.strataoftheworld.com/2022/06/information-theory-1.html">post 1</a>) from $$$ P((x^n, y^n) \in J_{n \epsilon}) = \sum_{(x^n, y^n) \in J_{n\epsilon}} p(x^n) p(y^n)$$$ $$$\le |J_{n\epsilon}| 2^{-n(H(X) - \epsilon)} 2^{-n(H(X) - \epsilon)}$$$ $$$ \le 2^{n(H(X,Y) + \epsilon)} 2^{-n(H(X) - \epsilon)} 2^{-n(H(X) - \epsilon)}$$$ $$$= 2^{n(H(X,Y) - H(X) - H(Y) + 3 \epsilon)}$$$ $$$= 2^{-n(I(X,Y) - 3 \epsilon)} $$$</p></li></ol><p>Armed with this definition, we can now interpret what was happening in the diagrams above: as we increase the length of the messages, more and more of the probability mass is concentrated in jointly typical sequences, by the first property above. The third property tells us that if we ignore the dependence between $$x^n$$ and $$y^n$$ - picking a square roughly at random in the diagrams above - we are, however, extremely unlikely to pick a square corresponding to a jointly typical pair.</p><p>Here is the noisy typewriter for 6 symbols, for length-4 messages coming in and out of the channel:</p><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEghBh70zAlC9if1MiSrMZk9WRLa4SV5wbv9jyoc9ZeoTNb4r1lBPo_B7Usu_QFsRIAt53ktS-ep3_LJjvTs3fUWt9Ztcow4xfxo6sLFjj_oiT6HT_2imaW2s-FcjgIFL3SN5gXB4Gwvsch87akGUI9ipQbAqYrvdDXHXL07_iS_mUCmn0qWlbciAFjifA/s1030/ArcoLinux_2022-06-25_18-59-15.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="1030" data-original-width="1027" height="640" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEghBh70zAlC9if1MiSrMZk9WRLa4SV5wbv9jyoc9ZeoTNb4r1lBPo_B7Usu_QFsRIAt53ktS-ep3_LJjvTs3fUWt9Ztcow4xfxo6sLFjj_oiT6HT_2imaW2s-FcjgIFL3SN5gXB4Gwvsch87akGUI9ipQbAqYrvdDXHXL07_iS_mUCmn0qWlbciAFjifA/w638-h640/ArcoLinux_2022-06-25_18-59-15.png" width="638" /></a></div><p>(As a reminder of the interpretation: each column represents the probablity distribution, shaded blue to yelow, for one input message, and the $$6^4 = 1296$$ possible messages we have with this message length (4) and alphabet size (6) are ranked in alphabetical order along both the top and left side of the grid)</p><p>The highest probability is still yellow, but you can barely see it. Most of the probability mass is in the medium-probability sequences (our jointly typical set), forming a small subset of the possible channel outputs for each input.</p><p>In the limit, therefore, the transition probability matrix for a block code of an arbitrary symbol transition probability matrix looks a lot like the noisy typewriter. This suggests a decoding method: if we see $$y^n$$, we decode it as $$x^n$$ if $$(x^n, y^n)$$ are in the jointly typical set, and there is no other $${x'}^n$$ such that $$({x'}^n, y^n)$$ are also jointly typical. As with the noisy typewriter example, we have to discard a lot of the $$x^n$$, so that the set of $$x^n$$ that a given $$y^n$$ could've come to hopefully contains only a single element, so we match the second condition in the decoding rule.</p><h3 id="theorem-outline">Theorem outline</h3><p>Now we will state the exact form of the noisy channel coding theorem. It has three parts:</p><ol><li><p>A discrete memoryless channel has a non-negative capacity $$C$$ such that for any $$\varepsilon > 0$$ and $$R < C$$, for large enough $$n$$ there's a block code of length $$N$$ and rate $$\geq R$$ and a decoder such that error probability is $$< \varepsilon$$.</p><p>We will see that this follows from the points about jointly typical sets and the decoding scheme based on them that we discussed above. The only thing really missing is an argument that the error rate of jointly typical coding can be made arbitrarily low as long as $$R < C$$. We will see that Shannon used perhaps the most insane trick in all of 20th century applied maths to side-step having to actually think of a specific code to prove this.</p></li><li><p>If error probability per bit $$p_e$$ is acceptable, rates up to $$$ R(p_e) = \frac{C}{1 - H_2(p_e)}. $$$ are possible. We will prove this by </p></li><li><p>For any $$p_e$$, rates $$> R(p_e)$$ are not possible.</p></li></ol><p>As we saw earlier, these three parts together divide up the space of possible rate-and-error combinations for codes into three parts: </p><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhFaR-atWmnPttu0ZXFtS2-0y3wxiPkw0DmZcP4S1U9KLhuz7Iw7SGCn_NNggZFpNKc5OBFkFL7eB29jIB3GXy7kMFVOncmVp1tTNafSdOGgDvYpf-GoOaMTDyjA5k0-RmbiwMeRitQJAR9IYWAqejEtnBXrtC1a-6a6gxzQr-JgyqsERmXXvPI-rhpMQ/s676/ArcoLinux_2022-06-25_18-49-59.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="483" data-original-width="676" height="458" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhFaR-atWmnPttu0ZXFtS2-0y3wxiPkw0DmZcP4S1U9KLhuz7Iw7SGCn_NNggZFpNKc5OBFkFL7eB29jIB3GXy7kMFVOncmVp1tTNafSdOGgDvYpf-GoOaMTDyjA5k0-RmbiwMeRitQJAR9IYWAqejEtnBXrtC1a-6a6gxzQr-JgyqsERmXXvPI-rhpMQ/w640-h458/ArcoLinux_2022-06-25_18-49-59.png" width="640" /></a></div><h3 id="proof-of-part-i-turning-noisy-channels-noiseless">Proof of Part I: turning noisy channels noiseless</h3><p>We want to prove that we can get an arbitrarily low error rate if the rate (bits of information per symbol) is smaller than the channel capacity, which we've defined as $$C = \max_{p_X} I(X;Y)$$.</p><p>We could do this by thinking up a code and then calculating the probability of error per length-$$n$$ block for it. This is hard though.</p><p>Here's what Shannon did instead: he started by considering a random block code, and then proved stuff about its average error.</p><p>What do we mean by a "random block code"? Recall that an $$(n,k)$$ block code is one that encodes length-$$k$$ message as length-$$n$$ messages. Since the rate $$r = \frac{k}{n}$$, we can talk about $$(n, nr)$$ block codes.</p><p>What the encoder is doing is mapping length-$$k$$ strings to length-$$n$$ strings. In the general case, it has some lookup table, with $$2^k = 2^{nr}$$ entries, each of length $$n$$. A "random code" means that we generate the entries of this lookup table from the distribution $$P(x^n) = \prod_{i=1}^n p(x_i)$$. We will refer to the encoder as $$E$$.</p><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiHLyOUVQSZXYKQMKosrwxek9NEqDIwpKXBiwIkB1rbKnacw8EjbCGc9mOp-C6c9U7wgb-w62IkzI3O64FKyqUlsyRPb9Asb7aJ3nvzUZF_-Ga6G65GV4iuYOmdl6xRbRhg5Nn8ilbCRrTitQ2O2BuwbWHDlPef24B1IwnbOIq08oAF1_q656BvGH3x5g/s905/ArcoLinux_2022-06-25_19-08-10.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="445" data-original-width="905" height="314" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiHLyOUVQSZXYKQMKosrwxek9NEqDIwpKXBiwIkB1rbKnacw8EjbCGc9mOp-C6c9U7wgb-w62IkzI3O64FKyqUlsyRPb9Asb7aJ3nvzUZF_-Ga6G65GV4iuYOmdl6xRbRhg5Nn8ilbCRrTitQ2O2BuwbWHDlPef24B1IwnbOIq08oAF1_q656BvGH3x5g/w640-h314/ArcoLinux_2022-06-25_19-08-10.png" width="640" /></a></div><p>(In the above diagram, the dots in the column represent probabilities of different outputs given the $$x^n$$ that is taken as input. Different values of $$w^k$$ would be mapped by the encoder to different columns $$x^n$$ in the square.)</p><p>Richard Hamming (yes, the Hamming codes person) mentions this trick in his famous talk <a href="https://www.cs.virginia.edu/~robins/YouAndYourResearch.pdf">"You and Your Research"</a>:</p><blockquote><p><i>Courage is one of the things that Shannon had supremely. You have only to think of his major theorem. He wants to create a method of coding, but he doesn't know what to do so he makes a random code. Then he is stuck. And then he asks the impossible question, "What would the average random code do?'' He then proves that the average code is arbitrarily good, and that therefore there must be at least one good code. Who but a man of infinite courage could have dared to think those thoughts?</i></p></blockquote><p>Perhaps it doesn't quite take infinite courage, but it is definitely one hell of a simplifying trick - and the remarkable trick is that it works.</p><p>Here's how: let the average probability of error in decoding one of our blocks be $$\bar{p_e}$$. If we have a message $$w^k$$, the steps that happen are:</p><ol><li>We use the (randomly-constructed) encoder $$E$$ to map it to an $$x^{n}$$ using $$x^n = E(w^k)$$. Note that the set of values that $$E(w^k)$$, can take, $$\text{Range}(E)$$, is a subset of the set of values of all possible $$x^n$$.</li><li>$$x^n$$ passes through the channel to become a $$y^n$$, according to the probabilities in a block transition probability matrices like the ones pictured above.</li><li>We guess that $$y^n$$ came from the $$x'^n \in \text{Range}(E)$$ such that the pair $$(x'^n, y^n)$$ is in the jointly typical set $$J_{n\epsilon}$$.<ol><li>If there isn't such an $$x'^n$$, we fail. In the diagram below, this happens if we get $$y_3$$, since $$\text{Range}(E) = \{x_1, x_2, x_3, x_4\}$$ does not contain anything jointly-typical with $$y_3$$.</li><li>If there is at least one wrong $$x'^n$$, we fail. In the diagram below, this happens if we get $$y_2$$, since both $$x_2$$ and $$x_3$$ are codewords the encoder might use that are jointly typical with $$y_2$$, so we don't know which one was originally transmitted over the channel.</li></ol></li><li>We use the decoder, which is simply the inverse of the encoder, to map to our guess $$\bar{w}^k$$ of what the original string was. Since $$x'^n \in \text{Range}(E)$$, the inverse of the encoder, $$E^{-1}$$, must be defined at $$x'^n$$. (Note that there is a chance, but a negligibly small one as $$n \to \infty$$, that in our encoder generation process we created the same codeword for two different strings, in which case the decoder can't be deterministic. We can say either: we don't care about this, because the probability of a collision goes to zero, or we can tweak the generation scheme to regenerate if there's a repeat; $$n \ge k$$ so we can always construct a repeat-free encoder.)</li></ol><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhhsPj6X9tFdhcyWVSmCNXlAprBCMGK88hFKeTTC257jSp8XFjYf0Fgk-O6YWhEXC0BvG337MCkBQF1KIodnrWmX3iqSSWVGhkBkVUReJUfWg1f4G-6S--2iW5ydJZGHxU5HHo1gVOZUe6iWjDmUSz6sd6ugaISVnCpjWcUswvHq9OK7nusuDlaeb4LGg/s731/ArcoLinux_2022-06-25_19-12-49.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="715" data-original-width="731" height="626" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhhsPj6X9tFdhcyWVSmCNXlAprBCMGK88hFKeTTC257jSp8XFjYf0Fgk-O6YWhEXC0BvG337MCkBQF1KIodnrWmX3iqSSWVGhkBkVUReJUfWg1f4G-6S--2iW5ydJZGHxU5HHo1gVOZUe6iWjDmUSz6sd6ugaISVnCpjWcUswvHq9OK7nusuDlaeb4LGg/w640-h626/ArcoLinux_2022-06-25_19-12-49.png" width="640" /></a></div><p>Therefore the two sources of error that we care about are:</p><ul><li><p>On step 3, we get a $$y^n$$ that is not jointly typical with the original $$x^n$$. Since $$P((x^n, y^n) \geq 1 - \delta$$ for some $$\delta$$ that we can make arbitrarily small by increasing $$n$$, we can upper-bound this probability with $$\delta$$.</p></li><li><p>On step 3, we get a $$y^n$$ that is jointly typical with at least one wrong $$x'^n$$. We saw above that one of the properties of the jointly typical set is that if $$x^n$$ and $$y^n$$ are selected independently rather than together, the probability that they are jointly typical is only $$2^{-n(I(X;Y) - 3 \epsilon)}$$. Therefore we can upper-bound this error probability by summing the probability of "accidental" joint-typicality over the $$2^k - 1$$ possible messages that are not the original message $$w^k$$. This sum is $$$ \sum_{w'^k \ne w^k} 2^{-n(I(X;Y) - 3 \epsilon)}$$$ $$$\le (2^{k} - 1) 2^{-n(I(X;Y) - 3 \epsilon)}$$$ $$$\le 2^{nr}2^{- n (I(X;Y) - 3 \epsilon)}$$$ $$$= 2^{nr - n(I(X;Y) - 3 \epsilon)} $$$</p></li></ul><p>We have the probabilities of two events, so the probability of at least one of them happening is smaller than or equal to their sum: $$$ \bar{p}_e \le \delta + 2^{nr - n(I(X;Y) - 3 \epsilon)} $$$ We know we can make $$\delta$$ however small we want. We can see that if $$r < I(X;Y) - 3 \epsilon$$, then the exponent is negative and increasing $$n$$ can also make the second term negligible. This is almost Part I of the theorem, which was:</p><blockquote><p>A discrete memoryless channel has a non-negative capacity $$C=\max_{p_X} I(X;Y)$$ such that for any $$\epsilon > 0$$ and $$R < C$$, for large enough $$n$$ there's a block code of length $$n$$ and rate $$\geq R$$ and a decoder such that error probability is $$< \varepsilon$$.</p></blockquote><p>First, to put a bound involving only one constant on $$\bar{p}_e$$, let's arbitrarily say that we increase $$n$$ until $$2^{nr - n(I(X;Y) - 3 \epsilon)} \le \delta$$. Then we have $$$ \bar{p}_e \le 2 \delta $$$ Second, we don't care about average error probability over codes, we care about the existence of a single code that's good. We can realise that if the average error probability $$\le 2 \delta$$, there must exist at least one code, call it $$C^*$$, with average error probability $$\le 2 \delta$$.</p><p>Third, we don't care about average error probability over messages, but maximal error probability, so that we can get the strict $$< \varepsilon$$ error probability in the theorem. This is trickier to bound, since $$C^*$$ might somehow have very low error probability with most messages, but some insane error probability for one particular message.</p><p>However, here again Shannon jumps to the rescue with a bold trick: throw out half the codewords, specifically the ones with highest error probability. Since the average error probability is $$\le 2 \delta$$, every codeword in the best half of codewords must have error probability $$\le 4 \delta$$, because otherwise the one-half of best codes would contribute more than $$\frac{1}{2} \times 4 \delta = 2 \delta$$ to the average error on their own.</p><p>What about the effect on our rate of throwing out half the codewords? Previously we had $$2^k = 2^{nr}$$ codewords; after throwing out half we have $$2^{nr - 1}$$, so our rate has gone from $$\frac{k}{n} = r$$ to $$\frac{nr - 1}{n} = r - \frac{1}{n}$$, a negligible decrease if $$n$$ is large.</p><p>What we now have is this: as $$n \to \infty$$, we can get any rate $$R < I(X;Y) - 3 \epsilon$$ with maximal error probability $$\le 4 \delta$$, and both $$\delta$$ and $$\epsilon$$ can be decreased arbitrarily close to zero by increasing $$n$$. Since we can set the distribution of $$X$$ to whatever we like (this is why it matters that we construct our random encoder by sampling from $$X$$ repeatedly), we can make $$I(X;Y) = \underset{p_X}{\max} I(X;Y)$$.</p><p>This is the first and most involved part of the theorem. It is also remarkably lazy: at no point do we have to go and construct an actual code, we just sit in our armchairs and philosophise about the average error probability of random codes.</p><h3 id="proof-of-part-ii-achievable-rates-if-you-accept-non-zero-error">Proof of Part II: achievable rates if you accept non-zero error</h3><p>Here's a simple code that achieves a rate higher than the capacity in a noiseless binary channel:</p><ol><li>The sender maps each length-$$nr$$ block to a block of length $$n$$ by cutting off the last $$nr - n$$ symbols.</li><li>The receiver reads $$n$$ symbols with error probability $$0$$, and then guesses the remaining $$nr - n$$ with bit error probability $$\frac{1}{2}$$ for each symbol. (Note; we're concerned with bit error here, unlike block error in the previous proof)</li></ol><p>An intuition you should have is that if the probability of anything is concentrated in a small set of outcomes, you're not maximising the entropy (remember: _entropy is maximised by a uniform distribution_) and therefore also not maximising the information transfer. The above scheme concentrates high probability of error to a small number of bits, while transmitting some of them with zero error - we should be able to do better.</p><p>It's not obvious how we'd start doing this. We're going to take some wisdom from the old proverb about hammers and nails, and note that the main hammer we've developed so far is a proof that we can send through the channel at a negligible error rate by increasing the size of the message. Let's turn this hammer upside down: we're going to use the decoding process to encode and the encoding process to decode. Specifically, to map from length-$$n$$ strings to the smaller length-$$k$$ strings, we use the decoding process from before:</p><ol><li>Given an $$x^n$$ to encode, we find the $$x'^n \in \text{Range}(E)$$ such that the pair $$(x^n, x'^n)$$ is in the jointly typical set $$J_{n\epsilon}$$. (Jointly typical with respect to what joint distribution? That of length-$$n$$ strings before and after being passed through the channel (here we're assuming that the input and output alphabets are equivalent). However, note that nothing actually has to pass through a channel for us to use this.)</li><li>We use the inverse of the encoder, $$E^{-1}$$, to map $$x'^n$$ to a length-$$k$$ string $$w^k$$ ($$x'^n \in \text{Range}(E)$$ so this is defined).</li></ol><p>To encode, we use the encoder $$E$$, to get $$\bar{x}^n = E(w^k)$$.</p><p>We'll find the per-bit error rate, not the per-block error rate, so we want to know how many bits are changed on average under this scheme. We're still working with the assumption of a noiseless channel, so we don't need to worry about the noise in the channel, only the error coming from our lossy compression (which is based on a joint probability distribution coming from assuming some channel, however). </p><p>Assume our channel has error probability $$p$$ when transmitting a symbol. Fix an $$x^n$$ and consider pairs $$(x^n, y^n)$$ in the jointly typical set. Most of the $$y^n$$ will differ from $$x^n$$ in approximately $$np$$ bits. Intuitively, this comes from the fact that for a binomial distribution, most of the probability mass is concentrated around the mean at $$np$$, and therefore the typical set contains mostly sequences with a number of errors close to this mean. Therefore, on average we should expect $$np$$ errors between the $$x^n$$ we put into the encoder and the $$x'^n$$ that it spits out. Since we assume no noise, the $$w^k = E^{-1}(x'^n)$$ we send through the channel comes back as the same, and we can do $$E(w^k) = E(E^{-1}(x'^n)) = x'^n$$ to perfectly recover $$x'^n$$. Therefore the only error is the $$np$$ wrong bits, and therefore our per-bit error rate is $$p$$.</p><p>Assume that, used the right way around, we have a code that can achieve a rate of $$R' = k/n$$. This rate is $$$ R' = \max_{p_X} I(X;Y) = \max_{p_X} \big[ H(Y) - H(Y|X) \big]$$$ $$$= 1 - H_2(p) $$$ assuming a binary code and a binary symmetric channel, and where $$H_2(p)$$ is the entropy of a two-outcome random variable with probability $$p$$ of the first outcome, or $$$ H_2(p) = - p \log p - (1 - p) \log (1 - p). $$$ Now since we're using it backward, we map from $$n$$ to $$k$$ bits rather than $$k$$ to $$n$$ bits, and this code has rate $$$ \frac{1}{R'} = \frac{n}{k} = \frac{1}{1 - H_2(p)} $$$ What we can now do is make a code that works like the following:</p><ol><li>Take a length-$$n$$ block of input.</li><li>Use the compressor (i.e. the typical set decoder) to map it to a smaller length-$$k$$ block.</li><li>Use some noiseless channel code with capacity $$C$$.</li><li>Use the decompressor (i.e. the typical set encoder) to map the recovered length-$$k$$ blocks back to length-$$n$$ blocks.</li></ol><p>In step 4, we will on average see that the recovered input differs in $$np$$ places, for a bit error probability of $$p$$. And what is our rate? We assumed the standard noiseless channel code in the middle that transmits our compressed input had the maximum rate $$C$$. However, it is transmitting strings that have already been compressed by a factor of $$\frac{k}{n}$$, so the true rate is $$$ R = \frac{C}{1 - H_2(p)} = \frac{C}{1 + p \log p + (1 - p) \log (1 - p)} $$$ This gives us the second part of the theorem: given a certain rate $$R$$, we can transmit at any probability of error $$p$$ low enough that $$C / (1 - H_2(p)) \le R$$.</p><p>(Note that effectively $$0 \le p < 0.5$$, because if $$p > 0.5$$ we can just flip the labels on the channel and change $$p$$ to $$1 - p$$, and if $$p = 0.5$$ we're transmitting no information.)</p><h3 id="proof-of-part-iii-unachievable-rates">Proof of Part III: unachievable rates</h3><p>Note that the pipeline is a Markov chain (i.e. each step depends only on the previous step):</p><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjVptKjd3xnPIq2F_8kfByNH3C96QL3mz0C3z-bXyTNEECMZHNxQwqHRusw6Mw5jxrNbT9k9L6OC8qFQuYkLr72mSoJiti9072A9B_HT6twHNku1gxJFIJ45WcEtJy7WuNMcr4MQNVZ7gi_KzuscQq9kcsTKQnbs9oAKN0oViBImC74qaxLxB273U_log/s1094/ArcoLinux_2022-06-25_19-19-53.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="239" data-original-width="1094" height="140" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjVptKjd3xnPIq2F_8kfByNH3C96QL3mz0C3z-bXyTNEECMZHNxQwqHRusw6Mw5jxrNbT9k9L6OC8qFQuYkLr72mSoJiti9072A9B_HT6twHNku1gxJFIJ45WcEtJy7WuNMcr4MQNVZ7gi_KzuscQq9kcsTKQnbs9oAKN0oViBImC74qaxLxB273U_log/w640-h140/ArcoLinux_2022-06-25_19-19-53.png" width="640" /></a></div><p>Therefore, the data processing inequality applies (for more on that, search for "data" <a href="https://www.strataoftheworld.com/2022/06/information-theory-1.html">here</a>). With one application we get $$$ I(w^k; \bar{w}^k) \le I(w^k; y^n) $$$ and with another $$$ I(w^k; y^n) \le I(x^n; y^n) $$$ which combine to give $$$ I(w^k, \bar{w}^k) \le I(x^n; y^n). $$$ By the definition of channel capacity, $$I(x^n; y^n) \le nC$$ (remember that the definition is about mutual information between $$X$$ and $$Y$$, so _per-symbol_ information), and so given the above we also have $$I(w^k, \bar{w}^k) \le nC$$.</p><p>With a rate $$R$$, we send over $$nR$$ bits of information, but if the per-bit error probability is $$p$$, we can only receive $$nR (1 - H_2(p))$$ of those bits. Therefore $$I(w^k, \bar{w}^k) = nR(1 - H_2(p))$$ at most, and we have $$$ nR(1-H_2(p)) > nC $$$ is a contradiction, which implies which implies $$$ R > \frac{C}{1 - H_2(p)} $$$ is a contradiction. </p><h2 id="gaussian-channels">Continuous entropy and Gaussian channels</h2><p>And now, for something completely different.</p><p>We've so far talked only about the entropy of discrete random variables. However, there is a very common case of channel coding that deals with continuous random variables: sending a continuous signal, like sound.</p><p>So: forget our old boring discrete random variable $$X$$, and bring in a brand-new continuous random variable that we will call ... $$X$$. How much information do you get from observing $$X$$ land on a particular value $$x$$? You get infinite information, because $$x$$ is a real number with an endless sequence of digits; alternatively, the Shannon information is $$- \log p(x)$$, and the probability of $$X=x$$ is infinitesimally small for a continuous random variable, so the Shannon information is $$-\log 0$$ which is infinite. Umm.</p><p>Consider calculating the entropy for a continuous variable, which we will denote $$h(X)$$ to make a difference from the discrete case, and define in the obvious way by replacing sums with integrals: $$$ h(X) = -\int_{-\infty}^\infty f(x) \log f(x) d x $$$ where $$f$$ is the probability density function. If we actually evaluate this integral, we would get a constant term that goes to infinity.</p><p>As principled mathematicians, we might be concerned about this. But we can mostly ignore it, especially as the main thing we want is $$I(X;Y)$$, and $$$ I(X;Y) = h(Y) - h(Y|X) = -\int f_Y(y) \log f_Y(y) \mathrm{d}y + \iint f_{X,Y}(x,y) \log f_{Y|X=x}(y) \mathrm{d}x \mathrm{d}y $$$</p><p>where <i>mumble mumble</i> the infinities cancel out <i>mumble</i> opposite signs <i>mumble</i>.</p><h3 id="signals">Signals</h3><p>With discrete random variables, we generally had some fairly obvious set of values that they could take. With continuous random variables, we usually deal with an unrestricted range - a radio signal could technically be however low or high. However, step down from abstract maths land, and you realise reality isn't as hopeless as it seems at first. Emitting a radio wave, or making noise, takes some input of energy, and the source has only so much power.</p><p>For waves (like radio waves and sound waves), power is proportional to the square of the amplitude of a wave. The variance $$\mathbb{V}(X) = \mathbb{E}[(x-\mathbb{E}[x])^2] = \int f(x) (x - \mathbb{E}[X])^2 \mathrm{d}x$$ of a continuous random variable $$X$$ with probability density function $$f$$ is just the expected squared difference between the value and its mean. Both of these quantities are squaring a difference. It turns out that the power of our source and the variance of the random variable that represents it are proportional.</p><p>Our model of a continuous noisy channel is one where there's an input signal $$X$$, a source of noise $$N$$, and an output signal $$Y = X + N$$. As usual, we want to maximise the channel capacity $$C = \max_{p_X} I(X;Y)$$, which is done by maximising $$$ I(X;Y) = h(Y) - h(Y|X). $$$ Because noise is generally the sum of a bunch of small contributing factors in each directions, the noise follows a normal distribution with variance $$\sigma_N^2$$. Because the only source of uncertainty is $$N$$ and this has the same regardless of $$X$$, $$h(Y|X)$$ depends only on $$N$$ and not at all on $$X$$, so the only thing we can affect is $$h(Y)$$.</p><p>Therefore, the question of how you maximise channel capacity turns into a question of how to maximise $$h(Y)$$ given that $$Y = X + N$$ with $$N \sim \mathcal{N}(0, \sigma_N^2)$$. If we were working without any power/variance constraints, we'd already know the answer: just make $$X$$ such that $$Y$$ is a uniform distribution (which in this case would mean making $$Y$$ a uniform distribution over all real numbers, something that's clearly a bit wacky). However, we have a constraint on power and therefore the variance of $$X$$.</p><p>If we were to do some algebra involving Lagrangian multipliers, we would eventually find that we want the distribution of $$X$$ to be a normal distribution. A key property of normal distributions is that if $$X \sim \mathcal{N}(0, \sigma_X^2)$$ (assume the mean is 0; note you can always shift your scale) and $$N \sim \mathcal{N}(0, \sigma_N^2)$$, then $$X + N \sim \mathcal{N}(0, \sigma_X^2 + \sigma_N^2)$$. Therefore the basic principle between efficiently transmitting information using a continuous signal is that you want to transform your input to follow a normal distribution.</p><p>If you do, what do you get? Start with $$$ I(X;Y) = h(Y) - h(Y|X) $$$ and now use the "standard" integral that $$$ \int f(z) \log p(z) \mathrm{d}z = -\frac{1}{2} \log (2 \pi e \sigma^2) $$$ if $$z$$ is drawn from a distribution $$\mathcal{N}(0, \sigma^2)$$, and therefore $$$ \max I(X;Y) = C = \frac{1}{2} \log (2 \pi e (\sigma_X^2 + \sigma_N^2)) - \frac{1}{2} \log (2 \pi e \sigma_N^2) $$$ using the fact that $$h(Y|X) = h(N)$$ since the information content of the noise is all that is unknown about $$Y$$ if we're given $$X$$, and the property of normal distributions mentioned above. We can do some algebra to get the above into the form $$$ C = \frac{1}{2} \log \left(\frac{2 \pi e (\sigma_X^2 + \sigma_N^2)}{2 \pi e \sigma_N^2}\right) \ = \frac{1}{2} \log \left( 1 + \frac{\sigma_X^2}{\sigma_N^2}\right) $$$ The variance is proportional to the power, so this can also be written in terms of power as $$$ C = \frac{1}{2} \log \left( 1 + \frac{S}{N}\right) $$$ if $$S$$ is the power of the signal and $$N$$ is the power of the noise. The units of capacity for the discrete case were bits per symbol; here they're bits per second. A sanity check is that if $$S = 0$$, we transmit $$\frac{1}{2} \log (1) = 0$$ bits per second, which makes sense: if your signal power is 0, it has no effect, and no one is going to hear you.</p><p>An interesting consequence here is that increasing signal power only gives you a logarithmic improvement in how much information you can transmit. If you shout twice as loud, you can detect approximately twice as fine-grained peaks and troughs in the amplitude of your voice. However, this helps surprisingly little.</p><p>If you want to communicate at a really high capacity, there are better things you can do than shouting very loudly. You can decompose a signal into frequency components using the Fourier transform. If your signal consists of many different frequency levels, you can effectively transmit a different amplitude on each of them at once. The range of frequencies that your signal can span over is called the bandwidth and is denoted $$W$$. If you can make use of multiple frequencies, the capacity equation changes to $$$ C = \frac{W}{2} \log \left(1 + \frac{S}{N}\right) $$$ Therefore if you want to transmit information, transmitting across a broad range of frequencies is much more effective than shouting loudly. There's a metaphor here somewhere.</p>Unknownnoreply@blogger.com1tag:blogger.com,1999:blog-1697673368059564013.post-59038482154934691632022-06-25T18:39:00.004+01:002022-06-25T18:42:48.461+01:00Information theory 2: source coding<p style="text-align: center;"><span style="font-size: x-small;">6.9k words, including equations (~36min)</span> <br /></p><p> </p><p>In <a href="https://www.strataoftheworld.com/2022/06/information-theory-1.html">the previous post</a>, we saw the basic information theory model:</p><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgJXTmanOlPkocQ5FGq2OL6Tpxe-qlKxs3MIQ0zBnKTk2JvkkLshofA86XeZiNoaa64veATnEBMIfChv5OcUAD6QTPZEpRmtV2b_jhSb_8XDs9PYBAcOdAYmnKDrrrcAxbuXthKVax_gAacxX360xcDRrsLbxGEZdGKaHo24f7itvDpI9k-cbPBYoKHoQ/s1104/ArcoLinux_2022-06-02_12-57-01.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="205" data-original-width="1104" height="118" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgJXTmanOlPkocQ5FGq2OL6Tpxe-qlKxs3MIQ0zBnKTk2JvkkLshofA86XeZiNoaa64veATnEBMIfChv5OcUAD6QTPZEpRmtV2b_jhSb_8XDs9PYBAcOdAYmnKDrrrcAxbuXthKVax_gAacxX360xcDRrsLbxGEZdGKaHo24f7itvDpI9k-cbPBYoKHoQ/w640-h118/ArcoLinux_2022-06-02_12-57-01.png" width="640" /></a></div><br /><p>If we have no noise in the channel, we don't need channel coding. Therefore the above model simplifies to</p><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEj1KF_NS28wo86Zclsq33a5hrIUMddggoRFHqCEAwffiTunltbEaON-d4I11qhEUFiu8ChkRqdkKC5f75leUGSkLq-Ysv7R_O2-QRIt-NMO42HyC13pVaMnninN6qyZMr4yIicxO5Iy9962Fmlt-Cczhh5tb2ye5rJPgOQNOdECo0LbnuzBNgjRnr1bzg/s716/ArcoLinux_2022-06-02_12-57-44.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="167" data-original-width="716" height="150" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEj1KF_NS28wo86Zclsq33a5hrIUMddggoRFHqCEAwffiTunltbEaON-d4I11qhEUFiu8ChkRqdkKC5f75leUGSkLq-Ysv7R_O2-QRIt-NMO42HyC13pVaMnninN6qyZMr4yIicxO5Iy9962Fmlt-Cczhh5tb2ye5rJPgOQNOdECo0LbnuzBNgjRnr1bzg/w640-h150/ArcoLinux_2022-06-02_12-57-44.png" width="640" /></a></div><p>and the goal is to minimise $$n$$ - that is, minimise the number of symbols we need to send - without needing to worry about being robust to any errors.</p><p>Here's one question to get started: imagine we're working with a compression function $$f_e$$ that acts on length-$$n$$ strings (that is, sequences of symbols) with some arbitrary alphabet size $$A$$ (that is, $$A$$ different types of symbols). is it possible to build an encoding function $$f_e$$ that compresses every possible input? Clearly not; imagine that it took every length-$$n$$ string to a length-$$m$$ string using the same alphabet, with $$m < n$$. Then we'd have $$A^m$$ different available codewords that would need to code for $$A^n > A^m$$ different messages. By the pigeonhole principle, there must be at least one codeword that codes for more than one message. But that means that if we see this codeword, we can't be sure what it codes for, so we can't recover the original with certainty.</p><p>Therefore, we have a choice: either:</p><ul><li>do <i>lossy compression</i>, where every message shrinks in size but we can't recover information perfectly; or</li><li>do <i>lossless compression</i>, and hope that more messages shrink in size than expand in size.</li></ul><p>This is obvious with lossless compression, but applies to both: if you want to do them well, you generally need a probability model for what your data looks like, or at least something that approximates one.</p><h2 id="terminology">Terminology</h2><p>When we talk about a "code", we just mean something that maps messages (the $$Z$$ in the above diagram) to a sequence of symbols. A code is <b>nonsingular</b> if it associates every message with a unique code. </p><p>A <b>symbol code</b> is a code where each symbol in the message maps to a codeword, and the code of a message is the concatenation of the codewords of the symbols that it is made of.</p><p>A <b>prefix code</b> is a code where no codeword is a prefix of another codeword. They are also called <b>instantaneous codes</b>, because when decoding, you can decode a codeword to a symbol immediately when you reach a point where the some prefix of the code corresponds to a codeword.</p><h2 id="useful-basic-results-in-lossless-compression">Useful basic results in lossless compression</h2><h3 id="kraft-s-inequality">Kraft's inequality</h3><p>Kraft's inequality states that a prefix code with an alphabet of size $$D$$ and code words of lengths $$l_1, l_2, \ldots, l_n$$ satisfies $$$ \sum_{i=1}^n D^{-l_i} \leq 1, $$$ and conversely that if there is a set of lengths $${l_1, \ldots, l_n}$$ that satisfies the above inequality, there exists a prefix code with those codeword lengths. We will only prove the first direction: that all prefix codes satisfy the above inequality.</p><p>Let $$l = \max_i l_i$$ and consider the tree with branching factor $$D$$ and depth $$l$$. This tree has $$D^l$$ nodes on the bottom level. Each codeword $$x_1x_2...x_c$$ is the node in this tree that you get to by choosing the $$d_i$$th branch on the $$i$$th level where $$d_i$$ is the index of symbol $$x_i$$ in the alphabet. Since it must be a prefix code, no node that is a descendant of a node that is a codeword can be a codeword. We can define our "budget " as the $$D^l$$ nodes on the bottom level of the tree, and define the "cost" of each codeword as the number of nodes on the bottom level of the tree that are descendants of the node. The node with length $$l$$ has cost 1, and in general a codeword at level $$l_i$$ has cost $$D^{l - l_i}$$. From this, and the prefix-freeness, we get $$$ \sum_i D^{l - l_i} \leq D^l $$$ which becomes the inequality when you divide both sides by $$D^l$$.</p><h3 id="gibbs-inequality">Gibbs' inequality</h3><p>Gibbs' inequality states that for any two probability distributions $$p$$ and $$q$$, $$$ -\sum_i p_i \log p_i \leq - \sum_i p_i \log q_i $$$ which can be written using the relative entropy $$D$$ (also known as the KL distance/divergence) as $$$ \sum_i p_i \log \frac{p_i}{q_i} = D(p||q) \geq 0. $$$ This can be proved using the <a href="https://en.wikipedia.org/wiki/Log_sum_inequality">log sum inequality</a>. The proof is boring.</p><h3 id="minimum-expected-length-of-a-symbol-code">Minimum expected length of a symbol code</h3><p>We want to minimise the expected length of our code $$C$$ for each symbol that $$X$$ might output. The expected length is $$L(C,X) = \sum_i p_i l_i$$. Now one way to think of what a length $$l_i$$ means is using the correspondence between prefix codes and binary trees discussed above. Given the prefix requirement, the higher the level in the tree (and thus the shorter the length of the codeword) the more other options we block out in the tree. Therefore we can think of the collection of lengths we assign to our codewords as specifying a rough probability distribution that assigns probability in proportion to $$2^{-l_i}$$. What we'll do is introduce a variable $$q_i$$ that measures the "implied probability" in this way (note dividing the division by a normalising constant): $$$ q_i = \frac{2^{-l_i}}{\sum_i 2^{-l_i}} = \frac{2^{-l_i}}{z} $$$ where in the 2nd step we've just defined $$z$$ to be the normalising constant. Now $$l_i = - \log zq_i = -\log q_i - \log z$$, so $$$ L(C,X) = \sum_i (-p_i \log q_i) - \log z $$$ Now we can apply Gibbs' inequality to know that $$\sum_i(- p_i \log q_i) \geq \sum_i (-p_i \log p_i)$$ and Kraft's inequality to know that $$\log z = \log \big(\sum_i 2^{-l_i} \big) \leq \log(1)=0$$, so we get $$$ L(C,X) \geq -\sum_i p_i \log p_i = H(X). $$$ Therefore the entropy (with base-2 $$\log$$) of a random variable is a lower bound on the expected length of a codeword (in a 2-symbol alphabet) that represents the outcome of that random variable. (And more generally, entropy with base-$$d$$ logarithms is a lower bound on the length of a codeword for the result in a $$d$$-symbol alphabet.)</p><h2 id="huffman-coding">Huffman coding</h2><p>Huffman coding is a very pretty concept.</p><p>We saw above that if you're making a random variable for the purpose of gaining the most information possible, you should prepare your random variable to have a uniform probability distribution. This is because entropy is maximised by a uniform distribution, and the entropy of a random variable is the average amount of information you get by observing it.</p><p>The reason why, say, encoding English characters as 5-bit strings (A = 00000, B = 00001, ..., Z = 11010, and then use the remaining 6 codes for punctuation or cat emojis or whatever) is not optimal is that some of those 5-bit strings are more likely than others. On a symbol-by-symbol-level, whether the first symbol is a 0 or a 1 is not equiprobable. To get an ideal code, each symbol we send should have equal probability (or as close to equal probability as we can get).</p><p>Robert Fano, of <a href="https://en.wikipedia.org/wiki/Fano%27s_inequality">Fano's inequality</a> fame, and Claude Shannon, of everything-in-information-theory fame, had tried to find an efficient general coding scheme in the early 1950s. They hadn't succeeded. Fano set it as an alternative to taking the final exam for his information theory class at MIT. David Huffman tried for a while, and had almost given up and started studying instead, when he came up with Huffman coding and quickly proved it to be optimal.</p><p>We want the first code symbol (a binary digit) to divide the space of possible message symbols (the English letters, say) in two equally-likely parts, the first two to divide it in four, the third into eight, and so o n. Now some message symbols are going to be more likely than others, so the codes for some symbols have to be longer. We don't want it to be ambiguous when we get to the end of a codeword, so we want a prefix-free code. Prefix-free codes with a size-$$d$$ alphabet can be represented as trees with branching factor $$d$$, where each leaf is one codeword:</p><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEg5Yb7bY9lY4ykPbwxKQurPGfN2KW51vlyHu1c-1MWUNuUYkCXWRev6uCSZOKooetoenZPkNvf6O1Ygk-l3at3Gt4iBgfQJeyhx-XR_5t4ZmY5HUWYUh47CrBB5ka5WieNK4_ANcRPcdXRsAt8o3D1TNsZBQGQkuW_9J62iZ9hr41bi8T2961-xXCIvxQ/s788/ArcoLinux_2022-06-25_18-09-46.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="506" data-original-width="788" height="410" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEg5Yb7bY9lY4ykPbwxKQurPGfN2KW51vlyHu1c-1MWUNuUYkCXWRev6uCSZOKooetoenZPkNvf6O1Ygk-l3at3Gt4iBgfQJeyhx-XR_5t4ZmY5HUWYUh47CrBB5ka5WieNK4_ANcRPcdXRsAt8o3D1TNsZBQGQkuW_9J62iZ9hr41bi8T2961-xXCIvxQ/w640-h410/ArcoLinux_2022-06-25_18-09-46.png" width="640" /></a></div><p>Above, we have $$d=2$$ (i..e binary), and six items to code for (<code>a</code>, <code>b</code>, <code>c</code>, <code>d</code>, <code>e</code>, and <code>f</code>), and six code words with lengths of between 1 and 4 characters in the codeword alphabet.</p><p>Each codeword is associated with some probability. We can define the weight of a leaf node to be its probability (or just how many times it occurs in the data) and the weight of a non-leaf code to be the sum of the weights of all leaves that are downstream of it in the tree. For an optimal prefix-free code, all we need to do is make sure that each node has children that are as equally balanced in weight as possible.</p><p>The best way to achieve this is to work bottom-up. Start without any tree, just a collection of leaf nodes representing the symbols you want codewords for. Then repeatedly build a node uniting the two least-likely parentless nodes in the tree, until the tree has a root.</p><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgeWVP1zFZw9DdXXLhr5FiCaTY_sLjPbbEKv6w--pr3gYntbL-xWziuCOb8gz8u92YxSpF3EZWQ22-_3NSzzfen0a0qeO3rouiUPsvoOs_rvQQYFUj5DyANdtJtAPTUELkmIcuIc3lPaJegCj0ydkOV9gur3mKxIw9YxWwiOnMBXCWeWFfPTRHYPEJm1g/s729/ArcoLinux_2022-06-25_18-12-46.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="538" data-original-width="729" height="472" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgeWVP1zFZw9DdXXLhr5FiCaTY_sLjPbbEKv6w--pr3gYntbL-xWziuCOb8gz8u92YxSpF3EZWQ22-_3NSzzfen0a0qeO3rouiUPsvoOs_rvQQYFUj5DyANdtJtAPTUELkmIcuIc3lPaJegCj0ydkOV9gur3mKxIw9YxWwiOnMBXCWeWFfPTRHYPEJm1g/w640-h472/ArcoLinux_2022-06-25_18-12-46.png" width="640" /></a></div><p>Above, the numbers next to the non-leaf nodes show the order in which the node was created. This set of weights on the leaf nodes creates the same tree structure as in the previous diagram.</p><p>(We could also try to work top-down, creating the tree the root to the leaves rather than from the leaves to the root, but this turns out to give slightly worse results. Also the algorithm for achieving this is less elegant.)</p><h2 id="arithmetic-coding">Arithmetic coding</h2><p>The Huffman code is the best symbol code - that is, a code where every symbol in the message gets associated with a codeword, and the code for the entire message is simply the concatenation of all the codewords of its symbols.</p><p>Symbol codes aren't always great, though. Consider encoding the output of a source that has a lot of runs like "<code>aaaaaaaaaahaaaaahahahaaaaa</code>" (a source of such messages might be, for example, a transcription of what a student says right before finals). The Huffman coding for this message is, for example, that "a" maps to a 0, and "h" maps to a 1, and you have achieved a compression of exactly 0%, even though intuitively those long runs of "a"s could be compressed.</p><p>One obvious thing you could do is run-length encoding, where long blocks of a character get compressed into a code for the character plus a code for how many times the character is repeated; for example the above might become "<code>10a1h5a1h1a1h1a1h5a</code>". However, this is only a good idea if there are lots of runs, and requires a bunch of complexity (e.g. your alphabet for the codewords must either be something more than binary, or then you need to be able to express things like lengths and counts in binary unambiguously, possibly using a second layer of encoding with a symbol code).</p><p>Another problem with Huffman codes is that the code is based on assuming an unchanging probability model across the entire length of the message that is being encoded. This might be a bad assumption if we're encoding, for example, long angry Twitter threads, where the frequency of exclamation marks and capital letters increases as the message continues. We could try to brute-force a solution, such as splitting the message into chunks and fitting a Huffman code separately to each chunk, but that's not very elegant. Remember how elegant Huffman codes feel as a solution to the symbol coding problem? We'd rather not settle for less.</p><p>The fundamental idea of arithmetic coding is that we send a number representing where on the cumulative probability distribution of all messages the message we want to send lies. This is a dense statement, so we will unpack it with an example. Let's say our alphabet is $$A = {a, r, t}$$. To establish an ordering, we'll just say we consider the alphabet symbols in alphabetic order. Now let's say our probability distribution for the random variable $$X$$ looks like the diagram on the left; then our cumulative probability distribution looks like the diagram on the right:</p><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjywjIVu-7eMsCaKyb1GdbLQWnmArUkgLMlhXAxdUleRkJynbd2RErOQkH7Dm3h1Dcb0Q6ynn1G36oTJP-58fj_9Kkd5ryBn0AMThBKSqADP42dkEjPB6ln-lv-wLJ-pUYZIxn6V3zXBAK6zJQIAd-zPOWxvf1aI2nvMVVnse1QCc-WWwM3XQJ__JQeuw/s1038/ArcoLinux_2022-06-21_21-42-25.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="397" data-original-width="1038" height="244" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjywjIVu-7eMsCaKyb1GdbLQWnmArUkgLMlhXAxdUleRkJynbd2RErOQkH7Dm3h1Dcb0Q6ynn1G36oTJP-58fj_9Kkd5ryBn0AMThBKSqADP42dkEjPB6ln-lv-wLJ-pUYZIxn6V3zXBAK6zJQIAd-zPOWxvf1aI2nvMVVnse1QCc-WWwM3XQJ__JQeuw/w640-h244/ArcoLinux_2022-06-21_21-42-25.png" width="640" /></a></div><p>One way to specify which of $${a, r, t}$$ we mean is to pick a number $$0 \leq c \leq 1$$, and then look at which range it corresponds to on the $$y$$-axis of the right-hand figure; $$0 \leq c < 0.5$$ implies $$a$$, $$0.5 \leq c < 0.7$$ implies $$r$$, and $$0.7 \leq c < 1$$ implies $$t$$. We don't need to send the leading 0 because it is always present, and for simplicity we'll transmit the following decimals in binary; 0.0 becomes "0", 0.5 becomes "1", 0.25 becomes "01", and 0.875 is "111". </p><p>Note that at this point we've almost reinvented is the Huffman code. $$a$$ has the most probability mass and can be represented in one symbol. $$r$$ happens to be representable in one symbol ("1" corresponds to 0.5 which maps to $$r$$) as well even though it has the least probability mass, which is definitely inefficient but not too bad. $$t$$ takes 2: "11".</p><p>The real benefit begins when we have multi-character messages. The way we can do it is like this, recursively splitting the number range between 0 and 1 into smaller and smaller chunks:</p><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiNxucz-lsT05H5PjrTHMM3NELAQILz9mDGgl4ED_sVaBcqbEoBTTfjMvzYrcUMNywDcT-OzlniqA4RkS-toShHZBNYhFzn744YFxx0oPYVj-FOJKRsLlU28RvU5bID5019UBvjQwVzmgqlpOHbmC-fN2bvTfqhj81PN0w5qIDzKEkbsjpr4e6Rye56bA/s969/ArcoLinux_2022-06-21_21-43-17.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="349" data-original-width="969" height="230" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiNxucz-lsT05H5PjrTHMM3NELAQILz9mDGgl4ED_sVaBcqbEoBTTfjMvzYrcUMNywDcT-OzlniqA4RkS-toShHZBNYhFzn744YFxx0oPYVj-FOJKRsLlU28RvU5bID5019UBvjQwVzmgqlpOHbmC-fN2bvTfqhj81PN0w5qIDzKEkbsjpr4e6Rye56bA/w640-h230/ArcoLinux_2022-06-21_21-43-17.png" width="640" /></a></div><p>We see possible numbers encoding "art", "rat", and "tar". Not only that, but we see that all messages we send are infinite in length, as we can just keep going down, adding more and more letters. At first this might seem like a great deal - send one number, get infinite symbols transmitted for free! However, there's a real difference between "art" and "artrat", so we want to be able to know when to stop as well.</p><p>A simple answer is that the message also includes some code encoding how many symbols to decode for. A more elegant answer is that we can keep our message as just one number, but extend our alphabet to include an end-of-message token. Note that even with this end-of-message token, it is still true that many characters of the message can be encoded by a single symbol of output, especially if some outcome is much more likely. For example, in the example below we need only one bit ("1", for the number 0.5) to represent the message "aaa" (followed by the end-of-message character):</p><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiYblnLiDN2hhrmMo7XMRQtEuDiOl4TAO53XUxdao9FxGjuINwDQOj-YT7YU3Q857Vgj6_gxi9UHEHvMQGkgpKpxDiHRO06z1FF8zkbbaqdUtG-BhwmD_0Qv77pnPMTyh2w8YpVvyZWN_AJ7vpPxAJB9w46bNlYmaFebm9mW9-DgcVZMDnVn-1DpludLQ/s942/ArcoLinux_2022-06-21_21-44-10.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="327" data-original-width="942" height="222" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiYblnLiDN2hhrmMo7XMRQtEuDiOl4TAO53XUxdao9FxGjuINwDQOj-YT7YU3Q857Vgj6_gxi9UHEHvMQGkgpKpxDiHRO06z1FF8zkbbaqdUtG-BhwmD_0Qv77pnPMTyh2w8YpVvyZWN_AJ7vpPxAJB9w46bNlYmaFebm9mW9-DgcVZMDnVn-1DpludLQ/w640-h222/ArcoLinux_2022-06-21_21-44-10.png" width="640" /></a></div><p>There are still two ways in which this code is underspecified.</p><p>The first is that we need to choose how much of the probability space to assign to our end-of-message token. The optimal value for this clearly depends on how long messages we will be sending.</p><p>The second is that even with the end-of-message token, each codeword is still represented by a range of values rather than a single number. Any of these are valid numbers to send, but we want to minimise the length, so therefore we will choose the number in this range that has the shortest binary representation.</p><p>Finally, what is our probability model? With the Huffman code, we either assume a probability model based on background information (e.g. we have the set of English characters, and we know the rough probabilities of them by looking at some text corpus that someone else has already compiled), or we fit the probability model based on the message we want to send - if 1/10th of all letters in the message are $$a$$s, we set $$p_a = 0.1$$ when building the tree for our Huffman code, and so on.</p><p>With arithmetic coding, we can also assume static probabilities. However, we can also do adaptive arithmetic coding, where we change the probability model as we go. A good way to do this is for our probability model to assume that the probability $$p_x$$ of the symbol $$x$$ after we have already processed text $$T$$ is $$$ p_x = \frac{\text{Count}(x, T) + 1}{\sum_{y \in A} \big(\text{Count}(y, T) + 1\big)}$$$ $$$= \frac{\text{Count}(x, T) + 1}{\sum_{y \in A} \big(\text{Count}(y, T)\big) + |A|} $$$ where $$A$$ is the alphabet, and $$\text{Count}(a, T)$$ simply returns the count of how many times the character $$a$$ occurs in $$T$$. Note that if we didn't have the $$+1$$ in the numerator and in the sum in the denominator, we would assume a probability of zero to anything we haven't seen before, and be unable to encode it.</p><p>(We can either say that the end-of-message token is in the alphabet $$A$$, or, more commonly, assign "probabilities" to all $$x$$ using the above formula and some probability $$p_{EOM}$$ to the end of message, and then renormalise by dividing all $$p_x$$ by $$1 + p_{EOM}$$.)</p><p>How do we decode this? At the start, the assumed distribution is simply uniform over the alphabet (except maybe for $$p_{EOM}$$). We can decode the first symbol using that distribution, then update the distribution and decode the next, and so on. It's quite elegant.</p><p>What isn't elegant is implementing this with standard number systems in most programming languages. For any non-trivial message length, arithmetic coding is going to need very precise floating point numbers, and you can't trust floating point precision very far. You'll need some special system, likely an arbitrary-precision arithmetic library, to actually implement arithmetic coding.</p><h3 id="prefix-free-arithmetic-coding">Prefix-free arithmetic coding</h3><p>The above description of arithmetic coding is not a prefix-free code. We generally want prefix-free codes, in particular because it means we can decode it symbol by symbol as it comes in, rather than having to wait for the entire message to come through. Note also that often in practice it is uncertain whether or not there are more bits coming; consider a patchy internet connection with significant randomness between packet arrival times.</p><p>The simple fix for this is that instead of encoding a number as <i>any</i> sequence of binary string that maps onto the right segment of the number line between 0 and 1, you impose an additional requirement on it: <i>whatever binary bits you add onto the number, it is still within the range</i>.</p><h2 id="lempel-ziv-coding">Lempel-Ziv coding</h2><p>Huffman coding integrated the probability model and the encoding. Arithmetic coding still uses an (at least implicit) probability model to encode, but in a way that makes it possible to update as we encode. Lempel-Ziv encoding, and its various descendants, throw away the entire idea of having any kind of (explicit) probability model. We will look at the original version of this algorithm.</p><h3 id="encoding">Encoding</h3><p>Skip all that Huffman coding nonsense of carefully rationing the shorter codewords for the most likely symbols, and simply decide on some codeword length $$d$$ and give every character in the alphabet a codeword of that length. If your alphabet is again $${a, r, t, \text{EOM}}$$ (we'll include the end-of-message character from the start this time), and $$d = 3$$, then the codewords you define are literally as simple as $$$ a \mapsto 000 $$$ $$$r \mapsto 001 $$$ $$$t \mapsto 010 $$$ \text{EOM} \mapsto 011 $$$ If we used this code, it would be a disaster. We have four symbols in our alphabet, so the maximum entropy of the distribution is $$\log_2 4 = 2$$ bits, and we're spending 3 bits on each symbol. With this encoding, we increase the length by at least 50%. Instead of your compressed file being uploaded in 4 seconds, it now takes 6.</p><p>However, we selected $$d=3$$, meaning we have $$2^3 = 8$$ slots for possible codewords of our chosen constant length, and we've only used 4. What we'll do is follow these steps as we scan through our text:</p><ol><li>Read one symbol <i>past</i> the longest match between the following text and a codeword we've defined. Therefore what we now have is a string $$Cx$$, where we have a code for $$C$$ already of length $$|C|$$, $$x$$ is a single character, and $$Cx$$ is a prefix of the remaining text.</li><li>Add $$C$$ to the code we're forming, to encode for the first $$|C|$$ characters of the remaining text.</li><li>If there is space among the $$2^d$$ possible codewords we have available: let $$n$$ be the binary representation of the smallest possible codeword not yet associated with a code, and define $$Cx \mapsto n$$ as a new codeword.</li></ol><p>Here is an example of the encoding process, showing the emitted codewords on the left, the original definitions on the top, the new definitions on the right, and the message down the middle:</p><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi33XVwUveeGSql8K9VJi_y7bZGa0TZ3UAKdPkkxbnXNZmakweKcmjdGHOBN1oPGSj0fxi3xtQcVDD-FT-XBEW6u18eKbVcZurVB9unqL3tHsSyYKb0mvpfpBRkDZttA1l9OgLlF2I0OFHawK8D2LnQP3M6cJZPHeJOTnSF0lV53ueCYE5m65t6h4U_4w/s892/ArcoLinux_2022-06-21_21-48-30.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="892" data-original-width="727" height="640" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi33XVwUveeGSql8K9VJi_y7bZGa0TZ3UAKdPkkxbnXNZmakweKcmjdGHOBN1oPGSj0fxi3xtQcVDD-FT-XBEW6u18eKbVcZurVB9unqL3tHsSyYKb0mvpfpBRkDZttA1l9OgLlF2I0OFHawK8D2LnQP3M6cJZPHeJOTnSF0lV53ueCYE5m65t6h4U_4w/w522-h640/ArcoLinux_2022-06-21_21-48-30.png" width="522" /></a></div><h3 id="decoding">Decoding</h3><p>A boring way to decode is to send the codeword list along with your message. The fun way is to reason it out as you go along, based on your knowledge of the above algorithm and a convention that lets you know which order the original symbols were added to the codeword list (say, alphabetically, so you know the three bindings in the top-left). An example of decoding the above message:</p><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEigucPM68ZDJpTmvknmmUEQkRAdUye2kMM3pMp_Ucgqg1xL6UDKXGWFyW4FCb-V_K4dMRlTrypPUZJQ6KFkU9pVU80bG2vy7oDBJP4H-Nq_WGu9WKjnZy8EhqVuMhpW4g8bVeQVTwRQ2xNAxydawUm9kACV9ADKU6OUcUcn59jcphwa5p8zx-2iHiywDA/s1073/ArcoLinux_2022-06-21_21-48-58.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="476" data-original-width="1073" height="284" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEigucPM68ZDJpTmvknmmUEQkRAdUye2kMM3pMp_Ucgqg1xL6UDKXGWFyW4FCb-V_K4dMRlTrypPUZJQ6KFkU9pVU80bG2vy7oDBJP4H-Nq_WGu9WKjnZy8EhqVuMhpW4g8bVeQVTwRQ2xNAxydawUm9kACV9ADKU6OUcUcn59jcphwa5p8zx-2iHiywDA/w640-h284/ArcoLinux_2022-06-21_21-48-58.png" width="640" /></a></div><h2 id="source-coding-theorem">Source coding theorem</h2><p>The source coding theorem is about lossy compression. It is going to tell us that if we can tolerate a probability of error $$\delta$$, and if we're encoding a message consisting of a lot of symbols, unless $$\delta$$ is very close to 0 (lossless compression) or 1 (there is nothing but error), it will take about $$H(X)$$ bits per symbol to encode the message, where $$X$$ is the random variable according to which the symbols in the message have been drawn. Since it means that entropy turns up as a fundamental and surprisingly constant limit when we're trying to compress our information, this further justifies the use of entropy as a measure of information.</p><p>We're going to start our attempt to prove the source coding theorem by considering a silly compression scheme. Observe that English has 26 letters, but the bottom 10 (Z, Q, X, J, K, V, B, P, Y, G) are slightly less than 10% of all letters. Why not just drop them? Everthn is still comprehensile without them, and ou can et awa with, for eample, onl 4 inary its per letter rather than 5, since ou're left with ust 16 letters.</p><p>Given an alphabet $$A$$ from which our random variable $$X$$ takes values, define the $$\delta$$-sufficient subset $$S_\delta$$ of $$A$$ to be the smallest subset of $$A$$ such that $$P(x \in S_\delta) \geq 1 - \delta$$ for $$x$$ drawn from $$X$$. For example, if $$A$$ is the English alphabet, and $$\delta = 0.1$$, then $$S_\delta$$ is the set of all letters except Z, Q, X, J, K, V, B, P, Y, and G, since the other letters have a combined probability of over $$1 - 0.1 = 0.9$$, and any other subset containing more than $$0.9$$ of the probability mass contains must contain more letters. </p><p>Note that $$S_\delta$$ can be formed by adding elements from $$A$$, in descending order of probability, into a set until the sum of probabilities of elements in the set exceeds $$1 - \delta$$.</p><p>Next, define the essential bit content of $$X$$, denoted $$H_\delta(X)$$, as $$$ H_\delta(X) = \log 2 |S_\delta|. $$$ In other words, $$H_\delta(X)$$ is the answer to "how many bits of information does it take to point to one element in $$S_\delta$$ (without being able to assume the distribution is anything better than uniform)?". $$H_\delta(X)$$ for $$\text{English alphabet}_{0.1}$$ is 4, because $$\log_2 |{E, T, A, O, I, N, S, H, R, D, L, U, C,M, W, F}| = \log_2 16 = 4$$. It makes sense that this is called "essential bit content".</p><p>We can graph $$H_\delta(X)$$ against $$\delta$$ to get a pattern like this:</p><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhlGh7Wm3vmOUAG49ZlIxkZgz2q7ApF-QM7addOeJ5uEqx2P9kzx8sAGF_4BBp4o6me9Pg6NqzGgivCir-VKdWB-E2hdLzAYx6cOgd9v2-BQr8Emaat6joRPkDFPtEZcjnGvNVvegvOvaRVJCQaYZGI_WCjZkwoY356mqGwpVlmzHZWAPT-eO2yviId_A/s850/ArcoLinux_2022-06-21_22-01-58.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="636" data-original-width="850" height="478" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhlGh7Wm3vmOUAG49ZlIxkZgz2q7ApF-QM7addOeJ5uEqx2P9kzx8sAGF_4BBp4o6me9Pg6NqzGgivCir-VKdWB-E2hdLzAYx6cOgd9v2-BQr8Emaat6joRPkDFPtEZcjnGvNVvegvOvaRVJCQaYZGI_WCjZkwoY356mqGwpVlmzHZWAPT-eO2yviId_A/w640-h478/ArcoLinux_2022-06-21_22-01-58.png" width="640" /></a></div><p>Where it gets more interesting is when we extend this definition to blocks. Let $$X^n$$ denote the random variable for a sequence of $$n$$ independent identically distributed samples drawn from $$X$$. We keep the same definitions for $$S_\delta$$ and $$H_\delta(X)$$; just remember that now $$S$$ is a subset of $$A^n$$ (where the exponent denotes Cartesian product of a set with itself; i.e. $$A^n$$ is all possible length-$$n$$ strings formed from that alphabet). In other words, we're throwing away the least common length-$$N$$ letter strings first; ZZZZ is out the window first if $$n = 4$$, and so on.</p><p>We can plot a similar graph as above, except we're plotting $$\frac{1}{n} H_\delta(x)$$ on the vertical axis to get per-symbol entropy, and there's a horizontal line around the entropy of English letter frequencies:</p><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEh1TCFB5fGrb7BNMPfvRwGOtzbahwMwd1yAkZoAUGoP66dWZCj9VkwySIMiwUBETqS6PE1Ob7jvtA3ex0sS02vn_UXgQOD5NLWcp1czRfB55SWHswGlcn3zNeeb4w8n5usdVT6NJZZ52JoU5So4qf6HNzcMNbZcH-IUnU4TZ4k6sGD9zmv4aTfsJCdJxw/s883/ArcoLinux_2022-06-21_22-02-44.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="646" data-original-width="883" height="234" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEh1TCFB5fGrb7BNMPfvRwGOtzbahwMwd1yAkZoAUGoP66dWZCj9VkwySIMiwUBETqS6PE1Ob7jvtA3ex0sS02vn_UXgQOD5NLWcp1czRfB55SWHswGlcn3zNeeb4w8n5usdVT6NJZZ52JoU5So4qf6HNzcMNbZcH-IUnU4TZ4k6sGD9zmv4aTfsJCdJxw/s320/ArcoLinux_2022-06-21_22-02-44.png" width="320" /></a></div><p>(Note that the entropy per letter of English drops to only 1.3 if we stop modelling each letter as drawn independently from the others around it, and instead have a model with a perfect understanding of which letters occur together.)</p><p>The graph above shows the plot of $$\frac{1}{n}H_\delta(x)$$ against $$\delta$$ for a random variable $$X^n$$ for $$n=1$$ (blue), $$n=2$$) (orange), and $$n=3$$ (green). We see that as $$n$$ increases, the lines become flatter, and the middle portions approach the black line that shows the entropy of the English letter frequency distribution. What you'd see if we continued plotting this graph for larger values of $$n$$ (which might happen for example if you bought me a beefier computer) is that this trend continues; specifically, that there is a value $$n$$ large enough that the graph of $$\frac{1}{n}H_\delta(x)$$ is as close as we want to the black line for the entire length of it, except for an arbitrarily small part near $$\delta = 0$$ and $$\delta = 1$$. Mathematically, we can pick an $$\epsilon > 0$$ such that for $$0 < \delta < 1$$ there exists a positive integer $$n_0$$ such that for all $$n \geq n_0$$, $$$ \left| \frac{1}{n}H_\delta(X^n) - H(X) \right| \leq \epsilon. $$$ Now remember that $$\frac{1}{n}H_\delta(X^n)=\frac{1}{n}\log |S_\delta|$$ was the essential bit content per symbol, or, in other words, the number of bits we need per symbol to represent $$X^n$$ (with error probability $$\delta$$) in the simple coding scheme where we assign an equal-length binary number to each element in $$S_\delta$$ (but hold on: aren't there better codes than ones where all elements in $$S_\delta$$ get an equal-length representation? yes, but we'll see soon that not by very much). Therefore what the above equation is saying is that we can encode $$X^n$$ with error chance $$\delta$$ using a number of bits per symbol that differs from the entropy $$H(X)$$ by only a small constant $$\epsilon$$. This is the source coding theorem. It is a big deal, because we've shown that entropy is related to the number of bits per symbol we need to do encoding in a lossy compression scheme.</p><p>(You can get to a similar result with lossless compression schemes where, instead of throwing away the ability to encode all sequences not in $$S_\delta$$ and just accepting the inevitable error, you instead have an encoding scheme where you reserve one bit to indicate whether or not an $$x^n$$ drawn from $$X^n$$ is in $$S_\delta$$, and if it is you encode it like above, and if it isn't you encode it using $$\log |A|^n$$ bits. Then you'll find that the probability of having to do the latter step is small enough that $$\log |A|^n > \log |S_\delta|$$ doesn't matter very much.)</p><h3 id="typical-sets">Typical sets</h3><p>Before going into the proof, it is useful to investigate what sorts of sequences $$x^n$$ we tend to pull out from $$X^n$$ for some $$X$$. The basic observation is that most $$x^n$$ are going to be neither the least probable nor the most probably out of all $$x^n$$. For example, "ZZZZZZZZZZ" would obviously be an unusual set of letters to draw at random if you're selecting them from English letter frequencies. However, so would "EEEEEEEEEE". Yes, this individual sequence is much more likely than "ZZZZZZZZZZ" or any other sequence, but there is only one of them, so getting it would still be surprising. To take another example, the typical sort of result you'd expect from a coin loaded so that $$P(\text{"heads"}) = 0.75$$ isn't runs of only heads, but rather an approximately 3:1 mix of heads and tails. </p><p>The distribution of letter counts follows a multinomial distribution (the generalisation of the binomial distribution). Therefore (if you think about what a multinomial distribution is, or if you know that the mean is $$n p_{x_i}$$ for the $$i$$th variable) in $$x^n$$ we'd expect roughly $$np_e$$ of the letter e, $$np_z$$ of the letter z, and so on - and $$np_e \ll n$$ even though $$p_e > p_L$$ for all $$L$$ in the alphabet. Slightly more precisely (if you happen to know this fact), the variance of variable $$x_i$$ is $$np_{x_i}(1-p_{x_i})$$, implying that the standard deviation grows only in proportion to $$\sqrt{n}$$, so for large $$n$$ it is very rare to get an $$x^n$$ with counts of $$x_i$$ that differ wildly from the expected count $$np_{x_i}$$. </p><p>Let's define a notion of "typicality" for a sequence $$x^n$$ based on this idea of it being unusual if $$x^n$$ is either a wildly likely or wildly unlikely sequence. The median sequence has $$np_{x_i}$$ of each variable, so has probability $$$ P(x^n) = p_{x_1}^{np_{x_1}}p_{x_2}^{np_{x_2}} \ldots p_{x_n}^{np_{x_n}} $$$ which in turn has a Shannon information content of $$$</p><ul><li>\log P(x^n) = -\sum_i np_{x_i} \log p_{x_i} = n H(X) $$$ Oh look, entropy pops up again. How surprising.</li></ul><p>Now we make the following definition: a sequence $$x^n$$ is $$\epsilon$$-typical if its information content per symbol is $$\epsilon$$-close to $$H(X)$$, that is $$$ \left| - \frac{1}{n}\log{P(x^n)} - H(X) \right| <\epsilon. $$$ Define the typical set $$T_{n\epsilon}$$ to be the set of length-$$n$$ sequences (drawn from $$X^n$$) that are $$\epsilon$$-typical.</p><p>$$T_{n\epsilon}$$ is a small subset of the set $$A^n$$ of all length-$$n$$ sequences. We can see this through the following reasoning: for any $$x^n \in T_{n\epsilon}$$, $$\frac{1}{n} \log P(x^n) \approx H(X)$$ which implies that $$$ P(x^n) \approx 2^{-nH(X)} $$$ and therefore that there can only be roughly $$2^{nH(X)}$$ such sequences; otherwise their probability would add up to more than 1. In comparison, the number of possible sequences $$|A^n| = 2^{n \log |A|}$$ is significantly larger, since $$\log |A| \leq H(X)$$ for any random variable $$X$$ with alphabet / outcome set $$A$$ (with equality if $$X$$ has a uniform distribution over $$A$$).</p><h3 id="the-typical-set-contains-most-of-the-probability">The typical set contains most of the probability</h3><p>Chebyshev's inequality states that $$$ P((X-\mathbb{E}[X])^2 \geq a) \leq \frac{\sigma^2}{a} $$$ where $$\sigma^2$$ is the variance of the random variable $$X$$, and $$a \geq 0$$. It is proved <a href="http://www.strataoftheworld.com/2021/01/data-science-2.html">here</a> (search for "Chebyshev").</p><p>Earlier we defined the $$\epsilon$$-typical set as $$$ T_{n\epsilon} = \left\{ x^n \in A^n \,\text{ such that } \, \left| -\frac{1}{n}\log P(X^n) - H(X) \right| < \epsilon \right\}. $$$ Note that $$$ \mathbb{E}\left[-\frac{1}{n}\log P(X^n)\right] = -\frac{1}{n} \sum \log P(X_i)$$$ $$$ = -\mathbb{E}[\log P(X_i)]$$$ $$$ = H(X_i) = H(X) $$$ by using independence of the $$X_i$$ making up $$X^n$$ in the first step, the law of large numbers ($$\lim_{n \to \infty} \frac{1}{n} \sum_i X_i = \mathbb{E}[X]$$) in the second, and the fact that all $$X_i$$ are independent draws of the same random variable $$X$$ in the third.</p><p>Therefore, we can now rewrite the typical set definition equivalently as $$$ T_{n\epsilon} = \left\{ x^n \in A^n \,\text{ such that } \, \left( -\frac{1}{n}\log P(x^n) - H(X) \right)^2 < \epsilon^2 \right\}$$$ $$$= \left\{ x^n \in A^n \,\text{ such that } \, \left( Y - \mathbb{E}[Y] \right)^2 < \epsilon^2 \right\} $$$ for $$Y = -\frac{1}{n} \log P(X^n)$$, which is in the right form to apply Chebyshev's inequality to get a probability of belonging to this set, except for the fact that the sign is the wrong way around. Very well - we'll instead consider the set of sequences $$\bar{T}_{n\epsilon} = A^n - T_{n\epsilon}$$ (i.e. all length-$$n$$ sequences that are not typical) instead, which can be defined as $$$ \bar{T}_{n \epsilon} = \left\{ x^n \in A^n \,\text{ such that } \, (Y - \mathbb{E}[Y])^2 \geq \epsilon^2 \right\} $$$ and use Chebyshev's inequality to conclude that $$$ P((Y - \mathbb{E}[Y])^2 \geq \epsilon^2) \leq \frac{\sigma_Y^2}{\epsilon^2} $$$ where $$\sigma_Y^2$$ is the variance of $$Y= -\frac{1}{n} \log P(X^n)$$. This is exciting - we have a bound on the probability that a sequence is not in the typical set - but we want to link this probability to $$n$$ somehow. Let $$Z = -\log P(X)$$, and note that $$Y$$ can be written as the average of many draws from $$Z$$. Therefore $$$ \mathbb{E}[Z] = -\frac{1}{n} \sum_i \log P(X) = -\frac{1}{n} \log P(X^n) = \mathbb{E}[Y] $$$ and since $$Y = \frac{1}{n} \sum_i Z_i$$, the variance of $$Y$$, $$\sigma_Y^2$$, is equal to $$\frac{1}{n} \sigma_Z^2$$ (a basic law of how variance works that is often used in statistics). We can substitute this into the expression above to get $$$ P((Y-\mathbb{E}[Y])^2 \geq \epsilon^2) \leq \frac{\sigma_Z^2}{n\epsilon^2}. $$$ The probability on the left-hand side is identical to $$P((-\frac{1}{n} \log P(X^n) - H(X) )^2 \geq \epsilon^2)$$, which is the probability of the condition that $$X^n$$ is <i>not</i> in the $$\epsilon$$-typical set $$T_{n\epsilon}$$, which gives us our grand result $$$ P(X^n \in T_{n\epsilon}) \ge 1 - \frac{\sigma_Z^2}{n\epsilon^2}. $$$ $$\sigma_Z^2$$ is the variance of $$\log P(X^n)$$; it depends on the particulars of the distribution and is probably hell to calculate. However, what we care about is that if we just crank up $$n$$, we can make this probability as close to 1 as we like, regardless of what $$\sigma_Z^2$$ is, and regardless of what we set as $$\epsilon$$ (the parameter for how wide the probability range for the typical set).</p><p>The key idea is this: asymptotically, as $$n \to \infty$$, more and more of the probability mass of possible length-$$n$$ sequences is concentrated among those that have a probability of between $$2^{-n(H(X)+\epsilon)}$$ and $$2^{-n(H(x) - \epsilon)}$$, regardless of what (positive real) $$\epsilon$$ you set. This is known as the "asymptotic equipartition property" (it might be more appropriate to call it an "asymptotic approximately-equally-partitioning property" because it's not really an "equipartition", since depending on $$\epsilon$$ these can be very different probabilities, but apparently that was too much of a mouthful even for the mathematicians).</p><h3 id="finishing-the-proof">Finishing the proof</h3><p>As a reminder of where we are: we stated without proof $$$ \left| \frac{1}{n}H_\delta(X^n) - H(X) \right| < \epsilon. $$$ and noted that this is an interesting result that also gives meaning to entropy, since we see that it's related to how many bits it takes for a naive coding scheme to express $$X^n$$ (with error probability $$\delta$$).</p><p>Then we went on to talk about typical sets, and ended up finding that the probability that an $$x^n$$ drawn from $$X^n$$ lies in the set $$$ T_{n \epsilon} =\left\{ x^n \in A^n \,\text{ such that } \, \left| -\frac{1}{n}\log P(X^n) - H(X) \right| < \epsilon \right\}. $$$ approaches 1 as $$n \to \infty$$, despite the fact that $$T_{n\epsilon}$$ has only approximately $$2^{nH(X)}$$ members, which, for distributions of $$X$$ that are not very close to the uniform distribution over the alphabet $$A$$, is a small fraction of the $$2^{n \log |A|}$$ possible length-$$n$$ sequences.</p><p>Remember that $$H_\delta(X^n) = \log |S_\delta|$$, and $$S_\delta$$ was the smallest subset of $$A^n$$ such that it contains sequences whose probability sums to at least $$1 - \delta$$. This is a bit like the typical set $$T_{n\epsilon}$$, which also contains sequences making up most of the probability mass. Note that $$T_{n\epsilon}$$ is less efficient; $$S_\delta$$ optimally contains all sequences with probability greater than some threshold, whereas $$T_{n\epsilon}$$ generally omits the highest-probability sequences (settling instead for sequences of the same probability as most sequences that are drawn from $$X^n$$). Therefore $$$ H_\delta(X^n) \leq \log |T_{n\epsilon}| $$$ for an $$n$$ that depends on what $$\delta$$ and $$\epsilon$$ we want. Now we can get an upper bound on $$H_\delta(X^n)$$ if we can upper-bound $$|T_{n\epsilon}|$$. Looking at the definition, we see that the probability of a sequence $$X^n$$ must obey $$$ 2^{n(H(X) - \epsilon)} < P(X^n) < 2^{n(H(X) + \epsilon)}. $$$ $$T_{n\epsilon}$$ has the largest number of elements if all elements have the lowest possible probability $$p$$, and if that is the case it has at most $$1/p$$ of such lowest-probability elements since the probabilities cannot add to more than one, which implies $$|T_{n\epsilon}| < 2^{n(H(x)+\epsilon)}$$. Therefore $$$ H_\delta(X^n) \leq \log |T_{n\epsilon}| < \log(2^{n(H(X)+e)}) = n(H(X) + \epsilon) $$$ and we have a bound $$$ H_\delta(X^n) < n(H(X) + \epsilon). $$$ If we can now also find the bound $$n(H(X) + \epsilon) < H_\delta(X^n)$$, we've shown $$|\frac{1}{n} H_\delta(X^n) - H(X)| < \epsilon$$ and we're done. The proof of this bound is a proof by contradiction. Imagine that there is an $$S'$$ such that $$$ \frac{1}{n} \log |S'| \leq H - \epsilon $$$ but also $$$ P(X^n \in S') \geq 1 - \delta. $$$ We want to show that $$P(X^n \in S')$$ can't actually be that large. For the other bound, we used our typical set successfully, so why not use it again? Specifically, write $$$ P(X^n \in S') = P(X^n \in S' \cap T_{n\varepsilon}) + P(X^n \in S' \cap \bar{T}_{n\varepsilon}) $$$ where $$\bar{T}_{n\varepsilon}$$ is again $$A^n - T_{n\varepsilon}$$, and noting that our constant $$\varepsilon$$ for $$T$$, is not the same as our constant $$\epsilon$$ in the bound. We want to set an upper bound on this probability; for that to hold, we need to make the terms on the right-hand side as large as possible. For the term, this is if $$S' \cap T_{n\varepsilon}$$ is as large as it can be based on the bound on $$|S'|$$, i.e. $$2^{n(H(X)-\epsilon)}$$, and each term in it has the maximum probability $$2^{-n(H(X)-\varepsilon)}$$ of terms in $$T_{n\varepsilon}$$. For the second term, this is if $$S' \cap \bar{T}_{n \epsilon}$$ is restricted only by $$P(X^n \in \bar{T}_{n\varepsilon}) \leq \frac{\sigma^2}{n\epsilon^2}$$, which we showed above. (Note that you can't have both of these conditions holding at once, but this does not matter since we only want to show a non-strict inequality.) Therefore we get $$$ P(X^n \in S') \leq 2^{n(H(X) - \epsilon)} 2^{-n(H(X)+\varepsilon)} + \frac{\sigma^2}{n\epsilon^2} \ = 2^{-n(\epsilon + \varepsilon)} + \frac{\sigma^2}{n\epsilon^2} $$$ and we see that since $$\epsilon, \varepsilon > 0$$, and as we're dealing with the case where $$n \to \infty$$, this probability is going to go to zero in the limit. But we had assumed $$P(X^n \in S') \geq 1 - \delta$$ - so we have a contradiction unless we don't assume that, which means $$$ n(H(X) - \epsilon) < H_\delta(X^n). $$$ Combining this with the previous bound, we've now shown $$$ H(X) - \epsilon < \frac{1}{n} H_\delta(X^n) < H(X) + \epsilon $$$ which is the same as $$$ \left|\frac{1}{n}H_\delta(X) - H(X)\right| < \epsilon $$$ which is the source coding theorem that we wanted to prove.</p>Unknownnoreply@blogger.com0tag:blogger.com,1999:blog-1697673368059564013.post-91311962977510142802022-06-20T16:27:00.004+01:002022-06-25T21:06:15.560+01:00Information theory 1<div id="information-theory-1" style="text-align: center;"><span style="font-size: x-small;"><i>5044 words, including equations (~30min)</i></span><br /></div><p>This is the first in a series of posts about information theory. A solid understanding of basic probability (random variables, probability distributions, etc.) is assumed. This post covers:</p><ul><li>what information and entropy are, both intuitively and axiomatically</li><li>(briefly) the relation of information-theoretic entropy to entropy in physics</li><li>conditional entropy</li><li>joint entropy</li><li>KL distance (also known as relative entropy)</li><li>mutual information</li><li>some results involving the above quantities</li><li>the point of source coding and channel coding</li></ul><p>Future posts cover source coding and channel coding in detail.</p><h2 id="what-is-information-">What is information?</h2><p>How much information is there in the number 14? What about the word "information"? Or this blog post? These don't seem like questions with exact answers.</p><p>Imagine you already know that someone has drawn a number between 0 and 15 from a hat. Then you're told that the number is 14. How much additional information have you learned? A first guess at a definition for information might be that it's the number of questions you need to ask to become certain about an answer. We don't want arbitrary questions though; "what is the number?" is very different from "is the number zero?". So let's say that it has to be a yes-no question.</p><p>You can represent a number within some specific range as a series of yes-no questions by writing it out in base-2. In base-2, 14 is 1110. Four questions suffice: "is the leftmost base-2 digit a 0?", etc. The number of base-$$B$$ digits required to represent a number $$n$$ is $$\lceil\log_B n\rceil$$, where $$\lceil x \rceil$$ means the smallest integer greater than or equal to $$x$$ (i.e., rounding up). Now maybe there should be some sense in which we can allow pointing at a number in the range 0 to 16 to have a bit more information than pointing at a number from 0 to 15, even though we can't literally ask 4.09 yes-no questions. So we might try to define our information measure as $$\log n$$ (in whatever base because changing which base we're doing logs in would only change the answer by a constant factor anyways, but let's just say it's base-2 to maintain the correspondence to yes-no questions), where $$n$$ is the number of outcomes that the thing we now know was selected from.</p><p>Now let's say there's a shoe box we've picked up from a store. There are a gazillion things that could be inside the box, so $$n$$ is something huge. However, it seems that if we open the box and find a new pair of sneakers, we are less surprised than if we open the box and find the Shroud of Turin. We'd like to make some types of contain quantitatively more information than others.</p><p>The standard sort of thing you do in this kind of situation is that you bring in probabilities. With drawing a number out of a hat, we have a uniform distribution where the probability for each outcome is $$p = 1/ n$$. So therefore we might as well have written that information content is equivalent to $$\log \frac{1}{p}$$, and gotten the same answer in that question. Since presumably the probability of your average shoe box containing sneakers is higher than the probability of it containing the Shroud of Turin, with this revised definition we now sensibly get that the latter gives us more information (because $$\log \frac{1}{p}$$ is a decreasing function of $$p$$). Note also that $$\log \frac{1}{p}$$ is the same as $$- \log p$$; we will usually use the latter form. This is called the Shannon information. To be precise:</p><blockquote><p><i>The (Shannon) information content of seeing a random variable $$X$$ take a value $$x$$ is $$$-\log p_x$$$ where $$p_x$$ is the probability that $$X$$ takes value $$x$$. </i></p><p><i>We can see the behaviour of the information content of an event as a function of its probability here: </i></p><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjyEA9S6CzMLwf4YkmXK0-pfFqZto-WTbTCjtxixK2QafIvHmbnmlKqPSWZN6Yj0fEMJBGAUclmullfxWV-9TChejDut2HmzT5-Y0WbvziC_5kWb0WrGSRHvOcG00bTsj3WC76FeyawjTIqEa0gOth87yiimQqg00FL80dB8LMQ5VbUAPgG25YlBpb9Rw/s866/ArcoLinux_2022-05-31_21-27-57.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="648" data-original-width="866" height="478" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjyEA9S6CzMLwf4YkmXK0-pfFqZto-WTbTCjtxixK2QafIvHmbnmlKqPSWZN6Yj0fEMJBGAUclmullfxWV-9TChejDut2HmzT5-Y0WbvziC_5kWb0WrGSRHvOcG00bTsj3WC76FeyawjTIqEa0gOth87yiimQqg00FL80dB8LMQ5VbUAPgG25YlBpb9Rw/w640-h478/ArcoLinux_2022-05-31_21-27-57.png" width="640" /></a></div><br /><p><br /></p></blockquote><h3 id="axiomatic-definition">Axiomatic definition</h3><p>The above derivation was so hand-wavy that it wasn't even close to being a derivation.</p><p>When discovering/inventing the concept of Shannon information, Shannon started from the idea that the information contained in seeing an event is a function of that event's probability (and nothing else). Then he required three further axioms to hold for this function:</p><ul><li>If the probability of an outcome is 1, it contains no information. This makes sense - if you already know something with certainty, then you can't get more information by seeing it again.</li><li>The information contained in an event is a decreasing function of its probability of happening. Again, this makes sense: seeing something you think is very unlikely is more informative than seeing something you were pretty certain was already going to happen.</li><li>The information contained in seeing two independent events is the sum of the information of seeing them separately. We don't want to have to apply some obscure maths magic to figure out how much information we got in total from seeing one dice roll and then another other.</li></ul><p>The last one is the big hint. The probability of seeing random variable (RV) $$X$$ take value $$x$$ and RV $$Y$$ take value $$y$$ is $$p_x p_y$$ if $$X$$ and $$Y$$ are independent. We want a function, call it $$f$$, such that $$f(p_x p_y) = f(p_x) + f(p_y)$$. This is the most important property of logarithms. You can do some more maths to really demonstrate that is the logarithms with some base are the only function that fit this definition, or you can just guess that it's a $$\log$$ and move on. We'll do the latter.</p><h3 id="entropy">Entropy</h3><p>Entropy is the flashy term that comes up in everything from chemistry to .zip files to the fundamental fact that we're all going to die. It is often introduced as something like "[mumble mumble] a measure of information [mumble mumble]".</p><p>It is important to distinguish between information and entropy. Information is a function of an outcome (of a random variable), for example the outcome of an experiment. Entropy is a function of a random variable, for example an experiment before you see the outcome. Specifically,</p><blockquote><p><i> The <b>entropy</b> $$H(X)$$ is the expected information gain from a random variable $$X$$: $$$ H(X) = \underset{x_i \sim X}{\mathbb{E}}\Big[-\log P(X=x_i)\Big] \ = -\sum_i p_{x_i} \log p_{x_i} $$$ ($$\underset{x_i \sim X}{\mathbb{E}}$$ means the expected value when value $$x_i$$ is drawn from the distribution of RV $$X$$. $$P(X=x_i)$$, alternatively denoted $$p_{x_i}$$ when $$X$$ is clear from context, is the probability of $$X$$ taking value $$x_i$$.)</i></p></blockquote><p>(Why is entropy denoted with an $$H$$? I don't know. Just be thankful it wasn't a random <i>Greek</i> letter.)</p><p>Imagine you're guessing a number between 0 and 15 inclusive, and the current state of your beliefs is that it is as likely to be any of these numbers. You ask "is the number 9?". If the answer is yes, you've gained $$-\log_2 \frac{1}{16} = \log_2 16 = 4$$ bits of information. If the answer is no, you've gained $$-\log_2 \frac{15}{16} = \log_2 16 - \log_2 15 = 0.093$$ bits of information. The probability of the first outcome is 1/16 and the probability of the second is 15/16, so the entropy is $$\frac{15}{16} \times 4 + \frac{1}{16} \times 0.093 = 0.337$$ bits.</p><p>In contrast, if you ask "is the number smaller than 8?", you always get $$-\log_2 \frac{8}{16} = \log_2{2} = 1$$ bit of information, and therefore the entropy of the question is 1 bit.</p><p>Since entropy is expected information gain, whenever you prepare a random variable for the purpose of getting information by observing its value, you want to maximise its entropy.</p><p>The closer a probability distribution is to a uniform distribution, the higher its entropy. The maximum entropy of a distribution with $$n$$ possible outcomes is the entropy of the uniform distribution $$U_n$$, which is $$$ H(U_n) = -\sum_i p_{u_i} \log p_{u_i} = -\sum_i \frac{1}{n} \log \frac{1}{n} \ = -\log \frac{1}{n} = \log n $$$ (This can be proved easily once we introduce some additional concepts.)</p><p>A general and very helpful principle to remember is that RVs with uniform distributions are most informative.</p><p>The above definition of entropy is sometimes called Shannon entropy, to distinguish it from the older but weaker concept of entropy in physics.</p><h4 id="entropy-in-physics">Entropy in physics</h4><p>The physicists' definition of entropy is a constant times the logarithm of the number of possible states that correspond to the observable macroscopic characteristics of a thermodynamic system: $$$ S=k_B \ln W $$$ where $$k_B$$ is the Boltzmann constant, $$\ln$$ is used instead of $$\log_2$$ because physics, and $$W$$ is the number of microstates. (Why do physicists denote entropy with the letter $$S$$? I don't know. Just be glad it wasn't a random <i>Hebrew</i> letter.)</p><p>In plain language: it is proportional to the Shannon entropy of finding out what is the exact configuration of bouncing atoms of the hot/cold/whatever box you're looking, out of all the ways the atoms could be bouncing inside that box given that the box is hot/cold/whatever, assuming that all those ways are equally likely. It is less general than the information theoretic entropy in the sense that it assumes a uniform distribution.</p><p>Entropy, either the Shannon or the physics version, seems abstract; random variables, numbers of microstates, what? However, $$S$$ as defined above has very real physical consequences. There's an important thermodynamics equation relating a change in entropy $$\delta S$$, a change in heat energy $$\delta Q$$, and temperature $$T$$ for a reversible process with the equation $$T\delta S = \delta Q$$, which sets a lower bound on how much energy you need to discover information (i.e., reduce the number of microstates that might be behind the macrostate you observe). Getting one bit of information means that $$\delta S$$ is $$k_B \ln 2$$ (from the definition of $$S$$), so at temperature $$T$$ kelvins we need $$k_B T \ln 2 \approx 9.6 \times 10^{-24} \times T$$ joules. This prevents arbitrarily efficient computers, and saves us from problems like Maxwell's demon. (Maxwell's demon is a thought experiment in physics: couldn't you violate the principle of increasing entropy (a physics thing) by building a box with a wall cutting it in half with a "demon" (some device) that lets slow particles pass left-to-right only and fast particles right-to-left, thus separating particles by temperature and reducing the number of microstates corresponding to the configuration of atoms inside the box? No, because the demon needs to expend energy to get information.)</p><p>Finally, is there an information-theoretic analogue of the second law of thermodynamics, which states that the entropy of a system always increases? You have to make some assumptions, but you can get to something like it, which I will sketch out in <i>very</i> rough detail and without explaining the terms (see Chapter 4 of <i>Elements of Information</i> Theory for the details). Imagine you have a probability distribution on the state space of a Markov chain. Now it is possible to prove that given any two such probability distributions, the distance between them (as measured using relative entropy; see below) is non-increasing. Now assume it also happens to be the case that the stationary distribution of the Markov chain is uniform (the stationary distribution is the probability distribution over states such that if every state sends out its probability mass according to the transition probabilities, you get back to the same distribution). We can consider an arbitrary probability distribution over the states, and compare it to the unchanging uniform one, and use the result that the distance between them is non-increasing to deduce that an arbitrary probability distribution will tend towards the uniform (= maximal entropy) one.</p><p>Reportedly, von Neumann (a polymath whose name appears in any mid-1900s mathsy thing) advised Shannon thus:</p><blockquote><p><i>"You should call [your concept] entropy, for two reasons. In the first place your uncertainty function has been used in statistical mechanics under that name, so it already has a name. In the second place, and more important, nobody knows what entropy really is, so in a debate you will always have the advantage."</i></p></blockquote><h3 id="intuition">Intuition</h3><p>We've snuck in the assumption that all information comes in the form of:</p><ol> <li>You first have some <i>quantitative</i> uncertainty over a <i>known set</i> of possible outcomes, which you specify in terms of a random variable $$X$$.</li><li>You find out the value that $$X$$ has taken.</li></ol><p>There's a clear random variable if you're pulling numbers out of a hat: the possible values of $$X$$ are the numbers written on the pieces of paper in the hat, and they all have equal probability. But where is the random variable when the piece of information you get is, say, the definition of information? (I don't mean here the literal characters on the screen - that's a more boring question - but instead the knowledge about information theory that is now (hopefully) in your brain). The answer would have to be something like "the random variable representing all possible definitions of information" (with a probability distribution that is, for example, skewed towards definitions that include a $$\log$$ somewhere because you remember seeing that before).</p><p>This is a bit tricky to think about, but we see that even in this kind of weird case you can specify some kind of set and probabilities over that set. Fundamentally, knowledge (or its lack) is about having a probability distribution over states. Perfect knowledge means you have probability $$1.00$$ on exactly one state of how something could be. If you're very uncertain, you have a huge probability distribution over an unimaginably large set of states (for example, all possible concepts that might be a definition of information). If you've literally seen nothing, then you're forced to rely on some guess for the prior distribution over states, like all those pesky Bayesian statisticians keep saying.</p><h2 id="more-quantities">More quantities</h2><h3 id="conditional-entropy">Conditional entropy</h3><p>Entropy is a function of the probability distribution of a random variable. We want to be able to calculate the entropies of the random variables we encounter.</p><p>A common combination of random variables we see is $$X$$ given $$Y$$, written $$X | Y$$. The definition is $$$ P(X = x \, |\, Y = y) = \frac{P(X = x \,\land\, Y = y)}{P(Y=y)}. $$$ It is a common mistake to think that $$H(X|Y) = -\sum_i P(X = x_i | Y = y) \log P(X = x_i | Y = y)$$. What is it then? Let's just do the algebra: $$$ H(X|Y) = -\underset{x \sim X|Y, y \sim Y}{\mathbb{E}} \big( \log P(X=x|Y=y) \big) $$$ from the definition of the entropy as the expectation of the Shannon information content, and then by algebra: $$$ H(X|Y) = -\underset{x \sim X|Y, y \sim Y}{\mathbb{E}} \big[ \log P(X=x|Y=y) \big]$$$ $$$ = -\sum_{y \in \mathcal{Y}} P(Y=y) \sum_{x \in \mathcal{X}} P(X=x | Y=y) \log P(X=x \,|\, Y = Y)$$$ $$$ = -\sum_{y \in \mathcal{Y}, \, x \in \mathcal{X}}P(X=x\,\land\, Y = y) \log P(X=x \,|\, Y = Y) $$$ where $$\mathcal{X}$$ and $$\mathcal{Y}$$ are simply the sets of possible values of $$X$$ and $$Y$$ respectively. In a trick beloved of bloggers everywhere tired of writing up equations as $$\LaTeX$$, the above is often abbreviated $$$ \sum_{y \in \mathcal{Y}, \, x \in \mathcal{X}} p(x,y) \log p(x|y) $$$ where we use $$p$$ as a generic notation for "probability of whatever; random variables left implicit".</p><blockquote><p><i>The <b>conditional entropy</b> $$X|Y$$ for a random variable $$X$$ given the value of another random variable $$Y$$, is written $$H(X|Y)$$ and defined as $$$ H(X|Y) = - \sum_{y \in \mathcal{Y}, \, x \in \mathcal{X}} p(x,y) \log p(x|y) $$$ which is lazier notation for $$$ -\sum_{y \in \mathcal{Y}, \, x \in \mathcal{X}}P(X=x\,\land\, Y = y) \log P(X=x \,|\, Y = Y). $$$ and also equal to $$$ -\sum_{y \in \mathcal{Y}, \, x \in \mathcal{X}} p(x,y) \log \frac{p(x, y)}{p(y)} $$$ It is most definitely not equal to $$\sum_{y \in \mathcal{Y}, \, x \in \mathcal{X}} p(x | y) \log p(x | y)$$.</i></p></blockquote><p>Conditional entropy is a measure of how much information we expect to get from a random variable assuming we've already seen another one. If the RVs $$X$$ and $$Y$$ are independent, the answer is that $$H(X|Y) = H(X)$$. If the value of $$Y$$ implies a value of $$X$$ (e.g. "percentage of sales in the US" implies "percentage of sales outside the US"), then $$H(X|Y) = 0$$, since we can work out what $$X$$ is from seeing what $$Y$$ is.</p><h3 id="joint-entropy">Joint entropy</h3><p>Now if $$H(X|Y)$$ is how much expected surprise there is left in $$X$$ after you've seen $$Y$$, then $$H(X|Y) + H(Y)$$ would sensibly be the total expected surprise in the combination of $$X$$ and $$Y$$. We write $$H(X,Y)$$ for this combination. If we do the algebra, we see that $$$ H(X,Y) = H(X|Y) + H(Y) $$$ $$$ = -\sum_{y \in \mathcal{Y}, \, x \in \mathcal{X}} p(x,y) \log \frac{p(x, y)}{p(y)} - \sum_{y \in \mathcal{Y}} p(y) \log p(y) $$$ $$$= -\left(\sum_{y \in \mathcal{Y}, \, x \in \mathcal{X}} p(x,y) \log p(x, y)\right) + \left( \sum_{y \in \mathcal{Y}, \,x\in \mathcal{X}} p(x,y) \log p(y)\right) -\left( \sum_{y \in \mathcal{Y}} p(y) \log p(y)\ \right) $$$ $$$= -\left(\sum_{y \in \mathcal{Y}, \, x \in \mathcal{X}} p(x,y) \log p(x, y)\right)$$$ = H(Z) $$$ if $$Z$$ is the random variable formed of the pair $$(X, Y)$$ drawn from the joint distribution over $$X$$ and $$Y$$.</p><h3 id="kullback-leibler-divergence-aka-relative-entropy">Kullback-Leibler divergence, AKA relative entropy</h3><p>"Kullback-Leibler divergence" is a bit of a mouthful. It is also called KL divergence, KL distance, or relative entropy. Intuitively, it is a measure of the distance between two probability distributions. For probability distributions represented by functions $$p$$ and $$q$$ over the same set $$\mathcal{X}$$, it is defined as $$$ D(p\,||\,q) = \sum_{x \in \mathcal{X}} p(x) \log \left(\frac{p(x)}{q(x)}\right). $$$ It's not a very good distance function; the only property of a distance function it meets is that it's non-negative. It's not symmetric (i.e. $$D(p \,||\, q) \ne D(q \,||\, p)$$) as you can see from the definition (especially considering how it breaks when $$q(x) = 0$$ but not if $$p(x) = 0$$). However, it has a number of cool interpretations, including how many bits you expect to lose on average if you build a code assuming a probability distribution $$q$$ when it's actually $$p$$, and how many bits of information you get in a Bayesian update from distribution $$q$$ to distribution $$p$$. It is also a common loss function in machine learning. The first argument $$p$$ is generally some better or true model, and we want to know how far away $$q$$ is from it.</p><h3 id="why-the-uniform-distribution-maximises-entropy">Why the uniform distribution maximises entropy</h3><p>The KL divergence gives us a nice way of proving that the uniform distribution maximises entropy. Consider the KL divergence of an arbitrary probability distribution $$p$$ from the uniform probability distribution $$u$$: $$$ D(p \,||\, u ) = \sum_{x \in \mathcal{X}} p(x) \log \left(\frac{p(x)}{q(x)}\right) $$$ $$$= \sum_{x \in \mathcal{X}} \big( p(x) \log p(x)\big) - \sum_{x \in \mathcal{X}} \big(p(x) \log q(x) \big) $$$ $$$= -H(X) - \sum_{x \in \mathcal{X}} p(x) \log \frac{1}{|\mathcal{X}|} $$$ $$$= H(X) - H(U) $$$ where $$\mathcal{X}$$ is the set of values over which $$p$$ and $$u$$ have non-zero values, $$X$$ is a random variable distributed according to $$p$$, and $$U$$ is a random variable distributed according to $$u$$ (i.e. uniformly). This is the same thing as $$$ H(X) = H(U) + D(p \,||\,u) $$$ which implies that we can write the entropy of a random variable as the entropy of a uniform random variable over a set of the same size, plus the KL distance between the distribution of $$X$$ and the distribution of the uniform random variable. Also, since all three quantities in the above equation are guaranteed to be non-negative, this implies that $$$ H(X) \leq H(U) $$$ and therefore that the uniform random variable has higher entropy than any other random variable over the same number of outcomes.</p><h3 id="mutual-information">Mutual information</h3><p>Earlier, we saw that $$H(X, Y) = H(X|Y) + H(Y) = H(X) + H(Y|X)$$. As a picture:</p><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjrGbqLctRwUw-setCmJDZ_tzd-1XQKwulod9TLgyVehtBNgORCubRlgNWfEhnIpAwYtohX8c3pDV6r3PIpma1YEUxSAiBo4E6qrt1mNoGPS7eGFXPJ7fwNeCnN3XKZMxPT9G0TvS4FftNrmdrlmBh3vdv4s3LFrZYjOqdr304iS8N4xGxyLmX-MNZdYw/s762/ArcoLinux_2022-05-31_22-25-08.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="348" data-original-width="762" height="183" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjrGbqLctRwUw-setCmJDZ_tzd-1XQKwulod9TLgyVehtBNgORCubRlgNWfEhnIpAwYtohX8c3pDV6r3PIpma1YEUxSAiBo4E6qrt1mNoGPS7eGFXPJ7fwNeCnN3XKZMxPT9G0TvS4FftNrmdrlmBh3vdv4s3LFrZYjOqdr304iS8N4xGxyLmX-MNZdYw/w400-h183/ArcoLinux_2022-05-31_22-25-08.png" width="400" /></a></div><br /><p>There's an overlapping region, representing the information you get no matter which of $$X$$ or $$Y$$ you look at. We call this the mutual information, a refreshingly sensible name, and denote it $$I(X;Y)$$, somewhat less sensibly. One way to find it is $$$ I(X;Y) = H(X,Y) - H(X|Y) - H(Y|X)$$$ $$$= - \sum_{x,y} p(x,y) \log p(x,y) \,+\, \sum_{x,y} p(x,y) \log \frac{p(x,y)}{p(y)} \,+\, \sum_{x,y} p(x,y) \log \frac{p(x,y)}{p(x)}$$$ $$$= \sum_{x,y} p(x,y) \big( \log p(x,y) - \log p(x) - \log p(y) \big)$$$ $$$= \sum_{x,y} p(x,y) \log \frac{p(x,y)}{p(x)p(y)}. $$$ Does this look familiar? Recall the definition $$$ D(p\,||\,q) = \sum_{x \in \mathcal{X}} p(x) \log \left(\frac{p(x)}{q(x)}\right). $$$ What we see is that $$$ I(X;Y) = D(p(x, y) \, || \, p(x) p(y)), $$$ or in other words that the mutual information between $$X$$ and $$Y$$ is the "distance" (as measured by KL divergence) between the probability distributions $$p(x,y)$$ - the joint distribution between $$X$$ and $$Y$$ - and $$p(x) p(y)$$, the joint distribution that $$X$$ and $$Y$$ would have if $$x$$ and $$y$$ were drawn independently.</p><p>If $$X$$ and $$Y$$ are independent, then these are the same distribution, and their KL divergence is 0.</p><p>If the value of $$Y$$ can be determined from the value of $$X$$, then the joint probability distribution of $$X$$ and $$Y$$ is a table where for every $$x$$, there is only one $$y$$ such that $$p(x,y) > 0$$ (otherwise, there would be a value $$x$$ such that there is uncertainty about $$Y$$). Let the function mapping an $$x$$ to the singular $$y$$ such that $$p(x,y) > 0$$ be $$f$$. Then $$$ I(X;Y) = \sum_{x,y} p(x,y) \log \frac{p(x,y)}{p(x)p(y)}$$$ $$$= \sum_y p(y) \sum_{x | f(x) = y} p(x|y) \log \frac{p(x, f(x))}{p(x)p(y)}. $$$ Now $$p(x, f(x)) = p(x)$$, because there is no $$y \ne f(x)$$ such that $$p(x, y) \ne 0$$. Therefore we get that the above is equal to $$$ \sum_y p(y) \sum_{x | f(x) = y} p(x|y) \log \frac{p(x)}{p(x)p(y)}\ = - \sum_y p(y) \sum_{x | f(x) = y} p(x|y) \log p(y), $$$ and since $$\log p(y)$$ does not depend on $$x$$, we can sum out the probability distribution to get $$$ -\sum_y p(y) \log p(y) = H(Y). $$$ In other words, if $$Y$$ can be determined from $$X$$, then the expected information that $$X$$ gives about $$Y$$ is the same as the expected information given by $$Y$$. </p><p>We can graphically represent the relations between $$H(X)$$, $$H(Y)$$, $$H(X|Y)$$, $$H(Y|X)$$, $$H(X,Y)$$, and $$I(X;Y)$$ like this:</p><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhIfRf2gkkFvGyjcbKP1ae7_GVlagYfW5Mz6vxVP_BmjGWMkrG53xsdW_b1vXH9dgXWoVlTl6Ic8TGJIyQ3WXSecZ8J4MlEvMoY3NgubTGjDIicpywUD7xLht0GuipBnS4DYOmmAEH6J7Fb39HMKoePq6yDFJNZHCaMSwtsUaI8wTZ49E93yLD8OZRg9g/s756/ArcoLinux_2022-05-31_22-25-34.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="447" data-original-width="756" height="378" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhIfRf2gkkFvGyjcbKP1ae7_GVlagYfW5Mz6vxVP_BmjGWMkrG53xsdW_b1vXH9dgXWoVlTl6Ic8TGJIyQ3WXSecZ8J4MlEvMoY3NgubTGjDIicpywUD7xLht0GuipBnS4DYOmmAEH6J7Fb39HMKoePq6yDFJNZHCaMSwtsUaI8wTZ49E93yLD8OZRg9g/w640-h378/ArcoLinux_2022-05-31_22-25-34.png" width="640" /></a></div><br /><p><br /></p><p>Having this image in your head is the single most valuable thing you can do to improve your ability to follow information theoretic maths. Just to spell it out, here are some of the results you can read out from it: $$$H(X,Y) = H(X) + H(Y|X) $$$ $$$H(X,Y) = H(X|Y) + H(Y) $$$ $$$H(X,Y) = H(X|Y) + I(X;Y) + H(Y|X) $$$ $$$H(X,Y) = H(X) + H(Y) - I(X;Y) $$$ $$$H(X) = I(X;Y) + H(Y|X)$$$ This diagram is also sometimes drawn with Venn diagrams:</p><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEggLIC4G6xVL1Ye-9BSkZLZCOyJc-7dI_SI-Ce1leWZvOj_H9F5wFsNz28g1GXe7pnJnfIYO-unAm2uHj3LzYcwVArgeTzuoX26d0tZ4GT6YBTR5iYA4d03t65q0z8THahQFfuGjpX066pe5n1r81dq7DZNLoE7hEpOdCeHnFpr_JWSoaHZupsV76mESw/s396/ArcoLinux_2022-05-31_22-26-00.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="396" data-original-width="382" height="400" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEggLIC4G6xVL1Ye-9BSkZLZCOyJc-7dI_SI-Ce1leWZvOj_H9F5wFsNz28g1GXe7pnJnfIYO-unAm2uHj3LzYcwVArgeTzuoX26d0tZ4GT6YBTR5iYA4d03t65q0z8THahQFfuGjpX066pe5n1r81dq7DZNLoE7hEpOdCeHnFpr_JWSoaHZupsV76mESw/w386-h400/ArcoLinux_2022-05-31_22-26-00.png" width="386" /></a></div><br /><p><br /></p><h3 id="data-processing-inequality">Data processing inequality</h3><p>A Markov chain is a series of random variables such that the $$(n+1)$$th is only directly influenced by the $$n$$th. If $$X \to Y \to Z$$ is a Markov chain, it means that all effects $$X$$ has on $$Z$$ are through $$Y$$.</p><p>The data processing inequality states that if $$X \to Y \to Z$$ is a Markov chain, then $$$ I(X; Y) \geq I(X; Z). $$$ This should be pretty intuitive, since the mutual information $$I(X;Y)$$ between $$X$$ and $$Y$$, which have a direct causal link between them, shouldn't be higher than that between $$X$$ and the more-distant $$Z$$, which $$X$$ can only influence through $$Y$$.</p><p>A special case is the Markov chain $$X \to Y \to f(Y)$$, where $$X$$ is, say, what happened in an abandoned parking lot at 3am, $$Y$$ is the security camera footage, and $$f$$ is some image enhancing process (more generally: any deterministic function of the data $$Y$$). The data processing inequality tells us that $$$ I(X; Y) \geq I(X; f(Y)). $$$ In essence, this means that any function you try to apply to some data $$Y$$ you have about some event $$X$$ cannot increase the information about the event that is available. Any enhancing function can only make it easier to spot some information about the event that is <i>already present</i> in the data you have about it (and the function might very plausibly destroy some). If all you have are four pixels, no amount of image enhancement wizardry will let you figure out the perpetrator's eye colour.</p><p>The proof (for the general case of $$X \to Y \to Z$$) goes like this: consider $$I(X; Y,Z)$$ (that is, the mutual information between knowing $$X$$ and knowing both $$Y$$ and $$Z$$). Now consider the different values in Venn diagram form:</p><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjG5o6kK-DxPU0tQRHkIgEevjypvmeikAoGX5qDjDOYKvc8C0azyFVA1evw5iBoD-a5jPBSamLrdOePJSPvwzV-MhKsdnwFYv6pRXnN2wL5BkXyCS5Lehg8QVL4-ZysYIHhPo56LyzLyTnescNUIZSuO5dOcHUm6EhdiAETvC0gtXQ4JR0XSSN3aaqaDg/s846/ArcoLinux_2022-05-31_22-59-32.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="688" data-original-width="846" height="325" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjG5o6kK-DxPU0tQRHkIgEevjypvmeikAoGX5qDjDOYKvc8C0azyFVA1evw5iBoD-a5jPBSamLrdOePJSPvwzV-MhKsdnwFYv6pRXnN2wL5BkXyCS5Lehg8QVL4-ZysYIHhPo56LyzLyTnescNUIZSuO5dOcHUm6EhdiAETvC0gtXQ4JR0XSSN3aaqaDg/w400-h325/ArcoLinux_2022-05-31_22-59-32.png" width="400" /></a></div><br /><p><br /></p><p>$$I(X; Y, Z)$$ corresponds to all areas within the circle representing $$X$$ that are also within at least one of the circle for $$Y$$ or $$Z$$. If we knew both $$Y$$ and $$Z$$, this "bite" is how much would be taken out of the uncertainty $$H(X)$$ of $$X$$.</p><p>We see that the red lined area is $$I(X; Y|Z)$$ (the information shared between $$X$$ and the part of $$Y$$ that remains unknown if you know $$Z$$), and likewise the green hatched area is $$I(X; Y; Z)$$ and the blue dotted area is $$I(X;Z|Y)$$. Since the red-lined and green-hatched areas together are $$I(X;Y)$$, and the green-hatched and blue-dotted areas together are $$I(X;Z)$$, we can write both $$$ I(X; \,Y,Z) = I(X;\,Y) + I(X;\,Z|Y)$$$ $$$I(X; \,Y,Z) = I(X;\,Z) + I(X;\,Y|Z) $$$ But hold on - $$I(X;Z|Y)=0$$ by the definition of a Markov chain, since no influence can pass from $$X$$ to $$Z$$ without going through $$Y$$, meaning that if we know everything about $$Y$$, nothing more we can learn about $$Z$$ will tell us anything more about $$X$$.</p><p>Since that term is zero, we have $$$ I(X; \; Y) = I(X; \; Z) + I(X; \, Y|Z) $$$ and since mutual information must be non-negative, this in turn implies $$$ I(X;Y) \geq I(X;Z). $$$</p><h2 id="two-big-things-source-channel-coding">Two big things: source & channel coding</h2><p>Much of information theory concerns itself with one of two goals.</p><p>Source coding is about data compression. It is about taking something that encodes some information, and trying to make it shorter without losing the information.</p><p>Channel coding is about error correction. It is about taking something that encodes some information, and making it longer to try to make sure the information can be recovered even if some errors creep in.</p><p>The basic model that information theory deals with is the following:</p><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgsS1fWWqne7wyRiNIS8jjsFxOPCb5B54n-xNiEKD4PYt3jwml1VCOrkX6JxhEXHuCFd7wv9Kr5Vvu1VCSi-PP74LtoZSto9IsYzZcKMJ3uzam7_JfrVUcerg51rWIdZzCQaxjLezaVeepV8TcaxudzUHzRTnCNqfWe-Ju4icIFHKqd5swb879emq1NTQ/s1220/ArcoLinux_2022-05-31_22-52-21.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="270" data-original-width="1220" height="142" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgsS1fWWqne7wyRiNIS8jjsFxOPCb5B54n-xNiEKD4PYt3jwml1VCOrkX6JxhEXHuCFd7wv9Kr5Vvu1VCSi-PP74LtoZSto9IsYzZcKMJ3uzam7_JfrVUcerg51rWIdZzCQaxjLezaVeepV8TcaxudzUHzRTnCNqfWe-Ju4icIFHKqd5swb879emq1NTQ/w640-h142/ArcoLinux_2022-05-31_22-52-21.png" width="640" /></a></div><br /><p>We have some random variable $$Z$$ - the contents of a text message, for example - which we encode under some coding scheme to get a message consisting of a sequence of symbols that we send over some channel - the internet, for example - and then hopefully recover the original message. The channel can be noiseless, meaning it transmits everything perfectly and can be removed from the diagram, or noisy, in which case some there is a chance that for some $$i$$, the $$X_i$$ sent into the channel differs from the $$Y_i$$ you get out.</p><p>Source coding is about trying to minimise how many symbols you have to send, while channel coding is about trying to make sure that $$\hat{Z}$$, the estimate of the original message, really ends up being the original message $$Z$$.</p><p>A big result in information theory is that for the above model, it is possible to separate the source coding and the channel coding, while maintaining optimality. The problems are distinct; regardless of source coding method, we can use the same channel method and still do well, and vice versa. Thanks to this result, called the source-channel separation theorem, source and channel coding can be considered separately. Therefore, our model can look like this:</p><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhKVksPpXEHyJeOtvGWW5rKPZYkJe-TxHDqeH4liLtrRak9ybQvb1MoaEkIDgLHc4yy7rXDra5JOQjswbiH2ZZtpEk4egOiXP3uk_bBk87FJS1Zl0d4bSbjo2uso2lSXrwIPJa-4DyMTnFtxCFI-8t5buk0NHxWPHGUEsX_6YcxZqc5MssJflmom7g_gQ/s1367/ArcoLinux_2022-05-31_22-52-43.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="246" data-original-width="1367" height="116" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhKVksPpXEHyJeOtvGWW5rKPZYkJe-TxHDqeH4liLtrRak9ybQvb1MoaEkIDgLHc4yy7rXDra5JOQjswbiH2ZZtpEk4egOiXP3uk_bBk87FJS1Zl0d4bSbjo2uso2lSXrwIPJa-4DyMTnFtxCFI-8t5buk0NHxWPHGUEsX_6YcxZqc5MssJflmom7g_gQ/w640-h116/ArcoLinux_2022-05-31_22-52-43.png" width="640" /></a></div><p><br /></p><p>(We use $$X^n$$ to refer to a random variable representing a length-$$n$$ sequence of symbols)</p><p>Both source and channel coding consist of:</p><ul><li>a central but tricky theorem giving theoretical bounds and motivating some definitions</li><li>a bunch of methods that people have invented for achieving something close to those theoretical bounds in practice</li></ul>Next see <a href="https://www.strataoftheworld.com/2022/06/information-theory-2-source-coding.html">the source coding post</a> and <a href="https://www.strataoftheworld.com/2022/06/information-theory-3-channel-coding.html">the channel coding post</a>. <br /><div><ul></ul><p></p><p></p></div>Unknownnoreply@blogger.com0tag:blogger.com,1999:blog-1697673368059564013.post-41260904924096790052021-10-17T23:14:00.000+01:002021-10-17T23:14:19.101+01:00Death is bad<p style="text-align: center;"> <span style="font-size: x-small;">3.5k words (about 12 minutes)<br /></span></p><p>Sometime in the future, we might have the technology to extend lifespans indefinitely and make people effectively immortal. When and how this might happen is a complicated question that I will not go into. Instead, I will take heed of Ian Malcolm in <i>Jurassic Park</i>, who complains that "your scientists were so preoccupied with whether or not they could that they didn't stop to think if they should".</p><p>This is (in my opinion rather surprisingly) a controversial question. </p><p>The core of it is this: should people die?</p><p>Often the best way to approach a general question is to start by thinking about specific cases. Imagine a healthy ten-year old child; should they die? The answer is clearly no. What about yourself, or your friends, or the last person you saw on the street? Wishing for death for yourself or others is almost universally a sign of a serious mental problem; acting on that desire even more so.</p><p>There are some exceptions. Death might be the best option for a sick and pained 90-year-old with no hope of future healthy days. It may well be (as I've seen credibly claimed in several places) that the focus on prolonging lifespan even in pained terminally ill people is excessive. "Prolong life, whatever the cost" is a silly point of view; maximising heartbeats isn't what we really care about.</p><p>However, now imagine a pained, dying, sick person who has a hope of surviving to live many healthy happy days – say a 40-year-old suffering from cancer. Should they die? No. You would hope that they get treatment, even if it's nauseating fatiguing painful chemotherapy for months on end. If there is no cure, you'd hope that scientists somewhere invent it. Even if it does not happen in time for that particular person, at least it will save others in the future, and eliminate one more horror of the world. It would be a great and celebrated human achievement.</p><p>What's the difference between the terminally ill 90-year-old and the 40-year-old with a curable cancer? The difference is technology. We have the technology to cure some cancers, but we don't have the technology to cure the many ageing-related diseases. If we did, then even if the treatment is expensive or difficult, we would hope – and consider it a moral necessity – for both of them to get it, and hope that they both go on living for many more years.</p><p>No one dies of time. You are a complex process running on the physical hardware of your brain, which is kept running by the machine that is the rest of your body. You die when that machine breaks. There is no poetic right time when you close your eyes and get claimed by time, there is only falling to one mechanical fault or another.</p><p>People (or conscious beings in general) matter, and their preferences should be taken seriously – this is the core of human morality. What is wrong in the world can be fixed – this is the guiding principle of civilisation since the Enlightenment.</p><p>So, should people die? Not if they don't want to, which (I assume) for most people means not if they have a remaining hope of happy, productive days.</p><h2 id="counterarguments">Counterarguments</h2><p>The idea that death is something to be defeated, like cancer, poverty, or smallpox, is not a common one. Perhaps there's some piece of the puzzle that is missing from the almost stupidly simple argument above?</p><p>One of the most common counterarguments is overpopulation (perhaps surprisingly; environmentalist concerns have clearly penetrated very deep into culture despite not being much of a thing before the 1970s). The argument goes like this: if we solve death, but people keep being born, there will be too many people on Earth, leading to environmental problems, and eventually low quality of life for everyone.</p><p>The object-level point (I will return to what I consider more important meta-level points later) is that demographic predictions have a tendency to be wrong, especially about the future (as the <a href="https://quoteinvestigator.com/2013/10/20/no-predict/">Danish (?) saying goes</a>). Malthus figured out pre-industrial demographics just as they came to an end with the industrial revolution. In the 1960s, there were <a href="https://en.wikipedia.org/wiki/The_Population_Bomb">warnings</a> of a population explosion, which fizzled out when it turned out that the <a href="https://en.wikipedia.org/wiki/Demographic_transition">demographic transition</a> (falling birth rates as countries develop) is a thing. Right now the world population is expected to stabilise at less than 1.5x the current size, and many developed countries are dealing with problems caused by shrinking populations (which they strangely refuse to fix through immigration).</p><p>Another concern are the effects of having a lot of old people around. What about social progress – how would the development of women's rights have been realised if you had a bunch of 19th century misogynists walking around in their top hats? What sort of power imbalances and Gini coefficients would we reach if Franklin Delano Roosevelt could continue cycling through high-power government roles indefinitely, or Elon Musk had time to profit from the colonisation of Mars? What happens to science when it can no longer advance (as Max Planck said) one funeral at at time?</p><p>(There is even an argument that life extension technology is problematic because the rich will get it first. This is an entirely general and therefore entirely worthless argument, since it applies to all human progress: the rich got iPhones first – clearly smartphones are a problematic technology, etc., etc. If you're worried about only the rich having access to it for too long, the proper response is to subsidise its development so that the period when not everyone has access to it is as short as possible.)</p><p>These are valid concerns that will definitely test the abilities of legislators and voters in the post-death era. However, they can probably be overcome. I think people can be brought around surprisingly far on social and moral attitudes without killing anyone. Consider how pre-2000 almost anyone's opinions would have made them a near-pariah today; many of those people still exist and it would hard to write them off as a total loss. Maybe some minority of immortal old people couldn't cope with all the Pride Parades – or whatever the future equivalent is – marching past their windows and they go off to start some place of their own with sufficient top hat density; then again, most countries have their own conservative backwater region already. If they start going for nukes, that's more of an issue, but not more so than Iran.</p><p>As for imbalances of power and wealth, it might require a few more taxes and other policies (the expansion of term limits to more jobs?), but given the strides that equalising policy-making has made it seems hard to argue there is a fundamental impossibility.</p><p>And what about all the advantages? A society of the undying might well be far more long-term oriented, mitigating one of the greatest human failures. After all, how often do people bemoan that 70-year-old oil executives just don't care because they won't be around to see the effects of climate change?</p><p>What about all the collective knowledge that is lost? Imagine if people in 2050 could hear World War II veterans reminding them of what war really is. Imagine if John von Neumann could have continued casually inventing fields of maths at a rate of about two per week instead of dying at age 53 (while <a href="https://en.wikipedia.org/wiki/John_von_Neumann#Illness_and_death">absolutely terrified of his approaching death</a>). Imagine if we could be sure to see George R. R. Martin finish <i>A Song of Ice and Fire</i>.</p><p>Also, concerns like overpopulation and Elon Musk's tax plan just seem small in comparison to the <i>literal eradication of death</i>.</p><p>Imagine proposing a miracle peace plan to the cabinets of the Allied countries in the midst of World War II. The plan would end the war, install liberal governments in the Axis powers, and no one even has to nuke a Japanese city. (If John von Neumann starts complaining about not getting to test his implosion bomb design, give him a list of unsolved maths problems to shut him up.) Now imagine that the reaction is somewhere between hesitance and resistance, together with comments like "where are we going to put all the soldiers we've trained?", "what about the effects on the public psyche of a random abrupt end without warning?", and "how will we make sure that the rich industrialists don't profit too much from all the suddenly unnecessary loans that they've been given?" At this point you might be justified in shouting: "this war is killing fifteen million people per year, we need to end it now".</p><p>The situation with death is similar, except it's over fifty million per year rather than fifteen. (See <a href="https://ourworldindata.org/grapher/annual-number-of-deaths-by-cause?country=~OWID_WRL">this chart</a> for breakdown by cause – you'll see that while currently-preventable causes like infectious diseases kill millions, ageing-related ones like heart disease, cancer, and dementia are already the majority.)</p><h3 id="thought-experiments">Thought experiments</h3><p>To make the question more concrete, we can try thought experiments. Imagine a world in which people don't die. Imagine visitors from that world coming to us. Would they go "ah yes, inevitable oblivion in less than a century, this is exactly the social policy we need, thanks – let us go run back home and implement it"? Or would they think of our world like we do of a disease-stricken third-world country, in dire need of humanitarian assistance and modern technology?</p><p>It's hard to get into the frame of mind of people who live in a society that doesn't hand out automatic death sentences to everyone at birth. Instead, to evaluate whether raising life expectancies to 200 makes sense even given the environmental impacts, we can ask whether a policy of killing people at age 50 to reduce population pressures would be even better than the current status quo – if both an increase and decrease in life expectancies is bad, this is suspicious because it implies we're at the optimum by chance. Or, since the abstract question (death in general) is always harder than more concrete ones, imagine withholding a drug that manages heart problems in the elderly on overpopulation grounds.</p><p>You might argue that current life expectancies are optimal. This is a hard position to defend. It seems like a coincidence that the lifespan achievable with modern technology is exactly the "right" one. Also, neither you nor society should not make that choice for other people. Perhaps some people get bored of life and readily step into coffins at age 80; many others want nothing more than to keep living. People should get what they want. Forcing everyone to conform to a certain lifespan is a specific case of forcing everyone to conform to a certain lifestyle; much moral progress in the past century has consisted of realising that this is bad.</p><p>I think it's also worth emphasising one common thread in the arguments against solving death: they are all arguments about societal effects. It is absolutely critical to make sure that your actions don't cause massive negative externalities, and that they also don't amount to defecting in <a href="https://en.wikipedia.org/wiki/Prisoner%27s_dilemma">prisoner's dilemma</a> or <a href="https://en.wikipedia.org/wiki/Tragedy_of_the_commons">the tragedy of the commons</a>. However, it is also absolutely critical that people are happy and aren't forced to die, because people and their preferences/wellbeing are what matters. Society exists to serve the people who make it up, not the other way around. Some of the worst moral mistakes in history come from emphasising the collective, and identifying good and harm in terms of effects on an abstract collective (e.g. a nation or religion), rather than in terms of effects on the individuals that make it up. Saying that everyone has to die for some vague pro-social reason is the ultimate form of such cart-before-the-horse reasoning.</p><h2 id="why-care-about-the-death-question">Why care about the death question?</h2><p>There are several features that make the case against death, and people's reactions to it, particularly interesting.</p><h3 id="failure-of-generalisation">Failure of generalisation</h3><p>First: generalisation. I started this post using specific examples before trying to answer the more general question. I think the popularity of death is a good example of how bad humans are at generalising.</p><p>When someone you know dies, it is very clearly and obviously a horrible tragedy. The scariest thing that could happen to you is probably either your own death, the death of people you care about, or something that your brain associates with death (the common fears: heights, snakes, ... clowns?).</p><p>And yet, make the question more abstract – think not about a specific case (which you feel in your bones is a horrible tragedy that would never happen in a just world), but about the general question of whether people should die, and it's like a switch flips: a person who would do almost anything to save themselves or those they care about, who cares deeply about suffering and injustice in the world, is suddenly willing to consign five times the death toll of World War I to permanent oblivion every single year.</p><p>Stalin reportedly said that a single death is a tragedy, but a million is only a statistic. Stalin is wrong. A single death is a tragedy, and a million deaths is a million tragedies. Tragedies should be stopped.</p><h3 id="people-these-days">People These Days</h3><p>Second: today, we're pretty good at ignoring and hiding death. This wasn't always the case. If you're a medieval peasant, death is never too far away, whether in the form of famine or plague or Genghis Khan. Death was like an obnoxious dinner guest: not fun, but also just kind of present in some form or another whether you invited them or not, so out of necessity involved in life and culture.</p><p>Today, unexpected death is much rarer. Child mortality globally has declined from <a href="https://ourworldindata.org/child-mortality">over 40% (i.e. almost every family had lost a child) in 1800 to 4.5% in 2015</a>, and <a href="https://ourworldindata.org/grapher/the-decline-of-child-mortality-by-level-of-prosperity-endpoints?time=latest&country=SWE~GBR~JPN~FRA~FIN~European+Union~KOR~ESP">below 0.5%</a> in developed countries. Famines have gone from something everyone lives through to something that the developed world is free from. War and conflict have gone from <a href="https://ourworldindata.org/war-and-peace#the-past-was-not-peaceful">common to uncommon</a>. Much greater diseases and accidents can be successfully treated. As a result of all these positive trends, death is less present in people's minds.</p><p>As I don't have my culture critic license yet, I won't try to make some fancy overarching points about how People These Days Just Don't Understand and how our Materialistic Culture fails to prepare people to deal with the Deep Questions and Confront Their Own Mortality. I will simply note that (a) death is bad, (b) we don't like thinking about bad things, and (c) sometimes not wanting to think about important things causes perverse situations.</p><h3 id="confronting-problems">Confronting problems</h3><p>Why do people not want to think that death is bad? I think one central reason is that death seems inevitable. It's tough to accept bad things you can't influence, and much easier to try to ignore them. If at some point you have to confront it anyways, one of the most reassuring stories you can tell is that it has a point. Imagine if over two hundred thousand years, generation after generation of humans, totalling some one hundred billion lives, was born, grew up, developed a rich inner world, and then had that world destroyed forever by random failures, evolution's lack of care for what happens after you reproduce, and the occasional rampaging mammoth. Surely there must be some purpose for it, some reason why all that death is not just a tragedy? Perhaps we aren't "meant" to live long, whatever that means, or perhaps it's all for the common good, or that "death gives meaning to life". Far more comforting to think that then to acknowledge that a hundred billion human lives and counting really are gone forever because they were unlucky enough to be born before we eradicated smallpox, or invented vaccines, or discovered antibiotics, or figured out how to reverse ageing.</p><p>Assume death is inevitable. Should you still recognise the wrongness of it?</p><p>I think yes, at least if you care about big questions and doing good. I think it's important to be able to look at the world, spot what's wrong about it, and acknowledge that there are huge things that should be done but are very difficult to achieve.</p><p>In particular, it's important to avoid the narrative fallacy (Nassim Taleb's term for the human tendency to want to fit the world to a story). In a story, there's a start and an end and a lesson, and the dangers are typically just small enough to be defeated. Our universe <a href="https://www.lesswrong.com/posts/sYgv4eYH82JEsTD34/beyond-the-reach-of-god">has no writer, only physics</a>, and physics doesn't care about hitting you with an unsolvable problem that will kill everyone you love. If you want to increase the justness of the world, recognising this fact is an important starting point.</p><h2 id="taxes">Taxes</h2><p>Is death inevitable? In considering this question, it's important once again to remember that death is not a singular magical thing. Your death happens when something breaks badly enough that your consciousness goes permanently offline.</p><p>Things, especially complex biological machines produced by evolution, can break in very tricky ways. But what can break can be fixed, and people who declare technological feats impossible have a bad track record. The problem might be very hard: maybe we have to wait until we have precision nano-bots that can individually repair the telomeres on each cell, or maybe there is no effective general solution to ageing and we face an endless grind of solving problem after problem to extend life/health expectancies from 120 to 130 to 140 and so forth. Then again, maybe someone leaves out a petri dish by accident in a lab and comes back the next day to the fountain of youth, or maybe by the end of the century no one is worrying about something as old-fashioned as biology.</p><p>There's also the possibility of stopgap solutions, like cryonics (preserving people close to death by <a href="https://en.wikipedia.org/wiki/Cryopreservation#Vitrification">vitrifying</a> them and hoping that future technology can revive them). Cryonics is currently in a very primitive state – no large animals successfully having been put through it – but there's a research pathway of testing on increasingly complex organs and then increasingly large animals that might eventually lead to success if someone bothered to pour resources into it.</p><p>There is no guarantee when this is happening. If civilisation is destroyed by an engineered pandemic or nuclear war before then, it will never happen.</p><p>Of course, in the very long run we face more fundamental problems, like the heat death of the universe. Literally infinite life is probably physically impossible; maybe this is reassuring.</p><h2 id="predictions-and-poems">Predictions and poems</h2><p>I will make three predictions about the eventual abolition of death.</p><p>First, many people will resist it. They might see it as conflicting with their religious views or as exacerbating inequality, or just as something too new and weird or unnatural.</p><p>Second, when the possibility of extending their lifespan stops being an abstract topic and becomes a concrete option, most people will seize it for themselves and their families.</p><p>This is a common path for technologies. Lightning rods and vaccines were first seen by some as affronts to God's will, but eventually it turns out people like not burning to death and not dying of horrible diseases more than they like fancy theological arguments. Most likely future generations will discover that they like not ageing more than they like appreciating the meaning of life by definitely not having one past age 120.</p><p>Finally, future people (if they exist) will probably look back with horror on the time when everyone died against their will within about a century.</p><p>Edgar Allen Poe wrote a poem called <a href="https://www.poetryfoundation.org/poems/48633/the-conqueror-worm">"The Conqueror Worm"</a>, about angels crying as they watch a tragic play called "Man", whose (anti-)hero is a monstrous worm that symbolises death. If we completely ignore what Poe intended with this, we can misinterpret one line to come to a nice interpretation of our own. The poem declares that the angels are watching this play in the "lonesome latter years". Clearly this refers to a future post-scarcity, post-death utopia, and the angels are our wise immortal descendants reflecting on the bad old days, when people were "mere puppets [...] who come and go / at the bidding of vast formless things" like famine and war and plague and death. The "circle [of life] ever returneth in / To the self same spot [= the grave]", and so the "Phantom [of wisdom and fulfilled lives] [is] chased for evermore / By a crowd that seize it not".</p><p>Death is a very poetic topic, and other poems need less (mis)interpretation. <a href="https://www.poetryfoundation.org/poems/52773/dirge-without-music">Edna St. Vincent Millay's "Dirge Without Music"</a> is particularly nice, while Dylan Thomas gives away the game in the title: <a href="https://poets.org/poem/do-not-go-gentle-good-night">"Do not go gentle into that good night"</a>.</p>Unknownnoreply@blogger.com0tag:blogger.com,1999:blog-1697673368059564013.post-9283111801107102512021-09-30T21:23:00.001+01:002021-09-30T21:26:05.349+01:00Short reviews: biographies<p style="text-align: center;"><span style="font-size: x-small;">Books reviewed (all by Walter Isaacson):<i><br />The Code Breaker: Jennifer Doudna, Gene Editing, and the Future of the Human Race </i>(2021)<br /><i>Steve Jobs: The Exclusive Biography </i>(2011)<i><br />Benjamin Franklin: An American Life </i>(2004)<br /><i></i></span></p><p style="text-align: center;"><span style="font-size: x-small;">3.5k words (about 12 minutes)</span> </p><p style="text-align: center;"><br /></p><p>Why read biographies? If you want stories of people and interesting characters, fiction is better. If you want general, big truths, then you're probably better off reading the many non-fiction books that are about abstract truths and far-ranging concepts rather than the particulars of a single person's life.</p><p>Consider, for a moment, designing an algorithm for a problem. The classic way to do this is to think hard about the problem, and then write down a specific series of steps that take you from inputs to (hopefully the correct) outputs. In contrast, the machine learning method is to use statistical methods on a long list of examples to make a model that (hopefully) approximates the mapping between inputs and outputs. </p><p>Reading explicit abstract arguments is like the first method. Like explicit algorithm design, it comes with some nice properties – it's very clear exactly how it generalises and when it's applicable – to the point where it's easy to scoff at the less explicit methods: "it's just a black box that our pile of statistics spits out" / "it's just anecdotes about someone's life".</p><p>However, much like machine learning methods can extract subtle lessons from a long list of examples, I think there is implicit knowledge contained in the long list of detail about someone's life that you find in a biography (at least if you read about people who did interesting things in their life – but then again, if there's a biography of someone ...). Once you've read the details of how CRISPR was invented, Apple jump-started, or compromises reached at the1787 American Constitutional Convention, I think your model of how science, business, and politics work in the real world is improved in many subtle ways.</p><p>(Note that this argument also applies to reading history.)</p><p>And of course, since biographies deal strongly with character, there is an element of the novel-like thrill of watching things happen to people.</p><h2 id="walter-isaacsons-biographies">Walter Isaacson's biographies</h2><p>I've read four of Walter Isaacson's biographies. Their subjects are Albert Einstein, Jennifer Doudna, Steve Jobs, and Benjamin Franklin.</p><p>The Einstein one I read years ago, and don't remember much detail about. It did earn a 6 out of 7 on my books spreadsheet though.</p><p>The <a href="https://en.wikipedia.org/wiki/Jennifer_Doudna">Jennifer Doudna</a> biography is the weakest. The main reason is that we don't get too much insight into Doudna herself or the way she carried out her scientific work, leaving Isaacson to spend many pages on other things: overviews of other players in the development of the <a href="https://en.wikipedia.org/wiki/CRISPR">gene-editing tool CRISPR</a> that are more journalistic than biographical, and descriptions of the biology that are limited by Isaacson's lack of biological expertise (at least when compared to the best popular biology writing, like Richard Dawkins' in <i>The Selfish Gene</i>). Hand-wringing over <a href="https://en.wikipedia.org/wiki/James_Watson">James Watson's</a> controversies takes up an alarming amount of space that is only partly justified by Watson's role as a childhood inspiration for Doudna. There's also a long section about the struggles behind the allocation of the CRISPR Nobel Prize (awarded in 2020) that is clearly balanced and thoroughly researched, but simply less interesting to me than similar segments in the Jobs or Franklin biographies, where the stakes are the fate of companies or nations, rather than who gets a shiny medal.</p><p>My guess is that these faults stem mainly from the more limited material Isaacson had access to. Albert Einstein and Benjamin Franklin are both among the most researched individuals in history. To the extent that Steve Jobs is behind, the interviews Isaacson personally conducted seem to have plugged the gap.</p><p>Doudna is still an inspiring person. She also has the enviable advantage of not being dead, and therefore may yet do even more and become the subject of further biographies. If you're interested in biotech, including the business side, or scientific careers that may one day win Nobel Prizes, the biography may well be worth reading. </p><h2 id="steve-jobs">Steve Jobs</h2><p>A god-like experimenter who wants to figure out what traits make tech entrepreneurs succeed may proceed something like this: create a bunch of people with extreme strengths in some areas and extreme weaknesses in others, release them into the world to start companies, and see which extreme strengths can balance out which extreme weaknesses. Such an experiment might well create Steve Jobs.</p><p>Take one weakness: Jobs's emotional volatility and, for lack of a better word, general nastiness in some circumstances, including things from extremely harsh criticism of employees' work to horrible table manners at restaurants. This isn't unique to Jobs either: look at the Wikipedia pages for <a href="https://en.wikipedia.org/wiki/Bill_Gates#Management_style">Bill Gates</a> and <a href="https://en.wikipedia.org/wiki/Jeff_Bezos#Leadership_style">Jeff Bezos</a>, and you'll find that they brighten their subordinates' work days with such productive witticisms as "that's the stupidest thing I've ever heard" and "why are you ruining my life?" respectively.</p><p>Does this show that behaviour up to and including verbal abuse is a forgivable flaw, or even beneficial, in tech CEOs?</p><p>First, though verbal abuse is neither productive nor right, a culture of vigorous debate is a distinct thing with incredible benefits, and the idea that it serves only to hurt and marginalise is not just a misguided generalisation but sometimes diametrically wrong. The best example is Daniel Ellsberg recounting an anecdote from his early times at RAND Corporation in <i>The Doomsday Machine</i> (an unrelated book; my review <a href="https://strataoftheworld.blogspot.com/2020/04/review-doomsday-machine.html">here</a>):</p><blockquote><p><i>Rather than showing irritation or ignoring my comment [that he made at the first meeting], Herman Kahn, brilliant and enormously fat, sitting directly across the table from me, looked at me soberly and said, "You're absolutely wrong."</i></p><i></i><p><i>A warm glow spread through my body. This was the way my undergraduate fellows on the editorial board of the Harvard Crimson (mostly Jewish, like Herman and me) had spoken to each other; I hadn't experienced anything like it for six years. At King's College, Cambridge, or in the Society of Fellows, arguments didn't remotely take this gloves-off, take-no-prisoners form. I thought, "I've found a home."</i></p></blockquote><p>Steve Jobs admittedly goes overboard with this. For example, people who worked with him had to learn that "this is shit" meant "that's interesting, could you elaborate and make the case for your idea further?". This is not just unnecessarily rude, but also unclear communication. The general impression that Isaacson gives is also not that Jobs was combative as a thought-out strategy, but rather that this was just his style of interaction.</p><p>I suspect that the famous combativeness of many tech CEOs is not itself a useful trait, but instead adjacent to several other traits that are, in particular disagreeableness (in the sense of willing to disagree with others and not feel pressure to conform) and perhaps also caring deeply about the product.</p><p>Consider another extreme Jobs trait: strange diets, and (in his youth), a belief that he didn't need to shower because of his dieting. This went so far that of the people Isaacson interviews about Jobs's youth, including those who hadn't seen him for decades, almost every one mentions something like "yeah, he stank". Yet while some leap to defend and (worse yet) emulate Jobs's verbal nastiness, presumably on grounds of its correlation with his success, far fewer do the same for his dieting and showering habits. (What conformists!)</p><p>I think the more general lesson is that Jobs was extreme in a lot of ways, including in the strength of his opinions and beliefs, and in not having a filter between them and his actions. He gets into eastern mysticism and goes off to India to become a monk. He gets into dieting and starts eating only fruit rather than just reading lifestyle magazines and half-heartedly trying diets for a week like most people might. He gets it into his head that the corner of a Mac isn't rounded enough and declares that in no uncertain terms. </p><p>So is that the key then: have firm convictions? We've gone from a maladaptive cliché to a trite one – and still not a very helpful one. Steve Jobs, with his "reality distortion field", may have been an expert at persuading people, but even he can't persuade reality to be another way. Even slightly wrong convictions tend to have nasty collisions with reality.</p><p>(It's worth noting that rather than being a stickler for one position or solution, Jobs tended to yo-yo back and forth between extremes, only slowly converging on a decision – something that often confused others at Apple until they learned to use a rolling average of his recent positions.)</p><p>The critical part, of course, was that Steve Jobs was right about a lot of things, despite several serious missteps (especially in regards to making over-expensive computers that no one wants to pay for). I think Jobs's success provides evidence that even in aesthetic matters, success has a surprisingly strong component of <i>being actually right</i>. And Jobs, who was all-around very bright despite not being a master of the technical side, seems to have mastered this.</p><p>Of course, the story of Jobs's success – which came in spite of his emotional volatility, and tendency to wish away problems rather than facing them – does not entirely fit the idea that success comes in large part from having well-calibrated beliefs about the world and going about achieving them in reasonable and rational ways.</p><p>I think there are three things worth keeping in mind.</p><p>First, it may well be that most successful people are successful "at random" (i.e. without having a rational strategy for achieving what they want to achieve), but that the probability of achieving your goals given that you have well-calibrated beliefs and a rational reality-accommodating plan is still very much higher than the probability of achieving them given any other strategy. That is, if <script type="math/tex">S</script> is the event of being very successful (by some definition), <script type="math/tex">R</script> the event that you follow a rational strategy and maintain well-calibrated beliefs and generally practice thought patterns that won't get you downvoted on LessWrong, <script type="math/tex">\neg R</script> the complement of that event, <script type="math/tex">P(\neg R|S)</script> can be high (i.e. most successful people became successful in not particularly smart ways), while <script type="math/tex">P(S|R)</script> can be much higher than <script type="math/tex">P(S|\neg R)</script> (following a rational strategy still gives you by far the best chances of success).</p><p>Second, Jobs's life illustrates the principle that you only have to be very right a small number of times – just like in general most of the return, especially in anything risky, comes from a small number of bets. He failed at managing, even when working under another CEO who had been brought in specifically to babysit him, to the extent that he was kicked out of his own company. He failed to build successful hardware after founding NeXT. However, he was really right about product design, and that was enough.</p><p>Third, though he did get away with ignoring many uncomfortable truths by simply willing them away, eventually reality hit back. He delayed dealing with the cancer threat when he was first told of it, and he trusted alternative treatments. The combination may well have killed him.</p><p> </p><h2 id="benjamin-franklin">Benjamin Franklin</h2><p>Benjamin Franklin was a newspaper publisher, writer, postmaster, ambassador, political leader, and scientist. He invented the lightning rod and realised that electric charge came in both a positive and negative form (and gave those names to them, as temporary ones until "[English] philosophers give us better").</p><p>He was one of the first or most influential pioneers of many other things as well; to take a random example, he thought up the idea of matched funding for a charitable project (and was quite proud of it too: "I do not remember any of my political maneuvers the success of which gave me at the time more pleasure, or that in after thinking about it I more easily excused myself for having made use of cunning").</p><p>More generally, he clearly enjoyed numbers and detail:</p><blockquote><p><i>[...H]e loved immersing himself in minutiae and trivia in a manner so obsessive that it might today be described as geeky. He was meticulous in describing every technical detail of his inventions, be it the library arm, stove, or lightning rod. In his essays, ranging from his arguments against hereditary honors to his discussions of trade, he provided reams of detailed calculations and historical footnotes. Even in his most humorous parodies, such as his proposal for the study of farts, the cleverness was enhanced by his inclusion of mock-serious facts, trivia, calculations, and learned precedents</i></p></blockquote><p>Do-gooders with time machines could do worse than giving him access to a spreadsheet program.</p><p>One of the best descriptions of Franklin's personality comes from Isaacson's comparison of him with John Adams (when they were both in Paris, late in Franklin's life):</p><blockquote><p><i>Adams was unbending and outspoken and argumentative, Franklin charming and taciturn and flirtatious. Adams was rigid in his personal morality and lifestyle, Franklin famously playful. Adams learned French by poring over grammar books and memorizing a collection of funeral orations; Franklin (who cared little about the grammar) learned the language by lounging on the pillows of his female friends and writing them amusing little tales. Adams felt comfortable confronting people, whereas Franklin preferred to seduce them, and the same was true of the way they dealt with nations.</i></p></blockquote><p>One striking things when reading about 18th century events is the informality and nepotism. For example, to become postmaster of the colonies, Franklin spent significant money on having a friend lobby on his behalf in London, and upon obtaining the position gave out cushy jobs to his son, brothers, brother's stepson, sister's son, and two of his wife's relatives.</p><p>Not only that, but the border between truth and fiction was also hazy in the press. Articles could be, without any differentiating label, either factual, obviously satirical, satirical in a way that takes a clever reader to spot, or outright hoaxes. Likewise Franklin often wrote and published letters to his own newspaper under pseudonyms, with various levels of disguise ranging from clearly transparent to purposefully anonymous (this, however, was normal, as it was often seen as unworthy of gentlemen to write such letters under their own names).</p><p>In other ways, the 18th century, and 18th century Franklin in particular, were surprisingly modern and liberal. Franklin took a very reasonable and liberal stance on the freedom of press:</p><blockquote><p><i>“It is unreasonable to imagine that printers approve of everything they print. It is likewise unreasonable what some assert, That printers ought not to print anything but what they approve; since […] an end would thereby be put to free writing, and the world would afterwards have nothing to read but what happened to be the opinions of printers.”</i></p></blockquote><p>He still exercised judgement over what he printed. When deciding whether to print something that violated his principles for money, he (reportedly) went through a process that many modern newspaper editors and Facebook engineers could well take to heart:</p><blockquote><p><i>To determine whether I should publish it or not, I went home in the evening, purchased a twopenny loaf at the baker’s, and with the water from the pump made my supper; I then wrapped myself up in my great-coat, and laid down on the floor and slept till morning, when, on another loaf and a mug of water, I made my breakfast. From this regimen I feel no inconvenience whatever. Finding I can live in this manner, I have formed a determination never to prostitute my press to the purposes of corruption and abuse of this kind for the sake of gaining a more comfortable subsistence.</i></p></blockquote><p>The 18th century offers some perspective about hostile politics too. After describing an extremely personal and angry election campaign (which Franklin lost), Isaacson writes:</p><blockquote><p><i>Modern election campaigns are often criticized for being negative, and today’s press is slammed for being scurrilous. But the most brutal of modern attack ads pale in comparison to the barrage of pamphlets in the 1764 [Pennsylvania] Assembly election. Pennsylvania survived them, as did Franklin, and American democracy learned that it could thrive in an atmosphere of unrestrained, even intemperate, free expression. As the election of 1764 showed, American democracy was built on a foundation of unbridled free speech. In the centuries since then, the nations that have thrived have been those, like America, that are most comfortable with the cacophony, and even occasional messiness, that comes from robust discourse.</i></p></blockquote><p>Isaacson points out that Franklin's popularity has come and gone, and explains this by making him the symbol of one side of a cultural and political dichotomy: tolerance and compromise rather than dogmatism and crusading, pragmatism rather than romanticism, social mobility rather than class and hierarchy, and secular material success over religious salvation. Thus, while immensely popular in the latter part of his life and after his death, once the Romantic Era got underway, he became seen as shallow, thrifty, and lacking in passion. For example, Franklin appears in Herman Melville's novel <i>Israel Potter</i>, a work that sounds like the most confusing Harry Potter fan-fiction of all time, as a precursor to today's shallow self-help gurus.</p><p>A perfect example of the type of cunning that made some people call him shallow comes from his time as a frontier commander. To get soldiers to attend worship services, he had the chaplain give out the daily rum rations right after the service. "Never were prayers more generally and punctually attended", Franklin proudly wrote.</p><p>Or: at the signing of the Declaration of Independence, John Hancock solemnly declared "There must be no pulling different ways; we must all hang together". Franklin reportedly responded, with a wit but not solemnity worthy of the historic occasion: "Yes, we must, indeed, all hang together, or most assuredly we shall all hang separately".</p><p>This oscillation between romantically-minded eras finding him shallow and business-minded eras finding him the godfather of all self-help gurus and thrifty entrepreneurs has continued to this day. It is true that his aphorism collections, as documented in his famous Poor Richard's Almanac, are more clever than insightful; that he was no moral philosopher; and that his virtue-cultivating efforts were often patchy. However, they are part of a crucial process: the separation of morality from theology during the Enlightenment, which "Franklin was [the] avatar" of. Franklin's foundational personal maxim, which he often repeated, is perhaps the single sentence that pre-modern religious countries most need to hear: “The most acceptable service to God is doing good to man".</p><p>The romanticists' criticisms are based on truths. Though sociable, founding and participating in many societies, his personal relationships tended to be intellectual but distant. Interestingly, despite his vast achievements, Franklin does not show signs of a deep unyielding inner ambition; he seems to have been driven by vague instincts to be useful, a sense of pride (which he tried to dull throughout his life), curiosity, and a delight in tinkering, planning, and organising. To his sister in 1771 he wrote "[...] I am much disposed to like the world as I find it, and to doubt my own judgment as to what would mend it" – a remarkable sentiment from the pen of someone who, not many years later, would be playing a key role in a revolution. And though even past the age of 75 he achieved a few minor things, like being instrumental in securing France's alliance to America, signing the peace treaty between the US and Britain, shaping the US Constitution, and being the head of Pennsylvania's government, he happily wiled away many of his latter days playing cards with only the occasional twinge of guilt. He specifically justified this in part based on a belief in the afterlife: "You know the soul is immortal; why then should you be such a niggard of a little time, when you have a whole eternity before you?"</p><p>However, even these traits seem to have made him exactly what America needed. He was a skilled diplomat in France partly because of his easy-going nature and lack of naked ambition. At the Constitutional Convention of 1787, he often hosted the (much younger) other leading revolutionaries at his house to talk about things in a less formal setting and soften their stances, and generally advocated tolerance and compromise. Isaacson cleverly summarises:</p><blockquote><p><i>Compromisers may not make great heroes, but they do make democracies.</i></p></blockquote><p>Perhaps the best known summary of Franklin's life is Turgot's epigram that "he snatched lightning from the sky and the sceptre from tyrants". Franklin himself had a go at this: he wrote an autobiography – then a rare form of book – and also proposed a cheeky epitaph for himself, including an exhortation to wait for a "new and more elegant edition [of him], revised and corrected by the Author".</p><p>He didn't just summarise himself, though. He also unwittingly wrote perhaps the pithiest summary of the spirit of the entire Enlightenment project, and consequently of the driving spirit of human progress since then. It was in a letter Franklin wrote to his wife, after narrowly escaping a shipwreck on the English coast in 1757:</p><blockquote><p><i>Were I a Roman Catholic, perhaps I should on this occasion vow to build a chapel to some saint; but as I am not, if I were to vow at all, it should be to build a lighthouse.</i></p></blockquote>Unknownnoreply@blogger.com0tag:blogger.com,1999:blog-1697673368059564013.post-14457107463493519742021-04-25T21:52:00.002+01:002022-03-31T22:58:16.256+01:00Lambda calculus<p style="text-align: center;"><i><span style="font-size: x-small;">7.8k words, including equations (about 30 minutes)</span></i></p><p style="text-align: center;"><i><span style="font-size: x-small;"> </span></i></p><p style="text-align: center;"><i><span style="font-size: x-small;">This post has also been published <a href="https://www.lesswrong.com/posts/D4PYwNtYNwsgoixGa/intro-to-hacking-with-the-lambda-calculus">here</a>. </span></i><br /></p><p> </p><p>This post is about lambda calculus. The goal is not to do maths with it, but rather to build up definitions within it until we can express non-trivial algorithms easily. At the end we will see a lambda calculus interpreter written in the lambda calculus, and realise that we're most of the way to Lisp.</p><p>But first, why care about lambda calculus? Consider four different systems:</p><ul><li><p>A <b>Turing machine</b> – that is, a machine that:</p><ul><li><p>works on an infinite tape of cells from which a finite set of symbols can be read and written, and always points at one of these cells;</p></li><li><p>has some set of states it can be in, some of which are termed "accepting" and one of which is the starting state; and</p></li><li><p>given a combination of current state and current symbol on the tape, always does an action consisting of three things:</p><ul><li>writes some symbol on the tape (possibly the same that was already there),</li><li>transitions to some some state (possibly the same it is already in), and</li><li>moves one cell left or right on the tape.</li> </ul></li> </ul></li><li><p>The <b>lambda calculus</b> (<script type="math/tex">\lambda</script>-calculus), a formal system that has expressions that are built out of an infinite set of variable names using <script type="math/tex">\lambda</script>-terms (which can be thought of as anonymous functions) and applications (analogous to function application), and a few simple rules for shuffling around the symbols in these expressions.</p></li><li><p>The <b>partial recursive functions</b>, constructed by function composition, primitive recursion (think bounded for-loops), and minimisation (returning the first value for which a function is zero) on three basic sets of functions:</p><ul><li>the zero functions, that take some number of arguments and return 0;</li><li>a successor function that takes a number and returns that number plus 1; and</li><li>the projection functions, defined for all natural numbers <script type="math/tex">a</script> and <script type="math/tex">b</script> such that <script type="math/tex">a \geq b</script> as taking in <script type="math/tex">a</script> arguments and returning the <script type="math/tex">b</script>th one.</li> </ul></li><li><p><b>Lisp</b>, a human-friendly axiomatisation of computation that accidentally became an extremely good and long-lived programming language.</p></li> </ul><p>The big result in theoretical computer science is that these can all do the same thing, in the sense that if you can express a calculation in one, you can express it in any other.</p><p>This is not an obvious thing. For example, the only thing lambda calculus lets you do is create terms consisting of symbols, single-argument anonymous functions, and applications of terms to each other (we'll look at the specifics soon). It's an extremely simple and basic thing. Yet no matter how hard you try, you can't make something that can compute more things, whether it's by inventing programming languages or building fancy computers.</p><p>Also, if you try to make something that does some sort of calculation (like a new programming language), then unless you keep it stupidly simple and/or take great care, it will be able to compute anything (at least in la-la-theory-land, where memory is infinite and you don't have to worry about practical details, like whether the computation finishes before the sun going nova).</p><p>Physicists search for their theory of everything. The computer scientists already have many, even though they've been at it for a lot less time than the physicists have: everything computable can be reduced to one of the many formalisms of computation. (One of the main reasons that we can talk about "computability" as a sensible universal concept is that any reasonable model makes the same things computable; the threshold is easy to hit and impossible to exceed, so computable versus not is an obvious thing to pay attention to.)</p><p>To talk about the theory of computation properly, we need to look at at least one of those models. The most well-known is the Turing machine. Turing machines have several points in their favour:</p><ul><li>They are the easiest to imagine as a physical machine.</li><li>They have clear and separate notions of time (steps taken in execution) and space (length of tape used).</li><li>They were invented by Alan Turing, who contributed to breaking the Enigma code during World War II, before being unjustly persecuted for being gay and tragically dying of cyanide poisoning at age 41.</li> </ul><p>In contrast, compare the lambda calculus:</p><ul><li>It is an abstract formal system arising out of a failed attempt to axiomatise logic.</li><li>There are many execution paths for a non-trivial expression.</li><li>It was invented by Alonzo Church, who lived a boringly successful life as a maths professor at Princeton, had three children, and died at age 92.</li> </ul><p>(Turing and Church worked together from 1936 to 1938, Church as Turing's doctoral advisor, after they independently proved the impossibility of the halting problem. At the same time and also working at Princeton were Albert Einstein, Kurt Gödel, and John von Neumann (who, if he had had his way, would've hired Turing and kept him from returning to the UK).)</p><p>However, the lambda calculus also has advantages. Its less mechanistic and more mathematical view of computation is arguably more elegant, and it has less things: instead of states, symbols, and a tape, the current state is just a term, and the term also represents the algorithm. It abstracts more nicely – we will see how we can, bit by bit, abstract out elements and get something that is a sensible programming language, a project that would be messier and longer with Turing machines.</p><p>Turing machines and lambda calculus are the foundations of imperative and functional programming respectively, and the situation between these two programming paradigms mirrors that between TMs and <script type="math/tex">\lambda</script>-calculus: one is more mechanistic, more popular, and more useful when dealing with (stateful) hardware; the other more mathematical, less popular, and neater for abstraction-building.</p><h3>Lambda trees</h3><p>Now let's define exactly what a lambda calculus term is.</p><p>We have an infinite set of variables <script type="math/tex">x_1, x_2, x_3, ...</script>, though for simplicity we will use any lowercase letter to refer to them. Any variable is a valid term. Note that variables are just symbols – despite the word "variable", there is no value bound to them.</p><p>We have two rules for building new terms:</p><ul><li><script type="math/tex">\lambda</script>-terms are formed from a variable <script type="math/tex">x</script> and a term <script type="math/tex">M</script>, and are written <script type="math/tex">(\lambda x. M)</script>.</li><li>Applications are formed from two terms <script type="math/tex">M</script> and <script type="math/tex">N</script>, and are written <script type="math/tex">(M N)</script>.</li> </ul><p>These terms, like most things, are trees. I will mostly ignore the convention of writing out horrible long strings of <script type="math/tex">\lambda</script>s and variables, only partly mitigated by parenthesis-reducing rules, and instead draw the trees.</p><p>(When it appears in this post, the standard notation appears slightly more horrible than usual because, for simplicity, I neglect the parenthesis-reducing rules (they can be confusing at first).)</p><p>Here are a few examples of terms, together with standard representations:</p><p></p><div class="separator" style="clear: both; text-align: center;"><a href="https://lh3.googleusercontent.com/-UlDDkX52Ra0/YIXTvhmzH-I/AAAAAAAACvg/2hsLntnO5rkBekTYnalMEBAzIgWjZJkxACLcBGAsYHQ/terms.png" style="margin-left: 1em; margin-right: 1em;"><img data-original-height="528" data-original-width="1138" height="296" src="https://lh3.googleusercontent.com/-UlDDkX52Ra0/YIXTvhmzH-I/AAAAAAAACvg/2hsLntnO5rkBekTYnalMEBAzIgWjZJkxACLcBGAsYHQ/w640-h296/terms.png" width="640" /></a></div><p></p><p>This representation makes it clear that we're dealing with a tree where nodes are either variables, lambda terms where the left child is the argument and the right child is the body, or applications. (I've circled the variables to make clear that the argument variable in a <script type="math/tex">\lambda</script>-term has a different role than a variable appearing elsewhere.)</p><p>It's not quite right to say that a <script type="math/tex">\lambda</script>-term is a function; instead, think of <script type="math/tex">\lambda</script>-terms as one representation of a (mathematical) function, when combined with the reduction rule we will look at soon.</p><p>If we interpret the above terms as representations of functions, we might rewrite them (in Pythonic pseudocode) as, from left to right:</p><ul><li><code>lambda x -> x</code> (i.e., the identity function) (<code>lambda</code> is a common keyword for an anonymous function in programming languages, for obvious reasons).</li><li><code>(lambda f -> f(y))(lambda x -> x)</code> (apply a function that takes a function and calls that function on <code>y</code> to the identity function as an argument).</li><li><code>x(y)</code></li> </ul><h2>Reduction</h2><p>Execution in lambda calculus is driven by something that is called <script type="math/tex">\beta</script>-reduction, presumably because Greek letters are cool. The basic idea of <script type="math/tex">\beta</script>-reduction is this:</p><ul><li>Pick an application (which I've represented by orange circles in the tree diagrams).</li><li>Check that the left child of the application node is a \lambda-term (if not, you have to reduce it to a <script type="math/tex">\lambda</script>-term before you can make that application).</li><li>Replace the variable in the left child of the <script type="math/tex">\lambda</script>-term with the right child of the application node wherever it appears in the right child of the <script type="math/tex">\lambda</script>-term, and then replace the application node with the right child of the <script type="math/tex">\lambda</script>-term.</li> </ul><p>In illustrated form, on the middle example above, using both tree diagrams and the usual notation:</p><p></p><div class="separator" style="clear: both; text-align: center;"><a href="https://lh3.googleusercontent.com/-WMZ34VP_SSc/YIXT1f79EaI/AAAAAAAACvk/TrjxXNYOrGUb2Gt_22VQaSArx_-TkNyiQCLcBGAsYHQ/reduction1.png" style="margin-left: 1em; margin-right: 1em;"><img alt="" data-original-height="634" data-original-width="1196" height="340" src="https://lh3.googleusercontent.com/-WMZ34VP_SSc/YIXT1f79EaI/AAAAAAAACvk/TrjxXNYOrGUb2Gt_22VQaSArx_-TkNyiQCLcBGAsYHQ/w640-h340/reduction1.png" width="640" /></a></div><p></p>(The notation <script type="math/tex">M[N/x]</script> means substitute the term <script type="math/tex">N</script> for the variable <script type="math/tex">x</script> in the term <script type="math/tex">M</script>; the general rule for <script type="math/tex">\beta</script>-reduction is that given <script type="math/tex">((\lambda x. M) N)</script>, you can replace it with <script type="math/tex">M[N/x]</script>, subject to some details that we will mostly skip over shortly.) <p>In our example, we end up with another application term, so we can reduce it further:</p><p></p><div class="separator" style="clear: both; text-align: center;"><a href="https://lh3.googleusercontent.com/-uExQDerRBfo/YIXT5Q7Ve6I/AAAAAAAACvo/_qQYEhP3HZEFfQdPoddyKGdbPZJbZVaIQCLcBGAsYHQ/reduction2.png" style="margin-left: 1em; margin-right: 1em;"><img data-original-height="666" data-original-width="1000" height="426" src="https://lh3.googleusercontent.com/-uExQDerRBfo/YIXT5Q7Ve6I/AAAAAAAACvo/_qQYEhP3HZEFfQdPoddyKGdbPZJbZVaIQCLcBGAsYHQ/w640-h426/reduction2.png" width="640" /></a></div><p></p><p>In our Pythonic pseudocode, we might represent this as an execution trace like the following:</p><pre><code>(lambda f -> f(y))(lambda x -> x)</code></pre><pre><code> --></code></pre><pre><code>(lambda x -> x)(y)</code></pre><pre><code> --></code></pre><pre><code>y<br /></code></pre><p>Reduction is not always so simple, even if there's only a single choice of what to reduce. You have to be careful if the same variable appears in different roles, and rename if necessary. The core rule is that within the tree rooted at a <script type="math/tex">\lambda</script>-term that takes an argument <script type="math/tex">x</script>, the variable <script type="math/tex">x</script> always means whatever was given to that <script type="math/tex">\lambda</script>-term, and never anything else. An <script type="math/tex">x</script> bound in one <script type="math/tex">\lambda</script>-term is distinct from an <script type="math/tex">x</script> bound in another <script type="math/tex">\lambda</script>-term.</p><p>The simplest way to get around problems is to make your first variable <script type="math/tex">x_1</script> and, whenever you need a new one, call it <script type="math/tex">x_i</script> where <script type="math/tex">i</script> is one more than the maximum index of any existing variable. Unfortunately humans aren't good at remembering the difference between <script type="math/tex">x_9</script> and <script type="math/tex">x_{17}</script>, and humans like conventions (like using <script type="math/tex">x</script> for generic variables, <script type="math/tex">f</script> for things that will be <script type="math/tex">\lambda</script>-terms, and so forth). Therefore we sometimes have to think about name collisions.</p><p>The principle that lets us out of name collision problems is that you can rename variables as you want (as long as distinct variables aren't renamed to the same thing). The name for this is <script type="math/tex">\alpha</script>-equivalence (more Greek letters!); for example <script type="math/tex">(\lambda x .x)</script> and <script type="math/tex">(\lambda y. y)</script> are <script type="math/tex">\alpha</script>-equivalent.</p><p>There are, of course, detailed rules for how to deal with name collisions when doing <script type="math/tex">\beta</script>-reductions, but you should be fine if you think about how variable scoping should sensibly work to preserve meaning (something you've already had to reason about if you've ever programmed). (A helpful concept to keep in mind is the difference between free variables and bound variables – starting from a variable and following the path up the tree to the parent node, does it run through a <script type="math/tex">\lambda</script>-node with that variable as an argument?)</p><p>An example of a name collision problem is this:</p><p></p><div class="separator" style="clear: both; text-align: center;"><a href="https://lh3.googleusercontent.com/-3CsGljEB1Po/YIXUBXYsiaI/AAAAAAAACvw/jREPl0dgL7ANsQN0D-XDyyBZpqRa0ff5wCLcBGAsYHQ/wrongreduction.png" style="margin-left: 1em; margin-right: 1em;"><img data-original-height="766" data-original-width="1094" height="448" src="https://lh3.googleusercontent.com/-3CsGljEB1Po/YIXUBXYsiaI/AAAAAAAACvw/jREPl0dgL7ANsQN0D-XDyyBZpqRa0ff5wCLcBGAsYHQ/w640-h448/wrongreduction.png" width="640" /></a></div><p></p><p>We can't do this because the <script type="math/tex">x</script> in the innermost <script type="math/tex">\lambda</script>-term on the left must mean whatever was passed to it, and the <script type="math/tex">y</script> whatever was passed to the outer <script type="math/tex">\lambda</script>-term. However, our reduction leaves us with an expression that applies its argument to itself. We can solve this by renaming the <script type="math/tex">x</script> within the inner <script type="math/tex">\lambda</script>-term:</p><p></p><div class="separator" style="clear: both; text-align: center;"><div class="separator" style="clear: both; text-align: center;"><a href="https://lh3.googleusercontent.com/-rbqp3oh23-c/YIXUJWBcHgI/AAAAAAAACv8/P7arn5R2eE88z7YxIansO7TtbozLuBDhQCLcBGAsYHQ/wrongreductionfix.png" style="margin-left: 1em; margin-right: 1em;"><img alt="" data-original-height="870" data-original-width="1186" height="470" src="https://lh3.googleusercontent.com/-rbqp3oh23-c/YIXUJWBcHgI/AAAAAAAACv8/P7arn5R2eE88z7YxIansO7TtbozLuBDhQCLcBGAsYHQ/w640-h470/wrongreductionfix.png" width="640" /></a></div></div><p></p><p>The general way to think of lambda calculus term is that they are partitioned in two ways into equivalence classes:</p><ul><li>The first, rather trivial, set of equivalence classes is treating all <script type="math/tex">\alpha</script>-equivalent terms as the same thing. "Equivalent" and <script type="math/tex">\alpha</script>-equivalent are usually the same thing when we're talking about the lambda calculus; it's the "structure" of a term that matters, not the variable names.</li><li>The second set of equivalence classes is treating everything that can be <script type="math/tex">\beta</script>-reduced into the same form as equivalent. This is less trivial – in fact, it's undecidable in the general case (as we will see in the post about computation theory).</li> </ul><h2>That's it</h2><p>Yes, really, that's all you need. There exists a lambda calculus term that beats you in chess.</p><p>You might ask: but hold on a moment, we have no data – no numbers, no pairs, no lists, no strings – how can we input chess positions into a term or get anything sensible as an answer? We will see later that it's possible to encode data as lambda terms. The chess-playing term would accept some massive mess of <script type="math/tex">\lambda</script>-terms encoding the board configuration as an input, and after a lot of reductions it would become a term encoding the move to make – eventually checkmate, against you.</p><p>Before we start abstracting out data and more complex functions, let's make some simple syntax changes and look at some basic facts about reduction.</p><h2>Some syntax simplifications</h2><p>The pure lambda calculus does not have <script type="math/tex">\lambda</script>-terms that take more than one argument. This is often inconvenient. However, there's a simple mapping between multi-argument <script type="math/tex">\lambda</script>-terms and single-argument ones: instead of a two-argument function, say, just have a function that takes in an argument and returns a one argument function that takes in an argument and returns a result using both arguments.</p><p>(In programming language terms, this is currying.)</p><p>In the standard notation, <script type="math/tex">(\lambda x.(\lambda y. M))</script> is often written <script type="math/tex">(\lambda xy.M)</script>. Likewise, we can do similar simplifications on our trees, remembering that this is a syntactic/visual difference, rather than introducing something new to the lambda calculus:</p><p></p><div class="separator" style="clear: both; text-align: center;"><a href="https://lh3.googleusercontent.com/-cVpDLaEazpQ/YIXUPbu9LwI/AAAAAAAACwA/q-lIAh_fh0AHGGS3t4sQOWYZJTNq-uxEQCLcBGAsYHQ/simplersyntax.png" style="margin-left: 1em; margin-right: 1em;"><img data-original-height="670" data-original-width="880" height="305" src="https://lh3.googleusercontent.com/-cVpDLaEazpQ/YIXUPbu9LwI/AAAAAAAACwA/q-lIAh_fh0AHGGS3t4sQOWYZJTNq-uxEQCLcBGAsYHQ/w400-h305/simplersyntax.png" width="400" /></a></div><p></p><p>Once we've done this change, the next natural simplification to make is to allow one application node to apply many arguments to a <script type="math/tex">\lambda</script>-term with "many arguments" (remember that it actually stands for a bunch of nested normal single-argument <script type="math/tex">\lambda</script>-terms):</p><p></p><div class="separator" style="clear: both; text-align: center;"><a href="https://lh3.googleusercontent.com/-bgLtQrFG_c0/YIXUSDSEaXI/AAAAAAAACwE/JdfsNQC21cAhHgCLDkimLoQhTRwq_Q_xgCLcBGAsYHQ/simplersyntax2.png" style="margin-left: 1em; margin-right: 1em;"><img data-original-height="600" data-original-width="1100" height="350" src="https://lh3.googleusercontent.com/-bgLtQrFG_c0/YIXUSDSEaXI/AAAAAAAACwE/JdfsNQC21cAhHgCLDkimLoQhTRwq_Q_xgCLcBGAsYHQ/w640-h350/simplersyntax2.png" width="640" /></a></div><p></p><p>(The corresponding simplification in the standard syntax is that <script type="math/tex">(M \, A \, B\, C)</script> means <script type="math/tex">(((M \, A)\, B)\, C)</script>. In a standard programming language, this might be written <code>M(A)(B)(C)</code>; that is, applying <code>A</code> to <code>M</code> to get a function that you apply to <code>B</code>, yielding another function that you apply to <code>C</code>. Sanity check: what's the difference between <script type="math/tex">((M \, A) \, B)</script> and <script type="math/tex">(M \, (A \, B))</script>?)</p><p> </p><h2>Some facts about reduction</h2><h3><script type="math/tex">\beta</script>-normal forms</h3><p>A <script type="math/tex">\beta</script>-normal form can be thought of as a "fully evaluated" term. More specifically, it is one where this configuration of nodes does not appear in the tree (after multi-argument <script type="math/tex">\lambda</script>s and applications have been compiled into single-argument ones), where <script type="math/tex">M</script> and <script type="math/tex">N</script> are arbitrary terms:</p><p></p><div class="separator" style="clear: both; text-align: center;"><a href="https://lh3.googleusercontent.com/-VdVQqLwIZlE/YIXUZa0HY8I/AAAAAAAACwM/ROTS3CjzNEwSaySEpQAroZJQ69Q5S9F2QCLcBGAsYHQ/normal.png" style="margin-left: 1em; margin-right: 1em;"><img alt="" data-original-height="342" data-original-width="448" height="240" src="https://lh3.googleusercontent.com/-VdVQqLwIZlE/YIXUZa0HY8I/AAAAAAAACwM/ROTS3CjzNEwSaySEpQAroZJQ69Q5S9F2QCLcBGAsYHQ/normal.png" width="314" /></a></div><p></p><p>Intuitively, if such a term does appear, then the reduction rules allow us to reduce the application (replacing this part of the tree with whatever you get when you substitute <script type="math/tex">N</script> in place of <script type="math/tex">x</script> within <script type="math/tex">M</script>), so our term is not fully reduced yet.</p><h3>Terms without a <script type="math/tex">\beta</script>-normal form</h3><p>Does every term have a <script type="math/tex">\beta</script>-normal form? If you've seen computation theory stuff before, you should be able to answer this immediately without considering anything about the lambda calculus itself.</p><p>The answer is no, because reducing to a <script type="math/tex">\beta</script>-normal form is the lambda calculus equivalent of an algorithm halting. Lambda calculus has the same expressive power as Turing machines or any other model of computation, and some algorithms run forever, so there must exist lambda calculus terms that you can keep reducing without ever getting a <script type="math/tex">\beta</script>-normal form.</p><p>Here's one example, often called <script type="math/tex">\Omega</script>: </p><p></p><div class="separator" style="clear: both; text-align: center;"><a href="https://lh3.googleusercontent.com/-vkqpIhKyJXY/YIXUcgli2oI/AAAAAAAACwQ/40CTgJilizggNn99lXI0-4YHDetNgNbZgCLcBGAsYHQ/omega.png" style="margin-left: 1em; margin-right: 1em;"><img data-original-height="706" data-original-width="894" height="316" src="https://lh3.googleusercontent.com/-vkqpIhKyJXY/YIXUcgli2oI/AAAAAAAACwQ/40CTgJilizggNn99lXI0-4YHDetNgNbZgCLcBGAsYHQ/w400-h316/omega.png" width="400" /></a></div><p></p><p>Note that even though we use the same variable <script type="math/tex">x</script> in both branches, the variable means a different thing: in the left branch it's whatever is passed as an input to the left <script type="math/tex">\lambda</script>-term – one reduction step onwards, that <script type="math/tex">x</script> stands for the entire right branch, which has its own <script type="math/tex">x</script>. In fact, before we start reducing, we will do an <script type="math/tex">\alpha</script>-conversion on the right branch (a pretentious way of saying that we will rename the bound variable).</p><p>Now watch:</p><p></p><div class="separator" style="clear: both; text-align: center;"><a href="https://lh3.googleusercontent.com/-osYjKlbs2f0/YIXUfOxA5pI/AAAAAAAACwY/WUdsWRTXmkYfLCcyEeHvZnNqV7zFVNmqQCLcBGAsYHQ/omegareduction.png" style="margin-left: 1em; margin-right: 1em;"><img data-original-height="526" data-original-width="998" height="338" src="https://lh3.googleusercontent.com/-osYjKlbs2f0/YIXUfOxA5pI/AAAAAAAACwY/WUdsWRTXmkYfLCcyEeHvZnNqV7zFVNmqQCLcBGAsYHQ/w640-h338/omegareduction.png" width="640" /></a></div><p></p><p>After one reduction step, we end up with the same term (as usual, we are treating <script type="math/tex">\alpha</script>-equivalent terms as equivalent; the variable could be <script type="math/tex">x</script> or <script type="math/tex">y</script> or <script type="math/tex">å</script> for all we care).</p><h3>Ambiguities with reduction</h3><p>Does it matter how we reduce, or does every reduction path eventually lead to a <script type="math/tex">\beta</script>-normal form, assuming that one exists in the first place? If you haven't seen this before, you might want to have a go at this before reading on.</p><p>Here's one example of a tricky term:</p><p></p><div class="separator" style="clear: both; text-align: center;"><a href="https://lh3.googleusercontent.com/-p-OCKUqUvUI/YIXUjbM7gII/AAAAAAAACwg/BkJUhbr62GclfyCoxAbGIKcI-1-IMU4jgCLcBGAsYHQ/normalorder1.png" style="margin-left: 1em; margin-right: 1em;"><img alt="" data-original-height="596" data-original-width="900" height="212" src="https://lh3.googleusercontent.com/-p-OCKUqUvUI/YIXUjbM7gII/AAAAAAAACwg/BkJUhbr62GclfyCoxAbGIKcI-1-IMU4jgCLcBGAsYHQ/normalorder1.png" width="320" /></a></div><p></p>Imagine that <script type="math/tex">M</script> has a <script type="math/tex">\beta</script>-normal form, and <script type="math/tex">\Omega</script> is as defined above and therefore can be reduced forever. If we start by reducing the application node, in a moment <script type="math/tex">\Omega</script> and all its loopiness gets thrown away, and we're left with just <script type="math/tex">M</script>, since the <script type="math/tex">\lambda</script>-term takes two arguments and returns the first. However, if we start by reducing <script type="math/tex">\Omega</script>, or are following a strategy like "evaluate the arguments before the application", we will at some point reduce <script type="math/tex">\Omega</script> and get thrown in for a loop. <p>We can take a broader view here. In any programming language – I will use Lisp notation because it's the closest to lambda calculus – if we have a function like <code>(define func (lambda (x y) [FUNCTION BODY]))</code>, and a function call like <code>(func arg1 arg2)</code> , the evaluator has a choice of what it does. The simplest strategies are to either:</p><ul><li>Evaluate the arguments – <code>arg1</code> and <code>arg2</code>– first, and then inside the function <code>func</code> have <code>x</code> and <code>y</code> bound to the results of evaluating <code>arg1</code> and <code>arg2</code> respectively. This is called call-by-value, and is used by most programming languages.</li><li>Bind <code>x</code> and <code>y</code> inside <code>func</code> to be the unevaluated values of <code>arg1</code> and <code>arg2</code>, and evaluate <code>arg1</code> and <code>arg2</code> only upon encountering them in the process of evaluating <code>func</code>. This is called call-by-name. It's rare to see it in programming languages (an exception being that it's possible with Lisp macros), but functional languages like Haskell often have a variant, call-by-need or "lazy evaluation", where the values of <code>arg1</code> and <code>arg2</code> are only executed when needed, but once executed the results are memoized so that the execution only needs to happen once.</li> </ul><p>Call-by-value reduces what you can express. Imagine trying to define your own if-function in a language with call-by-value:</p><pre><code class="language-scheme" lang="scheme">(define IF<br /> (lambda (predicate consequent alternative)<br /> (if predicate<br /> consequent <span style="color: #999999;">; if predicate is true, do this</span><br /> alternative)) <span style="color: #999999;">; if predicate is false, do this instead</span><br /></code></pre><p>(note that <code>IF</code> is the new if-function that we're trying to define, and <code>if</code> is assumed to be a language primitive.)</p><p>Now consider:</p><pre><code class="language-scheme" lang="scheme">(define factorial<br /> (lambda (n)<br /> (IF (= n 0)<br /> 1<br /> (* n<br /> (factorial (- n 1))))))<br /></code></pre><p>You call <code>(factorial 1)</code>, and for the first call the program evaluates the arguments to <code>IF</code>:</p><ul><li><code>(= 1 0)</code></li><li><code>1</code></li><li><code>(* 1 (factorial 0))</code></li> </ul><p>The last one needs the value of <code>(factorial 0)</code>, so we evaluate the arguments to the <code>IF</code> in the recursive call:</p><ul><li><code>(= 0 0)</code></li><li><code>1</code></li><li><code>(* 1 (factorial -1))</code></li> </ul><p>... and so on. We can't define <code>IF</code> as a function, because in call-by-value the <code>alternative</code> gets evaluated as part of the function call even if <code>predicate</code> is false.</p><p>(Most languages solve this by giving you a bunch of primitives and making you stick with them, perhaps with some fiddly mini-language for macros built in (consider C/C++). In Lisp, you can easily write macros that use all of the language features, and therefore extend the language by essentially defining your own primitives that can escape call-by-value or any other potentially limiting language feature.)</p><p>It's the same issue with our term <script type="math/tex">((\lambda xy.x) \, M \, \Omega)</script> above: call-by-value goes into a silly loop because one of the arguments isn't even "meant to" be evaluated (from our perspective as humans with goals looking at the formal system from the outside).</p><p>Lambda calculus does not impose a reduction/"evaluation" order, so we can do what we like. However, this still leaves us with a problem: how do we know if our algorithm has gone into an infinite loop, or we just reduced terms in the wrong order?</p><h3>Normal order reduction</h3><p>It turns out that always doing the equivalent of call-by-name – reducing the leftmost, outermost term first – saves the day. If a <script type="math/tex">\beta</script>-normal form exists, this strategy will lead you to it.</p><p>Intuitively, this is because with call-by-name, there is no "unnecessary" reduction. If some arguments in some call are never used (like in our example), they never reduce. If we start reducing an expression while doing leftmost/outermost-first reduction, that reduction must be standing in the way between us and a successful reduction to <script type="math/tex">\beta</script>-normal form.</p><p>Formally: ... the proof is left as an exercise for the reader.</p><h3>Church-Rosser theorem</h3><p>The Church-Rosser theorem is the thing that guarantees we can talk about unique <script type="math/tex">\beta</script>-normal forms for a term. It says that:</p><blockquote><p>Letting <script type="math/tex">\Lambda</script> be the set of terms in the lambda calculus, <script type="math/tex">\rightarrow_\beta</script> the <script type="math/tex">\beta</script>-reduction relation, and <script type="math/tex">\twoheadrightarrow_\beta</script> its reflexive transitive closure (i.e. <script type="math/tex">M \twoheadrightarrow_\beta N</script> iff there exist zero or more terms <script type="math/tex">P_1</script>, <script type="math/tex">P_2</script>, ... such that <script type="math/tex">M \rightarrow_\beta P_1 \rightarrow_\beta ... \rightarrow_\beta P_n \rightarrow_\beta N</script>), then:</p><p><b>For all <script type="math/tex">M \in \Lambda</script>, <script type="math/tex">M \rightarrow_\beta A</script> and <script type="math/tex">M \rightarrow_\beta B</script> implies that there exists <script type="math/tex">X \in \Lambda</script> such that <script type="math/tex">A \twoheadrightarrow_\beta X</script> and <script type="math/tex">B \twoheadrightarrow_\beta X</script>.</b></p></blockquote><p>Visually, if we have reduction chains like the black part, then the blue part must exist (a property known as confluence or the "diamond property"):</p><p></p><div class="separator" style="clear: both; text-align: center;"><a href="https://lh3.googleusercontent.com/-EvzNpdkHns0/YIXUwuCD5vI/AAAAAAAACws/a55xmnExm7kPIsTOfeB7yBMGD0TiGdpegCLcBGAsYHQ/churchrosser.png" style="margin-left: 1em; margin-right: 1em;"><img alt="" data-original-height="574" data-original-width="998" height="230" src="https://lh3.googleusercontent.com/-EvzNpdkHns0/YIXUwuCD5vI/AAAAAAAACws/a55xmnExm7kPIsTOfeB7yBMGD0TiGdpegCLcBGAsYHQ/w400-h230/churchrosser.png" width="400" /></a></div><p></p><p>Therefore, even if there are many reduction paths, and even if some of them are non-terminating, for any two different starting <script type="math/tex">\beta</script>-reductions we can make, we will not lose the existence of a reduction path to any <script type="math/tex">X</script>. If <script type="math/tex">X</script> is some <script type="math/tex">\beta</script>-normal form reachable from <script type="math/tex">M</script>, we know that any other reduction path that reaches a <script type="math/tex">\beta</script>-normal form must have reached <script type="math/tex">X</script>.</p><h2>The fun begins</h2><p>Now we will start making definitions within the lambda calculus. These definitions do not add any capabilities to the lambda calculus, but are simply conveniences to save out having to draw huge trees repeatedly when we get to doing more complex things.</p><p>There are two big ideas to keep in mind:</p><ol start=""><li>There are no data primitives in the lambda calculus (even the variables are just placeholders for terms to get substituted into, and don't even have consistent names – remember that we work within <script type="math/tex">\alpha</script>-equivalence). As a result, the general idea is that you encode "data" as actions: the number 4 is represented by a function that takes a function and an input and applies the function to the input 4 times, a list might be encoded by a description of how to iterate over it, and so on.</li><li>There are no types. Nothing in the lambda calculus will stop you from passing a number to a function that expects a function, or visa versa. There exist <a href="https://en.wikipedia.org/wiki/Typed_lambda_calculus">typed lambda calculi</a>, but they prevent you from doing some of the cool things with combinators that we'll see later in this post.</li> </ol><h3>Pairs</h3><p>We want to be able to associate two things into a pair, and then extract the first and second elements. In other words, we want things that work like this:</p><pre><code>(fst (pair a b)) == a<br />(snd (pair a b)) == b<br /></code></pre><p>The simplest solution starts like this:</p><p></p><div class="separator" style="clear: both; text-align: center;"><a href="https://lh3.googleusercontent.com/-Elx8o61jQWM/YIXU1Gk5PQI/AAAAAAAACw0/DQAu3ZCQ_dQ3DlPYCWDPaKTrhOb8oWLTwCLcBGAsYHQ/pairs.png" style="margin-left: 1em; margin-right: 1em;"><img data-original-height="1044" data-original-width="820" height="400" src="https://lh3.googleusercontent.com/-Elx8o61jQWM/YIXU1Gk5PQI/AAAAAAAACw0/DQAu3ZCQ_dQ3DlPYCWDPaKTrhOb8oWLTwCLcBGAsYHQ/w315-h400/pairs.png" width="315" /></a></div><p></p><p>Now we can get the first of a pair by doing <code>((pair x y) first)</code>. If we want the exact semantics above, we can define simple helpers like </p><pre><code class="language-scheme" lang="scheme">fst = (lambda p<br /> (p first))<br /></code></pre><p>(i.e. <script type="math/tex">\text{fst} = (\lambda p. (p \, \text{first}))</script>), and </p><pre><code class="language-scheme" lang="scheme">snd = (lambda p<br /> (p second))<br /></code></pre><p>since now <code>(snd (pair x y))</code> reduces to <code>((pair x y) second)</code> reduces to <code>y</code>.</p><h3>Lists</h3><p>A list can be constructed from pairs: <code>[1, 2, 3]</code> will be represented by <code>(pair 1 (pair 2 (pair 3 False)))</code> (we will define <code>False</code> later). If <script type="math/tex">l_1</script>, <script type="math/tex">l_2</script>, and <script type="math/tex">l_3</script> are the list items, a length element list looks like this:</p><p></p><div class="separator" style="clear: both; text-align: center;"><a href="https://lh3.googleusercontent.com/-l5IiHdFAAVo/YIXU6O3V_3I/AAAAAAAACw4/M_EcXX0HyssTbmFvQG3QVfYQ6_4eLqa9QCLcBGAsYHQ/list.png" style="margin-left: 1em; margin-right: 1em;"><img data-original-height="1000" data-original-width="1060" height="378" src="https://lh3.googleusercontent.com/-l5IiHdFAAVo/YIXU6O3V_3I/AAAAAAAACw4/M_EcXX0HyssTbmFvQG3QVfYQ6_4eLqa9QCLcBGAsYHQ/w400-h378/list.png" width="400" /></a></div><p></p><p>We might also represent the same list like this instead:</p><p></p><div class="separator" style="clear: both; text-align: center;"><a href="https://lh3.googleusercontent.com/-dRZUdY06P_E/YIXU-TNdfwI/AAAAAAAACw8/rKDaQ5adAYw2nhh6xj-tADnzoS-2n-FawCLcBGAsYHQ/listvar.png" style="margin-left: 1em; margin-right: 1em;"><img data-original-height="858" data-original-width="970" height="354" src="https://lh3.googleusercontent.com/-dRZUdY06P_E/YIXU-TNdfwI/AAAAAAAACw8/rKDaQ5adAYw2nhh6xj-tADnzoS-2n-FawCLcBGAsYHQ/w400-h354/listvar.png" width="400" /></a></div><p></p><p>This second representation makes it trivial to define things like a <code>reduce</code> function: <code>([1, 2, 3] 0 +)</code> would return 0 plus the sum of the list <code>[1, 2, 3]</code>, if <code>[1, 2, 3]</code> is represented as above. However, this representation would also make it harder to do other list operations, like getting all but the first element of a list, whereas our pair-based lists can do this trivially (<code>(snd l)</code> gets you all but the first element of the list <code>l</code>).</p><h3>Numbers & arithmetic</h3><p>Here are how the numbers work (using a system called Church numerals):</p><p></p><div class="separator" style="clear: both; text-align: center;"><a href="https://lh3.googleusercontent.com/-NuPgzLWknX4/YIXVCapZnHI/AAAAAAAACxE/cozGKFi3rVgsM6juTckj1SJSTo8utUlMgCLcBGAsYHQ/numbers.png" style="margin-left: 1em; margin-right: 1em;"><img alt="" data-original-height="616" data-original-width="1200" height="328" src="https://lh3.googleusercontent.com/-NuPgzLWknX4/YIXVCapZnHI/AAAAAAAACxE/cozGKFi3rVgsM6juTckj1SJSTo8utUlMgCLcBGAsYHQ/w640-h328/numbers.png" width="640" /></a></div><p></p><p>Since giving a function <script type="math/tex">f</script> to a number <script type="math/tex">n</script> (also a function) gives a function that applies <script type="math/tex">f</script> to its input <script type="math/tex">n</script> times, a lot of things are very convenient. Say you have this function to add one, which we'll call <code>succ</code> (for "successor"):<br /></p><p></p><div class="separator" style="clear: both; text-align: center;"><a href="https://lh3.googleusercontent.com/-g04afKMD8cw/YIXVFFyIZjI/AAAAAAAACxI/af_y0P4lIX4q1h6A4Fb9Sf8t69VkBLEJgCLcBGAsYHQ/succ.png" style="margin-left: 1em; margin-right: 1em;"><img data-original-height="700" data-original-width="1000" height="280" src="https://lh3.googleusercontent.com/-g04afKMD8cw/YIXVFFyIZjI/AAAAAAAACxI/af_y0P4lIX4q1h6A4Fb9Sf8t69VkBLEJgCLcBGAsYHQ/w400-h280/succ.png" width="400" /></a></div><p></p><p>(Considering the above definition of numbers: why does it work?) <br /></p><p>Now what is <code>(42 succ)</code>? It's a function that takes an argument and adds <code>42</code> to it. More generally, <code>((n succ) m)</code> gives you <code>m+n</code>. However, there's also a more straightforward way to represent addition, which you can figure out from noticing that all we have to do to add <code>m</code> to <code>n</code> is to compose the "apply <code>f</code>" operation <code>m</code> more times to <code>n</code>, something we can do simply by calling <code>(m f)</code> on <code>n</code>, once we've "standardised" <code>n</code> to have the same <code>f</code> and <code>x</code> as in the <script type="math/tex">\lambda</script>-term that represents <code>m</code> (that is why we have the <code>(n f x)</code> application, rather than just <code>n</code>):</p><p> </p><div class="separator" style="clear: both; text-align: center;"><a href="https://lh3.googleusercontent.com/-k4huLoOTr60/YIXVRqp4GsI/AAAAAAAACxU/RwfI1uA2p9IooMCmvyQVTh2TxfyFYEFHgCLcBGAsYHQ/add.png" style="margin-left: 1em; margin-right: 1em;"><img data-original-height="668" data-original-width="968" height="276" src="https://lh3.googleusercontent.com/-k4huLoOTr60/YIXVRqp4GsI/AAAAAAAACxU/RwfI1uA2p9IooMCmvyQVTh2TxfyFYEFHgCLcBGAsYHQ/w400-h276/add.png" width="400" /></a></div><p></p><p>Now, want multiplication? One way is to see that we can define <code>(mult m n)</code> as <code>((n (adder m)) 0)</code>, assuming that <code>(adder m)</code> returns a function that adds <code>m</code> to its input. As we saw, that can be done with <code>(m succ)</code>, so:</p><pre><code class="language-scheme" lang="scheme">(mult m n) =<br />((n (m succ))<br /> 0)<br /></code></pre><p>There's a more standard way too:</p><p></p><div class="separator" style="clear: both; text-align: center;"><a href="https://lh3.googleusercontent.com/-yy7qimS9E4Q/YIXVU0q5hQI/AAAAAAAACxc/EKbvrXKEIQ8Idi23u7vDt9y2zO5sKk2MQCLcBGAsYHQ/mult.png" style="margin-left: 1em; margin-right: 1em;"><img data-original-height="564" data-original-width="796" height="284" src="https://lh3.googleusercontent.com/-yy7qimS9E4Q/YIXVU0q5hQI/AAAAAAAACxc/EKbvrXKEIQ8Idi23u7vDt9y2zO5sKk2MQCLcBGAsYHQ/w400-h284/mult.png" width="400" /></a></div><br /><p></p> <p>The idea here is simply that <code>(n f)</code> gives a <script type="math/tex">\lambda</script>-term that takes an input and applies <code>f</code> to it <script type="math/tex">n</script> times, and when we call <code>m</code> with that as its first argument, we get something that does the <script type="math/tex">n</script>-fold application <script type="math/tex">m</script> times, for a total of <script type="math/tex">mn</script> times, and now all that remains is to pass the <code>x</code> to it.</p><p>A particularly neat thing is that exponentiation can be this simple:<br /></p><div class="separator" style="clear: both; text-align: center;"><a href="https://lh3.googleusercontent.com/-yMefuUdBrmY/YIXVbeKlCVI/AAAAAAAACxk/Izg3I42x73k_tdatLE0Ty6beLmzIn9HTgCLcBGAsYHQ/exp.png" style="margin-left: 1em; margin-right: 1em;"><img alt="" data-original-height="536" data-original-width="790" height="217" src="https://lh3.googleusercontent.com/-yMefuUdBrmY/YIXVbeKlCVI/AAAAAAAACxk/Izg3I42x73k_tdatLE0Ty6beLmzIn9HTgCLcBGAsYHQ/exp.png" width="320" /></a></div><p></p><p>Why? I'll let the trees talk. First, using the definition of <code>n</code> as a Church numeral (which I will underline in the trees below), and doing one <script type="math/tex">\beta</script>-reduction, we have:</p><p></p><div class="separator" style="clear: both; text-align: center;"><a href="https://lh3.googleusercontent.com/-PV4bT_1_ZSE/YIXVdvJX2gI/AAAAAAAACxo/xgbezro8juoIPyGv6Lc9wXi7DEWx5DtPQCLcBGAsYHQ/expe1.png" style="margin-left: 1em; margin-right: 1em;"><img data-original-height="946" data-original-width="1528" height="396" src="https://lh3.googleusercontent.com/-PV4bT_1_ZSE/YIXVdvJX2gI/AAAAAAAACxo/xgbezro8juoIPyGv6Lc9wXi7DEWx5DtPQCLcBGAsYHQ/w640-h396/expe1.png" width="640" /></a></div><p></p><p>This does not look promising – a number needs to have two arguments, but we have a <script type="math/tex">\lambda</script>-term taking in one. However, we'll soon see that the <code>x</code> in the tree on the right actually turns out to be the first argument, <code>f</code>, in the finished number. In fact, we'll make that renaming right away (since we're working under <script type="math/tex">\alpha</script>-equivalence), and continue reducing (below we've taken the bottom-most <code>m</code> and expanded it into its Church numeral definition): </p><p></p><div class="separator" style="clear: both; text-align: center;"><a href="https://lh3.googleusercontent.com/-82QyX18WZMI/YIXVgQWkf5I/AAAAAAAACxs/6lNfk11Iz3gzl1oWMX8NKQeqZ9FkjSefgCLcBGAsYHQ/expe2.png" style="margin-left: 1em; margin-right: 1em;"><img data-original-height="906" data-original-width="1200" height="483" src="https://lh3.googleusercontent.com/-82QyX18WZMI/YIXVgQWkf5I/AAAAAAAACxs/6lNfk11Iz3gzl1oWMX8NKQeqZ9FkjSefgCLcBGAsYHQ/w640-h483/expe2.png" width="640" /></a></div><p></p><p>At this point, the picture gets clearer: the next thing we'd reduce is the lambda term at the bottom applied to <code>m</code>, but that's just going to do the lambda term (which applies <code>f</code> <script type="math/tex">m</script> times) <script type="math/tex">m</script> more times. We'll have done 2 steps, and gotten up to <script type="math/tex">m^2</script> nestings of <code>f</code>. By the time we've done the remaining <script type="math/tex">n-1</script> steps, we'll have the representation of <script type="math/tex">m^n</script>; the <script type="math/tex">n-1</script> more applications between our bottom-most and topmost lambda term will reduce away, while the stack of applications of <code>f</code> increases by a factor of <script type="math/tex">m</script> each time.</p><p>What about subtraction? It's a bit complicated. Okay, how about just subtraction by <i>one</i>, also known as the <code>pred</code> (predecessor) function? Also tricky (and a good puzzle if you want to think about it). Here's one way:</p><p></p><div class="separator" style="clear: both; text-align: center;"><a href="https://lh3.googleusercontent.com/-qNJvAtfo9UI/YIXVjnrMdBI/AAAAAAAACxw/b1n7LTuSA4Ye_bAeOs3eZ2PVAwesypyAACLcBGAsYHQ/pred.png" style="margin-left: 1em; margin-right: 1em;"><img data-original-height="876" data-original-width="1146" height="489" src="https://lh3.googleusercontent.com/-qNJvAtfo9UI/YIXVjnrMdBI/AAAAAAAACxw/b1n7LTuSA4Ye_bAeOs3eZ2PVAwesypyAACLcBGAsYHQ/w640-h489/pred.png" width="640" /></a></div><p></p><p>Church numerals make it easy to add, but not subtract. So instead, here's what we do. First (box 1), we make a pair like <code>[0 0]</code>. Next (polygon 2), we have a function that takes a pair <code>p=[a b]</code> and creates a new pair <code>[b (succ b)]</code>, where <code>succ</code> is the successor function (one plus its input). Repeated application of this function on the pair in box 1 looks like this: <code>[0 0]</code>, <code>[0 1]</code>, <code>[1 2]</code>, <code>[2 3]</code>, and so on. Thus we see that if we start from <code>[0 0]</code> and apply the function in polygon 2 <script type="math/tex">n</script> times (box 3), the first element of the pair is (the Church numeral for) <script type="math/tex">n-1</script>, and the second element is <script type="math/tex">n</script>, and we can simply call <code>fst</code> to get that first element.</p><p>As we saw before, we can define subtraction as repeated application of <code>pred</code>:</p><pre><code class="language-scheme" lang="scheme">(minus m n) =<br />((n pred) m)<br /></code></pre><p>There's an alternative to Church numerals that's found in the more general <a href="https://crypto.stanford.edu/~blynn/compiler/scott.html">Scott encoding</a>. The advantages of Church vs Scott numerals, and their relative structures, are similar to the relative merits and structures of the two types of lists we discussed: one makes many operations natural by exploiting the fact that everything is a function, but also makes "throwing off a piece" (taking the rest/<code>snd</code> of a list, or subtracting one from a number) much harder.</p><h3>Booleans, if, & equality</h3><p>You might have noticed that we've defined <code>second</code> as <script type="math/tex">(\lambda x y. y)</script>, and <code>0</code> as <script type="math/tex">(\lambda f x. x)</script>. These two terms are a variable-renaming away from each other, so they are <script type="math/tex">\alpha</script>-equivalent. In other words, <code>second</code> and <code>0</code> are same thing. Because we don't have types, which is which depends only on our interpretation of the context it appears in.</p><p>Now let's define a <code>True</code> and <code>False</code>. Now <code>False</code> is kind of like <code>0</code>, so let's just say they're also the same thing. The opposite of <script type="math/tex">(\lambda x y. y)</script> is <script type="math/tex">(\lambda x y. x)</script>, so let's define that to be <code>True</code>.</p><p>What sort of muddle have we landed ourselves in now? Quite a good one, actually. Let's define <code>(if p c a)</code> to be <code>(p c a)</code>. If the predicate <code>p</code> is <code>True</code>, we select the consequent <code>c</code>, because <code>(True c a)</code> is exactly the same as <code>(first c a)</code> is clearly <code>c</code>. Likewise, if <code>p</code> is <code>False</code>, then we evaluate the same thing as <code>(second c a)</code> and end up with the alternative <code>a</code>.</p><p>We will also want to test whether a number is <code>0</code>/<code>False</code> (equality in general is hard in the lambda calculus, so what we end up with won't be guaranteed to work with things that aren't numbers). A simple way is:</p><pre><code class="language-scheme" lang="scheme">eq0 =<br />(lambda x<br /> (x (lambda y<br /> False)<br /> True))<br /></code></pre><p>If <code>x</code> is <code>0</code>, it's the same as <code>second</code> and will act as a conditional and pick out <code>True</code>. If it's not zero, we assume that it's some number <script type="math/tex">n</script>, and therefore will be a function that applies its first argument <script type="math/tex">n</script> times. Applying <script type="math/tex">(\lambda y.\text{False})</script> any non-zero amount of times to anything will return <code>False</code>.</p><h2>Fixed points, combinators, and recursion</h2><p>The big thing missing from the definitions we've put on top of the lambda calculus so far is recursion. Every lambda term represents an anonymous function, so there's no name within a <script type="math/tex">\lambda</script>-term that we can "call" to recurse.</p><p>Rather than jumping in straight to recursion, we're going to start with Russell's paradox: does a set that contains all elements that are not in the set contain itself? Phrased mathematically: what the hell is <script type="math/tex">R = \{x \,|\,x\notin R\} </script>?</p><p>In computation theory, sets are often specified by a characteristic function: a function that is always defined if the set is computable, and returns true if an element is in the set and false otherwise.</p><p>In the lambda calculus (which was originally supposed to be a foundation for logic), here's a characteristic function for the Russell set <script type="math/tex">R</script>:</p><p></p><div class="separator" style="clear: both; text-align: center;"><a href="https://lh3.googleusercontent.com/--fAw5h7j9LM/YIXVofSsV-I/AAAAAAAACx4/s-qCYIIqZ-A4ZGdu3bjaBdRLzhYs1ijfwCLcBGAsYHQ/russell.png" style="margin-left: 1em; margin-right: 1em;"><img data-original-height="678" data-original-width="800" height="339" src="https://lh3.googleusercontent.com/--fAw5h7j9LM/YIXVofSsV-I/AAAAAAAACx4/s-qCYIIqZ-A4ZGdu3bjaBdRLzhYs1ijfwCLcBGAsYHQ/w400-h339/russell.png" width="400" /></a></div><p></p><p>(where <code>not</code> can be straightforwardly defined on top of our existing definitions as <code>(not b) = (b False True)</code>).</p><p>This <script type="math/tex">\lambda</script>-term takes in an element <code>x</code>, assumes that <code>x</code> is the (characteristic function for) the set itself, and asks: is it the case that <code>x</code> is <i>not</i> in the set? Call this term <code>R</code>, and consider <code>(R R)</code>: the left <code>R</code> is working as the (characteristic function of) the set, and the right <code>R</code> as the element whose membership of the set we are testing.</p><p>Evaluating:</p><p></p><div class="separator" style="clear: both; text-align: center;"><a href="https://lh3.googleusercontent.com/-5USU6gUr_xU/YIXVrF-huCI/AAAAAAAACyA/iOL_s4hxuXsbYjeKp4LKjmOxoxLSyDKpgCLcBGAsYHQ/russell2.png" style="margin-left: 1em; margin-right: 1em;"><img data-original-height="612" data-original-width="1200" height="326" src="https://lh3.googleusercontent.com/-5USU6gUr_xU/YIXVrF-huCI/AAAAAAAACyA/iOL_s4hxuXsbYjeKp4LKjmOxoxLSyDKpgCLcBGAsYHQ/w640-h326/russell2.png" width="640" /></a></div><p></p><p>So we start out saying <code>(R R)</code>, and in one <script type="math/tex">\beta</script>-reduction step we end up saying <code>(not (R R))</code> (just as, with Russell's paradox, it first seems that the set must contain itself, because the set is not in itself, but once we've added the set to itself then suddenly it shouldn't be in itself anymore). One more step and we get, from <code>(R R)</code>, <code>(not (not (R R)))</code>. This is not ideal as a foundation for logic.</p><p>However, you might realise something: the <code>not</code> here doesn't play any role. We can replace it with any arbitrary <code>f</code>. In fact, let's do that, and create a simple wrapper <script type="math/tex">\lambda</script>-term around it that lets us pass in any <code>f</code> we want:</p><p></p><div class="separator" style="clear: both; text-align: center;"><a href="https://lh3.googleusercontent.com/-AMgahMRA2JE/YIXVtdgUYmI/AAAAAAAACyE/Tb4b7v2-emM5q5fl8Xicb8TQIvpxHgu2gCLcBGAsYHQ/Y.png" style="margin-left: 1em; margin-right: 1em;"><img data-original-height="1104" data-original-width="1076" height="400" src="https://lh3.googleusercontent.com/-AMgahMRA2JE/YIXVtdgUYmI/AAAAAAAACyE/Tb4b7v2-emM5q5fl8Xicb8TQIvpxHgu2gCLcBGAsYHQ/w390-h400/Y.png" width="390" /></a></div><p></p><p>Now let's look at the properties that <script type="math/tex">Y</script> has:</p><div cid="n1079" class="mathjax-block md-end-block md-math-block md-rawblock" contenteditable="false" id="mathjax-n1079" mdtype="math_block" spellcheck="false"> <div class="md-rawblock-container md-math-container" tabindex="-1"><div class="MathJax_SVG_Display" style="text-align: center;"></div><script id="MathJax-Element-363" type="math/tex; mode=display">(Y \, f) \rightarrow_\beta (f \, (Y \, f)) \rightarrow_\beta (f \, (f \, (Y \, f))) \rightarrow_\beta ...</script></div></div><p><script type="math/tex">Y</script> is called the Y combinator ("combinator" is a generic term for a lambda calculus term with no free variables). It is part of the general class of fixed-point combinators: combinators <script type="math/tex">X</script> such that <script type="math/tex">(X \, f) = (f \, (X\,f))</script>. (Turing invented another one: <script type="math/tex">\Theta = (A \, A)</script>, where <script type="math/tex">A</script> is defined as <script type="math/tex">(\lambda x y. (y \,(x\, x\, y)))</script>.)</p><p>A fixed-point combinator gives us recursion. Imagine we've almost written a recursive function, say for a factorial, except we've left a free function parameter for the recursive call:</p><pre><code class="language-scheme" lang="scheme">(lambda f x<br /> (if (eq0 x)<br /> 1<br /> (mult x<br /> (f (pred x)))))<br /></code></pre><p>(Also, take a moment to appreciate that we can already do everything necessary except for the recursion with our earlier definitions.)</p><p>Call the previous recursion-free factorial term <code>F</code>, and consider reducing <code>((Y F) 2)</code> (where <code>-BETA-></code> stands for one or more <script type="math/tex">\beta</script>-reductions):</p><pre><code class="language-scheme" lang="scheme">((Y F)<br /> 2)<br /><br />-BETA-><br /><br />((F (Y F))<br /> 2)<br /><br />-BETA-><br /><br />((lambda x<br /> (if (eq0 x)<br /> 1<br /> (mult x<br /> ((Y F) (pred x)))))<br /> 2)<br /><br />-BETA-><br /><br />(if (eq0 2)<br /> 1<br /> (mult 2<br /> ((Y F) (pred 2))))<br /><br />-BETA-><br /><br />(mult 2<br /> ((Y F)<br /> 1))<br /><br />-BETA-><br /><br />(mult 2<br /> ((F (Y F))<br /> 1))<br /><br />-BETA-><br /><br />(mult 2<br /> ((lambda x<br /> (if (eq0 x)<br /> 1<br /> (mult x<br /> ((Y F) (pred x)))))<br /> 1))<br /><br />-BETA-><br />...<br />-BETA-><br /><br />(mult 2<br /> (mult 1<br /> 1))<br /><br />-BETA-><br /><br />2<br /></code></pre><p>It works! Get a fixed-point combinator, and recursion is solved.</p><h3>Primitive recursion</h3><p>The definition of the partial recursive functions (one of the ways to define computability, mentioned at the beginning) involves something called primitive recursion. Let's implement that, and along the way look at fixed-point combinators from another perspective.</p><p>Primitive recursion is essentially about implementing bounded for-loops / recursion stacks, where "bounded" means that the depth is known when we enter the loop. Specifically, there's a function <script type="math/tex">f</script> that takes in zero or more parameters, which we'll abbreviate as <script type="math/tex">\overline{P}</script>. At 0, the value of our primitive recursive function <script type="math/tex">h</script> is <script type="math/tex">f(\overline{P})</script>. At any integer <script type="math/tex">x+1</script> for <script type="math/tex">x \geq 0</script>, <script type="math/tex">h(\overline{P}, x+1)</script> is defined as <script type="math/tex">g(\overline{P}, x, h(\overline{P}, x))</script>: in other words, the value at <script type="math/tex">x+1</script> is given by some function of:</p><ul><li>fixed parameter(s) <script type="math/tex">\overline{P}</script>,</li><li>how many more steps there are in the loop before hitting the base case (<script type="math/tex">x</script>), and</li><li>the value at <script type="math/tex">x</script> (the recursive part).</li> </ul><p>For example, in our factorial example there are no parameters, so <script type="math/tex">f</script> is just the constant function 1, and <script type="math/tex">g(x, r) = (x + 1) \times r</script>, where <script type="math/tex">r</script> is the recursive result for one less, and we have <script type="math/tex">x+1</script> because (for a reason I can't figure out – ideas?) <script type="math/tex">g</script> takes, by definition, not the current loop index but one less.</p><p>Now it's pretty easy to write the function for primitive recursion, leaving the recursive call as an extra parameter (<code>r</code>) once again, and assuming that we have <script type="math/tex">\lambda</script>-terms <code>F</code> and <code>G</code> for <script type="math/tex">f</script> and <script type="math/tex">g</script> respectively:</p><pre><code class="language-scheme" lang="scheme">(lambda r P x<br /> (if (eq0 x)<br /> (F P)<br /> (G P (pred x) (r P (pred x)))))<br /></code></pre><p>Slap a <script type="math/tex">Y</script> in front, and we take care of the recursion and we're done.</p><h3>The fixed point perspective</h3><p>However, rather than viewing this whole "slap in the <script type="math/tex">Y</script>" business as a hack for getting recursion, we can also interpret it as a fixed point operation.</p><p>A fixed point of a function <script type="math/tex">f</script> is a value <script type="math/tex">x</script> such that <script type="math/tex">x = f(x)</script>. The fixed points of <script type="math/tex">f(x)=x^2</script> are 0 and 1. In general, fixed points are often useful in maths stuff and there's a lot of deep theory behind them (for which you will have to look elsewhere).</p><p>Now <script type="math/tex">Y</script> (or any other fixed point combinator) has the property that <script type="math/tex">(Y f) =_\beta (f \, (Y\, f))</script> (remember that the equivalent of <script type="math/tex">f(x)</script> is written <script type="math/tex">(f \,x)</script> in the lambda calculus). In other words, <script type="math/tex">Y</script> is a magic wand that takes a function and returns its fixed point (albeit in a mathematical sense that is not very useful for explicitly finding those fixed points).</p><p>Taking once again the example of defining primitive recursion, we can consider it as the fixed point problem of finding an <script type="math/tex">h</script> such that <script type="math/tex">h = \Phi_{f,g}(h)</script>, where <script type="math/tex">\Phi_{f,g}</script> is a function like the following, where <code>F</code> and <code>G</code> are the lambda calculus representations of <script type="math/tex">f</script> and <script type="math/tex">g</script> respectively:</p><pre><code class="language-scheme" lang="scheme">(lambda h<br /> (lambda P x<br /> (if (eq0 x)<br /> (F P)<br /> (G P (pred x) (h P (pred x)))))))<br /></code></pre><p>That is, <script type="math/tex">\Phi_{f,g}</script> takes in some function <code>h</code>, and then returns a function that does primitive recursion – <i>under the assumption</i> that <code>h</code> is the right function for the recursive call.</p><p>Imagine it like this: when we're finding the fixed point of <script type="math/tex">f(x)= x^2</script>, we're asking for <script type="math/tex">x</script> such that <script type="math/tex">x=x^2</script>. We can imagine reaching into the set of values that <script type="math/tex">x</script> can take (in this case, the real numbers), plugging them in, and seeing that in most cases the equation <script type="math/tex">x=x^2</script> is false, but if we pick out a fixed point it becomes true. Similarly, solving <script type="math/tex">h=\Phi_{f,g}(h)</script> is the problem of considering all possible functions <script type="math/tex">h</script> (and it turns out all computable functions can be enumerated, so this is, if anything, less crazy than considering all possible real numbers), and requiring that plugging in <script type="math/tex">h</script> into <script type="math/tex">\Phi_{f,g}</script> gives back <script type="math/tex">h</script>. For almost any function that we plug in, this equation will be nonsense: instead of doing primitive recursion, on the first call to <code>h</code> <script type="math/tex">\Phi_{f,g}</script> will do some crazy call that might loop forever or calculate the 17th digit of <script type="math/tex">\pi</script>, but if it's picked just right, <script type="math/tex">h</script> and <script type="math/tex">\Phi_{f,g}(h)</script> will happen to be the same thing. Unlike in the algebraic case, it's very difficult to iteratively improve on your guess for <script type="math/tex">h</script>, so it's hard to think of how to use this weird way of defining the problem of finding <script type="math/tex">h</script> to actually find it.</p><p>Except hold on – we're working in the lambda calculus, and fixed point combinators are easy: call <script type="math/tex">Y</script> on a function and we have its fixed point, and, by the reasoning above, that is the recursive version of that function.</p><h2>The lambda calculus in lambda calculus</h2><p>There's one final powerful demonstration of a computation model's expressive power that we haven't looked at: being able to express itself. The most well-known case is the <a href="https://en.wikipedia.org/wiki/Universal_Turing_machine">universal Turing machine</a>, and those crop up a lot when you're thinking about computation theory.</p><p>Now there exists a trivial universal lambda term: <script type="math/tex">(\lambda \,f\,a\,.\,(f \,a))</script> takes <script type="math/tex">f</script>, the lambda representation of some function, and an argument <script type="math/tex">a</script>, and returns the lambda calculus representation of <script type="math/tex">f</script> applied to <script type="math/tex">a</script>. However, this isn't exactly fair, since we've just forwarded all the work onto whatever is interpreting the lambda calculus. It's like noting that an <code>eval</code> function exists in a programming language, and then writing on your CV that you've written an evaluator for it.</p><p>Instead, a "fair" way to define a universal lambda term is to build on the data specifications we have to define a representation of variables, lambda terms, and application terms, and then writing more definitions within the lambda calculus until we have a <code>reduce</code> function.</p><p>This is what I've done in <a href="https://github.com/LRudL/lambda-engine">Lambda Engine</a>. The definitions specific to defining the lambda calculus within the lambda calculus start about halfway down <a href="https://github.com/LRudL/lambda-engine/blob/main/definitions.rkt">this file</a>. I won't walk through the details here (see the code and comments for more detail), but the core points are:</p><ul><li>We distinguish term types by making each term a pair consisting of an identifier and then the data associated with it. The identifier for variables/<script type="math/tex">\lambda</script>s/applications is a function that takes a triple and returns the 1st/2nd/3rd member of it (this is simpler than tagging them with e.g. Church numerals, since testing numerical equality is complicated). The data is either a Church numeral (for variables) or a pair of a variable and a term (<script type="math/tex">\lambda</script>-terms) or a term and a term (applications).</li><li>We need case-based recursion, where we can take in a term, figure out what it is, and then perform a call to a function to handle that term and pass on the main recursive function to that handler function (for example, because when substituting in a application term, we need to call the main substitution function on both the left and right child of the application). The case-based recursion functions (different ones for the different number of arguments required by substitution and reduction) take a triple of functions (one for each term type) and exploit the fact that the identifier of a term is a function that picks some element from the triple (in this case, we call the identifier on the handler function triple to pick the right one).</li><li>We have helper functions for to build our term types, extract out parts, and test for whether something is a <script type="math/tex">\lambda</script>-term (exploiting the fact that the first element of the pair that a lambda term is is the "take the 2nd thing from a triple" function).</li><li>With the above, we can define substitution fairly straightforwardly. Note that we need to test Church numeral equality, which requires a generic Church numeral equality tester, which is a slow function (because it needs to recurse and take a lot of predecessors).</li><li>For reduction, the main tricky bit is doing it in normal order. This means that we have to be able to tell whether the left child in an application term is reducible before we try to reduce the right child (e.g. the left child might eventually reduce to a function that throws away its argument, and the right child might be a looping term like <script type="math/tex">\Omega</script>). We define a helper function to check whether something reduces, and then can write <code>reduce-app</code> and therefore <code>reduce</code>. For convenience we can define a function <code>n-reduce</code> that calls <code>reduce</code> an expression <code>n</code> times, simply by exploiting how Church numerals work (<code>((2 reduce) x)</code> is <code>(reduce (reduce x))</code>, for example).</li> </ul><p>What we don't have:</p><ul><li>Variable renaming. We assume that terms in this lambda calculus are written so that a variable name (in this case, a Church numeral) is never reused.</li><li>Automatically reducing to <script type="math/tex">\beta</script>-normal form. This could be done fairly simply by writing another function that calls itself with the <code>reduce</code> of its argument until our checker for whether something reduces is false. </li><li>Automatically checking whether we're looping (e.g. we've typed in the definition of <script type="math/tex">\Omega</script>).</li> </ul><p>The lambda calculus interpreter in <a href="https://github.com/LRudL/lambda-engine/blob/main/interpreter.rkt">this file</a> has all three features above. You can play with it, and the lambda-calculus-in-lambda-calculus, by downloading <a href="https://github.com/LRudL/lambda-engine">Lambda Engine</a> (and a <a href="https://racket-lang.org/">Racket interpreter</a> if you don't already have one) and using one of the evaluators in <a href="https://github.com/LRudL/lambda-engine/blob/main/main.rkt">this file</a>.</p><h2>Towards Lisp</h2><p>Let's see what we've defined in the lambda calculus so far:</p><ul><li><code>pair</code></li><li>lists</li><li><code>fst</code></li><li><code>snd</code></li><li><code>True</code></li><li><code>False</code></li><li><code>if</code></li><li><code>eq0</code></li><li>numbers</li><li>recursion<br /></li> </ul><p>This is most of <a href="http://languagelog.ldc.upenn.edu/myl/ldc/llog/jmc.pdf">what you need in a Lisp</a>. Lisp was invented in 1958 by John McCarthy. It was intended as an alternative axiomatisation for computation, with the goal of not being too complicated to define while still being human friendly, unlike the lambda calculus or Turing machines. It borrows notation (in particular the keyword <code>lambda</code>) from the lambda calculus and its terms are also trees, but it is not directly based on the lambda calculus.</p><p>Lisp was not intended as a programming language, but Steve Russell (no relation to Bertrand Russell ... I'm pretty sure) realised you could write machine code to evaluate Lisp expressions, and went ahead and did so, making Lisp the second-oldest programming language. Despite its age, Lisp is arguably the most elegant and flexible programming language (modern dialects include <a href="https://clojure.org/">Clojure</a> and <a href="https://racket-lang.org/">Racket</a>).</p><p>One way to think of what we've done in this post is that we've started from the lambda calculus – an almost stupidly simple theoretical model – and made definitions and syntax transformations until we got most of the way to being able to emulate Lisp, a very usable and practical programming language. The main takeaway is, hopefully, an intuitive sense of how something as simple as the lambda calculus can express any computation expressible in a higher-level language.</p>Unknownnoreply@blogger.com0tag:blogger.com,1999:blog-1697673368059564013.post-68979675695847730492021-03-27T22:21:00.011+00:002021-05-28T00:24:05.508+01:00Nuclear power is good<p style="text-align: center;"><span style="font-size: medium;"></span></p><p style="text-align: left;"><span style="font-size: large;">(Alternative title: burning things considered harmful)</span></p><p style="text-align: center;"><span style="font-size: medium;"> <i><span style="font-size: x-small;">5k words (about 17 minutes)</span></i></span></p><p style="text-align: center;"><span style="text-align: left;"> </span></p><p>If you want usable energy, you need to use the forces between particles.</p><p>The weakest force is gravity, but if you happen to be near a gigantic amount of material (e.g. the Earth) with an uneven surface that has stuff flowing down it (e.g. water in a river), we can still use it to generate power. This insight gives us hydropower, which delivers about 16% of the world's electricity. The main downside is that because of how weak gravity is, dams have to be large and environmentally disruptive to generate useful power.</p><p>Moving to stronger forces, we have chemical interactions between atoms. In the form of burning fossil fuels, rearranging chemical bonds produces 66% of the world's electricity. The main downside is how weak chemical bonds are, and therefore how much matter has to be processed (i.e. burned) to produce energy. A lot of matter means a lot of waste products. Despite decades of work on possible safe waste-management strategies (e.g. carbon capture and storage), we still outrageously keep dumping over thirty billion tons of carbon dioxide into the atmosphere every year, with massive effects on the climate that will potentially last thousands of years, while also producing a long list of other harmful waste products that kill <a href="https://ourworldindata.org/air-pollution">a lot of people</a> per year.</p><p>Thankfully, atoms aren't atomic: we can rearrange atoms and get energy densities that blow puny chemistry out of the water. Currently 11% of the world's electricity comes from directly doing this. We're still playing catch up to God, who, in His infinite wisdom, saw it fit to create a universe where just about 100% of energy production is nuclear.</p><p>Our nearest God-sanctioned nuclear reactor is the sun. Harnessing the sun's light and heat gives us another 1% of the world's electricity; a slightly more indirect route where we first wait for the sun's heat to stir up the air gives us another 3.5%. An even more indirect route is letting the sun's light fall on plants so that they create chemical bonds that we can burn for power; this gives us another 2%. The most indirect route of all is to use the chemical bonds created by sunlight that fell on extinct plants hundreds of millions of years ago, which is what we're really doing when we burn fossil fuels. So actually it's all nuclear, with the only difference being how many hoops you jump through first.</p><p>The current state of nuclear power is that we can harness only fission (splitting atoms) for controlled energy production. Fusion (combining atoms) is potentially an even better technology: it requires less exotic materials, produces less dangerous waste, and is literally star-power. However, it takes extreme energies to get power out of fusion, and the only way we've found how to do that is to blow up a (fission-based) nuclear bomb in a very controlled way that squeezes the stuff we want to fuse to create an even bigger bang. Technically we could use this for power – say, we build a massive underground chamber where we set off hydrogen bombs (the common name for a bomb that uses nuclear fusion) every once in a while to vaporise vast amounts of water into steam and then drive a generator – but let's just say there would be some difficulties. (Though, surprisingly, mostly economic and political ones rather than technical ones – this idea was seriously studied in the 1970s as <a href="https://en.wikipedia.org/wiki/Project_PACER">Project Pacer.</a>)</p><p>Controlled fusion power is in the works, but it's the poster child for technologies that are always twenty years away. At the moment scientists are playing around with <a href="https://en.wikipedia.org/wiki/National_Ignition_Facility">lasers that have 25 times the power of the entire world's electricity generation</a> (though only for a few picoseconds at a time) and <a href="https://en.wikipedia.org/wiki/ITER">magnets almost strong enough to levitate a frog</a>* to bring it about, but don't expect commercial fusion power in the next decade at least.</p><p>(*Levitating a frog takes a field of about 16 Teslas, according to research that won an <a href="https://www.improbable.com/ig-about/winners/#ig2000">Ig Nobel Prize in 2000</a>, compared to ITER's 13 Tesla field.)</p><p>Fusion is definitely a technology that we should develop. However, as J. Storrs Hall writes in <i>Where is my flying car?</i> (my review <a href="https://strataoftheworld.blogspot.com/2021/03/review-where-is-my-flying-car.html">here</a>):</p><blockquote><p><i>"As a science fiction and technology fan, for most of my life I had been squarely in the “just you wait until we get fusion” camp. Then I was forced to compare the expected advantages fusion would bring to the ones we already had with fission. Fuel costs are already negligible. The process is already clean, with no emissions. Even though the national [US] waste repository at Yucca Mountain has been blocked by activists since it was designated in 1987 and never opened, fission produces so little waste that all our power plants have operated the entire period by basically sweeping it into the back closet."</i></p></blockquote><p>We have already invented a miracle clean power source. And, surprise surprise, we should really use it.</p><p> </p><h2>The human case for nuclear power</h2><p>Every year, <a href="https://ourworldindata.org/grapher/number-of-deaths-by-risk-factor?tab=chart&stackMode=absolute&region=World">there are almost five million deaths attributable to air pollution</a>, a bit less than 1 in 10 of all deaths in the world, or one every six seconds. Since it's a bit tricky to know what counts as an "attributable death" in the case of some risk factor, here's another measure: <a href="https://ourworldindata.org/grapher/disease-burden-by-risk-factor">almost 150 million years of health-weighted life are lost every year because of air pollution</a>. The health effects of air pollution are right up there with the other biggest killers like high blood pressure, smoking, and obesity.</p><p>The biggest causes of air pollution are energy generation, traffic, and (especially in poor countries) heating. Getting global averages for power generation deadliness is hard, but doing some very rough estimation, more than one-tenth but less than one-third of air pollution deaths are directly related to power generation, for a total number in the hundreds of thousands per year. Imagine three Chernobyl-scale disasters a week, and you're in the right ballpark.</p><p>(There is major disagreement over the actual Chernobyl death toll. When making comparisons in this post, I use the number 4000. About 30 people died directly during the disaster; several thousand may die in the long run according to the best consensus estimates, though if you assume the contested <a href="https://en.wikipedia.org/wiki/Linear_no-threshold_model">linear no-threshold model</a> (which seems to be the main crux of the debate) you can get numbers in the tens of thousands. If you want to be maximally pessimistic, you can multiply Chernobyl impact comparisons by 10, but you'll find this doesn't materially change the conclusions.)</p><p>Which power sources cause these deaths? There's some disagreement over the exact numbers, but <a href="https://ourworldindata.org/grapher/death-rates-from-energy-production-per-twh?tab=chart&time=earliest..latest&region=World">here's</a> a chart for European energy production from Our World in Data:</p><br /><div class="separator" style="clear: both; text-align: center;"><a href="https://1.bp.blogspot.com/-vCz-ZbmYHOc/YF-vXMf_9QI/AAAAAAAAChg/wLsN8TpD8wMObwH87OyC0hJ42fb3CpPRgCLcBGAsYHQ/s1744/deathspertwh.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="1128" data-original-width="1744" height="414" src="https://1.bp.blogspot.com/-vCz-ZbmYHOc/YF-vXMf_9QI/AAAAAAAAChg/wLsN8TpD8wMObwH87OyC0hJ42fb3CpPRgCLcBGAsYHQ/w640-h414/deathspertwh.png" width="640" /></a></div><p>(One terawatt-hour (3.6 petajoules) is roughly the annual energy consumption of 20 000 Europeans.)</p><p>The chart above has European numbers. In particular for fossil fuel sources, there's a lot of country-specific variation due to environmental regulations and population density: the <a href="https://www.sciencedirect.com/science/article/pii/S0140673607612537?casa_token=r5LmpCZ6G8YAAAAA:aW4wjfZ3PENq0mvbNTLXF27WEkLRuAsE0wGTXSrC1R3OgNLg9a7RdoedMRKZ20sBoUwuClxm#bib32">paper</a> that the above chart is largely based on mentions 77 deaths/TWh as a reasonable figure for a regulation-compliant Chinese coal plant, while <a href="http://www.forbes.com/sites/jamesconca/2012/06/10/energys-deathprint-a-price-always-paid/">this article</a> says that 280 deaths/TWh is possible for coal.</p><p>Why do solar and wind produce any deaths at all? Both occasionally involve dangerous construction work (rooftop solar / tall wind turbines). In fact, if you look at recent decades (i.e., not including Chernobyl) and use the low-end estimates, solar and wind are deadlier than nuclear.</p><p>The estimates for hydropower can also swing a bit depending on whether or not you include the deadliest electricity generation disaster in history: the <a href="https://en.wikipedia.org/wiki/1975_Banqiao_Dam_failure">1975 Banqiao Dam failure</a>, which may have killed hundreds of thousands of people. Since 1965, hydropower has produced about 130 000 TWh; depending on which death toll estimate you believe, Banqiao single-handedly raises the deaths per TWh for hydropower by between 0.2 and 2. Compare this with nuclear power, which has produced about 92 000 TWh over the same timeframe; the long-term death estimates for Chernobyl add 0.04 to the deaths/TWh count for nuclear.</p><p>(The total generation numbers are based on the raw data behind <a href="https://ourworldindata.org/grapher/modern-renewable-energy-consumption?time=earliest..latest">this</a> and <a href="https://ourworldindata.org/grapher/nuclear-energy-generation?tab=chart&stackMode=absolute&time=earliest..latest&country=~OWID_WRL&region=World">this</a> graph, which you can download from the links. The nuclear number in the above chart is based on <a href="https://www.sciencedirect.com/science/article/pii/S0140673607612537?casa_token=r5LmpCZ6G8YAAAAA:aW4wjfZ3PENq0mvbNTLXF27WEkLRuAsE0wGTXSrC1R3OgNLg9a7RdoedMRKZ20sBoUwuClxm">this paper</a>, which Our World in Data says already includes Chernobyl, though I can't see where they add that in.)</p><p>The bottom line is that hydropower accidents are <a href="https://en.wikipedia.org/wiki/List_of_hydroelectric_power_station_failures">more common, more deadly, and higher variance</a> than nuclear accidents, even though both power sources have produced comparable amounts of energy in recent decades.</p><p>Okay, actually that isn't the real bottom line. The real bottom line is this: <i>when it comes to the human impacts of electricity generation, there are things that involve burning (fossil fuels & biomass), and then there is everything else, and the latter category is much much better</i>. Also, if you absolutely must burn something, <i>do not burn coal</i>.</p><p>What has nuclear specifically done so far? <a href="https://pubs.acs.org/doi/abs/10.1021/es3051197?source=cen&">One study</a> finds that it has saved 1.8 million lives by reducing air pollution, or about 4 years of the world's current malaria death rate.</p><p>What could it have done? Until the mid-1970s, the adoption of nuclear power was accelerating. Assume this trend had continued until today, and nuclear had replaced fossil fuels only (an optimistic assumption, but one that doesn't change the numbers much because renewables are a pretty small percentage). Under these assumptions, <a href="https://www.mdpi.com/1996-1073/10/12/2169/htm">one study</a> estimates that nuclear would now account for over half of the world's energy production, and a total of 9.5 million deaths would have been avoided – as much as if you saved everyone who would otherwise have died of cancer in the past year. Even if nuclear adoption had only been linear, 4.2 million deaths could have been avoided, the same number as saving everyone who has died in war since 1970 (the war deaths number is from the raw data behind <a href="https://ourworldindata.org/grapher/battle-related-deaths-in-state-based-conflicts-since-1946-by-world-region">this chart</a>).</p><p>Therefore: <i>in terms of the number of lives saved, keeping the nuclear power industry growing would have very likely been at least as good as achieving world peace in 1970.</i></p><p>Since these numbers are enormous, and involve difficult-to-estimate unknowns, here's something more concrete: Germany's decision in 2011 to get rid of nuclear is costing an average of 1100 lives per year (<a href="https://www.nber.org/system/files/working_papers/w26598/w26598.pdf">working paper</a>; <a href="https://grist.org/energy/the-cost-of-germany-going-off-nuclear-power-thousands-of-lives/">article</a>).</p><h2>The environmental case for nuclear power</h2><p>Climate change is a big problem, but the scale of it as an environmental problem is better known than the scale of air pollution as a health problem, so I won't go into the statistics on its impact.</p><p>Nuclear power is obviously good for the climate. Here's a chart, based on <a href="https://www.ipcc.ch/site/assets/uploads/2018/02/ipcc_wg3_ar5_annex-iii.pdf#page=7">this</a>, which is summarised in a more readable format <a href="https://en.wikipedia.org/wiki/Life-cycle_greenhouse_gas_emissions_of_energy_sources#2014_IPCC,_Global_warming_potential_of_selected_electricity_sources">here</a>:</p><p></p><div class="separator" style="clear: both; text-align: center;"><a href="https://lh3.googleusercontent.com/--BtWucJqwIw/YF-tqOJatuI/AAAAAAAAChI/EmTccYV2zp0JhqoLAahh9JWhI7rxbLUlgCLcBGAsYHQ/co2eqpertwh.png" style="margin-left: 1em; margin-right: 1em;"><img alt="" data-original-height="830" data-original-width="1232" height="432" src="https://lh3.googleusercontent.com/--BtWucJqwIw/YF-tqOJatuI/AAAAAAAAChI/EmTccYV2zp0JhqoLAahh9JWhI7rxbLUlgCLcBGAsYHQ/w640-h432/co2eqpertwh.png" width="640" /></a></div> <p></p><p>The black bars span the range between the minimum and maximum numbers. The red dot is the median.</p><p>I've converted the numbers from the traditional grams of CO2 equivalent per kWh to tons of CO2 equivalent per TWh, to be consistent with the death rates graph above, and for easier conversion to national/international CO2 statistics (which are generally expressed in tons of CO2 – unless its tons of carbon, in which case you divide by the ratio of carbon's mass in CO2, which is 12/44 or about 0.27).</p><p>(If you're wondering where hydropower is: it's median is right around concentrated solar, but in some cases, especially in tropical climates, the <a href="https://en.wikipedia.org/wiki/Environmental_impact_of_reservoirs#Greenhouse_gases">reservoirs created by dams can release a lot of methane</a>, making the maximum CO2-equivalent emissions for hydropower over twice as bad as coal and, more importantly, completely ruining my pretty chart.)</p><p>So far, the use of nuclear power is estimated to have <a href="https://blogs.scientificamerican.com/the-curious-wavefunction/nuclear-power-may-have-saved-1-8-million-lives-otherwise-lost-to-fossil-fuels-may-save-up-to-7-million-more/">reduced cumulative CO2 emissions to date by 64 billion tons</a>, a bit less than two years of the world's <i>total</i> CO2 emissions at current rates. The <a href="https://www.mdpi.com/1996-1073/10/12/2169/htm">same study</a> linked in the previous section estimates that, had nuclear power grown at a steady linear rate, this number would be doubled, and if the accelerating trend in nuclear power adoption had continued, there would be 174 billion tons less CO2 in the atmosphere. We would have saved more emissions than we would have if we had made every car in the world emission free since 1990.</p><p> </p><h2>The problems</h2><p>In <i>Enlightenment Now</i> (my review <a href="http://strataoftheworld.blogspot.com/2018/08/review-enlightenment-now-steven-pinker.html">here</a>), Steven Pinker writes:</p><blockquote><p><i>"It’s often said that with climate change, those who know the most are the most frightened, but with nuclear power, those who know the most are the least frightened."</i></p></blockquote><p>So why aren't the arguments against nuclear power enough to frighten those who know about it?</p><p>The short version: more nuclear power would save millions of lives from air pollution and be a big help in solving climate change. When these are the benefit, you need a hell of a drawback before the scales start tilting the other way.</p><p>The long version:</p><h3>Radiation & accidents</h3><p>(Radiation units are confusing. Activity, straightforwardly defined as the number of atoms that undergo decay per second, is measured in becquerels (Bq). The amount of radiation energy absorbed per kilogram of matter is measured in grays (Gy), which therefore have units of joules per kilogram. Measuring biological effects is harder, because the type of radiation and what tissue it hits both matter. If you adjust for the type of radiation by multiplying the absorbed dose in grays by some factor (scaled so that gamma rays have a factor 1), you get something called <a href="https://en.wikipedia.org/wiki/Equivalent_dose">equivalent dose</a>, which is measured in sieverts (Sv). If you also adjust for which tissue type was hit by multiplying by more estimated factors, you get <a href="https://en.wikipedia.org/wiki/Effective_dose_(radiation)">effective dose</a>, which is also measured in sieverts. If you want to get a sense of scale for radiation dose numbers, <a href="https://xkcd.com/radiation/">here's a good chart</a> and <a href="https://en.wikipedia.org/wiki/Sievert#Dose_examples">here's a good table</a>.)</p><p>In normal operation, a <a href="https://www.scientificamerican.com/article/coal-ash-is-more-radioactive-than-nuclear-waste/">nuclear power plant produces significantly less radiation than a coal power plant</a> (this is because everything radioactive is contained in a nuclear power plant, while coal power plants pump <a href="https://en.wikipedia.org/wiki/Fly_ash">fly ash</a> into the air). Neither is a significant dose.</p><p>In accidents, nuclear power plants can release insane amounts of radioactivity. Insane amounts of radiation are dangerous. However, the reaction to radiation risks is often out of proportion to the true risk – the Fukushima evacuations are considered excessive in hindsight, as argued in <a href="https://www.sciencedaily.com/releases/2017/11/171120085453.htm">this study</a>, though you probably don't need to make a study to guess it from <a href="https://ourworldindata.org/grapher/estimated-mortality-from-fukushima-nuclear-disaster">this chart</a>:</p><p></p><div class="separator" style="clear: both; text-align: center;"><a href="https://lh3.googleusercontent.com/-EQAwvcAiSt8/YF-tw2NnKII/AAAAAAAAChM/LPjJmcIJ7NceUiiSso759uNq4fKJgTBQwCLcBGAsYHQ/fukushima.png" style="margin-left: 1em; margin-right: 1em;"><img data-original-height="1172" data-original-width="1736" height="432" src="https://lh3.googleusercontent.com/-EQAwvcAiSt8/YF-tw2NnKII/AAAAAAAAChM/LPjJmcIJ7NceUiiSso759uNq4fKJgTBQwCLcBGAsYHQ/w640-h432/fukushima.png" width="640" /></a></div><p></p><p>(In the long run, some more cancer deaths are expected to trickle in.)</p><p>It is critically important to remember the above statistics on health effects, and not let yourself be biased by <a href="https://en.wikipedia.org/wiki/Chernobyl_(miniseries)">vivid stories</a> about horrible individual events. The fear of nuclear accidents is similar to the fear of flying rather than driving: statistically one is much safer, but one is much easier to fear because when things go wrong, it comes in more story-worthy packages.</p><p>In particular: it is <i>not</i> the case that nuclear power is safer only because accidents are rare and therefore get left out of statistics; nuclear power would be overwhelmingly safer than fossil fuels even if there were a Chernobyl going off every year. As I said above, <a href="https://en.wikipedia.org/wiki/List_of_hydroelectric_power_station_failures">hydropower accidents</a> are more common, more deadly, and higher variance, so any argument based on disaster risk that bans nuclear would also ban hydropower.</p><h3>Nuclear proliferation</h3><p>Nuclear power is good, but <a href="https://strataoftheworld.blogspot.com/2020/04/review-doomsday-machine.html">nuclear weapons are bad</a>. It would be bad if the spread of civilian nuclear power technology lead to nuclear proliferation. There is some overlap in technology, but neither civilian materials nor technologies automatically lead to weapons. The uranium used in power plants is typically only enriched to 3-5%, compared to more than 85% for weapons-grade uranium and 0.7% in natural uranium (though if you have uranium enrichment infrastructure, you can run it for more cycles than usual and let the enrichment levels slowly creep up – Iran has done this). There are also international agreements that prevent enrichment, and alternative nuclear technologies, like using thorium instead of uranium, with less weapon potential. Finally, a country trying to build nuclear weapons probably won't be stopped by a lack of a civilian industry; consider North Korea.</p><h3>Terrorism and war risks</h3><p>Another risk to consider is that nuclear power plants might be targeted by terrorists, or even by hostile nations, potentially leading to Chernobyl-scale disasters. This is a risk, but it's an acceptable one. Consider what it would mean if "hundreds or thousands of people could be killed if a determined and resourceful hostile actor targeted this piece of infrastructure" were a reason to not build some piece of infrastructure – we'd have to ban skyscrapers, airplanes, dams, water treatment plants, and so forth. Also considering the security that's (rightfully) present at nuclear power plants, it would probably take a 9/11-level of execution to do it, and the observed rate for 9/11-level events over a time interval of length T is, well, 1/T if the interval includes 9/11 and otherwise 0.</p><p>It is true that a complex civilisation has a lot of fragile points and someone should be thinking hard about minimising this kind of risk, and that nuclear power plants are a good example because the effects are expensive and long-lasting if an attack is successful. But as an argument against nuclear power, <a href="https://slatestarcodex.com/2013/04/13/proving-too-much/">it proves too much</a>.</p><h3>Nuclear waste</h3><p>Nuclear waste is awkward to deal with, but it's far from the worst sort of industrial waste we deal with – consider the over thirty billion tons of carbon dioxide we've dumped into the atmosphere over the past year, or the various horrible things that coal plants spew out that cause dozens of Chernobyl-equivalents per year.</p><p>Nuclear waste is not some miracle substance that effortlessly seeps everywhere and kills whatever it touches. Until 1993, countries (mostly the USSR and UK), were dumping nuclear waste into the ocean. This is rightly banned these days, but you can observe that we still have oceans; in fact, the <a href="https://en.wikipedia.org/wiki/Ocean_disposal_of_radioactive_waste#Environmental_impact">the environmental impacts</a> have so far been negligible except for somewhat higher concentrations of some nasty isotopes exactly at the site.</p><p>In general, nuclear waste is a serious problem that has to be solved somehow, but solutions exist (currently, Finland's <a href="https://en.wikipedia.org/wiki/Onkalo_spent_nuclear_fuel_repository">Onkalo repository</a> is the closest to being operational). Though the timescale is long, it is not different in principle from some existing disposal methods for nasty things like mercury and arsenic.</p><p>Is it responsible to leave behind dangerous waste for future generations? It's far more responsible than leaving them with the almost astronomical amounts of CO2 emissions that a single kilogram of uranium prevents.</p><p>Future people looking back at our century won't despair about a few warm rocks deep underground. They'll despair at all the silent air pollution deaths, at how far we let climate change get, and at how much sooner we could've reached their living standards had we made better use of our technology. Then they'll travel on nuclear-powered airplanes to distant hiking grounds, and tell scare stories around an (artificial!) campfire about the barbarian past when we burned things for energy and piped the waste products straight into the atmosphere.</p><h3>Uranium is limited</h3><p>First, we have <a href="https://www.scientificamerican.com/article/how-long-will-global-uranium-deposits-last/">200 years worth of economically accessible uranium reserves</a>. This is <a href="https://ourworldindata.org/grapher/years-of-fossil-fuel-reserves-left">more than for fossil fuels</a>, with the additional benefit that burning through the remaining uranium won't wreck the climate and kill millions.</p><p>Second, we have alternatives to uranium, like thorium.</p><p>Thirdly, there are hundreds of times more uranium dissolved in the oceans than there is on land (and this uranium exists in equilibrium, so if you take it out, more will leach out of the seabed to replace it, a fact that might lead a pedant to call nuclear power renewable). Even though the concentrations are tiny, because of the energy density of uranium, at modern reactor efficiencies there's still half a megajoule of usable nuclear energy in the uranium in a single cubic metre of seawater, enough to power the lightbulb in my room for over five hours. As a result, extracting it is a project that is <a href="https://www.forbes.com/sites/jamesconca/2016/07/01/uranium-seawater-extraction-makes-nuclear-power-completely-renewable/?sh=1b4b0f19159a">taken surprisingly seriously, and is surprisingly close to being economically viable</a>, though <a href="http://large.stanford.edu/courses/2017/ph241/jones-j2/docs/epjn150059.pdf">some people are very skeptical</a>.</p><h3>Nuclear power is unnatural</h3><p>Wrong: a few billion years ago <a href="https://www.scientificamerican.com/article/ancient-nuclear-reactor/">a spontaneous natural nuclear reactor</a> ran for a few hundred thousand years under what is now Gabon.</p><p>Using the best estimates for its running time and power output, even if this is the only natural reactor that ever formed, the energy it produced is several times higher than that of all human civilian nuclear power to date (both numbers are in the hundreds of petajoules range). Of sustained nuclear fission energy in our planet's history, more has been natural than artificial.</p><p> </p><h2>Nuclear is overpowered, so where is it?</h2><p>Nuclear power is an almost overpowered technology. The reason why comes down to physics: an energy source based on nuclear reactions has extreme power density, and, all else being equal, the higher your power density, the less fuel you need, the less waste products you produce, and the cleaner your power plant is overall. Not surprisingly, nuclear power turns out to be – along with solar and wind – the cleanest and safest power source we have.</p><p>In <a href="https://strataoftheworld.blogspot.com/2021/03/review-where-is-my-flying-car.html"><i>Where is My Flying Car?</i></a>, J. Storrs Hall gives some vivid facts to demonstrate the power and efficiency of nuclear: a wind turbine uses more lubricating oil per energy generated than a nuclear power plant uses uranium, and while the 7.5 TJ of energy a Boeing 747 burns through during a flight weighs 200 tons and costs a third of a million dollars when delivered as chemical fuel, getting the equivalent energy from nuclear takes 100 <i>grams</i> of reactor-grade uranium and costs 10 dollars.</p><p>So where is it? The simple reason is that it's either illegal (like in Italy), being phased out (like in Germany), or highly regulated and/or expensive. It wasn't always so:</p><p></p><table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto;"><tbody><tr><td style="text-align: center;"><a href="https://lh3.googleusercontent.com/-IIf3MZJbgac/YF-t3NSrZRI/AAAAAAAAChQ/Tc4JrsRjwX0ZXkARslwmLS_LTpYIVYbfwCLcBGAsYHQ/nuclearcost.png" style="margin-left: auto; margin-right: auto;"><img alt="" data-original-height="1074" data-original-width="1412" height="304" src="https://lh3.googleusercontent.com/-IIf3MZJbgac/YF-t3NSrZRI/AAAAAAAAChQ/Tc4JrsRjwX0ZXkARslwmLS_LTpYIVYbfwCLcBGAsYHQ/w400-h304/nuclearcost.png" width="400" /></a></td></tr><tr><td class="tr-caption" style="text-align: center;"><p>Source: <i>Where is my Flying Car?</i>, by J. Storrs Hall.</p></td></tr></tbody></table><p></p> <p>The above graph shows the price per kilowatt of US nuclear power plants. The green line is the trend line before the Department of Energy was established in 1977. Note also that the Three Mile Island accident was in 1979, and, despite no one being hurt, this was a turning point for the US nuclear industry.</p><p>When the price of a technology starts increasing, it's not the natural learning curve of the technology at work. It's a regulatory choice. And while you obviously should regulate nuclear power, we're not doing it right.<br /></p><p>J. Storrs Hall explains the cost increases:</p><blockquote><p><i>"Nuclear power is probably the clearest case where regulation clobbered the learning curve. Innovation is strongly suppressed when you’re betting a few billion dollars on your ability to get a license to operate the plant. Besides the obvious cost increases due to direct imposition of rules, there was a major side effect of forcing the size of plants up (fewer licenses); fewer plants were built and fewer ideas tried. That also meant a greater cost for transmission (about half the total, according to my itemized bill), since plants are further from the average customer."</i></p></blockquote><p>There is some hope that the tide is turning. New startups like <a href="https://en.wikipedia.org/wiki/NuScale_Power">NuScale</a> are working on small modular reactors that might greatly reduce prices. Of course, in addition to difficulties with funding, and the not-so-easy task of building a literal nuclear reactor, they've spent years jumping through regulatory hurdles and are not expected to produce power until 2029. So-called fourth-generation reactors are also being worked on, and there's always the hope we eventually get fusion.</p><p>But we're not going to get the benefits of cheap and plentiful nuclear power unless we stop treating it like it's the Antichrist.</p><p>Hall, never one to pass up the opportunity for a dramatic touch, quotes John Steinbeck's <i>The Grapes of Wrath</i> to sum up the sadness of our attitude to nuclear power:</p><blockquote><p><i>“And men with hoses squirt kerosene on the oranges, and they are angry at the crime, angry at the people who have come to take the fruit. A million people hungry, needing the fruit—and kerosene sprayed over the golden mountains.</i></p><i></i><p><i>[...]</i></p><i></i><p><i>There is a crime here that goes beyond denunciation. There is a sorrow here that weeping cannot symbolize. There is a failure here that topples all our success. The fertile earth, the straight tree rows, the sturdy trunks, and the ripe fruit. And children dying of pellagra must die because a profit cannot be taken from an orange. And coroners must fill in the certificate—died of malnutrition—because the food must rot, must be forced to rot.”</i></p></blockquote><p>More generally, <a href="https://strataoftheworld.blogspot.com/2021/03/technological-progress.html">human civilisation need to get better at making decisions about technology</a>. We shouldn't deny ourselves safe clean energy, but we should start working on mitigating the harms from actually scary technologies, like nuclear weapons, and make sure that new technologies like biotech and AI are used safely. Oh, and have I mentioned that burning things is bad for climate and health, and we should stop doing it?</p><h2>A metaphor</h2><p>I mentioned earlier that nuclear power and fossil fuels are like flying and driving. One of them is obviously safer, but the other seems scarier because the lizard-derived part of our brains can't multiply. Objecting to nuclear power on safety grounds but tolerating fossil fuels is like texting about how scared you are to board a plane while driving yourself to the airport. Let's make this metaphor more concrete, and hopefully create a memorable image.</p><p>The world consumes about 20 000 TWh per year as electricity (about one-eight of total energy use – lots is used directly for transportation and heat). Let's compare this to making a drive across Europe that starts in Lisbon and ends in Tallinn. Each kilometre we travel represents a bit less than 5 TWh of energy towards our 20 000 TWh goal. Let's say walking is wind/solar/geothermal, biking is hydropower, flying is nuclear, and driving is fossil fuels.</p><p>(The numbers for fossil fuel related deaths below are significant underestimates of the global average, because, like the chart above, they're based on the European data in <a href="https://www.sciencedirect.com/science/article/pii/S0140673607612537?casa_token=r5LmpCZ6G8YAAAAA:aW4wjfZ3PENq0mvbNTLXF27WEkLRuAsE0wGTXSrC1R3OgNLg9a7RdoedMRKZ20sBoUwuClxm#bib32">this study</a>. Regulations are looser and population densities higher in many developing countries that make up most of the world's air pollution deaths. I was not able to find a good estimate of the global average, and besides, these numbers are terrifying enough as they are.)</p><p>First we walk some 450 km, ending north-west of Madrid, and then bike 650 km, just barely taking us into France. We're a bit careless and somehow we've manage to shove a hundred people off wind turbines along the way. Oops.</p><p>By this point we're getting tired of walking and biking, but thankfully there's a flight to Paris. The pilot has a bad day and lands on top of a crowd, flattening another hundred people.</p><p>We really hate flying, so we refuse all the other offers that the airline companies try to sell us. Instead we step out of the Paris airport, rent a car, and start carelessly careening down the remaining 2600 km.</p><p>Gas takes us approximately to Berlin, a distance of about 1000 km. During this entire distance we run over a pedestrian at every block (roughly 1 per 80 metres), killing some 10 000 people in total.</p><p>We're in a real hurry to get to Poland, where the traffic rules get even more lenient and we can start <a href="https://www.independent.co.uk/climate-change/news/climate-change-poland-cop24-coal-air-pollution-global-warming-fossil-fuels-a8672481.html">burning coal</a>. The final leg of the journey from Berlin to the Polish border is powered by oil and isn't long, but still results in as many lethal hit-and-runs as the entire journey before it.</p><p>At the Polish border, we reach coal. From this point on, we text about the dangers of nuclear waste as we mow down one pedestrian every 8 metres for the entire rest of the coal-powered trip to Estonia (also burning <a href="https://en.wikipedia.org/wiki/Narva_Power_Plants">some other nasty things too</a>). Driving at a reckless 120 km/h whatever road we're on, we go run through four pedestrians a second – you'll hear a rapid thwack-thwack-thwack-thwack noise as the bodies hit the windshield – but it still takes 13 hours to make the trip. By the time we reach the Lithuanian border, the bodies of our victims, packed as tightly as possible, fill four Olympic swimming pools. Each of the three Baltic countries we drive through before reaching Tallinn fills another one.</p><p>Oh, and also every kilometre driven in our car had fifty times the environmental impact of flying.</p><p>Thank god we didn't fly: imagine how horrible it would be if another pilot had had a bad day.</p><p>The world makes this trip every year to meet our growing energy needs. We're getting fitter and walking a bit longer every year, as we should. But whenever someone suggests flying instead of driving, our collective response is: "What?! But that's so risky!"</p><p>Let's fly.</p><p><br /></p><p style="text-align: center;"><b>RELATED:</b></p><p></p><ul style="text-align: left;"><li><a href="https://strataoftheworld.blogspot.com/2018/12/review-seeds-of-science-why-we-got-it.html">A similar situation exists with GMOs</a></li><li><a href="https://strataoftheworld.blogspot.com/2021/03/technological-progress.html">Technological progress</a></li><li><a href="https://strataoftheworld.blogspot.com/2021/03/review-where-is-my-flying-car.html">Review: Where is my Flying Car?</a></li><li><a href="https://strataoftheworld.blogspot.com/2018/10/review-energy-and-civilization-history.html">Review: Energy and Civilisation</a> <br /></li></ul><p></p>Unknownnoreply@blogger.com0tag:blogger.com,1999:blog-1697673368059564013.post-46931608061489153172021-03-25T16:12:00.005+00:002021-03-27T22:51:38.091+00:00Technological progress<p style="text-align: center;"><span style="font-size: x-small;"><i>4k words (about 13 minutes)</i></span> <br /></p><p>In this post, I've collected some thoughts on:</p><ul><li>why technological progress probably matters more than you'd immediately expect;</li><li>what models we might try to fit to technological progress;</li><li>whether technological progress is stagnating; and</li><li>what we should hope future technological progress to look like.</li> </ul><p> </p><h2>Technological progress matters</h2><p>The most obvious reason why technological progress matters is that it is the cause for the increase in human welfare after the industrial revolution, which, in moral terms at least, is the most important thing that's ever happened. <a href="http://lukemuehlhauser.com/three-wild-speculations-from-amateur-quantitative-macrohistory/">"Everything was awful for a long time, and then the industrial revolution happened"</a> isn't a bad summary of history. It's tempting to think that technology was just one factor working with many others, like changing politics and moral values, but there are strong cases to be made that a changed technological environment, and <a href="https://strataoftheworld.blogspot.com/2019/09/growth-and-civilisation.html">the economic growth it enabled</a>, were <a href="http://strataoftheworld.blogspot.com/2020/12/review-foragers-farmers-and-fossil-fuels.html">the reasons for political and moral changes in the industrial era</a>. Given this history, we should expect that more technological progress will be important for increasing human welfare in the future too (though not enough on its own – see below). This applies both to people in developed countries – we are not at <a href="https://nickbostrom.com/utopia.html">utopia</a> yet, after all – as well as those in developing countries, who are already seeing vast benefits from information technology making development cheaper, and would especially benefit from decreases in the price of sustainable energy generation.</p><p>Then there are more subtle reasons to think that technological progress doesn't get the attention it deserves.</p><p>First, it works over long time horizons, so it is especially subject to all the kinds of short-termism that plague human decision-making.</p><p>Secondly, lost progress isn't visible: if the Internet hadn't been invented, very few would realise what they're missing out on, but try taking it away now and you might well spark a war. This means that stopping technological progress is politically cheap, because likely no one will realise the cost of what you've done.</p><p>Finally, making the right decisions about technology is going to decide whether or not the future is good. Debates about technology often become debates about whether we should be pessimistic or optimistic about the impacts of future technology. This is rarely a useful framing, because the only direct impact of technology is to let us make more changes to the world. Technology shouldn't be understood as a force automatically pulling the distribution of future outcomes in a good or bad direction, but as a force that <i>blows up the distribution</i> so that it spans all the way from an engineered super-pandemic that kills off humanity ten years from now to an interstellar civilisation of trillions of happy people that lasts until the stars burn down. Where on this distribution we end up on depends in large part on the decisions we collectively make about technology. So, how about we get those decisions right?</p><p>But first, how should we even think about technological progress?</p><p> </p><h2>Modelling technological progress</h2><p>Some people think <a href="https://strataoftheworld.blogspot.com/2021/03/review-where-is-my-flying-car.html">that technological progress is stagnating relative to historical trends, and that, for example, we should have flying cars by now</a>. To be able to answer this question, we need some model of what technological progress should be like. I can think of three general ones.</p><p>The first one I'll name the Kurzweilian model, after futurist <a href="https://en.wikipedia.org/wiki/Ray_Kurzweil#The_Law_of_Accelerating_Returns">Ray Kurzweil</a>, who's made a big deal about how <a href="https://www.kurzweilai.net/the-law-of-accelerating-returns">the intuitive linear model of technological progress is wrong, and history instead shows technological progress is exponential</a> – the larger your technological base, the easier it is to invent new technologies, and hence a graph of anything tech-related should be a hockey-stick curve shooting into the sky.</p><p>The second I'll call the fruit tree model, after the metaphor that once the "low-hanging fruit" are picked off, progress gets harder. The strongest case for this model is in science; the physics discoveries you can make by watching apples fall down have (very likely) long since been picked off. However, it's not clear similar arguments should apply to technology. Perhaps we can model inventing a technology as finding a clever way to combine a number of already known parts into a new thing, and hence the number of possible inventions as would be an increasing function of the number of things already invented, since this gives more combinations. For example, even if progress in pure aviation is slow, when we invent new things like lightweight computers we can combine the two to get drones. I haven't seen anyone propose a model to explain why the fruit tree model makes sense for technology in particular.</p><p>The third model is that technological change is mostly random. Any particular technological base satisfies the prerequisites for some set of inventions. Once invented, a new technology goes through an S-curve of increasing adoption and development, before reaching widespread adoption and a mature form. Sometimes there are many inventions just within reach, and you get an innovation burst, like the mid-20th century one when television, cars, passenger aircraft, nuclear weapons, birth control pills, and rocketry are all simultaneously going through the rapid improvement and adoption phase. Sometimes there are no plausible big inventions for very long periods of time, for example in medieval times. </p><p>Here's an Our World in Data graph (<a href="https://ourworldindata.org/grapher/technology-adoption-by-households-in-the-united-states?tab=chart&stackMode=absolute&country=Automobile~Cellular%20phone~Computer~Dryer~Electric%20power~Flush%20toilet~Household%20refrigerator~Microwave~Refrigerator~Washing%20machine&region=World">source and interactive version here</a>) showing more-or-less-S-curves for the adoption of a bunch of technologies:</p><p></p><div class="separator" style="clear: both; text-align: center;"><a href="https://lh3.googleusercontent.com/-NYRF1GhXMHc/YFy0_l5nykI/AAAAAAAACgU/4pDNWQllx4YVAaqXCizmI0srH-5DGMy4wCLcBGAsYHQ/adoption.png" style="margin-left: 1em; margin-right: 1em;"><img alt="" data-original-height="1116" data-original-width="1754" height="407" src="https://lh3.googleusercontent.com/-NYRF1GhXMHc/YFy0_l5nykI/AAAAAAAACgU/4pDNWQllx4YVAaqXCizmI0srH-5DGMy4wCLcBGAsYHQ/adoption.png" width="640" /></a></div><p></p><p>(One can try to imagine an even more general model to unify the three models above, though we're getting to fairly extreme abstraction levels. Nevertheless, for the fun of it: let's model each technology as a set of prerequisite technologies, and assume there's a subset of technology-space that makes up the sensible technologies, and some cost function that describes how hard it is to go from a set of technologies to a given new technology (so infinity if all prerequisites of the new one aren't contained in the known set). Then slow progress would be modelled as the set of sensible ideas and the cost function being such that from any particular set of known technologies, there are only a few sensible ideas with prerequisites only in the known set, and these have high costs. Fast progress is the opposite. In the Kurzweilian model, the subspace of sensible ideas is in some sense uniform, so that the fraction of the <script type="math/tex">2^{|K|}</script> possible prerequisite combinations for a known technology set <script type="math/tex">K</script> that are contained within the sensible set does not go down with the cardinality of <script type="math/tex">K</script>, and also we require the cost function to not increase too rapidly as the complexity of the technologies grow. In the fruit tree model, the cost function increases, and possibly the frequency of sensible technologies becomes sparser as you get into the more complex parts of technology-space. In the random model, the cost function has no trend, and a lot of the advancements happen when a "key technology" is discovered that is the last unknown prerequisite for a lot of sensible technologies in technology-space.)</p><p>(Question: has anyone drawn up a dependency tree of technologies across many industries (or even one large one), or some other database where each technology is linked to a set of prerequisites? That would be an incredible dataset to explore.)</p><p>In <a href="https://strataoftheworld.blogspot.com/2021/03/review-where-is-my-flying-car.html"><i>Where is my Flying Car?</i></a>, J. Storrs Hall introduces his own abstraction of a civilisation's technology base that he calls the "technium": imagine some high-dimensional space representing possible technologies, and imagine a blob in this space representing existing technology. This blob expands as our technological base expands, but not uniformly: imagine some gradient in this space representing how hard it is to make progress in a given direction from a particular point, which you can visualise as a "terrain" which the technium has to move along as it expands. Some parts of the terrain are steep: for example, given technology that lets you make economical passenger airplanes moving at near the speed of sound, it takes a lot to progress beyond that because crossing the speed of sound is difficult. Hence the "aviation cliffs" in the image below; the technium is pressing against it, but progress will be slow:</p><p></p><table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto;"><tbody><tr><td style="text-align: center;"><a href="https://lh3.googleusercontent.com/-ag8j7955xik/YFy1EHc4riI/AAAAAAAACgY/4wNJq5pUIsoqhemZFxha1HGl4aRtb7epQCLcBGAsYHQ/technium1.png" style="margin-left: auto; margin-right: auto;"><img alt="" data-original-height="1070" data-original-width="1906" height="360" src="https://lh3.googleusercontent.com/-ag8j7955xik/YFy1EHc4riI/AAAAAAAACgY/4wNJq5pUIsoqhemZFxha1HGl4aRtb7epQCLcBGAsYHQ/w640-h360/technium1.png" width="640" /></a></td></tr><tr><td class="tr-caption" style="text-align: center;"><span style="font-size: x-small;">(Image source: my own slides for an EA Cambridge talk.)</span><br /></td></tr></tbody></table><p></p><p>In other cases, there are valleys, where once the technium gets a toehold in it, progress is fast and the boundaries of what's possible gush forwards like a river breaking a dam. The best example is probably computing: figure out how to make transistors smaller and smaller, and suddenly a lot of possibilities open up.</p><p>We can visualise the three models above in terms of what we'd expect the terrain to look like as the technium expands further and further:</p><p></p><div class="separator" style="clear: both; text-align: center;"><div style="text-align: center;"><a href="https://lh3.googleusercontent.com/-Q8uW_-U9r6A/YFy1RHGhKOI/AAAAAAAACgk/kH9VI82-L-I_sV8llkVBOwoyzFob5mH5gCLcBGAsYHQ/techniumterrain.png" style="margin-left: 1em; margin-right: 1em;"><img data-original-height="810" data-original-width="1854" height="280" src="https://lh3.googleusercontent.com/-Q8uW_-U9r6A/YFy1RHGhKOI/AAAAAAAACgk/kH9VI82-L-I_sV8llkVBOwoyzFob5mH5gCLcBGAsYHQ/w640-h280/techniumterrain.png" width="640" /></a></div></div><p></p><p>(Or maybe a better model would be one where the gradient is always be positive, with 0 gradient meaning effortless progress?)</p><p>In the Kurzweilian model, the terrain gets easier and easier the further out you go; in the fruit tree it's the opposite; if there is no pattern, then we should expect cliffs and valleys and everything in between, with no predictable trend.</p><p>Hall comes out in favour of what I've called the random model, even going as far as to speculate that the valleys might follow a <a href="https://en.wikipedia.org/wiki/Zipf%27s_law">Zipf's law</a> distribution. He concisely summarises the major valleys of the past and future:</p><blockquote><p><i>"The three main phases of technology that drove the Industrial Revolution were first low-pressure steam engines, then machine tools, and then high-pressure engines enabled by the precision that the machine tools made possible. High-pressure steam had the power-to-weight ratios that allowed for engines in vehicles, notably locomotives and steamships. The three major, interacting, and mutually accelerating technologies in the twenty-first century are likely to be nuclear, nanotech (biotech is the “low-pressure steam” of nanotech), and AI, coming together in a synergy I have taken to calling the Second Atomic Age."</i></p></blockquote><p>Personally, my views have shifted away from somewhat Kurzweilian ones and towards the random model, with the main factors being that the technological stagnation debate has made me less certain that the historical data fits a Kurzweilian trend, and that since there are no clear answers to whether there is a general pattern, it's sensible to shift the distribution of my beliefs towards the model that doesn't require assuming the truth of a general pattern. However, given some huge valleys that seem to be out there – AI is the obvious one, but also nanotechnology, which might bring physical technology to Moore's law -like growth rates – it is possible that the difference between the Kurzweilian and random model looks largely academic in the next century.</p><p> </p><h2>Is technology stagnating?</h2><p>Now that we have some idea of how to think about technological progress, we are better placed to answer the question of whether it has stagnated: if the fruit tree model is true we should expect a slowdown, whereas if the extreme Kurzweilian model is true, a single trend line that's not going to break past the top of the figure in the next decade is a failure. Even so, this question is very confusing; economists debate about total factor productivity (a debate I will stay out of), and in general it's hard to know what could have been.</p><p>However, it does seem true that compared to the mid-20th century, the post-1970 era has seen breakthroughs in fewer categories of innovation. Consider:</p><ul><li><p>1920-1970:</p><ul><li>cars</li><li>radio</li><li>television</li><li>antibiotics</li><li>the green revolution</li><li>nuclear power</li><li>passenger aviation</li><li>chemical space travel</li><li>effective birth control</li><li>radar</li><li>lasers</li> </ul></li><li><p>1970-2020:</p><ul><li>personal computers</li><li>mobile phones</li><li>GPS</li><li>DNA sequencing</li><li>CRISPR</li><li>mRNA vaccines</li> </ul></li> </ul><p>Of course, it's hard to compare inventions and put them in categories – is lumping everything computing-related as largely the same thing really fair? – but <a href="https://rootsofprogress.org/technological-stagnation">some people are persuaded by such arguments</a>, and a general lack of big breakthroughs in big physical technologies does seem true. (Though might soon change, since the clean energy, biotech, and space industries are making rapid progress.)</p><p>Why is this? If we accept the fruit tree model, there's nothing to be explained. If we accept the random one, we can explain it as a fluke of the shape of the idea space terrain that the technium is currently pressing into. To quote Hall again:</p><blockquote><p><i>"The default [explanation for technological stagnation] seems to have been that the technium has, since the 70s, been expanding across a barren high desert, except for the fertile valley of information technology. I began this investigation believing that to be a likely explanation."</i></p></blockquote><p>This, I think, is a pretty common view, and is a sensible null hypothesis for the lack of other evidence. We can also imagine variations, like the existence of a huge valley in the form of computing drawing all the talent that would otherwise have gone into pushing the technium forwards in other places. However, Hall rather dramatically concludes that this</p><blockquote><p><i>"[...] is wrong. As the technium expanded, we have passed many fertile Gardens of Eden, but there has always been an angel with a flaming sword guarding against our access in the name of some religion or social movement, or simply bureaucracies barring entry in the name of safety or, most insanely, not allowing people to make money."</i></p></blockquote><p>Is this ever actually the case? I think there is a case where a feasible (and economic, environmental, and health-improving) technology has been blocked: nuclear power, as I discuss <a href="http://strataoftheworld.blogspot.com/2021/03/nuclear-power-is-good.html">here</a>. We should therefore amend our model of the technium: not only does it have to contend with the cliffs inherent in the terrain, but sometimes someone comes along and builds a big fat wall on the border, preventing either development, deployment, or both.</p><p>In diagram form:</p><p></p><div class="separator" style="clear: both; text-align: center;"><a href="https://lh3.googleusercontent.com/-Ro1kfHXErQw/YFy1cWclgvI/AAAAAAAACgs/hDiZla9SwnUa9Ym5IAXf5Y-zi0BdSAUaQCLcBGAsYHQ/technium2.png" style="margin-left: 1em; margin-right: 1em;"><img alt="" data-original-height="1102" data-original-width="1884" height="374" src="https://lh3.googleusercontent.com/-Ro1kfHXErQw/YFy1cWclgvI/AAAAAAAACgs/hDiZla9SwnUa9Ym5IAXf5Y-zi0BdSAUaQCLcBGAsYHQ/w640-h374/technium2.png" width="640" /></a></div><p></p><p>Are there other cases? Yes – GMOs, as I discuss in <a href="http://strataoftheworld.blogspot.com/2018/12/review-seeds-of-science-why-we-got-it.html">this review</a>. There have also been some harmful technologies that have been controlled; for example biological and chemical weapons of mass destruction are more-or-less kept under control by two treaties (the <a href="https://en.wikipedia.org/wiki/Biological_Weapons_Convention">Biological Weapons Convention</a> and the <a href="https://en.wikipedia.org/wiki/Chemical_Weapons_Convention">Chemical Weapons Convention</a>). However, such cases seem to be the exception, since the overall history is one of technology adoption steamrolling the luddites, from the literal <a href="https://en.wikipedia.org/wiki/Luddite">Luddites</a> to George W. Bush's attempts to <a href="https://en.wikipedia.org/wiki/Stem_cell_laws_and_policy_in_the_United_States#Timeline">limit stem cell research</a>.</p><p>There are also cases where we put a lot of effort into expanding the technium in a specific direction (German subsidies for solar power are one successful example). We might think of this as adding stairs to make it easier to climb a hill.</p><p>How much of the technium's progress (or lack thereof) is determined by the terrain's inherent shape, and how much by the walls and stairs that we slap onto it? I don't know. The examples above show that as a civilisation we sometimes do build important walls in the technium terrain, but arguments like those Hall presents in <i>Where is my Flying Car?</i> are not strong enough to make me update my beliefs to thinking that this is the main factor determining how the technium expands. If I had to make a very rough guess, I'd say that though there is variation based on area (e.g. nuclear and renewable energy have a lot of walls and stairs respectively; computing has neither), overall the inherent terrain has at least several times the effect size on the decadal timescale. The power balance seems heavily dependent on the timescale too – George W. Bush can hold back stem cells for a few years, but imagine the sort of measures it would have taken to delay steam engines for the past few hundred years.</p><p> </p><h2>How should we guide technological progress?</h2><p>How much should we try to guide technological progress?</p><p>A first step might be to look at how good we've been at it in the past, so that we get a reasonable baseline for likely future performance. Our track record is clearly mixed. On one hand, chemical and biological weapons of mass destruction have so far been largely kept under control, though under a rather shoestring system (Toby Ord likes to point out that <a href="https://www.bbc.com/future/article/20200923-the-hinge-of-history-long-termism-and-existential-risk">the Biological Weapons Convention has a smaller budget than an average McDonald's</a>), and subsidies have helped solar and wind to become mature technologies. On the other hand, there are <a href="https://en.wikipedia.org/wiki/List_of_states_with_nuclear_weapons#Statistics_and_force_configuration">over ten thousand nuclear weapons in the world</a> and they don't seem likely to go away anytime soon (in particular, while <a href="https://en.wikipedia.org/wiki/New_START">New START</a> was recently extended, Russia has a <a href="https://en.wikipedia.org/wiki/RS-28_Sarmat">new ICBM</a> coming into service this year and the US is probably going to go ahead with their <a href="https://en.wikipedia.org/wiki/Ground_Based_Strategic_Deterrent">next-generation ICBM project</a>, almost ensuring that ICBMs – the most strategically volatile nuclear weapons – continue existing for decades more). We've mostly stopped ourselves benefiting from safe and powerful technologies like nuclear power and GMOs for no good reason. More recently, we've failed to allow <a href="https://en.wikipedia.org/wiki/Human_challenge_study">human challenge trials</a> for covid vaccines, despite massive net benefits (vaccine safety could be confirmed months faster, and the risk to healthy participants is lower than <a href="https://www.bls.gov/charts/census-of-fatal-occupational-injuries/civilian-occupations-with-high-fatal-work-injury-rates.htm">a year at some jobs</a>), <a href="https://www.1daysooner.org/">an army of volunteers</a>, and <a href="https://pubmed.ncbi.nlm.nih.gov/33334616/">broad public support</a>.</p><p>Imagine your friend was really into picking stocks, and sure, they once bought some AAPL, but often they've managed to pick the Enrons and Lehman Brothers of the world. Would your advice to them be more like "stay actively involved in trading" or "you're better off investing in an index fund and not making stock-picking decisions"?</p><p>Would things be better if we had tried to steer technology less? We'd probably be saving money and the environment (and <a href="https://en.wikipedia.org/wiki/Golden_rice">third-world children</a>) by eating far more genetically engineered food, and air pollution would've claimed <a href="https://www.mdpi.com/1996-1073/10/12/2169/htm">millions fewer lives</a> because nuclear power would've done more to displace coal. Then again, we'd probably have significantly less solar power. (Also, depending on what counts as steering technology rather than just reacting to its misuses, we might include the eventual bans on lead in gasoline, DDT, and chloroflourocarbons as major wins.) And maybe without the Biological Weapons Convention becoming effective in 1975, the Cold War arms race would've escalated to developing even more bioweapons than the <a href="https://en.wikipedia.org/wiki/Soviet_biological_weapons_program">Soviets already did</a> (for more depth, read <a href="https://www.amazon.com/Dead-Hand-Untold-Dangerous-Legacy/dp/0307387844">this</a>), and an accidental leak might've released a civilisation-ending super-anthrax.</p><p>So though we haven't been particularly good at it so far, can we survive without steering technological progress in the future? I made the point above that technology increases the variance of future outcomes, and this very much includes in the negative direction. Maybe <a href="https://en.wikipedia.org/wiki/Boost-glide">hypersonic glide vehicles</a> make the nuclear arms race more unstable and eventually result in war. Maybe technology lets Xi Jinping achieve his dream of permanent dictatorship, and this model turns out to be easily exportable and usable by authoritarians in every country. Maybe we don't solve the AI alignment problem before someone goes ahead and builds one, and the result is straight from Nick Bostrom's nightmares. And what exactly is the stable equilibrium in a world where a 150€ device that Amazon will drone-deliver to anyone in the world within 24 hours can take a genome and print out bacteria and viruses that have it?</p><p>This fragility is highlighted in a <a href="https://www.nickbostrom.com/existential/risks.html">2002 paper by Nick Bostrom</a>, who shares the view that the technium can't be reliably held back, at least to the extent that some dangerous technologies might require:</p><blockquote><p><i>"If a feasible technology has large commercial potential, it is probably impossible to prevent it from being developed. At least in today’s world, with lots of autonomous powers and relatively limited surveillance, and at least with technologies that do not rely on rare materials or large manufacturing plants, it would be exceedingly difficult to make a ban 100% watertight. For some technologies (say, ozone-destroying chemicals), imperfectly enforceable regulation may be all we need. But with other technologies, such as destructive nanobots that self-replicate in the natural environment, even a single breach could be terminal."</i></p></blockquote><p>The solution is what he calls differential development:</p><blockquote><p><i>"[We can affect] the rate of development of various technologies and potentially the sequence in which feasible technologies are developed and implemented. Our focus should be on what I want to call differential technological development: trying to retard the implementation of dangerous technologies and accelerate implementation of beneficial technologies, especially those that ameliorate the hazards posed by other technologies." [Emphasis in original]</i></p></blockquote><p>(See <a href="https://forum.effectivealtruism.org/posts/XCwNigouP88qhhei2/differential-progress-intellectual-progress-technological">here</a> for more elaboration on this concept and variations.)</p><p>For example:</p><blockquote><p><i>"In the case of nanotechnology, the desirable sequence would be that defense systems are deployed before offensive capabilities become available to many independent powers; for once a secret or a technology is shared by many, it becomes extremely hard to prevent further proliferation. In the case of biotechnology, we should seek to promote research into vaccines, anti-bacterial and anti-viral drugs, protective gear, sensors and diagnostics, and to delay as much as possible the development (and proliferation) of biological warfare agents and their vectors. Developments that advance offense and defense equally are neutral from a security perspective, unless done by countries we identify as responsible, in which case they are advantageous to the extent that they increase our technological superiority over our potential enemies. Such “neutral” developments can also be helpful in reducing the threat from natural hazards and they may of course also have benefits that are not directly related to global security."</i></p></blockquote><p>One point to emphasise is that the dangerous technology probably can't be held back indefinitely. One day, if humanity continues advancing (as it should), it will be easy to create deadly diseases, build self-replicating nanobots, or spin up a superintelligent computer program in the way that you'd spin up a Heroku server today. The only thing that will save us if the defensive technology (and infrastructure, and institutions) are in place by then. In <i>The Diamond Age</i>, Neal Stephenson imagines a future where there are defensive nanobots in the air and inside people that are constantly on patrol against hostile nanobots. I can't help but think that this is where we're heading. (It's also the strategy our bodies have already adopted to fight off organic nanobots like viruses.)</p><p>This is not how we've done technology harm mitigation in the past. Guns are kept in check through regulation, not by everyone wearing body armour. Sufficiently tight rules on, say, what gene sequences you can put into viruses or what you can order your nanotech universal fabricator to produce will almost certainly be part of the solution and go a long way on their own. However, a gun can't spin out of control and end humanity; an engineered virus or self-replicating nanobot might. And as we've seen, our ability to regulate technology isn't perfect, so maybe we should have a backup plan.</p><p>The overall picture therefore seems to be that our civilisation's track record at tech regulation is far from perfect, but the future of humanity may soon depend on it. Given this, perhaps it's better that we err on the side of too much regulation – not because it's probably going to be beneficial, but because it's a useful training ground to build up the institutional competence we're going to need to tackle the actually difficult tech choices that are heading our way. Better to mess up regulating Facebook and – critically – learn from it, than to make the wrong choices about AI.</p><p>It won't be easy to make the leap from a civilisation that isn't building much nuclear power despite being in the middle of a climate crisis to one that can reliably ensure we survive even when everyone and their dog plays with nanobots. However, an increase in humanity's collective competence at making complex choices about technology is something we desperately need.</p><p><br /></p><p style="text-align: center;"><b>RELATED:</b></p><p></p><ul style="text-align: left;"><li style="text-align: left;"><a href="https://strataoftheworld.blogspot.com/2021/03/review-where-is-my-flying-car.html">Review: Where is my Flying Car?</a></li><li><a href="http://strataoftheworld.blogspot.com/2021/03/nuclear-power-is-good.html">Nuclear power is good</a></li><li><a href="https://strataoftheworld.blogspot.com/2018/12/review-seeds-of-science-why-we-got-it.html">Review: Seeds of Science</a> – GMOs are also good</li></ul><p></p>Unknownnoreply@blogger.com0tag:blogger.com,1999:blog-1697673368059564013.post-57949953287693226422021-03-21T16:52:00.005+00:002021-04-25T22:19:18.152+01:00Review: Where is my Flying Car?<p style="text-align: center;"><span style="font-size: x-small;"> Book: <i>Where is my Flying Car?: A Memoir of Future Past</i>, by J. Storrs Hall (2018)<br />Words: 9.3k (about 31 minutes)</span></p><p style="text-align: center;"><br /></p><p>In the 50s and 60s, predictions of the future were filled with big physical technical marvels: spaceships, futuristic cities, and, most symbolically, flying cars. The lack of flying cars has become a cliche, whether as a point about the unpredictability of future technological progress, or a joke about hopeless techno-optimism.</p><p>For J. Storrs Hall, flying cars are not a joke. They are a feasible technology, as demonstrated by many historical prototypes that are surprisingly close to futurists' dreams, and practical too: likely to be more expensive than cars, yes, but providing many times more value to owners.</p><p>So, where are they?</p><p></p><table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto;"><tbody><tr><td style="text-align: center;"><a href="https://lh3.googleusercontent.com/-l_dFD3WFGCM/YFd1zawk5BI/AAAAAAAACfE/jI3VAWTGstsxaGRCnrRB0BfLMEvEQQvVACLcBGAsYHQ/flyingcar.png" style="margin-left: auto; margin-right: auto;"><img data-original-height="1012" data-original-width="1310" height="309" src="https://lh3.googleusercontent.com/-l_dFD3WFGCM/YFd1zawk5BI/AAAAAAAACfE/jI3VAWTGstsxaGRCnrRB0BfLMEvEQQvVACLcBGAsYHQ/w400-h309/flyingcar.png" width="400" /></a></td></tr><tr><td class="tr-caption" style="text-align: center;">Above: not a joke. <i>(Public domain, <a href="https://commons.wikimedia.org/wiki/File:ConvairCar_Model_118.jpg">original here</a>)</i></td><td class="tr-caption" style="text-align: center;"><br /></td></tr></tbody></table><p></p> <p>The central motivating force behind <i>Where is my Flying Car?</i> is the disconnect between what is physically possible with modern science, and what our society is actually achieving. The immediate objection to such points is to say: "well, of course some engineer can imagine a world where all this fancy technology is somehow economically feasible and widespread, but in the real world everything is more complicated, and once you take these complications into account there's no surprising failure".</p><p>Hall's objection is that everything was going fine until 1970 or so.</p><p>Many people complain that technological progress has slowed. Flying cars, of course, but also: airliner cruising speeds have stagnated, the space age went on hiatus, cities are still single-level flat designs with traffic, nuclear power stopped replacing fossil fuels, and nanotechnology (in the long run, the most important technology for building anything) is growing slowly. <a href="https://www.newyorker.com/magazine/2011/11/28/no-death-no-taxes">Peter Thiel</a> sums this up by saying "we wanted flying cars, instead we got 140 characters".</p><p>It's not just technology. There's an <a href="https://wtfhappenedin1971.com/">entire website devoted to throwing graphs at you about trends that changed around 1970</a> (and selling you Bitcoin on the side), and, while a bunch of it is <a href="https://tylervigen.com/spurious-correlations">Spurious Correlations material</a>, they include enough important things, like a stagnation in median wages, that it's worth thinking about.</p><p>Perhaps the most fundamental indicator is that the energy available per person in the United States was increasing exponentially (a trend Hall names the Henry Adams curve), until, starting around 1970, it just wasn't:</p><p></p><div class="separator" style="clear: both; text-align: center;"><a href="https://lh3.googleusercontent.com/-lzBlmEZg8Yo/YFd2QE4JA2I/AAAAAAAACfM/Decg_7IdIvEQrT4I1txpyTqRYtVVg5DcQCLcBGAsYHQ/adamscurve.png" style="margin-left: 1em; margin-right: 1em;"><img data-original-height="650" data-original-width="1522" height="274" src="https://lh3.googleusercontent.com/-lzBlmEZg8Yo/YFd2QE4JA2I/AAAAAAAACfM/Decg_7IdIvEQrT4I1txpyTqRYtVVg5DcQCLcBGAsYHQ/w640-h274/adamscurve.png" width="640" /></a></div><br /><p></p>Is this just because the United States is an outlier in energy use statistics? No; other developing countries have plateaued too, with the exception of Iceland and Singapore: <p></p><table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto;"><tbody><tr><td style="text-align: center;"><a href="https://lh3.googleusercontent.com/-wpdE2TYobyg/YFd2XDadn7I/AAAAAAAACfQ/FpdBl5tJtdsWZPwwMhuLepjVu7FvhBNKgCLcBGAsYHQ/energycapita.png" style="margin-left: auto; margin-right: auto;"><img alt="" data-original-height="1096" data-original-width="1748" height="402" src="https://lh3.googleusercontent.com/-wpdE2TYobyg/YFd2XDadn7I/AAAAAAAACfQ/FpdBl5tJtdsWZPwwMhuLepjVu7FvhBNKgCLcBGAsYHQ/w640-h402/energycapita.png" width="640" /></a></td></tr><tr><td class="tr-caption" style="text-align: center;"><p>(Source: <a href="https://ourworldindata.org/">Our World in Data</a>, one of the best websites on the internet. You can play around with an interactive version of this chart <a href="https://ourworldindata.org/grapher/per-capita-energy-use?tab=chart&time=earliest..latest&country=DEU~JPN~SGP~SWE~TWN~GBR~USA~ISL&region=World">here</a>.)</p></td></tr></tbody></table><p></p><div class="separator" style="clear: both; text-align: center;"></div> <p></p><p>Hall tries to estimate what percentage of future predictions in some technical area have come true as a function of the energy intensity of the technology, and finds a strong inverse correlation: in less energy intensive areas (e.g. mobile phones) we've over-achieved relative to futurists' predictions, while the opposite is true with energy intensive big machines (e.g. flying cars). (This is necessarily very subjective, but Hall at least says he did not change any of his estimates after seeing the graph.)</p><p>Of course, we have to contrast the stagnation in some areas with enormous advancements during the same time. The most obvious example is computing, something that futurists generally missed. In biotechnology, the price of DNA sequencing has dropped exponentially and in just the past few years we've gotten powerful tools like CRISPR and mRNA vaccines. Meanwhile the average person is now twice as rich as in 1970, and life expectancy has increased by 15 years (and the numbers are not much lower if we restrict our attention just to developed countries).</p><p>Perhaps we should be content; maybe Peter Thiel should stop complaining now that we have <a href="https://www.bbc.com/news/technology-41900880">280 characters</a>? After all, the problem is not that things are failing, but that they <i>might</i> be improving slower than they could be. That hardly seems like the end of the world. So why should we focus on technological progress? Has it really slowed? And how can we model it? <a href="https://strataoftheworld.blogspot.com/2021/03/technological-progress.html">I discuss these questions in another post</a>. In this post, however, I will move straight onto Hall's favourite topic.</p><p> </p><h2>Cool technology</h2><h3>Flying cars</h3><p>You might assume the case for flying cars looks something like this:</p><ol start=""><li>You get to places very fast.</li><li>Very cool.</li> </ol><p>However, there's a deeper case to be made for flying cars (or rapid transportation in general), and it starts with the observation that barefoot-walkers in Zambia tend to spend an hour or so a day travelling. Why is this interesting? Because this is the same as the average duration in the United States (of course Hall's other example is the US) or any other society.</p><p>Flying cars aren't about the speed – they're about the distance that this speed allows, given universal human preferences for daily travel duration. Cars on the road do about 60 km/h on average for any trip ("you might think that you could do better for a long trip where you can get on the highway and go a long way fast", Hall writes, but "the big highways, on the average, take you out of your way by an amount that is proportional to the distance you are trying to go"). A flying car that goes five times faster lets you travel within twenty-five times the area, potentially opening up a lot of choice.</p><p>Hall goes through some calculations about the utilities of different time-to-travel versus distance functions, given empirical results from travel theory, to produce this chart (which I've edited to improve the image quality and convert units) as a summary:</p><p></p><div class="separator" style="clear: both; text-align: center;"><a href="https://lh3.googleusercontent.com/-q2Xx89wLBH0/YFd2isMPIJI/AAAAAAAACfU/8ILMwLRxm4EngBLIBDewJSLuQBDGlMq_ACLcBGAsYHQ/valueofvehicle.png" style="margin-left: 1em; margin-right: 1em;"><img data-original-height="1388" data-original-width="2002" height="278" src="https://lh3.googleusercontent.com/-q2Xx89wLBH0/YFd2isMPIJI/AAAAAAAACfU/8ILMwLRxm4EngBLIBDewJSLuQBDGlMq_ACLcBGAsYHQ/w400-h278/valueofvehicle.png" width="400" /></a></div><p></p><p>(The overhead time means how long it takes to transition into flying mode, for example if you have to attach wings to it, or drive to an airport to take off.)</p><p>Even a fairly lame flying car would easily be three times more valuable than a regular car, mainly by giving you more choice and therefore letting you visit places that you like more.</p><p>In terms of what a flying car would actually look like, you have several options. Helicopters are obvious, but they are about ten times the price of cars, mechanically complex (and with very low manufacturing tolerances), and limited by aerodynamics (the advancing blade pushes against the sound barrier, and the retreating one pushes against generating too little lift due to how slowly it moves) to a speed of 250 km/h or so.</p><p>Historically, many promising flying car designs that actually flew where <a href="https://en.wikipedia.org/wiki/Autogyro">autogyros</a>, which generate thrust with a propeller but lift through an unpowered freely-rotating helicopter-like rotor. They generally can't take off vertically, but can land in a very small space.</p><p>Another design is a VTOL (vertical take-off and landing) aircraft. Some have been built and used as fighter jets, but they've gained limited use because they're slower and less manoeuvrable than conventional fighters and have less room for weapons. However, Hall notes that one experimental VTOL aircraft in particular – the <a href="https://en.wikipedia.org/wiki/Ryan_XV-5_Vertifan">XV-5</a> – would "have made one hell of a sports car" and its performance characteristics are recognisable as those of a hypothetical utopian flying car. It flew in 1964, but was cancelled because the Air Force wanted something as fast and manoeuvrable as a fighter jet, rather than "one hell of a sports car".</p><p>Of current flying car startups, Hall mentions <a href="https://en.wikipedia.org/wiki/Terrafugia">Terrafugia</a> and <a href="https://en.wikipedia.org/wiki/AeroMobil_s.r.o._AeroMobil">AeroMobil</a>, which produce traditional gasoline-powered vehicles (both with fuel economies comparable in litres/km to ordinary cars). There's also <a href="https://en.wikipedia.org/wiki/Volocopter">Volocopter</a> and <a href="https://en.wikipedia.org/wiki/EHang">EHang</a>, both of which produce electric vehicles with constrained ranges.</p><p>Hall divides the roadblocks (or should I say <a href="https://en.wikipedia.org/wiki/NOTAM">NOTAMs</a>?) for flying cars into four categories.</p><p>The first is that flying is harder than driving. To test this idea, Hall learned to fly a plane, and concluded that it is considerably harder, but not insurmountably. Besides, we're not far from self-driving; commercial passenger flights are close to self-piloting already, the existing Volocopter is only "optionally piloted", and the EHang 184 flies itself. </p><p>The second is technological. The main challenges here are flying low and slow without stalling (you want to be able to land in small places, at least in emergencies), and reducing noise to manageable levels.</p><p>The third is economic. Even though the technology theoretically exists, it may be that we're not yet at a stage where personal flying machines are economically feasible. To some extent this is true; Hall admits that even on the pre-1970 trends in private aircraft ownership, the US private aircraft market would only be something like 30 000 - 40 000 per year (compared to the 2 000 or so that it currently is), about a hundredth of the number of cars sold. The economics means we should expect that the adoption curve is shallow, but not that it's necessarily non-existent.</p><p>The final reason is simple: even if you could make a flying car, you wouldn't be allowed to. Everything in aviation is heavily regulated, pushing up costs in a way that, Hall says, leads private pilots to joke about "hundred-dollar burgers". Of course, flying is hard, so you want standards high enough that at the very least you don't have to dodge other people's home-made flying motorbikes as they rain down from the sky, but in Hall's opinion the current balance is wrong.</p><p>And it's not just that the balance is wrong, but that the regulations are messed up. For example, making aircraft in the light sports aircraft category would be a great way to experiment with electric flight, but the FAA forbids them from being powered by anything other than a single internal combustion piston engine.</p><p>In particular, the FAA "has a deep allergy to people making money with flying machines". If you own a two-seat private aircraft, you can't charge a passenger you take on a flight more than half of the fuel cost, so no air Uber. Until the FAA stopped dragging its feet on <a href="https://en.wikipedia.org/wiki/Unmanned_aerial_vehicle#Commercial_use">drone regulation</a> in 2016, drones were operated under model aircraft rules, and therefore could not be used for anything other than hobby or recreational purposes. Similar rules still apply to ultralights, with one suspicious exception: a candidate for a federal, state, or local election is allowed to pay for a flight.</p><p>(And of course, to all these rules it's usually possible to apply for a waiver – so if you're a big company with an army of lawyers, do what you want, but if you're two people in a garage, good luck.)</p><p>There's no clear smoking gun of one piece of regulation specifically causing significant harm to flying car innovation. However, the harms of regulation are often a death-by-a-thousand-cuts situation, where a million rules each clip away at what is permissible and each add a small cost. Hall's conclusion is harsh: "It’s clear that if we had had the same planners and regulators in 1910 that we have now, we would never have gotten the family car at all."</p><p>One particular effect of flying cars would be to weaken the pull of cities, another topic to which Hall brings a lot of opinions.</p><h3>City design</h3><blockquote><p><i>"Designing a city whose transportation infrastructure consists of the flat ground between the boxes is insane."</i></p></blockquote><p>This is true. Most traffic problems would go away if you could add enough levels. However, "[e]ven the recent flurry of Utopia-building projects are still basically rows of boxes sitting on the dirt plus built-in wifi so the self-driving cars can talk to each other as they sit in automated traffic jams".</p><p>As usual, Hall spies some sinister human factors lurking behind the scenes, delaying his visions of techno-utopia:</p><blockquote><p><i>"There is a perverse incentive for bureaucrats and politicians to force people to interact as much as possible, and indeed to interact in contention, as that increases the opportunities for control and the granting of favors and privileges. This is probably one of the major reasons that our cities have remained flat, one-level no-man’s-lands where pedestrians (and beggars and muggers) and traffic at all scales are forced to compete for the same scarce space in the public sphere, while in the private sphere marvels of engineering have leapt a thousand feet into the sky, providing calm, safe, comfortable environments with free vertical transportation."</i></p></blockquote><p>This is an interesting idea, and I've <a href="https://www.elephantinthebrain.com/">read enough Robin Hanson</a> to not discount such perverse explanations immediately, but once again I'm not convinced how important this factor is, and Hall, as usual, is happy to paint only in broad to strokes.</p><p>However, he makes a clearly strong point here:</p><blockquote><p><i>"Densification proponents often point to an apparent paradox: removing a highway which crosses a community often does not increase traffic on the remaining streets, as the kind of hydraulic flow models used by traffic planners had assumed that it would. On the average, when a road is closed, 20% of the traffic it had handled simply vanishes. Traffic is assumed to be a bad thing, so closing (or restricting) roads is seen as beneficial. Well duh. If you closed all the roads, traffic would go to zero. If you cut off everybody’s right foot and forced them to use crutches, you’d get a lot less pedestrian traffic, too."</i></p></blockquote><p>Hall takes a liberal principle of being strongly in favour of giving people choice, arguing that the goal of city design and transportation infrastructure should be to maximise how far people can travel quickly, rather than trying to ensure that they don't need to travel anywhere other than the set of choices the all-seeing, all-knowing urban designer saw fit to place nearby. Of course, once again flying cars are the best:</p><blockquote><p><i>"The average American commute to work, one way by car, ranges from 20 minutes to half an hour (the longer times in denser areas). This gives you a working radius of about 15 miles [= 24 km], or [1800 square kilometres] around home to find a workplace (or around work to find a home). With a fast VTOL flying car, you get a [240-kilometre] radius or [180 thousand square kilometres] of commutable area. Cars, trucks, and highways were clearly one of the major causes of the postwar boom. It isn’t perhaps realized just how much the war on cars contributed to the great stagnation—or how much flying cars could have helped prolong the boom."</i></p></blockquote><h3>Nuclear power</h3><p>I discuss nuclear power at length in <a href="http://strataoftheworld.blogspot.com/2021/03/nuclear-power-is-good.html">another post</a>.</p><h3>Space travel?</h3><p>What about the classic example of supposedly stalled innovation – we were on the moon in 1969, and won't return until <a href="https://en.wikipedia.org/wiki/Artemis_program">at least 2024</a>?</p><blockquote><p><i>"With space travel, there’s a pretty straightforward answer: the Apollo project was a political stunt, albeit a grand and uplifting one; there was no compelling reason to continue going to the moon given the cost of doing so."</i></p></blockquote><p>The general curve of space progress seems to be over-achievement relative to technological trends in the 60s, followed by stagnation, not because the technology is impossible – we did go to the moon after all – but because it just wasn't economical. Only now, with private space companies like SpaceX and Rocket Lab actually making a business out of taking things to space outside the realm of <a href="https://aozerov.com/research/lvmarket.pdf">cosy costs-plus government contracts</a> is innovation starting to pick up again.</p><p>(In the past ten years, we've seen the first commercial crewed spacecraft, reuse of rocket stages, the first methane-fuelled rocket engine ever flown, the first full-flow staged-combustion rocket engine ever flown, and the first liquid-fuelled air-launched orbital rocket, just to pick some examples.)</p><p>Hall has some further comments about space. First, in this passage he shows an almost-religious deference to trend lines:</p><blockquote><p><i>"As you can see from the airliner cruising speed trend curve, we shouldn’t have expected to have commercial passenger space travel yet, even if the Great Stagnation hadn’t happened."</i></p></blockquote><p>I don't think it makes sense to take a trend line for atmospheric flight speeds and use that to estimate when we should have passenger space travel; the physics is completely different, and in particular speeds are very constrained in orbit (you need to go 8 km/s to stay in orbit, and you can't go faster around the Earth without constant thrusting to stop yourself from flying off – something Hall clearly understands, as he explains it more than once).</p><p>Secondly, he is of course in favour of everything high-energy and nuclear.</p><p>For example: <a href="https://en.wikipedia.org/wiki/Project_Orion_(nuclear_propulsion)">Project Orion</a> was an American plan for a spacecraft powered (potentially from the ground up, rather than just in space) by throwing nuclear bombs out the back and riding the plasma from the explosions. This is a good contender for the stupidest-sounding idea that actually makes for a solid engineering plan; it's a surprisingly feasible way of getting sci-fi performance characteristics from your spacecraft. Other feasible methods have either far lower thrust (like ion engines, meaning that you can't use them to take off or land), or have far lower exhaust velocity (which means much more of your spacecraft needs to be fuel). The obvious argument against Orion, at least for atmospheric launch, is the fallout, but Hall points out it's actually not <i>that</i> bad – the number of additional expected cancer deaths from radiation per launch is "only" in the single digits, and that's under a very conservative linear no-threshold model of radiation dangers, which is likely wrong. (The actual reasons for cancellation weren't related to radiation risks, but instead the prioritisation of Apollo, the <a href="https://en.wikipedia.org/wiki/Partial_Nuclear_Test_Ban_Treaty">Partial Test Ban Treaty of 1963</a> that banned atmospheric nuclear tests, and the fact that no one in the US government had a particularly pressing need to put a thousand tons into orbit.) Hall also mentions an interesting fact about Orion that I hadn't seen before: "the total atmospheric contamination for a launch was roughly the same no matter what size the ship; so that there would be an impetus toward larger ones" – perhaps Orion would have driven mass space launch.</p><p>A more controlled alternative to bombing yourself through space is to use a nuclear reactor to heat up propellant in order to expel it out the back of your rocket at high speeds, pushing you forwards. The main limit with these designs is that you can't turn the heat up too much without your reactor blowing up. Hall's favoured solution is a direct fission-to-jet process, where the products of your nuclear reaction go straight out the engine without all this intermediate fussing around with heating the propellant. A reaction that converts a proton and a lithium-7 atom into 2 helium nuclei would give an exhaust velocity of 20 Mm/s (7% of the speed of light), which is insane.</p><p>To give some perspective: let's say your design parameters are that you have a 10 ton spacecraft, of which 1 ton can be fuel. With chemical rocket technology, this gives you a little toy with a total ∆V of some 400 m/s, meaning that if you light it up and let it run horizontally along a frictionless train track, it'll break the sound barrier by the time it's out of fuel, but it can't take you from a Earth-to-moon-intercept trajectory to a low lunar orbit even with the most optimal trajectories. With the proton + lithium-7 process Hall describes, your 10% fuel, 10-ton spaceship can accelerate at 1G for two days. If you want to go to Mars, instead of this whole modern business of waiting for the orbital alignment that comes once every 26 months and then doing a 9-month trip along the lowest-energy orbit possible, you can almost literally point your spaceship at Mars, accelerate yourself to a speed of 1 000 km/s over a day (for comparison, the speeds of the inner planets in their orbits are in the tens of kilometres per second range), coast for maybe a day at most, and then decelerate for another day. For most of the trip you get free artificial gravity because your engine is pushing you so hard. This would be technology so powerful even Hall feels compelled to tack on a safety note: "watch out where you point that exhaust jet".</p><h3>Nanotechnology!</h3><p>Imagine if machine pieces could not be made on a scale smaller than a kilometre. Want a gear? Each tooth is a 1km x 1km x 1km cube at least. Want to build something more complicated, say an engine? If you're in a small country, it may well be a necessarily international project, and also better keep it fairly flat or it won't fit within the atmosphere. Want to cut down a single tree? Good luck.</p><p>This is roughly the scale at which modern technology operates compared to the atomic scale. Obviously this massively cuts down on what we can do. Having nanotechnology that lets us rearrange atoms on a fine level, instead of relying on astronomically blunt tools and bulk chemical reactions, could put the capabilities of physical technology on the kind of exponential Moore's law curve we've seen in information technology.</p><p>There are some problems in the way. As you get to smaller and smaller scales:</p><ul><li>matter stops being continuous and starts being discrete (and therefore for example oil-based lubrication stops working);</li><li>the impact of gravity vanishes but the impact of adhesion increases massively;</li><li>heat dissipation rates increase;</li><li>everything becomes springy and nothing is stiff anymore; and</li><li>hydrogen atoms (other atoms are too heavy) can start doing weird quantum stuff like tunnelling.</li> </ul><p>Also, how do we even get started? If all we have are extremely blunt tools, how do you make sharp ones?</p><p>There are two approaches. The first, the top-down approach, was suggested <a href="https://en.wikipedia.org/wiki/There%27s_Plenty_of_Room_at_the_Bottom">in a 1959 talk</a> by Richard Feynman, which is credited as introducing the concept of nanotechnology. First, note that we currently have an industrial tool-base at human scales that is, in a sense, self-replicating: it requires human inputs, but we can draw a graph of the dependencies and see that we have tools to make every tool. Now we take this tool-base, and create an analogous one at one-fourth the scale. We also create tools that let us transfer manipulations – the motions of a human engineer's hands, for example – to this smaller-scale version (today we can probably also automate large parts of it, but this isn't crucial). Now we have a tool-base that can produce itself at a smaller scale, and we can repeat the process again and again, making adjustments in line with the above points about how the engineering must change. If each step is one-fourth the previous, 8 iterations will take us from a millimetre-scale industrial base to a tens-of-nanometres-scale one.</p><p>The other approach is bottom-up. We already have some ability to manipulate things on the single-digit nanometre scale: the smallest features on today's chips are in this range, we have <a href="https://en.wikipedia.org/wiki/Atomic_force_microscopy">atomic-scale microscopes that can also manipulate atoms</a>, and of course we're surrounded by massively complicated nanotechnology called organic life that comes with pre-made nano-components. Perhaps these tools let us jump straight to making simple nano-scale machines, and a combination of these simple machines and our nano-manipulation tools lets us eventually build the critical self-sustaining tool-base at the atomic level.</p><h3>Weather machines?!</h3><p>Here's one thing you could do with nanotechnology: make 5 quintillion 1 cm controllable hydrogen balloons with mirrors, release them into the atmosphere, and then set sunlight levels to be whatever you want (without nanotechnology, this might also be doable, but nanotechnology lets you make very thin balloons and therefore removes the need to strip-mine an entire continent for the raw materials).</p><p>Hall calls this a weather machine, and it is exactly what it says on the tin, both on a global and local level. He estimates that it would double global GDP by letting regions set optimal temperatures, since "you could make land in lots of places on the earth, such as Northern Canada and Russia, as valuable as California". Of course, this is assuming that we don't care about messing up every natural ecosystem and weather pattern on the planet, but if the machine is powerful enough we might choose to keep the still-wild parts of the world as they are. I don't know if this would work, though; sunlight control alone can do a lot to the weather, but perhaps you'd need something different to avoid, for example, the huge winds from regional temperature differences? However, with a weather machine, the sort of subtle global modifications needed to reverse the roughly 1 watt per square metre increase in incoming solar radiation that anthropogenic emissions have caused would be trivial. </p><p>Weather machines are scary, because we're going to need very good institutions before that sort of power can be safely wielded. Hall thinks they're coming by the end of the century, if only because of the military implications: not only could you destroy agriculture wherever you want, but the mirrors could also focus sunlight onto a small spot. You could literally smite your enemies with the power of the sun.</p><p>Don't want things in the atmosphere, but still want to control the climate? Then put up sunshades into orbit, incentivising the development of a large-scale orbital launch infrastructure at the same time that we can afterwards use to settle Mars or whatever. As a bonus, put solar panels on your sunshade satellites, and you can generate more power than humanity currently uses.</p><p>As always, nothing is too big for Hall. He goes on to speculate about a weather machine <a href="https://en.wikipedia.org/wiki/Dyson_sphere">Dyson sphere</a> at half the width of the Earth's orbit. Put solar panels on it, and it would generate enormous amounts of power. Use it as a telescope, and you could see a phone lying on the ground on <a href="https://en.wikipedia.org/wiki/Proxima_Centauri_b">Proxima Centauri b</a>. Or, if the Proxima Centaurians try to invade, you can use it as a weapon and "pour a quarter of the Sun’s power output, i.e. 100 trillion terawatts, into a [15-centimetre] spot that far away, making outer space safe for democracy."</p><h3>Flying cities?!?</h3><p>And because why the hell not: imagine a 15-kilometre airplane shaped like a manta ray and with a thickness of a kilometre (so the <a href="https://en.wikipedia.org/wiki/Burj_Khalifa">Burj Khalifa</a> fits inside), with room for 10 million people inside. It takes 200 GW of power to stay flying – equivalent to 4 000 Boeing 747s – which could be provided by a line of nuclear power plants every 100 metres or so running along the back. This sounds like a lot, but Hall helpfully points out the reactors would only be 0.01% of the internal volume, so you could still cluster Burj Khalifas inside to your heart's content, and the energy consumption comes out to only 20 kW per person, about where we'd be today if energy use had continued growing on pre-1970s trends.</p><p>If you don't want to go to space but still want to leave the Earth untouched, this is one solution, as long as you don't mind a lot of very confused birds.</p><h2>Technology is possible, but has risks</h2><p>I worry that <i>Where is my Flying Car?</i> easily leaves the impression that everything Hall talks about is part of some uniform techno-wonderland, which, depending on your prior about technological progress, is somewhere between certainly going to happen or permanently relegated to the dreams of mad scientists. Hall does not work to dispel this impression: he goes back and forth between talking about how practical flying cars are and exotic nuclear spacecraft, or between reasonable ideas about traffic layout in cities and far-off speculation about city-sized airplanes. Credible world-changing technologies like nanotechnology easily seem like just another crazy thought Hall sketched out on the back of the envelope and could not stop being enthusiastic about.</p><p>So should we take Hall's more grounded speculation seriously and ignore the nano-nuclear-space-megapolises? I think this would be the wrong takeaway. First, I'm not sure Hall's crazy speculation is crazy enough to capture possible future weirdness within it; he restricts himself mainly to physical technologies, and thus leaves out potentially even weirder things like a move to virtual reality or the creation of superhuman intelligence (whether AI or augmented humans).</p><p>Second, Hall does have a consistent and in some way realist perspective: if you look at the world – not at the institutions humans have built, or whatever our current tech toolbox contains, but at the physical laws and particles at our disposal – what do you come up with?</p><p>After all, our world is ultimately not one of institutions and people and their tools. The "strata" go deeper, until you hit the bedrock of fundamental physics. We spend most of our time thinking about the upper layers, where the underlying physics is abstracted out and the particles partitioned into things like people and countries and knowledge. This is for good reason, because most of the time this is the perspective that lets you best think about things important to people. Occasionally, however, it's worth taking a less parochial perspective by looking right down to the bedrock, and remembering that anything that can be built on that is possible, and something we may one day deal with.</p><p>This perspective should also make clear another fact. The things we care about (e.g. people) exist many layers of abstraction up from the fundamental physics, and are therefore fragile, since they depend on the correct configuration of all levels below. If your physical environment becomes inhospitable, or an engineered virus prevents your cells from carrying out their function, the abstraction of you as a human with thoughts and feelings will crash, just like a program crashes if you fry the circuits of the computer it runs on.</p><p>So there are risks, new ones will appear as we get better at configuring physics, and stopping civilisation from accidentally destroying itself with some new technology is not something we're automatically guaranteed to succeed at.</p><p>Hall does not seem to recognise this. Despite all his talk about nanotechnology, the <a href="https://en.wikipedia.org/wiki/Gray_goo">grey goo scenario</a> of self-replicating nanobots going out of control and killing everyone doesn't get a mention. As far as I'm aware, there's no strong theoretical reason for this to be impossible – nanobots good at configuring carbon/oxygen/hydrogen atoms are a very reasonable sort of nanobot, and I can't help but noticing that my body is mainly carbon, oxygen, and hydrogen atoms. "What do you replace oil lubrication with for your atomic scale machine parts" is a worthwhile question, as Hall notes, but I'd like to add that so is the problem of not killing everyone.</p><p>Hall does mention the problem of AI safety:</p><blockquote><p><i>"The latest horror-industry trope is right out of science fiction [...]. People are trying to gin up worries that an AI will become more intelligent than people and thus be able to take over the world, with visions of Terminator dancing through their heads. Perhaps they should instead worry about what we have already done: build a huge, impenetrably opaque very stupid AI in the form of the administrative state, and bow down to it and serve it as if it were some god."</i></p></blockquote><p>What's this whole thing with arguments of the form "people worry about AI, but the <i>real</i> AI is X", where X is whatever institution the author dislikes? <a href="https://www.buzzfeednews.com/article/tedchiang/the-real-danger-to-civilization-isnt-ai-its-runaway">Here's another example</a> from a different political perspective (by sci-fi author Ted Chiang, whose <a href="http://strataoftheworld.blogspot.com/2020/05/short-reviews-fiction.html">fiction I enjoy</a>). I don't think this is a useless perspective – there is an analogy between institutions that fail because their design optimises for the wrong thing, and the more general idea of powerful agents accidentally designed to optimise for the wrong thing – but at the end of the day, surprise surprise, the real AI is a very intelligent computer program.</p><p>Hall also mentions he "spent an entire book (<i><a href="https://www.amazon.com/Beyond-AI-Creating-Conscience-Machine/dp/1591025117">Beyond AI</a></i>) arguing that if we can make robots smarter than we are, it will be a simple task to make them morally superior as well." This sounds overconfident – morality is complicated, after all – but I haven't read it.</p><p>As for climate change, Hall acknowledges the problem but justifies largely dismissing it by citing “[t]he actual published estimates for the IPCC’s worst case scenario, RCP8.5, [which] are for a reduction in GDP of between 1% and 3%". <a href="https://science.sciencemag.org/content/sci/356/6345/1362.full.pdf">This is true</a> ... if you only consider the United States! (The EU is in the same range but the global estimates range up to 10%, because of a disproportionate effect on poor tropical countries.) As the authors of that very report also note, these numbers don't take into account non-market losses. If Hall wants to make an argument for techno-optimistic capitalism, he should consider taking more care to distinguish himself from the strawman version.</p><p> </p><h2>It's <i>not</i> the technology, stupid!</h2><p>Hall does not think that we'd have all the technologies mentioned above if only technological progress had not "stagnated". The things he expects could've happened by now given past trends are:</p><ul><li>The technological feasibility of flying cars would be demonstrated and sales would be on the rise; Hall goes as far as to estimate the private airplane market in the US could have been selling 30k-40k planes per year (a fairly tight confidence interval for something this uncertain); compare with the actual US market today, which sells around 16 million cars and a few thousand private aircraft per year.</li><li>Demonstrated examples of multi-level cities and floating cities.</li><li>Chemical spacecraft technology would be about where they are now, but some chance that government funding would have resulted in <a href="https://en.wikipedia.org/wiki/Project_Orion_(nuclear_propulsion)">Project Orion</a>-style nuclear launch vehicles.</li><li>Nanotechnology: basic things like ammonia fuel cells might exist, but not fancier things like cell repair machines or universal fabricators.</li><li>Nuclear power would generate almost all electricity, and hence there would be a lot less CO2 in the atmosphere (<a href="https://www.mdpi.com/1996-1073/10/12/2169/htm">this study</a> estimates 174 billion fewer tons of CO2 had reasonable nuclear trends continued, but Hall optimistically gives the number as 500 billion tons).</li><li>AI and computers at the same level as today.</li><li>A small probability that something unexpected along the lines of cold fusion would have turned out to work and been commercialised.</li><li>A household income several times larger than today.</li> </ul><p>So what went wrong? Hall argues:</p><blockquote><p>"The faith in technology reflected in Golden Age SF and Space Age America wasn’t misplaced. What they got wrong was faith in our culture and bureaucratic arrangements."</p></blockquote><p>He gives two broad categories of reasons: concrete regulations, and a more general cultural shift from hard technical progress to worrying and signalling.</p><h3>Regulation ruins everything?</h3><p>Hall does not like regulation. He estimates that had regulation not grown as it did after 1970, the increased GDP growth might have been enough to make household incomes 1.5 to 2 times higher than they are today in the US. I can find some studies saying similar things – <a href="https://www.sciencedirect.com/science/article/abs/pii/S1094202520300223">here</a> is one claiming 0.8% lower GDP growth per year since 1980 due to regulation, which would imply today's economy would be about 1.3 times larger had this drag on growth existed. As far as I can tell, these estimates also don't take into account the benefits of regulation, which are sometimes massive (e.g. banning lead in gasoline). However, I think most people agree that regardless of how much regulation there should be, it could be a lot smarter. </p><p>Hall's clearest case for regulation having a big negative impact on an industry is private aviation in the United States, which crashed around 1980 after more stringent regulations were introduced. The number of airplane shipments per year dropped something like six-fold and never recovered.</p><p>A much bigger example is nuclear power, which I will discuss in an upcoming post, and which Hall also has plenty to say about.</p><p>Strangely, Hall misses perhaps the most obvious case in modern times: GMOs pointlessly being almost regulated out of existence, a story told well in Mark Lynas' <i>Seeds of Science</i> (my review <a href="http://strataoftheworld.blogspot.com/2018/12/review-seeds-of-science-why-we-got-it.html">here</a>). Perhaps this is because of Hall's focus on hard sciences, or his America-centrism (GMO regulation is worse in the EU than in the United States).</p><p>And speaking of America-centrism, the biggest question I had is why even if the US is bad at regulation, no country decides to do better and become the flying car capital of the world. Perhaps good regulation is hard enough that no one gets it right? Hall makes no mention of this question, though. </p><p>He does, however, throw plenty of shades on anything involving centralisation. For example:</p><blockquote><p><i>"Unfortunately, the impulse of the Progressive Era reformers, following the visions of [H. G.] Wells (and others) of a “Scientific Socialism,” was to centralize and unify, because that led to visible forms of efficiency. They didn’t realize that the competition they decried as inefficient, whether between firms or states, was the discovery procedure, the dynamic of evolution, the genetic algorithm that is the actual mainspring of innovation and progress."</i></p></blockquote><p>He brings some interesting facts to the table. For example, an OECD survey found a 0.26 correlation between private spending on research & development and economic growth, but a -0.37 between public R&D and growth. Here's Hall's once again somewhat dramatic explanation:</p><blockquote><p><i>“Centralized funding of an intellectual elite makes it easier for cadres, cliques, and the politically skilled to gain control of a field, and they by their nature are resistant to new, outside, non-Ptolemaic ideas. The ivory tower has a moat full of crocodiles.”</i></p></blockquote><p>He backs this up with his personal experiences of US government spending on nanotechnology lead to a flurry of scientists trying to claim that their work counted as nanotechnology (up to and including medieval stained glass windows) as well as trying to discredit anything that actually was nanotechnology, to make sure that the nanotechnologists wouldn't steal more federal funding in the future.</p><p>Studies, not surprisingly, find that the issue is more complicated (see for example <a href="https://link.springer.com/article/10.1007/s10645-019-09331-3">here</a>, which includes a mention of the specific survey Hall references).</p><p>Hall also includes a graph of economic growth vs the Fraser Institute's economic freedom score in the United States. I've created my own version below, including some more information than Hall does:</p><p></p><div class="separator" style="clear: both; text-align: center;"><a href="https://lh3.googleusercontent.com/-CTgmRx_DaVY/YFd3KNL9TpI/AAAAAAAACfg/1Joe0sC5KjM0CJwKMvaCWkRtG9L68XDogCLcBGAsYHQ/gdpef.png" style="margin-left: 1em; margin-right: 1em;"><img alt="" data-original-height="980" data-original-width="1550" height="405" src="https://lh3.googleusercontent.com/-CTgmRx_DaVY/YFd3KNL9TpI/AAAAAAAACfg/1Joe0sC5KjM0CJwKMvaCWkRtG9L68XDogCLcBGAsYHQ/gdpef.png" width="640" /></a></div><p></p>In general, it seems sensible to expect economic freedom to increase GDP: the more a person's economic choices are limited, the more likely the limitations are to prevent them from taking the optimal action (the main counterexample being if optimal actions for an individual create negative externalities for society). We can also see that this is empirically the case – developed countries tend to have high economic freedom. However, in using this graph as clear evidence, I think Hall is once again trying to make too clear a case on the basis of one correlation. <p>Effective decentralised systems, whether markets or democracy, are always prone to attack by people who claim that things would be better if only we let them make the rules. Maybe it takes something of Hall's engineer mindset to resist this impulse and see the value of bloodless systems and of general design principles like feedback and competition. (And perhaps Hall should apply this mindset more when evaluating the strength of evidence for his economic ideas.)</p><p>As for what the future of societal structure looks like, Hall surprisingly manages to avoid proposing flying-car-ocracy:</p><blockquote><p><i>""[It] may well be possible to design a better machine for social and economic control than the natural marketplace. But that will not be done by failing to understand how it works, or by adopting the simplistic, feedback-free methods of 1960s AI programs. And if ever it is done, it will be engineers, not politicians, who do it."</i></p></blockquote><p>He goes further:</p><blockquote><p><i>"As a futurist, I will go out on a limb and make this prediction: when someone invents a method of turning a Nicaragua into a Norway, extracting only a 1% profit from the improvement, they will become rich beyond the dreams of avarice and the world will become a much better, happier, place. Wise incorruptible robots may have something to do with it."</i></p></blockquote><h3>Risk perception and signalling</h3><p>Hall's second reason for us not living up to expectations for technological progress is cultural. He starts with the idea of risk homeostasis in psychology: everyone has some tolerance for risk, and will seek to be safer when they perceive current risk to be higher, and take more risks when they perceive current risk to be lower. In developed countries, risks are of course ridiculously low compared to historical levels, so most people feel safer than ever. Some start skydiving in response, but Hall suggests there's another effect that happens when an entire society finds itself living below their risk tolerance:</p><blockquote><p><i>"One obvious way [to increase perceived risk] is simply to start believing scare stories, from Corvairs to DDT to nuclear power to climate change. In other words, the Aquarian Eloi became phobic about everything specifically because we were actually safer, and needed something to worry about."</i></p></blockquote><p>I know what you're thinking – what the hell are "Aquarian Eloi"? Hall likes to come up with his own terms for things, and in this case he is making a reference to H. G. Wells' <i>The Time Machine</i>, in which descendants of humanity live out idle and dissolute lives (modelled on England's idle rich of the time), in order to label what he claims is the modern zeitgeist. Yes, this book is weird at times.</p><p>Another cultural idea he touches on is increased virtue signalling. Using the idea of <a href="https://en.wikipedia.org/wiki/Maslow%27s_hierarchy_of_needs">Maslow's hierarchy of needs</a>, he explains that as more and more of the population is materially well-off, more people invest more effort into self-actualisation. Some of this is productive, but, humans being humans, a lot of this effort goes into trying to signal how virtuous you are. Of course, there's nothing inherently wrong with that, as long as your virtue signalling isn't preventing other people climbing up from lower levels of Maslow's hierarchy – or, Hall would probably add, from building those flying cars.</p><h3>Environmentalism vs Greenism</h3><p>A particular sub-case of cultural change that Hall has a lot to say about is the "Green religion", something he distinguishes (though sometimes with not enough care) from perfectly reasonable desires "to live in a clean, healthy environment and enjoy the natural world".</p><p>This ideological, fear-driven and generally anti-science faction within the environmentalist movement is much the same thing as what Steven Pinker calls "Greenism", which I talked about in <a href="http://strataoftheworld.blogspot.com/2018/08/review-enlightenment-now-steven-pinker.html">my review of <i>Enlightenment Now</i></a> (search for "Greenism") and also features in <a href="http://strataoftheworld.blogspot.com/2018/08/review-enlightenment-now-steven-pinker.html">my review of Mark Lynas' <i>Seeds of Science</i></a> (search for "torpedoes"). Unlike Lynas or even Pinker, Hall does not hold back when it comes to criticising this particular strand of environmentalism. He explains it as an outgrowth of the risk-averseness and virtue signalling trends described above. The "Green religion", he claims, is now the "default religion of western civilization, especially in academic circles", and "has developed into an apocalyptic nature cult". To explain its resistance to progress and improving the human condition, he writes:</p><blockquote><p><i>"It seems likely that the fundamentalist Greens started with the notion that anything human was bad, and ran with the implication that anything that was good for humans was bad. In particular, anything that empowered ordinary people in their multitudes threatened the sanctity of the untouched Earth. The Green catechism seems lifted out of classic Romantic-era horror novels. Any science, any engineering, the “acquirement of knowledge,” can only lead to “destruction and infallible misery.” We must not aspire to become greater than our nature."</i></p></blockquote><p>There are troubling tendencies in ideological Greenism (as there is with anything ideological), but I think "apocalyptic nature cult" takes it too far, and as a substitute religion for the west, it has some formidable competitors. Hall is right to point out the tension between improving human welfare and Greenist desires to limit humans, but I'd bet that the driving factor isn't direct disdain for humans, but rather the sort of sacrificial attitudes that are common in humans (consider <a href="https://www.britannica.com/topic/flagellants">the people</a> who went around whipping themselves during the Black Death to try to atone for whatever God was punishing them for). Probably there's some part of human psychology or our cultural heritage that makes it easy to jump to sacrifice, disparaging ourselves (or even all of humanity), and repentance as the answer to any problem. While this a nobly selfless approach, it's just less effective than, and sometimes in opposition to, actually building things: developing new technologies, building clean power plants, and so on.</p><p>Hall also goes too far in letting the Greenists tar his view of the entire environmentalist movement. Not only is climate change a more important problem than the 1-3% estimated GDP loss for the US suggests, but you'd think that the sort of big technical innovation that is happening with clean tech would be exactly the sort of progress Hall would be rooting for.</p><p>Hall does have an environmentalist proposal, and of course it involves flying cars:</p><blockquote><p><i>"The two leading human causes of habitat destruction are agriculture and highways—the latter not so much by the land they take up, but by fragmenting ecosystems. One would think that Greens would be particularly keen for nuclear power, the most efficient, concentrated, high-tech factory farms, and for ... flying cars. "</i></p><p><i>[Ellipsis in original]</i></p></blockquote><h3>Energy matters!</h3><p>Despite being partly blinded by his excessive anti-Greenism, there is one especially important correction to some strands of environmentalist thinking that Hall makes well: cheap energy really matters and we need more of it (and energy efficiency won't save the day).</p><p>Above, I used the stagnation in energy use per capita as an example of things going wrong. This may have raised some eyebrows; isn't it good that we're not consuming more and more energy? Don't we want to reduce our energy consumption for the sake of the environment?</p><p>First, it is obviously true that we need to reduce the environmental impact of energy generation. Decoupling GDP growth from CO2 emissions is one of the great achievements of western countries over the past decades, and we need to massively accelerate this trend.</p><p>However, our goal, if we're liberal humanists, should be to give people choices and let them lead happy lives (while applying the same considerations to any sentient non-human beings, and ideally not wrecking irreplaceable ecosystems). In our universe, this means energy. Improvements in the quality of life over history are, to a large extent, improvements in the amount of energy each person has access to. This is very true:</p><blockquote><p><i>“Poverty is ameliorated by cheap energy. Bill Gates, nowadays perhaps the world’s leading philanthropist, puts it, “If you could pick just one thing to lower the price of—to reduce poverty—by far you would pick energy.”"</i></p></blockquote><p>Even in the United States, "[e]nergy poverty is estimated to kill roughly 28,000 people annually in the US from cold alone, a toll that falls almost entirely on the poor". </p><p>Climate change cannot be solved by reducing energy consumption, because there are six billion people in the world who have not reached western living standards and who should be brought up to them as quickly as possible. This will take energy. What we need is to simultaneously massively increase the amount of energy that humanity uses, while also switching over to clean energy. If you think only one of these is enough, you have either failed to understand the gravity of the world's poverty situation or the gravity of its environmental one.</p><p>(Energy efficiency matters, because all else being equal, it reduces operating costs. It is near-useless for solving emissions problems, however, because the more efficiently we can use energy, the more of it we will use. Hall illustrates this with a thought experiment of a farmer who uses a truck to carry one crate of tomatoes at a time from their farm to a customer, and whose only expense is fuel for the truck. Double its fuel efficiency, and it's economical to drive twice as far, and hence service four times as many customers (assuming customer number is proportional to reachable area), plus each trip is twice as long on average. The net result is that the 2x increase in efficiency leads to 8x more kilometres driven and hence 4x higher fuel consumption. The general case is called <a href="https://en.wikipedia.org/wiki/Jevons_paradox">Jevons paradox</a>.)</p><p>So yes, we need energy, most urgently in developing countries, but the more development and deployment of new energy sources there is, the cheaper they will be for everyone – consider Germany's highly successful subsidies for solar power – so developed countries have a role to play as well. (Also, are we sure there would be no human benefits to turning the plateauing in developed country energy use back into an increase?)</p><p>You'd think this is obvious. Unfortunately it isn't. In a section titled ""AAUGHH!!", Hall presents these quotes:</p><blockquote><p><i>“The prospect of cheap fusion energy is the worst thing that could happen to the planet. —Jeremy Rifkin</i></p><i></i><p><i>Giving society cheap, abundant energy would be the equivalent of giving an idiot child a machine gun. —Paul Ehrlich</i></p><i></i><p><i>It would be little short of disastrous for us to discover a source of clean, cheap, abundant energy, because of what we might do with it. —Amory Lovins”</i></p></blockquote><p>They are what leads Hall to say, perhaps with too much pessimism:</p><blockquote><p><i>"Should [a powerful new form of clean energy] prove actually usable on a large scale, they would be attacked just as viciously as fracking for natural gas, which would cut CO2 emissions in half, and nuclear power, which would eliminate them entirely, have been."</i></p></blockquote><p>It is good to give people the choice to do what they want, and therefore good to give them as much energy as possible to play with, whether they want it to power the construction of their dream city or their flying car trips to Australia (I do draw the line at Death Stars, though).</p><p>Right now we're limited by the wealth of our societies, limiting us to about 10 kW/capita in developed countries, and by the unacceptable externalities of our polluting technology. The right goal isn't to enforce limits on what people can do (except indirectly through the likes of taxes and regulation to correct externalities), but to bring about a world where these limits are higher.</p><p>If energy is expensive, people are cheap – lives and experiences are lost for want of a few watts. This is the world we have been gradually dragging ourselves out of since the industrial revolution, and progress should continue. Energy should be cheap, and people should be dear.</p><p> </p><h2>Don't panic; build</h2><p><i>Where is my Flying Car?</i> is a weird book.</p><p>First of all, I'm not sure if it has a structure. Hall will talk about flying cars, zoom off to something completely different until you think he's said all he has to say on them, and just when you least expect it: more flying cars. The same pattern of presentation repeats with other topics. Also, sections begin and sometimes end with a long selection of quotes, including no less than three from Shakespeare.</p><p>Second, the ideas. There are the hundred speculative examples of crazy (big, physical) future technologies, the many often half-baked economic/political arguments, the unstated but unmissable America-centrism, and witty rants that wander the border between insightful social critique and intellectualised versions of stereotypical boomer complaints about modern culture.</p><p>Also, the cover is this:</p><p></p><table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto;"><tbody><tr><td style="text-align: center;"><a href="https://lh3.googleusercontent.com/-w8hBExP7z7U/YFd3-oDl_8I/AAAAAAAACfo/tINZwzIMi04AtmLrMfLRGae6DY9qtEpXACLcBGAsYHQ/cover.png" style="margin-left: 1em; margin-right: 1em;"><img data-original-height="1012" data-original-width="624" height="400" src="https://lh3.googleusercontent.com/-w8hBExP7z7U/YFd3-oDl_8I/AAAAAAAACfo/tINZwzIMi04AtmLrMfLRGae6DY9qtEpXACLcBGAsYHQ/w247-h400/cover.png" width="247" /></a></td></tr><tr><td class="tr-caption" style="text-align: center;">Above: ... a joke?<br /></td></tr></tbody></table><p></p><div class="separator" style="clear: both; text-align: center;"></div><p></p> <p>However, I think overall there's a coherent and valuable perspective here. First, Hall is against pointless pessimism. He makes this point most clearly when talking about dystopian fiction, but I think it generalises:</p><blockquote><p><i>"Dystopia used to be a fiction of resistance; it’s become a fiction of submission, the fiction of an untrusting, lonely, and sullen twenty-first century, the fiction of fake news and infowars, the fiction of helplessness and hopelessness. It cannot imagine a better future, and it doesn’t ask anyone to bother to make one. It nurses grievances and indulges resentments; it doesn’t call for courage; it finds that cowardice suffices. Its only admonition is: Despair more."</i></p></blockquote><p>Hall's answer to this pessimism is to point out ten billion cool tech things that we could do one day. He veers too much to the techno-optimistic side by not acknowledging any risks, but overall this is an important message. Visions of the future are often dominated by the negatives: no war, no poverty, no death. Someone needs to fill in the positives, and while Hall focuses more on the "what" of it than the "how does it help humans" part, I think a hopeful look at future technologies is a good start.</p><p>In addition to being against pessimism about human capabilities, Hall also takes, at least implicitly, a liberal stand by being against pessimism about humans. His answer to "what should we do?" is to give people choice: let them travel far and easily, let them live where they want, let them command vast amounts of energy.</p><p>Hall also identifies two ways to keep a civilisation on track in terms of making technological progress and not getting consumed by signalling and politics: growing, and having a frontier.</p><p>On the topic of growth, he makes basically the same point as my <a href="https://strataoftheworld.blogspot.com/2019/09/growth-and-civilisation.html">post on growth and civilisation</a>:</p><blockquote><p><i>"One of the really towering intellectual achievements of the 20th Century, ranking with relativity, quantum mechanics, the molecular biology of life, and computing and information theory, was understanding the origins of morality in evolutionary game theory. The details are worth many books in themselves, but the salient point for our purposes is that the evolutionary pressures to what we consider moral behavior arise only in non-zero-sum interactions. In a dynamic, growing society, people can interact cooperatively and both come out ahead. In a static no-growth society, pressures toward morality and cooperation vanish; you can only improve your situation by taking from someone else. The zero-sum society is a recipe for evil."</i></p></blockquote><p>Secondly, the idea of a frontier: something outside your culture that your society presses against (ideally nature, but I think this would also apply to another competing society). This is needed because"[w]ithout an external challenge, we degenerate into squabbling [and] self-deceiving".</p><blockquote><p><i>"But on the frontier, where a majority of one’s efforts are not in competition with others but directly against nature, self-deception is considerably less valuable. A culture with a substantial frontier is one with at least a countervailing force against the cancerous overgrowth of largely virtue-signalling, cost-diseased institutions."</i></p></blockquote><p>Frontiers often relate to energy-intensive technologies:</p><blockquote><p><i>"High-power technologies promote an active frontier, be it the oceans or outer space. Frontiers in turn suppress self-deception and virtue signalling in the major institutions of society, with its resultant cost disease. We have been caught to some extent in a self-reinforcing trap, as the lack of frontiers foster those pathologies, which limit what our society can do, including exploring frontiers. But by the same token we should also get positive feedback by going in in the opposite direction, opening new frontiers and pitting our efforts against nature."</i></p></blockquote><p>Finally, Hall's book is a reminder that an important measure to judge a civilisation against is its capacity to do physical things. Even if the bulk of progress and value is now coming from less material things, like information technology or designing ever fairer and more effective institutions, there are important problems – covid vaccinations, solving climate change, and building infrastructure, for example – that depend heavily on our ability to actually go out and move atoms in the real world. Let's make sure we continue to get better at that, whether or not it leads to flying cars.</p><div><br /></div><div style="text-align: center;"><b>RELATED:</b></div><div><ul style="text-align: left;"> <li><a href="http://strataoftheworld.blogspot.com/2021/03/nuclear-power-is-good.html">Nuclear power is good</a></li> <li><a href="https://strataoftheworld.blogspot.com/2021/03/technological-progress.html">Technological progress</a></li><li><a href="https://strataoftheworld.blogspot.com/2018/08/review-enlightenment-now-steven-pinker.html">Review: Enlightenment Now</a></li><li><a href="https://strataoftheworld.blogspot.com/2018/10/review-energy-and-civilization-history.html">Review: Energy and Civilisation</a></li> </ul></div><p> </p>Unknownnoreply@blogger.com2tag:blogger.com,1999:blog-1697673368059564013.post-61824184159231243912021-01-22T12:15:00.003+00:002021-02-19T21:46:35.603+00:00Data science 2<p style="text-align: center;"><span style="font-size: x-small;"><i>6.4k words, including equations (about 30 minutes)</i></span> <br /></p><p>See the <a href="http://strataoftheworld.blogspot.com/2020/12/data-science-1.html">first post</a> for an introduction.</p><h2>Monte Carlo methods</h2><p>In the late 1940s, Stanislaw Ulam was trying to work out the probability of winning in a solitaire variant. After cranking out combinatorics equations for a while, he had the idea that simulating a large number of games starting from random starting configurations with the "fast" computers that were becoming available could be a more convenient method.</p><p>At the time, Ulam was working on nuclear weapons at Los Alamos, so he had the idea of using the same principle to solve some difficult neutron diffusion problems, and went on to develop such methods further with John von Neumann (no mid-20th century maths idea is complete without von Neumann's hand somewhere on it). Since this was secret research, it needed a codename, and a colleague suggested "Monte Carlo" after the casino in Monaco. (This group of geniuses managed to break rule #1 of codenames, which is "don't reveal the basic operating principle of your secret project in its codename".)</p><p>Ulam used this work to help himself become (along with Edward Teller) the father of the hydrogen bomb. Our purposes here will be a bit more modest.</p><p>The basic idea of Monte Carlo methods is just repeated random sampling. Have a way to generate a random variable <script type="math/tex">X</script>, but not to generate fancy maths stats like <script type="math/tex">P(X \in S)</script>, where <script type="math/tex">S</script> is some subset of the sample space? Fear not – let <script type="math/tex">f(x)</script>, for values <script type="math/tex">x</script> that <script type="math/tex">X</script> can take, be 1 if <script type="math/tex">x \in S</script> and 0 otherwise. Then <script type="math/tex">E(f(X))</script> is <script type="math/tex">P(f(X) = 1) = P(X \in S)</script> and we've solved the problem if we can estimate <script type="math/tex">P(f(X)=1)</script>. If we can randomly sample values from <script type="math/tex">X</script> (and calculate the function <script type="math/tex">f</script>), then this is easy, because we simply sample many values and calculate for what fraction of them <script type="math/tex">f(X) = 1</script>.</p><p>In general,</p><div cid="n354" class="mathjax-block md-end-block md-math-block md-rawblock" contenteditable="false" id="mathjax-n354" mdtype="math_block" spellcheck="false"> <div class="md-rawblock-container md-math-container" tabindex="-1"><div class="MathJax_SVG_Display" style="text-align: center;"></div><script id="MathJax-Element-1" type="math/tex; mode=display">E(f(X)) \approx \frac{1}{n} \sum_{i=1}^n f(x_i)</script></div></div><p>for large <script type="math/tex">n</script> and with <script type="math/tex">x_i</script> drawn independently at random from <script type="math/tex">X</script>, a result that comes from the law of the unconscious statistician (discussed in <a href="http://strataoftheworld.blogspot.com/2020/12/data-science-1.html">part 1</a>) once you realise that as <script type="math/tex">n</script> increases the fraction of <script type="math/tex">x_i</script>s in the sample approaches <script type="math/tex">P(X=x_i)</script>.</p><p>We can also do integration in a Monte Carlo style. The standard way to integrate a function <script type="math/tex">f</script> is to sample it at uniform points, multiply each sampled value by the distance between the uniform points, and then add everything up. There's nothing special about uniformity though – as the number of samples increases, as long as we make sure to multiply each by the distance to the next sample, the result will converge to the integral.</p><p></p><div class="separator" style="clear: both; text-align: center;"><a href="https://lh3.googleusercontent.com/-mbiuy7znBrs/YAq_cWeYE1I/AAAAAAAACUE/kEnKGqk6Xj8XRCp6Owfq108NPny0xPUrQCLcBGAsYHQ/mcint.png" style="margin-left: 1em; margin-right: 1em;"><img data-original-height="418" data-original-width="998" height="268" src="https://lh3.googleusercontent.com/-mbiuy7znBrs/YAq_cWeYE1I/AAAAAAAACUE/kEnKGqk6Xj8XRCp6Owfq108NPny0xPUrQCLcBGAsYHQ/w640-h268/mcint.png" width="640" /></a></div><p></p>Above on the left, we see standard integration, with undershoot in pink and overshoot in orange, and Monte Carlo integration, with random samplings, on the right. <p>Sometimes a lot of the interesting stuff (e.g. expected value, area in the integral, etc.) comes from a part of the function's domain that's low-probability when values in the domain are generated via <script type="math/tex">X</script>. If this happens, you either crank up the <script type="math/tex">n</script> in your Monte Carlo, or then get smart about how exactly you sample (this is called importance sampling). If we're smart about this, our randomised integration can be faster than the standard method.</p><p>We will look at examples of using Monte Carlo -style random simulation to do both Bayesian and frequentist statistics below.</p><p> </p><h2>Confidence</h2><p>In addition to providing a best-guess estimate of something (the probability a coin comes up heads, say), useful statistics should be able to tell us about how confident we should be in a particular guess – the best estimate of the probability a coin lands heads after observing 1 head in 2 throws or 50 heads in 100 throws is the same, but the second one still allows us to say more.</p><p>The question of how to quantify confidence leads into the question of what probability is.</p><p>The frequentist approach is to say that probabilities are observed relative frequencies across many trials, and if you don't have many trials to look at, then you imagine some hypothetical set of trials that an event might be seen as being drawn from.</p><p>The Bayesian approach is that probabilities quantify the state of your own knowledge, and if you don't have data to look at, you should still be able to draw a probability distribution representing your knowledge.</p><h3>Bayesianism</h3><p>Bayesianism is the idea that you represent uncertainty in beliefs about the world using numbers, which come from starting out with some prior distribution, and then shifting the distribution back and forth as evidence comes in. These numbers follow the axioms of probability, and so we might as well call them probabilities.</p><p>(Why should these numbers follow the axioms of probability? Because if you do otherwise and base decisions on those beliefs, you will do stupid things. As a simple example, making bets consistent with a probability model where the probabilities do not sum to 1 makes you exploitable. Let's say you're buying three options, each of which pays out 100€ if the winner of the 2036 US presidential election is EterniTrump, <a href="https://en.wikipedia.org/wiki/GPT-3">GPT</a>-7, or Xi Jinping respectively, and pay 40€ for each (consistent with assigning a probability of greater than 0.4 to each event occurring). You're sure to be down 20€ that you could've spent on underground bunkers instead.)</p><p>In Bayesian statistics, you don't perform arcane statistical tests to reject hypotheses. Your subjective beliefs about something are a probability distribution (or at least they should be, if you want to reason perfectly). Once you've internalised the idea of what a probability distribution means, and know how to reason about updates to that probability distribution rather than in black-and-white terms of absolute truth or falsehood, Bayesianism is intuitive and will make your reasoning about probabilistic things (i.e., everything except pure maths) better.</p><p>(Why is Bayesianism named after Bayes? Bayes invented Bayes' theorem but not Bayesianism; however, Bayesian updating using Bayes' theorem is the core part of ideal Bayesian reasoning.)</p><p>There's one tricky part of Bayesianism, and it's a consequence of the Bayesian insistence that subjective uncertainty is represented by a probability distribution, and hence quantified. It's this: you always need to start with a quantified probability distribution (called a prior), even before you've seen any data.</p><p>There's a clear regress here, at least philosophically. Sure, you might be able to come up with a sensible prior for how effective masks are against a respiratory disease, but ask a baby for <script type="math/tex">P(\frac{P(\text{covid} | \text{mask})}{P(\text{covid}|\neg \text{mask})} = r)</script> and you're not likely to get a coherent answer (and remember that your current prior should come from baby-you's prior in an unbroken series of Bayesian updates) – let alone if we're imagining some hypothetical platonic being existing beyond time and space who has never seen any data, or the <a href="https://www.theguardian.com/world/2020/apr/07/face-masks-cannot-stop-healthy-people-getting-covid-19-says-who">World Health Organisation</a>.</p><p>In practice, however, I don't think this is very worrying. Priors formalise the idea that you can apply background knowledge even when you don't have data for the specific case in front of you. Reject the use of priors, and you'll fall into another regress: "study suggests mask-wearing effective against the coronavirus variant in 40-60 year-old European females in green t-shirts; no information yet on 40-60 year-old European females in red t-shirts ..."</p><h4>Computational Bayes</h4><p>In general, the scenario we have when doing a Bayesian calculation is that there's some model <script type="math/tex">X</script> that depends on parameter(s) <script type="math/tex">\theta</script>, and we want to find what those parameters are given some sample <script type="math/tex">x</script> from <script type="math/tex">X</script> (since this is Bayesian, we have to assume that <script type="math/tex">\theta</script> itself is a value of the random variable <script type="math/tex">\Theta</script> describing the probabilities of each possible <script type="math/tex">\theta</script>). Now we could do this mathematically by calculating</p><div cid="n375" class="mathjax-block md-end-block md-math-block md-rawblock" contenteditable="false" id="mathjax-n375" mdtype="math_block" spellcheck="false"> <div class="md-rawblock-container md-math-container" tabindex="-1"><div class="MathJax_SVG_Display" style="text-align: center;"></div><script id="MathJax-Element-2" type="math/tex; mode=display">\Pr_\Theta(\theta \, | \, X=x) = c \Pr_X(x | \Theta = \theta) \Pr_\Theta(\theta),</script></div></div><p>and then finding the constant <script type="math/tex">c</script> with integration by the rule that probabilities must sum to 1. (Remember the interpretation of these terms: <script type="math/tex">\Pr_\Theta(\theta)</script> is the prior distribution we assume for <script type="math/tex">\Theta</script> before seeing evidence; <script type="math/tex">\Pr_\Theta(\theta \, | \, X=x)</script> is the posterior likelihood distribution after seeing the data; see the <a href="http://strataoftheworld.blogspot.com/2020/12/data-science-1.html">previous post</a> for some intuition on Bayes if these aren't clear to you.)</p><p>However, maybe some part of this (especially the integration) would be tricky, or you just happen to have a Jupyter notebook open on your computer. In any case, we can go about things in a different way, as long as we have a way to generate samples from our prior distribution and re-weight them appropriately.</p><p>The first thing we do is represent the prior distribution of <script type="math/tex">\Theta</script> by sampling it many times. We don't need an equation for it, just some function (in the programming sense) that pulls from it.</p><p>Next, consider the impact of our data on the estimates. We can imagine each sample we took as a representation of a tiny blob of probability mass corresponding to some particular <script type="math/tex">\theta_i</script>, and imagine rescaling it in the same way that we rescaled the odds of various outcomes when talking about the odds ratio form of Bayes' rule in the first post. How much do we rescale it by? By the likelihood of observing <script type="math/tex">x</script> if <script type="math/tex">\Theta=\theta_i</script>: this is the <script type="math/tex">\Pr_X(x|\Theta=\theta)</script> term in the above equation.</p><p>Finally, we need to do the scaling. Thankfully, this doesn't take integration, since we can calculate the sum of our re-weighted likelihoods and just divide all our scaled values by that – boom, we have an (approximation of) a posterior probability distribution.</p><p>To make things concrete, let's write code and visualise a simple case: estimating the probability that a coin lands heads. The first step in Bayesian calculations is usually the trickiest: we need a prior. For simplicity, let's say our prior is that the coin has an equal chance of having every possible probability (so the real numbers 0 to 1) of coming up heads.</p><p>(The fact that the thing we're estimating is itself a probability doesn't matter; don't be confused by the fact that we have two sorts of probability – our knowledge about the coin's probability of coming up heads, represented as a probability distribution, and the probability that the coin comes up heads (an empirical fact you can measure by throwing it many times). Equally well we might have talked about some non-probabilistic feature of the coin, like its diameter, but that would be a lot more boring.)</p><p>To write this out in actual Python, the first step (after importing NumPy for vectorised calculation and Matplotlib for the graphing we'll do later) is some way to generate samples from this distribution:</p><pre><code class="language-python" lang="python">import numpy as np<br />import matplotlib.pyplot as plt<br /><br />def prior_sample(n):<br /> return np.random.uniform(size=n)<br /></code></pre><p>(<code>np.random.uniform(size=n)</code> returns <code>n</code> samples from a uniform distribution over the range 0 to 1.)</p><p>To calculate the posterior:</p><pre><code class="language-python" lang="python">def posterior(sample, throws, heads):<br /> """ This function calculates an approximation of the<br /> posterior distribution after seeing the coin<br /> thrown a certain number of times;<br /> sample is a sample of our prior distribution,<br /> throws is how many times we've thrown the coin,<br /> heads is how many times it has come up heads."""<br /> # The number of times the coin lands heads follows a binomial distribution.<br /> # Thus, below we reweight using a binomial pdf:<br /> # (note that we drop the throws-Choose-heads term because it's a constant<br /> # and we rescale at the end anyways)<br /><br /> weighted_sample = sample ** heads * (1 - sample) ** (throws - heads)<br /><br /> # Divide by the sum of every element in the weighted sample to normalise:<br /><br /> return weighted_sample / np.sum(weighted_sample)<br /></code></pre><p>(Remember that the calculation of <code>weighted_sample</code> is done on every term in the <code>sample</code> array separately, in the standard vectorised way.)</p><p>Now we can generate a sample to model the prior distribution, and plot it as a histogram:</p><pre><code class="language-python" lang="python">N = 100000<br />throws = 100<br />heads = 20<br /><br />sample = prior_sample(N) # model the prior distribution<br /><br /># Plot a histogram:<br />plt.hist(sample,<br /> # split the range 0-1 into 50 bins for the histogram:<br /> np.linspace(0, 1, 50), <br /> # weight each item by the likelihood:<br /> weights=posterior(sample, throws, heads))<br /></code></pre><p>The result will look something like this:</p><p></p><div class="separator" style="clear: both; text-align: center;"><a href="https://lh3.googleusercontent.com/-x0LpfUiUwCA/YAq_z2nq0vI/AAAAAAAACUM/aLkZe2ZhF9sDGHa5JDOfXPb0hhZOllRswCLcBGAsYHQ/postex.png" style="margin-left: 1em; margin-right: 1em;"><img data-original-height="496" data-original-width="802" height="396" src="https://lh3.googleusercontent.com/-x0LpfUiUwCA/YAq_z2nq0vI/AAAAAAAACUM/aLkZe2ZhF9sDGHa5JDOfXPb0hhZOllRswCLcBGAsYHQ/w640-h396/postex.png" width="640" /></a></div><br /><p></p><p>This is an approximation of the posterior probability distribution after seeing 100 throws and 20 heads. We see that most of the probability mass is clustered around a probability of 0.2 of landing heads; the chance of it being a fair coin is negligible.</p><p>What if we had a different prior? Let's say we're reasonably sure it's roughly a standard coin, and model our prior for the probability of landing heads as a normal distribution with mean 0.5 and standard deviation 0.1. To visualise this prior, here's a histogram of a 100k samples from it:</p><p></p><div class="separator" style="clear: both; text-align: center;"><a href="https://lh3.googleusercontent.com/-zkfbbObxwQE/YArADhq3MPI/AAAAAAAACUc/mAd0l3HIix4AXh1YABNpYKd3lhS8ubpAQCLcBGAsYHQ/normex.png" style="margin-left: 1em; margin-right: 1em;"><img data-original-height="488" data-original-width="800" height="244" src="https://lh3.googleusercontent.com/-zkfbbObxwQE/YArADhq3MPI/AAAAAAAACUc/mAd0l3HIix4AXh1YABNpYKd3lhS8ubpAQCLcBGAsYHQ/w400-h244/normex.png" width="400" /></a></div><br /><p></p>The posterior distribution looks almost identical to our previous posterior: <p></p><div class="separator" style="clear: both; text-align: center;"><a href="https://lh3.googleusercontent.com/-ehLqscBpblA/YAq_5uJ6WAI/AAAAAAAACUU/3oBLg53kBPIsWj1arn0eRUpf6fJmsJH2gCLcBGAsYHQ/postex2.png" style="margin-left: 1em; margin-right: 1em;"><img data-original-height="498" data-original-width="1000" height="318" src="https://lh3.googleusercontent.com/-ehLqscBpblA/YAq_5uJ6WAI/AAAAAAAACUU/3oBLg53kBPIsWj1arn0eRUpf6fJmsJH2gCLcBGAsYHQ/w640-h318/postex2.png" width="640" /></a></div><br /><p></p><p>There's simply so much data (a hundred throws) that even very different priors will have converged on what the data indicates.</p><p>A normal distribution might not be a very good model, though. Say we think there's a 49.5% chance the coin is fair, a 49.5% chance it's been rigged to come up tails with a probability arbitrarily close to 1, and the remaining 1% is spread uniformly between 0 and 1 (be very careful about assigning zero probability to something!). Then our prior distribution might be coded like this:</p><pre><code class="language-python" lang="python">def prior_sample_3(n):<br /> m = n // 100<br /> return np.concatenate((np.random.uniform(size=m),<br /> np.zeros((n - m) // 2),<br /> np.ones(n - (n - m) // 2) // 2),<br /> axis=0)<br /></code></pre><p>and 100k samples might be distributed like this:</p><div class="separator" style="clear: both; text-align: center;"><a href="https://lh3.googleusercontent.com/-EISq1Nh8hJg/YArAHGUFB9I/AAAAAAAACUg/MdtcMGCDeVc9lIapC4Bw0yNQ8wbGFLy7gCLcBGAsYHQ/priorex.png" style="margin-left: 1em; margin-right: 1em;"><img data-original-height="494" data-original-width="800" height="396" src="https://lh3.googleusercontent.com/-EISq1Nh8hJg/YArAHGUFB9I/AAAAAAAACUg/MdtcMGCDeVc9lIapC4Bw0yNQ8wbGFLy7gCLcBGAsYHQ/w640-h396/priorex.png" width="640" /></a></div><br /><br /><p></p>Let's also say we have less data than before – the coin has come heads 8 times out of 40, say. Now our posterior distribution looks like this: <p></p><div class="separator" style="clear: both; text-align: center;"><a href="https://lh3.googleusercontent.com/-fqUm2qOhyvI/YArAKABG7-I/AAAAAAAACUw/lyX3ZPq95vcgCKYW77h6ToDS1GXkpfQkACLcBGAsYHQ/postex3.png" style="margin-left: 1em; margin-right: 1em;"><img data-original-height="496" data-original-width="800" height="396" src="https://lh3.googleusercontent.com/-fqUm2qOhyvI/YArAKABG7-I/AAAAAAAACUw/lyX3ZPq95vcgCKYW77h6ToDS1GXkpfQkACLcBGAsYHQ/w640-h396/postex3.png" width="640" /></a></div><br /><p></p>We've ruled out that the coin is rigged (a single heads was enough to nuke the likelihood of a completely rigged coin to zero – be very careful about assigning a probability of zero to something!), and most of the probability mass has shifted to a probability of landing heads of around 20%, as before, but because our prior was different, a noticeable chunk of our expectation is still that the coin is exactly fair. <p>As a final example, here's a big flowchart showing how the probability you should assign to different odds of the coin coming up heads shifts as you get data (red = tails, green = heads) up to 5 coin throws, assuming a prior that's the uniform distribution:</p><p></p><div class="separator" style="clear: both; text-align: center;"><div class="separator" style="clear: both; text-align: center;"><a href="https://lh3.googleusercontent.com/-mfCWYY9GDj4/YArAh75Q0pI/AAAAAAAACVI/3pLFbKdE5zIMLqtQWWydmsOfbB6b_l79wCLcBGAsYHQ/bayesucompressed.png" style="margin-left: 1em; margin-right: 1em;"><img data-original-height="1667" data-original-width="600" src="https://lh3.googleusercontent.com/-mfCWYY9GDj4/YArAh75Q0pI/AAAAAAAACVI/3pLFbKdE5zIMLqtQWWydmsOfbB6b_l79wCLcBGAsYHQ/s16000/bayesucompressed.png" /></a></div></div><p></p>Two questions to think about, one simple and on-topic, the other open-ended and off-topic: <ul><li>What is the simple function giving, within a constant, the posterior distribution after <script type="math/tex">n</script> heads and 0 tails? What about for <script type="math/tex">n</script> tails and 0 heads?</li><li>Doesn't the coin-throwing diagram look like Pascal's triangle? What's the connection between normal distributions, Pascal's triangle, and the central limit theorem (i.e., that the sum of enough of many of any random variable is distributed roughly normally?)? What extensions of Pascal's triangle can you think of, possibly with probabilistic interpretations?</li> </ul><h3>Frequentism</h3><p>Frequentists try to banish the subjectivity out of probability. The probability of event <script type="math/tex">E</script> is not a statement about subjective belief, but an empirical fact: given <script type="math/tex">n</script> trials, what is the fraction of times that <script type="math/tex">E</script> comes up, in the limit as <script type="math/tex">n \rightarrow \infty</script>? And ditch the Bayesian idea of doing nothing but shifting around the probability mass we assign to different beliefs; once you've done a statistical test, you either reject or fail to reject the null hypothesis.</p><p>A standard frequentist tool is hypothesis testing with a <script type="math/tex">p</script>-value. The procedure looks like this:</p><ol start=""><li>Pick a null hypothesis (usually denoted <script type="math/tex">H_0</script>). (For example, <script type="math/tex">H_0</script> could be that a coin is fair; that is, that the probability <script type="math/tex">h</script> of it coming up heads is 0.5.)</li><li>Pick a test statistic: a function <script type="math/tex">t</script> from the dataset <script type="math/tex">x</script> to a number. (For example, the maximum likelihood estimator for <script type="math/tex">h</script>, using the fact that we expect the number of heads to follow a binomial distribution with parameters for the number of throws and the probability <script type="math/tex">h</script>.)</li><li>Figure out a model for, or a way to sample from, the distribution of possible datasets given that <script type="math/tex">H_0</script> is true. (For example, we might write code to generate synthetic datasets <script type="math/tex">X^*</script> of the same size as <script type="math/tex">x</script> based on <script type="math/tex">h=0.5</script>.)</li><li>Find the probability of the test statistic <script type="math/tex">t</script> returning a result that is as extreme or more extreme than <script type="math/tex">t(x)</script>. We might do this using fancy maths that gives us cumulative distribution functions based on the model from the previous step, or by having our code generate many synthetic datasets <script type="math/tex">X^*</script>, calculate <script type="math/tex">t(X^*)</script> for each of them, and seeing how <script type="math/tex">t(x)</script> compares – what percentile of extremeness is it in? The answer is called the <script type="math/tex">p</script>-value.</li> </ol><p>(What is "more extreme"? That depends on our null hypothesis. If both low and high values of <script type="math/tex">t(x)</script> are evidence against <script type="math/tex">H_0</script> – as in our example – then we use a two-tailed test; if <script type="math/tex">t(x)</script> is in the 90% percentile of the <script type="math/tex">t(X^*)</script> distribution, both <script type="math/tex">t(x)</script> in the top and bottom 10% are at least as extreme as the value we got, and <script type="math/tex">p=0.2</script>. If only low or high values are evidence against <script type="math/tex">H_0</script>, then we use a one-tailed test. Say only high values are evidence against <script type="math/tex">H_0</script> and <script type="math/tex">t(x)</script> is in the 90% percentile; then <script type="math/tex">p=0.1</script>.)</p><p>Here's some example code to calculate a <script type="math/tex">p</script>-value, using random simulation:</p><pre><code class="language-python" lang="python"># Import NumPy and graphing library:<br />import numpy as np<br />import matplotlib.pyplot as plt<br /><br /># Define our null hypothesis:<br />h0_h = 0.5 # the value of h under the null hypothesis<br /><br /># Define the data we've gotten:<br />throws = 50<br />heads = 20<br /># Generate an array for it:<br />data = np.concatenate((np.zeros(throws - heads), np.ones(heads)), axis = 0)<br /><br />def t(x): # test statistic function<br /> return np.mean(x)<br /> # ^ this is the MLE for the binomial distribution<br /><br />def synth_x(n, p):<br /> # Create a synthetic dataset of some size n, assuming some p<br /> return np.random.binomial(1, p, size=n)<br /><br /># Take a lot of samples from the distribution of t(X*)<br /># (where X* is a synthetic dataset):<br />t_sample = np.array([t(synth_x(throws, h0_h)) for _ in range(100000)])<br /><br /># Calculate the p-value, using a two-tailed test:<br />p1 = np.mean(t_sample >= t(data))<br />p2 = np.mean(t_sample <= t(data))<br />p = 2 * min(p1, p2)<br /><br /># Display p-value<br />print(f"p-value is {p}") # about 0.20 in this case<br /><br /># Plot a histogram:<br />plt.hist(t_sample, bins=50, range=[0,1])<br />plt.axvline(x=t(data), color='black') # draw a line to show where t(data) falls<br /></code> </pre><p></p><div class="separator" style="clear: both; text-align: center;"><a href="https://lh3.googleusercontent.com/-NjudyfjYfjM/YArArM_CD3I/AAAAAAAACVM/BC5sMDJx5YQ2pFIgR-CBQCL78sUx7iIhACLcBGAsYHQ/pval.png" style="margin-left: 1em; margin-right: 1em;"><img data-original-height="508" data-original-width="800" height="406" src="https://lh3.googleusercontent.com/-NjudyfjYfjM/YArArM_CD3I/AAAAAAAACVM/BC5sMDJx5YQ2pFIgR-CBQCL78sUx7iIhACLcBGAsYHQ/w640-h406/pval.png" width="640" /></a></div><br /><p></p>The main tricky part in the code is the calculation of the <script type="math/tex">p</script>-value. A neat way to do is the following: observe that a two-tailed <script type="math/tex">p</script>-value is either twice the percent of (synthetic) data with a test statistic lower than <script type="math/tex">t(x)</script> (in the case that the observation ended up on the lower side of the distribution of synthetic datasets), or twice the percent of (synthetic) data with a higher test statistic. <p>Now, what exactly is a <script type="math/tex">p</script>-value? It's tempting to think of the <script type="math/tex">p</script>-value as the probability that the null hypothesis is correct: that is, that <script type="math/tex">p=0.05</script> means there's only a 5% chance the null hypothesis is true. However, what a <script type="math/tex">p</script>-value actually tells you is this: assuming that your null hypothesis is true (and you can correctly model the distribution of data you'd get if it is), what is the probability of getting a result at least as extreme as your data? In maths: </p><div cid="n434" class="mathjax-block md-end-block md-math-block md-rawblock" contenteditable="false" id="mathjax-n434" mdtype="math_block" spellcheck="false"> <div class="md-rawblock-container md-math-container" tabindex="-1"><div class="MathJax_SVG_Display" style="text-align: center;"></div><script id="MathJax-Element-3" type="math/tex; mode=display">p\text{-value} \ne P(H_0 \text{ is correct}), (!!)</script></div></div><p>but instead</p><div cid="n436" class="mathjax-block md-end-block md-math-block md-rawblock" contenteditable="false" id="mathjax-n436" mdtype="math_block" spellcheck="false"> <div class="md-rawblock-container md-math-container" tabindex="-1"><div class="MathJax_SVG_Display" style="text-align: center;"></div><script id="MathJax-Element-4" type="math/tex; mode=display">p\text{-value} = P(t(x) \geq t(X^*)),</script></div></div><p>for a right-tailed test (flip the <script type="math/tex">\geq</script> for a left-tailed test), where <script type="math/tex">X^*</script> is assumed drawn from the distribution resulting from assuming the null hypothesis <script type="math/tex">H_0</script> , or</p><div cid="n438" class="mathjax-block md-end-block md-math-block md-rawblock" contenteditable="false" id="mathjax-n438" mdtype="math_block" spellcheck="false"> <div class="md-rawblock-container md-math-container" tabindex="-1"><div class="MathJax_SVG_Display" style="text-align: center;"></div><script id="MathJax-Element-5" type="math/tex; mode=display">P(|t'(x)| \geq|t'(X^*)|),</script></div></div><p>for a two-tailed test, where <script type="math/tex">t'</script> is the test statistic function, but shifted so that the median <script type="math/tex">H_0</script> value is 0, so that we can just take absolute value to get an extremeness measure (for example, in the code above we'd subtract a 0.5 from the current definition of <code>t(x)</code>, since this is the median for the null hypothesis that the probability of heads is one-half).</p><h2>Probability bounds</h2><p>Sometimes it's useful to be able to quickly estimate a bound on some probability or expectation. Here are some examples, with quick proofs.</p><h4>Markov's inequality</h4><p>For <script type="math/tex">x > 0</script> if <script type="math/tex">X</script> takes positive numerical values,</p><div cid="n444" class="mathjax-block md-end-block md-math-block md-rawblock" contenteditable="false" id="mathjax-n444" mdtype="math_block" spellcheck="false"> <div class="md-rawblock-container md-math-container" contenteditable="false" tabindex="-1"> <div class="MathJax_SVG_Display" style="text-align: center;"></div><script id="MathJax-Element-212" type="math/tex; mode=display">P(X \geq a) \leq \frac{E(X)}{a}.</script> </div></div><p>Why?</p><p><b>Short proof</b>: Given <script type="math/tex">X \geq 0</script>, <script type="math/tex">X \geq 1_{X \geq a} \cdot a</script> (can be seen by considering cases <script type="math/tex">X < a</script>, <script type="math/tex">X=a</script>, and <script type="math/tex">X > a</script>), so, rearranging, <script type="math/tex">1_{X \geq a} \leq X/ a</script>. Taking the expectation on both sides we get <script type="math/tex">E(1_{X \geq a}) \leq E(X) / a</script>, and <script type="math/tex">E(1_{X \geq a}) = P(X \geq a)</script>. <script type="math/tex">\square</script></p><p><b>Intuitive proof</b>: let's say you want to draw a probability density function to maximise <script type="math/tex">P(X \geq a)</script>, given some value of the expectation of <script type="math/tex">E(X)</script> (and given that <script type="math/tex">X</script> only takes positive values). Any probability density assigned to values greater than <script type="math/tex">a</script> is more expensive in terms of expectation increase than assigning value exactly at <script type="math/tex">a</script>, and has an identical effect on <script type="math/tex">P(X \geq a)</script>. So to maximise <script type="math/tex">P(X \geq a)</script>, assign as much probability density as you can to <script type="math/tex">a</script>, and none to values greater than <script type="math/tex">a</script>. Given the restriction that <script type="math/tex">X</script> can only take positive values, the lowest value you can assign any probability to (to balance out the expectation if <script type="math/tex">a > E(X)</script>) is 0. If we allocate <script type="math/tex">p_1</script> to <script type="math/tex">X=0</script> and <script type="math/tex">p_2</script> to <script type="math/tex">X=a</script>, then to match the expectation <script type="math/tex">E(X)</script> we must have</p><div cid="n447" class="mathjax-block md-end-block md-math-block md-rawblock" contenteditable="false" id="mathjax-n447" mdtype="math_block" spellcheck="false"> <div class="md-rawblock-container md-math-container" tabindex="-1"><div class="MathJax_SVG_Display" style="text-align: center;"></div><script id="MathJax-Element-7" type="math/tex; mode=display">p_1 \cdot 0 + p_2 \cdot a = E(X),</script></div></div><p>or</p><div cid="n449" class="mathjax-block md-end-block md-math-block md-rawblock" contenteditable="false" id="mathjax-n449" mdtype="math_block" spellcheck="false"> <div class="md-rawblock-container md-math-container" tabindex="-1"><div class="MathJax_SVG_Display" style="text-align: center;"></div><script id="MathJax-Element-8" type="math/tex; mode=display">p_2 = P(X\geq a) = \frac{E(X)}{a}</script></div></div><p>in the maximal scenario; any other pdf we draw must have <script type="math/tex">P(X \geq a)</script> smaller.</p><p>The above equation can also be interpreted as saying that the fraction of values greater than <script type="math/tex">k=a/E(X)</script> times the average in a dataset of positive values can be at most <script type="math/tex">1/k</script> (i.e. <script type="math/tex">E(X)/a</script>). For example, at most half of people can have twice the average income.</p><h4>Chebyshev's inequality</h4><p>(An extension of Markov's inequality.)</p><p>Let <script type="math/tex">X</script> be a random variable with variance <script type="math/tex">\sigma^2</script> and expected value <script type="math/tex">\mu</script>. Then</p><div cid="n455" class="mathjax-block md-end-block md-math-block md-rawblock" contenteditable="false" id="mathjax-n455" mdtype="math_block" spellcheck="false"> <div class="md-rawblock-container md-math-container" tabindex="-1"><div class="MathJax_SVG_Display" style="text-align: center;"></div><script id="MathJax-Element-9" type="math/tex; mode=display">P(|X-\mu| \geq x) \leq \frac{\sigma^2}{x^2},</script></div></div><p>since if <script type="math/tex">Y = (X-\mu)^2</script> then, by Markov's inequality,</p><div cid="n457" class="mathjax-block md-end-block md-math-block md-rawblock" contenteditable="false" id="mathjax-n457" mdtype="math_block" spellcheck="false"> <div class="md-rawblock-container md-math-container" tabindex="-1"><div class="MathJax_SVG_Display" style="text-align: center;"></div><script id="MathJax-Element-10" type="math/tex; mode=display">P(Y \geq x^2) \leq \frac{\mathbb{E}(Y)}{x^2} = \frac{\sigma^2}{x^2},</script></div></div><p>by the definition of variance as <script type="math/tex">\mathbb{E}((X - \mu)^2)</script>. Finally, taking the square root inside the probability expression, <script type="math/tex">P(Y \geq x^2)=P(|X-\mu| \geq x)</script>. <script type="math/tex">\square</script></p><h4>Jensen's inequality</h4><p>Consider a concave function <script type="math/tex">f</script> and the values <script type="math/tex">E(f(X))</script> and <script type="math/tex">f(E(X))</script>, where <script type="math/tex">X</script> is (once again) a random variable.</p><p>Since <script type="math/tex">f</script> is concave, if we plot <script type="math/tex">y=f(x)</script> and the tangent line to <script type="math/tex">f</script> at some <script type="math/tex">x_0</script>, the tangent is an upper bound on <script type="math/tex">f(x)</script> for all <script type="math/tex">x</script>.</p><p></p><div class="separator" style="clear: both; text-align: center;"><a href="https://lh3.googleusercontent.com/-hnijlFUfZQk/YArAtTu4jTI/AAAAAAAACVQ/UdB1Y570UFcJNEUmO8cvPvDhENQaXwP3ACLcBGAsYHQ/jensen.png" style="margin-left: 1em; margin-right: 1em;"><img data-original-height="834" data-original-width="1000" height="333" src="https://lh3.googleusercontent.com/-hnijlFUfZQk/YArAtTu4jTI/AAAAAAAACVQ/UdB1Y570UFcJNEUmO8cvPvDhENQaXwP3ACLcBGAsYHQ/w400-h333/jensen.png" width="400" /></a></div><br /><p></p><p>Let <script type="math/tex">E(X) = \mu</script>, and let the tangent line to <script type="math/tex">y=f(x)</script> at <script type="math/tex">x=\mu</script> be <script type="math/tex">y=mx+b</script>. We have that <script type="math/tex">f(X) \leq mx+b</script> for all <script type="math/tex">x</script>. Taking the expectation on both sides,</p><div cid="n464" class="mathjax-block md-end-block md-math-block md-rawblock" contenteditable="false" id="mathjax-n464" mdtype="math_block" spellcheck="false"> <div class="md-rawblock-container md-math-container" tabindex="-1"><div class="MathJax_SVG_Display" style="text-align: center;"></div><script id="MathJax-Element-11" type="math/tex; mode=display">E(f(X)) \leq m \mu + b.</script></div></div><p>What is <script type="math/tex">m\mu +b</script>? It's the value of the tangent when it touches <script type="math/tex">f(x)</script> at <script type="math/tex">x=\mu</script>, and therefore it is also the value of <script type="math/tex">f</script> at <script type="math/tex">\mu</script>. Thus we can say</p><div cid="n466" class="mathjax-block md-end-block md-math-block md-rawblock" contenteditable="false" id="mathjax-n466" mdtype="math_block" spellcheck="false"> <div class="md-rawblock-container md-math-container" tabindex="-1"><div class="MathJax_SVG_Display" style="text-align: center;"></div><script id="MathJax-Element-12" type="math/tex; mode=display">E(f(X)) \leq f(E(X)). \square</script></div></div><p> </p><h2>Probability systems</h2><h3>Causal diagrams</h3><p>The <a href="https://en.wikipedia.org/wiki/Perseverance_(rover)"><i>Perseverance</i></a> rover is due to land on Mars on February 18th, 2021, carrying a small helicopter called <a href="https://en.wikipedia.org/wiki/Mars_Helicopter_Ingenuity"><i>Ingenuity</i></a>, which will likely become the first aircraft to make a powered flight on a planet that's not Earth.</p><p>Imagine that <i>Perseverance</i> is currently known to be in a position <script type="math/tex">X</script> (where <script type="math/tex">X</script> is some random variable, as is any capital letter). <i>Ingenuity</i> has completed its first flight, starting from the location of <i>Perseverance</i> (which we know to a high degree of accuracy), but because of a Martian sandstorm we only have inaccurate readings of <i>Ingenuity</i>'s current location and need to locate it quickly to know if it's in a place where it's going to run out of power due to dust blocking its solar panels unless we do a risky manoeuvre with its propellers. Specifically, we have two in-flight readouts of its position, <script type="math/tex">R_1</script> and <script type="math/tex">R_2</script>, which are known to be its actual true position <script type="math/tex">Y_1</script> and <script type="math/tex">Y_2</script> at those times plus some random error modelled as a <script type="math/tex">\text{Normal}(0,\sigma_1^2)</script> distribution, and also similarly we have a more accurate readout <script type="math/tex">R_f</script> of its final position <script type="math/tex">Y_f</script>, this time with the error following <script type="math/tex">\text{Normal}(0, \sigma_2^2)</script>. We also model <script type="math/tex">Y_1</script> as being generated from <script type="math/tex">X</script> with a parameter <script type="math/tex">h_1</script> representing its starting heading and velocity (e.g. <script type="math/tex">h_1</script> is a vector and the model could be <script type="math/tex">Y_1 = X + h_1 + \epsilon</script>, where <script type="math/tex">\epsilon</script> is another normally distributed error term), and likewise we have parameters <script type="math/tex">h_2</script> and <script type="math/tex">h_f</script> that influence how <script type="math/tex">Y_2</script> and <script type="math/tex">Y_f</script> are generated from the preceding positions. We know that it's initial battery level was <script type="math/tex">b_0</script>, and the battery level when it was at each of <script type="math/tex">Y_1</script>, <script type="math/tex">Y_2</script>, and <script type="math/tex">Y_f</script> is <script type="math/tex">B_1</script>, <script type="math/tex">B_2</script>, and <script type="math/tex">B_f</script>, where each of those is generated from the previous and the heading/velocity parameters <script type="math/tex">h_1</script>, <script type="math/tex">h_2</script>, and <script type="math/tex">h_f</script> (e.g. <script type="math/tex">B_2 = B_1 - (1 + \epsilon) |h_1|</script> – the amount of power lost is a normal error term plus a constant times the velocity). We need to find the probability that the next battery level <script type="math/tex">B_n</script>, a random variable generated from <script type="math/tex">B_f</script> (the previous level) and depending on <script type="math/tex">Y_f</script> (since storm intensity varies with position; say we have a function <script type="math/tex">s</script> that takes in positions and returns how much the dust will decrease power output and hence batter level at a particular position, then we might have <script type="math/tex">B_n = B_f - s(Y_f)</script>), is below a critical threshold <script type="math/tex">c</script>, given the starting <script type="math/tex">X</script>, and the position readings <script type="math/tex">R_1</script>, <script type="math/tex">R_2</script>, and <script type="math/tex">R_f</script>. Also the administrator of NASA is breathing down your neck because this is a 2 billion dollar mission, so better work fast and not make mistakes.</p><p>This problem seems almost intractably complicated. A handy way of making complex probability questions less unapproachable is to draw out a causal diagram: what are the key parameters, and which random variables are generated from which other ones? Here's an example for the above problem:</p><p></p><div class="separator" style="clear: both; text-align: center;"><a href="https://lh3.googleusercontent.com/-S4_pBsH_GnA/YArAwGsYO4I/AAAAAAAACVU/fffClocsKQM77RJleFmwdd7UQuacYaqxgCLcBGAsYHQ/causaldiagram.png" style="margin-left: 1em; margin-right: 1em;"><img data-original-height="874" data-original-width="1280" height="438" src="https://lh3.googleusercontent.com/-S4_pBsH_GnA/YArAwGsYO4I/AAAAAAAACVU/fffClocsKQM77RJleFmwdd7UQuacYaqxgCLcBGAsYHQ/w640-h438/causaldiagram.png" width="640" /></a></div><br /><br /><p></p><p>Arrows indicate random variables being generated from others; dotted lines note important parameters (note that some parameters are missing – those of <script type="math/tex">X</script>, for example). The probability we were asked about is <script type="math/tex">P(B_n < c | X = x, R_1 = r_1, R_2 = r_2, R_f = r_f)</script>; it doesn't look so complicated when you have the causal relations visualised in front of you.</p><p>The rest of the solution is left as an exercise for the reader. Please be in touch with NASA in late February to get the values <script type="math/tex">x</script>, <script type="math/tex">r_1</script>, <script type="math/tex">r_2</script>, and <script type="math/tex">r_f</script>.</p><h3>Markov chains</h3><p>A Markov chain has the following causal diagram:</p><p></p><div class="separator" style="clear: both; text-align: center;"><a href="https://lh3.googleusercontent.com/-HjDYzntkpaE/YArAxfn2HMI/AAAAAAAACVc/Md-EoAChqWsT5VHijY88E2xyceBy_mlkwCLcBGAsYHQ/markov.png" style="margin-left: 1em; margin-right: 1em;"><img data-original-height="210" data-original-width="1000" height="84" src="https://lh3.googleusercontent.com/-HjDYzntkpaE/YArAxfn2HMI/AAAAAAAACVc/Md-EoAChqWsT5VHijY88E2xyceBy_mlkwCLcBGAsYHQ/w400-h84/markov.png" width="400" /></a></div><br /><p></p>In words: the <script type="math/tex">n</script>th state of a Markov chain is generated from the <script type="math/tex">(n-1)</script>th state. <p>This might seem very restrictive. For example, the simplest text-generation Markov chain would just generate, say, one character based on the previous one, probably based on data for how often a letter follows another. It might tend to do some moderately reasonable things, like following "t" by "h" fairly often (assuming it was trained on English), but good luck getting anything too sensible out of it.</p><p>However, we can do a trick: generate letter <script type="math/tex">n</script> from the previous <script type="math/tex">k</script> letters. This seems like it's not a Markov chain; letter <script type="math/tex">X_n</script> depends on <script type="math/tex">X_{n-k}</script> through <script type="math/tex">X_{n-1}</script>. But we can define <script type="math/tex">Y_0=(X_0, X_1, ..., X_{k-1})</script>, <script type="math/tex">Y_1 = (X_1, X_2, ..., X_k)</script>, and so on, and now <script type="math/tex">Y_n</script> can be generated entirely from <script type="math/tex">Y_{n-1}</script>, and so the <script type="math/tex">Y</script>s form a Markov chain.</p><p>So one one hand, we can do these sorts of tricks to use Markov chains even when it seems like the problem is too complex for them. But perhaps even more importantly, if you reduce something to a Markov chain, you can immediately apply a lot of nice mathematical results.</p><p>A Markov chain can be visualised with a state diagram. Here's one for a Markov chain representing traffic light transitions:</p><p></p><div class="separator" style="clear: both; text-align: center;"><a href="https://lh3.googleusercontent.com/-I2pMAfptnW0/YArA0CBA-wI/AAAAAAAACVg/3BFSDkq4w0EkvZWm5KWYH2kyM4kSIvWbwCLcBGAsYHQ/trafficlights1.png" style="margin-left: 1em; margin-right: 1em;"><img data-original-height="1098" data-original-width="1000" height="400" src="https://lh3.googleusercontent.com/-I2pMAfptnW0/YArA0CBA-wI/AAAAAAAACVg/3BFSDkq4w0EkvZWm5KWYH2kyM4kSIvWbwCLcBGAsYHQ/w365-h400/trafficlights1.png" width="365" /></a></div><br /><p></p><p>The same information can be described with a transition matrix, showing the probability of each transition happening:</p><p></p><div class="separator" style="clear: both; text-align: center;"><a href="https://lh3.googleusercontent.com/-wjCwsRjL2rY/YArA1n3ionI/AAAAAAAACVk/Sf1PBxgqBJAJ_EGnGqxB4rf2XBbdrp8rgCLcBGAsYHQ/trafficmatrix1.png" style="margin-left: 1em; margin-right: 1em;"><img data-original-height="830" data-original-width="960" height="345" src="https://lh3.googleusercontent.com/-wjCwsRjL2rY/YArA1n3ionI/AAAAAAAACVk/Sf1PBxgqBJAJ_EGnGqxB4rf2XBbdrp8rgCLcBGAsYHQ/w400-h345/trafficmatrix1.png" width="400" /></a></div><br /><p></p><p>Note that this is a very boring Markov chain, because it's not probabilistic – every link has a probability mass of 1. This is not very interesting. Thankfully, our traffic light engineer is willing to add some randomness for the sake of making the system more mathematically interesting. For example, they might change the system to look like this (showing both the state diagram and transition matrix):</p><p></p><div class="separator" style="clear: both; text-align: center;"><a href="https://lh3.googleusercontent.com/-i1AKagcYfcI/YArA3XFfxyI/AAAAAAAACVo/e3ypWZKEntgt1VMi_geYYtdz-7KidI4qQCLcBGAsYHQ/traffic2.png" style="margin-left: 1em; margin-right: 1em;"><img data-original-height="690" data-original-width="1278" height="346" src="https://lh3.googleusercontent.com/-i1AKagcYfcI/YArA3XFfxyI/AAAAAAAACVo/e3ypWZKEntgt1VMi_geYYtdz-7KidI4qQCLcBGAsYHQ/w640-h346/traffic2.png" width="640" /></a></div><p></p><p>Now there's a 10% chance that the yellow light before red is skipped, and a 40% chance that red-yellow moves back to red instead of going green.</p><p>The key property with Markov chain calculations is memorylessness: <script type="math/tex">X_n</script> depends only on <script type="math/tex">X_{n-1}</script>. If you can use this property, you can work out a lot of Markov chain problems. For example, let's say that <script type="math/tex">X_0 = \text{R}</script> (we'll use <script type="math/tex">\text{R, RY, G, Y}</script> to denote the states), and we want to find the probability that you'll actually get to drive in two state transitions from now – that is, <script type="math/tex">\mathbb{P}(X_2 = \text{G} \, | \, X_0 = \text{R})</script> (I use <script type="math/tex">\mathbb{P}</script> here to differentiate a probability expression from the transition matrix <script type="math/tex">P</script>). Doing some straightforward algebra, you can figure out that this probability is <script type="math/tex">P_{\text{R},\text{RY}} \cdot P_{\text{RY},\text{G}}</script> (where <script type="math/tex">P_{a,b}</script> is the spot in the matrix with row label (i.e. start state) <script type="math/tex">a</script> and column label (i.e. end state) <script type="math/tex">b</script>).</p><p>(Note that each row of the transition matrix is a probability distribution for the next state, starting from the state the row is labelled with. Writing it as a matrix is a trick for expressing the probability distribution from each state in the same mathematical object.)</p><p>More generally: for any transition matrix, <script type="math/tex">P_{a,b}</script>is <script type="math/tex">\mathbb{P}(X_n = b \, | X_{n-1} = a)</script>. Now consider point <script type="math/tex">a,b</script> of <script type="math/tex">P^2</script>: by matrix multiplication, it is</p><div cid="n504" class="mathjax-block md-end-block md-math-block md-rawblock" contenteditable="false" id="mathjax-n504" mdtype="math_block" spellcheck="false"> <div class="md-rawblock-container md-math-container" tabindex="-1"><div class="MathJax_SVG_Display" style="text-align: center;"></div><script id="MathJax-Element-13" type="math/tex; mode=display">\sum_i P_{a,i}P_{i,b},</script></div></div><p> but by the definition of the transition matrix, this is the same as</p><div cid="n509" class="mathjax-block md-end-block md-math-block md-rawblock" contenteditable="false" id="mathjax-n509" mdtype="math_block" spellcheck="false"> <div class="md-rawblock-container md-math-container" tabindex="-1"><div class="MathJax_SVG_Display" style="text-align: center;"></div><script id="MathJax-Element-14" type="math/tex; mode=display">\sum_i \mathbb{P}(X_{1} = i \,|\, X_{0} = a) \mathbb{P}(X_{2} = b \,|\, X_{1} = i),</script></div></div><p>which is just summing up the probabilities of all paths through the state space that start at <script type="math/tex">a</script>, go to some <script type="math/tex">i</script>, and then end up at <script type="math/tex">b</script>; in other words, it is the probability that if you're at <script type="math/tex">a</script>, you end up at <script type="math/tex">b</script> after two state transitions.</p><p>You should be able to see that this extends more generally:</p><div cid="n516" class="mathjax-block md-end-block md-math-block md-rawblock" contenteditable="false" id="mathjax-n516" mdtype="math_block" spellcheck="false"> <div class="md-rawblock-container md-math-container" tabindex="-1"><div class="MathJax_SVG_Display" style="text-align: center;"></div><script id="MathJax-Element-15" type="math/tex; mode=display">\mathbb{P}(X_n = b \,|\,X_0 = a) = P^n_{a,b}.</script></div></div><p>Linear algebra comes to the rescue yet again; we've reduced the problem of finding the probability of going between any two states in a Markov chain's state space in <script type="math/tex">n</script> steps into the problem of multiplying a matrix <script type="math/tex">n</script> times with itself and looking up one item in it.</p><h4>Finding the stationary distribution</h4><p>Given a starting state in a Markov chain, we can't say for sure what state it will be after <script type="math/tex">n</script> transitions (unless it's entirely deterministic, like our initial boring traffic light model), but we can calculate exactly what the probability distribution over the states will be. This is usually denoted as a vector <script type="math/tex">\pi</script>, with <script type="math/tex">\pi_a</script> being the probability we're in state <script type="math/tex">a</script>.</p><p>Here's something we might want to know: what is the stationary distribution; that is, how can we allocate probability mass amongst the different states in such a way that the total amount of probability mass in each state remains constant after a state transition?</p><p>Here's something you might ask: why is it interesting to know this? Perhaps most importantly, the stationary distribution of a Markov chain is the long-run average of time spent in each state (exercise: prove that this is the case); if you want to know how much time our probabilistic traffic lights will spend being green over a long period of time, you need to find the stationary distribution.</p><p>Now given our distribution <script type="math/tex">\pi</script> (note: it's a row vector, not a column vector) and transition matrix <script type="math/tex">P</script>, we can express the stationary distribution as the <script type="math/tex">\pi</script> that satisfies two conditions. First,</p><div cid="n539" class="mathjax-block md-end-block md-math-block md-rawblock" contenteditable="false" id="mathjax-n539" mdtype="math_block" spellcheck="false"> <div class="md-rawblock-container md-math-container" tabindex="-1"><div class="MathJax_SVG_Display" style="text-align: center;"></div><script id="MathJax-Element-16" type="math/tex; mode=display">\pi = \pi P.</script></div></div><p>This is the condition that <script type="math/tex">\pi</script> must remain unchanged when transformed by our transition matrix <script type="math/tex">P</script> during a state transition. You might have expected the transformation to be written <script type="math/tex">P \pi</script>; usually we'd express a matrix transforming a vector in this order. However, because of the way we've defined <script type="math/tex">P</script> – start states on the vertical axis, end states on the horizontal – we need to do it this way. Here's a visualisation, with the result vector in red:</p><p></p><div class="separator" style="clear: both; text-align: center;"><div class="separator" style="clear: both; text-align: center;"><a href="https://lh3.googleusercontent.com/-B-R8mo5PwZU/YArA7o-HT0I/AAAAAAAACVw/wG7W-En9uys9G1JnvYJh4qhtRwo2xHMkwCLcBGAsYHQ/mmult.png" style="margin-left: 1em; margin-right: 1em;"><img data-original-height="608" data-original-width="958" height="254" src="https://lh3.googleusercontent.com/-B-R8mo5PwZU/YArA7o-HT0I/AAAAAAAACVw/wG7W-En9uys9G1JnvYJh4qhtRwo2xHMkwCLcBGAsYHQ/w400-h254/mmult.png" width="400" /></a></div></div><p></p><p>(Alternatively, we could take <script type="math/tex">\pi</script> as a column vector, flip the meanings of the rows and columns in <script type="math/tex">P</script>, and write <script type="math/tex">P\pi</script> – equivalent to transposing both of the current definitions of <script type="math/tex">\pi</script> and <script type="math/tex">P</script>.)</p><p>The second condition (can you see why it's necessary?), where <script type="math/tex">\pmb{1}</script> is a vector <script type="math/tex">(1,1,...,1,1)</script> of the required length, is</p><div cid="n556" class="mathjax-block md-end-block md-math-block md-rawblock" contenteditable="false" id="mathjax-n556" mdtype="math_block" spellcheck="false"> <div class="md-rawblock-container md-math-container" tabindex="-1"><div class="MathJax_SVG_Display" style="text-align: center;"></div><script id="MathJax-Element-17" type="math/tex; mode=display">\pi \cdot \pmb{1} = 1.</script></div></div><p>We can also write this as matrix multiplication, as long as we're clear about column and row vectors and transposing things as required. We can also be clever and write a single matrix that expresses both of these constraints, and then getting NumPy's linear algebra libraries to give us the answer becomes a single line of code.</p><p>(The second constraint is just the condition that any probability distribution sums to 1.) </p><h5>Uniqueness of the stationary distribution</h5><p>Now for another question: when does a unique stationary distribution exist? You should be able to think of a state diagram for which there are an infinite number of stationary distributions.</p><p>For example:</p><p></p><div class="separator" style="clear: both; text-align: center;"><a href="https://lh3.googleusercontent.com/-JSL7Ybk7Xoo/YArA91YfljI/AAAAAAAACV0/XZEBmDiuS5snRKG1QccdYPa7wSvi1d1gwCLcBGAsYHQ/stationary.png" style="margin-left: 1em; margin-right: 1em;"><img data-original-height="784" data-original-width="1280" height="392" src="https://lh3.googleusercontent.com/-JSL7Ybk7Xoo/YArA91YfljI/AAAAAAAACV0/XZEBmDiuS5snRKG1QccdYPa7wSvi1d1gwCLcBGAsYHQ/w640-h392/stationary.png" width="640" /></a></div><p></p><p>The states <script type="math/tex">C</script>, <script type="math/tex">B</script>, and <script type="math/tex">D</script> (in the dotted red circle) and <script type="math/tex">E</script>, <script type="math/tex">F</script>, <script type="math/tex">G</script>, and <script type="math/tex">H</script> (in the dotted blue circle) are "independent", in the sense that you can never get from one set of states to the other. Imagine that for the state set <script type="math/tex">\{C, B, D\}</script>, we have a stationary distribution over only those states <script type="math/tex">\pmb{\pi}</script>, and another stationary distribution <script type="math/tex">\pmb{\rho}</script> over <script type="math/tex">\{E,F,G,H\}</script>. (Let each of these vectors have a slot for every state, but let it be zero for states outside the corresponding state set – <script type="math/tex">\pmb{\pi} = (0, \pi_b, \pi_c, \pi_d, 0, 0, 0, 0)</script>, for example.) Now, because there can be no probability mass flow between these two sets, we can see that any distribution <script type="math/tex">\pmb{\sigma} = a \pmb{\pi} + b \pmb{\rho}</script> is also a stationary distribution, provided that <script type="math/tex">a</script> and <script type="math/tex">b</script> are chosen such that <script type="math/tex">\pmb{\sigma} \cdot \pmb{1} = 1</script> (probability distributions sum to one!).</p><p>It turns out that for any state set where each state is theoretically reachable from all the others – i.e., if we represent the state diagram as a directed graph, the graph is connected – there does exist a unique stationary distribution.</p><h5>Detailed balance</h5><p>Sometimes it doesn't take matrix calculations to find a stationary distribution. In the general case, the condition is that the probability mass flow into a state, from all other states, must equal the outflow to all other states. The simplest case this can happen is when, for any pair of states <script type="math/tex">a</script> and <script type="math/tex">b</script>, <script type="math/tex">a</script> sends as much probability mass to <script type="math/tex">b</script> upon a state transition as <script type="math/tex">b</script> sends to <script type="math/tex">a</script>. If we can ensure that this is true "locally" for each pair of states, then we don't have to do complex "global" optimisation over all states.</p><p>This condition is known as detailed balance. Mathematically, letting <script type="math/tex">\pi</script> be a distribution of probability mass over states and <script type="math/tex">P</script> be the transition matrix, we can express it as</p><div cid="n607" class="mathjax-block md-end-block md-math-block md-rawblock" contenteditable="false" id="mathjax-n607" mdtype="math_block" spellcheck="false"> <div class="md-rawblock-container md-math-container" contenteditable="false" tabindex="-1"> <div class="MathJax_SVG_Display" style="text-align: center;"></div><script id="MathJax-Element-902" type="math/tex; mode=display">\pi_a P_{ab} = \pi_b P_{ba}, \text{ for all states } a \text{ and } b,</script> </div></div><p>something that should be clear if you remember the interpretation of the transition matrix element <script type="math/tex">P_{ab}</script> as the probability of an <script type="math/tex">a \rightarrow b</script> transition.</p><p>A final fun question: say we have an undirected graph and we consider a random walk over it (i.e., if we're at a given vertex, we take any edge going from it with equal probability). What is the stationary distribution over the states (i.e. the vertices of the graph)?</p>Unknownnoreply@blogger.com0tag:blogger.com,1999:blog-1697673368059564013.post-58263582811746502072020-12-31T14:43:00.019+00:002021-02-19T22:08:20.635+00:00Data science 1<center><p><span style="font-size: x-small;"><i>8.3k words, including equations (about 40 minutes)</i></span></p></center><p>This is an overview of fundamental ideas in data science, mostly based on <a href="https://www.cl.cam.ac.uk/teaching/2021/DataSci/materials.html">Damon Wischik's excellent data science course at Cambridge</a> (if using these notes for revision for that course, be aware that I don't cover all examinable things and cover some things that aren't examinable; the criteria for inclusion is interestingness, not examinability).</p><p>The basic question is this: we're given data; what can we say about the world based on it?</p><p>These notes are split into two parts due to length. In part 1:</p><ul><li><p>Notation</p></li><li><p>A few results in probability, including a look at Bayes theorem leading up to an understanding of the continuous form.</p></li><li><p>Model-fitting</p><ul><li>Maximum likelihood estimation</li><li>Supervised & unsupervised learning</li><li>Linear models (fitting them and interpreting them)</li><li>Empirical distributions (with a note on KL divergence)</li> </ul></li> </ul><p>In <a href="http://strataoftheworld.blogspot.com/2021/01/data-science-2.html">part 2</a>:</p><ul><li>Monte Carlo methods</li><li>A few theorems that let you bound probabilities or expectations.</li><li>Bayesianism & frequentism</li><li>Probability systems (specifically basic results about Markov chains).</li> </ul><p> </p><h2>Probability basics</h2><p>The kind of background you want to have to understand this material:</p><ul><li><p>The basic maths of probability: reasoning about sample spaces, probabilities summing to one, understanding and working with random variables, etc.</p></li><li><p>The ideas of expected value and variance.</p></li><li><p>Some idea of the most common probability distributions:</p><ul><li>normal/Gaussian,</li><li>binomial,</li><li>poisson,</li><li>geometric,</li><li>etc.</li> </ul></li><li><p>What continuous and discrete distributions are.</p></li><li><p>Understanding probability density/mass functions, and cumulative distribution functions.</p></li> </ul><h3>Notation</h3><p>First, a few minor points:</p><ul><li><p>It's easy to interpret <script type="math/tex">Y = f(X)</script>, where <script type="math/tex">X</script> and <script type="math/tex">Y</script> are random variables, to mean "generate a value of <script type="math/tex">X</script>, then apply <script type="math/tex">f</script> to it, and this is <script type="math/tex">Y</script>". But <script type="math/tex">Y=f(X)</script> is maths, not code; we're stating something is true, not saying how the values are generated. If <script type="math/tex">f</script> is an invertible function, then <script type="math/tex">Y=f(X)</script> and <script type="math/tex">X=f^{-1}(Y)</script> are both equally good and equally true mathematical statements, and neither of them tell you what causes what.</p></li><li><p>Indicator functions are a useful trick when bounds are unknown; for example, write <script type="math/tex">1_{x \geq y}</script> (or <script type="math/tex">1[x\geq y]</script>) to denote 1 if <script type="math/tex">x \geq y</script> and 0 in all other cases.</p><ul><li>They also let you express logical AND as multiplication: <script type="math/tex">1_{f(x)} \cdot 1_{g(x)}</script> , where <script type="math/tex">f</script> and <script type="math/tex">g</script> are boolean functions, is the same as <script type="math/tex">1_{f(x) \wedge g(x)}</script>.</li> </ul></li> </ul><h4>Likelihood notation</h4><p>Discrete and continuous random variables are fundamentally different. In the discrete case, you deal with probability mass functions where there's a probability attached to each event; with the continuous case, you only get a probability density function that doesn't mean anything real and needs to be integrated to give you a probability. Many results apply to both discrete and continuous random variables though, and we might switch between continuous and discrete models in the same problem, so it's cumbersome to have to deal with the separate notation and semantics of them.</p><p>Enter likelihood notation: write <script type="math/tex">\Pr_X(x)</script> to mean <script type="math/tex">P(X=x)</script> if the distribution is discrete and <script type="math/tex">f(x)</script> if the distribution of <script type="math/tex">X</script> is continuous with probability density function <script type="math/tex">f</script>.</p><h4>Python & NumPy</h4><p>Python is a good choice for writing code, for various reasons:</p><ul><li>easy to read;</li><li>found almost everywhere;</li><li>easy to install if it isn't already installed;</li><li>not Java;</li> </ul><p>but particularly because it has excellent science/maths libraries:</p><ul><li>NumPy for vectorised calculations, maths, and stats;</li><li>SciPy for, uh, science;</li><li>Matplotlib for graphing;</li><li>Pandas for data.</li> </ul><p>NumPy is a must-have.</p><p>To use it, the big thing to understand is the idea of vectorised calculations. Otherwise, you'll see code like this:</p><pre><code class="language-python" lang="python">xs = numpy.array([1, 2, 3])<br />ys = x ** 2 + x<br /></code></pre><p>and wonder how we're adding and squaring arrays (we're not; the operations are implicitly applied to each element separately – and all of this runs in C so it's much faster than doing it natively in Python).</p><h3>Computation vs maths</h3><p>Today we have computers. Statistics was invented before computers, though, and this affected the field; work was directed to all the areas and problems where progress could be made without much computation. The result is an excellent theoretical mathematical underpinning, but modern statistics can benefit a lot from a computational approach – running simulations to get estimates and so on. For the simple problems there's an (imprecise) computational method and a (precise) mathematical method; for complex problems you either spend all day doing integrals (provided they're solvable at all) or switch to a computer.</p><p>In this post, I will focus on the maths, because the maths concepts are more interesting than the intricacies of NumPy, and because if you understand them (and programming, especially in a vectorised style), the programming bit isn't hard.</p><p> </p><p> </p><h3>Some probability results</h3><h4>The law of total probability</h4><p>Here's something intuitive: if we have a sample space (e.g. outcomes of a die roll) and we partition it into non-overlapping events <script type="math/tex">E_1</script> to <script type="math/tex">E_N</script> that cover every possible outcome (e.g. showing the numbers 1, 2, ..., 6, and losing the dice under the carpet), and we have some other event <script type="math/tex">A</script> (e.g. a player gets mad), then</p><div cid="n97" class="mathjax-block md-end-block md-math-block md-rawblock" contenteditable="false" id="mathjax-n97" mdtype="math_block" spellcheck="false"> <div class="md-rawblock-container md-math-container" tabindex="-1"><div class="MathJax_SVG_Display" style="text-align: center;"></div><script id="MathJax-Element-2" type="math/tex; mode=display">P(A) = \sum_{n=1}^{N} P(A | E_n)P(E_n);</script></div></div><p>if we know the probability of <script type="math/tex">A</script> given each event <script type="math/tex">E_n</script>, we can find the total probability of <script type="math/tex">A</script> by summing up the probabilities of each <script type="math/tex">E_n</script>, weighted by the conditional probability that <script type="math/tex">A</script> also happens. Visually, where the height of the red bars represents each <script type="math/tex">P(A|E_n)</script>, and the area of each segment represents the different <script type="math/tex">P(E_n)</script>s, we see that the total red area corresponds to the sum above:</p><p> <br /></p><table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto;"><tbody><tr><td style="text-align: center;"><a href="https://1.bp.blogspot.com/-I4KzIEDAk9s/X-3g3GmeiaI/AAAAAAAACII/eGAcc3GBZ-cTkeJkFqrF3d7UQ9F-ti0nACLcBGAsYHQ/s1280/ltp.png" style="margin-left: auto; margin-right: auto;"><img border="0" data-original-height="506" data-original-width="1280" height="252" src="https://1.bp.blogspot.com/-I4KzIEDAk9s/X-3g3GmeiaI/AAAAAAAACII/eGAcc3GBZ-cTkeJkFqrF3d7UQ9F-ti0nACLcBGAsYHQ/w640-h252/ltp.png" width="640" /></a></td></tr><tr><td class="tr-caption" style="text-align: center;"><i>You say this diagram is "messy and unprofessional"; I say it has an "informal aesthetic".</i><br /></td></tr></tbody></table><br /><p>This is called the law of total probability; a fancy name to pull out when you want to use this idea.</p><h4>The law of the unconscious statistician</h4><p>Another useful law doesn't even sound like a law at first, which is why it's called the law of the unconscious statistician.</p><p>Remember that the expected value, in case of a discrete distribution for the random variable <script type="math/tex">X</script>, is</p><div cid="n104" class="mathjax-block md-end-block md-math-block md-rawblock" contenteditable="false" id="mathjax-n104" mdtype="math_block" spellcheck="false"> <div class="md-rawblock-container md-math-container" tabindex="-1"><div class="MathJax_SVG_Display" style="text-align: center;"></div><script id="MathJax-Element-3" type="math/tex; mode=display">E(X)=\sum_i x_iP(X=x_i).</script></div></div><p>Now say we're not interested in the value of <script type="math/tex">X</script> itself, but rather some function <script type="math/tex">f</script> of it. What is the expected value of <script type="math/tex">f(X)</script>? Well, the values <script type="math/tex">x_i</script> are the possible values of <script type="math/tex">X</script>, so let's just replace the <script type="math/tex">x_i</script> above with <script type="math/tex">f(x_i)</script>:</p><div cid="n106" class="mathjax-block md-end-block md-math-block md-rawblock" contenteditable="false" id="mathjax-n106" mdtype="math_block" spellcheck="false"> <div class="md-rawblock-container md-math-container" tabindex="-1"><div class="MathJax_SVG_Display" style="text-align: center;"></div><script id="MathJax-Element-4" type="math/tex; mode=display">E(f(X)) = \sum_i f(x_i) P(X=x_i)</script></div></div><p>... and we're done – but for the wrong reasons. This result is actually more subtle than this; to prove it, consider a random variable <script type="math/tex">Y</script> for which <script type="math/tex">Y=f(X)</script>. By the definition of expected value,</p><div cid="n108" class="mathjax-block md-end-block md-math-block md-rawblock" contenteditable="false" id="mathjax-n108" mdtype="math_block" spellcheck="false"> <div class="md-rawblock-container md-math-container" tabindex="-1"><div class="MathJax_SVG_Display" style="text-align: center;"></div><script id="MathJax-Element-5" type="math/tex; mode=display">E(Y)=\sum_i y_i P(Y=y_i).</script></div></div><p>Uh oh – suddenly the connection between the obvious result and what expected value is doesn't seem so obvious. The problem is that the mapping between the <script type="math/tex">y_i</script> and <script type="math/tex">x_i</script> could be anything – many <script type="math/tex">x_i</script>, thrown into the blackbox <script type="math/tex">f</script>, might produce the same <script type="math/tex">y_i</script> – and we have to untangle this while keeping track of all the corresponding probabilities. </p><p>For a start, we might notice values <script type="math/tex">x_i</script> of <script type="math/tex">X</script>. So we might write</p><div cid="n111" class="mathjax-block md-end-block md-math-block md-rawblock" contenteditable="false" id="mathjax-n111" mdtype="math_block" spellcheck="false"> <div class="md-rawblock-container md-math-container" tabindex="-1"><div class="MathJax_SVG_Display" style="text-align: center;"></div><script id="MathJax-Element-6" type="math/tex; mode=display">E(Y)=\sum_i \Big( y_i \sum_{j \,|\, f(x_j)=y_i} P(X=x_j) \Big),</script></div></div><p>to sum over each possible value of <script type="math/tex">f(X)</script>, and then within that, also loop over the possible values of <script type="math/tex">X</script> that might have generated that <script type="math/tex">f(X)</script>. We've managed to switch a term involving the probability that <script type="math/tex">Y</script> takes some values to one about <script type="math/tex">X</script> taking a specific value – progress!</p><p>Next, we realise that <script type="math/tex">y_i</script> is the same for everything in the inner sum; <script type="math/tex">y_i = f(x_1) = f(x_2) = ... = f(x_j)</script>. So we don't change anything if we write</p><div cid="n114" class="mathjax-block md-end-block md-math-block md-rawblock" contenteditable="false" id="mathjax-n114" mdtype="math_block" spellcheck="false"> <div class="md-rawblock-container md-math-container" tabindex="-1"><div class="MathJax_SVG_Display" style="text-align: center;"></div><script id="MathJax-Element-7" type="math/tex; mode=display">E(Y)=\sum_i \Big( \sum_{j \,|\, f(x_j)=y_i} f(x_j) P(X=x_j) \Big)</script></div></div><p>instead. Now we just have to see that the above is equivalent to iterating once over all the <script type="math/tex">j</script>s.</p><p>A diagram:</p><a href="https://1.bp.blogspot.com/-nCfUHjudl2o/X-3gwrn3kuI/AAAAAAAACIE/na11fzX5DvExTnBkTKr1MSBHwqdU8rlbgCLcBGAsYHQ/s1280/lotus.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="770" data-original-width="1280" height="384" src="https://1.bp.blogspot.com/-nCfUHjudl2o/X-3gwrn3kuI/AAAAAAAACIE/na11fzX5DvExTnBkTKr1MSBHwqdU8rlbgCLcBGAsYHQ/w640-h384/lotus.png" width="640" /></a><p>The yellow area is the expected value of <script type="math/tex">f(x) = Y</script>. By the definition of expected value, we can sum up the areas of the yellow rectangles to get <script type="math/tex">E(f(X))</script>. What we've now done is "reduced" this to a process like this: pick <script type="math/tex">y_1</script>, looking at the <script type="math/tex">x_i</script> that map to it with <script type="math/tex">f</script> (<script type="math/tex">x_1</script> and <script type="math/tex">x_2</script> in this case), and find these probabilities and multiply them by <script type="math/tex">f(x_1)=f(x_2)=y_1</script>. So we add up the rectangles in the slots marked by the dotted lines, and we do it with this weird double-iteration of looking first at <script type="math/tex">y_i</script>s and then at <script type="math/tex">x_i</script>s.</p><p>But once we've put it this way, it's simple to see we get the same result if we iterate over the <script type="math/tex">x_i</script>s, get the corresponding rectangle slice for each, and add it all up. This corresponds to the formula we had above (summing <script type="math/tex">f(x_i) P(X=x_i)</script> over all possible <script type="math/tex">i</script>).</p><h4>Bayes' theorem (odds ratio and continuous form)</h4><div class="separator" style="clear: both; text-align: center;"><a href="https://1.bp.blogspot.com/-96QLqGNFQHM/X-3g8n4CgvI/AAAAAAAACIM/eYMaWIjkoQs9US3zrgBj8mVlTMPN9DIEQCLcBGAsYHQ/s1280/bayes.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="900" data-original-width="1280" height="450" src="https://1.bp.blogspot.com/-96QLqGNFQHM/X-3g8n4CgvI/AAAAAAAACIM/eYMaWIjkoQs9US3zrgBj8mVlTMPN9DIEQCLcBGAsYHQ/w640-h450/bayes.png" width="640" /></a></div><br />Above is a Venn diagram of a sample space (the box), with the probabilities of event <script type="math/tex">B</script> and event <script type="math/tex">R</script> marked by blue and red areas respectively (the hatched area represents that both happen). <p>By the definition of conditional probability,</p><div cid="n124" class="mathjax-block md-end-block md-math-block md-rawblock" contenteditable="false" id="mathjax-n124" mdtype="math_block" spellcheck="false"> <div class="md-rawblock-container md-math-container" tabindex="-1"><div class="MathJax_SVG_Display"></div><script id="MathJax-Element-8" type="math/tex; mode=display">P(R|B)=\frac{P(B \cap R)}{P(B)}, \text{ and} \\ P(B|R)=\frac{P(B \cap R)}{P(R)}.</script></div></div><p>Bayes theorem is about answering questions like "if we know how likely we are to be in the red area given that we're in the blue area, how likely are we to be in the blue area if we're in the red?" (Or: "if we know how likely we are to have symptoms if we have covid, how likely are we to have covid if we have symptoms?").</p><p>Solving both of the above equations for <script type="math/tex">P(B \cap R)</script> and equating them gives</p><div cid="n127" class="mathjax-block md-end-block md-math-block md-rawblock" contenteditable="false" id="mathjax-n127" mdtype="math_block" spellcheck="false"> <div class="md-rawblock-container md-math-container" tabindex="-1"><div class="MathJax_SVG_Display" style="text-align: center;"></div><script id="MathJax-Element-9" type="math/tex; mode=display">P(R|B) P(B) = P(B|R) P(R),</script></div></div><p>which is the answer – just divide out by either <script type="math/tex">P(B)</script> or <script type="math/tex">P(R)</script> to get, for example,</p><div cid="n129" class="mathjax-block md-end-block md-math-block md-rawblock" contenteditable="false" id="mathjax-n129" mdtype="math_block" spellcheck="false"> <div class="md-rawblock-container md-math-container" tabindex="-1"><div class="MathJax_SVG_Display" style="text-align: center;"></div><script id="MathJax-Element-10" type="math/tex; mode=display">P(B|R) = \frac{P(R|B)P(B)}{P(R)}.</script></div></div><p>Let's say the red area $$R$$ represents having symptoms. Let's say we split the blue area <script type="math/tex">B</script> into <script type="math/tex">B_1</script> and <script type="math/tex">B_2</script> – two different variants of covid, say. Now instead of talking about probabilities, let's talk about odds: let's say the odds ratios that a random person has no covid, has variant 1, and has variant 2 are 40:2:1, and that symptoms are, compared to the no-covid population, ten times as likely in variant 1 and twenty times as likely in variant 2 (in symbols: <script type="math/tex">P(R| \neg B_1 \cap \neg B_2)/40 = P(R|B_1) / 2 = P(R|B_2)</script>). Now we learn that we have symptoms and want to calculate posterior probabilities, to use Bayes-speak.</p><p>To apply Bayes' rule, you could crank out the formula exactly as above: convert odds to probabilities, divide out by the total probability of no covid or having variant 1 or 2, and then get revised probabilities for your odds of having no covid or a variant. This is equivalent to keeping track of the absolute sizes of the intersections in the diagram below:</p><div class="separator" style="clear: both; text-align: center;"><a href="https://1.bp.blogspot.com/-Jhos5ZJkCv8/X-3hH0LugyI/AAAAAAAACIY/Cbqmi8yd3e8NyCwPsqBOJFf4fZG-zD28wCLcBGAsYHQ/s1000/bayes2.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="622" data-original-width="1000" height="398" src="https://1.bp.blogspot.com/-Jhos5ZJkCv8/X-3hH0LugyI/AAAAAAAACIY/Cbqmi8yd3e8NyCwPsqBOJFf4fZG-zD28wCLcBGAsYHQ/w640-h398/bayes2.png" width="640" /></a></div><br /><p>But this is unnecessary. When we learned we had symptoms, we've already zoomed in to the red blob; that is our sample space now, so blob size compared to the original sample space no longer interests us.</p><p>So let's take our odds ratios directly, and only focus on relative probabilities. Let's imagine each scenario fighting over a set amount of probability space, with the starting allocations determined by prior odds ratios:</p><div class="separator" style="clear: both; text-align: center;"><a href="https://1.bp.blogspot.com/-d3Qk2fpGrT4/X-3hOBKv60I/AAAAAAAACIg/bsEq3MDex-wUlCQonWZ8DZ8Gl4clp6KUwCLcBGAsYHQ/s1280/odds1.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="84" data-original-width="1280" height="42" src="https://1.bp.blogspot.com/-d3Qk2fpGrT4/X-3hOBKv60I/AAAAAAAACIg/bsEq3MDex-wUlCQonWZ8DZ8Gl4clp6KUwCLcBGAsYHQ/w640-h42/odds1.png" width="640" /></a></div><br /><p>Now Bayes rule says to multiply each prior probability <script type="math/tex">P(B_i)</script> by <script type="math/tex">P(R|B_i)</script>. To adjust our prior odds ratio 40:2:1 by the ratios 1:10:20 telling us how many times more likely we are to see <script type="math/tex">R</script> (symptoms) given no covid or <script type="math/tex">B_1</script> or <script type="math/tex">B_2</script>, just multiply term-by-term to get 40:20:20, or 2:1:1. You can imagine each outcome fighting it out with their newly-adjusted relative strengths, giving a new distribution of the sample space:</p><div class="separator" style="clear: both; text-align: center;"><a href="https://1.bp.blogspot.com/-N8kRYPSO3bs/X-3hTdn_XLI/AAAAAAAACIk/F77jAGwyouYemA1udKnaLy1O_G7lVEIZACLcBGAsYHQ/s1282/odds2.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="102" data-original-width="1282" height="50" src="https://1.bp.blogspot.com/-N8kRYPSO3bs/X-3hTdn_XLI/AAAAAAAACIk/F77jAGwyouYemA1udKnaLy1O_G7lVEIZACLcBGAsYHQ/w640-h50/odds2.png" width="640" /></a></div><br /><p>Now if we want to get absolute probabilities again, we just have to scale things right so that they add up to 1. This tiny bit of cleanup at the end (if we want to convert to probabilities again) is the only downside of working with odds ratios.</p><p>This gives us an idea about how to use Bayes when the sample space is continuous rather than discrete. For example, let's say the sample space is between 0 and 100, representing the blood oxygenation level $$X$$ of a coronavirus patient. We can imagine an approximation where we write an odds ratio that includes every integer from 0 to 100, and then refine that until, in the limit, we've assigned odds to every real number between 0 and 100. Of course, at this point the odds ratio interpretation starts looking a bit weird, but we can switch to another one: what we have is a probability distribution, if only we scale it so that the entire thing integrates to one.</p><p>The same logic applies as before, even though everything is now continuous. Let's say we want to calculate a conditional probability like the probability of $$X$$ (the random variable for the patient's blood oxygenation) taking the value $$x$$. At first we have no information, so our best guess is the prior across all patients, $$\Pr_X(x)$$. Say we now get some piece of evidence, like the patient's age, and know the likelihood ratios of the patient being that age given each blood oxygenation level. To get our updated belief distribution, we can just go through and multiply the prior likelihoods of each blood oxygenation level by the ratios given the new piece of evidence.</p><div class="separator" style="clear: both; text-align: center;"><div class="separator" style="clear: both; text-align: center;"><a href="https://1.bp.blogspot.com/-yRdyWvJJFqo/X_BQQ0ev9EI/AAAAAAAACK0/Lvw36rxuPy03EK5CODRgEmcaOZebQP4mACLcBGAsYHQ/s1280/odds3.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="688" data-original-width="1280" height="344" src="https://1.bp.blogspot.com/-yRdyWvJJFqo/X_BQQ0ev9EI/AAAAAAAACK0/Lvw36rxuPy03EK5CODRgEmcaOZebQP4mACLcBGAsYHQ/w640-h344/odds3.png" width="640" /></a></div><a href="https://1.bp.blogspot.com/-jm_GaXHLjd8/X-3hXf6DR9I/AAAAAAAACIs/4jodPcdJE98OejxOZHtFgDagRZ3YvAbtgCLcBGAsYHQ/s1280/odds3.png" style="margin-left: 1em; margin-right: 1em;"></a></div><p>Above, the red line is the initial distribution of blood oxygenation <script type="math/tex">x</script> across all patients. The yellow line represents the relative likelihoods of the patient's actual known age <script type="math/tex">a</script> given a particular <script type="math/tex">x</script>. The green line at any particular $$x$$ is the product of the yellow and red function at that same $$x$$, and it's our relative posterior. To interpret it as a probability distribution, we have to scale it vertically so that it integrates to 1 (that's why we have a proportionality sign rather than an equals sign).</p><p>Now let's say more evidence comes in: the patient is unconscious (which we'll denote <script type="math/tex">U=\text{"yes"}</script>). We can repeat the same process of multiplying out relative likelihoods and the prior, this time with the prior being the result in the previous step:</p><p></p><div class="separator" style="clear: both; text-align: center;"><a href="https://1.bp.blogspot.com/-d_e-q5ctBIs/X_BQdmzEaoI/AAAAAAAACK4/gFbLZQb96N4AaWGOJy70ILariHxQNja1gCLcBGAsYHQ/s1278/odds4.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="928" data-original-width="1278" height="464" src="https://1.bp.blogspot.com/-d_e-q5ctBIs/X_BQdmzEaoI/AAAAAAAACK4/gFbLZQb96N4AaWGOJy70ILariHxQNja1gCLcBGAsYHQ/w640-h464/odds4.png" width="640" /></a></div><p></p><p>We can see that in this case the blue line varies a lot more depending on <script type="math/tex">x</script>, and hence our distribution for <script type="math/tex">x</script> (the purple line) changes more compared to our prior (the green line). Now let's say we have a very good piece of evidence: the result <script type="math/tex">m</script> of a blood oxygenation meter <script type="math/tex">M</script>.</p><p> </p><div class="separator" style="clear: both; text-align: center;"><a href="https://1.bp.blogspot.com/-YXGK8RMc6Oc/X_BQjsdKWfI/AAAAAAAACLA/zvLSoosiv408XSUcAXp_uRQwh54Nfx1xACLcBGAsYHQ/s1280/odds5.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="858" data-original-width="1280" height="428" src="https://1.bp.blogspot.com/-YXGK8RMc6Oc/X_BQjsdKWfI/AAAAAAAACLA/zvLSoosiv408XSUcAXp_uRQwh54Nfx1xACLcBGAsYHQ/w640-h428/odds5.png" width="640" /></a></div>There's some error on the oxygenation measurement, so our final belief (that <script type="math/tex">x</script> is distributed according to the black line) is very clearly a distribution of values rather than a single value, but it's clustered around a single point.<p></p><p>So to think through Bayes in practice, the lesson is this: throw out the denominator in the law. It's a constant anyways; if you really need it you can go through some integration at the end to find it. But it's not the central point of Bayes' theorem. Remember instead: prior times likelihood ratio gives posterior.</p><p> </p><h2>Fitting models</h2><p>A probability model tries to tell you how likely things are. Fitting a probability model to data is about finding one that is useful for given data.</p><div class="separator" style="clear: both; text-align: center;"><a href="https://1.bp.blogspot.com/-zWW-18QBIOE/X-3ht9sKyeI/AAAAAAAACJM/tge61Rkj8sYl5NGP720Pu2FooLBPjI4OgCLcBGAsYHQ/s1280/probmodels.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="366" data-original-width="1280" height="184" src="https://1.bp.blogspot.com/-zWW-18QBIOE/X-3ht9sKyeI/AAAAAAAACJM/tge61Rkj8sYl5NGP720Pu2FooLBPjI4OgCLcBGAsYHQ/w640-h184/probmodels.png" width="640" /></a></div><p>Above, we have two axes representing whatever, and the intensity of the red shading is the probability attributed to a particular pair of values.</p><p>The model on the left is simply bad. The one in the middle is also bad, though; it assigns no probability to many of the data points that were actually seen.</p><p>Choosing which distribution to fit – or whether to do something else entirely – is sometimes obvious, sometimes not. Complexity is rarely good.</p><h3>Maximum likelihood estimation (MLE)</h3><p>Let's say we do have a good idea of what the distribution is; the weight of stray cats in a city depends on a lot of small factors pushing both ways (when it last caught a mouse, the temperature over the past week, whether it was loved by its mother, etc.), so <a href="https://en.wikipedia.org/wiki/Bean_machine">we should expect a normal distribution</a>. Well, probably.</p><p>Let's say we have a dataset of cat weights, labelled <script type="math/tex">x_1</script> to <script type="math/tex">x_n</script> because we're serious maths people. How do we fit a distribution?</p><div class="separator" style="clear: both; text-align: center;"><a href="https://1.bp.blogspot.com/-gjN98iorrHU/X-3hznFtCLI/AAAAAAAACJQ/HncfCOzHt3wE7LZUxkeWDAc_27ZvZOH-gCLcBGAsYHQ/s800/cats.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="610" data-original-width="800" height="488" src="https://1.bp.blogspot.com/-gjN98iorrHU/X-3hznFtCLI/AAAAAAAACJQ/HncfCOzHt3wE7LZUxkeWDAc_27ZvZOH-gCLcBGAsYHQ/w640-h488/cats.png" width="640" /></a></div><br /><p><br /></p><p>Step 1 is Wikipedia. Wikipedia tells us that a normal distribution has two parameters, <script type="math/tex">\mu</script> (the mean) and <script type="math/tex">\sigma</script> (the standard deviation), and that the likelihood (not probability! see above) a normal distribution <script type="math/tex">X</script> with those parameters takes a value <script type="math/tex">x</script> is</p><div cid="n164" class="mathjax-block md-end-block md-math-block md-rawblock" contenteditable="false" id="mathjax-n164" mdtype="math_block" spellcheck="false"> <div class="md-rawblock-container md-math-container" tabindex="-1"><div class="MathJax_SVG_Display" style="text-align: center;"></div><script id="MathJax-Element-11" type="math/tex; mode=display">\Pr_X(x)= \frac{1}{\sigma \sqrt{2 \pi}} e^{-\frac{1}{2}\big( \frac{x-\mu}{\sigma} \big)^2}.</script></div></div><p>Oh dear.</p><p>After a moment's thought, we can interpret it more clearly:</p><div cid="n167" class="mathjax-block md-end-block md-math-block md-rawblock" contenteditable="false" id="mathjax-n167" mdtype="math_block" spellcheck="false"> <div class="md-rawblock-container md-math-container" tabindex="-1"><div class="MathJax_SVG_Display" style="text-align: center;"></div><script id="MathJax-Element-12" type="math/tex; mode=display">\Pr_X(x) = \frac{\text{blah}}{\sigma \text{ blah}} \text{blah}^{\text{-blah} {\big(\frac{x-\mu}{\sigma}\big)^2}}.</script></div></div><p>So it's just an exponential that decays in both directions from <script type="math/tex">\mu</script>, and that's squeezed by <script type="math/tex">\sigma</script>.</p><p>(Why are there constants then? Because it's a probability distribution, and must therefore integrate to 1 over its entire range or else all hell will break loose.)</p><p>Step 2 is philosophising. What does it really mean to get the best fit of a distribution?</p><p>The first thing we can notice is that there are only two dials we can adjust: the values of <script type="math/tex">\mu</script> and <script type="math/tex">\sigma</script>. For this particular problem at least, we've reduced the massive problem of picking the best model to one of finding the best spot in a 2D space (well, half of 2D space, since <script type="math/tex">\sigma</script> must be greater than zero).</p><p>The second thing we can notice is that the only tool we have at our disposal here to tell us about the fit to the distribution is the likelihood function, and, well, as the saying goes: when all you have is a likelihood function ...</p><p>A good fit will give high likelihoods to the points in the data set (we can't get an arbitrarily good fit by giving everything a lot of likelihood, because there's only so much likelihood to go around – the probabilities that the likelihood function assigns across its domain must sum to 1).</p><p>Let's call the likelihood of the data, given some model, to be the likelihood that we get that specific data set by independently generating samples from the model until we have the same number as in the data set (if we have a lot of data points, the likelihood of any particular set of them will usually be very low, since it's the product of the likelihood of a lot of individual points). And let's go ahead and try to tune the model so that the likelihood of our data is maximised.</p><p>(Remember, likelihood is probability, except for continuous random variables like our normal distribution, where we can't talk about the probability of a dataset (only about something like the probability of getting a dataset at least as close as [some metric] to the dataset).)</p><p>Step 3 is algebra. So what is the likelihood of all our data? Using basic probability, it's the product of the likelihoods of each data point (just like the probability of getting a set of independent events is the product of the probabilities of each event). Returning to our normal distribution with cat data <script type="math/tex">x_1</script> to <script type="math/tex">x_n</script>, the likelihood of the data given distribution <script type="math/tex">X</script> with mean <script type="math/tex">\mu</script> and standard deviation <script type="math/tex">\sigma</script> is</p><div cid="n177" class="mathjax-block md-end-block md-math-block md-rawblock" contenteditable="false" id="mathjax-n177" mdtype="math_block" spellcheck="false"> <div class="md-rawblock-container md-math-container" tabindex="-1"><div class="MathJax_SVG_Display"></div><script id="MathJax-Element-13" type="math/tex; mode=display">\Pr_X(x_1) \cdot \Pr_X(x_2) \cdot ... \cdot \Pr_X(x_n) \\ = \frac{1}{\sigma \sqrt{2 \pi}} e^{-\frac{1}{2}\big( \frac{x_1-\mu}{\sigma} \big)^2} \cdot ... \cdot \frac{1}{\sigma \sqrt{2 \pi}} e^{-\frac{1}{2}\big( \frac{x_n-\mu}{\sigma} \big)^2} \\ = \left(\frac{1}{\sigma \sqrt{2 \pi}} \right)^n e^{-\frac{1}{2}\big( \big( \frac{x_1 - \mu}{\sigma} \big)^2 + ... + \big(\frac{x_n - \mu}{\sigma} \big)^2 \big)}.</script></div></div><p>Oh dear. Maximising this is a pain.</p><p>Thankfully, there's a trick. We don't care about the likelihood, only that we set <script type="math/tex">\mu</script> and <script type="math/tex">\sigma</script> so that the likelihood is maximised. We can apply any monotonically increasing function to the likelihood, maximise that, and we'll have the <script type="math/tex">\mu</script> and <script type="math/tex">\sigma</script> that maximise the original mess.</p><p>Which monotonically increasing function? Logarithms are generally best, because they convert the products you get from calculating the likelihood of a dataset into sums (and in this case they're especially nice, because they'll also take out the exponentials in our distribution's likelihood function).</p><p>In fact, throw away the previous calculation, note that</p><div cid="n182" class="mathjax-block md-end-block md-math-block md-rawblock" contenteditable="false" id="mathjax-n182" mdtype="math_block" spellcheck="false"> <div class="md-rawblock-container md-math-container" tabindex="-1"><div class="MathJax_SVG_Display"></div><script id="MathJax-Element-14" type="math/tex; mode=display">\log\Pr_X(x) = -\log(\sigma \sqrt{2 \pi}) - \frac{1}{2} \left(\frac{x-\mu}{\sigma}\right)^2 \\ = -\log(\sqrt{2 \pi}) - \log(\sigma) - \frac{1}{2} \left(\frac{x-\mu}{\sigma}\right)^2, \\</script></div></div><p>from which we can throw away the <script type="math/tex">\log(\sqrt{2\pi})</script> because it's the same in each term, and then sum all the rest up to get a total log likelihood of</p><div cid="n184" class="mathjax-block md-end-block md-math-block md-rawblock" contenteditable="false" id="mathjax-n184" mdtype="math_block" spellcheck="false"> <div class="md-rawblock-container md-math-container" tabindex="-1"><div class="MathJax_SVG_Display" style="text-align: center;"></div><script id="MathJax-Element-15" type="math/tex; mode=display">-n\log(\sigma) - \sum_{i=1}^n \Big( \frac{1}{2} \left(\frac{x_i-\mu}{\sigma}\right)^2 \Big).</script></div></div><p>Call this <script type="math/tex">f</script>; the values of <script type="math/tex">\mu</script> and <script type="math/tex">\sigma</script> that maximise it are when when <script type="math/tex">\frac{\partial f}{\partial \mu} = 0</script> and <script type="math/tex">\frac{\partial f}{\partial \sigma} = 0</script>; that's when we've found our peak on the 2D space of possible <script type="math/tex">(\mu, \sigma)</script> pairs (technically this condition only tells us it's a stationary point, but it turns out to be the maximum, as you can prove by taking more derivatives).</p><p>So the maximum satisfies</p><div cid="n187" class="mathjax-block md-end-block md-math-block md-rawblock" contenteditable="false" id="mathjax-n187" mdtype="math_block" spellcheck="false"> <div class="md-rawblock-container md-math-container" tabindex="-1"><div class="MathJax_SVG_Display"></div><script id="MathJax-Element-16" type="math/tex; mode=display">\frac{\partial f}{\partial \mu} = -\sum_{i=1}^n \Big( \frac{x_i-\mu}{\sigma} \Big) = 0, \text{ and} \\ \frac{\partial f}{\partial \sigma} = -\frac{n}{\sigma} + \sum_{i=1}^n \left( \frac{(x_i - \mu)^2}{\sigma^3} \right) = 0.</script></div></div><p>The first condition gives</p><div cid="n189" class="mathjax-block md-end-block md-math-block md-rawblock" contenteditable="false" id="mathjax-n189" mdtype="math_block" spellcheck="false"> <div class="md-rawblock-container md-math-container" tabindex="-1"><div class="MathJax_SVG_Display" style="text-align: center;"></div><script id="MathJax-Element-17" type="math/tex; mode=display">\hat{\mu} = \frac{1}{n}\sum_{i=1}^n x_i,</script></div></div><p>in other words that <script type="math/tex">\hat{\mu}</script>, our best estimator function for the value of <script type="math/tex">\mu</script>, is the average of the values in the data set.</p><p>From the second condition, we can do algebra to get</p><div cid="n192" class="mathjax-block md-end-block md-math-block md-rawblock" contenteditable="false" id="mathjax-n192" mdtype="math_block" spellcheck="false"> <div class="md-rawblock-container md-math-container" tabindex="-1"><div class="MathJax_SVG_Display" style="text-align: center;"></div><script id="MathJax-Element-18" type="math/tex; mode=display">\hat{\sigma} = \sqrt{\frac{1}{n} \sum_{i=1}^n(x_i-\mu)^2}.</script></div></div><p>We need to be careful here, though. When writing out the conditions, <script type="math/tex">\mu</script> and <script type="math/tex">\sigma</script> stood for specific values of the parameters of the normal distribution <script type="math/tex">X</script>. We don't know these values; the best we can do is estimate them with <i>estimators</i>, which are technically not values but functions that take a data set and return an estimated value (and denoted by <script type="math/tex">\hat{\text{hats}}</script>). We can't have unknown values in our definition of <script type="math/tex">\hat{\sigma}</script>, as we currently do with the <script type="math/tex">\mu</script> in it; we have to replace it with the estimator for <script type="math/tex">\mu</script> like this:</p><div cid="n194" class="mathjax-block md-end-block md-math-block md-rawblock" contenteditable="false" id="mathjax-n194" mdtype="math_block" spellcheck="false"> <div class="md-rawblock-container md-math-container" tabindex="-1"><div class="MathJax_SVG_Display" style="text-align: center;"></div><script id="MathJax-Element-19" type="math/tex; mode=display">\hat{\sigma} = \sqrt{\frac{1}{n} \sum_{i=1}^n(x_i-\hat{\mu})^2}</script></div></div><p>– making sure that the estimator <script type="math/tex">\hat{\mu}</script> does not depend on <script type="math/tex">\hat{\sigma}</script> , since that would again make things undefined – or then by writing out the <script type="math/tex">\hat{\mu}</script> estimator like this:</p><div cid="n196" class="mathjax-block md-end-block md-math-block md-rawblock" contenteditable="false" id="mathjax-n196" mdtype="math_block" spellcheck="false"> <div class="md-rawblock-container md-math-container" tabindex="-1"><div class="MathJax_SVG_Display" style="text-align: center;"></div><script id="MathJax-Element-20" type="math/tex; mode=display">\hat{\sigma} = \sqrt{\frac{1}{n} \sum_{i=1}^n \left(x_i-\frac{1}{n}\sum_{i=1}^n x_i\right)^2},</script></div></div><p>which at least makes it very clear that the <script type="math/tex">x_i</script>s and their number <script type="math/tex">n</script> define <script type="math/tex">\hat{\sigma}</script>. </p><p>When you're done defining your estimators, you should have a clear diagram in your head of how to pour data into the functions you've written down and come out with concrete numbers, with no dangling inputs anywhere – you're not done if you have any.</p><h3>Supervised and unsupervised learning</h3><p>There are two main types of fancy model fitting we can do:</p><ol start=""><li>Supervised learning, where we have a set of pairs (of numbers or anything else) and we try to design a system to predict one element from the other. For example, maybe we measure the length and weight of some stray cats, but get bored of trying to get them to stay on the scale long enough, so we want to ditch the weighing and predict a weight from the length alone – how well can we do this?</li><li>Unsupervised learning, where we have our data (as a set of tuples of associated data, like cat lengths, weights, and locations), and we try to fit a model to it so we can generate similar items; maybe we want to fake a larger stray cat population in our data than actually exists but not get caught by the statistics bureau. (This category also includes things like trying to <a href="https://en.wikipedia.org/wiki/Unsupervised_learning">identify clusters</a> to interpret the data.) Fitting a distribution is perhaps the simplest example: using our one-dimensional cat weight database discussed in the MLE section, we can "generate" new cats by sampling from it, though the "cat" will just be the weight number. The more interesting case is when we have to generate a lot of associated data; for example, <a href="https://thispersondoesnotexist.com/">this website</a> offers you a new face every time you reload it. Behind it is a probability distribution for a human face in some crazy-dimensional variable space that's detailed enough that sampling it gives you all the data needed to figure out the colours of each pixel in a photorealistic face picture.</li> </ol><p>The unifying idea is maximum likelihood estimation (MLE). Clearly, something like MLE is needed if you want to fit a distribution to data for unsupervised learning; we're going to need to generate something eventually, so we better have a probability model. It's less clear that supervised learning has anything to do with MLE though, and tempting to think of it as defining some random loss function to measure how bad a fit is, and then minimising that. It's possible to think of supervised learning this way, but then you'll end up with a lot of detail about loss functions in your head, all of which will seem to be pulled out of thin air.</p><p>Instead, think of supervised learning as MLE too. We specify a probability model, which will take in some parameters (e.g. the exponent <script type="math/tex">a</script> and constant <script type="math/tex">b</script> in a cat length/weight model like <script type="math/tex">\text{weight} = b \times \text{length}^a + \epsilon</script>, where <script type="math/tex">\epsilon</script> is a normally distributed error term with mean 0 and some standard deviation we either know already or then ask the fitting procedure to find for us), and the value of the predictor variable(s) (e.g. the cat's length), and spit out its prediction of the variable(s) of interest.</p><p>(Note that often the variable of interest is not numerical, but a label: "spam", "tumour", "Eurasian oystercatcher", etc.)</p><p>In fact, seen from the MLE perspective, it can almost be hard to see the difference – if so, good. Just look at the processes:</p><ol start=""><li><p>Unsupervised learning:</p><ol start=""><li>Get your dataset <script type="math/tex">x = (x_1, x_2, ..., x_n)</script>.</li><li>Decide on a probability model (e.g. a simple distribution) <script type="math/tex">X</script> with a parameter set <script type="math/tex">\theta = (\theta_1, \theta_2, ..., \theta_m)</script>.</li><li>Find the <script type="math/tex">\theta</script> that maximises <script type="math/tex">\Pr_X(x_1; \theta) \times ... \times \Pr_X(x_n; \theta)=\Pr_X(x;\theta)</script>,* since assuming our data points are drawn independently, this is the likelihood of the dataset.</li> </ol></li><li><p>Supervised learning:</p><ol start=""><li>Get your dataset of pairs of the form (thing to predict, thing to predict from): <script type="math/tex">((y_1, x_1), (y_2, x_2), ..., (y_n, x_n))</script>.</li><li>Decide on a probability model <script type="math/tex">Y</script> that which relies on parameter set <script type="math/tex">\theta = (\theta_1, \theta_2, ..., \theta_n)</script>, and also <script type="math/tex">x_i</script>, to predict <script type="math/tex">y_i</script>..</li><li>Find the <script type="math/tex">\theta</script> that maximises <script type="math/tex">\Pr_Y(y_1;x_1, \theta) \times ... \times \Pr_Y(y_n; x_n, \theta) = \Pr_Y(y_1, ..., y_n; x_1, ...., y_n, \theta)</script>.*</li> </ol></li> </ol><p>*(We write <script type="math/tex">\Pr_X(x_i;\theta)</script> to mean the likelihood that <script type="math/tex">X</script> takes the value <script type="math/tex">x_i</script> if the parameters are <script type="math/tex">\theta</script>; we avoid writing it as a conditional probability <script type="math/tex">\Pr_X(x \, |\, \theta)</script> because interpreting this as a conditional probability is technically only valid with a Bayesian interpretation.)</p><h3>Linear models</h3><p>You can invent any model you choose. As always, simplicity pays though, and it turns out that there's a class of probability models which are easy to work with and reason about, for which general algorithms and mathematical tools exist, and which is often good enough: linear models.</p><p>The word "linear" immediately brings to mind straight lines. That's not what it means in this context. The linearity in linear models is because the output is a linear combination of "features" (predictor variables).</p><p>The general form is</p><div cid="n234" class="mathjax-block md-end-block md-math-block md-rawblock" contenteditable="false" id="mathjax-n234" mdtype="math_block" spellcheck="false"> <div class="md-rawblock-container md-math-container" tabindex="-1"><div class="MathJax_SVG_Display" style="text-align: center;"></div><script id="MathJax-Element-21" type="math/tex; mode=display">\hat{y_i} = c_1 e_{1,i} + c_2 e_{2,i} + ... +c_n e_{n,i},</script></div></div><p>where <script type="math/tex">\hat{y_i}</script> is the predicted value, <script type="math/tex">c_1</script> through <script type="math/tex">c_n</script> are constants, and <script type="math/tex">e_{1,i}</script> through <script type="math/tex">e_{n,i}</script> are the features describing the <script type="math/tex">i</script>th set of data. In the simplest case, a feature might be a value we measure directly, but in general it can be any function of data we measure. Ideally, we want that the true value <script type="math/tex">y_i \approx c_1 e_{1,i} + ... + c_n e_{n,i}</script>.</p><div class="separator" style="clear: both; text-align: center;"><a href="https://1.bp.blogspot.com/-joWzH265vss/X-3h7Ef5y7I/AAAAAAAACJU/4LG8rb-vc4Mtno2KGVHfWipxb41EnWuFACLcBGAsYHQ/s1278/linearmodel.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="678" data-original-width="1278" height="340" src="https://1.bp.blogspot.com/-joWzH265vss/X-3h7Ef5y7I/AAAAAAAACJU/4LG8rb-vc4Mtno2KGVHfWipxb41EnWuFACLcBGAsYHQ/w640-h340/linearmodel.png" width="640" /></a></div><p>In the above diagram, we see we measure the data <script type="math/tex">x_i</script> (note that it can be a tuple of values rather than a single value), pass it through some blackbox function to generate features, and take the prediction <script type="math/tex">\hat{y_i}</script> to be the sum of multiplying together each feature by the weight assigned to it. </p><p>Note that the linear model above is a prediction-maker but not a probability model because it doesn't assign likelihoods. The probability model for a linear model is often taken to be</p><div cid="n239" class="mathjax-block md-end-block md-math-block md-rawblock" contenteditable="false" id="mathjax-n239" mdtype="math_block" spellcheck="false"> <div class="md-rawblock-container md-math-container" tabindex="-1"><div class="MathJax_SVG_Display" style="text-align: center;"></div><script id="MathJax-Element-22" type="math/tex; mode=display">y_i = c_1 e_{1,i} + c_2 e_{2,i} + ... +c_n e_{n,i} + \epsilon</script></div></div><p>that is, there's an error term <script type="math/tex">\epsilon</script> that we assume to be a normal distribution with standard deviation <script type="math/tex">\sigma</script> (which may be known, or finding it may be part of fitting the model).</p><p>The above is also an equation for predicting one specific output (<script type="math/tex">y_i</script>) from one specific set of features, which in turn are determined by one specific input (e.g. a single data point). More generally we can write it in vector form:</p><div cid="n242" class="mathjax-block md-end-block md-math-block md-rawblock" contenteditable="false" id="mathjax-n242" mdtype="math_block" spellcheck="false"> <div class="md-rawblock-container md-math-container" tabindex="-1"><div class="MathJax_SVG_Display" style="text-align: center;"></div><script id="MathJax-Element-23" type="math/tex; mode=display">\pmb{y} \approx c_1 \pmb{e_1} + ... + c_n \pmb{e_n},</script></div></div><p>where <script type="math/tex">\pmb{y}=(y_1, y_2, ..., y_{n})</script>, and likewise <script type="math/tex">\pmb{e_j}</script> is a vector whose <script type="math/tex">i</script>th position corresponds to the <script type="math/tex">j</script>th feature of the <script type="math/tex">i</script>th data item.</p><p>Note that we can read this equation in two ways: as a vector equation about data, as just described, that's fitted to give <script type="math/tex">\pmb{y}</script> from its features, or as a prediction, saying that the value of a particular <script type="math/tex">y_i</script> will be roughly this.</p><p>There's a set of standard tricks to use in linear modelling:</p><ul><li>"One-hot coding": using a function that is 0 unless the input data satisfies some condition (having a label, exceeding a value, etc.).</li><li>If we have the data point <script type="math/tex">x_i</script>, using the features <script type="math/tex">e_{0,i} = 1</script>, <script type="math/tex">e_{1,i} = x_i</script>, and <script type="math/tex">e_{2,i} = x_i^2</script> to fit a quadratic (if you fit a polynomial of degree higher than 2 without a very solid reason, you're probably overfitting).</li><li>We often have a pattern with a known period <script type="math/tex">T</script> (days, years, etc.), and some non-zero starting phase <script type="math/tex">\phi</script>. Therefore we'd want a feature like <script type="math/tex">\sin((2\pi/T)x+\phi)</script>, where <script type="math/tex">x</script> to is an input, to fit this pattern to. If <script type="math/tex">\phi</script> is known, we don't have a problem, but if we want to fit the phase, it doesn't work: the model is not linear in <script type="math/tex">\phi</script>. To fix this, use a trig angle addition identity; the above becomes <script type="math/tex">\sin(\phi) \cos((2\pi/T)x) + \cos(\phi) \sin((2\pi/T)x)</script>, where <script type="math/tex">\sin(\phi)</script> and <script type="math/tex">\cos(\phi)</script> are just constants so can be forgotten about because the fitting model will determine the constants of our features. (Recovering <script type="math/tex">\phi</script> from the final constants will take a bit of maths; note that the constant of the cosine and sine terms in the fitted model will have the amplitude mixed in, in addition to <script type="math/tex">\phi</script>.)</li> </ul><p>Here's an annotated linear model with parameter interpretation:</p><p></p><div class="separator" style="clear: both; text-align: center;"><a href="https://lh3.googleusercontent.com/-UEm7eq7MsB8/X-3iV7GgJaI/AAAAAAAACJk/cnJ48BClH54YIOQoHLDGGDfkrDx8_8gjwCLcBGAsYHQ/examplelinear.png" style="margin-left: 1em; margin-right: 1em;"><img data-original-height="1022" data-original-width="1280" height="510" src="https://lh3.googleusercontent.com/-UEm7eq7MsB8/X-3iV7GgJaI/AAAAAAAACJk/cnJ48BClH54YIOQoHLDGGDfkrDx8_8gjwCLcBGAsYHQ/w640-h510/examplelinear.png" width="640" /></a></div><br /><p></p><p>The features in this model:</p><ul><li><script type="math/tex">e_1=x</script>.</li><li><script type="math/tex">e_2</script> is 0 if <script type="math/tex">x < A</script> and 1 otherwise.</li><li><script type="math/tex">e_3</script> is 0 if <script type="math/tex">x < A</script> and <script type="math/tex">x</script> otherwise.</li> </ul><p>(If we want to fit the best value of <script type="math/tex">A</script>, we'll have to do some maths and reconfigure the model. Right now <script type="math/tex">A</script> is a constant that's defined in the functions that calculate the features from the input data.)</p><p>The interpretation of the constants:</p><ul><li><script type="math/tex">c_0</script> is the prediction for <script type="math/tex">x=0</script>.</li><li><script type="math/tex">c_1</script> is the base slope.</li><li><script type="math/tex">c_2</script> is the difference between the prediction for <script type="math/tex">x=0</script> (the <script type="math/tex">y</script>-intercept of the <script type="math/tex">x < A</script> line) and the <script type="math/tex">y</script>-intercept of the <script type="math/tex">x>A</script> line.</li><li><script type="math/tex">c_3</script> is how much the slope changes after <script type="math/tex">x=A</script>.</li> </ul><p>We could have chosen different features (for example, letting <script type="math/tex">e_1 = 0</script> for <script type="math/tex">x > A</script>), and then gotten perhaps more readable constants (<script type="math/tex">c_3</script> would become just the slope, not the difference in slope). We could also have added a feature like <script type="math/tex">e_4 = x^2</script>, and then the model would no longer look like just straight lines. But whatever we do, we need to be careful to interpret the constants we get correctly, especially when the model gets complicated.</p><p>For our cat weight prediction example, we might expect weight <script type="math/tex">W</script> and length <script type="math/tex">L</script> to have a relation like <script type="math/tex">W \approx c L^3</script>, where <script type="math/tex">c</script> is a constant that the model will fit. If we want to ask questions about whether a cubic relation really is the best, take logs and fit something like <script type="math/tex">\log(W) = c_1 + c_2 \log(L)</script> – <script type="math/tex">c_2</script> tells us the exponent.</p><h4>Feature spaces and fitting linear models</h4><p>The main benefit of linear models is that by talking about linear combinations of data vectors we reduce the maths of fitting parameters to linear algebra. Linear algebra is about transformations of space and the vectors in it, so it also allows for a visual interpretation of everything.</p><p>Let's say we have a model like this:</p><div cid="n279" class="mathjax-block md-end-block md-math-block md-rawblock" contenteditable="false" id="mathjax-n279" mdtype="math_block" spellcheck="false"> <div class="md-rawblock-container md-math-container" tabindex="-1"><div class="MathJax_SVG_Display" style="text-align: center;"></div><script id="MathJax-Element-24" type="math/tex; mode=display">\pmb{y} \approx c_1 \pmb{e_1} + c_2 \pmb{e_2}.</script></div></div><p>Here, <script type="math/tex">\pmb{y}</script> is the actual measured data, and <script type="math/tex">\pmb{e_i}</script> are functions of the (also measured) predictor variables. Let's say <script type="math/tex">\pmb{y} = (y_1, y_2, y_3)</script> – i.e., we have three data points. We can imagine <script type="math/tex">\pmb{y}</script> as a vector pointing somewhere in 3D space, with <script type="math/tex">y_1</script>, <script type="math/tex">y_2</script>, and <script type="math/tex">y_3</script> the distances along the <script type="math/tex">x</script>, <script type="math/tex">y</script>, and <script type="math/tex">z</script> axes. Likewise, <script type="math/tex">\pmb{e_1}</script> and <script type="math/tex">\pmb{e_2}</script> can be thought of as 3D vectors encoding some (function of the) data we've measured.</p><p></p><div class="separator" style="clear: both; text-align: center;"><a href="https://lh3.googleusercontent.com/-58VTEODl7IM/X-3ifzkSRZI/AAAAAAAACJo/7YBx6KjGeUg_K0t2q-HXR0F2ozxDOLP3gCLcBGAsYHQ/3d.png" style="margin-left: 1em; margin-right: 1em;"><img data-original-height="920" data-original-width="1278" height="288" src="https://lh3.googleusercontent.com/-58VTEODl7IM/X-3ifzkSRZI/AAAAAAAACJo/7YBx6KjGeUg_K0t2q-HXR0F2ozxDOLP3gCLcBGAsYHQ/w400-h288/3d.png" width="400" /></a></div><br /><p></p><p>Now the only dials a linear model gives us to adjust are the weights of <script type="math/tex">\pmb{e_1}</script> and <script type="math/tex">\pmb{e_2}</script>: <script type="math/tex">c_1</script> and <script type="math/tex">c_2</script>. There's a 2D space of them (since there are two constants to adjust – <script type="math/tex">c_1</script> and <script type="math/tex">c_2</script>), and as it happens, there's a nice geometric interpretation: each pair <script type="math/tex">(c_1, c_2)</script> corresponds to a point on the plane spanned by <script type="math/tex">\pmb{e_1}</script> and <script type="math/tex">\pmb{e_2}</script> (specifically, the point you get to if you move <script type="math/tex">c_1</script> times along <script type="math/tex">\pmb{e_1}</script> and then <script type="math/tex">c_2</script> times along <script type="math/tex">\pmb{c_2}</script>).</p><p>So what are the best values of <script type="math/tex">c_1</script> and <script type="math/tex">c_2</script>? The intuitive answer is that we want to get as close as possible to <script type="math/tex">\pmb{y}</script>:</p><p></p><div class="separator" style="clear: both; text-align: center;"><a href="https://lh3.googleusercontent.com/-Xvj6zojy2LM/X-3ikHHj4xI/AAAAAAAACJs/Wyd-NiMwNA8d8vzruWVZYL6524HXL6aNwCLcBGAsYHQ/featurespace.png" style="margin-left: 1em; margin-right: 1em;"><img data-original-height="1080" data-original-width="1280" height="541" src="https://lh3.googleusercontent.com/-Xvj6zojy2LM/X-3ikHHj4xI/AAAAAAAACJs/Wyd-NiMwNA8d8vzruWVZYL6524HXL6aNwCLcBGAsYHQ/w640-h541/featurespace.png" width="640" /></a></div><p></p><p>In this case, the closest to <script type="math/tex">\pmb{y}</script> that we can reach on the plane spanned by <script type="math/tex">\pmb{e_1}</script> and <script type="math/tex">\pmb{e_2}</script> is the green vector, and the black vector is the difference between the predicted data vector and actual data vector.</p><p>Mathematically, what are we doing here? We're minimising the distance between the vector <script type="math/tex">\hat{\pmb{y}} = c_1 \pmb{e_1} + c_2 \pmb{e_2}</script> (where <script type="math/tex">c_1</script> and <script type="math/tex">c_2</script> can be varied) and <script type="math/tex">\pmb{y}</script>; this distance is given by</p><div cid="n287" class="mathjax-block md-end-block md-math-block md-rawblock" contenteditable="false" id="mathjax-n287" mdtype="math_block" spellcheck="false"> <div class="md-rawblock-container md-math-container" tabindex="-1"><div class="MathJax_SVG_Display" style="text-align: center;"></div><script id="MathJax-Element-25" type="math/tex; mode=display">\sqrt{(\hat{y_1} - y_1)^2 + (\hat{y_2} - y_2)^2 + (\hat{y_3} - y_3)^2 }.</script></div></div><p>Previously we simplified optimisation by applying a logarithm (a monotonically increasing function) and optimising that; this time we do the same by applying the squaring function (which is monotonically increasing for positive numbers, which our distance is limited to). This means that the quantity to minimise is</p><div cid="n289" class="mathjax-block md-end-block md-math-block md-rawblock" contenteditable="false" id="mathjax-n289" mdtype="math_block" spellcheck="false"> <div class="md-rawblock-container md-math-container" tabindex="-1"><div class="MathJax_SVG_Display" style="text-align: center;"></div><script id="MathJax-Element-26" type="math/tex; mode=display">(\hat{y_1} - y_1)^2 + (\hat{y_2} - y_2)^2 + (\hat{y_3} - y_3)^2.</script></div></div><p>In other words, we minimise the sum of squared errors ("least squares estimation" is the most common phrase).</p><p>If we have more than three data points, then we can't picture it, but the idea is exactly the same. Fitting an <script type="math/tex">n</script>-dimensional dataset to a linear model of <script type="math/tex">m</script> features boils down to moving as close as possible in <script type="math/tex">n</script>D space to the observed data vector, while limited to the <script type="math/tex">m</script>-dimensional (at most; see below) space spanned by the features.</p><p>(Above, <script type="math/tex">n=3</script> and <script type="math/tex">m=2</script>. Generally <script type="math/tex">n</script> is huge because datasets can be huge, while <script type="math/tex">m</script> is much smaller since it's the number of features we've written down into the model.)</p><blockquote><p><i>A maths lecturer is giving a lecture about 5-dimensional geometry.</i></p><i></i><p><i>A student asks a question: "I can follow the algebra just fine, but it would be helpful if I could visualise it. Is there any way to do that?"</i></p><i></i><p><i>The lecturer replies: "Oh, it's easy. Just imagine everything in <script type="math/tex">n</script> dimensions, and then let <script type="math/tex">n=5</script>."</i></p><i></i><p><i> </i></p><i></i><p><i>(variants of this joke are common; see for example <a href="http://www.personal.psu.edu/sxt104/mathjoke1.html">here.</a>)</i></p></blockquote><h5>Linear independence</h5><p>A set of vectors is linearly dependent if there exists a vector in it that can be written as a linear combination of the other vectors. If your feature vectors are linearly dependent, you will get the same predictions out of your model, but you can't interpret the coefficients.</p><p>(For visual intuition: two vectors in 2D are linearly dependent if they lie on the same line, three vectors in 3D are linearly dependent if they lie on the same plane (a superset of the case that they lie on the same line), and so on.)</p><p>An easy way to make this mistake is if you're doing one-hot coding of categories. Let's say you're fitting a linear model to estimate student exam grades <script type="math/tex">y</script> based on their university, with a model that looks like this:</p><div cid="n301" class="mathjax-block md-end-block md-math-block md-rawblock" contenteditable="false" id="mathjax-n301" mdtype="math_block" spellcheck="false"> <div class="md-rawblock-container md-math-container" tabindex="-1"><div class="MathJax_SVG_Display" style="text-align: center;"></div><script id="MathJax-Element-27" type="math/tex; mode=display">y \approx \alpha + \beta \cdot 1_{\text{Oxford}}+\gamma\cdot1_{\text{Cambridge}}+...,</script></div></div><p>using indicator function notation. Whatever linear fitting routine you do will happily give you coefficient values and the predictions it gives will be sensible, but you won't be able to interpret the coefficients. To see what's happening, consider an Oxford student: their predicted grade <script type="math/tex">y</script> is <script type="math/tex">\alpha + \beta</script>. What is <script type="math/tex">\alpha</script> and <script type="math/tex">\beta</script>? Good question – we can only assign meaning to their combination. If instead we eliminate one university and write</p><div cid="n303" class="mathjax-block md-end-block md-math-block md-rawblock" contenteditable="false" id="mathjax-n303" mdtype="math_block" spellcheck="false"> <div class="md-rawblock-container md-math-container" tabindex="-1"><div class="MathJax_SVG_Display" style="text-align: center;"></div><script id="MathJax-Element-28" type="math/tex; mode=display">y \approx \alpha + \beta \cdot 1_{\text{Cambridge}} + ...,</script></div></div><p>when we now fit the coefficients, <script type="math/tex">\alpha</script> will be the predicted grade for Oxford students, and <script type="math/tex">\alpha+\beta</script> the predicted grade for Cambridge students, so we can interpret <script type="math/tex">\alpha</script> as the Oxford average, and <script type="math/tex">\beta</script> as the difference between Oxford and Cambridge. (The predictions given by the model won't change though.)</p><p>The vector interpretation is that if our dataset contains, say, 3 Oxford students followed by 2 Cambridge students, the (5D) data vectors in the first model will be</p><div cid="n306" class="mathjax-block md-end-block md-math-block md-rawblock" contenteditable="false" id="mathjax-n306" mdtype="math_block" spellcheck="false"> <div class="md-rawblock-container md-math-container" tabindex="-1"><div class="MathJax_SVG_Display" style="text-align: center;"></div><script id="MathJax-Element-29" type="math/tex; mode=display">\alpha \begin{pmatrix}1 \\ 1 \\ 1 \\ 1 \\ 1\end{pmatrix} + \beta \begin{pmatrix}1 \\ 1 \\ 1 \\ 0 \\ 0\end{pmatrix} + \gamma \begin{pmatrix}0 \\ 0 \\ 0 \\ 1 \\ 1\end{pmatrix}.</script></div></div><p>But these vectors aren't linearly independent: the last two vectors sum up to the first one, and therefore there will be many triplets <script type="math/tex">(\alpha, \beta, \gamma)</script> that give identical predictions.</p><h4>Linear fitting and MLE</h4><p>We talked about MLE being the holy grail of model fitting, and then about linear models and how fitting them comes down to a geometry problem. As it turns out, MLE lurks behind least squares estimation as well.</p><p>I mentioned earlier that linear models often assume a normal distribution for errors. Let's assume that, and do MLE.</p><p>Our model is that</p><div cid="n312" class="mathjax-block md-end-block md-math-block md-rawblock" contenteditable="false" id="mathjax-n312" mdtype="math_block" spellcheck="false"> <div class="md-rawblock-container md-math-container" tabindex="-1"><div class="MathJax_SVG_Display" style="text-align: center;"></div><script id="MathJax-Element-30" type="math/tex; mode=display">Y_i = c_1 e_{1,i} + ... + c_n e_{n,i} + \epsilon,</script></div></div><p>where <script type="math/tex">\epsilon \sim N(0,\sigma^2)</script> (i.e. follows a normal distribution with mean zero and standard deviation <script type="math/tex">\sigma</script>).</p><p>A useful property of normal distributions is that if we add a constant <script type="math/tex">c</script> to a normal distribution with mean <script type="math/tex">\mu</script>, the result has a normal distribution with mean <script type="math/tex">\mu + c</script> and the same standard deviation (this isn't true of all distributions!). Therefore we can write the above as</p><div cid="n315" class="mathjax-block md-end-block md-math-block md-rawblock" contenteditable="false" id="mathjax-n315" mdtype="math_block" spellcheck="false"> <div class="md-rawblock-container md-math-container" tabindex="-1"><div class="MathJax_SVG_Display" style="text-align: center;"></div><script id="MathJax-Element-31" type="math/tex; mode=display">Y_i \sim N(c_1 e_{1,i} + ... + c_n e_{n,i}, \sigma^2).</script></div></div><p>The likelihood for getting <script type="math/tex">y</script> is</p><div cid="n317" class="mathjax-block md-end-block md-math-block md-rawblock" contenteditable="false" id="mathjax-n317" mdtype="math_block" spellcheck="false"> <div class="md-rawblock-container md-math-container" tabindex="-1"><div class="MathJax_SVG_Display" style="text-align: center;"></div><script id="MathJax-Element-32" type="math/tex; mode=display">\Pr_Y(y;c_1...c_n, \sigma) = \frac{1}{\sigma \sqrt{2 \pi}} e^{-\frac{1}{2} \left( \frac{y - (c_1 e_{1,i} + ... + c_n e_{n,i})} {\sigma} \right)^2},</script></div></div><p>once again copying out the likelihood function for normal distributions.</p><p>Now remember that we just want to fit <script type="math/tex">c_1</script> through <script type="math/tex">c_n</script>. These only occur in the exponent, so we can ignore all the constants out front, and also we can see that since there's a negative in the exponent, maximising it is equivalent to minimising the stuff in the exponent. Taking out <script type="math/tex">\sigma</script> and constants, the relevant stuff to minimise is</p><div cid="n320" class="mathjax-block md-end-block md-math-block md-rawblock" contenteditable="false" id="mathjax-n320" mdtype="math_block" spellcheck="false"> <div class="md-rawblock-container md-math-container" tabindex="-1"><div class="MathJax_SVG_Display" style="text-align: center;"></div><script id="MathJax-Element-33" type="math/tex; mode=display">(y-(c_1 e_{1,i} + ... + c_n e_{n,i}))^2,</script></div></div><p>where we can see that the thing we subtract from <script type="math/tex">y</script> is our model's prediction of <script type="math/tex">y</script> (one component of what we previously denoted <script type="math/tex">\hat{\pmb{y}}</script>). Once again, we can see we're minimising a square of the error. Of course, we have many <script type="math/tex">y</script>-values to fit; to see that it's the sum of these that we minimise, rather than some other function of them, just note that if we take a logarithm we'll get a term like the above (times constants) for each data point we're using to fit.</p><p>So least-squares fitting comes from MLE and the assumption of normally distributed errors.</p><p>(Are errors normally distributed? Often yes. Remember though that our features are functions of things we measure; even if <script type="math/tex">x</script> has normally-distributed errors, after we apply an arbitrary function to it to generate feature <script type="math/tex">e</script>, the resulting <script type="math/tex">e</script> might not have normally distributed errors (but for many simple functions it still will). We could be more fancy, and devise other fitting procedures, but often least squares is good enough.)</p><h3>Empirical distributions</h3><p>What's the simplest probability model we can fit to a dataset? It's tempting to think of an answer like "a normal distribution", or "a linear model with one linear feature". But we can be even more radical: treat the dataset itself as a distribution.</p><p></p><div class="separator" style="clear: both; text-align: center;"><a href="https://lh3.googleusercontent.com/-U2MRZRorb0c/X-3is6acicI/AAAAAAAACJ0/9LgERk6tfJA86hT_pXcQWbxDS_phNjPoQCLcBGAsYHQ/epdf.png" style="margin-left: 1em; margin-right: 1em;"><img data-original-height="686" data-original-width="1278" height="344" src="https://lh3.googleusercontent.com/-U2MRZRorb0c/X-3is6acicI/AAAAAAAACJ0/9LgERk6tfJA86hT_pXcQWbxDS_phNjPoQCLcBGAsYHQ/w640-h344/epdf.png" width="640" /></a></div><p></p><p>On the left, we've plotted the number of data points that take different values of <script type="math/tex">x</script> (this is a discrete distribution; for a continuous distribution, the probability that any two samples drawn are equal is infinitesimal). On the right, all we've done is normalised the distribution, by rescaling the vertical axis so that the heights of all the bars sum to one. Once we've done that, we can go ahead and call it a probability distribution, and assign the meaning that the height of the bar at <script type="math/tex">x</script> is the probability that the distribution <script type="math/tex">X</script> that we've just defined takes the value <script type="math/tex">x</script>. This is called an empirical distribution.</p><p>Sampling from an empirical distribution is easy – just pick a value at random from the dataset. (Of course, the likelihood such a distribution assigns to any value not in the dataset is zero, which can be a problem for many use cases.)</p><p>In fact, you've probably already dealt with empirical distributions, at least implicitly. When you calculate the mean and variance of a dataset, you can interpret this as calculating the properties of the empirical distribution given by that dataset. An empirical distribution as an abstract thing apart from your dataset may seem ad hoc, but it's not any less defined than a normal distribution.</p><p>The standard way to illustrate an empirical distribution is by plotting its cumulative distribution function (cdf); an empirical one is known as an ecdf. This is almost necessary for continuous variables. In general, the ecdf of a dataset is a very useful and general way to visualise it: it saves you from the pains of histograms (how large to make the bins? if you take logs or squares first, do you take them before or after binning? etc. etc.), and is also complete in the sense of technically displaying every point in the dataset.</p><p>The ecdf for the above distribution would look something like this:</p><p></p><div class="separator" style="clear: both; text-align: center;"><a href="https://lh3.googleusercontent.com/-rDGG7QbrniA/X-3iwUtC67I/AAAAAAAACJ8/_chL29192uMytLKCYpPueOhoJtQdOrDNACLcBGAsYHQ/ecdf.png" style="margin-left: 1em; margin-right: 1em;"><img alt="" data-original-height="530" data-original-width="1000" height="340" src="https://lh3.googleusercontent.com/-rDGG7QbrniA/X-3iwUtC67I/AAAAAAAACJ8/_chL29192uMytLKCYpPueOhoJtQdOrDNACLcBGAsYHQ/w640-h340/ecdf.png" width="640" /></a></div><p></p>(Like any cdf, it takes the value 0 up until the first data point and the value 1 after the last data point.) <p>If we now fit any parametric (i.e. non-empirical) distribution, comparing its cdf to the ecdf is a good test of how good the fit is.</p><h4>Measuring the goodness of a model fit with KL divergence</h4><p>The empirical distribution is the best possible fit to a given dataset, and therefore it's a good benchmark to measure the fit of a proposed model against.</p><p>Let's say our data is <script type="math/tex">x=x_1, ... ,x_n</script>, and the empirical distribution is <script type="math/tex">X^*</script>. The likelihood of drawing <script type="math/tex">x</script> from <script type="math/tex">X*</script> is (under the assumption of each <script type="math/tex">x_i</script> being drawn independently)</p><div cid="n338" class="mathjax-block md-end-block md-math-block md-rawblock" contenteditable="false" id="mathjax-n338" mdtype="math_block" spellcheck="false"> <div class="md-rawblock-container md-math-container" tabindex="-1"><div class="MathJax_SVG_Display" style="text-align: center;"></div><script id="MathJax-Element-34" type="math/tex; mode=display">\Pr_{X^*}(x_1) \cdot ... \cdot \Pr_{X^*}(x_n).</script></div></div><p>Now <script type="math/tex">\Pr_{X^*}(x_i)</script> is just the fraction of how many <script type="math/tex">x_j</script> in <script type="math/tex">x</script> are equal to <script type="math/tex">x_i</script>. Writing <script type="math/tex">N_{x_i}</script> to mean the number of values equal to <script type="math/tex">x_i</script> in the data, we can write</p><div cid="n340" class="mathjax-block md-end-block md-math-block md-rawblock" contenteditable="false" id="mathjax-n340" mdtype="math_block" spellcheck="false"> <div class="md-rawblock-container md-math-container" contenteditable="false" tabindex="-1"> <div class="MathJax_SVG_Display" style="text-align: center;"></div><script id="MathJax-Element-421" type="math/tex; mode=display">\Pr_{X^*}(x_i) = \frac{N_{x_i}}{n}.</script> </div></div><p>Taking logs, and writing <script type="math/tex">q_v = N_{v} / n = \Pr_{X^*}(v)</script>, the above product for the likelihood becomes the sum, over possible values $$v$$ of $$x_i$$, for the log likelihood:</p><div cid="n342" class="mathjax-block md-end-block md-math-block md-rawblock" contenteditable="false" id="mathjax-n342" mdtype="math_block" spellcheck="false"> <div class="md-rawblock-container md-math-container" contenteditable="false" tabindex="-1"> <div class="MathJax_SVG_Display" style="text-align: center;"></div><script id="MathJax-Element-430" type="math/tex; mode=display">\sum_{v} N_{v} \log(q_v).</script> </div></div><p>Now we'll do one last trick, which is to scale by <script type="math/tex">1/n</script>; otherwise, the term in front of the log will tend to be bigger if we have more data points, while we want something that means the same regardless of how many data points there are. After we do that, we notice a nice symmetry:</p><div cid="n344" class="mathjax-block md-end-block md-math-block md-rawblock" contenteditable="false" id="mathjax-n344" mdtype="math_block" spellcheck="false"> <div class="md-rawblock-container md-math-container" contenteditable="false" tabindex="-1"> <div class="MathJax_SVG_Display" style="text-align: center;"></div><script id="MathJax-Element-404" type="math/tex; mode=display">\sum_{v} q_v \log(q_v).</script> </div></div><p>This is a good baseline to compare any other model to. For example, let's say we fit to this a (discrete) distribution <script type="math/tex">X</script> (with the same sample space as <script type="math/tex">X^*</script>) with parameters <script type="math/tex">\theta</script>. Write <script type="math/tex">p_v = \Pr_X(v; \theta)</script>, and we can express the log likelihood of the dataset as</p><div cid="n346" class="mathjax-block md-end-block md-math-block md-rawblock" contenteditable="false" id="mathjax-n346" mdtype="math_block" spellcheck="false"> <div class="md-rawblock-container md-math-container" contenteditable="false" tabindex="-1"> <div class="MathJax_SVG_Display" style="text-align: center;"></div><script id="MathJax-Element-433" type="math/tex; mode=display">\sum_{v} N_{v} \log(p_v).</script> </div></div><p>Normalising by <script type="math/tex">1/n</script> as before, we get</p><div cid="n348" class="mathjax-block md-end-block md-math-block md-rawblock" contenteditable="false" id="mathjax-n348" mdtype="math_block" spellcheck="false"> <div class="md-rawblock-container md-math-container" contenteditable="false" tabindex="-1"> <div class="MathJax_SVG_Display" style="text-align: center;"></div><script id="MathJax-Element-435" type="math/tex; mode=display">\sum_{v} q_v \log(p_v).</script> </div></div><p>Now to get a measure of fit goodness, just subtract, and do some algebra on top if you feel like it:</p><div cid="n350" class="mathjax-block md-end-block md-math-block md-rawblock" contenteditable="false" id="mathjax-n350" mdtype="math_block" spellcheck="false"> <div class="md-rawblock-container md-math-container" contenteditable="false" tabindex="-1"> <div class="MathJax_SVG_Display"></div><script id="MathJax-Element-437" type="math/tex; mode=display">\sum_{v} q_v \log(q_v) - \sum_{v} q_v \log(p_v) \\ = \sum_{v} q_v \log(q_v/p_v) \\ = \sum_{v} \Pr_{X^*}(v) \log\left(\frac{\Pr_{X^*}(v)}{\Pr_X(v;\theta)}\right).</script> </div></div><p>(In the last step, I've just expanded out our earlier definitions of <script type="math/tex">p_i</script> and <script type="math/tex">q_i</script>.)</p><p>This is called the Kullback-Leibler divergence (KL divergence). If <script type="math/tex">X=X^*</script>, then it comes out to 0; for worse fits, the value becomes greater.</p><p>There's a nice information theoretic interpretation of this result. <script type="math/tex">- \sum_{v} q_v \log_2(p_v)</script> is the average number of bits needed to most efficiently represent a value randomly drawn from the dataset, using a coding scheme optimised for the distribution <script type="math/tex">X</script>. </p><p> </p><p style="text-align: center;"><a href="http://strataoftheworld.blogspot.com/2021/01/data-science-2.html">Next post</a> <br /></p>Unknownnoreply@blogger.com0tag:blogger.com,1999:blog-1697673368059564013.post-66658972369188355342020-12-17T08:24:00.006+00:002022-03-31T22:59:01.819+01:00Review: Foragers, Farmers, and Fossil Fuels<p style="text-align: center;"><i><span style="font-size: x-small;">Book: Foragers, Farmers, and Fossil Fuels: How Human Values Evolve,</span><span style="font-size: x-small;"> by Ian Morris (2015)</span><span style="font-size: x-small;"><br />7.8k words (about 26 minutes)</span></i></p><p style="text-align: center;"><i><span style="font-size: x-small;"> </span></i></p><p style="text-align: center;"><i><span style="font-size: x-small;"> This post has also been published <a href="https://www.lesswrong.com/posts/nsFpCGPJ6dfk9uFkR/review-foragers-farmers-and-fossil-fuels">here</a>.</span></i><br /></p><p style="text-align: center;"> </p><p style="text-align: left;">Two hundred years ago, most people lived in societies that considered slavery, war, and discrimination based on class, ethnicity, and gender to be justifiable. Today, most people live in societies that hold the opposite beliefs.</p><p style="text-align: left;">What changed? A simple and tempting narrative is that we have simply become wiser; that various Enlightenment philosophers, thoughtful activists, and other principled people figured out that the pre-industrial moral order is wrong and managed to persuade everyone to change.</p><p style="text-align: left;">It is true that many smart and principled people had good ideas and that this was a big proximate driver of better values. But is it a coincidence that this change in values happened around the same time as the industrial revolution?</p><p style="text-align: left;">What about the previous economic revolution, the agricultural one? Did that also coincide with a change in the values that people held? The evidence says yes – foraging societies tend to be more accepting of violence and far less accepting of hierarchy than farming ones.</p><p style="text-align: left;">The argument of Ian Morris' <i>Foragers, Farmers, and Fossil Fuels</i> is that these timings are not a coincidence. Societies that change their main method of getting energy also change their values, because some sets of values give greater success for a certain type of society. Farming societies that stick to anti-hierarchical forager attitudes won't survive competition with farming societies that learn to believe in hierarchies (maybe they won't be economically competitive and won't be able to field as big an army to defend themselves as the god-king next door can field to conquer them). Likewise, industrial societies that stick to inflexible hierarchies and elite-focused economies can't compete with more equal democracies that don't squander the talents of the non-elite, and maintain a well-looked-after middle-class of rich consumers and educated workers.</p><p style="text-align: left;">We can contrast two ways of trying to explain the history of values. The first says that the history of values is a history of ideas; a battle of ideas against other ideas, waged in the minds of people. The second says that the history of values is a history of what works best. The battle is between the benefits conferred by believing in certain ideas and those conferred by other ones, and it is waged out in the real world, where empires fall or rise based on whether they value the things that will lead them to success.</p><p style="text-align: left;">It is clear that neither style of explanation is enough on its own. No matter how persuasive it can be made, a sufficiently destructive idea – as an extreme example, that everyone should commit suicide – will not find its adherents in charge of the future (or coming from the opposite direction: why do you think many religions are so big on the "be fruitful and multiply" point?). On the other hand, no matter how practically useful a certain idea is, someone has to have the idea and persuade other people to adopt it as a value before it has a chance of spreading because of its practical benefits.</p><p style="text-align: left;">The question, then, is just how far can we push the deterministic account, where the methods of energy capture constrain values. In Ian Morris' telling, the answer is surprisingly far, and if his account of the history of values is correct, I agree with him (in particular, the similarities of farming society values across continents is hard to explain otherwise). However, I think Morris, along with most people who advance or accept similar arguments, goes too far with the moral pragmatism that these ideas may be thought to imply.</p><p style="text-align: left;">But first: what values did foragers, farmers, and fossil fuel users actually hold, and what is Morris' energy-based explanation of the changes between them?</p><p style="text-align: left;"> </p><h3 style="text-align: left;">Foragers</h3><p style="text-align: left;">Everyone has some idea of what a forager or hunter-gatherer is, but since we want to deal with differences between foragers and farmers, we want a clear idea of where the line is. Morris cites a good definition by Catherine Panter-Brick: foragers are people who "exercise no deliberate alteration of the gene pool of exploited resources". If you plant and harvest a few naturally occurring plants, you're still a forager, but when you start refining the crops generation by generation or breeding the animals, that's the point when you become a farmer.</p><p style="text-align: left;">Of course, there is a vast amount of variance in culture, lifestyle, and values between different forager bands. To <a href="https://condor.depaul.edu/~mfiddler/hyphen/humunivers.htm">almost</a> every generalisation about foragers, there exists some tribe that does the opposite. However, Morris argues that for each main type of human society (foraging/farming/industrial), it is useful to talk about the average set of values such societies held or tended to develop towards, at least in terms of the broad categories of tolerance of political/economic/gender hierarchy and propensity to violence. This covers up lots of important questions – different societies may have justified violence under different circumstances, or had different reasons for why economic inequality was acceptable, but such differences are sucked up into one category and ignored in this sort of analysis. That this makes sense will become apparent once we see that foragers, farmers, and fossil fuel users can be sensibly compared and contrasted even at this very general level.</p><p style="text-align: left;">In some ways, forager values are familiar. Even among foragers, possession and ownership are big deals, with every item generally having an owner. In other ways, they're surprisingly different.</p><p style="text-align: left;">Take violence. Though it's very difficult to come up with exact figures for anything to do with foragers (ancient foragers left behind only bones and tools, and modern foragers only live in places that farmers didn't want, so might not be a representative sample), the chance of dying by murder may have been around 10% in an average forager tribe, compared to 0.7% today, 1-2% across the 1900s (including all wars), roughly 5% in your average farming society or in the most murderous countries of today, and 20% for Poland during World War II.</p><p style="text-align: left;"> This was not recognised by anthropologists until the 1990s or so because, as Morris explains:</p><blockquote style="text-align: left;"><p><i>"[T]he social scale imposed by foraging is so small that even high rates of murder are difficult for outsiders to detect. If a band with a dozen members has a 10% rate of violent death, it will suffer roughly one homicide every twenty-five years, and since anthropologists rarely stay in the field for even twenty-five months, they will witness very few violent deaths."</i></p></blockquote><p style="text-align: left;">This is why Elizabeth Marshall Thomas' !Kung ethnography was called "The Gentle People", even though "their murder rate was much the same as what Detroit would endure at the peak of its crack cocaine epidemic".</p><p style="text-align: left;">Foragers are also extremely averse to hierarchy. Perhaps the best summary is given by a !Kung San forager asked about the absence of chiefs:</p><blockquote style="text-align: left;"><p><i>"Of course we have headmen! In fact we’re all headmen … Each one of us is headman over himself!"</i></p></blockquote><p style="text-align: left;">It's not just that foragers don't have strict hierarchies and this behaviour falls out naturally as a result; they are actively opposed to any sort of hierarchy or inequality. Material inequality is considered morally wrong, and fairness essential. Pressure to share spoils is applied liberally. And as in any group of humans, you'll have upstarts who try to achieve greatness and power, but such people usually have opposition groups immediately form to hold them back. Anthropologist Christopher Boehm calls these "reverse dominance hierarchies"; Morris translates this as "coalitions of losers".</p><p style="text-align: left;">The one sort of inequality that foragers aren't opposed to is gender inequality, with the dominant role in politics and violence generally falling to men (as an example of this attitude, Morris cites a forager of the Ona people (also known as the Selk'nam or Onawo) saying "the men are all captains and the women are sailors"). However, the gender inequality in forager societies is still on a different level from the extreme gender inequality and regimentation of farmer societies, and attitudes about sex were looser too. Morris writes that "abused wives regularly just walk away [...] without much fuss or criticism, and attitudes towards marital fidelity and premarital virginity tend to be quite relaxed".</p><p style="text-align: left;"> </p><h3 style="text-align: left;">Farmers</h3><p style="text-align: left;">As with foragers, Morris lumps together farming societies into one ideal type, labelled Agraria by Ernest Gellner. As before, this covers up a lot of variation (in particular, he identifies horticulturalists, city states like classical Athens or medieval Venice, and proto-industrial nations like Qing dynasty China, Mughal India, Ottoman Turkey, and Enlightenment Western Europe as the three extremes of Agraria), but Morris argues "the exceptions and sub-categories should not be allowed to obscure the reality of an ideal type representing in abstract terms the core features of peasant farming society". He cites Robert Redfield:</p><blockquote style="text-align: left;"><p><i>"[I]f a peasant from [any one of widely separated farming societies] could have been transported by some convenient genie to any one of the others and equipped with a knowledge of the language in the village to which he had been moved, he would very quickly come to feel at home. And this would be because the fundamental orientations of life would be unchanged. The compass of his career would continue to point to the same moral north."</i></p></blockquote><p style="text-align: left;">So what is the moral north of farming societies? Perhaps surprisingly, it's almost as hard to make definite conclusions about what anyone other than the elite thought in agrarian societies as it is to make conclusions about foragers.</p><p style="text-align: left;">While the elite read and wrote a lot, they didn't care much about what the peasants thought, and peasants were not literate. The most literate ancient societies – for example Athens in the 4th and 5th centuries BCE – had a <i>rudimentary</i> literacy rate of 10%, so one person in ten might be able to glean some meaning from words, but how well they could set down their thoughts on moral values is a different question. To get higher literacy rates, you have to move in time to the early second millennium, and in space to urban China or western Europe. Morris writes that "genuine mass literacy, with half or more of the population able to read simple sentences, belongs to the age of fossil fuels”, and because of this, most of “our evidence for peasant experience comes from archaeology and accounts by twentieth-century anthropologists, rural sociologists, and development economists." If history is the written record of the past, then the majority of the population lived their lives outside history until the past century or two. (Perhaps we might even say that history in this sense only began with the internet age, when the private lives of everyone began being set down.)</p><p style="text-align: left;">Before going into the trickier question of values, we can compare foragers and farmers in some simple ways. First, their energy consumption was higher. Foragers, like all humans, need to eat about eight and a half megajoules (2000 kilocalories) of energy as food per person per day to stay alive. Add cooking, and total energy consumption roughly doubles. The energy use of agrarian societies starts out at a forager level of around 20 MJ/person/day (5000 kcal), and goes up to the 100-150 MJ/person/day level (compare to 500 MJ/person/day (120 000 kcal), plus/minus a factor of two or so, for modern rich industrial nations).</p><p style="text-align: left;">Second, farming societies have very roughly perhaps half as few violent deaths as foragers, due to the existence of governments that at least occasionally kept the peace.</p><p style="text-align: left;">However, their life wasn't better on most metrics. In contrast to the literature (both then and now) full of "tales of vagabonds, wandering minstrels, and young men striking out to make their fortunes", "most farmers lived in worlds much smaller than most foragers had done, and never went much more than a day or two’s walk from the villages they were born in". Not only this, but:</p><blockquote style="text-align: left;"><p><i>"Excavated skeletons suggest that ancient farmers tended to suffer more than foragers from repetitive stress injuries; their teeth were often terrible, thanks to restricted diets heavy on sugary carbohydrates; and their stature, which is a fairly good proxy for overall nutrition, tended to fall slightly with the onset of agriculture, not increasing noticeably until the twentieth century AD."</i></p></blockquote><p style="text-align: left;">No farming society even managed to escape the repeating cycles of population growth and starvation that foragers were also prone to, despite having more direct control over their food supplies. Populations would increase to keep pace with the good times until all farmers were slaving way to stay at subsistence levels given the crowdedness and quality of the land. Then many would starve to death when the bad times came.</p><p style="text-align: left;">Another trend across the history of farming societies is three things coinciding: energy consumption rises above 40 MJ (twice the minimum agrarian level and the typical forager level), towns grow past 10 000 people, and a few people take charge and start bossing around the others with their governments.</p><p style="text-align: left;">In farming societies, widespread respect and reverence for hierarchy was internalised by everyone. Morris writes that “[f]arming society often seemed obsessed with the symbolism of rank”, and twentieth century anthropologists "regularly found that having a healthy respect for authority – knowing your place – was a key part of their informants’ sense of themselves as good people". This often came, and still comes, as a surprise to non-farmers:</p><blockquote style="text-align: left;"><p><i>"[W]hen European reformers began venturing outside their urban enclaves into the countryside in the eighteenth century, they were often astonished that instead of complaining about inequality and demanding the redistribution of property, peasants largely took it as right and proper that most people were poor and weak while a few were rich and strong."</i></p></blockquote><p style="text-align: left;">Especially revered was the "Old Deal", Morris' term for the generalised social contract between classes in agrarian societies: that some have the duty to be commanders (or "shepherds of the people", in the preferred phrasing of many a king), others to obey those commands, and if everyone follows this script then things work fine.</p><p style="text-align: left;">Even when the powerful were questioned, the questioning didn't go as far as the Old Deal itself. In fact it rarely reached the king. “The tsar is good but the boyars [aristocrats] are bad", goes a Russian saying; even those who protested the powerful assumed that the highest levels of power must be good and holy, and the problems came from their will being incorrectly carried out by lesser lords. Even when the king himself came under fire, the Old Deal itself, or the inequality it entailed, were not questioned. The most common sort of rebellion against a king took what Morris calls a "good-old-days form": the justification was that the king had broken the Old Deal (or been abandoned by the gods or lost the Mandate of Heaven) and the urgent need was to restore the days when the <i>right</i> dictator was in charge, not abolish the dictatorship in the first place.</p><p style="text-align: left;">There were exceptions – in the 1640s some Chinese peasants called themselves "Levelling Kings" and went around questioning who gave their rulers the right to call them serfs, and of course there's the gradual English case and the rather more abrupt French case – but these only came when the societies in question started hitting energy consumptions of 150 MJ/day, the very highest end that agrarian societies could achieve without a full-on industrial revolution.</p><p style="text-align: left;">(Morris implies that the energy consumption is the cause. This seems backwards; an explanation running through the institutions and organisation needed to sustain this energy level seems much more reasonable. In general, perhaps when Morris talks about "energy consumption", you should read "the societal factors that enable higher energy consumption" in its place.)</p><p style="text-align: left;">Given how anti-hierarchy foragers were, how did this come to be? Were the peasants all forced into a rigid hierarchy by ruthless elites?</p><blockquote style="text-align: left;"><p><i>'“You may fool all the people some of the time; you can even fool some of the people all the time; but you can’t fool all the people all the time,” Abraham Lincoln is supposed to have said (unless it was P. T. Barnum). But Korsgaard and Seaford apparently think that Lincoln/Barnum was wrong, and that for ten thousand years everyone in Agraria was led by the nose—women by men, poor by rich, everyone by priests—and robbed blind. This I just cannot credit. Humans are the cleverest animals on the planet (for all we know, the cleverest in the whole universe). We have worked out the answers to almost every problem we have ever encountered. So how, if farming values were really just a trick perpetrated by wicked elites, did they survive for ten millennia? Most of the farmers I have met have been canny folk; so why could farmers in the past not figure out what was going on behind the wizard’s veil?</i></p><p><i>The answer, in my opinion, is that there was no veil. The veil is a figment of modern academics’ imaginations, made necessary by the assumption that only a tiny elite could possibly have thought that hierarchy was a good thing. In reality, farmers had farming values not because they fell for a trick but because they had common sense.'</i></p></blockquote><p style="text-align: left;">It is clearly a mistake to think that farmers participated in farming societies and its values through gritted teeth. However, I don't think it was so much farmers' common sense that made them adopt farming values. Societies that brainwashed their members into sincerely accepting farming-era hierarchies did better, and eventually all farming societies mastered this art. </p><p style="text-align: left;"> </p><h4 style="text-align: left;">Specific inequalities: forced labour and patriarchy</h4><p style="text-align: left;">In addition to the general extreme hierarchy of farming societies, there are two specific types of inequality that are both interesting in their causes and tragic in their consequences.</p><p style="text-align: left;">The first is slavery, and forced labour more generally. Both are almost entirely absent in foraging bands, which might take captives from other tribes but usually eventually integrate them into the tribe rather than keeping them forever as slaves. In contrast, some form of forced labour is found in almost every agrarian society.</p><p style="text-align: left;">Why? Because financial institutions weren't strong enough. Markets for labour existed almost everywhere, but there was a problem: “anyone who had enough land to support a family preferred to make a living by working it rather than by selling labor”, because, without reliable banks for everyone, keeping a good farm was the only robust way to accumulate and maintain wealth, especially for your children. When it was time for a big construction project (maybe the pharaoh died and you need a pyramid to bury him in), even wealthy employers like the state couldn't always hire enough workers. Often they resorted to violence to lower the costs of labour. Violence, after all, came cheap.</p><p style="text-align: left;">The second specific kind of inequality was male domination and strict gender roles. Morris offers a two-pronged explanation. First, farmer men had more reason than forager men to keep farmer/forager women under control:</p><blockquote style="text-align: left;"><p><i>“The main reason that male foragers generally care less than male farmers about controlling women [...] is that foragers have much less to inherit than farmers. [...] [Q]uestions about the legitimacy of children matter a lot less than they do when only legitimate offspring will inherit land and capital.” </i></p></blockquote><p style="text-align: left;">(We might ask why farming societies were so strict about only legitimate offspring inheriting property, but perhaps this is a case of biological values limiting the space of cultural variation.)</p><p style="text-align: left;">Second, gender roles became more regimented out of necessity. Agricultural work – plowing, manuring, and irrigation – relies on brute upper body strength, which favours males. Farmers worked harder in general than foragers, so more male-specific strength-based work also pushed everything else – home upkeep (which foragers didn't need to do) and food processing – onto women. As early as 7000 BCE, skeletons from Syria suggest that both genders regularly carried heavy loads, but only women had an arthritic condition caused by kneeling and footwork, probably as a result of grinding grain.</p><p style="text-align: left;">Finally, child bearing is obviously restricted to women. With the advent of farming, the doubling time for populations fell by a factor of five, from ten thousand to two thousand years. <a href="https://ourworldindata.org/child-mortality-in-the-past">Infant mortality</a> seems not to have changed, so this is due to increased birth rates alone.</p><p style="text-align: left;">Morris writes that this decision on gender norms seems so obvious that "no farming society that moved beyond horticulture ever seems to have decided anything else". According to him, "if we sit theorizing in our fossil-fuel studies" we might imagine an alternative were women had the upper hand, "sending otherwise-useless men out to labor for them in the fields, but in reality, the organizational needs of farming societies gave men the means to inflict devastating economic pain on faithless wives while also raising the costs for men of failing to deter women from bringing cuckoos back to the nest". The empirical correlation between gender inequality and farming societies seems strong and Morris' arguments are plausible, but whether they're the final word is less clear.</p><p style="text-align: left;">Of course, you can't hold everyone down all the time. Morris lists many historical cases of people who were slaves and/or women, but nevertheless defied expectations and attained great success. For example, Morris tells the story of an Athenian slave banker called Pasion, who did so well that he was eventually not only able to buy his own freedom but also the bank itself.</p><p style="text-align: left;">(Interestingly, <a href="https://en.wikipedia.org/wiki/Pasion">Wikipedia</a> tells the story slightly differently, saying he was manumitted as a reward for his work, and inherited the bank after his former owners retired, rather than by buying it outright. Wikipedia cites the 1971 <i>Athenian Propertied Families</i> by J. K. Davies; Morris cites Edward Cohen's <i>Athenian Economy and Society</i> and Jeremy Trevett's <i>Apollorodus Son of Pasion</i>, both from 1992. I don't know who to believe, or whether a consensus exists.)</p><p style="text-align: left;">Morris' harsh conclusion is that both forced labour and patriarchy were "functionally necessary to farming societies that generated more than 10k kcal/cap/day [42 MJ/cap/day]”.</p><p style="text-align: left;"> </p><h3 style="text-align: left;">Fossil-fuel users</h3><p style="text-align: left;">Many places underwent the agricultural revolution independently of each other, because farming spread slow enough that distant people could invent it on their own before the waves of someone else's discovery of farming reached them. In contrast, the industrial revolution happened in north-west Europe fast enough, and gave big enough advantages, that no other region had an independent industrial revolution.</p><p style="text-align: left;">The culture and values of the post-industrial West – democracy, human rights, individualism, market-orientedness, and so on – are often labelled Western. In some sense this is a tautology; by definition, these are the values that Western countries have at the moment. The label is also used in a deeper sense, to mean that there is some kernel of Westernness in these values that makes them the logical conclusion of pre-industrial Western thought, and perhaps incompatible with different cultural bases.</p><p style="text-align: left;">One consequence of Morris' arguments is that this perspective is wrong. What we might call Western values are no more Western values than farming-era values are Sumerian values (or Indus Valley values or Mesoamerican values or ...); the reason Western values are called Western values but farming values aren't called Sumerian values is that the industrial revolution spread faster than the agricultural one. To explain Western values we should look not at ancient Greek philosophers and whatnot but at the demands of industrialised societies. </p><p style="text-align: left;">This does not mean that every industrialised society will approach the West in its values, only that the pressures are there (and wily enough dictators or future technological trends may be enough to avoid them). It might also be that the reason that Europe underwent an industrial revolution while other societies at the edges of agrarian achievement did not is that, by accidents of history and geography, pre-industrial north-west European values were closer to modern industrial values than those of the other societies that have stood at the cusp of industrialisation.</p><p style="text-align: left;">But the overall conclusion remains: <a href="https://slatestarcodex.com/2016/07/25/how-the-west-was-won/">"Western" values are the universal values</a> that industrialised societies tend towards. The conflict between Boko Haram or the Taliban and the West, to use two of Morris' examples, is not so much a conflict of culture versus culture, but of era versus era; a last stand of the hierarchy- and patriarchy-obsessed farming values that were held by everyone (except a forager here or there) until a few hundreds years ago. On a more granular level, the steady retreat of discrimination and formality from Western societies is simply the gradual acceptance that these vestiges of the farming era are no longer useful.</p><p style="text-align: left;">As with the transition to farming society, there's the question of how people eventually reached almost opposite stances of what their ancestors had believed. Unlike with the agricultural revolution, the question is especially pressing because the timescale of the changes is so short. But once again, a lot of it was driven by economics.</p><p style="text-align: left;">The first step was people moving from countryside farming to factory jobs:</p><blockquote style="text-align: left;"><p><i>"Nineteenth-century sources make it very clear that entering the wage-labor market could be a traumatic experience, requiring workers to submit to strict time discipline and factory conditions unlike anything they had known in the countryside; and yet millions chose to do so, because the alternative—hunger—was worse.</i></p><p><i>So eager were poor farmers for dirty, dangerous factory jobs that British employers only needed to increase wages by 5 percent (in real terms) between 1780 and 1830, although output per worker grew by 25 percent. Wage increases accelerated only in the 1830s, and even then only for urban workers. The great motor was productivity, which was now rising so high that employers began finding it cheaper to share some of their profits with their workers than to try to break strikes. (In another great irony, by the time that Dickens, Marx, and Engels were writing, wages were rising faster than ever before in history.) For the next fifty years, wages rose as fast as productivity; after 1880, they rose even faster. By then, incomes were beginning to rise in the countryside too.”</i></p></blockquote><p style="text-align: left;">One resulting value change was the abolition of forced labour:</p><blockquote style="text-align: left;"><p><i>“By making wage labour attractive enough to draw in millions of free workers, higher wages made forced labor less necessary, and because impoverished serfs and slaves—unlike the increasingly prosperous wage labourers—could rarely buy the manufactured goods being churned out by factories, forced labour increasingly struck business interests as an obstacle to growth (especially when it was competitors who were using it).”</i></p></blockquote><p style="text-align: left;">The farmer-era justifications for gender hierarchy also broke down. First, industrialised societies had less need for brute strength and more need for organisational work, in which there is no gender disparity. Second, birth rates eventually went down, reducing the amount of time women spent on children. As a result, almost universal male dominance during the farming era has given way to a world where 81% of people say gender equality is important, including 98% in Britain but also over 90% of Indonesians and Turks and even 78% of Iranians (India, with a very low 60% and a huge population, is probably the biggest drag on the average).</p><p style="text-align: left;">Morris offers a great summary of the principles of success in agrarian versus industrial societies:</p><blockquote style="text-align: left;"><p><i>“Agraria had worked by drawing lines, not just between elite and mass or men and women, but also between believers and nonbelievers, pure and defiled, free and slave, and countless other categories. Each group was assigned its place in a complex hierarchy of mutual obligations and privileges, tied together by the Old Deal and guaranteed by the gods and the threat of violence. Fossil-fuel societies, however, work best by erasing lines. The more a group replaces the rigid structure of figure 3.6 with the anti-structure of figure 4.7—a completely empty box, made up of interchangeable citizens—the bigger and more efficient its markets will be and the better it will function in the fossil-fuel world.”</i></p></blockquote><div class="separator" style="clear: both; text-align: center;"><a href="https://1.bp.blogspot.com/-iv_1_ZRRHZE/X9sUmDg8p0I/AAAAAAAACAA/dcLobQYU-0UJZGKgb7zk5yfTtgEi_0Z5wCLcBGAsYHQ/s1074/agraria.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="1036" data-original-width="1074" height="617" src="https://1.bp.blogspot.com/-iv_1_ZRRHZE/X9sUmDg8p0I/AAAAAAAACAA/dcLobQYU-0UJZGKgb7zk5yfTtgEi_0Z5wCLcBGAsYHQ/w640-h617/agraria.png" width="640" /></a></div><br /><p style="text-align: left;">The most successful agrarian social structure have a social structure like the one above; the most successful industrial societies look like this instead:</p><div class="separator" style="clear: both; text-align: center;"><a href="https://1.bp.blogspot.com/-Od3tKKJX0BE/X9sU5tjHsRI/AAAAAAAACAY/fL1F4M5b1hoz658MhCCf3Av-oI6mb1K7gCLcBGAsYHQ/s952/industria.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="948" data-original-width="952" height="638" src="https://1.bp.blogspot.com/-Od3tKKJX0BE/X9sU5tjHsRI/AAAAAAAACAY/fL1F4M5b1hoz658MhCCf3Av-oI6mb1K7gCLcBGAsYHQ/w640-h638/industria.png" width="640" /></a></div>This, in a nutshell, is why agrarian societies tend towards extreme hierarchy while industrial societies tend towards a social structure of interchangeable mobile individuals, free to do what they want and incentivised to slot themselves wherever they create the most value (at least economically). <p style="text-align: left;">With industrialisation, we've managed to roll back the discrimination and hierarchy of the farming age. We've even gone back to valuing fairly flat political hierarchies like the foragers (though we maintain them through democratic institutions rather than "coalitions of losers"), and become more egalitarian about gender than the foragers were, all the while living in societies far less violent than the average hunter-gatherer band.</p><p style="text-align: left;">There is one area where we're more tolerant of hierarchy than foragers, though: economic inequality. Once again the reason is practical: </p><blockquote style="text-align: left;"><p><i>"[...] Industria can flourish only if it has affluent middle and working classes that create effective demand for all the goods and services that fossil-fuel economies generate, but on the other, it also needs a dynamic entrepreneurial class that expects material rewards for providing leadership and management. In response, fossil-fuel values have evolved across the last two hundred years to favor government intervention to reduce wealth equality—but not too much.”</i></p></blockquote><p style="text-align: left;">However, even then we still abhor the farmer-era standard of seeing it as fair when the elite extract as much as they can from everyone under them. In fact, merely the fact that calling elites extractive has become a good political weapon shows how far we've come – as discussed in the farming section, farming-era people saw ruthlessly extractive elites as part of a fair social contract.</p><p style="text-align: left;"> </p><h3 style="text-align: left;">A summary of value evolution?</h3><p style="text-align: left;">We've just gone over a lot of detail about foragers, farmers, and fossil-fuel user values, and some reasons why values might have developed in the way they did. Is this a story of a random path through the stages of technological development, with harsh selection pressures making sure that societal values are dragged along for the ride? Or is there some pattern to the madness?</p><p style="text-align: left;">Morris' summary table does a good job of summing up the "what" of it:</p><div class="separator" style="clear: both; text-align: center;"><a href="https://1.bp.blogspot.com/-pgvDLzeIs0Q/X9sVC_1IOnI/AAAAAAAACAc/dA_v71VzNyYr5MPqzX32kq2akNKhZ1EQwCLcBGAsYHQ/s1118/summarytable.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="622" data-original-width="1118" height="357" src="https://1.bp.blogspot.com/-pgvDLzeIs0Q/X9sVC_1IOnI/AAAAAAAACAc/dA_v71VzNyYr5MPqzX32kq2akNKhZ1EQwCLcBGAsYHQ/w640-h357/summarytable.png" width="640" /></a></div><p style="text-align: left;">Two things leaps out from this table, especially if we plot it graphically: when it comes to attitudes towards hierarchy, fossil-fuel users are much closer to foragers than farmers are to anyone, and violence has gone down all along.</p><table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto;"><tbody><tr><td style="text-align: center;"><a href="https://1.bp.blogspot.com/-Pk02LU-xBAU/X9sVKg_tCoI/AAAAAAAACAk/zsRk0_bGIxoWxhcWYxB97t_oVebrlKgewCLcBGAsYHQ/s1778/graph.png" style="margin-left: auto; margin-right: auto;"><img border="0" data-original-height="994" data-original-width="1778" height="358" src="https://1.bp.blogspot.com/-Pk02LU-xBAU/X9sVKg_tCoI/AAAAAAAACAk/zsRk0_bGIxoWxhcWYxB97t_oVebrlKgewCLcBGAsYHQ/w640-h358/graph.png" width="640" /></a></td></tr><tr><td class="tr-caption" style="text-align: center;">(Slide from a talk I gave at EA Cambridge)<br /></td></tr></tbody></table><p style="text-align: left;"> </p><p style="text-align: left;">Other people have noticed this; economist and futurist Robin Hanson has written about the modern conservative-liberal axis mapping onto how willing people are to abandon farming ways and revert to more forager-like lifestyles and values as societies grow richer (as some people inexplicably prefer writing in digestible chunks rather than monolithic book-length blog posts, it's hard to give just one or two key links, but see for example <a href="https://www.overcomingbias.com/2012/05/forager-vs-farmer-morality.html">here</a>, <a href="https://www.overcomingbias.com/2017/08/forager-v-farmer-elaborated.html">here</a>, <a href="https://www.overcomingbias.com/2015/08/specific-vs-general-foragers-farmers.html">here</a>, and <a href="https://www.overcomingbias.com/2010/10/two-types-of-people.html">here</a>). </p><p style="text-align: left;">Perhaps we can tell a story like this: in the beginning there were foragers. They tended to live as people tend to do, and value the things that evolution had crafted people to want. Humans being humans, there was a lot of politicking, and with no institutions to restrain it, a fair amount of violence. The outside world was harsh and outside anyone's control.</p><p style="text-align: left;">Then the agricultural revolution slowly creeped across the world. At first people lived as before, but generation by generation it turned out that the societies that managed to best persuade people to accept a bit more hierarchy – to show a bit more obedience to the chiefs, grant a bit less non-reproductive status to women – did a bit better than the others. Over millennia, such societies either had their tricks independently discovered or copies by others, or then outright went warpath to subjugate over societies to their rule – and, of course, preach their values, which (given human adaptability) they held sincerely, and with no idea that they thought differently from their distant ancestors. Eventually, the big tricks – organised religion and the god-kings keeping power by letting their henchmen extract as much as they could from their subjects – became almost universal. They also lowered the level of violence by imposing some amount of internal order and perhaps a culture promoting peaceful conflict resolution, if only to spare more strength to throw at neighbouring societies.</p><p style="text-align: left;">Then came the industrial revolution, and suddenly what mattered is how well a society could harness the talents of its members and establish efficient, competitive markets to drive innovation. This created pressures to democratise and erase lines between people. Technology and wealth also increased people's ability to control their lives. Rich and comfortable industrialised people no longer needed to abide by strict farming-era social rules to survive, and so slowly gave up on them, reverting back to more forager-like ways, though with the added advantages of unprecedented peace and material wellbeing. </p><p style="text-align: left;"> </p><h3 style="text-align: left;">How selection pressures change values</h3><p style="text-align: left;">The reasons why societies tend to adopt pragmatic values are subtle; it's not as if people go around cynically holding the values that will best contribute to their tribe's or society's long-term success. As a result, Morris' descriptions of how selection pressures do their work are worth quoting at length.</p><p style="text-align: left;">First, here's how farmers ended up dominating the world in the first place:</p><blockquote style="text-align: left;"><p><i>“The first farmers had free will, just like us. As their families grew, their landscapes filled up. […] For all we know, some foragers in the Jordan Valley ten thousand years ago [chose to remain foragers]. The problem, though, was that they were not making a one-time choice. Tens of thousands of other people were asking the same question, and each family had to revisit the decision of whether to intensify or go hungry multiple times every year. Most important of all, each time one family chose to work harder and intensify its management of plants and animals, the payoffs from sticking with the old ways declined a little further for everyone else. Every time cultivators started thinking of the plants and animals on which they lavished care and attention as their personal gardens and flocks, not part of a common stock, hunting and gathering would become that much more difficult for those who stuck to it. Foragers who clung stubbornly and/or heroically to the old ways were doomed because the odds kept tilting against them.”</i></p></blockquote><p style="text-align: left;">But how did this result in a world of dictator kings? Morris:</p><blockquote style="text-align: left;"><p><i>“We should probably assume that people tried lots of different ways to solve the collective action problem of how to create larger, more integrated societies with more complex divisions of labor as they moved from foraging to farming, but almost everywhere, it seems that the solution that worked best was the idea of the godlike king.”</i></p></blockquote><p style="text-align: left;">Morris isn't very clear on why godlike kings, out of all possible forms of social organisation, worked best. We can imagine that it's hard to coordinate big armies for defence or offence without one, or that the symbolism of a godlike figurehead is the most reliable way to unite masses in a largely illiterate society, or vaguely gesture like Morris at the challenges of managing complex societies, but there doesn't seem to be much hard evidence or reason for a precise mechanism one way or the other, at least in <i>Foragers, Farmers, and Fossil Fuels</i>.</p><p style="text-align: left;">In general, <a href="https://en.wikipedia.org/wiki/Collective_action_problem">collective action problems</a> are important in any large organisation, and the simplest solution is complete centralisation; effectively reducing collective action problems back into individual action problems. Of course, this comes with all the cruelties and inefficiencies of real-world non-omnibenevolent, non-omniscient centralised decision-making. Given this, was the centralisation-vs-decentralisation tradeoff really so simple in the farming era that "godlike kings everywhere" was the only effective answer? Perhaps the tradeoffs really were that one-sided in the farming age, and this became a trickier question only in the industrial age when nurturing human talent and prosperity became key societal goals, and we created effective decentralised institutions like free markets and democracy. Or maybe there was a high but not extreme level of optimal centralisation, but the greed of individual rulers often pushed their societies past this level despite selection pressures working in favour of more responsibly lead societies, and it was only with the industrial age that these pressures became high enough to force the world away from the godlike king model.</p><p style="text-align: left;">Morris also describes the rise of capitalism:</p><blockquote style="text-align: left;"><p><i>“Capitalism took off in early-modern Western Europe because practical people figured out that this was the most effective way to get things done in an increasingly energy-rich world. Other people disagreed, and did things differently. Conflicts and compromises ensued as the competitive logic of cultural evolution went to work and drove the less effective ways extinct.”</i></p></blockquote><p style="text-align: left;">Once again, I think the concept of selection pressures is a powerful lens, but the details of what drives the relationship are missing. What exactly was it about an energy-rich environment that made capitalism ideal? Even by Morris' own account, it seems the methods (e.g. complex manufacturing chains, mature financial institutions, etc.) required to most effectively extract and use energy given a particular technology level are what matter, not the raw total of joules consumed per person per day.</p><p style="text-align: left;"> </p><h3 style="text-align: left;">Respondents</h3><p style="text-align: left;"><i>Foragers, Farmers, and Fossil Fuels</i> originated from the Tanner Lectures at Princeton. As part of the format, the book includes four responses to Morris' arguments, by Richard Seaford, Jonathan Spence, Christine Korsgaard, and Margaret Atwood.</p><p style="text-align: left;">On the whole, these responses don't add much to book, though they are helpful in making Morris elaborate on his arguments in the final chapter (cheekily entitled "My Correct Views on Everything"). </p><p style="text-align: left;">Seaford and Spence provide short chapters that seem to be more about their own interests than Morris' arguments, and have the tone of questions asked by professors who slept through the talk but are still trying to say something insightful at the questions session.</p><p style="text-align: left;">Atwood, of <i>The Handmaid's Tale</i> fame, brings an arsenal of literary flair to bear on the task. She manages to make some good points (what about horse-riding pastoralists, who may have been the first large-scale war-makers?), along with some ridiculous statements:</p><blockquote style="text-align: left;"><p><i>“Several billion years ago, marine algae produced the atmosphere that allows us to breathe, and these algae continue to produce from 60 to 80 percent of our oxygen. Without marine algae, we ourselves cannot survive. During the Vietnam War, huge vats of Agent Orange were being shipped across the Pacific. Should they have sunk and leaked, we would not be having this conversation today.”</i></p></blockquote><p style="text-align: left;">Let's do some very rough calculations. If all the Agent Orange deployed in Vietnam had been uniformly distributed across the Pacific, the mass concentration of its component acids (making the highest assumptions about what concentration it was sprayed at) would have been lower than one part in tens of trillions, a hundred thousand times lower than the mass concentrations of either lead or mercury already in the oceans. I couldn't find any study of what happens to algae in oceans if you dump Agent Orange on them, but one article about using algaecide in swimming pools says applying one ten-thousandth of the pool volume is typical. Another article mentions 5-10% as a common concentration, giving an algae-killing active ingredient concentration of maybe 1 in 100 000 in water. Agent Orange would need to kill algae at ten million times lower concentrations in oceans than commercial algaecide does in swimming pools for the Pacific's oxygen production to be destroyed.</p><p style="text-align: left;"> (Or maybe Atwood means the literal sense that, because of various butterfly effects, any such change in history makes any present event, including this conversation, unlikely?)</p><p style="text-align: left;">By far the most substantive response comes from the philosopher Christine Korsgaard. She also has the idea that the farming era was an aberration, with a fresh interpretation:</p><blockquote style="text-align: left;"><p><i>“Instead of thinking that values are determined by modes of energy capture, perhaps we should think that as human beings began to be in a position to amass power and property in the agricultural age, forms of ideology set in that distorted real moral values [i.e. the values a society should hold], distortions that we are only now, in the age of science and extensive literacy, beginning to overcome.”</i></p></blockquote><p style="text-align: left;">More significantly, she makes a distinction between the values a society holds and values that should be held (“positive values” and “real moral values” respectively), in contrast to Morris' arguments that such a distinction is meaningless and the only real distinction is between biological values and the form they take in a given society. Her response manages to pick away at Morris' nonchalant bulldozing of all philosophical subtleties.</p><p style="text-align: left;">Responding to this in the last chapter, Morris quotes, and then dismisses, Ernest Gellner's response to a social theory presentation at an archaeology conference: "They tell me you're a good archaeologist, so why are you trying to be a bad philosopher?". Perhaps he should have taken the question more to heart.</p><p style="text-align: left;"> </p><h3 style="text-align: left;">The future</h3><p style="text-align: left;">The experiment of how to switch from foraging to farming was run many times. Forager bands in many places adopted farming techniques. Some of them had good ideas about how to structure their now-farming societies and succeeded, while others had bad ideas and perished, or were forced to copy techniques from the more successful.</p><p style="text-align: left;">In contrast, today the entire world has been thrust into the industrial age in the space of a few hundred years. There is only one experiment going on, and only one chance to get it right. There's no one to copy from to see what we should do, and no one to pick up the job if our attempt fails.</p><p style="text-align: left;">A successful transition to the industrial world, and whatever we might mark as the next step after that, is therefore less certain than the successful transition from foragers to farmers. The values that industrial life imposes on us might be better than the those of the farming age, but it is not yet clear if they will become as universal as hierarchies and kings once were.</p><p style="text-align: left;">(Better by which standard? I think humans are similar enough that there is a <a href="https://strataoftheworld.blogspot.com/2020/08/ea-ideas-4-utilitarianism.html">context-independent universal human ethical framework</a>.)</p><p style="text-align: left;">Morris' arguments also lead to the question of how values might change in the future. Will the set of values that a society tends towards continue to improve as technology and wealth increases, or is the cuddliness of industrial values (compared to farming ones) a fluke?</p><p style="text-align: left;">The significance of <i>Foragers, Farmers, and Fossil Fuels</i> for this question is that we won't necessarily be the ones deciding. Over a span of years or decades, we can maintain our values through argument and education. Over a span of centuries, though, we can argue all we like, just as countless luddites and aristocrats railed against industrial/Western values, but if the game has changed and someone else's values make them play it better, it won't be enough. The harsh logic of evolution-like selection pressures can't be resisted forever; those that are best at spreading themselves into the future will eventually claim it.</p><p style="text-align: left;">Yuval Noah Harari, author of <i>Sapiens</i>, says that once we can engineer desires, the question is not "what do we want to become?", but "what do we want to want?". Morris counters that the real question is instead "what are we going to want, whether we want it or not?", and his answer is bleak yet pragmatic: "each age gets the thought it needs" ("needs" referring to "survival needs").</p><p style="text-align: left;">I don't think we need to be either nihilistic (in thinking that every set of societal values is as good as any other; some do a better job of serving universal human wants), nor pessimistic (in thinking that we can't do anything about a slide to worse values; we've never had more control over the future of our world).</p><p style="text-align: left;">Morris writes:</p><blockquote style="text-align: left;"><p><i>“Trying to imagine people who are somehow divorced from the demands of capturing energy and then speculating about what their moral values would be is an odd activity.”</i></p></blockquote><p style="text-align: left;">I disagree. Of course we can imagine people living without being constrained by energy needs. How many science fiction writers or futurists <i>haven't</i> imagined a post-scarcity society?</p><p style="text-align: left;">In fact, aren't we well on our way towards such a world? Forager and farmer lives were significantly shaped by the need to get food, water, light, and warmth. Today in developed countries, these aren't free, but our lives aren't shaped by worrying about them. Sure, you need to work a job, but what you worry about in the job is likely very far separated from survival needs, and provided you have one and aren't massively wasteful, the water and light flows exactly as you want it. Technological progress removes difficulty and scarcity. Ultimately, there's no physical limit stopping us from removing scarcity considerations from our lives (or, more precisely, making them trivial enough that we don't need to worry about them; nothing is ever entirely free in this universe).</p><p style="text-align: left;">Once we've done so, no longer have to make compromises between what we should do and what we as a society are forced to value in order to survive. And so I think it is reasonable to imagine humans whose values aren't warped by survival needs; in fact such values might be good ones to aim for.</p><p style="text-align: left;">(Or maybe the need to focus at least a bit on survival is the one anchor to objective reality that prevents societies from losing themselves entirely to petty politicking and status games.)</p><p style="text-align: left;">Of course, there's always the problem of competition. What happens to our happy post-scarcity society when the people next door ratchet up the competition, say by throwing off all the safeguards around capitalism, or developing AIs or nanomachines or <a href="https://slatestarcodex.com/2016/05/28/book-review-age-of-em/">Robin Hanson's emulated minds</a>, and then outcompeting us by adopting values more suitable to exploiting those technologies? Even if we ourselves don't suffer – say we have a big enough wall – in the long run we'd give up the rest of the world (or solar system or galaxy) to the pragmatic-valued competitors. At best, the long-term future looks like an oasis of human flourishing, surrounded by a galaxy-spanning alien economy with weird but morally neutral ways. (Imagine a forager tribe considering the massive and weird industrialised world around them; now imagine we're the foragers.) At worst, any good in our oasis would be outweighed by the morally bad machinations that fuel the endless growth of that weird galaxy-spanning alien economy.</p><p style="text-align: left;">So will we be forced to compromise ever more and more to avoid being outrun by those with fewer scruples about changing their values? Or can we build a world where human values are a winning strategy?</p><p style="text-align: left;">Looking at our <a href="http://strataoftheworld.blogspot.com/2018/08/review-enlightenment-now-steven-pinker.html">track record</a>, I think we have a chance.</p><p style="text-align: left;"> </p><p style="text-align: center;"><i><b>Related:</b></i><i><a href="http://strataoftheworld.blogspot.com/2019/09/growth-and-civilisation.html"><br />Growth and civilisation</a></i><br /></p>Unknownnoreply@blogger.com0tag:blogger.com,1999:blog-1697673368059564013.post-86168872585015707722020-08-10T17:43:00.003+01:002021-01-22T12:27:49.613+00:00EA ideas 4: utilitarianism<p style="text-align: center;"><font size="2"><i>4.9k words (≈17 minutes)</i></font></p><p style="text-align: center;"><font size="2"><i><span style="font-size: small;"> Posts in this series:</span><br /></i><a href="https://strataoftheworld.blogspot.com/2020/07/ea-ideas-1-rigour-and-opportunity-in.html">FIRST</a> | <a href="https://strataoftheworld.blogspot.com/2020/07/ea-ideas-3-uncertainty.html">PREVIOUS</a> | NEXT<i><br /></i></font></p><p>Many ideas in effective altruism (EA) do not require a particular moral theory. However, while there is no common EA moral theory, much EA moral thinking leans consequentialist (i.e. morality is fundamentally about consequences), and often specifically utilitarian (i.e. wellbeing and/or preference fulfilment are the consequences we care about).</p> <p>Utilitarian morality can be thought of as rigorous humanism, where by “humanism” I mean the general post-Enlightenment secular value system that emphasises caring about people, rather than upholding, say, religious rules or the honour of nations. Assume that the welfare of a conscious mind matters. Assume that our moral system should be impartial: that wellbeing/preferences should count the same regardless of who has them, and also in the sense of being indifferent of who’s perspective it is being wielded from (for example, a moral system that says to only value yourself would give you different advice than it gives me). The simplest conclusion you can draw from these assumptions is to consider welfare to be good and seek to increase it.</p> <p>I will largely ignore differences between the different types of utilitarianism. Examples of divisions within utilitarianism include preference vs hedonic/classical utilitarianism (do we care about the total satisfied preferences, or the total wellbeing; how different are these?) and act vs rule utilitarianism (is the right act the one with the greatest good as its consequence, or the one that conforms to a rule which produces the greatest good as its consequences – and, once again, are they different?).</p><p><br /></p> <h3>Utilitarianism is decisive</h3> <p>We want to do things that are “good”, so we have to define what we mean by it. But once we’ve done this, this concept of good is of no help unless it lets us make decisions on how to act. I will refer to the general property of a moral system being capable of making non-paradoxical decisions as decisiveness.</p> <p>Decisiveness can fail if a moral system leads to contradiction. Imagine a deontological system with the rules “do not lie” and “do not take actions that result in someone dying”. Now consider the classic thought experiment of what such a deontologist would do if the Gestapo knocked on their door and asked if they’re hiding any Jews. A tangle of absolute rules almost ensures the existence of some case where they cannot all be satisfied, or where following them strictly will cause immense harm.</p> <p>Decisiveness fails if our system allows circular preferences, since then you cannot make a consistent choice. Imagine you follow a moral system that says volunteering at a soup kitchen is better than helping old people across the street, collecting money for charity is better than soup kitchen volunteering, and helping old people across the street is better than collecting money. You arrive at the soup kitchen and decide to immediately walk out to go collect money. You stop collecting money to help an old person across the street. Halfway through, you abandon them and run off back to the soup kitchen.</p> <p>Decisiveness fails if there are tradeoffs our system cannot make. Imagine highway engineers deciding whether to bulldoze an important forest ecosystem or a historical monument considered sacred. If your moral system cannot weigh environment against historical artefacts (and economic growth, and the time of commuters, and …), it is not decisive. </p> <p>So for any two choices, a decisive moral system must be able to compare them, and the comparisons it makes cannot be circular preference. This implies a ranking: X is better than Y translates to X is before Y in the ranking list.</p> <p>(If we allow circular preferences, we obviously can’t make a list, since the graph of “better-than” relations would include cycles. If there are tradeoffs we can’t make – X and Y such that X and Y are neither better than equal or worse than each other – we can generate a ranking list but not a unique one (in set theory terms, we have a partial order rather than a total order).)</p> <p>Decisiveness also fails if our system can’t handle numbers. It is better to be happy for two minutes than one minute than fifty nine seconds. More generally, to practically any good we can either add or subtract a bit: one more happy thought, one less bit of pain.</p> <p>Therefore a decisive moral system must rank all possible choices (or actions or world states or whatever), with no circular preferences, and with arbitrarily many notches between each ranking. It sounds like what we need is numbers: if we can assign a number to choices, then there must exist a non-circular ranking (you can always sort numbers), and there’s no problem with handling the quantitativeness of many moral questions.</p> <p>There can’t be one axis to measure the value of pleasure, one to measure meaning, and another for art. Or there can – but at the most basic level of moral decision-making, we must be able to project everything onto the same scale, or else we’re doomed to have important moral questions where we can only shrug our shoulders. This leads to the idea of all moral questions being decidable by comparing how the alternatives measure up in terms of “utility”, the abstract unit of the basic value axis.</p> <p>You might say that requiring this extreme level of decisiveness may sometimes be necessary in practice, but it’s not what morality is about; perhaps moral philosophy should concern itself with high-minded philosophical debates over the nature of goodness, not ranking the preferability of everything. Alright, have it your way. But since being able to rank tricky “ought”-questions is still important, we’ll make a new word for this discipline: fnergality. You can replace “morality” or “ethics” with “fnergality” in the previous argument and in the rest of this post, and the points will still stand.</p><p><br /></p> <h3>What is utility?</h3> <p>So far, we have argued that a helpful moral system is decisive, and that this implies it needs a single utility scale for weighing all options. </p> <p>I have not specified what utility is. Without this definition, utilitarianism is not decisive at all.</p> <p>How you define utility will depend on which version of utilitarianism you endorse. The basic theme across all versions of utilitarianism is that utility is assigned without prejudice against arbitrary factors (like location, appearance, or being someone other than the one who is assigning utilities), and is related to ideas of welfare and preference.</p> <p>A hedonic utilitarian might define the utility of a state of the world as total wellbeing minus total suffering across all sentient minds. A preference utilitarian might ascribe utility to each instance of a sentient mind having a preference fulfilled or denied, depending on the weight of the preference (not being killed is likely a deeper wish than hearing a funny joke), and the sentience of the preferrer (a human’s preference is generally more important than a cat’s). Both would likely want to maximise the total utility that exists over the entire future.</p> <p>These definitions leave a lot of questions unanswered. For example, take the hedonic utilitarian definition. What is wellbeing? What is suffering? Exactly how many wellbeing units are being experienced per second by a particular jogger blissfully running through the early morning fog?</p> <p>The fact that we can’t answer “4.7, ±0.5 depending on how runny their nose is” doesn’t mean utilitarianism is useless. First, we might say that an answer exists in principle, even if we can’t figure it out. For example, a hedonic utilitarian might say that there is some way to calculate the net wellbeing experienced by any sentient mind. Maybe it requires knowing every detail of their brain activity, or a complete theory of what consciousness is. But – critically – these are factual questions, not moral ones. There would be moral judgements involved in specifying exactly how to carry out this calculation, or how to interpret the theory of consciousness. There would also be disagreements, in the same way that preference and hedonic utilitarians disagree today (and it is a bad idea to specify one Ultimate Goodness Function and declare morality solved forever). But in theory and given enough knowledge, a hedonic utilitarian theory could be made precise.</p> <p>Second, even if we can only approximate utilities, doing so is still an important part of difficult real-world decision-making.</p> <p>For example, Quality- and Disability-Adjusted Life Years (<a href="https://en.wikipedia.org/wiki/Quality-adjusted_life_year">QALYs</a> and <a href="https://en.wikipedia.org/wiki/Disability-adjusted_life_year">DALYs</a>) try to put a number on the value of a year of life with some disease burden. Obviously it is not an easy judgement to make (usually the judgement is made by having a lot of people answer carefully designed questions on a survey), and the results are far more imprecise than the 3-significant-figure numbers in the table <a href="https://www.who.int/healthinfo/statistics/GlobalDALYmethods_2000_2011.pdf">on page 17 here</a> would suggest. However, the principle that we should ask people and do studies to try figure out how much they’re suffering, and then make the decisions that reduce suffering the most across all people, seems like the most fair and just way to make medical decisions.</p> <p>Using QALYs may seem coldly numerical, but if you care about reducing suffering, not just as a lofty abstract statement but as a practical goal, you will care about every second. It can also be hard to accept QALY-based judgements, especially if they prefer others to people close to you. However, taking an impartial moral view, it is hard not to accept that the greatest good is better than a lesser good that includes you.</p> <p>(Using opposition to QALYs as an example, Robin Hanson <a href="https://www.overcomingbias.com/2019/05/simplerules.html">argues with his characteristic bluntness</a> that people favour discretion over mathematical precision in their systems and principles “as a way to promote an informal favoritism from which they expect to benefit”. In addition to the ease of sounding just and wise while repeating vague platitudes, this may be a reason why the decisiveness and precision of utilitarianism become disadvantages on the PR side of things.)</p><p><br /></p> <h3>Morality is everywhere</h3> <p>By achieving decisiveness, utilitarianism makes every choice a moral one.</p> <p>One possible understanding of morality is that it splits actions into three planes. There are rules for what to do (“remember the sabbath day”). There are rules for what not to do (“thou shalt not kill, and if thy doest, thy goeth to hell”). And then there’s the earthly realm, of questions like whether to have sausages for dinner, which – thankfully – morality, god, and your local preacher have nothing to say about.</p> <p>Utilitarianism says sausages are a moral issue. Not a very important one, true, but the happiness you get from eating them, your preferences one way or the other, and the increased risk of heart attack thirty years from now, can all be weighed under the same principles that determine how much effort we should spend on avoiding nuclear war. This is not an overreach: a moral theory is a way to answer “ought”-questions, and a good one should cover all of them.</p> <p>This leads to a key strength of utilitarianism: it scales, and this matters, especially when you want to apply ethics to big uncertain things. But first, a slight detour.</p><p><br /></p> <h3>Demandingness</h3> <p>A common objection to utilitarianism is that it is too demanding.</p> <p>First of all, I find this funny. Which principle of meta-ethics is it, exactly, that guarantees your moral obligations won’t take more than the equivalent of a Sunday afternoon each week?</p> <p>However, I can also see why consequentialist ethics can seem daunting. For someone who is used to thinking of ethics in terms of specific duties that must always be carried out, a theory that paints everything with some amount of moral importance and defines good in terms of maximising something vague and complicated can seem like too much of a burden. (I think this is behind the misinterpretation that utilitarianism says you have a duty to calculate that each action you take is the best one possible, which is neither utilitarian nor an effective way to achieve anything.)</p> <p>Utilitarianism is a consequentialist moral theory. Demands and duties are not part of it. It settles for simply defining what is good.</p> <p>(As it should. The definition is logically separate from the implications and the implementation. Good systems, concepts, and theories are generally <a href="https://www.lesswrong.com/posts/yDfxTj9TKYsYiWH5o/the-virtue-of-narrowness">narrow</a>.)</p><p><br /></p> <h3>Scaling ethics to the sea</h3> <p>There are many moral questions that are, in practice, settled. All else being equal, it is good to be kind, have fun, and help the needy.</p> <p>To make an extended metaphor: we can imagine that there is an island of settled moral questions; ones that no one except psychopaths or philosophy professors would think to question.</p> <p>This island of settled moral questions provides a useful test for moral systems. A moral system that doesn’t advocate kindness deserves to go in the rubbish. But though there is important intellectual work to be done in figuring out exactly what grounds this island (the geological layers it rests on, if you will), the real problem of morality in our world is how we extrapolate from this island to the surrounding sea.</p> <p>In the shallows near the island you have all kinds of conventional dilemmas – for example, consider our highway engineers in the previous example weighing nature against art against economy. Go far enough in any direction and you will encounter all sorts of perverse thought experiment monsters dreamt up by philosophers, which try to tear apart your moral intuitions with analytically sharp claws and teeth.</p> <p>You might think we can keep to the shallows. That is not an option. We increasingly need to make moral decisions about weird things, due to the increasing strangeness of the world: complex institutions, new technologies, and the sheer scale of there being over seven billion people around.</p> <p>A moral system based on rules for everyday things is like a constant-sized knife: fine for cutting up big fish (should I murder someone?), but clumsy at dealing with very small fish (what to have for dinner?), and often powerless against gargantuan eldritch leviathans from the deep (existential risk? mind uploading? insect welfare?).</p> <p>Utilitarianism scales both across sizes of questions and across different kinds of situations. This is because it isn’t based on rules, but on a concept (preference/wellbeing) that manages to turn up whenever there are morally important questions. This gives us something to aim for, no matter how big or small. It also makes us value preference/wellbeing wherever it turns up, whether in people we don’t like, the mind of a cow, or in aliens.</p><p><br /></p> <h3>Utilitarianism and other kinds of ethics</h3> <p>Utilitarianism, and consequentialist ethics more broadly, lacks one property that is a common social (if not philosophical) use of morality.</p> <p>Consider confronting a thief robbing a jewellery store. A deontological argument is “stealing is wrong; don’t do it”. A utilitarian argument would need to spell out the harms: “don’t steal, because you will cause suffering to the owner of the shop”. But the thief may well reply: “yes, but the wellbeing I gain from distributing the proceeds to my family is greater, so my act is right”. And now you’d have to point out that the costs to the shop workers who will lose their jobs if the shop goes bankrupt, plus more indirect costs like the effect on people’s trust in others or feelings of safety, outweigh these benefits – if they even do. Meanwhile the thief makes their escape.</p> <p>By making moral questions depend heavily on facts about the world, utilitarianism does not admit smackdown moral arguments (you can always be wrong about the facts, after all). This is a feature, not a bug. Putting people in their place is sometimes a necessary task (as in the case of law enforcement), but in general it is the province of social status games, not morality.</p> <p>Of course, nations need laws and people need principles. The insight of utilitarianism is that, important as these things are, their rightness is not axiomatic. There is a notion of good, founded on the reality of minds doing well and fulfilling their wishes, that cuts deeper than any arbitrary rule can. It is an uncomfortable thought that there are cases where you should break any absolute moral rule. But would it be better if there were rules for which we had to sacrifice anything?</p> <p>Recall the example of the Gestapo asking if you’re hiding Jews in your house. Given an extreme enough case, whether or not a moral rule (e.g. “don’t lie”) should be followed does depend on the effects of an action.</p> <p>At first glance, while utilitarianism captures the importance of happiness, selflessness, and impartiality, it doesn’t say anything about many other common moral topics. We talk about human rights, but consequentialism admits no rights. We talk about good people and bad people, but utilitarianism judges only consequences, not the people who bring them about. In utilitarian morality, good intentions alone count for nothing.</p> <p>First, remember that utilitarianism is a set of axioms about the most fundamental definition of good is. Just like simple mathematical axioms can lead to incredible complexity and depth, if you follow utilitarian reasoning down to daily life, you get a lot of subtlety and complexity, including a lot of common-sense ethics.</p> <p>For example, knowledge has no <i>intrinsic</i> value in utilitarianism. But having an accurate picture of what the world is like is so important for judging what is good that, in practice, you can basically regard accurate knowledge as a moral end in itself. (I think that unless you never intend to be responsible for others or take actions that significantly affect other people, when deciding whether to consider something true you should care only about its literal truth value, and not at all about whether it will make you feel good to believe it.)</p> <p>To take another example: integrity, in the sense of being honest and keeping commitments, clearly matters. This is not obvious if you look at the core ideas of utilitarianism, in the same way that the Chinese Remainder Theorem is not obvious if you look at the axioms of arithmetic. That doesn’t somehow make it un-utilitarian; for some examples of arguments, see <a href="https://forum.effectivealtruism.org/posts/CfcvPBY9hdsenMHCr/integrity-for-consequentialists">here</a>.</p> <p>See also <a href="https://www.lesswrong.com/posts/K9ZaZXDnL3SEmYZqB/ends-don-t-justify-means-among-humans">this article</a> for ideas on why strictly following rules can make sense even for strict consequentialists, given only the fact that human brains are fallible in predictable ways.</p> <p>As a metaphor, consider scientists. They are (in some idealised hypothetical world) committed only to the pursuit of truth: they care about nothing except the extent to which their theories precisely explain the world. But the pursuit of this goal in the real world will be complicated, and involve things – say, wild conjectures, or following hunches – that might even seem to go against the end goal. In the same way, real-world utilitarianism is not a cartoon caricature of endlessly calculating consequences and compromising principles for “the greater good”, but instead a reminder of what really matters in the end: the wishes and wellbeing of minds. Rights, duties, justice, fairness, knowledge, and integrity are not the most basic elements of (utilitarian) morality, but that doesn’t make them unimportant.</p><p><br /></p> <h3>Utilitarianism is horrible</h3> <p>Utilitarianism may have countless arguments on its side, but one fact remains: it can be pretty horrible.</p> <p>Many thought experiments show this. The most famous is the <a href="https://en.wikipedia.org/wiki/Trolley_problem">trolley problem</a>, where the utilitarian answer requires diverting a trolley from a track containing 5 people to one containing only a single person (an alternative telling is doctors killing a random patient to get the organs to save five others). Another is the <a href="https://en.wikipedia.org/wiki/Mere_addition_paradox">mere addition paradox</a>, also known as the repugnant conclusion: we should consider a few people living very good lives as a worse situation than many people living mediocre lives.</p> <p>Of course, the real world is never as stark as philosophers’ thought experiments. But a moral system should still give an answer – the right one – to every moral dilemma.</p> <p>Many alternatives to utilitarianism seem to fail at this step; they are not decisive. It is always easier to wallow in platitudes than to make a difficult choice.</p> <p>If a moral system gives an answer we find intuitively unappealing, we need to either reject the moral system, or reject our intuitions. The latter is obviously dangerous: get carried away by abstract morals, and you might find yourself denying common-sense morals (the island in the previous metaphor). However, particularly when dealing with things that are big or weird, we should expect our moral intuitions to occasionally fail.</p> <p>As an example, I think the repugnant conclusion is correct: for any quantity of people living extremely happy lives, there is some larger quantity of people living mediocre lives that would be a better state for the world to be in.</p> <p>First, rejecting the repugnant conclusion means rejecting total utilitarianism: the principle that you sum up individual utilities to get total utility (for example, you might average utilities instead). Rejecting total utilitarianism implies weird things, like the additional moral worth of someone’s life depending on how many people are already in the world. Why should a happy life in a world with ten billion people be worth less than one in a world with a thousand people?</p> <p>Alternatives also bring up their own issues. To take a simple example, if you value average happiness instead, eliminating everyone who is less happy than the average is a good idea (in the limit, every world of more than one person should be reduced to a world of one person).</p> <p>Finally, there is a specific bias that explains why the repugnant conclusion seems so repugnant. Humans tend to show <a href="https://en.wikipedia.org/wiki/Scope_neglect">scope neglect</a>. If our brains were built differently, and assigned due weight to the greater quantity of life in the “repugnant” choice, I think we’d find it the intuitive one.</p> <p>However, population ethics is both notoriously tricky and a fairly new discipline, so there is always the chance there exists a better alternative <a href="http://users.ox.ac.uk/~mert2255/papers/population_axiology.pdf">population axiology</a> than totalism.</p><p><br /></p> <h3>Is utilitarianism complete and correct?</h3> <p>I’m not sure what evidence or reasoning would let us say that a moral system is complete and correct.</p> <p>I do think the basic elements of utilitarianism are fairly solid. First, I showed above how requiring decisiveness leads to most of the utilitarian character of the theory (quantitativeness, the idea of utility). The reasons are similar to the ones for using expected value reasoning: if you don’t, you either can’t make some decisions, or introduce cases where you make stupid ones. Second, ideas of impartiality and universality seem like fundamental moral ideas. I’d be surprised if you could build a consistent, decisive, and humane moral theory without the ideas of quantified utility and impartiality.</p> <p>Though this skeleton may be solid, the real mess lies with defining utility.</p> <p>Do we care about preferences or wellbeing? It seems that if we define either in a broad enough way to be reasonable, the ideas start to converge. Is this a sign that we’re on the right track because the two main variants of utilitarianism talk about a similar thing, or that we’re on the wrong track and neither concept means much at all?</p> <p>Wellbeing as pleasure leaves out most of what people actually value. Sometimes people prefer to feel sadness; we have to include this. How? Notice the word I used – “prefer”. It seems like this broad-enough “wellbeing” concept might just mean “what people prefer”. But try defining the idea of preference. Ideal preferences should be sincere and based on perfect information – after all, if you hear information that changes your preference, it’s your estimate of the consequences that changed, not the morally right action. So when we talk about preference, we need complete information, which means trying to answer the question “given perfect information about what you will experience (or even the entire state of the universe, depending on what preferences count) in option A and in option B, which do you prefer?” Now how is this judgement made? Might there be something – wellbeing, call it – which is what a preferrer always prefers?</p> <p>Capturing any wellbeing/preference concept is difficult. Some things are very simple: a healthy life is preferable to death, for example, and given the remaining horribleness in the real world (e.g. sixty million people dying each year) a lot of our important moral decisions are about the simple cases. Even the problem of assigning QALY values to disease burdens has proven tractable, if not easy or uncontroversial. But solving the biggest problems is only the start.</p> <p>An important empirical fact about human values is that they’re complex. Any simple utopia is a dystopia. Maybe the simplest way to construct a dystopia is to imagine a utopia and remove one subtle thing we care about (e.g. variety, choice, or challenge).</p> <p>On one hand, we have strong theoretical reasons why we need to reduce everything to utilities to make moral decisions. On the other, we have the empirical fact that what counts as utility to people is very complex and subtle.</p> <p>I think the basic framework of utilitarian ideas gives us a method, in the way that the ruler and compass gave the Greeks a method to begin toying with maths. Thinking quantitatively about how all minds everywhere are doing is probably a good way to start our species’ serious exploration of weird and/or big moral questions. However, modern utilitarianism may be an approximation, like Newton’s theory of gravity (except with a lot more ambiguity in its definitions), and the equivalent of general relativity may be centuries away. It also seems certain that most of the richness of the topic still eludes us.</p><p><br /></p> <h3>Indirect arguments: what people think, and the history of ethics</h3> <p>In addition to the theoretical arguments above, we can try to weigh utilitarianism indirectly.</p> <p>First, we can see what people think (we are talking about morality after all – if everyone hates it, that’s cause for concern). On one hand, out of friends with who I’ve talked about these topics with (the median example being an undergraduate STEM student), basically everyone favours some form of utilitarianism. On the other hand, <a href="https://philpapers.org/surveys/results.pl">a survey</a> of almost a thousand philosophers found only a quarter accepting or leaning towards consequentialist ethics (slightly lower than the number of deontologists, and less than the largest group of a third of respondents who chose “other”). (However, two thirds endorse the utilitarian choice in the trolley problem, compared to only 8% saying not to switch (the rest were undecided).) My assumption is that a poll of everyone would find a significant majority against utilitarianism, but I think this would be largely because of the negative connotations of the word.</p> <p>Second, we can look at history. A large part of what we consider moral progress can be summarised as a move to more utilitarian morality.</p> <p>I am not an expert in the history of ethics (though I’d very much like to hear from one), but the general trend from rule- and duty-based historical morality to welfare-oriented modern morality seems clear. Consider perhaps the standard argument in favour of gay marriage: it’s good for some people and it hurts no one, so why not? Arguments do not get much more utilitarian. (Though of course, other arguments can be made with different starting points, for example a natural right to various freedoms.) In contrast the common counter-argument – that it violates the law of nature or god or at least social convention – is rooted in decidedly non-utilitarian principles. Whereas previously social disapproval was a sufficient reason to deny people happiness, today we assume a heavy, even insurmountable, burden of proof of any custom or rule that increases suffering on net.</p> <p>A second trend in moral attitudes is often summarised as an “expanding moral circle”: granting moral significance to more and more entities. The view that only particular people of particular races, genders, or nationalities count as <a href="https://concepts.effectivealtruism.org/concepts/moral-patienthood/">moral patients</a> has come to be seen as wrong, and the expansion of moral patienthood to non-humans is already underway.</p> <p>A concern for anything capable of experiencing welfare is built into utilitarianism. Utilitarianism also ensures that this process will not blow up to absurdities: rather than blindly granting rights to every ant, utilitarianism allows for the fact that the welfare of some entities deserves greater weight, and assures us there’s no need to worry about rocks.</p> <p>It would be a mistake to say that our moral progress has been driven by explicit utilitarianism. Abolitionists, feminists, and civil rights activists had diverse moral philosophies, and the deontological language of rights and duties has played a big role. But consider carefully why today we value the rights and duties that we do, rather than those of past eras, and I think you’ll find that the most concise way to summarise the difference is that we place more value on welfare and preferences. In short, we are more utilitarian.</p> <p>Two of the great utilitarian philosophers were <a href="https://en.wikipedia.org/wiki/Jeremy_Bentham#Animal_rights">Jeremy Bentham</a> and <a href="https://en.wikipedia.org/wiki/John_Stuart_Mill">John Stuart Mill</a>, who died in the early and late 1800s respectively (today we have <a href="https://en.wikipedia.org/wiki/Peter_Singer">Peter Singer</a>). On the basis of his utilitarian ethics, Bentham advocated for the abolition of slavery and capital punishment, gender equality, decriminalising homosexuality (an essay so radical at its time that it went unpublished for over a hundred years after Bentham’s death), and is especially known as one of the first defenders of animal rights. Mill also argued against slavery, and is especially known as an early advocate of women’s rights. Both were also important all-around liberals.</p> <p>Nineteenth century utilitarians were good at holding moral views that were ahead of their time. I would not be surprised if the same were true today.</p> <br />Unknownnoreply@blogger.com0tag:blogger.com,1999:blog-1697673368059564013.post-26073884531105734622020-07-26T09:42:00.005+01:002021-01-22T12:26:57.431+00:00EA ideas 3: uncertainty<center><i><font size="2">2.0k words (7 minutes)</font></i></center><center><i><font size="2"> </font></i></center><center><span style="font-size: small;"><i><font>Posts in this series:</font></i></span></center><center><font size="2"><a href="https://strataoftheworld.blogspot.com/2020/07/ea-ideas-1-rigour-and-opportunity-in.html">FIRST</a> | <a href="https://strataoftheworld.blogspot.com/2020/07/ea-ideas-2-expected-value-and-risk.html">PREVIOUS</a> | <a href="https://strataoftheworld.blogspot.com/2020/08/ea-ideas-4-utilitarianism.html">NEXT</a></font><i><font size="2"><a href="https://strataoftheworld.blogspot.com/2020/08/ea-ideas-4-utilitarianism.html"> </a><br /></font></i></center> <div style="text-align: center;"><br /></div><div style="text-align: center;"></div><p><a href="https://concepts.effectivealtruism.org/concepts/moral-uncertainty/">Moral uncertainty</a> is uncertainty over the definition of good. For example, you might broadly accept utilitarianism, but still have some credence in deontological principles occasionally being more right.</p> <p>Moral uncertainty is different from epistemic uncertainty (uncertainty about our knowledge, its sources, and uncertainty over our degree of uncertainty about these things). In practice these often mix – uncertainty over an action can easily involve both moral and epistemic uncertainty – but since is-ought confusions are a common trap in any discussion, it is good to keep these ideas firmly separate.</p><p><br /></p> <h2>Dealing with moral uncertainty</h2> <p>Thinking about moral uncertainty quickly gets us into deep philosophical waters.</p> <p>How do we decide which action to take? One approach is called “My Favourite Theory” (MFT), which is to act entirely in accordance to the moral theory you think is most likely to be correct. There are a number of counterarguments, many of which involve around problems of how we draw boundaries between theories: if you have 0.1 credence in each of 8 consequentialist theories and 0.2 credence in a deontological theory, should you really be a strict deontologist? (More fundamentally: say we have some credence in a family of moral systems with a continuous range of variants – say, differing by arbitrarily small differences in the weights assigned to various forms of happiness – does MFT require we reject this family of theories in favour of ones that vary only discretely, since in the former case the probability of a particular variant being correct is infinitesimal?). For a defence of MFT, see <a href="http://johanegustafsson.net/papers/in-defence-of-my-favourite-theory.pdf">this paper</a>.</p> <p>If we reject MFT, when making decisions we have to somehow make comparisons between the recommendations of different moral systems. Some regard this as non-sensical; others write <a href="http://commonsenseatheism.com/wp-content/uploads/2014/03/MacAskill-Normative-Uncertainty.pdf">theses</a> on how to do it (some of the same ground is covered in a much shorter space <a href="http://amirrorclear.net/files/why-maximize-expected-choiceworthiness.pdf">here</a>; this paper also discusses the same concerns with MFT that I mentioned in the last paragraph, and problems with switching to “My Favourite Option” – acting according to the option that is most likely to be correct, summed over all moral theories you have credence in).</p> <p>Another less specific idea is the <a href="http://www.overcomingbias.com/2009/01/moral-uncertainty-towards-a-solution.html">parliamentary model</a>. Imagine that all moral theories you have some credence in send delegates to a parliament, who can then negotiate, bargain, and vote their way to a conclusion. We can imagine delegates for a low-credence theory generally being overruled, but, on the issues most important to that theory, being able to bargain their way to changing the result.</p> <p>(In a nice touch of subtlety, the authors take care to specify that though the parliament acts according to a typical 50%-to-pass principle, the delegates act as if they believe that the percent of votes for an action is the probability that it will happen, removing the perverse incentives generated by an arbitrary threshold.)</p> <p>As an example of other sorts of meta-ethical considerations, Robin Hanson compares the process of fitting a moral theory to our moral intuitions to fitting a curve (the theory) to a set of data points (our moral intuitions). <a href="http://www.overcomingbias.com/2009/05/minimal-morals.html">He argues</a> that there’s enough uncertainty over these intuitions that we should take heed of a basic principle of curve-fitting: keep it simple, or otherwise you will overfit, and your curve will veer off in one direction or another when you try to extrapolate.</p><p><br /></p> <h2>Mixed moral and epistemic uncertainty</h2> <h3>Cause X</h3> <p>We are probably committing a moral atrocity without being aware of it.</p> <p>This is argued <a href="https://www.docdroid.net/0BDABfb/the-possibility-of-an-ongoing-moral-catastrophe-pdf">here</a>. The first argument is that past societies have been unaware of serious moral problems and we don’t have strong enough reasons to believe ourselves exempt from this rule. The second is that there are many sources of potential moral catastrophe – there are very many ways of being wrong about ethics or being wrong about key facts – so though we can’t point to any specific likely failure mode with huge consequences, the probability that at least one exists isn’t low.</p> <p>In addition to an ongoing moral catastrophe, it could be that we are overlooking an opportunity to achieve a lot of good for cheap. In either case there would be a cause, dubbed <a href="https://www.effectivealtruism.org/moral-progress-and-cause-x/">Cause X</a>, which would be a completely unknown but extremely important way of improving the world.</p> <p>(In either case, the cause would likely involve both moral and epistemic failure: we’ve both failed to think carefully enough about ethics to see what it implies, and failed to spot important facts about the world.)</p> <p>“Overlooked moral problem” immediately invites everyone to imagine their pet cause. That is not what Cause X is about. Imagine a world where every cause you support triumphed. What would still be wrong about this world? Some starting points for answering this are presented <a href="https://www.effectivealtruism.org/articles/three-heuristics-for-finding-cause-x/">here</a>.</p> <p>If you say “nothing”, consider MacAskill’s anecdote in the previous link: Aristotle was smart and spent his life thinking about ethics, but still thought <a href="https://en.wikipedia.org/wiki/Natural_slavery#Aristotle's_discussion_on_slavery">slavery made sense</a>.</p><p><br /></p> <h2>Types of epistemic uncertainty</h2><div>I use the term "epistemic uncertainty" because the concept is broader than just uncertainty over facts. For example, our brains are flawed in predictable ways, and dealing with this is different from dealing with being wrong or having incomplete information about a specific fact.</div><div><br /></div> <h3>Flawed brains</h3> <p>A basic cause for uncertainty is that human brains make mistakes. Especially important are biases, which consistently make our thinking wrong in the same way. This is a big and important topic; the classic book is Kahneman’s <a href="https://www.goodreads.com/book/show/11468377-thinking-fast-and-slow"><i>Thinking, Fast and Slow</i></a>, but if you prefer sprawling and arcane chains of blog posts, you’ll find plenty <a href="https://www.readthesequences.com">here</a>. I will only briefly mention some examples.</p> <p>The most important bias to avoid when thinking about EA may be <a href="https://en.wikipedia.org/wiki/Scope_neglect">scope neglect</a>. In short, people don’t automatically multiply. It is the image of a starving child that counts in your brain, and your brain gives this image the same weight whether the number you see on the page has three zeros or six after it. Trying to reason about any big problem without being very mindful of scope neglect is like trying to captain a ship that has no bottom: you will sink before you move anywhere.</p> <p>Many biases are difficult to counter, but occasionally someone thinks of a clever trick. Status quo bias is a preference for keeping things as they are. It can often be spotted through the <a href="https://en.wikipedia.org/wiki/Reversal_test">reversal test</a>. For example, say you argue that we shouldn’t lengthen human lifespans further. Ask yourself: should we then decrease life expectancy? If you think that we should have neither more nor less of something, you should also have a good reason for why it just so happens that we have an optimum amount already. What are the chances that the best possible lifespan for humans also happens to be the highest one that present technology can achieve?</p><p><br /></p> <h3>Crucial considerations</h3> <p>A <a href="https://www.effectivealtruism.org/articles/crucial-considerations-and-wise-philanthropy-nick-bostrom/">crucial consideration</a> is something that flips (or otherwise radically changes) the value of achieving a general goal.</p> <p>For example, imagine your goal is to end raising cows for meat, because you want to prevent suffering. Now say there’s a fancy new brain-scanner that lets you determine that even though the cow ends up getting chucked into a meat grinder, on average the cow’s happiness is above the threshold for when non-existence is preferable to existence (assume this is a well-defined concept in your moral system). Your morals are the same as before, but now they’re telling you to raise more cows for meat.</p> <p>An example of a chain of crucial considerations is whether or not we should develop some breakthrough but potentially dangerous technology, like AI or synthetic biology. We might think that the economic and personal benefits make it worth the expense, but a potential crucial consideration is the danger of accidents or misuse. There might be another crucial consideration that it’s better to have the technology developed internationally and in the open, rather than have advances made by rogue states.<br /></p> <p>There are probably many crucial considerations that are either unknown or unacknowledged, especially in areas that we haven’t thought about for very long.</p><p><br /></p> <h3>Cluelessness</h3> <p>The idea of cluelessness is that we are extremely uncertain about the impact of every action. For example, making a car stop as you cross the street might affect a conception later that day, and might make the difference between the birth of a future Gandhi or Hitler later on. (Note that many non-consequentialist moral systems seem even more prone to cluelessness worries – William MacAskill points this out in <a href="https://globalprioritiesinstitute.org/wp-content/uploads/2019/MacAskill_Mogensen_Paralysis_Argument.pdf">this paper</a>, and argues for it more informally <a href="https://80000hours.org/podcast/episodes/will-macaskill-paralysis-and-hinge-of-history/#the-paralysis-argument-01542">here</a>.)</p> <p>I’m not sure I fully understand the concerns. I’m especially confused about what the practical consequences of cluelessness should be on our decision-making. Even if we’re mostly clueless about the consequences of our actions, we should base them on the small amount of information we do have. However, at the very least it’s worth keeping in mind just how big uncertainty over consequences can be, and there are a bunch of philosophy paper topics here.</p> <p>For more on cluelessness, see for example:</p> <ul> <li><a href="https://philiptrammell.com/static/simplifying_cluelessness.pdf">Simplifying Cluelessness</a> (an argument that cluelessness is an important and real consideration)</li> <li><a href="http://users.ox.ac.uk/~mert2255/papers/cluelessness.pdf">an in-depth look at different forms of cluelessness</a></li> <li><a href="https://80000hours.org/podcast/episodes/hilary-greaves-global-priorities-institute/">the author of the previous paper discussing related ideas in a podcast</a> (a transcript is available).</li></ul><div><br /></div> <h3>Reality is underpowered</h3> <p>Imagine we resolve all of our uncertainties over moral philosophy, iron out the philosophical questions posed by cluelessness, confidently identify Cause X, avoid biases, find all crucial considerations, and all that remains is the relatively down-to-earth work of figuring out which interventions are most effective. You might think this is simple: run a bunch of randomised controlled trials (RCTs) on different interventions, publish the papers, and maybe wait for a meta-analysis to combine the results of all relevant papers before concluding that the matter is solved.</p> <p>Unfortunately, it’s often the case that <a href="https://forum.effectivealtruism.org/posts/jSPGFxLmzJTYSZTK3/reality-is-often-underpowered">reality is underpowered</a> (<a href="https://en.wikipedia.org/wiki/Power_of_a_test">in the statistical sense</a>): we can’t run the experiments or collect the data that we’d need to answer our questions.</p> <p>To take an extreme example, there are many different factors that affect a country’s development. To really settle the issue, we might make groups of, say, a dozen countries each, give them different amounts of the development factors (holding everything else fairly constant), watch them develop over 100 years, run a statistical analysis of the outcomes, and then draw conclusions about how much the factors matter. But try finding hundreds of identical countries with persuadable national leaders (and at least one country must have a science ethics board that lets this study go forwards).</p> <p>To make a metaphor with a <a href="https://en.wikipedia.org/wiki/Resolving_power">different sort of power</a>: the answers to our questions (on what effects are the most important in driving some phenomenon, or which intervention is the most effective) exist, sharp and clear, but the telescopes with which we try to see them aren’t good enough. The best we can do is interpret the smudges we do see, inferring as much as we can without the brute force of an RCT.</p> <p>This is an obvious point, but an important one to keep in mind to temper the rush to say we can answer everything if only we run the right study.</p><p><br /></p> <h2>Conclusions?</h2> <p>All this uncertainty might seem to imply two conclusions. I support one of them but not the other.</p> <p>The first conclusion is that the goal of doing good is complicated and difficult (as is the subgoal of having accurate beliefs about the world). This is true, and important to remember. It is tempting to forget analysis and fall back on feelings of righteousness, or to switch to easier questions like “what feels right?” or “what does society say is right?”</p> <p>The second conclusion is that this uncertainty means we should try less. This is wrong. Uncertainties may rightly redirect efforts towards more research, and reducing key uncertainties is probably one of the best things we can do, but there’s no reason why they should make us reduce our efforts.</p> <p>Uncertainty and confusion are properties of minds, not reality; they exist on the map, not the territory. To every well-formed question there is an answer. We need only find it.</p><p> </p><p style="text-align: center;"><a href="https://strataoftheworld.blogspot.com/2020/08/ea-ideas-4-utilitarianism.html">Next post</a> <br /></p> Unknownnoreply@blogger.com0tag:blogger.com,1999:blog-1697673368059564013.post-24680219473467919092020-07-25T10:27:00.010+01:002022-04-03T11:41:13.476+01:00EA ideas 2: expected value and risk neutrality<center><i><font size="2">2.6k words (9 minutes)</font></i></center><center><i><font size="2"> </font></i></center><center><span style="font-size: small;"><i><font>Posts in this series:<br /></font></i><font><a href="https://strataoftheworld.blogspot.com/2020/07/ea-ideas-1-rigour-and-opportunity-in.html">PREVIOUS</a> | <a href="https://strataoftheworld.blogspot.com/2020/07/ea-ideas-3-uncertainty.html">NEXT </a></font></span><i><font size="2"><br /></font></i></center><div><br /></div><div>The expected value (EV) of an event / choice / random variable is the sum, over all possible outcomes, of {value of outcome} times {probability of that outcome} (if all outcomes are equally likely, it is the average; if they’re not, it’s the probability-weighted average).</div> <p>In general, a rational agent makes decisions that maximise the expected value of the things they care about. However, EV reasoning involves more subtleties than its mathematical simplicity suggests, in both the real world and in thought experiments.</p> <p>Is a 50% chance of 1000€ exactly as good as a certain gain of 500€ (that is, are we risk-neutral?), or a 50% chance of 2000€ with a 50% chance of a 1000€ loss instead?</p> <p>Not necessarily. A bunch of research (and common sense) says people put decreasing value on an additional unit of money: the thousandth euro is worth more than the ten-thousandth. For example, average happiness scales roughly logarithmically with per-capita GDP. The thing to maximise in a monetary tradeoff is not the money, but the value you place on money; with a logarithmic relationship, the diminishing returns mean that more certain bets are better than naive EV-of-money reasoning implies. A related reason is that people <a href="https://en.wikipedia.org/wiki/Loss_aversion">weight losses more than gains</a>, which makes the third case look worse than the first even if you don’t assume a logarithmic money->value function.</p> <p>However, a (selfish) rational agent will still maximise EV in such decisions – not of money, but of what they get from it.</p> <p>(If you’re not selfish and live in a world where money can be transferred easily, the marginal benefit curve of efficiently targeted donations is essentially flat for a very long time – a single person will hit quickly diminishing returns after getting some amount of money, but there are enough poor people in the world that enormous resources are needed before you need to worry about everyone reaching the point of very low marginal benefit from more money. To fix the old saying, albeit with some hit to its catchiness: “money can buy happiness only (roughly) logarithmically for yourself, but (almost) linearly in the world at large, given efficient targeting”.)</p> <p>In some cases, we don’t need to worry about wonky thing->value functions. Imagine the three scenarios above, but instead of euros we have lives. Each life has the same value; there’s no reasonable argument for the thousandth life being worth less than the first. Simple EV reasoning is the right tool.</p><p><br /></p> <h3>Why expected value?</h3> <p>This conclusion easily invites a certain hesitation. Any decision involving hundreds of lives is a momentous one; how can we be sure of exactly the right way to value these decisions, even in simplified thought experiments? What’s so great about EV?</p> <p>A strong argument is that maximising EV is the strategy that leads to the greatest good over many decisions. In a single decision, a risky but EV-maximising choice can backfire – you might take a 50-50 bet of saving 1000 lives and lose, in which case you’ll have done much worse than picking an option of certainly saving 400. However, it’s a mathematical fact that given enough such choices, the <i>actual</i> average value will tend towards the EV. So maximising EV is what results in the most value in the long run.</p> <p>You might argue that we’re not often met with dozens of similar momentous decisions. Say that we’re reasonably confident the same choice will never pop up again, and certainly not many times; doesn’t the above argument no longer apply? Take a slightly broader view though, and consider which strategy gets you the most value across all decisions you make (of which there will realistically be many, even if no single decision occurs twice): the answer is still EV maximisation. We could go on to construct crazier thought experiments – toy universes in which only one decision ever occurs, for example – and then the argument really begins to break down (though you might try to save it by some wild scheme of imagining many hypothetical agents faced with the same choice and consider a Kantian / rule-utilitarian principle of deciding by answering the question of which strategy would be right if it were the one adopted across all countless hypothetical instances of this decision).</p> <p>There are other arguments too. Imagine 1000 people are about to die of a disease, and you have to decide between a cure that will certainly cure 400 versus an experimental one that will either cure everyone or save no-one. Imagine you are one of these people. In the first scenario, you have a 40% chance of living; in the second, a 50% chance. Which would you prefer?</p> <p>On a more mathematical level, von Neumann (an <a href="https://en.wikipedia.org/wiki/List_of_things_named_after_John_von_Neumann">all-around</a> polymath) and Morgenstern (co-founder of game theory with von Neumann) have <a href="https://en.wikipedia.org/wiki/Von_Neumann%E2%80%93Morgenstern_utility_theorem">proved</a> that under fairly basic assumptions of what is rational behaviour, a rational agent acts as if they’re maximising the EV of some preference function.</p><p><br /></p> <h3>Problems with EV</h3> <p>Diabolical philosophers have managed to dream up many challenges for EV reasoning. For example, imagine there’s two dollars on the table. You toss a coin; if it’s heads you take the money on the table, if it’s tails the money on the table doubles and you toss again. You have a 1/2 chance of winning 2 dollars, 1/4 chance of winning 4, 1/8 chance of winning 8, and so on, for a total EV of 1/2 x 2 + 1/4 x 4 + … = 1 + 1 + … . The sequence diverges to infinity.</p> <p>Imagine a choice: one game of the “St. Petersburg lottery” described above, or a million dollars. You’d be crazy not to pick the latter.</p> <p>Is this a challenge to the principle of maximising EV? Not in our universe. We know that whatever casino we’re playing at can’t have an infinite amount of money, so we’re wise to intuitively reject the St. Petersburg lottery. (<a href="https://en.wikipedia.org/wiki/St._Petersburg_paradox#Finite_St._Petersburg_lotteries">This section on Wikipedia</a> has a very nice demonstration of why, even if the casino is backed by Bill Gates’s net worth, the EV of the St. Petersburg game is less than $40.)</p> <p>The St. Petersburg lottery isn’t the weirdest EV paradox by half, though. In the <a href="http://www.colyvan.com/papers/pasadena.pdf">Pasadena game</a>, the EV is undefined (see the link for a definition, analysis, and an argument that such scenarios are points against EV-only decision-making). Nick Bostrom writes about the problems of consequentialist ethics in an infinite universe (or a universe that has a finite probability of being infinite) <a href="https://nickbostrom.com/ethics/infinite.html">here</a>.</p> <p>There’s also the classic: Pascal’s wager, the idea that even if the probability of god existing is extremely low, the benefits (an eternity in heaven) are great enough that you should seek to believe in god and live a life of Christian virtue.</p> <p>Unlike even Bostrom’s infinite ethics, Pascal’s wager is straightforwardly silly. We have no reason to privilege the hypothesis of a Christian god over the hypothesis – equally probable given the evidence we have – that there’s a god who punishes us exactly for what the Christian god rewards us for, or that god is a chicken and condemns all chicken-eaters to an eternity of hell. So even if you accept the mathematically dubious multiplication of infinities, Pascal’s wager doesn’t let you make an informed decision one way or another.</p> <p>However, the general format of Pascal’s wager – big values multiplied by small probabilities – is the cause of much of EV-related craziness, and dealing with such situations is a good example of how naive EV reasoning can go wrong. The more general case is often referred to as <a href="https://nickbostrom.com/papers/pascal.pdf">Pascal’s mugging</a>, and exemplified by the scenario (see link) where a mugger threatens to torture an astronomical amount of people unless you give them a small amount of money.</p><p><br /></p> <h3>Tempering EV extremeness with Bayesian updating</h3> <p>Something similar to Pascal’s mugging easily happens if you calculate EVs by multiplying together very rough guesses involving small probabilities and huge outcomes.</p> <p>The best and most general approach to these sorts of issues is laid out <a href="https://blog.givewell.org/2011/08/18/why-we-cant-take-expected-value-estimates-literally-even-when-theyre-unbiased/">here</a>.</p> <p>The key insight is to remember two things. First, every estimate is a probability distribution: if you measure a nail or estimate the effectiveness of a charity, the result isn’t just your best-guess value, but also the uncertainty surrounding it. Second, <a href="https://en.wikipedia.org/wiki/Bayesian_inference">Bayesian updating</a> is how you change your estimates when given new evidence (and hence you should pay attention to your prior: the estimate you have before getting the new information).</p> <p>Using some maths detailed <a href="https://blog.givewell.org/attachments/worms.pdf">here</a>, it can be shown that if your prior and measurement both follow normal distributions, then your new (Bayesian) estimate will be another normal distribution, with a mean (=expected value) that is an average of the prior and measurement means, weighted by the inverse variance of the two distributions. (Note that the link does it with log-normal distributions, but the result is the same; just switch between variables and their logarithms.)</p> <p><a href="https://www.desmos.com/calculator/tc5ybxvyzq">Here’s an interactive graph that lets you visualise this</a>.</p> <p>The results are pretty intuitive. Let’s say our prior for the effectiveness of some intervention has a mean of zero. If we take a measurement with low variance, our updated estimate probability distribution will shift most of the way towards our new measurement, and its variance will decrease (it will become narrower):</p> <table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto;"><tbody><tr><td style="text-align: center;"><a href="https://1.bp.blogspot.com/-MnACAmO3SQo/Xxv4cW9pU4I/AAAAAAAABdI/3aiW9PmDoJUYqygzx66GH41iVWqB7K4mACLcBGAsYHQ/s1280/EV%2B1.png" style="margin-left: auto; margin-right: auto;"><img border="0" data-original-height="684" data-original-width="1280" height="343" src="https://1.bp.blogspot.com/-MnACAmO3SQo/Xxv4cW9pU4I/AAAAAAAABdI/3aiW9PmDoJUYqygzx66GH41iVWqB7K4mACLcBGAsYHQ/w640-h343/EV%2B1.png" width="640" /></a></td></tr><tr><td class="tr-caption" style="text-align: center;"><i>Red is the probability distribution of our prior estimate. Green is our measurement. Black is our new belief, after a Bayesian update of our prior with the measurement. Dotted lines show the EV (=average, since the distributions are symmetrical) for each probability distribution. You can imagine the x-axis as either a linear or log scale.</i></td></tr></tbody></table><div class="separator" style="clear: both; text-align: center;"><br /></div> <p>If the same measurement has greater variance, our estimates shift less:</p><div class="separator" style="clear: both; text-align: center;"><a href="https://1.bp.blogspot.com/-uQmvpedlbYY/Xxv4cB6NE0I/AAAAAAAABdM/Ncf5c-bUWu0gE-zBP_Naenzbiqu0gdZXgCPcBGAYYCw/s1280/EV%2B2.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="704" data-original-width="1280" height="352" src="https://1.bp.blogspot.com/-uQmvpedlbYY/Xxv4cB6NE0I/AAAAAAAABdM/Ncf5c-bUWu0gE-zBP_Naenzbiqu0gdZXgCPcBGAYYCw/w640-h352/EV%2B2.png" width="640" /></a></div> <p><br /></p><p>And if we have a very imprecise measurement – for example, we’ve multiplied a bunch of rough guesses together – the estimate barely shifts even if the estimate is high:</p><div class="separator" style="clear: both; text-align: center;"><a href="https://1.bp.blogspot.com/-M3_TvSBj1eQ/Xxv4cLzwPCI/AAAAAAAABdQ/rMYIo8_RLzUYryw5_JlPAM_HN9KAoVEYgCPcBGAYYCw/s1280/EV%2B3.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="428" data-original-width="1280" height="214" src="https://1.bp.blogspot.com/-M3_TvSBj1eQ/Xxv4cLzwPCI/AAAAAAAABdQ/rMYIo8_RLzUYryw5_JlPAM_HN9KAoVEYgCPcBGAYYCw/w640-h214/EV%2B3.png" width="640" /></a></div><p><br /></p> <p>Of course, we can argue about what our priors should be – perhaps, for many of the hypothetical scenarios with potentially massive benefits (for instance concerning potential space colonisation in the future), the variance of our prior should be very large, in which case even highly uncertain guesses will shift our best-guess EV a lot. But the overall point still stands: if you go to your calculator, punch in some numbers, and conclude you’ve discovered something massively more important than anything else, it’s time to think very carefully about how much you can really conclude.</p><p>Overall, I think this is a good example of how a bit of maths can knock off quite a few teeth from a philosophical problem.</p> <p>(<a href="https://blog.givewell.org/2014/06/10/sequence-thinking-vs-cluster-thinking/">Here’s a link to a wider look at pitfalls of overly simple EV reasoning with a different framing</a>, by the same author as <a href="https://blog.givewell.org/2011/08/18/why-we-cant-take-expected-value-estimates-literally-even-when-theyre-unbiased/">this earlier link</a>. And <a href="https://arxiv.org/ftp/arxiv/papers/0810/0810.5515.pdf">here</a> is another exploration of the special considerations involved with low-probability, high-stakes risks.) </p><p><br /></p> <h3>Risk neutrality</h3> <p>An implication of EV maximisation as a decision framework is risk neutrality: when you’ve measured things in units of what you actually care about (e.g. converting money to the value it has for you as discussed above), you should be neutral about the choice between 10% chance of 10 value units and 100% chance of 1, and you really should prefer a 10% chance of 11 “value units” over a 100% chance of 1 “value unit”, or a 50-50 bet between losing 10 and gaining 20 over a certain gain of 14. </p> <p>This is not an intuitive conclusion, but I think we can be fairly confident in its correctness. Not only do we have robust theoretical reasons for using EV, but we can point to specific bugs in our brains that makes us balk at risk-neutrality: biases like <a href="https://en.wikipedia.org/wiki/Scope_neglect">scope neglect</a>, which makes humans underestimate the difference between big and small effects, or <a href="https://en.wikipedia.org/wiki/Loss_aversion">loss aversion</a>, which makes losses more salient than gains, or a preference for certainty.</p><p>$$$%%IF YOU SEE DOLLAR SIGNS IN THE NEXT SECTION, EQUATION RENDERING VIA MATHJAX IS NOT WORKING IN YOUR BROWSER$$$</p> <h3>Stochastic dominance (an aside)</h3> <p>Risk neutrality is not necessarily specific to EV maximisation. There’s a far more lenient, though also far more incomplete, principle of rational decision making that goes under the clumsy name of “<a href="https://en.wikipedia.org/wiki/Stochastic_dominance">stochastic dominance</a>”: given options $$A$$ and $$B$$, if the probability of a payoff of $$X$$ or greater is more under option $$A$$ than option $$B$$ for all values of $$X$$, then $$A$$ “stochastically dominates” option B and should be preferred. It’s very hard to argue against stochastic dominance.</p> <p>Consider a risky and a safe bet; to be precise, call them option $$A$$, with a small probability $$p$$ of a large payoff $$L$$, and option $$B$$, with a certain small payoff $$S$$. Assume that $$pL > S$$, so EV maximising says to take option $$A$$. However, we don’t have stochastic dominance: the probability of getting a small amount of value $$v$$ ($$v < S$$) is greater with $$B$$ than $$A$$, whereas the probability of getting a large amount of value ($$S < v < L$$) is greater with option $$A$$.</p> <p>The insight of <a href="https://philarchive.org/archive/TAREES">this paper</a> (summarised <a href="https://ethicalhaydonism.wordpress.com/2020/03/27/in-this-universe-you-have-to-be-risk-neutral/">here</a>) is that if we care about the total amount of value in the universe, are sufficiently uncertain about this total amount, and make some assumptions about its distribution, then stochastic dominance alone implies a high level of risk neutrality.</p> <p>The argument goes as follows: we have some estimate of the probability distribution $$U$$ of value that might exist in the universe. We care about the entire universe, not just the local effects of our decision, so what we consider is $$A + U$$ and $$B + U$$ rather than $$A$$ and $$B$$. Now consider an amount of value $$v$$. The probability that $$A + U$$ exceeds $$v$$ is the probability that $$U > v$$, plus the probability that $$(v - L) < U < v$$ <i>and</i> $$A$$ pays off $$L$$ (we called this probability $$p$$ earlier). The probability that $$B + U$$ exceeds $$v$$ is the probability that $$U > v - S$$.</p> <p>Is the first probability greater? This depends on the shape of the distribution of $$U$$ (to be precise, we’re asking whether $$P(U > v) + p P(v - L < U < v) > P(U > v - S)$$, which clearly depends on $$U$$). If you do a bunch of maths (which is present in the paper linked above; I haven’t looked through it), it turns out that this is true for all $$v$$ – and hence we have stochastic dominance of $$A$$ over $$B$$ – <i>if</i> the distribution of $$U$$ is wide enough and has a fat tail (i.e. trails off slowly as $$v$$ increases).</p> <p>What’s especially neat is that this automatically excludes Pascal’s mugging. The smaller the probability $$p$$ of our payoff is, the more stringent the criteria get: we need a wider and wider distribution of $$U$$ before $$A$$ stochastically dominates $$B$$, and at some point even the most stringent Pascalian must admit $$U$$ can’t plausibly have that wide of a distribution.</p> <p>It’s far from clear what $$U$$’s shape is, and hence how strong this reasoning is (see the links above for that). However, it is a good example of how easily benign background assumptions introduce risk neutrality into the problem of rational choice.</p><p><br /></p> <h3>Implications of risk neutrality: hits-based giving</h3> <p>What does risk neutrality imply about real-world altruism? In short, that we should be willing to take risks.</p> <p>A good overview of these considerations is given <a href="https://www.openphilanthropy.org/blog/hits-based-giving">in this article</a>. The key point:</p> <blockquote><p><i>[W]e suspect that, in fact, much of the best philanthropy is likely to fail.</i></p></blockquote> <p>For example, <a href="https://blog.givewell.org/2016/07/26/deworming-might-huge-impact-might-close-zero-impact/">GiveWell thinks that Deworm the World Initiative probably has low impact</a>, but still recommends them as one of their top charities because there’s a chance of massive impacts.</p> <p>Hits-based giving comes with its own share of problems. As <a href="https://www.openphilanthropy.org/blog/hits-based-giving">the article linked above</a> notes, it can provide a cover for arrogance and make it harder to be open about decision-making. However, just as high-risk high-reward projects make up a disproportionate share of successes in scientific research and entrepreneurship, we shouldn’t be surprised if the bulk of returns on charity comes from a small number of risky bets.</p><p> </p><p style="text-align: center;"><a href="https://strataoftheworld.blogspot.com/2020/07/ea-ideas-3-uncertainty.html">Next post</a> <br /></p> Unknownnoreply@blogger.com0tag:blogger.com,1999:blog-1697673368059564013.post-26796343766844742672020-07-25T09:55:00.019+01:002022-04-07T16:20:03.451+01:00EA ideas 1: rigour and opportunity in charity<center><font size="2"><i>2.2k words (8 minutes)<br /><br /></i></font></center> <p>Effective altruism (EA) is about trying to carefully reason how to do the most good. On the practical side, EA has inspired the donation of hundreds of millions of dollars to impactful charities, and lead to many new organisations focused on important causes. On the theoretical side, it has lead to rigorous and precise thought on ethics and how to apply it in the real world.</p> <p>The intellectual work that has come out of EA is valuable, especially in two ways.</p> <p>First, much EA work is exceptional in the breadth and weight of the matters it considers. It is interdisciplinary, including everything from meta-ethics to interpreting studies on the effectiveness of vaccination programs in developing countries. Because of its motivation – finding and exploring the most important problems – it zeros in on the weightiest issues in any particular area. EA work is a goldmine of interesting writing, particularly if you find yourself drawn in a discipline-agnostic way to all the biggest questions.</p> <p>Second, EA often has a scientific precision of argument that is often missing from discussions on abstract things (e.g. meta-ethics) or emotionally charged issues (e.g. saving lives).</p> <p>This post explains the motivations behind EA, and has a table of contents for this post series.</p><p><br /></p> <h3>Altruism, impartial welfarist good, and cause neutrality</h3> <p>I will have more to say in a later post about specific philosophical issues in defining what is moral. For now I will hope that the idea of an impartial welfare-oriented definition of good is sufficiently defensible that I will not be mauled to death by moral philosophers before that post (though if it doesn’t happen by then, it will certainly happen afterwards).</p> <p>Impartial (in the sense of considering everyone fairly, and giving the same answer regardless of who’s doing the judging) and welfare-oriented (in the sense of valuing happiness, meaning, fulfilment of preferences, and the absence of suffering) good is an intuitive and fairly unobjectionable idea. Yet if we take it as a goal, it points towards a different idea of charity than the current norm.</p> <p>Most charities are single-issue charities. This generally makes sense: better to have one organisation be really good at distributing malaria nets and one really good at advocating for taking nuclear weapons off high alert, than to have one organisation doing a mediocre job at both (malaria net delivery via ICBM?).</p><p>But the siloing of causes often goes further. If the effectiveness of an intervention is considered, it is often after choosing a cause area. To weigh cause areas against each other, to judge the needs of African children against, say, factory farmed pigs, seems like a faux pas at best, and a sin at worst (for a particularly incendiary tirade on the topic, see <a href="https://ssir.org/articles/entry/the_elitist_philanthropy_of_so_called_effective_altruism">this</a> article). </p> <p>However, if we hold ourselves to an impartial welfarist idea of good, this judgement must be made. An artist might choose what to paint based on how they want to express themselves or on a sudden flash of inspiration. A would-be altruist refusing to weigh causes against each other and instead selecting them on the basis of passion or inspiration is acting like our artist. In the artist’s case it doesn’t matter, but the altruist, in doing so, implicitly values their own choice and/or self-expression over the good that their actions might do. This is not altruism by our definition of good.</p> <p>Of course, people differ in their knowledge and talents, and these tend to align with inspiration. In the real world, it may well be that your greater ability, drive, and/or knowledge in one area outweighs the greater efficiency at which results convert to goodness in some other area. We will also see arguments for not placing all our bets on the same cause, and explore the enormous uncertainties that come in trying to compare causes. But the idea of <i>cause-neutrality</i> – that causes are comparable, and that making these comparisons is an important part of the job of any would-be altruist – remains.</p><p><br /></p> <h3>Effectiveness</h3> <p>Focusing on the idea of impartial welfarist good also makes it clear that, in trying to do good, we should focus on the good our actions result in. This may seem like an obvious statement, but it is not true of much charitable work.</p> <p>For example, we tend to emphasise the sacrifices of the donor over the benefits of the recipients. Consider old tales of people like <a href="https://en.wikipedia.org/wiki/Francis_of_assisi">Francis of Assisi</a>. Their claim to virtue (and sainthood) comes from giving away all their possessions, but the question of how much good this did to the beggars doesn’t come up. This attitude continues in the many modern charity evaluators that focus on metrics like percentage of money spent on overhead costs. Paying big salaries to recruit the best management and administration may genuinely be a cost-effective way of increasing the total good done, but it conflicts with our stereotype of self-sacrificing do-gooders. Of course, there is virtue in selfless sacrifice, but we should remember that the goal of charity is to make recipients better off, not to rank donors.</p> <p>As with many things humans do, acts of charity often aren't based on rational calculation. Some consider this a good thing: altruistic acts should come from hearts, not spreadsheets. This is wrong – if you care about impartial welfarist good.</p> <p>It is a fact about our world that <a href="https://blog.givewell.org/2011/06/11/why-we-should-expect-good-giving-to-be-hard/">good charity is hard</a>, and that charities <a href="https://www.cgdev.org/sites/default/files/1427016_file_moral_imperative_cost_effectiveness.pdf">have vast differences in cost-effectiveness</a>. When one charity results in ten or a hundred times more healthy years of life per dollar spent than another, boring details of statistical effectiveness become important moral facts. (This is true not just of charities, but most kinds of projects that might impact many people – government policy, activism, and so on.)</p> <p>When the difference in effectiveness between different interventions is often greater than the difference to doing nothing at all, and when these differences are often measured in lives, effectiveness considerations are critical in any attempt to do good.</p> <p>There is a role for simple, comforting altruism, but this role isn’t making big decisions over how to benefit others. These decisions deserve more than goodwill. They deserve to be made right. <br /></p><p><br /></p> <h3>Opportunity</h3> <p>Debates over charitable giving often centre on questions of moral duty and obligation (a good example is <a href="https://www.utilitarian.net/singer/by/1972----.htm">Famine, Affluence, and Morality</a>, Peter Singer’s classic paper that laid some of the foundations of what later became EA).</p> <p>Another framing is to think of it as an opportunity. To someone who cares about impartial welfarist good, altruistic acts are not a burden but an opportunity to achieve valuable things. In particular, there are many reasons to think that we (as in developed-world humans of the early 21st century) have an exceptionally large opportunity to do good.</p> <p>First, our values are better than those of people in preceding eras. This statement implies many philosophically contentious points, but for the time being I will not defend them, instead appealing to what I hope to be a common sense conviction that human morality isn’t nearly relative enough that it is impossible to differentiate modern secular humanist values from values that support war, slavery, and boundaries on personhood that exclude most people.</p> <p>(Of course, this statement also suggests that our current moral views are far from perfect too. This is important, very likely true, and will be discussed at length in future posts. The fact that this is increasingly recognised is hopefully a hint that we are at least on the right track.)</p> <p>Second, we have more resources than people in previous eras. There is also large variation in global income, meaning that if you happen to live in a rich country, you can help many others for cheap. A 2-adult, 1-child UK household with a total income of £30,000 is in the <a href="https://howrichami.givingwhatwecan.org/?income=30000&countryCode=GBR&household[adults]=2&household[children]=1">top 10% of the world income distribution and 7 times richer than the median global household</a>.</p> <p>Third, knowledge on what is effective has increased and technology make it easier to apply this knowledge. Today <a href="https://givewell.org">GiveWell</a>’s thorough charity research can multiply the impact of giving. Twenty years ago, there was no GiveWell. Two hundred years ago, donation guidance, if it existed, might have consisted of the church telling you to donate to them so they can convert people and push their social values.</p> <p>Fourthly, we may have an unprecedented ability to affect where civilisation is headed (for thoughts on this topic, see for example <a href="https://forum.effectivealtruism.org/posts/XXLf6FmWujkxna3E6/are-we-living-at-the-most-influential-time-in-history-1">this link</a>). The steepness of technical advancement increases the variance of possible future outcomes: in the next few decades we might nuke each other or engineer a pandemic – or we can set ourselves on a trajectory towards becoming a sustainable civilisation with billions of happy inhabitants that lasts until the stars burn down. Past eras didn’t have similar power, and if the future goes well humanity will no longer be as vulnerable to catastrophe as we are today, so people living roughly today might have exceptional leverage.</p><p><br /></p> <h3>Common EA cause areas</h3> <p>The cause areas most frequently seen as important, and most specific to EA relative to what other charities focus on, are:</p> <ul> <li>Global poverty, because the developing world is big, poor, and has many tractable problems with well-researched solutions.</li> <li>Animal welfare, because it is largely ignored, and potentially huge in scope (depending on how much animal lives are valued).</li> <li>Existential risk: focusing on avoiding human extinction or other irrevocable civilisational collapses, because new technologies (AI and biotech in particular) make them scarily plausible. (Sometimes this is motivated even more strongly by long-termism: specifically caring about the overwhelming number of happy future lives that may come to exist over the long-term future if we don't mess things up).</li></ul> <p>These are far from the only cause areas discussed in EA. Many EA-affiliated people argue either against some of the above, for the overwhelming importance of one relative to the other, or for entirely different causes.</p><p><br /></p> <h3>Effective altruism in practice</h3> <p>In practice, EA can seem weird and theoretical.</p> <p>The main reason for EA weirdness is that it casts a wide net. Everyone agrees that international peacekeeping is an important project, and also a serious one: it doesn’t get much more serious than world leaders intervening to get men with big guns to have big talks about their big disputes. On the other hand, the colonisation of space is important, but seems to have very little gravitas indeed; it’s something out of a science fiction novel. However, just as it’s a brute fact about the world that there are lots of violent people with big guns, it’s also a brute fact that space is big; both of these facts should be taken seriously when considering the long-run future. There might be a clear line between sci-fi and current affairs in a bookshop, but reality doesn't care about genre.<br /></p> <p>More generally, it’s important to keep in mind that every moral advance started out as a weird idea (for example, it was once considered crazy to suggest that women should get to vote).</p> <p>Parts of EA are very theoretical. This, too, is by design. Future posts will show many cases where which way we resolve a very abstract issue has a big impact on what the right practical action is – and in many of these cases it is unclear what the right resolution is. Finding out clearly matters.<br /></p> <p>If EA seems too theoretical or mathematical to you, consider two points. First, whatever the field, doing complex things in the real world tends to involve (or be built on) theoretical heavy lifting. Second, most charity efforts don’t pay much attention to theoretical issues; EA is at very least a helpful counterweight, and likely to uncover missed opportunities. </p> <p>Whenever the goal is to do good, it is easy to be overwhelmed by feelings of righteousness and forget theoretical scruples. Unfortunately we don’t live in the simple world where what feels right is the same as what is right.</p> <p>The core of effective altruism is not any particular moral theory or cause area, but a conviction that doing good is both important and difficult, and hence worthy of thought.</p><p><br /></p> <h3>This post series:</h3> <ol> <li>Rigour and opportunity in charity: this post.</li><li><a href="http://strataoftheworld.blogspot.com/2020/07/ea-ideas-2-expected-value-and-risk.html">Expected value and risk neutrality</a>: a rational agent maximises the expected value of what it cares about. Expected value reasoning is not free of problems, but, outside extreme thought experiments and applied carefully, it clears most of them, including "Pascal's mugging" (high-stakes, low-probability situations). Expected value reasoning implies risk neutrality. The most effective charity may often be a risky one, and gains from giving may be dominated by a few risky bets.</li><li><a href="http://strataoftheworld.blogspot.com/2020/07/ea-ideas-3-uncertainty.html">Uncertainty</a>: we are uncertain about both what is right and what is true (being mindful of the difference is often important). Moral uncertainty raises the question of how we should act when we have credence in more than one moral theory. Uncertainty about truth has many sources, including ones broader than uncertainty about specific facts, such as our biases or the difficulty of confirming some facts. These uncertainties suggest we are unaware of huge problems and opportunities.</li><li><a href="https://strataoftheworld.blogspot.com/2020/08/ea-ideas-4-utilitarianism.html">Utilitarianism</a>: while not a necessary part of EA thinking, utilitarianism is the most successful description of the core of human ethics so far. In principle (if not practice, due to the complexity of defining utility), it is capable of deciding every moral question, an important property for a moral system. Our moral progress over the past few centuries can be summarised as a transition to more utilitarian morality.</li><br class="Apple-interchange-newline" /><br class="Apple-interchange-newline" /></ol><div>(More coming)<br /></div><ol></ol> Unknownnoreply@blogger.com0tag:blogger.com,1999:blog-1697673368059564013.post-45241767924271643672020-05-09T22:29:00.003+01:002020-08-04T10:38:21.618+01:00Short reviews: fiction<h3><i>Cryptonomicon</i> (Neal Stephenson)</h3><br /><i>Cryptonomicon</i> is a hard novel to summarise. It is about World War II code-breakers and 1990s tech entrepreneurs, but also manages to concern itself with most other things as well.<br /><br />I first read <i>Cryptonomicon</i> over two years ago. However, it is a massive book, and since it happens in the same universe as <a href="http://strataoftheworld.blogspot.com/2018/06/review-baroque-cycle-neal-stephenson.html"><i>The Baroque Cycle</i></a>, I assumed reading it again would reveal many new things. I was not wrong.<br /><br />Neal Stephenson has a humorously extravagant (baroque?) writing style that is always entertaining to read, but in <i>Cryptonomicon</i> it is taken to an extreme. Stephenson turns mundane activities like writing a business plan, eating cereal, taking a car ride in the Philippines, and visiting a dentist into lengthy but hilarious tangents. Do they contribute to the plot? Who cares!<br /><br />As this is a Neal Stephenson novel, certain vices will also be present. A printed version of the book, dropped from a bomber, would punch a hole through the deck of a Japanese warship. The plot meanders to an extent that puts most rivers to shame. And some things are just plain weird.<br /><br />But overall, <i>Cryptonomicon</i> makes for a great read for anyone with the time to spare, and an interest in codebreaking, history, war, mathematics, the Internet, the financial industry, or technology.<br /><br /><br /><h3><i>Exhalation</i> (Ted Chiang)</h3><br />“Exhalation”, this short story collection’s titular work, is the greatest short story I have ever read. (You may read it online for free – and legally, as far as I can tell – <a href="here:%20https://www.lightspeedmagazine.com/fiction/exhalation/">here</a>). The careful setup builds to a beautiful and intuitive analogy that make the philosophical points at the end hit hard.<br /><br />Based on the strengths of “Exhalation” (the short story), I bought <i>Exhalation</i> (the short story collection). None of the other stories surpass “Exhalation”, though they are mostly good and sometimes excellent.<br /><br />Reading a Ted Chiang story is like watching an eerily intricate machine in action, or listening to a Bach fugue: the feeling is one of orderliness and precision combined with an almost casual ease. The premise of each story is fundamentally a thought experiment; a “what-if” question knocks down one domino and the story follows its consequences all the way down the chain. Nothing is wasted or in excess, and the beats of the pacing come like metronome beats. In the best of the stories, these beats are almost undetectable at first, gradually building up into dawning revalation as the pieces fall together and the story reaches its climax.<br /><br />Aside from “Exhalation”, there are two stories that stand out.<br /><br />“The Truth of Fact, the Truth of Feeling” is a thoughtful exploration of the effect of the medium on what is seen as true (a <a href="http://strataoftheworld.blogspot.com/2019/08/review-amusing-ourselves-to-death.html">topic</a> that Neil Postman would feel right at home with). The story cleverly parallels the story of a person in an African village being introduced to literacy in the past, and a person in the future grappling with the consequences of technology that records everything people see. In a world of cautionary tales about technology stealing our identities, destroying our communities, or letting dinosaurs loose in the park, Chiang’s take on this issue is surprisingly forward-looking.<br /><br />In “Omphalos” (an Ancient Greek word for “navel”, as in the expression “navel of the world”), the what-if question is: what if creationism were true, but humanity was a side-effect rather than the pinnacle of creation? The story is told in the form of prayers to god. Chiang takes the reader on a tour of what the scientific facts of this world look like: old trees with no growth rings in the middle, mummified people without navels, and so on, until finally a physics discovery, while confirming without doubt the existence of miracles, also leads inevitably to the conclusion that we are not the purpose of god’s creation. All this takes place in parallel with the emotional arc of the central character, which is told in a sympathetic and realistic manner.<br /><br /><br /><h3><i>Summerland</i> (Hannu Rajaniemi)</h3><br />The year is 1938. The Spanish Civil War rages on, Europe braces for war, Queen Victoria reigns from the afterlife, and the Soviets are merging souls into a godlike overmind, starting with Lenin’s.<br /><br />In the alternative universe of <i>Summerland</i>, Marconi discovered more than he bargained for when working with radio transmission, and soon enough ectotanks and other supernatural weaponry were being deployed in World War I. Since then much of early-1900s spiritualism has been proven right.<br /><br />Most significant is Summerland, an afterlife where souls can lodge themselves (provided they have a ticket) and even interact to a limited extent with the living.<br /><br />In terms of plot, <i>Summerland</i> is a fairly straightforward spy novel. This is executed well (though my judgement may not be representative of those who know more about spy novels), but the premise is what makes <i>Summerland</i> special.<br /><br />(Rajaniemi is best known for his far-future science fiction trilogy, which starts with <i>The Quantum Thief</i>; this is also recommended.)<br /><br /><br /><h3><i>The Curse of Chalion</i> (Lois McMaster Bujold)</h3><br />At the time of writing, the <a href="https://en.wikipedia.org/wiki/The_Curse_of_Chalion#Reception">“Reception” section</a> of the Wikipedia page for this book tells me nothing but “The book has received a number of reviews”.<br /><br />This rather underwhelming (though doubtlessly accurate) statement does not do the book justice. <i>The Curse of Chalion</i> shines not through outstanding excellence in one respect, but rather by bringing a variety of good elements together: characters that feel like real people, an atmospheric setting, and above all a hard-to-pin-down tastefulness where nothing is in excess.<br /><br />If I had to critique something, some of the turning points in the plot are rather deus ex machina. However, overall the book is a great example of fantasy built on literary merits rather than genre props, and makes for a very enjoyable story to get lost in.<br /><br />(The introduction of the Wikipedia article, however, is little but a list of all the awards the book has won.)<br /><br /><br /><h3><i>Unsong</i> (Scott Alexander)</h3><br />In the beginning God created the heavens and the Earth. For a while, everything was fine. Then Thamiel, the left hand of God, appears in the centre of the Earth and corrupts a third of the angelic host. A war begins between the angels and demons, in which the demons gain the upper hand. Their victory is averted only when the mathematically talented archangel Uriel initiates his backup plan: switching the world from running on divine light to running on mathematical laws. Angels and demons both are reduced to mere metaphors, and the world is saved.<br /><br />Saved, that is, until humans get very good at harnessing those laws and send Apollo 8 on a trip around the moon in 1968. Unfortunately all space beyond the moon is simply an illusion to make the universe seem consistent with the physics that now reigns on Earth. Instead of looping around the moon, Apollo 8 crashes into the edge of the world, damaging the delicate celestial machinery that Uriel put into place to maintain his conversion.<br /><br />Various glitches start to show up in the working of the world. Angels and demons begin returning: Uriel reappears in a hurricane in the Mexican Gulf, from where he plays the role of an overworked sysadmin issuing a constant stream of patches to prevent physics from crashing, while demons spring up from Lake Baikal and start invading Russia.<br /><br />The backstory of <i>Unsong</i>, told in various short excerpts throughout the book, continues with a very clever account of how the world reacts to this turn of events. Cold War politicking continues; for example, at one point Henry Kissinger successfully convinces President Nixon to ally with Hell in order to keep the Russians in check.<br /><br />The main plot line begins in 2017. In this universe, kabbalah works. In particular, it makes possible the discovery of Names of God – words which have magical powers, but whose distribution is controlled by strict copyright laws. The main character, Aaron Smith-Teller, is a gifted kabbalist, but works a low-paid job helping a company find Names: he reads potential Names off a computer screen all day long, and if he finds a Name, gives it over to the company. The process cannot be automated because computers lack a soul and hence can’t detect which words are Names, necessitating this sort of low-skill work.<br /><br /><i>Unsong</i> is remarkable not just for its crazy premise, but for the consistency and ruthlessness of its internal logic (which characters do not fail to exploit). Imagine you stumble across a Name that grants souls to inanimate objects. What do you do? That’s obvious: use it on a computer, have it start searching for new Names at superhuman speed, sell the Names for profit, buy more computers, and continue in this vein until you have magic powers beyond your dreams and can take over the world. If the Bible is literally true, what is the overriding moral priority? Simple: end the existence of hell; countless people suffering eternal torture for vague reasons cannot be part of a just universe.<br /><br />The central question that many of <i>Unsong</i>’s characters grapple with is the problem of theodicy: why would a good god create a world with so much evil? This question does not have direct relevance to our own world, but it leads to other interesting questions (as well as giving the author a chance to flaunt their ingenuity; the book actually has a plausible answer). Together with characters who are often both idealistic and ruthless – I’m particularly fond of Jalaketu West, AKA “The Comet King” – this makes the book a good exploration of many moral themes.<br /><br />Be warned, though: <i>Unsong</i> is about a universe where words, rather than equations, are the building blocks of reality. This leads to a lot of perverse verbal ingenuity, including more puns than can possibly be healthy. If you don’t want to read about characters who protest at the World’s Fair by waving signs saying “No it isn’t!”, or how atheists also include a leviathan in their mythology by calling the whole world a giant fluke, stay away.<br /><br /><i>Unsong</i> was published online, chapter by chapter. This means two things, one bad and one good. First, it is a bit less polished than a published novel might be. Second, <a href="http://unsongbook.com/">you can read it for free online</a>.Unknownnoreply@blogger.com0tag:blogger.com,1999:blog-1697673368059564013.post-68017470764452389212020-05-09T22:14:00.001+01:002020-07-25T08:06:46.302+01:00Short reviews: non-fiction<h3><i>The Feynman Lectures on Physics</i> (Richard Feynman)</h3><br /><i>The Feynman Lectures on Physics</i> (FLOP) is an incredible resource on basic physics. Feynman has an inimitable style: he is always clear, never the slightest bit pretentious, and has an eerie ability to cut through tangles of models, assumptions, and equations to get at the fundamental point. Often you can feel Feynman’s infectious enthusiasm through the page.<br /><br />There are some issues with trying to learn physics from FLOP. There are no exercises, so you cannot test your understanding very easily.<br /><br />Another reason is that the easy flow and elegant arguments make it less structured. If a typical textbook is like taking an official tour through a city, methodically exploring everything there is to see, the general feel of FLOP is more of chasing after a boundlessly enthusiastic tour guide as he zips from place to place using various shortcuts, leaving you with the nagging feeling that, while it was certainly very fun, you might not be able to retrace the route afterwards.<br /><br />Feynman has a remarkable ability to introduce just enough background to pull off some proof or argument. This makes for some brilliant arguments that are fun to follow, but, particularly when it comes to mathematical tricks, left me with the feeling that if I didn’t have more background than Feynman introduces, I would be lost.<br /><br />(Given a solid understanding of calculus and complex numbers, there are no great leaps required to follow the mathematics in volume 1. Volume 2 deals mainly with electromagnetism, which relies on vector calculus; at the time I was reading it, I didn’t have a solid grasp on that and this made parts difficult to follow. I still haven’t had a chance to read through everything in the second half of volume 2, and have read nothing from volume 3 and so cannot comment on it.)<br /><br />Overall, FLOP is a brilliant resource. Perhaps it works best as a reference volume; there are many arguments that I do not remember off the top of my head, but which I remember are presented with extreme clarity in FLOP. Of course, without reading through at least once, how will you know what’s in it?<br /><br /><br /><h3><i>The Character of Physical Law</i> (Richard Feynman)</h3><br /><i>The Character of Physical Law</i> is based on another series of lectures Feynman that gave. It attempts to squeeze out maximum understanding and reflection about what physics is about from a minimum of abstruse maths.<br /><br />It succeeds.<br /><br />The focus is not on what the laws themselves are, but rather on the common themes in many of them: conservation principles, symmetry, and, of course, maths. The combination of clear explanation and reflection without pretence or overstretched philosophy is unbeatable.<br /><br />If you read one popular physics book, make it this one. It is as close to the heart of physics as you can get without heavy mathematics.<br /><br />If you are serious about physics, you will of course have to dive into the maths. But read this book anyways.<br /><br /><br /><h3><i>Origin Story: A Big History of Everything</i> (David Christian)</h3><br />I have rarely agreed with the purpose of a book as much as I do with the purpose of <i>Origin Story</i>.<br /><br />The idea is that an origin story explaining where the world came from and what humanity’s place in it is has been a foundational part of most human cultures in history. Ironically, just as our civilisation is now figuring out the real answers to these questions, a collective understanding of our “origin story” is missing. This is the gap that <i>Origin Story</i> – and the field of big history in general – aims to plug.<br /><br /><br /><h3><i>The Great Leveler: Violence and the History of Inequality</i> (Walter Scheidel)</h3><br />Inequality is a trendy topic. Coherent insights into its history and how to quantify it are notably less trendy.<br /><br /><i>The Great Leveler</i> provides both in spades. Optimism is in somewhat shorter supply. Scheidel identifies “Four Horsemen of Leveling” that have historically driven large decreases in inequality: total war, violent revolution, state collapse, and pandemics. If, as Scheidel cautions, welfare democracies probably won’t buck this trend, it looks like coronavirus is our only chance.<br /><br /><br /><h3><i>The Strategy of Conflict </i>(Thomas Schelling)<br /><i></i></h3><br />You are in a car, driving directly towards another car. You will soon crash. The rules of <a href="https://en.wikipedia.org/wiki/Chicken_(game)">the game</a> are simple: the first one to swerve loses. How do you win? You close your eyes, throw away the steering wheel - basically, anything that both removes your ability to act and credibly signals this to your opponent.<br /><br /><i>The Strategy of Conflict</i> is all about delightfully - and sometimes scarily - counterintuitive problems in game theory, in particular conflict of the nuclear sort. The general theme is that reducing your ability to make choices and committing to irrational acts can be the most powerful tools at your disposal. If you can commit to something in advance, regardless of whether it is in your rational interest to do it when the time comes, you can change the payoffs for your opponent, and hence possibly change what they calculate their best action to be.<br /><br /><br /><h3><i>The Doomsday Machine</i> (Daniel Ellsberg)</h3><br />My brief notes on this book snowballed into a full review, which you can find <a href="http://strataoftheworld.blogspot.com/2020/04/review-doomsday-machine.html">here</a>. If you’re getting tired of the coronavirus pandemic, why not put things into perspective by reading about nuclear war?<br /><br /><br /><h3><i>Founders at Work </i>(Jessica Livingston)<br /><i></i></h3><br /><i>Founders at Work</i> is a collection of interviews with startup founders. The book doesn’t try to be anything fancy, or make any deep conclusions about how the technology industry works. Its main value – and this is not a trivial thing – is as a source of “virtual experience” that you can download into your brain. Reading dozens of founders reflecting on their experiences with the guidance of a knowledgeable interviewer is the second-best thing to having that experience yourself.<br />Perhaps the two most basic and recurring themes are:<br /><ol><li>In a (good) startup, everything is as barebones, minimalist, and plain as possible. The working place might be the stereotypical garage, someone’s apartment, or there might not even be one. Money is saved in endlessly creative ways. At most, you occasionally might have to dress up or pretend to have a normal office to impress investors. This theme is summarised by a story told in the introduction: some people tried to figure out how to make a sports car go faster, and eventually realised the key was to remove everything that makes it look like it goes fast.</li><li>In the early stages, no one has any idea what they’re doing</li></ol><br /><br /><ol></ol><h3><i>Security Engineering </i>(Ross Anderson)</h3><br />The lecturer for my current software & security engineering course is publishing the third edition of his security engineering textbook online chapter by chapter (“like Dickens’ novels”, as he describes it). The textbook is extremely readable, and many of the case studies are both illuminating and funny. Read it <a href="https://www.cl.cam.ac.uk/~rja14/book.html">here</a>.<br /><br />Be warned that most of the chapters will disappear from the website for several years after the book is published. However, they will return afterwards, and the same page linked above also has all the chapters from the second edition, which has already passed this period and is free online forever.Unknownnoreply@blogger.com0tag:blogger.com,1999:blog-1697673368059564013.post-972919944711690472020-04-23T16:12:00.003+01:002022-03-31T22:59:40.534+01:00Review: The Doomsday Machine<div style="text-align: center;"> <span style="font-size: x-small;">Book: <i>The Doomsday Machine: Confessions of a Nuclear War Planner</i>, </span><span style="font-size: x-small;">by Daniel Ellsberg (2012).</span></div><div style="text-align: center;"><span style="font-size: x-small;">This review is 4.6k words (≈15 minutes).</span></div><div style="text-align: center;"><span style="font-size: x-small;"> </span></div><div style="text-align: center;"><span style="font-size: x-small;"><a href="https://forum.effectivealtruism.org/posts/surPpSDrbnoxwreEd/book-review-the-doomsday-machine-1">This post has also been published on the Effective Altruism Forum </a><br /></span></div><br />Here’s what former RAND Corporation nuclear strategy analyst (and later, Pentagon Papers leaker) Daniel Ellsberg and his colleague thought about the movie <i>Doctor Strangelove</i> – a dark and brilliant comedy about accidental nuclear war – after watching it in 1964:<br /><blockquote><i>"We came out into the afternoon sunlight, dazed by the light and the film, both agreeing that what we had just seen was, essentially, a documentary."</i></blockquote><b><br /></b><b>Doctor Ellsberg, or: How I Learned to Start Worrying and Hate the Bomb</b><br /><b><br /></b>The age of mass slaughter of civilians as war strategy did not start with Hiroshima, but rather years before with British and, later, American bombing campaigns. No new moral or strategic choice was made in the decision to drop the atomic bombs on Japan; it was the natural outgrowth of the policies that had already incinerated Dresden and Tokyo.<br /><br />Of course, nuclear technology meant an escalation of its scale. A single plane carrying an atomic bomb is more efficient at delivering mass death than a bomber fleet. Hydrogen weapons, in which the atomic bomb is a mere detonating cap for a fusion reaction, scale up the destructive power a thousandfold. Thanks to missiles that can strike anywhere on Earth within an hour and the insistence of many nuclear countries in keeping weapons on high alert, each nuclear power has a loaded gun trained at the civilian population of the others.<br /><br />The perverse logic of this hostage situation leads to the sorts of insanities that make Ellsberg call <i>Doctor Strangelove</i> a documentary.<br /><br />For example, the lack of any way to recall bombers was a true and deliberate part of the US nuclear response mechanism. The fictional horror scenario of unintentionally launched bombers continuing towards their targets while the rest of the world spends its final hours waiting powerlessly was at most fifteen minutes from becoming reality throughout the early Cold War.<br /><br />(Thankfully the switch from bombers to faster missiles later removed this anxiety-inducing pre-Armageddon wait.)<br /><br />Why? Presumably because a recall code could be stolen by the enemy and used to misdirect an attack. The logic of mutually assured destruction demands certain response without delay.<br />When the US Air Force was told to place electronic locks on Minuteman missiles to prevent unauthorised launch, they decided that the unlock code would always be set at 0000 0000, so that a launch would never be blocked because the code was missing (or because a nervous launch officer couldn’t punch in anything more complicated).<br /><br />Delegating the authority to launch weapons is another way of ensuring launch readiness. If the president, vice president, and everyone else in the line of succession right down to the White House chef are nuked into oblivion by a surprise strike, it can’t interfere with the ability to retaliate, or else that’s exactly what the Soviets would immediately start planning to do. And so (as Ellsberg carefully investigated) Eisenhower discreetly gave the admiral of the US Indo-Pacific Command the right to start the nuclear war plan by his own initiative. Communication links across the Pacific can be unreliable, so the admiral further delegated launch authority down the chain. To Ellsberg it is unclear if even the president was aware of the further delegation, but perfectly clear that this is insanity: all it would take is a geopolitical crisis, some bad weather over Hawaii, and suddenly some over-eager general on a distant Pacific island thinks that nuclear war has broken out and it is their duty to join the fun.<br /><br />This is not the end of bureaucratic madness. Ellsberg recounts his surprise after learning that the US had no war plan involving just the Soviet Union. Any nuclear attack would hit China as well. The admirals Ellsberg asked about this were incensed, leading Ellsberg to conclude:<br /><blockquote><i>"Thus, if the president gave an order to attack only Soviet targets, CINCPAC [US Indo-Pacific Command, now called USINDOPACOM] forces, having destroyed Vladivostok and a few other minor targets in eastern Russia, would essentially have to sit out the war as observers—“on the sidelines,” as they thought of it—during the big game."</i></blockquote>This was something that the admirals thought intolerable.<br /><br />It gets worse:<br /><blockquote><i>"[I]t had long been clear to me that if the highest authorities did give [an order that excluded China] it would be virtually impossible to implement that order quickly in the Pacific. That was true for technical as well as bureaucratic reasons. CINCPAC planners were working extremely hard, around the clock each year, just to produce one single plan for nuclear war against the Sino-Soviet bloc, and they simply didn’t have the ability to produce a second plan for war with the Soviet Union alone."</i></blockquote>Why was it so difficult to create a nuclear war plan? A major reason was the enormous number of calculations needed to schedule the bombers so that they wouldn’t be swatted out of the sky by nukes dropped by other bombers:<br /><blockquote><i>"Plans specified that a particular explosion would go off at time-over-target, or TOT (for example, 117 minutes and 32 seconds after the Execute order), and then a nearby explosion would go off 2 minutes and 12 seconds later, and so forth. If everything went according to plan, no plane would be struck down by the explosion from a bomb dropped by another plane; no “fratricide” would occur."</i></blockquote>Practical inconveniences, like the fact that not all planes would manage to get themselves in the air equally fast, or the existence of weather, were ignored.<br /><blockquote><i>'I pointed these two problems out to a planner once.</i> </blockquote><blockquote><i>“Yes, I’ve thought of these problems before,” he said.</i> </blockquote><blockquote><i>“Well, doesn’t that make you question the value of making all these calculations and plans?”</i> </blockquote><blockquote><i>“These men are risking their lives flying out there. We’ve got to do what we can to save their lives.”</i> </blockquote><blockquote><i>“But it doesn’t seem that this plan has any chance to save any lives at all. It would save lives only if the execution followed the plan down to the second, and there’s not even the remotest possibility of that happening.”</i> </blockquote><blockquote><i>“Well, we’re ordered to make these calculations, so that’s what we do.”'</i></blockquote>No sane person can think to themselves “let’s plan to kill three hundred million Chinese peasants in order to protect the egos of a few admirals, and because someone wants to make really detailed spreadsheets”. A badly designed bureaucracy won’t even blink.<br /><blockquote><i>"How to describe that, other than insanity? Should the Pentagon officials and their subordinates have been institutionalized? But that was precisely the problem: they already were. Their institutions not only promoted this insanity, they demanded it. And still do. As do comparable institutions in Russia."</i></blockquote><b><br /></b><b>Cuban roulette</b><br /><br />Ellsberg’s account of the Cuban missile crisis is particularly haunting. Both Kennedy and Khrushchev were eager to avoid war, more cautious than many of their advisors, and willing to step down the bravado even at steep political cost.<br /><br />Yet at the peak of the crisis on October 27th, 1962, there were two occasions when nuclear war was averted by chance. The first occasion was when the captain of a Soviet submarine being hounded by American destroyers decided to launch a nuclear torpedo at the destroyers. On most submarines the agreement of the captain and political officer would have sufficed, but flotilla commander Vasili Arkhipov happened to be onboard this particular submarine, and had the authority to overrule the captain and the political officer.<br /><br />Had Arkhipov been stationed on a different submarine:<br /><blockquote><i>"The source of [the explosion caused by the nuclear torpedo] would have been mysterious to other commanders in the [US] Navy and officials on the ExComm, since no submarines known to be in the region were believed to carry nuclear warheads. The clear implication on the cause of the nuclear destruction of this antisubmarine hunter-killer group would have been a medium-range missile from Cuba whose launch had not been detected. That is the event that President Kennedy had announced on October 22nd would lead to a full-scale nuclear attack on the Soviet Union."</i></blockquote>Perhaps Kennedy would have decided to take back his redline, and maybe the conflict might have deescalated even then. But the odds would have been long.<br /><br />The second time was above Siberia and the Bering Sea. An American U-2 spy plane had wandered off-course into Soviet airspace. MiGs were scrambled to intercept it (perhaps believing it to be a reconnaissance plane for a larger attack), and American F-102As scrambled in turn to intercept the MiGs before they could get to the U-2. The F-102As were armed only with nuclear air-to-air weapons, since they were meant to be used against Russian nuclear bomber formations.<br /><br />Secretary of Defense Robert McNamara reportedly ran out of a Pentagon meeting hysterically yelling “this means war with the Soviet Union” upon hearing the news. Kennedy, however, was calmer:<br /><blockquote><i>"In a panic, [the chief of the Bureau of Intelligence and Research] rushed in to tell the president there was a U-2 over Russia being pursued by MiGs. Kennedy, very cool, responded from his rocking chair (as Hilsman reported) with an old Navy joke: 'There’s always some son-of-a-bitch who didn’t get the word.'"</i></blockquote>Even assuming leaders who would rather lose face than commit genocide, their control over events is not perfect. Government bureaucracies, trigger-happy generals, and your generic sons-of-bitches who don’t get the message have a lot of inertia, which any central organising force will struggle to halt. Combined with the hair-trigger launch capability demanded by deterrence through mutually assured destruction, this means nuclear war cannot be removed from the realm of the possible.<br /><br /><br /><b>’Tis but a scratch</b><br /><b><br /></b>How bad is nuclear war, really?<br /><br />Ellsberg comes with a prepackaged answer: the nuclear weapons currently deployed by the US and Russia (all other arsenals combined are less than 10% of the total) are equivalent to a doomsday machine which, if activated, would result in a nuclear winter that ends human civilisation.<br /><br />As far as I can tell, the research is not nearly as clearcut as Ellsberg would like to tell. Ellsberg writes of the “recent scientific confirmation of the thirty-year old nuclear winter “hypothesis””, but I’m not sure what this is meant to refer to.<br /><br /><i>The Doomsday Machine</i> is about nuclear history and policy, not the effects of nuclear war, so it makes sense for Ellsberg to omit a detailed analysis of what exactly we think might happen to the atmosphere. However, as best as I can tell given other sources, Ellsberg’s claims of a scientific consensus for civilisation-ending nuclear winter following a war waged with post-Cold War nuclear stockpiles is simply too strong given the current evidence. This is a shame. Nuclear winter is a serious threat, and serious threats do not need exaggeration.<br /><br />So what is our current understanding of nuclear winter? In a word: complicated.<br /><br />Some older models of nuclear winter were challenged when burning oil wells in Kuwait during the 1991 Gulf War failed to cause global or even continental cooling. Later papers suggest that sufficiently large burning areas, such as entire cities, might lift smoke much higher than isolated burning oil wells and hence cause greater effects. Others argue that modern cities are not very likely to become firestorms. A <a href="https://www.pnas.org/content/117/13/7071" title="recent study">recent study</a> estimated 3-17% losses in various crops from a limited Indo-Pakistani war alone. Still others claim they’re being <a href="https://en.wikipedia.org/wiki/Nuclear_winter#Critical_response_to_the_more_modern_papers">stigmatised as “closet Doctor Strangeloves”</a> for their criticism of the nuclear winter hypothesis.<br /><br />If we assume that the more aggressive nuclear winter models are not totally off the mark, <a href="https://forum.effectivealtruism.org/posts/pMsnCieusmYqGW26W/how-bad-would-nuclear-winter-caused-by-a-us-russia-nuclear">this analysis</a> estimates that billions of people might plausibly starve to death following a modern US-Russia nuclear exchange. However, to arrive at such estimates involves a long chain of assumptions.<br /><br />I think all we can say for sure are two things: first, that this is not an experiment we ever want to try, and second, that there exists at least one foolproof solution to global warming.<br /><br />Regardless of a hypothetical nuclear winter, any nuclear war is bad.<br /><br />Consider the greatest disasters in human history. Events like World War II, the Black Death, the Great Chinese Famine, and the Spanish flu all had a death toll between 10 and 100 million people (though the high-end estimates for the Black Death go to twice that number).<br /><br />In the early 1960s, the US Joint Chiefs of Staff estimated that a US first strike on the USSR and China would kill 275 million people immediately, and another 50 million within those countries over the next six months due to injuries and fallout. Attacks on Warsaw Pact countries would kill another 100 million. Collateral damage on neutral countries from fallout would depend on which way the wind blows, but likely add at least another 100 million across nearby countries like Finland, Japan, Sweden, and Afghanistan.<br /><br />Ellsberg recounts his reaction:<br /><blockquote><i>"I remember what I thought when I first held the single sheet with the graph on it. I thought, This piece of paper should not exist. It should never have existed. Not in America. Not anywhere, ever. It depicted evil beyond any human project ever. There should be nothing on earth, nothing real, that it referred to."</i></blockquote>The scale is an order of magnitude above any other disaster. In terms of human life lost, it is as if all of World War II (from the Holocaust to Hiroshima to Dresden to Leningrad), the Black Death, the Great Chinese Famine, and the Mongol conquests all happened in a day, followed by World War I, the Spanish flu, and every famine the British ever caused in India over the next few months. Finally, add in some risk of a nuclear winter that slowly kills a significant chunk of the rest through starvation. And this is what happens if we assume that the Soviets don’t hit back.<br /><br />An argument in favour of nuclear weapons is that they help maintain peace between great powers. This is true, but inadequate.<br /><br />When asked to put a number on the probability of the Cuban missile crisis escalating into a total nuclear war, Kennedy said “between one in three and even”. Assume this is right, and that the alternative to a nuclear standoff was another worldwide military conflict on the scale of World War II (fought with conventional weapons only). The harsh logic of expected value tells us that the crisis was still a bad deal: we shouldn’t gamble 500-1000 million lives on a coin flip to avoid a 50-100 million death conflict.<br /><br />Thankfully, a nuclear war today might be less damaging.<br /><br />First, the number of nuclear weapons has gone down. The US arsenal peaked at 30 000 weapons in the 1960s and Soviet/Russian one at 40 000 in the 1980s; both have since fallen to 6000 - 7000. At the same time, the accuracy of missiles has improved, and smaller, more accurate weapons have replaced huge multi-megaton bombs that can wipe out a city even if they miss by a few kilometres.<br /><br />Second, increased accuracy allows for strategies to change, at least among the most advanced nuclear powers. Countervalue targeting, where the aim is to inflict maximum damage to an enemy by hitting cities, can be swapped for counterforce targeting, in which the enemy’s military is targeted. The US might plausibly carry out a counterforce attack, but the same is not currently true of China, let alone Pakistan or India. Of course, with nuclear weapons it is impossible to avoid collateral damage, and the extent to which countervalue targeting has been swapped out is hard to tell given the secrecy of current nuclear war plans.<br /><br />By <a href="https://forum.effectivealtruism.org/posts/FfxrwBdBDCg9YTh69/how-many-people-would-be-killed-as-a-direct-result-of-a-us">one estimate</a>, even a limited counterforce scenario for a US-Russia nuclear war would lead to 10 million immediate deaths each in the US, Russia, and (if it’s involved) western Europe. Add in countervalue targeting, and that’s another 100 million across the US and Russia (an estimate for western Europe is not given). By weighing the probabilities of each level of countervalue targeting, the author of this estimate came up with a mean death toll of 50 million direct deaths.<br /><br />So what can we expect for a modern nuclear war? Any nuclear exchange will likely earn a place near the top of Wikipedia’s “<a href="https://en.wikipedia.org/wiki/List_of_wars_and_anthropogenic_disasters_by_death_toll" title="list of wars and anthropogenic disasters by death toll">list of wars and anthropogenic disasters by death toll</a>” page. A total one between large nuclear powers will instantly shoot to first place from the number of direct deaths alone. It would, entirely literally, be the worst thing ever.<br /><br />Then there’s the possibility of nuclear winter. It might be uncertain, but its potential scale means its contribution to the expected number of deaths is considerable. Every 1% increase in the chance of half the world’s population starving is, in expectation, another Canada gone.<br /><br />Understanding exactly how bad nuclear war would be is important, both to guide policy and to judge its importance relative to other causes. Right now there does not yet seem to be a consensus about nuclear winter risks; if we draw a graph of number of deaths versus probability of it happening, the distribution would be very wide, with most of the expected harm coming from the tail end: scenarios of low probability, but involving billions of deaths. Hopefully the immense efforts rightly spent on modelling climate change will have spillover benefits for nuclear winter research, and allow us to be more certain in our predictions.<br /><br /><br /><b>Institutional insanity, then and now </b><br /><b><br /></b>Ellsberg does not discuss much about US war plans after his time working with them, likely because after he leaked the Pentagon Papers there was no going back to his job at RAND.<br /><br />(In fact, the Pentagon Papers were just half of the secret material Ellsberg had copied. Ellsberg decided to release the Vietnam papers first, fearing that if he also released the nuclear planning papers, the Vietnam stuff would be forgotten. His plan to release the nuclear papers later was derailed due to a complex chain of events including letting his friend hide them in a dumpster and flooding from a near-hurricane. Much of the material has since been declassified, however.)<br /><br />Some things have gotten better. The insanity of a single war plan hitting both the USSR and China must have ended as the Sino-Soviet split progressed (or so I assume). Permissive Action Links (PALs) are now often, but not always, used to make unauthorised nuclear weapon use harder. Nuclear brinksmanship is (for now) less common than during the Cold War.<br /><br />However, as Ellsberg cautions, to think that the modern nuclear situation is much saner than the one that Ellsberg witnessed in the 1950s and 60s would be a mistake.<br /><br />A key point to understand about nuclear war is that, if it happens, the reason for it will be stupid.<br />Nuclear weapons are meant to be used. Their intended use is not as explosives, though. As Ellsberg points out:<br /><blockquote><i>"[…] they have been used in the precise way that a gun is used when you point it at someone’s head in a direct confrontation, whether or not the trigger is pulled. For a certain type of gun owner, getting their way in such situations without having to pull the trigger is the best use of the gun. It is why they have it, why they keep it loaded and ready to hand."</i></blockquote>The world has fallen into a <a href="https://en.wikipedia.org/wiki/Tragedy_of_the_commons">tragedy-of-the-commons</a> -type situation. The commons, in this case, is the <i>absence</i> of nuclear weapons. Such a world is ideal, but the equilibrium is unstable: the first country to get them can threaten others, and so over time things degenerate until most (big) countries develop them. The commons has become exhausted, no nuclear power is better off relative to the others, and they are all paying the cost: upkeep of weapons, delivery systems, and infrastructure, as well as a small but ever-present risk of accidental mass murder.<br /><br />(This is a simplification, of course. Nuclear weapons allow some countries to better their position relative even to other nuclear powers; for example, both Pakistan and India have nukes, but Pakistan comes off better in the deal since its weapons help offset its disadvantages in population, resources, and territory (as does the asymmetry of their nuclear policies - India has adopted a no-first-use policy, but Pakistan refuses to).)<br /><br />No sane person starts a nuclear war. If a nuclear weapon detonates, it has failed its purpose. The real risk both during and after the Cold War is that of accidental nuclear war – technical glitches, inadvertent escalation, Kennedy’s “sons-of-bitches who don’t get the word”.<br /><br />It is possible to imagine a world where the delicate balance of nuclear deterrents can be maintained with the millimetre precision required to ensure that the expected value of harm remains low. In this world, nuclear weapons may even be a net positive, paying back the costs of their upkeep and probability of accident by reducing the likelihood of non-nuclear conflict.<br /><br />Is this our world?<br /><br />If all curtains of secrecy were stripped from US nuclear planning, would we see a rational government carefully shouldering its Atlas-like burdens? What guarantee is there that China and Russia, both of which either already have or are soon likely to have dictators for life, will give appropriate weight to impartial concern for human welfare in their nuclear strategies? Didn’t Narendra Modi <a href="https://www.bbc.com/news/world-asia-47366718">order air strikes on another nuclear power</a> to increase his reelection odds just last year?<br /><br />Building institutions that carry out complex tasks reliably is a very difficult problem.<br /><br />Consider some of the greatest institutions humanity has come up with. Democracy promises that if you let people vote for their leaders, there’s some chance your country won’t slide into authoritarianism or dysfunction. Free markets boil down to the realisation that you can get away with making surprisingly few decisions about the economy. Scientific publishing allows for the mound of human knowledge to continuously expand, except occasionally there’s a replication crisis and the floor falls in for half a field.<br /><br />Such institutions are among the greatest achievements of human organisation and intelligence. Yet I still wouldn’t bet my life on a breakthrough study replicating, the stock market updating on a building but predictable global pandemic in its early stages, or a European democracy never sliding into dictatorship. Ask me to bet tens of millions of lives on US, Russian, Chinese, French, British, Pakistani, Indian, and Israeli secret military institutions all working reliably over a time horizon of decades, and I start to wonder when the first ship leaves for Mars.<br /><br />I am not generally fond of arguments about human hubris (too often they boil down to vague complaints that using our ingenuity to improve life would somehow be bad). This time, however, there is truth to them. For the governments of the United States and Russia to believe that they can wield a thousand-weapon arsenal responsibly is pretence. Our current institution-building abilities are not up to the task. (We don’t yet know whether we can even build institutions that guarantee human flourishing in the long run, but unlike the nuclear problem, this problem we have no choice but to attempt.)<br /><br /><br /><b>What should we do?</b><br /><b><br /></b>What are the most effective things we can do to lower the expected harm of nuclear war – that is, reduce both its probability and the damage it would cause if it happened?<br /><br />Ellsberg rightly recommends downsizing the US and Russian arsenals (together over 90% of the world total) as a first step, since this would reduce any risk of a catastrophic nuclear winter. For example, the US could start by unilaterally getting rid of its land-based missiles, which would reduce the time pressures on making a launch decision (land-based missiles will likely be the first targets of any attack and hence will be lost unless launched soon after a warning), and deprive Russian missiles of their first targets, making it more justifiable for Russia to cut back on its own arsenal.<br /><br />Another step is for countries to take weapons off hair-trigger launch alert. Constantly being two mistakes and ten minutes away from nuclear launch is not sustainable in the long run. Deterrence could be maintained through a focus on hardier nuclear forces like submarines, and less reliance on sitting ducks like land-based missile silos.<br /><br />To make both of these steps more likely, diplomatic efforts into nuclear arms control treaties should be increased. This has not been happening. The US suspended the Intermediate-Range Nuclear Forces Treaty in 2019 due to perceived Russian violations and because it didn’t cover China. The remaining major US-Russia treaty, New START (STrategic Arms Reduction Treaty), will likely expire in February 2021 unless Trump reverses course and negotiations happen very quickly. Treaties for old weapons aren’t enough, either; new technologies, like <a href="https://en.wikipedia.org/wiki/Boost-glide#Existing_or_in_development">hypersonic gliders that can keep lower than ballistic missiles and perform evasive manoeuvres</a>, might destabilise nuclear deterrence.<br /><br />In the long run, the aim should be to either abolish weaponised nukes entirely, or, failing that, at least reach a state where only a few accountable states wield small numbers of weapons. Biological and chemical weapons of mass destruction have been reined in by treaties and mostly abolished. Nuclear weapons should be next.<br /><br /><br /><b>Surviving by design</b><br /><b><br /></b>During the Cold War, it was easy to construct a compelling narrative around nuclear war: the climactic showdown between the forces of capitalism and communism and between democracy and totalitarianism, to be waged for infinite stakes with the ultimate fruits of modern science. With the end of the Cold War, the narrative was lost. Nuclear war was still possible – the only time a “nuclear briefcase” was ever activated was in <a href="https://en.wikipedia.org/wiki/Norwegian_rocket_incident">1995</a> – but it was largely relegated to the realm of technical glitches and accidents; not things that make for a good story.<br /><br />As Ellsberg points out, the biggest nuclear threat never was, and still isn’t, an intentional conflict (or rogue states or terrorists). It is the potential for failure in the institutions, people, and machines that control the biggest nuclear arsenals.<br /><br />If nuclear war starts, it won’t be grave geopolitical considerations that trigger it. It will be a country <a href="https://en.wikipedia.org/wiki/1961_Goldsboro_B-52_crash">dropping a nuclear weapon on itself</a>, someone accidentally <a href="https://blog.ucsusa.org/david-wright/nuclear-false-alarm-950">inserting a nuclear war training tape into an operational computer</a>, radar equipment getting <a href="https://en.wikipedia.org/wiki/List_of_nuclear_close_calls#1950s_and_1960s">confused by the moon</a>, or <a href="https://en.wikipedia.org/wiki/Norwegian_rocket_incident#Prior_notification">Russian bureaucracy being Russian bureaucracy</a>. (Each of these happened; see links.)<br /><br />Stories are nice. It’s tempting to demand a certain narrative coherence from the world; to think that sufficiently bad things, for sufficiently dumb reasons, aren’t allowed to happen.<br /><br />But we do not live in the world of narrative coherence. We live in the world where civilisation is indefinitely on hold because of bat soup in China. The greatest risks we face aren’t wrapped up in compelling narratives, and they do not come from commensurate causes.<br /><br />In particular, it is important to be aware that there are no safeguards. A world without us may seem pointless, but the laws of physics will bring it about given the right chain of cause and effect. If we want protection against catastrophe, we must build it ourselves.<br /><br />We did not survive the Cold War by design. We survived it by accident: because none of the close calls quite managed to escalate to full-blown war, and because of heroes like <a href="https://en.wikipedia.org/wiki/Vasily_Arkhipov_(vice_admiral)">Vasili Arkhipov</a> and <a href="https://en.wikipedia.org/wiki/Stanislav_Petrov">Stanislav Petrov</a> who decided not to press the button.<br /><br />We should survive the 21st century by design, not accident. This is not a given; with ever greater technology comes an ever greater number of efficient ways to kill a lot of people.<br /><br /><a href="https://80000hours.org/problem-profiles/nuclear-security/">Making nuclear war less likely and less disastrous</a> is an important part of achieving this. It is not the <a href="https://80000hours.org/problem-profiles/#global-catastrophic-risks">the only part, nor necessarily the most urgent</a>, but I will say this: if civilisation has to be severely damaged by some apocalyptic scenario, we might as well make sure that it’s something fancy and trendy like unfriendly AI, not something straight from a 1960s comedy film.Unknownnoreply@blogger.com1tag:blogger.com,1999:blog-1697673368059564013.post-79439555527259393292019-12-30T16:03:00.001+00:002019-12-30T16:13:23.589+00:00Classical physics$$%This post uses MathJax to render LaTeX-typeset math in a browser.$$<br />$$%If you're seeing this text, it's not working.$$<br />In chapter 18 of volume 2 of <i>The Feynman Lectures on Physics</i> (FLoP), there is a <a href="https://www.feynmanlectures.caltech.edu/II_18.html#Ch18-T1">chart</a>. The chart takes up about half a page. It contains every fundamental law of classical physics.<br /><br />Of course, it takes a lot of work to get from the fundamental laws to an explanation of something interesting. Classical physics is also neither complete nor contradiction-free: extrapolating its consequences leads to a few minor problems, such as <a href="https://en.wikipedia.org/wiki/Ultraviolet_catastrophe">every object instantly radiating away all of its energy</a> (quantum physics fixes this).<br /><br />It is still very striking to see, in one glance, the fundamental laws, as known in 1900. Given sufficient time, you can deduce from these laws almost any phenomenon you see in the world.<br /><br />So, what are the laws?<br /><br />Specifically, what is the "shape" or the "character" of the laws? It is one thing to state the equations, and quite another to see how they behave. It is also an entirely different thing to describe them only qualitatively, without even hinting at what the underlying mathematics is like.<br /><br />In this post, I will try to summarise what the laws are about and how they work. I will not avoid the maths. However, I will also try to demonstrate the flavour of the laws in a qualitative way <br /><br /><h2>Force law</h2>This is everyone's favorite:<br />$$$<br />\boldsymbol{F} = m \boldsymbol{a}<br />$$$<br />Doesn't look too bad either, right? All it means is that mass times acceleration gives you force.<br /><br />(The only complication is that if we're not restricting ourselves to motion along a line, then force and acceleration are both vectors. But a vector is just three numbers, one for each dimension. Vectors are written in bold.)<br /><br />Rearranging a bit, $$\boldsymbol{a} = \boldsymbol{F} / m$$; that is, if you apply a force of $$\boldsymbol{F}$$ to a mass $$m$$, then the acceleration will be in the same direction as the force, but you have to divide the size of that acceleration by the mass. This tells you how much you have to push something to accelerate it at a certain rate.<br /><br />Of course, we run into a problem of definitions. What is force? We've just defined it. It's the product of mass and acceleration. Alright, what's mass? Mass is resistance to acceleration; it's the property of an object you get by dividing the force you apply to the object by its rate of acceleration. However, since we're doing physics rather than philosophy, we can just say that mass and force are these quantities that we measure in such-and-such a way, and be done with it.<br /><br />What about acceleration? Here we can go a bit deeper: it is the rate of change of velocity with time. Velocity, in turn, is the rate of change of position with time. So acceleration is the rate of change (with time) of rate of change (with time) of position, and restate our law this way, using whichever notation we prefer. For example, assuming $$\boldsymbol{x}$$ is position and $$t$$ is our time variable:<br /><br />\begin{align*}<br />\boldsymbol{F} &= m \boldsymbol{a} \\<br />\boldsymbol{F} &= m \ddot{\boldsymbol{x}} \\<br />\boldsymbol{F} &= m \frac{d}{dt} (\frac{d}{dt} \boldsymbol{x}) \\<br />\boldsymbol{F} &= m \frac{d^2}{dt^2} \boldsymbol{x}<br />\end{align*}<br /><br />(Each of these means the same thing.)<br /><br />We can also state the law in a slightly different way, which often turns out to be more convenient.<br /><br />What we do is we define a new quantity, somewhat more abstract than "force" or "mass", but not demonstrably less "real" and certainly not useless. Call it momentum, denote it $$\boldsymbol{p}$$ (note: it's a vector), and let it be the product of mass and velocity: $$\boldsymbol{p} = m \boldsymbol{v}$$.<br /><br />Now: what is the rate of change (with time) of momentum? Since the rate of change (with time) of velocity is acceleration, and mass is constant with time, the rate of change of momentum is simply mass times the time derivative of velocity, or mass times acceleration. So we've managed to connect force to momentum. Force is just the rate of change (with time) of momentum:<br /><br />\begin{align*}<br />\boldsymbol{F} &= \frac{d}{dt} \boldsymbol{p}<br />\end{align*}<br /><br /><h2>Gravity</h2>The law of gravity also discovered by Newton, states that the force pulling two objects together is proportional to the mass of both objects, and inversely proportional to the square of the distance between the objects. Letting $$G$$ be the constant that makes our experiments check out, $$m_1$$ and $$m_2$$ the masses of the two objects, and $$r$$ the distance between them, we can write that the strength $$F$$ of the force is:<br /><br />$$$<br />F = \frac{G m_1 m_2}{r^2}<br />$$$<br /><br />(We could write this equation in vector form, but the force is always attractive so we know its direction.)<br /><br />Now, from $$F = ma$$, we can calculate the acceleration that bodies exert on each other. Let's say we want to know how much the mass $$m_1$$ of object 1 accelerates object 2. The acceleration is the force $$G m_1 m_2 / r^2$$ divided by the mass $$m_2$$ of object 2. The $$m_2$$ term cancels out, and we're left with $$G m_1 / r^2$$. So this law can also be phrased as the statement that an object of mass $$M$$ causes every other object in the universe to accelerate towards it at a rate<br /><br />$$$<br />a = \frac{G M}{r^2},<br />$$$<br /><br />where $$r$$ is the distance between them. We are saved from total chaos only by the little $$2$$ that tells us to square the distance. This ensures that, though the force has infinite range (as far as we know), its strength drops off fast: every doubling of distance means a four-fold reduction in force; every 10-fold increase in distance means a hundred-fold reduction in force.<br /><br /><h3>Gravitational potential and the gravitational force field</h3><div class="separator" style="clear: both; text-align: left;">We can also express the law of gravitation in a different way: instead of defining a law for the force, we define a law for the gravitational potential, and construct a force field from this.</div><br />Imagine we have some contraption of mass $$m$$, and there's some object of mass $$M$$ that we're moving directly away from in a straight line (also let's assume that our velocity is constant and low, so there are no changes in kinetic energy). The force pulling us backwards, as a function of distance $$r$$ from the object's center, is<br /><br />$$$<br />F(r) = \frac{GMm}{r^2}<br />$$$<br /><br />Now consider a small time interval during which me move a distance $$ds$$. The work we have to do (in other words, the energy we have to expend) against the force of gravity is the force against us times the distance we move.<br /><br />(Why do we define work/energy this way? Mainly because, if we do, it has a bunch of interesting properties, such as being conserved. This is the story of most quantities in physics – either they're things we can straightforwardly measure, or someone figures out that if we define a more abstract quantity based on some simpler ones, this new quantity has properties that make it useful enough to bother calculating.)<br /><br />Thus, for each small unit of distance $$ds$$, the work we do is $$F(r) ds$$. Note the word small – if $$ds$$ is too large, this is a poor approximation, since $$F(r)$$ and $$F(r + ds)$$ are going to be noticeably different: the force will have changed a lot between the beginning and end of the step. So if we want to figure out the work needed to push something from, say, $$r_0$$ to a far away point $$r_1$$ through a gravitational field, we have to add up a lot of small pieces: $$F(r_0)ds + F(r_0 + ds)ds + F(r_0 + 2ds)ds + ... + F(r_1 - 2ds)ds + F(r_1 - d2)ds + F(r_1)ds$$.<br /><br />For small distances, force is practically invariant and the energy expenditure can be calculated simply by multiplying force and distance. For example, if a crane lifts a weight of 1000 kilograms a distance of 50 metres from the surface of the Earth, the difference in gravitational force at the beginning and end of the lift is on the order of 0.15 Newtons, or the force a 15-gram weight exerts on your hand. The total energy spent on the lift (ignoring inefficiencies) is practically identical to 50 metres times 1000 kilograms times the gravitational acceleration 9.8 meters per second squared.<br /><br />But for longer distances, we have to add up a lot of small pieces. This is done by integration of the work done at each small step over the distance travelled: <br /><br />$$$<br />W = \int_{r_0}^{r_1} F(r) dr,<br />$$$<br /><br />where $$W$$ is the work done (= the energy spent), $$r_0$$ is the distance we start from, $$r_1$$ the distance we end at, and $$F(r)$$ the force gravity exerts at distance $$r$$.<br /><br />This is valid only for one-dimensional motion. A more general presentation requires vector notation. In the general case, work is not the magnitude of the force times the magnitude of the distance, but the dot product of the force and the magnitude vector. This can be visualised as the length of the projection of the force vector onto a unit vector in the direction of distance.<br /><br /><table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody><tr><td style="text-align: center;"><a href="https://1.bp.blogspot.com/-GfBLgLGe0rM/XgZVPcGvFNI/AAAAAAAABDM/3MD3ZxroHiI9V2KENSBTxrD4NQUNiY9nwCEwYBhgL/s1600/Screenshot%2B2019-12-27%2Bat%2B11.06.27.png" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" data-original-height="1076" data-original-width="1164" height="294" src="https://1.bp.blogspot.com/-GfBLgLGe0rM/XgZVPcGvFNI/AAAAAAAABDM/3MD3ZxroHiI9V2KENSBTxrD4NQUNiY9nwCEwYBhgL/s320/Screenshot%2B2019-12-27%2Bat%2B11.06.27.png" width="320" /></a></td></tr><tr><td class="tr-caption" style="text-align: center;">To find the work done by a force on an object: take a unit vector along the axis the object moves in ($$\hat{\boldsymbol{s}}$$), and measure the projection of the force vector ($$\boldsymbol{F}$$) in this direction (the dotted blue lines). In this case, the work is negative, since the force is acting more against the direction of motion than along it. To keep the object moving at the same speed, we would therefore have to expend energy.</td></tr></tbody></table>So if we're moving in a direction perpendicular to the force – for example, horizontally over the ground – gravity does no work.<br /><br />In general, then, given motion along a line $$L$$, the work done is the sum of $$\boldsymbol{F} \cdot \boldsymbol{dl}$$, where $$\cdot$$ is the dot product operator, $$\boldsymbol{F}$$ is the force vector, and $$\boldsymbol{dl}$$ runs over each small element of the path. In integral notation we write this<br /><br />$$$<br />W = \int_L \boldsymbol{F} \cdot \boldsymbol{dl}<br />$$$<br /><br />We can see that with the simple case of straight-line motion, in the direction of the gravitational field, the result is going to be positive; call it $$W_0$$. What this means is that gravity pushes us along, so work is done by gravity on us. If we moved the other way, we would have to do work against gravity. The total amount of work we would have to do is exactly $$W_0$$, since the path is the same, the gravitational force field is the same; only the sign is flipped from a plus to a minus for each step we add, because at each point along the path the step we take is now in the opposite direction.<br /><br />Imaging we move along the line $$L$$ first in one direction, then the other. First we get an energy $$W_0$$ from the gravitational field, which we can think of as having to expend $$-W_0$$ units of energy. Returning the other way, we have to spend $$W_0$$.<br /><br />So far we've been assuming that our path is a straight line directly away from the mass. The path doesn't matter, though. If we have any path near a point mass, we can break it down into radial and tangential components. The work done moving along any tangential component is zero, since the force is at right angles to the direction of motion. The sum of the work done moving along the radial components is the same as the sum of the work done moving along our straight line path with the same end-points, since the same outward/inward distance must be covered.<br /><br /><table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody><tr><td style="text-align: center;"><a href="https://1.bp.blogspot.com/-TYizCGwzK5U/XgoYNI626HI/AAAAAAAABE4/F0PgyyCHoLMn42BTkLz0_yIY1Hbd684wACEwYBhgL/s1600/Screenshot%2B2019-12-30%2Bat%2B17.30.50.png" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" data-original-height="1098" data-original-width="1280" height="548" src="https://1.bp.blogspot.com/-TYizCGwzK5U/XgoYNI626HI/AAAAAAAABE4/F0PgyyCHoLMn42BTkLz0_yIY1Hbd684wACEwYBhgL/s640/Screenshot%2B2019-12-30%2Bat%2B17.30.50.png" width="640" /></a></td></tr><tr><td class="tr-caption" style="text-align: center;">The work done in moving a small step inwards from a distance $$r_1 + dr$$ to a distance $$r_1$$ from a point mass (along the black line) can be broken down into the work done in moving along the tangential component (blue), plus the work done moving along the radial component (red). But the force acts in a perpendicular direction as we move along the blue path, so no work is done against gravity as we move along the tangential path, and hence the work done moving along the black path and the red path are equal.</td></tr></tbody></table> Therefore we know that, given points A and B:<br /><ul><li>The energy it takes to go from A to B is the same as the energy we gain from travelling from B to A.</li><li>The energy it takes to travel between the two points is independent of the path taken.</li></ul>(Since the gravitational forces of each mass are simply added together to get the net gravitational force, we know that the work done in total when we have multiple masses is just the sum of the work done against each mass independently, and hence the above result applies not only when moving near point masses, but when moving near any configuration of masses at all.) <br /><ul></ul>Now imagine that we choose a point X as our reference point. We call the gravitational potential relative to X the amount of work, per unit mass, that we have to do against the gravitational field to move from X to any other point in space (we consider work per unit mass, since otherwise the answer would depend on how big of a mass we're moving). Since we know the path taken does not matter, to find the gravitational potential between A and B we just take the potential from X to B and subtract the potential from X to A (since potential from X to A is the negative of the potential from A to X, and we want to add up the potential changes along two segments of the path: A to X, and then X to B).<br /><br /><table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody><tr><td style="text-align: center;"><a href="https://1.bp.blogspot.com/-rMXzRMF0y4Y/XgZWm63rgZI/AAAAAAAABDY/7gouHBXL45cM0ZIH2wEMBHROuTrJcSbEACLcBGAsYHQ/s1600/Screenshot%2B2019-12-27%2Bat%2B20.57.30.png" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" data-original-height="928" data-original-width="958" height="308" src="https://1.bp.blogspot.com/-rMXzRMF0y4Y/XgZWm63rgZI/AAAAAAAABDY/7gouHBXL45cM0ZIH2wEMBHROuTrJcSbEACLcBGAsYHQ/s320/Screenshot%2B2019-12-27%2Bat%2B20.57.30.png" width="320" /></a></td></tr><tr><td class="tr-caption" style="text-align: center;">The potential difference betweeen A to B is independent of the path taken. Thus, the total work done in moving a mass from A to B against gravity is the same as the totla work required to move it from A to X, and then X to B. The sum of work done over any loop must be zero, so it follows that the work done in moving from A to X is the negative of the work done in moving from X to A.</td></tr></tbody></table><br /><br />You can think of it this way. We have some two-dimensional plane representing a piece of space, and the height of the terrain above each point is the gravitational potential (so near masses, the terrain would dip downwards). We choose the height at some arbitrary point X to be the "sea level" relative to which we measure the height of other points. Once we know the height of every point relative to X, we know the height difference for each pair of points. The choice of X is arbitrary (though, for reasons of mathematical simplicity, the gravitational potential is usually taken to tend to zero far away from any mass, and to be increasingly negative near masses).<br /><br />(To be more accurate, you should visualise a 3D space, with the potential being "height" into a fourth dimension. In the likely event that you cannot visualise 4D space, visualising potential as height along the third dimension above a 2D space usually gives the necessary intuition anyways.)<br /><br />We can define a gravitational potential function, call it $$V$$, that takes a value at every point in space, and from which we can work out the work done by gravitational forces from moving between any two points just by subtracting the value of $$V$$ at the start from the value of $$V$$ at the end. Mathematically, the work $$W$$ per mass $$m$$ in moving from a point with position vector $$\boldsymbol{a}$$ to $$\boldsymbol{b}$$ is<br /><br />$$$<br />\frac{W}{m} = V(\boldsymbol{b}) - V(\boldsymbol{a}).<br />$$$<br /><br />Now what if we want to find the force? We found the work (and hence potential) by integrating force with respect to distance; therefore, we find force again by differentiating with respect to distance.<br /><br />The intuitive picture is that for every point in our potential-versus-location "terrain height" picture of gravitational potential, we figure out the magnitude of the force vector by looking at how great the slope of the potential is, and the direction by making it point in the direction of greatest decrease of potential.<br /><br /><table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody><tr><td style="text-align: center;"><a href="https://1.bp.blogspot.com/-cVBUhgviIZQ/XgZXjLfMC7I/AAAAAAAABDk/qxFYAzMcKiMnXZ8o70Uzkcn4QBiO6JUUgCLcBGAsYHQ/s1600/Screenshot%2B2019-12-27%2Bat%2B13.46.30.png" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" data-original-height="1124" data-original-width="1188" height="603" src="https://1.bp.blogspot.com/-cVBUhgviIZQ/XgZXjLfMC7I/AAAAAAAABDk/qxFYAzMcKiMnXZ8o70Uzkcn4QBiO6JUUgCLcBGAsYHQ/s640/Screenshot%2B2019-12-27%2Bat%2B13.46.30.png" width="640" /></a></td></tr><tr><td class="tr-caption" style="text-align: center;">The background color represents gravitational potential: the darker the color, the lower the potential (imagine the darker regions as being lower, and the lighter regions as higher). The red force vectors come from the gradient of the potential field: they are always in the direction of the greatest decrease of the potential, and have a magnitude proportional to the rate of this decrease.</td></tr></tbody></table><br />Mathematically, the gravitational potential field $$V$$ is a scalar field (one number for every point in space). We find the gravitational force field $$\boldsymbol{g}$$ by taking the gradient of $$V$$:<br /><br />$$$<br />\boldsymbol{g} = -\nabla V<br />$$$<br /><br />($$\nabla$$ is a vector calculus operator; $$\nabla S$$ is the notation for the gradient of a scalar field $$S$$. The minus sign comes from the fact that we've defined gravitational potential to decrease near a mass, but we still want the force vectors to point towards nearby masses.)<br /><br /><br /><h3>Working out the gravitational field directly</h3>We can also define a gravitational field directly. If our position vector is $$\boldsymbol{r_0}$$, that of a point mass of mass $$m$$ $$\boldsymbol{r}$$, $$G$$ is gravitational constant, and the gravitational force vector is $$\boldsymbol{g}$$, then<br /><br />$$$<br />\boldsymbol{g}(\boldsymbol{r}) = Gm \frac{\boldsymbol{r} - \boldsymbol{r_0}}{|\boldsymbol{r} - \boldsymbol{r_0}|^3},<br />$$$<br /><br />Let's see where the pieces come from.<br /><br />It is a law that gravitational force is proportional to $$Gm$$. Next, we want the force to point towards the mass; we get such a vector by subtracting from the mass's position $$\boldsymbol{r}$$ our position $$\boldsymbol{r_0}$$ (we have the mass M, us at P, and the origin of whatever coordinate system we're using at O; we want a vector from P to M, so we add the vector from P to O and the vector from O to M). So we put in a $$\boldsymbol{r} - \boldsymbol{r_0}$$.<br /><br />Finally, we want to ensure proportionality to the inverse square of distance from the mass. Note however that the $$\boldsymbol{r} - \boldsymbol{r_0}$$ vector in the numerator already scales with the first power of distance between us and the mass, so we have to divide by the third power of this distance to get a dependence on the minus second power.<br /><br />Alternatively, we can let $$\boldsymbol{u}$$ be a vector of length 1 pointing towards the mass, and write<br /><br />$$$<br />\boldsymbol{g}(\boldsymbol{r}) = Gm \frac{\boldsymbol{u}}{|\boldsymbol{r} - \boldsymbol{r_0}|^2},<br />$$$<br /><br />If we have many point masses, the force vector at any point is simply the (vector) sum of the contributions of each point mass. If we have a continuous distribution of charge – so we know the density of mass at each point in space, rather than having individual point charges – we would integrate over all of space to add up the contributions of each individual bit of mass.<br /><br />The intuitive picture is that every bit of mass influences the force vector at all other points (dragging it towards itself), but that the strength of this influence drops quickly with distance. The gravitational force vector at a point is the sum of the gravitational influences of every mass in the universe.<br /><br />If a mass $$m$$ is at a point where the field has the value $$\boldsymbol{g}$$ (note that it's a vector), then the gravitational force $$F_g$$ can be written simply as <br /><br />$$$<br />\boldsymbol{F_g} = m \boldsymbol{g}(\boldsymbol{r})<br />$$$<br /><br /><h3>Electric and magnetic force law</h3>We first expressed the gravitational force law as<br /><br />$$$<br />F = \frac{G m_1 m_2}{r^2}.<br />$$$<br /><br />There exists a similar law for the electric force between two particles:<br /><br />$$$<br />F = \frac{k_e q_1 q_2}{r^2}<br />$$$<br /><br />Here, $$k_e$$ is just a constant (like $$G$$), and $$q_1$$ and $$q_2$$ are the charges of the two particles in question. The main difference is that charge can be positive or negative (rather than just positive like mass), and hence the electric force can switch from being attractive to repulsive depending on the signs of the charges on the particles.<br /><br />It turns out that this is not the best way to reason about electromagnetic forces in general.<br /><br />With electromagnetism, the behavior of the electric field $$\boldsymbol{E}$$ and the magnetic field $$\boldsymbol{B}$$ is rather complicated. The simplest way to write down the force law is not directly in terms of charges and distances and whatever, but directly in terms of the fields themselves (in the same way that writing the gravitational force in terms of a vector field allowed us to write it simply as $$\boldsymbol{F_g} = m \boldsymbol{g}$$).<br /><br />So here is the law: given a charge $$q$$ moving at velocity $$\boldsymbol{v}$$ at a point in space where the electric field is $$\boldsymbol{E}$$ and the magnetic field $$\boldsymbol{B}$$, the force experienced by the charge due to electromagnetic forces is<br /><br />$$$<br />\boldsymbol{F_e} = q \boldsymbol{E} + q \boldsymbol{v} \times \boldsymbol{B}<br />$$$<br /><br />We see that the first bit is exactly like the gravitational case, except with charge instead of mass, and the electric field instead of the gravitational field. But the second bit is new.<br /><br />(Here, $$\times$$ refers not to multiplication, but to the cross product of two vectors. Briefly, the cross product of $$\boldsymbol{a}$$ and $$\boldsymbol{b}$$ is a vector that points perpendicular to both $$\boldsymbol{a}$$ and $$\boldsymbol{b}$$, and with a magnitude that is greatest when $$\boldsymbol{a}$$ and $$\boldsymbol{b}$$ are perpendicular to each other, and 0 when they are parallel. Note there are two directions perpendicular to any pair of vectors - which one the cross product returns is determined by the right hand rule.)<br /><br />The electric field and gravitational field are simple to understand. If you visualise them as vectors in space, those vectors tell you in which direction the force tugs at a charge or a mass passing through that space (though in the case of the electric field, the force can be in the opposite direction, depending on the sign on the charge).<br /><br />The magnetic field, however, exerts a force in a direction that is perpendicular both to the vectors of the field, and to the direction in which the particle moves.<br /><br />It's obvious that electric and gravitational fields can do work: they can make something move that wasn't moving before, accelerating something along a straight line, and so on. A magnetic field can't move a stationary charge, though. In fact, it can do no work at all.<br /><br />We have already seen that the work done by a constant force $$\boldsymbol{F}$$ acting across a distance $$\boldsymbol{s}$$ is the dot product of the force and distance vector, or $$\boldsymbol{F} \cdot \boldsymbol{s}$$. The rate at which work is done – the power – is the rate of change of work with time, or $$P = \frac{dW}{dt} = \boldsymbol{F} \cdot \boldsymbol{v}$$, since we assume force is constant with time, and the rate of change of the position vector $$\boldsymbol{s}$$ with time is the velocity vector $$\boldsymbol{v}$$. <br /><br />Now let $$\boldsymbol{F} = q \boldsymbol{v} \times \boldsymbol{B}$$. Since $$P = \boldsymbol{F} \cdot \boldsymbol{v}$$, it follows that $$P = (q \boldsymbol{v} \times \boldsymbol{B}) \cdot \boldsymbol{v}$$. The part in parentheses is a constant (the charge $$q$$) times the cross product of the velocity and the magnetic field. Therefore it's a vector that points perpendicular to the velocity. Now we take the dot product with the velocity, essentially asking: if we have a vector perpendicular to the velocity, what is its projection onto the velocity vector? The answer is zero. And so the magnetic component of the electromagnetic force cannot do work.<br /><br />This doesn't mean that it has no effect, of course. Imagine a particle moving upwards on the screen, and a magnetic field is switched on, going into the screen. The magnetic force will be to the left, and the particle's path will bend leftwards. But as it bends, the force also keeps bending to remain always perpendicular. The result is that the particle is now traveling in a circle, the radius of which is determined by the particle's mass (increases radius), the strength of the field (decreases radius), and the speed at which it is moving (increases radius). Just like a planet in a circular orbit around the sun, no work is done, because the force is always exactly perpendicular to the direction of travel. But it still influences the path that the object takes.<br /><br /><h2>Electric and magnetic fields: Maxwell's equations</h2>Maxwell's equations are scary. They are written in the language of vector calculus, so understanding them requires an understanding of divergence, flux, circulation, and curl. There are also two equivalent forms, which look completely different, but which are straightforwardly equivalent if you grasp the vector calculus concepts.<br /><br />The best introduction to these concepts is <a href="https://betterexplained.com/articles/category/math/vector-calculus/">here</a>. There are exceptionally lucid articles on MathInsight, for instance <a href="https://mathinsight.org/curl/_idea">on curl</a>.<br /><br />My aim here will be to try to convey, very concisely, the gist of what the key concepts are, in just enough detail to show why they are connected the way they are, and hence why the two forms of Maxwell's equations are equivalent. After that, I will (mostly qualitatively) describe the effects of each equation in turn.<br /><br /><h3>Flux and divergence</h3>Flux is about the amount of [whatever the field measures] passing through a surface. If you imagine a vector field as a bunch of arrows in 3D space, flux is approximated by counting how many arrows pass through a 2D surface, and seeing how closely they are perpendicular to the surface.<br /><br />In a uniform field of strength $$F$$ that is exactly perpendicular to a surface of area $$A$$, the total flux through the surface is $$FA$$. If the field were to tilt to an angle $$\theta$$ relative to the surface, the flux would decrease in proportion to $$\sin{\theta}$$. If the field were parallel to the surface, the field travels along the surface rather than through it, and the flux would be zero.<br /><br />More generally, flux is the sum, over each infinitesimally small piece of a surface, of the dot product of the field with a perpendicular vector to the surface (with a magnitude that represents the size of that bit of the field) (that is, $$\boldsymbol{dS}$$ is a <a href="https://farside.ph.utexas.edu/teaching/302l/lectures/node4.html">vector area</a> for an infinitesimal surface component). If the surface is $$S$$, the field is $$\boldsymbol{F}$$, and $$\boldsymbol{dS}$$ is the vector area of each surface bit, then<br /><br />$$$<br />\iint_A \boldsymbol{F} \cdot \boldsymbol{dS}<br />$$$<br /><br />is the flux.<br /><br />We can take the flux through an open surface like a rectangle, or a closed one like the surface of a sphere. If the vector field represents the motion of gas particles, and the flux is going through a sphere to the inside, then the average density of gas enclosed by the sphere must be increasing.<br /><br />Divergence is flux for a closed surface, as the size of the volume the surface encloses goes to zero. Think of it as describing, for every point in space, its tendency to act as a source or a sink of [whatever the field describes]. The divergence of a field $$\boldsymbol{F}$$, for reasons I will not get into, is denoted by $$\nabla \cdot \boldsymbol{F}$$ (yes, that is – in some sense – the dot product).<br /><br />The connection between divergence and flux is given by something variously called the divergence theorem, Gauss's theorem, or (presumably only by masochists and Ukrainians) Ostrogradsky's theorem.<br /><br />Despite the confusing names, it is an intuitive result. For some volume $$V$$ bounded by surface $$S$$, the total amount of flux passing through $$S$$ is the total amount of divergence throughout the volume (by which we mean the sum of the divergences at every infinitesimal bit of volume in $$V$$). You can imagine an incompressible liquid: if it's coming out of a volume (there is flux through the surface enclosing the volume), then inside that volume there must be some place that acts as a source of liquid.<br /><br />Mathematically,<br /><br />$$$<br />\iiint_V (\nabla \cdot \boldsymbol{F}) dV = \phi_S,<br />$$$<br /><br />where $$\phi_S$$ is the flux through $$S$$.<br /><br /><br /><h3>Flux and divergence in Maxwell's equations</h3>The first of Maxwell's equations can be given in the form<br /><br />$$$<br />\nabla \cdot \boldsymbol{E} = \rho / \epsilon_0<br />$$$<br /><br />Here $$\epsilon_0$$ is just a constant (the vacuum permittivity); you can ignore it. In general, any constants have no bearing on this discussion, and are included only for the sake of accuracy.<br /><br />The key bit is the charge density $$\rho$$, which is the amount of charge per volume at each space. Anywhere where you have a positive charge, there will be a region of space where the amount of positive charge per unit volume of space is positive. The above equation says that this point will act as a source of electric field vectors; using the "arrows in space" visualisation, there will be arrows pointing away from this point. Likewise a negative charge will be a "sink" of electric field vectors; think of arrows pointing in from the surrounding space into the spots where we have negative charges.<br /><br /><table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody><tr><td style="text-align: center;"><a href="https://1.bp.blogspot.com/-ppyoPH7lKug/XgZYXcsytfI/AAAAAAAABDw/-LwlssZ5ZWw-TUFfCXVF0OoK5n1YS8RBgCLcBGAsYHQ/s1600/Screenshot%2B2019-12-27%2Bat%2B11.46.44.png" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" data-original-height="760" data-original-width="784" height="310" src="https://1.bp.blogspot.com/-ppyoPH7lKug/XgZYXcsytfI/AAAAAAAABDw/-LwlssZ5ZWw-TUFfCXVF0OoK5n1YS8RBgCLcBGAsYHQ/s320/Screenshot%2B2019-12-27%2Bat%2B11.46.44.png" width="320" /></a></td></tr><tr><td class="tr-caption" style="text-align: center;">The electric field around a positive charge.</td></tr></tbody></table><br /><br /><table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody><tr><td style="text-align: center;"><a href="https://1.bp.blogspot.com/-iqNWay9Kiho/XgZYLrxmbvI/AAAAAAAABDs/nXoxJISJaZcNDBIbujxfXNAzrBbJDPQ0wCLcBGAsYHQ/s1600/Screenshot%2B2019-12-27%2Bat%2B20.27.54.png" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" data-original-height="910" data-original-width="724" height="640" src="https://1.bp.blogspot.com/-iqNWay9Kiho/XgZYLrxmbvI/AAAAAAAABDs/nXoxJISJaZcNDBIbujxfXNAzrBbJDPQ0wCLcBGAsYHQ/s640/Screenshot%2B2019-12-27%2Bat%2B20.27.54.png" width="507" /></a></td></tr><tr><td class="tr-caption" style="text-align: center;">The electric field around a positive (blue) and negative (red) charge in close proximity.</td></tr></tbody></table><br />Now let's stop squinting at tiny bits of space and instead consider an entire volume $$V$$. The above discussion on divergence and flux tells us how to do this. If we integrate the divergence over a volume, we get the flux through the enclosing surface (call it $$S$$ again). So we have that the flux of the electric field through the enclosing surface of our volume $$\phi_{ES}$$ will be <br /><br />$$$<br />\phi_{ES} = \frac{1}{\epsilon_0} \iiint_V \rho dV.<br />$$$<br /><br />The sum of all the charge densities in a volume is just the total charge within that volume; call it $$Q_V$$, so we can write simply that the electric flux through a closed surface is (a constant times) the total charge enclosed within that surface:<br /><br />$$$<br />\phi_{ES} = \left( \frac{1}{\epsilon_0} \right) Q_V<br />$$$<br /><br />Let's take a simple case of applying this law, and see where we end up.<br /><br />The simplest sort of closed surface we can have is a sphere. The simplest charge distribution we can have inside a sphere is a point charge in the centre. But note that, no matter how large the sphere is, the electric flux $$\phi_{ES}$$ through it has to be the same. The area of the sphere grows with the square of its radius, so it follows that electric field density has to decrease with the square of distance from a point charge to keep the sum of the field through the entire sphere's surface constant. Electric field density in turn is proportional to the force per unit mass the field exerts. Therefore electric forces exerted by a point charge decrease in proportion to the inverse square of distance from the charge.<br /><br />If we were to carry out the above line of reasoning while taking a bit more care with the constants, we would wind up with our original electric force law:<br /><br />$$$<br />F = \frac{k_e q_1 q_2}{r^2}.<br />$$$<br /><br />Another of Maxwell's equations has a differential form that states $$\nabla \cdot \boldsymbol{B} = 0$$; that is, no point in space is a source or sink of magnetic fields. It follows that no volume in space can be a source or sink either, and hence that the magnetic flux through a surface $$S$$, call it $$\phi_{BS}$$, must always be zero. This gives us the other form of this law.<br /><br />An immediate consequence of this law is that there are no magnetic "charges", and no magnetic monopoles. Magnetic field lines do not start or stop, but always form loops.<br /><br /><br /><h3>Circulation and curl</h3>A key concept with vector fields is that of a line integral.<br /><br />Consider taking a hike through hilly terrain. You know your path, and you have a map that gives you the direction and magnitude of the slope at each point. How do you find what distance up or down you travelled?<br /><br />We can represent each step you take as a vector $$\boldsymbol{s}$$: basically a line from where you were before the step to where you are after the step. Assume that the size of the step is small enough and the terrain gentle enough that the slope does not change appreciably between one step and the next. Let the slope at that point be given by the vector $$\boldsymbol{G}$$, which points in the direction of maximum increase of terrain height, with units of distance moved up divided by distance moved sideways (note that $$\boldsymbol{G}$$ always points along the plane perpendicular to the up-down direction). If you step directly along $$\boldsymbol{G}$$, then the distance you move up is the length of the step, times the magnitude of $$\boldsymbol{G}$$; you can verify this by looking at the units: distance moved sideways times distance moved up per distance moved sideways gives distance moved up.<br /><br />Stepping the same distance in the opposite direction would result in moving down by the same distance. Stepping perpendicular relative to $$\boldsymbol{G}$$'s axis would result in no change in height (if you're unconvinced, note that a small enough sloping region can be approximated by a rectangular plane). In the general case, the amount you move up or down is the projection of one vector onto the unit vector in the direction of other, or the dot product: $$\boldsymbol{G} \cdot \boldsymbol{s}$$.<br /><br />If you take a lot of steps, you add up the contribution from each one. Let the size of the steps decrease to zero, and we can work out the total change in height as an integral along your path $$P$$: just add up the dot product of $$\boldsymbol{G}$$ with each small vector $$\boldsymbol{dl}$$ pointing along your path for every segment of your path.<br /><br />Consider now the problem of finding the work $$W$$ done on a particle as it moves along some curve $$C$$. We know that for a constant force $$\boldsymbol{F}$$ and a straight-line movement along $$\boldsymbol{s}$$, $$W = \boldsymbol{F} \cdot \boldsymbol{s}$$. In the standard calculus way, if we want to find the total work over a curving path, we write the integral<br /><br />$$$<br />\int_P \boldsymbol{F} \cdot \boldsymbol{dl},<br />$$$<br /><br />to find the sum of the contributions of each infinitesimal step $$\boldsymbol{dl}$$ along the smooth path $$P$$ along which we travel.<br /><br />Now consider a similar line integral, but a closed one: one where the path we take returns to the starting point at the end.<br /><br />In the case of the terrain height example, the result is obvious. The net change in height when we travel from point A to point A is zero, regardless of the path we take. The same is true of work in a gravitational field, because we can write the gravitational force field as the gradient of a gravitational potential field in the same way we write the slope of a terrain as the gradient of the terrain's height.<br /><br />(In general, if a vector field $$\boldsymbol{F} = \nabla \phi$$ for some scalar field $$\phi$$, then a closed line integral in that vector field must be 0. Such a vector field is termed "conservative".)<br /><br />But consider the vector field representing the motion of water in a whirlpool. We go around the whirlpool once, and at every point along the way, the water is pushing in the direction of our travel: the line integral the vector field along our closed path is positive.<br /><br />Such a closed line integral is a quantity about a vector field and something (in this case, a loop) in space that we can calculate, similarly to flux. It is called circulation.<br /><br />In the case of flux, we found a way to determine it by looking only at divergence, which is a quantity that takes a value not for some large shape in space, but for each individual point of a vector field. We'd now like to do something similar with circulation. This is possible, once again, with a very intuitive and visual argument.<br /><br /><table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody><tr><td style="text-align: center;"><a href="https://1.bp.blogspot.com/-S2L2UEqVu8Q/Xgob_iGpfCI/AAAAAAAABFE/3TWtMqKjML4Su3JGhz2to3NmYc-YkRCtQCLcBGAsYHQ/s1600/Screenshot%2B2019-12-29%2Bat%2B23.27.41.png" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" data-original-height="950" data-original-width="1000" height="380" src="https://1.bp.blogspot.com/-S2L2UEqVu8Q/Xgob_iGpfCI/AAAAAAAABFE/3TWtMqKjML4Su3JGhz2to3NmYc-YkRCtQCLcBGAsYHQ/s400/Screenshot%2B2019-12-29%2Bat%2B23.27.41.png" width="400" /></a></td></tr><tr><td class="tr-caption" style="text-align: center;">The circulation over the black loop is the circulation in the red loop plus the circulation in the green loop, since the net contribution of the middle part is zero.</td></tr></tbody></table><br />Consider a loop $$L$$ in space, enclosing a surface $$S$$. Split it into two loops, $$L_1$$ and $$L_2$$. There is a segment along which these two circulations overlap, but when we add the circulation along $$L_1$$ and that along $$L_2$$, the contribution of this segment cancels out because the sign is reversed for $$L_1$$ compared to $$L_2$$ (for this segment, the infinitesimal path direction vectors $\boldsymbol{dl}$ point in the opposite direction). Therefore the circulation of $$L$$ is that of $$L_1$$ plus that of $$L_2$$, or in other words,<br /><br />$$$<br />\oint_L \boldsymbol{F} \cdot \boldsymbol{dl} = \oint_{L_1} \boldsymbol{F} \cdot \boldsymbol{dl} + \oint_{L_2} \boldsymbol{F} \cdot \boldsymbol{dl}<br />$$$<br /><br />($$\boldsymbol{dl}$$ is what we will call the infinitesimal segments of $$L$$, $$L_1$$, and $$L_2$$.)<br /><br />We can continue recursively splitting up the surface $$S$$ into smaller and smaller segments, always assured that, if we just add up all of them, we still get the circulation along $$L$$. In the limit, we have infinitesimally small segments – in a loose sense, one for each point in $$S$$. Integrate the value of each of these microscopic circulations along all of $$S$$, and you will get the circulation along $$L$$:<br /><br />$$$<br />\iint_S (\nabla \times \boldsymbol{F}) \cdot \boldsymbol{dS} = \oint_L \boldsymbol{F} \cdot \boldsymbol{dS}<br />$$$<br /><br />($$\nabla \times \boldsymbol{F}$$ is how we denote curl.)<br /><br />The main complexity is that since both expressions we're integrating are vectors, to get a scalar result we're integrating with $$\cdot \boldsymbol{dS}$$ – that is, the dot product of the expression with the vector area of each bit of surface – rather than simply with a scalar area element $$dS$$. <br /><br />This infinitesimal equivalent of circulation is called curl. Imagine the vector field as a fluid, and a microscopic sphere at some point in it. The curl at that point can be visualised as the vector that represents the axis along which the fluid makes the sphere turn (which way along this axis the vector points is given by the right hand rule).<br /><br />(It turns out that, for an infinitesimal square, it is possible to find an expression for the circulation around it in terms of the rates of change of the $$x$$, $$y$$, and $$z$$ components of the vector field with respect to the $$x$$, $$y$$, and $$z$$ axes. This allows for a definition of curl that is not in terms of the line integral of anything.)<br /><br /><br /><h3>Circulation and curl in Maxwell's equations</h3>The first two of (the differential form of) Maxwell's equations form a pair: one for the divergence of an electric field, the other for that of a magnetic field. The final two form another pair, this time dealing with the curl of the fields.<br /><br />The curl of an electric field is the negative rate of change with time of the magnetic field at that point:<br /><br />$$$<br />\nabla \times \boldsymbol{E} = -\frac{\partial \boldsymbol{B}}{\partial t}<br />$$$<br /><br />(We write $$\partial$$ instead of $$d$$ in the derivative operator because we're taking a partial derivative: changing time $$t$$ while holding the space coordinates, along which the magnetic field also varies, constant.)<br /><br />Armed with the result established previously for translating statements about curl at a point to statements about the circulation along a closed path, we can express the law in a different way. We simply pick a surface $$S$$, and integrate both sides of the above equation over this surface. There's only one detail: we can't integrate with respect to scalar area elements $$dS$$, since curl is a vector, and then we'd get a vector for the integral of the left-hand side. So we'll integrate, once again, with the vector areas $$\boldsymbol{dS}$$.<br /><br />The integral along the surface $$S$$ of the left-hand side ($$\nabla \times \boldsymbol{E}$$) is, by the circulation-curl result, the integral along the line $$L$$ that encloses $$S$$. Denoting the circulation of $$E$$ around $$L$$ by $$C_{EL}$$, we have:<br /><br />$$$<br />C_{EL} = - \iint_S \frac{\partial \boldsymbol{B}}{\partial{t}} \cdot \boldsymbol{dS}<br />$$$<br /><br />We integrate with respect to area and differentiate with respect to time, and area and time don't change relative to each other, so it's all the same which way around we do it. Thus we can just as well write<br /><br />$$$<br />C_{EL} = - \frac{d}{dt} \iint_S \boldsymbol{B} \cdot \boldsymbol{dS}<br />$$$<br /><br />Now the integral looks familiar - it's just the definition of flux, for the case of finding the flux through surface $$S$$ for the magnetic field $$\boldsymbol{B}$$. Denoting the flux of $$\boldsymbol{B}$$ through the surface $$S$$ $$\phi_{BS}$$, we arrive at the final version of the integral form of the law:<br /><br />$$$<br />C_{EL} = - \frac{d}{dt} \phi_{BS}<br />$$$<br /><br />To put it in words: the circulation of an electric field around a closed path is the negative rate of change with time of the magnetic flux through the surface enclosed by the path.<br /><br />This means that whenever we have magnetic fields changing, the electric field circulates. An electric field in which there exist closed paths of non-zero circulation is a powerful thing. We can, in theory, take a charge, move it along such a path, return it back where it was before, and have a positive amount of work done on the charge. (Remember that in the gravitational case, the line integral around any loop of work done comes to 0)<br /><br />This principle is how electric generators work. You have coils of wire, and in the middle, a changing magnetic field. This creates an electric field pushing along the wire, which makes the electrons in the wire move.<br /><br />Of course, a magnetic field cannot get stronger without limit, so it's difficult to do much with a uniformly increasing (or decreasing) magnetic field. But if the magnetic flux varies, from positive to zero to negative to zero within some bounded range, then most of the time it will be changing (except when it's at a minimum or a maximum), and you can get the electric charges in wires to oscillate back and forth, and extract work from this motion.<br /><br />The final law is the most complex one. In differential form, it is:<br /><br />$$$<br />\nabla \times \boldsymbol{B} = \mu_0 \boldsymbol{j} + \mu_0 \epsilon_0 \frac{\partial \boldsymbol{E}}{\partial t}<br />$$$<br /><br />Once again we have some constants ($$\mu_0$$ and $$\epsilon_0$$, the permeability and permittivity of a vacuum respectively) which have no bearing on this discussion.<br /><br />We also have a new symbol: $$\boldsymbol{j}$$, the current density. In the same way we previously referred to charge density $$\rho$$ instead of charge directly, we now talk about how much current flow there is per unit volume. Note that it's a vector: we care not just about how much current we have, but also about which direction it's flowing in.<br /><br />Originally, this final of Maxwell's equations was only half-complete. When Ampère first wrote down this law, he wrote down this:<br /><br />$$$<br />\nabla \times \boldsymbol{B} = \mu_0 \boldsymbol{j}<br />$$$<br /><br />(Or rather, he wrote down something that, in modern vector notation, might be written as the above.)<br /><br />Looking at only this half, let's see what we get. Just as before, we use the result relating curl to circulation, which gives<br /><br />$$$<br />\iint_S (\nabla \times \boldsymbol{B}) \cdot \boldsymbol{dS}<br />\equiv \oint_L \boldsymbol{B} \cdot \boldsymbol{dS}<br />= \mu_0 \iint_S \boldsymbol{j} \cdot \boldsymbol{dS}.<br />$$$<br /><br />Using the incomplete version of the law, what we find is that the integral-form version of it states: the line integral of the magnetic field around a closed path $$L$$ is (a constant times) the current flux through the surface $$S$$ enclosed by $$L$$.<br /><br />And this is what Ampère observed. If you take a wire, with some amount of current going through it, then you will always get a magnetic field around the wire, with the property that the total circulation of the field around a loop is proportional to the current flow and independent of the shape or size of the loop (of course, it takes some ingenuity to deduce from physically measurable quantities that the abstract magnetic field behaves this way).<br /><br />(In the same way as our divergence law for the electric field leads to an inverse-square law for the strength of the electric field of a point law, this law leads to an inverse law for the strength of a magnetic field with distance from a wire; for a circular loop, the length is proportional to the radius, so to maintain constant circulation along the loop regardless of its size, magnetic field strength must go down inversely with radius.)<br /><br />The incomplete version of the law had some difficulties. These can be illustrated through theoretical considerations, but the most concrete demonstration I've seen is a thought experiment in the Feynman Lectures. Consider a charged central blob that emits charged particles uniformly in all directions. Imagine a sphere around this blob, and draw a circle on the sphere. There are particles flying through this circle, so the incomplete version of the law requires there to be a magnetic circulation around our circle. But the situation is symmetric: we can have no reason to prefer, say, a counterclockwise circulation of the magnetic field over a clockwise one in our circle. There cannot reasonably be any circulation of the magnetic field on this sphere.<br /><br />If we were to try to invent our way out of this mess, we might note that this thought experiment involves a bunch of electric charges flying away, and thus the electric flux is constantly changing. Indeed, the solution involves adding a term relating to the rate of change of the electric field: this is the $$\mu_0 \epsilon_0 \frac{\partial \boldsymbol{E}}{\partial t}$$ part of the equation.<br /><br />(The very determined reader may wish to investigate how this saves us in the sphere-around-escaping-charges thought experiment. The (slightly less) determined reader can find the answer <a href="https://www.feynmanlectures.caltech.edu/II_18.html#Ch18-S2">here</a>).<br /><br />The effect of this additional term is that, instead of the circulation of the magnetic field around a loop being equal to one area integral, it will be equal to the sum of two. Using the same procedure as before, we eventually find that the integral form of<br /><br />$$$<br />\nabla \times \boldsymbol{B} = \mu_0 \boldsymbol{j} + \mu_0 \epsilon_0 \frac{\partial \boldsymbol{E}}{\partial t}<br />$$$<br /><br />is<br /><br />$$$<br />\oint_L \boldsymbol{B} \cdot \boldsymbol{dl} =<br />\mu_0 \iint_S \boldsymbol{j} \cdot \boldsymbol{dS}<br />+ \mu_0 \epsilon_0 \frac{d}{dt} \iint_S \boldsymbol{E} \cdot \boldsymbol{dS}.<br />$$$<br /><br />We recognise the left-hand side term as a circulation, and the two right-hand terms as fluxes. To make the conceptual relationships here clearer than the above mess of integrals, let $$C_{BL}$$ be the circulation of $$B$$ around $$L$$, and $$\phi_{jS}$$ the flux of current and $$\phi_{ES}$$ the flux of the electric field through the surface $$S$$ bounded by $$L$$, and we have that<br /><br />$$$<br />C_{BL} = \mu_0 \phi_{jS} + \mu_0 \epsilon_0 \frac{d (\phi_{ES})}{dt}.<br />$$$<br /><br />Therefore the circulation of a magnetic field around a loop $$L$$ is (a constant times) the current flux through the enclosed surface, plus (a constant times) the rate of change of electric flux through the same surface.<br /><br />To visualise this: imagine a string of electric charges passing through an imaginary circle that we draw around their path. When the electrons are passing through this circle, the first term on the right-hand side means that there is a circulation of the magnetic field along our imaginary circle. Even after the electrons have passed by, there will be some circulation of the magnetic field. The electrons cause there to be electric flux through the circle, and as they move further and further away, the flux decreases (though the rate of decrease decreases as the flux tends to zero). This changing flux keeps the magnetic field circulating even after the electrons have physically passed by.<br /><br />(A similar situation allows us to note another paradox with Ampère's original incomplete version of the law. Note that the derivation of the curl/circulation relationship given above does not require the circle to be flat - the surface "enclosed" by the loop could be, for instance, shaped like a cylindrical hat, with the rim being our circle. The time at which the electrons have finished passing through the surface therefore depends on what surface we choose our loop to enclose. If the magnetic circulation depended only on electrons passing through this surface, then changing where we draw an imaginary surface would change how the magnetic field behaves! In the real world, any such change of surface shape would also change the flux through the surface, in such a way that we always agree about what the circulation is regardless of any imaginary shapes.)<br /><br />For the curl/circulation of an electric field, we found that it depends on the rate of change of the magnetic field. Analogously, the curl/circulation of a magnetic field depends on the rate of change of the electric field.<br /><br />But why the extra term?<br /><br />Let's take the differential form of the equation, and take the divergence of both sides:<br /><br />$$$<br />\nabla \cdot (\nabla \times \boldsymbol{B}) = \nabla \cdot (\mu_0 \boldsymbol{j} + \mu_0 \epsilon_0 \frac{\partial \boldsymbol{E}}{\partial t})<br />$$$<br /><br />There is a vector calculus identity that $$\nabla \cdot (\nabla \times \boldsymbol{F}) = 0$$ for any vector field $$\boldsymbol{F}$$.<br /><br />(Why? Consider integrating $$\nabla \cdot (\nabla \times \boldsymbol{F})$$ over a volume $$V$$, bounded by the closed surface $$S$$. From the divergence-flux result, we know that this is equal to the integral of $$\nabla \times \boldsymbol{F}$$ over $$S$$. From the curl-circulation result, we know that this is equal to the integral of $$\boldsymbol{F}$$ over the loop that bounds $$S$$. But there can be no such loop, since $$S$$ must be closed in order to enclose a volume; if $$S$$ almost but not quite closed around $$V$$, then the loop would be very small, but since $$S$$ is closed it is of size zero. Therefore the integral must always be zero, implying that the expression itself must be zero. (If any mathematician challenges the technical details of this proof, I will be on the next plane to New Zealand))<br /><br />Using this identity and dividing by the constant $$\mu_0$$, we have that<br /><br />$$$<br />0 = \nabla \cdot \boldsymbol{j} + \epsilon_0 (\nabla \cdot \frac{\partial \boldsymbol{E}}{\partial t})<br />$$$<br /><br />We can reshuffle the order in which we take the derivatives (the divergence operator is essentially a derivative) and move one term to the other side to get:<br /><br />$$$<br />\nabla \cdot \boldsymbol{j} = -\epsilon_0 \frac{\partial}{\partial t} (\nabla \cdot \boldsymbol{E})<br />$$$<br /><br />The first of Maxwell's equations that we discussed tells us that $$\nabla \cdot \boldsymbol{E} = \rho / \epsilon_0$$. Substituting this into the above yields the final result:<br /><br />$$$<br />\nabla \cdot \boldsymbol{j} = -\frac{\partial \rho}{\partial t},<br />$$$<br /><br />where $$\boldsymbol{j}$$ is the current density vector and $$\rho$$ is the charge density.<br /><br />What does this result mean? The left-hand side is the divergence of the current density, or, in other words, the tendency of a point in space to act as a source or a sink of current. The right-hand side is the negative rate of change of charge density with time.<br /><br />Let's say we have current coming out of a point. This equation tells us that the current density at that point must then be going down. If current goes into a point, current density most go up.<br /><br />In short, this is the law of the conservation of charge.<br /><br />If we do the same thing, without the extra piece with the rate of change of the electric field in the equation, we find that<br /><br />$$$<br />\nabla \cdot (\nabla \times \boldsymbol{B}) = \nabla \cdot (\mu_0 \boldsymbol{j})<br />$$$<br /><br />and therefore that <br /><br />$$$<br />0 = \nabla \cdot \boldsymbol{j},<br />$$$<br /><br />which would imply that there can be no source or sink of current: current would be like an incompressible fluid (or a magnetic field), flowing in loops but never "piling up" or "emptying out" of one place.<br /><br />Maxwell's equations aren't really Maxwell's. Individually, they don't even have Maxwell's name; there's Gauss's law, Gauss's law for magnetism, Faraday's law, and Ampère's law (with Maxwell's addition). What Maxwell did was, first, put them all together, and second, add one piece - the $$+ \mu_0 \epsilon_0 (\partial \boldsymbol{E}) / (\partial t)$$ bit - to Ampère's law. This one piece, however, is a pretty significant one: not only does it resolve the contradictions in Ampère's original law, but it straightforwardly implies the conservation of charge, and (in a somewhat less straightforward way) the behaviour of light as a wave. I'm happy to let Maxwell have his name on the equations.<br /><br /><br /><h3>Finding the electric and magnetic fields</h3>Maxwell's equations don't really tell you how to figure out what actually happens when you have a bunch of charges moving around. Sure, you can deduce (from the two equations about curl) that they imply disturbances in the field spread out as waves consisting of an electric and a magnetic part oscillating together, or that you will only get magnetic fields when electric charges are in motion, or what values the field takes in simple cases like a current-carrying wire or a point charge.<br /><br />But if we have an arbitrary collection of moving charges, what do we do? "The circulation of the electric field must be this and this", says Faraday's law – but this just sets a constraint, without directly telling us how to find the field that fulfils it.<br /><br />In the case of gravitational fields, we were able to present essentially a complete solution. Once we've placed our masses, we know exactly how to find the gravitational field, and that tells us how the masses interact with each other. You could write a computer simulation to work it out based on the preceding discussion.<br /><br />Solving Maxwell's equations is more difficult. I will not present a derivation here, but the general outline is as follows.<br /><br />In the gravitational case, since the circulation of the gravitational (vector) field was always zero, we could express it as the gradient of a (scalar) potential field. Both magnetic and electric fields can have non-zero circulation, however, so though we can define potentials, it will not be in the form of a simple scalar potential with gradient equal to the field.<br /><br />In the same way that a zero-curl (and hence zero-circulation) field can be expressed as the gradient of something, a zero-divergence vector field can be expressed as the curl of something. The divergence of the magnetic field is zero, so we define the magnetic vector potential to be the vector field whose curl is the magnetic field.<br /><br />If we have no moving charges, we have no changing magnetic fields, and hence no curl in the electric field. In such a case, the electric field is simply the gradient of the electric potential, which we define in a way exactly analogous to the gravitational potential.<br /><br />However, if we have moving charges, and hence changing magnetic fields, we have circulating electric fields and hence the field cannot be the gradient of something. In the general case, the expression for the electric field involves the rate of change of the magnetic vector potential. <br /><br />Given these definitions, we get a relation between the magnetic vector potential and current density, and a similar relation between the (scalar) electric potential and the charge density (this relation takes the form of the wave equation in places where there are no currents or charges). (See <a href="https://www.feynmanlectures.caltech.edu/II_18.html#Ch18-S6">chapter 18 of FLoP</a> for the details)<br /><br />Finally, from this we can show, given an arbitrary charge and current distribution, how to find the electric potential and magnetic vector potential for each point in space. The case of finding the electric potential is exactly analogous to the gravitational potential case. The magnetic potential works similarly (though the constants are different), except we don't integrate a scalar like charge/mass, but the current density vector (over all space, scaled in inverse proportion to the distance to the point whose potential we're finding, just like with the electric and gravitational cases).<br /><br />To summarise: the gravitational/electric/magnetic potential at a point is a sum of the influences of masses/charges/currents elsewhere, weighted based on on how far they are, and how much mass/charge/current there is.<br /><br />However, we have to take into account that electromagnetic influences don't travel instantaneously (neither do gravitational ones, but classical physics does not account for that). The electric potential at time $$t$$ is affected not by the charge density a distance $$r$$ away at time $$t$$, but by what the charge density was at time $$t - r / c$$, where $$c$$ is the speed of light. Whenever the charge or current density distribution changes, the effects of the change on the electromagnetic potentials spreads out at the speed of light.<br /><br />(See <a href="https://www.feynmanlectures.caltech.edu/II_21.html#Ch21-S3">chapter 21 of FLoP</a> for the details of the derivation.)<br /><br />Now for the equations. Let the charge density at position $$\boldsymbol{R}$$ and time $$T$$ be $$\rho(\boldsymbol{R}, T)$$, and likewise the current density at an arbitrary place and time be $$\boldsymbol{j}(\boldsymbol{R}, T)$$. Then the electric potential $$\phi$$ and the magnetic vector potential $$\boldsymbol{A}$$ at a point with position vector $$\boldsymbol{r}$$ at time $$t$$ are<br /><br />\begin{align*}<br />& \phi(\boldsymbol{r}) = \frac{1}{4 \pi \epsilon_0} \iiint<br />\frac{\rho( \boldsymbol{r_{dV}}, t - r / c)}{r} dV, \\<br />& \boldsymbol{A}(\boldsymbol{r}) = \frac{1}{4 \pi \epsilon_0 c^2} \iiint<br />\frac{\boldsymbol{j}( \boldsymbol{r_{dV}}, t - r / c)}{r} dV,<br />\end{align*}<br /><br />where $$\boldsymbol{r_{dV}}$$ is a position vector that always points to whatever infinitesimal piece of volume the integral is running over, and $$r$$ is the distance between $$\boldsymbol{r}$$ (where we're finding the potential) and $$\boldsymbol{r_{dV}}$$ (so $$r = | \boldsymbol{r} - \boldsymbol{r_{dV}}|$$). Note that we let each integral run over all of space.<br /><br />To find the electric field $$\boldsymbol{E}$$ and the magnetic field $$\boldsymbol{B}$$ from these potentials, we have to do something a bit more complicated than just taking the gradient:<br /><br />\begin{align*}<br />& \boldsymbol{E} = -\nabla \phi - \frac{\partial \boldsymbol{A}}{\partial t} \\<br />& \boldsymbol{B} = \nabla \times \boldsymbol{A}<br />\end{align*}<br /><br /><h3>Example: visualising the solution to Maxwell's equations</h3>To get an intuitive picture of what the above solutions really mean, let's think through a simple example.<br /><br />Imagine a long series of positive charges moving upwards (in reality, if you had nothing but positive charges in close proximity, they would repel each other and fly away, but for the sake of simplicity let's say we've managed to arrange the situation in such a way that talking about just a string of positive charges is a good model for the electromagnetic effects).<br /><br />Now let's consider the electric and magnetic potentials outside the wire.<br /><br />Along this wire of charges, we have some current density. Therefore there will be electric potential around it, and this potential will decrease with distance from the wire.<br /><br />(The potential will not, however, decrease in proportion to the inverse of distance. We can find the exact way it decreases by doing the integrals, but in this case it's simpler to reason from the fields to the potentials rather than the other way around. By Gauss's law, we know that the electric flux through a cylinder we place around our wire is proportional to the charge density inside. By the symmetry of our setup, the flux through the top of the cylinder cancels out thee flux through the bottom; the net flux is the net flux through the sides, and must be directed radially outwards from the wire. If we fix the height of our cylinder, the charge inside is a constant. If we now increase the radius of the cylinder, the area of its side will increase in proportion to the radius, and hence, to keep total flux constant, the electric field must decline in inverse proportion to the radius. Potential can be found by integrating the field, and the integral of $$1/r$$ with respect to $$r$$ is a natural logarithm. So the relationship between distance from the wire and potential is actually logarithmic.)<br /><br />Since the charges are moving, the current density vectors in the region of space occupied by the wire are non-zero, directed up in the direction of motion. And since we have non-zero current density, we will have magnetic potential. The magnetic potential vectors will point upwards, run parallel to the wire, and have length proportional to both the amount of charge moving and to the speed of the charges.<br /><table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody><tr><td style="text-align: center;"><a href="https://1.bp.blogspot.com/-Bl17HntHboY/XgnsiRTOjOI/AAAAAAAABEY/Ts1-g1wxYRQnLqH4Gv44KzvmzyKL8mOHQCEwYBhgL/s1600/Screenshot%2B2019-12-30%2Bat%2B14.24.28.png" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" data-original-height="724" data-original-width="1280" height="225" src="https://1.bp.blogspot.com/-Bl17HntHboY/XgnsiRTOjOI/AAAAAAAABEY/Ts1-g1wxYRQnLqH4Gv44KzvmzyKL8mOHQCEwYBhgL/s400/Screenshot%2B2019-12-30%2Bat%2B14.24.28.png" width="400" /></a></td></tr><tr><td class="tr-caption" style="text-align: center;">The magnetic vector potential (blue) around a current (current density vectors in black).</td></tr></tbody></table><br />(The exact proportionality between distance and the length of the vectors is logarithmic, as with the electric case.)<br /><br />If we look at a portion of the wire that is far from the ends, neither potential will be changing (you can imagine that, at successive time steps, each charge moves to take the place of the one before it). Therefore the rate of change of the magnetic vector potential is 0, and the electric field is simply the (negative) gradient of the potential.<br /><br />For the magnetic field, we know it must be the curl of the magnetic potential field.<br /><br />You might think: how can a field consisting just of vectors pointing in the same direction have any curl? Curl is a more subtle concept than "vectors in loop-like arrangements". The intuitive idea here is to remember the sphere-in-a-fluid analogy. If we imagine the magnetic potential field as describing the flow of a fluid, and place a sphere in it, it will spin, since the "flow" closer to the wire is stronger than that further away, even though the direction of the flow is the same in both cases. The curl is along the axis of this spin, so it is directed tangentially to a circle around the wire.<br /><br /><table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody><tr><td style="text-align: center;"><a href="https://1.bp.blogspot.com/-hqHmlkL7X0A/Xgnxd98c5SI/AAAAAAAABEk/svEiebI5ev8GV2PIjBu08u9dsp6nbrjgwCLcBGAsYHQ/s1600/Screenshot%2B2019-12-30%2Bat%2B14.45.23.png" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" data-original-height="1076" data-original-width="1280" height="538" src="https://1.bp.blogspot.com/-hqHmlkL7X0A/Xgnxd98c5SI/AAAAAAAABEk/svEiebI5ev8GV2PIjBu08u9dsp6nbrjgwCLcBGAsYHQ/s640/Screenshot%2B2019-12-30%2Bat%2B14.45.23.png" width="640" /></a></td></tr><tr><td class="tr-caption" style="text-align: center;">The black dotted circle indicates the current coming out of the screen. The blue dotted circles indicate the magnetic vector potential vectors, also directed out of the screen. The magnetic field vectors are drawn in red for a few selected distances from the wire. The curl of the magnetic potential can be visualised by imagining which direction the green sphere would roll in if the magnetic vector potential represented the flow of a fluid. The direction of fastest decline of the magnetic potential is as we move from the wire along the dotted line, so the axis along which the sphere rolls is the solid line (which is parallel to the magnetic field vectors at that angle). The right-hand rule gives the counterclockwise direction for the magnetic vectors.</td></tr></tbody></table><br /><br />In this way we recover what we had already deduced from Ampère's law: a wire with current creates a circulation of the magnetic field around it (that declines in strength with the inverse distance from the wire, as can be seen, for example, from taking the derivative of our logarithmic potential-with-distance function).<br /><br />We can now consider the behaviour of another positive charge near this wire. If it's stationary, it will be repelled radially outward from the wire by electric forces. If it is somehow kept moving in a circle around the wire, it will feel no magnetic force since the magnetic field will always be parallel to its velocity vector. If it moves in some other way, it will experience a force perpendicular to both its velocity and the magnetic field (which of the two perpendicular directions it moves in is given by the right-hand rule); for instance, moving parallel to the wire, the magnetic force will draw it towards the wire.<br /><br />All is well, then?<br /><br /><h3>Hints of relativity</h3>Not quite. Imagine that, instead of sitting stationary next to the wire, we move with the charges, so that from our perspective the charges are stationary. Classical physics says the electric field is exactly as before, but the charges in the wire aren't moving, so they do not create a magnetic field around them (nor can magnetic fields exert forces on them).<br /><br />If (for simplicity) the additional charge near the wire was previously moving at the same speed as the charges in the wire, it will also be stationary in our new reference frame. Instead of feeling a repulsive electric force that is partly cancelled by an attractive magnetic force, it now feels only that same repulsive electric force, and would therefore be accelerated outwards at a greater rate.<br /><br />By following the laws of classical physics, we have changed how a system behaves by looking at it from a different reference frame.<br /><br />This isn't necessarily a paradox. In Newtonian mechanics, force always depends on acceleration and never on velocity, so it must always predict the same forces and hence the same consequences for a system regardless of how fast we move relative to the system. But might this invariance on what velocity we look at a system from just be a quirk that happens to be a valid approximation when considering mechanics, but fails when we get to deeper physics like electromagnetism? The electromagnetic force law certainly includes velocity in it.<br /><br />As it turns out, electromagnetism is correct, and Galilean invariance (the principle that physics works the same way regardless of how fast you move) holds. With these assumptions, the above situation really is a paradox. What needs changing is our ideas of time and space instead, as described by the theory of special relativity.<br /><br />Galilean-invariant solution to the above paradox is that fast-moving objects compress along the direction of their travel. In the stationary case with no magnetic fields, we have only electric repulsion between the wire and our charge. As we shift into the reference frame in which the charges do move, the wire and the charge are compressed, which increases charge density, and thus makes the electric repulsion increase in strength in such a way that, when the magnetic attraction is added, the net force is exactly the same as in the stationary case.<br /><br /><h3>Summary of Maxwell's equations</h3>This list summarises Maxwell's equations and their key implications when considered independently of each other.<br /><br /><ul><li><b>Gauss's law.</b></li><ul><li><b>Differential form: </b><br />$$$<br /> \nabla \cdot \boldsymbol{E} = \frac{\rho}{\epsilon_0}<br /> $$$ </li><li><b>Interpretation: </b>The tendency of a point in space to act as a source/sink of electric charge is (a constant times) the charge density at that point.</li><li><b>Integral form: </b> $$$<br /> \oint \oint_S \boldsymbol{E} \cdot \boldsymbol{dS}<br /> = \frac{1}{\epsilon_0} \iiint_V \rho dV<br /> $$$<br /><span style="font-size: x-small;">(NOTE: Technical problems prevent proper rendering of the closed surface integral symbol. Here, $$\oint \oint_S$$ refers to an integral over the closed surface $$S$$ (which is usually denoted by a symbol where one circle over both integral signs.)</span></li><li><b>Interpretation: </b>The electric flux through a volume is proportional to the total charge contained within that volume.</li><li><b>Key consequence: </b>The strength of the electric field decays in proportion to the inverse square of distance from a point charge.</li></ul><li><b>Gauss's law for magnetism.</b></li><ul><li><b>Differential form: </b><br />$$$<br /> \nabla \cdot \boldsymbol{B} = 0<br /> $$$</li><li><b>Interpretation: </b>No point in space acts as a source or sink of magnetic flux.</li><li><b>Integral form: </b> $$$<br /> \iint_S \boldsymbol{B} \cdot \boldsymbol{dS}<br /> = 0<br /> $$$<br /><span style="font-size: x-small;">(NOTE: Technical problems prevent proper rendering of the closed surface integral symbol. Here, $$\oint \oint_S$$ refers to an integral over the closed surface $$S$$ (which is usually denoted by a symbol where one circle over both integral signs.)</span></li><li><b>Interpretation: </b>Given any volume, any magnetic flux passing into it must be equalled by magnetic field passing out.</li><li><b>Key consequence: </b>There are no magnetic charges / magnetic monopoles.</li></ul><li><b>Faraday's law.</b></li><ul><li><b>Differential form: </b><br />$$$<br /> \nabla \times \boldsymbol{E} = - \frac{\partial \boldsymbol{B}}{\partial t}<br /> $$$</li><li><b>Interpretation: </b>The curl of the electric field at a point in space is the negative rate of change of the magnetic field at that point.</li><li><b>Integral form: </b> $$$<br /> \oint \boldsymbol{E} \cdot \boldsymbol{dl}<br /> = - \frac{d}{dt} \iint_S \boldsymbol{B} \cdot \boldsymbol{dS}<br /> $$$</li><li><b>Interpretation: </b>The circulation of the electric field along a closed loop is the negative rate of change of magnetic flux through the surface enclosed by the loop.</li><li><b>Key consequence: </b>Changing magnetic fields lead to a non-conservative electric field that can do net work e.g. on charges moving in loops.</li></ul><li><b>Ampère's law with Maxwell's addition.</b></li><ul><li><b>Differential form: </b><br />$$$<br /> \nabla \times \boldsymbol{B}<br /> = \mu_0 \boldsymbol{j} + \mu_0 \epsilon_0 \frac{\partial \boldsymbol{E}}{\partial t}<br /> $$$</li><li><b>Interpretation: </b>The curl of the magnetic field at a point in space depends on the current density and the rate of change of the electric field at that point.</li><li><b>Integral form: </b> $$$<br /> \oint_L \boldsymbol{B} \cdot \boldsymbol{dl} =<br /> \mu_0 \iint_S \boldsymbol{j} \cdot \boldsymbol{dS}<br /> + \mu_0 \epsilon_0 \frac{d}{dt} \iint_S \boldsymbol{E} \cdot \boldsymbol{dS}<br /> $$$</li><li><b>Interpretation: </b>The circulation of the magnetic field along a closed loop is (a constant times) the current flux plus the electric flux through the enclosed surface.</li><li><b>Key consequence: </b>Conservation of charge.</li></ul></ul><h2>The shape of classical physics</h2>Given an arbitrary collection of moving objects with known masses, charges, and velocities, we can predict what happens (according to classical physics) like this:<br /><ul><li>For each point in space, calculate the gravitational, electric, and magnetic vector potential.</li><li>From these potentials, find the gravitational, electric, and magnetic field at each point by applying some sort of differential operator (a simple gradient in the gravitational case, an expression involving a gradient and the rate of change of the magnetic vector potential for the electric case, and the curl operator in the magnetic case.)</li><li>Calculate the total force $$\boldsymbol{F}$$ on every object by adding together the gravitational force ($$\boldsymbol{F_g} = m \boldsymbol{g}$$) and the electromagnetic force ($$\boldsymbol{F_e} = q (\boldsymbol{E} + \boldsymbol{v} \times \boldsymbol{B})$$).</li><li>Let the velocity of each object change at a rate of $$\boldsymbol{F} / m$$ (Newton's second law).</li></ul><br /><i>All diagrams created with Schematica, a diagram-drawing program that is currently under development. You can try out the experimental version <a href="https://lrudl.github.io/Schematica/">here.</a></i> <br /><ul></ul><br />Unknownnoreply@blogger.com2tag:blogger.com,1999:blog-1697673368059564013.post-84254237492508302302019-09-27T11:00:00.001+01:002021-02-13T13:35:31.063+00:00Growth and civilisation<div style="text-align: center;"><span style="font-size: x-small;"><i>3.0k words (≈ 12 minutes)</i></span> </div><br />It is often said that continuous exponential economic growth cannot be sustainable in the long run. This may well be so. But are our values sustainable without growth?<br /><br /><br /><b>The zero-sum world</b><br /><br />Game theorists distinguish between zero-sum games and non-zero-sum (positive-sum or negative-sum) games. In a zero-sum game, one player’s gain is another’s loss, and visa versa. The sum of the player’s gains is zero; it is impossible for the world at large to gain.<br /><br />A world without growth is a zero-sum game. If the resources available at time $$T_2$$ are the same as those available at time $$T_1$$, the only way to increase your share of those resources is to take them from someone else.<br /><br />For most of human history, the world was largely zero-sum. Before the industrial revolution, economic and technological progress were generally slow enough that major increases in resources (or human power more generally) did not happen over an individual’s lifespan.<br /><br />A well-managed estate or a hard-working farmer could, of course, beat the averages without hurting others. However, if you sought to become rich, creating value was a bad bet; you were far better off trying to become friends with the powerful. The powerful had only so many resources at their disposal, so this generally meant – directly or indirectly – worsening someone else’s access to riches. If you were a king seeking to make your nation great, you were probably better off trying to seek control over the resources of other nations (whether through royal marriage, warfare, or other means) than figuring out how to best create wealth within your nation. In a world of slow growth, the first strategy might net you France; the second strategy might mean that your descendants see agricultural efficiency improve by 10%.<br /><br />Land was essential in premodern societies. Populations generally grew to the maximum density that the land would support, so in the long run land also meant people. Land is an inherently zero-sum game – very little productive land was unoccupied (even historically) and you can’t make more, so gains in land for one party are always losses for another.<br /><br />Look at premodern societies through a modern lens, and the zero-sum thinking inherent in them is striking. If you were a member of the elite, you squeezed as much value out of the land and labour you have control over as you can; there’s no reason to invest in the future, because productivity would not change much anyways. The ultimate institution in a zero-sum world is the military, because that is how you grab value from others and stop others from grabbing it from you. Hence military culture was venerated.<br /><br /><b><u><i>A note on the above historical claims</i></u></b><br /><i>All of these things are, of course, vast generalisations to which there are innumerable exceptions and which, in a more thorough piece, would require plenty of asterisks. Below I’ve gestured at data that supports the general gist of the points made above (feel free to skip this section):</i><br /><ul><i></i><li><i>The transition from a zero- to positive-sum world is indisputable. Consider for instance <a href="https://ourworldindata.org/economic-growth#from-poverty-to-prosperity-the-uk-over-the-long-run">English per capita GDP over the past 700-and-some years</a>: from 1270 to 1800, wealth per person rose about 3-fold, for an average growth rate of 0.2% per year, compared to an average 1.1% since then. Over a 70-year life starting in the year 1400, you’d observe average income dip a few percent; over the same life starting in 1900, you’d see it almost triple. Note that such charts don’t measure money; they measure wealth, including the value of home-grown food, etc. See <a href="https://ourworldindata.org/extreme-history-methods">this excellent write-up</a> for more on the methodology.</i></li><i></i><li><i>Importance of land: There is a very nice graph I once saw showing, for some roughly medieval historical period, almost no correlation between arability of land and per capita wealth but a strong correlation between arability and population density. I was unable to locate this graph, but be assured it exists (at least in my imagination). Nevertheless, I hope you will agree that 1) pre-industrial agrarian societies had a rather Malthusian relationship with land, thus 2) land was dreadfully important, and thus 3) there was a lot of non-value-creating politicking and fighting over land. The issue of land has not stopped being important (or divisive), but today lack thereof is no longer nearly as much of a cap on economic power<br /><b>EDIT [2020]: I have found the graph! Behold:<br /> </b><div class="separator" style="clear: both; text-align: center;"><a href="https://1.bp.blogspot.com/-1vW0l7p2S3c/YCfVQ1rCUdI/AAAAAAAACao/SEmYIDalf4MW7BNyyVIAwDRGUlv_5RjLACLcBGAsYHQ/s1808/Screenshot%2B2021-02-13%2Bat%2B15.32.58.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="1072" data-original-width="1808" height="380" src="https://1.bp.blogspot.com/-1vW0l7p2S3c/YCfVQ1rCUdI/AAAAAAAACao/SEmYIDalf4MW7BNyyVIAwDRGUlv_5RjLACLcBGAsYHQ/w640-h380/Screenshot%2B2021-02-13%2Bat%2B15.32.58.png" width="640" /></a></div><b>The source, as usual, is the excellent website Our World in Data. Original <a href="https://ourworldindata.org/economic-growth#the-economy-before-economic-growth-the-malthusian-trap">here.</a></b><br /><br /><br /> </i></li><i></i><li><i>Military values: I was unable to find quantitative data on this, but the general pattern seems to be that the military played a more central role in pre-industrial societies than today, and that military values like bravery, martial prowess, discipline, and aggression have declined in importance since the industrial revolution.</i></li><i></i><li><i>Tendency towards exploitation: Historical data on GINI coefficients suggests that they were often about as high as they could get (in societies with average wealth close to the subsistence level, inequality is limited by the fact that you can’t take very much from people before they start starving to death, and when the poorest no longer exist, inequality goes down; the wealthier a society, the higher the rate of inequality that is “sustainable” in this sense). The Great Leveler by Walter Scheidel provides a good summary of this data. A summary of the summary might be the following fact: in 28 pre-industrial societies (including places like 1290s England, Byzantium in the year 1000, 1730s Holland, 1860s Chile), the <a href="https://books.google.fi/books?id=CD1hDwAAQBAJ&printsec=frontcover&dq=the+great+leveler&hl=en&sa=X&ved=0ahUKEwjn35eXyuzkAhVi1qYKHVlxDlkQ6AEIKzAA#v=onepage&q=%22often%20about%20as%20unequal%20as%20they%20could%20be%22&f=false">average extraction rate was 77% of the theoretical maximum</a> (for comparison, today’s OECD countries are roughly in the 20-40% range). I consider this strong evidence for a general tendency towards maximum extraction of resources by the elite in a zero-growth world. However, it’s clear that the causes of any shift are likely more complex than just the zero- to positive-sum transition (for instance, democracy makes ruthless exploitation of the masses harder, and knowledge work is less amenable to forceful extraction than agricultural work).</i></li><i></i><li><i>Corruption as the best get-rich-scheme in pre-industrial societies: In the same book (in fact, on the same page I linked above), Scheidel states that pre-industrial fortunes were usually extremely closely tied to political power, to an extent far greater than today. </i></li></ul><br /><b>Things change</b><br /><br />The industrial revolution was the first time in human history during which the world saw prolonged economic growth at a rate fast enough to be obvious over a single human life.<br /><br />If we step back and look at the grand sweep of human economic history, we see something like this:<br /><br /><table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody><tr><td style="text-align: center;"><a href="https://1.bp.blogspot.com/-UocjeSCgbtk/XY3behzHvTI/AAAAAAAABBA/bUxV8O3tbtwkRYH5gA6Fd7Ld24NTLhpXQCLcBGAsYHQ/s1600/Screenshot%2B2019-09-26%2Bat%2B20.41.04.png" style="margin-left: auto; margin-right: auto;"><img border="0" data-original-height="1056" data-original-width="1580" height="425" src="https://1.bp.blogspot.com/-UocjeSCgbtk/XY3behzHvTI/AAAAAAAABBA/bUxV8O3tbtwkRYH5gA6Fd7Ld24NTLhpXQCLcBGAsYHQ/s640/Screenshot%2B2019-09-26%2Bat%2B20.41.04.png" width="640" /></a></td></tr><tr><td class="tr-caption" style="text-align: center;">Figure taken from <a href="https://ourworldindata.org/economic-growth#the-world-economy-over-the-last-two-millennia">this page</a> on the phenomenal website <a href="https://ourworldindata.org/">Our World in Data.</a></td></tr></tbody></table><br />Of course, there is much more to life than economics. However, the past few hundred years have also been ones of immense ethical change. Since the industrial revolution, we have gone from a world were war, slavery, racism, sexism, and religious intolerance are the norm and even celebrated to one where all of these things are rightly condemned.<br /><br />A large part of this is because prosperous people living comfortable lives tend to care a lot more about others than poor people in bad conditions. Thus, even if growth were to suddenly stop, a large part of the moral gains we have made would likely remain. It is also true that the effect is not one way – in fact, <a href="https://advances.sciencemag.org/content/4/7/eaar8680">one study</a> found that secularisation often preceded economic growth.<br /><br />However, there is a case to be made that, regardless of the level of prosperity, whether wealth is increasing or not is an important factor for what sort of attitudes prevail in the long run.<br /><br />Intuitively, this makes sense. It’s much easier to be altruistic and tolerant when the ceiling of human capacity keeps rising. Economic troubles are among the first explanations cited by political pundits as a cause of the recent rise in intolerant populism. Whether the world is stagnant or growing also has an effect on what sort of strategies make sense.<br /><br />We can capture this intuition with a thought experiment.<br /><br /><br /><b>Blue vs red strategies</b><br /><br />A shift from positive- to zero-sum games is also a shift in what sort of strategies are successful, and hence what sort of strategies will govern society in the long run.<br /><br />Consider two different starting scenarios with the same players, one in an (almost) zero-sum world and the other in a strongly positive-sum world. Imagine, in each, three different factions, each following a specific strategy:<br /><ul><li>Blue invests in future growth to create value.</li><li>Red tries to capture value from others.</li><li>Green sits around being captured by Red.</li></ul><div class="separator" style="clear: both; text-align: center;"><a href="https://1.bp.blogspot.com/-uzotm7Oj5lA/XY3cCdqJ33I/AAAAAAAABBI/6tPF-qxXq5YNuNrwpf1eLJFkbe2-q_TiACLcBGAsYHQ/s1600/Screenshot%2B2019-09-26%2Bat%2B20.43.34.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="612" data-original-width="1600" height="244" src="https://1.bp.blogspot.com/-uzotm7Oj5lA/XY3cCdqJ33I/AAAAAAAABBI/6tPF-qxXq5YNuNrwpf1eLJFkbe2-q_TiACLcBGAsYHQ/s640/Screenshot%2B2019-09-26%2Bat%2B20.43.34.png" width="640" /></a></div><br /><br />In a positive-sum world like our current one, the future might unfold something like the graph on the right side in the image above. Red captures a bit of Green, but Blue makes enormous gains.<br /><br />In a zero-sum world, like our past, or a hypothetical no-growth future, the future might unfold more like in the graph on the left. Blue succeeds in creating some value, but its gains are dwarfed by Red’s gains from conquering Green.<br /><br />The key point is this: <u>in the long run and in a positive-sum world, the Blue strategy will dominate, and Blue players – individuals, companies, institutions, governments, whatever – are the ones who dictate what the future looks like. In the long run and in a zero-sum world, the Red strategy will dominate, and Red players will have the most say in what the future looks like.</u><br /><br />Thus, when the industrial revolution made the world economy shift from a zero- to a positive-sum game, a shift from Red to Blue strategies inevitably followed. The fact that society was wired for a zero-sum world slowed the spread of Blue strategies, but in the long run existing zero-sum values and customs were often swept aside by the greater success of the Blue strategy at capturing future value. Given a sufficiently long time scale, it is hard to resist this kind of harsh evolutionary logic.<br /><br />In medieval Europe, there certainly were people who believed in peaceful cooperation and investing in the future. Unfortunately, in that time and place, this is not the strategy that maximises its adherents’ share of future power, and so these people were largely trampled underfoot by those who followed a Red strategy of capturing value from others.<br /><br />To take another example: today, war is no longer the best way to make your nation greater. This doesn’t just mean that peaceful, tolerant, growth- and future-investing nations are the winners – it also means that, because they are the winners, they get a lot of say in how the world works. After all, it is human nature to spread your values to others. No surprise, then, when the post-industrial world order gradually shifts from one where war is simply politics by other means, to one where it is rare and condemned. Things like treaties, international organisations, and cross-border trade now dominate international politics. Ease-of-doing-business indices matter more than troop numbers.<br /><br />Not everyone got the memo; some of those who didn’t even ended up in charge of big nations and started a few world wars, before being crushed by the Allies’ economic superiority. Being defeated in war forced Japan and Germany to become even more peaceful and growth-oriented than the rest, and now they’re among the richest countries in the world. Nowadays no serious up-and-coming nation even considers going warpath. Instead they compete to hit double-digit GDP growth, usually by first trying to build products for everyone else and then worrying a lot about things like investing in education to maximise the human potential of their citizens.<br /><br />The transition is far from absolute. Win-win cooperation and future investment were never entirely absent, just as zero-sum fights are still very much part of our world. However, I’d argue that a shift in which type of interaction tends to have more power over the long run has happened.<br /><br /><br /><b>Zero-sum thinking - a mistake?</b><br /><br />Many foolish mistakes we now scorn are only mistakes because we live in a positive-sum world. For example, Donald Trump thinks in zero-sum terms: China gains a lot from trade, therefore that trade must be hurting someone, and most likely that someone is the United States, China’s largest trade partner; immigrants are moving into the country, they consume resources and take jobs when they live there, and therefore they must be a net drain on Americans; and so on. The critical mistake in all such lines of reasoning is that they ignore the fact that trade and immigration are often positive-sum situations. Trump’s suspicion for win-win cooperation would be a perfectly reasonable attitude in a negative- or zero-sum world.<br /><br /><a href="https://en.wikipedia.org/wiki/Zero-sum_thinking">A tendency for zero-sum thinking</a> seems partly innate to humans. This is because a strongly positive-sum world has existed for less than two centuries, and is not the one our brains evolved to deal with. Many of the worst tendencies that zero-sum thinking brings with it are kept at bay only because (for the time being) growth is now a regular part of our world.<br /><br />If the world turns back into a zero-sum world (or society turns zero-sum for a large enough section of the population), the danger isn’t just that zero-sum thinkers will be the winners. The danger is that they’ll also be right.<br /><br /><br /><b>Sustainability vs values?</b><br /><br />The idea that there is a serious contradiction between the ever-accelerating growth of human civilisation and the finite resources of our planet has become mainstream.<br /><br />This view is broadly correct. A civilisation powered by fossil fuels cannot even maintain our current prosperity level without causing serious environmental issues (the finiteness of fossil fuels might eventually be a problem, but only long after the impacts on the climate have become catastrophic). It is also true that being naively optimistic about technological solutions is not wise.<br /><br />Thus the early-21st-century dream for the future might look something like a prosperous sustainable planetary civilisation that has outgrown its hubristic drive towards ever greater capabilities, inhabited by people who coexist peacefully and hold on to altruistic liberal values.<br /><br />However, like most dreams, something is off about this vision. We should not expect a stagnant, zero-sum world to be one where openness, altruism, and a future-oriented outlook are winning strategies.<br /><br />This is not to say that a zero-sum world would revert back to medieval levels of warfare and violence. However, in the long run value-capturing players will gain at the expense of others. If history is any guide, a world where it is difficult to create value will tend towards one where connections and loyalty are everything, and those without are increasingly exploited. Most likely this would manifest more as politicking than outright bloodshed: a steadily rising tide of influence struggles, political dynasties, and moralising about who deserves what.<br /><br />But even if we want to ensure that growth continues, what can we do about it? Environmental limits are very real, and a stagnant future is better than no future at all.<br /><br />The only solution is to think bigger.<br /><br />The physical limits are a lot further out than they may seem. Humanity’s energy consumption is about $$2 \times 10^{13}$$ watts (20 trillion joules per second). Harvesting 1% of the solar radiation that falls on Earth would net us on the order of $$10^{15}$$ watts (a thousand trillion joules per second). Relying only on this small sliver of solar energy, we can keep up a growth in energy consumption of 2% per year for the next 200 years, roughly as long as humanity has been making significant use of fossil fuels. After we reach this limit, we will have captured an infinitesimal slice of the energy output of one star in a galaxy of hundreds of billions.<br /><br />(Ultimately, however, exponential growth is impossible. Physics sets an upper limit on the maximum density of <a href="https://en.wikipedia.org/wiki/Bremermann%27s_limit">computation</a>, and presumably we need computation to create value – most fundamentally, you can't experience anything without computation going on somewhere (e.g. a brain). The finite speed of light means that the volume of space we can influence from the present grows in proportion to the cube of elapsed time. In the extremely long run, we are limited to cubic growth, which is polynomial, not exponential.)<br /><br />There’s no guarantee that we will ever have the technology (or the will) to harness such power. However, it’s important to understand that the problems standing in the way are not fundamental physical limits. We do not lack energy – we lack the organisation, will, and ingenuity needed to harness the right energy sources. Given enough of these elements, the capacities of future humans may be as far removed from us as ours are from hunter-gatherers.<br /><br />In the shorter run, the most critical task is transitioning to a sustainable civilisation, because what is not sustainable must eventually end, and certainly cannot grow without limit.<br /><br />I think we should also make a greater effort to recognise and promote the non-zero-sumness of our world. Some problems genuinely are zero-sum, but many only seem that way because of our cognitive biases.<br /><br />We must also make sure that the right variables are positive-sum. It is of little use if GDP keeps growing, but the benefits accrue only to a small number or are outweighed by non-economic costs. Growth in indicators like <a href="https://en.wikipedia.org/wiki/Green_gross_domestic_product">Green GDP</a> or the <a href="https://en.wikipedia.org/wiki/Genuine_progress_indicator">Genuine Progress Indicator</a> is likely a far better measure of the type of positive-sumness discussed here than raw GDP growth figures.<br /><br />Finally, I want to draw attention to a simplification made in this discussion. I’ve written about zero- or positive-sumness as if they were immutable properties of the world that have a one-way causal effect on what happens. In reality there’s no magical ceiling on growth that constrains human activity. Human wealth increases when people go out and make things – life-saving medicines, time-saving devices, whatever.<br /><br />Of course, different societies in different times can be more or less hospitable to growth. A peasant in medieval Europe would have a hard time making a significant contribution to human capacities. The industrial revolution relied on a critical mass of scientific understanding and Enlightenment values to get going.<br /><br />Today, we have this immense legacy to thank for our ability to (on average) raise living standards by a few percent each year and keep the self-improving loops of both technology and values going.<br /><br />The best future is not a stagnant one, but a growing one: a world where human capabilities stretch a bit further every year, and where the winners are those who create value rather than those who take it from others.Unknownnoreply@blogger.com0tag:blogger.com,1999:blog-1697673368059564013.post-3210472956435580012019-09-08T11:09:00.005+01:002021-03-27T22:50:01.085+00:00Review: Structure and Interpretation of Computer Programs<h1></h1><div style="text-align: center;"> <span style="font-size: x-small;">Book: <i>Structure and Interpretation of Computer Programs</i>,</span><br /><span style="font-size: x-small;">by Harold Abelson, Gerald Jay Sussman, and Julie Sussman (1996, 2nd ed.)</span></div><div style="text-align: center;"><span style="font-size: x-small;">2.7k words (≈10 minutes) </span></div><p> </p><p>Many regard <i>Structure and Interpretation of Computer Programs </i>(SICP) as the bible of programming. For good reason, as it turns out.<br /><br /><br /><b>Beware the wizards</b><br /><b><br /></b><br /></p><table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody><tr><td style="text-align: center;"><a href="https://upload.wikimedia.org/wikipedia/commons/9/9d/SICP_cover.jpg" style="margin-left: auto; margin-right: auto;"><img border="0" data-original-height="579" data-original-width="400" height="400" src="https://upload.wikimedia.org/wikipedia/commons/9/9d/SICP_cover.jpg" width="276" /></a></td></tr><tr><td class="tr-caption" style="text-align: center;">The Wizard Book. (Credit: MIT Press)</td></tr></tbody></table><b><br /></b>SICP is sometimes called the “Wizard Book”, because there’s a wizard on the cover (if your job is making an interesting cover for a programming book, what would <i>you</i> do?). However, this does not mean that the book has anything to do with –<br /><blockquote><i>“[L]earning to program is considerably less dangerous than learning sorcery, because the spirits we deal with are conveniently contained in a secure way.”</i></blockquote>Um. Okay, I rest my case. Proceed with caution.<br /><br /><br /><b>Contrarian SICP</b><br /><b><br /></b>For most subjects there is a standard way to present it that most books, lectures, etc. will follow.<br /><br />For programming, the standard way seems to be to take some “mainstream” language, show how to print “Hello, World!” onto the screen, then start introducing things like assigning values to variables, conditionals, and so on. Pretty soon you can be doing some pretty impressive things.<br /><br />SICP does not follow this route.<br /><br /><br /><b>Why Lisp?</b><br /><b><br /></b>The first thing that might strike you about SICP is that the programming language of choice is Scheme, a dialect of Lisp (short for “LISt Processor”), which is commonly known as that obscure language invented in 1958 that wears down the parentheses keys on your keyboard.<br /><br /><table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody><tr><td style="text-align: center;"><a href="https://imgs.xkcd.com/comics/lisp_cycles.png" style="margin-left: auto; margin-right: auto;"><img border="0" data-original-height="211" data-original-width="640" height="130" src="https://imgs.xkcd.com/comics/lisp_cycles.png" width="400" /></a></td></tr><tr><td class="tr-caption" style="text-align: center;">Comic by Randall Munroe of <a href="https://xkcd.com/">xkcd</a>. This comic can be found <a href="https://xkcd.com/297/">here.</a> </td></tr></tbody></table><p>However, the authors are not just being contrarian here; there are many good arguments for using Lisp in a book like this.<br /><br />First, Lisp is the closest a programming language can get to having no syntax. You don’t have to learn where curly brackets are used, or which operators/functions follow which type of syntax, or a multitude of special characters that perform arcane pointer logic (I’m looking at you, C++). </p><p>If you have an expression in parentheses, the first thing inside the parentheses is the name of the function that is being called. Everything after it is an argument to be passed to that function. Something not in parentheses represents either just itself (e.g. a string, number, or boolean), or is the name of a variable that in turn represents something.<br /><br />For example: <code>(+ 1 (* 2 3) var)</code> evaluates to the sum of the numbers 1, the product of 2 and 3, and whichever number the variable <code>var</code> has been set to.<br /><br />Now you know approximately 90% of Lisp syntax (there’s also a few other things, like a special syntax that stands in for an unnamed function, and some shortcuts for things you’d otherwise have to type out repeatedly).<br /><br />If you follow along with SICP, Lisp is self-explanatory.<br /><br />The second point in favour of Lisp follows immediately from the first: the near-absence of syntax means you don’t have to think about it. Once you get used to it, writing in Lisp feels almost like transcribing pure thought into code.<br /><br />When a language implements various special syntaxes, it generally privileges certain design patterns and ways of thinking; if for-loops are unavoidable, the programmer will think in for-loops. A near-absence of syntax means neutrality. Some might call it blandness; fair enough, but Lisp’s blandness is very powerful when used right. It makes it a very useful language for a book like SICP, which tries to teach you (for example) many different ways of abstracting data, rather than the one that is made most convenient by a language’s syntax.<br /><br />The third point in favour of Lisp is that what little syntax it has was chosen carefully, namely in such a way that Lisp code is also Lisp data. The example function call <code>(+ 1 (* 2 3) var)</code> given above is just a list of the elements <code>+</code>, <code>1</code>, the list of the elements<code> *,</code> 2, and 3, and <code>var</code>. This means that it’s very easy to write Lisp code that operates on Lisp code, something that comes in handy when SICP walks through the operation of a Lisp interpreter (in more practical situations, it also enables Lisp’s powerful macro system). To put it another way, introspection is easier in Lisp than other languages.<br /><br />Finally, as the (perhaps biased) authors write: “Above and beyond these considerations, programming in Lisp is great fun.”<br /><br /><br /><b>Executable math</b><br /><b><br /></b>Once you’ve gotten over all the parentheses, the second thing you’ll notice about SICP is the order in which topics are presented.<br /><br />The first chapter is entirely devoted to creating abstractions by defining functions. Only function (and variable) definition and function calling are used – no mention is made of data structures or changing the values of variables.<br /><br />If you think it’s impossible to do anything interesting by just calling functions, you are wrong, and SICP will prove it.<br /><br />The chapter runs through the very basics of function application, variable definitions, and the substitution model of how to apply functions (this last point will latter be amended). It discusses iterative and recursive processes, and how iterative processes can be described by recursive functions.<br /><br />A lot of the things you can do by just calling function are quite math-y. SICP does not shy away from this: Newton’s method for square roots, numerical integration, and finding fixed points of (mathematical) functions are prominent examples. No prior knowledge about the math is assumed, but this may still put off many readers because it’s abstract and not directly relevant to most real-world problems. “Executable math” is a pretty good summary of what most of this chapter is about.<br /><br />However, the chapter really is striking. Using just one type of abstraction (defining functions) and not too many pages, SICP scales from the very basics to solving fairly involved problems with techniques, like extensive use of higher-order functions, that would be left for much later in a more conventional work.<br /><b><br /></b><b><br /></b><b>Finally: data!</b><br /><br />Only in the second chapter does SICP turn to data structures. Once again the format is the same: introduce exactly one type of abstraction, and systematically introduce examples of how it’s useful and what can be done with it.<br /><br />The basic Lisp data structure is creating cells that link together two values. The primitive function for this is <code>cons</code>. If we want to chain together many values, for instance to create a list of the elements 1, 2, and 3, we can do this with <code>(cons 1 (cons 2 (cons 3 null)))</code> (of course, there’s also a function – <code>list</code> – that creates lists like this automatically ). <br /><br />Additionally, Lisp provides primitive functions for accessing the first and second element in a <code>cons</code> cell. For historical reasons, these functions are called <code>car</code> (returns the first element) and <code>cdr</code> (returns the second element). This means that the <code>cdr</code> of a list defined in the same way as above would be all but the first element of the list.<br /><br />But what is data? Or do we even care? After all, all that interests us about <code>cons</code>, <code>car</code>, and <code>cdr</code> is that if we define, say, <code>x</code> as <code>(cons 1 2)</code>, then <code>(car x)</code> should be 1 and <code>(cdr x)</code> should be 2.<br /><br />One clever way of implementing this – and one that will likely seem both weird and ingenious the first time you see it – is the following:<br /><br /></p><pre><code>(define (cons x1 x2) ; define cons as a function on two inputs<br /> (define (dispatch n) ; define a function inside cons<br /> (if (= n 1)<br /> x1 ; return x1 if n = 1<br /> x2)) ; else, return x2<br /> dispatch) ; the cons function returns the dispatch function<br /><br />(define (car x)<br /> (x 1))<br /><br />(define (cdr x)<br /> (x 2))<br /></code></pre><br />What’s happening is this: <code>cons</code> returns the function <code>dispatch</code>. Let’s say <code>x</code> is a <code>cons</code> cell that we have made with <code>cons</code>, consisting of the elements <code>x1</code> and <code>x2</code>.<br /><br />Now we’ve defined the <code>car</code> of <code>x</code> to be whatever you get when you call the function <code>x</code> with 1 as the argument. <code>x</code> is what the <code>cons</code> function returned, in other words the <code>dispatch</code> function, and when we call that with 1 as the argument, it will return <code>x1</code>. Likewise, when we call <code>x</code> with the argument 2, the <code>dispatch</code> function that <code>x</code> represents will return <code>x2</code>. We have satisfied all the properties that we wanted <code>cons</code>, <code>car</code>, and <code>cdr</code> to have.<br /><br />Is this how any reasonable Lisp implementation actually works? No.<br /><br />(If you’re confused about the previous example: note that we’ve snuck in an assumption about how variable scoping works in functions. When the <code>dispatch</code> function is created inside the <code>cons</code> function, the variables <code>x1</code> and <code>x2</code> are obviously bound to whatever values we inputted into <code>cons</code>. What’s not obvious is that <code>dispatch</code> can access these values when it’s called later – after all, <code>x1</code> and <code>x2</code> were local variables for a function call that will have ended by then (and the <code>cons</code> function might have been called many times, meaning many <code>x1</code>s and <code>x2</code>s). However, in Lisp the environment at the time a function is called is bound to that function. When that function is called again, any local variables like <code>x1</code> and <code>x2</code> present in a parent function in which that function was defined will remain accessible. This type of thing is called a closure.)<br /><br />Mutability (changing variable values after they’ve been defined) is only introduced in the third chapter; up until then, the book focuses purely on functional programming.<br /><br />The third chapter is the culmination of the first half of the book: now that functions, data abstraction, and mutability have all been discussed, the authors introduce many examples of the structures that are now possible.<br /><b><br /></b><b><br /></b><b>The <i>what</i> evaluator?</b><br /><br />SICP walks the reader through the process of writing a Lisp evaluator in Lisp, something that is called a “metacircular evaluator”. <br /><br />Writing a Lisp evaluator in Lisp might seem pointless, but remember that a programming language, especially one like Lisp, is just as much a language for setting down our thoughts about procedures as it is something to be executed by computers. A Lisp-to-Lisp interpreter has the advantage that it is one of the simplest interpreters that it is possible to write. Interpreters for Lisp benefit greatly from the simplicity of Lisp’s syntax, while interpreters written in Lisp benefit from the expressiveness and flexibility of the language. Thus, with our Lisp-to-Lisp interpreter, the essence of an evaluator is laid about as barely before our eyes as it can be.<br /><br /><table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody><tr><td style="text-align: center;"><a href="https://1.bp.blogspot.com/-Hyjsm-mYCwk/XXTQL4Jf2HI/AAAAAAAAAco/zsxFFyMCP9Qaohat5E8IhBj1f-MaafvbQCLcBGAs/s1600/Screenshot%2B2019-09-05%2Bat%2B12.09.35.png" style="margin-left: auto; margin-right: auto;"><img border="0" data-original-height="937" data-original-width="1600" height="374" src="https://1.bp.blogspot.com/-Hyjsm-mYCwk/XXTQL4Jf2HI/AAAAAAAAAco/zsxFFyMCP9Qaohat5E8IhBj1f-MaafvbQCLcBGAs/s640/Screenshot%2B2019-09-05%2Bat%2B12.09.35.png" width="640" /></a></td></tr><tr><td class="tr-caption" style="text-align: center;">If you're willing to forget about code readability and leave out some syntactic sugar like <code>cond</code> expressions, you can literally behold the metacircular evaluator at a glance. (Note the <code>#lang sicp</code> line – <a href="https://racket-lang.org/">DrRacket</a> has a package that implements the exact version of Scheme used in SICP.) </td></tr></tbody></table><br />The authors write:<br /><blockquote><i>“It is no exaggeration to regard this as the most fundamental idea in programming:</i> </blockquote><blockquote><i> </i><br /><blockquote class="tr_bq"><i>‘The evaluator, which determines the meaning of expressions in a programming language, is just another program.’</i> </blockquote></blockquote><blockquote><i> </i><br /><i>To appreciate this point is to change our images of ourselves as programmers. We come to see ourselves as designers of languages, rather than only users of languages designed by others.” </i></blockquote>After presenting the metacircular evaluator (and an optimisation), the authors go on to discuss three “variations on a <i>Scheme</i>” (haha …):<br /><ol><li>Making the evaluator lazier. More precisely, delaying the evaluation of an expression until it is needed (“lazy evaluation”). This allows, for example, the convenient representation of infinite lists (“streams”), and more flexibility in creating new conditionals.</li><li>Non-deterministic computing, in which the language has built-in capabilities to handle statements like “pick one of these three items”, or “search through these options until some permutation matches this condition”. With such a language, some logic puzzles can be solved by simply stating the requirements and pressing enter.</li><li>A logic programming language, which can process queries about data.</li></ol>Programming often involves wanting to do something, and then taking that task and “building down” to the demands of whatever programming language is used. A powerful alternative method is to also build up the language to customise it for the needs of the task at hand. The boundary between language and program blurs.<br /><br />It’s almost as if … <br /><blockquote><i>“The evaluator, which determines the meaning of expressions in a programming language, is just another program.”</i></blockquote><b><br /></b><b>What do we say to compilers? Not today</b><br /><br />There’s a fifth chapter to SICP, in which a register machine simulator is constructed, and then used to implement – surprise surprise – a Lisp compiler.<br /><br />In a way, this completes the loop: the first three chapters show what kinds of things various programming abstractions allow, the fourth shows how these abstractions can be used to implement themselves, and the fifth looks “under the hood” of Lisp itself to consider how it can be implemented with elements simpler than itself. Of course, the question of how the simpler register machine itself can be implemented is left unanswered, but this is already starting to brings us into the realm of hardware, for which <a href="http://strataoftheworld.blogspot.com/2019/08/review-from-nand-to-tetris.html">another book</a> might be better suited.<br /><br />For the first four chapters I did perhaps half of the exercises; for the last, I just read the main text. The chapter feels more theoretical than the previous ones. Even though the Lisp-to-Lisp evaluator of the fourth chapter is purely academic, I found it more interesting (and also more practical, since I recently wrote an interpreter for a project) than the construction of a compiler from simulated versions of very restrictive components. Hopefully I will return to the chapter at a later point, but for now a more thorough reading will have to wait.<br /><br /><br /><b>First Principles of Computer Programming</b><br /><b><br /></b>SICP is a rather unconventional programming book. I think this is largely because the authors seem to have started from first principles and asked “what should a good book on deep principles in high-level programming languages look like?”, rather than making all the safest choices.<br /><br />Therefore, Lisp.<br /><br />Therefore, presenting one element at a time (functions, data abstraction, mutability) with care and depth, rather than the (admittedly faster and more practical) approach of introducing all the simplest things first.<br /><br />Therefore, spending a lot of time hammering in the point that what evaluates/compiles your program is just another program.<br /><br />SICP is not about showing you the fastest route to making an app. Unless you’re of a theoretical bent, it might not even be a particularly good introduction to programming in general (on the other hand, on several occasions I was slowed down by prior misconceptions; those with a fresher perspective may avoid some difficulties).<br /><br />However, it excels as a deep dive into the principles of programming. Especially if you have experience with programming but haven't yet read a systematic treatment of the topic, SICP will be invaluable in straightening out and unifying many concepts.<br /><br /><br /><b>Links & resources</b><br /><ul><li>SICP is available for free online in a variety of formats: <ul><li><a href="https://mitpress.mit.edu/sites/default/files/sicp/index.html">MIT’s official web version and other SICP stuff</a></li><li>PDF and EPUB conversions are available <a href="https://github.com/sarabander/sicp">here</a></li></ul></li><ul><li>The SICP lectures are <a href="https://www.youtube.com/watch?v=-J_xL4IGhJA&list=PLE18841CABEA24090">on YouTube</a>.</li></ul></ul>I’m not aware of an official SICP solution set, but you will find many on the internet. <a href="http://community.schemewiki.org/?sicp-solutions">This one</a> seems to be the most complete, often featuring many solutions to a given exercise. <br /><br /><br /><b>How to Design Programs: an alternative book</b><br /><b><br /></b>A similar, first-principles-driven, Lisp-based book on programming called <i>How to Design Programs</i> (HTDP) also exists (I have not read it). This book was consciously designed to emulate SICP in the good while fixing the bad, particularly in the context of being used as an introduction to programming (the authors of HTDP have written an article called <i><a href="https://www2.ccs.neu.edu/racket/pubs/jfp2004-fffk.pdf">The Structure and Interpretation of the Computer Science Curriculum</a></i> in which they summarise their case).<br /><br />Incredibly, <a href="https://htdp.org/2003-09-26/">HTDP is also available available for free online</a>. Either MIT Press has been overrun by communists, or the people who write good programming books are far more charitable than the average textbook writer.Unknownnoreply@blogger.com0tag:blogger.com,1999:blog-1697673368059564013.post-58424044141383380412019-08-20T12:36:00.001+01:002019-09-27T11:19:10.491+01:00Review: The Pleasures of Counting<div style="text-align: center;"> <span style="font-size: x-small;">Book: <i>The Pleasures of Counting</i>, by Thomas William Körner (1996)</span></div><div style="text-align: center;"><span style="font-size: x-small;">1.8k words (≈6 minutes) </span> </div><br /><br />On his <a href="https://www.dpmms.cam.ac.uk/~twk/">website,</a> T. W. Körner introduces <i>The Pleasures of Counting</i> as follows:<br /><blockquote><i>Longer than “With Rod and Line Through the Gobi Desert”, funnier than “The Wit and Wisdom of the German General Staff” and with <a href="https://en.wikipedia.org/wiki/A_Brief_History_of_Time#Publication">more formulae</a> than “A Brief History of Time” [The Pleasures of Counting] was voted Book of the Year by a panel consisting of Mrs E. Körner, Mrs W. Körner, Miss K. Körner and Dr A. Altman (née Körner).</i></blockquote><i>The Pleasures of Counting</i> is hard to categorise. On one hand, its flow and lucidity match the best works of general non-fiction, even though the book features more tangents than an intro to derivatives course. On the other hand, in contrast to books that merely tell about math, Körner has the gall to make the reader do exercises.<br /><br />The result is 500 pages of insights, proofs, exercises, and real-world applications about topics ranging from cholera to submarine warfare to weather prediction, all delivered in Körner’s personable style and with a generous heaping of witty anecdotes and the occasional bit of verse. And it is glorious.<br /><br /><br /><b>Warning: may contain math – but don’t worry</b><br /><br />A first question about any book involving math is how much you need to know beforehand for it to be comprehensible.<br /><br />Most mathematical arguments in <i>The Pleasures of Counting</i> can be followed with straightforward algebra. This does not mean the results themselves are straightforward – some, for instance the derivation of the Lorentz transformation or an outline of Shannon’s theorem, require careful thought and are easy to get lost in. Some exercises also either require or benefit greatly from prior exposure to calculus. However, in general <i>The Pleasures of Counting</i> manages to be both accessible and fairly deep. A casual reader can skip exercises and tricky arguments while still getting the gist, while other readers will find much to dig into in the more intricate proofs and exercises. All notation used is explained in an appendix.<br /><br />Most people can gain something from this book, and given the breadth of the material, I expect very few will encounter nothing new.<br /><br /><br /><b>The pleasures of everything under the sun</b><br /><br />Körner discusses many common examples of mathematical reasoning and results, including special relativity, Galileo’s arguments about motion, Engima machines, Turing’s work, fractals, sorting algorithms, and the effects of scaling on biology, though always with his own spin on each topic.<br /><br />I particularly enjoyed Körner’s discussion of dimensional analysis in physics – a fancy way of saying that you figure out what variables some quantity should depend on, fiddle with them until you get an equation where the units (mass, length, time) check out, and then go design bridges with it.<br /><br />This is an example of the “dangerous but fascinating past-time” of what Körner calls science “in a darkened room” – trying to derive scientific facts from pure thought alone. Science requires both reason and observation; relying on one alone is like trying to walk with one leg. That’s not to say it’s impossible to go places by hopping with one leg: Körner relishes in showing how you can start from small, abstract assumptions and hop over to interesting conclusions, such as why helicopters have long blades or <a href="https://en.wikipedia.org/wiki/Lorentz_transformation">how spacetime works.</a><br /> <br />The most unique and refreshing parts of <i>The Pleasures of Counting</i> are Körner’s presentation of the works of several somewhat less well-known scientific figures, such as <a href="https://en.wikipedia.org/wiki/G._I._Taylor">G. I. Taylor</a>, <a href="https://en.wikipedia.org/wiki/Lewis_Fry_Richardson">Lewis Fry Richardson</a>, and <a href="https://en.wikipedia.org/wiki/Patrick_Blackett">Patrick Blackett</a>. I have a feeling Körner’s pick of figures to examine is not random – all three are British mathematicians, physicists, or mathematical physicists who lived from the late 1800s to the mid/late-1900s, worked on war-related issues (Blackett was a major advisor on military strategy and operational research in World War II, Taylor participated in the Manhattan Project, and Richardson was an ardent pacifist who was a conscientious objector during World War I and later attempted a mathematical analysis of the causes of war), and studied/taught at Cambridge, like Körner. The timelines make it possible that Taylor and Blackett could have worked in Cambridge at the same time as Körner studied there, though I cannot recall Körner mentioning any personal knowledge of them in <i>The Pleasures of Counting<b>. </b></i><br /><br /><br /><b>The pleasures of digression</b><br /><br />Körner does not restrict himself to purely mathematical matters. At one point a Socratic dialogue on the axioms of number theory segues into a discussion on the purpose of university:<br /><blockquote><i>TEACHER: […] When Mill wrote On The Subjection of Women </i>[alright, the dialogue may have been going on a slight tangent even before the university stuff]<i>, he was consciously following Plato in this, and, still more importantly, in his view that everything is open to question and that positive good may come from rational discussion.</i> </blockquote><blockquote><i>STUART [a student]: And that is what university is all about.</i> </blockquote><blockquote><i>TEACHER: Not really.</i> </blockquote><blockquote><i>STUART: But that is what university ought to be all about.</i> </blockquote><blockquote><i>TEACHER: So you think the taxpayer is parting with large sums of money so that young ladies and gentlemen can sit around discussing life, the universe and everything. You are here to learn mathematics and more mathematics – not to row, play bridge, act or even to find yourselves – and that is what I am going to teach you.</i> </blockquote><blockquote><i>STUART: But, even if that is what the taxpayers want, is it what they ought to get? A university which just trains technicians is not a university; it is a technical college.</i> </blockquote><blockquote><i>TEACHER: Better a good technical college than a corrupt university. What ought you to learn at university besides mathematics?</i> </blockquote><blockquote><i>STUART: Students learn to question received opinions.</i> </blockquote><blockquote><i>TEACHER: So, after I have made you write out 100 times: ‘I must not accept authority,’, what do we do next?</i> </blockquote><blockquote><i>ELEANOR [another student]: That’s simple. You make us write out: ‘I really, really must not accept authority.”</i> </blockquote><blockquote><i>TEACHER: Besides which, asking questions is the easy bit. It’s finding good answers which is hard. A university is at least as much a repository for the accumulation of human experience and an instrument for passing it on as it is a device for adding to it.</i> </blockquote><blockquote><i>STUART: But just teaching mathematics is not enough. A lot of us will go on to be engineers and managers and will have to take moral decisions. So why don’t you teach us ethics?</i> </blockquote><blockquote><i>TEACHER: But would you actually go to lectures on ethics?</i> </blockquote><blockquote><i>STUART: If the lecturer was good, yes.</i> </blockquote><blockquote><i>TEACHER: But anybody would go to hear Sir Isaiah Berlin lecturing on how to watch paint dry. The question is, would you go listen to your ordinary lecturers talking about ethics?</i> </blockquote><blockquote><i>ELEANOR: Not unless it was for examinations.</i> </blockquote><blockquote><i>STUART: So why not examine it?</i> </blockquote><blockquote><i>TEACHER: What would the examination questions look like? ‘Is it wrong to steal from widows and orphans? Answer yes or no and give brief reasons.’</i> </blockquote><blockquote><i>STUART: There are lots of difficult and interesting moral problems.</i> </blockquote><blockquote><i>TEACHER: Yes, but the problems of the human race are not those of finding the answer to moral problems in hard cases but of acting on the answer in simple ones. American law schools now include courses on ethics, but the only observable result is that the defence in cases of fraud now begins ‘My client’s behaviour has throughout been not merely legal but ethical.’ […] If wisdom were teachable it would surely be our duty to teach it. Since it is not, we simply try to teach mathematics.</i></blockquote>Opinionated? Yes. Controversial? Perhaps. Does he have a point? Definitely.<br /><br />Körner also discusses how to persuade bureaucratic committees (and when to give up), the principles of successful smalltalk, and the philosophical issue of whether and how we should discount future values.<br /><br />And then, after you’ve been nodding along at one of these digressions for a while, you snap out of a Körner-induced trance, realise you’re halfway through a proof, and that you’ve been enjoying it all the way.<br /><br /><br /><b>The idle mathematician of an empty day</b><br /><br />Ultimately, <i>The Pleasures of Counting</i> is not about the usefulness or applicability of mathematics, but the joy of it. Deriving truths from other truths, or looking at the messiness of the real world and capturing its broad strokes with a few symbols is not just a means to an end but also an art form, a way of thinking, and a purpose in itself.<br /><br />Körner closes <i>The Pleasures of Counting</i> with the prologue of William Morris’s <i>The Earthly Paradise</i>. This poem is perhaps the best (and certainly most poetic) argument for the importance of “useless” endeavours like math or poetry. My idle blogging cannot beat Morris’s verse, so here is the poem in full:<br /><blockquote><i>Of Heaven and Hell I have no power to sing,</i><br /><i>I cannot ease the burdens of your fears,</i><br /><i>Or make quick-coming death a little thing,</i><br /><i>Or bring again the pleasures of past years,</i><br /><i>Nor for my words shall ye forget your tears,</i><br /><i>Or hope again for aught that I can say,</i><br /><i>The idle singer of an empty day.</i> </blockquote><blockquote><i>But rather, when aweary of your mirth,</i><br /><i>From full hearts still unsatisfied ye sigh,</i><br /><i>And, feeling kindly onto all the earth,</i><br /><i>Grudge every minute as it passes by,</i><br /><i>Made the more mindful that the sweet days die –</i><br /><i>– Remember me a little then I pray,</i><br /><i>The idle singer of an empty day.</i> </blockquote><blockquote><i>The heavy trouble, the bewildering care</i><br /><i>That weighs us down who live and earn our bread,</i><br /><i>These idle verses have no power to bear;</i><br /><i>So let me sign of names remembered,</i><br /><i>Because they, living not, can ne’er be dead,</i><br /><i>Or long time take their memory quite away</i><br /><i>From us poor singers of an empty day.</i> </blockquote><blockquote><i>Dreamer of dreams, born out of my due time,</i><br /><i>Why should I strive to set the crooked straight?</i><br /><i>Let it suffice me that my murmuring rhyme</i><br /><i>Beats with light wings against the ivory gate,</i><br /><i>Telling a tale not too importunate</i><br /><i>To those who in the sleepy region stay,</i><br /><i>Lulled by the singer of an empty day.</i> </blockquote><blockquote><i>Folk say, the wizard to a northern king,</i><br /><i>At Christmas-tide such wondrous things did show,</i><br /><i>That through one window men beheld the spring,</i><br /><i>And through another saw the summer glow,</i><br /><i>And through a third the fruited vines a-row,</i><br /><i>While still unheard, but in its wonted way,</i><br /><i>Piped the drear wind of that December day.</i> </blockquote><blockquote><i>So with this Earthly Paradise it is,</i><br /><i>If ye read aright and pardon me,</i><br /><i>Who strives to build a shadowy isle of bliss</i><br /><i>Midmost the beatings of the steely sea,</i><br /><i>Where tossed about all hearts of men must be,</i><br /><i>Whose ravenous monsters mighty men shall slay,</i><br /><i>Not the poor singer of an empty day.</i></blockquote>Unknownnoreply@blogger.com0tag:blogger.com,1999:blog-1697673368059564013.post-85622864394512007142019-08-15T20:50:00.001+01:002021-03-21T16:53:23.455+00:00Review: From Nand to Tetris<div style="text-align: center;"><span style="font-size: x-small;">Book: <i>The Elements of Computing Systems</i>, by Noam Nissam and Shimon Schocken (2005)</span></div><div style="text-align: center;"><span style="font-size: x-small;">5.9k words (≈30 minutes) </span> </div><br /><br /><b>A brief rant about the title</b><br /><br />“From Nand to Tetris” (Nand2Tetris for short) is the name of the course, website, and overall project that the book <i>The Elements of Computing Systems</i> is part of. It’s an excellent name – catchy, concise, and expertly captures the content.<br /><br />However, apparently it’s a law that a textbook must have a stodgy title consisting of a reference to the subject matter (bonus points for ostentatious circumlocution, like writing “computing system” instead of “computer”), perhaps attached to a generic word like “concepts” or “elements” that doesn’t mean much.<br /><br />For the rest of this post, I will pointedly ignore the name “<i>The Elements of Computing Sytems</i>” and refer to it as “<i>From Nand to Tetris</i>” or “Nand2Tetris” instead.<br /><br /><br /><b>You’re a wizard</b><br /><br />At first glance, computers are basically magic.<br /><br />Science fiction author Arthur C. Clarke once said:<br /><blockquote><i>Any sufficiently advanced technology is indistinguishable from magic.</i></blockquote>A more accurate phrase might be “any sufficiently advanced technology <i>appears to be</i> indistinguishable from magic”. Of magic, all you can say is it just works. With technology (or, for that matter, anything in our world), there is always a reason.<br /><br />The goal of Nand2Tetris is to take computers from the realm of magic into the realm of understanding.<br /><br />This is a difficult task, since the technological stack connecting the physical world to a desktop operating system is perhaps the highest and most abstract technological stack humans have created. The functioning of computers is also a topic split into many layers, each of them its own broad field, from chip design to compilers to programming languages to operating systems.<br /><br />What Nand2Tetris does is presents one example of a path from logic gates to chips to a machine language to virtual machines to high-level languages to an operating system. The aim is not so much to explore every nook and cranny of the computational jungle, or even to provide a map, but instead to demonstrate that such a path is even possible.<br /><br /><h2>A path through the jungle</h2><h3>Logic gates</h3><b>Boolean logic and basic gates</b><br /><br />Most of the function of a computer can be constructed from just two pieces of hardware. The first is the NAND gate.<br /><br />The only thing we need to know about the NAND gate is that it takes two inputs, each of which takes a binary value (we call the values 0 and 1), and produces a 1 as an output <i>except</i> when both inputs are 1, in which case the output is a 0.<br /><br />We will not discuss the implementation of the NAND gate, but instead assume that such a device can be implemented by electrical engineers who are clever enough (we have to start somewhere).<br /><br />In fact, it is barely relevant that the NAND gate is a physical device. We can instead think of it – and other logic gates – as a mathematical function, which maps some set of 0s and 1s to an output value.<br /><br />In the case of a NAND gate, it takes two inputs and maps it to one output in the manner specified by the following table:<br /><br />0, 0 -> 1<br />0, 1 -> 1<br />1, 0 -> 1<br />1, 1 -> 0<br /><br />(The name “NAND” is an abbreviation of “not and”, since the NAND of A and B is true <i>except</i> when the AND of A and B is true)<br /><br />Such tables are called truth tables. <i>From Nand to Tetris</i> provides a handy list of all 2^4 = 16 two-argument boolean functions:<br /><br /><div class="separator" style="clear: both; text-align: center;"><a href="https://1.bp.blogspot.com/-M_9pmuZgyiU/XVWt-RrSrlI/AAAAAAAAAaA/mdeRfYwac7gMwovTTvctDybiGa34nS40QCLcBGAs/s1600/boolean%2Bfunctinos.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="944" data-original-width="878" height="400" src="https://1.bp.blogspot.com/-M_9pmuZgyiU/XVWt-RrSrlI/AAAAAAAAAaA/mdeRfYwac7gMwovTTvctDybiGa34nS40QCLcBGAs/s400/boolean%2Bfunctinos.png" width="371" /></a></div><br /><br />If you have studied boolean logic before, you have doubtlessly spent time manipulating sets of 0s and 1s (or truths and falsities) linked together by AND, OR, and NOT operators. Why these three? In addition to them being straightforward to understand, it turns out that it is possible to specify any boolean function with AND, OR, and NOT operators.<br /><br />(How? If we have a truth table of the function (if we don’t or can’t make one, then we haven’t properly specified the function!), we can simply take every row for which the value of a function is a 1, build an expression for identifying that row out of ANDs, and then chain together all of these expressions with some ORs. For example, if we want the input sequence a=1, b=0, and c=1 map onto a 1, the expression (a AND ((NOT b) AND c)) will be true for this and only this sequence. If we have a bunch of such expressions, say expressions w, x, y, and z, and we want to figure out if at least one of them is true, we can do so with the expression (w OR (x OR (y OR z))). If we have AND and OR functions/gates that can take more than two arguments, this becomes simpler, since we don’t have to nest ANDs and ORs and can instead write something like OR(AND(a, (NOT b), c), expression2, …).)<br /><br />It turns out that the NAND function itself is sufficient for defining AND, OR, and NOT (NOR – the negation of OR, in the same way that NAND is the negation of AND – has the same property). This implies that if we have a logic gate that implements the NAND function, we can use it – and it alone – to build chips that implement AND, OR, and NOT functions, and hence any boolean function.<br /><br />How? (NOT x) can be implemented as (x NAND x). Using this definition, we can write (x AND y) as (NOT (x NAND y)), and (x OR y) as ((NOT x) NAND (NOT y)). Using logic gate symbols, we can represent these gates and some others as follows:<br /><br /><div class="separator" style="clear: both; text-align: center;"></div><div class="separator" style="clear: both; text-align: center;"><a href="https://1.bp.blogspot.com/-5tBwfOrTNhw/XV4ssUeCMsI/AAAAAAAAAcQ/FR-By2JqboYlx4Nw50V5mIDvkjz5EqkEgCLcBGAs/s1600/chip%2Bdiagrams%2BFIXED.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="1258" data-original-width="809" src="https://1.bp.blogspot.com/-5tBwfOrTNhw/XV4ssUeCMsI/AAAAAAAAAcQ/FR-By2JqboYlx4Nw50V5mIDvkjz5EqkEgCLcBGAs/s1600/chip%2Bdiagrams%2BFIXED.png" /></a></div><br /><br />(Warning: the book makes you design all of these.)<br /><br />Note that the demultiplexor has two outputs. Hence “demultiplexing” is not a function (a function has only one output), and we call it a chip rather than a logic gate.<br /><br />Note that the concerns of designing a chip out of NAND gates are not precisely the same as that of specifying the boolean function out of NAND operations. For instance, since we defined (NOT x) as (x NAND x) and (x AND y) as (NOT (x NAND y)), the NAND representation of (x AND y) is ((x NAND y) NAND (x NAND y)). There are three NAND operations, so it looks like we need 3 NAND gates – but no, we can split wires as in the above diagram and do it with two. Similar concerns apply to the implementation of the XOR gate in the above diagram.<br /><br />There are many subtleties in optimising chips, which we (following the example of Nand2Tetris) will skip in order to get on with our journey.<br /><br /><br /><b>Multi-bit and multi-way chips</b><br /><br />A multi-way version of a basic gate allows for applying the function of the gate to many inputs at once. A multi-way AND outputs true if and only if every input is a 1; a multi-way OR outputs true if at least one input is a 1. The implementation is simple: for a 8-bit AND, for instance, take the AND of the first two inputs (call it A), then the AND of A and the third (call it B), then the AND of B and the fourth, and so on.<br /><br />A multi-bit version of a chip is basically many of those chips in parallel, applying their function to every piece of the input.<br /><br />For example, a 4-bit AND chip fed 1101 and 0100 as its inputs will output 0100 – the first output is the AND of the first digit of the two inputs, and so on. The implementation is even simpler: send bit 1 of inputs A and B through one AND gate, bit 2 of both through another, and so on.<br /><br />It gets a bit more complicated when dealing with multiplexors that are both multi-way and multi-bit, but the basic principle is the same: we have a bunch of binary values that we want to group together (perhaps they represent a number), and so we build chips that allow us to deal with them together.<br /><br />In addition, we don’t want to deal with the wires representing each binary digit of a number individually, so we group them into “buses” that transfer many bits at once from component to component (basically just a clump of wires, as far as I understand). On diagrams, a bus looks like a wire, except with a slash through it.<br /><br /><br /><b>Arithmetic</b><br /><br />Now that we have constructed our basic gates, we can begin doing something interesting – at least if addition counts as interesting.<br /><br />Since our chips are built of gates that deal with binary values – true or false, 1 and 0, whatever – any reasonable low-level implementation of arithmetic will be confined to base-2 rather than our standard base-10 number system.<br /><br />(To convert binary to decimal, just remember that the value of each digit goes up by a factor of 2 rather than 10 as you move from right to left; for example, 1011 (base 2) = 1 x <b>1</b> + 2 x <b>1</b> + 4 x <b>0</b> + 8 x <b>1</b> = 11 (base 10))<br /><br />This turns out to make things much simpler. The algorithm for addition is the same (add corresponding digits, output a result and a carry, take the carry into account when adding the next two corresponding digits), but we have far fewer cases:<br /><ul><li>0 + 0 --> 0, carry 0</li><li>0 + 1 --> 1, carry 0</li><li>1 + 0 --> 1, carry 0</li><li>1 + 1 --> 0, carry 1</li></ul>We can see that the result bit has the same truth table as the XOR function, and the carry bit has the same truth table as the AND function. Hence, for the simple purpose of determining the result digit and carry bit of two binary digits, the following chip (called a half-adder) is sufficient:<br /><br /><div class="separator" style="clear: both; text-align: center;"><a href="https://1.bp.blogspot.com/-VI4N5-j7pIo/XVWuzx9KqvI/AAAAAAAAAaU/pnQHIQGR_vM15sqHbyjlZk0OSw_xBIF8QCLcBGAs/s1600/Screenshot%2B2019-08-15%2Bat%2B22.13.10.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="160" data-original-width="666" height="152" src="https://1.bp.blogspot.com/-VI4N5-j7pIo/XVWuzx9KqvI/AAAAAAAAAaU/pnQHIQGR_vM15sqHbyjlZk0OSw_xBIF8QCLcBGAs/s640/Screenshot%2B2019-08-15%2Bat%2B22.13.10.png" width="640" /></a></div><br /><br />Now we have to figure out how to chain such chips to create a multi-bit adder that can deal with carry bits. Observing that, at most, we will have two input bits and one carry bit to deal with to determine the resulting bit, let’s construct a chip that takes three bits as input and outputs the result and the carry bit. If we add a 0, a 1, and a 1, the corresponding result digit is a 0 and the carry is a 1; if we add a 1, a 1, and a 1, the result bit and the carry bit are both a 1.<br /><br />The result, called a full-adder, can be constructed like so:<br /><br /><div class="separator" style="clear: both; text-align: center;"><a href="https://1.bp.blogspot.com/-juy2y92ICLM/XVWvBCNP_ZI/AAAAAAAAAaY/SacBqozAxykSoCJPW2lAOdcVWRPaQ6FIQCLcBGAs/s1600/Screenshot%2B2019-08-15%2Bat%2B22.14.04.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="410" data-original-width="1512" height="172" src="https://1.bp.blogspot.com/-juy2y92ICLM/XVWvBCNP_ZI/AAAAAAAAAaY/SacBqozAxykSoCJPW2lAOdcVWRPaQ6FIQCLcBGAs/s640/Screenshot%2B2019-08-15%2Bat%2B22.14.04.png" width="640" /></a></div><br /><br />Now that we have a full-adder, we can construct a multi-bit addition chip by simply chaining them together, feeding the carry bit from the previous full-adder as one of the three inputs into the next one.<br /><br />The only complication is that we have to connect a constant-0 input to the first full-adder to fill its third input, and the final carry bit goes nowhere.<br /><br />A 4-bit adder looks like this: <br /><br /><div class="separator" style="clear: both; text-align: center;"><a href="https://1.bp.blogspot.com/-mBYH4mEX4aI/XVWvS-vmsVI/AAAAAAAAAak/2GA39HkyE9EUPQih8MGpvozqpPglIEeZwCLcBGAs/s1600/adder.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="525" data-original-width="537" height="390" src="https://1.bp.blogspot.com/-mBYH4mEX4aI/XVWvS-vmsVI/AAAAAAAAAak/2GA39HkyE9EUPQih8MGpvozqpPglIEeZwCLcBGAs/s400/adder.png" width="400" /></a></div><br /><br />This is pretty cool – after all this messing around with logic and logic gates, we have finally built a piece of hardware that does something real.<br /><br />We still have to consider some issues. The most important is how we represent negative numbers. If are smart enough about it, this also comes with a bonus: we don’t have to construct a separate chip to handle subtraction, but can instead subtract A from B by converting B to the representation of -B and then adding it to A.<br /><br />The standard method is called the two’s complement method, and it says: in an n-bit system, represent -x as the binary representation of $$2^n - x$$.<br /><br />For example, let’s say our system is 8-bit. 0000 0000 is 0 (space inserted for readability), 0000 0001 is 1, 0000 0010 is 2, 0000 0011 is 3, and so on. -0 is $$2^8$$ - 0 = 1 0000 0000, but we keep only 8 bits so this turns into 0000 0000 as well. -1 is $$2^8 - 1$$ = 255 = 1111 1111. -2 is $$2^8 - 2$$ = 254 = 1111 1110. And so on.<br /><br />Another example: <i>From Nand to Tetris</i> presents the complete table for a 4-bit system:<br /><i> </i><br /><div class="separator" style="clear: both; text-align: center;"><a href="https://1.bp.blogspot.com/-1FLR30LSZhg/XVWvnH4_BcI/AAAAAAAAAaw/TsXO-Dck52gz7Mc-GAxcaPeMFETm1RveQCLcBGAs/s1600/twos%2Bcomplement.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="714" data-original-width="1424" height="320" src="https://1.bp.blogspot.com/-1FLR30LSZhg/XVWvnH4_BcI/AAAAAAAAAaw/TsXO-Dck52gz7Mc-GAxcaPeMFETm1RveQCLcBGAs/s640/twos%2Bcomplement.png" width="640" /></a></div><br />The most important consequence of this is that our addition circuit can properly add negative and positive numbers together. -1 + 1 would be represented as inputting two buses, one containing 1111 1111 and the other 0000 0001, to our addition circuit. Adding these up gives 0000 0000 (since our adder only deals with the first 8 bits), or 0, just as intended. -1 + -1 becomes 1111 1110, which is -2. The reader can verify other examples if they so wish.<br /><br />Another consequence of this is that the largest number we can represent in our n-bit system is $$2^n / 2 - 1$$, and the most negative number is $$-(2^n) / 2$$. Our 8-bit system only gave us the integers -256 to 255 to play with, but the growth is exponential. In a 16-bit system, we can represent the numbers -32768 to 32767; in a 32-bit system, -2 147 483 648 to 2 147 483 647; in a 64-bit system, -9 223 372 036 854 775 808 to 9 223 372 036 854 775 807.<br /><br />(Of course, when necessary we can implement logic for handling even larger numbers at a higher level in the computer, just as we can implement logic for handling any other data type).<br /><br />Yet another consequence is that if we add together positive numbers that exceed the limit, the result will be a negative number (large enough negative numbers will also add to a positive). This feature has lead to countless incidents, with some of the more notable ones (ranging from exploding rockets to nuke-obsessed Gandhis in the <i>Civilization</i> game series) listed in <a href="https://en.wikipedia.org/wiki/Integer_overflow#Examples">this Wikipedia article</a>. As always: <a href="https://www.joelonsoftware.com/2002/11/11/the-law-of-leaky-abstractions/">beware leaky abstractions.</a> <br /><br /><b>The Arithmetic Logic Unit</b><br /><br />The final piece of our journey that relies on logic gates alone is the construction of an arithmetic logic unit (ALU). Though it has a fancy name, all it does is implements some basic operations we will need: adding, ANDing, negating, and zeroing input buses.<br /><br />The ALU constructed in Nand2Tetris operates on two 16-bit buses, and also takes 4 bits to configure the inputs (to determine whether to negate and/or zero the x and y inputs), 1 bit to change the function (switches between ANDing and adding the 16-bit input buses), and 1 bit to determine whether or not to negate the output. In addition to outputting a 16-bit bus with the result of the computation, it also outputs 1 bit saying whether or not the output is zero, and another saying whether or not it’s a negative number.<br /><br />(Note that “negating” a binary bus means flipping each bit (e.g. 0101 --> 1010), and is a different process from switching the sign of a number (e.g. 0101 (5) --> 1011 (-5) with the two’s complement method))<br /><br />My implementation of such an ALU looks like this (for clarity, I have first defined a separate “ALUPreP” ALU pre-processor chip to perform the negating/zeroing of the x/y inputs):<br /><br /><div class="separator" style="clear: both; text-align: center;"><a href="https://1.bp.blogspot.com/-aviwADPSEQA/XVWwVPWpN2I/AAAAAAAAAa4/moFJwdIjMospF5_jPjm5mX3LwEdKfw58gCLcBGAs/s1600/ALU%2Bpre-processor.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="410" data-original-width="780" height="336" src="https://1.bp.blogspot.com/-aviwADPSEQA/XVWwVPWpN2I/AAAAAAAAAa4/moFJwdIjMospF5_jPjm5mX3LwEdKfw58gCLcBGAs/s640/ALU%2Bpre-processor.png" width="640" /></a></div><br /><table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody><tr><td style="text-align: center;"><a href="https://1.bp.blogspot.com/-048arPRJ4Zg/XVWxSsNtaZI/AAAAAAAAAbU/N8tdFrInl1ow1wrZddmflwg1I7E__BQ8ACLcBGAs/s1600/Screenshot%2B2019-08-15%2Bat%2B22.23.43.png" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" data-original-height="613" data-original-width="1600" height="244" src="https://1.bp.blogspot.com/-048arPRJ4Zg/XVWxSsNtaZI/AAAAAAAAAbU/N8tdFrInl1ow1wrZddmflwg1I7E__BQ8ACLcBGAs/s640/Screenshot%2B2019-08-15%2Bat%2B22.23.43.png" width="640" /></a></td></tr><tr><td class="tr-caption" style="text-align: center;">Slashes on lines indicate a 16-bit bus, not a single wire. The 16s on various bits indicate that they are 16-bit versions of the chip (the AND-gate marked MW 16 is instead a multiway chip; see above discussion on multibit vs multiway chips). The splitting of the first bit from the result bus is a bit questionable as a chip design element, but it works since in the 2's complement method all negative numbers begin with a 1.</td></tr></tbody></table><br /> <br />Having a bunch of bits to zero and negate outputs and inputs and whatever may seem pointless. However, such a design allows us to compute many different functions in only one chip, including x + y, x AND y, x OR y, -x, x+1, y-x, and so on. <i>From Nand to Tetris</i> provides a table (where the notation & = AND, | = OR, and ! = NOT is used):<br /><br /><br /><div class="separator" style="clear: both; text-align: center;"><a href="https://1.bp.blogspot.com/-Dgj0I5rUSPs/XVWwc7K6mnI/AAAAAAAAAa8/JtRVylYzi5kuPVP3-qgwgFXCOadl4CHwACLcBGAs/s1600/ALU%2Bfuncs.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="1312" data-original-width="1600" height="524" src="https://1.bp.blogspot.com/-Dgj0I5rUSPs/XVWwc7K6mnI/AAAAAAAAAa8/JtRVylYzi5kuPVP3-qgwgFXCOadl4CHwACLcBGAs/s640/ALU%2Bfuncs.png" width="640" /></a></div><br /><br />Remember that the only piece of hardware needed to implement all of this is the humble NAND gate.<br /><br />(To give a sense of scale: by my count, my ALU design above requires 768 NAND gates if we simply substitute in the NAND-only versions of other gates (768 happens to be 2^9 + 2^8, but this is just a coincidence).)<br /><br /><h3>Flip-flops</h3>I mentioned earlier that only two pieces of hardware are required to implement most of our computer. In the previous section, we examined what we can do with NAND gates; now, we will turn to flip-flops (no, not <a href="https://en.wikipedia.org/wiki/Flip-flops"><i>that</i></a> type of flip-flop).<br /><br />NAND gates allow us to perform any feat of (binary) logic that we wish, but they do not allow for memory.<br /><br />The way in which the Nand2Tetris computer implements memory is with a data flip-flop (DFF). The principle of a DFF, like a NAND gate, is simple: its output at one “tick” of the computer’s central clock is its input at the previous tick.<br /><br />Thus, to add DFFs to our computer, we need to assume the existence of some type of clock, which broadcasts a signal to all DFFs. This allows us to divide time into discrete chunks.<br /><br />Real electronics always involves a delay in passing the signal from one component to another. Thus, when we pass inputs to our ALU, there’s a brief moment before the ALU stabilises to the true result. Inputs arriving from different parts of the computer also do not arrive simultaneously. Dividing time into ticks allows us to abstract away all such concerns (as long as the ticks are long enough for everything to stabilise); all we care about is the state of the computer at each tick, not what happens in between two ticks while the state is transitioning.<br /><br />A DFF and a multiplexor (a logic gate with two inputs and one selector bit, outputting the first input if the selector bit is 0 and the second if the selector bit is 1) can be combined to create a 1-bit register:<br /><br /><div class="separator" style="clear: both; text-align: center;"><a href="https://1.bp.blogspot.com/-zthjPHCU9TU/XVWx8M_jSvI/AAAAAAAAAbc/Nn8HfiSi27Q6_5gqAECJlF0jv80gTNudgCLcBGAs/s1600/1-bit%2Bregister.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="220" data-original-width="737" height="190" src="https://1.bp.blogspot.com/-zthjPHCU9TU/XVWx8M_jSvI/AAAAAAAAAbc/Nn8HfiSi27Q6_5gqAECJlF0jv80gTNudgCLcBGAs/s640/1-bit%2Bregister.png" width="640" /></a></div><br /><br />The operation of this register is as follows:<br /><ul><li>If the selector bit is 1, the DFF’s output at time <i>t</i> is the input value at time <i>t-1</i>.</li><li>If the selector bit is 0, the DFF’s output at time <i>t</i> is its output at time <i>t-1</i>.</li></ul>Hence, we can set a value (either a 0 or a 1) into the 1-bit register, and it will keep outputting that value until we tell it to change to a new value (by sending it the new value and sending a “1” as the input to the multiplexor’s selector bit).<br /><br />Of course, a storage of 1 bit doesn’t allow for very many funny cat GIFs, so clearly there’s still some work to be done.<br /><br />The first thing we do is we make the registers bigger, simply by adding many 1-bit registers in parallel. Most useful elements on which we do computations (numbers, letters (which are stored as numbers), etc.) take more than one bit to specify, so it’s useful to split memory into chunks – 16 bit chunks in the case of the Nand2Tetris computer.<br /><br />Next, let’s take many of these bigger registers, and put them in parallel with each other. The problem now is accessing and setting registers in the memory independently of each other. We can add a series of address bits as inputs to our memory chip and then build some circuitry so that the output will be the contents of the memory with the address specified by the address bits, and if we load a new input, the input will be loaded into the register with the address that is being inputted.<br /><br />A simple memory unit of four 16-bit registers, each uniquely identified by 2 address bits (address are: 00, 01, 10, and 11), and its control logic can be implemented as follows:<br /><br /><table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody><tr><td style="text-align: center;"><a href="https://1.bp.blogspot.com/-F5JUpZ6GDrg/XVWyK5enpXI/AAAAAAAAAbg/0MunIGp2RTUKQnvytvM0_WjTXtzsCGgGgCLcBGAs/s1600/4-word%2BRAM.png" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" data-original-height="665" data-original-width="695" height="612" src="https://1.bp.blogspot.com/-F5JUpZ6GDrg/XVWyK5enpXI/AAAAAAAAAbg/0MunIGp2RTUKQnvytvM0_WjTXtzsCGgGgCLcBGAs/s640/4-word%2BRAM.png" width="640" /></a></td></tr><tr><td class="tr-caption" style="text-align: center;">Less complicated than it looks. The input is split so that it reaches every register. If load=1, the three multiplexors on the left route it to one of the registers, and the input is loaded into that register (if load=0, nothing is meant to be loaded this tick and all load?-inputs into the registers are 0, so nothing happens). The register output buses are passed through a series of multiplexors to select which one's output is finally sent out of the memory chip.</td></tr></tbody></table><br />To construct larger memory chips, all we need to do is add registers and expand our address access logic. If we want a memory chip with, say, 64 registers, we need log2(64) = 6 bits to specify which address we are talking about, and hence 6 address bits (n address bits gives you 2^n unique addresses, which is why computer memory sizes are usually powers of 2: 2, 4, 8, 16, 32, 64, 128, 256, 512, 1024, etc.).<br /><br />Since we can access and set any part of the memory at the same speed and in any order, we call this “random access memory” (RAM). RAM is not permanent memory – maintaining state requires DFFs constantly passing looping signals to each other, which in turn requires power. Turn off power, and you lose memory contents.<br /><br />(RAM based on DFFs is called static RAM or SRAM, and is faster but more expensive and power-hungry than the alternative capacitor-based DRAM design. Hence SRAM is mainly used for small CPU caches (in the kilobyte or single-digit megabyte range), while the main system memory – what people generally think of when they hear “RAM” – uses DRAM, with capacities in the one or two-digit gigabytes.)<br /><br />Nand2Tetris does not examine the functioning of hard disk drives (HDDs) or solid state drives (SSDs) or other more permanent data storage devices.<br /><br /><h3>Instruction sets & machine language</h3>So far, using nothing but NAND gates and DFFs (and, well, wires and buses and clocks and so on), we have figured out how to:<br /><ul><li>perform arbitrary logic operations on binary data (and hence also perform basic arithmetic in base-2), and</li><li>store arbitrary binary data in memory of arbitrary size.</li></ul>All this arbitrariness gives us a lot of power. Let’s put it to use.<br /><br />The next thing we want to implement is the ability to give our computer instructions. We have already shown that it is possible to build chips that carry out arithmetic on binary numbers, so if we encode instructions as binary strings, we can identify and handle them just fine (though the control logic is complex).<br /><br />In Nand2Tetris, two basic types of instructions are implemented, each 16 bits long. I list them here to give you an impression of what they’re like:<br /><ul><li>If the first bit is a 0, the next 15 bits are interpreted as a memory address (in a memory of size 2^15 = 32768 bits), and the contents in memory at that point (a 16-bit value in our 16-bit system) are loaded into a special address register inside our CPU.</li><li>If the first bit is a 1, then: <ul><li>Bits 2 and 3 do nothing.</li><li>Bit 4 determines whether the second input we will pass to the ALU is the contents of the address register, or the contents of the memory location that the address register points to (the first input to the ALU is always the contents of a special data register in our CPU).</li><li>Bits 5-10 are the 6 bits passed to the ALU to determine which function it will compute on its inputs (see ALU table above).</li><li>Bits 11, 12, and 13 determine whether or not to send the output of the ALU to, respectively: i) the address register, ii) the data register, and/or iii) the memory location that the address register points to.</li><li>Bits 14, 15, and 16 determine which of 8 (=2^3) possible jump conditions apply. If all three bits are zero, do not jump (the next instruction executed is the next one in the program); the remaining 7 possibilities encode things like “jump if ALU output is negative”, “jump if ALU output is zero”, “jump regardless of ALU output”, and so on. The destination of the jump is always to the instruction in instruction memory* with the address currently in the address register.</li></ul></li></ul>(*For simplicity, the computer design in Nand2Tetris features a separate memory space for the instructions that make up the program and for the data the program operates on) <br /><br />The Nand2Tetris CPU instruction set is a rather minimalist one, but even so it allows for real computation.<br /><br />If you need convincing that this is true, consider a program that adds all numbers from 1 to 100. An annotated instruction sequence which achieves this is given below, interspersed with what the code might look like in a more readable language:<br /><br /><table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody><tr><td style="text-align: center;"><a href="https://1.bp.blogspot.com/-eXT37ig1oig/XVWzPkTEagI/AAAAAAAAAbw/VuRMX7odHQgO310yAc1rjb8roR1N6pRHACLcBGAs/s1600/Screenshot%2B2019-08-15%2Bat%2B22.32.04.png" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" data-original-height="958" data-original-width="1400" height="436" src="https://1.bp.blogspot.com/-eXT37ig1oig/XVWzPkTEagI/AAAAAAAAAbw/VuRMX7odHQgO310yAc1rjb8roR1N6pRHACLcBGAs/s640/Screenshot%2B2019-08-15%2Bat%2B22.32.04.png" width="640" /></a></td></tr><tr><td class="tr-caption" style="text-align: center;">D refers to the data register, A to the address register, and M[A] to the memory contents that the address register points to. Before each machine language segment are one or two lines of higher-level code which may translate into the machine code underneath. The choice of M[0] and M[1] as the places where we store the two variables is arbitrary.</td><td class="tr-caption" style="text-align: center;"><br /></td></tr></tbody></table><br />Such a list of instructions is called machine language.<br /><br />With machine language, we have finally risen from the abyss of hardware to the surface world of software. Having achieved this, all that remains (<a href="https://www.goodreads.com/quotes/285-now-at-this-very-moment-i-knew-that-the-united">to misquote Winston Churchill</a>) is the proper application of overwhelming abstraction.<br /><br /><h3>Assemblers & virtual machines</h3>Machine language, though powerful, suffers from a significant flaw: no one wants to write machine language.<br /><br />Thankfully (these days), practically no one has to.<br /><br />The first layer we can add on machine language is ditching the ones and zeroes for something marginally more readable, but keeping a one-to-one mapping between statements in our programming language and machine instructions. For example, instead of “0000 0000 0010 1011” to load the contents of memory location 43 into the address register, we write “LOAD 43”, and use a program that converts such statements to the machine language equivalents (if such a program does not exist yet, we have to do it manually).<br /><br />We can also write programs that let us define variables as stand-ins for memory addresses or data values, and then convert these to the corresponding memory locations for us. A massive benefit for the programmer is also ditching the insistence on one-to-one correspondence between lines and machine instructions. A single statement in a high-level language translates into many machine language instructions.<br /><br />Programming languages that retain a strong correspondence between statements and the computer’s machine language are termed assembly languages. The program that performs the work of converting an assembly language into machine language is called an assembler.<br /><br />In general, a program that converts another program written in language A into a version that runs in language B are called compilers. The process of running any program eventually ends with it being compiled, often through many intermediate steps, into machine language.<br /><br /><h3>Virtual machines</h3>Often, we want our programs to work not just on one processor type and one computer, but on many computers