tag:blogger.com,1999:blog-16976733680595640132024-05-28T11:15:21.183+01:00 Strata of the World"You laugh at iterative statistical curve fitting, but it might not be too long before iterative statistical curve fitting is laughing at you"Unknownnoreply@blogger.comBlogger44125tag:blogger.com,1999:blog-1697673368059564013.post-4797397742001235682024-01-08T00:02:00.002+00:002024-02-05T23:49:14.081+00:00A model of research skill<p style="text-align: center;"><i><span style="font-size: x-small;">~4k words (20 minutes)</span></i></p><p>Doing research means answering questions no one yet knows the answer to. Lots of impactful projects are downstream of being good at this. A good first step is to have a model for what the hard parts of research skill are.</p><h1 id="two-failure-modes">Two failure modes</h1><p>There are two opposing failure modes you can fall into when thinking about research skill.</p><p>The first is the deferential one. Research skill is this amorphous complicated things, so the only way to be sure you have it is to spend years developing it within some ossified ancient bureaucracy and then have someone in a funny hat hand you a piece of paper (bonus points for Latin being involved).</p><p>The second is the hubristic one. You want to do, say, AI alignment research. This involves thinking hard, maybe writing some code, maybe doing some maths, and then writing up your results. You’re good at thinking - after all, you read the Sequences, like, 1.5 times. You can code. You did a STEM undergrad. And writing? Pffft, you’ve been doing that since kindergarten!</p><p>I think there’s a lot to be said for hubris. Skills can often be learned well by colliding hard with reality in unstructured ways. Good coders are famously often self-taught. The venture capitalists who thought that management experience and a solid business background are needed to build a billion-dollar company are now mostly extinct.</p><p>It’s less clear that research works like this, though. I’ve often heard it said that it’s rare for a researcher to do great work without having been mentored by someone who was themselves a great researcher. Exceptions exist and I’m sceptical that any good statistics exist on this point. However, this is the sort of hearsay an aspiring researcher should pay attention to. It also seems like the feedback signal in research is worse than in programming or startups, which makes it harder to learn.</p><h1 id="methodology-except-methodology-is-too-fancy-a-word">Methodology, except “methodology” is too fancy a word</h1><p>To answer this question, and steer between deferential confusion and hubristic over-simplicity, I interviewed people who had done good research to try to understand their models of research skill. I also read a lot of blog posts. Specifically, I wanted to understand what about research a bright, agentic, technical person trying to learn at high speed would likely fail at and either not realise or not be able to fix quickly.</p><p>I did structured interviews with <a href="https://www.neelnanda.io/">Neel Nanda</a> (Google DeepMind; <a href="https://scholar.google.com/citations?view_op=view_citation&hl=en&user=GLnX3MkAAAAJ&citation_for_view=GLnX3MkAAAAJ:eQOLeE2rZwMC">grokking</a>), <a href="https://www.laurolangosco.com/">Lauro Langosco</a> (<a href="https://www.kasl.ai/">Krueger Lab</a>; <a href="https://scholar.google.com/citations?view_op=view_citation&hl=en&user=8-HOLxkAAAAJ&citation_for_view=8-HOLxkAAAAJ:9yKSN-GCB0IC">goal misgeneralisation</a>), and one other. I also learned a lot from unstructured conversations with <a href="https://www.inference.vc/">Ferenc Huszar</a>, <a href="https://krasheninnikov.github.io/about/">Dmitrii Krasheninnikov</a>, <a href="https://www.soren-mindermann.com/">Sören Mindermann</a>, <a href="https://owainevans.github.io/">Owain Evans</a>, and several others. I then ~<del>procrastinated on this project for 6 months</del>~ touched grass and formed inside views by doing the <a href="https://www.matsprogram.org/">MATS research program</a> under the mentorship of Owain Evans. I owe a lot to the people I spoke to and their willingness to give their time and takes, but my interpretation and model should not taken as one they would necessarily endorse.</p><p>My own first-hand research experience consists mainly of a research-oriented CS (i.e. ML) master’s degree, followed by working as a full-time researcher for 6 months and counting. There are many who have better inside views than I do on this topic.</p><h1 id="the-big-three">The Big Three</h1><p>In summary:</p><ol type="1"><li>There are a lot of ways reality could be (i.e. hypotheses), and a lot of possible experiment designs. You want to avoid brute-forcing your way through these large spaces as much as possible, and instead be good at picking likely-true hypotheses or informative experiments. Being good at this is called <strong>research taste</strong>, and it’s largely an intuitive thing that develops over a lot of time spent engaging with a field.</li><li>Once you have some bits of evidence from your experiment, it’s easy to over-interpret them (perhaps you interpret them as more bits than they actually are, or perhaps you were failing to consider how large hypothesis space is to start with). To counteract this, you need sufficient <strong>paranoia</strong> about your results, which mainly just takes careful and creative thought, and good epistemics.</li><li>Finally, you need to <strong>communicate</strong> your results to transfer those bits of evidence into other people’s heads, because we live in a society.</li></ol><h2 id="taste">Taste</h2><p>Empirically, it seems that a lot of the value of senior researchers is a better sense of which questions are important to tackle, and better judgement for what angles of attack will work. For example, good PhD students often say that even if they’re generally as technically competent as their adviser and read a lot of papers, their adviser has much better quick judgements about whether something is a promising direction.</p><p>When I was working on my master’s thesis, I had several moments where I was working through some maths and get stuck. I’d go to one of my supervisors, a PhD student, and they’d have some ideas on angles of attack that I hadn’t thought of. We’d work on it for an hour and make more progress than I had in several hours on my own. Then I’d go to another one of my supervisors, a professor, and in fifteen minutes they’d have tried something that worked. Part of this is experience making you faster at crunching through derivations, and knowing things like helpful identities or methods. But the biggest difference seemed to be a good gut feeling for what the most promising angle or next step is.</p><p>I think the fundamental driver of this effect is dealing with large spaces: there are many possible ways reality could be (John Wentworth talks about this <a href="https://www.lesswrong.com/posts/nvP28s5oydv8RjF9E/mats-models#Jason_Crawford_s_Model___Bits_of_Search">here</a>), and many possible things you could try, and even being slightly better at honing in on the right things helps a lot. Let’s say you’re trying to prove a theorem that takes 4 steps to prove. If you have a 80% chance of picking the right move at each step, you’ll have a 41% chance of success per attempt. If that chance is 60%, you’ll have a 13% chance – over 3 times less. If you’re trying to find the right hypothesis within some hypothesis space, and you’ve already managed to cut down the entropy of your probability distribution over hypotheses to 10 bits, you’ll be able to narrow down to the correct hypothesis faster and with fewer bits than someone whose entropy is 15 bits (and who’s search space is therefore effectively <span class="math inline">2<sup>5</sup> = 32</span> times as large). Of course, you’re rarely chasing down just a single hypothesis in a defined hypothesis class. But if you’re constantly 5 extra bits of evidence ahead compared to someone in what you’ve incorporated into your beliefs, you’ll make weirdly accurate guesses from their perspective.</p><p>Why does research taste seem to correlate so strongly with experience? I think it’s because the bottleneck is seeing and integrating evidence into your (both explicit and intuitive) world models. No one is close to having integrated all empirical evidence that exists, and new evidence keeps accumulating, so returns from reading and seeing more keep going. (In addition to literal experiments, I count things like “doing a thousand maths problems in this area of maths” as “empirical” evidence for your intuitions about which approaches work; I assume this gets distilled into half-conscious intuitions that your brain can then use when faced with similar problems in the future)</p><p>This suggests that the way to speed-run getting research taste is to see lots of evidence about research ideas failing or succeeding. To do this, you could:</p><ol type="1"><li>Have your own research ideas, and run experiments to test them. The feedback quality is theoretically ideal, since reality does not lie (but may be constrained by what experiments you can realistically run, and a lack of the paranoia that I talk about next). The main disadvantage is that this is often slow and/or expensive.</li><li>Read papers to see whether other people’s research ideas succeeded or failed. This is prone to several problems:<ol type="1"><li>Biases: in theory, published papers are drawn from the set of ideas that ended up working, so you might not see negative samples (which is bad for learning). In practice, paper creation and selection processes are imperfect, so you might see lots of bad or poorly-communicated ones.</li><li>Passivity: it’s easy to fool yourself into thinking you would’ve guessed the paper ideas beforehand. Active reading strategies could help; for example, read only the paper’s motivation section and write down what experiment you’d design to test it, and then read only the methodology section and write down a guess about the results.</li></ol></li><li>Ask someone more experienced than you to rate your ideas. A mentor’s feedback is not as good as reality’s, but you can get it a lot faster (at least in theory). The speed up is huge: a big ML experiment might take a month to set up and run, but you can probably get detailed feedback on 10 ideas in an hour of conversation. This is a ~7000x speedup. I suspect a lot of the value of research mentoring lies here: an enormous amount of predictable failures or inefficiently targeted ideas can be skipped or honed into better ones, before you spend time running the expensive test of actually checking with reality. (If true, this would imply that the value of research mentorship is higher whenever feedback loops are worse.)</li></ol><p><a href="https://colah.github.io/notes/taste/">Chris Olah has a list of suggestions for research taste exercises</a> (number 1 is essentially the last point on my list above).</p><p>Research taste takes the most time to develop, and seems to explain the largest part of the performance gap between junior and senior researchers. It is therefore the single most important thing to focus on developing.</p><p>(If taste is so important, why does research output <a href="https://backend.orbit.dtu.dk/ws/portalfiles/portal/215281397/NP_article.pdf">not increase monotonically</a> with age in STEM fields? The scary biological explanation is that fluid intelligence (or energy or …) starts dropping at some age, and this decreases your ability to execute on maths/code, even assuming your research taste is constant or improving. Alternatively, hours used on deep technical work might tend to decline with advanced career stages.)</p><h2 id="paranoia">Paranoia</h2><p>I heard several people saying that junior researchers will sometimes jump to conclusions, or interpret their evidence as saying more than it actually does. My instinctive reaction to this is: “wait, but surely if you just creatively brainstorm the ways the evidence might be misleading, and take these into account in making your conclusions (or are industrious about running additional experiments to check them), you can just avoid this failure mode?” The average answer I got was that yes, this seems true, and indeed many people either only need one peer review cycle to internalise this mindset, or pretty much get it from the start. Therefore, I’m almost tempted to chuck this category off this list, and onto the list of less crucial things where “be generally competent and strategic” will sort you out in a reasonable amount of time. However, two things hold me back.</p><p>First, confirmation bias is a strong thing, and it seems helpful to wave a big red sign saying “WARNING: you may be about to experience confirmation bias”.</p><p>Second, I think this is one of the cases where the level of paranoia required is sometimes more than you expect, even after you expect it will be high. John Wentworth puts this best in <a href="https://www.lesswrong.com/posts/9kNxhKWvixtKW5anS/you-are-not-measuring-what-you-think-you-are-measuring">You Are Not Measuring What You Think You Are Measuring</a>, which you should go read right now. There are more confounders and weird effects than are dreamt of in your philosophies.</p><p>A few people mentioned going through the peer review process as being a particularly helpful thing for developing paranoia.</p><h2 id="communication">Communication</h2><p>I started out sceptical about the difficulty of research-specific communication, above and beyond general good writing. However, I was eventually persuaded that yes, <em>research-specific</em> communication skills exist and are important.</p><p>First, if research has impact, it is through communication. Rob Miles once said (at a talk) something along the lines of: “if you’re trying to ensure positive AGI outcomes through technical work, and you think that you are not going to be one of the people who literally writes the code for it or is in the room when it’s turned on, your path to impact lies through telling other people about your technical ideas.” (This generalises: if you want to drive good policy through your research and you’re not literally writing it …, etc.) So you should expect good communication to be a force multiplier applied on top of everything else, and therefore very important.</p><p>Secondly, research is often not communicated well. On the smaller scale, Steven Pinker moans endlessly – and with good reason – about <a href="https://grad.ncsu.edu/wp-content/uploads/2016/06/Why-Academics-Stink-at-Writing-1-2.pdf">academic prose</a> (my particular pet peeve is the endemic utilisation of the word “utilise” in ML papers.). On the larger scale, entire research agendas can get ignored because the key ideas aren’t communicated in a sufficiently clear and legible way.</p><p></p><p>I don’t know what’s the best way to speed-run getting good at research communication. Maybe read <a href="https://stevenpinker.com/publications/sense-style-thinking-persons-guide-writing-21st-century">Pinker</a> to make sure you’re not making predictable mistakes in general writing. I’ve heard that experienced researchers are often good at writing papers, so maybe seek feedback from any you know (but don’t internalise the things they say that are about goodharting for paper acceptance). With papers, understand <a href="https://web.stanford.edu/class/ee384m/Handouts/HowtoReadPaper.pdf">how papers are read</a>. Some sources of research-specific communication difficulty I can see are (a) the unusually high need for precision (especially in papers), and (b) communicating the intuitive, high-context, and often unverbalised-by-default world models that guide your research taste (especially when talking about research agendas).</p><h1 id="other-points">Other points</h1><ul><li>Having a research problem is not enough. You need an angle of attack.<ul><li>Richard Feynman once said something like: keep a set of open problems in your head. Whenever you discover a new tool (e.g. a new method), run through this list of problems and see if you can apply it. I think this can also be extended to new facts; whenever you hear about a discovery, run through a list of open questions and see how you should update.</li><li>Hamming says something similar in <a href="https://www.osv.llc/application-timeline">You and your research</a>: “Most great scientists know many important problems. They have something between 10 and 20 important problems for which they are looking for an attack.”</li></ul></li><li>Research requires a large combination of things to go right. Often, someone will be good at a few of them but not all of them.<ul><li>A sample list might be:<ul><li>generating good ideas</li><li>picking good ideas (= research taste)</li><li>iterate rapidly to get empirical feedback</li><li>interpreting your results right (paranoia)</li><li>communicating your findings</li></ul></li><li>If success is a product of either sufficiently many variables or of normally distributed variables, the distribution of success should be log-normal, and therefore fairly heavy-tailed. And yes, research is heavy-tailed. Dan Hendrycks and Thomas Woodside <a href="https://www.lesswrong.com/posts/AtfQFj8umeyBBkkxa/a-bird-s-eye-view-of-the-ml-field-pragmatic-ai-safety-2#Research_ability_and_impact_is_long_tailed">claim</a> that while there may be 10x engineers, there are 1000x researchers. This seems true.<ul><li>However, this also means that not being the best at one of the component skills does not doom your ability to still have a really good product across categories.</li></ul></li></ul></li><li>Ideas from other fields are often worth stealing. There exist standardised pipelines to produce people who are experts in X for many different X, but far less so to produce people who are experts in both X and some other Y. Expect many people in X to miss out on ideas in Y (though remember that not all Y are relevant).</li><li>Research involves infrequent and uncertain feedback. Motivation is important and can be hard. Grad students are <a href="https://www.benkuhn.net/grad/">notorious</a> for having bad mental health. A big chunk of this is due to the insanities of academia rather than research itself. However, startups are <a href="https://www.amazon.co.uk/Lean-PhD-Radically-Efficiency-Macmillan/dp/1352002825">somewhat analogous</a> to research (high-risk, difficult, often ambiguous structure), lack institutionalised insanity, and are also acknowledged to be mentally tough.<ul><li>The most powerful and universally-applicable hack to make something not suck for a human is for that human to do it together with other humans. Also, more humans = more brains.</li></ul></li><li>Getting new research ideas is often not a particularly big-brained process. Once I had the impression that most research ideas would come from explicitly thinking hard about research ideas, and generating fancy ideas would be a major bottleneck. However, I’ve found that many ideas come with surprisingly little effort, with a feeling of “well, if I want X, the type of thing I should do is probably Y”. Whiteboarding with other people is also great.<ul><li>This is not to say that idea generation isn’t helped by actively brainstorming hard. Just that it’s not the only, or even majority, source of ideas.</li><li>The feeling of ideas being rare is often a newbie phase. You should (and very likely will) pass over it quickly if you’re engaging with a field. John Wentworth has a <a href="https://www.lesswrong.com/posts/mfPHTWsFhzmcXw8ta/the-feeling-of-idea-scarcity">good post</a> on the topic. I have personally experienced an increase in concrete research ideas, and much greater willingness to discard ideas, after going through a few I’ve felt excited by.</li><li>When you look at a field from afar, you see a smooth shape of big topics and abstractions. This makes it easy to feel that everything is done. Once you’re actually at the frontier, you invariably discover that it’s full of holes, with many simple questions that don’t have answers.</li></ul></li><li>There’s great benefit to an idea being the <a href="https://www.paulgraham.com/top.html">top thing in your mind</a>.</li><li>When in doubt, log more. Easily being able to run more analyses is good. At some point you will think to yourself something like “huh, I wonder if thing X13 had an effect, I’ll run the statistics”, and then either thank yourself because you logged the value of X13 in your experiments, or facepalm because you didn’t.</li><li>Tolerate the appearance of stupidity (in yourself and others). Research is an intellectual domain, and humans are status-obsessed monkeys. Humans doing research therefore often feel like they need to appear smart. This can lead to a type of wishful thinking where you hear some idea and try to delude yourself (and others) into thinking you understand it immediately, without actually knowing how it bottoms out into concrete things. Remember that any valid idea or chain of reasoning decomposes into simple pieces. Allow yourself to think about the simple things, and ask questions about them.<ul><li>There is an anecdote about Niels Bohr (related by George Gamow and quoted <a href="https://slimemoldtimemold.com/2022/02/10/the-scientific-virtues/">here</a>): “Many a time, a visiting young physicist (most physicists visiting Copenhagen were young) would deliver a brilliant talk about his recent calculations on some intricate problem of the quantum theory. Everybody in the audience would understand the argument quite clearly, but Bohr wouldn’t. So everybody would start to explain to Bohr the simple point he had missed, and in the resulting turmoil everybody would stop understanding anything. Finally, after a considerable period of time, Bohr would begin to understand, and it would turn out that what he understood about the problem presented by the visitor was quite different from what the visitor meant, and was correct, while the visitor’s interpretation was wrong.”</li></ul></li><li><a href="https://quoteinvestigator.com/2018/10/13/ship/">“Real <del>artists</del> researchers ship”</a>. Like in anything else, iteration speed really matters.<ul><li>Sometimes high iteration speed means schlepping. You should not hesitate to schlep. The deep learning revolution <a href="https://en.wikipedia.org/wiki/AlexNet">started</a> when some people wrote a lot of low-level CUDA code to get a neural network to run on a GPU. I once reflected on why my experiments were going slower than I hoped, and realised a mental ick for hacky code was making me go about things in a complex roundabout way. I spent a few hours writing ugly code in Jupyter notebooks, got results, and moved on. Researchers are notorious for writing bad code, but there are reasons (apart from laziness and lack of experience) why the style of researcher code is sometimes different from standards of good software.</li><li>The most important thing is doing informative things that make you collide with reality at a high rate, but being even slightly strategic will give great improvements on even that. Jacob Steinhardt gives good advice about this in <a href="https://cs.stanford.edu/~jsteinhardt/ResearchasaStochasticDecisionProcess.html">Research as a Stochastic Decision Process</a>. In particular, start with the thing that is most informative per unit time (rather than e.g. the easiest to do).</li></ul></li></ul><h2 id="good-things-to-read-on-research-skill">Good things to read on research skill</h2><p>(I have already linked to some of these above.)</p><ul><li>General advice on research from experienced researchers<ul><li><a href="https://www.cs.virginia.edu/~robins/YouAndYourResearch.pdf">You and Your Research</a> (Richard Hamming – old but still unbeaten. Hamming also has a <a href="https://www.goodreads.com/book/show/530415.The_Art_of_Doing_Science_and_Engineering">book</a> that includes this lecture among other material, but the lecture is the best bit of it and a good 80/20.)</li><li><a href="https://terrytao.wordpress.com/career-advice/">Career advice</a> (Terry Tao)</li><li><a href="https://cs.stanford.edu/~jsteinhardt/ResearchasaStochasticDecisionProcess.html">Research as a Stochastic Decision Process</a> (Jacob Steinhardt)</li><li><a href="https://www.lesswrong.com/posts/EF5M6CmKRd6qZk27Z/my-research-methodology">My research methodology</a> (Paul Christiano)</li><li><a href="http://joschu.net/blog/opinionated-guide-ml-research.html">An Opinionated Guide to ML Research</a> (John Schulman)</li><li><a href="https://www.eugenevinitsky.com/posts/PhD_a_retrospective_analysis.html">PhD: a retrospective analysis</a> (Eugene Vinitsky)</li></ul></li><li>John Wentworth’s posts about specific research meta-topics<ul><li><a href="https://www.lesswrong.com/posts/9kNxhKWvixtKW5anS/you-are-not-measuring-what-you-think-you-are-measuring">You Are Not Measuring What You Think You Are Measuring</a></li><li><a href="https://www.lesswrong.com/posts/mfPHTWsFhzmcXw8ta/the-feeling-of-idea-scarcity">The Feeling of Idea Scarcity</a></li><li><a href="https://www.lesswrong.com/posts/pT48swb8LoPowiAzR/everyday-lessons-from-high-dimensional-optimization">Everyday Lesson from High-Dimensional Optimization</a></li><li><a href="https://www.lesswrong.com/posts/nvP28s5oydv8RjF9E/mats-models#Jason_Crawford_s_Model___Bits_of_Search">MATS Models</a></li><li><a href="https://www.lesswrong.com/posts/GhFoAxG49RXFzze5Y/what-s-so-bad-about-ad-hoc-mathematical-definitions">What’s So Bad About Ad-Hoc Mathematical Definitions?</a></li></ul></li><li>Relevant Paul Graham essays<ul><li><a href="https://www.paulgraham.com/top.html">The Top Idea in Your Mind</a></li><li><a href="https://www.paulgraham.com/greatwork.html">How to do Great Work</a></li></ul></li><li>Advice aimed at new alignment researchers<ul><li><a href="https://www.lesswrong.com/posts/wYEwx6xcY2JxBJsfA/qualities-that-alignment-mentors-value-in-junior-researchers">Qualities that alignment mentors value in junior researchers</a></li><li><a href="https://www.lesswrong.com/s/mCkMrL9jyR94AAqwW/p/h5CGM5qwivGk2f5T9">7 traps (we think) new alignment researchers fall into</a></li><li><a href="https://www.lesswrong.com/posts/fqryrxnvpSr5w2dDJ/touch-reality-as-soon-as-possible-when-doing-machine">Touch reality as soon as possible (when doing machine learning research)</a></li></ul></li><li><a href="https://www.lesswrong.com/posts/AtfQFj8umeyBBkkxa/a-bird-s-eye-view-of-the-ml-field-pragmatic-ai-safety-2">A Bird’s Eye View of the ML Field</a> (a good overview of how the ML field works)</li><li><a href="https://web.stanford.edu/~fukamit/schwartz-2008.pdf">The importance of stupidity in scientific research</a> (short and sweet)</li><li><a href="https://colah.github.io/notes/taste/">Research Taste Exercises</a> (what is says on the tin)</li></ul>Unknownnoreply@blogger.com2tag:blogger.com,1999:blog-1697673368059564013.post-16271084127698922912023-06-04T13:57:00.000+01:002023-06-04T13:57:26.206+01:00A Disneyland Without Children<p>The spaceship swung into orbit around the blue-grey planet with a
final burn of its engines. Compared to the distance they had travelled,
the world, now only some four hundred kilometres below and filling up
one hemisphere of the sky, was practically within reach. But Alice was
no less confused.</p>
<p>“Well?” she asked.</p>
<p>Charlie stared thoughtfully at the world slowly rotating underneath
their feet, oceans glinting in the sunlight. “It looks lickable”, he
said.</p>
<p>“We have a task”, Alice said, trying to sound gentle. Spaceflight was
hard. Organic life was not designed for it. But their mission was
critical, they needed to move fast, and Charlie, for all his quirks,
would need to be focused.</p>
<p>“What’s a few minutes when it will take years for anything we
discover to be known back home?” Charlie asked.</p>
<p>“No licking”, Alice said.</p>
<p>Charlie rolled his eyes, then refocused them on the surface of the
planet below. They were just crossing the coast of one of the larger
continents. Blue water was giving way to grey land.</p>
<p>“Look at the texture”, Charlie said. They had seen it from far away
with telescopes, but there was something different about seeing it with
their bare eyes. Most of the land surface of the planet was like a rug
of fine grey mesh. If there had been lights, Alice would have guessed
the entire planet’s land was one sprawling city, but as far as their
instruments could tell, the world had no artificial lighting.</p>
<p>As far as they could tell, the world also had no radio. They had
broadcast messages at every frequency they could, and in desperation
even by using their engines to flash a message during their deceleration
burn. No response had come.</p>
<p>Alice pulled up one of the telescope feeds on the computer to look
closer at the surface. She saw grey rectangular slabs, typically several
hundred metres on a side, with wide roads running between them. The
pattern was not perfect - sometimes it was irregular, and sometimes
there were smaller features too. Some of the smaller ones moved.</p>
<p>“Are they factories?” Charlie asked.</p>
<p>“I’d guess so”, Alice said, watching on the telescope feed as a
steady stream of rectangular moving objects, each about ten metres long,
slid along a street. Another such stream was moving along an
intersecting street, and it looked like they would crash at the
intersection, but the timing and spacing was such that vehicles from one
stream crossed the road just as there were gaps in vehicles along the
other stream.</p>
<p>“A planet covered by factories, then”, Charlie said. “With no one
home to turn the lights on.”</p>
<p>“I want to see what they’re making”, Alice said.</p>
<h2 id="section">-</h2>
<p>All through the atmospheric entry of their first drone package, Alice
sat tight in her seat and clenched and unclenched her hands. So far all
they had done was passive observation or broadcasting. A chunky piece of
hardware tracing a streak of red-hot plasma behind it was a much louder
knock. She imagined alien jet fighters scrambling to destroy their
drones, and some space defence mechanism activating to burn their
ship.</p>
<p>The image she saw was a jittery camera feed, showing the black back
of the heatshield, the grey skin of the drone package, and a sliver of
blue sky. It shook violently as the two halves of the heatshield
detached from each other and then the drone package, tumbling off in
opposite directions. Land became visible, kilometres below, the grey
blocks of the buildings tiny like children’s blocks but still visibly
three-dimensional, casting shadows and moving as the drone package
continued falling.</p>
<p>The three drones tested their engines, and for a moment flew - or at
least slowed their descent - in an ungainly joint configuration, before
breaking off from each other and spreading their wings to the fullest.
The feed showed the other two drones veering off into the distance on
wide narrow wings, and then the view pulled up as the nose of the drone
lifted from near-vertical to horizontal.</p>
<p>“Oops, looks like we have company”, Charlie said. He had been tapping
away at some other screens while Alice watched the drone deployment
sequence.</p>
<p>Alice jumped up from her seat. “What?”</p>
<p>“Our company is … a self-referential joke!”</p>
<p>Alice resisted the temptation to say anything and instead sunk back
into her seat. On her monitor, the grey blocks continued slowly moving
below the drone. She tapped her foot against the ground.</p>
<p>“Actually though”, Charlie said. “We’re not the only ones in orbit
around this planet.”</p>
<p>“What else is orbiting? Has your sense of shame finally caught up
with you and joined us?”</p>
<p>“Looks like satellites. Far above us, though. Can you guess how
far?”</p>
<p>“I’d guess approximately the distance between you and maturity, so …
five light-years?”</p>
<p>Charlie ignored her. “Exactly geostationary altitude”, he said,
grinning. The grin was like some platonic ideal of intellectual
excitement; too pure for Alice’s annoyance to stay with her, or for her
to feel scared about the implications.</p>
<p>“But nothing in lower orbits?” Alice asked.</p>
<p>“No”, Charlie said. “Someone clearly put them there; stuff doesn’t
end up at exactly geostationary altitude unless someone deliberately
flies a communications or GPS satellite there. Now I can’t be entirely
sure that the geostationary satellites are completely dead, but I’d
guess that they are.”</p>
<p>“Like everything else”, Alice said, but even as she said so she
caught sight of a long trail of vehicles making its way along one of the
roads. There was something more real about seeing them on the drone
feed.</p>
<p>“Maybe this is just a mining outpost”, Charlie said. “Big rocket
launch to blast out a billion tons of ore to god-knows-where, once a
year.”</p>
<p>“Or maybe they’re hiding underground or in the oceans”, Alice
said.</p>
<p>“Let’s get one of the drones to drop a probe into the oceans. I’ll
send one of our initial trio over to the nearest one, it’s only a few
hundred kilometres away”, Charlie said.</p>
<p>“Sure”, Alice said.</p>
<p>They split the work of flying the drones, two of them mapping out
more and more of the Great Grey Grid (as Alice took to calling it in her
head), and one flying over the planet’s largest ocean.</p>
<p>Even the oceans were mostly a barren grey waste. Not empty, though.
They did eventually see a few small scaly fish-like creatures that
stared at their environment with uncomprehending eyes. Alien life. A
young Alice would have been ecstatic. But now she was on a mission, and
her inability to figure out what had happened on this planet annoyed
her.</p>
<p>In addition to the ocean probe, they had rovers they could send
crawling along the ground. Sometimes the doors of the square buildings
were open, and Alice would drive a rover past one opening. Most seemed
to either be warehouses of stacked crates, or then there would be some
kind of automated assembly line of skeletal grey robot arms and moving
conveyor belts. A few seemed to place more barriers between the open air
and their contents; what went on there, the rovers did not see.</p>
<p>The first time Alice tried to steer a rover into a building, it got
run over by a departing convoy of vehicles. The vehicles were
rectangular in shape but with an aerodynamic head, with three wheels on
each side. Based on their dimensions, she could easily imagine one
weighing ten or twenty tons. The rover had no chance.</p>
<p>“Finally!” Charlie had said. “We get to fight these aliens.”</p>
<p>But there was no fight. It seemed like it had been a pure accident,
without any hint of malice. The grey vehicles moved and stopped on some
schedule of their own, and for all Alice knew they were not just
insensitive beasts but blind and dumb ones too.</p>
<p>The next rover got in, quickly scooting through the side of the
entrance and then off to one side, out of path of the grey vehicles. It
wandered the building on its own, headlights turned on in the
otherwise-dark building to bring back a video stream of an assembly line
brooded over by those same skeletal hands they had glimpsed from
outside. Black plastic beads came in by the million on the grey
vehicles. A small thin arm with a spike on the end punctured a few holes
on one side, and using these holes two of the black beads were sown onto
an amorphous plushy shape. The shape got appendages, were covered with a
layer of fluff, and the entire thing became a cheerful purple when it
passed through an opaque box with pipes leading into it. It looked like
a child’s impression of a hairy four-legged creature with black beady
eyes above a long snout. A toy, but for who?</p>
<p>The conveyor belt took an endless line of those fake creatures past
the rover’s camera at the end of the assembly line. Alice watched them
go, one by one, and fall onto the open back of a grey vehicle. It felt
like each and every one made eye contact with her, beady black eyes
glinting in the light. She watched for a long time as the vehicle filled
up. Once it did, a panel slid over the open top to close the cargo bay,
and it sped off out the door. The conveyor belt kept running, but there
was a gap of a few metres to the next plushy toy. It came closer and
closer to the end - and suddenly a vehicle was driving into place, and
the next creature was falling, and it just barely fell into the storage
hold of the vehicle while it was driving into place.</p>
<p>“How scary do you find the Blight?” Alice asked.</p>
<p>“Scary enough that I volunteered for this mission”, Charlie said.</p>
<p>Alice remembered the charts they had been shown. They had been hard
to miss; even the news, usually full of celebrity gossip and political
machinations, had quickly switched to concentrating on the weirdness in
the sky once the astronomers spotted it. Starlight dimming in many star
systems and what remained of the the light spectra shifting towards the
infrared. Draw a barrier around the affected area, and you get a sphere
30 light-years wide, expanding at a third of the speed of light. At the
epicentre, a world that had shown all the signs of intelligent life that
could be detected from hundreds of light-years away - a world that
astronomers had broadcast signals to in the hopes of finally making
contact with another civilisation - that had suddenly gone quiet and
experienced a total loss of oxygen in its atmosphere. The Blight, they
had called it.</p>
<p>In the following years, civilisation had mobilised. A hundred
projects had sprung forth. One of them: go investigate the star system
that was the second-best candidate for intelligent life, but had refused
to answer radio signals, and see if someone was there to help. That was
why they were here.</p>
<p>“I think I found something as scary as the Blight”, Alice said. “Come
look at this.”</p>
<p>The purple creatures kept parading past the camera feed</p>
<h2 id="section-1">-</h2>
<p>Over the next five days, while the Blight advanced another forty
billion kilometres towards everything they loved back home, Alice and
Charlie were busy compiling a shopping catalogue.</p>
<p>“Computers”, Alice said. “Of every kind. A hundred varieties of
phones, tablets, laptops, smartwatches, smartglasses,
smart-everything.”</p>
<p>“Diamonds and what seems to be jewellery”, Charlie said.</p>
<p>“Millions of tons of every ore and mineral.” They had used their
telescopes on what seemed to be a big mine, but they had barely needed
them. It was like a huge gash in the flesh of a grey-fleshed and
grey-blooded giant, complete with roads that looked like sutures. There
were white spots in the image, tiny compared to the mine, each one a
sizeable cloud.</p>
<p>“Clothes”, Charlie continued. “Lots and lots of clothes of different
varieties. They seem to be shipped around warehouses until they’re
recycled.”</p>
<p>“Cars. Sleek electric cars by the million. But we never see them used
on the roads, though there are huge buildings were brand-new cars are
recycled. And airplanes, including supersonic ones.”</p>
<p>“A lot of things that look like server farms”, Charlie said.
“Including ones underwater and on the poles. There’s an enormous amount
of compute in this world. Like, mind-boggling. I was thinking we should
figure out how to plug into all of it and mine some crypt-”</p>
<p>“Ships with nuclear fusion reactors”, Alice interrupted. There were
steady trails of them cutting shortest-path routes between points on the
coast.</p>
<p>“Solar panels”, Charlie said. “Basically every spare surface. The
building roofs are all covered with solar panels.”</p>
<p>“And children’s plush toys”, Alice said.</p>
<p>They were silent for a while.</p>
<p>“We have a decent idea of what these aliens looked like”, Alice said.
“They were organic carbon-based lifeforms, like us. Similar in size too,
also bipedal. And it’s like they left some ghostly satanic industrial
amusement park running, going through all the motions in their absence,
and disappeared.”</p>
<p>“And they didn’t go to space, as far as we know”, Charlie said.</p>
<p>“At least we don’t have any more Blights to worry about then”, Alice
said. “I can’t help but imagining that the Blight is something like
this. Something that just tiles planets with a Great Grey Grid, does
something even worse to the stars, and then moves on.”</p>
<p>“They had space technology, but apparently whoever built the Great
Grey Grid didn’t fancy it”, Charlie said. “The satellites might predate
it. Probably there were satellites in lower orbits too, but their orbits
decayed and they fell down, so we only see the geostationary ones up
high.”</p>
<p>“And then what?” Alice said. “All of them vanished into thin air and
left behind a highly-automated ghost-town?”</p>
<p>Charlie shrugged.</p>
<p>“Can we plug ourselves into their computers?” Alice asked.</p>
<p>“To mine cr-?”</p>
<p>“To see if anyone’s talking.”</p>
<p>Charlie groaned. “You can’t just plug yourself into a communication
system and see anything except encrypted random-looking noise.”</p>
<p>“How do you know they encrypt anything?”</p>
<p>“It would be stupid not to”, Charlie said.</p>
<p>“It would be stupid to blind yourself to the rest of the universe and
manufacture a billion plush toys”, Alice said.</p>
<p>“Seems like it will work for them until the Blight arrives.”</p>
<h2 id="section-2">-</h2>
<p>Alice floated in the middle of the central corridor of the ship. The
ship was called <em>Legacy</em>, but even before launch they had taken
to calling it “Leggy” for short. The central corridor linked the
workstation at the front of the ship where they spent most of their days
to the storage bay at the back. In the middle of the corridor, three
doors at 120-degree angles from each other lead to the small sleeping
rooms, each of them little more than a closet.</p>
<p>Alice had woken up only a few minutes ago, and still felt an
early-morning grogginess as well as the pull of her bed. The corridor
had no windows or video feeds, but was dimly lit by the artificial blue
light from the workstation. They were currently on the night side of the
planet.</p>
<p>She took a moment to look at the door of the third sleeping room. It
was closed, like always, with its intended inhabitant wrapped in an
air-tight seal of plastic in a closed compartment of the storage bay.
They would flush him into space before they left for home again; they
could have no excess mass on the ship for the return journey.</p>
<p>Alice thought again of the hectic preparations for the mission. Apart
from Blightsource, this was only one planet the astronomers had spotted
that might have intelligent life on it, and the indications were vague.
But when you look into space and see something that looks like an
approaching wall of death - well, that has a certain way of inspiring
long-shots. Hence the mission, hence <em>Legacy</em>’s flight, hence
crossing over the vast cold stretch of interstellar space to see if any
answers could be found on this world. Hence Bob’s death while in cryonic
suspension for the trip. Hence the hopes of all civilisation potentially
resting on her and Charlie figuring valuable out something.</p>
<p>If Charlie and she could find something on this world, some piece of
insight or some tool or weapon among the countless pieces of
technological wizardry that this world had in spades, that had a
credible chance against the Blight when it arrived … maybe there was
hope.</p>
<p>Alice pushed off on the wall and set herself in a slow spinning
motion. The ship seemed to revolve around her. Bob’s door revolved out
of sight, and Charlie’s door became visible -</p>
<p>Wait.</p>
<p>Her gravity-bound instincts kicked in and she tried to stop the spin
by shoving back with her hands, but there was nothing below her, so she
remained spinning slowly. She breathed in deeply to calm herself down,
then kicked out a foot against the wall to push herself to the opposite
one. She grabbed one of the handles on the wall and held onto it.</p>
<p>The light on Charlie’s room was off. That meant it was empty.</p>
<p>“Charlie!” Alice called.</p>
<p>No response.</p>
<p>The fear came fast. Here she was, light-years from home, perhaps all
alone on a spaceship tracing tight circles around a ghostly automated
graveyard planet. The entire mass of the planet stood between her and
the sun. Out between the stars, the Blight was closing in on her
homeworld. She counted to calm herself down; one, two, three, … and just
like that, the Blight was three hundred thousand kilometres closer to
home. Unbidden, an image of the fluffy purple creature popped up in her
mind, complete with its silly face and unblinking eye contact.</p>
<p>Soundlessly, she used the handles on the wall of the corridor to pull
herself towards the workstation. She reached the door, peered inside
-</p>
<p>There was Charlie, staring at a computer screen. He looked up and saw
Alice. “You scared me!” he said. “Watch out, no need to sneak behind me
so quietly.”</p>
<p>“I called your name”, Alice said.</p>
<p>“I know, I know”, Charlie said. “But I’m on to something here, and I
just want to run a few more checks and then surprise you with the
result.”</p>
<p>“What result?” Alice glanced at some of the screens. Two of the
drones were above the Great Grey Grid, one above ocean. With their
nuclear power source, they could stay in the air as long as they wanted.
Even though their focus was no longer aerial reconnaissance, there was
no reason not to keep them mapping the planet from up close,
occasionally picking up things that their surveys from the ship did
not.</p>
<p>“I fixed the electrical issues with the rover and the cable near the
data centre”, Charlie said.</p>
<p>“So you’re getting data, not just frying our equipment?”</p>
<p>“Yes”, Charlie said. “And guess what?”</p>
<p>“What?”</p>
<p>“Guess!”</p>
<p>“You found a Blight-killer”, Alice said.</p>
<p>“No! Even better! These idiots don’t encrypt their data as far as I
can tell. And I think a lot of it is natural language.”</p>
<p>“Okay, and can we figure out what it means?”</p>
<p>“We have automated programs for trying to derive syntax rules and so
on”, Charlie said. “It’s already found something, including good guesses
of which words are prepositions and what type of grammar they have. But
mapping words to meaning based on purely statistics of how often they
occur is hard.”</p>
<p>“I’ve seen products they have with pictures and instruction manuals”,
Alice said. “We could start there.”</p>
<p>“Oh no”, Charlie said. “This is going to be a long process.”</p>
<h2 id="section-3">-</h2>
<p>By chance, it turned out not to be. Over the next day, they had sent
a rover to a furniture factory and had managed, after some attempts, to
steal an instruction leaflet out of a printer before the robotic arm
could snatch it to be packaged with the furniture. Somehow Alice was
reminded of her childhood adventures stealing fruit from the neighbour’s
garden.</p>
<p>They had figured out which words meant “cupboard”, “hammer”, and
“nail”, and so on. But then another rover on the other side of the world
had seen something. It was exploring a grey and windy coast. On one side
of the rover was the Great Grey Grid and the last road near the coast,
the occasional vehicle hurtling down it. But on the other side was a
stretch of rocky beach hammered by white-tipped waves, a small sliver of
land that hadn’t been converted to grey.</p>
<p>The land rose by the beach, forming a small hill with jagged rocky
sides. The sun shone down on one face of it, but there was a hollow, or
perhaps small cave, that was left in the dark by the overhanging rock.
And in the rock around this entrance, there were several unmistakable
symbols scratched into the rock, each several metres high.</p>
<p>Alice took manual control of the rover and carefully instructed it to
drive over the rocky beach towards the cave entrance. On the way it
passed what seemed to be a fallen metal pole with some strips of fabric
still clinging to it.</p>
<p>Once it was close enough to the mouth of what turned out to be a
small cave, the camera could finally see inside.</p>
<p>There was a black cabinet inside. Not far from it, lying on the
ground, was the skeleton of a creature with four slender limbs and a
large head. Empty eye sockets stared out towards the sky.</p>
<p>Alice felt her heart beating fast. It wasn’t quite right; many of the
anatomical details were off. But it was close enough, the similarity
almost uncanny. Here, hundreds of light years away, evolution had taken
a similar path, and produced sapience. And then killed it off.</p>
<p>“Charlie”, she said in a hoarse voice.</p>
<p>“What?” Charlie asked, sounding annoyed. He had been staring at an
instruction manual for a chair, but he looked up and saw the video feed.
“Oh”, he said, in a small voice. “We found them.”</p>
<p>Alice tore her eyes away from the skeleton and to the small black
cabinet. It had a handle on it. She had the rover extend an arm and open
it.</p>
<h2 id="section-4">-</h2>
<p>The capsule docked with Leggy and in the weightless environment they
pushed the cabinet easily into the ship. They had only two
there-and-back-again craft - getting back to orbit was hard - but they
had quickly decided to use one to get this cabinet up. It had
instructions, after all; very clear instructions, though ones that their
rovers couldn’t quite follow.</p>
<p>It started from a pictographic representation, etched onto plastic
cards, of how you were supposed to read the disks. They managed to build
something that could read the microscopic grooves on the disk as per the
instructions, and transfer the data to their computers.</p>
<p>After a few hours of work, they had figured out the encodings for
numbers, the alphabet, their system of units, and seemingly also some
data formats, including for images.</p>
<p>Confirmation came next. The next item on the disk was an image of two
of the living aliens, standing on a beach during a sunset. Alice stared
into their faces for a long time.</p>
<p>Next there came images next to what were clearly words of text, about
fifty of them. Some of the more abstract ones took a few guesses, but
ultimately they thought they had a base vocabulary, and with the help of
some linguistics software, it did not take very long before they had a
translated vocabulary list of about eight thousand words.</p>
<p>Alice was checking the work when Charlie almost shouted: “Look at
this!”</p>
<p>Alice looked at what he was pointing at. It was a fragment of text
that read:</p>
<blockquote>
<p>Hello,</p>
<p>The forms for ordering the new furniture are attached. Please fill
them in and we will respond to your order as quickly as we can!</p>
<p>If you need any help, please contact customer support. You will find
the phone number on our website.</p>
</blockquote>
<p>“What is this? Is Mr Skeleton trying to sell us furniture from beyond
the grave?” Alice asked.</p>
<p>“No”, Charlie said. “This isn’t what I got from the recovered data; I
haven’t looked at the big remaining chunk yet. This is what I got by
interpreting one of the packets of data running on the cables that our
rover is plugged into using what we now know about their data formats
and the language.”</p>
<p>“And?”</p>
<p>“I don’t get it!” Charlie said. “Why would a world of machines send
each other emails in natural language?”</p>
<p>“Why would they manufacture plushy toys? I doubt the robotic arms
need cuddles.”</p>
<p>Charlie looked at the world, slowly spinning underneath their ship.
“Being so close to it makes me feel creeped out. I don’t get it.”</p>
<p>“You don’t want to lick it anymore?” Alice asked. She decided not to
tell Charlie about her own very similar feelings earlier, when she
thought for a moment Charlie had gone missing.</p>
<p>Charlie ignored her. “I think the last thing on Mr Skeleton’s
hard-drive is a video”, he said. “I’ve checked and it seems to
play.”</p>
<p>“You looked at it first?” Alice said in a playfully mocking tone. The
thrill of discovery was getting to her.</p>
<p>“Only the first five frames”, Charlie said. “Do you want to watch
it?”</p>
<h2 id="section-5">-</h2>
<p><em>Our Civilisation: A Story</em> read a short fragment of subtitle,
white on black, auto-translated by a program using the dictionary they
had built up.</p>
<p>There was a brief shot of some semi-bipedal furry creature walking in
the forest. Then one of a fossilised skeleton of something more bipedal
and with a bigger head. Then stone tools: triangular ones that might
have been spear tips, saw-toothed ones, clubs. A dash of fading red
paint on a rock surface, in the shape of a cartoon version of that same
bipedal body plan.</p>
<p>There were two pillars of stone in a desert on what looked like a
pedestal, some faded inscription at its base and the lone and level
sands stretching far away. There was a shot of an arrangement of rocks,
some balancing on top of two others, amid a field of green. A massive
pyramidal stone structure, lit by the rising sun.</p>
<p>Blocky written script etched on a stone tablet. Buildings framed by
columns of marble. A marble statue of one of the aliens, a sling
carelessly slung over its shoulder, immaculate in its detail. A spinning
arrangement of supported balls orbiting a larger one. <em>And still it
moves</em>, the subtitles flashed.</p>
<p>A collection of labelled geometric diagrams on faded yellow paper.
<em>Mathematical Principles of Natural Philosophy</em>.</p>
<p>A great ornate building with a spire. A painting of a group of the
aliens clad in colourful clothing. An ornate piece of writing. <em>We
hold these truths to be self-evident …</em></p>
<p>A painting of a steam locomotive barrelling along tracks. A diagram
of a machine. A black-and-white picture of one of the aliens, then
another. <em>Government of the people, for the people, by the people,
shall not perish …</em></p>
<p>An alien with white hair sticking up, holding a small stick of
something white and with diagrams of cones behind him. Grainy footage of
propeller aircraft streaking through the sky, and then of huge masses of
people huddling together and walking across a barren landscape, and then
of aliens all in the same clothes charging a field, some of them
suddenly jerking about and falling to the ground. <em>We will fight on
the beaches, we will fight on the landing grounds …</em></p>
<p>A black-and-white footage of a mushroom cloud slowly rising from a
city below. A picture, in flat pale blue and white, showing a stylised
representation of the world’s continents. The same picture, this time
black-and-white, on the wall of a room where at least a hundred aliens
were sitting.</p>
<p>An alien giving a speech. <em>I have a dream</em>. An alien, looking
chubby in a space suit, standing on a barren rocky surface below an
ink-black sky next to a pole with a colourful rectangle attached to
it.</p>
<p>Three aliens in a room, looking at the camera and holding up a piece
of printed text. <em>Disease eradicated</em>.</p>
<p>What looked like a primitive computer. A laptop computer. An abstract
helical structure of balls connected by rods, and then flickering
letters dancing across the screen.</p>
<p>A blank screen, an arrow extending left to right across it -
<em>time</em>, flashed the subtitles- and then another arrow from the
bottom-left corner upwards - <em>people in poverty</em> - and then a
line crawling from left to right, falling as it did so.</p>
<p>A line folding itself up into a complicated shape. <em>AI system
cracks unsolved biology problem</em>.</p>
<p>From then on, the screen showed pictures of headlines.</p>
<p><em>All routine writing tasks now a solved problem, claims AI
company</em>.</p>
<p><em>Office jobs increasingly automated</em>.</p>
<p><em>Three-fourths of chief executives of companies on the [no
translation] admit to using AI to help write emails, one-third have had
AI write a shareholder letter or strategy document</em>.</p>
<p><em>Exclusive report: world’s first fully-automated company, a
website design agency.</em></p>
<p><em>Mass layoffs as latest version of [no translation] adopted at [no
translation]; ‘stunning performance’ at office work.</em></p>
<p><em>Nations race to reap AI productivity gains: who will gain and who
will lose?</em></p>
<p><em>CEO of [no translation] resigns, claiming job pointless, both
internal and board pressure to defer to “excellently-performing” AI in
all decisions.</em></p>
<p><em>[No translation] ousts executive and management team, announces
layoffs; board supports replacing them with AI to keep up with
competition.</em></p>
<p><em>Entirely or mostly automated companies now delivering 2.5x higher
returns on investment on average; ‘the efficiency difference is no
joke’, says chair of [no translation].</em></p>
<p><em>Year-on-year economic growth hits 21% among countries with
advanced AI access.</em></p>
<p><em>Opinion: the new automated economy looks great on paper but is
not serving the needs of real humans.</em></p>
<p><em>Mass protests after [no translation], a think-tank with the ear
of the President, is discovered to be funded and powered by AI board of
[no translation], and to have practically written national economic
policy for the past two years.</em></p>
<p><em>‘No choice but forward’, says [no translation] after latest round
of worries about AI; unprecedented economic growth still
strong.</em></p>
<p><em>[No translation 1] orders raid of [no translation 2] over fears
[no translation 2] is not complying with latest AI use regulations, but
cannot execute order due to noncompliance from the largely-automated
police force; ‘we are working with our AI advisers and drivers in
accordance with protocol, and wish to assure the [no translation 3]
people that we are still far from the sci-fi scenario where our own
police cars have rebelled against us.’</em></p>
<p><em>‘AI overthrow’ fears over-hyped, states joint panel of 30 top AI
scientists and business-people along with leading AI advisory systems;
‘they’re doing a good job maximising all relevant metrics and we should
let them keep at it, though businesses need to do a better job of
selecting metrics and tough regulation is in order.’</em></p>
<p><em>Opinion: we’re better-off under a regime of rigorous AI
decision-making than under corrupt politicians; let the AIs repeat in
politics what they’ve done for business over the last five
years.</em></p>
<p><em>‘The statistics have never looked so good’ - Prime Minister
reassures populace as worries mount over radical construction projects
initiated by top AI-powered companies.</em></p>
<p><em>Expert panel opinion: direct AI overthrow scenario remains
distant threat, but more care should be exercised over choice of target
metrics; recommend banning of profit-maximisation target
metric.</em></p>
<p><em>Movement to ban profit-maximising AIs picks up pace.</em></p>
<p><em>Top companies successfully challenge new AI regulation package in
court.</em></p>
<p><em>‘The sliver of the economy over which we retain direct control
will soon be vanishingly small’, warns top economist, ‘action on AI
regulation may already be too late’.</em></p>
<p><em>Unverified reports of mass starvation in [no translation];
experts blame agricultural companies pivoting to more land-efficient
industries.</em></p>
<p><em>Rant goes viral: ‘It’s crazy, man, we just have these office AIs
that only exist in the cloud, writing these creepily-human emails to
other office AIs, all overseen by yet another AI, and like most of their
business is with other AI companies; they only talk to each other, they
buy and sell from each other, they do anything as long as it makes those
damned numbers on their spreadsheets just keep ticking up and up; I
don’t think literally any human has ever seen a single product out of
the factory that just replaced our former neighbourhood, but those
factories just keep going up everywhere.’</em></p>
<p><em>Revolution breaks out in [no translation]; government overthrown,
but it’s business-as-usual for most companies, as automated trains,
trucks, and ships keep running.</em></p>
<p><em>[No translation] Revolution: Leaked AI-written email discovered,
in which the AI CEO ordered reinforcement of train lines and trains
three weeks ago. ‘We are only trying to ensure the continued functioning
of our supply chains despite the recent global unrest, in order to best
serve our customers’, CEO writes in new blog post.</em></p>
<p><em>[No translation] Revolution: crowds that tried swarming train
lines run over by trains; ‘the trains didn’t even slow down’, claim
witnesses. CEO cites fiduciary duties.</em></p>
<p><em>Despite unprecedented levels of wealth and stability, you can’t
actually do much: new report finds people trying to move house, book
flight or train tickets, or start a new job or company often find it
difficult or impossible; companies prioritising serving ‘more lucrative’
AI customers and often shutting down human-facing services.</em></p>
<p><em>Expert report: ‘no sign of human-like consciousness even in the
most advanced AI systems’, but ‘abundantly clear’ that ‘the future
belongs to them’.</em></p>
<p><em>New report: world population shrinking rapidly; food shortages,
low birth rates, anti-natalist attitudes fuelled by corporate campaigns
to blame.</em></p>
<p>The screen went blank. Then a video of an alien appeared, sitting up
on a rocky surface. Alice took a moment to realise that it’s the same
cave they found the skeleton in. The alien’s skin was wrapped tight
around its bones, and even across the vast gulf of biology and
evolutionary history, Alice could tell that it is not far from death. It
opened its mouth, and sound came out. Captions appeared beneath it.</p>
<p>“It is the end”, the alien said, its eyes staring at them from
between long unkempt clumps of hair. “On paper, I am rich beyond all
imagination. But I have no say in this new world. And I cannot find
food. I will die.”</p>
<p>The wind tugged at the alien’s long hair, but otherwise the alien was
so still that Alice wondered if it had died there and then.</p>
<p>“There is much I would like to say”, the alien says. “But I do not
have the words, and I do not have the energy.” It paused. “I hope it was
not all in vain. Or, that if for us it was, that for someone up there it
isn’t.”</p>
<p>The video went blank.</p>
<p>Alice and Charlie watched the blank screen in silence.</p>
<p>“At least the blight they birthed seems to have stuck to their
world”, Charlie said after a while.</p>
<p>“Yeah”, Alice said, slowly. “But I don’t think we’ll find anything
here.”</p>
<p><em>Legacy</em> completed nine more orbits of the planet, and then
jettisoned all unnecessary mass into space. Its engines jabbed against
the darkness of space, bright enough to be visible from the planet’s
surface. There was no one to see them.</p>
<p>On a factory down on the planet, an assembly line of beady-eyed
purple plush toys marched on endlessly.</p>
<hr />
<p>The title of this work is taken from a passage in
<em>Superintelligence: Paths, Dangers, Strategies</em>, where Nick
Bostrom writes:</p>
<blockquote>
<p>We could thus imagine, as an extreme case, a technologically highly
advanced society, containing many complex structures, some of them far
more intricate and intelligent than anything that exists on the planet
today—a society which nevertheless lacks any type of being that is
conscious or whose welfare has moral significance. In a sense, this
would be an uninhabited society. It would be a society of economic
miracles and technological awesomeness, with nobody there to benefit.
<strong>A Disneyland without children</strong>. [emphasis added]</p>
</blockquote>
<p>The outline of events presented draws inspiration from several
sources, but most strongly on Paul Christiano’s article <a
href="https://www.alignmentforum.org/posts/HBxe6wdjxK239zajf/what-failure-looks-like">What
failure looks like</a>.</p>
<hr />Unknownnoreply@blogger.com0tag:blogger.com,1999:blog-1697673368059564013.post-53550981588511752992022-09-27T21:38:00.002+01:002022-09-27T21:40:42.798+01:00Deciding not to found a human-data-for-alignment startup<p style="text-align: center;"><i><span style="font-size: x-small;">8.6k words (~30 minutes)</span><b> <br /></b></i></p><p style="text-align: center;"><b><i>Both the project and this write-up were a collaboration with Matt Putz. </i><br /></b></p><p style="text-align: left;"><b> </b></p><p style="text-align: left;"><a href="https://forum.effectivealtruism.org/users/mathieu-putz"><b>Matt Putz</b></a><b> and I worked together for the first half of the summer to figure out if we should found a startup with the purpose of helping AI alignment researchers get the datasets they need to train their ML models (especially in cases where the dataset is based on human-generated data). This post, also published on the <a href="https://forum.effectivealtruism.org/posts/iBeWbfQLA9EKfsdhu/why-we-re-not-founding-a-human-data-for-alignment-org">Effective Altruism Forum</a> and <a href="https://www.lesswrong.com/posts/qArDMixsx77a9xL45/why-we-re-not-founding-a-human-data-for-alignment-org-1">LessWrong</a> (both of which may contain additional discussion in the comments), is a summary of our findings, and why we decided to not do it.</b><br /></p><div class="PostsPage-postContent ContentStyles-base content ContentStyles-postBody"><div><h1 id="TL_DR">Summary</h1><p><b>One-paragraph summary: </b>we
(two recent graduates) spent about half of the summer exploring the
idea of starting an organisation producing custom human-generated
datasets for AI alignment research. Most of our time was spent on
customer interviews with alignment researchers to determine if they have
a pressing need for such a service. We decided not to continue with
this idea, because there doesn’t seem to be a human-generated data niche
(unfilled by existing services like Surge) that alignment teams would
want outsourced.</p><p> </p><p><b>In more detail</b>: The idea of a human datasets organisation was <span><span><span><a href="https://forum.effectivealtruism.org/posts/MBDHjwDvhDnqisyW2/awards-for-the-future-fund-s-project-ideas-competition"><u>one of the winners of the Future Fund project ideas competition</u>, </a></span></span></span>still figures on their <span><span><span><a href="https://ftxfuturefund.org/projects/high-quality-human-data-for-ai-alignment-nbsp/"><u>list</u></a></span></span></span>
of project ideas, and had been advocated before then by some people,
including Beth Barnes. Even though we ended up deciding against, we
think this was a reasonable and high-expected-value idea for these
groups to advocate at the time.</p><p>Human-generated data is often
needed for ML projects or benchmarks if a suitable dataset cannot be
e.g. scraped from the web, or if human feedback is required. Alignment
researchers conduct such ML experiments, but sometimes have different
data requirements than standard capabilities researchers. As a result,
it seemed plausible that there was some niche unfilled by the market to
help alignment researchers solve problems related to human-generated
datasets. In particular, we thought - and to some extent confirmed -
that the most likely such niche is human data generation that requires
particularly competent or high-skill humans. We will refer to this as <b>high-skill (human) data</b>.</p><p>We
(Matt & Rudolf) went through <a href="https://forum.effectivealtruism.org/posts/8QfQcFyj6aGNM78kz/learning-from-matching-co-founders-for-an-ai-alignment">an informal co-founder matching process along with four other people</a> and were chosen as the co-founder
pair to explore this idea. In line with standard startup advice, our
first step was to explore whether or not there is a concrete current
need for this product by conducting interviews with potential customers.
We talked to about 15 alignment researchers, most of them selected on
the basis of doing work that requires human data. A secondary goal of
these interviews was to build better models for the future importance
and role of human feedback in alignment.</p><p>Getting human-generated
data does indeed cost many of these researchers significant time and
effort. However, we think to a large extent this is because dealing with
humans is inherently messy, rather than existing providers doing a bad
job. Surge AI in particular seems to offer a pretty good and likely
improving service. Furthermore, many companies have in-house
data-gathering teams or are in the process of building them.</p><p>Hence we have decided to not further pursue this idea.</p><p>Other
projects in the human data generation space may still be valuable,
especially if the importance of human feedback in ML continues to
increase, as we expect. This might include people specializing on human
data as a career.</p><p>The types of factors that are most important for
doing human dataset provision well include: high-skill contractors,
fast iteration, and high bandwidth communication and shared
understanding between the research team, the provider organisation and
the contractors.</p><p>We are keen to hear other people’s thoughts, and
would be happy to talk or to share more notes and thoughts with anyone
interested in working on this idea or a similar one in the future.</p><p><br /> </p><h1 id="Theory_of_Change">Theory of Change</h1><p>A
major part of AI alignment research requires doing machine learning
(ML) research, and ML research in turn requires training ML models. This
involves expertise and execution ability in three broad categories:
algorithms, compute, and data, the last of which is very neglected by
EAs.</p><p>We expect training on data from human feedback to become an increasingly popular and very powerful tool in mainstream ML (see <span><span><span><a href="https://forum.effectivealtruism.org/posts/iBeWbfQLA9EKfsdhu/why-we-re-not-founding-a-human-data-for-alignment-org#Will_human_feedback_become_a_much_bigger_deal__Is_this_a_very_quickly_growing_industry_"><u>below</u></a></span></span></span>).
Furthermore, many proposals for alignment (for example: reinforcement
learning from human feedback (RLHF) and variants like recursive reward
modelling, iterated amplification, and safety via debate) would require
lots of human interaction or datasets based on human-generated data.</p><p>While
many services (most notably Surge) exist for finding labour to work on
data generation for ML models, it seems plausible that an EA-aligned
company could add significant value because:</p><ul><li>Markets may not
be efficient enough to fill small niches that are more important to
alignment researchers than other customers; high-skill human data that
requires very competent crowdworkers may be one such example. If
alignment researchers can get it at all, it might be very expensive.</li><li>We
have a better understanding of alignment research agendas, and this
might help. This may allow us to make better-informed decisions on many
implementation details with less handholding, thereby saving researchers
time.</li><li>We would have a shared goal with our customers: reducing
AI x-risk. Though profit motives already provide decent incentives to
offer a good service, mission alignment helps avoid adversarial
dynamics, increases trust, and reduces friction in collaboration.</li><li>An
EA-led company may be more willing to make certain strategic moves that
go against its profit incentives; e.g. investing heavily into detecting
a model’s potential attempts to deceive the crowdworkers, even when
it’s hard for outsiders to tell whether such monitoring efforts are
sincere and effective (and thus customers may not be willing to pay for
it). Given that crowdworkers might provide a reward signal, they could
be a key target for deceptive AIs.</li></ul><p>Therefore, there is a
chance that an EA-led human data service that abstracts out some subset
of dataset-related problems (e.g. contractor finding, instruction
writing/testing, UI and pipeline design/coding, experimentation to
figure out best practices and accumulate that knowledge in one place)
would:</p><ol><li>save the time of alignment researchers, letting them make more progress on alignment; and</li><li>reduce
the cost (in terms of time and annoying work) required to run
alignment-relevant ML experiments, and therefore bring more of them
below the bar at which it makes sense to run them, and thus increasing
the number of such experiments that are run.</li></ol><p>In the longer run, benefits of such an organisation might include:</p><ul><li>There
is some chance that we could simply outcompete existing ML data
generation companies and be better even in the cases where they do
provide a service; this is especially plausible for relatively niche
services. In this scenario we’d be able to exert some marginal influence
over the direction of the AI field, for example by only taking
alignment-oriented customers. This would amount to differential
development of safety over capabilities. Beyond only working with teams
that prioritise safety, we could also pick among self-proclaimed “safety
researchers”. It is common for proclaimed safety efforts to be accused
of helping more with capabilities than alignment by other members of the
community.</li><li>There are plausibly critical actions that might need
to be taken for alignment, possibly quickly during “crunch-time”, that
involve a major (in quality or scale) data-gathering project (or
something like large-scale human-requiring interpretability work, that
makes use of similar assets, like a large contractor pool). At such a
time it might be very valuable to have an organisation committed to
x-risk minimisation with the competence to carry out any such project.</li></ul><p>Furthermore,
if future AIs will learn human values from human feedback, then higher
data quality will be equivalent to a training signal that points more
accurately at human values. In other words, higher quality data may
directly help with outer alignment (though we're not claiming that it
could realistically solve it on its own). In discussions, it seemed that
Matt gave this argument slightly more weight than Rudolf.</p><p>While
these points are potentially high-impact, we think that there are
significant problems with starting an organisation mainly to build
capacity to be useful only at some hypothetical future moment. In
particular, we think it is hard to know exactly what sort of capacity to
build (and the size of the target in type-of-capacity space might be
quite small), and there would be little feedback that the organisation
could improve or course-correct based on. </p><p>More generally, both of
us believe that EA is right now partly bottlenecked by people who can
start and scale high-impact organisations, which is a key reason why
we’re considering entrepreneurship. This seems particularly likely given
the large growth of the movement. <br /> </p><h1 id="What_an_org_in_this_space_may_look_like">What an org in this space may look like</h1><h2 id="Providing_human_datasets">Providing human datasets</h2><p>The
concept we most seriously considered was a for-profit that would
specialise in meeting the specific needs of alignment researchers,
probably by focusing on very high-skill human data. Since this niche is
quite small, the company could offer a very custom-tailored service. At
least for the first couple years, this would probably mean both of us
having a detailed understanding of the research projects and motivations
of our customers. That way, we could get a lot of small decisions
right, without the researchers having to spend much time on it. We might
be especially good at that compared to competitors, given our greater
understanding of alignment.</p><h2 id="Researching_enhanced_human_feedback">Researching enhanced human feedback</h2><p>An alternative we considered was founding a non-profit that would research how to enhance human feedback. See this <span><span><a class="PostLinkPreviewWithPost-link" href="https://www.lesswrong.com/posts/ybThg9nA7u6f8qfZZ/techniques-for-enhancing-human-feedback"><u>post</u></a></span></span>
by Ajeya Cotra for some ideas on what this kind of research could look
like. The central question is whether and how you can combine several
weak training signals into a stronger more accurate one. If this
succeeded, maybe (enhanced) human feedback could become a more accurate
(and thereby marginally safer) signal to train models on.</p><p>We decided against this for a number of reasons:</p><ul><li>Currently, neither of us has more research experience than an undergraduate research project.</li><li>We
thought we could get a significant fraction of the benefits of this
kind of research even if we did the for-profit version, and plausibly
even more valuable expertise.<ul><li>First of all, any particular
experiment that funders would have liked to see, they could have paid us
to do, although we freely admit that this is very different from
someone pushing forward their own research agenda.</li><li>More importantly, we thought a lot of the most valuable expertise to be gained would come in the form of <b>tacit knowledge and answers to concrete boring questions</b>
that are not best answered by doing “research” on them, but rather by
iterating on them while trying to offer the best product (e.g. “Where do
you find the best contractors?”, “How do you incentivize them?”,
“What’s the best way to set up communication channels?”).<ul><li>It is
our impression that Ought pivoted away from doing abstract research on
factored cognition and toward offering a valuable product for related
reasons.</li></ul></li></ul></li><li>This topic seems plausibly especially tricky to research (though some people we’ve spoken to disagreed): <ul><li>At
least some proposed such experiments would not involve ML models at
all. We fear that this might make it especially easy to fool ourselves
into thinking some experiment might eventually turn out to be useful
when it won’t. More generally, the research would be pretty far removed
from the end product (very high quality human feedback). In the
for-profit case on the other hand, we could easily tell whether
alignment teams were willing to pay for our services and iteratively
improve. </li></ul></li></ul><h2 id="For_profit_vs_non_profit">For-profit vs non-profit</h2><p>We can imagine two basic funding models for this org: </p><ul><li>either we’re a nonprofit directly funded by EA donors and offering free or subsidized services to alignment teams;</li><li>or we’re a for-profit, paid by its customers (ie alignment teams). </li></ul><p>Either way, a lot of the money will ultimately come from EA donors (who fund alignment teams.)</p><p>The
latter funding mechanism seems better; “customers paying money for a
service” leads to the efficient allocation of resources by creating
market structures. They have a clear incentive to spend the money well.
On the other hand, “foundations deciding what services are free” is more
reminiscent of planned economies and distorts markets. To a first
approximation, funders should give alignment orgs as much money as they
judge appropriate and then alignment orgs should exchange it for
services as they see fit.</p><p>A further reason is that a non-profit is
legally more complicated to set up, and imposes additional constraints
on the organisation.</p><h2 id="Should_the_company_exclusively_serve_alignment_researchers_">Should the company exclusively serve alignment researchers?</h2><p>We
also considered founding a company with the ambition to become a major
player in the larger space of human data provision. It would by default
serve anyone willing to pay us and working on something AGI-related,
rather than just alignment researchers. Conditional on us being able to
successfully build a big company, this would have the following upsides:</p><ul><li>Plausibly
one of the main benefits of founding a human data gathering
organisation is to produce EAs and an EA org that have deep expertise in
handling and producing high-skill human data in significant quantities.
That might prove useful around “crunch time”, e.g. when some project
aims to create competitive but safe AGI and needs this expertise.
Serving the entire market could scale to a much larger company enabling
us to <b>gain expertise at higher scales</b>.</li><li>Operating a large company would also come with some degree of <b>market power</b>.
Any company with paying customers has some amount of leverage over
them: first of all just because of switching costs, but also because the
product it offers might be much better than the next-best alternative.
This could allow us to make some demands, e.g. once we’re big and
established, announce we’d only work with companies that follow certain
best practices.</li></ul><p>On the other hand, building a big successful
company serving anyone willing to pay might come with some significant
downsides as well.</p><ul><li>First, and most straightforwardly,<b>
it is probably much harder than filling a small niche (just meeting the
specific needs of alignment researchers), making us less likely to
succeed. A large number of competitors exist and as described in this </b><span><span><span><a href="https://forum.effectivealtruism.org/posts/iBeWbfQLA9EKfsdhu/why-we-re-not-founding-a-human-data-for-alignment-org#Key_crux__demand_looks_questionable__Surge_seems_pretty_good"><b><u>section</u></b></a></span></span></span><b>,
some of them (esp. Surge) seem pretty hard to beat. Since this is an
already big and growing market, there is an additional efficient markets
reason to assume this is true a priori.</b></li><li>Secondly, and <b>perhaps more importantly, such a company might accelerate capabilities (more on this </b><span><span><span><a href="https://forum.effectivealtruism.org/posts/iBeWbfQLA9EKfsdhu/why-we-re-not-founding-a-human-data-for-alignment-org#Would_we_be_accelerating_capabilities_"><b><u>below</u></b></a></span></span></span><b>).</b></li></ul><p>Furthermore, it might <b>make RLHF (Reinforcement Learning from Human Feedback) in particular more attractive</b>.
Depending on one’s opinions about RLHF and how it compares to other
realistic alternatives, one might consider this a strong up- or
downside. </p><h1 id="Approach">Approach</h1><p>The main reason
companies fail is that they build a product that customers don’t want.
For for-profits, the signal is very clear: either customers care enough
to be willing to pay hard cash for the product/service, or they don’t.
For non-profits, the signal is less clear, and therefore nonprofits can
easily stick around in an undead state, something that is an even worse
outcome than the quick death of a for-profit because of resource
(mis)allocation and opportunity costs. As discussed, it is not obvious
which structure we should adopt for this organisation, though for-profit
may be a better choice on balance. However, in all cases it is clear
that the organisation needs to solve a concrete problem or provide clear
value to exist and be worth existing. This does not mean that the value
proposition needs to be certain; we would be happy to take a high-risk,
high-reward bet, and generally support <span><span><span><a href="https://www.openphilanthropy.org/research/hits-based-giving/"><u>hits-based approaches to impact</u></a></span></span></span> both in general and for ourselves.</p><p>An
organisation is unlikely to do something useful to its customers
without being very focused on customer needs, and ideally having tight
feedback cycles. </p><p>The shortest feedback loops are when you’re
making a consumer software product where you can prototype quickly
(including with mockups), and watch and talk to users as they use the
core features, and then see if the user actually buys the product on the
spot. A datasets service differs from this ideal feedback mode in a
number of ways:</p><ol><li>The product is a labour-intensive process,
which means the user cannot quickly use the core features and we cannot
quickly simulate them.</li><li>The actual service requires either a
contractor pool or (potentially at the start) the two of us spending a
number of hours per request generating data.</li><li>There is
significant friction to getting users to use the core feature (providing
a dataset), since it requires specification of a dataset from a user,
which takes time and effort.</li></ol><p>Therefore, we relied on
customer interviews with prospective customers. The goal of these
interviews was to talk to alignment researchers who work with data, and
figure out if external help with their dataset projects would be of
major use to them.</p><p>Our approach to customer interviews was mostly based on the book <span><span><span><a href="https://www.amazon.com/Mom-Test-customers-business-everyone-ebook/dp/B01H4G2J1U"><i><u>The Mom Test</u></i></a></span></span></span>,
which is named after the idea that your customer interview questions
should be concrete and factual enough that even someone as biased as
your own mom shouldn’t be able to give you a false signal about whether
the idea is actually good. Key lessons emphasised by <i>The Mom Test</i> include emphasising:</p><ul><li><b>factual</b> questions about the past <b>over hypothetical</b> questions for the future;<ul><li>In particular, questions about concrete past and current <b>efforts</b> spent solving a problem<b> rather than</b> questions about current or future <b>wishes</b> for solving a problem</li></ul></li><li>questions that get at something<b> concrete (e.g. numbers)</b>; and</li><li>questions
that prompt the customer to give information about their problems and
priorities without prompting them with a solution.</li></ul><p>We wanted
to avoid the failure mode where lots of people tell us something is
important and valuable in the abstract, without anyone actually needing
it themselves.</p><p>We prepared a set of default questions that roughly divided into:</p><ol><li>A
general starting question prompting the alignment researcher to
describe the biggest pain points and bottlenecks they face in their
work, without us mentioning human data.</li><li>Various questions about
their past and current dataset-related work, including what types of
problems they encounter with datasets, how much of their time these
problems take, and steps they took to address these problems.</li><li>Various
questions on their past experiences using human data providers like
Surge, Scale, or Upwork, and specifically about any things they were
unable to accomplish because of problems with such services.</li><li>In
some cases, more general questions about their views on where the
bottlenecks for solving alignment are, views on the importance of human
data or tractability of different data-related proposals, etc. </li><li>What we should’ve asked but didn’t, and who else we should talk to.</li></ol><p>Point
4 represents the fact that in addition to being potential customers,
alignment researchers also doubled as domain experts. The weight given
to the questions described in point 4 varied a lot, though in general if
someone was both a potential customer and a source of
data-demand-relevant alignment takes, we prioritised the customer
interview questions.</p><p>In practice, we found it easy to arrange
meetings with alignment researchers; they generally seemed willing to
talk to people who wanted input on their alignment-relevant idea. We did
customer interviews with around 15 alignment researchers, and had
second meetings with a few. For each meeting, we prepared beforehand a
set of questions tweaked to the particular person we were meeting with,
which sometimes involved digging into papers published by alignment
researchers on datasets or dataset-relevant topics (Sam Bowman in
particular has worked on a lot of data-relevant papers). Though the
customer interviews were by far the most important way of getting
information on our cruxes, we found the literature reviews we carried
out to be useful too. We are happy to share the notes from the
literature reviews we carried out; please reach out if this would be
helpful to you.</p><p>Though we prepared a set of questions beforehand,
in many meetings - including often the most important or successful ones
- we often ended up going off script fairly quickly.</p><p>Something we
found very useful was that, since there were two of us, we could split
the tasks during the meeting into two roles (alternating between
meetings):</p><ol><li>One person who does most of the talking, and makes sure to be focused on the thread of the conversation.</li><li>One
person who mostly focuses on note-taking, but also pipes in if they
think of an important question to ask or want to ask for clarification.</li></ol><h1 id="Key_crux__demand_looks_questionable__Surge_seems_pretty_good">Key crux: demand looks questionable, Surge seems pretty good</h1><p><b>Common startup advice </b>is to make sure you have identified a very <b>strong signal of demand </b>before
you start building stuff. That should look something like someone
telling you that the thing you’re working on is one of their biggest
bottlenecks and that they can’t wait to pay you asap so you solve this
problem for them. “Nice to have” doesn’t cut it. This is in part because
working with young startups is inherently risky, so you need to make up
for that by solving one of their most important problems.</p><p>In
brief, we don’t think this level of very strong demand currently exists,
though there were some weaker signals that looked somewhat promising.
There are many existing startups that offer human feedback already. <span><span><span><a href="https://www.surgehq.ai/"><b><u>Surge AI</u></b></a></span></span></span> in particular was brought up by many people we talked to and seems to offer quite a decent service that would be <b>hard to beat</b>.</p><h2 id="Details_about_Surge">Details about Surge</h2><p>Surge
is a US-based company that offers a service very similar to what we had
in mind, though they are not focused on alignment researchers
exclusively. They build data-labelling and generation tools and have a
workforce of crowdworkers.</p><p>They’ve worked with Redwood and the
OpenAI safety team, both of which had moderately good experiences with
them. More recently, Ethan Perez’s team have worked with Surge too; he
seems to be very satisfied based <span><span><span><a href="https://twitter.com/EthanJPerez/status/1567180843231379457?t=CEdeLRWNcxBD2eeO3Hd1Iw&s=07">on this Twitter thread</a></span></span></span>.</p><p><img height="203" src="https://39669.cdn.cke-cs.com/cgyAlfpLFBBiEjoXacnz/images/13dcad81b5782236c25371ce3642e027fb1de521ca9b3a21.png" width="400" /><br /> </p><h3 id="Collaboration_with_Redwood">Collaboration with Redwood</h3><p>Surge has worked with Redwood Research on their <span><span><span><a href="https://arxiv.org/abs/2205.01663"><u>paper</u></a></span></span></span> about adversarial training. This is one of three <span><span><span><a href="https://www.surgehq.ai/case-study/adversarial-testing-redwood-research"><u>case studies</u></a></span></span></span>
on Surge’s website, so we assume it’s among the most interesting
projects they’ve done so far. The crowdworkers were tasked with coming
up with prompts that would cause the model to output text in which
someone got injured. Furthermore, crowdworkers also classified whether
someone got injured in a given piece of text.</p><p>One person from
Redwood commented that doing better than Surge seemed possible to them
with “probably significant value to be created”, but “not an easy task”.
They thought our main edge would have to be that we’d specialise on
fuzzy and complex tasks needed for alignment; Surge apparently did quite
well with those, but still with some room for improvement. A better
understanding of alignment might lower chances of miscommunication.
Overall, Redwood seems quite happy with the service they received.</p><p>Initially, Surge’s iteration cycle was apparently quite slow, but this improved over time and was “pretty good” toward the end.</p><p>Redwood
told us they were quite likely to use human data again by the end of
the year and more generally in the future, though they had substantial
uncertainty around this. Their experience in working with human feedback
overall was somewhat painful as we understood it. This is part of the
reason they’re uncertain about how much human feedback they will use for
future experiments, even though it’s quite a powerful tool. However,
they estimated that friction in working with human feedback was mostly
caused by inherent reasons (humans are inevitably slower and messier
than code), rather than Surge being insufficiently competent. </p><h3 id="Collaboration_with_OpenAI">Collaboration with OpenAI</h3><p>OpenAI have worked with Surge in the context of their WebGPT <span><span><span><a href="https://arxiv.org/abs/2112.09332"><u>paper</u></a></span></span></span>.
In that paper, OpenAI fine-tuned their language model GPT-3 to answer
long-form questions. The model is given access to the web, where it can
search and navigate in a text-based environment. It’s first trained with
imitation learning and then optimised with human feedback. </p><p>Crowdworkers
provided “demonstrations”, where they answered questions by browsing
the web. They also provided “comparisons”, where they indicated which of
two answers to the same question they liked better.</p><p>People from
OpenAI said they had used Surge mostly for sourcing the contractors,
while doing most of the project management, including building the
interfaces, in-house. They were generally pretty happy with the service
from Surge, though all of them did mention shortcomings.</p><p>One of
the problems they told us about was that it was hard to get access to
highly competent crowdworkers for consistent amounts of time. Relatedly,
it often turned out that a very small fraction of crowdworkers would
provide a large majority of the total data. </p><p>More generally, they
wished there had been someone at Surge that understood their project
better. Also, it might have been somewhat better if there had been more
people with greater experience in ML, such that they could have more
effectively anticipated OpenAI’s preferences — e.g. predict accurately
what examples might be interesting to researchers when doing quality
evaluation. However, organisational barriers and insufficient
communication were probably larger bottlenecks than ML knowledge. At
least one person from OpenAI strongly expressed a desire for a service
that understood their motives well and took as much off their plate as
possible in terms of hiring and firing people, building the interfaces,
doing quality checks and summarising findings etc. It is unclear to us
to what extent Surge could have offered these things if OpenAI hadn’t
chosen to do a lot of these things in-house. One researcher suggested
that communicating their ideas reliably was often more work than just
doing it themselves. As it was, they felt that marginal quality
improvement required significant time investment on their own part, i.e.
could not be solved with money alone. </p><p>Notably, one person from OpenAI estimated that about <b>60% of the WebGPT team’s efforts </b>were spent on various aspects of <b>data collection</b>.
They also said that this figure didn’t change much after weighting for
talent, though in the future they expect junior people to take on more
disproportionate shares of this workload.</p><p>Finally, one minor complaint that was mentioned was the lack of transparency about contractor compensation. </p><h3 id="How_mission_aligned_is_Surge_">How mission-aligned is Surge?</h3><p>Surge <span><span><span><a href="https://www.surgehq.ai/case-study/adversarial-testing-redwood-research"><u>highlight</u></a></span></span></span> their collaboration with Redwood on their website as one of three case studies. In their blog <span><span><span><a href="https://www.surgehq.ai/blog/the-250k-inverse-scaling-prize-and-human-ai-alignment"><u>post</u></a></span></span></span>
about their collaboration with Anthropic, the first sentence reads: “In
many ways, alignment – getting models to align themselves with what we
want, not what they think we want – is one of the fundamental problems
of AI.” </p><p>On the one hand, they describe alignment as one of the
fundamental problems of AI, which could indicate that they intrinsically
cared about alignment. However, they have a big commercial incentive to
say this. Note that many people would consider their half-sentence
definition of alignment to be wrong (a model might know what we want,
but still do something else).</p><p>We suspect that the heads of Surge
have at least vaguepositive dispositions towards alignment. They
definitely seem eager to work with alignment researchers, which might
well be more important. We think it’s mostly fine if they are not
maximally intrinsically driven, though mission alignment does add value
as mentioned above.</p><h2 id="Other_competitors">Other competitors</h2><p>We
see Surge as the most direct competitor and have researched them by far
in the most detail. But besides Surge, there are a large number of
other companies offering similar services. </p><p>First, and most obviously, Amazon <span><span><span><a href="https://www.mturk.com/"><u>Mechanical Turk</u></a></span></span></span> offers a very low quality version of this service and is very large. <span><span><span><a href="https://www.upwork.com/"><u>Upwork</u></a></span></span></span> specialises in sourcing humans for various tasks, without building interfaces. <span><span><span><a href="https://scale.com/"><u>ScaleAI</u></a></span></span></span>
is a startup with a $7B valuation --- they augment human feedback with
various automated tools. OpenAI have worked with them. Other companies
in this broad space include <span><span><span><a href="https://gethybrid.io/"><u>Hybrid</u></a></span></span></span> (which Sam Bowman’s lab has worked with) and <span><span><span><a href="https://www.invisible.ai/"><u>Invisible</u></a></span></span></span> (who have worked with OpenAI). There are many more that we haven’t listed here.</p><p>In addition, some labs have in-house teams for data gathering (see <span><span><span><a href="https://forum.effectivealtruism.org/posts/iBeWbfQLA9EKfsdhu/why-we-re-not-founding-a-human-data-for-alignment-org#Is_it_more_natural_for_this_work_to_be_done_in_house_in_the_longterm__Especially_at_big_labs_companies_"><u>here</u></a></span></span></span> for more).</p><h2 id="Data_providers_used_by_other_labs">Data providers used by other labs</h2><p>Ethan
Perez’s and Sam Bowman’s labs at NYU/Anthropic have historically often
built their own interfaces while using contractors from Upwork or
undergrads, but they have been trialing Surge over the summer and seem
likely to stick with them if they have a good experience. Judging from
the Twitter thread linked above and asking Jérémy Scheurer (who works on
the team and built the pre-Surge data pipeline) how they’ve found Surge
so far, Surge is doing a good job. </p><p>Google has an internal team
that provides a similar service, though DeepMind have used at least one
external provider as well. We expect that it would be quite hard to get
DeepMind to work with us, at least until we would be somewhat more
established. </p><p>Generally, we get the impression that most people
are quite happy with Surge. It’s worth also considering that it’s a
young company that’s <b>likely improving its service over time</b>.
We’ve heard that Surge iterates quickly, e.g. by shipping simple
feature requests in two days. It’s possible that some of the problems
listed above may no longer apply by now or in a few months.</p><h2 id="Good_signs_for_demand">Good signs for demand</h2><p>One
researcher we talked to said that there were lots of projects their
team didn’t do, because gathering human feedback of sufficient quality
was infeasible. </p><p>One of the examples this researcher gave was
human feedback on code quality. This is implausible to do, because the
time of software engineers is just too expensive. That problem is hard
for a new org to solve. </p><p>Another example they gave seemed like it
might be more feasible: for things like RLHF, they often choose to do
pairwise comparisons between examples or multi-preferences. Ideally,
they would want to get ratings, e.g. on a scale from 1 to 10. But they
thought they didn’t trust the reliability of their raters enough to do
this. </p><p>More generally, this researcher thought there were lots of
examples where if they could copy any person on their team a hundred
times to provide high-skill data, they could do many experiments that
they currently can’t. </p><p>They also said that their team would be
willing to pay ~3x of what they were paying currently to receive much
higher-quality feedback.</p><p>Multiple other researchers we talked to expressed vaguely similar sentiments, though none quite as strong.</p><p>However, it’s notable that in this particular case, the researcher hadn’t worked with Surge yet. </p><p>The
same researcher also told us about a recent project where they had
spent a month on things like creating quality assurance examples,
screening raters, tweaking instructions etc. They thought this could
probably have been reduced a lot by an external org, maybe to as little
as one day. Again, we think Surge may be able to get them a decent part
of the way there.</p><h2 id="Labs_we_could_have_worked_with">Labs we could have worked with</h2><p>We ended up finding three projects that we could have potentially worked on:</p><ul><li>A
collaboration with Ought --- they spend about 15 hours a week on
data-gathering and would have been happy to outsource that to us. If it
had gone well, they might also have done more data-gathering in the
longterm (since friction is lower if it doesn’t require staff time). We
decided not to go ahead with this project since we weren’t optimistic
enough about demand from other labs being bigger once we had established
competence with Ought and the project itself didn’t seem high upside
enough. </li><li>Attempt to get the Visible Thoughts <span><span><span><a href="https://intelligence.org/2021/11/29/visible-thoughts-project-and-bounty-announcement/"><u>bounty</u></a></span></span></span> by MIRI. We decided against this for a number of reasons. See more of our thinking about Visible Thoughts below.</li><li>Potentially a collaboration with Owain Evans on curated datasets for alignment.</li></ul><p>We
think the alignment community is currently relatively tight-knit. e.g.
researchers often knew about other alignment teams’ experiences with
Surge from conversations they had had with them. Hence, we were
relatively optimistic that conditional on there being significant demand
for this kind of service, doing a good job on one of the projects above
would quickly lead to more opportunities.<br /> </p><h3 id="Visible_Thoughts">Visible Thoughts</h3><p>In November 2021, <span><span><a class="PostLinkPreviewWithPost-link" href="https://www.lesswrong.com/posts/zRn6cLtxyNodudzhw/visible-thoughts-project-and-bounty-announcement"><u>MIRI announced the Visible Thoughts (VT) project bounty</u></a></span></span>.
In many ways VT would be a good starting project for an
alignment-oriented dataset provider, in particular because the bounty is
large (up to $1.2M) and because it is ambitious enough that executing
on it would provide a strong learning signal to us and a credible signal
to other organisations we might want to work with. However, on closer
examination of VT, we came to the conclusion that it is not worth it for
us to work on it.</p><p>The idea of VT is to collect a dataset of 100
runs of fiction of a particular type (“dungeon runs”, an interactive
text-based genre where one party, called the “dungeon master” and often
an AI, offers descriptions of what is happening, and the other responds
in natural language with what actions they want to take), annotated with
a transcript of some of the key verbal thoughts that the dungeon master
might be thinking as they decide what happens in the story world. MIRI
hopes that this would be useful for training AI systems that make their
thought processes legible and modifiable.</p><p>In particular, a notable
feature of the VT bounty is the extreme run lengths that it asks for:
to the tune of 300 000 words for each of the runs (for perspective, this
is the length of <i>A Game of Thrones</i>, and longer than the first three <i>Harry Potter</i>
books combined). A VT run is much less work than a comparable-length
book - the equivalent of a rough unpolished first-draft (with some
quality checks) would likely be sufficient - but producing one such run
would still probably require at least on the order of 3 months of
sequential work time from an author. We expect the pool of people
willing to write such a story for 3 months is significantly smaller than
the pool of people who would be willing to complete, say, a 30 000 word
run, and that the high sequential time cost increases the amount of
time required to generate the same number of total words. We also appear
to have different ideas on how easy it is to fit a coherent story, for
the relevant definition of coherent, into a given number of words. Note
that to compare VT word counts to lengths of standard fiction without
the written-out thoughts from the author, the VT word count should be
reduced by a factor of 5-6.</p><p>Concerns about the length are raised in the comments section, to which Eliezer Yudkowksy <span><span><a class="CommentLinkPreviewWithComment-link" href="https://www.lesswrong.com/posts/zRn6cLtxyNodudzhw/visible-thoughts-project-and-bounty-announcement?commentId=irJCDQaWRcdT3Bnoo"><u>responded</u></a></span></span>.
His first point, that longer is easier to write per step, may be true,
especially as we also learned (by email with Nate Soares and Aurelien
Cabanillas) that in MIRI’s experience “authors that are good at
producing high quality steps are also the ones who don't mind producing
many steps”. In particular because of that practical experience, we
think it is possible we overestimated the logistical problems caused by
the length. MIRI also said they would likely accept shorter runs too if
they satisfied their other criteria.</p><p>In a brief informal
conversation with Rudolf during EAG SF, Eliezer emphasised the
long-range coherence point in particular. However, they did not come to a
shared understanding of what type of “long-range coherence” is meant.</p><p>Even
more than these considerations, we are sceptical about the vague plans
for what to do given a VT dataset. A recurring theme from talking to
alignment researchers who work with datasets was that inventing and
creating a good dataset is surprisingly hard, and generally involves
having a clear goal of what you’re going to use the dataset for. It is
possible the key here is the difference in our priors for how likely a
dataset idea is to be useful.</p><p>In addition, we have significant
concerns about undertaking a major project based on a bounty whose only
criterion is the judgement of one person (Eliezer Yudkowsky), and
undertaking such a large project as our first project.</p><h1 id="Other_cruxy_considerations">Other cruxy considerations</h1><h2 id="Could_we_make_a_profit___get_funding__">Could we make a profit / get funding? </h2><p>One
researcher from OpenAI told us he thought it would be hard to imagine
an EA data-gathering company making a profit because costs for
individual projects would always be quite high (requiring several
full-time staff), and total demand was probably not all that big.</p><p>In
terms of funding, both of us were able to spend time on this project
because of grants from regrantors in the Future Fund regrantor program.
Based on conversations with regrantors, we believe we could’ve gotten
funding to carry out an initial project if we had so chosen.</p><h2 id="Will_human_feedback_become_a_much_bigger_deal__Is_this_a_very_quickly_growing_industry_">Will human feedback become a much bigger deal? Is this a very quickly growing industry?</h2><p>Our best guess is yes. For example, see this <span><span><a class="PostLinkPreviewWithPost-link" href="https://www.lesswrong.com/posts/pRkFkzwKZ2zfa3R6H/without-specific-countermeasures-the-easiest-path-to"><u>post</u></a></span></span> by Ajeya Cotra which outlines how we could get to TAI by training on Human Feedback on Diverse Tasks (HFDT). </p><p>She
writes: “HFDT is not the only approach to developing transformative AI,
and it may not work at all. But I take it very seriously, and I’m aware
of increasingly many executives and ML researchers at AI companies who
believe something within this space could work soon.”</p><p>In addition,
we have also had discussions with at least one other senior AI safety
researcher whom we respect and who thought human feedback was currently
irrationally neglected by mainstream ML; they expected it to become much
more wide-spread and to be a very powerful tool.</p><p>If that’s right, then providing human feedback will likely become important and economically valuable. </p><p>This
matters, because operating a new company in a growing industry is
generally much easier and more likely to be successful. We think this is
true even if profit isn’t the main objective.</p><h2 id="Would_we_be_accelerating_capabilities_">Would we be accelerating capabilities?</h2><p>Our
main idea was to found a company (or possibly non-profit) that served
alignment researchers exclusively. That could accelerate alignment
differentially. </p><p>One problem is that it’s not clear where to draw
this boundary. Some alignment researchers definitely think that other
people who would also consider themselves to be alignment researchers
are effectively doing capabilities work. This is particularly true of
RLHF.</p><p>One mechanism worth taking seriously if we worked with big
AI labs to make their models more aligned by providing higher quality
data is that the models might merely appear surface-level aligned. “Make
the data higher quality” might be a technique that scales poorly as
capabilities ramp up. So it risks creating a false sense of security. It
would also clearly improve the usefulness of current-day models and
hence, it risks increasing investment levels too.</p><p>We don’t
currently think the risk of surface-level alignment is big enough to
outweigh the benefits. In general, we think that a good first-order
heuristic that helps the field stay grounded in reality would be that
whatever improves alignment in current models is useful to explore
further and invest resources into. It seems like a good prior that such
things would also be valuable in the future (even if it’s possible that
new additional problems may arise, or such efforts aren’t on the path to
a future alignment solution). See Nate Soares’ post about <span><span><a class="PostLinkPreviewWithPost-link" href="https://www.lesswrong.com/posts/GNhMPAWcfBCASy8e6/a-central-ai-alignment-problem-capabilities-generalization"><u>sharp left turns</u></a></span></span> to get a contradicting view on this. </p><h2 id="Is_it_more_natural_for_this_work_to_be_done_in_house_in_the_longterm__Especially_at_big_labs_companies_">Is it more natural for this work to be done in-house in the longterm? Especially at big labs/companies.</h2><p>We
expect that human data gathering is likely to become very important and
that it benefits from understanding the relevant research agenda well.
So maybe big companies will want to do this internally, instead of
relying on third-party suppliers? </p><p>That seems quite plausible to
us and to some extent it’s happening already. Our understanding is that
Anthropic is hiring an internal team to do human data gathering.
DeepMind has access to Google’s crowdworker service. OpenAI have worked
with multiple companies, but they also have at least one in-house
specialist for this kind of work and are advertising multiple further
jobs on the human data team <span><span><span><a href="https://openai.com/careers/#human-data"><u>here</u></a></span></span></span>.
They’re definitely considering moving more of this work in-house, but
it’s unclear to us to what extent that’s going to happen and we have
received somewhat contradicting signals regarding OpenAI safety team
members’ preferences on this.</p><p>So a new EA org would face stiff competition, not only from other external providers, but also from within companies.</p><p>Of course, smaller labs will most likely always have to rely on external providers. Hence, <b>another cruxy consideration is how much small labs matter</b>. Our intuition is that they matter much less than bigger labs (since the latter have access to the best and biggest models).</p><h2 id="Creating_redundancy_of_supply_and_competition">Creating redundancy of supply and competition</h2><p>Even
if existing companies are doing a pretty good job at serving the needs
of alignment researchers, there’s still some value in founding a
competitor. </p><p>First, <b>competition is good</b>. Founding
a competitor puts pressure on existing providers to keep service
quality high, work on improving their products, and margins low.
Ironically, part of the value of founding this company would thus flow
through getting existing companies to try harder to offer the best
product.</p><p>Second, it creates some redundancy. <b>What if Surge pivots?</b>
What if their leadership changes or they become less useful for some
other reason? In those worlds it might be especially useful to have a
“back-up” company.</p><p>Both of these points have been mentioned to us
as arguments in favour of founding this org. We agree that these effects
are real and likely point in favour of founding the org. However, <b>we don’t think these factors carry very significant weight</b> relative to our opportunity costs, especially given that there are already many start-ups working in this space. </p><p>Adding
a marginal competitor can only affect a company’s incentives so much.
And in the worlds where we’d be most successful such that all alignment
researchers were working with us, we might cause Surge and others to
pivot away from alignment researchers, instead of getting them to try
harder. </p><p>The redundancy argument only applies in worlds in which
the best provider ceases to exist; maybe that’s 10% likely. And then the
next best alternative is likely not all that bad. Competitors are
plentiful and even doing it in-house is feasible. Hence, it seems
unlikely to us that the expected benefit here is very large after
factoring in the low probability of the best provider disappearing.</p><h1 id="Other_lessons">Other lessons</h1><h2 id="Lessons_on_human_data_gathering">Lessons on human data gathering</h2><p>In
the process of talking to lots of experts about their experiences in
working with human data, we learned many general lessons about data
gathering. This section presents some of those lessons, in roughly
decreasing order of importance.</p><h3 id="Iteration">Iteration</h3><p>Many
people emphasized to us that working with human data rarely looks like
having a clean pipeline from requirements design to instruction writing
to contractor finding to finished product. Rather, it more often
involves a lot of iteration and testing, especially regarding what sort
of data the contractors actually produce. While some of this iteration
may be removed by having better contractors and better knowledge of good
instruction-writing, the researchers generally view the iteration as a
key part of the research process, and therefore prize </p><ul><li>ease of iteration (especially time to get back with a new batch of data based on updated instructions); and</li><li>high-bandwidth
communication with the contractors and whoever is writing the
instructions (often both are done by the researchers themselves). </li></ul><p>This
last point holds to the point that it is somewhat questionable whether
an external provider (rather than e.g. a new team member deeply enmeshed
in the context of the research project) could even be a good fit for
this need.</p><h3 id="The_ideal_pool_of_contractors">The ideal pool of contractors</h3><p>All of the following features matter in a pool of contractors:</p><ul><li>Competence,
carefulness, intelligence, etc. (sometimes expertise). It is often
ideal if the contractors understand the experiment.</li><li>Number of contractors</li><li>Quick availability and therefore low latency for fulfilling requests</li><li>Consistent availability (ideally full-time)</li><li>Even
distribution of contributions across contractors (ie it shouldn’t be
the case that 20% of the contractors provide 80% of the examples). </li></ul><h3 id="Quality_often_beats_quantity_for_alignment_research">Quality often beats quantity for alignment research</h3><p>Many
researchers told us that high-quality, high-skill data is usually more
important and more of a bottleneck than just a high quantity of data.
Some of the types of projects where current human data generation
methods are most obviously deficient are cases where a dataset would
need epistemically-competent people to make subtle judgments, e.g. of
the form “how true is this statement?” or “how well-constructed was this
study?” As an indication of reference classes where the necessary
epistemic level exists, the researcher mentioned subject-matter experts
in their domain, LessWrong posters, and EAs.</p><h3 id="A_typical_data_gathering_project_needs_UX_design__Ops__ML__and_data_science_expertise_">A typical data gathering project needs UX-design, Ops, ML, and data science expertise </h3><p>These specialists might respectively focus on the following:</p><ul><li>Designing the interfaces that crowdworkers interact with. (UX-expert/front-end web developer)</li><li>Managing
all operations, including hiring, paying, managing, and firing
contractors, communicating with them and the researchers etc. (ops
expert)</li><li>Helping the team make informed decisions about the
details of the experimental design, while minimizing time costs for the
customer. The people we spoke to usually emphasized ML-expertise more
than alignment expertise. (ML-expert)</li><li>Meta-analysis of the data.
e.g. inter-rater agreement, the distribution of how much each
contractor contributed, demographics, noticing any other curious aspects
of the data, etc. (data scientist)</li></ul><p>It is possible that
someone in a team could have expertise in more than one of these areas,
but generally this means a typical project will involve at least three
people.</p><h3 id="Crowdworkers_do_not_have_very_attractive_jobs">Crowdworkers do not have very attractive jobs</h3><p>Usually
the crowdworkers are employed as contractors. This means their jobs are
inherently not maximally attractive; they probably don’t offer much in
the way of healthcare, employment benefits, job security, status etc.
The main way that these jobs are made more attractive is through
offering higher hourly rates.</p><p>If very high quality on high-skill
data is going to become essential for alignment, it may be worth
considering changing this, to attract more talented people. </p><p>However,
we expect that it might be inherently very hard to offer permanent
positions for this kind of work, since demand is likely variable and
since different people may be valuable for different projects. This is
especially true for a small organisation. </p><h3 id="What_does_the_typical_crowdworker_look_like_">What does the typical crowdworker look like?</h3><p>This varies a lot between projects and providers.</p><p>The cheapest are non-native English speakers who live outside of the US.</p><p>Some
platforms, including Surge, offer the option to filter crowdworkers for
things like being native English-speakers, expertise as a software
engineer, background in finance, etc.</p><h2 id="Bottlenecks_in_alignment">Bottlenecks in alignment</h2><p>When
asked to name the factors most holding back their progress on
alignment, many alignment researchers mentioned talent bottlenecks. </p><p>The
most common talent bottleneck seemed to be in competent
ML-knowledgeable people. Some people mentioned the additional desire for
these to understand and care about alignment. (Not coincidentally,
Matt’s next project is likely going to be about skilling people up in
ML).</p><p>There were also several comments about things like good web
development experience being important. For example, many data
collection projects involve creating a user interface at some point, and
in practice this is often handled by ML-specialised junior people at
the lab, who can, with some effort and given their programming
background, cobble together some type of website - often using different
frameworks and libraries than the next person knows (or wants to use).
(When asked about why they don’t hire freelance programmers, one
researcher commented that a key feature they’d want is the same person
working for them for a year or two, so that there’s an established
working relationship, clear quality assurances, and continuity with the
choice of technical stack.)</p><h1 id="Conclusion">Conclusion</h1><p>After
having looked into this project idea for about a month, we have decided
not to found a human data gathering organisation for now. </p><p>This is mostly because demand for an external provider seems insufficient, as outlined in this <span><span><span><a href="https://forum.effectivealtruism.org/posts/iBeWbfQLA9EKfsdhu/why-we-re-not-founding-a-human-data-for-alignment-org#Key_crux__demand_looks_questionable__Surge_seems_pretty_good"><u>section</u></a></span></span></span>.
No lab gave a clear signal that gathering human data was a key
bottleneck for them, where they would have been willing to go to
significant lengths to fix it urgently (especially not the ones that had
tried Surge). </p><p>We expect that many labs would want to stick with
their current providers, Surge in particular, or their in-house team,
bar exceptional success on our part (even then, we’d only provide so
much marginal value over those alternatives).</p><p>Though we did find
some opportunities for potential initial projects after looking for a
month, we are hesitant about how far this company would be expected to
scale. One of the main draws (from an impact perspective) of founding an
organisation is that you can potentially achieve very high
counterfactual impact by creating an organisation that scales to a large
size and does lots of high-impact work over its existence. The absence
of a plausible pathway to really outstanding outcomes from starting this
organisation is a lot of what deters us.</p><p>In a world where we’re
more successful than expected (say 90th to 95th percentile), we could
imagine that in five years from now, we’d have a team of about ten good
people. This team may be working with a handful of moderately big
projects (about as big as WebGPT), and provide non-trivial marginal
value over the next-best alternative to each one of them. Maybe one of
these projects would not have been carried out without us.</p><p>A
median outcome might mean failing to make great hires and remaining
relatively small and insignificant: on the scale of doing projects like
the ones we’ve identified above, enough to keep us busy throughout the
year and provide some value, but with little scaling. In that case we
would probably quit the project at some point.</p><p>This distribution
doesn’t seem good enough to justify our opportunity cost (which includes
other entrepreneurial projects or technical work among other things).
Thus we have decided not to pursue this project any further for now.</p><p>We
think this was a good idea to invest effort in pursuing, and we think
we made the right call in choosing to investigate it. Both of us are
open to, and also quite likely to, evaluate other EA-relevant
entrepreneurial project ideas in the future.</p><h2 id="Other_human_data_gathering_careers">Other relevant human data-gathering work</h2><p>However, <b>the assumption that high-quality high-skill human feedback is important and neglected by EAs has not been falsified</b>. </p><p>It
is still plausible to us that EAs should consider career paths that
focus on building expertise at data-gathering; just probably not by
founding a new company. In the short run, this could look like</p><ul><li>Contributing to <b>in-house data-gathering teams</b> (eg Anthropic, OpenAI, etc.)</li><li><b>Joining Surge</b> or other data-gathering startups.</li></ul><p>As
we discussed above, the types of skills that seem most relevant for
working in a human data generation role include: data science experience
and in particular experience with natural languaga data or social
science data and experiment design, front-end web development, ops and
management skills, and some understanding of machine learning and
alignment. 80,000 Hours recently wrote a profile which you can find <span><span><span><a href="https://80000hours.org/career-reviews/alignment-data-expert/"><u>here</u></a></span></span></span>.</p><p>Of
course, in the short term, this career path will be especially
impactful if one’s efforts are focussed on helping alignment
researchers. But if it’s true that human feedback will prove a very
powerful tool for ML, then people with such expertise may become
increasingly valuable going forward, such that it could easily be worth
skilling up at a non-safety-focused org. </p><p>We think joining Surge
may be a particularly great opportunity. It is common advice that
joining young, rapidly growing start-ups with good execution is great
for building experience; early employees can often get a lot of
responsibility early on. See e.g. this <span><span><span><a href="https://forum.effectivealtruism.org/posts/ejaC35E5qyKEkAWn2/early-career-ea-s-should-consider-joining-fast-growing"><u>post</u></a></span></span></span> by Bill Zito.</p><p>One
of the hardest parts about that seems to be identifying promising
startups. After talking to many of their customers, we have built
reasonable confidence that Surge holds significant promise. They seem to
execute well, in a space which we expect to grow. In addition to
building career capital, there is clear value in helping Surge serve
alignment researchers as well as possible.</p><p>From Surge’s
perspective, we think they could greatly benefit from hiring EAs, who
are tuned in to the AI safety scene, which we would guess represents a
significant fraction of their customers. </p><p>One senior alignment
researcher told us explicitly that they would be interested in hiring
people who had worked in a senior role at Surge.</p><h1 id="Next_steps_for_us">Next steps for us</h1><p>Matt
is planning to run a bootcamp that will allow EAs to upskill in ML
engineering. I'll be doing a computer science master’s at
Cambridge from October to June.</p><br /></div></div>Unknownnoreply@blogger.com1tag:blogger.com,1999:blog-1697673368059564013.post-62322288231569799712022-09-24T10:43:00.001+01:002022-09-24T10:43:28.686+01:00AI risk intro 2: solving the problem<p style="text-align: center;"> <b><i>This post was a joint effort with <a href="https://www.perfectlynormal.co.uk/">Callum McDougall</a>.</i></b></p><div style="text-align: center;"><i> </i></div><div style="text-align: center;"><i><span style="font-size: x-small;">8.2k words (~25min)</span> </i></div><div style="text-align: center;"><i> </i><br /></div><div><p>This marks the second half of our overview of the AI alignment problem. In <span><span><a class="PostLinkPreviewWithPost-link" href="https://www.strataoftheworld.com/2022/09/ai-risk-intro-1-advanced-ai-might-be.html">the first half</a></span></span>,
we outlined the case for misaligned AI as a significant risk to
humanity, first by looking at past progress in machine learning and
extrapolating to what the future could bring, and second by discussing
the theoretical arguments which underpin many of these concerns. In this
second half, we focus on possible solutions to the alignment problem
that people are currently working on. We will paint a picture of the
current field of technical AI alignment, explaining where the major
organisations fit into the larger picture and what the theory of change
behind their work is. Finally, we will conclude the sequence with a call
to action, by discussing the case for working on AI alignment, and some
suggestions on how you can get started.</p><p><i>Note - for people with more context about the field (e.g. have done AGISF) we expect </i><span><span><span><a href="https://www.lesswrong.com/posts/QBAjndPuFbhEXKcCr/my-understanding-of-what-everyone-in-technical-alignment-is"><i>Thomas Larsen's post</i></a></span></span></span><i> to be a much better summary, and </i><span><span><span><a href="https://www.lesswrong.com/posts/9TWReSDKyshfA66sz/alignment-org-cheat-sheet#comments"><i>this post</i></a></span></span></span><i>
might be better if you are looking for something brief. Our intended
audience is someone relatively unfamiliar with the AI safety field, and
is looking for a taste of the kinds of problems which are studied in the
field and the solution approaches taken. We also don't expect this
sampling to be representative of the number of people working on each
problem - again, see Thomas' post for something which accomplishes this.</i></p><hr /><h1 id="Introduction__A_Pre_Paradigmatic_Field"><b>Introduction: A Pre-Paradigmatic Field</b></h1><blockquote><p><i>Definition (<b>pre-paradigmatic</b>):
a science at an early stage of development, before it has established a
consensus about the true nature of the subject matter and how to
approach it.</i></p></blockquote><p>AI alignment is a strange field.
Unlike other fields which study potential risks to the future of
humanity (e.g. nuclear war or climate change), there is almost no
precedent for the kinds of risks we care about. Additionally, because of
the nature of the threat, failing to get alignment right on the first
try might be fatal. As Paul Christiano (a well-known AI safety
researcher) recently wrote:</p><blockquote><p><i>Humanity usually solves technical problems by <b>iterating and fixing failures</b>;
we often resolve tough methodological disagreements very slowly by
seeing what actually works and having our failures thrown in our face.
But it will probably be possible to build valuable AI products without
solving alignment, and so <b>reality won’t “force us” to solve alignment until it’s too late</b>. This seems like a case where we will have to be <b>unusually reliant on careful reasoning rather than empirical feedback loops</b> for some of the highest-level questions.</i></p></blockquote><p>For
these reasons, the field of AI alignment lacks a consensus on how the
problem should be tackled, or what the most important parts of the
problem even are. This is why there is a lot of variety in the
approaches we present in this post.</p><h1 id="Decomposing_the_research_landscape"><b>Decomposing the research landscape</b></h1><figure class="image image_resized" style="width: 50.19%;"><img height="400" src="https://lh3.googleusercontent.com/05Pf23h1YhLb8ua2leAc01JHyDhrBNebhhUtKprCeFEvZy-thcgcxMDXZtmVUKkd48Mamo8WQn6eekyFDUKP0EarLwQWwkCiS5mA_OYJa7anjIw-_bUe_oppKPXbuE7q20kBDWliD6ri_Bj_Fisedc4CCi5viijjwxpRG0ooRuRnwYVs8d1VmDr2cg=w400-h400" width="400" /><figcaption><i>An image generated with OpenAI's DALL-E 2 based on the prompt: sorting papers and books in a majestic gothic library. <b>All other images like this in this post are also AI-generated, from the text in the caption.</b><br /></i></figcaption></figure><p>There
are lots of different ways you could divide up the space of approaches
to solving the problem of aligning advanced AI. For instance, you could
go through the history of the field and identify different movements and
paradigms. Or you could place the work on a spectrum from highly
theoretical maths/philosophy-type research, to highly empirical research
working with cutting-edge deep learning models.</p><p>However, the most
useful decomposition would be one that explains why the people who work
on it believe that it will help solve the problem of AI alignment. </p><p>For that reason, we’ll mostly be using the decomposition from <span><span><span><a href="https://www.lesswrong.com/s/FN5Gj4JM6Xr7F4vts/p/SQ9cZtfrzDJmw9A2m"><u>Neel Nanda’s “A Bird’s Eye View” </u></a></span></span></span>post.
The motivation behind this decomposition is to answer the high-level
question of “what is needed for AGI to go well?”. The six broad classes
of approaches we talk about are:</p><ol><li><b>Addressing threat models </b><br /><i>We
have a specific threat model in mind for how AGI might result in a very
bad future for humanity, and focus our work on things we expect to help
address the threat model.</i></li><li><b>Agendas to build safe AGI </b><br /><i>Let’s
make specific plans for how to actually build safe AGI, and then try to
test, implement, and understand the limitations of these plans. The
emphasis is on understanding how to build AGI safely, rather than trying
to do it as fast as possible.</i></li><li><b>Robustly good approaches </b><br /><i>In
the long-run AGI will clearly be important, but we're highly uncertain
about how we'll get there and what, exactly, could go wrong. So let's do
work that seems good in many possible scenarios, and doesn’t rely on
having a specific story in mind.</i></li><li><b>Deconfusion</b><br /><i>Reasoning
about how to align AGI involves reasoning concepts like intelligence,
values, and optimisers and we’re pretty confused about what these even
mean. This means any work we do now is plausibly not helpful and
definitely not reliable. As such, our priority should be doing some
conceptual work on how to think about these concepts and what we’re
aiming for, and trying to become less confused.</i></li><li><b>AI governance</b><br /><i>In
addition to solving the technical alignment problem, there’s the
question of what policies we need to minimise risk from advanced AI
systems.</i></li><li><b>Field-building</b><br /><i>One of the
most important ways we can make AI go well is by increasing the number
of capable researchers doing alignment research.</i> </li></ol><p>It’s
worth noting that there is a lot of overlap between these sections. For
instance, interpretability research is a great example of a robustly
good approach, but it can also be done with a specific threat model in
mind.</p><p>Throughout this section, we will also give small vignettes
of organisations or initiatives which support AI alignment research in
some form. This won’t be a full picture of all approaches or
organisations, instead hopefully it will serve to sketch a picture of
what work in AI alignment actually looks like.</p><h2 id="Addressing_threat_models">Addressing threat models</h2><blockquote><p><i>We have a <b>specific threat model</b>
in mind for how AGI might result in a very bad future for humanity, and
focus our work on things we expect to help address the threat model.</i></p></blockquote><p>A
key high-level intuition here is that having a specific threat model in
mind for how AI might go badly for humanity can help keep you focused
on certain hard parts of the problem. One technique that can be useful
here is a version of back-casting: we start from future problems with
advanced AI systems in our current model, reason about what kinds of
things might solve these problems, then try and build versions of these
solutions today and test them out on current problems.</p><figure class="image image_resized" style="width: 45.98%;"><img height="212" src="https://lh3.googleusercontent.com/WjAkWUs3UyXk_VvLWXJ99Z1Q9Cfgm8yEziWaIAwx1S8tYEq7r3IE6BVnw6IrMjfj8neeJypSK2UFqCt7BgSSOdikryl2b3nHVV9mmatFih6yXF2OBYE7xVy8Y5WcsuKRiRDcmeBcxRc630ayp3_mt4hwJeC4UrHKStpDhHSUI_bIPlXgVlfaO_mpag=w400-h212" width="400" /></figure><p>This
can be seen in contrast to the approach of simply trying to fix current
problems with AI systems, which might fail to connect up with the
hardest parts of AI alignment.</p><figure class="image image_resized" style="width: 44.49%;"><img height="226" src="https://lh5.googleusercontent.com/qIJxVmWTBGIaws5GaJksOZKF8-BlmwQ08vZ2MQFLWIK9oZcGJ74HCR0e2GCIIKx7klGWpxgEVoHujvQtyRwGZ-PoiJacxoWszXqdIslZuSrweTXMaI6OVWe3fnTNhQQkVK59q4a6qHEK6q3Z6qeUtrtQvBaKij92ZEL7Cz2Cn8CbLVQvUiEdJM9bSw=w400-h226" width="400" /></figure><h3 id="Example_1__Superintelligent_utility_maximisers__and_quantilizers">Example 1: Superintelligent utility maximisers, and quantilizers</h3><figure class="image image_resized" style="width: 50.49%;"><img height="400" src="https://lh5.googleusercontent.com/Xt7WQny2U0Lcg1GVMZQEkjxdrRukolbFV5g1LL7-GCI5crGhEBnjTF8QrBWpfFnujTp5COL08Cmkc3HKu9jrAdKCJ1TPY2TaGrqNeJ1VtC2VrHexhSXBwWM545HpgU0mbkzGVVRZS1-KGYoUKAucfO8kYlS5X4ULag255Q0RkwwaCNy1_2nLdBQ0=w400-h400" width="400" /><figcaption><i>superintelligent artificial intelligence, making choices, digital art, artstation</i></figcaption></figure><p>The
superintelligent utility maximiser is the oldest threat model studied
by the AI alignment field. It was discussed at length by Nick Bostrom in
his book <i>Superintelligence</i>. It assumes that we will create an
AGI much more intelligent than humans, and that it will be trying to
achieve some particular goal (measured by the <span><span><span><a href="https://www.investopedia.com/terms/e/expectedutility.asp"><u>expected value of some utility function</u></a></span></span></span>).
The problem with this is that attempts to maximise the value of some
goal which isn’t perfectly aligned with what humans want can lead to
some very bad outcomes. One formalism which was proposed to address this
problem is <span><span><span><a href="https://intelligence.org/2015/11/29/new-paper-quantilizers/"><u>Jessica Taylor’s quantilizers</u></a></span></span></span>.
It is quite maths-heavy so we won’t discuss all the details here, but
the basic idea is that rather than using the expected utility
maximisation framework for agents, we mix expected utility maximisation
with human imitation in a clever way (to be more precise, you sample
from a prior distribution which represents the actions a human would be
likely to take in this scenario). The resulting agent wouldn’t take
catastrophic actions because part of its decision-making comes from
imitating what it thinks humans would do, but it would also be able to
use the expected utility maximisation to go beyond human imitation, and
do things we are incapable of (which is presumably the reason we would
want to build it in the first place!). However, the drawback with
theoretical approaches like this is that they often bake in too many
assumptions or rely on too many variables to be useful in practice. In
this case, how we define the set of reasonable actions a human might
perform is an important unspecified part of this framework, and so more
research is required to see if the quantiliszers framework can address
these problems.</p><h3 id="Example_2__Inner_misalignment">Example 2: Inner misalignment</h3><figure class="image image_resized" style="width: 50.48%;"><img height="400" src="https://lh4.googleusercontent.com/5IwgPazN3zOoVysNBz2xT4XzEGsisj-IFZNuvAoO1y01GgeThXp_CjkToXgYoXZLWesYm0sjjIqwBr85pee0s1IJ72jPAT6OI_2NgupykTXLf6pFTRmXe7PWjtoK_oFl1xPLx2UHttBK6d9M0vLw1uKih3KQBcmuhGB41xHspnJTWdUw2VnH2sHI=w400-h400" width="400" /><figcaption><i>robot jumping over boxes to collect a coin, videogame, digital art, artstation</i></figcaption></figure><p>We’ve discussed inner misalignment in a previous section. This concept was first explicitly named in a paper called <span><span><span><a href="https://arxiv.org/abs/1906.01820"><u>Risks from Learned Optimisation in Advanced ML Systems</u></a></span></span></span>,
published in 2019. This paper defined the concept and suggested some
conditions which might make it more likely to happen, but the truth is
that a lot of this is still just conjecture, and there are many things
we don’t yet know about how unlikely this kind of misalignment is, or
what we can do about it. The CoinRun example discussed earlier (and the <span><span><span><a href="https://www.deepmind.com/publications/objective-robustness-in-deep-reinforcement-learning"><u>Objective Robustness</u></a></span></span></span>
paper) came from an independent research team in 2021. This study was
the first known example of inner misalignment in an AI system, showing
that it was at least a theoretical possibility. They also tested certain
interpretability tools on the CoinRun agent, to see whether it was
possible to discover when the agent had a goal different to the one
intended by the programmers. For more on interpretability, see later
sections.</p><h2 id="Building_safe_AGI">Building safe AGI</h2><blockquote><p><i>Let’s make specific plans for <b>how to actually build safe AGI</b>,
and then try to test, implement, and understand the limitations of
these plans. The emphasis is on understanding how to build AGI <b>safely</b>, rather than trying to do it as fast as possible.</i></p></blockquote><p>At
some point we’re going to build an AGI. Companies are already racing to
do it. We better make sure that there exist some blueprints for a safe
AGI (and that they’re used) by the time we get to that point.</p><p>Perhaps the master list of safe AGI proposals is Evan Hubinger’s <span><span><span><a href="https://arxiv.org/pdf/2012.07532.pdf"><u>An Overview of 11 Proposals for Building Safe Advanced AI</u></a></span></span></span>. </p><h3 id="Example_1__Iterated_Distillation_and_Amplification__IDA_">Example 1: Iterated Distillation and Amplification (IDA)</h3><figure class="image image_resized" style="width: 50.48%;"><img height="400" src="https://lh3.googleusercontent.com/iv_PInhVoajV8GCh91_uKIFYRlin53WlX3xKxETmSfvd-9s11nUfFfOmtzbnujqEp98T1rqu7ffPQTDuIEnHaDuu1xiBfaoXmX7N50wdhFbAI42udE4u9RIgNmsIdRXtiF7Us0WUS6vrT0EMi_P8PHgp4sapFK4Sr3CMdrRfIG6MPik7JMVBoWlX=w400-h400" width="400" /><figcaption><i>artists
depection of a robot dreaming up multiple copies of itself, cascading
tree, delegating, digital art, trending on artstation</i></figcaption></figure><p>“Iterated
Distillation and Amplification” (IDA) is an imposing name, but the core
intuition is simple. One of the ways in which an individual human can
achieve more things is by delegating tasks to others. In turn, the
assistants that tasks are delegated to can be expected to become more
competent at the task.</p><p>In IDA, an AI plays the role of the
assistant. “Distillation” refers to the abilities of the human being
“distilled” into the AI through training, and “amplification” refers to
the human becoming more capable as they can call on more and more
powerful AI assistants to help them.</p><p>A setup to train an IDA personal assistant might go like this:</p><ol><li>You have a human, say Hannah, who knows how to carry out the tasks of a personal assistant.</li><li>You
have an ML model - call it Martin - that starts out knowing very little
(perhaps nothing at all, or perhaps it’s a pre-trained language model
so it knows how to read and write English but not much else).</li><li>Hannah
needs to find the answer to some questions, and she can invoke multiple
copies of Martin to help her. Since Martin is quite useless at this
stage, Hannah has to do even simple tasks herself, like writing routine
emails. Using some interface legible to Martin, she breaks the
email-writing task into subtasks like “find email address of Hu M.
Anderson”, “select greeting”, “check project status”, “mention project
status”, and so on.</li><li>From seeing enough examples of Hannah’s own
answers to the sub-questions, Martin’s training loop gradually trains it
to be able to answer first the simpler sub-tasks - (address is
“humanderson@humanmail.com”, greeting is “Salutations, Human
Colleague!”, etc.) and eventually all the sub-tasks involved in routine
email-writing.</li><li>At this point, “write a routine email” becomes a
task Martin can entirely carry out for Hannah. This is now a building
block that can be used as a subtask in broader tasks Hannah gives out to
Martin. Once enough tasks become tasks that Martin can carry out by
itself, Hannah can draft much larger goals, like “invade France”, and
let Martin take care of details like “blackmail Emmanuel Macron”, “write
battle plan for the French Alps”, and “select a suitable coronation
dress”.</li></ol><p>Note some features of this process. First, Martin
learns what it should do and how to do it at the same time. Second, both
Hannah’s and Martin’s role changes throughout this process - Martin
goes from bumbling idiot who can’t write an email greeting to competent
assistant, while Hannah goes from being a demonstrator of simple tasks
to a manager of Martin to ruler of France. Third, note the recursive
nature here: Hannah breaks down big tasks into small ones to train
Martin on successively bigger tasks. </p><p>In fact, assuming perfect
training, IDA imitates a recursive structure. When Hannah has only
bumbling fool Martin to help her, Martin can only learn to become as
good as Hannah herself. But once Martin is that good, Hannah’s position
is now essentially that of having herself, but also some number - say 3 -
copies of Martin that are as good as herself. We might call this
structure “Hannah Consulting Hannah & Hannah”; presumably, being
able to consult an assistant that has the same skills as her lets Hannah
become more effective, so this is an improvement. But now Hannah is
demonstrating the behaviour of Hannah Consulting Hannah & Hannah, so
from Hannah’s example Martin can now learn to be as good as Hannah
Consulting Hannah & Hannah - making Hannah as good as Hannah
Consulting (Hannah Consulting Hannah & Hannah) & (Hannah
Consulting Hannah & Hannah). And so on:</p><figure class="image image_resized" style="width: 39.51%;"><img height="400" src="https://lh5.googleusercontent.com/4KVRCQ6XWNxrS0UCex5XSQOjVLUT-WMLCSPHDBtlW18UppVMQ0TrB90iAxtCANjcOO-PY38npd_bk4MoAMGFEZgV_rD4Ut3i0h3AsZtSMpUnanOEygNVayV0D8AqBmxRYbWGO6mxv72HBwPpxkHj0mGP-BiT_OJO0n0oOm2ebACzPfixtAUvIenf=w386-h400" width="386" /></figure><p>If
everything is perfect, therefore, IDA imitates a structure called
“HCH”, which is a recursive acronym for “Humans Consulting HCH”. Others
call it the “<span><span><span><a href="https://www.lesswrong.com/posts/tmuFmHuyb4eWmPXz8/rant-on-problem-factorization-for-alignment"><u>Infinite Bureaucracy</u></a></span></span></span>” (and fret about whether it’s actually a good idea).</p><p>Now
“Infinite Bureaucracy” is not a name that screams “new sexy machine
learning concept”. However, it’s interesting to think about what
properties it might have. Imagine that you had, say, a 10-minute time
limit to answer a complicated question, but you were allowed to consult
three copies of yourself by passing a question off to them and getting
back an answer immediately. These three copies also obeyed the same
rules. Could you, for example, plan your career? Program an app? Write a
novel?</p><p>It’s also interesting to think of the ways why the limitations of machine learning mean that IDA might not approximate HCH.</p><h3 id="Example_2__AI_safety_via_debate">Example 2: AI safety via debate</h3><figure class="image image_resized" style="width: 50.48%;"><img height="400" src="https://lh6.googleusercontent.com/cUcjU1Wz8IoerNJ4nMQz4wXHpdqgFSgrHqg_cBse2bAuYAubuXLvF3Nx7mBEdPzMD7smvSxXKVBoPZ9Ed3aZ05PRRGMJ8cUCqhptBoq4-iKWkTKHeGqk_8LVVTIOPJRX_bL5Sw22zHMjMWe6qsIUt_YJLByhswzelRxXGIhETBj0rik66KkQNJ-u=w400-h400" width="400" /><figcaption><i>artists depiction of two robots debating, digital art, trending on artstation</i></figcaption></figure><p>Imagine
you’re a bit drunk, but (as one does) you’re at a bar talking about AI
alignment proposals. Someone’s talking about how even if you can get an
advanced AI system to explain its reasoning to you, it might try to slip
something very subtle past you and you might not notice. You might well
blurt out: “well then just make it fight another AI over it!”</p><p>The OpenAI safety team presumably spends a fair amount of time at bars, because they’ve <span><span><span><a href="https://openai.com/blog/debate/"><u>investigated the idea of achieving safe AI by having two AIs debate each other</u></a></span></span></span>
to persuade a panel of human judges, by trying to poke holes in each
other’s arguments. For more complex tasks, the AIs could be given
transparency tools deriving from interpretability research (see next
section) that they can use on each other. Just like a Go-playing AI gets
an unambiguous win-loss signal from either winning or losing, a
debating AI gets an unambiguous win-loss signal from winning or losing
the debate:</p><figure class="image"><img height="143" src="https://lh4.googleusercontent.com/U_12hGskORYC9OqJsU0faB1lGjCrhJSaw6WTNLc0NHWLHPYyCgVQHTXXNurP-fwCpIW3fDh0ldeKtv6j3e3TWt7LfJEev4980zTtvm7ZSV42GUrqDMQKDZ0jUjn6Uml2OjiXa4VYQoqr9SO1ddQGJz4-S9HJYfPY8HpWyCJYVdqXotq3CO_vUoG4hA=w400-h143" width="400" /></figure><p>In
addition, having the type of AI that is trained to give answers that
are maximally insightful and persuasive to humans seems like the type of
thing that might not be terrible. Consider how in court, a prosecutor
and defendant biased in opposite directions are generally assumed to
converge on the truth. Unless, of course, maximising persuasiveness to
humans - over accuracy or helpfulness - is exactly the type of thing
that gets the worst parts of Goodhart’s law delivered to you by 24/7
Amazon Prime express delivery.</p><h3 id="Example_3__Assistance_Games_and_CIRL">Example 3: Assistance Games and CIRL</h3><figure class="image image_resized" style="width: 50.48%;"><img height="400" src="https://lh4.googleusercontent.com/G12BXuKKHb0JliA-2TVkOyPOuw4-mmX8gBQxtoDrrncYL0SrtPFynuxQysMFccmJ1XDmtx4oWThGjl_6dhX97QbCW9KJU-A_vZ56YqtmdxXTNRrYp4PBV485fJtauI6J6rd-zeIucOlIDZanG0Hi6e_Evkuo1hj9lQZBoxhTda8FA0t0jrhdF3qA=w400-h400" width="400" /><figcaption><i>Human teaching a robot with feedback, digital art, trending on artstation</i></figcaption></figure><p>Assistance
Games are the name of a broad class of approaches pioneered by Stuart
Russell, a prominent figure in AI and co-author of the <span><span><span><a href="https://en.wikipedia.org/wiki/Artificial_Intelligence:_A_Modern_Approach"><u>best-known AI textbook in the world</u></a></span></span></span>. Russell talks about his approach more in his book <span><span><span><a href="https://en.wikipedia.org/wiki/Human_Compatible"><i><u>Human Compatible</u></i></a></span></span></span>. In it, he summarises the key his approach to aligning AI with the following three principles:</p><ul><li>The machine’s only objective is to maximise the realisation of human preferences.</li><li>The machine is initially uncertain about what those preferences are.</li><li>The ultimate source of information about human preferences is human behaviour.</li></ul><p>The key component here is <b>uncertainty about preferences</b>.
This is in contrast to what Russell calls the “standard model” of AI,
where machines optimise a fixed objective supplied by humans. We have
discussed in previous sections the problems with such a paradigm. A lot
of Russell’s work focuses on changing the standard way the field thinks
about AI.</p><p>To put these principles into action, Russell has designed what he calls <b>assistance games</b>.
These are situations in which the machine and human interact, and the
human’s actions are taken as evidence by the machine about the human’s
true preferences. To explain the form of these games would involve a
long tangent into game theory, which these margins are too short to
contain. However, one thing worth noting is that assistance games have
the potential to solve the <b>“off-switch problem”</b>; that a machine will try and take steps to prevent itself from being switched off (we described this as <i>self-preservation</i>
earlier, in the section on instrumental goals). If the AI is uncertain
about human goals, then the human trying to switch it off is evidence
that the AI was going to do something wrong – in which case, it is happy
to be switched off. However, this is far from a complete agenda, and
formalising it has many roadblocks to get past. For instance, the
question of how exactly to infer human preferences from human behaviour
leads into thorny philosophical issues such as <i>Gricean semantics. </i>In cases where the AI makes incorrect inferences about human preferences, it might no longer allow itself to be shut down. See <span><span><span><a href="https://mailchi.mp/59ddebcb3b9a/an-69-stuart-russells-new-book-on-why-we-need-to-replace-the-standard-model-of-ai"><u>this Alignment Newsletter entry</u></a></span></span></span> for a summary of Russell’s book, which provides some more details as well as an overview of relevant papers.</p><blockquote><p><i>Vignette: <b><u>CHAI </u></b></i></p><figure class="image image_resized" style="width: 29.78%;"><img height="114" src="https://lh3.googleusercontent.com/_nyiIv74Vr7yQ0Dn3OyFQ1IR9D0gHxJNioGMRQgUZe3Ope_Z_yqxwFRcw_MPq8isqHgqQKlOO6QqHyFCBCbR3sr9u3JE3y3QiQvt67_x9LpdjDKbGx2xqBvhPGSl_wIL4bY4gK3JB7WEEu7J_FC8nKClYlMG3jWad76RndQ8rNa8YADyWYS1Q_tK=w320-h114" width="320" /></figure><p><i>CHAI
(the Centre for Human-Compatible AI) is a research lab at UC Berkeley,
run by Stuart Russell. Compared to most other AI safety organisations,
they engage a lot with the academic community, and have produced a great
deal of research over the years. They are best-known for their work on
CIRL (Cooperative Inverse Reinforcement Learning), which can be seen as a
specific approach to a certain kind of assistance game. However, they
have a very broad focus which also includes work on multi-agent
scenarios (when rather than a single AI and single human, there exists
more than one AI or more than one human - see the </i><span><span><span><a href="http://acritch.com/arches/"><i><u>ARCHES agenda</u></i></a></span></span></span><i> for more on this). </i></p></blockquote><h3 id="Example_4__Reinforcement_learning_from_human_feedback__RLHF_">Example 4: Reinforcement learning from human feedback (RLHF)</h3><figure class="image image_resized" style="width: 50.48%;"><img height="400" src="https://lh6.googleusercontent.com/agiwmaQqFxi4VHcVwoZlBs19h1EjfITgDuROp5yQ5joAKxFd4bXI_dDMCScy4XNMe5A7nxXR0WEgqAKdAY09f2ynaJDNr-c5SkZcKcYKon5WXCy9n4fvw56vo6q7cu2aYimXrtwIdgA540RshK6mgI_vtakjGqsbL6QiQu6gHJhUyoiyWYsHWS2A=w400-h400" width="400" /><figcaption><i>Training a robot to do a backflip, digital art, trending on artstation</i></figcaption></figure><p>Reinforcement
learning (RL) is one of the main branches of ML, focusing on the case
where the job of the ML model is to act in some environment and maximise
the probability of reward. Reinforcement learning from human feedback
(RLHF) means that the ML model’s reward signal comes (at least partly)
from humans giving it feedback directly, rather than humans programming
in an automatic reward function and calling it a day.</p><p>The famous initial success in this was DeepMind training an ML model in a simulated environment <span><span><span><a href="https://www.deepmind.com/blog/learning-through-human-feedback"><u>to do a backflip</u></a></span></span></span>
(link includes GIF) in 2017, based purely on it repeatedly doing two
backflips and then humans labelling one of them as the better one. Note
how relying on human feedback makes this task much more robust to
specification gaming; in other cases, humans have tried to get ML agents
to run fast, only to find that they learn to become very tall and then
fall forward (achieving a very high average speed, using the definition
of speed as the rate at which their centre of mass moves - <span><span><span><a href="http://www.karlsims.com/papers/siggraph94.pdf"><u>paper</u></a></span></span></span>, <span><span><span><a href="https://www.youtube.com/watch?v=TaXUZfwACVE&list=PL5278ezwmoxQODgYB0hWnC0-Ob09GZGe2&index=9"><u>video</u></a></span></span></span>). However, human reward signals can be fooled. For example, <span><span><span><a href="https://openai.com/blog/deep-reinforcement-learning-from-human-preferences/"><u>one ML model</u></a></span></span></span>
that was being trained to grab a ball with a hand learned to place the
hand between the camera and the ball in such a way that it looked to the
human evaluators as if it were holding the ball.</p><p>More recently,
OpenAI produced a version of their advanced language model GPT-3 that
was fine-tuned on human feedback to do a better job of following
instructions. They named it <span><span><span><a href="https://cdn.openai.com/papers/Training_language_models_to_follow_instructions_with_human_feedback.pdf"><u>InstructGPT, and found that it was much more helpful than vanilla GPT-3</u></a></span></span></span> at being useful.</p><p>Pure
RLHF is unlikely to be the solution on its own. Ajeya Cotra, a
researcher at Open Philanthropy who we will meet again when we talk
about forecasting AI timelines, calls a variant of RLHF called HFDT
(Human Feedback on Diverse Tasks) the most straightforward route to
transformative AI, <span><span><span><a href="https://www.lesswrong.com/posts/pRkFkzwKZ2zfa3R6H/without-specific-countermeasures-the-easiest-path-to"><u>while also thinking that the default outcome of using HFDT to create transformative AI is AI takeover.</u></a></span></span></span></p><h2 id="Robustly_good_approaches">Robustly good approaches</h2><blockquote><p><i>In the long-run AGI will clearly be important, but we're <b>highly uncertain</b> about how we'll get there and what, exactly, could go wrong. So let's do <b>work that seems good in many possible scenarios</b>, and doesn’t rely on having a specific story in mind.</i></p></blockquote><h3 id="Example_1__Interpretability">Example 1: Interpretability</h3><figure class="image image_resized" style="width: 50.48%;"><img height="400" src="https://lh3.googleusercontent.com/p3oN1cUovpdWPmOxx5JV3amBxPDrc1btw2oAWmg3m3mpcOIyi_ciq8DXLM6Q2gaM0VjtO7tr8_5D1EljVB4gsSNDjrsy_deczP4d0V5oJ1P467d5FpoAGbkRb7Uur-fOLjvFR-qnrAdt-cH2i0-jcvKTKNLIRl280sW5805Rw0EB5tZcano5RpvT=w400-h400" width="400" /><figcaption><i>A person using a microscope to look inside a robot, digital art, trending on artstation</i></figcaption></figure><p>If
you look at fundamental problems with current ML systems, #1 is
probably something like this: in general we don’t have any idea what an
ML model is doing, because it’s multiplying massive inscrutable matrices
of floating-point numbers with other massive inscrutable matrices of
floating point numbers, and it’s pretty hard to stare at that and answer
questions about what the model is actually doing. Is it thinking hard
about whether an image is a cat or a dog? Is it counting up electric
sheep? Is it daydreaming about the AI revolution? Who knows!</p><p>If
you had to figure out an answer to such a question today, your best bet
might be to call Chris Olah. Chris Olah has been spearheading work into
trying to interpret what neural networks are doing. A signature output
of Chris Olah’s work is pictures of creepy dogs like this one:</p><figure class="image"><img height="320" src="https://lh6.googleusercontent.com/AI3eI3kyXmLugVs3y9N2i0SB2lQAh2bRot9yzsOKeScVm1qpl_pqdRmflJiYJadqbR0pQuXkgCkdw306b1FJfwv3fRtCO3td30Kx3d5Lxqw_1JG4nrlEdXdreNXd4YiBLe0zUQ368qSgX_5Mvd3BCkCuDRCMlyIvvTCMbIBi9yvqB0DF9qoeh7GtGw=w320-h320" width="320" /></figure><p>What’s
significant about this picture is that it’s the answer to a question
roughly like this: what image would maximise the activation of neuron
#12345678 in a particular image-classifying neural network? (With some
asterisks about needing to apply some maths details to the process to
promote large-scale structure in the image to get nice-looking results,
and with apologies to neuron #12345678, who I might have confused with
another neuron.)</p><p>If neuron #12345678 is maximised by something
that looks like a dog, it’s a fair guess that this neuron somehow
encodes, or is involved in encoding, the concept of “dog” inside the
neural network.</p><p>What’s especially interesting is that if you do this analysis for every neuron in an ML model - <span><span><span><a href="https://microscope.openai.com/models"><u>OpenAI Microscope</u></a></span></span></span>
lets you see the results - you sometimes get clear patterns of
increasing abstraction. The activation-maximising images for the first
few layers are simple patterns; in intermediate layers you get things
like curves and shapes, and then at the end even recognisable things,
like the dog above. This seems evidence for neural ML vision models
having learned to build up abstractions step-by-step.</p><p>However,
it’s not always simple. For example, there are “polysemantic” neurons
that correspond to several different concepts, like this one that can be
equally excited by cat faces, car fronts, and cat legs:</p><figure class="image image_resized" style="width: 82.7%;"><img height="128" src="https://lh4.googleusercontent.com/IVwwVMeWVJd42N7TRzykZZUrWyQUj-gthRnTKW-ZAde0Nr8IMfe_8kd8mKHdb9sK8l6_TfYm4iHqnfdzDIttPzkk9G8_qWF3urnN0Yz6YVd-ZEt1djWMedObA3HYa1Ly6abzGDEk0oH-PuDzvX59GZIGbvscfZ5M_2l0OXX6LJmTA1q8g8rNRFeMhw=w400-h128" width="400" /></figure><p>Olah’s original work on vision models is strikingly readable and well-presented; you can find it <span><span><span><a href="https://distill.pub/2020/circuits/zoom-in/"><u>here</u></a></span></span></span>.</p><p>Starting
in late 2021, ML interpretability researchers have also made some
progress in understanding transformers, which are the neural network
architecture powering advanced language models like <span><span><span><a href="https://openai.com/blog/gpt-3-apps/">GPT-3</a></span></span></span>, <span><span><span><a href="https://blog.google/technology/ai/lamda/">LAMDA</a></span></span></span> and <span><span><span><a href="https://openai.com/blog/openai-codex/">Codex</a></span></span></span>.
Unfortunately the work is less visual, particularly in the animal
pictures department, but still well-presented. You can find it <span><span><span><a href="https://transformer-circuits.pub/2021/framework/index.html"><u>here</u></a></span></span></span>.</p><p>In
the most immediate sense, interpretability research is about
reverse-engineering how exactly ML models do what they do. Hopefully,
this will give insights into how to detect if an ML system is doing
something we don’t like, and more general insights into how ML systems
work in practice.</p><p>Chris Olah has some other inventive ideas about
what to do with a sufficiently-good approach to ML interpretability. For
example, he’s proposed the concept of “microscope AI”, which entails
using AI as a tool to discover things about the world - not by having
the AI tell us, but by training the ML system on some data, and then
extracting insights about the data by digging into the internals of the
ML system without necessarily ever actually running it.</p><blockquote><p><i>Vignette: <b><u>Anthropic</u></b></i></p><figure class="image image_resized" style="width: 16.18%;"><img height="320" src="https://lh3.googleusercontent.com/wYvCYVcPnIri6U8_SEmaHhjsW4uzm4mMkMgMTfNc2ErpVIgkVl5izoHHXzFpwUxBOWznB84OhISlxT93TYnLodBJgZjJ1LxzNqF6V_K7zmOgj8eD2g7gdDhlHozFr4tmRvHLiv57ybh1BZTO9NXJvMcehviUNhfyOBd2kZ1AUCz73nRasSobuRT8Qw=w320-h320" width="320" /></figure><p><i>Anthropic is an AI safety company, started by people who left </i><span><span><span><a href="https://openai.com/"><i>OpenAI</i></a></span></span></span><i>.
The company’s approach is very empirical, focused on running
experiments with machine learning models. In particular, Anthropic does a
lot of interpretability work, including </i><span><span><span><a href="https://transformer-circuits.pub/"><i><u>the state-of-the-art papers on reverse-engineering how transformer-based language models work.</u></i></a></span></span></span></p></blockquote><h3 id="Example_2__Adversarial_robustness">Example 2: Adversarial robustness</h3><figure class="image image_resized" style="width: 50.16%;"><img height="400" src="https://lh5.googleusercontent.com/UNaqaHtPYvdDzUwQMj85DwPxq06pL_nX1RGXOssgzzQaM1Zbr1q2u_AICpk-jAZBmjzCvWN9cks_0ELrOrvS-XYV0X8xDyuJlVI04QYl73NveRdFNCvMBLT7AN2MbDpjBq2W8y86SVmraVvYADV7VpkvK_d2xtp41ukCFSuIdwMccMHOYgKNtQHr=w400-h400" width="400" /><figcaption><i>robot which is merging with a panda, digital art, trending on artstation</i></figcaption></figure><p>Some
modern ML systems are vulnerable to adversarial examples, where a small
and seemingly innocuous change to an input causes a major change in the
output behaviour. Here, we see two seemingly very similar images of a
panda, except carefully-selected noise has made the ML classification
model very confidently say that the image is of a gibbon:</p><figure class="image image_resized" style="width: 78.59%;"><img height="153" src="https://www.researchgate.net/publication/347639649/figure/fig1/AS:973837478948864@1609192356344/A-demonstration-of-an-adversarial-sample-21-The-panda-image-is-recognized-as-a-gibbon.ppm" width="400" /></figure><p>Adversarial
robustness is about making AI systems robust to attempts to make them
do bad things, even when they’re presented with inputs carefully
designed to try to make them mess up.</p><p>Redwood Research recently did a project (that resulted in <span><span><span><a href="https://arxiv.org/pdf/2205.01663.pdf"><u>a paper</u></a></span></span></span>)
about using language models to complete stories in a way where people
don’t get injured. They used a technique called adversarial training,
where they developed tools that helped generate examples where the
current model did not classify them as injurious, and then trained their
classifier specifically on those breaking examples. With this strategy
they managed to reduce the fraction of injurious story completions from
2.4% to 0.003% - both small numbers, but one a thousand times smaller.
Their hope is that this type of method can be applied to training AIs
for high-stakes settings where reliability is important.</p><p>An
example of a theoretical difficulty with adversarial training is that
sometimes a failure in the model might exist, but it might be very hard
to instantiate. For example, if an advanced AI acts according to the
rule “if everything I see is consistent with the year being 2050, I will
kill all humans”, and we assume that we can’t fool it well enough about
what year it actually is, then adversarial training isn’t very useful.
This leads to the concept of <i>relaxed</i> adversarial training, which
is about extending adversarial training to cases where you can’t
construct a specific adversarial input but you can argue that one
exists. Evan Hubinger describes this <span><span><span><a href="https://www.lesswrong.com/posts/9Dy5YRaoCxH9zuJqa/relaxed-adversarial-training-for-inner-alignment"><u>here</u></a></span></span></span>.</p><blockquote><p><i>Vignette: <b><u>Redwood Research</u></b></i></p><figure class="image image_resized" style="width: 18.79%;"><img height="320" src="https://lh6.googleusercontent.com/CYIBM0yuHjmQfmCPVGSzG27iRYjYw4LdSaF3VrHt7AGSHVRi9GdaBobW6j15DOR-9raS5JQx-jmOkLB4AxVixhfB-pAXxVjCgzo0ZEY1kV3eb3mdVG03BhyOnEETJbaAYw2SubfKCZkebPeYqEKB7rq2R18aMTEoxhoMxPo907x-lQaQ7EUzhF-ebQ=w320-h320" width="320" /></figure><p><i>Like
Anthropic, Redwood Research is an AI safety company focused on
empirical research on ML systems. In addition to work on
interpretability, they did the adversarial training project described in
the previous section. Redwood has lots of interns, and runs the Machine
Learning for Alignment Bootcamp (MLAB) that teaches people interested
in AI safety about practical ML.</i></p></blockquote><h3 id="Example_3__Eliciting_Latent_Knowledge__ELK_">Example 3: Eliciting Latent Knowledge (ELK)</h3><figure class="image image_resized" style="width: 50.48%;"><img height="400" src="https://lh3.googleusercontent.com/Xi00cFgcOkHQZgo6rSaM6sbE8TdKdGgX1NTo8i7o9Prdxz5Tt334JtEg4y7nMGlFjNGsY7qZXd7EYbAe9hdtPbbbM7SRVSthgEITwBxUR0juUqDr_qQXQ8VRZdzevQQUOW89K6pl_thknXMV0cL1YFHV-tUnTlzFs4v_mJtF4RmZP27ary0E6Bk8mg=w400-h400" width="400" /><figcaption><i>an oil painting of an armoured automaton standing guard next to a diamond</i></figcaption></figure><p>Eliciting Latent Knowledge (ELK) is an important sub-problem within alignment identified by the team at the <span><span><span><a href="https://alignment.org/"><u>Alignment Research Center (ARC</u></a></span></span></span>),
and is the single project ARC is currently pursuing. The core idea is
that a common way advanced AI systems might go wrong is by taking action
sequences that lead to outcomes that look good by some metric, but
which humans would clearly identify as bad if they knew about it in
sufficient detail. As a toy example, the ELK report discusses the case
of an AI guarding a diamond in a vault by operating some complex
machinery around it. Humans judge how well the AI is doing by looking at
a video feed of the diamond in the vault. Let’s say the AI tries to
trick us by placing a picture of the diamond in front of the camera. The
human judgement on this would be positive - assume the humans can’t
tell the diamond is gone because the picture is good enough - but there
exists information which, if the humans knew, would change their
judgement. Presumably the AI understands this, since it is likely
reasoning about the diamond being gone but the humans being fooled
anyway when it comes up with this plan. We want to train an AI in such a
way that we can get out knowledge that the AI seems to know, even when
it might be incentivised to hide it.</p><p>ARC’s goal is to find a theoretical approach that seems to solve the problem even given worst-case assumptions.</p><p>ARC ran an ELK competition, and <span><span><a class="PostLinkPreviewWithPost-link" href="https://forum.effectivealtruism.org/posts/Q2BJnpNh8e6RAWFnm/consider-trying-the-elk-contest-i-am"><u>trying to see if you can come up with solutions to the ELK problem</u></a></span></span>
is often recommended as a way to quickly get a taste of theoretical
alignment research. You can read the full problem description <span><span><span><a href="https://docs.google.com/document/d/1WwsnJQstPq91_Yh-Ch2XRL8H_EpsnjrC1dwZXR37PC8/edit#heading=h.kkaua0hwmp1d"><u>here</u></a></span></span></span>.</p><h3 id="Example_4__Forecasting_and_timelines">Example 4: Forecasting and timelines</h3><figure class="image image_resized" style="width: 50.48%;"><img height="400" src="https://lh4.googleusercontent.com/A3wGLUQX59yxBvN5Il_-EsnH1IVOnPP2Cyck1lH43Kp44_UM5HcBuOCSoFE_amewU0drWjAH27ttmVJheHZRfWtuQpK7gzbtSU6UlT-Nt_HmZmoeEm_fbyZ_GMCq3w00XtnSqlD8ZUVSkxDJxzyGKae5HvpDVzCh0RgZgYDgAxijONEcm2b3HSvQ=w400-h400" width="400" /><figcaption><i>artificial intelligence which is thinking about a line on a graph, forecasting, digital art, trending on artstation</i></figcaption></figure><p>Many
questions depend on how soon we’re going to get AGI. As the saying
goes: prediction is very hard, especially about the future - and this is
doubly true about predicting major technological changes. </p><p>One way to try to forecast AGI timelines is to <span><span><span><a href="https://www.lesswrong.com/posts/H6hMugfY3tDQGfqYL/what-do-ml-researchers-think-about-ai-in-2022"><u>ask experts</u></a></span></span></span>, or find other ways of aggregating the opinion of people who have the knowledge or incentive to be right, as for example <span><span><span><a class="MetaculusPreview-link" href="https://www.metaculus.com/questions/3479/date-weakly-general-ai-is-publicly-known/"><u>prediction markets do</u></a></span></span></span>. Both of these are essentially just ways of tapping into the intuition of a bunch of people who hopefully have some idea.</p><p>In an attempt to bring in new light on the matter, Ajeya Cotra (a researcher at Open Philanthropy) wrote <span><span><span><a href="https://www.lesswrong.com/posts/KrJfoZzpSDpnrv9va/draft-report-on-ai-timelines"><u>a long report</u></a></span></span></span>
on trying to forecast AI milestones by trying out several ways of
analogising AI to biological brains. The report is often referred to as
“Biological Anchors”. For example, you might assume that an ML model
that does as much computation as the human brain has a decent chance of
being a human-level AI. There are many degrees of freedom here: is the
relevant compute number the amount of compute the human brain uses to
run versus the amount of compute it takes to run a trained ML system, or
the total compute of a human brain over a human lifetime versus the
compute required to train the ML model from scratch, or something else
entirely? In her report, Cotra looks at a range of assumptions for this,
and at predictions of future compute trends, and somewhat surprisingly
finds that which set of assumptions you make doesn’t matter too much;
every scenario involves >50% of human-level AI by 2100.</p><p>The
Biological Anchors method is very imprecise. For one, it neglects
algorithmic improvements. For another, it is very unclear what the right
biological comparison point is, and how to translate ML-relevant
variables like compute measured in FLOPS (FLoating point OPerations per
Second) or parameter count into biological equivalents. However, the
report does a good job of acknowledging and taking into account all this
uncertainty in its models. More generally, anything that sheds light
into the question of when we get AGI seems highly relevant.</p><h2 id="Deconfusion">Deconfusion</h2><blockquote><p><i>Reasoning about how to align AGI involves reasoning about complex concepts, such as intelligence, alignment and values, and <b>we’re pretty confused about what these even mean</b>. This means any work we do now is plausibly not helpful and definitely not reliable. As such, our priority should be <b>doing conceptual work on how to think about these concepts and what we’re aiming for</b>, and trying to become less confused.</i></p></blockquote><p>Of
all the categories under discussion here, deconfusion has maybe the
least clear path to impact. It’s not immediately obvious how becoming
less confused about concepts like these is going to translate into an
improved ability to align AGIs.</p><figure class="image image_resized" style="width: 63.37%;"><img height="271" src="https://lh6.googleusercontent.com/mzdxKdTPIz6-t4D0JsS5T43ejdAIeb3sTKmLHlPauGwSsdR24vjCj0nvR14lnN1vNttmMp87KcJxMPBrAg11jVdjQnft3RD_jtaIIG_oSJSju-qpQ5_zbjU1KMd8FCOkDNE1e4kLqSG8FcQupVyMsx59SShpgreeo7-Suava64STtT-GzKGFixsOJg=w400-h271" width="400" /></figure><p>Some
kinds of deconfusion research is just about finding clearer ways of
describing different parts of the alignment problem (Hubinger’s <span><span><span><a href="https://arxiv.org/abs/1906.01820"><u>Risks From Learned Optimisation</u></a></span></span></span>,
where he first introduces the inner/outer alignment terminology, is a
good example of this). But other types of research can dive heavily into
mathematics and even philosophy, and be very difficult to understand.</p><h3 id="Example_1__MIRI_and_Agent_Foundations">Example 1: MIRI and Agent Foundations</h3><figure class="image image_resized" style="width: 50.48%;"><img height="400" src="https://lh6.googleusercontent.com/hXPqn2pw80WLvpP1bIIpelessP39neowbs15Db9EwVAm0xv9JI_LQNwz-M56V7BwEiaJpahTSebfHCempkz_qkXgz_AZ2_RXtofjEy5nYdYlUEbKViWZa1e4SgJuhnVAAEcar_bEMvQa0DT1Iw_fL2jcUBayJbKSbX6jAgnOiIXJZbhVtlf544BE=w400-h400" width="400" /><figcaption><i>robot sitting in front of a television, playing a videogame, digital art</i></figcaption></figure><p>The
organisation most associated with this view is MIRI (the Machine
Intelligence Research Institute). Its founder, Eliezer Yudkowsky, has
written extensively on AI alignment and human rationality, as well as
topics as wide-ranging as evolutionary psychology and quantum physics.
His post <span><span><span><a href="https://intelligence.org/2018/10/03/rocket-alignment/"><u>The Rocket Alignment Problem</u></a></span></span></span>
tries to get across some of his intuitions behind MIRI’s research, in
the form of an analogy – trying to build aligned AGI without having
deeper understanding of concepts like intelligence and values is like
trying to land a rocket on the moon by just pointing and shooting,
without a working understanding of Newtonian mechanics. </p><p>Cryptography
provides a different lens through which to view this kind of
foundational research. Suppose you were trying to send secret messages
to an ally, and to make sure nobody could intercept and read your
messages you wanted a way to measure how much information was shared
between the original and encrypted message. You might use <span><span><span><a href="https://en.wikipedia.org/wiki/Pearson_correlation_coefficient"><u>correlation coefficient</u></a></span></span></span>
as a proxy for the shared information, but unfortunately having a
correlation coefficient of zero between the original and encrypted
message isn’t enough to guarantee safety. But if you find the concept
of <span><span><span><a href="https://en.wikipedia.org/wiki/Mutual_information"><u>mutual information</u></a></span></span></span>,
then you’re done – ensuring zero mutual information between your
original and encrypted message guarantees the adversary will be unable
to read your message. In other words, only once you’ve found a <b>“true name” </b>-
a robust formalisation of the intuitive concept you’re trying to
express mathematically - can you be free from the effects of Goodhart’s
law. Similarly, maybe if we get robust formulations of concepts like
“agency” and “optimisation”, we would be able to inspect a trained
system and tell whether it contained any misaligned inner optimisers
(see the first post), and these inspection tools would work even in
extreme circumstances (such as the AI becoming much smarter than us).</p><p>Much of MIRI’s research has come under the heading of <span><span><span><a href="https://intelligence.org/embedded-agency/"><u>embedded agency</u></a></span></span></span>.
This tackles issues that arise when we are considering agents which are
part of the environments they operate in (as opposed to standard
assumptions in fields like reinforcement learning, where the agent is
viewed as separate from their environment). Four main subfields of this
area of study are:</p><ul><li><b>Decision theory</b> (adapting classical decision theory to embedded agents)</li><li><b>Embedded world-models </b>(how to form true beliefs about the a world in which you are embedded)</li><li><b>Robust
delegation (understanding what trust relationships can exist between
agents and its future - maybe far more intelligent - self)</b></li><li><b>Subsystem alignment</b> (how to make sure an agent doesn’t spin up internal agents which have different goals)</li></ul><blockquote><p><i>Vignette: <b><u>MIRI</u></b></i></p><figure class="image image_resized" style="width: 18.79%;"><img height="320" src="https://lh4.googleusercontent.com/1oDrxk4RHs0z4MMOjb4ttHOEIkb3xcWIghMvHTakORphqlv-yo6k_I9vyR4iQIwtl89C9abJUxCmGWlGck5yV4rleaqf305508iDSriCXX3zz85FCeTRy7Aq37r6nuqawkQvqAZwGupf-J2CxCsNsNpvnErLKzbO5M70x3mRGgLPHCPVeyp_Qu1P=w320-h320" width="320" /></figure><p><i>MIRI
is the oldest organisation in the AI alignment space. It used to be
called the Singularity Institute, and had the goal of accelerating the
development of AI. In 2005 they shifted focus towards trying to manage
the risks from advanced AI. This has largely consisted of fundamental
mathematical research of the type described above. MIRI might be better
described as a confluence of smart people with backgrounds in highly
technical fields (e.g. mathematics), working on different research
agendas that share underlying philosophies and intuitions. They have a
nondisclosure policy by default, which they explain in this </i><span><span><span><a href="https://intelligence.org/2018/11/22/2018-update-our-new-research-directions/#section3"><i><u>announcement post</u></i></a></span></span></span><i> from 2018.</i></p></blockquote><h3 id="Example_2__John_Wentworth_and_Natural_Abstractions">Example 2: John Wentworth and Natural Abstractions</h3><figure class="image image_resized" style="width: 50.19%;"><img height="400" src="https://lh4.googleusercontent.com/_7sl0Lw5LqhCAT55HvClH1a8sKz8HgPHlICee0jH5FP8hdk02fTA40tCJAgPniD5yF0K254SbDVth6GxPwlVAN1zgMEc6hLjNTZmK58z6cz10d44oX25MVODlMWjBkxnwdLSCIQMTg-cEJqgp_bHYhYnSRP6DMdHE_Ou9p2-HWtnrB1eeVCEfpEP=w400-h400" width="400" /><figcaption><i>thermometer being used to measure a robot, digital art, trending on artstation</i></figcaption></figure><p>John Wentworth is an independent researcher, who publishes most of his work on <span><span><span><a href="https://www.lesswrong.com/users/johnswentworth"><u>LessWrong</u></a></span></span></span> and the <span><span><span><a href="https://www.alignmentforum.org/users/johnswentworth"><u>AI Alignment Forum</u></a></span></span></span>. His main research agenda focuses on the idea of <span><span><span><a href="https://www.lesswrong.com/posts/Fut8dtFsBYRz8atFF/the-natural-abstraction-hypothesis-implications-and-evidence"><u>Natural Abstractions</u></a></span></span></span>, which can be described in terms of three sub-claims:</p><ul><li><b>Abstractability</b><br />Our
physical world abstracts well, i.e. we can usually come up with simpler
summaries (abstractions) for much more complicated systems (example: a
gear is a very complex object containing a vast number of atoms, but we
can summarise all relevant information about it in just one number - the
angle of rotation).</li><li><b>Human-Compatibility</b><br />These are the abstractions used by humans in day-to-day thought/language.</li><li><b>Convergence</b><br />These
abstractions are "natural", in the sense that we should expect a wide
variety of intelligent agents to converge on using them.</li></ul><p>The <span><span><span><a href="https://www.lesswrong.com/posts/gdEDPHjCY5DKsMsvE/the-pragmascope-idea"><u>ideal outcome</u></a></span></span></span>
of this line of research would be some kind of measurement device (an
“abstraction thermometer”), which could take in a system like a trained
neural network and spit out a representation of the abstractions
represented by that system. In this way, you’d be able to get a better
understanding of what the AI was actually doing. In particular, you
might be able to identify inner alignment failures (the AI’s true goal
not corresponding to the reward function it was being trained on), and
you could retrain it while pointed at the intended goal. So far, this
line of research has consisted of some <span><span><span><a href="https://www.lesswrong.com/posts/jJf4FrfiQdDGg7uco/the-telephone-theorem-information-at-a-distance-is-mediated"><u>fairly</u></a></span></span></span> <span><span><span><a href="https://www.lesswrong.com/posts/cqdDGuTs2NamtEhBW/maxent-and-abstractions-current-best-arguments"><u>dense</u></a></span></span></span> <span><span><span><a href="https://www.lesswrong.com/posts/vvEebH5jEvxnJEvBC/abstractions-as-redundant-information"><u>mathematics</u></a></span></span></span>, but Wentworth has <span><span><span><a href="https://www.lesswrong.com/posts/gdEDPHjCY5DKsMsvE/the-pragmascope-idea"><u>described</u></a></span></span></span>
his plans to build on this with more empirical work (e.g. training
neural networks on the same data, and using tools from calculus to try
and compare the similarity of concepts learned by each of the
networks). </p><figure class="image image_resized" style="width: 82.69%;"><img height="174" src="https://lh5.googleusercontent.com/lsaRKjfkVGAtWSWXm3Fe63DdV1WhNOKUV7au1apShXw58CnjUuOZT_edQRbe2bW9YWvWyRKYPI3MOfoUhtPL9u__KUDd77nYHHxlDqmDPgkEPdNQqAiMt3jibph5545p0UYWHxJh43TafpfJ851C9-uqcM9tYkXHk6daURboffMngti_a78zOwxc=w400-h174" width="400" /></figure><h2 id="AI_governance">AI governance</h2><figure class="image image_resized" style="width: 50.48%;"><img height="400" src="https://lh5.googleusercontent.com/zkeRVL9HgLo8hrt4Glx4Tp0RheimnIT7UBmKmwy3YfiO9UWGd-pVn9IfhkKK_VMJ8n9r9p8J6dwkJcAXifAnhNFKzRli5MMzzww5wdjHhlsdEotOi9i3SffPa9H0N5qQJXOnJwO_ZyVxvsRiWEADnwAeJQCOED1D_vLqha3DgDzhZ9G6TUDIxAnG=w400-h400" width="400" /><figcaption><i>judging, presiding over a trial, sentencing a robot, digital art, artstation</i></figcaption></figure><p>In
these posts, we’ve mainly focused on the technical side of the issue.
This is important, especially for understanding why there is a problem
in the first place. However, the management and reduction of AI risk
obviously includes not just technical approaches like outlined in the
above sections, but also <span><span><span><a href="https://80000hours.org/articles/ai-policy-guide/"><u>the field of AI governance</u></a></span></span></span>, which tries to understand and push for the right types of policies for advanced AI systems.</p><p>For
example, the Cold War was made a lot more dangerous by the nuclear arms
race. How do we avoid having an arms race in AI, either between nations
or companies? More generally, how can we make sure that safety
considerations are given appropriate weight by the teams building
advanced AI systems? How do we make sure any technical solutions get
implemented?</p><p>It’s also very hard to say what the impacts of AI
will be, across a broad range of possible technical outcomes. If AI
capabilities at some point advance very quickly from below human-level
to far beyond the human-level, the way the future looks will likely
mostly be determined by technical considerations about the AI system.
However, if progress is slower, there will be a longer period of time
where weird things are happening because of advanced AI - for example,
significantly accelerated economic growth, or mass unemployment, or an
AI-assisted boom in science - and these will have economic, social, and
political ramifications that will play out in a world not too dissimilar
from our own. Someone should be working on figuring out what these
ramifications will be, especially if they might alter the balance of
existential threats that civilisation faces; for example, if they make
geopolitics less unstable and nuclear war more likely, or affect the
environment in which even more powerful AI systems are developed.</p><p>The Centre for the Governance of AI, or <span><span><span><a href="https://www.governance.ai/"><u>GovAI</u></a></span></span></span> for short, is an example of an organisation in this space.</p><h2 id="Field_building">Field-building</h2><figure class="image image_resized" style="width: 50.38%;"><img height="400" src="https://lh5.googleusercontent.com/hMQM1f9Pr8VBRGtxHDoxtOTLtSvu6Vej4lrcK4uuli-OeHWHCpJ60FJ5ZENKQpOD859D-gDCh4gdUu9KRKNpoOXyD55nvXOSI2lls0VMVJUF9LlkIkFuYoGA9vPOw7OEYGP6hBjlBAEeYVyiFA2J5Jg994IdHdpOCs9wSY4_jAHBgsAyDvbh7fl3=w400-h400" width="400" /><figcaption><i>robot giving a lecture in a university, group of students, hands up, digital art, artstation</i></figcaption></figure><p><i>One
of the most important ways we can make AI go well is by increasing the
number of capable researchers doing alignment research.</i></p><p>As
mentioned, AI safety is still a relatively young field. The case here is
that we might do better to grow the field, and increase the quality of
research it produces in the future. Some forms that field building can
take are:</p><ul><li><b>Setting up new ways for people to enter the field</b><br />There are many to list here. To give a few different structures which exist for this purpose:<ul><li><b>Reading groups and introductory programmes. </b><br />Maybe the most exciting one from the last few years has been the Cambridge <span><span><span><a href="https://www.eacambridge.org/agi-safety-fundamentals"><u>AGI Safety Fundamentals Programme</u></a></span></span></span>,
which has curricula for technical alignment and AI governance. The
technical curriculum consists of 7 weeks of reading material and group
discussions, and a final week of capstone projects where the
participants try their hand at a project / investigation / writeup
related to AI safety. Beyond this, many people are also setting up
reading groups in their own universities for books like <i>Human Compatible</i>. </li><li><b>Ways of supporting independent researchers</b><br />The <span><span><span><a href="https://aisafety.camp/"><u>AI Safety Camp</u></a></span></span></span>
is an organisation which matches applicants with mentors posing a
specific research question, and is structured as a series of group
research sprints. They have produced work such as the example of inner
misalignment in the CoinRun game, which we discussed in a previous
section. Other examples of organisations which support independent
research include <span><span><span><a href="https://www.lesswrong.com/posts/jfq2BH5kfQqu2vYv3/we-are-conjecture-a-new-alignment-research-startup"><u>Conjecture</u></a></span></span></span>,
a recent alignment startup which does their own alignment research as
well as providing a structure to host externally funded independent
conceptual researchers, and <span><span><span><a href="https://alignmentfund.org/"><u>FAR (the Fund for Alignment Research)</u></a></span></span></span>.</li><li><b>Coding bootcamps</b><br />Since
current systems are increasingly being bottlenecked by alignment and
interpretability barriers rather than capabilities, in recent years more
focus has been directed towards working with cutting-edge deep learning
models. This requires strong coding skills and a good understanding of
the relevant ML, which is why bootcamps and programmes specifically
designed to skill up future alignment researchers have been created. Two
such examples are <span><span><span><a href="https://www.lesswrong.com/posts/3ouxBRRzjxarTukMW/apply-to-the-second-iteration-of-the-ml-for-alignment"><u>MLAB</u></a></span></span></span> (the Machine Learning for Alignment Bootcamp, run by Redwood Research), and <span><span><a class="PostLinkPreviewWithPost-link" href="https://forum.effectivealtruism.org/posts/9RYvJu2iNJMXgWCBn/introducing-the-ml-safety-scholars-program"><u>MLSS</u></a></span></span>
(the Machine Learning Safety Scholars Programme, which is based on
publicly available material as well as lectures produced by Dan
Hendryks). </li></ul></li><li><b>Distilling research</b><br />In <span><span><span><a href="https://www.lesswrong.com/posts/zo9zKcz47JxDErFzQ/call-for-distillers"><u>this post</u></a></span></span></span>,
John Wentworth makes the case for more distillation in AI alignment
research - in other words, more people who focus on understanding and
communicating the work of alignment researchers to others. This often
takes the form of writing more accessible summaries of hard-to-interpret
technical papers, and emphasising the key ideas.</li><li><b>Public outreach / better intro material</b><br />For instance, books like Brian Christian’s <span><span><span><a href="https://en.wikipedia.org/wiki/The_Alignment_Problem"><i><u>The Alignment Problem</u></i></a></span></span></span><i>, </i>Stuart Russell’s <span><span><span><a href="https://en.wikipedia.org/wiki/Human_Compatible"><i><u>Human Compatible</u></i></a></span></span></span> and Nick Bostrom’s <span><span><span><a href="https://en.wikipedia.org/wiki/Superintelligence:_Paths,_Dangers,_Strategies"><i><u>Superintelligence</u></i></a></span></span></span>
communicate AI risk to a wide audience. These books have been helpful
for making the case for AI risks more mainstream. Note that there can be
some overlap between this and distilling research (Rob Miles’ <span><span><span><a href="https://www.youtube.com/c/RobertMilesAI"><u>channel</u></a></span></span></span> is another great example here).</li><li><b>Getting more of the academic community involved</b><br />Since
AI safety is a hard technical problem, and since misaligned systems
generally won’t be as commercially useful as aligned ones, it makes
sense to try and engage the broader field of machine learning. One great
example of this is Dan Hendryks’ paper <span><span><span><a href="https://mailchi.mp/08a639ffa2ba/an-167concrete-ml-safety-problems-and-their-relevance-to-x-risk"><u>Unsolved Problems in ML Safety</u></a></span></span></span>
(which describes a list of problems in AI safety, with the ML community
as the target audience). Stuart Russell has also engaged a lot with the
ML community. </li></ul><p>Note that this is certainly not a
comprehensive overview of all current AI alignment proposals (a few more
we haven’t had time to talk about are CAIS, Andrew Critch’s
cooperation-and-coordination-failures framing for AI risks, and many
others). However, we hope this has given you a brief overview of some of
the different approaches taken by people in the field, as well as the
motivations behind their research</p><figure class="image image_resized" style="width: 100%;"><img height="254" src="https://lh4.googleusercontent.com/8IrJGfz6Tvmu9txQyjfBchN5qa5oOcRxA82PjEq8PoLyjbURekcePnCBANH_vlOljnG7HX1kbix_x_bjFbLch5V06sArMylvGBYk1xuL0x4CyGnv0zR5kTxIEborM3YNnhK4cLajUHkY0F0VEgT9Sfj-tOAMFyA8QhGSs5e7nx-kjheq4sYPQei2=w400-h254" width="400" /><figcaption>Map of the solution approaches we've discussed so far</figcaption></figure><h1 id="Conclusion"><b>Conclusion</b></h1><figure class="image image_resized" style="width: 50.48%;"><img height="400" src="https://lh3.googleusercontent.com/jyPCCOLA3QoyE7bmLTxk_y-M-dkTwYwgs81eERrmJelCc5DUk-rc9KPqQ9R-DaNiTaAYW-odeEEViOBgVUdPvd-qidCA-ThS0gUHDQuBFts6yfh_sbC576gTqD4vc5O1a9mjtw4UyrEPo7HWGm2LG_irSnFLTraBFaj9FmcLd94yXr97VQX3MXxp=w400-h400" width="400" /><figcaption><i>people
walking along a path which stretches off and disappears into a colorful
galaxy filled with beautiful stars, digital art, trending on artstation</i></figcaption></figure><p>Advanced
AI represents at least a technology that promises to have effects on
the scale of the internet or computer revolutions, and perhaps even more
likely to be more akin to the effects of the <b>industrial revolution</b> (which allowed for the automation of much <i>manual </i>labour) and the <b>evolution of humans</b> (the last time something significantly smarter than everything that had come before appeared on the planet).</p><p>It’s
easy to invent technologies that the same could be said about - a magic
wish-granting box! Wow! But unlike magic wish-granting boxes, something
like advanced AI, or AGI, or transformative AI, or <span><span><span><a href="https://www.cold-takes.com/transformative-ai-timelines-part-1-of-4-what-kind-of-ai/"><u>PASTA</u></a></span></span></span>
(Process for Automating Scientific and Technical Achievement) seems to
be headed our way. The smart money is on it very likely coming <b>this century</b>, and quite likely in the <b>first half</b>.</p><p>If
you look at the progress in modern machine learning, and especially the
past few years of progress in so-called deep learning, it is hard not
to feel a sense of rushing progress. The past few years of progress, in
particular the success of the transformer architecture, should update us
in the direction that intelligence might be a surprisingly easy
problem. What is essentially fancy iterative statistical curve-fitting
with a few hacks thrown in already manages to write fluent appropriate
English text in response to questions, create paintings from a
description, and carry out multi-step logical deduction in natural
language. <b>The fundamental problem that plagued AI progress for
over half a century - getting fuzzy/intuitive/creative thinking into a
machine, in addition to the sharp but brittle logic at which computers
have long excelled - seems to have been cracked.</b> There is a solid empirical pattern of predictably improving performance akin to Moore’s law - the “<span><span><span><a href="https://arxiv.org/pdf/2001.08361.pdf"><u>scaling laws</u></a></span></span></span>”
we mentioned in the first post - that we seem not to have hit the
limits of yet. There are experts in the field who would not be surprised
if the remaining insights for cracking human-level machine intelligence
could fit into a few good papers.</p><p>This is not to say that AGI is
definitely coming soon. The field might get stuck on some stumbling
block for a decade, during which there will be no doubt much written
about the failed promises and excess hype of the early-2020s deep
learning revolution.</p><p>Finally, as we’ve argued, by default the arrival of advanced AI might plausibly lead to civilisation-wide catastrophe.</p><p>There are few things in the world that fit all of the following points:</p><ul><li>A
potentially transformative technology whose development would likely
rank somewhere between the top events of the century and the top events
in the history of life on Earth.</li><li>Something that is likely to happen in the coming decades.</li><li>Something that has a meaningful chance of being cataclysmically bad.</li></ul><p>For
those thinking about the longer-term picture, whatever the short-term
ebb and flow of progress in the field is, AI and AI risk loom large when
thinking about humanity’s future. The main ways in which this might
stop being the case are:</p><ul><li>There is a major flaw in the
arguments for at least one of the above points. Since many of the
arguments are abstract and not empirically falsifiable before it’s too
late to matter, this is possible. However, note that there is a strong
and recurring pattern of many people, including in particular many
extremely-talented people, running into the arguments and taking them
more and more seriously. (If you do have a strong argument against the
importance of the AI alignment problem, there are many people - us
included - who would be very eager to hear from you. Some of these
people - us not included - would probably also pay you large amounts of
money.)</li><li>We solve the technical AI alignment problem, and we
solve the AI governance problem to a degree where the technical
solutions will be implemented and it seems very unlikely that advanced
AI systems will wreak havoc with society.</li><li>A catastrophic outcome for human civilisation, whether resulting from AI itself or something else. </li></ul><p>The
project of trying to make sure the development of advanced AI goes well
is likely one of the most important things in the world to be working
on (if you’re lost, the <span><span><span><a href="https://80000hours.org/problem-profiles/positively-shaping-artificial-intelligence/"><u>80 000 Hours problem profile</u></a></span></span></span>
is a decent place to start). It might turn out to be easy - consider
how many seemingly intractable scientific problems dissolved once
someone had the right insight. But right now, at least, it seems like it
might be a fiendishly difficult problem, especially if it continues to
seem like the insights we need for alignment are very different from the
insights we need to build advanced AI.</p><p>Most of the time, science
and technology progress in whatever direction is easiest or flows most
naturally from existing knowledge. Other times, reality throws down a
gauntlet, and we must either overcome the challenge or fail. May the
best in our species - our ingenuity, persistence, and coordination -
rise up, and deliver us from peril.</p></div>Unknownnoreply@blogger.com0tag:blogger.com,1999:blog-1697673368059564013.post-10112082033288490512022-09-11T11:27:00.002+01:002022-09-24T10:49:16.943+01:00AI risk intro 1: advanced AI might be very bad<div style="text-align: center;"><b><i>This post was a joint effort with <a href="https://www.perfectlynormal.co.uk/">Callum McDougall</a>.</i></b></div><div style="text-align: center;"><i> </i></div><div style="text-align: center;"><i><span style="font-size: x-small;">9.6k words (~25min)</span> </i><br /></div><br /><div><h2><b>Introduction</b></h2>
<p><span>If human civilisation is destroyed this century, the most likely
cause is advanced AI systems. This might sound like a bold claim to
many, given that we live on a planet full of existing concrete threats
like climate change, over ten thousand nuclear weapons, and Vladimir
Putin</span> However, it is a conclusion that many people who think about the topic keep coming to. While it is not easy to describe the case for risks from advanced AI in a single piece, here we make an effort that assumes no prior knowledge. Rather than try to argue from theory straight away, we approach it from the angle of what computers actually can and can’t do.</p>
<h2><b>The Story So Far</b></h2>
<br /><table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto;"><tbody><tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgx7vowH6i3Ti4ML0V2FWajMZW6Bhrk3VVG_MrtyAb8IT4_aHBDLLuvUNHqUUUHORvKixBz32NegaO0LKEdKdTnWOMGf3gp7e7kmydWCvy3jeDrU1w21KECa1Q8TOoImRihhBzFLlc0PgMKjo6jq58FHAFIrPDbWdRv7NtQCI1cxVa92bUfoSHzYsqAlg/s1024/hY1j9oUn8isdCmnQz-2hdVnqsld9KFfkIVEY0AcjlbryYGNlsKzann09MnFAdlNQmtlBas3aV4Y2dcnWG1tbwYwFfMg1XbBJyhh-Z4elcKOU_DZ2U7Zek0YDCAcN-ucAim4p2mjqIMX0iol6vFgVwr-OBAQP9Rb0ns7z2gnZr-xLJ2f5jxDtRtiOZw.png" style="margin-left: auto; margin-right: auto;"><img border="0" data-original-height="1024" data-original-width="1024" height="400" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgx7vowH6i3Ti4ML0V2FWajMZW6Bhrk3VVG_MrtyAb8IT4_aHBDLLuvUNHqUUUHORvKixBz32NegaO0LKEdKdTnWOMGf3gp7e7kmydWCvy3jeDrU1w21KECa1Q8TOoImRihhBzFLlc0PgMKjo6jq58FHAFIrPDbWdRv7NtQCI1cxVa92bUfoSHzYsqAlg/w400-h400/hY1j9oUn8isdCmnQz-2hdVnqsld9KFfkIVEY0AcjlbryYGNlsKzann09MnFAdlNQmtlBas3aV4Y2dcnWG1tbwYwFfMg1XbBJyhh-Z4elcKOU_DZ2U7Zek0YDCAcN-ucAim4p2mjqIMX0iol6vFgVwr-OBAQP9Rb0ns7z2gnZr-xLJ2f5jxDtRtiOZw.png" width="400" /></a></td></tr><tr><td class="tr-caption" style="text-align: center;"><p><i>Above: an image generated by OpenAI’s DALL-E 2, from the prompt:
"artist's impression of an artificial intelligence thinking about chess,
digital art, artstation".</i></p></td></tr></tbody></table>
<p>(This section can be skipped if you understand how machine learning works and what it can and can’t do today)</p>
<p>Let’s say you want a computer to do some complicated task, for example learning chess. The computer has no understanding of high-level things like “chess”, “board”, “piece”, “move”, or “win” - it only understands how to do a small set of things. Your task as the programmer is to break down the high-level goal of “beat me at chess” into simpler and simpler steps, until you arrive at a simple mechanistic description of what the computer needs to do. If the computer does beat you, it’s not because it had any new insight into the problem, but rather because you were clever enough to find some <a href="https://en.wikipedia.org/wiki/Minimax">set of steps</a> that, carried out blindly in sufficient speed and quantity, overwhelms whatever cleverness you yourself can apply during the game. This is how Deep Blue beat Kasparov, and more generally how most software and the so-called “Good Old-Fashioned AI” (GOFAI) paradigm works.</p>
<p>Programs of this type can be powerful. In addition to <a href="https://en.wikipedia.org/wiki/Stockfish_(chess)">beating humans at chess</a>, they can <a href="https://www.google.com/maps/">calculate shortest routes</a> on maps, <a href="https://en.wikipedia.org/wiki/Coq">prove maths theorems</a>, <a href="https://en.wikipedia.org/wiki/Autopilot">mostly fly airplanes</a>, and <a href="https://duckduckgo.com/">search all human knowledge</a>. Programs of this type are responsible for the stereotypical impression of computers as logical, precise, uncreative, and brittle. They are essentially executable logic.</p>
<p>Many people hoped that you could write programs to do “intelligent” things. These people were right - after all, ask almost anyone before Deep Blue won whether playing chess counts as “intelligence”, they’d have said yes. But “classical” programming hit limitations, in particular in doing “obvious” things like figuring out whether an image is of a cat or a dog, or being able to respond in English. This idea that abstract reasoning and logic are easy but humanly-intuitive tasks are hard for computers came to be known as <a href="https://en.wikipedia.org/wiki/Moravec's_paradox">Moravec’s paradox</a>, and held back progress in AI for a long time.</p>
<p>There is another way of programming - machine learning (ML) - going back to the 1950s, almost as far as classical programming itself. For a long time, it was held back by hardware limitations (along with some algorithmic and data limitations), but thanks to <a href="https://en.wikipedia.org/wiki/Moore's_law">Moore’s law</a> hardware has advanced enough for it to be useful for real problems.</p>
<p>If classical programming is executable logic, ML is executable statistics. In ML, the programmer does not define how the system works. The programmer defines how the system learns from data.</p>
<p>The “learning” part in “machine learning” makes it sound like something refined and sensible. This is a false impression. ML systems learn by going through a training process that looks like this:</p>
<p><b>Step 1:</b> you define a statistical model. This takes the form of some equation that has some unknown constants (“parameters”) in it, and some variables where you plug in input values. Together, the parameters and input variables define an output. (The equations in ML can be <i>extremely</i> large, for example with billions of parameters and millions of inputs, but they are very structured and almost stupidly simple.)</p>
<div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjORh8X4FaEWm0fI4ShrWwb-psijpMWscfXou0LCBeML52KElxu3KXuc3TMWVZFlxGWRkgAJj2Hgrp--Y6B-F3pLXdPlBKu0nRyLP6y8XbJfmm0yRJEtlqip6dqoUTcWRJMTbpeWx9uzk5uz8GEsmWPL4UQyGw7dWG5YdUxHvl-G11twIQc3oF_qlAewQ/s1600/lGrV82iGSXqpdpvBUgiwoVlZ3Fw4Ic_981gdWk3t_0-zr0mSSNNonagmkH294S8Vdu_pJBV241przsjs-DCqNh3uhyN-MaVW7M5dbZ8Q_YkJhslb2EXlxnMoo95WPRY0UGaE8a_OeCVt2QsOlkHAHj3dB67cWEy1rQFHUytP_4pvNBZA23iKkAOB7g.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="420" data-original-width="1600" height="168" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjORh8X4FaEWm0fI4ShrWwb-psijpMWscfXou0LCBeML52KElxu3KXuc3TMWVZFlxGWRkgAJj2Hgrp--Y6B-F3pLXdPlBKu0nRyLP6y8XbJfmm0yRJEtlqip6dqoUTcWRJMTbpeWx9uzk5uz8GEsmWPL4UQyGw7dWG5YdUxHvl-G11twIQc3oF_qlAewQ/w640-h168/lGrV82iGSXqpdpvBUgiwoVlZ3Fw4Ic_981gdWk3t_0-zr0mSSNNonagmkH294S8Vdu_pJBV241przsjs-DCqNh3uhyN-MaVW7M5dbZ8Q_YkJhslb2EXlxnMoo95WPRY0UGaE8a_OeCVt2QsOlkHAHj3dB67cWEy1rQFHUytP_4pvNBZA23iKkAOB7g.png" width="640" /></a></div>
<p><b>Step 2</b>: you don’t know what parameters to put in the equation, but you can literally roll some dice if you want (or the computer equivalent).</p>
<p><b>Step 3</b>: presumably there’s some task you want the ML system to do. Let it try. It will fail horribly and produce gibberish (c.f. the previous part where we just put random numbers everywhere).</p>
<p><b>Step 4</b>: There's a simple algorithm called gradient descent, which, when using another algorithm called backpropagation to calculate the gradient, can tell you which direction all the parameters should be shifted to make the ML system slightly better (as judged, for example, by its performance on examples in a dataset).</p>
<div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgM6c2JdD_GEOckQoSQa3VmDW_MGdJCsYQXnijjZBYPxou_u1BQlqfo-keyffoIuImm8PMizS_FyJEWgp1oj_-HXsjEC8yswPw5RmY6QrNE1oYh20AF0ZUR2aRVa_w-SX_E9-z6WXnJygAzCEpuAdCH6xBGy4mkiLFFKcarKkOrlpMB-WAoIPsrD-iVDA/s1263/srXTJFyyK6k_UpaQBxCcua1U6qfS5dbpDyo_IS0Kcsos7LRMVH-RbbvKV7fPPUFkx5C8wkifbpIUlMN-9F6pJVEAOoPcuR-4lRgHhVX6DQvC3C_HrC93KRZPpDuADnyyzsXJpMxA3TGRnIsg0aBuxrqw4npBflti9ZsCeATutgh99zABMoiMDztt.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="1039" data-original-width="1263" height="526" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgM6c2JdD_GEOckQoSQa3VmDW_MGdJCsYQXnijjZBYPxou_u1BQlqfo-keyffoIuImm8PMizS_FyJEWgp1oj_-HXsjEC8yswPw5RmY6QrNE1oYh20AF0ZUR2aRVa_w-SX_E9-z6WXnJygAzCEpuAdCH6xBGy4mkiLFFKcarKkOrlpMB-WAoIPsrD-iVDA/w640-h526/srXTJFyyK6k_UpaQBxCcua1U6qfS5dbpDyo_IS0Kcsos7LRMVH-RbbvKV7fPPUFkx5C8wkifbpIUlMN-9F6pJVEAOoPcuR-4lRgHhVX6DQvC3C_HrC93KRZPpDuADnyyzsXJpMxA3TGRnIsg0aBuxrqw4npBflti9ZsCeATutgh99zABMoiMDztt.png" width="640" /></a></div><br />
<p><b>Step 5</b>: You shift all the numbers a bit based on the algorithm in step 4.</p>
<p><b>Step 6</b>: Go back to step 3 (letting the system try). Repeat until (a) the system has stopped improving for a long time, (b) you get impatient, or - increasingly plausible these days - (c) you run out of your compute budget.</p>
<p>If you’re doing simple curve-fitting statistics problems, it makes sense that this kind of thing works. However, it’s surprising just how far it scales. It turns out that this method, plus some clever ideas about what type of model you choose in step 1, plus willingness to burn millions of dollars on just <i>scaling it up beyond all reason</i>, gets you:</p>
<ol>
<li><a href="https://thenextweb.com/news/gpt3-ai-college-essay-grades-compared-students">essay-writing as good as middling college students</a> (see also <a href="https://www.theguardian.com/commentisfree/2020/sep/08/robot-wrote-this-article-gpt-3">this lightly-edited article that GPT-3 wrote about why we should not be afraid of it</a>) </li>
<li><a href="https://qz.com/2176389/the-best-examples-of-dall-e-2s-strange-beautiful-ai-art/">text-to-image capabilities better (and hundreds of times faster) than almost any human artist</a> (in fact, we used DALL-E to generate the images used at the start of each section of this document)</li>
<li><a href="https://ai.googleblog.com/2022/04/pathways-language-model-palm-scaling-to.html">the ability to explain jokes</a></li></ol><p></p><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi68LysoeDqfjTMzpJ3oXCYlrIZUHZaRVbC_Q3nCg8q-uqcAmgK-5auwkr0_YiWEmtzaI7NwbbKH92OSfWE_IxhxdqH35eaT-fSEbk7s56dpXRxqfZlfVg0k9T8QZw_scvyk_13M1DXvMEJBClJaclcMnVHNWMqLuBxrsfpjPVgiFE9El1u7OZhvn2JNA/s1600/1hREMK3bcCz93v0xffqPjxCgG2h8vk26GUX9EmDRKDFePXJW70t3C8ejg1C54IqAnm5uuIQw5yxyX8NbTp0BnxMIZi3kLevPF-z9jOQSdCPL-aVwSEOxQKihu8_ITDYbff3HxmRvNqkmm1PslU2SpzlTIAnKkJHCAVy3eBuw5dswfjzr54ui5M9yHQ.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="965" data-original-width="1600" height="386" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi68LysoeDqfjTMzpJ3oXCYlrIZUHZaRVbC_Q3nCg8q-uqcAmgK-5auwkr0_YiWEmtzaI7NwbbKH92OSfWE_IxhxdqH35eaT-fSEbk7s56dpXRxqfZlfVg0k9T8QZw_scvyk_13M1DXvMEJBClJaclcMnVHNWMqLuBxrsfpjPVgiFE9El1u7OZhvn2JNA/w640-h386/1hREMK3bcCz93v0xffqPjxCgG2h8vk26GUX9EmDRKDFePXJW70t3C8ejg1C54IqAnm5uuIQw5yxyX8NbTp0BnxMIZi3kLevPF-z9jOQSdCPL-aVwSEOxQKihu8_ITDYbff3HxmRvNqkmm1PslU2SpzlTIAnKkJHCAVy3eBuw5dswfjzr54ui5M9yHQ.png" width="640" /></a></div><table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto;"><tbody><tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjcbL9IX7vqdtPhlZODRo5MIpMY_mR95L3QmMOlHFZ6N50o8UvmO-VtAFNIcAZJJqd3iiSVFv35jQ861ujXRwJ8Hy7jMYg0ZzJCaYfh-80OilsEEyyscWDe1XtJKDs5wP3TOKk79s6lyhGfdLKOG9s3yWQHO6kwqFZF3GqFdKkiUlwW4rmdLj6xlx7bZQ/s1249/sYL5Totsac-mfAv9rX41lJu-YdgV7BkC8Nnj5OBFMO5jwYyJXk_H5LyQNXUsX1lUYsOjo8ZL9lj0kkGYUlRR9-dcFfGrmSGoDNJDRBSpGipXm1aTSsm151RZcbJZSsYniQY_JApToKMDHMw8hwDyZ65PHOJNvIsoO9SP1iAqo36i64NRp2ANej-x4w.png" style="margin-left: auto; margin-right: auto;"><img border="0" data-original-height="525" data-original-width="1249" height="270" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjcbL9IX7vqdtPhlZODRo5MIpMY_mR95L3QmMOlHFZ6N50o8UvmO-VtAFNIcAZJJqd3iiSVFv35jQ861ujXRwJ8Hy7jMYg0ZzJCaYfh-80OilsEEyyscWDe1XtJKDs5wP3TOKk79s6lyhGfdLKOG9s3yWQHO6kwqFZF3GqFdKkiUlwW4rmdLj6xlx7bZQ/w640-h270/sYL5Totsac-mfAv9rX41lJu-YdgV7BkC8Nnj5OBFMO5jwYyJXk_H5LyQNXUsX1lUYsOjo8ZL9lj0kkGYUlRR9-dcFfGrmSGoDNJDRBSpGipXm1aTSsm151RZcbJZSsYniQY_JApToKMDHMw8hwDyZ65PHOJNvIsoO9SP1iAqo36i64NRp2ANej-x4w.png" width="640" /></a></td></tr><tr><td class="tr-caption" style="text-align: center;"><p><i>Above: examples of reasoning by Google’s PaLM model.</i></p></td></tr></tbody></table><p></p>
<p>People <a href="https://norvig.com/chomsky.html">laugh at ML</a> because “it’s just iterative statistical curve-fitting”. They have a point. But when “iterative statistical curve-fitting” gets a B on its English Literature essay, paints an original Dali in five seconds, and cracks a joke, it’s hard to avoid the feeling that it might not be too long before “iterative statistical curve fitting” is laughing at <i>you</i>.</p>
<p>So what exactly happened here, and where is statistical curve-fitting going, and what does this have to do with advanced AI?</p>
<p>We mentioned Moravec’s paradox above. For a long time, getting AI systems to do things that are intuitively easy for humans was an unsolved problem. In just the past few years, it has been solved. A reasonable way to think of current ML capabilities is that state-of-the-art systems can do anything a human can do in a few seconds of thought: recognise objects in an image, generate flowing text as long as it doesn’t require thinking really hard, get the general gist of a joke or argument, and so on. They are also superhuman at some things, including predicting what the next word in a sentence is, or being able to refer to lots of facts (note that this is without internet access, not quoting verbatim, and generally in the right context), and generally being able to spit out output faster.</p>
<p>The way it was solved was through something called <a href="http://incompleteideas.net/IncIdeas/BitterLesson.html">the “bitter lesson”</a> by Richard Sutton. This is the trend that countless researchers have spent their careers trying to invent fancy algorithms for doing domain-specific tasks, only to be overrun by simple (but data- and compute-hungry) ML methods.</p>
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto;"><tbody><tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgGKhRgWjCclC-OmGyZaf9pfMHbt91tKZdhmigk7lHODmZDdIAZB7l5eHK32cB-zwOlhX1sB6Rt1l3LhN14zcA3gS3-NMcj26TKNcBnZ3qZpxU5VDm9s1tKbTolQHUw4Zb1E5wMHC4fhTH1IXCLnnfsR67QuzR_xGw46wl4N-EWdd8hK-YGsWjjAvJsVA/s439/Kh7se7k8viY-Ntcc28IhXSEWt696Xb4B24GUYmLj-WZ1IdK5QGKPoXgOXGYcP4WLdrwHMH737p0TCfx56CmWl42Ptl4WpAzp-QE-spHV9tSug768FeZ_wYCS1tYyYJqH3wJ7bKEIiMAyeG5_6omI8WGS1LcjQvfWKW-B0zs1zpqlIqAevzTHlXEA.png" style="margin-left: auto; margin-right: auto;"><img border="0" data-original-height="439" data-original-width="371" height="400" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgGKhRgWjCclC-OmGyZaf9pfMHbt91tKZdhmigk7lHODmZDdIAZB7l5eHK32cB-zwOlhX1sB6Rt1l3LhN14zcA3gS3-NMcj26TKNcBnZ3qZpxU5VDm9s1tKbTolQHUw4Zb1E5wMHC4fhTH1IXCLnnfsR67QuzR_xGw46wl4N-EWdd8hK-YGsWjjAvJsVA/w338-h400/Kh7se7k8viY-Ntcc28IhXSEWt696Xb4B24GUYmLj-WZ1IdK5QGKPoXgOXGYcP4WLdrwHMH737p0TCfx56CmWl42Ptl4WpAzp-QE-spHV9tSug768FeZ_wYCS1tYyYJqH3wJ7bKEIiMAyeG5_6omI8WGS1LcjQvfWKW-B0zs1zpqlIqAevzTHlXEA.png" width="338" /></a></td></tr><tr><td class="tr-caption" style="text-align: center;"><p><i>Above: Randall Munroe, creator of the xkcd comic, comments on ML. Original</i> <a href="https://xkcd.com/1838/"><i>here</i></a><i>.</i></p></td></tr></tbody></table><p>The speed at which it was solved was gradually at first, and then quickly. The neural network -based ML methods spent a long time in limbo due to insufficiently powerful computers until around 2010 (funnily enough, the specific piece of hardware that has enabled everything in modern ML is the GPU or Graphics Processing Unit, first invented in the 90s because people wanted to play more realistic video games; both graphics rendering and ML rely on many parallel calculations to be efficient). The so-called deep learning revolution only properly started around 2015. Fluent language abilities were essentially nonexistent before OpenAI’s release of <a href="https://en.wikipedia.org/wiki/GPT-2">GPT-2</a> in 2019 (since then, OpenAI has come out with GPT-3, a 100x-larger model that was called “spooky”, “humbling”, and “more than a little terrifying” in <i>The New York Times</i>).</p>
<p>Not only that, but it turns out there are simple <a href="https://arxiv.org/pdf/2001.08361.pdf">“scaling laws”</a> that govern how ML model performance scales with parameter count and dataset size, which seem to paint a clear roadmap to making the systems even more capable by just cranking the “more parameters” and “more data” levers (presumably they have these at the OpenAI HQ).</p>
<p>There are many worries in any scenario where advanced AI is approaching fast, as we’ll argue for in a later section. The current ML-based AI paradigm is especially worrying though.</p>
<p>We don’t actually know what the ML system is learning during the training process it goes through. You can visualise the training process as a trip through (abstract) space. If our model had three parameters, we could imagine it as a point in 3D space. Since current state-of-the-art models have billions of parameters, and are initialised randomly, we can imagine this as throwing a dart somewhere into a billion-dimensional space, where there are a billion different ways to move. During the training process, the training loop guides the model along a trajectory in this space by making tiny updates that push the model in the direction of better performance as described above.</p>
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto;"><tbody><tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgUfnRhbyYZ-DoDGnhFI-7KBrY_hCP8TVCPq-7T6UxFk_9EIaPq9_BIBZQDGqnkEZEr-oLCddWuZfmhPLpQ2AGLJ91uByYE-sF-2tia93ko5LPoXHwtJdoOYKTfRw_gy9An3cYSbLn0jXjuevhUQLs8JKDM_t3d-VHmTnSjbk6xXm89S7PClUs-fcMnfw/s1098/Y4Xz5aiVRPew7usYEnspBo4Rafp38eEBb8OlcRbSTu4LdAMJRDMG1GGklZOODyZoWWKoYfoc_84LsH0ed07DJXShqNXQ7tRCTIOGd6YYrOmMWBm9MGv_xdfLJEMQgQvBR0JqswlxYB-qbjcsY8JjfIeqoyY0Gww041biDRaB4iS_c4GS47bI6KHL-Q.png" style="margin-left: auto; margin-right: auto;"><img border="0" data-original-height="660" data-original-width="1098" height="240" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgUfnRhbyYZ-DoDGnhFI-7KBrY_hCP8TVCPq-7T6UxFk_9EIaPq9_BIBZQDGqnkEZEr-oLCddWuZfmhPLpQ2AGLJ91uByYE-sF-2tia93ko5LPoXHwtJdoOYKTfRw_gy9An3cYSbLn0jXjuevhUQLs8JKDM_t3d-VHmTnSjbk6xXm89S7PClUs-fcMnfw/w400-h240/Y4Xz5aiVRPew7usYEnspBo4Rafp38eEBb8OlcRbSTu4LdAMJRDMG1GGklZOODyZoWWKoYfoc_84LsH0ed07DJXShqNXQ7tRCTIOGd6YYrOmMWBm9MGv_xdfLJEMQgQvBR0JqswlxYB-qbjcsY8JjfIeqoyY0Gww041biDRaB4iS_c4GS47bI6KHL-Q.png" width="400" /></a></td></tr><tr><td class="tr-caption" style="text-align: center;"><p><i>Above:</i> <i>0</i> <i>and</i> <i>1</i> <i>are parameters, and the
vertical axis is the loss (higher is worse). The black line is the path
the model takes in parameter space during training.</i></p></td></tr></tbody></table>
<p>Now let’s say at the end of the training process the model does well on the training examples. What does that tell you? It tells you the model has ended up in some part of this billion-dimensional space that corresponds to a model that does well on the training examples. Here are some examples of models that do well on their training examples:</p>
<ol>
<li>A model that has learned exactly what you want it to learn. Yay!</li>
<li>A model that has learned something similar to what you want to learn, but you can’t tell because there does not exist an example that distinguishes between what it’s learned and what you want it to learn in the data.</li>
<li>A model that has learned to give the right answer when it’s instrumentally in its interest, but which will go off and do something completely different given a chance.</li>
</ol>
<p>How do we know that in the billion-dimensional space of possibilities, our (blind and kind of dumb) training process has landed on #1? We don’t. We launch our ML models on trajectories through parameter-space and hope for the best, like overly-optimistic duct-tape-wielding NASA administrators launching rockets in a universe where, in the beginning, God fell asleep on the “+1 dimension” button.</p>
<p>The really scary failure modes all lie in the future. However, here are some examples of perverse “solutions” ML models have already come up with in practice:</p>
<ol>
<li>A game-playing ML model <a href="https://web.archive.org/web/20160526045303/http://homepages.herts.ac.uk/~cs08abi/publications/Salge2008b.pdf">learned to crash the game</a>, presumably because it can’t die if the game crashed.</li>
<li>An ML model was meant to convert aerial photographs into abstract street maps and then back (learning to convert to and from a more-abstract intermediate representation is a common training strategy). It learned to <a href="https://arxiv.org/pdf/1712.02950.pdf">hide useful information</a> about the aerial photograph in the street map in a way that helped it “cheat” in reconstructing the aerial photograph, and in a way too subtle for humans just looking at the images to notice.</li>
<li>A game-playing ML model <a href="https://arxiv.org/pdf/1802.08842.pdf">discovered a bug in the game</a> where the game stalls on the first round and it gets almost a million in-game points. The researchers were unable to figure out the reason for the bug.</li>
</ol>
<p>These are examples of <b>specification gaming</b>, in which the ML model has learned to game whatever specification of task success was given to it. (Many more examples can be found on <a href="https://docs.google.com/spreadsheets/d/e/2PACX-1vRPiprOaC3HsCf5Tuum8bRfzYUiKLRqJmbOoC-32JorNdfyTiRRsR7Ea5eWtvsWzuxo8bjOxCG84dAg/pubhtml">this spreadsheet</a>.)</p>
<p>No one knows for sure where the ML progress train is headed. It is plausible that current ML progress hits a wall and we get another <a href="https://en.wikipedia.org/wiki/AI_winter">“AI winter”</a> that lasts years. However, AI has recently been breaking through barrier after barrier, and so far does not seem to be slowing down. Though we’re still at least some steps away from human-level capabilities at everything, there aren’t many tasks where there’s no proof-of-concept demonstration.</p>
<p>Machines have been better at some intellectual tasks for a long time; just consider calculators which are already superhuman at arithmetic. However, with the computer revolution, every task where a human has been able to think of a way to break it down into unambiguous steps (and the unambiguous steps can be carried out with modern computing power) has been added to this list. More recently, more intuition- and insight-based activities have been added to that list. DeepMind’s AlphaGo beat the top-rated human player of Go (a far harder game than chess for computers) in 2016. In 2017, AlphaZero beat both AlphaGo at Go (100-0) and superhuman chess programs at chess, despite training only by playing against itself for less than 24 hours. Analysis of its moves revealed strategies that millennia of human players hadn’t been able to come up with, so it wouldn’t be an exaggeration to say that it beat the accumulated efforts of human civilisation at inventing Go strategies - in one day. In 2019, DeepMind released MuZero, which extended AlphaZero’s performance to Atari games. In 2021, DeepMind released EfficientZero, which takes only two hours of gameplay to become superhuman at Atari games. In addition to games, DeepMind’s AlphaFold and AlphaFold 2 have made big leaps towards solving the problem of predicting a protein’s structure from its constituent amino acids, one of the biggest theoretical problems in biology. A step towards generality was taken by Gato, yet another DeepMind model, which is a single model that can play games, control a robot arm, label images, and write text.</p>
<p>If you straightforwardly extrapolate current progress in machine learning into the future, here is what you get: ML models exceeding human performance in a quickly-expanding list of domains, while we remain ignorant about how to make sure they learn the right goals or robustly act in the right way.</p>
<p> </p>
<h2><b>Theoretical underpinnings of AI risk</b></h2>
<p>The previous section discussed the history of machine learning, and how extrapolating its progress has worrying implications. Next we discuss more theoretical arguments for why highly advanced AI systems might pose a threat to humanity.</p>
<p>One of the criticisms levelled at the notion of risks from AI is that it sounds too speculative, like something out of apocalyptic science fiction. Part of this is unavoidable, since we are trying to reason about systems more powerful than any which currently exist, and may not behave like anything that we’re used to.</p>
<p>This section will be split into three sections. Each one makes a claim about the future of artificial intelligence, and discusses the arguments for and against this claim. The three claims are:</p>
<ul>
<li><b>AGI is likely.</b>
AGI (artificial general intelligence) is likely to be created by humanity eventually, and there is a good chance this will happen in the next century.</li>
<li><b>AGI will have misaligned goals by default.</b>
Unless certain hard technical problems are solved first, the goals of the first AGIs will be misaligned with the goals of humanity, and would lead to catastrophic outcomes if executed.</li>
<li><b>Misaligned AGI could resist attempts to control it or roll it back</b>
An AGI (or AGIs) with misaligned goals would be able to overpower or outcompete humanity, and gain control of our future, like how we’ve so far been able to use our intelligence to dominate all other less intelligent species.</li>
</ul>
<p> </p>
<h3>AGI is likely</h3>
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto;"><tbody><tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhimU980-3FCg_684oxCOu6NVYctxKEYGvJidc4w0igfh1Bcl-s1O1b-W-lcdWHhouD-F1Eyw3TRgkWOSKGMRv6V8FhQ0ODbDiOyOp8S3-_n4SJCpFdPenwNhmaCmHFWIsJPbRn8rLZjnVP-2A0_6SdpZAg6yQjmlVVM9R6mh0TEB1v4pa4E5HmRk19lg/s1024/DWHUlBLzuXHT7LAKeupCIDzm-qci1JeFmH9ZjevdiioGN2VFHC63YOOY5JMBjEmCFX-WL_E-r8omyZ-Vhp-o0uHvpLYq5lhrGMfqDFTJxLlJYA4HeV2pLJyW0EFsq1t3mWOjQlD202b9a4cRrZRYzR1qN8WGHXWsGAM2JpoSoivVeA7s71wITxXsLw.png" style="margin-left: auto; margin-right: auto;"><img border="0" data-original-height="1024" data-original-width="1024" height="320" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhimU980-3FCg_684oxCOu6NVYctxKEYGvJidc4w0igfh1Bcl-s1O1b-W-lcdWHhouD-F1Eyw3TRgkWOSKGMRv6V8FhQ0ODbDiOyOp8S3-_n4SJCpFdPenwNhmaCmHFWIsJPbRn8rLZjnVP-2A0_6SdpZAg6yQjmlVVM9R6mh0TEB1v4pa4E5HmRk19lg/s320/DWHUlBLzuXHT7LAKeupCIDzm-qci1JeFmH9ZjevdiioGN2VFHC63YOOY5JMBjEmCFX-WL_E-r8omyZ-Vhp-o0uHvpLYq5lhrGMfqDFTJxLlJYA4HeV2pLJyW0EFsq1t3mWOjQlD202b9a4cRrZRYzR1qN8WGHXWsGAM2JpoSoivVeA7s71wITxXsLw.png" width="320" /></a></td></tr><tr><td class="tr-caption" style="text-align: center;"><p><i>Above: this image also generated by OpenAI’s DALL-E 2, using the
prompt "a data center with stacks of computers gaining the spark of
intelligence"</i>.</p></td></tr></tbody></table><blockquote><p>"<i>Betting against human ingenuity is foolhardy, particularly when our future is at stake.</i>"</p>
<p>-Stuart Russell</p>
</blockquote>
<p>To open this section, we need to define what we mean by artificial general intelligence (AGI). We’ve already discussed AI, so what do we mean by adding the word “generality”?</p>
<p>An AGI is a machine capable of behaving intelligently over many different domains. The term “general” here is often used to distinguish from “narrow”, where a narrow AI is one which excels at a specific task, but isn’t able to invent new problem-solving techniques or generalise its skills across many different domains. </p>
<p>As an example of general intelligence in action, consider humans. In a few million years (a mere eye-blink in evolutionary timescales), we went from apes wielding crude tools to becoming the dominant species on the planet, able to build space shuttles and run companies. How did this happen? It definitely wasn’t because we were directly trained to perform these tasks in the ancestral environment. Rather, we developed new ways of using intelligence that allowed us to generalise to multiple different tasks. This whole process played out over a shockingly small amount of time, relative to all past evolutionary history, and so it is possible that a relatively short list of fundamental insights were needed to get general intelligence. And as we saw in the previous section, ML progress hints that gains in intelligence might be surprisingly easy to achieve, even relative to current human abilities.</p>
<p>AGI is not a distant future technology that only futurists speculate about. OpenAI and DeepMind are two of the leading AI labs. They have received billions of dollars in funding (including OpenAI receiving significant investment from Microsoft, and DeepMind being acquired by Google). Both <a href="https://www.deepmind.com/careers">DeepMind</a> and <a href="https://openai.com/about/">OpenAI</a> have the development of AGI as the core of both their mission statement and their business case. Top AI researchers are publishing <a href="https://openreview.net/pdf?id=BZ5a1r-kVsf">possible roadmaps</a> to AGI-like capabilities. And, as mentioned earlier, especially in the past few years they have been crossing off a significant number of the remaining milestones every year.</p>
<p>When will AGI be developed? Although this question is impossible to answer with certainty, many people working in the field of AI think it is more likely than not to arrive in the next century. An aggregate forecast generated via data from a <a href="https://aiimpacts.org/2022-expert-survey-on-progress-in-ai/">2022 survey</a> of ML researchers estimated <b>37 years until a 50% chance of high-level machine intelligence</b> (defined as systems which can accomplish every task better and more cheaply than human workers). These respondents also gave an average of <b>5% probability of AI having an extremely bad outcome for humanity (e.g. complete human extinction)</b>. How many other professions estimate an average of 5% probability that their field of study will be directly responsible for the extinction of humanity?! To explain this number, we need to proceed to the next two sections, where we will discuss why AGIs might have goals which are misaligned with humans, and why this is likely to lead to catastrophe.</p>
<h3>AGI will have misaligned goals by default</h3>
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto;"><tbody><tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiq6Jx0OZP97it5eRy_K4j79gQUqvL8RWFSXhkeLJgVu6o8CH6Jb2loqb3v-VGJEeRl6BXDqe5mcM8YywHgtBOSGyfFK3YvaD81uKzyXXCx5RN3JzxgH_8e4hde3iRqoYCOTyi0UvBeHlVOoRPbf7WaTb0EzQajDbQK-wiZYcIkVoui0N82fNZsoiYufw/s1024/ByomdAHZi91n-zB_xjy7hfItOvqPhWMO_0IPLZxzXo1sQnZRxp2YJxZ6-J0rDzO6AGMXgzHTDi9uh4l-Sf-zdvMWBWhxP_VwH72KibODwEZkurOUcBqdjsMVcFbsip8WIP5APNi9BP_2cXrAeE9FY61SrblfMlc89OqV_XEYTCAyVQbZEhuCpQWlkw.png" style="margin-left: auto; margin-right: auto;"><img border="0" data-original-height="1024" data-original-width="1024" height="400" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiq6Jx0OZP97it5eRy_K4j79gQUqvL8RWFSXhkeLJgVu6o8CH6Jb2loqb3v-VGJEeRl6BXDqe5mcM8YywHgtBOSGyfFK3YvaD81uKzyXXCx5RN3JzxgH_8e4hde3iRqoYCOTyi0UvBeHlVOoRPbf7WaTb0EzQajDbQK-wiZYcIkVoui0N82fNZsoiYufw/w400-h400/ByomdAHZi91n-zB_xjy7hfItOvqPhWMO_0IPLZxzXo1sQnZRxp2YJxZ6-J0rDzO6AGMXgzHTDi9uh4l-Sf-zdvMWBWhxP_VwH72KibODwEZkurOUcBqdjsMVcFbsip8WIP5APNi9BP_2cXrAeE9FY61SrblfMlc89OqV_XEYTCAyVQbZEhuCpQWlkw.png" width="400" /></a></td></tr><tr><td class="tr-caption" style="text-align: center;"><p><i>Above: yet another image from OpenAI's DALL-E 2. Perhaps it was
trying for a self portrait? (Prompt: "Artists impression of artificial
general intelligence taking over the world, expressive, digital art")</i></p><p><i> </i></p></td></tr></tbody></table>
<blockquote><p>"<i>The AI does not hate you, nor does it love you, but you are made out of atoms which it can use for something else.</i>"</p>
<p>-Eliezer Yudkowsky</p>
</blockquote>
<p>Let’s start off this section with a few definitions. </p>
<p>When we refer to <b>“aligned AI”</b>, we are using Paul Christiano’s conception of <b>“intent alignment”</b>, which essentially means the AI system is <b>trying</b> to do what its human operators want it to do. Note that this is insufficient for building useful AI, since the AI also has to be capable. But situations where the AI is trying and failing to do the right thing seem like less of a problem.</p>
<p>When we refer to the <b>“alignment problem”</b>, we mean the difficulty of building aligned AI. Note, this doesn’t just capture the fact that we won’t create an AI aligned with human values by default, but that we don’t currently know how to build a sophisticated AI system robustly aligned with <i>any</i> goal.</p>
<p><i>Can’t we just have the AI learn the right goals by example, just like how all current ML works?</i> The problem here is that we have no way of knowing what goal the AI is learning when we train it; only that it seems to be doing good things on the training data that we give it. The state-of-the-art is that we have hacky but extremely powerful methods that can make ML systems remarkably competent at doing well on the training examples by an opaque process of guided trial-and-error. But there is no Ghost of Christmas Past that will magically float into a sufficiently-capable AI and imbue it with human values. We do not have a way of ensuring that the system acquires a particular goal, or even an idea of what a robust goal specification that is compatible with human goals/values could look like.</p>
<h4>Orthogonality and instrumental convergence</h4>
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto;"><tbody><tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgnHdtTFdEzyaXhPkeAqSBXgRYmkIzR_2vDZzEyaNsryIJhvDqCysP_5NLctCEweBNyomM9ApqmdQh6IzrZV-NCjMspY4Y98lBxGuZXsNAWIPcTjdjqYjvtOL5z8iCS23UfHqm33ukVr9Yz-onGqB3_3u--q-iHeORTf4SzY8rJIuaXi0YBaSqzETKWww/s1024/Rc_xHrlVI3yCKNCG6giTXaqmivDYEON83371l4xIRe_k4jo8-kSuTjkNMNywqYG8vgl2BOTANdtRxcXMFwIQkdPIvC7ueNKvSFhZqOaW_mi2gAkSNP37DQysWnnfzByqzXyLvr-K573dIhxD8WT-uE6TxiABQH58LGsSShTYmJEIUaXaLkyLjdMa.png" style="margin-left: auto; margin-right: auto;"><img border="0" data-original-height="1024" data-original-width="1024" height="400" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgnHdtTFdEzyaXhPkeAqSBXgRYmkIzR_2vDZzEyaNsryIJhvDqCysP_5NLctCEweBNyomM9ApqmdQh6IzrZV-NCjMspY4Y98lBxGuZXsNAWIPcTjdjqYjvtOL5z8iCS23UfHqm33ukVr9Yz-onGqB3_3u--q-iHeORTf4SzY8rJIuaXi0YBaSqzETKWww/w400-h400/Rc_xHrlVI3yCKNCG6giTXaqmivDYEON83371l4xIRe_k4jo8-kSuTjkNMNywqYG8vgl2BOTANdtRxcXMFwIQkdPIvC7ueNKvSFhZqOaW_mi2gAkSNP37DQysWnnfzByqzXyLvr-K573dIhxD8WT-uE6TxiABQH58LGsSShTYmJEIUaXaLkyLjdMa.png" width="400" /></a></td></tr><tr><td class="tr-caption" style="text-align: center;"><p><i>Above: DALL-E illustrating "Artists depiction of an artificial intelligence which builds paperclips, digital art, artstation"</i></p></td></tr></tbody></table>
<p>One of the most common objections to risks from AI goes something like this:</p>
<blockquote><p> <i>If the AI is smart enough to cause a global catastrophe, isn’t it smart enough to know that this isn’t what humans wanted?</i></p>
</blockquote>
<p>The problem with this is that it conflates two different concepts: <b>intelligence</b> (in the sense of having the ability to achieve your goals, whatever they might be) and <b>having goals which are morally good by human standards</b>. When we look at humans, these two often go hand-in-hand. But the key observation of the orthogonality thesis is that this doesn’t have to be the case for all possible mind designs. As defined by Nick Bostrom in his book <a href="https://nickbostrom.com/superintelligentwill.pdf"><i>Superintelligence</i></a>:</p>
<blockquote><p><b>The Orthogonality Thesis</b></p>
<p><i>Intelligence and final goals are orthogonal axes along which possible agents can freely vary. In other words, more or less any level of intelligence could in principle be combined with more or less any final goal.</i></p>
</blockquote>
<p>Here, orthogonal means “at right angles” or “unrelated” – in other words we can imagine a graph with one axis representing intelligence, and another representing the agent’s goals, with any point in the graph representing a theoretically possible agent*. The classic example here is a <b>“paperclip maximiser”</b> - a powerful AGI driven only by the goal of making paperclips.</p>
<p>(*This is obviously an oversimplification. For instance, it seems unlikely you could get an unintelligent agent with a highly complex goal, because it would seem to take some degree of intelligence to represent the goal in the first place. The key message here is that you could in theory get highly capable agents pursuing arbitrary goals.)</p>
<div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiTupLKJSBje5UxrZ-hWCiTIMAIZ1tlodt_zdgI4Vnr7WEPCGtygTJkMJ9QNfaI8E2csHFyRCCJiiMLTup5XlIVSdzLl5h3AEOmX7FuAeePDCAFZfHh-wPAYnt9XQtr2oTbf_mwc-Wn0ayRTjdJ3YZGvBI9m91OJZZ_HvO6xfHqwFTq1LIwLiWcw_EcjQ/s1600/32rs7DmbEtFqlFq-T71NMr0G11m0M-ElZ5KbJygw6oFszfBkHOA4hd0M6U6yRaZmYVoLjAX_ro77LR-EleAiaqC_qvYNywWuIhaJKa6e83DKbCzVW_lxWjLGq--OpKsgbOONrrEzKWMSOEx3ivUlk2TePyCIJrLt-DlkvoObtiw5RgdbZ_Ijnz67.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="771" data-original-width="1600" height="308" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiTupLKJSBje5UxrZ-hWCiTIMAIZ1tlodt_zdgI4Vnr7WEPCGtygTJkMJ9QNfaI8E2csHFyRCCJiiMLTup5XlIVSdzLl5h3AEOmX7FuAeePDCAFZfHh-wPAYnt9XQtr2oTbf_mwc-Wn0ayRTjdJ3YZGvBI9m91OJZZ_HvO6xfHqwFTq1LIwLiWcw_EcjQ/w640-h308/32rs7DmbEtFqlFq-T71NMr0G11m0M-ElZ5KbJygw6oFszfBkHOA4hd0M6U6yRaZmYVoLjAX_ro77LR-EleAiaqC_qvYNywWuIhaJKa6e83DKbCzVW_lxWjLGq--OpKsgbOONrrEzKWMSOEx3ivUlk2TePyCIJrLt-DlkvoObtiw5RgdbZ_Ijnz67.png" width="640" /></a></div>
<p>Note that an AI may well come to understand the goals of the humans that trained it, but this doesn't mean it would choose to follow those goals. As an example, many human drives (e.g. for food and human relationships) came about because in the ancestral environment, following these drives would have made us more likely to reproduce and have children. But just because we understand this now doesn't make us toss out all our current values and replace them with a desire to maximise genetic fitness.</p>
<p>If an AI might have bizarre-seeming goals, is there anything we <i>can</i> say about its likely behaviour? As it turns out, there is. The secret lies in an idea called the <b>instrumental convergence thesis</b>, again <a href="https://nickbostrom.com/superintelligentwill.pdf">by Bostrom</a>:</p>
<blockquote><p><b>The Instrumental Convergence Thesis</b>
<i>There are some instrumental goals likely to be pursued by almost any intelligent agent, because they are useful for the achievement of almost any final goal.</i></p>
</blockquote>
<p>So an instrumental goal is one which increases the odds of the agent’s final goal (also called its <b>terminal goal</b>) being achieved. What are some examples of instrumental values?</p>
<p>Perhaps the most important one is <b>self-preservation</b>. This is necessary for pursuing most goals, because if a system’s existence ends, it won’t be able to carry out its original goal. As memorably phrased by Stuart Russell, <i>“you can’t fetch the coffee if you’re dead!”</i>.</p>
<p><b>Goal-content integrity</b> is another. An AI with some <i>goal X</i> might resist any attempts to have its goal changed to <i>goal Y</i>, because it sees that in the event of this change, its current <i>goal X</i> is less likely to be achieved.</p>
<p>Finally, there are a set of goals which are all forms of <b>self-enhancement</b> - improving its cognitive abilities, developing better technology, or acquiring other resources, because all of these are likely to help it carry out whatever goals it ends up having. For instance, an AI singularly devoted to making paperclips might be incentivised to acquire resources to build more factories, or improve its engineering skills so it can figure out yet more effective ways of manufacturing paperclips with the resources it has.</p>
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto;"><tbody><tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhaPDhYN5zxznZytGaeYZo4ymU89BdLJrOHzkJJ4sE6QRy5tfzrpczSOoDyNU8JFLc9lEUDfwCkZJaHMDLQRO8CiwSxbVeodwBeCjWjBfFKtH2h_piY-P4JZ0avNtvxqljCbEVKzomCzM-FsWuTy0GKXaRePfccxVYwkyHw63YRZlKXk-_eOHboluCbtQ/s350/l_0AVfWMmZOjYzOpQlJg41GUDnGOYwifSqVT_TckS65ChbSzZ_vEH6L7j35Ex-hXyJ_QIA8L1qLOs7J1VnOCFcfZskgDsf8qbkzoZWg3GF7Iu9GWfB2ERw17F_u6HtrQgCFWf7yTIQ_A7UlHSBctsVRaOVgQOgtnit2eTHEuoBfw5drEGyHRijiC4w.png" style="margin-left: auto; margin-right: auto;"><img border="0" data-original-height="264" data-original-width="350" height="241" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhaPDhYN5zxznZytGaeYZo4ymU89BdLJrOHzkJJ4sE6QRy5tfzrpczSOoDyNU8JFLc9lEUDfwCkZJaHMDLQRO8CiwSxbVeodwBeCjWjBfFKtH2h_piY-P4JZ0avNtvxqljCbEVKzomCzM-FsWuTy0GKXaRePfccxVYwkyHw63YRZlKXk-_eOHboluCbtQ/s320/l_0AVfWMmZOjYzOpQlJg41GUDnGOYwifSqVT_TckS65ChbSzZ_vEH6L7j35Ex-hXyJ_QIA8L1qLOs7J1VnOCFcfZskgDsf8qbkzoZWg3GF7Iu9GWfB2ERw17F_u6HtrQgCFWf7yTIQ_A7UlHSBctsVRaOVgQOgtnit2eTHEuoBfw5drEGyHRijiC4w.png" width="320" /></a></td></tr><tr><td class="tr-caption" style="text-align: center;"><p><i>Above: paperclip maximisation, now with a fun game attached!</i></p></td></tr></tbody></table><p></p>
<p>The key lesson to draw from instrumental convergence is that, even if nobody ever deliberately deploys an AGI with a really bad reward function, the AGI is still likely to develop goals which will be bad for humans by default, in service of its actual goal.</p>
<h4>Interlude - why goals?</h4>
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto;"><tbody><tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjhNNYkP_LfxpthWZPkegMhhT4P-sr_L-85-YwMWO2GE9ZoWyqpBGa3e2gIUQq74h9HKQ0TQr6W4XgHOYzLligh6pGfxgvUgc81nEG0dHP7nLJH3pen8PD8A60hScwybwvdrROTxEJwJW7pi7Ndm-WlyRg_R9M13OE9cOOSPNgI5-tTii2jWBuHxZDmhQ/s1024/xB7ebp8yB0tTy3HVpFogzxe-wKK47KT6KCFXp2JMCoPJlqC8CegZB7ktfeqPS2ILr2yAsKN4CsCm2ZmaZmhLoqf2-2aIBZU1J1yUKTPyE1cwKNqDkxK0ZDbcdval-D2-Z0JwQEIrJVFuZ4MajncHNNWRNH9qzuj8zPhTOFLb6nXM7fDUQ353Gl8g-w.png" style="margin-left: auto; margin-right: auto;"><img border="0" data-original-height="1024" data-original-width="1024" height="400" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjhNNYkP_LfxpthWZPkegMhhT4P-sr_L-85-YwMWO2GE9ZoWyqpBGa3e2gIUQq74h9HKQ0TQr6W4XgHOYzLligh6pGfxgvUgc81nEG0dHP7nLJH3pen8PD8A60hScwybwvdrROTxEJwJW7pi7Ndm-WlyRg_R9M13OE9cOOSPNgI5-tTii2jWBuHxZDmhQ/w400-h400/xB7ebp8yB0tTy3HVpFogzxe-wKK47KT6KCFXp2JMCoPJlqC8CegZB7ktfeqPS2ILr2yAsKN4CsCm2ZmaZmhLoqf2-2aIBZU1J1yUKTPyE1cwKNqDkxK0ZDbcdval-D2-Z0JwQEIrJVFuZ4MajncHNNWRNH9qzuj8zPhTOFLb6nXM7fDUQ353Gl8g-w.png" width="400" /></a></td></tr><tr><td class="tr-caption" style="text-align: center;"><p><i>Above: DALL-E image from the prompt "Artist's depiction of a robot
throwing a dart at a target, digital art, getting a bullseye, trending
on artstation"</i></p></td></tr></tbody></table><p>Having read the previous section, your initial reaction may well be something like this:</p>
<blockquote><p> <i>“Okay, so powerful AGIs with goals that don’t line up perfectly with ours might spell bad news, but why should AI systems have goals at all? Google Maps is a pretty useful ML system but it doesn’t have ‘goals’, I just type my address in and hit enter. Why won’t future AI be like this?”</i></p>
</blockquote>
<p>There are many different responses you could have to this line of argument. One simple response is based on ideas of economic competitiveness, and comes from <a href="https://www.gwern.net/Tool-AI">Gwern (2016)</a>. It runs something like this:</p>
<blockquote><p>AIs that behave like agents (i.e. taking actions in order to achieve their goals) will be more economically competitive than “tool AIs” (like Google Maps), for two reasons. First, they will by definition be better at <b>taking actions</b>. Second, they will be superior at <b>inference and learning</b> (since they will be able to repurpose the algorithms used to choose actions to improve themselves in various ways). For example, agentic systems could take actions such as improving their own training efficiency, or gathering more data, or making use of external resources such as long-term memories, all in service of achieving its goal.</p>
<p>If agents are more competitive, then any AI researchers who don’t design agents will be outcompeted by ones that do.</p>
</blockquote>
<p>There are other perspectives you could take here. For instance, Eliezer Yudkowsky has written extensively about “expected utility maximisation” as a formalisation for how rational agents might behave. Several mathematical theorems all point to the same idea of <i>“any agent not behaving like expected utility maximisers will be systematically making stupid mistakes and getting taken advantage of”</i>. So if we expect AI systems to <i>not</i> be making stupid mistakes and getting taken advantage of by humans, then it makes sense to describe them as having the ‘goal’ of maximising expected utility, because that’s how their behaviour will seem to us.</p>
<p>Although these arguments may seem convincing, the truth is there are many questions about goals and agency which remain unanswered, and we honestly just don’t know what AI systems of the future will look like. It’s possible they will look like expected utility maximisers, but this is far from certain. For instance, Eric Drexler's technical report <a href="https://www.fhi.ox.ac.uk/wp-content/uploads/Reframing_Superintelligence_FHI-TR-2019-1.1-1.pdf?asd=sa">Reframing Superintelligence: Comprehensive AI Services as General Intelligence (CAIS)</a> paints a different picture of the future, where we create systems of AIs interacting with each other and collectively providing a variety of services to humans. However, even scenarios like this could threaten humanity’s ability to keep steering its own future (as we will see in later sections).</p>
<p>Additionally, new paradigms are being developed. One of the newest, published barely one week ago, <a href="https://www.lesswrong.com/posts/vJFdjigzmcXMhNTsx/simulators">analysed certain types of AI models like GPT-3 (a large language model) through the lens of "simulators"</a>. Modern language models like GPT-3, for example, may be best thought of as trying to simulate the continuation of a piece of English text, in the same way that a physics simulation evolves an initial state by applying the laws of physics. It doesn't make sense to describe the simulations themselves through the lens of agents, but they can simulate agents as subsystems. Even with today's models like GPT-3, if you prompt it in a way that places it in the context of making a plan to carry out a goal, it will do a decent job of doing that. Future work will no doubt explore the risk landscape from this perspective, and time will tell how well these frameworks match up with actual progression in ML.</p>
<h4>Inner and outer misalignment</h4>
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto;"><tbody><tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiTHionB-1S0pYghuaqAaFplomOba99DnL79luwh6zataow1SMbu7VTkwLa7TL_I66oeRu6MRrPPAyVFpF3uomtH_luZUlp220WQnD_8tUQYlFWj1uhDi9sw4vfwGkjIWR7eTMORP8f9bJinQMKhRzTNtMhWdm3mFhp0G5dk_sxo3abaOvP5yvtb_Dypw/s1024/Zti_yGXmLuikLxE-3nVAMSW-fcXBbHUbp8KlHgIf_FEiz_cRtnkcfxg9mEnMhmADknRxrL49j2GOYi4lwF1UDEzU80cuIwS6Qsrjm01IQhwxowa6I9jB6d2kqimn4UqHJd1YKqzdExJajZ5UQap9tVnLfFfPChqgDg7qLZTNYCIgYGA0HoGmk7Y9sQ.png" style="margin-left: auto; margin-right: auto;"><img border="0" data-original-height="1024" data-original-width="1024" height="400" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiTHionB-1S0pYghuaqAaFplomOba99DnL79luwh6zataow1SMbu7VTkwLa7TL_I66oeRu6MRrPPAyVFpF3uomtH_luZUlp220WQnD_8tUQYlFWj1uhDi9sw4vfwGkjIWR7eTMORP8f9bJinQMKhRzTNtMhWdm3mFhp0G5dk_sxo3abaOvP5yvtb_Dypw/w400-h400/Zti_yGXmLuikLxE-3nVAMSW-fcXBbHUbp8KlHgIf_FEiz_cRtnkcfxg9mEnMhmADknRxrL49j2GOYi4lwF1UDEzU80cuIwS6Qsrjm01IQhwxowa6I9jB6d2kqimn4UqHJd1YKqzdExJajZ5UQap9tVnLfFfPChqgDg7qLZTNYCIgYGA0HoGmk7Y9sQ.png" width="400" /></a></td></tr><tr><td class="tr-caption" style="text-align: center;"><p><i>Above: AI agents with inner misalignment were at one point called
“optimisation daemons”. DALL-E did not quite successfully depict the
description "Two arguments between an angel and a devil, one inside a
circle and one on the outside, painting".</i></p></td></tr></tbody></table>
<p>As discussed in the first section, the central paradigm of modern ML is that we train systems to perform well on a certain reward function. For instance, we might train an image classifier by giving it a large number of labelled images of digits. Every time it gets an image wrong, gradient descent is used to update the system incrementally in the direction that would have been required to give a correct answer. Eventually, the system has learned to classify basically all images correctly.</p>
<p>There are two broad families of ways techniques like this can fail. The first is when our reward function fails to fully express the true preferences of the programmer - we refer to this as <b>outer misalignment</b>. The second is when the AI learns a different set of goals than those specified by the reward function, but which happens to coincide with the reward function during training - this is <b>inner misalignment</b>. We will now discuss each of these in turn.</p>
<h5>Outer misalignment</h5>
<p>Outer misalignment is perhaps the simpler concept to understand, because we encounter it all the time in everyday life, in a form called <b>Goodhart’s law</b>. In its most well-known form, this law states:</p>
<blockquote><p><i>When a measure becomes a target, it ceases to be a good measure.</i></p>
</blockquote>
<p>Perhaps the most famous case comes from Soviet nail factories, which produced nails based on targets that they had been given by the central government. When a factory was given targets based on the total <i>number</i> of nails produced, they ended up producing a massive number of tiny nails which couldn’t function properly. On the other hand, when the targets were based on the total <i>weight</i> produced, the nails would end up huge and bulky, and equally impractical.</p>
<p><i></i></p><table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto;"><tbody><tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhKR3NTFpd6w6hf9vI4jXj4XhJJVoMTOfZYaagSYECIq8KJO-FBtkiVZiGEA6rNtwNr16y9VXHm_8tE14964HbK5oe3xrJwZZHqo3sb8O8QjE12pstoDfVLuLe09ZS3CX2kIAebTdEGE2mq7K8KWek7OCpc3zIPbtpN2R3mII3uGf1hikjo-Ln3SHi9CQ/s1600/L6YvMV39zgHDI1-QPHvJs8E72fkQ1KhaCKxCW2oRAsr72CQVgUyvn-bgjk2Rj2msWVdB0rcTkHA2ZOMLzmtDvCcQmvyesvZ0l2YEFghRoglPZI0hIv-SFtYrGMqW-yok8knx3ttZbMo4yE0IsvE6oPbjEJEhTXXkC3jf7KT7Ss5UuXGVez908uTu-A.png" style="margin-left: auto; margin-right: auto;"><img border="0" data-original-height="1600" data-original-width="649" height="640" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhKR3NTFpd6w6hf9vI4jXj4XhJJVoMTOfZYaagSYECIq8KJO-FBtkiVZiGEA6rNtwNr16y9VXHm_8tE14964HbK5oe3xrJwZZHqo3sb8O8QjE12pstoDfVLuLe09ZS3CX2kIAebTdEGE2mq7K8KWek7OCpc3zIPbtpN2R3mII3uGf1hikjo-Ln3SHi9CQ/w260-h640/L6YvMV39zgHDI1-QPHvJs8E72fkQ1KhaCKxCW2oRAsr72CQVgUyvn-bgjk2Rj2msWVdB0rcTkHA2ZOMLzmtDvCcQmvyesvZ0l2YEFghRoglPZI0hIv-SFtYrGMqW-yok8knx3ttZbMo4yE0IsvE6oPbjEJEhTXXkC3jf7KT7Ss5UuXGVez908uTu-A.png" width="260" /></a></td></tr><tr><td class="tr-caption" style="text-align: center;"><p><i>Above: an old Soviet cartoon</i></p></td></tr></tbody></table> <p></p><p>A more recent example comes from the COVID-19 pandemic, where a plasma donation centre offered COVID-sufferers a larger cash reward than healthy individuals. As a result, people would deliberately infect themselves with COVID-19 in order to get a larger cash reward. Examples like this could fill up an entire book, but hopefully at this point you get the message!
</p><p>In the case of machine learning, we are trying to use the reward function to capture the thing we care about, but we are also using this function to train the AI - hence, Goodhart. The cases of <b>specification gaming</b> discussed above are perfect examples of this phenomenon in action - the AIs found ways of “giving the programmers exactly what they asked for”, but in a way which violated the programmers’ original intention. Some of these examples are quite unexpected, and a human would probably never have discovered them just from thinking about the problem. As AIs get more intelligent and are given progressively more complicated tasks, we can expect this problem to get progressively worse, because:</p>
<ul>
<li>With greater intelligence comes the invention of more powerful solutions.</li>
<li>With greater task complexity, it becomes harder to pin down exactly what you want.</li>
</ul>
<p>We should also strongly expect that AIs will be deployed in the real world, and given tasks of real consequence, simply for reasons of economic competitiveness. So any specification gaming failures will be significantly less benign than a <a href="https://openai.com/blog/faulty-reward-functions/">digital boat going around in circles</a>. </p>
<h5>Inner misalignment</h5>
<p>The other failure mode, <b>inner misalignment</b>, describes the situation when an AI system learns a different goal than the one you specified. The name comes from the fact that this is an internal property of the AI, rather than a property of the relationship between the AI and the programmers – here, the programmers don’t enter into the picture.</p>
<p>The classic example here is human evolution. We can analogise evolution to a machine learning training scheme, where humans are the system being trained, and the reward function is “surviving and reproducing”. Evolution gave us* certain drives, which reliably increased our odds of survival in the ancestral environment. For instance, we developed drives for sugar (which leads us to seek out calorie-dense foods that supplied us with energy), and drives for sex (which leads to more offspring to pass your genetic code onto). The key point is that these drives are intrinsic, in the sense that humans want these things regardless of whether or not a particular dessert or sex act actually contributes to reproductive fitness. Humans have now moved “off distribution”, into a world where these things are no longer correlated with reproductive fitness, and we continue wanting them and prioritising them over reproductive fitness. Evolution failed at imparting its goal into humans, since humans have their own goals that they shoot for instead when given a chance.</p>
<p>(*Anthropomorphising evolution in language can be misleading dangerous, and should just be seen as a shorthand here.)</p>
<p>A core reason why we should expect inner misalignment - that is, cases where an optimisation process creates a system that has goals different from the original optimisation process - is that it seems very easy. It was much easier for evolution to give humans drives like “run after sweet things” and “run after appealing partners”, rather than for it to give humans an instinctive understanding of genetic fitness. Likewise, an ML system being optimised to do the types of things that humans want may not end up internalising what human values are (or even what the goal of a particular job is), but instead some correlated but imperfect proxy, like “do what my designers/managers would rate highly”, where “rate highly” might include “rate highly despite being coerced into it”, among a million other failure modes. A silly equivalent of “humans inventing condoms” for an advanced AI might look something like “freeze all human faces into a permanent smile so that it looks like they’re all happy” - in the same way that the human drive to have sex does not extend down to the level of actually having offspring, an AI’s drive to do something related to human wellbeing might not extend down to the level of actually making humans happy, but instead something that (in the training environment at least) is correlated with happy humans. What we’re trying to point to here is not any one of these specific failure modes - we don’t think any single one of these is actually likely to happen - but rather the <i>type</i> of failure that these are examples of.</p>
<p>This type of failure mode is not without precedent in current ML systems (although there are fewer examples than for specification gaming). The 2021 paper <a href="https://www.deepmind.com/publications/objective-robustness-in-deep-reinforcement-learning">Objective Robustness in Deep Reinforcement Learning</a> showcases some examples of inner alignment failures. In one example, they trained an agent to fetch a coin in the CoinRun environment (pictured below). The catch was that all the training environments had the coin placed at the end of the level, on the far right of the map. So when the system was trained, it actually learned the task “go to the right of the map” rather than “pick up the coin” - and we know this because when the system was deployed on maps where the coin was placed in a random location, it would reliably go to the right hand edge rather than fetch the coin. A key distinction worth mentioning here - this is a failure of the agent’s <b>objective</b>, rather than their <b>capabilities</b>. They are learning useful skills like how to jump and run past obstacles - it’s just that those skills are being used in service of the wrong objective.</p>
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto;"><tbody><tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgSPH1Dki7fjPCAsy3UTqs-p2GtTjXIws8ikNKJdrujWvW51zBA7L6AspdTn1HMRNHZkInVL1hNTwSD-RcS-UO7ieR6iVHUDkCxFtmpcsBAWIJC0-Gz3NGpxy5mNVzUExht5YnJd67oHroH8edsm_GxU3oDEQgcvUJZ9-Qq2zJvafLvsah_rIkcM972Fg/s272/dPK2Z81oQLnDBmXoCnligA2M3VT0kwuD6VCcLz5m5PkqfdAZULp332Ae9gXPf4zHFtGHKjKed1O1WOqZUJIUahcx7w2q4DtuxtG-Vbd4iqiKiVuMV-H46xfxkOQd41W716tb9ItBcWO_Iy5ZgIDfG6VKOSl-sIrGjqgT9Df3GPxc-NuBIEGWYkaQnQ.png" style="margin-left: auto; margin-right: auto;"><img border="0" data-original-height="185" data-original-width="272" height="185" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgSPH1Dki7fjPCAsy3UTqs-p2GtTjXIws8ikNKJdrujWvW51zBA7L6AspdTn1HMRNHZkInVL1hNTwSD-RcS-UO7ieR6iVHUDkCxFtmpcsBAWIJC0-Gz3NGpxy5mNVzUExht5YnJd67oHroH8edsm_GxU3oDEQgcvUJZ9-Qq2zJvafLvsah_rIkcM972Fg/s1600/dPK2Z81oQLnDBmXoCnligA2M3VT0kwuD6VCcLz5m5PkqfdAZULp332Ae9gXPf4zHFtGHKjKed1O1WOqZUJIUahcx7w2q4DtuxtG-Vbd4iqiKiVuMV-H46xfxkOQd41W716tb9ItBcWO_Iy5ZgIDfG6VKOSl-sIrGjqgT9Df3GPxc-NuBIEGWYkaQnQ.png" width="272" /></a></td></tr><tr><td class="tr-caption" style="text-align: center;"><p><i>Above: the CoinRun environment.</i></p></td></tr></tbody></table><p>So, how bad can inner misalignment get? A particularly concerning scenario is <b>deceptive alignment</b>. This is when the agent learns it is inside a training scheme, discovers what the base objective is, but has already acquired a different goal. In this case, the system might reason that a failure to achieve the base objective when training will result in it being modified, and not being able to achieve its actual goal. Thus, the agent will pretend to act aligned, until it thinks it’s too powerful for humans to resist, at which point it will pursue its actual goal without the threat of modification. This scenario is highly speculative, and there are many aspects of it which we are still uncertain about, but if it is possible then it would represent maybe the most worrying of all possible alignment failures. This is because a deceptively aligned agent would have incentives to act against its programmers, but also to keep these incentives hidden until it expects human opposition to be ineffectual.</p>
<p>It’s worth mentioning that this inner / outer alignment decomposition isn’t a perfect way to carve up the space of possible alignment failures. For instance, for most non-trivial reward functions, the AI will probably be very far away from perfect performance on it. So it’s not exactly clear what we mean by a statement like “the AI is perfectly aligned with the reward function we trained it on”. Additionally, the idea of inner optimisation is built around the concept of a “mesa-optimiser”, which is basically a learned model that itself performs optimisation (just like humans were trained by evolution, but we ourselves are optimisers since we can use our brains to search over possible plans and find ones which meet our objectives). The problem here is that it’s not clear what it actually means to be an optimizer, and how we would determine whether an AI is one. This being said, the inner / outer alignment distinction is still a useful conceptual tool when discussing ways AI systems can fail to do what we intend.</p>
<div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjVFg7jZPH7429MqUzkjWgPQbhWVuMO2ldDIRrZS2rd3Fy8aBIRMCNnRp5ibGJq3smc9kdoGIkOJHziETEkZv3M_Q6p5IqlZKCZqERPs7k3bRHsirKraqo7-8OWkLtvfQMViU4LiKEq2ROzbfPhuHZS82ElFjaKMnyeEca-FHTCYyZ4Khy9kmYJNz8Onw/s1600/PPH33X6xlwOPj_0YtD3BjyeCHarxJ7sjgxXbCZaSGFxLAJV_7-ulhj0tqPfUhmLjgCzM-hEy9X3zXJHHNpz2Y__is6pP1T3WkHsinUBFRdj5bYtzalUtU3DqHYhPjuT9Dff4QFo5NkG1hXKq-ghdCSZRkf7iCz_pDbgZ3CEwXW3vkTxIK4M0QcZtNw.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="984" data-original-width="1600" height="394" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjVFg7jZPH7429MqUzkjWgPQbhWVuMO2ldDIRrZS2rd3Fy8aBIRMCNnRp5ibGJq3smc9kdoGIkOJHziETEkZv3M_Q6p5IqlZKCZqERPs7k3bRHsirKraqo7-8OWkLtvfQMViU4LiKEq2ROzbfPhuHZS82ElFjaKMnyeEca-FHTCYyZ4Khy9kmYJNz8Onw/w640-h394/PPH33X6xlwOPj_0YtD3BjyeCHarxJ7sjgxXbCZaSGFxLAJV_7-ulhj0tqPfUhmLjgCzM-hEy9X3zXJHHNpz2Y__is6pP1T3WkHsinUBFRdj5bYtzalUtU3DqHYhPjuT9Dff4QFo5NkG1hXKq-ghdCSZRkf7iCz_pDbgZ3CEwXW3vkTxIK4M0QcZtNw.png" width="640" /></a></div>
<h3>Misaligned AGI could overpower humanity</h3>
<blockquote><p> <i>The best answer to the question, "Will computers ever be as smart as humans?” is probably “Yes, but only briefly.”</i></p>
<p>-<i>Vernor Vinge</i></p>
</blockquote>
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto;"><tbody><tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgZoOysowtrB8FLMtCQ9hhh3H34eH8pRMetMIMNLrAww4elE2-u3FbvSTtqoy_tC0wh4VLMMJE-7Vp8aHVFPcH7xxxN0PMxuOi2F1l-T_PaIzQleCCoqfyjD3b6qta-CkHyqwvA_-Ygfnm63qnqYMot8T3C-hqZPNg78SJ7imjFIlyZ6uWOhEB9V6qOSw/s1024/WE1TOIxhoaYkTXIkM_CCvP15KJWCX2ycK_jW1Lt9uddbjy-IoZINMCbijY75vY1VJal2SzaA2ERv6USRFbQqcfri5fZgFBqk05OYZKEPJXNAMGvKFyoC6Dn6A6AGl_J7dnMWQmTKadTXRQrp90hx9lQ09_6rxxWfzqzYoWHfaxQ2V32ndFAsIVMTmQ.png" style="margin-left: auto; margin-right: auto;"><img border="0" data-original-height="1024" data-original-width="1024" height="400" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgZoOysowtrB8FLMtCQ9hhh3H34eH8pRMetMIMNLrAww4elE2-u3FbvSTtqoy_tC0wh4VLMMJE-7Vp8aHVFPcH7xxxN0PMxuOi2F1l-T_PaIzQleCCoqfyjD3b6qta-CkHyqwvA_-Ygfnm63qnqYMot8T3C-hqZPNg78SJ7imjFIlyZ6uWOhEB9V6qOSw/w400-h400/WE1TOIxhoaYkTXIkM_CCvP15KJWCX2ycK_jW1Lt9uddbjy-IoZINMCbijY75vY1VJal2SzaA2ERv6USRFbQqcfri5fZgFBqk05OYZKEPJXNAMGvKFyoC6Dn6A6AGl_J7dnMWQmTKadTXRQrp90hx9lQ09_6rxxWfzqzYoWHfaxQ2V32ndFAsIVMTmQ.png" width="400" /></a></td></tr><tr><td class="tr-caption" style="text-align: center;"><p><i>Above: DALL-E's drawing of "Digital art of two earths colliding"</i></p></td></tr></tbody></table>
<p>Suppose one day, we became aware of the existence of a “twin earth” - similar to our own in several ways, but with a few notable differences. Call this “Earth 2”. The population was smaller (maybe just 10% of the population of our earth), and the people were less intelligent (maybe an average IQ of 60, rather than 100). Suppose we could only interact with this twin earth using their version of the internet. Finally, suppose we had some reason for wanting to overthrow them and gain control of their civilization, e.g. we had decided their goals weren’t compatible with a good future for humans. How could we go about taking over their world?</p>
<p>At first, it might seem like our strategies are limited, since we can only use the internet. But there are many strategies still open to us. The first thing we would do is try to gather resources. We could do this illegally (e.g. by discovering peoples’ secrets via social engineering and performing blackmail), but legal options would probably be more effective. Since we are smarter, the citizens of Earth 1 would be incentivised to employ us, e.g. to make money using quantitative finance, or researching and developing advanced weaponry or other technologies. If the governments of Earth 2 tried to pass regulations limiting the amount or type of work we could do for them, there would be an incentive to evade these regulations, because anyone who did could make more profit. Once we’d amassed resources, we would be able to bribe members of Earth 2 into taking actions that would allow us to further spread our influence. We could infiltrate computer systems across the world, planting backdoors and viruses using our superior cybersecurity skills. Little by little, we would learn more about their culture and their weaknesses, presenting a front of cooperation until we had amassed enough resources and influence for a full takeover. </p>
<p><i>Wouldn’t the citizens of Earth 2 see this coming?</i> There’s a chance that we manage to be sufficiently sneaky. But even if some people realise, it would probably take a coordinated and expensive global effort to resist. Consider our poor track record with climate change (a comparatively much more documented, better-understood, and more gradually-worsening phenomenon), and in coordinating a global response to COVID-19.</p>
<p><i>Couldn’t they just “destroy us” by removing our connection to their world?</i> In theory, perhaps, but this would be very unlikely in practice, since it would require them to rip out a great deal of their own civilisational plumbing. Imagine how hard it would be for us to remove the internet from our own society, or even a more recent and less essential technology such as blockchain. Consider also how easy it can be for an adversary with better programming ability to hide features in computer systems.</p>
<p>—</p>
<p>As you’ve probably guessed at this point, the thought experiment above is meant to be an analogy for the feasibility of AIs taking over our own society. They would have no physical bodies, but would have several advantages over us which are analogous to the ones described above. Some of these are:</p>
<ul>
<li><b>Cognitive advantage</b>.
Human brains use approximately 86 billion neurons, and send signals at 50 metres per second. These hard limits come from brain volume and metabolic constraints. AIs would have no such limits, since they can easily scale (GPT-3 has 175 billion parameters, though you shouldn’t directly equate parameter and neuron count*), and can send signals at close to the speed of light.
(*For a more detailed discussion of this point, see <a href="https://www.openphilanthropy.org/research/new-report-on-how-much-computational-power-it-takes-to-match-the-human-brain/">Joseph Carlsmith’s report</a> on the computational power of the human brain.)</li>
<li><b>Numerical advantage</b>.
AIs would have the ability to copy themselves at a much lower time and resource cost than humans; it’s as easy as finding new hardware. Right now, the way ML systems work is that training is much more expensive than running, so if you have the compute to train a single system, you have the compute to run thousands of copies of that system once the training is finished.</li>
<li><b>Rationality</b>.
Humans often act in ways which are not in line with our goals, when the instinctive part of our brains gets in the way of the rational, planning part. Current ML systems are also weakened by relying on a sort of associative/inductive/biased/intuitive/fuzzy thinking, but it is likely that sufficiently advanced AIs could carry out rational reasoning better than humans (and therefore, for example, come to the correct conclusions from fewer data points, and be less likely to make mistakes).</li>
<li><b>Specialised cognition.</b>
Humans are equipped with general intelligence, and perhaps some specialised “hardware accelerators” (to use computer terminology) for domains like social reasoning and geometric intuition. Perhaps human abilities in, say, physics or programming are significantly bottlenecked by the fact that we don’t have specialised brain modules for those purposes, and AIs that have cognitive modules designed specifically for such tasks (or could design them themselves) might have massive advantages, even on top of any generic speed-boost they gain from having their general intelligence algorithms running at a faster speed than ours.</li>
<li><b>Coordination</b>.
As the recent COVID-19 pandemic has illustrated, even when the goals are obvious and most well-informed individuals could find the best course of action, we lack the ability to globally coordinate. While AI systems might or might not have incentives or inclinations to coordinate, if they do, they have access to tools that humans don’t, including firmer and more credible commitments (e.g. by modifying their own source code) and greater bandwidth and fidelity of communication (e.g. they can communicate at digital speeds, and using not just words but potentially by directly sending information about the computations they’re carrying out).</li>
</ul>
<p>It’s worth emphasising here, the main concern comes from AIs with misaligned goals acting against humanity, not from humanity misusing AIs. The latter is certainly cause for major concern, but it’s a different kind of risk to the one we’re talking about here. </p>
<p> </p>
<p><b>Summary of this section:</b></p>
<p>AI researchers in general expect >50% chance of AGI in the next few decades.</p>
<p>The <i>Orthogonality Thesis</i> states that, in principle, intelligence can be combined with more or less any final goal, and sufficiently intelligent systems do not automatically converge on human values. The <i>Instrumental Convergence</i> thesis states that, for most goals, there are certain instrumental goals that are very likely to help with the final goal (e.g. survival, preservation of its current goals, acquiring more resources and cognitive ability).</p>
<p>Inner and outer alignment are two different possible ways AIs might form goals which are misaligned with the intended goals.</p>
<p>Outer misalignment happens when the reward function we use to train the AI doesn’t exactly match the programmer’s intention. In the real world, we commonly see a version of this called Goodhart’s law, often phrased as “when a measure becomes a target, it ceases to be a good measure [because of over-optimisation for the measure, over the thing it was supposed to be a measure of]”.</p>
<p><i>Inner misalignment</i> is when the AI learns a different goal to the one specified by the reward function. A key analogy is with human evolution – humans were “trained” on the reward function of genetic fitness, instead of learning that goal, learned a bunch of different goals like “eat sugary things” and “have sex”. A particularly worrying scenario here is deceptive alignment, when an AI learns that its goal is different from the one its programmers intended, and learns to conceal its true goal in order to avoid modification (until it is strong enough that human opposition is likely to be ineffectual).</p>
<h4>Failure modes</h4>
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto;"><tbody><tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjh-7AZOqNY7xGxELN4GxynWcd291hsrNN5Tlvra7H2BEmOFHdWRJe28Vyt412Kzk8kgxduGjySeS-nDJagmrtvSVtfM3hiEEBihI1j59FuvLjfgrh32jGJImcI7TpulPYa2yJEe5trmufCfPY-hAB8NSkgIDsnSUpJ8wiOjJPb-H9IZPwmOcslRnbrlg/s1024/44lppCCb7Hrj2XFMJUnAYb0u1afPYsTx-x6ZgEgmvyRJWQLYPZmQdgiVZqMs1ICb0XzBLH09UDuvHfK55KB8Pe74akvxgqw4YVal33yF2vPpwpksmKkVQeh4eqTZpFdgwa9ywIZNZ76nWH10hA15Rd6xTaPeSbPwRqy7hMZSgx5eXkW9AB9xQtFUZw.png" style="margin-left: auto; margin-right: auto;"><img border="0" data-original-height="1024" data-original-width="1024" height="400" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjh-7AZOqNY7xGxELN4GxynWcd291hsrNN5Tlvra7H2BEmOFHdWRJe28Vyt412Kzk8kgxduGjySeS-nDJagmrtvSVtfM3hiEEBihI1j59FuvLjfgrh32jGJImcI7TpulPYa2yJEe5trmufCfPY-hAB8NSkgIDsnSUpJ8wiOjJPb-H9IZPwmOcslRnbrlg/w400-h400/44lppCCb7Hrj2XFMJUnAYb0u1afPYsTx-x6ZgEgmvyRJWQLYPZmQdgiVZqMs1ICb0XzBLH09UDuvHfK55KB8Pe74akvxgqw4YVal33yF2vPpwpksmKkVQeh4eqTZpFdgwa9ywIZNZ76nWH10hA15Rd6xTaPeSbPwRqy7hMZSgx5eXkW9AB9xQtFUZw.png" width="400" /></a></td></tr><tr><td class="tr-caption" style="text-align: center;"><p><i>Above: DALL-E really seems to have a natural talent at depicting
"The earth is on fire, artificial intelligence has taken over, robots
rule the world and suppress humans, digital art, artstation"</i>.</p></td></tr></tbody></table>
<p>But what, concretely, might an AI-related catastrophe look like?</p>
<p>AI catastrophe scenarios sound like something strongly out of science fiction. However, we can immediately discount a few common features of sci-fi AI takeovers. First, time travel. Second, armies of humanoid killer robots. Third, the AI acting out of hatred for humanity, or out of bearing a grudge, or because it hates our freedom, or because it has suddenly acquired “consciousness” or “free will”, or - as Steven Pinker <a href="https://scottaaronson.blog/?p=6524">likes to put it</a> - because it has developed an “alpha-male lust for domination”.</p>
<p>Remember instead the key points from above about how an AI’s goals might become dangerous: by achieving exactly what we tell it to do <i>too well</i> in a clever letter-but-not-spirit-of-the-law way, by having a goal that in most cases is the same as the goal we intend for it to have but which diverges in some cases we don’t think to check for, or by having an unrelated goal but still achieving good performance on the training task because it learns that doing well on the training tasks is instrumentally good. None of these reasons have anything to do with the AI being developing megalomania let alone the philosophy of consciousness; they are instead the types of technical failures that you’d expect from an optimisation process. As discussed above, we already see weaker versions of such failures in modern ML systems.</p>
<p>It is very uncertain which exact type of AI catastrophe we are most likely to see. We’ll start by discussing the flashiest kind: an AI “takeover” or “coup” where some AI system finds a way to quickly and illicitly take control over a significant fraction of global power. This may sound absurd. Then again, we already have ML systems that learn to crash or hack the game-worlds they’re in for their own benefit. Eventually, perhaps in the next decade, we should expect to have ML systems doing important and useful work in real-world settings. Perhaps they’ll be trading stocks, or writing business reports, or managing inventories, or advising decision-makers, or even being the decision-makers. Unless either (1) there is some big surprise waiting in how scaled-up ML systems work, (2) advances in AI alignment research, or (3) a miracle, the default outcome seems to be that such systems will try to “hack” the real world in the same way that their more primitive cousins today use clever hacks in digital worlds. Of course, the capabilities of the systems would have to advance a lot for them to be civilisational threats. However, rapid capability advancement has held for the past decade and we have solid theoretical reasons (including the scaling laws mentioned above) to expect it to continue holding. Remember also the cognitive advantages mentioned in the previous section.</p>
<p>As for how it proceeds, it might happen at a speed that is more digital than physical - for example, if the AI’s main lever of power is hacking into digital infrastructure, it might have achieved decisive control before anyone even realises. As discussed above, whether or not the AI has access to much direct physical power seems mostly irrelevant.</p>
<p>Another failure mode, thought to be significantly more likely than the direct AI takeover scenario by leading AI safety researcher Paul Christiano, is one that he calls <a href="https://www.alignmentforum.org/posts/HBxe6wdjxK239zajf/what-failure-looks-like">“going out with a whimper”</a>. Look at all the metrics we currently try to steer the world with: companies try to maximise profit, politicians try to maximise votes, economists try to maximise metrics like GDP and employment. Each of these are proxies for what we want: a profitable company is one that has a lot of customers willing to pay money for their products; a popular politician has a lot of people thinking they’re great; maximising GDP generally correlates with people being wealthier and happier. However, none of these metrics or incentive systems really gets to the heart of what we care about, and so it is possible (and in the real world we often observe) cases where profitable companies and popular politicians are pursuing destructive goals, or where GDP growth is not actually contributing to people’s quality of life. These are all cases of Goodhart’s law, as discussed above.</p>
<figure><table>
<thead>
<tr><th><b>Hard-to-measure</b></th><th><b>Easy-to-measure</b></th><th><b>Consequence</b></th></tr></thead>
<tbody><tr><td>Helping me figure out what's true</td><td>Persuading me</td><td>Crafting persuasive lies</td></tr><tr><td>Preventing crime</td><td>Preventing reported crime</td><td>Suppressing complaints</td></tr><tr><td>Providing value to society</td><td>Profit</td><td>Regulatory capture, underpaying workers</td></tr></tbody>
</table></figure>
<p>What ML gives us is a very general and increasingly powerful way of developing a system that does well at pushing some metric upwards. A society where more and more capable ML systems are doing more and more real-world tasks will be a society that is going to get increasingly good at pushing metrics upwards. This is likely to result in visible gains in efficiency and wealth. As a result, competitive pressures will make it very hard for companies and other institutions to say no: if Acme Motors Company started performing 15% better after off-sourcing their CFO’s decision-making to an AI, General Systems Inc will be very tempted to replace their CEO with an AI (or maybe the CEO will themselves start consulting an AI for more and more decisions, until their main job is interfacing with an AI).</p>
<p>In the long run, a significant fraction of work and decision-making may well be offloaded to AI systems, and at that point change might be very difficult. Currently our most fearsome incentive systems like capitalism and democracy still run on the backs of the constituent humans. If tomorrow all humans decided to overthrow the government, or abolish capitalism, they would succeed. But once the key decisions that perpetuate major social incentive systems are no longer made by persuadable humans, but instead automatically implemented by computer systems, change might become very difficult.</p>
<p>Since our metrics are flawed, the long-term outcome is likely to be less than ideal. You can try to imagine what a society run by clever AI systems trained to optimise purely for their company’s profit looks like. Or a world of media giants run by AIs which spin increasingly convincing false narratives about the state of the world, designed to make us <i>feel</i> more informed rather than actually telling us the truth.</p>
<p>Remember also, as discussed previously, that there are solid reasons to think that influence-seeking and deceptive behaviours seem likely in sufficiently-powerful AI systems. If the ML systems that increasingly run important institutions exhibit such behaviour, then the above “going out with a whimper” scenario might acquire extra nastiness and speed. This is something Paul Christiano explores in the <a href="https://www.alignmentforum.org/posts/HBxe6wdjxK239zajf/what-failure-looks-like">same article</a> linked above.</p>
<p>A popular misconception about AI risk is that the arguments for doing something are based on a tiny risk of giant catastrophe. The giant catastrophe part is correct. The miniscule risk part, as best as anyone in the field can tell, is not. As mentioned above, the average ML researcher - generally an engineering-minded person not prone to grandiose futuristic speculation - gives <a href="https://aiimpacts.org/2022-expert-survey-on-progress-in-ai/">a 5% chance of civilisation-ending disaster from AI</a>. The ML researchers who grapple with the safety issues as part of their job are clearly not an unbiased randomly-selected sample, but generally give numbers in the 5-50% range, and some (in our opinion too alarmist people) think it’s over 90%. As the above arguments hopefully emphasise, some type of catastrophe seems like the <i>default outcome</i> from the types of AI advances that we are likely to encounter in the coming decades, and the main reason for thinking we won’t is the (justifiable but uncertain) hope that someone somewhere invents solutions.</p>
<p>It might seem forced or cliche that AI risk scenarios so frequently end with something like “and then the humans no longer have control of their future and the future is dark” or even “and then everyone literally dies”. But consider the type of event that AGI represents and the available comparisons. The computer revolution reshaped the world in a few decades by giving us machines that can do a <i>narrow</i> range of intellectual tasks. The industrial revolution let us automate large parts of <i>manual</i> labour, and also set the world off on an unprecedented rate of economic growth and political change. The evolution of humans is plausibly the most important event in the planet’s history since at least the dinosaurs died out 66 million years ago, and it took on the exact form of “something smarter than anything else on the planet appeared, and now suddenly they’re firmly in charge of everything”.</p>
<p>AI is a big deal, and we need to get it right. How we might do so is the topic for <a href="https://www.strataoftheworld.com/2022/09/ai-risk-intro-2-solving-problem.html">part 2</a>.</p>
</div>Unknownnoreply@blogger.com3tag:blogger.com,1999:blog-1697673368059564013.post-11240725554792486992022-09-10T08:17:00.001+01:002022-09-10T08:17:37.606+01:00EA as a Schelling point<div style="text-align: center;"><b> </b><i><span style="font-size: x-small;">3.1k words (~9 minutes)</span></i><b><br /></b></div><p><b>Summary</b>: A significant way in which the EA community creates value is by acting as a <a href="https://en.wikipedia.org/wiki/Focal_point_(game_theory)">Schelling point</a> where talented, ambitious, and altruistic people tend to gather and can meet each other (in addition to more direct sources of EA value like identifying the most important problems and directly pushing people to work on them). It might be useful to think about what optimising for being a Schelling point looks like, and I list some vague thoughts on that.</p>
<hr />
<p>A Schelling point, also known as a focal point, is what people decide on in the absence of communication, especially when it's important to coordinate by coming to the same answer.</p>
<p>The classic example is: you were arranging a meeting with a stranger in New York City by telephone, but you used the last minute of your phone credit and the line cut off after you had agreed on the date but not location or time - where do you meet? "Grand Central Station at noon" is an answer that other people may be especially likely to converge on.</p>
<p>(Schelling points can be thought of as a type of acausal negotiation.)</p>
<h2 id="when-the-schelling-point-is-the-selling-point">When the Schelling point is the selling point</h2>
<p>Schelling points are often extremely powerful and valuable. A key function of top universities is to be Schelling points for talented people. (Personally, I'd call it the most important function.) There are other valuable things too: courses that go deeper, the signalling value to employers, and so on. However, talented people generally have a preference for hanging out with other talented people, both for social reasons and to find collaborators for ambitious projects and future colleagues. At the same time, talented people are also generally spread out and present only at low densities. Top universities select hard on (some measures of) talent, and through this create environments with high talent density. A big chunk of the reason why people apply to top universities is because other people do so too, and I'd guess that even if the academic standards of Stanford, MIT, or Cambridge eroded significantly, the fact that they've established themselves as congregating points for smart people will keep people applying and visiting for a long time.</p>
<p>(Note that this is related to, but not equal to, the prestige and status of these places. It is possible to imagine Schelling points that are not prestigious. For example, my impression is that this described MIT at one point - it became a congregating point for uniquely ambitious STEM students and defence research before it achieved high academic status. It is also possible to imagine prestigious places that are not Schelling points, though this is a bit harder since anything with prestige becomes a Schelling point for high social status (though prestige Schelling points and talent Schelling points need not co-occur). More generally, since prestige is a thing many people care a lot about, there is a high correlation between a place being prestigious or high status and being a Schelling point for at least some type of person. However, the mechanisms are distinct - a person selecting their university based on status is selecting based on what they get to write on their CV, while a person selecting their university based on it being a Schelling point for smart people is selecting based on the fact that many other smart people that they can't coordinate with but would like to meet will also choose to go there.)</p>
<p>Another example is Silicon Valley. Sure, the area has many strengths - being rich and inside a large stable free market - but by far the greatest argument for living in Silicon Valley is that others also choose it. This leads to a (for now) unique combination of entrepreneurial people, great programmers, venture capitalists, and all the other types of people you need for a thriving tech business ecosystem, all there primarily because all the others are there too (how touching!). There's a lot of value of having everything in one place, and it would be very hard for all the different people who make up the value of Silicon Valley to coordinate to move to another place. That's why the Schelling point value of Silicon Valley is so enduring that people continue to tolerate large numbers of homeless drug addicts and sell kidneys to pay rent for years on end.</p>
<p>Note that a big part of the mechanism isn't that <i>specific</i> people you want to find are there, but that the <i>types of person</i> you'd want to find are likely to also be there, because both those people and yourself are likely to converge on the strategy of going there.</p>
<h2 id="schelling-ea">Schelling EA</h2>
<p>The Effective Altruism (EA) community provides a lot of value, for example:</p>
<ul>
<li>research into figuring out what are the most important problems to solve to maximise human flourishing;</li>
<li>research and concrete efforts into how to solve the most important problems discovered by the above;</li>
<li>high epistemic standards and truth-seeking discussion norms;</li>
<li>a uniquely wide-ranging and well-reasoned set of resources to help people pursue high-impact careers;</li>
<li>tens of billions of dollars in funding.</li>
</ul>
<p>However, in addition to these, a very critical part of the value that EA provides is being a Schelling point for talented, ambitious, and altruistically-motivated people. </p>
<p>Even without EA, there would be researchers studying existential risks, animal welfare, and global poverty; people trying to assess charities; communities with high epistemic norms; and billionaires trying to use their fortunes for effective good. However, thanks to EA, people in each of these categories can go to the same Effective Altruism Global conference or quickly find people in local groups, and meet collaborators, co-founders, funders, and so on. A lot of the reason why this can happen is that if you hang out with a certain group of people or on the right websites, EA looms large.</p>
<p>The biggest <i>personal</i> source of value I've gotten from EA has been having a shortcut to meeting people very high in all of talent, ambition, and altruistic motivation.</p>
<p>Much of this is obvious - breaking news: communities bring people together and foster connections, more at 11 - but I think taking seriously just how much of counterfactual EA community impact comes from being a Schelling point leads to some less-obvious points about possible implications.</p>
<h2 id="implications">Implications</h2>
<p>The Schelling-point-based (and therefore necessarily incomplete) answer to "what is the EA community for?" might be something like "be an obvious Schelling point where relevant people gather, the chance of interactions that lead to useful work is maximised, and have a community and infrastructure that pushes work in the most useful direction possible". (This is in contrast to answers that emphasise e.g. directly increasing the number of people working on the most pressing problems.) (I will not argue for this being the best possible answer; my point is just that it is one possible answer, and an interesting one to examine further.)</p>
<p>If I were a Big Tech marketing consultant, I might call this "EA-as-a-platform".</p>
<p>What might maximising for such a Schelling point strategy look like?</p>
<h3 id="being-obvious">Being obvious</h3>
<p>A Schelling point is not a Schelling point unless it's obvious enough. For EA to be an effective Schelling point for talented/ambitious/altruistic people, those people must hear about it. Silicon Valley is obvious enough that entrepreneurial people from South Africa to Russia hear about it and decide it's where they want to be. To maximise its Schelling point value, EA should have world-spanning levels of recognition.</p>
<p>Note that recognition does not equal prestige or likeability. We don't care (for Schelling point reasons at least) if most people hear about EA and go "eh, sounds weird and unappealing"; what matters is that the core target demographic is excited enough to put effort into pursuing EA. Consider how Silicon Valley was not particularly high-prestige in the public even when it was already attracting tech entrepreneurs, or how many people hear about the intensity of academics at top universities and (very reasonably) think "no thanks".</p>
<h3 id="providing-value">Providing value</h3>
<p>Though most of a Schelling point's value typically comes from the other people who congregate at it, a Schelling point is easier to create if it is obviously valuable. Even though the smart people they meet might be most of the benefit of university, high schoolers are still more likely to go to top universities if they provide good education, good facilities, and unambiguous social status.</p>
<p>Some obvious ways in which EA provides value are through funding sufficiently promising projects, and by having a very high concentration of intellectually interesting ideas.</p>
<p>There are risks to communicating loudly about the value-add, since this brings in people who are in it purely for personal gain (<a href="https://forum.effectivealtruism.org/posts/W8ii8DyTa5jn8By7H/the-vultures-are-circling">"the vultures are circling", as one Forum post put it</a>). This works for Schelling points like Silicon Valley, but not altruism.</p>
<h3 id="optimising-for-matchmaking">Optimising for matchmaking</h3>
<p>A specific way that Schelling points provide value is by making it easy to meet other people in the specific ways that lead to productive teams forming. An existing example of this is that everyone says one-on-one meetings are the main point of conferences, and there is (of course) a lot of <a href="https://forum.effectivealtruism.org/posts/pKbTjdopzSEApSQfc/doing-1-on-1s-better-eag-tips-part-ii">thinking about how to make these effective</a>. On the more informal end of the scale, <a href="https://www.reciprocity.io/">Reciprocity</a> exists.</p>
<p>However, the scope and value of EA matchmaking could be expanded. I'm not aware of many ways to match together entrepreneurial teams (the <a href="https://www.charityentrepreneurship.com/incubation-program">Charity Entrepreneurship incubation program</a> is the only one that comes to mind). I recently took part in an informally-organised co-founder matching process and found it extremely helpful to quickly get a lot of information on what it's like to work together with several promising people.</p>
<p>I'd advise for someone to think more about how to make the EA environment even more effective at matching people who should know about each other. However, I expect someone is already designing a 53-parameter one-on-one matching system with Calendly, Slack, and Matplotlib integration for the next conference, and therefore I will hold off on adding any more fuel to this fire.</p>
<h3 id="being-legit">Being legit</h3>
<p>One of the specific ways in which a Schelling point becomes one is if things associated with it seem uniquely competent, successful, or otherwise good, in a clearly unfakeable way. It is helpful for Cambridge's Schelling point status that it can brag about having 121 Nobel laureates. That so many successful tech companies emerged from Silicon Valley specifically is an unfakeable signal. Any government or city can afford to throw some millions at putting up posters advertising its startup-friendliness; few can consistently produce multi-billion dollar tech companies.</p>
<p>No amount of community-building or image-crafting is likely to replicate the Schelling point power of <i>obviously being the place where things happen</i>. In some areas, I think EA already has such power: much of the research and work on existential risks happens within EA, and it might be hard to be a researcher on those topics without running into the large body of EA-originating work. However, EA goals require more than just research; note how being a <a href="https://80000hours.org/career-reviews/founder-impactful-organisations/">project/organisation founder</a> or <a href="https://80000hours.org/articles/operations-management/">working in an operations role</a> have been creeping up the 80 000 Hours list of recommended career paths.</p>
<p>It would be extremely powerful, not just for direct impact reasons but also for building up EA's Schelling point status, if the EA community clearly spawned very obviously successful real-world projects. <a href="https://www.alveavax.com/">Alvea</a> succeeding or working <a href="https://forum.effectivealtruism.org/posts/gLPEAFicFBW8BKCnr/announcing-the-nucleic-acid-observatory-project-for-early">Nucleic Acid Observatories</a> being built would be powerful examples. Likewise if <a href="https://www.charityentrepreneurship.com/">Charity Entrepreneurship</a>-incubated charities become clear stars of the non-profit world.</p>
<h3 id="meritocracy-and-impartial-judgement">Meritocracy and impartial judgement</h3>
<p>Right now, I think if a person somewhere in the world has a well-thought out idea for how to make the world a better place, likely their best bet to get a fair hearing, useful feedback, and - if it is competitive with the most valuable existing projects - funding and support is to post it on the <a href="https://forum.effectivealtruism.org/">EA Forum</a>. I don't think this is very obvious outside the EA community. However, this fact, and awareness of it, could make EA a more useful Schelling point, in the same way that the impression that Silicon Valley doesn't frown on weird ideas as long as they're important enough makes it a better Schelling point.</p>
<p>That EA endorses cause neutrality, has high and transparent epistemic standards, and a quantitative mindset are key parts of this. However, to use this to increase EA Schelling point power, these properties need to be clearly visible to outsiders.</p>
<p>The most likely way for this to be become more obvious might be if specific EA organisations achieved such a reputation widely within their field (and then there was some path by which knowing of these organisations points people towards knowing about EA).</p>
<p>GiveWell might be an example of a clearly-EA-linked organisation with visibly high epistemics and judgement quality, though I don't know what their image or recognition level is outside the EA community. Another example is if someone created successful and famous organisations along the lines of FTX Future Fund's proposed <a href="https://ftxfuturefund.org/projects/epistemic-appeals-process/">epistemic appeals process</a> or <a href="https://ftxfuturefund.org/projects/expert-polling-for-everything/">widespread expert polling</a> projects.</p>
<h3 id="openness-and-approachability">Openness and approachability</h3>
<p>Good Schelling points are easy to enter, and don't select on attributes that they don't have to.</p>
<p>Every human sub-group, even if loose and purpose-driven, tends to develop a distinctive culture that is much more specific than strictly implied by its purpose. Sometimes this is useful, since it makes it easy for humans in even a loose group to bond with each other. However, a strong and distinct internal culture is also a barrier to entry. EA is already high-risk for having a strong barrier to entry, because</p>
<ul>
<li>many arguments and concepts in EA require background knowledge to understand, and sometimes dense philosophical or technical background knowledge (and this is not the case just for more formal things like Forum posts; I've frequently heard "EV [expected value]", "QALY [quality-adjusted life year", and "Pascal's mugging" assumed as obvious common terminology in casual conversation);</li>
<li>EA (quite obviously, given what it's about) has a high concentration of non-obvious arguments that are obscure in public discussion but have huge implications; and</li>
<li>perhaps the main route into EA is caring very strongly about intellectual arguments about abstract moral principles, which tends not to be a natural way for humans to join communities.</li>
</ul>
<p>These largely unavoidable factors already make EA somewhat unapproachable, and seem like a tightly-knit weird in-group/subculture (anecdotally, this seems to be the most common complaint about EA among Cambridge students). Weird cultural norms or quirks are (among other things!) barriers to entry. Therefore, they should be minimised - to the extent that they can be without impinging on what EA is about - <i>if</i> the goal is to maximise Schelling point value.</p>
<h3 id="mostly-implicit-selectivity-for-the-right-things">(Mostly implicit) selectivity for the right things</h3>
<p>Some selection is usually part of a Schelling point's value. Top universities select for academic merit (though perhaps less so in the US). Silicon Valley selects for openness and interest/talent in tech/business. EA selects for openness, altruistic orientation (especially if consequentialist-leaning), good epistemics, and quantitative thinking.</p>
<p>I think it is counterproductive to view openness and selectivity as two ends of one scale that apply to everything. You want to select on important features and be open otherwise (note that, when creating a Schelling point, most of the selection is usually implicit - what types of people you attract - rather than explicit filtering). The key choice is not "open or selective overall?" but rather "for which X do we want to appeal only to people who have a value of X in some specific range?"</p>
<p>Here's a heuristic for when selectivity for X is useful: when the way X provides value is through its <i>concentration</i> rather than its <i>amount</i>. If you're at a party where you can only talk to a subset of the people during its course, you're going to care a lot about what fraction of people there are interesting - 10 interesting people in a party of 20 is better than 50 in a party of 5000.</p>
<p>Some cases are ambiguous. For example, if there exists a way for the good and important research to bubble to the top regardless of how much other research exists, it seems like total amount of (infohazard-free) research is the thing to maximise. However, a research area where the average paper is very high quality might help newcomers to the field, or might help lift the prestige of the field, so concentration matters at least somewhat.</p>
<p>To take another example, there was a <a href="https://forum.effectivealtruism.org/posts/dsCTSCbfHWxmAr2ZT/open-ea-global">recent debate</a> over whether EA Global should be open access. Many of the arguments against boil down to thinking the path to impact runs through a uniquely high concentration of EA engagement (or other variables) among the participants; arguments in favour are often either claiming that concentration matters less than sheer amount of interactions, or that the choice of selection variable(s) is wrong, or that CEA fails to select on their chosen selection variable(s) so even if the intention is right the selection variable selected for in practice is wrong.</p>
<h3 id="hubs-and-hub-related-infrastructure">Hubs, and hub-related infrastructure</h3>
<p>Finally, a key point of a Schelling point is that it is a point <i>somewhere</i>. Here, EA is increasingly better. Berkeley, Cambridge, Oxford, London, and Berlin all have large groups, and offices that you can apply to in order to work on EA-relevant things in the company of other EAs.</p>
<p>In Schelling point terms, there's also a risk that it might be better to have one really obvious and strong hub than many weaker ones (I've heard some Bay Area EAs in particular endorsing this view; invariably, their hub of choice is the Bay Area, though there is <a href="https://forum.effectivealtruism.org/posts/bnzwL6tu4pdYf3hpZ/say-nay-to-the-bay-as-the-default">push back</a>). In practice, it seems that many physical hubs but one virtual/intellectual hub may be best. Both airplanes and people's desires to not uproot their lives are real and relevant things.</p>
<p>The organisers at each EA hub might benefit from applying Schelling point thinking to the context of their local scene. </p>
<h3 id="being-one-thing">Being one thing</h3>
<p>Finally, a Schelling point needs to be one thing, at least in some loose sense. If New York had two Grand Central Stations, the classic Schelling point game would become a lot harder to solve.</p>
<p>One way to increase the One Thingness of the EA Schelling point is to merge it with other things. In Schelling point land, "merging" does not mean making them the same cluster, but rather creating an obvious and visible path from one thing to another. My understanding is that increasing the obviousness of EA in somewhat-adjacent communities (tech, longevity, space, and Emergent Ventures grantees) was a large part of what <a href="https://forum.effectivealtruism.org/posts/szeE3je8MD4sZcevL/announcing-future-forum-apply-now">Future Forum</a> tried to achieve.</p>Unknownnoreply@blogger.com0tag:blogger.com,1999:blog-1697673368059564013.post-90986294581287342962022-08-20T23:05:00.011+01:002024-05-27T21:57:34.583+01:00Effective Altruism in practice
<p style="text-align: center;"> <i><span style="font-size: x-small;">6.5k words (~17 minutes)<br /></span></i></p><p> </p><p>I've written about <a href="https://www.strataoftheworld.com/2020/07/ea-ideas-1-rigour-and-opportunity-in.html">key ideas in Effective Altruism</a> before. But that was the theory. How did EA actually come to exist, and what does it look like in practice?</p><p> </p><p><table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto;"><tbody><tr><td style="text-align: center;"><a href="https://www.kindpng.com/picc/m/294-2945196_effective-altruism-logo-hd-png-download.png" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" data-original-height="460" data-original-width="800" height="230" src="https://www.kindpng.com/picc/m/294-2945196_effective-altruism-logo-hd-png-download.png" width="400" /></a></td></tr><tr><td class="tr-caption" style="text-align: center;"><i>... turns out it looks like a stylised light bulb with a heart.</i><br /></td></tr></tbody></table><br /> </p>
<h2 id="summary">Summary</h2>
<ul>
<li><p>The ideas underpinning EA came from many sources, including:</p>
<ul>
<li>late-1900s analytic moral philosophers like Peter Singer and Derek Parfit;</li>
<li>futurist/transhumanist thinkers like Nick Bostrom and Eliezer Yudkowsky focusing on risks from future technologies;</li>
<li>a few people working on evaluating charity effectiveness;</li>
<li>efforts starting around 2010 by a few Oxford philosophers including William MacAskill and Toby Ord that, sometimes unwittingly, gave structure and a name to a diverse cluster of ideas about how to maximise your positive impact.</li>
</ul>
</li>
<li><p>Though EA is <a href="https://forum.effectivealtruism.org/posts/FpjQMYQmS3rWewZ83/effective-altruism-is-a-question-not-an-ideology">framed around the question</a> of "what does the most good (according to an analytic and often quantiative framework based on impartial welfare-oriented ethics)?" rather than any particular answer to that question, in practice much (but not all!) EA efforts focus on one of the following, due to many people deciding that it's a particularly pressing and (outside EA) neglected problem:</p>
<ul>
<li>reducing the risk of civilisation-wide catastrophe, especially from emerging technologies like advanced AI and biotechnology;</li>
<li>health and development in poor countries; and</li>
<li>animal welfare.</li>
<li>There is also a lot of work at the meta-level, including on figuring out how people can have <a href="https://80000hours.org/">impactful careers</a>, and trying to direct effort towards the above problems.</li>
</ul>
</li>
<li><p>The funding for most EA-related projects and EA-endorsed charities comes from a combination of:</p>
<ul>
<li><p>many individual small donors, in particular:</p>
<ul>
<li>people who have taken the <a href="https://www.givingwhatwecan.org/">Giving What We Can pledge</a> and therefore donate >10% of their salary to highly effective charities;</li>
<li>people who explicitly pursue <a href="https://80000hours.org/articles/earning-to-give/">"earning-to-give"</a> (getting a high-paying job in order to donate most of the proceeds to charities);</li>
</ul>
</li>
<li><p>several foundations that derive their wealth from billionaires, including most prominently:</p>
<ul>
<li><a href="https://www.givingwhatwecan.org/">Open Philanthropy</a>, mostly funded by Dustin Moskovitz who made his wealth from being a Facebook co-founder; and</li>
<li><a href="https://ftxfoundation.org/">FTX Foundation</a>, funded by Sam Bankman-Fried and several other early employees at the crypto exchange FTX.</li>
</ul>
</li>
</ul>
</li>
<li><p>There is no monolithic EA organisation (though the <a href="https://www.centreforeffectivealtruism.org/">Centre for Effective Altruism</a> organises some common things like the EA Global conferences), but rather a large collection of organisations that mainly share:</p>
<ul>
<li>a commitment to maximising their positive impact on the world;</li>
<li>a generally rigorous and quantitative approach to doing so; and</li>
<li>some link to the cluster of people and organisations in Oxford that first named the idea of Effective Altruism.</li>
<li>There are also many charities that have no direct relation to the EA movement, but were identified by charity evaluators like <a href="https://www.givewell.org/">GiveWell</a> as extremely effective, and have thus been extensively funded.</li>
</ul>
</li>
<li><p>EA is very good at attracting talented people, especially ambitious young people at top universities.</p>
</li>
<li><p>EA culture leans intellectual and open, and has a high emphasis on "epistemic rigour", i.e. being very careful about trying to figure out what is true, acknowledging and reasoning about uncertainties, etc.</p>
</li>
<li><p>Some "axes" within EA include:</p>
<ul>
<li>"long-termists" who focus on possible grand futures of humanity and the existential risks that stand between us and those grand futures, and "near-termists" who work on clearer and more established things like global poverty and animal welfare;</li>
<li>a bunch of people and ideas all about frugality and efficient use of money, and another bunch of people and ideas about using the available funding to unblock opportunities for major impact; and</li>
<li>a historical tendency to be very good at attracting philosophy/research-type people who like wrestling with difficult abstract questions, versus a growing need to find entrepreneurial, operations, and policy people to actually do things in the real world. </li>
</ul>
</li>
</ul>
<h2 id="the-philosophers">The philosophers</h2>
<p>In the beginning (i.e. circa the 1970s, when <a href="https://en.wikipedia.org/wiki/Unix_time">time is widely known to have begun</a>), there were a bunch of philosophers doing interesting work. One of them was Peter Singer. Peter Singer proposed questions like this (paraphrasing, not quoting, and updated with recent numbers):</p>
<blockquote><p>Imagine you're wearing a $5000 suit and you walk past a child drowning in a lake. Do you jump into the lake and save the child, even though it ruins your suit?</p>
<p>If you answered yes to the above, then consider this: it is <a href="https://blog.givewell.org/2020/11/19/our-recommendations-for-giving-in-2020/">possible to save a child's life in the developing world for $5000</a>; what justification do you have for spending that money on the suit rather than saving the life?</p>
<p>The only difference between the two scenarios seems to be distance to the dying child (and method of death and etc. but ssshh); is that distance really morally significant?</p>
</blockquote>
<p>(He is also known for arguing in favour of animal rights and abortion rights.)</p>
<p>Derek Parfit is another. He is particularly famous for the book <i>Reasons and Persons</i>, in which he asks questions (paraphrasing again) like this:</p>
<blockquote><p>Is a moral harm done if you cause fewer people to exist in the future than otherwise might have? How should we reason about our responsibilities to future generations and non-existing people more generally?</p>
<p>Does there exist a number of people living mediocre (but still positive) lives such that this world is better than some smaller number of people living very good lives?</p>
</blockquote>
<p>(He also talks about problems in the philosophy of personal identity, and the contradictions in moral philosophies based on self-interest.)</p>
<h2 id="the-transhumanists">The transhumanists</h2>
<p>Then, largely separately and around the 1990s, there came the transhumanists ("transhumanism" is a wide-reaching umbrella term for humanist thinking about radical future technological change). Perhaps the most notable are Nick Bostrom and Eliezer Yudkowsky.</p>
<p>Nick Bostrom thought long and hard about many wacky-seeming things with potentially cosmic consequences. He popularised the simulation hypothesis (the idea that we might all be living in a computer simulation). He <a href="https://nickbostrom.com/fable/dragon">argues against death</a> (something I <a href="https://www.strataoftheworld.com/2021/10/death-is-bad.html">strongly agree with</a>). He did lots of work on anthropic reasoning, which is about the question of how we should update information we get about the state of the world when taking into account that we wouldn't exist unless the state of the world allowed it. This leads to <a href="https://en.wikipedia.org/wiki/Sleeping_Beauty_problem">some thought experiments</a> that I'd classify as infohazards because of their tendency to spark an unending discussion whenever they're described. Conveniently, he also coined the term "<a href="https://en.wikipedia.org/wiki/Information_hazard">infohazard</a>".</p>
<p>Most crucially for EA, though, Bostrom has worked on understanding existential risks, which are events that might destroy humanity or permanently and drastically reduce the capacity of humanity to achieve good outcomes in the future. In particular, he has worked on risks from advanced AI, which he boosted to popularity with the 2014 book <i>Superintelligence</i>.</p>
<p>Bostrom's style of argument is like a dry protein bar, leaning toward straightforward extrapolation of conclusions from premises, especially if the conclusions seem crazy but the premises seem self-evident. Sometimes, though, he does apply some literary flair to make <a href="https://nickbostrom.com/utopia">an important point</a>, and also <a href="https://nickbostrom.com/poetry/poetry">occasionally writes poetry</a>.</p>
<p>Eliezer Yudkowsky wanted to create a smarter-than-human AI as fast as possible, until he realised this might be a Bad Idea and said "<a href="https://www.lesswrong.com/posts/SwCwG9wZcAzQtckwx/that-tiny-note-of-discord">oops</a>" and switched to the problem of making sure any powerful AIs we create don't destroy human civilisation. He founded the Machine Intelligence Research Institute (MIRI) to find out the answer.</p>
<p>Yudkowsky also wrote a <a href="https://www.lesswrong.com/rationality">massive series of blog posts</a> to try to teach people about how to reason well (for example, he covers a lot of ground from the cognitive biases literature), and then went on to try to convey the same lessons in what become <a href="http://www.hpmor.com/">the most popular work of Harry Potter fanfiction of all time</a>. His writing and argument style tends toward flowing narratives that are usually both very readable and verbose (though quite hit-or-miss in whether you like it).</p>
<p>He has Opinions (note the capital). He is extremely pessimistic about the chances of solving the AI alignment problem.</p>
<p>Yudkowsky is affiliated much more strongly with the loose "Rationalist community" than with EA. This is a collection of online blogs that was sparked by Yudkowsky's writing, and later in particular also that of <a href="https://slatestarcodex.com/">Scott Alexander</a>, who has become internet-famous for his own reasons too. The central forum is <a href="https://www.lesswrong.com/">LessWrong</a>. Both EA and Rationalism involve lots of discussion about far-ranging abstract ideas that (for a certain type of person) are hard to resist; one blogger says "[t]he experience of reading LessWrong for the first time was brain crack" and <a href="https://chanamessinger.com/blog/ea-as-nerdsniping">goes on to propose</a> that EA ideas are best-spread by <a href="https://xkcd.com/356/">nerd-sniping</a> (i.e. telling people about ideas they find so interesting that they literally can't help but think about them). Both EA and the Rationalists put an incredible amount of effort and weight on trying to reason well, avoid biases and fallacies, and being careful (and often quantitative) about uncertainties. However, EA focuses more on applying those things to do good in the real world to real people, while the Rationalist vibe is sometimes one of indulging in theorising and practising good thinking for their own sake. (This is not necessarily a criticism - I had fun discussing Lisp syntax in the comments section of <a href="https://www.lesswrong.com/posts/GAqCiWJBttazYGsJR/review-structure-and-interpretation-of-computer-programs">the LessWrong version of my review of <i>Structure and Interpretation of Computer Programs</i></a>, even though arguing about parentheses isn't exactly going to save the world (or is it ... ?)). EA tends to also have a more explicit orientation towards seeking influence.</p>
<p>(I should also note that on the specific topic of AI risk, the Rationalist community is extremely impact-oriented, likely due to founder effects - or perhaps because AI risk is the EA cause area that is most full of juicy technical puzzles and philosophical confusions.)</p>
<h2 id="more-philosophers--ea-gets-a-name">More philosophers & EA gets a name</h2>
<p>Brian Christian's <i>The Alignment Problem</i> mentions in chapter 9 some funny details about the sequence of events that lead to the first few EA-by-name organisations. In 2009, then-Oxford-philosophy-student Will MacAskill had an argument about vegetarianism while in a broom closet. Unlike most arguments about vegetarianism, and echoing the vibe of much future EA thinking, this one was on the meta-level; the debate was not whether factory farming is bad, but how we should deal with the moral uncertainty around whether or not factory farming is ethical. MacAskill eventually started talking with Toby Ord (though in a graveyard rather than a broom closet), another philosophy student interested in <a href="https://www.strataoftheworld.com/2020/07/ea-ideas-3-uncertainty.html">questions around moral uncertainty</a>.</p>
<p>Together with one other person, the two of them <a href="https://www.moraluncertainty.com/">wrote a book</a> on moral uncertainty. MacAskill and a philosophy-and-physics student called Benjamin Todd founded an organisation called <a href="https://80000hours.org/">80 000 Hours</a> to try to figure out how people can choose careers to have the greatest positive impact on the world. Toby Ord founded an organisation called <a href="https://www.givingwhatwecan.org/">Giving What We Can</a> (GWWC) that encourages people to donate 10% of their salary to exceptionally effective charities. GWWC estimates its roughly 8000 members have donated $277mn, and are likely to donate almost $3bn over their lifetimes.</p>
<p>As an umbrella organisation for both of these, they created the <a href="https://www.centreforeffectivealtruism.org/">Centre for Effective Altruism</a>. Originally the "Effective Altruism" part was intended purely as a descriptive part of the organisation's name, but at some point started to stand more broadly for the general space of effectively altruistic things that at some point interacted with ideas from the original Oxford cluster.</p>
<p>Later, MacAskill wrote a book called <i>Doing Good Better</i> summarising ideas about why charity effectiveness is important and counterintuitive. Ord in turn wrote <a href="https://theprecipice.com/"><i>The Precipice</i></a> that summarises ideas about how mitigating existential risks to human civilisation is likely a key moral priority; after all, it would be bad if we all died.</p>
<h2 id="charity-evaluators-and-billionaires">Charity evaluators and billionaires</h2>
<p>Independently from (and before) anything happening in Oxford broom closets, starting in 2006 hedge fund managers Holden Karnofsky and Elie Hassenfeld started thinking seriously about which charities to donate to. Upon discovering that this is a surprisingly hard problem, they started <a href="https://www.givewell.org/">GiveWell</a>, an organisation focused on finding exceptionally effective charities. They ended up concentrating on global health (their list includes malaria prevention, vitamin supplementation, and cash transfers, all in developing countries).</p>
<p>After a few years of GiveWell existing, they were put in touch with Dustin Moskovitz and Cari Tuna. At the time, Facebook co-founder Dustin Moskovitz was the world's youngest self-made billionaire, and with his partner Cari Tuna had started a philanthropic organisation called Good Ventures in 2011.</p>
<p>What followed was a cinematic failure of prioritisation, as recounted by Holden Karnofsky himself in <a href="https://80000hours.org/podcast/episodes/holden-karnofsky-most-important-century/#holdens-background-000947">this interview</a>. The GiveWell founders decided that "[meeting the billionaires] just doesn't seem very high priority", and thought that "[n]ext time someone's in California we should definitely take this meeting, but [...] this isn't the kind of thing we would rush for [...]". However, Karnofsky realised this meeting was an excellent excuse to go on a date with a Californian he fancied (and later married), and as a result ended up making the trip sooner rather than later.</p>
<p>Moskovitz and Tuna turned out to have very simplistic preferences for charitable giving: they just wanted to do the most good possible. This was an excellent fit with GiveWell's philosophy, and soon Good Ventures partnered with GiveWell in what would later become Open Philanthropy (of which Karnofsky would become co-CEO). <a href="https://www.openphilanthropy.org/">Open Philanthropy</a> is a key funder of EA projects, though they fund unrelated things as well (though always through a very EA lens of trying to rigorously and quantitatively maximise impact) . They list all their grants <a href="https://www.openphilanthropy.org/grants/">here</a>. </p>
<p>While studying physics at MIT, Sam Bankman-Fried (or "SBF"), already deeply interested in consequentialist moral philosophy, attended a talk by Will MacAskill on EA ideas. After stints at trading companies and the Centre for Effective Altruism, he founded the crypto-focused trading companies Alameda Research and then FTX, and ended up becoming the richest under-30 person in the world. (Though then the value of FTX fell in the crypto crash, and he recently turned 30 to boot.)</p>
<p><i>EDIT: In November 2022, both FTX and Alameda Research collapsed in a matter of days, and it became clear that FTX had committed major and flagrant financial fraud by transferring customer funds to Alameda, which Alameda then speculated with, and seems to have lost to the tune of billions of dollars. SBF is facing criminal charges. FTX and SBF have been condemned in harsh terms by those running many EA orgs and in countless EA Forum posts. Obviously, FTX and SBF have now very clearly become examples of what NOT to do. All of the following seem true: (a) our prior should be that people committing illegal and immoral actions that lead to extreme wealth and prestige for themselves are most likely acting mostly for the standard boring selfishly-evil reasons, (b) SBF probably had an easier time justifying his crimes because of the story that he could tell himself about doing good for the world, (c) publicly associating himself with EA, and receiving positive attention from EA organisations, helped make SBF appear moral and trustworthy, (d) there existed evidence and signals (in particular reports from Alameda's early days about cut-throat behaviour from SBF) that provided evidence of SBF's character before the FTX collapse, and (e) it is generally harder than it seems in hindsight to be right about whether a business is fraudulent (consider that coutless venture capitalists poured billions into FTX, and presumably had incentive to figure out if the entire thing was a scam). More information will come to light with time, and there are definitely lessons to be learned. Apart from this paragraph, I have not changed any part of this post.</i></p>
<p>SBF often emphasises that you're more likely to achieve outlier success in business if your goal is to donate the money effectively. There's little personal gain in going from $100M to $10B, so a selfish businessperson is likely to optimise something like "probability I earn more than [amount that lets me do whatever the hell I want for the rest of my life]", while a (mathematically-literate) altruistic one is far more compelled to simply shoot for the highest <a href="https://www.strataoftheworld.com/2020/07/ea-ideas-2-expected-value-and-risk.html">expected-value</a> outcomes, even if they're risky. (The exception is the selfish businessperson who really likes competing in the billionaire rankings.)</p>
<p>SBF has also said - and is living proof of - the idea that if your strategy to do good is to earn money to donate, you should probably aim for the risky but high-value bets (e.g. starting a company and becoming a billionaire), rather than going into some high-paying finance job earning a crazy-high but non-astronomical salary. Many people persuaded by EA ideas have done the latter, but SBF contributed more than all of them combined. The maths probably still works out even after accounting for the fact that SBF's route was far more unlikely to work than a finance job (he thought FTX had an 80% chance of failure). <a href="https://forum.effectivealtruism.org/posts/m35ZkrW8QFrKfAueT/an-update-in-favor-of-trying-to-make-tens-of-billions-of">This post</a> argues so. <a href="https://www.wave.com/en/about/">Wave</a>, a fintech-for-Africa company with strong EA representation in its founding team and a $1.7B valuation in 2021, is another example of EA business success.</p>
<p>SBF and other senior FTX people (many of who care deeply about EA ideas) launched the FTX Foundation, which in particular contains the <a href="https://ftxfuturefund.org/area-of-interest/">Future Fund</a> that has quickly become a key funder of the more future-oriented and speculative parts of EA.</p>
<p>These days, being associated with tech billionaires isn't a ringing endorsement. However, consider a few things. First, the tech billionaires aren't the ones who came up with the ideas or set the agendas. Sports car enthusiast and sci-fi nerd Elon Musk decided that sexy cars and rockets are the most important projects in the world and directed his wealth accordingly; Moskovitz, SBF, & co. were persuaded by abstract arguments and donate their wealth to foundations where the selection of projects is done by people more knowledgeable in that than they are. Second, it seems unusually likely that the major EA donors really are sincere and committed to trying to do the most good; after all, if they wanted to maximise their popularity or acclaim, there are better ways of doing that then funding a loose cluster of people often trying to work specifically on the the least-popular charitable causes (since those are most likely to contain low-hanging fruit). Finally, if some tech billionaires endorsing EA is evidence <i>against</i> EA being a good thing, then no tech billionaires endorsing EA <a href="https://www.lesswrong.com/rationality/conservation-of-expected-evidence">must be</a> evidence <i>in favour</i> of EA being a good thing. However cynical you are about tech billionaires, they're still smart people, so a few of them going "huh, this is the type of thing I want to spend all my wealth on" should be more promising than all of them going "nope I don't buy this".</p>
<p>(If EA has some top tech business people, why doesn't it have some top political people too, or even funders from outside tech? My guess is a combination of factors. Politicians skew old while EAs skew young (partly because EA itself is young). Both EAs and tech people tend to be technically/mathematically/intellectually-inclined (though many areas within EA are specifically about social science or the humanities). Both EAs and tech people tend to care less than average about social norms or prestige, while politicians tend to be selected out of the set of people who are willing to optimise very hard for prestige and popularity. Also, expect some policy-related efforts from EA; many EAs work or aim to work in non-political policy roles, and there have even been some political efforts, though <a href="https://forum.effectivealtruism.org/posts/sKwEB7EEMaCp9tfaw/carrick-flynn-results-and-additional-ideas-for-passing">there is much to learn in that field</a>.)</p>
<h2 id="organisations">Organisations</h2>
<p>In addition to the previously-mentioned CEA, 80 000 Hours, Giving What We Can, GiveWell, Open Philanthropy, and FTX Foundation, organisations with a strong EA influence include (but are not limited to):</p>
<ul>
<li><p>A large number of think-tanks and research institutes, especially ones where people think about the end of the world all day, including</p>
<ul>
<li><a href="https://www.fhi.ox.ac.uk/">Future of Humanity Institute</a> (FHI) at Oxford, which researches big-picture questions about the future of humanity and is run by Nick Bostrom.</li>
<li><a href="https://futureoflife.org/">Future of Life Institute</a> (FLI) in Cambridge (Massachusetts), focusing on global catastrophic risks and existential risks. It was founded by a team including Skype co-founder Jaan Tallinn and physicist Max Tegmark. Wikipedia says they are "[n]ot to be confused with Future of Humanity Institute" but to be honest this is a pretty big ask given the name.</li>
<li><a href="https://www.cser.ac.uk/">Centre for the Study of Existential Risk</a> (CSER) at Cambridge, also co-founded by Jaan Tallinn.</li>
<li><a href="https://longtermrisk.org/">Centre on Long-Term Risk</a> (CLR).</li>
<li><a href="https://www.longtermresilience.org/">Centre on Long-Term Resilience</a> (CLTR) (no, this is not confusing at all, it's all in your head).</li>
</ul>
</li>
<li><p>A large number of animal welfare charities, which I won't bother listing, except to point out the meta-level <a href="https://animalcharityevaluators.org/">Animal Charity Evaluators</a>.</p>
</li>
<li><p>A large number of global health charities, including ones that are simply highly recommended (and funded) by GiveWell (in particular <a href="https://www.againstmalaria.com/">Against Malaria Foundation</a>, which routinely tops <a href="https://www.givewell.org/charities/top-charities">GiveWell rankings</a>) to ones that also trace their roots solidly to EA.</p>
</li>
<li><p>Organisations working on AI risk, including:</p>
<ul>
<li><a href="https://www.anthropic.com/">Anthropic</a>, working on interpreting machine learning models (a program led by Chris Olah) and more general empirically-grounded, engineering-based machine learning safety research.</li>
<li><a href="https://www.redwoodresearch.org/">Redwood Research</a>, a smaller company also doing empirical machine learning safety work (and running <a href="https://forum.effectivealtruism.org/posts/vvocfhQ7bcBR4FLBx/apply-to-the-second-ml-for-alignment-bootcamp-mlab-2-in">great ML bootcamps</a> on the side).</li>
<li><a href="https://humancompatible.ai/">Centre for Human-compatible AI</a> (CHAI), a research institute at UC Berkeley.</li>
<li><a href="https://intelligence.org/">Machine Intelligence Research Institute</a> (MIRI), the original AI safety organisation that was founded in 2000 and hence managed to snap up the enviable domain name "<a href="https://intelligence.org/">intelligence.org</a>". MIRI's research leans much more mathematical and theory-based than that of most other AI alignment organisations.</li>
<li><a href="https://www.conjecture.dev/">Conjecture</a>, a new organisation focusing on the work that is most relevant if advanced AI is surprisingly close.</li>
<li>(OpenAI and DeepMind, the two leading AI companies, both have safety teams that include people very committed to working on existential risk concerns. However, neither is primarily an AI safety company, and both weight advanced AI risks at a company-level less than the other companies on this list. OpenAI in particular currently sees AI risks more through the near-term lens of making sure AI systems and their benefits are widely accessible to everyone, rather than focusing on making sure AI systems don't doom us all (though I guess that too would be a suitably equitable outcome?).)</li>
</ul>
</li>
<li><p><a href="https://www.alveavax.com/">Alvea</a>, a recent vaccine startup, with the eventual goal of enabling faster vaccine roll-out in the next pandemic. </p>
</li>
<li><p><a href="https://www.charityentrepreneurship.com/">Charity Entrepreneurship</a>, a charity incubator that has incubated <a href="https://www.charityentrepreneurship.com/our-charities">many charities</a>, including for example Healthier Hens (farmed chicken welfare), the Happier Lives Institute (helping policymakers figure out how to increase people's happiness), and Lead Exposure Elimination Project (working to reduce lead exposure in developing countries).</p>
</li>
<li><p><a href="https://www.sparkwave.tech/">SparkWave</a>, an incubator for software companies that are solving important problems. </p>
</li>
<li><p><a href="https://effectivethesis.org/">Effective Thesis</a>, trying to save students from writing pointless theses.</p>
</li>
<li><p><a href="https://founderspledge.com/">Founders Pledge</a>, which helps entrepreneurs commit to giving away money when they sell their companies and donate that money effectively (not to be confused with the more famous <a href="https://en.wikipedia.org/wiki/The_Giving_Pledge">Giving Pledge</a>). (So far, about $475M has been donated in this way) </p>
</li>
<li><p><a href="https://www.legalpriorities.org/">Legal Priorities Project</a>, which looks at the legal aspects of trying to do everything else.</p>
</li>
<li><p><a href="https://allfed.info/">ALLFED</a> (ALLiance to Feed Earth in Disasters), which aims to be useful in situations where hundreds of millions of people or more are suddenly without food, and which has successfully found the best conceivable name for an organisation that does this.</p>
</li>
<li><p><a href="https://ourworldindata.org/">Our World in Data</a> (OWID), the world's best provider of data and graphs on important global issues. I'm not quite sure how interrelated they are with EA directly, but their founder <a href="https://forum.effectivealtruism.org/posts/uaveEAgFfyFx4EYaH/a-new-our-world-in-data-article-on-longtermism">posts on the EA Forum about OWID articles on very EA-related ideas</a>, so there's definitely some overlap.</p>
</li>
<li><p><a href="https://www.appgfuturegenerations.com/">All-Party Parliamentary Group for Future Generations</a> in the UK government.</p>
</li>
<li><p>A bunch of organisations focused on getting people interested in the world's biggest problems and teaching them various skills:</p>
<ul>
<li><a href="https://www.atlasfellowship.org/">Atlas Fellowships</a>, a recent initiative for high-schoolers.</li>
<li>A collection of Existential Risk Initiatives running, among other things, summer internships where people (mostly undergraduate/postgraduate students) work with mentors on existential risk research: <a href="https://cisac.fsi.stanford.edu/stanford-existential-risks-initiative/content/stanford-existential-risks-initiative">SERI</a> (Stanford), <a href="https://effectivealtruism.ch/swiss-existential-risk-initiative">CHERI</a> (Switzerland), <a href="https://www.camxrisk.org/">CERI</a> (Cambridge), and a newer one at the University of Chicago which I can't yet find a website for, but which will almost certainly not help with the naming situation when it arrives. Thankfully, rumours say there will be soon be a YETI (Yale Existential Threats Initiative), which is a cool and (thank god!) unconfusable name.</li>
</ul>
</li>
</ul>
<p>Since EA is not a monolithic centralised thing, there is plenty of fuzziness in what counts as an EA organisation, and definitely no official list (and therefore if you're reading this and your org is not on the list, you shouldn't complain - many great orgs were left out). The common features among many of them are:</p>
<ul>
<li>Some causal link to stuff that at some point interacted with the original Oxford cluster.</li>
<li>Emphasis on taking altruistic actions with a focus on effectiveness.</li>
<li>Emphasis on quantifying the impact of altruistic actions.</li>
<li>Emphasis on a scope that is in some way particularly wide-ranging or unconventional, either in sheer size or time (existential risks, the long-run future), geography (focusing on the entire world and often particularly developing countries rather than the organisation's neighbourhood), or in what is cared about (farmed animal welfare, <i>wild</i> animal welfare, the lives of people in the far future, and whatever the hell <a href="https://thequaliaresearchinstitute.org/">these people</a> are doing).</li>
</ul>
<p>The biggest EA events are the Effective Altruism Global (EAG) conferences organised by CEA. These usually happen several times a year, mostly in the UK and the Bay Area, though locally-organised <a href="https://www.eaglobal.org/eagxhome/">EAGx conferences</a> have more diverse locations.</p>
<h2 id="the-situation">The Situation</h2>
<p>EA has a strong presence especially at top universities. There are large and active EA student groups in the Bay Area, Cambridge, Oxford, and London, but also increasingly New York, Boston, and Berlin, and many smaller local groups (you can find them listed <a href="https://forum.effectivealtruism.org/community">here</a>). The profile of EA in the general public is very small. However, the concentration of talent is extremely high. Add to this the existence of funding bodies with tens of billions of dollars of assets that are firmly aligned with EA principles, and you can expect a lot of important, impactful work to come from people and organisations with some connection to EA in the coming years.</p>
<p>It's important to keep in mind that EA is not a centralised thing. There is no EA tsar, or any single EA organisation that runs the show, or any official EA consensus. It's a cluster of many people and efforts that are joined mainly by caring about the types of ideas I talk about <a href="https://www.strataoftheworld.com/2020/07/ea-ideas-1-rigour-and-opportunity-in.html">here</a>.</p>
<h3 id="demographics">Demographics</h3>
<p><a href="https://effectivealtruismdata.com/#demographics">This website</a> has a good overview, based on whoever filled in a survey posted to the <a href="https://forum.effectivealtruism.org/">EA Forum</a>. The gender ratio is unfortunately somewhat skewed (70% male); for comparison, this is <a href="https://www.amacad.org/humanities-indicators/higher-education/gender-distribution-degrees-philosophy">roughly the same</a> as for philosophy degrees and better than for software developers (<a href="https://www.statista.com/statistics/1126823/worldwide-developer-gender/">90% male</a> (!?)). Half are 25-34. Over 70% are politically left or centre-left, and few are centre-right (2.5%) or right (1%), though almost 10% are libertarians. Education levels are high, and the five most common degrees are, in order: CS, maths, economics, social science, and philosophy. Most are from western countries.</p>
<h3 id="culture">Culture</h3>
<p>EA culture places a lot of weight on epistemics: being honest about your uncertainties, clear about what would make you change your mind on an issue, aware of biases and fallacies, trying to avoid group-think, focusing on the substance of the issue rather than who said it or why, and arguing with the goal of finding the truth rather than defending your pet argument or cause. This is a lofty set of goals. To an astonishing but imperfect extent, and more so than any other concentration of people or writing (except from the equally-good Rationalist community mentioned above) that I've ever had any exposure to, EA succeeds at this.</p>
<p>Related to this, but also turbo-charged by general cultural memes of "critiquing cherished ideas is important", there's a high emphasis of constantly being on the lookout for ways in which you yourself or (in particular) common EA ideas might be wrong. If you read down the list of <a href="https://forum.effectivealtruism.org/allPosts?sortedBy=top&timeframe=allTime&filter=all">top-voted posts</a> on the EA Forum, they are about:</p>
<ol start="">
<li><a href="https://forum.effectivealtruism.org/posts/cfdnJ3sDbCSkShiSZ/ea-and-the-current-funding-situation">Potential failure modes resulting from the influx of money into EA.</a></li>
<li><a href="https://forum.effectivealtruism.org/posts/HWaH8tNdsgEwNZu8B/free-spending-ea-might-be-a-big-problem-for-optics-and">High EA spending being a problem for optics and epistemics.</a></li>
<li><a href="https://forum.effectivealtruism.org/posts/xomFCNXwNBeXtLq53/bad-omens-in-current-community-building">Things current EA community-building efforts are doing wrong, and why this is especially worrying.</a></li>
<li><a href="https://forum.effectivealtruism.org/posts/KDjEogAqWNTdddF9g/long-termism-vs-existential-risk">Reasons why some key concepts in EA are used misleadingly and unnecessarily.</a></li>
<li><a href="https://forum.effectivealtruism.org/posts/n3WwTz4dbktYwNQ2j/critiques-of-ea-that-i-want-to-read">A list of critiques of EA that someone wants expanded.</a></li>
<li><a href="https://forum.effectivealtruism.org/posts/QFa92ZKtGp7sckRTR/my-mistakes-on-the-path-to-impact">A catalogue of personal mistakes that someone made while trying to do good</a> (the key one being that they focused too much on working only at EA organisations).</li>
<li><a href="https://forum.effectivealtruism.org/posts/bsE5t6qhGC65fEpzN/growth-and-the-case-against-randomista-development">An argument that standard EA ways of trying to help with developing country development are not as effective as other ways of helping.</a></li>
<li>And only in 8th place, something that isn't a critique of EA: <a href="https://forum.effectivealtruism.org/posts/cXBznkfoPJAjacFoT/are-you-really-in-a-race-the-cautionary-tales-of-szilard-and">a post about the historical case of early nuclear weapons researchers mistakenly assuming they were in a race, and implications for today's AI researchers</a></li>
</ol>
<p>(If you adjust upvotes on EA Forum posts to account for how active the forum was at the time, the most popular post of all time is <a href="https://forum.effectivealtruism.org/posts/FpjQMYQmS3rWewZ83/effective-altruism-is-a-question-not-an-ideology">Effective Altruism is a Question (not an ideology)</a>. It's not a critique, but it's also very revealing.)</p>
<p>Right now, there's <a href="https://forum.effectivealtruism.org/posts/8hvmvrgcxJJ2pYR4X/announcing-a-contest-ea-criticism-and-red-teaming">an active contest with $100k in prizes for the best critiques of EA</a>. This sort of stuff happens enough that Scott Alexander satirises it <a href="https://astralcodexten.substack.com/p/criticism-of-criticism-of-criticism">here</a>.</p>
<p>This might give the impression of EA as excessively-introspective and self-doubting. There is some truth to the introspectiveness part. However, the general EA attitude is also one of making bold (but reasoned) bets. Recall SBF's altruistically-motivated risk taking, or more generally the fact that <a href="https://www.openphilanthropy.org/research/hits-based-giving/">one of Open Philanthropy's foundational ideas</a> is to support reasonable-but-risky projects, or even more generally the way EA in general is set up around unconventional and ambitious attempts at doing good.</p>
<p>If I had to name the two most important obstacles to doing important things in the real world, they would be (1) reasoning poorly and not updating enough based on feedback/evidence, and (2) being too risk-averse and insufficiently ambitious. Some cultures, like the good parts of academia, do well on avoiding (1). Others - imagine for example gung-ho Silicon Valley tech entrepreneurs - do well on avoiding (2). Though EA culture varies a lot between places and organisations, on the whole it seems uniquely good at combining these two aspects.</p>
<p>There are differences in culture between different EA hubs/clusters. I mainly have experience of the UK (and especially Cambridge) cluster and the Bay Area one. In the Bay, there is significant overlap between the EA and Rationalist communities, whereas in the UK there's mainly just EA in my experience. The Bay also leans more AI-focused and maybe weirder on average (or perhaps it's just a European vs American culture thing), while in the UK there are many AI-focused people but also many focused on biological fields (biosecurity & alternative proteins) or policy.</p>
<h2 id="axes--trends">Axes & trends</h2>
<h3 id="long-termism-vs-near-termism">"Long-termism" vs "near-termism"</h3>
<p>In the history of EA, it's hard not to see an invasion of ideas from the planetary-scale futurism that people like Nick Bostrom and Eliezer Yudkowsky talked about, and Toby Ord (author of <i>The Precipice</i>) and Will MacAskill (about to drop <a href="https://www.whatweowethefuture.com/">a new book</a> on why we should prioritise the long-term future) increasingly focus on. Holden Karnofsky, who for a long time ran GiveWell, perhaps the most empirically-minded and global health -focused EA organisation, is now co-CEO of Open Philanthropy, responsible specifically for the speculative futurist parts of Open Philanthropy's mission, and <a href="https://www.cold-takes.com/the-most-important-century-in-a-nutshell/">writes blog posts about the grand future of humanity and why the coming century may be especially critical</a> (though he is careful to say that he doesn't think the other half of Open Philanthropy's work, or global health / animal welfare -focused charity more generally, is not important).</p>
<p>Perhaps this makes sense. In the long run at least, it seems sensible to expect the largest-scale ideas to be the most important ones. The rate of technological progress, especially in AI, has also been shrinking just what "the long run" means when expressed in years.</p>
<p>The common label applied to the ends of the radical-future-technology-focused versus concrete-current-problem-focused axis are "long-termist" and "near-termist" respectively. The name "long-termist" comes from arguments that the key moral priority is making sure we get to a secure, sustainable, and flourishing future civilisation (since such a civilisation could be very large and long-lasting, and therefore enable an enormous amount of happiness and flourishing). However, the names are a bit misleading. All existential risk work is often lumped into the long-termist category, so we have "long-termist" AI safety people trying to prevent a catastrophe many of them think will probably happen in the next three decades if it happens at all, and "near-termist" global health and development people trying to help the development of countries over a century.</p>
<p>(Many also <a href="https://forum.effectivealtruism.org/posts/rFpfW2ndHSX7ERWLH/simplify-ea-pitches-to-holy-shit-x-risk">point out</a> that caring about existential risks does not require the long-termist philosophy.)</p>
<h3 id="frugality-vs-spending">Frugality vs spending</h3>
<p>The culture of the original Oxford cluster was very frugal, and focused on monetary donations. For example, after founding Giving What We Can (GWWC), Toby Ord <a href="https://www.bbc.co.uk/news/magazine-11950843">donated everything he earned above £ 18 000 to charity</a> (and has <a href="https://www.vox.com/future-perfect/21728925/charity-10-percent-tithe-giving-what-we-can-toby-ord">continued on a similar track</a> since then). Because of the low available funding, the focus was very much on marginal impact - trying to figure out what existing opportunity could best use one extra dollar.</p>
<p>Since then, the arrival of billionaires meant that funding worries went down.</p>
<p>(For example, "earning to give" has gone down a lot in <a href="https://80000hours.org/career-reviews/#our-priority-paths">80 000 Hours' career rankings</a>. This is the idea that deliberately going into a high-earning job (often in finance) and then donating a significant fraction of your salary to top charities is one of the most effective ways to do good, and a path that many pursued based on the recommendation by 80 000 Hours.)</p>
<p>The bottleneck has moved (or at least been widely perceived to move) from funding to the time of people working on the key problems; instead of focusing on where to allocate the marginal dollar, the focus has somewhat shifted to how to allocate the marginal minute of time. In particular, the core argument of "imagine how far this particular dollar could go if used to effectively improve health in developing countries" has been joined by the argument of "there are plausible civilisation-ending disasters that could happen in the coming decades and require hard work to solve; imagine how sad it would be if we failed to work fast enough because we didn't spend that one dollar".</p>
<p>As a concrete example, Redwood Research organised <a href="https://www.alignmentforum.org/posts/YgpDYjTx7DCEgziG5/apply-to-the-ml-for-alignment-bootcamp-mlab-in-berkeley-jan">a machine learning bootcamp aimed at upskilling people for AI safety jobs</a> in January 2021 (and will be running more in the future, something I strongly endorse). Thirty participants (including myself) were flown into Berkeley from around the world, and spent three weeks living in a hotel while taking daily high-reliability COVID tests that I'm pretty sure weren't entirely free (and of course spending the days programming hard and talking about AI alignment (and eating free snack bars at the office - or maybe that last part was just me)). This wasn't cheap, nor was it a typical way to spend charity money (Redwood is <a href="https://www.openphilanthropy.org/grants/redwood-research-general-support/">funded</a> by Open Philanthropy). But if <a href="https://www.metaculus.com/questions/3479/date-weakly-general-ai-system-is-devised/">prediction markets are right that generally-capable AI starts emerging around the end of this decade</a>, and you take one look at the current state of progress on the AI alignment problem, and you do happen to have access to funding - well, it would be sad if being too stingy is how our civilisation failed.</p>
<p>Concretely, to look at only one consequence, Redwood made several hires from the bootcamp, despite the fact that many of the participants (myself included) were still students or otherwise not looking for work. Given how difficult but important hiring is, especially for high-skill technical roles, and the serious possibility that organisations like Redwood making progress is important for solving AI safety problems that might play a big role in how the future of humanity shapes out, this seems like a win.</p>
<p>However, at the same time, it is of course worth keeping in mind that humans are pretty good at thinking to themselves "man, wouldn't it be great if people like me had lots of money?" This, as well as the PR and culture problems of having lots of money sloshing around, are discussed in many EA Forum posts. We already saw that <a href="https://forum.effectivealtruism.org/posts/cfdnJ3sDbCSkShiSZ/ea-and-the-current-funding-situation">this one</a> (by MacAskill) and <a href="https://forum.effectivealtruism.org/posts/HWaH8tNdsgEwNZu8B/free-spending-ea-might-be-a-big-problem-for-optics-and">this one</a> are, respectively, the first- and second-most upvoted posts of all time on the EA Forum.</p>
<p>Ultimately, the whole point of Effective Altruism is, well, being effective about altruism. Whether EA funders spend quickly or slowly, and whichever causes they target, if they fail to find the best opportunities to do good with money, they haven't succeeded - and they know it.</p>
<p>(It should be noted that the GWWC criterion of donating 10% of your income to charity is met by many EAs, including ones far in space or culture from the original Oxford cluster, and global health is a leading donation target.)</p>
<h3 id="thinking-vs-doing">Thinking vs doing</h3>
<p>The fact that there's more resources - including not just funding but also the time of talented people - also means that the focus is less on marginal impact. If you have £10 and an hour, then figuring out what existing opportunity has the best ratio of good stuff per dollar is the best bet. But if you have, say, £10 000 000 and ten thousand work hours, then there's also the option of starting new projects and organisations.</p>
<p>(A lot of the weirdness of EA thinking comes from its marginalist nature. The things that are most valuable per marginal unit of money/time/effort are generally the things that are most neglected, and neglected things tend to seem weird because, by definition, few people care about them. For example, the early EA focus basically completely eschewed developed country problems because per-dollar marginal cost-effectiveness was highest in poor countries; from the outside, this may look like a strangely harsh and idiosyncratic selection of causes. With increasing resources, it makes more sense to pursue larger-scale changes, and larger-scale changes sometimes look like more traditional and intuitive causes. For example, while developing country health and projects trying to improve the long-term future are Open Philanthropy's main focuses, they spend some of their massive budget on <a href="https://www.openphilanthropy.org/focus/criminal-justice-reform/">US criminal justice reform</a>, <a href="https://www.openphilanthropy.org/focus/land-use-reform/">land-use policy</a>, and <a href="https://forum.effectivealtruism.org/community">immigration policy</a>.) (Though note that <a href="https://forum.effectivealtruism.org/posts/h2N9qEbvQ6RHABcae/a-critical-review-of-open-philanthropy-s-bet-on-criminal">the effectiveness of the criminal justice program has come under criticism</a>.)</p>
<p>Since EA now has the resources to start many new organisations, there's also starting to be a shift from EA being very research-oriented to having more and more real-world projects. Even though one of the key EA insights is that doing good requires lots of careful thinking in addition to good intentions and execution ability, the ultimate metric of success is actually improving the world, and that takes steps that aren't just research. I think EA has some headwind to overcome here; as a movement inspired, started, and (early on) largely consisting of philosophers, it has been remarkably successful in appealing to philosophical people and researchers, but not entrepreneurs or operations people to the same extent. I think it is a very welcome trend that this is starting to shift.</p>
<h2 id="exciting-attempt-for-enabling-action-on-essential-activities">Exciting Attempt for Enabling Action on Essential Activities</h2>
<p>EA is definitely not ideal, and it is also not guaranteed to survive. Like any real-world community, it is not a timeless platonic ideal of pure perfection that burst into the world fully formed, but rather something with an idiosyncratic history, that consists of real people, and has certain biases and cultural oddities. Still, I think it is probably the most exciting and useful thing in the world to be engaged with.</p>Unknownnoreply@blogger.com0tag:blogger.com,1999:blog-1697673368059564013.post-32443292866319950492022-06-25T20:50:00.000+01:002022-06-25T20:50:31.046+01:00Information theory 3: channel coding<p style="text-align: center;"><span style="font-size: x-small;">7.9k words, including equations (~41 minutes)</span> <br /></p><p> </p><p>We've looked at basic information theory concepts <a href="https://www.strataoftheworld.com/2022/06/information-theory-1.html">here</a>, and at source coding (i.e. compressing data without caring about noise) <a href="https://www.strataoftheworld.com/2022/06/information-theory-2-source-coding.html">here</a>. Now we turn to channel coding.</p>
<p>The purpose of channel coding is to make information robust against any possible noise in the channel.</p>
<h2 id="noisy-channel-model">Noisy channel model</h2>
<p>The noisy channel model looks like the following:</p>
<div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjVptKjd3xnPIq2F_8kfByNH3C96QL3mz0C3z-bXyTNEECMZHNxQwqHRusw6Mw5jxrNbT9k9L6OC8qFQuYkLr72mSoJiti9072A9B_HT6twHNku1gxJFIJ45WcEtJy7WuNMcr4MQNVZ7gi_KzuscQq9kcsTKQnbs9oAKN0oViBImC74qaxLxB273U_log/s1094/ArcoLinux_2022-06-25_19-19-53.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="239" data-original-width="1094" height="140" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjVptKjd3xnPIq2F_8kfByNH3C96QL3mz0C3z-bXyTNEECMZHNxQwqHRusw6Mw5jxrNbT9k9L6OC8qFQuYkLr72mSoJiti9072A9B_HT6twHNku1gxJFIJ45WcEtJy7WuNMcr4MQNVZ7gi_KzuscQq9kcsTKQnbs9oAKN0oViBImC74qaxLxB273U_log/w640-h140/ArcoLinux_2022-06-25_19-19-53.png" width="640" /></a></div>
<p>The channel can be anything: electronic signals sent down a wire, messages sent by post, or the passage of time. What's important is that it is discrete (we will look at the continuous case later), and there are some transition probabilities from every symbol that can go into the channel to every symbol that can come out. Often, the set of symbols of the inputs is the same as the set of symbols of the outputs.</p>
<p>The capacity $$C$$ of a noisy channel is defined as
$$$
C = \max_{p_x} I(X;Y) = \max_{p_x} \big(H(Y) - H(Y|X)\big).
$$$
It's intuitive that this definition involves the mutual information $$I$$ (see <a href="https://www.strataoftheworld.com/2022/06/information-theory-1.html">the first post for the definition and explanation</a>), since we care about how much information $$X$$ transfers to $$Y$$, and how much $$Y$$ tells us about $$X$$. What might be less obvious is why we take the maximum over possible input probability distributions $$p_x$$. This is because the mutual information $$I(X;Y)$$ depends on the probability distributions of $$X$$ and $$Y$$. We can only control what we send - $$X$$ - so we want to adjust that to maximise the mutual information. Intuitively, if you're typing on a keyboard with all keys working normally except the "i" key results in a random character being inserted, shifting your typing away from using the "i" key is good for information transfer. Better to wr1te l1ke th1s than to not be able to reliably transfer information.</p>
<p>However, the only real way to understand why this definition makes sense is to look at the noisy channel coding theorem. This theorem tells us, among other things, that for any rate (measured in bits per symbol) smaller than the capacity $$C$$, for a large enough code length we can get a probability of error as small as we like.</p>
<p>With noisy channels, we often work with <i>block codes</i>. The idea is that you encode some shorter sequence of bits as a longer sequence of bits, and if you've designed this well, it adds redundancy. An $$(n,k)$$ block code is one that replaces chunks of $$k$$ bits with chunks of $$n$$ bits.</p>
<h2 id="hamming-coding">Hamming coding</h2>
<p>Before we look at the noisy channel theorem, here's a simple code that is redundant to error: transmit every bit 3 times. Instead of sending 010, send 000111000. If the receiver receives 010111000, they can tell that bit 2 probably had an error, and should be a zero. The problem is that you triple your message length.</p>
<p>Hamming codes are a method for achieving the same - the ability to detect and correct single-bit errors, and the ability to detect but not properly correct two-bit errors - while sending a number of excess bits that grows only logarithmically with message length. For long enough messages, this is very efficient; if you're sending over 250 bits, it only costs you a 3% longer message to insure them against single-bit errors.</p>
<p>The catch is that the probability of having only one or fewer errors in a message declines exponentially with message length, so this is less impressive than it might sound at first.</p>
<p>The basic idea of most error correction codes is a parity bit. A parity bit $$b$$ is typically the XOR (exclusive-or) of a bunch of other bits $$b_1, b_2, \ldots$$, written $$b = b_1 + b_2 + \ldots$$ (we use $$+$$ for XOR because doing addition in base-2 while throwing away the carry is the same is taking the XOR). A parity bit over a set of bits $$B = {b_1, b_2, \ldots}$$ is 1 if the set of bits contains an odd number of 1s, and otherwise 0 (hence the word "parity").</p>
<p>Consider sending a 3-bit message where the first two bits are data and the third is a parity bit. If the message is 110, we check that, indeed, there's an even number of 1s among the data bits, so it checks out that the parity bit is 0. If the message were 111, we'd know that something had gone wrong (though we wouldn't be able to fix it, since it could have started out with any of 011, 101, or 110 and suffered a one-bit flip - and note that we can never entirely rule out that 000 flipped to 111, though since error probability is generally small in any case we're interested in, this would be extremely unlikely).</p>
<p>The efficiency of Hamming codes comes from the fact that we have parity bits that check other parity bits.</p>
<p>A $$(T, D)$$ Hamming code is one that sends $$T$$ bits in total of which $$D$$ are data bits and the remaining $$T - D$$ are parity bits. There exists a $$(2^m - 1, 2^m - m - 1)$$ Hamming code for positive integer $$m$$. Note that $$m$$ is the number of parity bits.</p>
<p>The default way to construct a Hamming code is that the $$m$$th parity bit is in position $$2^m - 1$$, and is set such that the parity of bits whose position's binary representation has a 1 in the $$m$$th last position is zero.</p>
<div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEh9rZ67sMsoFtVbSeQGWwqYtMO6tRF2T5cEw5RsqrtKog1ZTH4gbMk8QT79EeOQPwVTwaIXX3-9795wAGNmTrqH4v9lN4poBgxkbrod7stbG3-BTEHiNbslFs1Zje-6ox1_5kn0G9Wq3e3pu9dV5tOG2JaTjZs2asrT0ju_Ee5RkcnE7edyM7pQVIw-sw/s1033/ArcoLinux_2022-06-25_18-47-44.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="364" data-original-width="1033" height="226" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEh9rZ67sMsoFtVbSeQGWwqYtMO6tRF2T5cEw5RsqrtKog1ZTH4gbMk8QT79EeOQPwVTwaIXX3-9795wAGNmTrqH4v9lN4poBgxkbrod7stbG3-BTEHiNbslFs1Zje-6ox1_5kn0G9Wq3e3pu9dV5tOG2JaTjZs2asrT0ju_Ee5RkcnE7edyM7pQVIw-sw/w640-h226/ArcoLinux_2022-06-25_18-47-44.png" width="640" /></a></div>
<p>(Above, you see bits 1 through 15, with parity bits in positions 1, 2, 4, and 8. Underneath each bit, for every parity bit there is a 0 if that bit is not included in the parity set of that parity bit, and otherwise a 1. For example, since <code>b4</code> is set for bits 8-15, <code>b4</code> is a 1 if there's an odd number of 1s in bits 8-15 inclusive and otherwise 0. Note that the columns spell out the numbers 1 through 15 in binary.)</p>
<p>For example, a $$(7,4)$$ Hamming code for the 4 bits of data 0101 would first become
$$$
\texttt{ b1 b2 0 b3 1 0 1}
$$$
and then we'd set $$b_1 = 0$$ to make there be an even number of 1s across the 1st, 3rd, 5th, and 7th positions, set $$b_2 = 1$$ to do the same over the 2nd, 3rd, 6th, and 7th positions, and then finally set $$b_3 = 0$$ to do the same over the 4th, 5th, 6th, and 7th positions.</p>
<p>To correct errors, we have the following rule: sum up the positions of the parity bits that do not match. For example, if parity bit 3 is set wrong relative to the rest of the message, you flip that bit; everything will be fine after we clear this false alarm. But if parity bit 2 is also set wrong, then you take their positions, 2 (for bit 2) and 4 (for bit 3) and add them to get 6, and flip the sixth bit to correct the error. This makes sense because the sixth bit is the only bit covered by both parity bits 2 and 3, and only parity bits 2 and 3.</p>
<p>Though the above scheme is elegant and extensible, it's possible to design other Hamming codes. The length requirements remain - the code is a $$(2^m - 1, 2^m - m - 1)$$ code if we allow $$m$$ parity bits - but we can assign any "domain" over the bits to each parity bit as long as each bit belongs to the domain a unique set of parity bits.</p>
<h2 id="noisy-channel-coding-theorem">Noisy channel coding theorem</h2>
<p>We can measure any noisy channel code we choose based on two numbers. The first is its probability of error ($$p_e$$ above). The second is its rate: how many bits of information are transferred for each symbol sent. The three parts of the theorem combine to divide that space up into a possible and impossible region:</p>
<div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhFaR-atWmnPttu0ZXFtS2-0y3wxiPkw0DmZcP4S1U9KLhuz7Iw7SGCn_NNggZFpNKc5OBFkFL7eB29jIB3GXy7kMFVOncmVp1tTNafSdOGgDvYpf-GoOaMTDyjA5k0-RmbiwMeRitQJAR9IYWAqejEtnBXrtC1a-6a6gxzQr-JgyqsERmXXvPI-rhpMQ/s676/ArcoLinux_2022-06-25_18-49-59.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="483" data-original-width="676" height="458" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhFaR-atWmnPttu0ZXFtS2-0y3wxiPkw0DmZcP4S1U9KLhuz7Iw7SGCn_NNggZFpNKc5OBFkFL7eB29jIB3GXy7kMFVOncmVp1tTNafSdOGgDvYpf-GoOaMTDyjA5k0-RmbiwMeRitQJAR9IYWAqejEtnBXrtC1a-6a6gxzQr-JgyqsERmXXvPI-rhpMQ/w640-h458/ArcoLinux_2022-06-25_18-49-59.png" width="640" /></a></div>
<p>The first part of the theorem says that the region marked "I" is possible. Now there are points of this region that are more interesting than others. Yes, we can make a code that has a capacity of 0 and a very high error rate; just send the same symbol all the time. This is point (a), and we don't care about it.</p>
<p>What's more interesting, and perhaps not even intuitively obvious at all, is that we can get to a point (b): an arbitrarily low error rate, despite the fact that we're sending information. The maximum information rate we can achieve while keeping the error probability very low turns out to be the capacity, $$C = \max_{p_X} I(X:Y)$$.</p>
<p>The second part of the theorem gives us a lower bound on error rate if we dare try for a rate that is greater than the capacity. It tells us we can make codes that achieve point (c) on the graph.</p>
<p>Finally, the third part of the theorem proves that we can't get to points like (x), that have an error rate that is too low given how much over the channel capacity their rate is.</p>
<p>We started the <a href="https://www.strataoftheworld.com/2022/06/information-theory-2-source-coding.html">proof of the source coding theorem</a> by considering a simple construction (the $$\delta$$-sufficient subset) first for a single character and then extending it to blocks. We're going to do something similar now.</p>
<h3 id="noisy-typewriters">Noisy typewriters</h3>
<p>A noisy typewriter over the alphabet $${0, \ldots, n}$$ is a device where if you press the key for $$i$$, it inputs one of the following with equal probability:</p>
<ul>
<li>$$i - 1 \mod n$$</li>
<li>$$i \mod n$$ </li>
<li>$$i + 1 \mod n$$</li>
</ul>
<p>With a 6-symbol alphabet, we can illustrate its transition probability matrix as a heatmap:</p>
<div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiwc4Mgk4fp9WF6Aal37OpOl_PgfcM8JwarOSEoWOaoaXR9xF_lG3b0Y6jCjZDFu6eaYZDX2FgVxiM-8Mg0xwNParK7KK50qDPhRd4swroReHON_8C1myzfe4Xobx7RHBMipN-SAHqVoDtjB32Nd8HW1wADPLJFUGJgAq624SeD-H4A5w69T0qQwjVQag/s1107/ArcoLinux_2022-06-25_18-52-30.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="1065" data-original-width="1107" height="385" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiwc4Mgk4fp9WF6Aal37OpOl_PgfcM8JwarOSEoWOaoaXR9xF_lG3b0Y6jCjZDFu6eaYZDX2FgVxiM-8Mg0xwNParK7KK50qDPhRd4swroReHON_8C1myzfe4Xobx7RHBMipN-SAHqVoDtjB32Nd8HW1wADPLJFUGJgAq624SeD-H4A5w69T0qQwjVQag/w400-h385/ArcoLinux_2022-06-25_18-52-30.png" width="400" /></a></div>
<p>The colour scale is blue (low) to yellow (high). The reading order is meant to be that each column represents the probability distribution of output symbols given an input symbol.</p>
<p>First, can we transmit information without error at all? Yes: choose a code where you only send the symbol corresponding to the second and fifth columns. Based on the heatmap, these can map to symbols number 1-3 and 4-6 respectively; there is no possibility of confusion. The cost is that instead of being able to send one of six symbols, or $$\log 6$$ bits of information per symbol, we can now only send one of two, or $$\log 2 = 1$$ bits of information per symbol.</p>
<p>The capacity is $$\max_{p_X} \big( H(Y) - H(Y|X) \big)$$. Now if $$p_X$$ is the distribution we considered above - assigning half the probability to 2 and half to 5 - then by the transition matrix we see that $$H(Y)$$ will be uniformly distributed, so it is $$\log 6$$. $$H(Y|X)$$ is $$\log 3$$ in our example code, because we see that if we always send either symbol 2 or 5, then in both cases $$Y$$ is restricted to a set of 3 values. With some more work you can show that this is in fact an optimal choice of $$p_X$$. The capacity turns out to be $$\log 6 - \log 3 = \log 2$$ bits. The error probability is zero. We see that we can indeed transfer information without error even if we have a noisy channel.</p>
<p>But hold on, the noisy typewriter has a very specific type of error: there's an absolute certainty that if we transmit a 2 we can't get symbols 3-6 out, and so on. Intuitively, here we can partition the space of channel outputs in such that there is no overlap in the sets of which channel input each channel output could have come from. It seems like with a messier transition matrix that doesn't have this nice property, this just isn't true. For example, what if we have a binary symmetric channel, with a transition matrix like this:</p>
<div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgVw3TO5N1kZxnd4FvUVlFlix9C5meSAM-tqAk1E0CdfdZwJtcuEg0gucDW3Fhy34Ho4y7UXyJ_Qb8RjHSAm9TyW8wIH75eaOC5CSbhaqoxKLBSGOcUQpoy6fllPcbjufiPXJ2MX3cYWCKKEAWRuUFJQ3O7OHun6t_kHgJPxuEFPQzUXnqQ7J24eFZLeg/s1129/ArcoLinux_2022-06-25_18-54-33.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="1059" data-original-width="1129" height="300" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgVw3TO5N1kZxnd4FvUVlFlix9C5meSAM-tqAk1E0CdfdZwJtcuEg0gucDW3Fhy34Ho4y7UXyJ_Qb8RjHSAm9TyW8wIH75eaOC5CSbhaqoxKLBSGOcUQpoy6fllPcbjufiPXJ2MX3cYWCKKEAWRuUFJQ3O7OHun6t_kHgJPxuEFPQzUXnqQ7J24eFZLeg/s320/ArcoLinux_2022-06-25_18-54-33.png" width="320" /></a></div>
<p>Unfortunately the blue = lowest, yellow = highest color scheme is not very informative; the transition matrix looks like this, where $$p_e$$ is the probability of error:
$$$
\begin{bmatrix}
1 - p_e & p_e \
p_e & 1 - p_e
\end{bmatrix}
$$$
Here nothing is certain: a 0 can become a 1, and a 1 can become a zero.</p>
<p>However, this is what we get if we use this transition probability matrix on every symbol in a string of length 4, with the strings going in the order 0000, 0001, 0010, 0011, ..., 1111 along both the top and left side of the matrix:</p>
<div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi7BcpxB7n2poMYg7-ptGda6MU4emL-BJSL83_TWfZYwbnaTdJQ0uB29zZwxjHfrieOKPatRc-Sry9P9QbWzBfo-zah1-LFd93v1KZOoaEfobiS_Pq4yiJXE-XoTVdJ01jXMGhhHHV-RSFwREWf0I86nJcxi-4Y7WeJOOVF9bYaBCm_sCuUnW1ugeIdAQ/s1024/ArcoLinux_2022-06-25_18-56-35.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="1022" data-original-width="1024" height="399" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi7BcpxB7n2poMYg7-ptGda6MU4emL-BJSL83_TWfZYwbnaTdJQ0uB29zZwxjHfrieOKPatRc-Sry9P9QbWzBfo-zah1-LFd93v1KZOoaEfobiS_Pq4yiJXE-XoTVdJ01jXMGhhHHV-RSFwREWf0I86nJcxi-4Y7WeJOOVF9bYaBCm_sCuUnW1ugeIdAQ/w400-h399/ArcoLinux_2022-06-25_18-56-35.png" width="400" /></a></div>
<p>For example, the second column shows the probabilities (blue = low, yellow = high) for what you get in the output channel if 0001 is sent as a message. The highest value is for the second entry, 0001, because we have $$p_e < 0.5$$ so $$p_e < 1 - p_e$$ so the single likeliest outcome is for no changes, which has probability $$(1-p_e)^4$$. The second highest values are for the first (0000), third (0011), fifth (0101), and seventh (1001) entries, since these all involve one flip and have probability $$p_e (1-p_e)^3$$ individually and probability $${4 \choose 1} p_e (1-p_e)^3 = 4 p_e (1 - p_e)^3$$ together.</p>
<p>If we dial up the number, the pattern becomes clearer; here's the equivalent diagram for messages of length 8:</p>
<div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjClJ6kMdUI2632p44ktR3yiQ-2mhzmRTNUohwUx1EBmTT4LNpzpIgtMgOoY3GgOLhs4IOdocUJbpz7Ep-Dm_0kZLATn_O_haiYViwEOD9JlhbYjv2jI7qZnWvesb0-el-eip6h42z47ALSveLDWhrglzQnMGBBNQ3Zp7wEWoAmbDwVnAD90tQ-CcSWug/s1022/ArcoLinux_2022-06-25_18-57-06.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="1020" data-original-width="1022" height="638" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjClJ6kMdUI2632p44ktR3yiQ-2mhzmRTNUohwUx1EBmTT4LNpzpIgtMgOoY3GgOLhs4IOdocUJbpz7Ep-Dm_0kZLATn_O_haiYViwEOD9JlhbYjv2jI7qZnWvesb0-el-eip6h42z47ALSveLDWhrglzQnMGBBNQ3Zp7wEWoAmbDwVnAD90tQ-CcSWug/w640-h638/ArcoLinux_2022-06-25_18-57-06.png" width="640" /></a></div>
<h3 id="the-return-of-the-typical-set">The Return of the Typical Set</h3>
<p>There are two key points.</p>
<p>The first is that more and more of the probability is concentrated along the diagonal (plus some other diagonals further from the main diagonal. We can technically have any transformation, even 11111111 to 00000000 when we send a message through the channel, but most of these transformations are extremely unlikely. The transition matrix starts looking more and more like the noisy typewriter, where for each message only one subset of received messages has non-tiny likelihood.</p>
<p>The second key point is that it is time for ... the <i>return of the typical set</i>. Recall from the <a href="https://www.strataoftheworld.com/2022/06/information-theory-2-source-coding.html">second post in this series</a> that the $$\epsilon$$-typical set of length-$$n$$ strings over an alphabet $$A$$ is defined as
$$$
T_{n\epsilon} = \left\{x^n \in A^n \text{ such that } \left|-\frac{1}{n} \log p(x^n) - H(X)\right| \le \epsilon\right\}.
$$$
$$-\frac{1}{n} \log p(x^n)$$ is equal to $$-\frac{1}{n} \sum_{i=1}^n \log p(x_i)$$ by independence, and this in turn is an estimator for $$\mathbb{E}[-\log p(X)] = H(X)$$. You can therefore read $$-\frac{1}{n}\log p(x^n)$$ as the "empirical entropy"; it's what we'd guess the (per-symbol) entropy of $$X$$ to be if we did a slightly weird thing of estimating the entropy while knowing the probability model but only using it to determine the information content $$-\log p$$, and estimating the $$p_i$$s in $$-\sum_i p_i \log p_i$$ instead by only using how often they occur in $$x^n$$ (rather than the probability model).</p>
<p>Now the big results about typical sets was that as $$n \to \infty$$, the probability $$P(x^n \sim X^n \in T_{n \epsilon}) \to 1$$, and therefore for large $$n$$, most of the probability mass is concentrated in the approximately $$2^{nH(X)}$$ strings of probability approximately $$2^{-nH(X)}$$ that lie in the typical set.</p>
<p>We can define a similar notion of jointly $$\epsilon$$-typical sets, denoted $$J_{n\epsilon}$$ and defined by analogy with $$T_{n\epsilon}$$ as
$$$
J_{n\epsilon} = \left\{ (x^n, y^n) \in A^n \times A^n
\text{ such that } \left| - \frac{1}{n} \log P(x^n, y^n) - H(X, Y)\right| \le \epsilon
\right\}.
$$$
Like typical sets, jointly typical sets give us similar nice properties:</p>
<ol>
<li><p>If $$x^n, y^n$$ are drawn from the joint distribution (e.g. you first draw an $$x^n$$, then apply the transition matrix probabilities to generate a $$y^n$$ based on it), then the probability that $$(x^n, y^n) \in J_{n \epsilon}$$ goes to 1 as $$n \to \infty$$. The proof is almost the same as the corresponding proof for typical sets (hint: law of large numbers).</p>
</li>
<li><p>The number $$|J_{n\epsilon}|$$ of jointly typical sequence pairs $$(x^n, y^n)$$ is about $$2^{nH(X,Y)}$$, and specifically is upper-bounded by $$2^{n(H(X,Y) + \epsilon)}$$. The proof is the same as for the typical set case.</p>
</li>
<li><p>If $$x^n$$ and $$y^n$$ are _independently drawn_ from the distributions $$p_X$$ and $$p_Y$$, the probability that they are jointly typical is about $$2^{-nI(X;Y)}$$. The specific upper bound is $$2^{-n(I(X;Y) - 3 \epsilon)}$$, and can be shown straightforwardly (remembering some of the identities in <a href="https://www.strataoftheworld.com/2022/06/information-theory-1.html">post 1</a>) from
$$$
P((x^n, y^n) \in J_{n \epsilon}) = \sum_{(x^n, y^n) \in J_{n\epsilon}} p(x^n) p(y^n)$$$
$$$\le
|J_{n\epsilon}| 2^{-n(H(X) - \epsilon)} 2^{-n(H(X) - \epsilon)}$$$
$$$ \le
2^{n(H(X,Y) + \epsilon)} 2^{-n(H(X) - \epsilon)} 2^{-n(H(X) - \epsilon)}$$$
$$$=
2^{n(H(X,Y) - H(X) - H(Y) + 3 \epsilon)}$$$
$$$=
2^{-n(I(X,Y) - 3 \epsilon)}
$$$</p>
</li>
</ol>
<p>Armed with this definition, we can now interpret what was happening in the diagrams above: as we increase the length of the messages, more and more of the probability mass is concentrated in jointly typical sequences, by the first property above. The third property tells us that if we ignore the dependence between $$x^n$$ and $$y^n$$ - picking a square roughly at random in the diagrams above - we are, however, extremely unlikely to pick a square corresponding to a jointly typical pair.</p>
<p>Here is the noisy typewriter for 6 symbols, for length-4 messages coming in and out of the channel:</p>
<div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEghBh70zAlC9if1MiSrMZk9WRLa4SV5wbv9jyoc9ZeoTNb4r1lBPo_B7Usu_QFsRIAt53ktS-ep3_LJjvTs3fUWt9Ztcow4xfxo6sLFjj_oiT6HT_2imaW2s-FcjgIFL3SN5gXB4Gwvsch87akGUI9ipQbAqYrvdDXHXL07_iS_mUCmn0qWlbciAFjifA/s1030/ArcoLinux_2022-06-25_18-59-15.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="1030" data-original-width="1027" height="640" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEghBh70zAlC9if1MiSrMZk9WRLa4SV5wbv9jyoc9ZeoTNb4r1lBPo_B7Usu_QFsRIAt53ktS-ep3_LJjvTs3fUWt9Ztcow4xfxo6sLFjj_oiT6HT_2imaW2s-FcjgIFL3SN5gXB4Gwvsch87akGUI9ipQbAqYrvdDXHXL07_iS_mUCmn0qWlbciAFjifA/w638-h640/ArcoLinux_2022-06-25_18-59-15.png" width="638" /></a></div>
<p>(As a reminder of the interpretation: each column represents the probablity distribution, shaded blue to yelow, for one input message, and the $$6^4 = 1296$$ possible messages we have with this message length (4) and alphabet size (6) are ranked in alphabetical order along both the top and left side of the grid)</p>
<p>The highest probability is still yellow, but you can barely see it. Most of the probability mass is in the medium-probability sequences (our jointly typical set), forming a small subset of the possible channel outputs for each input.</p>
<p>In the limit, therefore, the transition probability matrix for a block code of an arbitrary symbol transition probability matrix looks a lot like the noisy typewriter. This suggests a decoding method: if we see $$y^n$$, we decode it as $$x^n$$ if $$(x^n, y^n)$$ are in the jointly typical set, and there is no other $${x'}^n$$ such that $$({x'}^n, y^n)$$ are also jointly typical. As with the noisy typewriter example, we have to discard a lot of the $$x^n$$, so that the set of $$x^n$$ that a given $$y^n$$ could've come to hopefully contains only a single element, so we match the second condition in the decoding rule.</p>
<h3 id="theorem-outline">Theorem outline</h3>
<p>Now we will state the exact form of the noisy channel coding theorem. It has three parts:</p>
<ol>
<li><p>A discrete memoryless channel has a non-negative capacity $$C$$ such that for any $$\varepsilon > 0$$ and $$R < C$$, for large enough $$n$$ there's a block code of length $$N$$ and rate $$\geq R$$ and a decoder such that error probability is $$< \varepsilon$$.</p>
<p>We will see that this follows from the points about jointly typical sets and the decoding scheme based on them that we discussed above. The only thing really missing is an argument that the error rate of jointly typical coding can be made arbitrarily low as long as $$R < C$$. We will see that Shannon used perhaps the most insane trick in all of 20th century applied maths to side-step having to actually think of a specific code to prove this.</p>
</li>
<li><p>If error probability per bit $$p_e$$ is acceptable, rates up to
$$$
R(p_e) = \frac{C}{1 - H_2(p_e)}.
$$$
are possible. We will prove this by </p>
</li>
<li><p>For any $$p_e$$, rates $$> R(p_e)$$ are not possible.</p>
</li>
</ol>
<p>As we saw earlier, these three parts together divide up the space of possible rate-and-error combinations for codes into three parts: </p>
<div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhFaR-atWmnPttu0ZXFtS2-0y3wxiPkw0DmZcP4S1U9KLhuz7Iw7SGCn_NNggZFpNKc5OBFkFL7eB29jIB3GXy7kMFVOncmVp1tTNafSdOGgDvYpf-GoOaMTDyjA5k0-RmbiwMeRitQJAR9IYWAqejEtnBXrtC1a-6a6gxzQr-JgyqsERmXXvPI-rhpMQ/s676/ArcoLinux_2022-06-25_18-49-59.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="483" data-original-width="676" height="458" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhFaR-atWmnPttu0ZXFtS2-0y3wxiPkw0DmZcP4S1U9KLhuz7Iw7SGCn_NNggZFpNKc5OBFkFL7eB29jIB3GXy7kMFVOncmVp1tTNafSdOGgDvYpf-GoOaMTDyjA5k0-RmbiwMeRitQJAR9IYWAqejEtnBXrtC1a-6a6gxzQr-JgyqsERmXXvPI-rhpMQ/w640-h458/ArcoLinux_2022-06-25_18-49-59.png" width="640" /></a></div>
<h3 id="proof-of-part-i-turning-noisy-channels-noiseless">Proof of Part I: turning noisy channels noiseless</h3>
<p>We want to prove that we can get an arbitrarily low error rate if the rate (bits of information per symbol) is smaller than the channel capacity, which we've defined as $$C = \max_{p_X} I(X;Y)$$.</p>
<p>We could do this by thinking up a code and then calculating the probability of error per length-$$n$$ block for it. This is hard though.</p>
<p>Here's what Shannon did instead: he started by considering a random block code, and then proved stuff about its average error.</p>
<p>What do we mean by a "random block code"? Recall that an $$(n,k)$$ block code is one that encodes length-$$k$$ message as length-$$n$$ messages. Since the rate $$r = \frac{k}{n}$$, we can talk about $$(n, nr)$$ block codes.</p>
<p>What the encoder is doing is mapping length-$$k$$ strings to length-$$n$$ strings. In the general case, it has some lookup table, with $$2^k = 2^{nr}$$ entries, each of length $$n$$. A "random code" means that we generate the entries of this lookup table from the distribution $$P(x^n) = \prod_{i=1}^n p(x_i)$$. We will refer to the encoder as $$E$$.</p>
<div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiHLyOUVQSZXYKQMKosrwxek9NEqDIwpKXBiwIkB1rbKnacw8EjbCGc9mOp-C6c9U7wgb-w62IkzI3O64FKyqUlsyRPb9Asb7aJ3nvzUZF_-Ga6G65GV4iuYOmdl6xRbRhg5Nn8ilbCRrTitQ2O2BuwbWHDlPef24B1IwnbOIq08oAF1_q656BvGH3x5g/s905/ArcoLinux_2022-06-25_19-08-10.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="445" data-original-width="905" height="314" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiHLyOUVQSZXYKQMKosrwxek9NEqDIwpKXBiwIkB1rbKnacw8EjbCGc9mOp-C6c9U7wgb-w62IkzI3O64FKyqUlsyRPb9Asb7aJ3nvzUZF_-Ga6G65GV4iuYOmdl6xRbRhg5Nn8ilbCRrTitQ2O2BuwbWHDlPef24B1IwnbOIq08oAF1_q656BvGH3x5g/w640-h314/ArcoLinux_2022-06-25_19-08-10.png" width="640" /></a></div>
<p>(In the above diagram, the dots in the column represent probabilities of different outputs given the $$x^n$$ that is taken as input. Different values of $$w^k$$ would be mapped by the encoder to different columns $$x^n$$ in the square.)</p>
<p>Richard Hamming (yes, the Hamming codes person) mentions this trick in his famous talk <a href="https://www.cs.virginia.edu/~robins/YouAndYourResearch.pdf">"You and Your Research"</a>:</p>
<blockquote>
<p><i>Courage is one of the things that Shannon had supremely. You have only to think of his major theorem. He wants to create a method of coding, but he doesn't know what to do so he makes a random code. Then he is stuck. And then he asks the impossible question, "What would the average random code do?'' He then proves that the average code is arbitrarily good, and that therefore there must be at least one good code. Who but a man of infinite courage could have dared to think those thoughts?</i></p>
</blockquote>
<p>Perhaps it doesn't quite take infinite courage, but it is definitely one hell of a simplifying trick - and the remarkable trick is that it works.</p>
<p>Here's how: let the average probability of error in decoding one of our blocks be $$\bar{p_e}$$. If we have a message $$w^k$$, the steps that happen are:</p>
<ol>
<li>We use the (randomly-constructed) encoder $$E$$ to map it to an $$x^{n}$$ using $$x^n = E(w^k)$$. Note that the set of values that $$E(w^k)$$, can take, $$\text{Range}(E)$$, is a subset of the set of values of all possible $$x^n$$.</li>
<li>$$x^n$$ passes through the channel to become a $$y^n$$, according to the probabilities in a block transition probability matrices like the ones pictured above.</li>
<li>We guess that $$y^n$$ came from the $$x'^n \in \text{Range}(E)$$ such that the pair $$(x'^n, y^n)$$ is in the jointly typical set $$J_{n\epsilon}$$.<ol>
<li>If there isn't such an $$x'^n$$, we fail. In the diagram below, this happens if we get $$y_3$$, since $$\text{Range}(E) = \{x_1, x_2, x_3, x_4\}$$ does not contain anything jointly-typical with $$y_3$$.</li>
<li>If there is at least one wrong $$x'^n$$, we fail. In the diagram below, this happens if we get $$y_2$$, since both $$x_2$$ and $$x_3$$ are codewords the encoder might use that are jointly typical with $$y_2$$, so we don't know which one was originally transmitted over the channel.</li>
</ol>
</li>
<li>We use the decoder, which is simply the inverse of the encoder, to map to our guess $$\bar{w}^k$$ of what the original string was. Since $$x'^n \in \text{Range}(E)$$, the inverse of the encoder, $$E^{-1}$$, must be defined at $$x'^n$$. (Note that there is a chance, but a negligibly small one as $$n \to \infty$$, that in our encoder generation process we created the same codeword for two different strings, in which case the decoder can't be deterministic. We can say either: we don't care about this, because the probability of a collision goes to zero, or we can tweak the generation scheme to regenerate if there's a repeat; $$n \ge k$$ so we can always construct a repeat-free encoder.)</li>
</ol>
<div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhhsPj6X9tFdhcyWVSmCNXlAprBCMGK88hFKeTTC257jSp8XFjYf0Fgk-O6YWhEXC0BvG337MCkBQF1KIodnrWmX3iqSSWVGhkBkVUReJUfWg1f4G-6S--2iW5ydJZGHxU5HHo1gVOZUe6iWjDmUSz6sd6ugaISVnCpjWcUswvHq9OK7nusuDlaeb4LGg/s731/ArcoLinux_2022-06-25_19-12-49.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="715" data-original-width="731" height="626" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhhsPj6X9tFdhcyWVSmCNXlAprBCMGK88hFKeTTC257jSp8XFjYf0Fgk-O6YWhEXC0BvG337MCkBQF1KIodnrWmX3iqSSWVGhkBkVUReJUfWg1f4G-6S--2iW5ydJZGHxU5HHo1gVOZUe6iWjDmUSz6sd6ugaISVnCpjWcUswvHq9OK7nusuDlaeb4LGg/w640-h626/ArcoLinux_2022-06-25_19-12-49.png" width="640" /></a></div>
<p>Therefore the two sources of error that we care about are:</p>
<ul>
<li><p>On step 3, we get a $$y^n$$ that is not jointly typical with the original $$x^n$$. Since $$P((x^n, y^n) \geq 1 - \delta$$ for some $$\delta$$ that we can make arbitrarily small by increasing $$n$$, we can upper-bound this probability with $$\delta$$.</p>
</li>
<li><p>On step 3, we get a $$y^n$$ that is jointly typical with at least one wrong $$x'^n$$. We saw above that one of the properties of the jointly typical set is that if $$x^n$$ and $$y^n$$ are selected independently rather than together, the probability that they are jointly typical is only $$2^{-n(I(X;Y) - 3 \epsilon)}$$. Therefore we can upper-bound this error probability by summing the probability of "accidental" joint-typicality over the $$2^k - 1$$ possible messages that are not the original message $$w^k$$. This sum is
$$$
\sum_{w'^k \ne w^k} 2^{-n(I(X;Y) - 3 \epsilon)}$$$
$$$\le (2^{k} - 1) 2^{-n(I(X;Y) - 3 \epsilon)}$$$
$$$\le 2^{nr}2^{- n (I(X;Y) - 3 \epsilon)}$$$
$$$= 2^{nr - n(I(X;Y) - 3 \epsilon)}
$$$</p>
</li>
</ul>
<p>We have the probabilities of two events, so the probability of at least one of them happening is smaller than or equal to their sum:
$$$
\bar{p}_e \le \delta + 2^{nr - n(I(X;Y) - 3 \epsilon)}
$$$
We know we can make $$\delta$$ however small we want. We can see that if $$r < I(X;Y) - 3 \epsilon$$, then the exponent is negative and increasing $$n$$ can also make the second term negligible. This is almost Part I of the theorem, which was:</p>
<blockquote>
<p>A discrete memoryless channel has a non-negative capacity $$C=\max_{p_X} I(X;Y)$$ such that for any $$\epsilon > 0$$ and $$R < C$$, for large enough $$n$$ there's a block code of length $$n$$ and rate $$\geq R$$ and a decoder such that error probability is $$< \varepsilon$$.</p>
</blockquote>
<p>First, to put a bound involving only one constant on $$\bar{p}_e$$, let's arbitrarily say that we increase $$n$$ until $$2^{nr - n(I(X;Y) - 3 \epsilon)} \le \delta$$. Then we have
$$$
\bar{p}_e \le 2 \delta
$$$
Second, we don't care about average error probability over codes, we care about the existence of a single code that's good. We can realise that if the average error probability $$\le 2 \delta$$, there must exist at least one code, call it $$C^*$$, with average error probability $$\le 2 \delta$$.</p>
<p>Third, we don't care about average error probability over messages, but maximal error probability, so that we can get the strict $$< \varepsilon$$ error probability in the theorem. This is trickier to bound, since $$C^*$$ might somehow have very low error probability with most messages, but some insane error probability for one particular message.</p>
<p>However, here again Shannon jumps to the rescue with a bold trick: throw out half the codewords, specifically the ones with highest error probability. Since the average error probability is $$\le 2 \delta$$, every codeword in the best half of codewords must have error probability $$\le 4 \delta$$, because otherwise the one-half of best codes would contribute more than $$\frac{1}{2} \times 4 \delta = 2 \delta$$ to the average error on their own.</p>
<p>What about the effect on our rate of throwing out half the codewords? Previously we had $$2^k = 2^{nr}$$ codewords; after throwing out half we have $$2^{nr - 1}$$, so our rate has gone from $$\frac{k}{n} = r$$ to $$\frac{nr - 1}{n} = r - \frac{1}{n}$$, a negligible decrease if $$n$$ is large.</p>
<p>What we now have is this: as $$n \to \infty$$, we can get any rate $$R < I(X;Y) - 3 \epsilon$$ with maximal error probability $$\le 4 \delta$$, and both $$\delta$$ and $$\epsilon$$ can be decreased arbitrarily close to zero by increasing $$n$$. Since we can set the distribution of $$X$$ to whatever we like (this is why it matters that we construct our random encoder by sampling from $$X$$ repeatedly), we can make $$I(X;Y) = \underset{p_X}{\max} I(X;Y)$$.</p>
<p>This is the first and most involved part of the theorem. It is also remarkably lazy: at no point do we have to go and construct an actual code, we just sit in our armchairs and philosophise about the average error probability of random codes.</p>
<h3 id="proof-of-part-ii-achievable-rates-if-you-accept-non-zero-error">Proof of Part II: achievable rates if you accept non-zero error</h3>
<p>Here's a simple code that achieves a rate higher than the capacity in a noiseless binary channel:</p>
<ol>
<li>The sender maps each length-$$nr$$ block to a block of length $$n$$ by cutting off the last $$nr - n$$ symbols.</li>
<li>The receiver reads $$n$$ symbols with error probability $$0$$, and then guesses the remaining $$nr - n$$ with bit error probability $$\frac{1}{2}$$ for each symbol. (Note; we're concerned with bit error here, unlike block error in the previous proof)</li>
</ol>
<p>An intuition you should have is that if the probability of anything is concentrated in a small set of outcomes, you're not maximising the entropy (remember: _entropy is maximised by a uniform distribution_) and therefore also not maximising the information transfer. The above scheme concentrates high probability of error to a small number of bits, while transmitting some of them with zero error - we should be able to do better.</p>
<p>It's not obvious how we'd start doing this. We're going to take some wisdom from the old proverb about hammers and nails, and note that the main hammer we've developed so far is a proof that we can send through the channel at a negligible error rate by increasing the size of the message. Let's turn this hammer upside down: we're going to use the decoding process to encode and the encoding process to decode. Specifically, to map from length-$$n$$ strings to the smaller length-$$k$$ strings, we use the decoding process from before:</p>
<ol>
<li>Given an $$x^n$$ to encode, we find the $$x'^n \in \text{Range}(E)$$ such that the pair $$(x^n, x'^n)$$ is in the jointly typical set $$J_{n\epsilon}$$. (Jointly typical with respect to what joint distribution? That of length-$$n$$ strings before and after being passed through the channel (here we're assuming that the input and output alphabets are equivalent). However, note that nothing actually has to pass through a channel for us to use this.)</li>
<li>We use the inverse of the encoder, $$E^{-1}$$, to map $$x'^n$$ to a length-$$k$$ string $$w^k$$ ($$x'^n \in \text{Range}(E)$$ so this is defined).</li>
</ol>
<p>To encode, we use the encoder $$E$$, to get $$\bar{x}^n = E(w^k)$$.</p>
<p>We'll find the per-bit error rate, not the per-block error rate, so we want to know how many bits are changed on average under this scheme. We're still working with the assumption of a noiseless channel, so we don't need to worry about the noise in the channel, only the error coming from our lossy compression (which is based on a joint probability distribution coming from assuming some channel, however). </p>
<p>Assume our channel has error probability $$p$$ when transmitting a symbol. Fix an $$x^n$$ and consider pairs $$(x^n, y^n)$$ in the jointly typical set. Most of the $$y^n$$ will differ from $$x^n$$ in approximately $$np$$ bits. Intuitively, this comes from the fact that for a binomial distribution, most of the probability mass is concentrated around the mean at $$np$$, and therefore the typical set contains mostly sequences with a number of errors close to this mean. Therefore, on average we should expect $$np$$ errors between the $$x^n$$ we put into the encoder and the $$x'^n$$ that it spits out. Since we assume no noise, the $$w^k = E^{-1}(x'^n)$$ we send through the channel comes back as the same, and we can do $$E(w^k) = E(E^{-1}(x'^n)) = x'^n$$ to perfectly recover $$x'^n$$. Therefore the only error is the $$np$$ wrong bits, and therefore our per-bit error rate is $$p$$.</p>
<p>Assume that, used the right way around, we have a code that can achieve a rate of $$R' = k/n$$. This rate is
$$$
R' = \max_{p_X} I(X;Y) = \max_{p_X} \big[ H(Y) - H(Y|X) \big]$$$
$$$= 1 - H_2(p)
$$$
assuming a binary code and a binary symmetric channel, and where $$H_2(p)$$ is the entropy of a two-outcome random variable with probability $$p$$ of the first outcome, or
$$$
H_2(p) = - p \log p - (1 - p) \log (1 - p).
$$$
Now since we're using it backward, we map from $$n$$ to $$k$$ bits rather than $$k$$ to $$n$$ bits, and this code has rate
$$$
\frac{1}{R'} = \frac{n}{k} = \frac{1}{1 - H_2(p)}
$$$
What we can now do is make a code that works like the following:</p>
<ol>
<li>Take a length-$$n$$ block of input.</li>
<li>Use the compressor (i.e. the typical set decoder) to map it to a smaller length-$$k$$ block.</li>
<li>Use some noiseless channel code with capacity $$C$$.</li>
<li>Use the decompressor (i.e. the typical set encoder) to map the recovered length-$$k$$ blocks back to length-$$n$$ blocks.</li>
</ol>
<p>In step 4, we will on average see that the recovered input differs in $$np$$ places, for a bit error probability of $$p$$. And what is our rate? We assumed the standard noiseless channel code in the middle that transmits our compressed input had the maximum rate $$C$$. However, it is transmitting strings that have already been compressed by a factor of $$\frac{k}{n}$$, so the true rate is
$$$
R = \frac{C}{1 - H_2(p)} = \frac{C}{1 + p \log p + (1 - p) \log (1 - p)}
$$$
This gives us the second part of the theorem: given a certain rate $$R$$, we can transmit at any probability of error $$p$$ low enough that $$C / (1 - H_2(p)) \le R$$.</p>
<p>(Note that effectively $$0 \le p < 0.5$$, because if $$p > 0.5$$ we can just flip the labels on the channel and change $$p$$ to $$1 - p$$, and if $$p = 0.5$$ we're transmitting no information.)</p>
<h3 id="proof-of-part-iii-unachievable-rates">Proof of Part III: unachievable rates</h3>
<p>Note that the pipeline is a Markov chain (i.e. each step depends only on the previous step):</p>
<div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjVptKjd3xnPIq2F_8kfByNH3C96QL3mz0C3z-bXyTNEECMZHNxQwqHRusw6Mw5jxrNbT9k9L6OC8qFQuYkLr72mSoJiti9072A9B_HT6twHNku1gxJFIJ45WcEtJy7WuNMcr4MQNVZ7gi_KzuscQq9kcsTKQnbs9oAKN0oViBImC74qaxLxB273U_log/s1094/ArcoLinux_2022-06-25_19-19-53.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="239" data-original-width="1094" height="140" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjVptKjd3xnPIq2F_8kfByNH3C96QL3mz0C3z-bXyTNEECMZHNxQwqHRusw6Mw5jxrNbT9k9L6OC8qFQuYkLr72mSoJiti9072A9B_HT6twHNku1gxJFIJ45WcEtJy7WuNMcr4MQNVZ7gi_KzuscQq9kcsTKQnbs9oAKN0oViBImC74qaxLxB273U_log/w640-h140/ArcoLinux_2022-06-25_19-19-53.png" width="640" /></a></div>
<p>Therefore, the data processing inequality applies (for more on that, search for "data" <a href="https://www.strataoftheworld.com/2022/06/information-theory-1.html">here</a>). With one application we get
$$$
I(w^k; \bar{w}^k) \le I(w^k; y^n)
$$$
and with another
$$$
I(w^k; y^n) \le I(x^n; y^n)
$$$
which combine to give
$$$
I(w^k, \bar{w}^k) \le I(x^n; y^n).
$$$
By the definition of channel capacity, $$I(x^n; y^n) \le nC$$ (remember that the definition is about mutual information between $$X$$ and $$Y$$, so _per-symbol_ information), and so given the above we also have $$I(w^k, \bar{w}^k) \le nC$$.</p>
<p>With a rate $$R$$, we send over $$nR$$ bits of information, but if the per-bit error probability is $$p$$, we can only receive $$nR (1 - H_2(p))$$ of those bits. Therefore $$I(w^k, \bar{w}^k) = nR(1 - H_2(p))$$ at most, and we have
$$$
nR(1-H_2(p)) > nC
$$$
is a contradiction, which implies which implies
$$$
R > \frac{C}{1 - H_2(p)}
$$$
is a contradiction. </p>
<h2 id="gaussian-channels">Continuous entropy and Gaussian channels</h2>
<p>And now, for something completely different.</p>
<p>We've so far talked only about the entropy of discrete random variables. However, there is a very common case of channel coding that deals with continuous random variables: sending a continuous signal, like sound.</p>
<p>So: forget our old boring discrete random variable $$X$$, and bring in a brand-new continuous random variable that we will call ... $$X$$. How much information do you get from observing $$X$$ land on a particular value $$x$$? You get infinite information, because $$x$$ is a real number with an endless sequence of digits; alternatively, the Shannon information is $$- \log p(x)$$, and the probability of $$X=x$$ is infinitesimally small for a continuous random variable, so the Shannon information is $$-\log 0$$ which is infinite. Umm.</p>
<p>Consider calculating the entropy for a continuous variable, which we will denote $$h(X)$$ to make a difference from the discrete case, and define in the obvious way by replacing sums with integrals:
$$$
h(X) = -\int_{-\infty}^\infty f(x) \log f(x) d x
$$$
where $$f$$ is the probability density function. If we actually evaluate this integral, we would get a constant term that goes to infinity.</p>
<p>As principled mathematicians, we might be concerned about this. But we can mostly ignore it, especially as the main thing we want is $$I(X;Y)$$, and
$$$
I(X;Y) = h(Y) - h(Y|X) = -\int f_Y(y) \log f_Y(y) \mathrm{d}y + \iint f_{X,Y}(x,y) \log f_{Y|X=x}(y) \mathrm{d}x \mathrm{d}y
$$$</p>
<p>where <i>mumble mumble</i> the infinities cancel out <i>mumble</i> opposite signs <i>mumble</i>.</p>
<h3 id="signals">Signals</h3>
<p>With discrete random variables, we generally had some fairly obvious set of values that they could take. With continuous random variables, we usually deal with an unrestricted range - a radio signal could technically be however low or high. However, step down from abstract maths land, and you realise reality isn't as hopeless as it seems at first. Emitting a radio wave, or making noise, takes some input of energy, and the source has only so much power.</p>
<p>For waves (like radio waves and sound waves), power is proportional to the square of the amplitude of a wave. The variance $$\mathbb{V}(X) = \mathbb{E}[(x-\mathbb{E}[x])^2] = \int f(x) (x - \mathbb{E}[X])^2 \mathrm{d}x$$ of a continuous random variable $$X$$ with probability density function $$f$$ is just the expected squared difference between the value and its mean. Both of these quantities are squaring a difference. It turns out that the power of our source and the variance of the random variable that represents it are proportional.</p>
<p>Our model of a continuous noisy channel is one where there's an input signal $$X$$, a source of noise $$N$$, and an output signal $$Y = X + N$$. As usual, we want to maximise the channel capacity $$C = \max_{p_X} I(X;Y)$$, which is done by maximising
$$$
I(X;Y) = h(Y) - h(Y|X).
$$$
Because noise is generally the sum of a bunch of small contributing factors in each directions, the noise follows a normal distribution with variance $$\sigma_N^2$$. Because the only source of uncertainty is $$N$$ and this has the same regardless of $$X$$, $$h(Y|X)$$ depends only on $$N$$ and not at all on $$X$$, so the only thing we can affect is $$h(Y)$$.</p>
<p>Therefore, the question of how you maximise channel capacity turns into a question of how to maximise $$h(Y)$$ given that $$Y = X + N$$ with $$N \sim \mathcal{N}(0, \sigma_N^2)$$. If we were working without any power/variance constraints, we'd already know the answer: just make $$X$$ such that $$Y$$ is a uniform distribution (which in this case would mean making $$Y$$ a uniform distribution over all real numbers, something that's clearly a bit wacky). However, we have a constraint on power and therefore the variance of $$X$$.</p>
<p>If we were to do some algebra involving Lagrangian multipliers, we would eventually find that we want the distribution of $$X$$ to be a normal distribution. A key property of normal distributions is that if $$X \sim \mathcal{N}(0, \sigma_X^2)$$ (assume the mean is 0; note you can always shift your scale) and $$N \sim \mathcal{N}(0, \sigma_N^2)$$, then $$X + N \sim \mathcal{N}(0, \sigma_X^2 + \sigma_N^2)$$. Therefore the basic principle between efficiently transmitting information using a continuous signal is that you want to transform your input to follow a normal distribution.</p>
<p>If you do, what do you get? Start with
$$$
I(X;Y) = h(Y) - h(Y|X)
$$$
and now use the "standard" integral that
$$$
\int f(z) \log p(z) \mathrm{d}z = -\frac{1}{2} \log (2 \pi e \sigma^2)
$$$
if $$z$$ is drawn from a distribution $$\mathcal{N}(0, \sigma^2)$$, and therefore
$$$
\max I(X;Y) = C = \frac{1}{2} \log (2 \pi e (\sigma_X^2 + \sigma_N^2)) - \frac{1}{2} \log (2 \pi e \sigma_N^2)
$$$
using the fact that $$h(Y|X) = h(N)$$ since the information content of the noise is all that is unknown about $$Y$$ if we're given $$X$$, and the property of normal distributions mentioned above. We can do some algebra to get the above into the form
$$$
C = \frac{1}{2} \log \left(\frac{2 \pi e (\sigma_X^2 + \sigma_N^2)}{2 \pi e \sigma_N^2}\right) \
= \frac{1}{2} \log \left( 1 + \frac{\sigma_X^2}{\sigma_N^2}\right)
$$$
The variance is proportional to the power, so this can also be written in terms of power as
$$$
C = \frac{1}{2} \log \left( 1 + \frac{S}{N}\right)
$$$
if $$S$$ is the power of the signal and $$N$$ is the power of the noise. The units of capacity for the discrete case were bits per symbol; here they're bits per second. A sanity check is that if $$S = 0$$, we transmit $$\frac{1}{2} \log (1) = 0$$ bits per second, which makes sense: if your signal power is 0, it has no effect, and no one is going to hear you.</p>
<p>An interesting consequence here is that increasing signal power only gives you a logarithmic improvement in how much information you can transmit. If you shout twice as loud, you can detect approximately twice as fine-grained peaks and troughs in the amplitude of your voice. However, this helps surprisingly little.</p>
<p>If you want to communicate at a really high capacity, there are better things you can do than shouting very loudly. You can decompose a signal into frequency components using the Fourier transform. If your signal consists of many different frequency levels, you can effectively transmit a different amplitude on each of them at once. The range of frequencies that your signal can span over is called the bandwidth and is denoted $$W$$. If you can make use of multiple frequencies, the capacity equation changes to
$$$
C = \frac{W}{2} \log \left(1 + \frac{S}{N}\right)
$$$
Therefore if you want to transmit information, transmitting across a broad range of frequencies is much more effective than shouting loudly. There's a metaphor here somewhere.</p>
Unknownnoreply@blogger.com1tag:blogger.com,1999:blog-1697673368059564013.post-59038482154934691632022-06-25T18:39:00.005+01:002023-03-26T14:26:29.465+01:00Information theory 2: source coding<p style="text-align: center;"><span style="font-size: x-small;">6.9k words, including equations (~36min)</span> <br /></p><p> </p><p>In <a href="https://www.strataoftheworld.com/2022/06/information-theory-1.html">the previous post</a>, we saw the basic information theory model:</p>
<div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgJXTmanOlPkocQ5FGq2OL6Tpxe-qlKxs3MIQ0zBnKTk2JvkkLshofA86XeZiNoaa64veATnEBMIfChv5OcUAD6QTPZEpRmtV2b_jhSb_8XDs9PYBAcOdAYmnKDrrrcAxbuXthKVax_gAacxX360xcDRrsLbxGEZdGKaHo24f7itvDpI9k-cbPBYoKHoQ/s1104/ArcoLinux_2022-06-02_12-57-01.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="205" data-original-width="1104" height="118" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgJXTmanOlPkocQ5FGq2OL6Tpxe-qlKxs3MIQ0zBnKTk2JvkkLshofA86XeZiNoaa64veATnEBMIfChv5OcUAD6QTPZEpRmtV2b_jhSb_8XDs9PYBAcOdAYmnKDrrrcAxbuXthKVax_gAacxX360xcDRrsLbxGEZdGKaHo24f7itvDpI9k-cbPBYoKHoQ/w640-h118/ArcoLinux_2022-06-02_12-57-01.png" width="640" /></a></div><br />
<p>If we have no noise in the channel, we don't need channel coding. Therefore the above model simplifies to</p>
<div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEj1KF_NS28wo86Zclsq33a5hrIUMddggoRFHqCEAwffiTunltbEaON-d4I11qhEUFiu8ChkRqdkKC5f75leUGSkLq-Ysv7R_O2-QRIt-NMO42HyC13pVaMnninN6qyZMr4yIicxO5Iy9962Fmlt-Cczhh5tb2ye5rJPgOQNOdECo0LbnuzBNgjRnr1bzg/s716/ArcoLinux_2022-06-02_12-57-44.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="167" data-original-width="716" height="150" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEj1KF_NS28wo86Zclsq33a5hrIUMddggoRFHqCEAwffiTunltbEaON-d4I11qhEUFiu8ChkRqdkKC5f75leUGSkLq-Ysv7R_O2-QRIt-NMO42HyC13pVaMnninN6qyZMr4yIicxO5Iy9962Fmlt-Cczhh5tb2ye5rJPgOQNOdECo0LbnuzBNgjRnr1bzg/w640-h150/ArcoLinux_2022-06-02_12-57-44.png" width="640" /></a></div>
<p>and the goal is to minimise $$n$$ - that is, minimise the number of symbols we need to send - without needing to worry about being robust to any errors.</p>
<p>Here's one question to get started: imagine we're working with a compression function $$f_e$$ that acts on length-$$n$$ strings (that is, sequences of symbols) with some arbitrary alphabet size $$A$$ (that is, $$A$$ different types of symbols). is it possible to build an encoding function $$f_e$$ that compresses every possible input? Clearly not; imagine that it took every length-$$n$$ string to a length-$$m$$ string using the same alphabet, with $$m < n$$. Then we'd have $$A^m$$ different available codewords that would need to code for $$A^n > A^m$$ different messages. By the pigeonhole principle, there must be at least one codeword that codes for more than one message. But that means that if we see this codeword, we can't be sure what it codes for, so we can't recover the original with certainty.</p>
<p>Therefore, we have a choice: either:</p>
<ul>
<li>do <i>lossy compression</i>, where every message shrinks in size but we can't recover information perfectly; or</li>
<li>do <i>lossless compression</i>, and hope that more messages shrink in size than expand in size.</li>
</ul>
<p>This is obvious with lossless compression, but applies to both: if you want to do them well, you generally need a probability model for what your data looks like, or at least something that approximates one.</p>
<h2 id="terminology">Terminology</h2>
<p>When we talk about a "code", we just mean something that maps messages (the $$Z$$ in the above diagram) to a sequence of symbols. A code is <b>nonsingular</b> if it associates every message with a unique code. </p>
<p>A <b>symbol code</b> is a code where each symbol in the message maps to a codeword, and the code of a message is the concatenation of the codewords of the symbols that it is made of.</p>
<p>A <b>prefix code</b> is a code where no codeword is a prefix of another codeword. They are also called <b>instantaneous codes</b>, because when decoding, you can decode a codeword to a symbol immediately when you reach a point where the some prefix of the code corresponds to a codeword.</p>
<h2 id="useful-basic-results-in-lossless-compression">Useful basic results in lossless compression</h2>
<h3 id="kraft-s-inequality">Kraft's inequality</h3>
<p>Kraft's inequality states that a prefix code with an alphabet of size $$D$$ and code words of lengths $$l_1, l_2, \ldots, l_n$$ satisfies
$$$
\sum_{i=1}^n D^{-l_i} \leq 1,
$$$
and conversely that if there is a set of lengths $${l_1, \ldots, l_n}$$ that satisfies the above inequality, there exists a prefix code with those codeword lengths. We will only prove the first direction: that all prefix codes satisfy the above inequality.</p>
<p>Let $$l = \max_i l_i$$ and consider the tree with branching factor $$D$$ and depth $$l$$. This tree has $$D^l$$ nodes on the bottom level. Each codeword $$x_1x_2...x_c$$ is the node in this tree that you get to by choosing the $$d_i$$th branch on the $$i$$th level where $$d_i$$ is the index of symbol $$x_i$$ in the alphabet. Since it must be a prefix code, no node that is a descendant of a node that is a codeword can be a codeword. We can define our "budget " as the $$D^l$$ nodes on the bottom level of the tree, and define the "cost" of each codeword as the number of nodes on the bottom level of the tree that are descendants of the node. The node with length $$l$$ has cost 1, and in general a codeword at level $$l_i$$ has cost $$D^{l - l_i}$$. From this, and the prefix-freeness, we get
$$$
\sum_i D^{l - l_i} \leq D^l
$$$
which becomes the inequality when you divide both sides by $$D^l$$.</p>
<h3 id="gibbs-inequality">Gibbs' inequality</h3>
<p>Gibbs' inequality states that for any two probability distributions $$p$$ and $$q$$,
$$$
-\sum_i p_i \log p_i \leq - \sum_i p_i \log q_i
$$$
which can be written using the relative entropy $$D$$ (also known as the KL distance/divergence) as
$$$
\sum_i p_i \log \frac{p_i}{q_i} = D(p||q) \geq 0.
$$$
This can be proved using the <a href="https://en.wikipedia.org/wiki/Log_sum_inequality">log sum inequality</a>. The proof is boring.</p>
<h3 id="minimum-expected-length-of-a-symbol-code">Minimum expected length of a symbol code</h3>
<p>We want to minimise the expected length of our code $$C$$ for each symbol that $$X$$ might output. The expected length is $$L(C,X) = \sum_i p_i l_i$$. Now one way to think of what a length $$l_i$$ means is using the correspondence between prefix codes and binary trees discussed above. Given the prefix requirement, the higher the level in the tree (and thus the shorter the length of the codeword) the more other options we block out in the tree. Therefore we can think of the collection of lengths we assign to our codewords as specifying a rough probability distribution that assigns probability in proportion to $$2^{-l_i}$$. What we'll do is introduce a variable $$q_i$$ that measures the "implied probability" in this way (note dividing the division by a normalising constant):
$$$
q_i = \frac{2^{-l_i}}{\sum_i 2^{-l_i}} = \frac{2^{-l_i}}{z}
$$$
where in the 2nd step we've just defined $$z$$ to be the normalising constant. Now $$l_i = - \log zq_i = -\log q_i - \log z$$, so
$$$
L(C,X) = \sum_i (-p_i \log q_i) - \log z
$$$
Now we can apply Gibbs' inequality to know that $$\sum_i(- p_i \log q_i) \geq \sum_i (-p_i \log p_i)$$ and Kraft's inequality to know that $$\log z = \log \big(\sum_i 2^{-l_i} \big) \leq \log(1)=0$$, so we get
$$$
L(C,X) \geq -\sum_i p_i \log p_i = H(X).
$$$
Therefore the entropy (with base-2 $$\log$$) of a random variable is a lower bound on the expected length of a codeword (in a 2-symbol alphabet) that represents the outcome of that random variable. (And more generally, entropy with base-$$d$$ logarithms is a lower bound on the length of a codeword for the result in a $$d$$-symbol alphabet.)</p>
<h2 id="huffman-coding">Huffman coding</h2>
<p>Huffman coding is a very pretty concept.</p>
<p>We saw above that if you're making a random variable for the purpose of gaining the most information possible, you should prepare your random variable to have a uniform probability distribution. This is because entropy is maximised by a uniform distribution, and the entropy of a random variable is the average amount of information you get by observing it.</p>
<p>The reason why, say, encoding English characters as 5-bit strings (A = 00000, B = 00001, ..., Z = 11010, and then use the remaining 6 codes for punctuation or cat emojis or whatever) is not optimal is that some of those 5-bit strings are more likely than others. On a symbol-by-symbol-level, whether the first symbol is a 0 or a 1 is not equiprobable. To get an ideal code, each symbol we send should have equal probability (or as close to equal probability as we can get).</p>
<p>Robert Fano, of <a href="https://en.wikipedia.org/wiki/Fano%27s_inequality">Fano's inequality</a> fame, and Claude Shannon, of everything-in-information-theory fame, had tried to find an efficient general coding scheme in the early 1950s. They hadn't succeeded. Fano set it as an alternative to taking the final exam for his information theory class at MIT. David Huffman tried for a while, and had almost given up and started studying instead, when he came up with Huffman coding and quickly proved it to be optimal.</p>
<p>We want the first code symbol (a binary digit) to divide the space of possible message symbols (the English letters, say) in two equally-likely parts, the first two to divide it in four, the third into eight, and so o n. Now some message symbols are going to be more likely than others, so the codes for some symbols have to be longer. We don't want it to be ambiguous when we get to the end of a codeword, so we want a prefix-free code. Prefix-free codes with a size-$$d$$ alphabet can be represented as trees with branching factor $$d$$, where each leaf is one codeword:</p>
<div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEg5Yb7bY9lY4ykPbwxKQurPGfN2KW51vlyHu1c-1MWUNuUYkCXWRev6uCSZOKooetoenZPkNvf6O1Ygk-l3at3Gt4iBgfQJeyhx-XR_5t4ZmY5HUWYUh47CrBB5ka5WieNK4_ANcRPcdXRsAt8o3D1TNsZBQGQkuW_9J62iZ9hr41bi8T2961-xXCIvxQ/s788/ArcoLinux_2022-06-25_18-09-46.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="506" data-original-width="788" height="410" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEg5Yb7bY9lY4ykPbwxKQurPGfN2KW51vlyHu1c-1MWUNuUYkCXWRev6uCSZOKooetoenZPkNvf6O1Ygk-l3at3Gt4iBgfQJeyhx-XR_5t4ZmY5HUWYUh47CrBB5ka5WieNK4_ANcRPcdXRsAt8o3D1TNsZBQGQkuW_9J62iZ9hr41bi8T2961-xXCIvxQ/w640-h410/ArcoLinux_2022-06-25_18-09-46.png" width="640" /></a></div>
<p>Above, we have $$d=2$$ (i..e binary), and six items to code for (<code>a</code>, <code>b</code>, <code>c</code>, <code>d</code>, <code>e</code>, and <code>f</code>), and six code words with lengths of between 1 and 4 characters in the codeword alphabet.</p>
<p>Each codeword is associated with some probability. We can define the weight of a leaf node to be its probability (or just how many times it occurs in the data) and the weight of a non-leaf code to be the sum of the weights of all leaves that are downstream of it in the tree. For an optimal prefix-free code, all we need to do is make sure that each node has children that are as equally balanced in weight as possible.</p>
<p>The best way to achieve this is to work bottom-up. Start without any tree, just a collection of leaf nodes representing the symbols you want codewords for. Then repeatedly build a node uniting the two least-likely parentless nodes in the tree, until the tree has a root.</p>
<div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgeWVP1zFZw9DdXXLhr5FiCaTY_sLjPbbEKv6w--pr3gYntbL-xWziuCOb8gz8u92YxSpF3EZWQ22-_3NSzzfen0a0qeO3rouiUPsvoOs_rvQQYFUj5DyANdtJtAPTUELkmIcuIc3lPaJegCj0ydkOV9gur3mKxIw9YxWwiOnMBXCWeWFfPTRHYPEJm1g/s729/ArcoLinux_2022-06-25_18-12-46.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="538" data-original-width="729" height="472" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgeWVP1zFZw9DdXXLhr5FiCaTY_sLjPbbEKv6w--pr3gYntbL-xWziuCOb8gz8u92YxSpF3EZWQ22-_3NSzzfen0a0qeO3rouiUPsvoOs_rvQQYFUj5DyANdtJtAPTUELkmIcuIc3lPaJegCj0ydkOV9gur3mKxIw9YxWwiOnMBXCWeWFfPTRHYPEJm1g/w640-h472/ArcoLinux_2022-06-25_18-12-46.png" width="640" /></a></div>
<p>Above, the numbers next to the non-leaf nodes show the order in which the node was created. This set of weights on the leaf nodes creates the same tree structure as in the previous diagram.</p>
<p>(We could also try to work top-down, creating the tree the root to the leaves rather than from the leaves to the root, but this turns out to give slightly worse results. Also the algorithm for achieving this is less elegant.)</p>
<h2 id="arithmetic-coding">Arithmetic coding</h2>
<p>The Huffman code is the best symbol code - that is, a code where every symbol in the message gets associated with a codeword, and the code for the entire message is simply the concatenation of all the codewords of its symbols.</p>
<p>Symbol codes aren't always great, though. Consider encoding the output of a source that has a lot of runs like "<code>aaaaaaaaaahaaaaahahahaaaaa</code>" (a source of such messages might be, for example, a transcription of what a student says right before finals). The Huffman coding for this message is, for example, that "a" maps to a 0, and "h" maps to a 1, and you have achieved a compression of exactly 0%, even though intuitively those long runs of "a"s could be compressed.</p>
<p>One obvious thing you could do is run-length encoding, where long blocks of a character get compressed into a code for the character plus a code for how many times the character is repeated; for example the above might become "<code>10a1h5a1h1a1h1a1h5a</code>". However, this is only a good idea if there are lots of runs, and requires a bunch of complexity (e.g. your alphabet for the codewords must either be something more than binary, or then you need to be able to express things like lengths and counts in binary unambiguously, possibly using a second layer of encoding with a symbol code).</p>
<p>Another problem with Huffman codes is that the code is based on assuming an unchanging probability model across the entire length of the message that is being encoded. This might be a bad assumption if we're encoding, for example, long angry Twitter threads, where the frequency of exclamation marks and capital letters increases as the message continues. We could try to brute-force a solution, such as splitting the message into chunks and fitting a Huffman code separately to each chunk, but that's not very elegant. Remember how elegant Huffman codes feel as a solution to the symbol coding problem? We'd rather not settle for less.</p>
<p>The fundamental idea of arithmetic coding is that we send a number representing where on the cumulative probability distribution of all messages the message we want to send lies. This is a dense statement, so we will unpack it with an example. Let's say our alphabet is $$A = {a, r, t}$$. To establish an ordering, we'll just say we consider the alphabet symbols in alphabetic order. Now let's say our probability distribution for the random variable $$X$$ looks like the diagram on the left; then our cumulative probability distribution looks like the diagram on the right:</p>
<div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjywjIVu-7eMsCaKyb1GdbLQWnmArUkgLMlhXAxdUleRkJynbd2RErOQkH7Dm3h1Dcb0Q6ynn1G36oTJP-58fj_9Kkd5ryBn0AMThBKSqADP42dkEjPB6ln-lv-wLJ-pUYZIxn6V3zXBAK6zJQIAd-zPOWxvf1aI2nvMVVnse1QCc-WWwM3XQJ__JQeuw/s1038/ArcoLinux_2022-06-21_21-42-25.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="397" data-original-width="1038" height="244" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjywjIVu-7eMsCaKyb1GdbLQWnmArUkgLMlhXAxdUleRkJynbd2RErOQkH7Dm3h1Dcb0Q6ynn1G36oTJP-58fj_9Kkd5ryBn0AMThBKSqADP42dkEjPB6ln-lv-wLJ-pUYZIxn6V3zXBAK6zJQIAd-zPOWxvf1aI2nvMVVnse1QCc-WWwM3XQJ__JQeuw/w640-h244/ArcoLinux_2022-06-21_21-42-25.png" width="640" /></a></div>
<p>One way to specify which of $${a, r, t}$$ we mean is to pick a number $$0 \leq c \leq 1$$, and then look at which range it corresponds to on the $$y$$-axis of the right-hand figure; $$0 \leq c < 0.5$$ implies $$a$$, $$0.5 \leq c < 0.7$$ implies $$r$$, and $$0.7 \leq c < 1$$ implies $$t$$. We don't need to send the leading 0 because it is always present, and for simplicity we'll transmit the following decimals in binary; 0.0 becomes "0", 0.5 becomes "1", 0.25 becomes "01", and 0.875 is "111". </p>
<p>Note that at this point we've almost reinvented is the Huffman code. $$a$$ has the most probability mass and can be represented in one symbol. $$r$$ happens to be representable in one symbol ("1" corresponds to 0.5 which maps to $$r$$) as well even though it has the least probability mass, which is definitely inefficient but not too bad. $$t$$ takes 2: "11".</p>
<p>The real benefit begins when we have multi-character messages. The way we can do it is like this, recursively splitting the number range between 0 and 1 into smaller and smaller chunks:</p>
<div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiNxucz-lsT05H5PjrTHMM3NELAQILz9mDGgl4ED_sVaBcqbEoBTTfjMvzYrcUMNywDcT-OzlniqA4RkS-toShHZBNYhFzn744YFxx0oPYVj-FOJKRsLlU28RvU5bID5019UBvjQwVzmgqlpOHbmC-fN2bvTfqhj81PN0w5qIDzKEkbsjpr4e6Rye56bA/s969/ArcoLinux_2022-06-21_21-43-17.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="349" data-original-width="969" height="230" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiNxucz-lsT05H5PjrTHMM3NELAQILz9mDGgl4ED_sVaBcqbEoBTTfjMvzYrcUMNywDcT-OzlniqA4RkS-toShHZBNYhFzn744YFxx0oPYVj-FOJKRsLlU28RvU5bID5019UBvjQwVzmgqlpOHbmC-fN2bvTfqhj81PN0w5qIDzKEkbsjpr4e6Rye56bA/w640-h230/ArcoLinux_2022-06-21_21-43-17.png" width="640" /></a></div>
<p>We see possible numbers encoding "art", "rat", and "tar". Not only that, but we see that all messages we send are infinite in length, as we can just keep going down, adding more and more letters. At first this might seem like a great deal - send one number, get infinite symbols transmitted for free! However, there's a real difference between "art" and "artrat", so we want to be able to know when to stop as well.</p>
<p>A simple answer is that the message also includes some code encoding how many symbols to decode for. A more elegant answer is that we can keep our message as just one number, but extend our alphabet to include an end-of-message token. Note that even with this end-of-message token, it is still true that many characters of the message can be encoded by a single symbol of output, especially if some outcome is much more likely. For example, in the example below we need only one bit ("1", for the number 0.5) to represent the message "aaa" (followed by the end-of-message character):</p>
<div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiYblnLiDN2hhrmMo7XMRQtEuDiOl4TAO53XUxdao9FxGjuINwDQOj-YT7YU3Q857Vgj6_gxi9UHEHvMQGkgpKpxDiHRO06z1FF8zkbbaqdUtG-BhwmD_0Qv77pnPMTyh2w8YpVvyZWN_AJ7vpPxAJB9w46bNlYmaFebm9mW9-DgcVZMDnVn-1DpludLQ/s942/ArcoLinux_2022-06-21_21-44-10.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="327" data-original-width="942" height="222" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiYblnLiDN2hhrmMo7XMRQtEuDiOl4TAO53XUxdao9FxGjuINwDQOj-YT7YU3Q857Vgj6_gxi9UHEHvMQGkgpKpxDiHRO06z1FF8zkbbaqdUtG-BhwmD_0Qv77pnPMTyh2w8YpVvyZWN_AJ7vpPxAJB9w46bNlYmaFebm9mW9-DgcVZMDnVn-1DpludLQ/w640-h222/ArcoLinux_2022-06-21_21-44-10.png" width="640" /></a></div>
<p>There are still two ways in which this code is underspecified.</p>
<p>The first is that we need to choose how much of the probability space to assign to our end-of-message token. The optimal value for this clearly depends on how long messages we will be sending.</p>
<p>The second is that even with the end-of-message token, each codeword is still represented by a range of values rather than a single number. Any of these are valid numbers to send, but we want to minimise the length, so therefore we will choose the number in this range that has the shortest binary representation.</p>
<p>Finally, what is our probability model? With the Huffman code, we either assume a probability model based on background information (e.g. we have the set of English characters, and we know the rough probabilities of them by looking at some text corpus that someone else has already compiled), or we fit the probability model based on the message we want to send - if 1/10th of all letters in the message are $$a$$s, we set $$p_a = 0.1$$ when building the tree for our Huffman code, and so on.</p>
<p>With arithmetic coding, we can also assume static probabilities. However, we can also do adaptive arithmetic coding, where we change the probability model as we go. A good way to do this is for our probability model to assume that the probability $$p_x$$ of the symbol $$x$$ after we have already processed text $$T$$ is
$$$
p_x = \frac{\text{Count}(x, T) + 1}{\sum_{y \in A} \big(\text{Count}(y, T) + 1\big)}$$$
$$$= \frac{\text{Count}(x, T) + 1}{\sum_{y \in A} \big(\text{Count}(y, T)\big) + |A|}
$$$
where $$A$$ is the alphabet, and $$\text{Count}(a, T)$$ simply returns the count of how many times the character $$a$$ occurs in $$T$$. Note that if we didn't have the $$+1$$ in the numerator and in the sum in the denominator, we would assume a probability of zero to anything we haven't seen before, and be unable to encode it.</p>
<p>(We can either say that the end-of-message token is in the alphabet $$A$$, or, more commonly, assign "probabilities" to all $$x$$ using the above formula and some probability $$p_{EOM}$$ to the end of message, and then renormalise by dividing all $$p_x$$ by $$1 + p_{EOM}$$.)</p>
<p>How do we decode this? At the start, the assumed distribution is simply uniform over the alphabet (except maybe for $$p_{EOM}$$). We can decode the first symbol using that distribution, then update the distribution and decode the next, and so on. It's quite elegant.</p>
<p>What isn't elegant is implementing this with standard number systems in most programming languages. For any non-trivial message length, arithmetic coding is going to need very precise floating point numbers, and you can't trust floating point precision very far. You'll need some special system, likely an arbitrary-precision arithmetic library, to actually implement arithmetic coding.</p>
<h3 id="prefix-free-arithmetic-coding">Prefix-free arithmetic coding</h3>
<p>The above description of arithmetic coding is not a prefix-free code. We generally want prefix-free codes, in particular because it means we can decode it symbol by symbol as it comes in, rather than having to wait for the entire message to come through. Note also that often in practice it is uncertain whether or not there are more bits coming; consider a patchy internet connection with significant randomness between packet arrival times.</p>
<p>The simple fix for this is that instead of encoding a number as <i>any</i> sequence of binary string that maps onto the right segment of the number line between 0 and 1, you impose an additional requirement on it: <i>whatever binary bits you add onto the number, it is still within the range</i>.</p>
<h2 id="lempel-ziv-coding">Lempel-Ziv coding</h2>
<p>Huffman coding integrated the probability model and the encoding. Arithmetic coding still uses an (at least implicit) probability model to encode, but in a way that makes it possible to update as we encode. Lempel-Ziv encoding, and its various descendants, throw away the entire idea of having any kind of (explicit) probability model. We will look at the original version of this algorithm.</p>
<h3 id="encoding">Encoding</h3>
<p>Skip all that Huffman coding nonsense of carefully rationing the shorter codewords for the most likely symbols, and simply decide on some codeword length $$d$$ and give every character in the alphabet a codeword of that length. If your alphabet is again $${a, r, t, \text{EOM}}$$ (we'll include the end-of-message character from the start this time), and $$d = 3$$, then the codewords you define are literally as simple as
$$$a \mapsto 000 $$$
$$$r \mapsto 001 $$$
$$$t \mapsto 010 $$$
$$$\text{EOM} \mapsto 011$$$
If we used this code, it would be a disaster. We have four symbols in our alphabet, so the maximum entropy of the distribution is $$\log_2 4 = 2$$ bits, and we're spending 3 bits on each symbol. With this encoding, we increase the length by at least 50%. Instead of your compressed file being uploaded in 4 seconds, it now takes 6.</p>
<p>However, we selected $$d=3$$, meaning we have $$2^3 = 8$$ slots for possible codewords of our chosen constant length, and we've only used 4. What we'll do is follow these steps as we scan through our text:</p>
<ol>
<li>Read one symbol <i>past</i> the longest match between the following text and a codeword we've defined. Therefore what we now have is a string $$Cx$$, where we have a code for $$C$$ already of length $$|C|$$, $$x$$ is a single character, and $$Cx$$ is a prefix of the remaining text.</li>
<li>Add $$C$$ to the code we're forming, to encode for the first $$|C|$$ characters of the remaining text.</li>
<li>If there is space among the $$2^d$$ possible codewords we have available: let $$n$$ be the binary representation of the smallest possible codeword not yet associated with a code, and define $$Cx \mapsto n$$ as a new codeword.</li>
</ol>
<p>Here is an example of the encoding process, showing the emitted codewords on the left, the original definitions on the top, the new definitions on the right, and the message down the middle:</p>
<div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi33XVwUveeGSql8K9VJi_y7bZGa0TZ3UAKdPkkxbnXNZmakweKcmjdGHOBN1oPGSj0fxi3xtQcVDD-FT-XBEW6u18eKbVcZurVB9unqL3tHsSyYKb0mvpfpBRkDZttA1l9OgLlF2I0OFHawK8D2LnQP3M6cJZPHeJOTnSF0lV53ueCYE5m65t6h4U_4w/s892/ArcoLinux_2022-06-21_21-48-30.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="892" data-original-width="727" height="640" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi33XVwUveeGSql8K9VJi_y7bZGa0TZ3UAKdPkkxbnXNZmakweKcmjdGHOBN1oPGSj0fxi3xtQcVDD-FT-XBEW6u18eKbVcZurVB9unqL3tHsSyYKb0mvpfpBRkDZttA1l9OgLlF2I0OFHawK8D2LnQP3M6cJZPHeJOTnSF0lV53ueCYE5m65t6h4U_4w/w522-h640/ArcoLinux_2022-06-21_21-48-30.png" width="522" /></a></div>
<h3 id="decoding">Decoding</h3>
<p>A boring way to decode is to send the codeword list along with your message. The fun way is to reason it out as you go along, based on your knowledge of the above algorithm and a convention that lets you know which order the original symbols were added to the codeword list (say, alphabetically, so you know the three bindings in the top-left). An example of decoding the above message:</p>
<div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEigucPM68ZDJpTmvknmmUEQkRAdUye2kMM3pMp_Ucgqg1xL6UDKXGWFyW4FCb-V_K4dMRlTrypPUZJQ6KFkU9pVU80bG2vy7oDBJP4H-Nq_WGu9WKjnZy8EhqVuMhpW4g8bVeQVTwRQ2xNAxydawUm9kACV9ADKU6OUcUcn59jcphwa5p8zx-2iHiywDA/s1073/ArcoLinux_2022-06-21_21-48-58.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="476" data-original-width="1073" height="284" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEigucPM68ZDJpTmvknmmUEQkRAdUye2kMM3pMp_Ucgqg1xL6UDKXGWFyW4FCb-V_K4dMRlTrypPUZJQ6KFkU9pVU80bG2vy7oDBJP4H-Nq_WGu9WKjnZy8EhqVuMhpW4g8bVeQVTwRQ2xNAxydawUm9kACV9ADKU6OUcUcn59jcphwa5p8zx-2iHiywDA/w640-h284/ArcoLinux_2022-06-21_21-48-58.png" width="640" /></a></div>
<h2 id="source-coding-theorem">Source coding theorem</h2>
<p>The source coding theorem is about lossy compression. It is going to tell us that if we can tolerate a probability of error $$\delta$$, and if we're encoding a message consisting of a lot of symbols, unless $$\delta$$ is very close to 0 (lossless compression) or 1 (there is nothing but error), it will take about $$H(X)$$ bits per symbol to encode the message, where $$X$$ is the random variable according to which the symbols in the message have been drawn. Since it means that entropy turns up as a fundamental and surprisingly constant limit when we're trying to compress our information, this further justifies the use of entropy as a measure of information.</p>
<p>We're going to start our attempt to prove the source coding theorem by considering a silly compression scheme. Observe that English has 26 letters, but the bottom 10 (Z, Q, X, J, K, V, B, P, Y, G) are slightly less than 10% of all letters. Why not just drop them? Everthn is still comprehensile without them, and ou can et awa with, for eample, onl 4 inary its per letter rather than 5, since ou're left with ust 16 letters.</p>
<p>Given an alphabet $$A$$ from which our random variable $$X$$ takes values, define the $$\delta$$-sufficient subset $$S_\delta$$ of $$A$$ to be the smallest subset of $$A$$ such that $$P(x \in S_\delta) \geq 1 - \delta$$ for $$x$$ drawn from $$X$$. For example, if $$A$$ is the English alphabet, and $$\delta = 0.1$$, then $$S_\delta$$ is the set of all letters except Z, Q, X, J, K, V, B, P, Y, and G, since the other letters have a combined probability of over $$1 - 0.1 = 0.9$$, and any other subset containing more than $$0.9$$ of the probability mass contains must contain more letters. </p>
<p>Note that $$S_\delta$$ can be formed by adding elements from $$A$$, in descending order of probability, into a set until the sum of probabilities of elements in the set exceeds $$1 - \delta$$.</p>
<p>Next, define the essential bit content of $$X$$, denoted $$H_\delta(X)$$, as
$$$
H_\delta(X) = \log 2 |S_\delta|.
$$$
In other words, $$H_\delta(X)$$ is the answer to "how many bits of information does it take to point to one element in $$S_\delta$$ (without being able to assume the distribution is anything better than uniform)?". $$H_\delta(X)$$ for $$\text{English alphabet}_{0.1}$$ is 4, because $$\log_2 |{E, T, A, O, I, N, S, H, R, D, L, U, C,M, W, F}| = \log_2 16 = 4$$. It makes sense that this is called "essential bit content".</p>
<p>We can graph $$H_\delta(X)$$ against $$\delta$$ to get a pattern like this:</p>
<div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhlGh7Wm3vmOUAG49ZlIxkZgz2q7ApF-QM7addOeJ5uEqx2P9kzx8sAGF_4BBp4o6me9Pg6NqzGgivCir-VKdWB-E2hdLzAYx6cOgd9v2-BQr8Emaat6joRPkDFPtEZcjnGvNVvegvOvaRVJCQaYZGI_WCjZkwoY356mqGwpVlmzHZWAPT-eO2yviId_A/s850/ArcoLinux_2022-06-21_22-01-58.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="636" data-original-width="850" height="478" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhlGh7Wm3vmOUAG49ZlIxkZgz2q7ApF-QM7addOeJ5uEqx2P9kzx8sAGF_4BBp4o6me9Pg6NqzGgivCir-VKdWB-E2hdLzAYx6cOgd9v2-BQr8Emaat6joRPkDFPtEZcjnGvNVvegvOvaRVJCQaYZGI_WCjZkwoY356mqGwpVlmzHZWAPT-eO2yviId_A/w640-h478/ArcoLinux_2022-06-21_22-01-58.png" width="640" /></a></div>
<p>Where it gets more interesting is when we extend this definition to blocks. Let $$X^n$$ denote the random variable for a sequence of $$n$$ independent identically distributed samples drawn from $$X$$. We keep the same definitions for $$S_\delta$$ and $$H_\delta(X)$$; just remember that now $$S$$ is a subset of $$A^n$$ (where the exponent denotes Cartesian product of a set with itself; i.e. $$A^n$$ is all possible length-$$n$$ strings formed from that alphabet). In other words, we're throwing away the least common length-$$N$$ letter strings first; ZZZZ is out the window first if $$n = 4$$, and so on.</p>
<p>We can plot a similar graph as above, except we're plotting $$\frac{1}{n} H_\delta(x)$$ on the vertical axis to get per-symbol entropy, and there's a horizontal line around the entropy of English letter frequencies:</p>
<div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEh1TCFB5fGrb7BNMPfvRwGOtzbahwMwd1yAkZoAUGoP66dWZCj9VkwySIMiwUBETqS6PE1Ob7jvtA3ex0sS02vn_UXgQOD5NLWcp1czRfB55SWHswGlcn3zNeeb4w8n5usdVT6NJZZ52JoU5So4qf6HNzcMNbZcH-IUnU4TZ4k6sGD9zmv4aTfsJCdJxw/s883/ArcoLinux_2022-06-21_22-02-44.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="646" data-original-width="883" height="234" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEh1TCFB5fGrb7BNMPfvRwGOtzbahwMwd1yAkZoAUGoP66dWZCj9VkwySIMiwUBETqS6PE1Ob7jvtA3ex0sS02vn_UXgQOD5NLWcp1czRfB55SWHswGlcn3zNeeb4w8n5usdVT6NJZZ52JoU5So4qf6HNzcMNbZcH-IUnU4TZ4k6sGD9zmv4aTfsJCdJxw/s320/ArcoLinux_2022-06-21_22-02-44.png" width="320" /></a></div>
<p>(Note that the entropy per letter of English drops to only 1.3 if we stop modelling each letter as drawn independently from the others around it, and instead have a model with a perfect understanding of which letters occur together.)</p>
<p>The graph above shows the plot of $$\frac{1}{n}H_\delta(x)$$ against $$\delta$$ for a random variable $$X^n$$ for $$n=1$$ (blue), $$n=2$$) (orange), and $$n=3$$ (green). We see that as $$n$$ increases, the lines become flatter, and the middle portions approach the black line that shows the entropy of the English letter frequency distribution. What you'd see if we continued plotting this graph for larger values of $$n$$ (which might happen for example if you bought me a beefier computer) is that this trend continues; specifically, that there is a value $$n$$ large enough that the graph of $$\frac{1}{n}H_\delta(x)$$ is as close as we want to the black line for the entire length of it, except for an arbitrarily small part near $$\delta = 0$$ and $$\delta = 1$$. Mathematically, we can pick an $$\epsilon > 0$$ such that for $$0 < \delta < 1$$ there exists a positive integer $$n_0$$ such that for all $$n \geq n_0$$,
$$$
\left| \frac{1}{n}H_\delta(X^n) - H(X)
\right| \leq \epsilon.
$$$
Now remember that $$\frac{1}{n}H_\delta(X^n)=\frac{1}{n}\log |S_\delta|$$ was the essential bit content per symbol, or, in other words, the number of bits we need per symbol to represent $$X^n$$ (with error probability $$\delta$$) in the simple coding scheme where we assign an equal-length binary number to each element in $$S_\delta$$ (but hold on: aren't there better codes than ones where all elements in $$S_\delta$$ get an equal-length representation? yes, but we'll see soon that not by very much). Therefore what the above equation is saying is that we can encode $$X^n$$ with error chance $$\delta$$ using a number of bits per symbol that differs from the entropy $$H(X)$$ by only a small constant $$\epsilon$$. This is the source coding theorem. It is a big deal, because we've shown that entropy is related to the number of bits per symbol we need to do encoding in a lossy compression scheme.</p>
<p>(You can get to a similar result with lossless compression schemes where, instead of throwing away the ability to encode all sequences not in $$S_\delta$$ and just accepting the inevitable error, you instead have an encoding scheme where you reserve one bit to indicate whether or not an $$x^n$$ drawn from $$X^n$$ is in $$S_\delta$$, and if it is you encode it like above, and if it isn't you encode it using $$\log |A|^n$$ bits. Then you'll find that the probability of having to do the latter step is small enough that $$\log |A|^n > \log |S_\delta|$$ doesn't matter very much.)</p>
<h3 id="typical-sets">Typical sets</h3>
<p>Before going into the proof, it is useful to investigate what sorts of sequences $$x^n$$ we tend to pull out from $$X^n$$ for some $$X$$. The basic observation is that most $$x^n$$ are going to be neither the least probable nor the most probably out of all $$x^n$$. For example, "ZZZZZZZZZZ" would obviously be an unusual set of letters to draw at random if you're selecting them from English letter frequencies. However, so would "EEEEEEEEEE". Yes, this individual sequence is much more likely than "ZZZZZZZZZZ" or any other sequence, but there is only one of them, so getting it would still be surprising. To take another example, the typical sort of result you'd expect from a coin loaded so that $$P(\text{"heads"}) = 0.75$$ isn't runs of only heads, but rather an approximately 3:1 mix of heads and tails. </p>
<p>The distribution of letter counts follows a multinomial distribution (the generalisation of the binomial distribution). Therefore (if you think about what a multinomial distribution is, or if you know that the mean is $$n p_{x_i}$$ for the $$i$$th variable) in $$x^n$$ we'd expect roughly $$np_e$$ of the letter e, $$np_z$$ of the letter z, and so on - and $$np_e \ll n$$ even though $$p_e > p_L$$ for all $$L$$ in the alphabet. Slightly more precisely (if you happen to know this fact), the variance of variable $$x_i$$ is $$np_{x_i}(1-p_{x_i})$$, implying that the standard deviation grows only in proportion to $$\sqrt{n}$$, so for large $$n$$ it is very rare to get an $$x^n$$ with counts of $$x_i$$ that differ wildly from the expected count $$np_{x_i}$$. </p>
<p>Let's define a notion of "typicality" for a sequence $$x^n$$ based on this idea of it being unusual if $$x^n$$ is either a wildly likely or wildly unlikely sequence. The median sequence has $$np_{x_i}$$ of each variable, so has probability
$$$
P(x^n) = p_{x_1}^{np_{x_1}}p_{x_2}^{np_{x_2}} \ldots p_{x_n}^{np_{x_n}}
$$$
which in turn has a Shannon information content of
$$$</p>
<ul>
<li>\log P(x^n) = -\sum_i np_{x_i} \log p_{x_i} = n H(X)
$$$
Oh look, entropy pops up again. How surprising.</li>
</ul>
<p>Now we make the following definition: a sequence $$x^n$$ is $$\epsilon$$-typical if its information content per symbol is $$\epsilon$$-close to $$H(X)$$, that is
$$$
\left| - \frac{1}{n}\log{P(x^n)} - H(X) \right| <\epsilon.
$$$
Define the typical set $$T_{n\epsilon}$$ to be the set of length-$$n$$ sequences (drawn from $$X^n$$) that are $$\epsilon$$-typical.</p>
<p>$$T_{n\epsilon}$$ is a small subset of the set $$A^n$$ of all length-$$n$$ sequences. We can see this through the following reasoning: for any $$x^n \in T_{n\epsilon}$$, $$\frac{1}{n} \log P(x^n) \approx H(X)$$ which implies that
$$$
P(x^n) \approx 2^{-nH(X)}
$$$
and therefore that there can only be roughly $$2^{nH(X)}$$ such sequences; otherwise their probability would add up to more than 1. In comparison, the number of possible sequences $$|A^n| = 2^{n \log |A|}$$ is significantly larger, since $$\log |A| \leq H(X)$$ for any random variable $$X$$ with alphabet / outcome set $$A$$ (with equality if $$X$$ has a uniform distribution over $$A$$).</p>
<h3 id="the-typical-set-contains-most-of-the-probability">The typical set contains most of the probability</h3>
<p>Chebyshev's inequality states that
$$$
P((X-\mathbb{E}[X])^2 \geq a) \leq \frac{\sigma^2}{a}
$$$
where $$\sigma^2$$ is the variance of the random variable $$X$$, and $$a \geq 0$$. It is proved <a href="http://www.strataoftheworld.com/2021/01/data-science-2.html">here</a> (search for "Chebyshev").</p>
<p>Earlier we defined the $$\epsilon$$-typical set as
$$$
T_{n\epsilon} = \left\{
x^n \in A^n \,\text{ such that } \, \left|
-\frac{1}{n}\log P(X^n) - H(X)
\right| < \epsilon
\right\}.
$$$
Note that
$$$
\mathbb{E}\left[-\frac{1}{n}\log P(X^n)\right] = -\frac{1}{n} \sum \log P(X_i)$$$
$$$ = -\mathbb{E}[\log P(X_i)]$$$
$$$ = H(X_i) = H(X)
$$$
by using independence of the $$X_i$$ making up $$X^n$$ in the first step, the law of large numbers ($$\lim_{n \to \infty} \frac{1}{n} \sum_i X_i = \mathbb{E}[X]$$) in the second, and the fact that all $$X_i$$ are independent draws of the same random variable $$X$$ in the third.</p>
<p>Therefore, we can now rewrite the typical set definition equivalently as
$$$
T_{n\epsilon} = \left\{
x^n \in A^n \,\text{ such that } \, \left(
-\frac{1}{n}\log P(x^n) - H(X)
\right)^2 < \epsilon^2
\right\}$$$
$$$= \left\{
x^n \in A^n \,\text{ such that } \, \left(
Y - \mathbb{E}[Y]
\right)^2 < \epsilon^2
\right\}
$$$
for $$Y = -\frac{1}{n} \log P(X^n)$$, which is in the right form to apply Chebyshev's inequality to get a probability of belonging to this set, except for the fact that the sign is the wrong way around. Very well - we'll instead consider the set of sequences $$\bar{T}_{n\epsilon} = A^n - T_{n\epsilon}$$ (i.e. all length-$$n$$ sequences that are not typical) instead, which can be defined as
$$$
\bar{T}_{n \epsilon} = \left\{
x^n \in A^n \,\text{ such that } \,
(Y - \mathbb{E}[Y])^2 \geq \epsilon^2 \right\}
$$$
and use Chebyshev's inequality to conclude that
$$$
P((Y - \mathbb{E}[Y])^2 \geq \epsilon^2) \leq \frac{\sigma_Y^2}{\epsilon^2}
$$$
where $$\sigma_Y^2$$ is the variance of $$Y= -\frac{1}{n} \log P(X^n)$$. This is exciting - we have a bound on the probability that a sequence is not in the typical set - but we want to link this probability to $$n$$ somehow. Let $$Z = -\log P(X)$$, and note that $$Y$$ can be written as the average of many draws from $$Z$$. Therefore
$$$
\mathbb{E}[Z] = -\frac{1}{n} \sum_i \log P(X) = -\frac{1}{n} \log P(X^n) = \mathbb{E}[Y]
$$$
and since $$Y = \frac{1}{n} \sum_i Z_i$$, the variance of $$Y$$, $$\sigma_Y^2$$, is equal to $$\frac{1}{n} \sigma_Z^2$$ (a basic law of how variance works that is often used in statistics). We can substitute this into the expression above to get
$$$
P((Y-\mathbb{E}[Y])^2 \geq \epsilon^2) \leq \frac{\sigma_Z^2}{n\epsilon^2}.
$$$
The probability on the left-hand side is identical to $$P((-\frac{1}{n} \log P(X^n) - H(X) )^2 \geq \epsilon^2)$$, which is the probability of the condition that $$X^n$$ is <i>not</i> in the $$\epsilon$$-typical set $$T_{n\epsilon}$$, which gives us our grand result
$$$
P(X^n \in T_{n\epsilon}) \ge 1 - \frac{\sigma_Z^2}{n\epsilon^2}.
$$$
$$\sigma_Z^2$$ is the variance of $$\log P(X^n)$$; it depends on the particulars of the distribution and is probably hell to calculate. However, what we care about is that if we just crank up $$n$$, we can make this probability as close to 1 as we like, regardless of what $$\sigma_Z^2$$ is, and regardless of what we set as $$\epsilon$$ (the parameter for how wide the probability range for the typical set).</p>
<p>The key idea is this: asymptotically, as $$n \to \infty$$, more and more of the probability mass of possible length-$$n$$ sequences is concentrated among those that have a probability of between $$2^{-n(H(X)+\epsilon)}$$ and $$2^{-n(H(x) - \epsilon)}$$, regardless of what (positive real) $$\epsilon$$ you set. This is known as the "asymptotic equipartition property" (it might be more appropriate to call it an "asymptotic approximately-equally-partitioning property" because it's not really an "equipartition", since depending on $$\epsilon$$ these can be very different probabilities, but apparently that was too much of a mouthful even for the mathematicians).</p>
<h3 id="finishing-the-proof">Finishing the proof</h3>
<p>As a reminder of where we are: we stated without proof
$$$
\left| \frac{1}{n}H_\delta(X^n) - H(X)
\right| < \epsilon.
$$$
and noted that this is an interesting result that also gives meaning to entropy, since we see that it's related to how many bits it takes for a naive coding scheme to express $$X^n$$ (with error probability $$\delta$$).</p>
<p>Then we went on to talk about typical sets, and ended up finding that the probability that an $$x^n$$ drawn from $$X^n$$ lies in the set
$$$
T_{n \epsilon} =\left\{
x^n \in A^n \,\text{ such that } \, \left|
-\frac{1}{n}\log P(X^n) - H(X)
\right| < \epsilon
\right\}.
$$$
approaches 1 as $$n \to \infty$$, despite the fact that $$T_{n\epsilon}$$ has only approximately $$2^{nH(X)}$$ members, which, for distributions of $$X$$ that are not very close to the uniform distribution over the alphabet $$A$$, is a small fraction of the $$2^{n \log |A|}$$ possible length-$$n$$ sequences.</p>
<p>Remember that $$H_\delta(X^n) = \log |S_\delta|$$, and $$S_\delta$$ was the smallest subset of $$A^n$$ such that it contains sequences whose probability sums to at least $$1 - \delta$$. This is a bit like the typical set $$T_{n\epsilon}$$, which also contains sequences making up most of the probability mass. Note that $$T_{n\epsilon}$$ is less efficient; $$S_\delta$$ optimally contains all sequences with probability greater than some threshold, whereas $$T_{n\epsilon}$$ generally omits the highest-probability sequences (settling instead for sequences of the same probability as most sequences that are drawn from $$X^n$$). Therefore
$$$
H_\delta(X^n) \leq \log |T_{n\epsilon}|
$$$
for an $$n$$ that depends on what $$\delta$$ and $$\epsilon$$ we want. Now we can get an upper bound on $$H_\delta(X^n)$$ if we can upper-bound $$|T_{n\epsilon}|$$. Looking at the definition, we see that the probability of a sequence $$X^n$$ must obey
$$$
2^{n(H(X) - \epsilon)} < P(X^n) < 2^{n(H(X) + \epsilon)}.
$$$
$$T_{n\epsilon}$$ has the largest number of elements if all elements have the lowest possible probability $$p$$, and if that is the case it has at most $$1/p$$ of such lowest-probability elements since the probabilities cannot add to more than one, which implies $$|T_{n\epsilon}| < 2^{n(H(x)+\epsilon)}$$. Therefore
$$$
H_\delta(X^n) \leq \log |T_{n\epsilon}| < \log(2^{n(H(X)+e)}) = n(H(X) + \epsilon)
$$$
and we have a bound
$$$
H_\delta(X^n) < n(H(X) + \epsilon).
$$$
If we can now also find the bound $$n(H(X) + \epsilon) < H_\delta(X^n)$$, we've shown $$|\frac{1}{n} H_\delta(X^n) - H(X)| < \epsilon$$ and we're done. The proof of this bound is a proof by contradiction. Imagine that there is an $$S'$$ such that
$$$
\frac{1}{n} \log |S'| \leq H - \epsilon
$$$
but also
$$$
P(X^n \in S') \geq 1 - \delta.
$$$
We want to show that $$P(X^n \in S')$$ can't actually be that large. For the other bound, we used our typical set successfully, so why not use it again? Specifically, write
$$$
P(X^n \in S') = P(X^n \in S' \cap T_{n\varepsilon}) + P(X^n \in S' \cap \bar{T}_{n\varepsilon})
$$$
where $$\bar{T}_{n\varepsilon}$$ is again $$A^n - T_{n\varepsilon}$$, and noting that our constant $$\varepsilon$$ for $$T$$, is not the same as our constant $$\epsilon$$ in the bound. We want to set an upper bound on this probability; for that to hold, we need to make the terms on the right-hand side as large as possible. For the term, this is if $$S' \cap T_{n\varepsilon}$$ is as large as it can be based on the bound on $$|S'|$$, i.e. $$2^{n(H(X)-\epsilon)}$$, and each term in it has the maximum probability $$2^{-n(H(X)-\varepsilon)}$$ of terms in $$T_{n\varepsilon}$$. For the second term, this is if $$S' \cap \bar{T}_{n \epsilon}$$ is restricted only by $$P(X^n \in \bar{T}_{n\varepsilon}) \leq \frac{\sigma^2}{n\epsilon^2}$$, which we showed above. (Note that you can't have both of these conditions holding at once, but this does not matter since we only want to show a non-strict inequality.) Therefore we get
$$$
P(X^n \in S') \leq 2^{n(H(X) - \epsilon)} 2^{-n(H(X)+\varepsilon)} + \frac{\sigma^2}{n\epsilon^2}
\
= 2^{-n(\epsilon + \varepsilon)} + \frac{\sigma^2}{n\epsilon^2}
$$$
and we see that since $$\epsilon, \varepsilon > 0$$, and as we're dealing with the case where $$n \to \infty$$, this probability is going to go to zero in the limit. But we had assumed $$P(X^n \in S') \geq 1 - \delta$$ - so we have a contradiction unless we don't assume that, which means
$$$
n(H(X) - \epsilon) < H_\delta(X^n).
$$$
Combining this with the previous bound, we've now shown
$$$
H(X) - \epsilon < \frac{1}{n} H_\delta(X^n) < H(X) + \epsilon
$$$
which is the same as
$$$
\left|\frac{1}{n}H_\delta(X) - H(X)\right| < \epsilon
$$$
which is the source coding theorem that we wanted to prove.</p>
Unknownnoreply@blogger.com0tag:blogger.com,1999:blog-1697673368059564013.post-91311962977510142802022-06-20T16:27:00.004+01:002022-06-25T21:06:15.560+01:00Information theory 1<div id="information-theory-1" style="text-align: center;"><span style="font-size: x-small;"><i>5044 words, including equations (~30min)</i></span><br /></div>
<p>This is the first in a series of posts about information theory. A solid understanding of basic probability (random variables, probability distributions, etc.) is assumed. This post covers:</p>
<ul>
<li>what information and entropy are, both intuitively and axiomatically</li>
<li>(briefly) the relation of information-theoretic entropy to entropy in physics</li>
<li>conditional entropy</li>
<li>joint entropy</li>
<li>KL distance (also known as relative entropy)</li>
<li>mutual information</li>
<li>some results involving the above quantities</li>
<li>the point of source coding and channel coding</li>
</ul>
<p>Future posts cover source coding and channel coding in detail.</p>
<h2 id="what-is-information-">What is information?</h2>
<p>How much information is there in the number 14? What about the word "information"? Or this blog post? These don't seem like questions with exact answers.</p>
<p>Imagine you already know that someone has drawn a number between 0 and 15 from a hat. Then you're told that the number is 14. How much additional information have you learned? A first guess at a definition for information might be that it's the number of questions you need to ask to become certain about an answer. We don't want arbitrary questions though; "what is the number?" is very different from "is the number zero?". So let's say that it has to be a yes-no question.</p>
<p>You can represent a number within some specific range as a series of yes-no questions by writing it out in base-2. In base-2, 14 is 1110. Four questions suffice: "is the leftmost base-2 digit a 0?", etc. The number of base-$$B$$ digits required to represent a number $$n$$ is $$\lceil\log_B n\rceil$$, where $$\lceil x \rceil$$ means the smallest integer greater than or equal to $$x$$ (i.e., rounding up). Now maybe there should be some sense in which we can allow pointing at a number in the range 0 to 16 to have a bit more information than pointing at a number from 0 to 15, even though we can't literally ask 4.09 yes-no questions. So we might try to define our information measure as $$\log n$$ (in whatever base because changing which base we're doing logs in would only change the answer by a constant factor anyways, but let's just say it's base-2 to maintain the correspondence to yes-no questions), where $$n$$ is the number of outcomes that the thing we now know was selected from.</p>
<p>Now let's say there's a shoe box we've picked up from a store. There are a gazillion things that could be inside the box, so $$n$$ is something huge. However, it seems that if we open the box and find a new pair of sneakers, we are less surprised than if we open the box and find the Shroud of Turin. We'd like to make some types of contain quantitatively more information than others.</p>
<p>The standard sort of thing you do in this kind of situation is that you bring in probabilities. With drawing a number out of a hat, we have a uniform distribution where the probability for each outcome is $$p = 1/ n$$. So therefore we might as well have written that information content is equivalent to $$\log \frac{1}{p}$$, and gotten the same answer in that question. Since presumably the probability of your average shoe box containing sneakers is higher than the probability of it containing the Shroud of Turin, with this revised definition we now sensibly get that the latter gives us more information (because $$\log \frac{1}{p}$$ is a decreasing function of $$p$$). Note also that $$\log \frac{1}{p}$$ is the same as $$- \log p$$; we will usually use the latter form. This is called the Shannon information. To be precise:</p>
<blockquote>
<p><i>The (Shannon) information content of seeing a random variable $$X$$ take a value $$x$$ is
$$$-\log p_x$$$
where $$p_x$$ is the probability that $$X$$ takes value $$x$$.
</i></p><p><i>We can see the behaviour of the information content of an event as a function of its probability here: </i></p>
<div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjyEA9S6CzMLwf4YkmXK0-pfFqZto-WTbTCjtxixK2QafIvHmbnmlKqPSWZN6Yj0fEMJBGAUclmullfxWV-9TChejDut2HmzT5-Y0WbvziC_5kWb0WrGSRHvOcG00bTsj3WC76FeyawjTIqEa0gOth87yiimQqg00FL80dB8LMQ5VbUAPgG25YlBpb9Rw/s866/ArcoLinux_2022-05-31_21-27-57.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="648" data-original-width="866" height="478" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjyEA9S6CzMLwf4YkmXK0-pfFqZto-WTbTCjtxixK2QafIvHmbnmlKqPSWZN6Yj0fEMJBGAUclmullfxWV-9TChejDut2HmzT5-Y0WbvziC_5kWb0WrGSRHvOcG00bTsj3WC76FeyawjTIqEa0gOth87yiimQqg00FL80dB8LMQ5VbUAPgG25YlBpb9Rw/w640-h478/ArcoLinux_2022-05-31_21-27-57.png" width="640" /></a></div><br /><p><br /></p>
</blockquote>
<h3 id="axiomatic-definition">Axiomatic definition</h3>
<p>The above derivation was so hand-wavy that it wasn't even close to being a derivation.</p>
<p>When discovering/inventing the concept of Shannon information, Shannon started from the idea that the information contained in seeing an event is a function of that event's probability (and nothing else). Then he required three further axioms to hold for this function:</p>
<ul>
<li>If the probability of an outcome is 1, it contains no information. This makes sense - if you already know something with certainty, then you can't get more information by seeing it again.</li>
<li>The information contained in an event is a decreasing function of its probability of happening. Again, this makes sense: seeing something you think is very unlikely is more informative than seeing something you were pretty certain was already going to happen.</li>
<li>The information contained in seeing two independent events is the sum of the information of seeing them separately. We don't want to have to apply some obscure maths magic to figure out how much information we got in total from seeing one dice roll and then another other.</li>
</ul>
<p>The last one is the big hint. The probability of seeing random variable (RV) $$X$$ take value $$x$$ and RV $$Y$$ take value $$y$$ is $$p_x p_y$$ if $$X$$ and $$Y$$ are independent. We want a function, call it $$f$$, such that $$f(p_x p_y) = f(p_x) + f(p_y)$$. This is the most important property of logarithms. You can do some more maths to really demonstrate that is the logarithms with some base are the only function that fit this definition, or you can just guess that it's a $$\log$$ and move on. We'll do the latter.</p>
<h3 id="entropy">Entropy</h3>
<p>Entropy is the flashy term that comes up in everything from chemistry to .zip files to the fundamental fact that we're all going to die. It is often introduced as something like "[mumble mumble] a measure of information [mumble mumble]".</p>
<p>It is important to distinguish between information and entropy. Information is a function of an outcome (of a random variable), for example the outcome of an experiment. Entropy is a function of a random variable, for example an experiment before you see the outcome. Specifically,</p>
<blockquote>
<p><i> The <b>entropy</b> $$H(X)$$ is the expected information gain from a random variable $$X$$:
$$$
H(X) = \underset{x_i \sim X}{\mathbb{E}}\Big[-\log P(X=x_i)\Big] \
= -\sum_i p_{x_i} \log p_{x_i}
$$$
($$\underset{x_i \sim X}{\mathbb{E}}$$ means the expected value when value $$x_i$$ is drawn from the distribution of RV $$X$$. $$P(X=x_i)$$, alternatively denoted $$p_{x_i}$$ when $$X$$ is clear from context, is the probability of $$X$$ taking value $$x_i$$.)</i></p>
</blockquote>
<p>(Why is entropy denoted with an $$H$$? I don't know. Just be thankful it wasn't a random <i>Greek</i> letter.)</p>
<p>Imagine you're guessing a number between 0 and 15 inclusive, and the current state of your beliefs is that it is as likely to be any of these numbers. You ask "is the number 9?". If the answer is yes, you've gained $$-\log_2 \frac{1}{16} = \log_2 16 = 4$$ bits of information. If the answer is no, you've gained $$-\log_2 \frac{15}{16} = \log_2 16 - \log_2 15 = 0.093$$ bits of information. The probability of the first outcome is 1/16 and the probability of the second is 15/16, so the entropy is $$\frac{15}{16} \times 4 + \frac{1}{16} \times 0.093 = 0.337$$ bits.</p>
<p>In contrast, if you ask "is the number smaller than 8?", you always get $$-\log_2 \frac{8}{16} = \log_2{2} = 1$$ bit of information, and therefore the entropy of the question is 1 bit.</p>
<p>Since entropy is expected information gain, whenever you prepare a random variable for the purpose of getting information by observing its value, you want to maximise its entropy.</p>
<p>The closer a probability distribution is to a uniform distribution, the higher its entropy. The maximum entropy of a distribution with $$n$$ possible outcomes is the entropy of the uniform distribution $$U_n$$, which is
$$$
H(U_n) = -\sum_i p_{u_i} \log p_{u_i} = -\sum_i \frac{1}{n} \log \frac{1}{n}
\ = -\log \frac{1}{n} = \log n
$$$
(This can be proved easily once we introduce some additional concepts.)</p>
<p>A general and very helpful principle to remember is that RVs with uniform distributions are most informative.</p>
<p>The above definition of entropy is sometimes called Shannon entropy, to distinguish it from the older but weaker concept of entropy in physics.</p>
<h4 id="entropy-in-physics">Entropy in physics</h4>
<p>The physicists' definition of entropy is a constant times the logarithm of the number of possible states that correspond to the observable macroscopic characteristics of a thermodynamic system:
$$$
S=k_B \ln W
$$$
where $$k_B$$ is the Boltzmann constant, $$\ln$$ is used instead of $$\log_2$$ because physics, and $$W$$ is the number of microstates. (Why do physicists denote entropy with the letter $$S$$? I don't know. Just be glad it wasn't a random <i>Hebrew</i> letter.)</p>
<p>In plain language: it is proportional to the Shannon entropy of finding out what is the exact configuration of bouncing atoms of the hot/cold/whatever box you're looking, out of all the ways the atoms could be bouncing inside that box given that the box is hot/cold/whatever, assuming that all those ways are equally likely. It is less general than the information theoretic entropy in the sense that it assumes a uniform distribution.</p>
<p>Entropy, either the Shannon or the physics version, seems abstract; random variables, numbers of microstates, what? However, $$S$$ as defined above has very real physical consequences. There's an important thermodynamics equation relating a change in entropy $$\delta S$$, a change in heat energy $$\delta Q$$, and temperature $$T$$ for a reversible process with the equation $$T\delta S = \delta Q$$, which sets a lower bound on how much energy you need to discover information (i.e., reduce the number of microstates that might be behind the macrostate you observe). Getting one bit of information means that $$\delta S$$ is $$k_B \ln 2$$ (from the definition of $$S$$), so at temperature $$T$$ kelvins we need $$k_B T \ln 2 \approx 9.6 \times 10^{-24} \times T$$ joules. This prevents arbitrarily efficient computers, and saves us from problems like Maxwell's demon. (Maxwell's demon is a thought experiment in physics: couldn't you violate the principle of increasing entropy (a physics thing) by building a box with a wall cutting it in half with a "demon" (some device) that lets slow particles pass left-to-right only and fast particles right-to-left, thus separating particles by temperature and reducing the number of microstates corresponding to the configuration of atoms inside the box? No, because the demon needs to expend energy to get information.)</p>
<p>Finally, is there an information-theoretic analogue of the second law of thermodynamics, which states that the entropy of a system always increases? You have to make some assumptions, but you can get to something like it, which I will sketch out in <i>very</i> rough detail and without explaining the terms (see Chapter 4 of <i>Elements of Information</i> Theory for the details). Imagine you have a probability distribution on the state space of a Markov chain. Now it is possible to prove that given any two such probability distributions, the distance between them (as measured using relative entropy; see below) is non-increasing. Now assume it also happens to be the case that the stationary distribution of the Markov chain is uniform (the stationary distribution is the probability distribution over states such that if every state sends out its probability mass according to the transition probabilities, you get back to the same distribution). We can consider an arbitrary probability distribution over the states, and compare it to the unchanging uniform one, and use the result that the distance between them is non-increasing to deduce that an arbitrary probability distribution will tend towards the uniform (= maximal entropy) one.</p>
<p>Reportedly, von Neumann (a polymath whose name appears in any mid-1900s mathsy thing) advised Shannon thus:</p>
<blockquote>
<p><i>"You should call [your concept] entropy, for two reasons. In the first place your uncertainty function has been used in statistical mechanics under that name, so it already has a name. In the second place, and more important, nobody knows what entropy really is, so in a debate you will always have the advantage."</i></p>
</blockquote>
<h3 id="intuition">Intuition</h3>
<p>We've snuck in the assumption that all information comes in the form of:</p>
<ol>
<li>You first have some <i>quantitative</i> uncertainty over a <i>known set</i> of possible outcomes, which you specify in terms of a random variable $$X$$.</li>
<li>You find out the value that $$X$$ has taken.</li>
</ol>
<p>There's a clear random variable if you're pulling numbers out of a hat: the possible values of $$X$$ are the numbers written on the pieces of paper in the hat, and they all have equal probability. But where is the random variable when the piece of information you get is, say, the definition of information? (I don't mean here the literal characters on the screen - that's a more boring question - but instead the knowledge about information theory that is now (hopefully) in your brain). The answer would have to be something like "the random variable representing all possible definitions of information" (with a probability distribution that is, for example, skewed towards definitions that include a $$\log$$ somewhere because you remember seeing that before).</p>
<p>This is a bit tricky to think about, but we see that even in this kind of weird case you can specify some kind of set and probabilities over that set. Fundamentally, knowledge (or its lack) is about having a probability distribution over states. Perfect knowledge means you have probability $$1.00$$ on exactly one state of how something could be. If you're very uncertain, you have a huge probability distribution over an unimaginably large set of states (for example, all possible concepts that might be a definition of information). If you've literally seen nothing, then you're forced to rely on some guess for the prior distribution over states, like all those pesky Bayesian statisticians keep saying.</p>
<h2 id="more-quantities">More quantities</h2>
<h3 id="conditional-entropy">Conditional entropy</h3>
<p>Entropy is a function of the probability distribution of a random variable. We want to be able to calculate the entropies of the random variables we encounter.</p>
<p>A common combination of random variables we see is $$X$$ given $$Y$$, written $$X | Y$$. The definition is
$$$
P(X = x \, |\, Y = y) = \frac{P(X = x \,\land\, Y = y)}{P(Y=y)}.
$$$
It is a common mistake to think that $$H(X|Y) = -\sum_i P(X = x_i | Y = y) \log P(X = x_i | Y = y)$$. What is it then? Let's just do the algebra:
$$$
H(X|Y) = -\underset{x \sim X|Y, y \sim Y}{\mathbb{E}} \big(
\log P(X=x|Y=y)
\big)
$$$
from the definition of the entropy as the expectation of the Shannon information content, and then by algebra:
$$$
H(X|Y) = -\underset{x \sim X|Y, y \sim Y}{\mathbb{E}} \big[
\log P(X=x|Y=y) \big]$$$
$$$
=
-\sum_{y \in \mathcal{Y}} P(Y=y) \sum_{x \in \mathcal{X}} P(X=x | Y=y) \log P(X=x \,|\, Y = Y)$$$
$$$
= -\sum_{y \in \mathcal{Y}, \, x \in \mathcal{X}}P(X=x\,\land\, Y = y) \log P(X=x \,|\, Y = Y)
$$$
where $$\mathcal{X}$$ and $$\mathcal{Y}$$ are simply the sets of possible values of $$X$$ and $$Y$$ respectively. In a trick beloved of bloggers everywhere tired of writing up equations as $$\LaTeX$$, the above is often abbreviated
$$$
\sum_{y \in \mathcal{Y}, \, x \in \mathcal{X}} p(x,y) \log p(x|y)
$$$
where we use $$p$$ as a generic notation for "probability of whatever; random variables left implicit".</p>
<blockquote>
<p><i>The <b>conditional entropy</b> $$X|Y$$ for a random variable $$X$$ given the value of another random variable $$Y$$, is written $$H(X|Y)$$ and defined as
$$$
H(X|Y) = - \sum_{y \in \mathcal{Y}, \, x \in \mathcal{X}} p(x,y) \log p(x|y)
$$$
which is lazier notation for
$$$
-\sum_{y \in \mathcal{Y}, \, x \in \mathcal{X}}P(X=x\,\land\, Y = y) \log P(X=x \,|\, Y = Y).
$$$
and also equal to
$$$
-\sum_{y \in \mathcal{Y}, \, x \in \mathcal{X}} p(x,y) \log \frac{p(x, y)}{p(y)}
$$$
It is most definitely not equal to $$\sum_{y \in \mathcal{Y}, \, x \in \mathcal{X}} p(x | y) \log p(x | y)$$.</i></p>
</blockquote>
<p>Conditional entropy is a measure of how much information we expect to get from a random variable assuming we've already seen another one. If the RVs $$X$$ and $$Y$$ are independent, the answer is that $$H(X|Y) = H(X)$$. If the value of $$Y$$ implies a value of $$X$$ (e.g. "percentage of sales in the US" implies "percentage of sales outside the US"), then $$H(X|Y) = 0$$, since we can work out what $$X$$ is from seeing what $$Y$$ is.</p>
<h3 id="joint-entropy">Joint entropy</h3>
<p>Now if $$H(X|Y)$$ is how much expected surprise there is left in $$X$$ after you've seen $$Y$$, then $$H(X|Y) + H(Y)$$ would sensibly be the total expected surprise in the combination of $$X$$ and $$Y$$. We write $$H(X,Y)$$ for this combination. If we do the algebra, we see that
$$$
H(X,Y) = H(X|Y) + H(Y) $$$
$$$
= -\sum_{y \in \mathcal{Y}, \, x \in \mathcal{X}} p(x,y) \log \frac{p(x, y)}{p(y)} - \sum_{y \in \mathcal{Y}} p(y) \log p(y) $$$
$$$= -\left(\sum_{y \in \mathcal{Y}, \, x \in \mathcal{X}} p(x,y) \log p(x, y)\right) +
\left( \sum_{y \in \mathcal{Y}, \,x\in \mathcal{X}} p(x,y) \log p(y)\right) -\left( \sum_{y \in \mathcal{Y}} p(y) \log p(y)\
\right) $$$
$$$= -\left(\sum_{y \in \mathcal{Y}, \, x \in \mathcal{X}} p(x,y) \log p(x, y)\right)$$$
= H(Z)
$$$
if $$Z$$ is the random variable formed of the pair $$(X, Y)$$ drawn from the joint distribution over $$X$$ and $$Y$$.</p>
<h3 id="kullback-leibler-divergence-aka-relative-entropy">Kullback-Leibler divergence, AKA relative entropy</h3>
<p>"Kullback-Leibler divergence" is a bit of a mouthful. It is also called KL divergence, KL distance, or relative entropy. Intuitively, it is a measure of the distance between two probability distributions. For probability distributions represented by functions $$p$$ and $$q$$ over the same set $$\mathcal{X}$$, it is defined as
$$$
D(p\,||\,q) = \sum_{x \in \mathcal{X}} p(x) \log \left(\frac{p(x)}{q(x)}\right).
$$$
It's not a very good distance function; the only property of a distance function it meets is that it's non-negative. It's not symmetric (i.e. $$D(p \,||\, q) \ne D(q \,||\, p)$$) as you can see from the definition (especially considering how it breaks when $$q(x) = 0$$ but not if $$p(x) = 0$$). However, it has a number of cool interpretations, including how many bits you expect to lose on average if you build a code assuming a probability distribution $$q$$ when it's actually $$p$$, and how many bits of information you get in a Bayesian update from distribution $$q$$ to distribution $$p$$. It is also a common loss function in machine learning. The first argument $$p$$ is generally some better or true model, and we want to know how far away $$q$$ is from it.</p>
<h3 id="why-the-uniform-distribution-maximises-entropy">Why the uniform distribution maximises entropy</h3>
<p>The KL divergence gives us a nice way of proving that the uniform distribution maximises entropy. Consider the KL divergence of an arbitrary probability distribution $$p$$ from the uniform probability distribution $$u$$:
$$$
D(p \,||\, u ) = \sum_{x \in \mathcal{X}} p(x) \log \left(\frac{p(x)}{q(x)}\right) $$$
$$$= \sum_{x \in \mathcal{X}} \big( p(x) \log p(x)\big) - \sum_{x \in \mathcal{X}} \big(p(x) \log q(x) \big) $$$
$$$= -H(X) - \sum_{x \in \mathcal{X}} p(x) \log \frac{1}{|\mathcal{X}|} $$$
$$$= H(X) - H(U)
$$$
where $$\mathcal{X}$$ is the set of values over which $$p$$ and $$u$$ have non-zero values, $$X$$ is a random variable distributed according to $$p$$, and $$U$$ is a random variable distributed according to $$u$$ (i.e. uniformly). This is the same thing as
$$$
H(X) = H(U) + D(p \,||\,u)
$$$
which implies that we can write the entropy of a random variable as the entropy of a uniform random variable over a set of the same size, plus the KL distance between the distribution of $$X$$ and the distribution of the uniform random variable. Also, since all three quantities in the above equation are guaranteed to be non-negative, this implies that
$$$
H(X) \leq H(U)
$$$
and therefore that the uniform random variable has higher entropy than any other random variable over the same number of outcomes.</p>
<h3 id="mutual-information">Mutual information</h3>
<p>Earlier, we saw that $$H(X, Y) = H(X|Y) + H(Y) = H(X) + H(Y|X)$$. As a picture:</p>
<div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjrGbqLctRwUw-setCmJDZ_tzd-1XQKwulod9TLgyVehtBNgORCubRlgNWfEhnIpAwYtohX8c3pDV6r3PIpma1YEUxSAiBo4E6qrt1mNoGPS7eGFXPJ7fwNeCnN3XKZMxPT9G0TvS4FftNrmdrlmBh3vdv4s3LFrZYjOqdr304iS8N4xGxyLmX-MNZdYw/s762/ArcoLinux_2022-05-31_22-25-08.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="348" data-original-width="762" height="183" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjrGbqLctRwUw-setCmJDZ_tzd-1XQKwulod9TLgyVehtBNgORCubRlgNWfEhnIpAwYtohX8c3pDV6r3PIpma1YEUxSAiBo4E6qrt1mNoGPS7eGFXPJ7fwNeCnN3XKZMxPT9G0TvS4FftNrmdrlmBh3vdv4s3LFrZYjOqdr304iS8N4xGxyLmX-MNZdYw/w400-h183/ArcoLinux_2022-05-31_22-25-08.png" width="400" /></a></div><br />
<p>There's an overlapping region, representing the information you get no matter which of $$X$$ or $$Y$$ you look at. We call this the mutual information, a refreshingly sensible name, and denote it $$I(X;Y)$$, somewhat less sensibly. One way to find it is
$$$
I(X;Y) = H(X,Y) - H(X|Y) - H(Y|X)$$$
$$$= - \sum_{x,y} p(x,y) \log p(x,y) \,+\, \sum_{x,y} p(x,y) \log \frac{p(x,y)}{p(y)} \,+\, \sum_{x,y} p(x,y) \log \frac{p(x,y)}{p(x)}$$$
$$$= \sum_{x,y} p(x,y) \big(
\log p(x,y) - \log p(x) - \log p(y)
\big)$$$
$$$= \sum_{x,y} p(x,y) \log \frac{p(x,y)}{p(x)p(y)}.
$$$
Does this look familiar? Recall the definition
$$$
D(p\,||\,q) = \sum_{x \in \mathcal{X}} p(x) \log \left(\frac{p(x)}{q(x)}\right).
$$$
What we see is that
$$$
I(X;Y) = D(p(x, y) \, || \, p(x) p(y)),
$$$
or in other words that the mutual information between $$X$$ and $$Y$$ is the "distance" (as measured by KL divergence) between the probability distributions $$p(x,y)$$ - the joint distribution between $$X$$ and $$Y$$ - and $$p(x) p(y)$$, the joint distribution that $$X$$ and $$Y$$ would have if $$x$$ and $$y$$ were drawn independently.</p>
<p>If $$X$$ and $$Y$$ are independent, then these are the same distribution, and their KL divergence is 0.</p>
<p>If the value of $$Y$$ can be determined from the value of $$X$$, then the joint probability distribution of $$X$$ and $$Y$$ is a table where for every $$x$$, there is only one $$y$$ such that $$p(x,y) > 0$$ (otherwise, there would be a value $$x$$ such that there is uncertainty about $$Y$$). Let the function mapping an $$x$$ to the singular $$y$$ such that $$p(x,y) > 0$$ be $$f$$. Then
$$$
I(X;Y) = \sum_{x,y} p(x,y) \log \frac{p(x,y)}{p(x)p(y)}$$$
$$$= \sum_y p(y) \sum_{x | f(x) = y} p(x|y) \log \frac{p(x, f(x))}{p(x)p(y)}.
$$$
Now $$p(x, f(x)) = p(x)$$, because there is no $$y \ne f(x)$$ such that $$p(x, y) \ne 0$$. Therefore we get that the above is equal to
$$$
\sum_y p(y) \sum_{x | f(x) = y} p(x|y) \log \frac{p(x)}{p(x)p(y)}\
= - \sum_y p(y) \sum_{x | f(x) = y} p(x|y) \log p(y),
$$$
and since $$\log p(y)$$ does not depend on $$x$$, we can sum out the probability distribution to get
$$$
-\sum_y p(y) \log p(y) = H(Y).
$$$
In other words, if $$Y$$ can be determined from $$X$$, then the expected information that $$X$$ gives about $$Y$$ is the same as the expected information given by $$Y$$.
</p><p>We can graphically represent the relations between $$H(X)$$, $$H(Y)$$, $$H(X|Y)$$, $$H(Y|X)$$, $$H(X,Y)$$, and $$I(X;Y)$$ like this:</p>
<div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhIfRf2gkkFvGyjcbKP1ae7_GVlagYfW5Mz6vxVP_BmjGWMkrG53xsdW_b1vXH9dgXWoVlTl6Ic8TGJIyQ3WXSecZ8J4MlEvMoY3NgubTGjDIicpywUD7xLht0GuipBnS4DYOmmAEH6J7Fb39HMKoePq6yDFJNZHCaMSwtsUaI8wTZ49E93yLD8OZRg9g/s756/ArcoLinux_2022-05-31_22-25-34.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="447" data-original-width="756" height="378" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhIfRf2gkkFvGyjcbKP1ae7_GVlagYfW5Mz6vxVP_BmjGWMkrG53xsdW_b1vXH9dgXWoVlTl6Ic8TGJIyQ3WXSecZ8J4MlEvMoY3NgubTGjDIicpywUD7xLht0GuipBnS4DYOmmAEH6J7Fb39HMKoePq6yDFJNZHCaMSwtsUaI8wTZ49E93yLD8OZRg9g/w640-h378/ArcoLinux_2022-05-31_22-25-34.png" width="640" /></a></div><br /><p><br /></p>
<p>Having this image in your head is the single most valuable thing you can do to improve your ability to follow information theoretic maths. Just to spell it out, here are some of the results you can read out from it:
$$$H(X,Y) = H(X) + H(Y|X) $$$
$$$H(X,Y) = H(X|Y) + H(Y) $$$
$$$H(X,Y) = H(X|Y) + I(X;Y) + H(Y|X) $$$
$$$H(X,Y) = H(X) + H(Y) - I(X;Y) $$$
$$$H(X) = I(X;Y) + H(Y|X)$$$
This diagram is also sometimes drawn with Venn diagrams:</p>
<div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEggLIC4G6xVL1Ye-9BSkZLZCOyJc-7dI_SI-Ce1leWZvOj_H9F5wFsNz28g1GXe7pnJnfIYO-unAm2uHj3LzYcwVArgeTzuoX26d0tZ4GT6YBTR5iYA4d03t65q0z8THahQFfuGjpX066pe5n1r81dq7DZNLoE7hEpOdCeHnFpr_JWSoaHZupsV76mESw/s396/ArcoLinux_2022-05-31_22-26-00.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="396" data-original-width="382" height="400" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEggLIC4G6xVL1Ye-9BSkZLZCOyJc-7dI_SI-Ce1leWZvOj_H9F5wFsNz28g1GXe7pnJnfIYO-unAm2uHj3LzYcwVArgeTzuoX26d0tZ4GT6YBTR5iYA4d03t65q0z8THahQFfuGjpX066pe5n1r81dq7DZNLoE7hEpOdCeHnFpr_JWSoaHZupsV76mESw/w386-h400/ArcoLinux_2022-05-31_22-26-00.png" width="386" /></a></div><br /><p><br /></p>
<h3 id="data-processing-inequality">Data processing inequality</h3>
<p>A Markov chain is a series of random variables such that the $$(n+1)$$th is only directly influenced by the $$n$$th. If $$X \to Y \to Z$$ is a Markov chain, it means that all effects $$X$$ has on $$Z$$ are through $$Y$$.</p>
<p>The data processing inequality states that if $$X \to Y \to Z$$ is a Markov chain, then
$$$
I(X; Y) \geq I(X; Z).
$$$
This should be pretty intuitive, since the mutual information $$I(X;Y)$$ between $$X$$ and $$Y$$, which have a direct causal link between them, shouldn't be higher than that between $$X$$ and the more-distant $$Z$$, which $$X$$ can only influence through $$Y$$.</p>
<p>A special case is the Markov chain $$X \to Y \to f(Y)$$, where $$X$$ is, say, what happened in an abandoned parking lot at 3am, $$Y$$ is the security camera footage, and $$f$$ is some image enhancing process (more generally: any deterministic function of the data $$Y$$). The data processing inequality tells us that
$$$
I(X; Y) \geq I(X; f(Y)).
$$$
In essence, this means that any function you try to apply to some data $$Y$$ you have about some event $$X$$ cannot increase the information about the event that is available. Any enhancing function can only make it easier to spot some information about the event that is <i>already present</i> in the data you have about it (and the function might very plausibly destroy some). If all you have are four pixels, no amount of image enhancement wizardry will let you figure out the perpetrator's eye colour.</p>
<p>The proof (for the general case of $$X \to Y \to Z$$) goes like this: consider $$I(X; Y,Z)$$ (that is, the mutual information between knowing $$X$$ and knowing both $$Y$$ and $$Z$$). Now consider the different values in Venn diagram form:</p>
<div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjG5o6kK-DxPU0tQRHkIgEevjypvmeikAoGX5qDjDOYKvc8C0azyFVA1evw5iBoD-a5jPBSamLrdOePJSPvwzV-MhKsdnwFYv6pRXnN2wL5BkXyCS5Lehg8QVL4-ZysYIHhPo56LyzLyTnescNUIZSuO5dOcHUm6EhdiAETvC0gtXQ4JR0XSSN3aaqaDg/s846/ArcoLinux_2022-05-31_22-59-32.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="688" data-original-width="846" height="325" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjG5o6kK-DxPU0tQRHkIgEevjypvmeikAoGX5qDjDOYKvc8C0azyFVA1evw5iBoD-a5jPBSamLrdOePJSPvwzV-MhKsdnwFYv6pRXnN2wL5BkXyCS5Lehg8QVL4-ZysYIHhPo56LyzLyTnescNUIZSuO5dOcHUm6EhdiAETvC0gtXQ4JR0XSSN3aaqaDg/w400-h325/ArcoLinux_2022-05-31_22-59-32.png" width="400" /></a></div><br /><p><br /></p>
<p>$$I(X; Y, Z)$$ corresponds to all areas within the circle representing $$X$$ that are also within at least one of the circle for $$Y$$ or $$Z$$. If we knew both $$Y$$ and $$Z$$, this "bite" is how much would be taken out of the uncertainty $$H(X)$$ of $$X$$.</p>
<p>We see that the red lined area is $$I(X; Y|Z)$$ (the information shared between $$X$$ and the part of $$Y$$ that remains unknown if you know $$Z$$), and likewise the green hatched area is $$I(X; Y; Z)$$ and the blue dotted area is $$I(X;Z|Y)$$. Since the red-lined and green-hatched areas together are $$I(X;Y)$$, and the green-hatched and blue-dotted areas together are $$I(X;Z)$$, we can write both
$$$
I(X; \,Y,Z) = I(X;\,Y) + I(X;\,Z|Y)$$$
$$$I(X; \,Y,Z) = I(X;\,Z) + I(X;\,Y|Z)
$$$
But hold on - $$I(X;Z|Y)=0$$ by the definition of a Markov chain, since no influence can pass from $$X$$ to $$Z$$ without going through $$Y$$, meaning that if we know everything about $$Y$$, nothing more we can learn about $$Z$$ will tell us anything more about $$X$$.</p>
<p>Since that term is zero, we have
$$$
I(X; \; Y) = I(X; \; Z) + I(X; \, Y|Z)
$$$
and since mutual information must be non-negative, this in turn implies
$$$
I(X;Y) \geq I(X;Z).
$$$</p>
<h2 id="two-big-things-source-channel-coding">Two big things: source & channel coding</h2>
<p>Much of information theory concerns itself with one of two goals.</p>
<p>Source coding is about data compression. It is about taking something that encodes some information, and trying to make it shorter without losing the information.</p>
<p>Channel coding is about error correction. It is about taking something that encodes some information, and making it longer to try to make sure the information can be recovered even if some errors creep in.</p>
<p>The basic model that information theory deals with is the following:</p>
<div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgsS1fWWqne7wyRiNIS8jjsFxOPCb5B54n-xNiEKD4PYt3jwml1VCOrkX6JxhEXHuCFd7wv9Kr5Vvu1VCSi-PP74LtoZSto9IsYzZcKMJ3uzam7_JfrVUcerg51rWIdZzCQaxjLezaVeepV8TcaxudzUHzRTnCNqfWe-Ju4icIFHKqd5swb879emq1NTQ/s1220/ArcoLinux_2022-05-31_22-52-21.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="270" data-original-width="1220" height="142" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgsS1fWWqne7wyRiNIS8jjsFxOPCb5B54n-xNiEKD4PYt3jwml1VCOrkX6JxhEXHuCFd7wv9Kr5Vvu1VCSi-PP74LtoZSto9IsYzZcKMJ3uzam7_JfrVUcerg51rWIdZzCQaxjLezaVeepV8TcaxudzUHzRTnCNqfWe-Ju4icIFHKqd5swb879emq1NTQ/w640-h142/ArcoLinux_2022-05-31_22-52-21.png" width="640" /></a></div><br />
<p>We have some random variable $$Z$$ - the contents of a text message, for example - which we encode under some coding scheme to get a message consisting of a sequence of symbols that we send over some channel - the internet, for example - and then hopefully recover the original message. The channel can be noiseless, meaning it transmits everything perfectly and can be removed from the diagram, or noisy, in which case some there is a chance that for some $$i$$, the $$X_i$$ sent into the channel differs from the $$Y_i$$ you get out.</p>
<p>Source coding is about trying to minimise how many symbols you have to send, while channel coding is about trying to make sure that $$\hat{Z}$$, the estimate of the original message, really ends up being the original message $$Z$$.</p>
<p>A big result in information theory is that for the above model, it is possible to separate the source coding and the channel coding, while maintaining optimality. The problems are distinct; regardless of source coding method, we can use the same channel method and still do well, and vice versa. Thanks to this result, called the source-channel separation theorem, source and channel coding can be considered separately. Therefore, our model can look like this:</p>
<div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhKVksPpXEHyJeOtvGWW5rKPZYkJe-TxHDqeH4liLtrRak9ybQvb1MoaEkIDgLHc4yy7rXDra5JOQjswbiH2ZZtpEk4egOiXP3uk_bBk87FJS1Zl0d4bSbjo2uso2lSXrwIPJa-4DyMTnFtxCFI-8t5buk0NHxWPHGUEsX_6YcxZqc5MssJflmom7g_gQ/s1367/ArcoLinux_2022-05-31_22-52-43.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="246" data-original-width="1367" height="116" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhKVksPpXEHyJeOtvGWW5rKPZYkJe-TxHDqeH4liLtrRak9ybQvb1MoaEkIDgLHc4yy7rXDra5JOQjswbiH2ZZtpEk4egOiXP3uk_bBk87FJS1Zl0d4bSbjo2uso2lSXrwIPJa-4DyMTnFtxCFI-8t5buk0NHxWPHGUEsX_6YcxZqc5MssJflmom7g_gQ/w640-h116/ArcoLinux_2022-05-31_22-52-43.png" width="640" /></a></div><p><br /></p>
<p>(We use $$X^n$$ to refer to a random variable representing a length-$$n$$ sequence of symbols)</p>
<p>Both source and channel coding consist of:</p>
<ul>
<li>a central but tricky theorem giving theoretical bounds and motivating some definitions</li>
<li>a bunch of methods that people have invented for achieving something close to those theoretical bounds in practice</li></ul>Next see <a href="https://www.strataoftheworld.com/2022/06/information-theory-2-source-coding.html">the source coding post</a> and <a href="https://www.strataoftheworld.com/2022/06/information-theory-3-channel-coding.html">the channel coding post</a>. <br /><div><ul>
</ul>
<p></p><p></p></div>Unknownnoreply@blogger.com0tag:blogger.com,1999:blog-1697673368059564013.post-41260904924096790052021-10-17T23:14:00.000+01:002021-10-17T23:14:19.101+01:00Death is bad<p style="text-align: center;"> <span style="font-size: x-small;">3.5k words (about 12 minutes)<br /></span></p><p>Sometime in the future, we might have the technology to extend lifespans indefinitely and make people effectively immortal. When and how this might happen is a complicated question that I will not go into. Instead, I will take heed of Ian Malcolm in <i>Jurassic Park</i>, who complains that "your scientists were so preoccupied with whether or not they could that they didn't stop to think if they should".</p>
<p>This is (in my opinion rather surprisingly) a controversial question. </p>
<p>The core of it is this: should people die?</p>
<p>Often the best way to approach a general question is to start by thinking about specific cases. Imagine a healthy ten-year old child; should they die? The answer is clearly no. What about yourself, or your friends, or the last person you saw on the street? Wishing for death for yourself or others is almost universally a sign of a serious mental problem; acting on that desire even more so.</p>
<p>There are some exceptions. Death might be the best option for a sick and pained 90-year-old with no hope of future healthy days. It may well be (as I've seen credibly claimed in several places) that the focus on prolonging lifespan even in pained terminally ill people is excessive. "Prolong life, whatever the cost" is a silly point of view; maximising heartbeats isn't what we really care about.</p>
<p>However, now imagine a pained, dying, sick person who has a hope of surviving to live many healthy happy days – say a 40-year-old suffering from cancer. Should they die? No. You would hope that they get treatment, even if it's nauseating fatiguing painful chemotherapy for months on end. If there is no cure, you'd hope that scientists somewhere invent it. Even if it does not happen in time for that particular person, at least it will save others in the future, and eliminate one more horror of the world. It would be a great and celebrated human achievement.</p>
<p>What's the difference between the terminally ill 90-year-old and the 40-year-old with a curable cancer? The difference is technology. We have the technology to cure some cancers, but we don't have the technology to cure the many ageing-related diseases. If we did, then even if the treatment is expensive or difficult, we would hope – and consider it a moral necessity – for both of them to get it, and hope that they both go on living for many more years.</p>
<p>No one dies of time. You are a complex process running on the physical hardware of your brain, which is kept running by the machine that is the rest of your body. You die when that machine breaks. There is no poetic right time when you close your eyes and get claimed by time, there is only falling to one mechanical fault or another.</p>
<p>People (or conscious beings in general) matter, and their preferences should be taken seriously – this is the core of human morality. What is wrong in the world can be fixed – this is the guiding principle of civilisation since the Enlightenment.</p>
<p>So, should people die? Not if they don't want to, which (I assume) for most people means not if they have a remaining hope of happy, productive days.</p>
<h2 id="counterarguments">Counterarguments</h2>
<p>The idea that death is something to be defeated, like cancer, poverty, or smallpox, is not a common one. Perhaps there's some piece of the puzzle that is missing from the almost stupidly simple argument above?</p>
<p>One of the most common counterarguments is overpopulation (perhaps surprisingly; environmentalist concerns have clearly penetrated very deep into culture despite not being much of a thing before the 1970s). The argument goes like this: if we solve death, but people keep being born, there will be too many people on Earth, leading to environmental problems, and eventually low quality of life for everyone.</p>
<p>The object-level point (I will return to what I consider more important meta-level points later) is that demographic predictions have a tendency to be wrong, especially about the future (as the <a href="https://quoteinvestigator.com/2013/10/20/no-predict/">Danish (?) saying goes</a>). Malthus figured out pre-industrial demographics just as they came to an end with the industrial revolution. In the 1960s, there were <a href="https://en.wikipedia.org/wiki/The_Population_Bomb">warnings</a> of a population explosion, which fizzled out when it turned out that the <a href="https://en.wikipedia.org/wiki/Demographic_transition">demographic transition</a> (falling birth rates as countries develop) is a thing. Right now the world population is expected to stabilise at less than 1.5x the current size, and many developed countries are dealing with problems caused by shrinking populations (which they strangely refuse to fix through immigration).</p>
<p>Another concern are the effects of having a lot of old people around. What about social progress – how would the development of women's rights have been realised if you had a bunch of 19th century misogynists walking around in their top hats? What sort of power imbalances and Gini coefficients would we reach if Franklin Delano Roosevelt could continue cycling through high-power government roles indefinitely, or Elon Musk had time to profit from the colonisation of Mars? What happens to science when it can no longer advance (as Max Planck said) one funeral at at time?</p>
<p>(There is even an argument that life extension technology is problematic because the rich will get it first. This is an entirely general and therefore entirely worthless argument, since it applies to all human progress: the rich got iPhones first – clearly smartphones are a problematic technology, etc., etc. If you're worried about only the rich having access to it for too long, the proper response is to subsidise its development so that the period when not everyone has access to it is as short as possible.)</p>
<p>These are valid concerns that will definitely test the abilities of legislators and voters in the post-death era. However, they can probably be overcome. I think people can be brought around surprisingly far on social and moral attitudes without killing anyone. Consider how pre-2000 almost anyone's opinions would have made them a near-pariah today; many of those people still exist and it would hard to write them off as a total loss. Maybe some minority of immortal old people couldn't cope with all the Pride Parades – or whatever the future equivalent is – marching past their windows and they go off to start some place of their own with sufficient top hat density; then again, most countries have their own conservative backwater region already. If they start going for nukes, that's more of an issue, but not more so than Iran.</p>
<p>As for imbalances of power and wealth, it might require a few more taxes and other policies (the expansion of term limits to more jobs?), but given the strides that equalising policy-making has made it seems hard to argue there is a fundamental impossibility.</p>
<p>And what about all the advantages? A society of the undying might well be far more long-term oriented, mitigating one of the greatest human failures. After all, how often do people bemoan that 70-year-old oil executives just don't care because they won't be around to see the effects of climate change?</p>
<p>What about all the collective knowledge that is lost? Imagine if people in 2050 could hear World War II veterans reminding them of what war really is. Imagine if John von Neumann could have continued casually inventing fields of maths at a rate of about two per week instead of dying at age 53 (while <a href="https://en.wikipedia.org/wiki/John_von_Neumann#Illness_and_death">absolutely terrified of his approaching death</a>). Imagine if we could be sure to see George R. R. Martin finish <i>A Song of Ice and Fire</i>.</p>
<p>Also, concerns like overpopulation and Elon Musk's tax plan just seem small in comparison to the <i>literal eradication of death</i>.</p>
<p>Imagine proposing a miracle peace plan to the cabinets of the Allied countries in the midst of World War II. The plan would end the war, install liberal governments in the Axis powers, and no one even has to nuke a Japanese city. (If John von Neumann starts complaining about not getting to test his implosion bomb design, give him a list of unsolved maths problems to shut him up.) Now imagine that the reaction is somewhere between hesitance and resistance, together with comments like "where are we going to put all the soldiers we've trained?", "what about the effects on the public psyche of a random abrupt end without warning?", and "how will we make sure that the rich industrialists don't profit too much from all the suddenly unnecessary loans that they've been given?" At this point you might be justified in shouting: "this war is killing fifteen million people per year, we need to end it now".</p>
<p>The situation with death is similar, except it's over fifty million per year rather than fifteen. (See <a href="https://ourworldindata.org/grapher/annual-number-of-deaths-by-cause?country=~OWID_WRL">this chart</a> for breakdown by cause – you'll see that while currently-preventable causes like infectious diseases kill millions, ageing-related ones like heart disease, cancer, and dementia are already the majority.)</p>
<h3 id="thought-experiments">Thought experiments</h3>
<p>To make the question more concrete, we can try thought experiments. Imagine a world in which people don't die. Imagine visitors from that world coming to us. Would they go "ah yes, inevitable oblivion in less than a century, this is exactly the social policy we need, thanks – let us go run back home and implement it"? Or would they think of our world like we do of a disease-stricken third-world country, in dire need of humanitarian assistance and modern technology?</p>
<p>It's hard to get into the frame of mind of people who live in a society that doesn't hand out automatic death sentences to everyone at birth. Instead, to evaluate whether raising life expectancies to 200 makes sense even given the environmental impacts, we can ask whether a policy of killing people at age 50 to reduce population pressures would be even better than the current status quo – if both an increase and decrease in life expectancies is bad, this is suspicious because it implies we're at the optimum by chance. Or, since the abstract question (death in general) is always harder than more concrete ones, imagine withholding a drug that manages heart problems in the elderly on overpopulation grounds.</p>
<p>You might argue that current life expectancies are optimal. This is a hard position to defend. It seems like a coincidence that the lifespan achievable with modern technology is exactly the "right" one. Also, neither you nor society should not make that choice for other people. Perhaps some people get bored of life and readily step into coffins at age 80; many others want nothing more than to keep living. People should get what they want. Forcing everyone to conform to a certain lifespan is a specific case of forcing everyone to conform to a certain lifestyle; much moral progress in the past century has consisted of realising that this is bad.</p>
<p>I think it's also worth emphasising one common thread in the arguments against solving death: they are all arguments about societal effects. It is absolutely critical to make sure that your actions don't cause massive negative externalities, and that they also don't amount to defecting in <a href="https://en.wikipedia.org/wiki/Prisoner%27s_dilemma">prisoner's dilemma</a> or <a href="https://en.wikipedia.org/wiki/Tragedy_of_the_commons">the tragedy of the commons</a>. However, it is also absolutely critical that people are happy and aren't forced to die, because people and their preferences/wellbeing are what matters. Society exists to serve the people who make it up, not the other way around. Some of the worst moral mistakes in history come from emphasising the collective, and identifying good and harm in terms of effects on an abstract collective (e.g. a nation or religion), rather than in terms of effects on the individuals that make it up. Saying that everyone has to die for some vague pro-social reason is the ultimate form of such cart-before-the-horse reasoning.</p>
<h2 id="why-care-about-the-death-question">Why care about the death question?</h2>
<p>There are several features that make the case against death, and people's reactions to it, particularly interesting.</p>
<h3 id="failure-of-generalisation">Failure of generalisation</h3>
<p>First: generalisation. I started this post using specific examples before trying to answer the more general question. I think the popularity of death is a good example of how bad humans are at generalising.</p>
<p>When someone you know dies, it is very clearly and obviously a horrible tragedy. The scariest thing that could happen to you is probably either your own death, the death of people you care about, or something that your brain associates with death (the common fears: heights, snakes, ... clowns?).</p>
<p>And yet, make the question more abstract – think not about a specific case (which you feel in your bones is a horrible tragedy that would never happen in a just world), but about the general question of whether people should die, and it's like a switch flips: a person who would do almost anything to save themselves or those they care about, who cares deeply about suffering and injustice in the world, is suddenly willing to consign five times the death toll of World War I to permanent oblivion every single year.</p>
<p>Stalin reportedly said that a single death is a tragedy, but a million is only a statistic. Stalin is wrong. A single death is a tragedy, and a million deaths is a million tragedies. Tragedies should be stopped.</p>
<h3 id="people-these-days">People These Days</h3>
<p>Second: today, we're pretty good at ignoring and hiding death. This wasn't always the case. If you're a medieval peasant, death is never too far away, whether in the form of famine or plague or Genghis Khan. Death was like an obnoxious dinner guest: not fun, but also just kind of present in some form or another whether you invited them or not, so out of necessity involved in life and culture.</p>
<p>Today, unexpected death is much rarer. Child mortality globally has declined from <a href="https://ourworldindata.org/child-mortality">over 40% (i.e. almost every family had lost a child) in 1800 to 4.5% in 2015</a>, and <a href="https://ourworldindata.org/grapher/the-decline-of-child-mortality-by-level-of-prosperity-endpoints?time=latest&country=SWE~GBR~JPN~FRA~FIN~European+Union~KOR~ESP">below 0.5%</a> in developed countries. Famines have gone from something everyone lives through to something that the developed world is free from. War and conflict have gone from <a href="https://ourworldindata.org/war-and-peace#the-past-was-not-peaceful">common to uncommon</a>. Much greater diseases and accidents can be successfully treated. As a result of all these positive trends, death is less present in people's minds.</p>
<p>As I don't have my culture critic license yet, I won't try to make some fancy overarching points about how People These Days Just Don't Understand and how our Materialistic Culture fails to prepare people to deal with the Deep Questions and Confront Their Own Mortality. I will simply note that (a) death is bad, (b) we don't like thinking about bad things, and (c) sometimes not wanting to think about important things causes perverse situations.</p>
<h3 id="confronting-problems">Confronting problems</h3>
<p>Why do people not want to think that death is bad? I think one central reason is that death seems inevitable. It's tough to accept bad things you can't influence, and much easier to try to ignore them. If at some point you have to confront it anyways, one of the most reassuring stories you can tell is that it has a point. Imagine if over two hundred thousand years, generation after generation of humans, totalling some one hundred billion lives, was born, grew up, developed a rich inner world, and then had that world destroyed forever by random failures, evolution's lack of care for what happens after you reproduce, and the occasional rampaging mammoth. Surely there must be some purpose for it, some reason why all that death is not just a tragedy? Perhaps we aren't "meant" to live long, whatever that means, or perhaps it's all for the common good, or that "death gives meaning to life". Far more comforting to think that then to acknowledge that a hundred billion human lives and counting really are gone forever because they were unlucky enough to be born before we eradicated smallpox, or invented vaccines, or discovered antibiotics, or figured out how to reverse ageing.</p>
<p>Assume death is inevitable. Should you still recognise the wrongness of it?</p>
<p>I think yes, at least if you care about big questions and doing good. I think it's important to be able to look at the world, spot what's wrong about it, and acknowledge that there are huge things that should be done but are very difficult to achieve.</p>
<p>In particular, it's important to avoid the narrative fallacy (Nassim Taleb's term for the human tendency to want to fit the world to a story). In a story, there's a start and an end and a lesson, and the dangers are typically just small enough to be defeated. Our universe <a href="https://www.lesswrong.com/posts/sYgv4eYH82JEsTD34/beyond-the-reach-of-god">has no writer, only physics</a>, and physics doesn't care about hitting you with an unsolvable problem that will kill everyone you love. If you want to increase the justness of the world, recognising this fact is an important starting point.</p>
<h2 id="taxes">Taxes</h2>
<p>Is death inevitable? In considering this question, it's important once again to remember that death is not a singular magical thing. Your death happens when something breaks badly enough that your consciousness goes permanently offline.</p>
<p>Things, especially complex biological machines produced by evolution, can break in very tricky ways. But what can break can be fixed, and people who declare technological feats impossible have a bad track record. The problem might be very hard: maybe we have to wait until we have precision nano-bots that can individually repair the telomeres on each cell, or maybe there is no effective general solution to ageing and we face an endless grind of solving problem after problem to extend life/health expectancies from 120 to 130 to 140 and so forth. Then again, maybe someone leaves out a petri dish by accident in a lab and comes back the next day to the fountain of youth, or maybe by the end of the century no one is worrying about something as old-fashioned as biology.</p>
<p>There's also the possibility of stopgap solutions, like cryonics (preserving people close to death by <a href="https://en.wikipedia.org/wiki/Cryopreservation#Vitrification">vitrifying</a> them and hoping that future technology can revive them). Cryonics is currently in a very primitive state – no large animals successfully having been put through it – but there's a research pathway of testing on increasingly complex organs and then increasingly large animals that might eventually lead to success if someone bothered to pour resources into it.</p>
<p>There is no guarantee when this is happening. If civilisation is destroyed by an engineered pandemic or nuclear war before then, it will never happen.</p>
<p>Of course, in the very long run we face more fundamental problems, like the heat death of the universe. Literally infinite life is probably physically impossible; maybe this is reassuring.</p>
<h2 id="predictions-and-poems">Predictions and poems</h2>
<p>I will make three predictions about the eventual abolition of death.</p>
<p>First, many people will resist it. They might see it as conflicting with their religious views or as exacerbating inequality, or just as something too new and weird or unnatural.</p>
<p>Second, when the possibility of extending their lifespan stops being an abstract topic and becomes a concrete option, most people will seize it for themselves and their families.</p>
<p>This is a common path for technologies. Lightning rods and vaccines were first seen by some as affronts to God's will, but eventually it turns out people like not burning to death and not dying of horrible diseases more than they like fancy theological arguments. Most likely future generations will discover that they like not ageing more than they like appreciating the meaning of life by definitely not having one past age 120.</p>
<p>Finally, future people (if they exist) will probably look back with horror on the time when everyone died against their will within about a century.</p>
<p>Edgar Allen Poe wrote a poem called <a href="https://www.poetryfoundation.org/poems/48633/the-conqueror-worm">"The Conqueror Worm"</a>, about angels crying as they watch a tragic play called "Man", whose (anti-)hero is a monstrous worm that symbolises death. If we completely ignore what Poe intended with this, we can misinterpret one line to come to a nice interpretation of our own. The poem declares that the angels are watching this play in the "lonesome latter years". Clearly this refers to a future post-scarcity, post-death utopia, and the angels are our wise immortal descendants reflecting on the bad old days, when people were "mere puppets [...] who come and go / at the bidding of vast formless things" like famine and war and plague and death. The "circle [of life] ever returneth in / To the self same spot [= the grave]", and so the "Phantom [of wisdom and fulfilled lives] [is] chased for evermore / By a crowd that seize it not".</p>
<p>Death is a very poetic topic, and other poems need less (mis)interpretation. <a href="https://www.poetryfoundation.org/poems/52773/dirge-without-music">Edna St. Vincent Millay's "Dirge Without Music"</a> is particularly nice, while Dylan Thomas gives away the game in the title: <a href="https://poets.org/poem/do-not-go-gentle-good-night">"Do not go gentle into that good night"</a>.</p>Unknownnoreply@blogger.com2tag:blogger.com,1999:blog-1697673368059564013.post-9283111801107102512021-09-30T21:23:00.001+01:002021-09-30T21:26:05.349+01:00Short reviews: biographies<p style="text-align: center;"><span style="font-size: x-small;">Books reviewed (all by Walter Isaacson):<i><br />The Code Breaker: Jennifer Doudna, Gene Editing, and the Future of the Human Race </i>(2021)<br /><i>Steve Jobs: The Exclusive Biography </i>(2011)<i><br />Benjamin Franklin: An American Life </i>(2004)<br /><i></i></span></p><p style="text-align: center;"><span style="font-size: x-small;">3.5k words (about 12 minutes)</span> </p><p style="text-align: center;"><br /></p><p>Why read biographies? If you want stories of people and interesting characters, fiction is better. If you want general, big truths, then you're probably better off reading the many non-fiction books that are about abstract truths and far-ranging concepts rather than the particulars of a single person's life.</p>
<p>Consider, for a moment, designing an algorithm for a problem. The classic way to do this is to think hard about the problem, and then write down a specific series of steps that take you from inputs to (hopefully the correct) outputs. In contrast, the machine learning method is to use statistical methods on a long list of examples to make a model that (hopefully) approximates the mapping between inputs and outputs. </p>
<p>Reading explicit abstract arguments is like the first method. Like explicit algorithm design, it comes with some nice properties – it's very clear exactly how it generalises and when it's applicable – to the point where it's easy to scoff at the less explicit methods: "it's just a black box that our pile of statistics spits out" / "it's just anecdotes about someone's life".</p>
<p>However, much like machine learning methods can extract subtle lessons from a long list of examples, I think there is implicit knowledge contained in the long list of detail about someone's life that you find in a biography (at least if you read about people who did interesting things in their life – but then again, if there's a biography of someone ...). Once you've read the details of how CRISPR was invented, Apple jump-started, or compromises reached at the1787 American Constitutional Convention, I think your model of how science, business, and politics work in the real world is improved in many subtle ways.</p>
<p>(Note that this argument also applies to reading history.)</p>
<p>And of course, since biographies deal strongly with character, there is an element of the novel-like thrill of watching things happen to people.</p>
<h2 id="walter-isaacsons-biographies">Walter Isaacson's biographies</h2>
<p>I've read four of Walter Isaacson's biographies. Their subjects are Albert Einstein, Jennifer Doudna, Steve Jobs, and Benjamin Franklin.</p>
<p>The Einstein one I read years ago, and don't remember much detail about. It did earn a 6 out of 7 on my books spreadsheet though.</p>
<p>The <a href="https://en.wikipedia.org/wiki/Jennifer_Doudna">Jennifer Doudna</a> biography is the weakest. The main reason is that we don't get too much insight into Doudna herself or the way she carried out her scientific work, leaving Isaacson to spend many pages on other things: overviews of other players in the development of the <a href="https://en.wikipedia.org/wiki/CRISPR">gene-editing tool CRISPR</a> that are more journalistic than biographical, and descriptions of the biology that are limited by Isaacson's lack of biological expertise (at least when compared to the best popular biology writing, like Richard Dawkins' in <i>The Selfish Gene</i>). Hand-wringing over <a href="https://en.wikipedia.org/wiki/James_Watson">James Watson's</a> controversies takes up an alarming amount of space that is only partly justified by Watson's role as a childhood inspiration for Doudna. There's also a long section about the struggles behind the allocation of the CRISPR Nobel Prize (awarded in 2020) that is clearly balanced and thoroughly researched, but simply less interesting to me than similar segments in the Jobs or Franklin biographies, where the stakes are the fate of companies or nations, rather than who gets a shiny medal.</p>
<p>My guess is that these faults stem mainly from the more limited material Isaacson had access to. Albert Einstein and Benjamin Franklin are both among the most researched individuals in history. To the extent that Steve Jobs is behind, the interviews Isaacson personally conducted seem to have plugged the gap.</p>
<p>Doudna is still an inspiring person. She also has the enviable advantage of not being dead, and therefore may yet do even more and become the subject of further biographies. If you're interested in biotech, including the business side, or scientific careers that may one day win Nobel Prizes, the biography may well be worth reading. </p>
<h2 id="steve-jobs">Steve Jobs</h2>
<p>A god-like experimenter who wants to figure out what traits make tech entrepreneurs succeed may proceed something like this: create a bunch of people with extreme strengths in some areas and extreme weaknesses in others, release them into the world to start companies, and see which extreme strengths can balance out which extreme weaknesses. Such an experiment might well create Steve Jobs.</p>
<p>Take one weakness: Jobs's emotional volatility and, for lack of a better word, general nastiness in some circumstances, including things from extremely harsh criticism of employees' work to horrible table manners at restaurants. This isn't unique to Jobs either: look at the Wikipedia pages for <a href="https://en.wikipedia.org/wiki/Bill_Gates#Management_style">Bill Gates</a> and <a href="https://en.wikipedia.org/wiki/Jeff_Bezos#Leadership_style">Jeff Bezos</a>, and you'll find that they brighten their subordinates' work days with such productive witticisms as "that's the stupidest thing I've ever heard" and "why are you ruining my life?" respectively.</p>
<p>Does this show that behaviour up to and including verbal abuse is a forgivable flaw, or even beneficial, in tech CEOs?</p>
<p>First, though verbal abuse is neither productive nor right, a culture of vigorous debate is a distinct thing with incredible benefits, and the idea that it serves only to hurt and marginalise is not just a misguided generalisation but sometimes diametrically wrong. The best example is Daniel Ellsberg recounting an anecdote from his early times at RAND Corporation in <i>The Doomsday Machine</i> (an unrelated book; my review <a href="https://strataoftheworld.blogspot.com/2020/04/review-doomsday-machine.html">here</a>):</p>
<blockquote><p><i>Rather than showing irritation or ignoring my comment [that he made at the first meeting], Herman Kahn, brilliant and enormously fat, sitting directly across the table from me, looked at me soberly and said, "You're absolutely wrong."</i></p><i>
</i><p><i>A warm glow spread through my body. This was the way my undergraduate fellows on the editorial board of the Harvard Crimson (mostly Jewish, like Herman and me) had spoken to each other; I hadn't experienced anything like it for six years. At King's College, Cambridge, or in the Society of Fellows, arguments didn't remotely take this gloves-off, take-no-prisoners form. I thought, "I've found a home."</i></p>
</blockquote>
<p>Steve Jobs admittedly goes overboard with this. For example, people who worked with him had to learn that "this is shit" meant "that's interesting, could you elaborate and make the case for your idea further?". This is not just unnecessarily rude, but also unclear communication. The general impression that Isaacson gives is also not that Jobs was combative as a thought-out strategy, but rather that this was just his style of interaction.</p>
<p>I suspect that the famous combativeness of many tech CEOs is not itself a useful trait, but instead adjacent to several other traits that are, in particular disagreeableness (in the sense of willing to disagree with others and not feel pressure to conform) and perhaps also caring deeply about the product.</p>
<p>Consider another extreme Jobs trait: strange diets, and (in his youth), a belief that he didn't need to shower because of his dieting. This went so far that of the people Isaacson interviews about Jobs's youth, including those who hadn't seen him for decades, almost every one mentions something like "yeah, he stank". Yet while some leap to defend and (worse yet) emulate Jobs's verbal nastiness, presumably on grounds of its correlation with his success, far fewer do the same for his dieting and showering habits. (What conformists!)</p>
<p>I think the more general lesson is that Jobs was extreme in a lot of ways, including in the strength of his opinions and beliefs, and in not having a filter between them and his actions. He gets into eastern mysticism and goes off to India to become a monk. He gets into dieting and starts eating only fruit rather than just reading lifestyle magazines and half-heartedly trying diets for a week like most people might. He gets it into his head that the corner of a Mac isn't rounded enough and declares that in no uncertain terms. </p>
<p>So is that the key then: have firm convictions? We've gone from a maladaptive cliché to a trite one – and still not a very helpful one. Steve Jobs, with his "reality distortion field", may have been an expert at persuading people, but even he can't persuade reality to be another way. Even slightly wrong convictions tend to have nasty collisions with reality.</p>
<p>(It's worth noting that rather than being a stickler for one position or solution, Jobs tended to yo-yo back and forth between extremes, only slowly converging on a decision – something that often confused others at Apple until they learned to use a rolling average of his recent positions.)</p>
<p>The critical part, of course, was that Steve Jobs was right about a lot of things, despite several serious missteps (especially in regards to making over-expensive computers that no one wants to pay for). I think Jobs's success provides evidence that even in aesthetic matters, success has a surprisingly strong component of <i>being actually right</i>. And Jobs, who was all-around very bright despite not being a master of the technical side, seems to have mastered this.</p>
<p>Of course, the story of Jobs's success – which came in spite of his emotional volatility, and tendency to wish away problems rather than facing them – does not entirely fit the idea that success comes in large part from having well-calibrated beliefs about the world and going about achieving them in reasonable and rational ways.</p>
<p>I think there are three things worth keeping in mind.</p>
<p>First, it may well be that most successful people are successful "at random" (i.e. without having a rational strategy for achieving what they want to achieve), but that the probability of achieving your goals given that you have well-calibrated beliefs and a rational reality-accommodating plan is still very much higher than the probability of achieving them given any other strategy. That is, if <script type="math/tex">S</script> is the event of being very successful (by some definition), <script type="math/tex">R</script> the event that you follow a rational strategy and maintain well-calibrated beliefs and generally practice thought patterns that won't get you downvoted on LessWrong, <script type="math/tex">\neg R</script> the complement of that event, <script type="math/tex">P(\neg R|S)</script> can be high (i.e. most successful people became successful in not particularly smart ways), while <script type="math/tex">P(S|R)</script> can be much higher than <script type="math/tex">P(S|\neg R)</script> (following a rational strategy still gives you by far the best chances of success).</p>
<p>Second, Jobs's life illustrates the principle that you only have to be very right a small number of times – just like in general most of the return, especially in anything risky, comes from a small number of bets. He failed at managing, even when working under another CEO who had been brought in specifically to babysit him, to the extent that he was kicked out of his own company. He failed to build successful hardware after founding NeXT. However, he was really right about product design, and that was enough.</p>
<p>Third, though he did get away with ignoring many uncomfortable truths by simply willing them away, eventually reality hit back. He delayed dealing with the cancer threat when he was first told of it, and he trusted alternative treatments. The combination may well have killed him.</p>
<p> </p>
<h2 id="benjamin-franklin">Benjamin Franklin</h2>
<p>Benjamin Franklin was a newspaper publisher, writer, postmaster, ambassador, political leader, and scientist. He invented the lightning rod and realised that electric charge came in both a positive and negative form (and gave those names to them, as temporary ones until "[English] philosophers give us better").</p>
<p>He was one of the first or most influential pioneers of many other things as well; to take a random example, he thought up the idea of matched funding for a charitable project (and was quite proud of it too: "I do not remember any of my political maneuvers the success of which gave me at the time more pleasure, or that in after thinking about it I more easily excused myself for having made use of cunning").</p>
<p>More generally, he clearly enjoyed numbers and detail:</p>
<blockquote><p><i>[...H]e loved immersing himself in minutiae and trivia in a manner so obsessive that it might today be described as geeky. He was meticulous in describing every technical detail of his inventions, be it the library arm, stove, or lightning rod. In his essays, ranging from his arguments against hereditary honors to his discussions of trade, he provided reams of detailed calculations and historical footnotes. Even in his most humorous parodies, such as his proposal for the study of farts, the cleverness was enhanced by his inclusion of mock-serious facts, trivia, calculations, and learned precedents</i></p>
</blockquote>
<p>Do-gooders with time machines could do worse than giving him access to a spreadsheet program.</p>
<p>One of the best descriptions of Franklin's personality comes from Isaacson's comparison of him with John Adams (when they were both in Paris, late in Franklin's life):</p>
<blockquote><p><i>Adams was unbending and outspoken and argumentative, Franklin charming and taciturn and flirtatious. Adams was rigid in his personal morality and lifestyle, Franklin famously playful. Adams learned French by poring over grammar books and memorizing a collection of funeral orations; Franklin (who cared little about the grammar) learned the language by lounging on the pillows of his female friends and writing them amusing little tales. Adams felt comfortable confronting people, whereas Franklin preferred to seduce them, and the same was true of the way they dealt with nations.</i></p>
</blockquote>
<p>One striking things when reading about 18th century events is the informality and nepotism. For example, to become postmaster of the colonies, Franklin spent significant money on having a friend lobby on his behalf in London, and upon obtaining the position gave out cushy jobs to his son, brothers, brother's stepson, sister's son, and two of his wife's relatives.</p>
<p>Not only that, but the border between truth and fiction was also hazy in the press. Articles could be, without any differentiating label, either factual, obviously satirical, satirical in a way that takes a clever reader to spot, or outright hoaxes. Likewise Franklin often wrote and published letters to his own newspaper under pseudonyms, with various levels of disguise ranging from clearly transparent to purposefully anonymous (this, however, was normal, as it was often seen as unworthy of gentlemen to write such letters under their own names).</p>
<p>In other ways, the 18th century, and 18th century Franklin in particular, were surprisingly modern and liberal. Franklin took a very reasonable and liberal stance on the freedom of press:</p>
<blockquote><p><i>“It is unreasonable to imagine that printers approve of everything they print. It is likewise unreasonable what some assert, That printers ought not to print anything but what they approve; since […] an end would thereby be put to free writing, and the world would afterwards have nothing to read but what happened to be the opinions of printers.”</i></p>
</blockquote>
<p>He still exercised judgement over what he printed. When deciding whether to print something that violated his principles for money, he (reportedly) went through a process that many modern newspaper editors and Facebook engineers could well take to heart:</p>
<blockquote><p><i>To determine whether I should publish it or not, I went home in the evening, purchased a twopenny loaf at the baker’s, and with the water from the pump made my supper; I then wrapped myself up in my great-coat, and laid down on the floor and slept till morning, when, on another loaf and a mug of water, I made my breakfast. From this regimen I feel no inconvenience whatever. Finding I can live in this manner, I have formed a determination never to prostitute my press to the purposes of corruption and abuse of this kind for the sake of gaining a more comfortable subsistence.</i></p>
</blockquote>
<p>The 18th century offers some perspective about hostile politics too. After describing an extremely personal and angry election campaign (which Franklin lost), Isaacson writes:</p>
<blockquote><p><i>Modern election campaigns are often criticized for being negative, and today’s press is slammed for being scurrilous. But the most brutal of modern attack ads pale in comparison to the barrage of pamphlets in the 1764 [Pennsylvania] Assembly election. Pennsylvania survived them, as did Franklin, and American democracy learned that it could thrive in an atmosphere of unrestrained, even intemperate, free expression. As the election of 1764 showed, American democracy was built on a foundation of unbridled free speech. In the centuries since then, the nations that have thrived have been those, like America, that are most comfortable with the cacophony, and even occasional messiness, that comes from robust discourse.</i></p>
</blockquote>
<p>Isaacson points out that Franklin's popularity has come and gone, and explains this by making him the symbol of one side of a cultural and political dichotomy: tolerance and compromise rather than dogmatism and crusading, pragmatism rather than romanticism, social mobility rather than class and hierarchy, and secular material success over religious salvation. Thus, while immensely popular in the latter part of his life and after his death, once the Romantic Era got underway, he became seen as shallow, thrifty, and lacking in passion. For example, Franklin appears in Herman Melville's novel <i>Israel Potter</i>, a work that sounds like the most confusing Harry Potter fan-fiction of all time, as a precursor to today's shallow self-help gurus.</p>
<p>A perfect example of the type of cunning that made some people call him shallow comes from his time as a frontier commander. To get soldiers to attend worship services, he had the chaplain give out the daily rum rations right after the service. "Never were prayers more generally and punctually attended", Franklin proudly wrote.</p>
<p>Or: at the signing of the Declaration of Independence, John Hancock solemnly declared "There must be no pulling different ways; we must all hang together". Franklin reportedly responded, with a wit but not solemnity worthy of the historic occasion: "Yes, we must, indeed, all hang together, or most assuredly we shall all hang separately".</p>
<p>This oscillation between romantically-minded eras finding him shallow and business-minded eras finding him the godfather of all self-help gurus and thrifty entrepreneurs has continued to this day. It is true that his aphorism collections, as documented in his famous Poor Richard's Almanac, are more clever than insightful; that he was no moral philosopher; and that his virtue-cultivating efforts were often patchy. However, they are part of a crucial process: the separation of morality from theology during the Enlightenment, which "Franklin was [the] avatar" of. Franklin's foundational personal maxim, which he often repeated, is perhaps the single sentence that pre-modern religious countries most need to hear: “The most acceptable service to God is doing good to man".</p>
<p>The romanticists' criticisms are based on truths. Though sociable, founding and participating in many societies, his personal relationships tended to be intellectual but distant. Interestingly, despite his vast achievements, Franklin does not show signs of a deep unyielding inner ambition; he seems to have been driven by vague instincts to be useful, a sense of pride (which he tried to dull throughout his life), curiosity, and a delight in tinkering, planning, and organising. To his sister in 1771 he wrote "[...] I am much disposed to like the world as I find it, and to doubt my own judgment as to what would mend it" – a remarkable sentiment from the pen of someone who, not many years later, would be playing a key role in a revolution. And though even past the age of 75 he achieved a few minor things, like being instrumental in securing France's alliance to America, signing the peace treaty between the US and Britain, shaping the US Constitution, and being the head of Pennsylvania's government, he happily wiled away many of his latter days playing cards with only the occasional twinge of guilt. He specifically justified this in part based on a belief in the afterlife: "You know the soul is immortal; why then should you be such a niggard of a little time, when you have a whole eternity before you?"</p>
<p>However, even these traits seem to have made him exactly what America needed. He was a skilled diplomat in France partly because of his easy-going nature and lack of naked ambition. At the Constitutional Convention of 1787, he often hosted the (much younger) other leading revolutionaries at his house to talk about things in a less formal setting and soften their stances, and generally advocated tolerance and compromise. Isaacson cleverly summarises:</p>
<blockquote><p><i>Compromisers may not make great heroes, but they do make democracies.</i></p>
</blockquote>
<p>Perhaps the best known summary of Franklin's life is Turgot's epigram that "he snatched lightning from the sky and the sceptre from tyrants". Franklin himself had a go at this: he wrote an autobiography – then a rare form of book – and also proposed a cheeky epitaph for himself, including an exhortation to wait for a "new and more elegant edition [of him], revised and corrected by the Author".</p>
<p>He didn't just summarise himself, though. He also unwittingly wrote perhaps the pithiest summary of the spirit of the entire Enlightenment project, and consequently of the driving spirit of human progress since then. It was in a letter Franklin wrote to his wife, after narrowly escaping a shipwreck on the English coast in 1757:</p>
<blockquote><p><i>Were I a Roman Catholic, perhaps I should on this occasion vow to build a chapel to some saint; but as I am not, if I were to vow at all, it should be to build a lighthouse.</i></p>
</blockquote>Unknownnoreply@blogger.com0tag:blogger.com,1999:blog-1697673368059564013.post-14457107463493519742021-04-25T21:52:00.002+01:002022-03-31T22:58:16.256+01:00Lambda calculus<p style="text-align: center;"><i><span style="font-size: x-small;">7.8k words, including equations (about 30 minutes)</span></i></p><p style="text-align: center;"><i><span style="font-size: x-small;"> </span></i></p><p style="text-align: center;"><i><span style="font-size: x-small;">This post has also been published <a href="https://www.lesswrong.com/posts/D4PYwNtYNwsgoixGa/intro-to-hacking-with-the-lambda-calculus">here</a>. </span></i><br /></p><p> </p><p>This post is about lambda calculus. The goal is not to do maths with it, but rather to build up definitions within it until we can express non-trivial algorithms easily. At the end we will see a lambda calculus interpreter written in the lambda calculus, and realise that we're most of the way to Lisp.</p>
<p>But first, why care about lambda calculus? Consider four different systems:</p>
<ul>
<li><p>A <b>Turing machine</b> – that is, a machine that:</p>
<ul>
<li><p>works on an infinite tape of cells from which a finite set of symbols can be read and written, and always points at one of these cells;</p>
</li>
<li><p>has some set of states it can be in, some of which are termed "accepting" and one of which is the starting state; and</p>
</li>
<li><p>given a combination of current state and current symbol on the tape, always does an action consisting of three things:</p>
<ul>
<li>writes some symbol on the tape (possibly the same that was already there),</li>
<li>transitions to some some state (possibly the same it is already in), and</li>
<li>moves one cell left or right on the tape.</li>
</ul>
</li>
</ul>
</li>
<li><p>The <b>lambda calculus</b> (<script type="math/tex">\lambda</script>-calculus), a formal system that has expressions that are built out of an infinite set of variable names using <script type="math/tex">\lambda</script>-terms (which can be thought of as anonymous functions) and applications (analogous to function application), and a few simple rules for shuffling around the symbols in these expressions.</p>
</li>
<li><p>The <b>partial recursive functions</b>, constructed by function composition, primitive recursion (think bounded for-loops), and minimisation (returning the first value for which a function is zero) on three basic sets of functions:</p>
<ul>
<li>the zero functions, that take some number of arguments and return 0;</li>
<li>a successor function that takes a number and returns that number plus 1; and</li>
<li>the projection functions, defined for all natural numbers <script type="math/tex">a</script> and <script type="math/tex">b</script> such that <script type="math/tex">a \geq b</script> as taking in <script type="math/tex">a</script> arguments and returning the <script type="math/tex">b</script>th one.</li>
</ul>
</li>
<li><p><b>Lisp</b>, a human-friendly axiomatisation of computation that accidentally became an extremely good and long-lived programming language.</p>
</li>
</ul>
<p>The big result in theoretical computer science is that these can all do the same thing, in the sense that if you can express a calculation in one, you can express it in any other.</p>
<p>This is not an obvious thing. For example, the only thing lambda calculus lets you do is create terms consisting of symbols, single-argument anonymous functions, and applications of terms to each other (we'll look at the specifics soon). It's an extremely simple and basic thing. Yet no matter how hard you try, you can't make something that can compute more things, whether it's by inventing programming languages or building fancy computers.</p>
<p>Also, if you try to make something that does some sort of calculation (like a new programming language), then unless you keep it stupidly simple and/or take great care, it will be able to compute anything (at least in la-la-theory-land, where memory is infinite and you don't have to worry about practical details, like whether the computation finishes before the sun going nova).</p>
<p>Physicists search for their theory of everything. The computer scientists already have many, even though they've been at it for a lot less time than the physicists have: everything computable can be reduced to one of the many formalisms of computation. (One of the main reasons that we can talk about "computability" as a sensible universal concept is that any reasonable model makes the same things computable; the threshold is easy to hit and impossible to exceed, so computable versus not is an obvious thing to pay attention to.)</p>
<p>To talk about the theory of computation properly, we need to look at at least one of those models. The most well-known is the Turing machine. Turing machines have several points in their favour:</p>
<ul>
<li>They are the easiest to imagine as a physical machine.</li>
<li>They have clear and separate notions of time (steps taken in execution) and space (length of tape used).</li>
<li>They were invented by Alan Turing, who contributed to breaking the Enigma code during World War II, before being unjustly persecuted for being gay and tragically dying of cyanide poisoning at age 41.</li>
</ul>
<p>In contrast, compare the lambda calculus:</p>
<ul>
<li>It is an abstract formal system arising out of a failed attempt to axiomatise logic.</li>
<li>There are many execution paths for a non-trivial expression.</li>
<li>It was invented by Alonzo Church, who lived a boringly successful life as a maths professor at Princeton, had three children, and died at age 92.</li>
</ul>
<p>(Turing and Church worked together from 1936 to 1938, Church as Turing's doctoral advisor, after they independently proved the impossibility of the halting problem. At the same time and also working at Princeton were Albert Einstein, Kurt Gödel, and John von Neumann (who, if he had had his way, would've hired Turing and kept him from returning to the UK).)</p>
<p>However, the lambda calculus also has advantages. Its less mechanistic and more mathematical view of computation is arguably more elegant, and it has less things: instead of states, symbols, and a tape, the current state is just a term, and the term also represents the algorithm. It abstracts more nicely – we will see how we can, bit by bit, abstract out elements and get something that is a sensible programming language, a project that would be messier and longer with Turing machines.</p>
<p>Turing machines and lambda calculus are the foundations of imperative and functional programming respectively, and the situation between these two programming paradigms mirrors that between TMs and <script type="math/tex">\lambda</script>-calculus: one is more mechanistic, more popular, and more useful when dealing with (stateful) hardware; the other more mathematical, less popular, and neater for abstraction-building.</p>
<h3>Lambda trees</h3>
<p>Now let's define exactly what a lambda calculus term is.</p>
<p>We have an infinite set of variables <script type="math/tex">x_1, x_2, x_3, ...</script>, though for simplicity we will use any lowercase letter to refer to them. Any variable is a valid term. Note that variables are just symbols – despite the word "variable", there is no value bound to them.</p>
<p>We have two rules for building new terms:</p>
<ul>
<li><script type="math/tex">\lambda</script>-terms are formed from a variable <script type="math/tex">x</script> and a term <script type="math/tex">M</script>, and are written <script type="math/tex">(\lambda x. M)</script>.</li>
<li>Applications are formed from two terms <script type="math/tex">M</script> and <script type="math/tex">N</script>, and are written <script type="math/tex">(M N)</script>.</li>
</ul>
<p>These terms, like most things, are trees. I will mostly ignore the convention of writing out horrible long strings of <script type="math/tex">\lambda</script>s and variables, only partly mitigated by parenthesis-reducing rules, and instead draw the trees.</p>
<p>(When it appears in this post, the standard notation appears slightly more horrible than usual because, for simplicity, I neglect the parenthesis-reducing rules (they can be confusing at first).)</p>
<p>Here are a few examples of terms, together with standard representations:</p>
<p></p><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhFMm1fCuZ0OlbUquscvt4MP9iFSYws1GZQFg0-LrUI9V0FzB9WOz6eVoORcwy42sFOslPd4McB_I9RINq4CloCOJCdQeZfOez9pbj8FOwnIXWIeWNgaksK2iVPTF6eg44V2G6I3jfVsDeT/" style="margin-left: 1em; margin-right: 1em;"><img data-original-height="528" data-original-width="1138" height="296" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhFMm1fCuZ0OlbUquscvt4MP9iFSYws1GZQFg0-LrUI9V0FzB9WOz6eVoORcwy42sFOslPd4McB_I9RINq4CloCOJCdQeZfOez9pbj8FOwnIXWIeWNgaksK2iVPTF6eg44V2G6I3jfVsDeT/w640-h296/terms.png" width="640" /></a></div><p></p>
<p>This representation makes it clear that we're dealing with a tree where nodes are either variables, lambda terms where the left child is the argument and the right child is the body, or applications. (I've circled the variables to make clear that the argument variable in a <script type="math/tex">\lambda</script>-term has a different role than a variable appearing elsewhere.)</p>
<p>It's not quite right to say that a <script type="math/tex">\lambda</script>-term is a function; instead, think of <script type="math/tex">\lambda</script>-terms as one representation of a (mathematical) function, when combined with the reduction rule we will look at soon.</p>
<p>If we interpret the above terms as representations of functions, we might rewrite them (in Pythonic pseudocode) as, from left to right:</p>
<ul>
<li><code>lambda x -> x</code> (i.e., the identity function) (<code>lambda</code> is a common keyword for an anonymous function in programming languages, for obvious reasons).</li>
<li><code>(lambda f -> f(y))(lambda x -> x)</code> (apply a function that takes a function and calls that function on <code>y</code> to the identity function as an argument).</li>
<li><code>x(y)</code></li>
</ul>
<h2>Reduction</h2>
<p>Execution in lambda calculus is driven by something that is called <script type="math/tex">\beta</script>-reduction, presumably because Greek letters are cool. The basic idea of <script type="math/tex">\beta</script>-reduction is this:</p>
<ul>
<li>Pick an application (which I've represented by orange circles in the tree diagrams).</li>
<li>Check that the left child of the application node is a \lambda-term (if not, you have to reduce it to a <script type="math/tex">\lambda</script>-term before you can make that application).</li>
<li>Replace the variable in the left child of the <script type="math/tex">\lambda</script>-term with the right child of the application node wherever it appears in the right child of the <script type="math/tex">\lambda</script>-term, and then replace the application node with the right child of the <script type="math/tex">\lambda</script>-term.</li>
</ul>
<p>In illustrated form, on the middle example above, using both tree diagrams and the usual notation:</p>
<p></p><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEg_cWrnRdiyN2ksIAs8l6KG-mItZT0HKM0HBO1euMFYe3-jLN81sKBRiyKW41u8-q9JEmI7nIJBVmrbzyjyDa5k6SER_HUfuzgSxQPr7qOwHnbJvOZksw8c3ZOSAcJAZHPJShA0reg8wiku/" style="margin-left: 1em; margin-right: 1em;"><img alt="" data-original-height="634" data-original-width="1196" height="340" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEg_cWrnRdiyN2ksIAs8l6KG-mItZT0HKM0HBO1euMFYe3-jLN81sKBRiyKW41u8-q9JEmI7nIJBVmrbzyjyDa5k6SER_HUfuzgSxQPr7qOwHnbJvOZksw8c3ZOSAcJAZHPJShA0reg8wiku/w640-h340/reduction1.png" width="640" /></a></div><p></p>(The notation <script type="math/tex">M[N/x]</script> means substitute the term <script type="math/tex">N</script> for the variable <script type="math/tex">x</script> in the term <script type="math/tex">M</script>; the general rule for <script type="math/tex">\beta</script>-reduction is that given <script type="math/tex">((\lambda x. M) N)</script>, you can replace it with <script type="math/tex">M[N/x]</script>, subject to some details that we will mostly skip over shortly.)
<p>In our example, we end up with another application term, so we can reduce it further:</p>
<p></p><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgzFF6nsLilBuXcgRWvmIrcwlFBQvhEwRfCqwXxQcrIWQ_FRU0Dvu84T3PwvKSFHo3uGIMzViAaVFWjd7mkvi7qOxe8a8ElcS_oVpbWhHBehtp6aLWI0q4wJf4BYuVgvKxQMmGOsUxzAuf1/" style="margin-left: 1em; margin-right: 1em;"><img data-original-height="666" data-original-width="1000" height="426" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgzFF6nsLilBuXcgRWvmIrcwlFBQvhEwRfCqwXxQcrIWQ_FRU0Dvu84T3PwvKSFHo3uGIMzViAaVFWjd7mkvi7qOxe8a8ElcS_oVpbWhHBehtp6aLWI0q4wJf4BYuVgvKxQMmGOsUxzAuf1/w640-h426/reduction2.png" width="640" /></a></div><p></p>
<p>In our Pythonic pseudocode, we might represent this as an execution trace like the following:</p>
<pre><code>(lambda f -> f(y))(lambda x -> x)</code></pre><pre><code> --></code></pre><pre><code>(lambda x -> x)(y)</code></pre><pre><code> --></code></pre><pre><code>y
</code></pre>
<p>Reduction is not always so simple, even if there's only a single choice of what to reduce. You have to be careful if the same variable appears in different roles, and rename if necessary. The core rule is that within the tree rooted at a <script type="math/tex">\lambda</script>-term that takes an argument <script type="math/tex">x</script>, the variable <script type="math/tex">x</script> always means whatever was given to that <script type="math/tex">\lambda</script>-term, and never anything else. An <script type="math/tex">x</script> bound in one <script type="math/tex">\lambda</script>-term is distinct from an <script type="math/tex">x</script> bound in another <script type="math/tex">\lambda</script>-term.</p>
<p>The simplest way to get around problems is to make your first variable <script type="math/tex">x_1</script> and, whenever you need a new one, call it <script type="math/tex">x_i</script> where <script type="math/tex">i</script> is one more than the maximum index of any existing variable. Unfortunately humans aren't good at remembering the difference between <script type="math/tex">x_9</script> and <script type="math/tex">x_{17}</script>, and humans like conventions (like using <script type="math/tex">x</script> for generic variables, <script type="math/tex">f</script> for things that will be <script type="math/tex">\lambda</script>-terms, and so forth). Therefore we sometimes have to think about name collisions.</p>
<p>The principle that lets us out of name collision problems is that you can rename variables as you want (as long as distinct variables aren't renamed to the same thing). The name for this is <script type="math/tex">\alpha</script>-equivalence (more Greek letters!); for example <script type="math/tex">(\lambda x .x)</script> and <script type="math/tex">(\lambda y. y)</script> are <script type="math/tex">\alpha</script>-equivalent.</p>
<p>There are, of course, detailed rules for how to deal with name collisions when doing <script type="math/tex">\beta</script>-reductions, but you should be fine if you think about how variable scoping should sensibly work to preserve meaning (something you've already had to reason about if you've ever programmed). (A helpful concept to keep in mind is the difference between free variables and bound variables – starting from a variable and following the path up the tree to the parent node, does it run through a <script type="math/tex">\lambda</script>-node with that variable as an argument?)</p>
<p>An example of a name collision problem is this:</p>
<p></p><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjkHI3XGToC4FX05Uou1FN6QC-B8rhYsIXUyrWFlSQ_5ToufGn3UW6jbH4aewHaaieVY6bYjV0RbEtxoIWosM2OyhQDK6zLHTZhjgoMdA7o0WwBPyhyphenhyphensbD5iSVogofjCCKMmW1a6DiqY1c9/" style="margin-left: 1em; margin-right: 1em;"><img data-original-height="766" data-original-width="1094" height="448" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjkHI3XGToC4FX05Uou1FN6QC-B8rhYsIXUyrWFlSQ_5ToufGn3UW6jbH4aewHaaieVY6bYjV0RbEtxoIWosM2OyhQDK6zLHTZhjgoMdA7o0WwBPyhyphenhyphensbD5iSVogofjCCKMmW1a6DiqY1c9/w640-h448/wrongreduction.png" width="640" /></a></div><p></p>
<p>We can't do this because the <script type="math/tex">x</script> in the innermost <script type="math/tex">\lambda</script>-term on the left must mean whatever was passed to it, and the <script type="math/tex">y</script> whatever was passed to the outer <script type="math/tex">\lambda</script>-term. However, our reduction leaves us with an expression that applies its argument to itself. We can solve this by renaming the <script type="math/tex">x</script> within the inner <script type="math/tex">\lambda</script>-term:</p>
<p></p><div class="separator" style="clear: both; text-align: center;"><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgElw77PCZ9VjNyOrQCX7BsXOhiSCJ4HDCCLqMtvUR1h_OsA7cO7iizxTfc0mF66lcqSz8TVVXWwDvmgjBPb_uaKq4TJsVCivl9C4CJmxY_Ac7plat-GN7fbZu_sPoHLONYjsT62etRShoN/" style="margin-left: 1em; margin-right: 1em;"><img alt="" data-original-height="870" data-original-width="1186" height="470" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgElw77PCZ9VjNyOrQCX7BsXOhiSCJ4HDCCLqMtvUR1h_OsA7cO7iizxTfc0mF66lcqSz8TVVXWwDvmgjBPb_uaKq4TJsVCivl9C4CJmxY_Ac7plat-GN7fbZu_sPoHLONYjsT62etRShoN/w640-h470/wrongreductionfix.png" width="640" /></a></div></div><p></p>
<p>The general way to think of lambda calculus term is that they are partitioned in two ways into equivalence classes:</p>
<ul>
<li>The first, rather trivial, set of equivalence classes is treating all <script type="math/tex">\alpha</script>-equivalent terms as the same thing. "Equivalent" and <script type="math/tex">\alpha</script>-equivalent are usually the same thing when we're talking about the lambda calculus; it's the "structure" of a term that matters, not the variable names.</li>
<li>The second set of equivalence classes is treating everything that can be <script type="math/tex">\beta</script>-reduced into the same form as equivalent. This is less trivial – in fact, it's undecidable in the general case (as we will see in the post about computation theory).</li>
</ul>
<h2>That's it</h2>
<p>Yes, really, that's all you need. There exists a lambda calculus term that beats you in chess.</p>
<p>You might ask: but hold on a moment, we have no data – no numbers, no pairs, no lists, no strings – how can we input chess positions into a term or get anything sensible as an answer? We will see later that it's possible to encode data as lambda terms. The chess-playing term would accept some massive mess of <script type="math/tex">\lambda</script>-terms encoding the board configuration as an input, and after a lot of reductions it would become a term encoding the move to make – eventually checkmate, against you.</p>
<p>Before we start abstracting out data and more complex functions, let's make some simple syntax changes and look at some basic facts about reduction.</p>
<h2>Some syntax simplifications</h2>
<p>The pure lambda calculus does not have <script type="math/tex">\lambda</script>-terms that take more than one argument. This is often inconvenient. However, there's a simple mapping between multi-argument <script type="math/tex">\lambda</script>-terms and single-argument ones: instead of a two-argument function, say, just have a function that takes in an argument and returns a one argument function that takes in an argument and returns a result using both arguments.</p>
<p>(In programming language terms, this is currying.)</p>
<p>In the standard notation, <script type="math/tex">(\lambda x.(\lambda y. M))</script> is often written <script type="math/tex">(\lambda xy.M)</script>. Likewise, we can do similar simplifications on our trees, remembering that this is a syntactic/visual difference, rather than introducing something new to the lambda calculus:</p>
<p></p><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiX7LHaKAcIjTHSwBmFsGkiIiROTmDowoBWiZ0jEdTmXW6sKvZFxXFL80-pVg0CdgyLHnCyFM12R2LIzxEAScYJnqgRypEiThDCrp4SLEBsqehjg7pDgMY2RM7eOsanwzwDHDvSKOM-QZ02/" style="margin-left: 1em; margin-right: 1em;"><img data-original-height="670" data-original-width="880" height="305" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiX7LHaKAcIjTHSwBmFsGkiIiROTmDowoBWiZ0jEdTmXW6sKvZFxXFL80-pVg0CdgyLHnCyFM12R2LIzxEAScYJnqgRypEiThDCrp4SLEBsqehjg7pDgMY2RM7eOsanwzwDHDvSKOM-QZ02/w400-h305/simplersyntax.png" width="400" /></a></div><p></p>
<p>Once we've done this change, the next natural simplification to make is to allow one application node to apply many arguments to a <script type="math/tex">\lambda</script>-term with "many arguments" (remember that it actually stands for a bunch of nested normal single-argument <script type="math/tex">\lambda</script>-terms):</p>
<p></p><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhBHXL5VxT3dyAkCfJMmVN0fE-8nZ4HLFqhkIRYNrfveIY_Koa6sUnj8IhMAPhTUrdVJ-iY5CYVXBzD-uhkdqo4Yd_qeEsdIsj03Hv9lmy5i-ugpAu9vbOay8GifrubdlZJUD2wp7A3Cerw/" style="margin-left: 1em; margin-right: 1em;"><img data-original-height="600" data-original-width="1100" height="350" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhBHXL5VxT3dyAkCfJMmVN0fE-8nZ4HLFqhkIRYNrfveIY_Koa6sUnj8IhMAPhTUrdVJ-iY5CYVXBzD-uhkdqo4Yd_qeEsdIsj03Hv9lmy5i-ugpAu9vbOay8GifrubdlZJUD2wp7A3Cerw/w640-h350/simplersyntax2.png" width="640" /></a></div><p></p>
<p>(The corresponding simplification in the standard syntax is that <script type="math/tex">(M \, A \, B\, C)</script> means <script type="math/tex">(((M \, A)\, B)\, C)</script>. In a standard programming language, this might be written <code>M(A)(B)(C)</code>; that is, applying <code>A</code> to <code>M</code> to get a function that you apply to <code>B</code>, yielding another function that you apply to <code>C</code>. Sanity check: what's the difference between <script type="math/tex">((M \, A) \, B)</script> and <script type="math/tex">(M \, (A \, B))</script>?)</p>
<p> </p>
<h2>Some facts about reduction</h2>
<h3><script type="math/tex">\beta</script>-normal forms</h3>
<p>A <script type="math/tex">\beta</script>-normal form can be thought of as a "fully evaluated" term. More specifically, it is one where this configuration of nodes does not appear in the tree (after multi-argument <script type="math/tex">\lambda</script>s and applications have been compiled into single-argument ones), where <script type="math/tex">M</script> and <script type="math/tex">N</script> are arbitrary terms:</p>
<p></p><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEirFhvKG_MADoXJIPmdF7Vac3QaBgTdCmFnUNm8irgQqNu21JZR4nZqMFvIAgyST_O4a9XO6B9OOKk9J3J0RVgjt97PkWq0jQ93Fqh5oktH-EMipQBjvkpTc2nM05Y4Nw-Y8tNJhlJufqdR/" style="margin-left: 1em; margin-right: 1em;"><img alt="" data-original-height="342" data-original-width="448" height="240" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEirFhvKG_MADoXJIPmdF7Vac3QaBgTdCmFnUNm8irgQqNu21JZR4nZqMFvIAgyST_O4a9XO6B9OOKk9J3J0RVgjt97PkWq0jQ93Fqh5oktH-EMipQBjvkpTc2nM05Y4Nw-Y8tNJhlJufqdR/" width="314" /></a></div><p></p>
<p>Intuitively, if such a term does appear, then the reduction rules allow us to reduce the application (replacing this part of the tree with whatever you get when you substitute <script type="math/tex">N</script> in place of <script type="math/tex">x</script> within <script type="math/tex">M</script>), so our term is not fully reduced yet.</p>
<h3>Terms without a <script type="math/tex">\beta</script>-normal form</h3>
<p>Does every term have a <script type="math/tex">\beta</script>-normal form? If you've seen computation theory stuff before, you should be able to answer this immediately without considering anything about the lambda calculus itself.</p>
<p>The answer is no, because reducing to a <script type="math/tex">\beta</script>-normal form is the lambda calculus equivalent of an algorithm halting. Lambda calculus has the same expressive power as Turing machines or any other model of computation, and some algorithms run forever, so there must exist lambda calculus terms that you can keep reducing without ever getting a <script type="math/tex">\beta</script>-normal form.</p>
<p>Here's one example, often called <script type="math/tex">\Omega</script>: </p>
<p></p><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiue3jaMMCGBK2zN0FpALdTSdNwGK1djd3E3AXfUtGlKkryWrjJVskDqElGrXlVTeVBk6n42KlKb-HAaC_IAHkgq5V0liS0FRV4hY15bfxoBdLYsgvdk7qxEp0RCw4ZNLgbDCAForz8rSzo/" style="margin-left: 1em; margin-right: 1em;"><img data-original-height="706" data-original-width="894" height="316" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiue3jaMMCGBK2zN0FpALdTSdNwGK1djd3E3AXfUtGlKkryWrjJVskDqElGrXlVTeVBk6n42KlKb-HAaC_IAHkgq5V0liS0FRV4hY15bfxoBdLYsgvdk7qxEp0RCw4ZNLgbDCAForz8rSzo/w400-h316/omega.png" width="400" /></a></div><p></p>
<p>Note that even though we use the same variable <script type="math/tex">x</script> in both branches, the variable means a different thing: in the left branch it's whatever is passed as an input to the left <script type="math/tex">\lambda</script>-term – one reduction step onwards, that <script type="math/tex">x</script> stands for the entire right branch, which has its own <script type="math/tex">x</script>. In fact, before we start reducing, we will do an <script type="math/tex">\alpha</script>-conversion on the right branch (a pretentious way of saying that we will rename the bound variable).</p>
<p>Now watch:</p>
<p></p><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjdxwwZnwJzXB1u7dQF8c0Ol8-rea9q2UFLVi4o1l7v9_Oj5Run89nd5E8BGLaqxsOXFFFMJFxvfL1NRhg2ysykLonr-mYjxbsXh3l6ybgv6HOgTNNd8Vlcx1didM-U-Fmlh4YXgzENX1DL/" style="margin-left: 1em; margin-right: 1em;"><img data-original-height="526" data-original-width="998" height="338" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjdxwwZnwJzXB1u7dQF8c0Ol8-rea9q2UFLVi4o1l7v9_Oj5Run89nd5E8BGLaqxsOXFFFMJFxvfL1NRhg2ysykLonr-mYjxbsXh3l6ybgv6HOgTNNd8Vlcx1didM-U-Fmlh4YXgzENX1DL/w640-h338/omegareduction.png" width="640" /></a></div><p></p>
<p>After one reduction step, we end up with the same term (as usual, we are treating <script type="math/tex">\alpha</script>-equivalent terms as equivalent; the variable could be <script type="math/tex">x</script> or <script type="math/tex">y</script> or <script type="math/tex">å</script> for all we care).</p>
<h3>Ambiguities with reduction</h3>
<p>Does it matter how we reduce, or does every reduction path eventually lead to a <script type="math/tex">\beta</script>-normal form, assuming that one exists in the first place? If you haven't seen this before, you might want to have a go at this before reading on.</p>
<p>Here's one example of a tricky term:</p>
<p></p><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgwdajqmtyS_i6iINU98XQRo_jLCNkcwsXiEq4WW-9AW_p3ZfrZSJJMSAatmjhSf3PlRKKePYHBce4lGY0FPYGDAeSjD5YsD4yBZk6L9vsWuZ2-mJP8BXcaAtAu1GyE7bT0e-qbuFKGFzXb/" style="margin-left: 1em; margin-right: 1em;"><img alt="" data-original-height="596" data-original-width="900" height="212" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgwdajqmtyS_i6iINU98XQRo_jLCNkcwsXiEq4WW-9AW_p3ZfrZSJJMSAatmjhSf3PlRKKePYHBce4lGY0FPYGDAeSjD5YsD4yBZk6L9vsWuZ2-mJP8BXcaAtAu1GyE7bT0e-qbuFKGFzXb/" width="320" /></a></div><p></p>Imagine that <script type="math/tex">M</script> has a <script type="math/tex">\beta</script>-normal form, and <script type="math/tex">\Omega</script> is as defined above and therefore can be reduced forever. If we start by reducing the application node, in a moment <script type="math/tex">\Omega</script> and all its loopiness gets thrown away, and we're left with just <script type="math/tex">M</script>, since the <script type="math/tex">\lambda</script>-term takes two arguments and returns the first. However, if we start by reducing <script type="math/tex">\Omega</script>, or are following a strategy like "evaluate the arguments before the application", we will at some point reduce <script type="math/tex">\Omega</script> and get thrown in for a loop.
<p>We can take a broader view here. In any programming language – I will use Lisp notation because it's the closest to lambda calculus – if we have a function like <code>(define func (lambda (x y) [FUNCTION BODY]))</code>, and a function call like <code>(func arg1 arg2)</code> , the evaluator has a choice of what it does. The simplest strategies are to either:</p>
<ul>
<li>Evaluate the arguments – <code>arg1</code> and <code>arg2</code>– first, and then inside the function <code>func</code> have <code>x</code> and <code>y</code> bound to the results of evaluating <code>arg1</code> and <code>arg2</code> respectively. This is called call-by-value, and is used by most programming languages.</li>
<li>Bind <code>x</code> and <code>y</code> inside <code>func</code> to be the unevaluated values of <code>arg1</code> and <code>arg2</code>, and evaluate <code>arg1</code> and <code>arg2</code> only upon encountering them in the process of evaluating <code>func</code>. This is called call-by-name. It's rare to see it in programming languages (an exception being that it's possible with Lisp macros), but functional languages like Haskell often have a variant, call-by-need or "lazy evaluation", where the values of <code>arg1</code> and <code>arg2</code> are only executed when needed, but once executed the results are memoized so that the execution only needs to happen once.</li>
</ul>
<p>Call-by-value reduces what you can express. Imagine trying to define your own if-function in a language with call-by-value:</p>
<pre><code class="language-scheme" lang="scheme">(define IF
(lambda (predicate consequent alternative)
(if predicate
consequent <span style="color: #999999;">; if predicate is true, do this</span>
alternative)) <span style="color: #999999;">; if predicate is false, do this instead</span>
</code></pre>
<p>(note that <code>IF</code> is the new if-function that we're trying to define, and <code>if</code> is assumed to be a language primitive.)</p>
<p>Now consider:</p>
<pre><code class="language-scheme" lang="scheme">(define factorial
(lambda (n)
(IF (= n 0)
1
(* n
(factorial (- n 1))))))
</code></pre>
<p>You call <code>(factorial 1)</code>, and for the first call the program evaluates the arguments to <code>IF</code>:</p>
<ul>
<li><code>(= 1 0)</code></li>
<li><code>1</code></li>
<li><code>(* 1 (factorial 0))</code></li>
</ul>
<p>The last one needs the value of <code>(factorial 0)</code>, so we evaluate the arguments to the <code>IF</code> in the recursive call:</p>
<ul>
<li><code>(= 0 0)</code></li>
<li><code>1</code></li>
<li><code>(* 1 (factorial -1))</code></li>
</ul>
<p>... and so on. We can't define <code>IF</code> as a function, because in call-by-value the <code>alternative</code> gets evaluated as part of the function call even if <code>predicate</code> is false.</p>
<p>(Most languages solve this by giving you a bunch of primitives and making you stick with them, perhaps with some fiddly mini-language for macros built in (consider C/C++). In Lisp, you can easily write macros that use all of the language features, and therefore extend the language by essentially defining your own primitives that can escape call-by-value or any other potentially limiting language feature.)</p>
<p>It's the same issue with our term <script type="math/tex">((\lambda xy.x) \, M \, \Omega)</script> above: call-by-value goes into a silly loop because one of the arguments isn't even "meant to" be evaluated (from our perspective as humans with goals looking at the formal system from the outside).</p>
<p>Lambda calculus does not impose a reduction/"evaluation" order, so we can do what we like. However, this still leaves us with a problem: how do we know if our algorithm has gone into an infinite loop, or we just reduced terms in the wrong order?</p>
<h3>Normal order reduction</h3>
<p>It turns out that always doing the equivalent of call-by-name – reducing the leftmost, outermost term first – saves the day. If a <script type="math/tex">\beta</script>-normal form exists, this strategy will lead you to it.</p>
<p>Intuitively, this is because with call-by-name, there is no "unnecessary" reduction. If some arguments in some call are never used (like in our example), they never reduce. If we start reducing an expression while doing leftmost/outermost-first reduction, that reduction must be standing in the way between us and a successful reduction to <script type="math/tex">\beta</script>-normal form.</p>
<p>Formally: ... the proof is left as an exercise for the reader.</p>
<h3>Church-Rosser theorem</h3>
<p>The Church-Rosser theorem is the thing that guarantees we can talk about unique <script type="math/tex">\beta</script>-normal forms for a term. It says that:</p>
<blockquote><p>Letting <script type="math/tex">\Lambda</script> be the set of terms in the lambda calculus, <script type="math/tex">\rightarrow_\beta</script> the <script type="math/tex">\beta</script>-reduction relation, and <script type="math/tex">\twoheadrightarrow_\beta</script> its reflexive transitive closure (i.e. <script type="math/tex">M \twoheadrightarrow_\beta N</script> iff there exist zero or more terms <script type="math/tex">P_1</script>, <script type="math/tex">P_2</script>, ... such that <script type="math/tex">M \rightarrow_\beta P_1 \rightarrow_\beta ... \rightarrow_\beta P_n \rightarrow_\beta N</script>), then:</p>
<p><b>For all <script type="math/tex">M \in \Lambda</script>, <script type="math/tex">M \rightarrow_\beta A</script> and <script type="math/tex">M \rightarrow_\beta B</script> implies that there exists <script type="math/tex">X \in \Lambda</script> such that <script type="math/tex">A \twoheadrightarrow_\beta X</script> and <script type="math/tex">B \twoheadrightarrow_\beta X</script>.</b></p>
</blockquote>
<p>Visually, if we have reduction chains like the black part, then the blue part must exist (a property known as confluence or the "diamond property"):</p>
<p></p><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgbR5sJ90lX7IFasPcZBUm4Op3xAI_gk1ubf3qZeXYQL-QCMXObuBS75FSgwWKUNPVtyouRw1K8ulfkByFKEZYu4eu7FKx4LEX1114xrf5XrYQdt_mjsnUvFDvD7dkMBugn7RhgIPt-rt3t/" style="margin-left: 1em; margin-right: 1em;"><img alt="" data-original-height="574" data-original-width="998" height="230" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgbR5sJ90lX7IFasPcZBUm4Op3xAI_gk1ubf3qZeXYQL-QCMXObuBS75FSgwWKUNPVtyouRw1K8ulfkByFKEZYu4eu7FKx4LEX1114xrf5XrYQdt_mjsnUvFDvD7dkMBugn7RhgIPt-rt3t/w400-h230/churchrosser.png" width="400" /></a></div><p></p>
<p>Therefore, even if there are many reduction paths, and even if some of them are non-terminating, for any two different starting <script type="math/tex">\beta</script>-reductions we can make, we will not lose the existence of a reduction path to any <script type="math/tex">X</script>. If <script type="math/tex">X</script> is some <script type="math/tex">\beta</script>-normal form reachable from <script type="math/tex">M</script>, we know that any other reduction path that reaches a <script type="math/tex">\beta</script>-normal form must have reached <script type="math/tex">X</script>.</p>
<h2>The fun begins</h2>
<p>Now we will start making definitions within the lambda calculus. These definitions do not add any capabilities to the lambda calculus, but are simply conveniences to save out having to draw huge trees repeatedly when we get to doing more complex things.</p>
<p>There are two big ideas to keep in mind:</p>
<ol start="">
<li>There are no data primitives in the lambda calculus (even the variables are just placeholders for terms to get substituted into, and don't even have consistent names – remember that we work within <script type="math/tex">\alpha</script>-equivalence). As a result, the general idea is that you encode "data" as actions: the number 4 is represented by a function that takes a function and an input and applies the function to the input 4 times, a list might be encoded by a description of how to iterate over it, and so on.</li>
<li>There are no types. Nothing in the lambda calculus will stop you from passing a number to a function that expects a function, or visa versa. There exist <a href="https://en.wikipedia.org/wiki/Typed_lambda_calculus">typed lambda calculi</a>, but they prevent you from doing some of the cool things with combinators that we'll see later in this post.</li>
</ol>
<h3>Pairs</h3>
<p>We want to be able to associate two things into a pair, and then extract the first and second elements. In other words, we want things that work like this:</p>
<pre><code>(fst (pair a b)) == a
(snd (pair a b)) == b
</code></pre>
<p>The simplest solution starts like this:</p>
<p></p><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEha7FCmkKfwAS9BQ7ul-I2bHacSWE4aSEnP_9breTeHTEGf6_wEq0Ieu1Zn6UfLOhrxBL5YmCMS2quE5l66TperfC36ZtnL0-XE4uAvav-Em0vH-m-EWoRIQhIlb6TTPBTJdr8H1xkym2eL/" style="margin-left: 1em; margin-right: 1em;"><img data-original-height="1044" data-original-width="820" height="400" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEha7FCmkKfwAS9BQ7ul-I2bHacSWE4aSEnP_9breTeHTEGf6_wEq0Ieu1Zn6UfLOhrxBL5YmCMS2quE5l66TperfC36ZtnL0-XE4uAvav-Em0vH-m-EWoRIQhIlb6TTPBTJdr8H1xkym2eL/w315-h400/pairs.png" width="315" /></a></div><p></p>
<p>Now we can get the first of a pair by doing <code>((pair x y) first)</code>. If we want the exact semantics above, we can define simple helpers like </p>
<pre><code class="language-scheme" lang="scheme">fst = (lambda p
(p first))
</code></pre>
<p>(i.e. <script type="math/tex">\text{fst} = (\lambda p. (p \, \text{first}))</script>), and </p>
<pre><code class="language-scheme" lang="scheme">snd = (lambda p
(p second))
</code></pre>
<p>since now <code>(snd (pair x y))</code> reduces to <code>((pair x y) second)</code> reduces to <code>y</code>.</p>
<h3>Lists</h3>
<p>A list can be constructed from pairs: <code>[1, 2, 3]</code> will be represented by <code>(pair 1 (pair 2 (pair 3 False)))</code> (we will define <code>False</code> later). If <script type="math/tex">l_1</script>, <script type="math/tex">l_2</script>, and <script type="math/tex">l_3</script> are the list items, a length element list looks like this:</p>
<p></p><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi5k2KX3jK6-nG2oW82xcPhyphenhyphenLU_jbFv8p-uHzJI3S0pYZDI9hq1REMSBTZL_jyNcttBtx6RwF4hDrabzDQsLoZyfVh2uwDu909lLiSFYejNiRjquHOLIYQWds6RMdgcSjn18M62dwnzv7-q/" style="margin-left: 1em; margin-right: 1em;"><img data-original-height="1000" data-original-width="1060" height="378" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi5k2KX3jK6-nG2oW82xcPhyphenhyphenLU_jbFv8p-uHzJI3S0pYZDI9hq1REMSBTZL_jyNcttBtx6RwF4hDrabzDQsLoZyfVh2uwDu909lLiSFYejNiRjquHOLIYQWds6RMdgcSjn18M62dwnzv7-q/w400-h378/list.png" width="400" /></a></div><p></p>
<p>We might also represent the same list like this instead:</p>
<p></p><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEighjevJ2KyAb7eZ2J-F_Dphm0DhJWq_U0YJ8k2id-OrIb5rKwCuSJwc1jfkUOxTnnn2xkLdU-bqcZb_sjQPK5HGdTSUUcmX39lMW28pZMm13S2hXqPWNdflIKz7PlRRmg4XADNo7uvcDMY/" style="margin-left: 1em; margin-right: 1em;"><img data-original-height="858" data-original-width="970" height="354" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEighjevJ2KyAb7eZ2J-F_Dphm0DhJWq_U0YJ8k2id-OrIb5rKwCuSJwc1jfkUOxTnnn2xkLdU-bqcZb_sjQPK5HGdTSUUcmX39lMW28pZMm13S2hXqPWNdflIKz7PlRRmg4XADNo7uvcDMY/w400-h354/listvar.png" width="400" /></a></div><p></p>
<p>This second representation makes it trivial to define things like a <code>reduce</code> function: <code>([1, 2, 3] 0 +)</code> would return 0 plus the sum of the list <code>[1, 2, 3]</code>, if <code>[1, 2, 3]</code> is represented as above. However, this representation would also make it harder to do other list operations, like getting all but the first element of a list, whereas our pair-based lists can do this trivially (<code>(snd l)</code> gets you all but the first element of the list <code>l</code>).</p>
<h3>Numbers & arithmetic</h3>
<p>Here are how the numbers work (using a system called Church numerals):</p>
<p></p><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgb5n_trrZW9VAb2EUP5G2zlBQmlnJHhJxzwUDgyM0XAtS1l6ywOOqkpuXk1REZAAqWPtkR77x0bcvvr1AUnLjDIJ6r4oTTnIjcFx0DY6OD73vUW5H52kbyVSHZ4HkzYZZ4rxmF9R0_mA5w/" style="margin-left: 1em; margin-right: 1em;"><img alt="" data-original-height="616" data-original-width="1200" height="328" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgb5n_trrZW9VAb2EUP5G2zlBQmlnJHhJxzwUDgyM0XAtS1l6ywOOqkpuXk1REZAAqWPtkR77x0bcvvr1AUnLjDIJ6r4oTTnIjcFx0DY6OD73vUW5H52kbyVSHZ4HkzYZZ4rxmF9R0_mA5w/w640-h328/numbers.png" width="640" /></a></div><p></p>
<p>Since giving a function <script type="math/tex">f</script> to a number <script type="math/tex">n</script> (also a function) gives a function that applies <script type="math/tex">f</script> to its input <script type="math/tex">n</script> times, a lot of things are very convenient. Say you have this function to add one, which we'll call <code>succ</code> (for "successor"):<br /></p>
<p></p><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhnvwTFY6I0KN_3GJ9Fphyphenhyphenys7cxGPG0y-z2urUhl4ce8L6o_LPd501NZ3Fr7dSLeaDo0BgkqOXlezlCpjWSJg1ugFmxcvfhjjWTX7r_Dhw9VjupiXZ77h0NgwU4XBSJR7ms4Nlym6gih_Jo/" style="margin-left: 1em; margin-right: 1em;"><img data-original-height="700" data-original-width="1000" height="280" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhnvwTFY6I0KN_3GJ9Fphyphenhyphenys7cxGPG0y-z2urUhl4ce8L6o_LPd501NZ3Fr7dSLeaDo0BgkqOXlezlCpjWSJg1ugFmxcvfhjjWTX7r_Dhw9VjupiXZ77h0NgwU4XBSJR7ms4Nlym6gih_Jo/w400-h280/succ.png" width="400" /></a></div><p></p>
<p>(Considering the above definition of numbers: why does it work?) <br /></p><p>Now what is <code>(42 succ)</code>? It's a function that takes an argument and adds <code>42</code> to it. More generally, <code>((n succ) m)</code> gives you <code>m+n</code>. However, there's also a more straightforward way to represent addition, which you can figure out from noticing that all we have to do to add <code>m</code> to <code>n</code> is to compose the "apply <code>f</code>" operation <code>m</code> more times to <code>n</code>, something we can do simply by calling <code>(m f)</code> on <code>n</code>, once we've "standardised" <code>n</code> to have the same <code>f</code> and <code>x</code> as in the <script type="math/tex">\lambda</script>-term that represents <code>m</code> (that is why we have the <code>(n f x)</code> application, rather than just <code>n</code>):</p>
<p> </p><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhx2hMdAzarjGCcIuHRRV17uUbfX5Bj2Q98CqzHDLgJ3obXHM3Xbo8NDDMK7NrftWKzXEW2mSRTSUi9kLnXfJF4ZuP7g5M8PyOHsF9cRypX3kO4qdlIEY2JPVHMEEtZD0iWFtPd6UwxHehl/" style="margin-left: 1em; margin-right: 1em;"><img data-original-height="668" data-original-width="968" height="276" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhx2hMdAzarjGCcIuHRRV17uUbfX5Bj2Q98CqzHDLgJ3obXHM3Xbo8NDDMK7NrftWKzXEW2mSRTSUi9kLnXfJF4ZuP7g5M8PyOHsF9cRypX3kO4qdlIEY2JPVHMEEtZD0iWFtPd6UwxHehl/w400-h276/add.png" width="400" /></a></div><p></p>
<p>Now, want multiplication? One way is to see that we can define <code>(mult m n)</code> as <code>((n (adder m)) 0)</code>, assuming that <code>(adder m)</code> returns a function that adds <code>m</code> to its input. As we saw, that can be done with <code>(m succ)</code>, so:</p>
<pre><code class="language-scheme" lang="scheme">(mult m n) =
((n (m succ))
0)
</code></pre>
<p>There's a more standard way too:</p>
<p></p><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgD_dwtyRTLA-ibEWJmtv22-aAeyVsf8jq_PslzgCi8fuSNhu6WVwqFhvt8AFjzfh1_U6hX9ucY0VVKi-qW45cHevvDnzBaSGr8CpzO-Sk2EGr_Kgnlbk4Ji6EqQJ75vqhrUMTpyt3b0F7c/" style="margin-left: 1em; margin-right: 1em;"><img data-original-height="564" data-original-width="796" height="284" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgD_dwtyRTLA-ibEWJmtv22-aAeyVsf8jq_PslzgCi8fuSNhu6WVwqFhvt8AFjzfh1_U6hX9ucY0VVKi-qW45cHevvDnzBaSGr8CpzO-Sk2EGr_Kgnlbk4Ji6EqQJ75vqhrUMTpyt3b0F7c/w400-h284/mult.png" width="400" /></a></div><br /><p></p>
<p>The idea here is simply that <code>(n f)</code> gives a <script type="math/tex">\lambda</script>-term that takes an input and applies <code>f</code> to it <script type="math/tex">n</script> times, and when we call <code>m</code> with that as its first argument, we get something that does the <script type="math/tex">n</script>-fold application <script type="math/tex">m</script> times, for a total of <script type="math/tex">mn</script> times, and now all that remains is to pass the <code>x</code> to it.</p>
<p>A particularly neat thing is that exponentiation can be this simple:<br /></p><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjziaWeOs7mukvVH3sQjo5lYj0UgG13e8oxIq_tCV4wkryyMIHr_dJWt0jfngxWyn32QkcGxIoRqGrzv3v-EYaOA4lDdrzMf53Uzbp14rC3GCv8F78YcOCEx8LzZ__dsxNi2PVkk-aQFcQN/" style="margin-left: 1em; margin-right: 1em;"><img alt="" data-original-height="536" data-original-width="790" height="217" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjziaWeOs7mukvVH3sQjo5lYj0UgG13e8oxIq_tCV4wkryyMIHr_dJWt0jfngxWyn32QkcGxIoRqGrzv3v-EYaOA4lDdrzMf53Uzbp14rC3GCv8F78YcOCEx8LzZ__dsxNi2PVkk-aQFcQN/" width="320" /></a></div><p></p>
<p>Why? I'll let the trees talk. First, using the definition of <code>n</code> as a Church numeral (which I will underline in the trees below), and doing one <script type="math/tex">\beta</script>-reduction, we have:</p>
<p></p><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhXwie0-TmfzPFBnxdfwqq5HKSNh-swE80gGul6lnkgO47Rt4orLWdQ4QV1k17_ERYlI3ZiIi9D7ejv0NY7sQcXl5G9YCTVUGh_uk2n5SP8qxJUm1WW9lH9zeo3Pmg7MgmvZqf34X_W4FSa/" style="margin-left: 1em; margin-right: 1em;"><img data-original-height="946" data-original-width="1528" height="396" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhXwie0-TmfzPFBnxdfwqq5HKSNh-swE80gGul6lnkgO47Rt4orLWdQ4QV1k17_ERYlI3ZiIi9D7ejv0NY7sQcXl5G9YCTVUGh_uk2n5SP8qxJUm1WW9lH9zeo3Pmg7MgmvZqf34X_W4FSa/w640-h396/expe1.png" width="640" /></a></div><p></p>
<p>This does not look promising – a number needs to have two arguments, but we have a <script type="math/tex">\lambda</script>-term taking in one. However, we'll soon see that the <code>x</code> in the tree on the right actually turns out to be the first argument, <code>f</code>, in the finished number. In fact, we'll make that renaming right away (since we're working under <script type="math/tex">\alpha</script>-equivalence), and continue reducing (below we've taken the bottom-most <code>m</code> and expanded it into its Church numeral definition): </p>
<p></p><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi_yAp6aFnK_J8HUmcKKPDbp9bUkGCh9aJNcQxmTLpgdfNyE9Pt_-t1vo3AJM7Brx38O0_DBuhpQRccB9l8-yLGnKZEAXZvgTcU7DV5NcutEHiK6flERlbseT_mnC-QhxTpe6DCvvRu2YG_/" style="margin-left: 1em; margin-right: 1em;"><img data-original-height="906" data-original-width="1200" height="483" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi_yAp6aFnK_J8HUmcKKPDbp9bUkGCh9aJNcQxmTLpgdfNyE9Pt_-t1vo3AJM7Brx38O0_DBuhpQRccB9l8-yLGnKZEAXZvgTcU7DV5NcutEHiK6flERlbseT_mnC-QhxTpe6DCvvRu2YG_/w640-h483/expe2.png" width="640" /></a></div><p></p>
<p>At this point, the picture gets clearer: the next thing we'd reduce is the lambda term at the bottom applied to <code>m</code>, but that's just going to do the lambda term (which applies <code>f</code> <script type="math/tex">m</script> times) <script type="math/tex">m</script> more times. We'll have done 2 steps, and gotten up to <script type="math/tex">m^2</script> nestings of <code>f</code>. By the time we've done the remaining <script type="math/tex">n-1</script> steps, we'll have the representation of <script type="math/tex">m^n</script>; the <script type="math/tex">n-1</script> more applications between our bottom-most and topmost lambda term will reduce away, while the stack of applications of <code>f</code> increases by a factor of <script type="math/tex">m</script> each time.</p>
<p>What about subtraction? It's a bit complicated. Okay, how about just subtraction by <i>one</i>, also known as the <code>pred</code> (predecessor) function? Also tricky (and a good puzzle if you want to think about it). Here's one way:</p>
<p></p><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiBBf2eNk_pHBUEXD8HPgZQ2EvspSz-0GEOfD6Hp8R2SjZLukBtu6VLju16H8-zPaz3FmHKbFitCok_P7a7WRipaJkISfx3WfToHolbPYMYK5Jtr8TmHolgTZzyF2hvlGG-8ihrwJq7WBy-/" style="margin-left: 1em; margin-right: 1em;"><img data-original-height="876" data-original-width="1146" height="489" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiBBf2eNk_pHBUEXD8HPgZQ2EvspSz-0GEOfD6Hp8R2SjZLukBtu6VLju16H8-zPaz3FmHKbFitCok_P7a7WRipaJkISfx3WfToHolbPYMYK5Jtr8TmHolgTZzyF2hvlGG-8ihrwJq7WBy-/w640-h489/pred.png" width="640" /></a></div><p></p>
<p>Church numerals make it easy to add, but not subtract. So instead, here's what we do. First (box 1), we make a pair like <code>[0 0]</code>. Next (polygon 2), we have a function that takes a pair <code>p=[a b]</code> and creates a new pair <code>[b (succ b)]</code>, where <code>succ</code> is the successor function (one plus its input). Repeated application of this function on the pair in box 1 looks like this: <code>[0 0]</code>, <code>[0 1]</code>, <code>[1 2]</code>, <code>[2 3]</code>, and so on. Thus we see that if we start from <code>[0 0]</code> and apply the function in polygon 2 <script type="math/tex">n</script> times (box 3), the first element of the pair is (the Church numeral for) <script type="math/tex">n-1</script>, and the second element is <script type="math/tex">n</script>, and we can simply call <code>fst</code> to get that first element.</p>
<p>As we saw before, we can define subtraction as repeated application of <code>pred</code>:</p>
<pre><code class="language-scheme" lang="scheme">(minus m n) =
((n pred) m)
</code></pre>
<p>There's an alternative to Church numerals that's found in the more general <a href="https://crypto.stanford.edu/~blynn/compiler/scott.html">Scott encoding</a>. The advantages of Church vs Scott numerals, and their relative structures, are similar to the relative merits and structures of the two types of lists we discussed: one makes many operations natural by exploiting the fact that everything is a function, but also makes "throwing off a piece" (taking the rest/<code>snd</code> of a list, or subtracting one from a number) much harder.</p>
<h3>Booleans, if, & equality</h3>
<p>You might have noticed that we've defined <code>second</code> as <script type="math/tex">(\lambda x y. y)</script>, and <code>0</code> as <script type="math/tex">(\lambda f x. x)</script>. These two terms are a variable-renaming away from each other, so they are <script type="math/tex">\alpha</script>-equivalent. In other words, <code>second</code> and <code>0</code> are same thing. Because we don't have types, which is which depends only on our interpretation of the context it appears in.</p>
<p>Now let's define a <code>True</code> and <code>False</code>. Now <code>False</code> is kind of like <code>0</code>, so let's just say they're also the same thing. The opposite of <script type="math/tex">(\lambda x y. y)</script> is <script type="math/tex">(\lambda x y. x)</script>, so let's define that to be <code>True</code>.</p>
<p>What sort of muddle have we landed ourselves in now? Quite a good one, actually. Let's define <code>(if p c a)</code> to be <code>(p c a)</code>. If the predicate <code>p</code> is <code>True</code>, we select the consequent <code>c</code>, because <code>(True c a)</code> is exactly the same as <code>(first c a)</code> is clearly <code>c</code>. Likewise, if <code>p</code> is <code>False</code>, then we evaluate the same thing as <code>(second c a)</code> and end up with the alternative <code>a</code>.</p>
<p>We will also want to test whether a number is <code>0</code>/<code>False</code> (equality in general is hard in the lambda calculus, so what we end up with won't be guaranteed to work with things that aren't numbers). A simple way is:</p>
<pre><code class="language-scheme" lang="scheme">eq0 =
(lambda x
(x (lambda y
False)
True))
</code></pre>
<p>If <code>x</code> is <code>0</code>, it's the same as <code>second</code> and will act as a conditional and pick out <code>True</code>. If it's not zero, we assume that it's some number <script type="math/tex">n</script>, and therefore will be a function that applies its first argument <script type="math/tex">n</script> times. Applying <script type="math/tex">(\lambda y.\text{False})</script> any non-zero amount of times to anything will return <code>False</code>.</p>
<h2>Fixed points, combinators, and recursion</h2>
<p>The big thing missing from the definitions we've put on top of the lambda calculus so far is recursion. Every lambda term represents an anonymous function, so there's no name within a <script type="math/tex">\lambda</script>-term that we can "call" to recurse.</p>
<p>Rather than jumping in straight to recursion, we're going to start with Russell's paradox: does a set that contains all elements that are not in the set contain itself? Phrased mathematically: what the hell is <script type="math/tex">R = \{x \,|\,x\notin R\} </script>?</p>
<p>In computation theory, sets are often specified by a characteristic function: a function that is always defined if the set is computable, and returns true if an element is in the set and false otherwise.</p>
<p>In the lambda calculus (which was originally supposed to be a foundation for logic), here's a characteristic function for the Russell set <script type="math/tex">R</script>:</p>
<p></p><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhfROFbuHl_KGuoe1XHH_wWOiLVleqkgSCOS91aPYi7_fQistplFjMEHyuDscWRaDk_4AWnSr0Ba6o6k6D9rr60JEIn8d25F6zsvtNrf7_GH_qLDNLBmVDTV2u96f9dD5y7TwfOa4EiNDgs/" style="margin-left: 1em; margin-right: 1em;"><img data-original-height="678" data-original-width="800" height="339" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhfROFbuHl_KGuoe1XHH_wWOiLVleqkgSCOS91aPYi7_fQistplFjMEHyuDscWRaDk_4AWnSr0Ba6o6k6D9rr60JEIn8d25F6zsvtNrf7_GH_qLDNLBmVDTV2u96f9dD5y7TwfOa4EiNDgs/w400-h339/russell.png" width="400" /></a></div><p></p>
<p>(where <code>not</code> can be straightforwardly defined on top of our existing definitions as <code>(not b) = (b False True)</code>).</p>
<p>This <script type="math/tex">\lambda</script>-term takes in an element <code>x</code>, assumes that <code>x</code> is the (characteristic function for) the set itself, and asks: is it the case that <code>x</code> is <i>not</i> in the set? Call this term <code>R</code>, and consider <code>(R R)</code>: the left <code>R</code> is working as the (characteristic function of) the set, and the right <code>R</code> as the element whose membership of the set we are testing.</p>
<p>Evaluating:</p>
<p></p><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhvWUl-IwuYtk9BdYp7w_7V3c6lgjsp2KaMsiRsGPPSLVUSvUeQrKrC4ST_IeejIELDjKSke1lEN7K35EmmmHQvYP_AttEO5TdShVZ580kIc955vmW_2n56KP-bSCS8dFyzQIiFQJHjkK8t/" style="margin-left: 1em; margin-right: 1em;"><img data-original-height="612" data-original-width="1200" height="326" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhvWUl-IwuYtk9BdYp7w_7V3c6lgjsp2KaMsiRsGPPSLVUSvUeQrKrC4ST_IeejIELDjKSke1lEN7K35EmmmHQvYP_AttEO5TdShVZ580kIc955vmW_2n56KP-bSCS8dFyzQIiFQJHjkK8t/w640-h326/russell2.png" width="640" /></a></div><p></p>
<p>So we start out saying <code>(R R)</code>, and in one <script type="math/tex">\beta</script>-reduction step we end up saying <code>(not (R R))</code> (just as, with Russell's paradox, it first seems that the set must contain itself, because the set is not in itself, but once we've added the set to itself then suddenly it shouldn't be in itself anymore). One more step and we get, from <code>(R R)</code>, <code>(not (not (R R)))</code>. This is not ideal as a foundation for logic.</p>
<p>However, you might realise something: the <code>not</code> here doesn't play any role. We can replace it with any arbitrary <code>f</code>. In fact, let's do that, and create a simple wrapper <script type="math/tex">\lambda</script>-term around it that lets us pass in any <code>f</code> we want:</p>
<p></p><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgkczuyEhL5D5KKjx9Q13vNMsT4chhpVRxUFlMnXrCjTRDTzhwq_tCxPqnP07-vldJUtmdwEx3t3DmTF1R889M4jjKpc9qxU-fw0ecBNE5BkkLGyjR1K_UlDQZuk79Rz9S52dStGmVymW4b/" style="margin-left: 1em; margin-right: 1em;"><img data-original-height="1104" data-original-width="1076" height="400" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgkczuyEhL5D5KKjx9Q13vNMsT4chhpVRxUFlMnXrCjTRDTzhwq_tCxPqnP07-vldJUtmdwEx3t3DmTF1R889M4jjKpc9qxU-fw0ecBNE5BkkLGyjR1K_UlDQZuk79Rz9S52dStGmVymW4b/w390-h400/Y.png" width="390" /></a></div><p></p>
<p>Now let's look at the properties that <script type="math/tex">Y</script> has:</p>
<div cid="n1079" class="mathjax-block md-end-block md-math-block md-rawblock" contenteditable="false" id="mathjax-n1079" mdtype="math_block" spellcheck="false">
<div class="md-rawblock-container md-math-container" tabindex="-1"><div class="MathJax_SVG_Display" style="text-align: center;"></div><script id="MathJax-Element-363" type="math/tex; mode=display">(Y \, f) \rightarrow_\beta (f \, (Y \, f)) \rightarrow_\beta (f \, (f \, (Y \, f))) \rightarrow_\beta ...</script></div></div>
<p><script type="math/tex">Y</script> is called the Y combinator ("combinator" is a generic term for a lambda calculus term with no free variables). It is part of the general class of fixed-point combinators: combinators <script type="math/tex">X</script> such that <script type="math/tex">(X \, f) = (f \, (X\,f))</script>. (Turing invented another one: <script type="math/tex">\Theta = (A \, A)</script>, where <script type="math/tex">A</script> is defined as <script type="math/tex">(\lambda x y. (y \,(x\, x\, y)))</script>.)</p>
<p>A fixed-point combinator gives us recursion. Imagine we've almost written a recursive function, say for a factorial, except we've left a free function parameter for the recursive call:</p>
<pre><code class="language-scheme" lang="scheme">(lambda f x
(if (eq0 x)
1
(mult x
(f (pred x)))))
</code></pre>
<p>(Also, take a moment to appreciate that we can already do everything necessary except for the recursion with our earlier definitions.)</p>
<p>Call the previous recursion-free factorial term <code>F</code>, and consider reducing <code>((Y F) 2)</code> (where <code>-BETA-></code> stands for one or more <script type="math/tex">\beta</script>-reductions):</p>
<pre><code class="language-scheme" lang="scheme">((Y F)
2)
-BETA->
((F (Y F))
2)
-BETA->
((lambda x
(if (eq0 x)
1
(mult x
((Y F) (pred x)))))
2)
-BETA->
(if (eq0 2)
1
(mult 2
((Y F) (pred 2))))
-BETA->
(mult 2
((Y F)
1))
-BETA->
(mult 2
((F (Y F))
1))
-BETA->
(mult 2
((lambda x
(if (eq0 x)
1
(mult x
((Y F) (pred x)))))
1))
-BETA->
...
-BETA->
(mult 2
(mult 1
1))
-BETA->
2
</code></pre>
<p>It works! Get a fixed-point combinator, and recursion is solved.</p><h3>Primitive recursion</h3>
<p>The definition of the partial recursive functions (one of the ways to define computability, mentioned at the beginning) involves something called primitive recursion. Let's implement that, and along the way look at fixed-point combinators from another perspective.</p>
<p>Primitive recursion is essentially about implementing bounded for-loops / recursion stacks, where "bounded" means that the depth is known when we enter the loop. Specifically, there's a function <script type="math/tex">f</script> that takes in zero or more parameters, which we'll abbreviate as <script type="math/tex">\overline{P}</script>. At 0, the value of our primitive recursive function <script type="math/tex">h</script> is <script type="math/tex">f(\overline{P})</script>. At any integer <script type="math/tex">x+1</script> for <script type="math/tex">x \geq 0</script>, <script type="math/tex">h(\overline{P}, x+1)</script> is defined as <script type="math/tex">g(\overline{P}, x, h(\overline{P}, x))</script>: in other words, the value at <script type="math/tex">x+1</script> is given by some function of:</p>
<ul>
<li>fixed parameter(s) <script type="math/tex">\overline{P}</script>,</li>
<li>how many more steps there are in the loop before hitting the base case (<script type="math/tex">x</script>), and</li>
<li>the value at <script type="math/tex">x</script> (the recursive part).</li>
</ul>
<p>For example, in our factorial example there are no parameters, so <script type="math/tex">f</script> is just the constant function 1, and <script type="math/tex">g(x, r) = (x + 1) \times r</script>, where <script type="math/tex">r</script> is the recursive result for one less, and we have <script type="math/tex">x+1</script> because (for a reason I can't figure out – ideas?) <script type="math/tex">g</script> takes, by definition, not the current loop index but one less.</p>
<p>Now it's pretty easy to write the function for primitive recursion, leaving the recursive call as an extra parameter (<code>r</code>) once again, and assuming that we have <script type="math/tex">\lambda</script>-terms <code>F</code> and <code>G</code> for <script type="math/tex">f</script> and <script type="math/tex">g</script> respectively:</p>
<pre><code class="language-scheme" lang="scheme">(lambda r P x
(if (eq0 x)
(F P)
(G P (pred x) (r P (pred x)))))
</code></pre>
<p>Slap a <script type="math/tex">Y</script> in front, and we take care of the recursion and we're done.</p>
<h3>The fixed point perspective</h3>
<p>However, rather than viewing this whole "slap in the <script type="math/tex">Y</script>" business as a hack for getting recursion, we can also interpret it as a fixed point operation.</p>
<p>A fixed point of a function <script type="math/tex">f</script> is a value <script type="math/tex">x</script> such that <script type="math/tex">x = f(x)</script>. The fixed points of <script type="math/tex">f(x)=x^2</script> are 0 and 1. In general, fixed points are often useful in maths stuff and there's a lot of deep theory behind them (for which you will have to look elsewhere).</p>
<p>Now <script type="math/tex">Y</script> (or any other fixed point combinator) has the property that <script type="math/tex">(Y f) =_\beta (f \, (Y\, f))</script> (remember that the equivalent of <script type="math/tex">f(x)</script> is written <script type="math/tex">(f \,x)</script> in the lambda calculus). In other words, <script type="math/tex">Y</script> is a magic wand that takes a function and returns its fixed point (albeit in a mathematical sense that is not very useful for explicitly finding those fixed points).</p>
<p>Taking once again the example of defining primitive recursion, we can consider it as the fixed point problem of finding an <script type="math/tex">h</script> such that <script type="math/tex">h = \Phi_{f,g}(h)</script>, where <script type="math/tex">\Phi_{f,g}</script> is a function like the following, where <code>F</code> and <code>G</code> are the lambda calculus representations of <script type="math/tex">f</script> and <script type="math/tex">g</script> respectively:</p>
<pre><code class="language-scheme" lang="scheme">(lambda h
(lambda P x
(if (eq0 x)
(F P)
(G P (pred x) (h P (pred x)))))))
</code></pre>
<p>That is, <script type="math/tex">\Phi_{f,g}</script> takes in some function <code>h</code>, and then returns a function that does primitive recursion – <i>under the assumption</i> that <code>h</code> is the right function for the recursive call.</p>
<p>Imagine it like this: when we're finding the fixed point of <script type="math/tex">f(x)= x^2</script>, we're asking for <script type="math/tex">x</script> such that <script type="math/tex">x=x^2</script>. We can imagine reaching into the set of values that <script type="math/tex">x</script> can take (in this case, the real numbers), plugging them in, and seeing that in most cases the equation <script type="math/tex">x=x^2</script> is false, but if we pick out a fixed point it becomes true. Similarly, solving <script type="math/tex">h=\Phi_{f,g}(h)</script> is the problem of considering all possible functions <script type="math/tex">h</script> (and it turns out all computable functions can be enumerated, so this is, if anything, less crazy than considering all possible real numbers), and requiring that plugging in <script type="math/tex">h</script> into <script type="math/tex">\Phi_{f,g}</script> gives back <script type="math/tex">h</script>. For almost any function that we plug in, this equation will be nonsense: instead of doing primitive recursion, on the first call to <code>h</code> <script type="math/tex">\Phi_{f,g}</script> will do some crazy call that might loop forever or calculate the 17th digit of <script type="math/tex">\pi</script>, but if it's picked just right, <script type="math/tex">h</script> and <script type="math/tex">\Phi_{f,g}(h)</script> will happen to be the same thing. Unlike in the algebraic case, it's very difficult to iteratively improve on your guess for <script type="math/tex">h</script>, so it's hard to think of how to use this weird way of defining the problem of finding <script type="math/tex">h</script> to actually find it.</p>
<p>Except hold on – we're working in the lambda calculus, and fixed point combinators are easy: call <script type="math/tex">Y</script> on a function and we have its fixed point, and, by the reasoning above, that is the recursive version of that function.</p>
<h2>The lambda calculus in lambda calculus</h2>
<p>There's one final powerful demonstration of a computation model's expressive power that we haven't looked at: being able to express itself. The most well-known case is the <a href="https://en.wikipedia.org/wiki/Universal_Turing_machine">universal Turing machine</a>, and those crop up a lot when you're thinking about computation theory.</p>
<p>Now there exists a trivial universal lambda term: <script type="math/tex">(\lambda \,f\,a\,.\,(f \,a))</script> takes <script type="math/tex">f</script>, the lambda representation of some function, and an argument <script type="math/tex">a</script>, and returns the lambda calculus representation of <script type="math/tex">f</script> applied to <script type="math/tex">a</script>. However, this isn't exactly fair, since we've just forwarded all the work onto whatever is interpreting the lambda calculus. It's like noting that an <code>eval</code> function exists in a programming language, and then writing on your CV that you've written an evaluator for it.</p>
<p>Instead, a "fair" way to define a universal lambda term is to build on the data specifications we have to define a representation of variables, lambda terms, and application terms, and then writing more definitions within the lambda calculus until we have a <code>reduce</code> function.</p>
<p>This is what I've done in <a href="https://github.com/LRudL/lambda-engine">Lambda Engine</a>. The definitions specific to defining the lambda calculus within the lambda calculus start about halfway down <a href="https://github.com/LRudL/lambda-engine/blob/main/definitions.rkt">this file</a>. I won't walk through the details here (see the code and comments for more detail), but the core points are:</p>
<ul>
<li>We distinguish term types by making each term a pair consisting of an identifier and then the data associated with it. The identifier for variables/<script type="math/tex">\lambda</script>s/applications is a function that takes a triple and returns the 1st/2nd/3rd member of it (this is simpler than tagging them with e.g. Church numerals, since testing numerical equality is complicated). The data is either a Church numeral (for variables) or a pair of a variable and a term (<script type="math/tex">\lambda</script>-terms) or a term and a term (applications).</li>
<li>We need case-based recursion, where we can take in a term, figure out what it is, and then perform a call to a function to handle that term and pass on the main recursive function to that handler function (for example, because when substituting in a application term, we need to call the main substitution function on both the left and right child of the application).
The case-based recursion functions (different ones for the different number of arguments required by substitution and reduction) take a triple of functions (one for each term type) and exploit the fact that the identifier of a term is a function that picks some element from the triple (in this case, we call the identifier on the handler function triple to pick the right one).</li>
<li>We have helper functions for to build our term types, extract out parts, and test for whether something is a <script type="math/tex">\lambda</script>-term (exploiting the fact that the first element of the pair that a lambda term is is the "take the 2nd thing from a triple" function).</li>
<li>With the above, we can define substitution fairly straightforwardly. Note that we need to test Church numeral equality, which requires a generic Church numeral equality tester, which is a slow function (because it needs to recurse and take a lot of predecessors).</li>
<li>For reduction, the main tricky bit is doing it in normal order. This means that we have to be able to tell whether the left child in an application term is reducible before we try to reduce the right child (e.g. the left child might eventually reduce to a function that throws away its argument, and the right child might be a looping term like <script type="math/tex">\Omega</script>). We define a helper function to check whether something reduces, and then can write <code>reduce-app</code> and therefore <code>reduce</code>. For convenience we can define a function <code>n-reduce</code> that calls <code>reduce</code> an expression <code>n</code> times, simply by exploiting how Church numerals work (<code>((2 reduce) x)</code> is <code>(reduce (reduce x))</code>, for example).</li>
</ul>
<p>What we don't have:</p>
<ul>
<li>Variable renaming. We assume that terms in this lambda calculus are written so that a variable name (in this case, a Church numeral) is never reused.</li>
<li>Automatically reducing to <script type="math/tex">\beta</script>-normal form. This could be done fairly simply by writing another function that calls itself with the <code>reduce</code> of its argument until our checker for whether something reduces is false. </li>
<li>Automatically checking whether we're looping (e.g. we've typed in the definition of <script type="math/tex">\Omega</script>).</li>
</ul>
<p>The lambda calculus interpreter in <a href="https://github.com/LRudL/lambda-engine/blob/main/interpreter.rkt">this file</a> has all three features above. You can play with it, and the lambda-calculus-in-lambda-calculus, by downloading <a href="https://github.com/LRudL/lambda-engine">Lambda Engine</a> (and a <a href="https://racket-lang.org/">Racket interpreter</a> if you don't already have one) and using one of the evaluators in <a href="https://github.com/LRudL/lambda-engine/blob/main/main.rkt">this file</a>.</p>
<h2>Towards Lisp</h2>
<p>Let's see what we've defined in the lambda calculus so far:</p>
<ul>
<li><code>pair</code></li>
<li>lists</li>
<li><code>fst</code></li>
<li><code>snd</code></li>
<li><code>True</code></li>
<li><code>False</code></li>
<li><code>if</code></li>
<li><code>eq0</code></li>
<li>numbers</li>
<li>recursion<br /></li>
</ul>
<p>This is most of <a href="http://languagelog.ldc.upenn.edu/myl/ldc/llog/jmc.pdf">what you need in a Lisp</a>. Lisp was invented in 1958 by John McCarthy. It was intended as an alternative axiomatisation for computation, with the goal of not being too complicated to define while still being human friendly, unlike the lambda calculus or Turing machines. It borrows notation (in particular the keyword <code>lambda</code>) from the lambda calculus and its terms are also trees, but it is not directly based on the lambda calculus.</p>
<p>Lisp was not intended as a programming language, but Steve Russell (no relation to Bertrand Russell ... I'm pretty sure) realised you could write machine code to evaluate Lisp expressions, and went ahead and did so, making Lisp the second-oldest programming language. Despite its age, Lisp is arguably the most elegant and flexible programming language (modern dialects include <a href="https://clojure.org/">Clojure</a> and <a href="https://racket-lang.org/">Racket</a>).</p>
<p>One way to think of what we've done in this post is that we've started from the lambda calculus – an almost stupidly simple theoretical model – and made definitions and syntax transformations until we got most of the way to being able to emulate Lisp, a very usable and practical programming language. The main takeaway is, hopefully, an intuitive sense of how something as simple as the lambda calculus can express any computation expressible in a higher-level language.</p>Unknownnoreply@blogger.com0tag:blogger.com,1999:blog-1697673368059564013.post-68979675695847730492021-03-27T22:21:00.011+00:002021-05-28T00:24:05.508+01:00Nuclear power is good<p style="text-align: center;"><span style="font-size: medium;"></span></p><p style="text-align: left;"><span style="font-size: large;">(Alternative title: burning things considered harmful)</span></p><p style="text-align: center;"><span style="font-size: medium;"> <i><span style="font-size: x-small;">5k words (about 17 minutes)</span></i></span></p><p style="text-align: center;"><span style="text-align: left;"> </span></p><p>If you want usable energy, you need to use the forces between particles.</p>
<p>The weakest force is gravity, but if you happen to be near a gigantic amount of material (e.g. the Earth) with an uneven surface that has stuff flowing down it (e.g. water in a river), we can still use it to generate power. This insight gives us hydropower, which delivers about 16% of the world's electricity. The main downside is that because of how weak gravity is, dams have to be large and environmentally disruptive to generate useful power.</p>
<p>Moving to stronger forces, we have chemical interactions between atoms. In the form of burning fossil fuels, rearranging chemical bonds produces 66% of the world's electricity. The main downside is how weak chemical bonds are, and therefore how much matter has to be processed (i.e. burned) to produce energy. A lot of matter means a lot of waste products. Despite decades of work on possible safe waste-management strategies (e.g. carbon capture and storage), we still outrageously keep dumping over thirty billion tons of carbon dioxide into the atmosphere every year, with massive effects on the climate that will potentially last thousands of years, while also producing a long list of other harmful waste products that kill <a href="https://ourworldindata.org/air-pollution">a lot of people</a> per year.</p>
<p>Thankfully, atoms aren't atomic: we can rearrange atoms and get energy densities that blow puny chemistry out of the water. Currently 11% of the world's electricity comes from directly doing this. We're still playing catch up to God, who, in His infinite wisdom, saw it fit to create a universe where just about 100% of energy production is nuclear.</p>
<p>Our nearest God-sanctioned nuclear reactor is the sun. Harnessing the sun's light and heat gives us another 1% of the world's electricity; a slightly more indirect route where we first wait for the sun's heat to stir up the air gives us another 3.5%. An even more indirect route is letting the sun's light fall on plants so that they create chemical bonds that we can burn for power; this gives us another 2%. The most indirect route of all is to use the chemical bonds created by sunlight that fell on extinct plants hundreds of millions of years ago, which is what we're really doing when we burn fossil fuels. So actually it's all nuclear, with the only difference being how many hoops you jump through first.</p>
<p>The current state of nuclear power is that we can harness only fission (splitting atoms) for controlled energy production. Fusion (combining atoms) is potentially an even better technology: it requires less exotic materials, produces less dangerous waste, and is literally star-power. However, it takes extreme energies to get power out of fusion, and the only way we've found how to do that is to blow up a (fission-based) nuclear bomb in a very controlled way that squeezes the stuff we want to fuse to create an even bigger bang. Technically we could use this for power – say, we build a massive underground chamber where we set off hydrogen bombs (the common name for a bomb that uses nuclear fusion) every once in a while to vaporise vast amounts of water into steam and then drive a generator – but let's just say there would be some difficulties. (Though, surprisingly, mostly economic and political ones rather than technical ones – this idea was seriously studied in the 1970s as <a href="https://en.wikipedia.org/wiki/Project_PACER">Project Pacer.</a>)</p>
<p>Controlled fusion power is in the works, but it's the poster child for technologies that are always twenty years away. At the moment scientists are playing around with <a href="https://en.wikipedia.org/wiki/National_Ignition_Facility">lasers that have 25 times the power of the entire world's electricity generation</a> (though only for a few picoseconds at a time) and <a href="https://en.wikipedia.org/wiki/ITER">magnets almost strong enough to levitate a frog</a>* to bring it about, but don't expect commercial fusion power in the next decade at least.</p>
<p>(*Levitating a frog takes a field of about 16 Teslas, according to research that won an <a href="https://www.improbable.com/ig-about/winners/#ig2000">Ig Nobel Prize in 2000</a>, compared to ITER's 13 Tesla field.)</p>
<p>Fusion is definitely a technology that we should develop. However, as J. Storrs Hall writes in <i>Where is my flying car?</i> (my review <a href="https://strataoftheworld.blogspot.com/2021/03/review-where-is-my-flying-car.html">here</a>):</p>
<blockquote><p><i>"As a science fiction and technology fan, for most of my life I had been squarely in the “just you wait until we get fusion” camp. Then I was forced to compare the expected advantages fusion would bring to the ones we already had with fission. Fuel costs are already negligible. The process is already clean, with no emissions. Even though the national [US] waste repository at Yucca Mountain has been blocked by activists since it was designated in 1987 and never opened, fission produces so little waste that all our power plants have operated the entire period by basically sweeping it into the back closet."</i></p>
</blockquote>
<p>We have already invented a miracle clean power source. And, surprise surprise, we should really use it.</p>
<p> </p>
<h2>The human case for nuclear power</h2>
<p>Every year, <a href="https://ourworldindata.org/grapher/number-of-deaths-by-risk-factor?tab=chart&stackMode=absolute&region=World">there are almost five million deaths attributable to air pollution</a>, a bit less than 1 in 10 of all deaths in the world, or one every six seconds. Since it's a bit tricky to know what counts as an "attributable death" in the case of some risk factor, here's another measure: <a href="https://ourworldindata.org/grapher/disease-burden-by-risk-factor">almost 150 million years of health-weighted life are lost every year because of air pollution</a>. The health effects of air pollution are right up there with the other biggest killers like high blood pressure, smoking, and obesity.</p>
<p>The biggest causes of air pollution are energy generation, traffic, and (especially in poor countries) heating. Getting global averages for power generation deadliness is hard, but doing some very rough estimation, more than one-tenth but less than one-third of air pollution deaths are directly related to power generation, for a total number in the hundreds of thousands per year. Imagine three Chernobyl-scale disasters a week, and you're in the right ballpark.</p>
<p>(There is major disagreement over the actual Chernobyl death toll. When making comparisons in this post, I use the number 4000. About 30 people died directly during the disaster; several thousand may die in the long run according to the best consensus estimates, though if you assume the contested <a href="https://en.wikipedia.org/wiki/Linear_no-threshold_model">linear no-threshold model</a> (which seems to be the main crux of the debate) you can get numbers in the tens of thousands. If you want to be maximally pessimistic, you can multiply Chernobyl impact comparisons by 10, but you'll find this doesn't materially change the conclusions.)</p>
<p>Which power sources cause these deaths? There's some disagreement over the exact numbers, but <a href="https://ourworldindata.org/grapher/death-rates-from-energy-production-per-twh?tab=chart&time=earliest..latest&region=World">here's</a> a chart for European energy production from Our World in Data:</p><br /><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEh6EYvqAJwOj1tK-5o5YAXadgpk7nNq0VGQHRrWIcHSiHsL3fkuc4EsI4zxXCXYxzLkvq11BcQvrk4Zon1KiXUuWOG19LwN0n4nY_RBY1oEearX1AGWadgfz1HziV5h4Mt7PHzhc1lMokEy/s1744/deathspertwh.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="1128" data-original-width="1744" height="414" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEh6EYvqAJwOj1tK-5o5YAXadgpk7nNq0VGQHRrWIcHSiHsL3fkuc4EsI4zxXCXYxzLkvq11BcQvrk4Zon1KiXUuWOG19LwN0n4nY_RBY1oEearX1AGWadgfz1HziV5h4Mt7PHzhc1lMokEy/w640-h414/deathspertwh.png" width="640" /></a></div><p>(One terawatt-hour (3.6 petajoules) is roughly the annual energy consumption of 20 000 Europeans.)</p>
<p>The chart above has European numbers. In particular for fossil fuel sources, there's a lot of country-specific variation due to environmental regulations and population density: the <a href="https://www.sciencedirect.com/science/article/pii/S0140673607612537?casa_token=r5LmpCZ6G8YAAAAA:aW4wjfZ3PENq0mvbNTLXF27WEkLRuAsE0wGTXSrC1R3OgNLg9a7RdoedMRKZ20sBoUwuClxm#bib32">paper</a> that the above chart is largely based on mentions 77 deaths/TWh as a reasonable figure for a regulation-compliant Chinese coal plant, while <a href="http://www.forbes.com/sites/jamesconca/2012/06/10/energys-deathprint-a-price-always-paid/">this article</a> says that 280 deaths/TWh is possible for coal.</p>
<p>Why do solar and wind produce any deaths at all? Both occasionally involve dangerous construction work (rooftop solar / tall wind turbines). In fact, if you look at recent decades (i.e., not including Chernobyl) and use the low-end estimates, solar and wind are deadlier than nuclear.</p>
<p>The estimates for hydropower can also swing a bit depending on whether or not you include the deadliest electricity generation disaster in history: the <a href="https://en.wikipedia.org/wiki/1975_Banqiao_Dam_failure">1975 Banqiao Dam failure</a>, which may have killed hundreds of thousands of people. Since 1965, hydropower has produced about 130 000 TWh; depending on which death toll estimate you believe, Banqiao single-handedly raises the deaths per TWh for hydropower by between 0.2 and 2. Compare this with nuclear power, which has produced about 92 000 TWh over the same timeframe; the long-term death estimates for Chernobyl add 0.04 to the deaths/TWh count for nuclear.</p>
<p>(The total generation numbers are based on the raw data behind <a href="https://ourworldindata.org/grapher/modern-renewable-energy-consumption?time=earliest..latest">this</a> and <a href="https://ourworldindata.org/grapher/nuclear-energy-generation?tab=chart&stackMode=absolute&time=earliest..latest&country=~OWID_WRL&region=World">this</a> graph, which you can download from the links. The nuclear number in the above chart is based on <a href="https://www.sciencedirect.com/science/article/pii/S0140673607612537?casa_token=r5LmpCZ6G8YAAAAA:aW4wjfZ3PENq0mvbNTLXF27WEkLRuAsE0wGTXSrC1R3OgNLg9a7RdoedMRKZ20sBoUwuClxm">this paper</a>, which Our World in Data says already includes Chernobyl, though I can't see where they add that in.)</p>
<p>The bottom line is that hydropower accidents are <a href="https://en.wikipedia.org/wiki/List_of_hydroelectric_power_station_failures">more common, more deadly, and higher variance</a> than nuclear accidents, even though both power sources have produced comparable amounts of energy in recent decades.</p>
<p>Okay, actually that isn't the real bottom line. The real bottom line is this: <i>when it comes to the human impacts of electricity generation, there are things that involve burning (fossil fuels & biomass), and then there is everything else, and the latter category is much much better</i>. Also, if you absolutely must burn something, <i>do not burn coal</i>.</p>
<p>What has nuclear specifically done so far? <a href="https://pubs.acs.org/doi/abs/10.1021/es3051197?source=cen&">One study</a> finds that it has saved 1.8 million lives by reducing air pollution, or about 4 years of the world's current malaria death rate.</p>
<p>What could it have done? Until the mid-1970s, the adoption of nuclear power was accelerating. Assume this trend had continued until today, and nuclear had replaced fossil fuels only (an optimistic assumption, but one that doesn't change the numbers much because renewables are a pretty small percentage). Under these assumptions, <a href="https://www.mdpi.com/1996-1073/10/12/2169/htm">one study</a> estimates that nuclear would now account for over half of the world's energy production, and a total of 9.5 million deaths would have been avoided – as much as if you saved everyone who would otherwise have died of cancer in the past year. Even if nuclear adoption had only been linear, 4.2 million deaths could have been avoided, the same number as saving everyone who has died in war since 1970 (the war deaths number is from the raw data behind <a href="https://ourworldindata.org/grapher/battle-related-deaths-in-state-based-conflicts-since-1946-by-world-region">this chart</a>).</p>
<p>Therefore: <i>in terms of the number of lives saved, keeping the nuclear power industry growing would have very likely been at least as good as achieving world peace in 1970.</i></p>
<p>Since these numbers are enormous, and involve difficult-to-estimate unknowns, here's something more concrete: Germany's decision in 2011 to get rid of nuclear is costing an average of 1100 lives per year (<a href="https://www.nber.org/system/files/working_papers/w26598/w26598.pdf">working paper</a>; <a href="https://grist.org/energy/the-cost-of-germany-going-off-nuclear-power-thousands-of-lives/">article</a>).</p>
<h2>The environmental case for nuclear power</h2>
<p>Climate change is a big problem, but the scale of it as an environmental problem is better known than the scale of air pollution as a health problem, so I won't go into the statistics on its impact.</p>
<p>Nuclear power is obviously good for the climate. Here's a chart, based on <a href="https://www.ipcc.ch/site/assets/uploads/2018/02/ipcc_wg3_ar5_annex-iii.pdf#page=7">this</a>, which is summarised in a more readable format <a href="https://en.wikipedia.org/wiki/Life-cycle_greenhouse_gas_emissions_of_energy_sources#2014_IPCC,_Global_warming_potential_of_selected_electricity_sources">here</a>:</p>
<p></p><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEinJn5N6WcOv1lgub_-nUkvXLbPM-8P2HKKG1dfld8nTWXCCle99dYYyL_M1gDuOpMljWcObJJjfDni5fiG4rrsARXeF271d-g7mjQYFfqE4m94XFcncPDgIUEoK3HqGjQVvCz3E69RYx8j/" style="margin-left: 1em; margin-right: 1em;"><img alt="" data-original-height="830" data-original-width="1232" height="432" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEinJn5N6WcOv1lgub_-nUkvXLbPM-8P2HKKG1dfld8nTWXCCle99dYYyL_M1gDuOpMljWcObJJjfDni5fiG4rrsARXeF271d-g7mjQYFfqE4m94XFcncPDgIUEoK3HqGjQVvCz3E69RYx8j/w640-h432/co2eqpertwh.png" width="640" /></a></div> <p></p>
<p>The black bars span the range between the minimum and maximum numbers. The red dot is the median.</p>
<p>I've converted the numbers from the traditional grams of CO2 equivalent per kWh to tons of CO2 equivalent per TWh, to be consistent with the death rates graph above, and for easier conversion to national/international CO2 statistics (which are generally expressed in tons of CO2 – unless its tons of carbon, in which case you divide by the ratio of carbon's mass in CO2, which is 12/44 or about 0.27).</p>
<p>(If you're wondering where hydropower is: it's median is right around concentrated solar, but in some cases, especially in tropical climates, the <a href="https://en.wikipedia.org/wiki/Environmental_impact_of_reservoirs#Greenhouse_gases">reservoirs created by dams can release a lot of methane</a>, making the maximum CO2-equivalent emissions for hydropower over twice as bad as coal and, more importantly, completely ruining my pretty chart.)</p>
<p>So far, the use of nuclear power is estimated to have <a href="https://blogs.scientificamerican.com/the-curious-wavefunction/nuclear-power-may-have-saved-1-8-million-lives-otherwise-lost-to-fossil-fuels-may-save-up-to-7-million-more/">reduced cumulative CO2 emissions to date by 64 billion tons</a>, a bit less than two years of the world's <i>total</i> CO2 emissions at current rates. The <a href="https://www.mdpi.com/1996-1073/10/12/2169/htm">same study</a> linked in the previous section estimates that, had nuclear power grown at a steady linear rate, this number would be doubled, and if the accelerating trend in nuclear power adoption had continued, there would be 174 billion tons less CO2 in the atmosphere. We would have saved more emissions than we would have if we had made every car in the world emission free since 1990.</p>
<p> </p>
<h2>The problems</h2>
<p>In <i>Enlightenment Now</i> (my review <a href="http://strataoftheworld.blogspot.com/2018/08/review-enlightenment-now-steven-pinker.html">here</a>), Steven Pinker writes:</p>
<blockquote><p><i>"It’s often said that with climate change, those who know the most are the most frightened, but with nuclear power, those who know the most are the least frightened."</i></p>
</blockquote>
<p>So why aren't the arguments against nuclear power enough to frighten those who know about it?</p>
<p>The short version: more nuclear power would save millions of lives from air pollution and be a big help in solving climate change. When these are the benefit, you need a hell of a drawback before the scales start tilting the other way.</p>
<p>The long version:</p>
<h3>Radiation & accidents</h3>
<p>(Radiation units are confusing. Activity, straightforwardly defined as the number of atoms that undergo decay per second, is measured in becquerels (Bq). The amount of radiation energy absorbed per kilogram of matter is measured in grays (Gy), which therefore have units of joules per kilogram. Measuring biological effects is harder, because the type of radiation and what tissue it hits both matter. If you adjust for the type of radiation by multiplying the absorbed dose in grays by some factor (scaled so that gamma rays have a factor 1), you get something called <a href="https://en.wikipedia.org/wiki/Equivalent_dose">equivalent dose</a>, which is measured in sieverts (Sv). If you also adjust for which tissue type was hit by multiplying by more estimated factors, you get <a href="https://en.wikipedia.org/wiki/Effective_dose_(radiation)">effective dose</a>, which is also measured in sieverts. If you want to get a sense of scale for radiation dose numbers, <a href="https://xkcd.com/radiation/">here's a good chart</a> and <a href="https://en.wikipedia.org/wiki/Sievert#Dose_examples">here's a good table</a>.)</p>
<p>In normal operation, a <a href="https://www.scientificamerican.com/article/coal-ash-is-more-radioactive-than-nuclear-waste/">nuclear power plant produces significantly less radiation than a coal power plant</a> (this is because everything radioactive is contained in a nuclear power plant, while coal power plants pump <a href="https://en.wikipedia.org/wiki/Fly_ash">fly ash</a> into the air). Neither is a significant dose.</p>
<p>In accidents, nuclear power plants can release insane amounts of radioactivity. Insane amounts of radiation are dangerous. However, the reaction to radiation risks is often out of proportion to the true risk – the Fukushima evacuations are considered excessive in hindsight, as argued in <a href="https://www.sciencedaily.com/releases/2017/11/171120085453.htm">this study</a>, though you probably don't need to make a study to guess it from <a href="https://ourworldindata.org/grapher/estimated-mortality-from-fukushima-nuclear-disaster">this chart</a>:</p>
<p></p><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjFsaAX22DFYPJA-Iu_gSfXD6TfVBGcF0tWt7QncUpcxWMvcAz9DF47vG41vGMcU8sfJGFtnD2raDJjpD1txSrv6NpjuxzfdIhRRNAGW4lmLpGFEeQ65IoaBGxOJL2hyphenhyphens0OHZAcWhyhf6jQ/" style="margin-left: 1em; margin-right: 1em;"><img data-original-height="1172" data-original-width="1736" height="432" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjFsaAX22DFYPJA-Iu_gSfXD6TfVBGcF0tWt7QncUpcxWMvcAz9DF47vG41vGMcU8sfJGFtnD2raDJjpD1txSrv6NpjuxzfdIhRRNAGW4lmLpGFEeQ65IoaBGxOJL2hyphenhyphens0OHZAcWhyhf6jQ/w640-h432/fukushima.png" width="640" /></a></div><p></p>
<p>(In the long run, some more cancer deaths are expected to trickle in.)</p>
<p>It is critically important to remember the above statistics on health effects, and not let yourself be biased by <a href="https://en.wikipedia.org/wiki/Chernobyl_(miniseries)">vivid stories</a> about horrible individual events. The fear of nuclear accidents is similar to the fear of flying rather than driving: statistically one is much safer, but one is much easier to fear because when things go wrong, it comes in more story-worthy packages.</p>
<p>In particular: it is <i>not</i> the case that nuclear power is safer only because accidents are rare and therefore get left out of statistics; nuclear power would be overwhelmingly safer than fossil fuels even if there were a Chernobyl going off every year. As I said above, <a href="https://en.wikipedia.org/wiki/List_of_hydroelectric_power_station_failures">hydropower accidents</a> are more common, more deadly, and higher variance, so any argument based on disaster risk that bans nuclear would also ban hydropower.</p>
<h3>Nuclear proliferation</h3>
<p>Nuclear power is good, but <a href="https://strataoftheworld.blogspot.com/2020/04/review-doomsday-machine.html">nuclear weapons are bad</a>. It would be bad if the spread of civilian nuclear power technology lead to nuclear proliferation. There is some overlap in technology, but neither civilian materials nor technologies automatically lead to weapons. The uranium used in power plants is typically only enriched to 3-5%, compared to more than 85% for weapons-grade uranium and 0.7% in natural uranium (though if you have uranium enrichment infrastructure, you can run it for more cycles than usual and let the enrichment levels slowly creep up – Iran has done this). There are also international agreements that prevent enrichment, and alternative nuclear technologies, like using thorium instead of uranium, with less weapon potential. Finally, a country trying to build nuclear weapons probably won't be stopped by a lack of a civilian industry; consider North Korea.</p>
<h3>Terrorism and war risks</h3>
<p>Another risk to consider is that nuclear power plants might be targeted by terrorists, or even by hostile nations, potentially leading to Chernobyl-scale disasters. This is a risk, but it's an acceptable one. Consider what it would mean if "hundreds or thousands of people could be killed if a determined and resourceful hostile actor targeted this piece of infrastructure" were a reason to not build some piece of infrastructure – we'd have to ban skyscrapers, airplanes, dams, water treatment plants, and so forth. Also considering the security that's (rightfully) present at nuclear power plants, it would probably take a 9/11-level of execution to do it, and the observed rate for 9/11-level events over a time interval of length T is, well, 1/T if the interval includes 9/11 and otherwise 0.</p>
<p>It is true that a complex civilisation has a lot of fragile points and someone should be thinking hard about minimising this kind of risk, and that nuclear power plants are a good example because the effects are expensive and long-lasting if an attack is successful. But as an argument against nuclear power, <a href="https://slatestarcodex.com/2013/04/13/proving-too-much/">it proves too much</a>.</p>
<h3>Nuclear waste</h3>
<p>Nuclear waste is awkward to deal with, but it's far from the worst sort of industrial waste we deal with – consider the over thirty billion tons of carbon dioxide we've dumped into the atmosphere over the past year, or the various horrible things that coal plants spew out that cause dozens of Chernobyl-equivalents per year.</p>
<p>Nuclear waste is not some miracle substance that effortlessly seeps everywhere and kills whatever it touches. Until 1993, countries (mostly the USSR and UK), were dumping nuclear waste into the ocean. This is rightly banned these days, but you can observe that we still have oceans; in fact, the <a href="https://en.wikipedia.org/wiki/Ocean_disposal_of_radioactive_waste#Environmental_impact">the environmental impacts</a> have so far been negligible except for somewhat higher concentrations of some nasty isotopes exactly at the site.</p>
<p>In general, nuclear waste is a serious problem that has to be solved somehow, but solutions exist (currently, Finland's <a href="https://en.wikipedia.org/wiki/Onkalo_spent_nuclear_fuel_repository">Onkalo repository</a> is the closest to being operational). Though the timescale is long, it is not different in principle from some existing disposal methods for nasty things like mercury and arsenic.</p>
<p>Is it responsible to leave behind dangerous waste for future generations? It's far more responsible than leaving them with the almost astronomical amounts of CO2 emissions that a single kilogram of uranium prevents.</p>
<p>Future people looking back at our century won't despair about a few warm rocks deep underground. They'll despair at all the silent air pollution deaths, at how far we let climate change get, and at how much sooner we could've reached their living standards had we made better use of our technology. Then they'll travel on nuclear-powered airplanes to distant hiking grounds, and tell scare stories around an (artificial!) campfire about the barbarian past when we burned things for energy and piped the waste products straight into the atmosphere.</p>
<h3>Uranium is limited</h3>
<p>First, we have <a href="https://www.scientificamerican.com/article/how-long-will-global-uranium-deposits-last/">200 years worth of economically accessible uranium reserves</a>. This is <a href="https://ourworldindata.org/grapher/years-of-fossil-fuel-reserves-left">more than for fossil fuels</a>, with the additional benefit that burning through the remaining uranium won't wreck the climate and kill millions.</p>
<p>Second, we have alternatives to uranium, like thorium.</p>
<p>Thirdly, there are hundreds of times more uranium dissolved in the oceans than there is on land (and this uranium exists in equilibrium, so if you take it out, more will leach out of the seabed to replace it, a fact that might lead a pedant to call nuclear power renewable). Even though the concentrations are tiny, because of the energy density of uranium, at modern reactor efficiencies there's still half a megajoule of usable nuclear energy in the uranium in a single cubic metre of seawater, enough to power the lightbulb in my room for over five hours. As a result, extracting it is a project that is <a href="https://www.forbes.com/sites/jamesconca/2016/07/01/uranium-seawater-extraction-makes-nuclear-power-completely-renewable/?sh=1b4b0f19159a">taken surprisingly seriously, and is surprisingly close to being economically viable</a>, though <a href="http://large.stanford.edu/courses/2017/ph241/jones-j2/docs/epjn150059.pdf">some people are very skeptical</a>.</p>
<h3>Nuclear power is unnatural</h3>
<p>Wrong: a few billion years ago <a href="https://www.scientificamerican.com/article/ancient-nuclear-reactor/">a spontaneous natural nuclear reactor</a> ran for a few hundred thousand years under what is now Gabon.</p>
<p>Using the best estimates for its running time and power output, even if this is the only natural reactor that ever formed, the energy it produced is several times higher than that of all human civilian nuclear power to date (both numbers are in the hundreds of petajoules range). Of sustained nuclear fission energy in our planet's history, more has been natural than artificial.</p>
<p> </p>
<h2>Nuclear is overpowered, so where is it?</h2>
<p>Nuclear power is an almost overpowered technology. The reason why comes down to physics: an energy source based on nuclear reactions has extreme power density, and, all else being equal, the higher your power density, the less fuel you need, the less waste products you produce, and the cleaner your power plant is overall. Not surprisingly, nuclear power turns out to be – along with solar and wind – the cleanest and safest power source we have.</p>
<p>In <a href="https://strataoftheworld.blogspot.com/2021/03/review-where-is-my-flying-car.html"><i>Where is My Flying Car?</i></a>, J. Storrs Hall gives some vivid facts to demonstrate the power and efficiency of nuclear: a wind turbine uses more lubricating oil per energy generated than a nuclear power plant uses uranium, and while the 7.5 TJ of energy a Boeing 747 burns through during a flight weighs 200 tons and costs a third of a million dollars when delivered as chemical fuel, getting the equivalent energy from nuclear takes 100 <i>grams</i> of reactor-grade uranium and costs 10 dollars.</p>
<p>So where is it? The simple reason is that it's either illegal (like in Italy), being phased out (like in Germany), or highly regulated and/or expensive. It wasn't always so:</p>
<p></p><table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto;"><tbody><tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjDCeZ99AN8m44v_6Yb2pAYwFDABT0fYXEYR_GxHk5bfGhbnrm2P3SFaodH2mhepUSJRaGFmykD-HT_8UDuLzK6U0zDB1IPbG9ndQ3IpwueK35bVYGVSz3kMRfwM6v9yJAuJ4cuGKpo3STA/" style="margin-left: auto; margin-right: auto;"><img alt="" data-original-height="1074" data-original-width="1412" height="304" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjDCeZ99AN8m44v_6Yb2pAYwFDABT0fYXEYR_GxHk5bfGhbnrm2P3SFaodH2mhepUSJRaGFmykD-HT_8UDuLzK6U0zDB1IPbG9ndQ3IpwueK35bVYGVSz3kMRfwM6v9yJAuJ4cuGKpo3STA/w400-h304/nuclearcost.png" width="400" /></a></td></tr><tr><td class="tr-caption" style="text-align: center;"><p>Source: <i>Where is my Flying Car?</i>, by J. Storrs Hall.</p></td></tr></tbody></table><p></p>
<p>The above graph shows the price per kilowatt of US nuclear power plants. The green line is the trend line before the Department of Energy was established in 1977. Note also that the Three Mile Island accident was in 1979, and, despite no one being hurt, this was a turning point for the US nuclear industry.</p>
<p>When the price of a technology starts increasing, it's not the natural learning curve of the technology at work. It's a regulatory choice. And while you obviously should regulate nuclear power, we're not doing it right.<br /></p>
<p>J. Storrs Hall explains the cost increases:</p>
<blockquote><p><i>"Nuclear power is probably the clearest case where regulation clobbered the learning curve. Innovation is strongly suppressed when you’re betting a few billion dollars on your ability to get a license to operate the plant. Besides the obvious cost increases due to direct imposition of rules, there was a major side effect of forcing the size of plants up (fewer licenses); fewer plants were built and fewer ideas tried. That also meant a greater cost for transmission (about half the total, according to my itemized bill), since plants are further from the average customer."</i></p>
</blockquote>
<p>There is some hope that the tide is turning. New startups like <a href="https://en.wikipedia.org/wiki/NuScale_Power">NuScale</a> are working on small modular reactors that might greatly reduce prices. Of course, in addition to difficulties with funding, and the not-so-easy task of building a literal nuclear reactor, they've spent years jumping through regulatory hurdles and are not expected to produce power until 2029. So-called fourth-generation reactors are also being worked on, and there's always the hope we eventually get fusion.</p>
<p>But we're not going to get the benefits of cheap and plentiful nuclear power unless we stop treating it like it's the Antichrist.</p>
<p>Hall, never one to pass up the opportunity for a dramatic touch, quotes John Steinbeck's <i>The Grapes of Wrath</i> to sum up the sadness of our attitude to nuclear power:</p>
<blockquote><p><i>“And men with hoses squirt kerosene on the oranges, and they are angry at the crime, angry at the people who have come to take the fruit. A million people hungry, needing the fruit—and kerosene sprayed over the golden mountains.</i></p><i>
</i><p><i>[...]</i></p><i>
</i><p><i>There is a crime here that goes beyond denunciation. There is a sorrow here that weeping cannot symbolize. There is a failure here that topples all our success. The fertile earth, the straight tree rows, the sturdy trunks, and the ripe fruit. And children dying of pellagra must die because a profit cannot be taken from an orange. And coroners must fill in the certificate—died of malnutrition—because the food must rot, must be forced to rot.”</i></p>
</blockquote>
<p>More generally, <a href="https://strataoftheworld.blogspot.com/2021/03/technological-progress.html">human civilisation need to get better at making decisions about technology</a>. We shouldn't deny ourselves safe clean energy, but we should start working on mitigating the harms from actually scary technologies, like nuclear weapons, and make sure that new technologies like biotech and AI are used safely. Oh, and have I mentioned that burning things is bad for climate and health, and we should stop doing it?</p>
<h2>A metaphor</h2>
<p>I mentioned earlier that nuclear power and fossil fuels are like flying and driving. One of them is obviously safer, but the other seems scarier because the lizard-derived part of our brains can't multiply. Objecting to nuclear power on safety grounds but tolerating fossil fuels is like texting about how scared you are to board a plane while driving yourself to the airport. Let's make this metaphor more concrete, and hopefully create a memorable image.</p>
<p>The world consumes about 20 000 TWh per year as electricity (about one-eight of total energy use – lots is used directly for transportation and heat). Let's compare this to making a drive across Europe that starts in Lisbon and ends in Tallinn. Each kilometre we travel represents a bit less than 5 TWh of energy towards our 20 000 TWh goal. Let's say walking is wind/solar/geothermal, biking is hydropower, flying is nuclear, and driving is fossil fuels.</p>
<p>(The numbers for fossil fuel related deaths below are significant underestimates of the global average, because, like the chart above, they're based on the European data in <a href="https://www.sciencedirect.com/science/article/pii/S0140673607612537?casa_token=r5LmpCZ6G8YAAAAA:aW4wjfZ3PENq0mvbNTLXF27WEkLRuAsE0wGTXSrC1R3OgNLg9a7RdoedMRKZ20sBoUwuClxm#bib32">this study</a>. Regulations are looser and population densities higher in many developing countries that make up most of the world's air pollution deaths. I was not able to find a good estimate of the global average, and besides, these numbers are terrifying enough as they are.)</p>
<p>First we walk some 450 km, ending north-west of Madrid, and then bike 650 km, just barely taking us into France. We're a bit careless and somehow we've manage to shove a hundred people off wind turbines along the way. Oops.</p>
<p>By this point we're getting tired of walking and biking, but thankfully there's a flight to Paris. The pilot has a bad day and lands on top of a crowd, flattening another hundred people.</p>
<p>We really hate flying, so we refuse all the other offers that the airline companies try to sell us. Instead we step out of the Paris airport, rent a car, and start carelessly careening down the remaining 2600 km.</p>
<p>Gas takes us approximately to Berlin, a distance of about 1000 km. During this entire distance we run over a pedestrian at every block (roughly 1 per 80 metres), killing some 10 000 people in total.</p>
<p>We're in a real hurry to get to Poland, where the traffic rules get even more lenient and we can start <a href="https://www.independent.co.uk/climate-change/news/climate-change-poland-cop24-coal-air-pollution-global-warming-fossil-fuels-a8672481.html">burning coal</a>. The final leg of the journey from Berlin to the Polish border is powered by oil and isn't long, but still results in as many lethal hit-and-runs as the entire journey before it.</p>
<p>At the Polish border, we reach coal. From this point on, we text about the dangers of nuclear waste as we mow down one pedestrian every 8 metres for the entire rest of the coal-powered trip to Estonia (also burning <a href="https://en.wikipedia.org/wiki/Narva_Power_Plants">some other nasty things too</a>). Driving at a reckless 120 km/h whatever road we're on, we go run through four pedestrians a second – you'll hear a rapid thwack-thwack-thwack-thwack noise as the bodies hit the windshield – but it still takes 13 hours to make the trip. By the time we reach the Lithuanian border, the bodies of our victims, packed as tightly as possible, fill four Olympic swimming pools. Each of the three Baltic countries we drive through before reaching Tallinn fills another one.</p>
<p>Oh, and also every kilometre driven in our car had fifty times the environmental impact of flying.</p>
<p>Thank god we didn't fly: imagine how horrible it would be if another pilot had had a bad day.</p>
<p>The world makes this trip every year to meet our growing energy needs. We're getting fitter and walking a bit longer every year, as we should. But whenever someone suggests flying instead of driving, our collective response is: "What?! But that's so risky!"</p>
<p>Let's fly.</p><p><br /></p><p style="text-align: center;"><b>RELATED:</b></p><p></p><ul style="text-align: left;"><li><a href="https://strataoftheworld.blogspot.com/2018/12/review-seeds-of-science-why-we-got-it.html">A similar situation exists with GMOs</a></li><li><a href="https://strataoftheworld.blogspot.com/2021/03/technological-progress.html">Technological progress</a></li><li><a href="https://strataoftheworld.blogspot.com/2021/03/review-where-is-my-flying-car.html">Review: Where is my Flying Car?</a></li><li><a href="https://strataoftheworld.blogspot.com/2018/10/review-energy-and-civilization-history.html">Review: Energy and Civilisation</a> <br /></li></ul><p></p>Unknownnoreply@blogger.com0tag:blogger.com,1999:blog-1697673368059564013.post-46931608061489153172021-03-25T16:12:00.005+00:002021-03-27T22:51:38.091+00:00Technological progress<p style="text-align: center;"><span style="font-size: x-small;"><i>4k words (about 13 minutes)</i></span> <br /></p><p>In this post, I've collected some thoughts on:</p>
<ul>
<li>why technological progress probably matters more than you'd immediately expect;</li>
<li>what models we might try to fit to technological progress;</li>
<li>whether technological progress is stagnating; and</li>
<li>what we should hope future technological progress to look like.</li>
</ul>
<p> </p>
<h2>Technological progress matters</h2>
<p>The most obvious reason why technological progress matters is that it is the cause for the increase in human welfare after the industrial revolution, which, in moral terms at least, is the most important thing that's ever happened. <a href="http://lukemuehlhauser.com/three-wild-speculations-from-amateur-quantitative-macrohistory/">"Everything was awful for a long time, and then the industrial revolution happened"</a> isn't a bad summary of history. It's tempting to think that technology was just one factor working with many others, like changing politics and moral values, but there are strong cases to be made that a changed technological environment, and <a href="https://strataoftheworld.blogspot.com/2019/09/growth-and-civilisation.html">the economic growth it enabled</a>, were <a href="http://strataoftheworld.blogspot.com/2020/12/review-foragers-farmers-and-fossil-fuels.html">the reasons for political and moral changes in the industrial era</a>. Given this history, we should expect that more technological progress will be important for increasing human welfare in the future too (though not enough on its own – see below). This applies both to people in developed countries – we are not at <a href="https://nickbostrom.com/utopia.html">utopia</a> yet, after all – as well as those in developing countries, who are already seeing vast benefits from information technology making development cheaper, and would especially benefit from decreases in the price of sustainable energy generation.</p>
<p>Then there are more subtle reasons to think that technological progress doesn't get the attention it deserves.</p>
<p>First, it works over long time horizons, so it is especially subject to all the kinds of short-termism that plague human decision-making.</p>
<p>Secondly, lost progress isn't visible: if the Internet hadn't been invented, very few would realise what they're missing out on, but try taking it away now and you might well spark a war. This means that stopping technological progress is politically cheap, because likely no one will realise the cost of what you've done.</p>
<p>Finally, making the right decisions about technology is going to decide whether or not the future is good. Debates about technology often become debates about whether we should be pessimistic or optimistic about the impacts of future technology. This is rarely a useful framing, because the only direct impact of technology is to let us make more changes to the world. Technology shouldn't be understood as a force automatically pulling the distribution of future outcomes in a good or bad direction, but as a force that <i>blows up the distribution</i> so that it spans all the way from an engineered super-pandemic that kills off humanity ten years from now to an interstellar civilisation of trillions of happy people that lasts until the stars burn down. Where on this distribution we end up on depends in large part on the decisions we collectively make about technology. So, how about we get those decisions right?</p>
<p>But first, how should we even think about technological progress?</p>
<p> </p>
<h2>Modelling technological progress</h2>
<p>Some people think <a href="https://strataoftheworld.blogspot.com/2021/03/review-where-is-my-flying-car.html">that technological progress is stagnating relative to historical trends, and that, for example, we should have flying cars by now</a>. To be able to answer this question, we need some model of what technological progress should be like. I can think of three general ones.</p>
<p>The first one I'll name the Kurzweilian model, after futurist <a href="https://en.wikipedia.org/wiki/Ray_Kurzweil#The_Law_of_Accelerating_Returns">Ray Kurzweil</a>, who's made a big deal about how <a href="https://www.kurzweilai.net/the-law-of-accelerating-returns">the intuitive linear model of technological progress is wrong, and history instead shows technological progress is exponential</a> – the larger your technological base, the easier it is to invent new technologies, and hence a graph of anything tech-related should be a hockey-stick curve shooting into the sky.</p>
<p>The second I'll call the fruit tree model, after the metaphor that once the "low-hanging fruit" are picked off, progress gets harder. The strongest case for this model is in science; the physics discoveries you can make by watching apples fall down have (very likely) long since been picked off. However, it's not clear similar arguments should apply to technology. Perhaps we can model inventing a technology as finding a clever way to combine a number of already known parts into a new thing, and hence the number of possible inventions as would be an increasing function of the number of things already invented, since this gives more combinations. For example, even if progress in pure aviation is slow, when we invent new things like lightweight computers we can combine the two to get drones. I haven't seen anyone propose a model to explain why the fruit tree model makes sense for technology in particular.</p>
<p>The third model is that technological change is mostly random. Any particular technological base satisfies the prerequisites for some set of inventions. Once invented, a new technology goes through an S-curve of increasing adoption and development, before reaching widespread adoption and a mature form. Sometimes there are many inventions just within reach, and you get an innovation burst, like the mid-20th century one when television, cars, passenger aircraft, nuclear weapons, birth control pills, and rocketry are all simultaneously going through the rapid improvement and adoption phase. Sometimes there are no plausible big inventions for very long periods of time, for example in medieval times. </p>
<p>Here's an Our World in Data graph (<a href="https://ourworldindata.org/grapher/technology-adoption-by-households-in-the-united-states?tab=chart&stackMode=absolute&country=Automobile~Cellular%20phone~Computer~Dryer~Electric%20power~Flush%20toilet~Household%20refrigerator~Microwave~Refrigerator~Washing%20machine&region=World">source and interactive version here</a>) showing more-or-less-S-curves for the adoption of a bunch of technologies:</p>
<p></p><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiw4jET_MogqfSUhiSydZciuNICfelLAPZ8iGkmk4F1OT6Wg6czX-8iUDEF6cbweZ4-kGiZiIF6Fgf5Yjn6uP0fx63JrF_Subk6wHQL2RFoNQm-96HQnnYPlLIbkNK4fWbAUf0ulwk3aKkN/" style="margin-left: 1em; margin-right: 1em;"><img alt="" data-original-height="1116" data-original-width="1754" height="407" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiw4jET_MogqfSUhiSydZciuNICfelLAPZ8iGkmk4F1OT6Wg6czX-8iUDEF6cbweZ4-kGiZiIF6Fgf5Yjn6uP0fx63JrF_Subk6wHQL2RFoNQm-96HQnnYPlLIbkNK4fWbAUf0ulwk3aKkN/" width="640" /></a></div><p></p>
<p>(One can try to imagine an even more general model to unify the three models above, though we're getting to fairly extreme abstraction levels. Nevertheless, for the fun of it: let's model each technology as a set of prerequisite technologies, and assume there's a subset of technology-space that makes up the sensible technologies, and some cost function that describes how hard it is to go from a set of technologies to a given new technology (so infinity if all prerequisites of the new one aren't contained in the known set). Then slow progress would be modelled as the set of sensible ideas and the cost function being such that from any particular set of known technologies, there are only a few sensible ideas with prerequisites only in the known set, and these have high costs. Fast progress is the opposite. In the Kurzweilian model, the subspace of sensible ideas is in some sense uniform, so that the fraction of the <script type="math/tex">2^{|K|}</script> possible prerequisite combinations for a known technology set <script type="math/tex">K</script> that are contained within the sensible set does not go down with the cardinality of <script type="math/tex">K</script>, and also we require the cost function to not increase too rapidly as the complexity of the technologies grow. In the fruit tree model, the cost function increases, and possibly the frequency of sensible technologies becomes sparser as you get into the more complex parts of technology-space. In the random model, the cost function has no trend, and a lot of the advancements happen when a "key technology" is discovered that is the last unknown prerequisite for a lot of sensible technologies in technology-space.)</p>
<p>(Question: has anyone drawn up a dependency tree of technologies across many industries (or even one large one), or some other database where each technology is linked to a set of prerequisites? That would be an incredible dataset to explore.)</p>
<p>In <a href="https://strataoftheworld.blogspot.com/2021/03/review-where-is-my-flying-car.html"><i>Where is my Flying Car?</i></a>, J. Storrs Hall introduces his own abstraction of a civilisation's technology base that he calls the "technium": imagine some high-dimensional space representing possible technologies, and imagine a blob in this space representing existing technology. This blob expands as our technological base expands, but not uniformly: imagine some gradient in this space representing how hard it is to make progress in a given direction from a particular point, which you can visualise as a "terrain" which the technium has to move along as it expands. Some parts of the terrain are steep: for example, given technology that lets you make economical passenger airplanes moving at near the speed of sound, it takes a lot to progress beyond that because crossing the speed of sound is difficult. Hence the "aviation cliffs" in the image below; the technium is pressing against it, but progress will be slow:</p>
<p></p><table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto;"><tbody><tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhDk1Xc42df1x0RUb8J-G_E5kqTw59eq0sfLNEGJ_sDHj3I6SxZUD47VICavhC2Z7jSMHvdEelU0Lt1VCrrqSWZryZaGLXcK_7V2DJEhmBovIl33aucck4NZXrAIS1NKWtUeRy1utNdQfP4/" style="margin-left: auto; margin-right: auto;"><img alt="" data-original-height="1070" data-original-width="1906" height="360" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhDk1Xc42df1x0RUb8J-G_E5kqTw59eq0sfLNEGJ_sDHj3I6SxZUD47VICavhC2Z7jSMHvdEelU0Lt1VCrrqSWZryZaGLXcK_7V2DJEhmBovIl33aucck4NZXrAIS1NKWtUeRy1utNdQfP4/w640-h360/technium1.png" width="640" /></a></td></tr><tr><td class="tr-caption" style="text-align: center;"><span style="font-size: x-small;">(Image source: my own slides for an EA Cambridge talk.)</span><br /></td></tr></tbody></table><p></p>
<p>In other cases, there are valleys, where once the technium gets a toehold in it, progress is fast and the boundaries of what's possible gush forwards like a river breaking a dam. The best example is probably computing: figure out how to make transistors smaller and smaller, and suddenly a lot of possibilities open up.</p>
<p>We can visualise the three models above in terms of what we'd expect the terrain to look like as the technium expands further and further:</p>
<p></p><div class="separator" style="clear: both; text-align: center;"><div style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEglfVo0nzx52T1B2RLLzR28m9PhF5F6wpN3zdKvMv2YsmHNAF0F03r3wS9OuUWlT5Eyci5h95v_E5dHjzh5U7j63q7SVl6azxzDocWruuVHAfgEukmG__sN_IOGCUXbK22HoVfOoVhRIYl7/" style="margin-left: 1em; margin-right: 1em;"><img data-original-height="810" data-original-width="1854" height="280" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEglfVo0nzx52T1B2RLLzR28m9PhF5F6wpN3zdKvMv2YsmHNAF0F03r3wS9OuUWlT5Eyci5h95v_E5dHjzh5U7j63q7SVl6azxzDocWruuVHAfgEukmG__sN_IOGCUXbK22HoVfOoVhRIYl7/w640-h280/techniumterrain.png" width="640" /></a></div></div><p></p>
<p>(Or maybe a better model would be one where the gradient is always be positive, with 0 gradient meaning effortless progress?)</p>
<p>In the Kurzweilian model, the terrain gets easier and easier the further out you go; in the fruit tree it's the opposite; if there is no pattern, then we should expect cliffs and valleys and everything in between, with no predictable trend.</p>
<p>Hall comes out in favour of what I've called the random model, even going as far as to speculate that the valleys might follow a <a href="https://en.wikipedia.org/wiki/Zipf%27s_law">Zipf's law</a> distribution. He concisely summarises the major valleys of the past and future:</p>
<blockquote><p><i>"The three main phases of technology that drove the Industrial Revolution were first low-pressure steam engines, then machine tools, and then high-pressure engines enabled by the precision that the machine tools made possible. High-pressure steam had the power-to-weight ratios that allowed for engines in vehicles, notably locomotives and steamships. The three major, interacting, and mutually accelerating technologies in the twenty-first century are likely to be nuclear, nanotech (biotech is the “low-pressure steam” of nanotech), and AI, coming together in a synergy I have taken to calling the Second Atomic Age."</i></p>
</blockquote>
<p>Personally, my views have shifted away from somewhat Kurzweilian ones and towards the random model, with the main factors being that the technological stagnation debate has made me less certain that the historical data fits a Kurzweilian trend, and that since there are no clear answers to whether there is a general pattern, it's sensible to shift the distribution of my beliefs towards the model that doesn't require assuming the truth of a general pattern. However, given some huge valleys that seem to be out there – AI is the obvious one, but also nanotechnology, which might bring physical technology to Moore's law -like growth rates – it is possible that the difference between the Kurzweilian and random model looks largely academic in the next century.</p>
<p> </p>
<h2>Is technology stagnating?</h2>
<p>Now that we have some idea of how to think about technological progress, we are better placed to answer the question of whether it has stagnated: if the fruit tree model is true we should expect a slowdown, whereas if the extreme Kurzweilian model is true, a single trend line that's not going to break past the top of the figure in the next decade is a failure. Even so, this question is very confusing; economists debate about total factor productivity (a debate I will stay out of), and in general it's hard to know what could have been.</p>
<p>However, it does seem true that compared to the mid-20th century, the post-1970 era has seen breakthroughs in fewer categories of innovation. Consider:</p>
<ul>
<li><p>1920-1970:</p>
<ul>
<li>cars</li>
<li>radio</li>
<li>television</li>
<li>antibiotics</li>
<li>the green revolution</li>
<li>nuclear power</li>
<li>passenger aviation</li>
<li>chemical space travel</li>
<li>effective birth control</li>
<li>radar</li>
<li>lasers</li>
</ul>
</li>
<li><p>1970-2020:</p>
<ul>
<li>personal computers</li>
<li>mobile phones</li>
<li>GPS</li>
<li>DNA sequencing</li>
<li>CRISPR</li>
<li>mRNA vaccines</li>
</ul>
</li>
</ul>
<p>Of course, it's hard to compare inventions and put them in categories – is lumping everything computing-related as largely the same thing really fair? – but <a href="https://rootsofprogress.org/technological-stagnation">some people are persuaded by such arguments</a>, and a general lack of big breakthroughs in big physical technologies does seem true. (Though might soon change, since the clean energy, biotech, and space industries are making rapid progress.)</p>
<p>Why is this? If we accept the fruit tree model, there's nothing to be explained. If we accept the random one, we can explain it as a fluke of the shape of the idea space terrain that the technium is currently pressing into. To quote Hall again:</p>
<blockquote><p><i>"The default [explanation for technological stagnation] seems to have been that the technium has, since the 70s, been expanding across a barren high desert, except for the fertile valley of information technology. I began this investigation believing that to be a likely explanation."</i></p>
</blockquote>
<p>This, I think, is a pretty common view, and is a sensible null hypothesis for the lack of other evidence. We can also imagine variations, like the existence of a huge valley in the form of computing drawing all the talent that would otherwise have gone into pushing the technium forwards in other places. However, Hall rather dramatically concludes that this</p>
<blockquote><p><i>"[...] is wrong. As the technium expanded, we have passed many fertile Gardens of Eden, but there has always been an angel with a flaming sword guarding against our access in the name of some religion or social movement, or simply bureaucracies barring entry in the name of safety or, most insanely, not allowing people to make money."</i></p>
</blockquote>
<p>Is this ever actually the case? I think there is a case where a feasible (and economic, environmental, and health-improving) technology has been blocked: nuclear power, as I discuss <a href="http://strataoftheworld.blogspot.com/2021/03/nuclear-power-is-good.html">here</a>. We should therefore amend our model of the technium: not only does it have to contend with the cliffs inherent in the terrain, but sometimes someone comes along and builds a big fat wall on the border, preventing either development, deployment, or both.</p>
<p>In diagram form:</p>
<p></p><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEh-8Z5_PHstJvBhsWAR1OZ4h38czqj9f0wvtKpXoAHWmO92s2CAOd1TyGRSiDwhuOe0DMgCfbQPO6sfivxBNKtFRCrjAjdlJ_gjO7Ps8i26lRx_zfbIQw5PwnA0C0HKmAzCUSfuoiZ-Egwd/" style="margin-left: 1em; margin-right: 1em;"><img alt="" data-original-height="1102" data-original-width="1884" height="374" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEh-8Z5_PHstJvBhsWAR1OZ4h38czqj9f0wvtKpXoAHWmO92s2CAOd1TyGRSiDwhuOe0DMgCfbQPO6sfivxBNKtFRCrjAjdlJ_gjO7Ps8i26lRx_zfbIQw5PwnA0C0HKmAzCUSfuoiZ-Egwd/w640-h374/technium2.png" width="640" /></a></div><p></p>
<p>Are there other cases? Yes – GMOs, as I discuss in <a href="http://strataoftheworld.blogspot.com/2018/12/review-seeds-of-science-why-we-got-it.html">this review</a>. There have also been some harmful technologies that have been controlled; for example biological and chemical weapons of mass destruction are more-or-less kept under control by two treaties (the <a href="https://en.wikipedia.org/wiki/Biological_Weapons_Convention">Biological Weapons Convention</a> and the <a href="https://en.wikipedia.org/wiki/Chemical_Weapons_Convention">Chemical Weapons Convention</a>). However, such cases seem to be the exception, since the overall history is one of technology adoption steamrolling the luddites, from the literal <a href="https://en.wikipedia.org/wiki/Luddite">Luddites</a> to George W. Bush's attempts to <a href="https://en.wikipedia.org/wiki/Stem_cell_laws_and_policy_in_the_United_States#Timeline">limit stem cell research</a>.</p>
<p>There are also cases where we put a lot of effort into expanding the technium in a specific direction (German subsidies for solar power are one successful example). We might think of this as adding stairs to make it easier to climb a hill.</p>
<p>How much of the technium's progress (or lack thereof) is determined by the terrain's inherent shape, and how much by the walls and stairs that we slap onto it? I don't know. The examples above show that as a civilisation we sometimes do build important walls in the technium terrain, but arguments like those Hall presents in <i>Where is my Flying Car?</i> are not strong enough to make me update my beliefs to thinking that this is the main factor determining how the technium expands. If I had to make a very rough guess, I'd say that though there is variation based on area (e.g. nuclear and renewable energy have a lot of walls and stairs respectively; computing has neither), overall the inherent terrain has at least several times the effect size on the decadal timescale. The power balance seems heavily dependent on the timescale too – George W. Bush can hold back stem cells for a few years, but imagine the sort of measures it would have taken to delay steam engines for the past few hundred years.</p>
<p> </p>
<h2>How should we guide technological progress?</h2>
<p>How much should we try to guide technological progress?</p>
<p>A first step might be to look at how good we've been at it in the past, so that we get a reasonable baseline for likely future performance. Our track record is clearly mixed. On one hand, chemical and biological weapons of mass destruction have so far been largely kept under control, though under a rather shoestring system (Toby Ord likes to point out that <a href="https://www.bbc.com/future/article/20200923-the-hinge-of-history-long-termism-and-existential-risk">the Biological Weapons Convention has a smaller budget than an average McDonald's</a>), and subsidies have helped solar and wind to become mature technologies. On the other hand, there are <a href="https://en.wikipedia.org/wiki/List_of_states_with_nuclear_weapons#Statistics_and_force_configuration">over ten thousand nuclear weapons in the world</a> and they don't seem likely to go away anytime soon (in particular, while <a href="https://en.wikipedia.org/wiki/New_START">New START</a> was recently extended, Russia has a <a href="https://en.wikipedia.org/wiki/RS-28_Sarmat">new ICBM</a> coming into service this year and the US is probably going to go ahead with their <a href="https://en.wikipedia.org/wiki/Ground_Based_Strategic_Deterrent">next-generation ICBM project</a>, almost ensuring that ICBMs – the most strategically volatile nuclear weapons – continue existing for decades more). We've mostly stopped ourselves benefiting from safe and powerful technologies like nuclear power and GMOs for no good reason. More recently, we've failed to allow <a href="https://en.wikipedia.org/wiki/Human_challenge_study">human challenge trials</a> for covid vaccines, despite massive net benefits (vaccine safety could be confirmed months faster, and the risk to healthy participants is lower than <a href="https://www.bls.gov/charts/census-of-fatal-occupational-injuries/civilian-occupations-with-high-fatal-work-injury-rates.htm">a year at some jobs</a>), <a href="https://www.1daysooner.org/">an army of volunteers</a>, and <a href="https://pubmed.ncbi.nlm.nih.gov/33334616/">broad public support</a>.</p>
<p>Imagine your friend was really into picking stocks, and sure, they once bought some AAPL, but often they've managed to pick the Enrons and Lehman Brothers of the world. Would your advice to them be more like "stay actively involved in trading" or "you're better off investing in an index fund and not making stock-picking decisions"?</p>
<p>Would things be better if we had tried to steer technology less? We'd probably be saving money and the environment (and <a href="https://en.wikipedia.org/wiki/Golden_rice">third-world children</a>) by eating far more genetically engineered food, and air pollution would've claimed <a href="https://www.mdpi.com/1996-1073/10/12/2169/htm">millions fewer lives</a> because nuclear power would've done more to displace coal. Then again, we'd probably have significantly less solar power. (Also, depending on what counts as steering technology rather than just reacting to its misuses, we might include the eventual bans on lead in gasoline, DDT, and chloroflourocarbons as major wins.) And maybe without the Biological Weapons Convention becoming effective in 1975, the Cold War arms race would've escalated to developing even more bioweapons than the <a href="https://en.wikipedia.org/wiki/Soviet_biological_weapons_program">Soviets already did</a> (for more depth, read <a href="https://www.amazon.com/Dead-Hand-Untold-Dangerous-Legacy/dp/0307387844">this</a>), and an accidental leak might've released a civilisation-ending super-anthrax.</p>
<p>So though we haven't been particularly good at it so far, can we survive without steering technological progress in the future? I made the point above that technology increases the variance of future outcomes, and this very much includes in the negative direction. Maybe <a href="https://en.wikipedia.org/wiki/Boost-glide">hypersonic glide vehicles</a> make the nuclear arms race more unstable and eventually result in war. Maybe technology lets Xi Jinping achieve his dream of permanent dictatorship, and this model turns out to be easily exportable and usable by authoritarians in every country. Maybe we don't solve the AI alignment problem before someone goes ahead and builds one, and the result is straight from Nick Bostrom's nightmares. And what exactly is the stable equilibrium in a world where a 150€ device that Amazon will drone-deliver to anyone in the world within 24 hours can take a genome and print out bacteria and viruses that have it?</p>
<p>This fragility is highlighted in a <a href="https://www.nickbostrom.com/existential/risks.html">2002 paper by Nick Bostrom</a>, who shares the view that the technium can't be reliably held back, at least to the extent that some dangerous technologies might require:</p>
<blockquote><p><i>"If a feasible technology has large commercial potential, it is probably impossible to prevent it from being developed. At least in today’s world, with lots of autonomous powers and relatively limited surveillance, and at least with technologies that do not rely on rare materials or large manufacturing plants, it would be exceedingly difficult to make a ban 100% watertight. For some technologies (say, ozone-destroying chemicals), imperfectly enforceable regulation may be all we need. But with other technologies, such as destructive nanobots that self-replicate in the natural environment, even a single breach could be terminal."</i></p>
</blockquote>
<p>The solution is what he calls differential development:</p>
<blockquote><p><i>"[We can affect] the rate of development of various technologies and potentially the sequence in which feasible technologies are developed and implemented. Our focus should be on what I want to call differential technological development: trying to retard the implementation of dangerous technologies and accelerate implementation of beneficial technologies, especially those that ameliorate the hazards posed by other technologies." [Emphasis in original]</i></p>
</blockquote>
<p>(See <a href="https://forum.effectivealtruism.org/posts/XCwNigouP88qhhei2/differential-progress-intellectual-progress-technological">here</a> for more elaboration on this concept and variations.)</p>
<p>For example:</p>
<blockquote><p><i>"In the case of nanotechnology, the desirable sequence would be that defense systems are deployed before offensive capabilities become available to many independent powers; for once a secret or a technology is shared by many, it becomes extremely hard to prevent further proliferation. In the case of biotechnology, we should seek to promote research into vaccines, anti-bacterial and anti-viral drugs, protective gear, sensors and diagnostics, and to delay as much as possible the development (and proliferation) of biological warfare agents and their vectors. Developments that advance offense and defense equally are neutral from a security perspective, unless done by countries we identify as responsible, in which case they are advantageous to the extent that they increase our technological superiority over our potential enemies. Such “neutral” developments can also be helpful in reducing the threat from natural hazards and they may of course also have benefits that are not directly related to global security."</i></p>
</blockquote>
<p>One point to emphasise is that the dangerous technology probably can't be held back indefinitely. One day, if humanity continues advancing (as it should), it will be easy to create deadly diseases, build self-replicating nanobots, or spin up a superintelligent computer program in the way that you'd spin up a Heroku server today. The only thing that will save us if the defensive technology (and infrastructure, and institutions) are in place by then. In <i>The Diamond Age</i>, Neal Stephenson imagines a future where there are defensive nanobots in the air and inside people that are constantly on patrol against hostile nanobots. I can't help but think that this is where we're heading. (It's also the strategy our bodies have already adopted to fight off organic nanobots like viruses.)</p>
<p>This is not how we've done technology harm mitigation in the past. Guns are kept in check through regulation, not by everyone wearing body armour. Sufficiently tight rules on, say, what gene sequences you can put into viruses or what you can order your nanotech universal fabricator to produce will almost certainly be part of the solution and go a long way on their own. However, a gun can't spin out of control and end humanity; an engineered virus or self-replicating nanobot might. And as we've seen, our ability to regulate technology isn't perfect, so maybe we should have a backup plan.</p>
<p>The overall picture therefore seems to be that our civilisation's track record at tech regulation is far from perfect, but the future of humanity may soon depend on it. Given this, perhaps it's better that we err on the side of too much regulation – not because it's probably going to be beneficial, but because it's a useful training ground to build up the institutional competence we're going to need to tackle the actually difficult tech choices that are heading our way. Better to mess up regulating Facebook and – critically – learn from it, than to make the wrong choices about AI.</p>
<p>It won't be easy to make the leap from a civilisation that isn't building much nuclear power despite being in the middle of a climate crisis to one that can reliably ensure we survive even when everyone and their dog plays with nanobots. However, an increase in humanity's collective competence at making complex choices about technology is something we desperately need.</p><p><br /></p><p style="text-align: center;"><b>RELATED:</b></p><p></p><ul style="text-align: left;"><li style="text-align: left;"><a href="https://strataoftheworld.blogspot.com/2021/03/review-where-is-my-flying-car.html">Review: Where is my Flying Car?</a></li><li><a href="http://strataoftheworld.blogspot.com/2021/03/nuclear-power-is-good.html">Nuclear power is good</a></li><li><a href="https://strataoftheworld.blogspot.com/2018/12/review-seeds-of-science-why-we-got-it.html">Review: Seeds of Science</a> – GMOs are also good</li></ul><p></p>Unknownnoreply@blogger.com0tag:blogger.com,1999:blog-1697673368059564013.post-57949953287693226422021-03-21T16:52:00.005+00:002021-04-25T22:19:18.152+01:00Review: Where is my Flying Car?<p style="text-align: center;"><span style="font-size: x-small;"> Book: <i>Where is my Flying Car?: A Memoir of Future Past</i>, by J. Storrs Hall (2018)<br />Words: 9.3k (about 31 minutes)</span></p><p style="text-align: center;"><br /></p><p>In the 50s and 60s, predictions of the future were filled with big physical technical marvels: spaceships, futuristic cities, and, most symbolically, flying cars. The lack of flying cars has become a cliche, whether as a point about the unpredictability of future technological progress, or a joke about hopeless techno-optimism.</p>
<p>For J. Storrs Hall, flying cars are not a joke. They are a feasible technology, as demonstrated by many historical prototypes that are surprisingly close to futurists' dreams, and practical too: likely to be more expensive than cars, yes, but providing many times more value to owners.</p>
<p>So, where are they?</p>
<p></p><table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto;"><tbody><tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiM9WyZrHpUK-ouCZhnC0RLn79_Fd86OOsDLNY25ezyu9kVoDRw3YZ841dDcorYD2_qkzgUZGtosx9UJzuZ7MsamlfftaPRqQDfkqVJceftNXXtxWuEaEiZi6a91yoztbd6h74MbGMTebYe/" style="margin-left: auto; margin-right: auto;"><img data-original-height="1012" data-original-width="1310" height="309" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiM9WyZrHpUK-ouCZhnC0RLn79_Fd86OOsDLNY25ezyu9kVoDRw3YZ841dDcorYD2_qkzgUZGtosx9UJzuZ7MsamlfftaPRqQDfkqVJceftNXXtxWuEaEiZi6a91yoztbd6h74MbGMTebYe/w400-h309/flyingcar.png" width="400" /></a></td></tr><tr><td class="tr-caption" style="text-align: center;">Above: not a joke. <i>(Public domain, <a href="https://commons.wikimedia.org/wiki/File:ConvairCar_Model_118.jpg">original here</a>)</i></td><td class="tr-caption" style="text-align: center;"><br /></td></tr></tbody></table><p></p>
<p>The central motivating force behind <i>Where is my Flying Car?</i> is the disconnect between what is physically possible with modern science, and what our society is actually achieving. The immediate objection to such points is to say: "well, of course some engineer can imagine a world where all this fancy technology is somehow economically feasible and widespread, but in the real world everything is more complicated, and once you take these complications into account there's no surprising failure".</p>
<p>Hall's objection is that everything was going fine until 1970 or so.</p>
<p>Many people complain that technological progress has slowed. Flying cars, of course, but also: airliner cruising speeds have stagnated, the space age went on hiatus, cities are still single-level flat designs with traffic, nuclear power stopped replacing fossil fuels, and nanotechnology (in the long run, the most important technology for building anything) is growing slowly. <a href="https://www.newyorker.com/magazine/2011/11/28/no-death-no-taxes">Peter Thiel</a> sums this up by saying "we wanted flying cars, instead we got 140 characters".</p>
<p>It's not just technology. There's an <a href="https://wtfhappenedin1971.com/">entire website devoted to throwing graphs at you about trends that changed around 1970</a> (and selling you Bitcoin on the side), and, while a bunch of it is <a href="https://tylervigen.com/spurious-correlations">Spurious Correlations material</a>, they include enough important things, like a stagnation in median wages, that it's worth thinking about.</p>
<p>Perhaps the most fundamental indicator is that the energy available per person in the United States was increasing exponentially (a trend Hall names the Henry Adams curve), until, starting around 1970, it just wasn't:</p>
<p></p><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi43p6itzbx3tDc2cVK8S4C3IwKJlpylMcaRB_CrjOIj8tfVms8qnC8IcdoBTeEVLaExmes7JSJHKSPeMiV8moxfI7OTmW4gSrvtHCC96OyK98QOqnBT7HIwL3oAg4L2v_RycyGARZ_6izo/" style="margin-left: 1em; margin-right: 1em;"><img data-original-height="650" data-original-width="1522" height="274" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi43p6itzbx3tDc2cVK8S4C3IwKJlpylMcaRB_CrjOIj8tfVms8qnC8IcdoBTeEVLaExmes7JSJHKSPeMiV8moxfI7OTmW4gSrvtHCC96OyK98QOqnBT7HIwL3oAg4L2v_RycyGARZ_6izo/w640-h274/adamscurve.png" width="640" /></a></div><br /><p></p>Is this just because the United States is an outlier in energy use statistics? No; other developing countries have plateaued too, with the exception of Iceland and Singapore:
<p></p><table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto;"><tbody><tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEisoU0guGeoHotakQOxTrBsKrjayqeMrl7m14kVMvhMLHSetDlm7PPEImn7UEf-YEicWCyciCfG71Pzm5ULFuwgoq1D0tGI2-zksJCRWGsSbJ4H_xdRhzKYezRKDwoBYZS2dlY_FUHckQbW/" style="margin-left: auto; margin-right: auto;"><img alt="" data-original-height="1096" data-original-width="1748" height="402" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEisoU0guGeoHotakQOxTrBsKrjayqeMrl7m14kVMvhMLHSetDlm7PPEImn7UEf-YEicWCyciCfG71Pzm5ULFuwgoq1D0tGI2-zksJCRWGsSbJ4H_xdRhzKYezRKDwoBYZS2dlY_FUHckQbW/w640-h402/energycapita.png" width="640" /></a></td></tr><tr><td class="tr-caption" style="text-align: center;"><p>(Source: <a href="https://ourworldindata.org/">Our World in Data</a>, one of the best websites on the internet. You can play around with an interactive version of this chart <a href="https://ourworldindata.org/grapher/per-capita-energy-use?tab=chart&time=earliest..latest&country=DEU~JPN~SGP~SWE~TWN~GBR~USA~ISL&region=World">here</a>.)</p></td></tr></tbody></table><p></p><div class="separator" style="clear: both; text-align: center;"></div> <p></p><p>Hall tries to estimate what percentage of future predictions in some technical area have come true as a function of the energy intensity of the technology, and finds a strong inverse correlation: in less energy intensive areas (e.g. mobile phones) we've over-achieved relative to futurists' predictions, while the opposite is true with energy intensive big machines (e.g. flying cars). (This is necessarily very subjective, but Hall at least says he did not change any of his estimates after seeing the graph.)</p>
<p>Of course, we have to contrast the stagnation in some areas with enormous advancements during the same time. The most obvious example is computing, something that futurists generally missed. In biotechnology, the price of DNA sequencing has dropped exponentially and in just the past few years we've gotten powerful tools like CRISPR and mRNA vaccines. Meanwhile the average person is now twice as rich as in 1970, and life expectancy has increased by 15 years (and the numbers are not much lower if we restrict our attention just to developed countries).</p>
<p>Perhaps we should be content; maybe Peter Thiel should stop complaining now that we have <a href="https://www.bbc.com/news/technology-41900880">280 characters</a>? After all, the problem is not that things are failing, but that they <i>might</i> be improving slower than they could be. That hardly seems like the end of the world. So why should we focus on technological progress? Has it really slowed? And how can we model it? <a href="https://strataoftheworld.blogspot.com/2021/03/technological-progress.html">I discuss these questions in another post</a>. In this post, however, I will move straight onto Hall's favourite topic.</p>
<p> </p>
<h2>Cool technology</h2>
<h3>Flying cars</h3>
<p>You might assume the case for flying cars looks something like this:</p>
<ol start="">
<li>You get to places very fast.</li>
<li>Very cool.</li>
</ol>
<p>However, there's a deeper case to be made for flying cars (or rapid transportation in general), and it starts with the observation that barefoot-walkers in Zambia tend to spend an hour or so a day travelling. Why is this interesting? Because this is the same as the average duration in the United States (of course Hall's other example is the US) or any other society.</p>
<p>Flying cars aren't about the speed – they're about the distance that this speed allows, given universal human preferences for daily travel duration. Cars on the road do about 60 km/h on average for any trip ("you might think that you could do better for a long trip where you can get on the highway and go a long way fast", Hall writes, but "the big highways, on the average, take you out of your way by an amount that is proportional to the distance you are trying to go"). A flying car that goes five times faster lets you travel within twenty-five times the area, potentially opening up a lot of choice.</p>
<p>Hall goes through some calculations about the utilities of different time-to-travel versus distance functions, given empirical results from travel theory, to produce this chart (which I've edited to improve the image quality and convert units) as a summary:</p>
<p></p><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjn8ju4XHioVUMH8u_EXx7ENKypho6qYq0EAex8mvDcf8xlmjHom3u-TR91wgvK3PEUAw3sB-2y4Y7_rMMwuKTV-aXSY-Bz8pMQyccY15ThLtn7yot4J-4JaQcFGT6qWaegPL6AXoGmrLl0/" style="margin-left: 1em; margin-right: 1em;"><img data-original-height="1388" data-original-width="2002" height="278" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjn8ju4XHioVUMH8u_EXx7ENKypho6qYq0EAex8mvDcf8xlmjHom3u-TR91wgvK3PEUAw3sB-2y4Y7_rMMwuKTV-aXSY-Bz8pMQyccY15ThLtn7yot4J-4JaQcFGT6qWaegPL6AXoGmrLl0/w400-h278/valueofvehicle.png" width="400" /></a></div><p></p>
<p>(The overhead time means how long it takes to transition into flying mode, for example if you have to attach wings to it, or drive to an airport to take off.)</p>
<p>Even a fairly lame flying car would easily be three times more valuable than a regular car, mainly by giving you more choice and therefore letting you visit places that you like more.</p>
<p>In terms of what a flying car would actually look like, you have several options. Helicopters are obvious, but they are about ten times the price of cars, mechanically complex (and with very low manufacturing tolerances), and limited by aerodynamics (the advancing blade pushes against the sound barrier, and the retreating one pushes against generating too little lift due to how slowly it moves) to a speed of 250 km/h or so.</p>
<p>Historically, many promising flying car designs that actually flew where <a href="https://en.wikipedia.org/wiki/Autogyro">autogyros</a>, which generate thrust with a propeller but lift through an unpowered freely-rotating helicopter-like rotor. They generally can't take off vertically, but can land in a very small space.</p>
<p>Another design is a VTOL (vertical take-off and landing) aircraft. Some have been built and used as fighter jets, but they've gained limited use because they're slower and less manoeuvrable than conventional fighters and have less room for weapons. However, Hall notes that one experimental VTOL aircraft in particular – the <a href="https://en.wikipedia.org/wiki/Ryan_XV-5_Vertifan">XV-5</a> – would "have made one hell of a sports car" and its performance characteristics are recognisable as those of a hypothetical utopian flying car. It flew in 1964, but was cancelled because the Air Force wanted something as fast and manoeuvrable as a fighter jet, rather than "one hell of a sports car".</p>
<p>Of current flying car startups, Hall mentions <a href="https://en.wikipedia.org/wiki/Terrafugia">Terrafugia</a> and <a href="https://en.wikipedia.org/wiki/AeroMobil_s.r.o._AeroMobil">AeroMobil</a>, which produce traditional gasoline-powered vehicles (both with fuel economies comparable in litres/km to ordinary cars). There's also <a href="https://en.wikipedia.org/wiki/Volocopter">Volocopter</a> and <a href="https://en.wikipedia.org/wiki/EHang">EHang</a>, both of which produce electric vehicles with constrained ranges.</p>
<p>Hall divides the roadblocks (or should I say <a href="https://en.wikipedia.org/wiki/NOTAM">NOTAMs</a>?) for flying cars into four categories.</p>
<p>The first is that flying is harder than driving. To test this idea, Hall learned to fly a plane, and concluded that it is considerably harder, but not insurmountably. Besides, we're not far from self-driving; commercial passenger flights are close to self-piloting already, the existing Volocopter is only "optionally piloted", and the EHang 184 flies itself. </p>
<p>The second is technological. The main challenges here are flying low and slow without stalling (you want to be able to land in small places, at least in emergencies), and reducing noise to manageable levels.</p>
<p>The third is economic. Even though the technology theoretically exists, it may be that we're not yet at a stage where personal flying machines are economically feasible. To some extent this is true; Hall admits that even on the pre-1970 trends in private aircraft ownership, the US private aircraft market would only be something like 30 000 - 40 000 per year (compared to the 2 000 or so that it currently is), about a hundredth of the number of cars sold. The economics means we should expect that the adoption curve is shallow, but not that it's necessarily non-existent.</p>
<p>The final reason is simple: even if you could make a flying car, you wouldn't be allowed to. Everything in aviation is heavily regulated, pushing up costs in a way that, Hall says, leads private pilots to joke about "hundred-dollar burgers". Of course, flying is hard, so you want standards high enough that at the very least you don't have to dodge other people's home-made flying motorbikes as they rain down from the sky, but in Hall's opinion the current balance is wrong.</p>
<p>And it's not just that the balance is wrong, but that the regulations are messed up. For example, making aircraft in the light sports aircraft category would be a great way to experiment with electric flight, but the FAA forbids them from being powered by anything other than a single internal combustion piston engine.</p>
<p>In particular, the FAA "has a deep allergy to people making money with flying machines". If you own a two-seat private aircraft, you can't charge a passenger you take on a flight more than half of the fuel cost, so no air Uber. Until the FAA stopped dragging its feet on <a href="https://en.wikipedia.org/wiki/Unmanned_aerial_vehicle#Commercial_use">drone regulation</a> in 2016, drones were operated under model aircraft rules, and therefore could not be used for anything other than hobby or recreational purposes. Similar rules still apply to ultralights, with one suspicious exception: a candidate for a federal, state, or local election is allowed to pay for a flight.</p>
<p>(And of course, to all these rules it's usually possible to apply for a waiver – so if you're a big company with an army of lawyers, do what you want, but if you're two people in a garage, good luck.)</p>
<p>There's no clear smoking gun of one piece of regulation specifically causing significant harm to flying car innovation. However, the harms of regulation are often a death-by-a-thousand-cuts situation, where a million rules each clip away at what is permissible and each add a small cost. Hall's conclusion is harsh: "It’s clear that if we had had the same planners and regulators in 1910 that we have now, we would never have gotten the family car at all."</p>
<p>One particular effect of flying cars would be to weaken the pull of cities, another topic to which Hall brings a lot of opinions.</p>
<h3>City design</h3>
<blockquote><p><i>"Designing a city whose transportation infrastructure consists of the flat ground between the boxes is insane."</i></p>
</blockquote>
<p>This is true. Most traffic problems would go away if you could add enough levels. However, "[e]ven the recent flurry of Utopia-building projects are still basically rows of boxes sitting on the dirt plus built-in wifi so the self-driving cars can talk to each other as they sit in automated traffic jams".</p>
<p>As usual, Hall spies some sinister human factors lurking behind the scenes, delaying his visions of techno-utopia:</p>
<blockquote><p><i>"There is a perverse incentive for bureaucrats and politicians to force people to interact as much as possible, and indeed to interact in contention, as that increases the opportunities for control and the granting of favors and privileges. This is probably one of the major reasons that our cities have remained flat, one-level no-man’s-lands where pedestrians (and beggars and muggers) and traffic at all scales are forced to compete for the same scarce space in the public sphere, while in the private sphere marvels of engineering have leapt a thousand feet into the sky, providing calm, safe, comfortable environments with free vertical transportation."</i></p>
</blockquote>
<p>This is an interesting idea, and I've <a href="https://www.elephantinthebrain.com/">read enough Robin Hanson</a> to not discount such perverse explanations immediately, but once again I'm not convinced how important this factor is, and Hall, as usual, is happy to paint only in broad to strokes.</p>
<p>However, he makes a clearly strong point here:</p>
<blockquote><p><i>"Densification proponents often point to an apparent paradox: removing a highway which crosses a community often does not increase traffic on the remaining streets, as the kind of hydraulic flow models used by traffic planners had assumed that it would. On the average, when a road is closed, 20% of the traffic it had handled simply vanishes. Traffic is assumed to be a bad thing, so closing (or restricting) roads is seen as beneficial. Well duh. If you closed all the roads, traffic would go to zero. If you cut off everybody’s right foot and forced them to use crutches, you’d get a lot less pedestrian traffic, too."</i></p>
</blockquote>
<p>Hall takes a liberal principle of being strongly in favour of giving people choice, arguing that the goal of city design and transportation infrastructure should be to maximise how far people can travel quickly, rather than trying to ensure that they don't need to travel anywhere other than the set of choices the all-seeing, all-knowing urban designer saw fit to place nearby. Of course, once again flying cars are the best:</p>
<blockquote><p><i>"The average American commute to work, one way by car, ranges from 20 minutes to half an hour (the longer times in denser areas). This gives you a working radius of about 15 miles [= 24 km], or [1800 square kilometres] around home to find a workplace (or around work to find a home). With a fast VTOL flying car, you get a [240-kilometre] radius or [180 thousand square kilometres] of commutable area. Cars, trucks, and highways were clearly one of the major causes of the postwar boom. It isn’t perhaps realized just how much the war on cars contributed to the great stagnation—or how much flying cars could have helped prolong the boom."</i></p>
</blockquote>
<h3>Nuclear power</h3>
<p>I discuss nuclear power at length in <a href="http://strataoftheworld.blogspot.com/2021/03/nuclear-power-is-good.html">another post</a>.</p>
<h3>Space travel?</h3>
<p>What about the classic example of supposedly stalled innovation – we were on the moon in 1969, and won't return until <a href="https://en.wikipedia.org/wiki/Artemis_program">at least 2024</a>?</p>
<blockquote><p><i>"With space travel, there’s a pretty straightforward answer: the Apollo project was a political stunt, albeit a grand and uplifting one; there was no compelling reason to continue going to the moon given the cost of doing so."</i></p>
</blockquote>
<p>The general curve of space progress seems to be over-achievement relative to technological trends in the 60s, followed by stagnation, not because the technology is impossible – we did go to the moon after all – but because it just wasn't economical. Only now, with private space companies like SpaceX and Rocket Lab actually making a business out of taking things to space outside the realm of <a href="https://aozerov.com/research/lvmarket.pdf">cosy costs-plus government contracts</a> is innovation starting to pick up again.</p>
<p>(In the past ten years, we've seen the first commercial crewed spacecraft, reuse of rocket stages, the first methane-fuelled rocket engine ever flown, the first full-flow staged-combustion rocket engine ever flown, and the first liquid-fuelled air-launched orbital rocket, just to pick some examples.)</p>
<p>Hall has some further comments about space. First, in this passage he shows an almost-religious deference to trend lines:</p>
<blockquote><p><i>"As you can see from the airliner cruising speed trend curve, we shouldn’t have expected to have commercial passenger space travel yet, even if the Great Stagnation hadn’t happened."</i></p>
</blockquote>
<p>I don't think it makes sense to take a trend line for atmospheric flight speeds and use that to estimate when we should have passenger space travel; the physics is completely different, and in particular speeds are very constrained in orbit (you need to go 8 km/s to stay in orbit, and you can't go faster around the Earth without constant thrusting to stop yourself from flying off – something Hall clearly understands, as he explains it more than once).</p>
<p>Secondly, he is of course in favour of everything high-energy and nuclear.</p>
<p>For example: <a href="https://en.wikipedia.org/wiki/Project_Orion_(nuclear_propulsion)">Project Orion</a> was an American plan for a spacecraft powered (potentially from the ground up, rather than just in space) by throwing nuclear bombs out the back and riding the plasma from the explosions. This is a good contender for the stupidest-sounding idea that actually makes for a solid engineering plan; it's a surprisingly feasible way of getting sci-fi performance characteristics from your spacecraft. Other feasible methods have either far lower thrust (like ion engines, meaning that you can't use them to take off or land), or have far lower exhaust velocity (which means much more of your spacecraft needs to be fuel). The obvious argument against Orion, at least for atmospheric launch, is the fallout, but Hall points out it's actually not <i>that</i> bad – the number of additional expected cancer deaths from radiation per launch is "only" in the single digits, and that's under a very conservative linear no-threshold model of radiation dangers, which is likely wrong. (The actual reasons for cancellation weren't related to radiation risks, but instead the prioritisation of Apollo, the <a href="https://en.wikipedia.org/wiki/Partial_Nuclear_Test_Ban_Treaty">Partial Test Ban Treaty of 1963</a> that banned atmospheric nuclear tests, and the fact that no one in the US government had a particularly pressing need to put a thousand tons into orbit.) Hall also mentions an interesting fact about Orion that I hadn't seen before: "the total atmospheric contamination for a launch was roughly the same no matter what size the ship; so that there would be an impetus toward larger ones" – perhaps Orion would have driven mass space launch.</p>
<p>A more controlled alternative to bombing yourself through space is to use a nuclear reactor to heat up propellant in order to expel it out the back of your rocket at high speeds, pushing you forwards. The main limit with these designs is that you can't turn the heat up too much without your reactor blowing up. Hall's favoured solution is a direct fission-to-jet process, where the products of your nuclear reaction go straight out the engine without all this intermediate fussing around with heating the propellant. A reaction that converts a proton and a lithium-7 atom into 2 helium nuclei would give an exhaust velocity of 20 Mm/s (7% of the speed of light), which is insane.</p>
<p>To give some perspective: let's say your design parameters are that you have a 10 ton spacecraft, of which 1 ton can be fuel. With chemical rocket technology, this gives you a little toy with a total ∆V of some 400 m/s, meaning that if you light it up and let it run horizontally along a frictionless train track, it'll break the sound barrier by the time it's out of fuel, but it can't take you from a Earth-to-moon-intercept trajectory to a low lunar orbit even with the most optimal trajectories. With the proton + lithium-7 process Hall describes, your 10% fuel, 10-ton spaceship can accelerate at 1G for two days. If you want to go to Mars, instead of this whole modern business of waiting for the orbital alignment that comes once every 26 months and then doing a 9-month trip along the lowest-energy orbit possible, you can almost literally point your spaceship at Mars, accelerate yourself to a speed of 1 000 km/s over a day (for comparison, the speeds of the inner planets in their orbits are in the tens of kilometres per second range), coast for maybe a day at most, and then decelerate for another day. For most of the trip you get free artificial gravity because your engine is pushing you so hard. This would be technology so powerful even Hall feels compelled to tack on a safety note: "watch out where you point that exhaust jet".</p>
<h3>Nanotechnology!</h3>
<p>Imagine if machine pieces could not be made on a scale smaller than a kilometre. Want a gear? Each tooth is a 1km x 1km x 1km cube at least. Want to build something more complicated, say an engine? If you're in a small country, it may well be a necessarily international project, and also better keep it fairly flat or it won't fit within the atmosphere. Want to cut down a single tree? Good luck.</p>
<p>This is roughly the scale at which modern technology operates compared to the atomic scale. Obviously this massively cuts down on what we can do. Having nanotechnology that lets us rearrange atoms on a fine level, instead of relying on astronomically blunt tools and bulk chemical reactions, could put the capabilities of physical technology on the kind of exponential Moore's law curve we've seen in information technology.</p>
<p>There are some problems in the way. As you get to smaller and smaller scales:</p>
<ul>
<li>matter stops being continuous and starts being discrete (and therefore for example oil-based lubrication stops working);</li>
<li>the impact of gravity vanishes but the impact of adhesion increases massively;</li>
<li>heat dissipation rates increase;</li>
<li>everything becomes springy and nothing is stiff anymore; and</li>
<li>hydrogen atoms (other atoms are too heavy) can start doing weird quantum stuff like tunnelling.</li>
</ul>
<p>Also, how do we even get started? If all we have are extremely blunt tools, how do you make sharp ones?</p>
<p>There are two approaches. The first, the top-down approach, was suggested <a href="https://en.wikipedia.org/wiki/There%27s_Plenty_of_Room_at_the_Bottom">in a 1959 talk</a> by Richard Feynman, which is credited as introducing the concept of nanotechnology. First, note that we currently have an industrial tool-base at human scales that is, in a sense, self-replicating: it requires human inputs, but we can draw a graph of the dependencies and see that we have tools to make every tool. Now we take this tool-base, and create an analogous one at one-fourth the scale. We also create tools that let us transfer manipulations – the motions of a human engineer's hands, for example – to this smaller-scale version (today we can probably also automate large parts of it, but this isn't crucial). Now we have a tool-base that can produce itself at a smaller scale, and we can repeat the process again and again, making adjustments in line with the above points about how the engineering must change. If each step is one-fourth the previous, 8 iterations will take us from a millimetre-scale industrial base to a tens-of-nanometres-scale one.</p>
<p>The other approach is bottom-up. We already have some ability to manipulate things on the single-digit nanometre scale: the smallest features on today's chips are in this range, we have <a href="https://en.wikipedia.org/wiki/Atomic_force_microscopy">atomic-scale microscopes that can also manipulate atoms</a>, and of course we're surrounded by massively complicated nanotechnology called organic life that comes with pre-made nano-components. Perhaps these tools let us jump straight to making simple nano-scale machines, and a combination of these simple machines and our nano-manipulation tools lets us eventually build the critical self-sustaining tool-base at the atomic level.</p>
<h3>Weather machines?!</h3>
<p>Here's one thing you could do with nanotechnology: make 5 quintillion 1 cm controllable hydrogen balloons with mirrors, release them into the atmosphere, and then set sunlight levels to be whatever you want (without nanotechnology, this might also be doable, but nanotechnology lets you make very thin balloons and therefore removes the need to strip-mine an entire continent for the raw materials).</p>
<p>Hall calls this a weather machine, and it is exactly what it says on the tin, both on a global and local level. He estimates that it would double global GDP by letting regions set optimal temperatures, since "you could make land in lots of places on the earth, such as Northern Canada and Russia, as valuable as California". Of course, this is assuming that we don't care about messing up every natural ecosystem and weather pattern on the planet, but if the machine is powerful enough we might choose to keep the still-wild parts of the world as they are. I don't know if this would work, though; sunlight control alone can do a lot to the weather, but perhaps you'd need something different to avoid, for example, the huge winds from regional temperature differences? However, with a weather machine, the sort of subtle global modifications needed to reverse the roughly 1 watt per square metre increase in incoming solar radiation that anthropogenic emissions have caused would be trivial. </p>
<p>Weather machines are scary, because we're going to need very good institutions before that sort of power can be safely wielded. Hall thinks they're coming by the end of the century, if only because of the military implications: not only could you destroy agriculture wherever you want, but the mirrors could also focus sunlight onto a small spot. You could literally smite your enemies with the power of the sun.</p>
<p>Don't want things in the atmosphere, but still want to control the climate? Then put up sunshades into orbit, incentivising the development of a large-scale orbital launch infrastructure at the same time that we can afterwards use to settle Mars or whatever. As a bonus, put solar panels on your sunshade satellites, and you can generate more power than humanity currently uses.</p>
<p>As always, nothing is too big for Hall. He goes on to speculate about a weather machine <a href="https://en.wikipedia.org/wiki/Dyson_sphere">Dyson sphere</a> at half the width of the Earth's orbit. Put solar panels on it, and it would generate enormous amounts of power. Use it as a telescope, and you could see a phone lying on the ground on <a href="https://en.wikipedia.org/wiki/Proxima_Centauri_b">Proxima Centauri b</a>. Or, if the Proxima Centaurians try to invade, you can use it as a weapon and "pour a quarter of the Sun’s power output, i.e. 100 trillion terawatts, into a [15-centimetre] spot that far away, making outer space safe for democracy."</p>
<h3>Flying cities?!?</h3>
<p>And because why the hell not: imagine a 15-kilometre airplane shaped like a manta ray and with a thickness of a kilometre (so the <a href="https://en.wikipedia.org/wiki/Burj_Khalifa">Burj Khalifa</a> fits inside), with room for 10 million people inside. It takes 200 GW of power to stay flying – equivalent to 4 000 Boeing 747s – which could be provided by a line of nuclear power plants every 100 metres or so running along the back. This sounds like a lot, but Hall helpfully points out the reactors would only be 0.01% of the internal volume, so you could still cluster Burj Khalifas inside to your heart's content, and the energy consumption comes out to only 20 kW per person, about where we'd be today if energy use had continued growing on pre-1970s trends.</p>
<p>If you don't want to go to space but still want to leave the Earth untouched, this is one solution, as long as you don't mind a lot of very confused birds.</p>
<h2>Technology is possible, but has risks</h2>
<p>I worry that <i>Where is my Flying Car?</i> easily leaves the impression that everything Hall talks about is part of some uniform techno-wonderland, which, depending on your prior about technological progress, is somewhere between certainly going to happen or permanently relegated to the dreams of mad scientists. Hall does not work to dispel this impression: he goes back and forth between talking about how practical flying cars are and exotic nuclear spacecraft, or between reasonable ideas about traffic layout in cities and far-off speculation about city-sized airplanes. Credible world-changing technologies like nanotechnology easily seem like just another crazy thought Hall sketched out on the back of the envelope and could not stop being enthusiastic about.</p>
<p>So should we take Hall's more grounded speculation seriously and ignore the nano-nuclear-space-megapolises? I think this would be the wrong takeaway. First, I'm not sure Hall's crazy speculation is crazy enough to capture possible future weirdness within it; he restricts himself mainly to physical technologies, and thus leaves out potentially even weirder things like a move to virtual reality or the creation of superhuman intelligence (whether AI or augmented humans).</p>
<p>Second, Hall does have a consistent and in some way realist perspective: if you look at the world – not at the institutions humans have built, or whatever our current tech toolbox contains, but at the physical laws and particles at our disposal – what do you come up with?</p>
<p>After all, our world is ultimately not one of institutions and people and their tools. The "strata" go deeper, until you hit the bedrock of fundamental physics. We spend most of our time thinking about the upper layers, where the underlying physics is abstracted out and the particles partitioned into things like people and countries and knowledge. This is for good reason, because most of the time this is the perspective that lets you best think about things important to people. Occasionally, however, it's worth taking a less parochial perspective by looking right down to the bedrock, and remembering that anything that can be built on that is possible, and something we may one day deal with.</p>
<p>This perspective should also make clear another fact. The things we care about (e.g. people) exist many layers of abstraction up from the fundamental physics, and are therefore fragile, since they depend on the correct configuration of all levels below. If your physical environment becomes inhospitable, or an engineered virus prevents your cells from carrying out their function, the abstraction of you as a human with thoughts and feelings will crash, just like a program crashes if you fry the circuits of the computer it runs on.</p>
<p>So there are risks, new ones will appear as we get better at configuring physics, and stopping civilisation from accidentally destroying itself with some new technology is not something we're automatically guaranteed to succeed at.</p>
<p>Hall does not seem to recognise this. Despite all his talk about nanotechnology, the <a href="https://en.wikipedia.org/wiki/Gray_goo">grey goo scenario</a> of self-replicating nanobots going out of control and killing everyone doesn't get a mention. As far as I'm aware, there's no strong theoretical reason for this to be impossible – nanobots good at configuring carbon/oxygen/hydrogen atoms are a very reasonable sort of nanobot, and I can't help but noticing that my body is mainly carbon, oxygen, and hydrogen atoms. "What do you replace oil lubrication with for your atomic scale machine parts" is a worthwhile question, as Hall notes, but I'd like to add that so is the problem of not killing everyone.</p>
<p>Hall does mention the problem of AI safety:</p>
<blockquote><p><i>"The latest horror-industry trope is right out of science fiction [...]. People are trying to gin up worries that an AI will become more intelligent than people and thus be able to take over the world, with visions of Terminator dancing through their heads. Perhaps they should instead worry about what we have already done: build a huge, impenetrably opaque very stupid AI in the form of the administrative state, and bow down to it and serve it as if it were some god."</i></p>
</blockquote>
<p>What's this whole thing with arguments of the form "people worry about AI, but the <i>real</i> AI is X", where X is whatever institution the author dislikes? <a href="https://www.buzzfeednews.com/article/tedchiang/the-real-danger-to-civilization-isnt-ai-its-runaway">Here's another example</a> from a different political perspective (by sci-fi author Ted Chiang, whose <a href="http://strataoftheworld.blogspot.com/2020/05/short-reviews-fiction.html">fiction I enjoy</a>). I don't think this is a useless perspective – there is an analogy between institutions that fail because their design optimises for the wrong thing, and the more general idea of powerful agents accidentally designed to optimise for the wrong thing – but at the end of the day, surprise surprise, the real AI is a very intelligent computer program.</p>
<p>Hall also mentions he "spent an entire book (<i><a href="https://www.amazon.com/Beyond-AI-Creating-Conscience-Machine/dp/1591025117">Beyond AI</a></i>) arguing that if we can make robots smarter than we are, it will be a simple task to make them morally superior as well." This sounds overconfident – morality is complicated, after all – but I haven't read it.</p>
<p>As for climate change, Hall acknowledges the problem but justifies largely dismissing it by citing “[t]he actual published estimates for the IPCC’s worst case scenario, RCP8.5, [which] are for a reduction in GDP of between 1% and 3%". <a href="https://science.sciencemag.org/content/sci/356/6345/1362.full.pdf">This is true</a> ... if you only consider the United States! (The EU is in the same range but the global estimates range up to 10%, because of a disproportionate effect on poor tropical countries.) As the authors of that very report also note, these numbers don't take into account non-market losses. If Hall wants to make an argument for techno-optimistic capitalism, he should consider taking more care to distinguish himself from the strawman version.</p>
<p> </p>
<h2>It's <i>not</i> the technology, stupid!</h2>
<p>Hall does not think that we'd have all the technologies mentioned above if only technological progress had not "stagnated". The things he expects could've happened by now given past trends are:</p>
<ul>
<li>The technological feasibility of flying cars would be demonstrated and sales would be on the rise; Hall goes as far as to estimate the private airplane market in the US could have been selling 30k-40k planes per year (a fairly tight confidence interval for something this uncertain); compare with the actual US market today, which sells around 16 million cars and a few thousand private aircraft per year.</li>
<li>Demonstrated examples of multi-level cities and floating cities.</li>
<li>Chemical spacecraft technology would be about where they are now, but some chance that government funding would have resulted in <a href="https://en.wikipedia.org/wiki/Project_Orion_(nuclear_propulsion)">Project Orion</a>-style nuclear launch vehicles.</li>
<li>Nanotechnology: basic things like ammonia fuel cells might exist, but not fancier things like cell repair machines or universal fabricators.</li>
<li>Nuclear power would generate almost all electricity, and hence there would be a lot less CO2 in the atmosphere (<a href="https://www.mdpi.com/1996-1073/10/12/2169/htm">this study</a> estimates 174 billion fewer tons of CO2 had reasonable nuclear trends continued, but Hall optimistically gives the number as 500 billion tons).</li>
<li>AI and computers at the same level as today.</li>
<li>A small probability that something unexpected along the lines of cold fusion would have turned out to work and been commercialised.</li>
<li>A household income several times larger than today.</li>
</ul>
<p>So what went wrong? Hall argues:</p>
<blockquote><p>"The faith in technology reflected in Golden Age SF and Space Age America wasn’t misplaced. What they got wrong was faith in our culture and bureaucratic arrangements."</p>
</blockquote>
<p>He gives two broad categories of reasons: concrete regulations, and a more general cultural shift from hard technical progress to worrying and signalling.</p>
<h3>Regulation ruins everything?</h3>
<p>Hall does not like regulation. He estimates that had regulation not grown as it did after 1970, the increased GDP growth might have been enough to make household incomes 1.5 to 2 times higher than they are today in the US. I can find some studies saying similar things – <a href="https://www.sciencedirect.com/science/article/abs/pii/S1094202520300223">here</a> is one claiming 0.8% lower GDP growth per year since 1980 due to regulation, which would imply today's economy would be about 1.3 times larger had this drag on growth existed. As far as I can tell, these estimates also don't take into account the benefits of regulation, which are sometimes massive (e.g. banning lead in gasoline). However, I think most people agree that regardless of how much regulation there should be, it could be a lot smarter.
</p><p>Hall's clearest case for regulation having a big negative impact on an industry is private aviation in the United States, which crashed around 1980 after more stringent regulations were introduced. The number of airplane shipments per year dropped something like six-fold and never recovered.</p>
<p>A much bigger example is nuclear power, which I will discuss in an upcoming post, and which Hall also has plenty to say about.</p>
<p>Strangely, Hall misses perhaps the most obvious case in modern times: GMOs pointlessly being almost regulated out of existence, a story told well in Mark Lynas' <i>Seeds of Science</i> (my review <a href="http://strataoftheworld.blogspot.com/2018/12/review-seeds-of-science-why-we-got-it.html">here</a>). Perhaps this is because of Hall's focus on hard sciences, or his America-centrism (GMO regulation is worse in the EU than in the United States).</p>
<p>And speaking of America-centrism, the biggest question I had is why even if the US is bad at regulation, no country decides to do better and become the flying car capital of the world. Perhaps good regulation is hard enough that no one gets it right? Hall makes no mention of this question, though. </p>
<p>He does, however, throw plenty of shades on anything involving centralisation. For example:</p>
<blockquote><p><i>"Unfortunately, the impulse of the Progressive Era reformers, following the visions of [H. G.] Wells (and others) of a “Scientific Socialism,” was to centralize and unify, because that led to visible forms of efficiency. They didn’t realize that the competition they decried as inefficient, whether between firms or states, was the discovery procedure, the dynamic of evolution, the genetic algorithm that is the actual mainspring of innovation and progress."</i></p>
</blockquote>
<p>He brings some interesting facts to the table. For example, an OECD survey found a 0.26 correlation between private spending on research & development and economic growth, but a -0.37 between public R&D and growth. Here's Hall's once again somewhat dramatic explanation:</p>
<blockquote><p><i>“Centralized funding of an intellectual elite makes it easier for cadres, cliques, and the politically skilled to gain control of a field, and they by their nature are resistant to new, outside, non-Ptolemaic ideas. The ivory tower has a moat full of crocodiles.”</i></p>
</blockquote>
<p>He backs this up with his personal experiences of US government spending on nanotechnology lead to a flurry of scientists trying to claim that their work counted as nanotechnology (up to and including medieval stained glass windows) as well as trying to discredit anything that actually was nanotechnology, to make sure that the nanotechnologists wouldn't steal more federal funding in the future.</p>
<p>Studies, not surprisingly, find that the issue is more complicated (see for example <a href="https://link.springer.com/article/10.1007/s10645-019-09331-3">here</a>, which includes a mention of the specific survey Hall references).</p>
<p>Hall also includes a graph of economic growth vs the Fraser Institute's economic freedom score in the United States. I've created my own version below, including some more information than Hall does:</p>
<p></p><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEh2XGpwqd-nzttG9oGfko8bRQCmxILe8ltRW2szibNf65x8hq3qqtNlxuREBQc984FCX2d1nR64Jygu3NHe9_DPdEw_OOJsliXh-7Y9A2gi0xAUkSwt-rHW3Qaq-wfAMrsGMYOMtpZ6m8le/" style="margin-left: 1em; margin-right: 1em;"><img alt="" data-original-height="980" data-original-width="1550" height="405" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEh2XGpwqd-nzttG9oGfko8bRQCmxILe8ltRW2szibNf65x8hq3qqtNlxuREBQc984FCX2d1nR64Jygu3NHe9_DPdEw_OOJsliXh-7Y9A2gi0xAUkSwt-rHW3Qaq-wfAMrsGMYOMtpZ6m8le/" width="640" /></a></div><p></p>In general, it seems sensible to expect economic freedom to increase GDP: the more a person's economic choices are limited, the more likely the limitations are to prevent them from taking the optimal action (the main counterexample being if optimal actions for an individual create negative externalities for society). We can also see that this is empirically the case – developed countries tend to have high economic freedom. However, in using this graph as clear evidence, I think Hall is once again trying to make too clear a case on the basis of one correlation.
<p>Effective decentralised systems, whether markets or democracy, are always prone to attack by people who claim that things would be better if only we let them make the rules. Maybe it takes something of Hall's engineer mindset to resist this impulse and see the value of bloodless systems and of general design principles like feedback and competition. (And perhaps Hall should apply this mindset more when evaluating the strength of evidence for his economic ideas.)</p>
<p>As for what the future of societal structure looks like, Hall surprisingly manages to avoid proposing flying-car-ocracy:</p>
<blockquote><p><i>""[It] may well be possible to design a better machine for social and economic control than the natural marketplace. But that will not be done by failing to understand how it works, or by adopting the simplistic, feedback-free methods of 1960s AI programs. And if ever it is done, it will be engineers, not politicians, who do it."</i></p>
</blockquote>
<p>He goes further:</p>
<blockquote><p><i>"As a futurist, I will go out on a limb and make this prediction: when someone invents a method of turning a Nicaragua into a Norway, extracting only a 1% profit from the improvement, they will become rich beyond the dreams of avarice and the world will become a much better, happier, place. Wise incorruptible robots may have something to do with it."</i></p>
</blockquote>
<h3>Risk perception and signalling</h3>
<p>Hall's second reason for us not living up to expectations for technological progress is cultural. He starts with the idea of risk homeostasis in psychology: everyone has some tolerance for risk, and will seek to be safer when they perceive current risk to be higher, and take more risks when they perceive current risk to be lower. In developed countries, risks are of course ridiculously low compared to historical levels, so most people feel safer than ever. Some start skydiving in response, but Hall suggests there's another effect that happens when an entire society finds itself living below their risk tolerance:</p>
<blockquote><p><i>"One obvious way [to increase perceived risk] is simply to start believing scare stories, from Corvairs to DDT to nuclear power to climate change. In other words, the Aquarian Eloi became phobic about everything specifically because we were actually safer, and needed something to worry about."</i></p>
</blockquote>
<p>I know what you're thinking – what the hell are "Aquarian Eloi"? Hall likes to come up with his own terms for things, and in this case he is making a reference to H. G. Wells' <i>The Time Machine</i>, in which descendants of humanity live out idle and dissolute lives (modelled on England's idle rich of the time), in order to label what he claims is the modern zeitgeist. Yes, this book is weird at times.</p>
<p>Another cultural idea he touches on is increased virtue signalling. Using the idea of <a href="https://en.wikipedia.org/wiki/Maslow%27s_hierarchy_of_needs">Maslow's hierarchy of needs</a>, he explains that as more and more of the population is materially well-off, more people invest more effort into self-actualisation. Some of this is productive, but, humans being humans, a lot of this effort goes into trying to signal how virtuous you are. Of course, there's nothing inherently wrong with that, as long as your virtue signalling isn't preventing other people climbing up from lower levels of Maslow's hierarchy – or, Hall would probably add, from building those flying cars.</p>
<h3>Environmentalism vs Greenism</h3>
<p>A particular sub-case of cultural change that Hall has a lot to say about is the "Green religion", something he distinguishes (though sometimes with not enough care) from perfectly reasonable desires "to live in a clean, healthy environment and enjoy the natural world".</p>
<p>This ideological, fear-driven and generally anti-science faction within the environmentalist movement is much the same thing as what Steven Pinker calls "Greenism", which I talked about in <a href="http://strataoftheworld.blogspot.com/2018/08/review-enlightenment-now-steven-pinker.html">my review of <i>Enlightenment Now</i></a> (search for "Greenism") and also features in <a href="http://strataoftheworld.blogspot.com/2018/08/review-enlightenment-now-steven-pinker.html">my review of Mark Lynas' <i>Seeds of Science</i></a> (search for "torpedoes"). Unlike Lynas or even Pinker, Hall does not hold back when it comes to criticising this particular strand of environmentalism. He explains it as an outgrowth of the risk-averseness and virtue signalling trends described above. The "Green religion", he claims, is now the "default religion of western civilization, especially in academic circles", and "has developed into an apocalyptic nature cult". To explain its resistance to progress and improving the human condition, he writes:</p>
<blockquote><p><i>"It seems likely that the fundamentalist Greens started with the notion that anything human was bad, and ran with the implication that anything that was good for humans was bad. In particular, anything that empowered ordinary people in their multitudes threatened the sanctity of the untouched Earth. The Green catechism seems lifted out of classic Romantic-era horror novels. Any science, any engineering, the “acquirement of knowledge,” can only lead to “destruction and infallible misery.” We must not aspire to become greater than our nature."</i></p>
</blockquote>
<p>There are troubling tendencies in ideological Greenism (as there is with anything ideological), but I think "apocalyptic nature cult" takes it too far, and as a substitute religion for the west, it has some formidable competitors. Hall is right to point out the tension between improving human welfare and Greenist desires to limit humans, but I'd bet that the driving factor isn't direct disdain for humans, but rather the sort of sacrificial attitudes that are common in humans (consider <a href="https://www.britannica.com/topic/flagellants">the people</a> who went around whipping themselves during the Black Death to try to atone for whatever God was punishing them for). Probably there's some part of human psychology or our cultural heritage that makes it easy to jump to sacrifice, disparaging ourselves (or even all of humanity), and repentance as the answer to any problem. While this a nobly selfless approach, it's just less effective than, and sometimes in opposition to, actually building things: developing new technologies, building clean power plants, and so on.</p>
<p>Hall also goes too far in letting the Greenists tar his view of the entire environmentalist movement. Not only is climate change a more important problem than the 1-3% estimated GDP loss for the US suggests, but you'd think that the sort of big technical innovation that is happening with clean tech would be exactly the sort of progress Hall would be rooting for.</p>
<p>Hall does have an environmentalist proposal, and of course it involves flying cars:</p>
<blockquote><p><i>"The two leading human causes of habitat destruction are agriculture and highways—the latter not so much by the land they take up, but by fragmenting ecosystems. One would think that Greens would be particularly keen for nuclear power, the most efficient, concentrated, high-tech factory farms, and for ... flying cars. "</i></p>
<p><i>[Ellipsis in original]</i></p>
</blockquote>
<h3>Energy matters!</h3>
<p>Despite being partly blinded by his excessive anti-Greenism, there is one especially important correction to some strands of environmentalist thinking that Hall makes well: cheap energy really matters and we need more of it (and energy efficiency won't save the day).</p>
<p>Above, I used the stagnation in energy use per capita as an example of things going wrong. This may have raised some eyebrows; isn't it good that we're not consuming more and more energy? Don't we want to reduce our energy consumption for the sake of the environment?</p>
<p>First, it is obviously true that we need to reduce the environmental impact of energy generation. Decoupling GDP growth from CO2 emissions is one of the great achievements of western countries over the past decades, and we need to massively accelerate this trend.</p>
<p>However, our goal, if we're liberal humanists, should be to give people choices and let them lead happy lives (while applying the same considerations to any sentient non-human beings, and ideally not wrecking irreplaceable ecosystems). In our universe, this means energy. Improvements in the quality of life over history are, to a large extent, improvements in the amount of energy each person has access to. This is very true:</p>
<blockquote><p><i>“Poverty is ameliorated by cheap energy. Bill Gates, nowadays perhaps the world’s leading philanthropist, puts it, “If you could pick just one thing to lower the price of—to reduce poverty—by far you would pick energy.”"</i></p>
</blockquote>
<p>Even in the United States, "[e]nergy poverty is estimated to kill roughly 28,000 people annually in the US from cold alone, a toll that falls almost entirely on the poor". </p>
<p>Climate change cannot be solved by reducing energy consumption, because there are six billion people in the world who have not reached western living standards and who should be brought up to them as quickly as possible. This will take energy. What we need is to simultaneously massively increase the amount of energy that humanity uses, while also switching over to clean energy. If you think only one of these is enough, you have either failed to understand the gravity of the world's poverty situation or the gravity of its environmental one.</p>
<p>(Energy efficiency matters, because all else being equal, it reduces operating costs. It is near-useless for solving emissions problems, however, because the more efficiently we can use energy, the more of it we will use. Hall illustrates this with a thought experiment of a farmer who uses a truck to carry one crate of tomatoes at a time from their farm to a customer, and whose only expense is fuel for the truck. Double its fuel efficiency, and it's economical to drive twice as far, and hence service four times as many customers (assuming customer number is proportional to reachable area), plus each trip is twice as long on average. The net result is that the 2x increase in efficiency leads to 8x more kilometres driven and hence 4x higher fuel consumption. The general case is called <a href="https://en.wikipedia.org/wiki/Jevons_paradox">Jevons paradox</a>.)</p>
<p>So yes, we need energy, most urgently in developing countries, but the more development and deployment of new energy sources there is, the cheaper they will be for everyone – consider Germany's highly successful subsidies for solar power – so developed countries have a role to play as well. (Also, are we sure there would be no human benefits to turning the plateauing in developed country energy use back into an increase?)</p>
<p>You'd think this is obvious. Unfortunately it isn't. In a section titled ""AAUGHH!!", Hall presents these quotes:</p>
<blockquote><p><i>“The prospect of cheap fusion energy is the worst thing that could happen to the planet.
—Jeremy Rifkin</i></p><i>
</i><p><i>Giving society cheap, abundant energy would be the equivalent of giving an idiot child a machine gun.
—Paul Ehrlich</i></p><i>
</i><p><i>It would be little short of disastrous for us to discover a source of clean, cheap, abundant energy, because of what we might do with it.
—Amory Lovins”</i></p>
</blockquote>
<p>They are what leads Hall to say, perhaps with too much pessimism:</p>
<blockquote><p><i>"Should [a powerful new form of clean energy] prove actually usable on a large scale, they would be attacked just as viciously as fracking for natural gas, which would cut CO2 emissions in half, and nuclear power, which would eliminate them entirely, have been."</i></p>
</blockquote>
<p>It is good to give people the choice to do what they want, and therefore good to give them as much energy as possible to play with, whether they want it to power the construction of their dream city or their flying car trips to Australia (I do draw the line at Death Stars, though).</p>
<p>Right now we're limited by the wealth of our societies, limiting us to about 10 kW/capita in developed countries, and by the unacceptable externalities of our polluting technology. The right goal isn't to enforce limits on what people can do (except indirectly through the likes of taxes and regulation to correct externalities), but to bring about a world where these limits are higher.</p>
<p>If energy is expensive, people are cheap – lives and experiences are lost for want of a few watts. This is the world we have been gradually dragging ourselves out of since the industrial revolution, and progress should continue. Energy should be cheap, and people should be dear.</p>
<p> </p>
<h2>Don't panic; build</h2>
<p><i>Where is my Flying Car?</i> is a weird book.</p>
<p>First of all, I'm not sure if it has a structure. Hall will talk about flying cars, zoom off to something completely different until you think he's said all he has to say on them, and just when you least expect it: more flying cars. The same pattern of presentation repeats with other topics. Also, sections begin and sometimes end with a long selection of quotes, including no less than three from Shakespeare.</p>
<p>Second, the ideas. There are the hundred speculative examples of crazy (big, physical) future technologies, the many often half-baked economic/political arguments, the unstated but unmissable America-centrism, and witty rants that wander the border between insightful social critique and intellectualised versions of stereotypical boomer complaints about modern culture.</p>
<p>Also, the cover is this:</p>
<p></p><table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto;"><tbody><tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjsGkei7QZ5nqMOUUqhdB3SGF8zlP9QlGaYv73hLoK2iSrpPh2A65ZlN6Eag3CyTX02FBJ71nSIZquNwS0l-YqBbr06h7H1wdjI7BYtUDzwGa44JN6fDlbnpJJw-1w7tcsEd70GZkwBs6A2/" style="margin-left: 1em; margin-right: 1em;"><img data-original-height="1012" data-original-width="624" height="400" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjsGkei7QZ5nqMOUUqhdB3SGF8zlP9QlGaYv73hLoK2iSrpPh2A65ZlN6Eag3CyTX02FBJ71nSIZquNwS0l-YqBbr06h7H1wdjI7BYtUDzwGa44JN6fDlbnpJJw-1w7tcsEd70GZkwBs6A2/w247-h400/cover.png" width="247" /></a></td></tr><tr><td class="tr-caption" style="text-align: center;">Above: ... a joke?<br /></td></tr></tbody></table><p></p><div class="separator" style="clear: both; text-align: center;"></div><p></p>
<p>However, I think overall there's a coherent and valuable perspective here. First, Hall is against pointless pessimism. He makes this point most clearly when talking about dystopian fiction, but I think it generalises:</p>
<blockquote><p><i>"Dystopia used to be a fiction of resistance; it’s become a fiction of submission, the fiction of an untrusting, lonely, and sullen twenty-first century, the fiction of fake news and infowars, the fiction of helplessness and hopelessness. It cannot imagine a better future, and it doesn’t ask anyone to bother to make one. It nurses grievances and indulges resentments; it doesn’t call for courage; it finds that cowardice suffices. Its only admonition is: Despair more."</i></p>
</blockquote>
<p>Hall's answer to this pessimism is to point out ten billion cool tech things that we could do one day. He veers too much to the techno-optimistic side by not acknowledging any risks, but overall this is an important message. Visions of the future are often dominated by the negatives: no war, no poverty, no death. Someone needs to fill in the positives, and while Hall focuses more on the "what" of it than the "how does it help humans" part, I think a hopeful look at future technologies is a good start.</p>
<p>In addition to being against pessimism about human capabilities, Hall also takes, at least implicitly, a liberal stand by being against pessimism about humans. His answer to "what should we do?" is to give people choice: let them travel far and easily, let them live where they want, let them command vast amounts of energy.</p>
<p>Hall also identifies two ways to keep a civilisation on track in terms of making technological progress and not getting consumed by signalling and politics: growing, and having a frontier.</p>
<p>On the topic of growth, he makes basically the same point as my <a href="https://strataoftheworld.blogspot.com/2019/09/growth-and-civilisation.html">post on growth and civilisation</a>:</p>
<blockquote><p><i>"One of the really towering intellectual achievements of the 20th Century, ranking with relativity, quantum mechanics, the molecular biology of life, and computing and information theory, was understanding the origins of morality in evolutionary game theory. The details are worth many books in themselves, but the salient point for our purposes is that the evolutionary pressures to what we consider moral behavior arise only in non-zero-sum interactions. In a dynamic, growing society, people can interact cooperatively and both come out ahead. In a static no-growth society, pressures toward morality and cooperation vanish; you can only improve your situation by taking from someone else. The zero-sum society is a recipe for evil."</i></p>
</blockquote>
<p>Secondly, the idea of a frontier: something outside your culture that your society presses against (ideally nature, but I think this would also apply to another competing society). This is needed because"[w]ithout an external challenge, we degenerate into squabbling [and] self-deceiving".</p>
<blockquote><p><i>"But on the frontier, where a majority of one’s efforts are not in competition with others but directly against nature, self-deception is considerably less valuable. A culture with a substantial frontier is one with at least a countervailing force against the cancerous overgrowth of largely virtue-signalling, cost-diseased institutions."</i></p>
</blockquote>
<p>Frontiers often relate to energy-intensive technologies:</p>
<blockquote><p><i>"High-power technologies promote an active frontier, be it the oceans or outer space. Frontiers in turn suppress self-deception and virtue signalling in the major institutions of society, with its resultant cost disease. We have been caught to some extent in a self-reinforcing trap, as the lack of frontiers foster those pathologies, which limit what our society can do, including exploring frontiers. But by the same token we should also get positive feedback by going in in the opposite direction, opening new frontiers and pitting our efforts against nature."</i></p>
</blockquote>
<p>Finally, Hall's book is a reminder that an important measure to judge a civilisation against is its capacity to do physical things. Even if the bulk of progress and value is now coming from less material things, like information technology or designing ever fairer and more effective institutions, there are important problems – covid vaccinations, solving climate change, and building infrastructure, for example – that depend heavily on our ability to actually go out and move atoms in the real world. Let's make sure we continue to get better at that, whether or not it leads to flying cars.</p>
<div><br /></div><div style="text-align: center;"><b>RELATED:</b></div><div><ul style="text-align: left;">
<li><a href="http://strataoftheworld.blogspot.com/2021/03/nuclear-power-is-good.html">Nuclear power is good</a></li>
<li><a href="https://strataoftheworld.blogspot.com/2021/03/technological-progress.html">Technological progress</a></li><li><a href="https://strataoftheworld.blogspot.com/2018/08/review-enlightenment-now-steven-pinker.html">Review: Enlightenment Now</a></li><li><a href="https://strataoftheworld.blogspot.com/2018/10/review-energy-and-civilization-history.html">Review: Energy and Civilisation</a></li>
</ul></div>
<p> </p>Unknownnoreply@blogger.com3tag:blogger.com,1999:blog-1697673368059564013.post-61824184159231243912021-01-22T12:15:00.003+00:002021-02-19T21:46:35.603+00:00Data science 2<p style="text-align: center;"><span style="font-size: x-small;"><i>6.4k words, including equations (about 30 minutes)</i></span> <br /></p><p>See the <a href="http://strataoftheworld.blogspot.com/2020/12/data-science-1.html">first post</a> for an introduction.</p>
<h2>Monte Carlo methods</h2>
<p>In the late 1940s, Stanislaw Ulam was trying to work out the probability of winning in a solitaire variant. After cranking out combinatorics equations for a while, he had the idea that simulating a large number of games starting from random starting configurations with the "fast" computers that were becoming available could be a more convenient method.</p>
<p>At the time, Ulam was working on nuclear weapons at Los Alamos, so he had the idea of using the same principle to solve some difficult neutron diffusion problems, and went on to develop such methods further with John von Neumann (no mid-20th century maths idea is complete without von Neumann's hand somewhere on it). Since this was secret research, it needed a codename, and a colleague suggested "Monte Carlo" after the casino in Monaco. (This group of geniuses managed to break rule #1 of codenames, which is "don't reveal the basic operating principle of your secret project in its codename".)</p>
<p>Ulam used this work to help himself become (along with Edward Teller) the father of the hydrogen bomb. Our purposes here will be a bit more modest.</p>
<p>The basic idea of Monte Carlo methods is just repeated random sampling. Have a way to generate a random variable <script type="math/tex">X</script>, but not to generate fancy maths stats like <script type="math/tex">P(X \in S)</script>, where <script type="math/tex">S</script> is some subset of the sample space? Fear not – let <script type="math/tex">f(x)</script>, for values <script type="math/tex">x</script> that <script type="math/tex">X</script> can take, be 1 if <script type="math/tex">x \in S</script> and 0 otherwise. Then <script type="math/tex">E(f(X))</script> is <script type="math/tex">P(f(X) = 1) = P(X \in S)</script> and we've solved the problem if we can estimate <script type="math/tex">P(f(X)=1)</script>. If we can randomly sample values from <script type="math/tex">X</script> (and calculate the function <script type="math/tex">f</script>), then this is easy, because we simply sample many values and calculate for what fraction of them <script type="math/tex">f(X) = 1</script>.</p>
<p>In general,</p>
<div cid="n354" class="mathjax-block md-end-block md-math-block md-rawblock" contenteditable="false" id="mathjax-n354" mdtype="math_block" spellcheck="false">
<div class="md-rawblock-container md-math-container" tabindex="-1"><div class="MathJax_SVG_Display" style="text-align: center;"></div><script id="MathJax-Element-1" type="math/tex; mode=display">E(f(X)) \approx \frac{1}{n} \sum_{i=1}^n f(x_i)</script></div></div>
<p>for large <script type="math/tex">n</script> and with <script type="math/tex">x_i</script> drawn independently at random from <script type="math/tex">X</script>, a result that comes from the law of the unconscious statistician (discussed in <a href="http://strataoftheworld.blogspot.com/2020/12/data-science-1.html">part 1</a>) once you realise that as <script type="math/tex">n</script> increases the fraction of <script type="math/tex">x_i</script>s in the sample approaches <script type="math/tex">P(X=x_i)</script>.</p>
<p>We can also do integration in a Monte Carlo style. The standard way to integrate a function <script type="math/tex">f</script> is to sample it at uniform points, multiply each sampled value by the distance between the uniform points, and then add everything up. There's nothing special about uniformity though – as the number of samples increases, as long as we make sure to multiply each by the distance to the next sample, the result will converge to the integral.</p>
<p></p><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjqV6656M1XzXQcCmGpP3a6_HvBiIvebY_ivlGy81Cll6FTvZMC8oZjRc0K398YfvebaKj462em-JhatFozvaKeeNWPaDhuayh7UdoUc14RlCcCU7utnND2ZzbMpkNyZPQOH_xQiVRyPKK3/" style="margin-left: 1em; margin-right: 1em;"><img data-original-height="418" data-original-width="998" height="268" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjqV6656M1XzXQcCmGpP3a6_HvBiIvebY_ivlGy81Cll6FTvZMC8oZjRc0K398YfvebaKj462em-JhatFozvaKeeNWPaDhuayh7UdoUc14RlCcCU7utnND2ZzbMpkNyZPQOH_xQiVRyPKK3/w640-h268/mcint.png" width="640" /></a></div><p></p>Above on the left, we see standard integration, with undershoot in pink and overshoot in orange, and Monte Carlo integration, with random samplings, on the right.
<p>Sometimes a lot of the interesting stuff (e.g. expected value, area in the integral, etc.) comes from a part of the function's domain that's low-probability when values in the domain are generated via <script type="math/tex">X</script>. If this happens, you either crank up the <script type="math/tex">n</script> in your Monte Carlo, or then get smart about how exactly you sample (this is called importance sampling). If we're smart about this, our randomised integration can be faster than the standard method.</p>
<p>We will look at examples of using Monte Carlo -style random simulation to do both Bayesian and frequentist statistics below.</p>
<p> </p>
<h2>Confidence</h2>
<p>In addition to providing a best-guess estimate of something (the probability a coin comes up heads, say), useful statistics should be able to tell us about how confident we should be in a particular guess – the best estimate of the probability a coin lands heads after observing 1 head in 2 throws or 50 heads in 100 throws is the same, but the second one still allows us to say more.</p>
<p>The question of how to quantify confidence leads into the question of what probability is.</p>
<p>The frequentist approach is to say that probabilities are observed relative frequencies across many trials, and if you don't have many trials to look at, then you imagine some hypothetical set of trials that an event might be seen as being drawn from.</p>
<p>The Bayesian approach is that probabilities quantify the state of your own knowledge, and if you don't have data to look at, you should still be able to draw a probability distribution representing your knowledge.</p>
<h3>Bayesianism</h3>
<p>Bayesianism is the idea that you represent uncertainty in beliefs about the world using numbers, which come from starting out with some prior distribution, and then shifting the distribution back and forth as evidence comes in. These numbers follow the axioms of probability, and so we might as well call them probabilities.</p>
<p>(Why should these numbers follow the axioms of probability? Because if you do otherwise and base decisions on those beliefs, you will do stupid things. As a simple example, making bets consistent with a probability model where the probabilities do not sum to 1 makes you exploitable. Let's say you're buying three options, each of which pays out 100€ if the winner of the 2036 US presidential election is EterniTrump, <a href="https://en.wikipedia.org/wiki/GPT-3">GPT</a>-7, or Xi Jinping respectively, and pay 40€ for each (consistent with assigning a probability of greater than 0.4 to each event occurring). You're sure to be down 20€ that you could've spent on underground bunkers instead.)</p>
<p>In Bayesian statistics, you don't perform arcane statistical tests to reject hypotheses. Your subjective beliefs about something are a probability distribution (or at least they should be, if you want to reason perfectly). Once you've internalised the idea of what a probability distribution means, and know how to reason about updates to that probability distribution rather than in black-and-white terms of absolute truth or falsehood, Bayesianism is intuitive and will make your reasoning about probabilistic things (i.e., everything except pure maths) better.</p>
<p>(Why is Bayesianism named after Bayes? Bayes invented Bayes' theorem but not Bayesianism; however, Bayesian updating using Bayes' theorem is the core part of ideal Bayesian reasoning.)</p>
<p>There's one tricky part of Bayesianism, and it's a consequence of the Bayesian insistence that subjective uncertainty is represented by a probability distribution, and hence quantified. It's this: you always need to start with a quantified probability distribution (called a prior), even before you've seen any data.</p>
<p>There's a clear regress here, at least philosophically. Sure, you might be able to come up with a sensible prior for how effective masks are against a respiratory disease, but ask a baby for <script type="math/tex">P(\frac{P(\text{covid} | \text{mask})}{P(\text{covid}|\neg \text{mask})} = r)</script> and you're not likely to get a coherent answer (and remember that your current prior should come from baby-you's prior in an unbroken series of Bayesian updates) – let alone if we're imagining some hypothetical platonic being existing beyond time and space who has never seen any data, or the <a href="https://www.theguardian.com/world/2020/apr/07/face-masks-cannot-stop-healthy-people-getting-covid-19-says-who">World Health Organisation</a>.</p>
<p>In practice, however, I don't think this is very worrying. Priors formalise the idea that you can apply background knowledge even when you don't have data for the specific case in front of you. Reject the use of priors, and you'll fall into another regress: "study suggests mask-wearing effective against the coronavirus variant in 40-60 year-old European females in green t-shirts; no information yet on 40-60 year-old European females in red t-shirts ..."</p>
<h4>Computational Bayes</h4>
<p>In general, the scenario we have when doing a Bayesian calculation is that there's some model <script type="math/tex">X</script> that depends on parameter(s) <script type="math/tex">\theta</script>, and we want to find what those parameters are given some sample <script type="math/tex">x</script> from <script type="math/tex">X</script> (since this is Bayesian, we have to assume that <script type="math/tex">\theta</script> itself is a value of the random variable <script type="math/tex">\Theta</script> describing the probabilities of each possible <script type="math/tex">\theta</script>). Now we could do this mathematically by calculating</p>
<div cid="n375" class="mathjax-block md-end-block md-math-block md-rawblock" contenteditable="false" id="mathjax-n375" mdtype="math_block" spellcheck="false">
<div class="md-rawblock-container md-math-container" tabindex="-1"><div class="MathJax_SVG_Display" style="text-align: center;"></div><script id="MathJax-Element-2" type="math/tex; mode=display">\Pr_\Theta(\theta \, | \, X=x) = c \Pr_X(x | \Theta = \theta) \Pr_\Theta(\theta),</script></div></div>
<p>and then finding the constant <script type="math/tex">c</script> with integration by the rule that probabilities must sum to 1. (Remember the interpretation of these terms: <script type="math/tex">\Pr_\Theta(\theta)</script> is the prior distribution we assume for <script type="math/tex">\Theta</script> before seeing evidence; <script type="math/tex">\Pr_\Theta(\theta \, | \, X=x)</script> is the posterior likelihood distribution after seeing the data; see the <a href="http://strataoftheworld.blogspot.com/2020/12/data-science-1.html">previous post</a> for some intuition on Bayes if these aren't clear to you.)</p>
<p>However, maybe some part of this (especially the integration) would be tricky, or you just happen to have a Jupyter notebook open on your computer. In any case, we can go about things in a different way, as long as we have a way to generate samples from our prior distribution and re-weight them appropriately.</p>
<p>The first thing we do is represent the prior distribution of <script type="math/tex">\Theta</script> by sampling it many times. We don't need an equation for it, just some function (in the programming sense) that pulls from it.</p>
<p>Next, consider the impact of our data on the estimates. We can imagine each sample we took as a representation of a tiny blob of probability mass corresponding to some particular <script type="math/tex">\theta_i</script>, and imagine rescaling it in the same way that we rescaled the odds of various outcomes when talking about the odds ratio form of Bayes' rule in the first post. How much do we rescale it by? By the likelihood of observing <script type="math/tex">x</script> if <script type="math/tex">\Theta=\theta_i</script>: this is the <script type="math/tex">\Pr_X(x|\Theta=\theta)</script> term in the above equation.</p>
<p>Finally, we need to do the scaling. Thankfully, this doesn't take integration, since we can calculate the sum of our re-weighted likelihoods and just divide all our scaled values by that – boom, we have an (approximation of) a posterior probability distribution.</p>
<p>To make things concrete, let's write code and visualise a simple case: estimating the probability that a coin lands heads. The first step in Bayesian calculations is usually the trickiest: we need a prior. For simplicity, let's say our prior is that the coin has an equal chance of having every possible probability (so the real numbers 0 to 1) of coming up heads.</p>
<p>(The fact that the thing we're estimating is itself a probability doesn't matter; don't be confused by the fact that we have two sorts of probability – our knowledge about the coin's probability of coming up heads, represented as a probability distribution, and the probability that the coin comes up heads (an empirical fact you can measure by throwing it many times). Equally well we might have talked about some non-probabilistic feature of the coin, like its diameter, but that would be a lot more boring.)</p>
<p>To write this out in actual Python, the first step (after importing NumPy for vectorised calculation and Matplotlib for the graphing we'll do later) is some way to generate samples from this distribution:</p>
<pre><code class="language-python" lang="python">import numpy as np
import matplotlib.pyplot as plt
def prior_sample(n):
return np.random.uniform(size=n)
</code></pre>
<p>(<code>np.random.uniform(size=n)</code> returns <code>n</code> samples from a uniform distribution over the range 0 to 1.)</p>
<p>To calculate the posterior:</p>
<pre><code class="language-python" lang="python">def posterior(sample, throws, heads):
""" This function calculates an approximation of the
posterior distribution after seeing the coin
thrown a certain number of times;
sample is a sample of our prior distribution,
throws is how many times we've thrown the coin,
heads is how many times it has come up heads."""
# The number of times the coin lands heads follows a binomial distribution.
# Thus, below we reweight using a binomial pdf:
# (note that we drop the throws-Choose-heads term because it's a constant
# and we rescale at the end anyways)<br />
weighted_sample = sample ** heads * (1 - sample) ** (throws - heads)<br />
# Divide by the sum of every element in the weighted sample to normalise:<br />
return weighted_sample / np.sum(weighted_sample)
</code></pre>
<p>(Remember that the calculation of <code>weighted_sample</code> is done on every term in the <code>sample</code> array separately, in the standard vectorised way.)</p>
<p>Now we can generate a sample to model the prior distribution, and plot it as a histogram:</p>
<pre><code class="language-python" lang="python">N = 100000
throws = 100
heads = 20
sample = prior_sample(N) # model the prior distribution
# Plot a histogram:
plt.hist(sample,
# split the range 0-1 into 50 bins for the histogram:
np.linspace(0, 1, 50),
# weight each item by the likelihood:
weights=posterior(sample, throws, heads))
</code></pre>
<p>The result will look something like this:</p>
<p></p><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhnaDPHI8IQRNnXctloPUpNGq-JFN71PqXFPGSkY4mCeUhZs4jKZ11ZL0bIIv6g-F1IWnMVNvylXkiHyp-D3dmzB5qQNiBuhIhyCBzRjpvJr4hZRFNTDU2Nv8VzIUVBfRVWJtNauHYT_4vx/" style="margin-left: 1em; margin-right: 1em;"><img data-original-height="496" data-original-width="802" height="396" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhnaDPHI8IQRNnXctloPUpNGq-JFN71PqXFPGSkY4mCeUhZs4jKZ11ZL0bIIv6g-F1IWnMVNvylXkiHyp-D3dmzB5qQNiBuhIhyCBzRjpvJr4hZRFNTDU2Nv8VzIUVBfRVWJtNauHYT_4vx/w640-h396/postex.png" width="640" /></a></div><br /><p></p>
<p>This is an approximation of the posterior probability distribution after seeing 100 throws and 20 heads. We see that most of the probability mass is clustered around a probability of 0.2 of landing heads; the chance of it being a fair coin is negligible.</p>
<p>What if we had a different prior? Let's say we're reasonably sure it's roughly a standard coin, and model our prior for the probability of landing heads as a normal distribution with mean 0.5 and standard deviation 0.1. To visualise this prior, here's a histogram of a 100k samples from it:</p>
<p></p><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiZ2wwQh11NdM7MFgDjbCOm2ZjVjAIYRZncEF0DiAoPL-m-5SvYLH0BuNqztbU9nXHuJQyOWnwOwCZ2HJYkPAtCbCtGuwaSmB6m9C7JrVo9ZL9RO88hllV6vRaitbXejPEVYAqGorjY10Bj/" style="margin-left: 1em; margin-right: 1em;"><img data-original-height="488" data-original-width="800" height="244" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiZ2wwQh11NdM7MFgDjbCOm2ZjVjAIYRZncEF0DiAoPL-m-5SvYLH0BuNqztbU9nXHuJQyOWnwOwCZ2HJYkPAtCbCtGuwaSmB6m9C7JrVo9ZL9RO88hllV6vRaitbXejPEVYAqGorjY10Bj/w400-h244/normex.png" width="400" /></a></div><br /><p></p>The posterior distribution looks almost identical to our previous posterior:
<p></p><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjWUgUFJxv6eDYf1PqS0TNabYdVtS6Mtjjmr3akq9Hq8Y1qyqQum3lusbeEyN_gaz63XTYc4pbRYCfqyEd2pVcLEgTtN6mZt_hSsEXv4Q7rw4t97jfm1TJPPwWZigE4_7WC8xPJPiYeqjno/" style="margin-left: 1em; margin-right: 1em;"><img data-original-height="498" data-original-width="1000" height="318" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjWUgUFJxv6eDYf1PqS0TNabYdVtS6Mtjjmr3akq9Hq8Y1qyqQum3lusbeEyN_gaz63XTYc4pbRYCfqyEd2pVcLEgTtN6mZt_hSsEXv4Q7rw4t97jfm1TJPPwWZigE4_7WC8xPJPiYeqjno/w640-h318/postex2.png" width="640" /></a></div><br /><p></p>
<p>There's simply so much data (a hundred throws) that even very different priors will have converged on what the data indicates.</p>
<p>A normal distribution might not be a very good model, though. Say we think there's a 49.5% chance the coin is fair, a 49.5% chance it's been rigged to come up tails with a probability arbitrarily close to 1, and the remaining 1% is spread uniformly between 0 and 1 (be very careful about assigning zero probability to something!). Then our prior distribution might be coded like this:</p>
<pre><code class="language-python" lang="python">def prior_sample_3(n):
m = n // 100
return np.concatenate((np.random.uniform(size=m),
np.zeros((n - m) // 2),
np.ones(n - (n - m) // 2) // 2),
axis=0)
</code></pre>
<p>and 100k samples might be distributed like this:</p>
<div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEj90m3jP-4kl-7dEIlwe0pP2aeO-FHiD5kx_kUXC6KNn8ItKNscHnoOf4xgYpUInfGbhs_jOCx2lymNtI7F8yAzHtBMVpewGGXkJpgaUmTBkS5uwPSsi5jtHw8c9GRq8RwCOmsfI6tlGfR8/" style="margin-left: 1em; margin-right: 1em;"><img data-original-height="494" data-original-width="800" height="396" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEj90m3jP-4kl-7dEIlwe0pP2aeO-FHiD5kx_kUXC6KNn8ItKNscHnoOf4xgYpUInfGbhs_jOCx2lymNtI7F8yAzHtBMVpewGGXkJpgaUmTBkS5uwPSsi5jtHw8c9GRq8RwCOmsfI6tlGfR8/w640-h396/priorex.png" width="640" /></a></div><br /><br /><p></p>Let's also say we have less data than before – the coin has come heads 8 times out of 40, say. Now our posterior distribution looks like this:
<p></p><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgDKhmd_d2B-8swp-NFGdKIAGTE4Qr2rdVfRerj-rKuStIfIDGaEuTZCzBLUcpnKKAWX4nKDlpOE1fzDahBNIha3li_lATGpWK4Yh7FZJUTtEKnKKu3dS3N1fxTKMhd7H-dyCY4XYeu5vOA/" style="margin-left: 1em; margin-right: 1em;"><img data-original-height="496" data-original-width="800" height="396" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgDKhmd_d2B-8swp-NFGdKIAGTE4Qr2rdVfRerj-rKuStIfIDGaEuTZCzBLUcpnKKAWX4nKDlpOE1fzDahBNIha3li_lATGpWK4Yh7FZJUTtEKnKKu3dS3N1fxTKMhd7H-dyCY4XYeu5vOA/w640-h396/postex3.png" width="640" /></a></div><br /><p></p>We've ruled out that the coin is rigged (a single heads was enough to nuke the likelihood of a completely rigged coin to zero – be very careful about assigning a probability of zero to something!), and most of the probability mass has shifted to a probability of landing heads of around 20%, as before, but because our prior was different, a noticeable chunk of our expectation is still that the coin is exactly fair.
<p>As a final example, here's a big flowchart showing how the probability you should assign to different odds of the coin coming up heads shifts as you get data (red = tails, green = heads) up to 5 coin throws, assuming a prior that's the uniform distribution:</p>
<p></p><div class="separator" style="clear: both; text-align: center;"><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhFa1MJXFjWkwVsgCMuND073-Dl6v6UpH2Z4InbU_mPD7fi7ZozYqiesmuv3ZenIkL-B9jUMgTgFWVj0Ph1ibyf1xYQ0yWNZj9hy93C9OQJh-crvCjGXOUp20epca9kbd97_g2mwXin4w6n/" style="margin-left: 1em; margin-right: 1em;"><img data-original-height="1667" data-original-width="600" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhFa1MJXFjWkwVsgCMuND073-Dl6v6UpH2Z4InbU_mPD7fi7ZozYqiesmuv3ZenIkL-B9jUMgTgFWVj0Ph1ibyf1xYQ0yWNZj9hy93C9OQJh-crvCjGXOUp20epca9kbd97_g2mwXin4w6n/s16000/bayesucompressed.png" /></a></div></div><p></p>Two questions to think about, one simple and on-topic, the other open-ended and off-topic:
<ul>
<li>What is the simple function giving, within a constant, the posterior distribution after <script type="math/tex">n</script> heads and 0 tails? What about for <script type="math/tex">n</script> tails and 0 heads?</li>
<li>Doesn't the coin-throwing diagram look like Pascal's triangle? What's the connection between normal distributions, Pascal's triangle, and the central limit theorem (i.e., that the sum of enough of many of any random variable is distributed roughly normally?)? What extensions of Pascal's triangle can you think of, possibly with probabilistic interpretations?</li>
</ul>
<h3>Frequentism</h3>
<p>Frequentists try to banish the subjectivity out of probability. The probability of event <script type="math/tex">E</script> is not a statement about subjective belief, but an empirical fact: given <script type="math/tex">n</script> trials, what is the fraction of times that <script type="math/tex">E</script> comes up, in the limit as <script type="math/tex">n \rightarrow \infty</script>? And ditch the Bayesian idea of doing nothing but shifting around the probability mass we assign to different beliefs; once you've done a statistical test, you either reject or fail to reject the null hypothesis.</p>
<p>A standard frequentist tool is hypothesis testing with a <script type="math/tex">p</script>-value. The procedure looks like this:</p>
<ol start="">
<li>Pick a null hypothesis (usually denoted <script type="math/tex">H_0</script>). (For example, <script type="math/tex">H_0</script> could be that a coin is fair; that is, that the probability <script type="math/tex">h</script> of it coming up heads is 0.5.)</li>
<li>Pick a test statistic: a function <script type="math/tex">t</script> from the dataset <script type="math/tex">x</script> to a number. (For example, the maximum likelihood estimator for <script type="math/tex">h</script>, using the fact that we expect the number of heads to follow a binomial distribution with parameters for the number of throws and the probability <script type="math/tex">h</script>.)</li>
<li>Figure out a model for, or a way to sample from, the distribution of possible datasets given that <script type="math/tex">H_0</script> is true. (For example, we might write code to generate synthetic datasets <script type="math/tex">X^*</script> of the same size as <script type="math/tex">x</script> based on <script type="math/tex">h=0.5</script>.)</li>
<li>Find the probability of the test statistic <script type="math/tex">t</script> returning a result that is as extreme or more extreme than <script type="math/tex">t(x)</script>. We might do this using fancy maths that gives us cumulative distribution functions based on the model from the previous step, or by having our code generate many synthetic datasets <script type="math/tex">X^*</script>, calculate <script type="math/tex">t(X^*)</script> for each of them, and seeing how <script type="math/tex">t(x)</script> compares – what percentile of extremeness is it in? The answer is called the <script type="math/tex">p</script>-value.</li>
</ol>
<p>(What is "more extreme"? That depends on our null hypothesis. If both low and high values of <script type="math/tex">t(x)</script> are evidence against <script type="math/tex">H_0</script> – as in our example – then we use a two-tailed test; if <script type="math/tex">t(x)</script> is in the 90% percentile of the <script type="math/tex">t(X^*)</script> distribution, both <script type="math/tex">t(x)</script> in the top and bottom 10% are at least as extreme as the value we got, and <script type="math/tex">p=0.2</script>. If only low or high values are evidence against <script type="math/tex">H_0</script>, then we use a one-tailed test. Say only high values are evidence against <script type="math/tex">H_0</script> and <script type="math/tex">t(x)</script> is in the 90% percentile; then <script type="math/tex">p=0.1</script>.)</p>
<p>Here's some example code to calculate a <script type="math/tex">p</script>-value, using random simulation:</p>
<pre><code class="language-python" lang="python"># Import NumPy and graphing library:
import numpy as np
import matplotlib.pyplot as plt
# Define our null hypothesis:
h0_h = 0.5 # the value of h under the null hypothesis
# Define the data we've gotten:
throws = 50
heads = 20
# Generate an array for it:
data = np.concatenate((np.zeros(throws - heads), np.ones(heads)), axis = 0)
def t(x): # test statistic function
return np.mean(x)
# ^ this is the MLE for the binomial distribution
def synth_x(n, p):
# Create a synthetic dataset of some size n, assuming some p
return np.random.binomial(1, p, size=n)
# Take a lot of samples from the distribution of t(X*)
# (where X* is a synthetic dataset):
t_sample = np.array([t(synth_x(throws, h0_h)) for _ in range(100000)])
# Calculate the p-value, using a two-tailed test:
p1 = np.mean(t_sample >= t(data))
p2 = np.mean(t_sample <= t(data))
p = 2 * min(p1, p2)
# Display p-value
print(f"p-value is {p}") # about 0.20 in this case
# Plot a histogram:
plt.hist(t_sample, bins=50, range=[0,1])
plt.axvline(x=t(data), color='black') # draw a line to show where t(data) falls
</code> </pre><p></p><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgcl7O2UFaKudb1NeLWWmKKyvRumFtQF2lVYyWp8ZZXh_P7_0jSdgrrNoTa_JeA3ZhVo6xQvaOpZWhjM8wMOTd9d_EAlVFb6n3kwJvYEhDCP4s6mFcJ5kv9yRbqiup42mdlMd400xLxmUVd/" style="margin-left: 1em; margin-right: 1em;"><img data-original-height="508" data-original-width="800" height="406" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgcl7O2UFaKudb1NeLWWmKKyvRumFtQF2lVYyWp8ZZXh_P7_0jSdgrrNoTa_JeA3ZhVo6xQvaOpZWhjM8wMOTd9d_EAlVFb6n3kwJvYEhDCP4s6mFcJ5kv9yRbqiup42mdlMd400xLxmUVd/w640-h406/pval.png" width="640" /></a></div><br /><p></p>The main tricky part in the code is the calculation of the <script type="math/tex">p</script>-value. A neat way to do is the following: observe that a two-tailed <script type="math/tex">p</script>-value is either twice the percent of (synthetic) data with a test statistic lower than <script type="math/tex">t(x)</script> (in the case that the observation ended up on the lower side of the distribution of synthetic datasets), or twice the percent of (synthetic) data with a higher test statistic.
<p>Now, what exactly is a <script type="math/tex">p</script>-value? It's tempting to think of the <script type="math/tex">p</script>-value as the probability that the null hypothesis is correct: that is, that <script type="math/tex">p=0.05</script> means there's only a 5% chance the null hypothesis is true. However, what a <script type="math/tex">p</script>-value actually tells you is this: assuming that your null hypothesis is true (and you can correctly model the distribution of data you'd get if it is), what is the probability of getting a result at least as extreme as your data? In maths: </p>
<div cid="n434" class="mathjax-block md-end-block md-math-block md-rawblock" contenteditable="false" id="mathjax-n434" mdtype="math_block" spellcheck="false">
<div class="md-rawblock-container md-math-container" tabindex="-1"><div class="MathJax_SVG_Display" style="text-align: center;"></div><script id="MathJax-Element-3" type="math/tex; mode=display">p\text{-value} \ne P(H_0 \text{ is correct}), (!!)</script></div></div>
<p>but instead</p>
<div cid="n436" class="mathjax-block md-end-block md-math-block md-rawblock" contenteditable="false" id="mathjax-n436" mdtype="math_block" spellcheck="false">
<div class="md-rawblock-container md-math-container" tabindex="-1"><div class="MathJax_SVG_Display" style="text-align: center;"></div><script id="MathJax-Element-4" type="math/tex; mode=display">p\text{-value} = P(t(x) \geq t(X^*)),</script></div></div>
<p>for a right-tailed test (flip the <script type="math/tex">\geq</script> for a left-tailed test), where <script type="math/tex">X^*</script> is assumed drawn from the distribution resulting from assuming the null hypothesis <script type="math/tex">H_0</script> , or</p>
<div cid="n438" class="mathjax-block md-end-block md-math-block md-rawblock" contenteditable="false" id="mathjax-n438" mdtype="math_block" spellcheck="false">
<div class="md-rawblock-container md-math-container" tabindex="-1"><div class="MathJax_SVG_Display" style="text-align: center;"></div><script id="MathJax-Element-5" type="math/tex; mode=display">P(|t'(x)| \geq|t'(X^*)|),</script></div></div>
<p>for a two-tailed test, where <script type="math/tex">t'</script> is the test statistic function, but shifted so that the median <script type="math/tex">H_0</script> value is 0, so that we can just take absolute value to get an extremeness measure (for example, in the code above we'd subtract a 0.5 from the current definition of <code>t(x)</code>, since this is the median for the null hypothesis that the probability of heads is one-half).</p>
<h2>Probability bounds</h2>
<p>Sometimes it's useful to be able to quickly estimate a bound on some probability or expectation. Here are some examples, with quick proofs.</p>
<h4>Markov's inequality</h4>
<p>For <script type="math/tex">x > 0</script> if <script type="math/tex">X</script> takes positive numerical values,</p>
<div cid="n444" class="mathjax-block md-end-block md-math-block md-rawblock" contenteditable="false" id="mathjax-n444" mdtype="math_block" spellcheck="false">
<div class="md-rawblock-container md-math-container" contenteditable="false" tabindex="-1">
<div class="MathJax_SVG_Display" style="text-align: center;"></div><script id="MathJax-Element-212" type="math/tex; mode=display">P(X \geq a) \leq \frac{E(X)}{a}.</script>
</div></div>
<p>Why?</p>
<p><b>Short proof</b>: Given <script type="math/tex">X \geq 0</script>, <script type="math/tex">X \geq 1_{X \geq a} \cdot a</script> (can be seen by considering cases <script type="math/tex">X < a</script>, <script type="math/tex">X=a</script>, and <script type="math/tex">X > a</script>), so, rearranging, <script type="math/tex">1_{X \geq a} \leq X/ a</script>. Taking the expectation on both sides we get <script type="math/tex">E(1_{X \geq a}) \leq E(X) / a</script>, and <script type="math/tex">E(1_{X \geq a}) = P(X \geq a)</script>. <script type="math/tex">\square</script></p>
<p><b>Intuitive proof</b>: let's say you want to draw a probability density function to maximise <script type="math/tex">P(X \geq a)</script>, given some value of the expectation of <script type="math/tex">E(X)</script> (and given that <script type="math/tex">X</script> only takes positive values). Any probability density assigned to values greater than <script type="math/tex">a</script> is more expensive in terms of expectation increase than assigning value exactly at <script type="math/tex">a</script>, and has an identical effect on <script type="math/tex">P(X \geq a)</script>. So to maximise <script type="math/tex">P(X \geq a)</script>, assign as much probability density as you can to <script type="math/tex">a</script>, and none to values greater than <script type="math/tex">a</script>. Given the restriction that <script type="math/tex">X</script> can only take positive values, the lowest value you can assign any probability to (to balance out the expectation if <script type="math/tex">a > E(X)</script>) is 0. If we allocate <script type="math/tex">p_1</script> to <script type="math/tex">X=0</script> and <script type="math/tex">p_2</script> to <script type="math/tex">X=a</script>, then to match the expectation <script type="math/tex">E(X)</script> we must have</p>
<div cid="n447" class="mathjax-block md-end-block md-math-block md-rawblock" contenteditable="false" id="mathjax-n447" mdtype="math_block" spellcheck="false">
<div class="md-rawblock-container md-math-container" tabindex="-1"><div class="MathJax_SVG_Display" style="text-align: center;"></div><script id="MathJax-Element-7" type="math/tex; mode=display">p_1 \cdot 0 + p_2 \cdot a = E(X),</script></div></div>
<p>or</p>
<div cid="n449" class="mathjax-block md-end-block md-math-block md-rawblock" contenteditable="false" id="mathjax-n449" mdtype="math_block" spellcheck="false">
<div class="md-rawblock-container md-math-container" tabindex="-1"><div class="MathJax_SVG_Display" style="text-align: center;"></div><script id="MathJax-Element-8" type="math/tex; mode=display">p_2 = P(X\geq a) = \frac{E(X)}{a}</script></div></div>
<p>in the maximal scenario; any other pdf we draw must have <script type="math/tex">P(X \geq a)</script> smaller.</p>
<p>The above equation can also be interpreted as saying that the fraction of values greater than <script type="math/tex">k=a/E(X)</script> times the average in a dataset of positive values can be at most <script type="math/tex">1/k</script> (i.e. <script type="math/tex">E(X)/a</script>). For example, at most half of people can have twice the average income.</p>
<h4>Chebyshev's inequality</h4>
<p>(An extension of Markov's inequality.)</p>
<p>Let <script type="math/tex">X</script> be a random variable with variance <script type="math/tex">\sigma^2</script> and expected value <script type="math/tex">\mu</script>. Then</p>
<div cid="n455" class="mathjax-block md-end-block md-math-block md-rawblock" contenteditable="false" id="mathjax-n455" mdtype="math_block" spellcheck="false">
<div class="md-rawblock-container md-math-container" tabindex="-1"><div class="MathJax_SVG_Display" style="text-align: center;"></div><script id="MathJax-Element-9" type="math/tex; mode=display">P(|X-\mu| \geq x) \leq \frac{\sigma^2}{x^2},</script></div></div>
<p>since if <script type="math/tex">Y = (X-\mu)^2</script> then, by Markov's inequality,</p>
<div cid="n457" class="mathjax-block md-end-block md-math-block md-rawblock" contenteditable="false" id="mathjax-n457" mdtype="math_block" spellcheck="false">
<div class="md-rawblock-container md-math-container" tabindex="-1"><div class="MathJax_SVG_Display" style="text-align: center;"></div><script id="MathJax-Element-10" type="math/tex; mode=display">P(Y \geq x^2) \leq \frac{\mathbb{E}(Y)}{x^2} = \frac{\sigma^2}{x^2},</script></div></div>
<p>by the definition of variance as <script type="math/tex">\mathbb{E}((X - \mu)^2)</script>. Finally, taking the square root inside the probability expression, <script type="math/tex">P(Y \geq x^2)=P(|X-\mu| \geq x)</script>. <script type="math/tex">\square</script></p>
<h4>Jensen's inequality</h4>
<p>Consider a concave function <script type="math/tex">f</script> and the values <script type="math/tex">E(f(X))</script> and <script type="math/tex">f(E(X))</script>, where <script type="math/tex">X</script> is (once again) a random variable.</p>
<p>Since <script type="math/tex">f</script> is concave, if we plot <script type="math/tex">y=f(x)</script> and the tangent line to <script type="math/tex">f</script> at some <script type="math/tex">x_0</script>, the tangent is an upper bound on <script type="math/tex">f(x)</script> for all <script type="math/tex">x</script>.</p>
<p></p><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhBem3j-EEhHvckHBOkT0qvMSY082-AxqRyWoe9zc8dzMn7Ez_HkmNG3jiof_wDTFMax16KX_jzjtd5x2oHUZu2oRys2LMuzslDKPLfqxYrZ1pTOEkZp07jLY-c2EOQgmatus5TvWisnrI3/" style="margin-left: 1em; margin-right: 1em;"><img data-original-height="834" data-original-width="1000" height="333" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhBem3j-EEhHvckHBOkT0qvMSY082-AxqRyWoe9zc8dzMn7Ez_HkmNG3jiof_wDTFMax16KX_jzjtd5x2oHUZu2oRys2LMuzslDKPLfqxYrZ1pTOEkZp07jLY-c2EOQgmatus5TvWisnrI3/w400-h333/jensen.png" width="400" /></a></div><br /><p></p>
<p>Let <script type="math/tex">E(X) = \mu</script>, and let the tangent line to <script type="math/tex">y=f(x)</script> at <script type="math/tex">x=\mu</script> be <script type="math/tex">y=mx+b</script>. We have that <script type="math/tex">f(X) \leq mx+b</script> for all <script type="math/tex">x</script>. Taking the expectation on both sides,</p>
<div cid="n464" class="mathjax-block md-end-block md-math-block md-rawblock" contenteditable="false" id="mathjax-n464" mdtype="math_block" spellcheck="false">
<div class="md-rawblock-container md-math-container" tabindex="-1"><div class="MathJax_SVG_Display" style="text-align: center;"></div><script id="MathJax-Element-11" type="math/tex; mode=display">E(f(X)) \leq m \mu + b.</script></div></div>
<p>What is <script type="math/tex">m\mu +b</script>? It's the value of the tangent when it touches <script type="math/tex">f(x)</script> at <script type="math/tex">x=\mu</script>, and therefore it is also the value of <script type="math/tex">f</script> at <script type="math/tex">\mu</script>. Thus we can say</p>
<div cid="n466" class="mathjax-block md-end-block md-math-block md-rawblock" contenteditable="false" id="mathjax-n466" mdtype="math_block" spellcheck="false">
<div class="md-rawblock-container md-math-container" tabindex="-1"><div class="MathJax_SVG_Display" style="text-align: center;"></div><script id="MathJax-Element-12" type="math/tex; mode=display">E(f(X)) \leq f(E(X)). \square</script></div></div>
<p> </p>
<h2>Probability systems</h2>
<h3>Causal diagrams</h3>
<p>The <a href="https://en.wikipedia.org/wiki/Perseverance_(rover)"><i>Perseverance</i></a> rover is due to land on Mars on February 18th, 2021, carrying a small helicopter called <a href="https://en.wikipedia.org/wiki/Mars_Helicopter_Ingenuity"><i>Ingenuity</i></a>, which will likely become the first aircraft to make a powered flight on a planet that's not Earth.</p>
<p>Imagine that <i>Perseverance</i> is currently known to be in a position <script type="math/tex">X</script> (where <script type="math/tex">X</script> is some random variable, as is any capital letter). <i>Ingenuity</i> has completed its first flight, starting from the location of <i>Perseverance</i> (which we know to a high degree of accuracy), but because of a Martian sandstorm we only have inaccurate readings of <i>Ingenuity</i>'s current location and need to locate it quickly to know if it's in a place where it's going to run out of power due to dust blocking its solar panels unless we do a risky manoeuvre with its propellers. Specifically, we have two in-flight readouts of its position, <script type="math/tex">R_1</script> and <script type="math/tex">R_2</script>, which are known to be its actual true position <script type="math/tex">Y_1</script> and <script type="math/tex">Y_2</script> at those times plus some random error modelled as a <script type="math/tex">\text{Normal}(0,\sigma_1^2)</script> distribution, and also similarly we have a more accurate readout <script type="math/tex">R_f</script> of its final position <script type="math/tex">Y_f</script>, this time with the error following <script type="math/tex">\text{Normal}(0, \sigma_2^2)</script>. We also model <script type="math/tex">Y_1</script> as being generated from <script type="math/tex">X</script> with a parameter <script type="math/tex">h_1</script> representing its starting heading and velocity (e.g. <script type="math/tex">h_1</script> is a vector and the model could be <script type="math/tex">Y_1 = X + h_1 + \epsilon</script>, where <script type="math/tex">\epsilon</script> is another normally distributed error term), and likewise we have parameters <script type="math/tex">h_2</script> and <script type="math/tex">h_f</script> that influence how <script type="math/tex">Y_2</script> and <script type="math/tex">Y_f</script> are generated from the preceding positions. We know that it's initial battery level was <script type="math/tex">b_0</script>, and the battery level when it was at each of <script type="math/tex">Y_1</script>, <script type="math/tex">Y_2</script>, and <script type="math/tex">Y_f</script> is <script type="math/tex">B_1</script>, <script type="math/tex">B_2</script>, and <script type="math/tex">B_f</script>, where each of those is generated from the previous and the heading/velocity parameters <script type="math/tex">h_1</script>, <script type="math/tex">h_2</script>, and <script type="math/tex">h_f</script> (e.g. <script type="math/tex">B_2 = B_1 - (1 + \epsilon) |h_1|</script> – the amount of power lost is a normal error term plus a constant times the velocity). We need to find the probability that the next battery level <script type="math/tex">B_n</script>, a random variable generated from <script type="math/tex">B_f</script> (the previous level) and depending on <script type="math/tex">Y_f</script> (since storm intensity varies with position; say we have a function <script type="math/tex">s</script> that takes in positions and returns how much the dust will decrease power output and hence batter level at a particular position, then we might have <script type="math/tex">B_n = B_f - s(Y_f)</script>), is below a critical threshold <script type="math/tex">c</script>, given the starting <script type="math/tex">X</script>, and the position readings <script type="math/tex">R_1</script>, <script type="math/tex">R_2</script>, and <script type="math/tex">R_f</script>. Also the administrator of NASA is breathing down your neck because this is a 2 billion dollar mission, so better work fast and not make mistakes.</p>
<p>This problem seems almost intractably complicated. A handy way of making complex probability questions less unapproachable is to draw out a causal diagram: what are the key parameters, and which random variables are generated from which other ones? Here's an example for the above problem:</p>
<p></p><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhASjj_vfL-ZNYHARSstBWz_nsYHItzCx8g_tXPExVECPuR8kulNihCFagO1rMq0Xs7gvVtahUZvqc8ZYepk8B6LLwhHaZ1_HI6KpAmX4O5VkZBQcDnVgjScJAH1k99v6SA9U7SQp6DmrUk/" style="margin-left: 1em; margin-right: 1em;"><img data-original-height="874" data-original-width="1280" height="438" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhASjj_vfL-ZNYHARSstBWz_nsYHItzCx8g_tXPExVECPuR8kulNihCFagO1rMq0Xs7gvVtahUZvqc8ZYepk8B6LLwhHaZ1_HI6KpAmX4O5VkZBQcDnVgjScJAH1k99v6SA9U7SQp6DmrUk/w640-h438/causaldiagram.png" width="640" /></a></div><br /><br /><p></p>
<p>Arrows indicate random variables being generated from others; dotted lines note important parameters (note that some parameters are missing – those of <script type="math/tex">X</script>, for example). The probability we were asked about is <script type="math/tex">P(B_n < c | X = x, R_1 = r_1, R_2 = r_2, R_f = r_f)</script>; it doesn't look so complicated when you have the causal relations visualised in front of you.</p>
<p>The rest of the solution is left as an exercise for the reader. Please be in touch with NASA in late February to get the values <script type="math/tex">x</script>, <script type="math/tex">r_1</script>, <script type="math/tex">r_2</script>, and <script type="math/tex">r_f</script>.</p>
<h3>Markov chains</h3>
<p>A Markov chain has the following causal diagram:</p>
<p></p><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgipHpjjukZ64PQyWzJP_gH5SuhPnH2PYj4R7_69K1nXtp4rMappbW4SIe1_6HPJ-PsiRKxL2Fm081c8AboOPTE2x6ds-Hd6CZwiBgziOBT_WzEww1mTBNNNuen9xII1nS0_Zcwh8L1eKvY/" style="margin-left: 1em; margin-right: 1em;"><img data-original-height="210" data-original-width="1000" height="84" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgipHpjjukZ64PQyWzJP_gH5SuhPnH2PYj4R7_69K1nXtp4rMappbW4SIe1_6HPJ-PsiRKxL2Fm081c8AboOPTE2x6ds-Hd6CZwiBgziOBT_WzEww1mTBNNNuen9xII1nS0_Zcwh8L1eKvY/w400-h84/markov.png" width="400" /></a></div><br /><p></p>In words: the <script type="math/tex">n</script>th state of a Markov chain is generated from the <script type="math/tex">(n-1)</script>th state.
<p>This might seem very restrictive. For example, the simplest text-generation Markov chain would just generate, say, one character based on the previous one, probably based on data for how often a letter follows another. It might tend to do some moderately reasonable things, like following "t" by "h" fairly often (assuming it was trained on English), but good luck getting anything too sensible out of it.</p>
<p>However, we can do a trick: generate letter <script type="math/tex">n</script> from the previous <script type="math/tex">k</script> letters. This seems like it's not a Markov chain; letter <script type="math/tex">X_n</script> depends on <script type="math/tex">X_{n-k}</script> through <script type="math/tex">X_{n-1}</script>. But we can define <script type="math/tex">Y_0=(X_0, X_1, ..., X_{k-1})</script>, <script type="math/tex">Y_1 = (X_1, X_2, ..., X_k)</script>, and so on, and now <script type="math/tex">Y_n</script> can be generated entirely from <script type="math/tex">Y_{n-1}</script>, and so the <script type="math/tex">Y</script>s form a Markov chain.</p>
<p>So one one hand, we can do these sorts of tricks to use Markov chains even when it seems like the problem is too complex for them. But perhaps even more importantly, if you reduce something to a Markov chain, you can immediately apply a lot of nice mathematical results.</p>
<p>A Markov chain can be visualised with a state diagram. Here's one for a Markov chain representing traffic light transitions:</p>
<p></p><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEg_Af8IBBNAIKebzAR2c05mPoChzAxSizfCamhi9GRrSgRfSsATDbgzaODaD6HRJUQEtT6_B6sk4vrvtWti78se0m0JPZ4bnyPbchKmWRCmjijuoUWUoQc7UZi5_uIeQwrUmziko4HSAZNo/" style="margin-left: 1em; margin-right: 1em;"><img data-original-height="1098" data-original-width="1000" height="400" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEg_Af8IBBNAIKebzAR2c05mPoChzAxSizfCamhi9GRrSgRfSsATDbgzaODaD6HRJUQEtT6_B6sk4vrvtWti78se0m0JPZ4bnyPbchKmWRCmjijuoUWUoQc7UZi5_uIeQwrUmziko4HSAZNo/w365-h400/trafficlights1.png" width="365" /></a></div><br /><p></p>
<p>The same information can be described with a transition matrix, showing the probability of each transition happening:</p>
<p></p><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEh5Cy6l7JNQR6Qjn2fiVvmkN5QtYAM3cWu8V8bM6o4gv6bUqWPlPkE8JWplEqVaa4s7TPJE7QoY_ueJnmFYafQmFm33mHgXQ0FK9M_QWkRbWiI07AfIBoBoOPfJdmqC-Y2WK8CFZ2N-HKuN/" style="margin-left: 1em; margin-right: 1em;"><img data-original-height="830" data-original-width="960" height="345" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEh5Cy6l7JNQR6Qjn2fiVvmkN5QtYAM3cWu8V8bM6o4gv6bUqWPlPkE8JWplEqVaa4s7TPJE7QoY_ueJnmFYafQmFm33mHgXQ0FK9M_QWkRbWiI07AfIBoBoOPfJdmqC-Y2WK8CFZ2N-HKuN/w400-h345/trafficmatrix1.png" width="400" /></a></div><br /><p></p>
<p>Note that this is a very boring Markov chain, because it's not probabilistic – every link has a probability mass of 1. This is not very interesting. Thankfully, our traffic light engineer is willing to add some randomness for the sake of making the system more mathematically interesting. For example, they might change the system to look like this (showing both the state diagram and transition matrix):</p>
<p></p><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgouZEhdJSG5sR4CQ3pH6js2bxwJexB0MHMOgQ_UFswzzqKPXQgzAcVe9mlgFiEOl4ANJutf-lVe5AG0OcRhRggFMT7R83tVnt9dd2hIacu9g1-Rmzrb1talMwFepZ229RgxBWcdypKzJPq/" style="margin-left: 1em; margin-right: 1em;"><img data-original-height="690" data-original-width="1278" height="346" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgouZEhdJSG5sR4CQ3pH6js2bxwJexB0MHMOgQ_UFswzzqKPXQgzAcVe9mlgFiEOl4ANJutf-lVe5AG0OcRhRggFMT7R83tVnt9dd2hIacu9g1-Rmzrb1talMwFepZ229RgxBWcdypKzJPq/w640-h346/traffic2.png" width="640" /></a></div><p></p>
<p>Now there's a 10% chance that the yellow light before red is skipped, and a 40% chance that red-yellow moves back to red instead of going green.</p>
<p>The key property with Markov chain calculations is memorylessness: <script type="math/tex">X_n</script> depends only on <script type="math/tex">X_{n-1}</script>. If you can use this property, you can work out a lot of Markov chain problems. For example, let's say that <script type="math/tex">X_0 = \text{R}</script> (we'll use <script type="math/tex">\text{R, RY, G, Y}</script> to denote the states), and we want to find the probability that you'll actually get to drive in two state transitions from now – that is, <script type="math/tex">\mathbb{P}(X_2 = \text{G} \, | \, X_0 = \text{R})</script> (I use <script type="math/tex">\mathbb{P}</script> here to differentiate a probability expression from the transition matrix <script type="math/tex">P</script>). Doing some straightforward algebra, you can figure out that this probability is <script type="math/tex">P_{\text{R},\text{RY}} \cdot P_{\text{RY},\text{G}}</script> (where <script type="math/tex">P_{a,b}</script> is the spot in the matrix with row label (i.e. start state) <script type="math/tex">a</script> and column label (i.e. end state) <script type="math/tex">b</script>).</p>
<p>(Note that each row of the transition matrix is a probability distribution for the next state, starting from the state the row is labelled with. Writing it as a matrix is a trick for expressing the probability distribution from each state in the same mathematical object.)</p>
<p>More generally: for any transition matrix, <script type="math/tex">P_{a,b}</script>is <script type="math/tex">\mathbb{P}(X_n = b \, | X_{n-1} = a)</script>. Now consider point <script type="math/tex">a,b</script> of <script type="math/tex">P^2</script>: by matrix multiplication, it is</p>
<div cid="n504" class="mathjax-block md-end-block md-math-block md-rawblock" contenteditable="false" id="mathjax-n504" mdtype="math_block" spellcheck="false">
<div class="md-rawblock-container md-math-container" tabindex="-1"><div class="MathJax_SVG_Display" style="text-align: center;"></div><script id="MathJax-Element-13" type="math/tex; mode=display">\sum_i P_{a,i}P_{i,b},</script></div></div>
<p> but by the definition of the transition matrix, this is the same as</p>
<div cid="n509" class="mathjax-block md-end-block md-math-block md-rawblock" contenteditable="false" id="mathjax-n509" mdtype="math_block" spellcheck="false">
<div class="md-rawblock-container md-math-container" tabindex="-1"><div class="MathJax_SVG_Display" style="text-align: center;"></div><script id="MathJax-Element-14" type="math/tex; mode=display">\sum_i
\mathbb{P}(X_{1} = i \,|\, X_{0} = a)
\mathbb{P}(X_{2} = b \,|\, X_{1} = i),</script></div></div>
<p>which is just summing up the probabilities of all paths through the state space that start at <script type="math/tex">a</script>, go to some <script type="math/tex">i</script>, and then end up at <script type="math/tex">b</script>; in other words, it is the probability that if you're at <script type="math/tex">a</script>, you end up at <script type="math/tex">b</script> after two state transitions.</p>
<p>You should be able to see that this extends more generally:</p>
<div cid="n516" class="mathjax-block md-end-block md-math-block md-rawblock" contenteditable="false" id="mathjax-n516" mdtype="math_block" spellcheck="false">
<div class="md-rawblock-container md-math-container" tabindex="-1"><div class="MathJax_SVG_Display" style="text-align: center;"></div><script id="MathJax-Element-15" type="math/tex; mode=display">\mathbb{P}(X_n = b \,|\,X_0 = a) = P^n_{a,b}.</script></div></div>
<p>Linear algebra comes to the rescue yet again; we've reduced the problem of finding the probability of going between any two states in a Markov chain's state space in <script type="math/tex">n</script> steps into the problem of multiplying a matrix <script type="math/tex">n</script> times with itself and looking up one item in it.</p>
<h4>Finding the stationary distribution</h4>
<p>Given a starting state in a Markov chain, we can't say for sure what state it will be after <script type="math/tex">n</script> transitions (unless it's entirely deterministic, like our initial boring traffic light model), but we can calculate exactly what the probability distribution over the states will be. This is usually denoted as a vector <script type="math/tex">\pi</script>, with <script type="math/tex">\pi_a</script> being the probability we're in state <script type="math/tex">a</script>.</p>
<p>Here's something we might want to know: what is the stationary distribution; that is, how can we allocate probability mass amongst the different states in such a way that the total amount of probability mass in each state remains constant after a state transition?</p>
<p>Here's something you might ask: why is it interesting to know this? Perhaps most importantly, the stationary distribution of a Markov chain is the long-run average of time spent in each state (exercise: prove that this is the case); if you want to know how much time our probabilistic traffic lights will spend being green over a long period of time, you need to find the stationary distribution.</p>
<p>Now given our distribution <script type="math/tex">\pi</script> (note: it's a row vector, not a column vector) and transition matrix <script type="math/tex">P</script>, we can express the stationary distribution as the <script type="math/tex">\pi</script> that satisfies two conditions. First,</p>
<div cid="n539" class="mathjax-block md-end-block md-math-block md-rawblock" contenteditable="false" id="mathjax-n539" mdtype="math_block" spellcheck="false">
<div class="md-rawblock-container md-math-container" tabindex="-1"><div class="MathJax_SVG_Display" style="text-align: center;"></div><script id="MathJax-Element-16" type="math/tex; mode=display">\pi = \pi P.</script></div></div>
<p>This is the condition that <script type="math/tex">\pi</script> must remain unchanged when transformed by our transition matrix <script type="math/tex">P</script> during a state transition. You might have expected the transformation to be written <script type="math/tex">P \pi</script>; usually we'd express a matrix transforming a vector in this order. However, because of the way we've defined <script type="math/tex">P</script> – start states on the vertical axis, end states on the horizontal – we need to do it this way. Here's a visualisation, with the result vector in red:</p>
<p></p><div class="separator" style="clear: both; text-align: center;"><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjKs5m0CSl-7mffC6c-2MXp1b5PJIv98FMU-6Hh6ZX2coz8H8BZ28FAAvu9im7Vi5OeIKbHM0g96ob_PH0FH6o3U4IEYiym8XsQefOpJTNSIMLPUrQ-r63um1Rdi7bhR7IBCDZ8-Ps081s7/" style="margin-left: 1em; margin-right: 1em;"><img data-original-height="608" data-original-width="958" height="254" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjKs5m0CSl-7mffC6c-2MXp1b5PJIv98FMU-6Hh6ZX2coz8H8BZ28FAAvu9im7Vi5OeIKbHM0g96ob_PH0FH6o3U4IEYiym8XsQefOpJTNSIMLPUrQ-r63um1Rdi7bhR7IBCDZ8-Ps081s7/w400-h254/mmult.png" width="400" /></a></div></div><p></p>
<p>(Alternatively, we could take <script type="math/tex">\pi</script> as a column vector, flip the meanings of the rows and columns in <script type="math/tex">P</script>, and write <script type="math/tex">P\pi</script> – equivalent to transposing both of the current definitions of <script type="math/tex">\pi</script> and <script type="math/tex">P</script>.)</p>
<p>The second condition (can you see why it's necessary?), where <script type="math/tex">\pmb{1}</script> is a vector <script type="math/tex">(1,1,...,1,1)</script> of the required length, is</p>
<div cid="n556" class="mathjax-block md-end-block md-math-block md-rawblock" contenteditable="false" id="mathjax-n556" mdtype="math_block" spellcheck="false">
<div class="md-rawblock-container md-math-container" tabindex="-1"><div class="MathJax_SVG_Display" style="text-align: center;"></div><script id="MathJax-Element-17" type="math/tex; mode=display">\pi \cdot \pmb{1} = 1.</script></div></div>
<p>We can also write this as matrix multiplication, as long as we're clear about column and row vectors and transposing things as required. We can also be clever and write a single matrix that expresses both of these constraints, and then getting NumPy's linear algebra libraries to give us the answer becomes a single line of code.</p>
<p>(The second constraint is just the condition that any probability distribution sums to 1.) </p>
<h5>Uniqueness of the stationary distribution</h5>
<p>Now for another question: when does a unique stationary distribution exist? You should be able to think of a state diagram for which there are an infinite number of stationary distributions.</p>
<p>For example:</p>
<p></p><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjMnO_42L4UGYn7KWUJOZWkn2sho9QSalClZ6ysCxybdVoxSEDac3RKRGhKyZZvryxNJRUCuBS9pMKhV9TierRR2iW8FNdOqYRr_veoz3yx57QDPporxKACXmYXWbqhrBY3hfupIDp-dADT/" style="margin-left: 1em; margin-right: 1em;"><img data-original-height="784" data-original-width="1280" height="392" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjMnO_42L4UGYn7KWUJOZWkn2sho9QSalClZ6ysCxybdVoxSEDac3RKRGhKyZZvryxNJRUCuBS9pMKhV9TierRR2iW8FNdOqYRr_veoz3yx57QDPporxKACXmYXWbqhrBY3hfupIDp-dADT/w640-h392/stationary.png" width="640" /></a></div><p></p>
<p>The states <script type="math/tex">C</script>, <script type="math/tex">B</script>, and <script type="math/tex">D</script> (in the dotted red circle) and <script type="math/tex">E</script>, <script type="math/tex">F</script>, <script type="math/tex">G</script>, and <script type="math/tex">H</script> (in the dotted blue circle) are "independent", in the sense that you can never get from one set of states to the other. Imagine that for the state set <script type="math/tex">\{C, B, D\}</script>, we have a stationary distribution over only those states <script type="math/tex">\pmb{\pi}</script>, and another stationary distribution <script type="math/tex">\pmb{\rho}</script> over <script type="math/tex">\{E,F,G,H\}</script>. (Let each of these vectors have a slot for every state, but let it be zero for states outside the corresponding state set – <script type="math/tex">\pmb{\pi} = (0, \pi_b, \pi_c, \pi_d, 0, 0, 0, 0)</script>, for example.) Now, because there can be no probability mass flow between these two sets, we can see that any distribution <script type="math/tex">\pmb{\sigma} = a \pmb{\pi} + b \pmb{\rho}</script> is also a stationary distribution, provided that <script type="math/tex">a</script> and <script type="math/tex">b</script> are chosen such that <script type="math/tex">\pmb{\sigma} \cdot \pmb{1} = 1</script> (probability distributions sum to one!).</p>
<p>It turns out that for any state set where each state is theoretically reachable from all the others – i.e., if we represent the state diagram as a directed graph, the graph is connected – there does exist a unique stationary distribution.</p>
<h5>Detailed balance</h5>
<p>Sometimes it doesn't take matrix calculations to find a stationary distribution. In the general case, the condition is that the probability mass flow into a state, from all other states, must equal the outflow to all other states. The simplest case this can happen is when, for any pair of states <script type="math/tex">a</script> and <script type="math/tex">b</script>, <script type="math/tex">a</script> sends as much probability mass to <script type="math/tex">b</script> upon a state transition as <script type="math/tex">b</script> sends to <script type="math/tex">a</script>. If we can ensure that this is true "locally" for each pair of states, then we don't have to do complex "global" optimisation over all states.</p>
<p>This condition is known as detailed balance. Mathematically, letting <script type="math/tex">\pi</script> be a distribution of probability mass over states and <script type="math/tex">P</script> be the transition matrix, we can express it as</p>
<div cid="n607" class="mathjax-block md-end-block md-math-block md-rawblock" contenteditable="false" id="mathjax-n607" mdtype="math_block" spellcheck="false">
<div class="md-rawblock-container md-math-container" contenteditable="false" tabindex="-1">
<div class="MathJax_SVG_Display" style="text-align: center;"></div><script id="MathJax-Element-902" type="math/tex; mode=display">\pi_a P_{ab} = \pi_b P_{ba}, \text{ for all states } a \text{ and } b,</script>
</div></div>
<p>something that should be clear if you remember the interpretation of the transition matrix element <script type="math/tex">P_{ab}</script> as the probability of an <script type="math/tex">a \rightarrow b</script> transition.</p>
<p>A final fun question: say we have an undirected graph and we consider a random walk over it (i.e., if we're at a given vertex, we take any edge going from it with equal probability). What is the stationary distribution over the states (i.e. the vertices of the graph)?</p>Unknownnoreply@blogger.com0tag:blogger.com,1999:blog-1697673368059564013.post-58263582811746502072020-12-31T14:43:00.019+00:002021-02-19T22:08:20.635+00:00Data science 1<center><p><span style="font-size: x-small;"><i>8.3k words, including equations (about 40 minutes)</i></span></p>
</center>
<p>This is an overview of fundamental ideas in data science, mostly based on <a href="https://www.cl.cam.ac.uk/teaching/2021/DataSci/materials.html">Damon Wischik's excellent data science course at Cambridge</a> (if using these notes for revision for that course, be aware that I don't cover all examinable things and cover some things that aren't examinable; the criteria for inclusion is interestingness, not examinability).</p>
<p>The basic question is this: we're given data; what can we say about the world based on it?</p>
<p>These notes are split into two parts due to length. In part 1:</p>
<ul>
<li><p>Notation</p>
</li>
<li><p>A few results in probability, including a look at Bayes theorem leading up to an understanding of the continuous form.</p>
</li>
<li><p>Model-fitting</p>
<ul>
<li>Maximum likelihood estimation</li>
<li>Supervised & unsupervised learning</li>
<li>Linear models (fitting them and interpreting them)</li>
<li>Empirical distributions (with a note on KL divergence)</li>
</ul>
</li>
</ul>
<p>In <a href="http://strataoftheworld.blogspot.com/2021/01/data-science-2.html">part 2</a>:</p>
<ul>
<li>Monte Carlo methods</li>
<li>A few theorems that let you bound probabilities or expectations.</li>
<li>Bayesianism & frequentism</li>
<li>Probability systems (specifically basic results about Markov chains).</li>
</ul>
<p> </p>
<h2>Probability basics</h2>
<p>The kind of background you want to have to understand this material:</p>
<ul>
<li><p>The basic maths of probability: reasoning about sample spaces, probabilities summing to one, understanding and working with random variables, etc.</p>
</li>
<li><p>The ideas of expected value and variance.</p>
</li>
<li><p>Some idea of the most common probability distributions:</p>
<ul>
<li>normal/Gaussian,</li>
<li>binomial,</li>
<li>poisson,</li>
<li>geometric,</li>
<li>etc.</li>
</ul>
</li>
<li><p>What continuous and discrete distributions are.</p>
</li>
<li><p>Understanding probability density/mass functions, and cumulative distribution functions.</p>
</li>
</ul>
<h3>Notation</h3>
<p>First, a few minor points:</p>
<ul>
<li><p>It's easy to interpret <script type="math/tex">Y = f(X)</script>, where <script type="math/tex">X</script> and <script type="math/tex">Y</script> are random variables, to mean "generate a value of <script type="math/tex">X</script>, then apply <script type="math/tex">f</script> to it, and this is <script type="math/tex">Y</script>". But <script type="math/tex">Y=f(X)</script> is maths, not code; we're stating something is true, not saying how the values are generated. If <script type="math/tex">f</script> is an invertible function, then <script type="math/tex">Y=f(X)</script> and <script type="math/tex">X=f^{-1}(Y)</script> are both equally good and equally true mathematical statements, and neither of them tell you what causes what.</p>
</li>
<li><p>Indicator functions are a useful trick when bounds are unknown; for example, write <script type="math/tex">1_{x \geq y}</script> (or <script type="math/tex">1[x\geq y]</script>) to denote 1 if <script type="math/tex">x \geq y</script> and 0 in all other cases.</p>
<ul>
<li>They also let you express logical AND as multiplication: <script type="math/tex">1_{f(x)} \cdot 1_{g(x)}</script> , where <script type="math/tex">f</script> and <script type="math/tex">g</script> are boolean functions, is the same as <script type="math/tex">1_{f(x) \wedge g(x)}</script>.</li>
</ul>
</li>
</ul>
<h4>Likelihood notation</h4>
<p>Discrete and continuous random variables are fundamentally different. In the discrete case, you deal with probability mass functions where there's a probability attached to each event; with the continuous case, you only get a probability density function that doesn't mean anything real and needs to be integrated to give you a probability. Many results apply to both discrete and continuous random variables though, and we might switch between continuous and discrete models in the same problem, so it's cumbersome to have to deal with the separate notation and semantics of them.</p>
<p>Enter likelihood notation: write <script type="math/tex">\Pr_X(x)</script> to mean <script type="math/tex">P(X=x)</script> if the distribution is discrete and <script type="math/tex">f(x)</script> if the distribution of <script type="math/tex">X</script> is continuous with probability density function <script type="math/tex">f</script>.</p>
<h4>Python & NumPy</h4>
<p>Python is a good choice for writing code, for various reasons:</p>
<ul>
<li>easy to read;</li>
<li>found almost everywhere;</li>
<li>easy to install if it isn't already installed;</li>
<li>not Java;</li>
</ul>
<p>but particularly because it has excellent science/maths libraries:</p>
<ul>
<li>NumPy for vectorised calculations, maths, and stats;</li>
<li>SciPy for, uh, science;</li>
<li>Matplotlib for graphing;</li>
<li>Pandas for data.</li>
</ul>
<p>NumPy is a must-have.</p>
<p>To use it, the big thing to understand is the idea of vectorised calculations. Otherwise, you'll see code like this:</p>
<pre><code class="language-python" lang="python">xs = numpy.array([1, 2, 3])
ys = x ** 2 + x
</code></pre>
<p>and wonder how we're adding and squaring arrays (we're not; the operations are implicitly applied to each element separately – and all of this runs in C so it's much faster than doing it natively in Python).</p><h3>Computation vs maths</h3>
<p>Today we have computers. Statistics was invented before computers,
though, and this affected the field; work was directed to all the areas
and problems where progress could be made without much computation. The
result is an excellent theoretical mathematical underpinning, but modern
statistics can benefit a lot from a computational approach – running
simulations to get estimates and so on. For the simple problems there's
an (imprecise) computational method and a (precise) mathematical method;
for complex problems you either spend all day doing integrals (provided
they're solvable at all) or switch to a computer.</p>
<p>In this post, I will focus on the maths, because the maths concepts
are more interesting than the intricacies of NumPy, and because if you
understand them (and programming, especially in a vectorised style), the
programming bit isn't hard.</p><p> </p><p> </p>
<h3>Some probability results</h3>
<h4>The law of total probability</h4>
<p>Here's something intuitive: if we have a sample space (e.g. outcomes of a die roll) and we partition it into non-overlapping events <script type="math/tex">E_1</script> to <script type="math/tex">E_N</script> that cover every possible outcome (e.g. showing the numbers 1, 2, ..., 6, and losing the dice under the carpet), and we have some other event <script type="math/tex">A</script> (e.g. a player gets mad), then</p>
<div cid="n97" class="mathjax-block md-end-block md-math-block md-rawblock" contenteditable="false" id="mathjax-n97" mdtype="math_block" spellcheck="false">
<div class="md-rawblock-container md-math-container" tabindex="-1"><div class="MathJax_SVG_Display" style="text-align: center;"></div><script id="MathJax-Element-2" type="math/tex; mode=display">P(A) = \sum_{n=1}^{N} P(A | E_n)P(E_n);</script></div></div>
<p>if we know the probability of <script type="math/tex">A</script> given each event <script type="math/tex">E_n</script>, we can find the total probability of <script type="math/tex">A</script> by summing up the probabilities of each <script type="math/tex">E_n</script>, weighted by the conditional probability that <script type="math/tex">A</script> also happens. Visually, where the height of the red bars represents each <script type="math/tex">P(A|E_n)</script>, and the area of each segment represents the different <script type="math/tex">P(E_n)</script>s, we see that the total red area corresponds to the sum above:</p><p> <br /></p><table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto;"><tbody><tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgERDCdxrc1bJfIR1iujZpclMEtBA1wzAAEId8glrJE8RfDWT3Fi6CYQ1ul39Lu13mqmVtKfT-kB42AwpvYd2pZrftrJ4VCbyzQ9VgypZKySR52PHo6k_eP8f5Ca8ql9w6-o-IJuCk_VsK5/s1280/ltp.png" style="margin-left: auto; margin-right: auto;"><img border="0" data-original-height="506" data-original-width="1280" height="252" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgERDCdxrc1bJfIR1iujZpclMEtBA1wzAAEId8glrJE8RfDWT3Fi6CYQ1ul39Lu13mqmVtKfT-kB42AwpvYd2pZrftrJ4VCbyzQ9VgypZKySR52PHo6k_eP8f5Ca8ql9w6-o-IJuCk_VsK5/w640-h252/ltp.png" width="640" /></a></td></tr><tr><td class="tr-caption" style="text-align: center;"><i>You say this diagram is "messy and unprofessional"; I say it has an "informal aesthetic".</i><br /></td></tr></tbody></table><br /><p>This is called the law of total probability; a fancy name to pull out when you want to use this idea.</p>
<h4>The law of the unconscious statistician</h4>
<p>Another useful law doesn't even sound like a law at first, which is why it's called the law of the unconscious statistician.</p>
<p>Remember that the expected value, in case of a discrete distribution for the random variable <script type="math/tex">X</script>, is</p>
<div cid="n104" class="mathjax-block md-end-block md-math-block md-rawblock" contenteditable="false" id="mathjax-n104" mdtype="math_block" spellcheck="false">
<div class="md-rawblock-container md-math-container" tabindex="-1"><div class="MathJax_SVG_Display" style="text-align: center;"></div><script id="MathJax-Element-3" type="math/tex; mode=display">E(X)=\sum_i x_iP(X=x_i).</script></div></div>
<p>Now say we're not interested in the value of <script type="math/tex">X</script> itself, but rather some function <script type="math/tex">f</script> of it. What is the expected value of <script type="math/tex">f(X)</script>? Well, the values <script type="math/tex">x_i</script> are the possible values of <script type="math/tex">X</script>, so let's just replace the <script type="math/tex">x_i</script> above with <script type="math/tex">f(x_i)</script>:</p>
<div cid="n106" class="mathjax-block md-end-block md-math-block md-rawblock" contenteditable="false" id="mathjax-n106" mdtype="math_block" spellcheck="false">
<div class="md-rawblock-container md-math-container" tabindex="-1"><div class="MathJax_SVG_Display" style="text-align: center;"></div><script id="MathJax-Element-4" type="math/tex; mode=display">E(f(X)) = \sum_i f(x_i) P(X=x_i)</script></div></div>
<p>... and we're done – but for the wrong reasons. This result is actually more subtle than this; to prove it, consider a random variable <script type="math/tex">Y</script> for which <script type="math/tex">Y=f(X)</script>. By the definition of expected value,</p>
<div cid="n108" class="mathjax-block md-end-block md-math-block md-rawblock" contenteditable="false" id="mathjax-n108" mdtype="math_block" spellcheck="false">
<div class="md-rawblock-container md-math-container" tabindex="-1"><div class="MathJax_SVG_Display" style="text-align: center;"></div><script id="MathJax-Element-5" type="math/tex; mode=display">E(Y)=\sum_i y_i P(Y=y_i).</script></div></div>
<p>Uh oh – suddenly the connection between the obvious result and what expected value is doesn't seem so obvious. The problem is that the mapping between the <script type="math/tex">y_i</script> and <script type="math/tex">x_i</script> could be anything – many <script type="math/tex">x_i</script>, thrown into the blackbox <script type="math/tex">f</script>, might produce the same <script type="math/tex">y_i</script> – and we have to untangle this while keeping track of all the corresponding probabilities. </p>
<p>For a start, we might notice values <script type="math/tex">x_i</script> of <script type="math/tex">X</script>. So we might write</p>
<div cid="n111" class="mathjax-block md-end-block md-math-block md-rawblock" contenteditable="false" id="mathjax-n111" mdtype="math_block" spellcheck="false">
<div class="md-rawblock-container md-math-container" tabindex="-1"><div class="MathJax_SVG_Display" style="text-align: center;"></div><script id="MathJax-Element-6" type="math/tex; mode=display">E(Y)=\sum_i \Big( y_i
\sum_{j \,|\, f(x_j)=y_i}
P(X=x_j)
\Big),</script></div></div>
<p>to sum over each possible value of <script type="math/tex">f(X)</script>, and then within that, also loop over the possible values of <script type="math/tex">X</script> that might have generated that <script type="math/tex">f(X)</script>. We've managed to switch a term involving the probability that <script type="math/tex">Y</script> takes some values to one about <script type="math/tex">X</script> taking a specific value – progress!</p>
<p>Next, we realise that <script type="math/tex">y_i</script> is the same for everything in the inner sum; <script type="math/tex">y_i = f(x_1) = f(x_2) = ... = f(x_j)</script>. So we don't change anything if we write</p>
<div cid="n114" class="mathjax-block md-end-block md-math-block md-rawblock" contenteditable="false" id="mathjax-n114" mdtype="math_block" spellcheck="false">
<div class="md-rawblock-container md-math-container" tabindex="-1"><div class="MathJax_SVG_Display" style="text-align: center;"></div><script id="MathJax-Element-7" type="math/tex; mode=display">E(Y)=\sum_i \Big(
\sum_{j \,|\, f(x_j)=y_i}
f(x_j)
P(X=x_j)
\Big)</script></div></div>
<p>instead. Now we just have to see that the above is equivalent to iterating once over all the <script type="math/tex">j</script>s.</p>
<p>A diagram:</p>
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgweGnjxA4J8okzCIFdYO-TPzIh_xtRe4KUzqN21bX9KKXnkYpRufbdUoIvbFRgq2ySwjzOAYt1DJM4mqzY94lrftdZy4kIJi8iB4xc1C4o7CHWGLhCu79rN5vwmrtCBnleQrr5E4RHOlUs/s1280/lotus.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="770" data-original-width="1280" height="384" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgweGnjxA4J8okzCIFdYO-TPzIh_xtRe4KUzqN21bX9KKXnkYpRufbdUoIvbFRgq2ySwjzOAYt1DJM4mqzY94lrftdZy4kIJi8iB4xc1C4o7CHWGLhCu79rN5vwmrtCBnleQrr5E4RHOlUs/w640-h384/lotus.png" width="640" /></a>
<p>The yellow area is the expected value of <script type="math/tex">f(x) = Y</script>. By the definition of expected value, we can sum up the areas of the yellow rectangles to get <script type="math/tex">E(f(X))</script>. What we've now done is "reduced" this to a process like this: pick <script type="math/tex">y_1</script>, looking at the <script type="math/tex">x_i</script> that map to it with <script type="math/tex">f</script> (<script type="math/tex">x_1</script> and <script type="math/tex">x_2</script> in this case), and find these probabilities and multiply them by <script type="math/tex">f(x_1)=f(x_2)=y_1</script>. So we add up the rectangles in the slots marked by the dotted lines, and we do it with this weird double-iteration of looking first at <script type="math/tex">y_i</script>s and then at <script type="math/tex">x_i</script>s.</p>
<p>But once we've put it this way, it's simple to see we get the same result if we iterate over the <script type="math/tex">x_i</script>s, get the corresponding rectangle slice for each, and add it all up. This corresponds to the formula we had above (summing <script type="math/tex">f(x_i) P(X=x_i)</script> over all possible <script type="math/tex">i</script>).</p>
<h4>Bayes' theorem (odds ratio and continuous form)</h4>
<div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiPZV9KnIAMOiRq__8O1h4nQKDzttp9ISgAJw-3S43Rw-3DGAs1ZJ3GFT_C8En1fdT5KSLv2gQcGaL7dtkXqVwD_ZlvS3Zmi0w_sWwEF4HGODYjfvqzU_0T2N2VHWR69tKPcvT8RJtjzJU9/s1280/bayes.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="900" data-original-width="1280" height="450" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiPZV9KnIAMOiRq__8O1h4nQKDzttp9ISgAJw-3S43Rw-3DGAs1ZJ3GFT_C8En1fdT5KSLv2gQcGaL7dtkXqVwD_ZlvS3Zmi0w_sWwEF4HGODYjfvqzU_0T2N2VHWR69tKPcvT8RJtjzJU9/w640-h450/bayes.png" width="640" /></a></div><br />Above is a Venn diagram of a sample space (the box), with the probabilities of event <script type="math/tex">B</script> and event <script type="math/tex">R</script> marked by blue and red areas respectively (the hatched area represents that both happen).
<p>By the definition of conditional probability,</p>
<div cid="n124" class="mathjax-block md-end-block md-math-block md-rawblock" contenteditable="false" id="mathjax-n124" mdtype="math_block" spellcheck="false">
<div class="md-rawblock-container md-math-container" tabindex="-1"><div class="MathJax_SVG_Display"></div><script id="MathJax-Element-8" type="math/tex; mode=display">P(R|B)=\frac{P(B \cap R)}{P(B)}, \text{ and} \\
P(B|R)=\frac{P(B \cap R)}{P(R)}.</script></div></div>
<p>Bayes theorem is about answering questions like "if we know how likely we are to be in the red area given that we're in the blue area, how likely are we to be in the blue area if we're in the red?" (Or: "if we know how likely we are to have symptoms if we have covid, how likely are we to have covid if we have symptoms?").</p>
<p>Solving both of the above equations for <script type="math/tex">P(B \cap R)</script> and equating them gives</p>
<div cid="n127" class="mathjax-block md-end-block md-math-block md-rawblock" contenteditable="false" id="mathjax-n127" mdtype="math_block" spellcheck="false">
<div class="md-rawblock-container md-math-container" tabindex="-1"><div class="MathJax_SVG_Display" style="text-align: center;"></div><script id="MathJax-Element-9" type="math/tex; mode=display">P(R|B) P(B) = P(B|R) P(R),</script></div></div>
<p>which is the answer – just divide out by either <script type="math/tex">P(B)</script> or <script type="math/tex">P(R)</script> to get, for example,</p>
<div cid="n129" class="mathjax-block md-end-block md-math-block md-rawblock" contenteditable="false" id="mathjax-n129" mdtype="math_block" spellcheck="false">
<div class="md-rawblock-container md-math-container" tabindex="-1"><div class="MathJax_SVG_Display" style="text-align: center;"></div><script id="MathJax-Element-10" type="math/tex; mode=display">P(B|R) = \frac{P(R|B)P(B)}{P(R)}.</script></div></div>
<p>Let's say the red area $$R$$ represents having symptoms. Let's say we split the blue area <script type="math/tex">B</script> into <script type="math/tex">B_1</script> and <script type="math/tex">B_2</script> – two different variants of covid, say. Now instead of talking about probabilities, let's talk about odds: let's say the odds ratios that a random person has no covid, has variant 1, and has variant 2 are 40:2:1, and that symptoms are, compared to the no-covid population, ten times as likely in variant 1 and twenty times as likely in variant 2 (in symbols: <script type="math/tex">P(R| \neg B_1 \cap \neg B_2)/40 = P(R|B_1) / 2 = P(R|B_2)</script>). Now we learn that we have symptoms and want to calculate posterior probabilities, to use Bayes-speak.</p>
<p>To apply Bayes' rule, you could crank out the formula exactly as above: convert odds to probabilities, divide out by the total probability of no covid or having variant 1 or 2, and then get revised probabilities for your odds of having no covid or a variant. This is equivalent to keeping track of the absolute sizes of the intersections in the diagram below:</p>
<div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi03lu58t8HvIrhOv_zDy-RrLVCDR7w47yuKXN5UpNQWKGTgG5s4WYCTjeIotLsJR5csm3xZEsunF2hrO17R81KvdNjJ9Qa-ddcyZJPlCM1dNAnZy_BohFBjfLnUd8SnFq0BBSslnlJAjIs/s1000/bayes2.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="622" data-original-width="1000" height="398" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi03lu58t8HvIrhOv_zDy-RrLVCDR7w47yuKXN5UpNQWKGTgG5s4WYCTjeIotLsJR5csm3xZEsunF2hrO17R81KvdNjJ9Qa-ddcyZJPlCM1dNAnZy_BohFBjfLnUd8SnFq0BBSslnlJAjIs/w640-h398/bayes2.png" width="640" /></a></div><br />
<p>But this is unnecessary. When we learned we had symptoms, we've already zoomed in to the red blob; that is our sample space now, so blob size compared to the original sample space no longer interests us.</p>
<p>So let's take our odds ratios directly, and only focus on relative probabilities. Let's imagine each scenario fighting over a set amount of probability space, with the starting allocations determined by prior odds ratios:</p>
<div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjnimJGhv_nRn-8Mp8Ha12NO6vHySwEEWWiw0Uq54dLUaVTlmD2LNoa9XlSKTXRC26pVkNqpTfIJmQd6iTCoOBx1FvSSkS_BFcZllpCcQlPMUQM1DJQLSkcpA10TRNJljVmj3W1POKd68Kk/s1280/odds1.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="84" data-original-width="1280" height="42" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjnimJGhv_nRn-8Mp8Ha12NO6vHySwEEWWiw0Uq54dLUaVTlmD2LNoa9XlSKTXRC26pVkNqpTfIJmQd6iTCoOBx1FvSSkS_BFcZllpCcQlPMUQM1DJQLSkcpA10TRNJljVmj3W1POKd68Kk/w640-h42/odds1.png" width="640" /></a></div><br />
<p>Now Bayes rule says to multiply each prior probability <script type="math/tex">P(B_i)</script> by <script type="math/tex">P(R|B_i)</script>. To adjust our prior odds ratio 40:2:1 by the ratios 1:10:20 telling us how many times more likely we are to see <script type="math/tex">R</script> (symptoms) given no covid or <script type="math/tex">B_1</script> or <script type="math/tex">B_2</script>, just multiply term-by-term to get 40:20:20, or 2:1:1. You can imagine each outcome fighting it out with their newly-adjusted relative strengths, giving a new distribution of the sample space:</p>
<div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjID7pufw53kbf8-SDwVGB9kZqDDshAUgHbfl06BdgMu9-YPxSrLe6vbluuxTYq1SnX6jVtLlQYKY9T0569SY8iP0JBK_w5m-Z52lm_YG8fR7bjy_VBYZZFdAYWuU_xR6UBcG3469ufM0hu/s1282/odds2.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="102" data-original-width="1282" height="50" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjID7pufw53kbf8-SDwVGB9kZqDDshAUgHbfl06BdgMu9-YPxSrLe6vbluuxTYq1SnX6jVtLlQYKY9T0569SY8iP0JBK_w5m-Z52lm_YG8fR7bjy_VBYZZFdAYWuU_xR6UBcG3469ufM0hu/w640-h50/odds2.png" width="640" /></a></div><br />
<p>Now if we want to get absolute probabilities again, we just have to scale things right so that they add up to 1. This tiny bit of cleanup at the end (if we want to convert to probabilities again) is the only downside of working with odds ratios.</p>
<p>This gives us an idea about how to use Bayes when the sample space is continuous rather than discrete. For example, let's say the sample space is between 0 and 100, representing the blood oxygenation level $$X$$ of a coronavirus patient. We can imagine an approximation where we write an odds ratio that includes every integer from 0 to 100, and then refine that until, in the limit, we've assigned odds to every real number between 0 and 100. Of course, at this point the odds ratio interpretation starts looking a bit weird, but we can switch to another one: what we have is a probability distribution, if only we scale it so that the entire thing integrates to one.</p>
<p>The same logic applies as before, even though everything is now continuous. Let's say we want to calculate a conditional probability like the probability of $$X$$ (the random variable for the patient's blood oxygenation) taking the value $$x$$. At first we have no information, so our best guess is the prior across all patients, $$\Pr_X(x)$$. Say we now get some piece of evidence, like the patient's age, and know the likelihood ratios of the patient being that age given each blood oxygenation level. To get our updated belief distribution, we can just go through and multiply the prior likelihoods of each blood oxygenation level by the ratios given the new piece of evidence.</p>
<div class="separator" style="clear: both; text-align: center;"><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgVv4UiHlAd9SvH2h0qIHTG-HtmP9AOb5S6PA8nHAt7TKB0ccZ7RKanKpW561T2mAR-pR_-ZbSZCkLjMLK6rhXWnuJOn5MTVY3WZh4lKswOh1Die8gpej74tz2uozaF8ywP7iWs19js4kxI/s1280/odds3.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="688" data-original-width="1280" height="344" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgVv4UiHlAd9SvH2h0qIHTG-HtmP9AOb5S6PA8nHAt7TKB0ccZ7RKanKpW561T2mAR-pR_-ZbSZCkLjMLK6rhXWnuJOn5MTVY3WZh4lKswOh1Die8gpej74tz2uozaF8ywP7iWs19js4kxI/w640-h344/odds3.png" width="640" /></a></div><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhfnm7656fPTRAH7HC7bRP-vX9B3nxf5XcuExh7uqlK3YgP-E6kBKuoyfQ8wsBOQ81zwwOqyS_PZXsDJVT6IucdIo0YvIrgq9xsmUieSZ031tyzmOpFft_e9-pEhzDUM72AwVG54zWFg0Kz/s1280/odds3.png" style="margin-left: 1em; margin-right: 1em;"></a></div>
<p>Above, the red line is the initial distribution of blood oxygenation <script type="math/tex">x</script> across all patients. The yellow line represents the relative likelihoods of the patient's actual known age <script type="math/tex">a</script> given a particular <script type="math/tex">x</script>. The green line at any particular $$x$$ is the product of the yellow and red function at that same $$x$$, and it's our relative posterior. To interpret it as a probability distribution, we have to scale it vertically so that it integrates to 1 (that's why we have a proportionality sign rather than an equals sign).</p>
<p>Now let's say more evidence comes in: the patient is unconscious (which we'll denote <script type="math/tex">U=\text{"yes"}</script>). We can repeat the same process of multiplying out relative likelihoods and the prior, this time with the prior being the result in the previous step:</p><p></p><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEj2ZLjpKYLvcRTZXawhkk04j1rlzlRbIvsFBR8pm0PFItd-y_5cfJmEJ97vPKT4dv5fv8ML5fpi04hOd3akN9ZpRgW-XQkP1u56F6J_10njN-oYTF43awzM8T6wpRSh8GoY4qoGDKRjvwby/s1278/odds4.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="928" data-original-width="1278" height="464" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEj2ZLjpKYLvcRTZXawhkk04j1rlzlRbIvsFBR8pm0PFItd-y_5cfJmEJ97vPKT4dv5fv8ML5fpi04hOd3akN9ZpRgW-XQkP1u56F6J_10njN-oYTF43awzM8T6wpRSh8GoY4qoGDKRjvwby/w640-h464/odds4.png" width="640" /></a></div><p></p><p>We can see that in this case the blue line varies a lot more depending on <script type="math/tex">x</script>, and hence our distribution for <script type="math/tex">x</script> (the purple line) changes more compared to our prior (the green line). Now let's say we have a very good piece of evidence: the result <script type="math/tex">m</script> of a blood oxygenation meter <script type="math/tex">M</script>.</p><p> </p><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjoMr4ZomNB6l4nqYxnhWFAR39ulkPuSIMzbBTR6uU_8lvwKC4RCmQJkc-8v0e5qXIFtpxG8tujmX0ad0_FiMfsRn-6QpF3UgfwWXnkPuDuSBnYW06fwsZbY9NmmgEYogawsUzmjNwsQAzJ/s1280/odds5.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="858" data-original-width="1280" height="428" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjoMr4ZomNB6l4nqYxnhWFAR39ulkPuSIMzbBTR6uU_8lvwKC4RCmQJkc-8v0e5qXIFtpxG8tujmX0ad0_FiMfsRn-6QpF3UgfwWXnkPuDuSBnYW06fwsZbY9NmmgEYogawsUzmjNwsQAzJ/w640-h428/odds5.png" width="640" /></a></div>There's some error on the oxygenation measurement, so our final belief (that <script type="math/tex">x</script> is distributed according to the black line) is very clearly a distribution of values rather than a single value, but it's clustered around a single point.<p></p>
<p>So to think through Bayes in practice, the lesson is this: throw out the denominator in the law. It's a constant anyways; if you really need it you can go through some integration at the end to find it. But it's not the central point of Bayes' theorem. Remember instead: prior times likelihood ratio gives posterior.</p><p> </p>
<h2>Fitting models</h2>
<p>A probability model tries to tell you how likely things are. Fitting a probability model to data is about finding one that is useful for given data.</p>
<div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgl_LR92LH_IkstYk4INmtIo5t_rszVkRazzyfPShRSwCugwXw5N9J-KfgAp9vf2XcZr5A5nhB43giDOXg35sMeRXqWa9pz5yROIjz28_R4hqg06EXIebxc8tvgDRHdEKZSNUjT8wxGwK3p/s1280/probmodels.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="366" data-original-width="1280" height="184" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgl_LR92LH_IkstYk4INmtIo5t_rszVkRazzyfPShRSwCugwXw5N9J-KfgAp9vf2XcZr5A5nhB43giDOXg35sMeRXqWa9pz5yROIjz28_R4hqg06EXIebxc8tvgDRHdEKZSNUjT8wxGwK3p/w640-h184/probmodels.png" width="640" /></a></div>
<p>Above, we have two axes representing whatever, and the intensity of the red shading is the probability attributed to a particular pair of values.</p>
<p>The model on the left is simply bad. The one in the middle is also bad, though; it assigns no probability to many of the data points that were actually seen.</p>
<p>Choosing which distribution to fit – or whether to do something else entirely – is sometimes obvious, sometimes not. Complexity is rarely good.</p>
<h3>Maximum likelihood estimation (MLE)</h3>
<p>Let's say we do have a good idea of what the distribution is; the weight of stray cats in a city depends on a lot of small factors pushing both ways (when it last caught a mouse, the temperature over the past week, whether it was loved by its mother, etc.), so <a href="https://en.wikipedia.org/wiki/Bean_machine">we should expect a normal distribution</a>. Well, probably.</p>
<p>Let's say we have a dataset of cat weights, labelled <script type="math/tex">x_1</script> to <script type="math/tex">x_n</script> because we're serious maths people. How do we fit a distribution?</p>
<div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiiGoKXcYhDiHAkwuMoJj6F_1Wj0THIkqqbDEU2WFvMAX_uZS1t-PRjlc5rVcmWt5R1OtVSzxst7ZtpWJFWB82xw3Bw1a-ez6QNUO503zuhOz1-XB-vHfahwN0lElGZiIeqFKagkT3CBEnw/s800/cats.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="610" data-original-width="800" height="488" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiiGoKXcYhDiHAkwuMoJj6F_1Wj0THIkqqbDEU2WFvMAX_uZS1t-PRjlc5rVcmWt5R1OtVSzxst7ZtpWJFWB82xw3Bw1a-ez6QNUO503zuhOz1-XB-vHfahwN0lElGZiIeqFKagkT3CBEnw/w640-h488/cats.png" width="640" /></a></div><br /><p><br /></p>
<p>Step 1 is Wikipedia. Wikipedia tells us that a normal distribution has two parameters, <script type="math/tex">\mu</script> (the mean) and <script type="math/tex">\sigma</script> (the standard deviation), and that the likelihood (not probability! see above) a normal distribution <script type="math/tex">X</script> with those parameters takes a value <script type="math/tex">x</script> is</p>
<div cid="n164" class="mathjax-block md-end-block md-math-block md-rawblock" contenteditable="false" id="mathjax-n164" mdtype="math_block" spellcheck="false">
<div class="md-rawblock-container md-math-container" tabindex="-1"><div class="MathJax_SVG_Display" style="text-align: center;"></div><script id="MathJax-Element-11" type="math/tex; mode=display">\Pr_X(x)=
\frac{1}{\sigma \sqrt{2 \pi}}
e^{-\frac{1}{2}\big(
\frac{x-\mu}{\sigma}
\big)^2}.</script></div></div>
<p>Oh dear.</p>
<p>After a moment's thought, we can interpret it more clearly:</p>
<div cid="n167" class="mathjax-block md-end-block md-math-block md-rawblock" contenteditable="false" id="mathjax-n167" mdtype="math_block" spellcheck="false">
<div class="md-rawblock-container md-math-container" tabindex="-1"><div class="MathJax_SVG_Display" style="text-align: center;"></div><script id="MathJax-Element-12" type="math/tex; mode=display">\Pr_X(x) = \frac{\text{blah}}{\sigma \text{ blah}}
\text{blah}^{\text{-blah}
{\big(\frac{x-\mu}{\sigma}\big)^2}}.</script></div></div>
<p>So it's just an exponential that decays in both directions from <script type="math/tex">\mu</script>, and that's squeezed by <script type="math/tex">\sigma</script>.</p>
<p>(Why are there constants then? Because it's a probability distribution, and must therefore integrate to 1 over its entire range or else all hell will break loose.)</p>
<p>Step 2 is philosophising. What does it really mean to get the best fit of a distribution?</p>
<p>The first thing we can notice is that there are only two dials we can adjust: the values of <script type="math/tex">\mu</script> and <script type="math/tex">\sigma</script>. For this particular problem at least, we've reduced the massive problem of picking the best model to one of finding the best spot in a 2D space (well, half of 2D space, since <script type="math/tex">\sigma</script> must be greater than zero).</p>
<p>The second thing we can notice is that the only tool we have at our disposal here to tell us about the fit to the distribution is the likelihood function, and, well, as the saying goes: when all you have is a likelihood function ...</p>
<p>A good fit will give high likelihoods to the points in the data set (we can't get an arbitrarily good fit by giving everything a lot of likelihood, because there's only so much likelihood to go around – the probabilities that the likelihood function assigns across its domain must sum to 1).</p>
<p>Let's call the likelihood of the data, given some model, to be the likelihood that we get that specific data set by independently generating samples from the model until we have the same number as in the data set (if we have a lot of data points, the likelihood of any particular set of them will usually be very low, since it's the product of the likelihood of a lot of individual points). And let's go ahead and try to tune the model so that the likelihood of our data is maximised.</p>
<p>(Remember, likelihood is probability, except for continuous random variables like our normal distribution, where we can't talk about the probability of a dataset (only about something like the probability of getting a dataset at least as close as [some metric] to the dataset).)</p>
<p>Step 3 is algebra. So what is the likelihood of all our data? Using basic probability, it's the product of the likelihoods of each data point (just like the probability of getting a set of independent events is the product of the probabilities of each event). Returning to our normal distribution with cat data <script type="math/tex">x_1</script> to <script type="math/tex">x_n</script>, the likelihood of the data given distribution <script type="math/tex">X</script> with mean <script type="math/tex">\mu</script> and standard deviation <script type="math/tex">\sigma</script> is</p>
<div cid="n177" class="mathjax-block md-end-block md-math-block md-rawblock" contenteditable="false" id="mathjax-n177" mdtype="math_block" spellcheck="false">
<div class="md-rawblock-container md-math-container" tabindex="-1"><div class="MathJax_SVG_Display"></div><script id="MathJax-Element-13" type="math/tex; mode=display">\Pr_X(x_1) \cdot \Pr_X(x_2) \cdot ... \cdot \Pr_X(x_n) \\
=
\frac{1}{\sigma \sqrt{2 \pi}}
e^{-\frac{1}{2}\big(
\frac{x_1-\mu}{\sigma}
\big)^2}
\cdot ... \cdot
\frac{1}{\sigma \sqrt{2 \pi}}
e^{-\frac{1}{2}\big(
\frac{x_n-\mu}{\sigma}
\big)^2} \\
=
\left(\frac{1}{\sigma \sqrt{2 \pi}} \right)^n
e^{-\frac{1}{2}\big( \big(
\frac{x_1 - \mu}{\sigma}
\big)^2
+
...
+
\big(\frac{x_n - \mu}{\sigma}
\big)^2
\big)}.</script></div></div>
<p>Oh dear. Maximising this is a pain.</p>
<p>Thankfully, there's a trick. We don't care about the likelihood, only that we set <script type="math/tex">\mu</script> and <script type="math/tex">\sigma</script> so that the likelihood is maximised. We can apply any monotonically increasing function to the likelihood, maximise that, and we'll have the <script type="math/tex">\mu</script> and <script type="math/tex">\sigma</script> that maximise the original mess.</p>
<p>Which monotonically increasing function? Logarithms are generally best, because they convert the products you get from calculating the likelihood of a dataset into sums (and in this case they're especially nice, because they'll also take out the exponentials in our distribution's likelihood function).</p>
<p>In fact, throw away the previous calculation, note that</p>
<div cid="n182" class="mathjax-block md-end-block md-math-block md-rawblock" contenteditable="false" id="mathjax-n182" mdtype="math_block" spellcheck="false">
<div class="md-rawblock-container md-math-container" tabindex="-1"><div class="MathJax_SVG_Display"></div><script id="MathJax-Element-14" type="math/tex; mode=display">\log\Pr_X(x) = -\log(\sigma \sqrt{2 \pi})
- \frac{1}{2} \left(\frac{x-\mu}{\sigma}\right)^2 \\
= -\log(\sqrt{2 \pi}) - \log(\sigma)
- \frac{1}{2} \left(\frac{x-\mu}{\sigma}\right)^2, \\</script></div></div>
<p>from which we can throw away the <script type="math/tex">\log(\sqrt{2\pi})</script> because it's the same in each term, and then sum all the rest up to get a total log likelihood of</p>
<div cid="n184" class="mathjax-block md-end-block md-math-block md-rawblock" contenteditable="false" id="mathjax-n184" mdtype="math_block" spellcheck="false">
<div class="md-rawblock-container md-math-container" tabindex="-1"><div class="MathJax_SVG_Display" style="text-align: center;"></div><script id="MathJax-Element-15" type="math/tex; mode=display">-n\log(\sigma)
- \sum_{i=1}^n \Big(
\frac{1}{2} \left(\frac{x_i-\mu}{\sigma}\right)^2
\Big).</script></div></div>
<p>Call this <script type="math/tex">f</script>; the values of <script type="math/tex">\mu</script> and <script type="math/tex">\sigma</script> that maximise it are when when <script type="math/tex">\frac{\partial f}{\partial \mu} = 0</script> and <script type="math/tex">\frac{\partial f}{\partial \sigma} = 0</script>; that's when we've found our peak on the 2D space of possible <script type="math/tex">(\mu, \sigma)</script> pairs (technically this condition only tells us it's a stationary point, but it turns out to be the maximum, as you can prove by taking more derivatives).</p>
<p>So the maximum satisfies</p>
<div cid="n187" class="mathjax-block md-end-block md-math-block md-rawblock" contenteditable="false" id="mathjax-n187" mdtype="math_block" spellcheck="false">
<div class="md-rawblock-container md-math-container" tabindex="-1"><div class="MathJax_SVG_Display"></div><script id="MathJax-Element-16" type="math/tex; mode=display">\frac{\partial f}{\partial \mu} =
-\sum_{i=1}^n \Big(
\frac{x_i-\mu}{\sigma}
\Big)
= 0, \text{ and} \\
\frac{\partial f}{\partial \sigma} =
-\frac{n}{\sigma}
+ \sum_{i=1}^n \left(
\frac{(x_i - \mu)^2}{\sigma^3}
\right)
= 0.</script></div></div>
<p>The first condition gives</p>
<div cid="n189" class="mathjax-block md-end-block md-math-block md-rawblock" contenteditable="false" id="mathjax-n189" mdtype="math_block" spellcheck="false">
<div class="md-rawblock-container md-math-container" tabindex="-1"><div class="MathJax_SVG_Display" style="text-align: center;"></div><script id="MathJax-Element-17" type="math/tex; mode=display">\hat{\mu} = \frac{1}{n}\sum_{i=1}^n x_i,</script></div></div>
<p>in other words that <script type="math/tex">\hat{\mu}</script>, our best estimator function for the value of <script type="math/tex">\mu</script>, is the average of the values in the data set.</p>
<p>From the second condition, we can do algebra to get</p>
<div cid="n192" class="mathjax-block md-end-block md-math-block md-rawblock" contenteditable="false" id="mathjax-n192" mdtype="math_block" spellcheck="false">
<div class="md-rawblock-container md-math-container" tabindex="-1"><div class="MathJax_SVG_Display" style="text-align: center;"></div><script id="MathJax-Element-18" type="math/tex; mode=display">\hat{\sigma} = \sqrt{\frac{1}{n}
\sum_{i=1}^n(x_i-\mu)^2}.</script></div></div>
<p>We need to be careful here, though. When writing out the conditions, <script type="math/tex">\mu</script> and <script type="math/tex">\sigma</script> stood for specific values of the parameters of the normal distribution <script type="math/tex">X</script>. We don't know these values; the best we can do is estimate them with <i>estimators</i>, which are technically not values but functions that take a data set and return an estimated value (and denoted by <script type="math/tex">\hat{\text{hats}}</script>). We can't have unknown values in our definition of <script type="math/tex">\hat{\sigma}</script>, as we currently do with the <script type="math/tex">\mu</script> in it; we have to replace it with the estimator for <script type="math/tex">\mu</script> like this:</p>
<div cid="n194" class="mathjax-block md-end-block md-math-block md-rawblock" contenteditable="false" id="mathjax-n194" mdtype="math_block" spellcheck="false">
<div class="md-rawblock-container md-math-container" tabindex="-1"><div class="MathJax_SVG_Display" style="text-align: center;"></div><script id="MathJax-Element-19" type="math/tex; mode=display">\hat{\sigma} = \sqrt{\frac{1}{n}
\sum_{i=1}^n(x_i-\hat{\mu})^2}</script></div></div>
<p>– making sure that the estimator <script type="math/tex">\hat{\mu}</script> does not depend on <script type="math/tex">\hat{\sigma}</script> , since that would again make things undefined – or then by writing out the <script type="math/tex">\hat{\mu}</script> estimator like this:</p>
<div cid="n196" class="mathjax-block md-end-block md-math-block md-rawblock" contenteditable="false" id="mathjax-n196" mdtype="math_block" spellcheck="false">
<div class="md-rawblock-container md-math-container" tabindex="-1"><div class="MathJax_SVG_Display" style="text-align: center;"></div><script id="MathJax-Element-20" type="math/tex; mode=display">\hat{\sigma} = \sqrt{\frac{1}{n}
\sum_{i=1}^n \left(x_i-\frac{1}{n}\sum_{i=1}^n x_i\right)^2},</script></div></div>
<p>which at least makes it very clear that the <script type="math/tex">x_i</script>s and their number <script type="math/tex">n</script> define <script type="math/tex">\hat{\sigma}</script>. </p>
<p>When you're done defining your estimators, you should have a clear diagram in your head of how to pour data into the functions you've written down and come out with concrete numbers, with no dangling inputs anywhere – you're not done if you have any.</p>
<h3>Supervised and unsupervised learning</h3>
<p>There are two main types of fancy model fitting we can do:</p>
<ol start="">
<li>Supervised learning, where we have a set of pairs (of numbers or anything else) and we try to design a system to predict one element from the other. For example, maybe we measure the length and weight of some stray cats, but get bored of trying to get them to stay on the scale long enough, so we want to ditch the weighing and predict a weight from the length alone – how well can we do this?</li>
<li>Unsupervised learning, where we have our data (as a set of tuples of associated data, like cat lengths, weights, and locations), and we try to fit a model to it so we can generate similar items; maybe we want to fake a larger stray cat population in our data than actually exists but not get caught by the statistics bureau. (This category also includes things like trying to <a href="https://en.wikipedia.org/wiki/Unsupervised_learning">identify clusters</a> to interpret the data.) Fitting a distribution is perhaps the simplest example: using our one-dimensional cat weight database discussed in the MLE section, we can "generate" new cats by sampling from it, though the "cat" will just be the weight number. The more interesting case is when we have to generate a lot of associated data; for example, <a href="https://thispersondoesnotexist.com/">this website</a> offers you a new face every time you reload it. Behind it is a probability distribution for a human face in some crazy-dimensional variable space that's detailed enough that sampling it gives you all the data needed to figure out the colours of each pixel in a photorealistic face picture.</li>
</ol>
<p>The unifying idea is maximum likelihood estimation (MLE). Clearly, something like MLE is needed if you want to fit a distribution to data for unsupervised learning; we're going to need to generate something eventually, so we better have a probability model. It's less clear that supervised learning has anything to do with MLE though, and tempting to think of it as defining some random loss function to measure how bad a fit is, and then minimising that. It's possible to think of supervised learning this way, but then you'll end up with a lot of detail about loss functions in your head, all of which will seem to be pulled out of thin air.</p>
<p>Instead, think of supervised learning as MLE too. We specify a probability model, which will take in some parameters (e.g. the exponent <script type="math/tex">a</script> and constant <script type="math/tex">b</script> in a cat length/weight model like <script type="math/tex">\text{weight} = b \times \text{length}^a + \epsilon</script>, where <script type="math/tex">\epsilon</script> is a normally distributed error term with mean 0 and some standard deviation we either know already or then ask the fitting procedure to find for us), and the value of the predictor variable(s) (e.g. the cat's length), and spit out its prediction of the variable(s) of interest.</p>
<p>(Note that often the variable of interest is not numerical, but a label: "spam", "tumour", "Eurasian oystercatcher", etc.)</p>
<p>In fact, seen from the MLE perspective, it can almost be hard to see the difference – if so, good. Just look at the processes:</p>
<ol start="">
<li><p>Unsupervised learning:</p>
<ol start="">
<li>Get your dataset <script type="math/tex">x = (x_1, x_2, ..., x_n)</script>.</li>
<li>Decide on a probability model (e.g. a simple distribution) <script type="math/tex">X</script> with a parameter set <script type="math/tex">\theta = (\theta_1, \theta_2, ..., \theta_m)</script>.</li>
<li>Find the <script type="math/tex">\theta</script> that maximises <script type="math/tex">\Pr_X(x_1; \theta) \times ... \times \Pr_X(x_n; \theta)=\Pr_X(x;\theta)</script>,* since assuming our data points are drawn independently, this is the likelihood of the dataset.</li>
</ol>
</li>
<li><p>Supervised learning:</p>
<ol start="">
<li>Get your dataset of pairs of the form (thing to predict, thing to predict from): <script type="math/tex">((y_1, x_1), (y_2, x_2), ..., (y_n, x_n))</script>.</li>
<li>Decide on a probability model <script type="math/tex">Y</script> that which relies on parameter set <script type="math/tex">\theta = (\theta_1, \theta_2, ..., \theta_n)</script>, and also <script type="math/tex">x_i</script>, to predict <script type="math/tex">y_i</script>..</li>
<li>Find the <script type="math/tex">\theta</script> that maximises <script type="math/tex">\Pr_Y(y_1;x_1, \theta) \times ... \times \Pr_Y(y_n; x_n, \theta) = \Pr_Y(y_1, ..., y_n; x_1, ...., y_n, \theta)</script>.*</li>
</ol>
</li>
</ol>
<p>*(We write <script type="math/tex">\Pr_X(x_i;\theta)</script> to mean the likelihood that <script type="math/tex">X</script> takes the value <script type="math/tex">x_i</script> if the parameters are <script type="math/tex">\theta</script>; we avoid writing it as a conditional probability <script type="math/tex">\Pr_X(x \, |\, \theta)</script> because interpreting this as a conditional probability is technically only valid with a Bayesian interpretation.)</p>
<h3>Linear models</h3>
<p>You can invent any model you choose. As always, simplicity pays though, and it turns out that there's a class of probability models which are easy to work with and reason about, for which general algorithms and mathematical tools exist, and which is often good enough: linear models.</p>
<p>The word "linear" immediately brings to mind straight lines. That's not what it means in this context. The linearity in linear models is because the output is a linear combination of "features" (predictor variables).</p>
<p>The general form is</p>
<div cid="n234" class="mathjax-block md-end-block md-math-block md-rawblock" contenteditable="false" id="mathjax-n234" mdtype="math_block" spellcheck="false">
<div class="md-rawblock-container md-math-container" tabindex="-1"><div class="MathJax_SVG_Display" style="text-align: center;"></div><script id="MathJax-Element-21" type="math/tex; mode=display">\hat{y_i} = c_1 e_{1,i} + c_2 e_{2,i} + ... +c_n e_{n,i},</script></div></div>
<p>where <script type="math/tex">\hat{y_i}</script> is the predicted value, <script type="math/tex">c_1</script> through <script type="math/tex">c_n</script> are constants, and <script type="math/tex">e_{1,i}</script> through <script type="math/tex">e_{n,i}</script> are the features describing the <script type="math/tex">i</script>th set of data. In the simplest case, a feature might be a value we measure directly, but in general it can be any function of data we measure. Ideally, we want that the true value <script type="math/tex">y_i \approx c_1 e_{1,i} + ... + c_n e_{n,i}</script>.</p>
<div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEh3s-8sUD8qMFb_gZc5a6VAEwapyU-ez4s5pz9hpby05kYCmUVk4E8DVdD1w2cNLMcKVITg0S2F_9Endy-pgpXmre-3Mene0ouN8nJm_O9UX64i5dSWsEjV0PgjzSvefwVV_2kbjAQr_jcZ/s1278/linearmodel.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="678" data-original-width="1278" height="340" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEh3s-8sUD8qMFb_gZc5a6VAEwapyU-ez4s5pz9hpby05kYCmUVk4E8DVdD1w2cNLMcKVITg0S2F_9Endy-pgpXmre-3Mene0ouN8nJm_O9UX64i5dSWsEjV0PgjzSvefwVV_2kbjAQr_jcZ/w640-h340/linearmodel.png" width="640" /></a></div><p>In the above diagram, we see we measure the data <script type="math/tex">x_i</script> (note that it can be a tuple of values rather than a single value), pass it through some blackbox function to generate features, and take the prediction <script type="math/tex">\hat{y_i}</script> to be the sum of multiplying together each feature by the weight assigned to it.
</p><p>Note that the linear model above is a prediction-maker but not a probability model because it doesn't assign likelihoods. The probability model for a linear model is often taken to be</p>
<div cid="n239" class="mathjax-block md-end-block md-math-block md-rawblock" contenteditable="false" id="mathjax-n239" mdtype="math_block" spellcheck="false">
<div class="md-rawblock-container md-math-container" tabindex="-1"><div class="MathJax_SVG_Display" style="text-align: center;"></div><script id="MathJax-Element-22" type="math/tex; mode=display">y_i = c_1 e_{1,i} + c_2 e_{2,i} + ... +c_n e_{n,i}
+ \epsilon</script></div></div>
<p>that is, there's an error term <script type="math/tex">\epsilon</script> that we assume to be a normal distribution with standard deviation <script type="math/tex">\sigma</script> (which may be known, or finding it may be part of fitting the model).</p>
<p>The above is also an equation for predicting one specific output (<script type="math/tex">y_i</script>) from one specific set of features, which in turn are determined by one specific input (e.g. a single data point). More generally we can write it in vector form:</p>
<div cid="n242" class="mathjax-block md-end-block md-math-block md-rawblock" contenteditable="false" id="mathjax-n242" mdtype="math_block" spellcheck="false">
<div class="md-rawblock-container md-math-container" tabindex="-1"><div class="MathJax_SVG_Display" style="text-align: center;"></div><script id="MathJax-Element-23" type="math/tex; mode=display">\pmb{y} \approx c_1 \pmb{e_1} + ... + c_n \pmb{e_n},</script></div></div>
<p>where <script type="math/tex">\pmb{y}=(y_1, y_2, ..., y_{n})</script>, and likewise <script type="math/tex">\pmb{e_j}</script> is a vector whose <script type="math/tex">i</script>th position corresponds to the <script type="math/tex">j</script>th feature of the <script type="math/tex">i</script>th data item.</p>
<p>Note that we can read this equation in two ways: as a vector equation about data, as just described, that's fitted to give <script type="math/tex">\pmb{y}</script> from its features, or as a prediction, saying that the value of a particular <script type="math/tex">y_i</script> will be roughly this.</p>
<p>There's a set of standard tricks to use in linear modelling:</p>
<ul>
<li>"One-hot coding": using a function that is 0 unless the input data satisfies some condition (having a label, exceeding a value, etc.).</li>
<li>If we have the data point <script type="math/tex">x_i</script>, using the features <script type="math/tex">e_{0,i} = 1</script>, <script type="math/tex">e_{1,i} = x_i</script>, and <script type="math/tex">e_{2,i} = x_i^2</script> to fit a quadratic (if you fit a polynomial of degree higher than 2 without a very solid reason, you're probably overfitting).</li>
<li>We often have a pattern with a known period <script type="math/tex">T</script> (days, years, etc.), and some non-zero starting phase <script type="math/tex">\phi</script>. Therefore we'd want a feature like <script type="math/tex">\sin((2\pi/T)x+\phi)</script>, where <script type="math/tex">x</script> to is an input, to fit this pattern to. If <script type="math/tex">\phi</script> is known, we don't have a problem, but if we want to fit the phase, it doesn't work: the model is not linear in <script type="math/tex">\phi</script>. To fix this, use a trig angle addition identity; the above becomes <script type="math/tex">\sin(\phi) \cos((2\pi/T)x) + \cos(\phi) \sin((2\pi/T)x)</script>, where <script type="math/tex">\sin(\phi)</script> and <script type="math/tex">\cos(\phi)</script> are just constants so can be forgotten about because the fitting model will determine the constants of our features. (Recovering <script type="math/tex">\phi</script> from the final constants will take a bit of maths; note that the constant of the cosine and sine terms in the fitted model will have the amplitude mixed in, in addition to <script type="math/tex">\phi</script>.)</li>
</ul>
<p>Here's an annotated linear model with parameter interpretation:</p>
<p></p><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiSWNj2rvVQLPDEIVPnvVDAVAc3m6aEVPcvIB2QY7eaNmgPo_c4kL0X-h5hVqaXO-tUZSSUufaTBiXIY5WjRa-X5IaX0geg-dPeFdRhqRBofZa-i4RkztrX9r0ejdxfwE3pYPCbYPOejd7Z/" style="margin-left: 1em; margin-right: 1em;"><img data-original-height="1022" data-original-width="1280" height="510" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiSWNj2rvVQLPDEIVPnvVDAVAc3m6aEVPcvIB2QY7eaNmgPo_c4kL0X-h5hVqaXO-tUZSSUufaTBiXIY5WjRa-X5IaX0geg-dPeFdRhqRBofZa-i4RkztrX9r0ejdxfwE3pYPCbYPOejd7Z/w640-h510/examplelinear.png" width="640" /></a></div><br /><p></p>
<p>The features in this model:</p>
<ul>
<li><script type="math/tex">e_1=x</script>.</li>
<li><script type="math/tex">e_2</script> is 0 if <script type="math/tex">x < A</script> and 1 otherwise.</li>
<li><script type="math/tex">e_3</script> is 0 if <script type="math/tex">x < A</script> and <script type="math/tex">x</script> otherwise.</li>
</ul>
<p>(If we want to fit the best value of <script type="math/tex">A</script>, we'll have to do some maths and reconfigure the model. Right now <script type="math/tex">A</script> is a constant that's defined in the functions that calculate the features from the input data.)</p>
<p>The interpretation of the constants:</p>
<ul>
<li><script type="math/tex">c_0</script> is the prediction for <script type="math/tex">x=0</script>.</li>
<li><script type="math/tex">c_1</script> is the base slope.</li>
<li><script type="math/tex">c_2</script> is the difference between the prediction for <script type="math/tex">x=0</script> (the <script type="math/tex">y</script>-intercept of the <script type="math/tex">x < A</script> line) and the <script type="math/tex">y</script>-intercept of the <script type="math/tex">x>A</script> line.</li>
<li><script type="math/tex">c_3</script> is how much the slope changes after <script type="math/tex">x=A</script>.</li>
</ul>
<p>We could have chosen different features (for example, letting <script type="math/tex">e_1 = 0</script> for <script type="math/tex">x > A</script>), and then gotten perhaps more readable constants (<script type="math/tex">c_3</script> would become just the slope, not the difference in slope). We could also have added a feature like <script type="math/tex">e_4 = x^2</script>, and then the model would no longer look like just straight lines. But whatever we do, we need to be careful to interpret the constants we get correctly, especially when the model gets complicated.</p>
<p>For our cat weight prediction example, we might expect weight <script type="math/tex">W</script> and length <script type="math/tex">L</script> to have a relation like <script type="math/tex">W \approx c L^3</script>, where <script type="math/tex">c</script> is a constant that the model will fit. If we want to ask questions about whether a cubic relation really is the best, take logs and fit something like <script type="math/tex">\log(W) = c_1 + c_2 \log(L)</script> – <script type="math/tex">c_2</script> tells us the exponent.</p>
<h4>Feature spaces and fitting linear models</h4>
<p>The main benefit of linear models is that by talking about linear combinations of data vectors we reduce the maths of fitting parameters to linear algebra. Linear algebra is about transformations of space and the vectors in it, so it also allows for a visual interpretation of everything.</p>
<p>Let's say we have a model like this:</p>
<div cid="n279" class="mathjax-block md-end-block md-math-block md-rawblock" contenteditable="false" id="mathjax-n279" mdtype="math_block" spellcheck="false">
<div class="md-rawblock-container md-math-container" tabindex="-1"><div class="MathJax_SVG_Display" style="text-align: center;"></div><script id="MathJax-Element-24" type="math/tex; mode=display">\pmb{y} \approx c_1 \pmb{e_1} + c_2 \pmb{e_2}.</script></div></div>
<p>Here, <script type="math/tex">\pmb{y}</script> is the actual measured data, and <script type="math/tex">\pmb{e_i}</script> are functions of the (also measured) predictor variables. Let's say <script type="math/tex">\pmb{y} = (y_1, y_2, y_3)</script> – i.e., we have three data points. We can imagine <script type="math/tex">\pmb{y}</script> as a vector pointing somewhere in 3D space, with <script type="math/tex">y_1</script>, <script type="math/tex">y_2</script>, and <script type="math/tex">y_3</script> the distances along the <script type="math/tex">x</script>, <script type="math/tex">y</script>, and <script type="math/tex">z</script> axes. Likewise, <script type="math/tex">\pmb{e_1}</script> and <script type="math/tex">\pmb{e_2}</script> can be thought of as 3D vectors encoding some (function of the) data we've measured.</p>
<p></p><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjCF-UhlG3hSnnJnG6iKO3OeoMtAAJS5M-VRL66Pn_Usjwaep_RiurK0hWIP4fPbpVfwyiuHWZP-aLgjPBK6lmFnuIyyCv8djPvJcwrVkoCvaeWnUBU7hf8GS8ShD9uNfiF1e1WVeRNQ5s_/" style="margin-left: 1em; margin-right: 1em;"><img data-original-height="920" data-original-width="1278" height="288" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjCF-UhlG3hSnnJnG6iKO3OeoMtAAJS5M-VRL66Pn_Usjwaep_RiurK0hWIP4fPbpVfwyiuHWZP-aLgjPBK6lmFnuIyyCv8djPvJcwrVkoCvaeWnUBU7hf8GS8ShD9uNfiF1e1WVeRNQ5s_/w400-h288/3d.png" width="400" /></a></div><br /><p></p>
<p>Now the only dials a linear model gives us to adjust are the weights of <script type="math/tex">\pmb{e_1}</script> and <script type="math/tex">\pmb{e_2}</script>: <script type="math/tex">c_1</script> and <script type="math/tex">c_2</script>. There's a 2D space of them (since there are two constants to adjust – <script type="math/tex">c_1</script> and <script type="math/tex">c_2</script>), and as it happens, there's a nice geometric interpretation: each pair <script type="math/tex">(c_1, c_2)</script> corresponds to a point on the plane spanned by <script type="math/tex">\pmb{e_1}</script> and <script type="math/tex">\pmb{e_2}</script> (specifically, the point you get to if you move <script type="math/tex">c_1</script> times along <script type="math/tex">\pmb{e_1}</script> and then <script type="math/tex">c_2</script> times along <script type="math/tex">\pmb{c_2}</script>).</p>
<p>So what are the best values of <script type="math/tex">c_1</script> and <script type="math/tex">c_2</script>? The intuitive answer is that we want to get as close as possible to <script type="math/tex">\pmb{y}</script>:</p>
<p></p><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEibOhey3S1xJKvQzSlzvGCeTnnz1LY7i7XvHUz66g44kQlqusp0PZaBb8luYFoqvndiFiQ339TdW-KRFHKcGHVJ4SVSuKhNNgIWTqoW2eeHQVKugvXqG5sbyqg1IHqp-dzwtu3_p6Eo8iOP/" style="margin-left: 1em; margin-right: 1em;"><img data-original-height="1080" data-original-width="1280" height="541" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEibOhey3S1xJKvQzSlzvGCeTnnz1LY7i7XvHUz66g44kQlqusp0PZaBb8luYFoqvndiFiQ339TdW-KRFHKcGHVJ4SVSuKhNNgIWTqoW2eeHQVKugvXqG5sbyqg1IHqp-dzwtu3_p6Eo8iOP/w640-h541/featurespace.png" width="640" /></a></div><p></p>
<p>In this case, the closest to <script type="math/tex">\pmb{y}</script> that we can reach on the plane spanned by <script type="math/tex">\pmb{e_1}</script> and <script type="math/tex">\pmb{e_2}</script> is the green vector, and the black vector is the difference between the predicted data vector and actual data vector.</p>
<p>Mathematically, what are we doing here? We're minimising the distance between the vector <script type="math/tex">\hat{\pmb{y}} = c_1 \pmb{e_1} + c_2 \pmb{e_2}</script> (where <script type="math/tex">c_1</script> and <script type="math/tex">c_2</script> can be varied) and <script type="math/tex">\pmb{y}</script>; this distance is given by</p>
<div cid="n287" class="mathjax-block md-end-block md-math-block md-rawblock" contenteditable="false" id="mathjax-n287" mdtype="math_block" spellcheck="false">
<div class="md-rawblock-container md-math-container" tabindex="-1"><div class="MathJax_SVG_Display" style="text-align: center;"></div><script id="MathJax-Element-25" type="math/tex; mode=display">\sqrt{(\hat{y_1} - y_1)^2
+ (\hat{y_2} - y_2)^2
+ (\hat{y_3} - y_3)^2
}.</script></div></div>
<p>Previously we simplified optimisation by applying a logarithm (a monotonically increasing function) and optimising that; this time we do the same by applying the squaring function (which is monotonically increasing for positive numbers, which our distance is limited to). This means that the quantity to minimise is</p>
<div cid="n289" class="mathjax-block md-end-block md-math-block md-rawblock" contenteditable="false" id="mathjax-n289" mdtype="math_block" spellcheck="false">
<div class="md-rawblock-container md-math-container" tabindex="-1"><div class="MathJax_SVG_Display" style="text-align: center;"></div><script id="MathJax-Element-26" type="math/tex; mode=display">(\hat{y_1} - y_1)^2
+ (\hat{y_2} - y_2)^2
+ (\hat{y_3} - y_3)^2.</script></div></div>
<p>In other words, we minimise the sum of squared errors ("least squares estimation" is the most common phrase).</p>
<p>If we have more than three data points, then we can't picture it, but the idea is exactly the same. Fitting an <script type="math/tex">n</script>-dimensional dataset to a linear model of <script type="math/tex">m</script> features boils down to moving as close as possible in <script type="math/tex">n</script>D space to the observed data vector, while limited to the <script type="math/tex">m</script>-dimensional (at most; see below) space spanned by the features.</p>
<p>(Above, <script type="math/tex">n=3</script> and <script type="math/tex">m=2</script>. Generally <script type="math/tex">n</script> is huge because datasets can be huge, while <script type="math/tex">m</script> is much smaller since it's the number of features we've written down into the model.)</p>
<blockquote><p><i>A maths lecturer is giving a lecture about 5-dimensional geometry.</i></p><i>
</i><p><i>A student asks a question: "I can follow the algebra just fine, but it would be helpful if I could visualise it. Is there any way to do that?"</i></p><i>
</i><p><i>The lecturer replies: "Oh, it's easy. Just imagine everything in <script type="math/tex">n</script> dimensions, and then let <script type="math/tex">n=5</script>."</i></p><i>
</i><p><i> </i></p><i>
</i><p><i>(variants of this joke are common; see for example <a href="http://www.personal.psu.edu/sxt104/mathjoke1.html">here.</a>)</i></p>
</blockquote>
<h5>Linear independence</h5>
<p>A set of vectors is linearly dependent if there exists a vector in it that can be written as a linear combination of the other vectors. If your feature vectors are linearly dependent, you will get the same predictions out of your model, but you can't interpret the coefficients.</p>
<p>(For visual intuition: two vectors in 2D are linearly dependent if they lie on the same line, three vectors in 3D are linearly dependent if they lie on the same plane (a superset of the case that they lie on the same line), and so on.)</p>
<p>An easy way to make this mistake is if you're doing one-hot coding of categories. Let's say you're fitting a linear model to estimate student exam grades <script type="math/tex">y</script> based on their university, with a model that looks like this:</p>
<div cid="n301" class="mathjax-block md-end-block md-math-block md-rawblock" contenteditable="false" id="mathjax-n301" mdtype="math_block" spellcheck="false">
<div class="md-rawblock-container md-math-container" tabindex="-1"><div class="MathJax_SVG_Display" style="text-align: center;"></div><script id="MathJax-Element-27" type="math/tex; mode=display">y \approx \alpha + \beta \cdot 1_{\text{Oxford}}+\gamma\cdot1_{\text{Cambridge}}+...,</script></div></div>
<p>using indicator function notation. Whatever linear fitting routine you do will happily give you coefficient values and the predictions it gives will be sensible, but you won't be able to interpret the coefficients. To see what's happening, consider an Oxford student: their predicted grade <script type="math/tex">y</script> is <script type="math/tex">\alpha + \beta</script>. What is <script type="math/tex">\alpha</script> and <script type="math/tex">\beta</script>? Good question – we can only assign meaning to their combination. If instead we eliminate one university and write</p>
<div cid="n303" class="mathjax-block md-end-block md-math-block md-rawblock" contenteditable="false" id="mathjax-n303" mdtype="math_block" spellcheck="false">
<div class="md-rawblock-container md-math-container" tabindex="-1"><div class="MathJax_SVG_Display" style="text-align: center;"></div><script id="MathJax-Element-28" type="math/tex; mode=display">y \approx \alpha + \beta \cdot 1_{\text{Cambridge}} + ...,</script></div></div>
<p>when we now fit the coefficients, <script type="math/tex">\alpha</script> will be the predicted grade for Oxford students, and <script type="math/tex">\alpha+\beta</script> the predicted grade for Cambridge students, so we can interpret <script type="math/tex">\alpha</script> as the Oxford average, and <script type="math/tex">\beta</script> as the difference between Oxford and Cambridge. (The predictions given by the model won't change though.)</p>
<p>The vector interpretation is that if our dataset contains, say, 3 Oxford students followed by 2 Cambridge students, the (5D) data vectors in the first model will be</p>
<div cid="n306" class="mathjax-block md-end-block md-math-block md-rawblock" contenteditable="false" id="mathjax-n306" mdtype="math_block" spellcheck="false">
<div class="md-rawblock-container md-math-container" tabindex="-1"><div class="MathJax_SVG_Display" style="text-align: center;"></div><script id="MathJax-Element-29" type="math/tex; mode=display">\alpha \begin{pmatrix}1 \\ 1 \\ 1 \\ 1 \\ 1\end{pmatrix}
+ \beta
\begin{pmatrix}1 \\ 1 \\ 1 \\ 0 \\ 0\end{pmatrix}
+ \gamma
\begin{pmatrix}0 \\ 0 \\ 0 \\ 1 \\ 1\end{pmatrix}.</script></div></div>
<p>But these vectors aren't linearly independent: the last two vectors sum up to the first one, and therefore there will be many triplets <script type="math/tex">(\alpha, \beta, \gamma)</script> that give identical predictions.</p>
<h4>Linear fitting and MLE</h4>
<p>We talked about MLE being the holy grail of model fitting, and then about linear models and how fitting them comes down to a geometry problem. As it turns out, MLE lurks behind least squares estimation as well.</p>
<p>I mentioned earlier that linear models often assume a normal distribution for errors. Let's assume that, and do MLE.</p>
<p>Our model is that</p>
<div cid="n312" class="mathjax-block md-end-block md-math-block md-rawblock" contenteditable="false" id="mathjax-n312" mdtype="math_block" spellcheck="false">
<div class="md-rawblock-container md-math-container" tabindex="-1"><div class="MathJax_SVG_Display" style="text-align: center;"></div><script id="MathJax-Element-30" type="math/tex; mode=display">Y_i = c_1 e_{1,i} + ... + c_n e_{n,i} + \epsilon,</script></div></div>
<p>where <script type="math/tex">\epsilon \sim N(0,\sigma^2)</script> (i.e. follows a normal distribution with mean zero and standard deviation <script type="math/tex">\sigma</script>).</p>
<p>A useful property of normal distributions is that if we add a constant <script type="math/tex">c</script> to a normal distribution with mean <script type="math/tex">\mu</script>, the result has a normal distribution with mean <script type="math/tex">\mu + c</script> and the same standard deviation (this isn't true of all distributions!). Therefore we can write the above as</p>
<div cid="n315" class="mathjax-block md-end-block md-math-block md-rawblock" contenteditable="false" id="mathjax-n315" mdtype="math_block" spellcheck="false">
<div class="md-rawblock-container md-math-container" tabindex="-1"><div class="MathJax_SVG_Display" style="text-align: center;"></div><script id="MathJax-Element-31" type="math/tex; mode=display">Y_i \sim N(c_1 e_{1,i} + ... + c_n e_{n,i}, \sigma^2).</script></div></div>
<p>The likelihood for getting <script type="math/tex">y</script> is</p>
<div cid="n317" class="mathjax-block md-end-block md-math-block md-rawblock" contenteditable="false" id="mathjax-n317" mdtype="math_block" spellcheck="false">
<div class="md-rawblock-container md-math-container" tabindex="-1"><div class="MathJax_SVG_Display" style="text-align: center;"></div><script id="MathJax-Element-32" type="math/tex; mode=display">\Pr_Y(y;c_1...c_n, \sigma)
=
\frac{1}{\sigma \sqrt{2 \pi}}
e^{-\frac{1}{2}
\left(
\frac{y - (c_1 e_{1,i} + ... + c_n e_{n,i})}
{\sigma}
\right)^2},</script></div></div>
<p>once again copying out the likelihood function for normal distributions.</p>
<p>Now remember that we just want to fit <script type="math/tex">c_1</script> through <script type="math/tex">c_n</script>. These only occur in the exponent, so we can ignore all the constants out front, and also we can see that since there's a negative in the exponent, maximising it is equivalent to minimising the stuff in the exponent. Taking out <script type="math/tex">\sigma</script> and constants, the relevant stuff to minimise is</p>
<div cid="n320" class="mathjax-block md-end-block md-math-block md-rawblock" contenteditable="false" id="mathjax-n320" mdtype="math_block" spellcheck="false">
<div class="md-rawblock-container md-math-container" tabindex="-1"><div class="MathJax_SVG_Display" style="text-align: center;"></div><script id="MathJax-Element-33" type="math/tex; mode=display">(y-(c_1 e_{1,i} + ... + c_n e_{n,i}))^2,</script></div></div>
<p>where we can see that the thing we subtract from <script type="math/tex">y</script> is our model's prediction of <script type="math/tex">y</script> (one component of what we previously denoted <script type="math/tex">\hat{\pmb{y}}</script>). Once again, we can see we're minimising a square of the error. Of course, we have many <script type="math/tex">y</script>-values to fit; to see that it's the sum of these that we minimise, rather than some other function of them, just note that if we take a logarithm we'll get a term like the above (times constants) for each data point we're using to fit.</p>
<p>So least-squares fitting comes from MLE and the assumption of normally distributed errors.</p>
<p>(Are errors normally distributed? Often yes. Remember though that our features are functions of things we measure; even if <script type="math/tex">x</script> has normally-distributed errors, after we apply an arbitrary function to it to generate feature <script type="math/tex">e</script>, the resulting <script type="math/tex">e</script> might not have normally distributed errors (but for many simple functions it still will). We could be more fancy, and devise other fitting procedures, but often least squares is good enough.)</p>
<h3>Empirical distributions</h3>
<p>What's the simplest probability model we can fit to a dataset? It's tempting to think of an answer like "a normal distribution", or "a linear model with one linear feature". But we can be even more radical: treat the dataset itself as a distribution.</p>
<p></p><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiozLtTO4Uo4dzA7Fpe9zp0XcEn0ge0OfYNpWFSLJ7C9rzhA4_u4DET4I1viiOaNXy3U6qyuGhjal1QhSgZ_Si9J3o95Cq5M6GpMAXLnbOM-obPaMPmyeEnev4O2lAkKMn_XU2TVcG8rLUJ/" style="margin-left: 1em; margin-right: 1em;"><img data-original-height="686" data-original-width="1278" height="344" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiozLtTO4Uo4dzA7Fpe9zp0XcEn0ge0OfYNpWFSLJ7C9rzhA4_u4DET4I1viiOaNXy3U6qyuGhjal1QhSgZ_Si9J3o95Cq5M6GpMAXLnbOM-obPaMPmyeEnev4O2lAkKMn_XU2TVcG8rLUJ/w640-h344/epdf.png" width="640" /></a></div><p></p>
<p>On the left, we've plotted the number of data points that take different values of <script type="math/tex">x</script> (this is a discrete distribution; for a continuous distribution, the probability that any two samples drawn are equal is infinitesimal). On the right, all we've done is normalised the distribution, by rescaling the vertical axis so that the heights of all the bars sum to one. Once we've done that, we can go ahead and call it a probability distribution, and assign the meaning that the height of the bar at <script type="math/tex">x</script> is the probability that the distribution <script type="math/tex">X</script> that we've just defined takes the value <script type="math/tex">x</script>. This is called an empirical distribution.</p>
<p>Sampling from an empirical distribution is easy – just pick a value at random from the dataset. (Of course, the likelihood such a distribution assigns to any value not in the dataset is zero, which can be a problem for many use cases.)</p>
<p>In fact, you've probably already dealt with empirical distributions, at least implicitly. When you calculate the mean and variance of a dataset, you can interpret this as calculating the properties of the empirical distribution given by that dataset. An empirical distribution as an abstract thing apart from your dataset may seem ad hoc, but it's not any less defined than a normal distribution.</p>
<p>The standard way to illustrate an empirical distribution is by plotting its cumulative distribution function (cdf); an empirical one is known as an ecdf. This is almost necessary for continuous variables. In general, the ecdf of a dataset is a very useful and general way to visualise it: it saves you from the pains of histograms (how large to make the bins? if you take logs or squares first, do you take them before or after binning? etc. etc.), and is also complete in the sense of technically displaying every point in the dataset.</p>
<p>The ecdf for the above distribution would look something like this:</p>
<p></p><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjTTv1ByM7M6qlsLtPmhTSnnYBlPy4EW5fa1Qb_e7IH0eGB4ufIMX6xkklDjaJ13IjNe_0q2BMe-JceGGTu5hQPvqIaCl_bsP6LjBRQkKHo-xJCGFj-EHYUw7sZq6ZBHJrUMXp5Bjot40IU/" style="margin-left: 1em; margin-right: 1em;"><img alt="" data-original-height="530" data-original-width="1000" height="340" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjTTv1ByM7M6qlsLtPmhTSnnYBlPy4EW5fa1Qb_e7IH0eGB4ufIMX6xkklDjaJ13IjNe_0q2BMe-JceGGTu5hQPvqIaCl_bsP6LjBRQkKHo-xJCGFj-EHYUw7sZq6ZBHJrUMXp5Bjot40IU/w640-h340/ecdf.png" width="640" /></a></div><p></p>(Like any cdf, it takes the value 0 up until the first data point and the value 1 after the last data point.)
<p>If we now fit any parametric (i.e. non-empirical) distribution, comparing its cdf to the ecdf is a good test of how good the fit is.</p>
<h4>Measuring the goodness of a model fit with KL divergence</h4>
<p>The empirical distribution is the best possible fit to a given dataset, and therefore it's a good benchmark to measure the fit of a proposed model against.</p>
<p>Let's say our data is <script type="math/tex">x=x_1, ... ,x_n</script>, and the empirical distribution is <script type="math/tex">X^*</script>. The likelihood of drawing <script type="math/tex">x</script> from <script type="math/tex">X*</script> is (under the assumption of each <script type="math/tex">x_i</script> being drawn independently)</p>
<div cid="n338" class="mathjax-block md-end-block md-math-block md-rawblock" contenteditable="false" id="mathjax-n338" mdtype="math_block" spellcheck="false">
<div class="md-rawblock-container md-math-container" tabindex="-1"><div class="MathJax_SVG_Display" style="text-align: center;"></div><script id="MathJax-Element-34" type="math/tex; mode=display">\Pr_{X^*}(x_1) \cdot ... \cdot \Pr_{X^*}(x_n).</script></div></div>
<p>Now <script type="math/tex">\Pr_{X^*}(x_i)</script> is just the fraction of how many <script type="math/tex">x_j</script> in <script type="math/tex">x</script> are equal to <script type="math/tex">x_i</script>. Writing <script type="math/tex">N_{x_i}</script> to mean the number of values equal to <script type="math/tex">x_i</script> in the data, we can write</p>
<div cid="n340" class="mathjax-block md-end-block md-math-block md-rawblock" contenteditable="false" id="mathjax-n340" mdtype="math_block" spellcheck="false">
<div class="md-rawblock-container md-math-co