tag:blogger.com,1999:blog-16976733680595640132022-11-07T17:22:10.054+00:00 Strata of the World"Utilitarianism says sausages are a moral issue" – this blog solves moralityUnknownnoreply@blogger.comBlogger42125tag:blogger.com,1999:blog-1697673368059564013.post-53550981588511752992022-09-27T21:38:00.002+01:002022-09-27T21:40:42.798+01:00Deciding not to found a human-data-for-alignment startup<p style="text-align: center;"><i><span style="font-size: x-small;">8.6k words (~30 minutes)</span><b> <br /></b></i></p><p style="text-align: center;"><b><i>Both the project and this write-up were a collaboration with Matt Putz. </i><br /></b></p><p style="text-align: left;"><b> </b></p><p style="text-align: left;"><a href="https://forum.effectivealtruism.org/users/mathieu-putz"><b>Matt Putz</b></a><b> and I worked together for the first half of the summer to figure out if we should found a startup with the purpose of helping AI alignment researchers get the datasets they need to train their ML models (especially in cases where the dataset is based on human-generated data). This post, also published on the <a href="https://forum.effectivealtruism.org/posts/iBeWbfQLA9EKfsdhu/why-we-re-not-founding-a-human-data-for-alignment-org">Effective Altruism Forum</a> and <a href="https://www.lesswrong.com/posts/qArDMixsx77a9xL45/why-we-re-not-founding-a-human-data-for-alignment-org-1">LessWrong</a> (both of which may contain additional discussion in the comments), is a summary of our findings, and why we decided to not do it.</b><br /></p><div class="PostsPage-postContent ContentStyles-base content ContentStyles-postBody"><div><h1 id="TL_DR">Summary</h1><p><b>One-paragraph summary: </b>we (two recent graduates) spent about half of the summer exploring the idea of starting an organisation producing custom human-generated datasets for AI alignment research. Most of our time was spent on customer interviews with alignment researchers to determine if they have a pressing need for such a service. We decided not to continue with this idea, because there doesn’t seem to be a human-generated data niche (unfilled by existing services like Surge) that alignment teams would want outsourced.</p><p> </p><p><b>In more detail</b>: The idea of a human datasets organisation was <span><span><span><a href="https://forum.effectivealtruism.org/posts/MBDHjwDvhDnqisyW2/awards-for-the-future-fund-s-project-ideas-competition"><u>one of the winners of the Future Fund project ideas competition</u>, </a></span></span></span>still figures on their <span><span><span><a href="https://ftxfuturefund.org/projects/high-quality-human-data-for-ai-alignment-nbsp/"><u>list</u></a></span></span></span> of project ideas, and had been advocated before then by some people, including Beth Barnes. Even though we ended up deciding against, we think this was a reasonable and high-expected-value idea for these groups to advocate at the time.</p><p>Human-generated data is often needed for ML projects or benchmarks if a suitable dataset cannot be e.g. scraped from the web, or if human feedback is required. Alignment researchers conduct such ML experiments, but sometimes have different data requirements than standard capabilities researchers. As a result, it seemed plausible that there was some niche unfilled by the market to help alignment researchers solve problems related to human-generated datasets. In particular, we thought - and to some extent confirmed - that the most likely such niche is human data generation that requires particularly competent or high-skill humans. We will refer to this as <b>high-skill (human) data</b>.</p><p>We (Matt & Rudolf) went through <a href="https://forum.effectivealtruism.org/posts/8QfQcFyj6aGNM78kz/learning-from-matching-co-founders-for-an-ai-alignment">an informal co-founder matching process along with four other people</a> and were chosen as the co-founder pair to explore this idea. In line with standard startup advice, our first step was to explore whether or not there is a concrete current need for this product by conducting interviews with potential customers. We talked to about 15 alignment researchers, most of them selected on the basis of doing work that requires human data. A secondary goal of these interviews was to build better models for the future importance and role of human feedback in alignment.</p><p>Getting human-generated data does indeed cost many of these researchers significant time and effort. However, we think to a large extent this is because dealing with humans is inherently messy, rather than existing providers doing a bad job. Surge AI in particular seems to offer a pretty good and likely improving service. Furthermore, many companies have in-house data-gathering teams or are in the process of building them.</p><p>Hence we have decided to not further pursue this idea.</p><p>Other projects in the human data generation space may still be valuable, especially if the importance of human feedback in ML continues to increase, as we expect. This might include people specializing on human data as a career.</p><p>The types of factors that are most important for doing human dataset provision well include: high-skill contractors, fast iteration, and high bandwidth communication and shared understanding between the research team, the provider organisation and the contractors.</p><p>We are keen to hear other people’s thoughts, and would be happy to talk or to share more notes and thoughts with anyone interested in working on this idea or a similar one in the future.</p><p><br /> </p><h1 id="Theory_of_Change">Theory of Change</h1><p>A major part of AI alignment research requires doing machine learning (ML) research, and ML research in turn requires training ML models. This involves expertise and execution ability in three broad categories: algorithms, compute, and data, the last of which is very neglected by EAs.</p><p>We expect training on data from human feedback to become an increasingly popular and very powerful tool in mainstream ML (see <span><span><span><a href="https://forum.effectivealtruism.org/posts/iBeWbfQLA9EKfsdhu/why-we-re-not-founding-a-human-data-for-alignment-org#Will_human_feedback_become_a_much_bigger_deal__Is_this_a_very_quickly_growing_industry_"><u>below</u></a></span></span></span>). Furthermore, many proposals for alignment (for example: reinforcement learning from human feedback (RLHF) and variants like recursive reward modelling, iterated amplification, and safety via debate) would require lots of human interaction or datasets based on human-generated data.</p><p>While many services (most notably Surge) exist for finding labour to work on data generation for ML models, it seems plausible that an EA-aligned company could add significant value because:</p><ul><li>Markets may not be efficient enough to fill small niches that are more important to alignment researchers than other customers; high-skill human data that requires very competent crowdworkers may be one such example. If alignment researchers can get it at all, it might be very expensive.</li><li>We have a better understanding of alignment research agendas, and this might help. This may allow us to make better-informed decisions on many implementation details with less handholding, thereby saving researchers time.</li><li>We would have a shared goal with our customers: reducing AI x-risk. Though profit motives already provide decent incentives to offer a good service, mission alignment helps avoid adversarial dynamics, increases trust, and reduces friction in collaboration.</li><li>An EA-led company may be more willing to make certain strategic moves that go against its profit incentives; e.g. investing heavily into detecting a model’s potential attempts to deceive the crowdworkers, even when it’s hard for outsiders to tell whether such monitoring efforts are sincere and effective (and thus customers may not be willing to pay for it). Given that crowdworkers might provide a reward signal, they could be a key target for deceptive AIs.</li></ul><p>Therefore, there is a chance that an EA-led human data service that abstracts out some subset of dataset-related problems (e.g. contractor finding, instruction writing/testing, UI and pipeline design/coding, experimentation to figure out best practices and accumulate that knowledge in one place) would:</p><ol><li>save the time of alignment researchers, letting them make more progress on alignment; and</li><li>reduce the cost (in terms of time and annoying work) required to run alignment-relevant ML experiments, and therefore bring more of them below the bar at which it makes sense to run them, and thus increasing the number of such experiments that are run.</li></ol><p>In the longer run, benefits of such an organisation might include:</p><ul><li>There is some chance that we could simply outcompete existing ML data generation companies and be better even in the cases where they do provide a service; this is especially plausible for relatively niche services. In this scenario we’d be able to exert some marginal influence over the direction of the AI field, for example by only taking alignment-oriented customers. This would amount to differential development of safety over capabilities. Beyond only working with teams that prioritise safety, we could also pick among self-proclaimed “safety researchers”. It is common for proclaimed safety efforts to be accused of helping more with capabilities than alignment by other members of the community.</li><li>There are plausibly critical actions that might need to be taken for alignment, possibly quickly during “crunch-time”, that involve a major (in quality or scale) data-gathering project (or something like large-scale human-requiring interpretability work, that makes use of similar assets, like a large contractor pool). At such a time it might be very valuable to have an organisation committed to x-risk minimisation with the competence to carry out any such project.</li></ul><p>Furthermore, if future AIs will learn human values from human feedback, then higher data quality will be equivalent to a training signal that points more accurately at human values. In other words, higher quality data may directly help with outer alignment (though we're not claiming that it could realistically solve it on its own). In discussions, it seemed that Matt gave this argument slightly more weight than Rudolf.</p><p>While these points are potentially high-impact, we think that there are significant problems with starting an organisation mainly to build capacity to be useful only at some hypothetical future moment. In particular, we think it is hard to know exactly what sort of capacity to build (and the size of the target in type-of-capacity space might be quite small), and there would be little feedback that the organisation could improve or course-correct based on. </p><p>More generally, both of us believe that EA is right now partly bottlenecked by people who can start and scale high-impact organisations, which is a key reason why we’re considering entrepreneurship. This seems particularly likely given the large growth of the movement. <br /> </p><h1 id="What_an_org_in_this_space_may_look_like">What an org in this space may look like</h1><h2 id="Providing_human_datasets">Providing human datasets</h2><p>The concept we most seriously considered was a for-profit that would specialise in meeting the specific needs of alignment researchers, probably by focusing on very high-skill human data. Since this niche is quite small, the company could offer a very custom-tailored service. At least for the first couple years, this would probably mean both of us having a detailed understanding of the research projects and motivations of our customers. That way, we could get a lot of small decisions right, without the researchers having to spend much time on it. We might be especially good at that compared to competitors, given our greater understanding of alignment.</p><h2 id="Researching_enhanced_human_feedback">Researching enhanced human feedback</h2><p>An alternative we considered was founding a non-profit that would research how to enhance human feedback. See this <span><span><a class="PostLinkPreviewWithPost-link" href="https://www.lesswrong.com/posts/ybThg9nA7u6f8qfZZ/techniques-for-enhancing-human-feedback"><u>post</u></a></span></span> by Ajeya Cotra for some ideas on what this kind of research could look like. The central question is whether and how you can combine several weak training signals into a stronger more accurate one. If this succeeded, maybe (enhanced) human feedback could become a more accurate (and thereby marginally safer) signal to train models on.</p><p>We decided against this for a number of reasons:</p><ul><li>Currently, neither of us has more research experience than an undergraduate research project.</li><li>We thought we could get a significant fraction of the benefits of this kind of research even if we did the for-profit version, and plausibly even more valuable expertise.<ul><li>First of all, any particular experiment that funders would have liked to see, they could have paid us to do, although we freely admit that this is very different from someone pushing forward their own research agenda.</li><li>More importantly, we thought a lot of the most valuable expertise to be gained would come in the form of <b>tacit knowledge and answers to concrete boring questions</b> that are not best answered by doing “research” on them, but rather by iterating on them while trying to offer the best product (e.g. “Where do you find the best contractors?”, “How do you incentivize them?”, “What’s the best way to set up communication channels?”).<ul><li>It is our impression that Ought pivoted away from doing abstract research on factored cognition and toward offering a valuable product for related reasons.</li></ul></li></ul></li><li>This topic seems plausibly especially tricky to research (though some people we’ve spoken to disagreed): <ul><li>At least some proposed such experiments would not involve ML models at all. We fear that this might make it especially easy to fool ourselves into thinking some experiment might eventually turn out to be useful when it won’t. More generally, the research would be pretty far removed from the end product (very high quality human feedback). In the for-profit case on the other hand, we could easily tell whether alignment teams were willing to pay for our services and iteratively improve. </li></ul></li></ul><h2 id="For_profit_vs_non_profit">For-profit vs non-profit</h2><p>We can imagine two basic funding models for this org: </p><ul><li>either we’re a nonprofit directly funded by EA donors and offering free or subsidized services to alignment teams;</li><li>or we’re a for-profit, paid by its customers (ie alignment teams). </li></ul><p>Either way, a lot of the money will ultimately come from EA donors (who fund alignment teams.)</p><p>The latter funding mechanism seems better; “customers paying money for a service” leads to the efficient allocation of resources by creating market structures. They have a clear incentive to spend the money well. On the other hand, “foundations deciding what services are free” is more reminiscent of planned economies and distorts markets. To a first approximation, funders should give alignment orgs as much money as they judge appropriate and then alignment orgs should exchange it for services as they see fit.</p><p>A further reason is that a non-profit is legally more complicated to set up, and imposes additional constraints on the organisation.</p><h2 id="Should_the_company_exclusively_serve_alignment_researchers_">Should the company exclusively serve alignment researchers?</h2><p>We also considered founding a company with the ambition to become a major player in the larger space of human data provision. It would by default serve anyone willing to pay us and working on something AGI-related, rather than just alignment researchers. Conditional on us being able to successfully build a big company, this would have the following upsides:</p><ul><li>Plausibly one of the main benefits of founding a human data gathering organisation is to produce EAs and an EA org that have deep expertise in handling and producing high-skill human data in significant quantities. That might prove useful around “crunch time”, e.g. when some project aims to create competitive but safe AGI and needs this expertise. Serving the entire market could scale to a much larger company enabling us to <b>gain expertise at higher scales</b>.</li><li>Operating a large company would also come with some degree of <b>market power</b>. Any company with paying customers has some amount of leverage over them: first of all just because of switching costs, but also because the product it offers might be much better than the next-best alternative. This could allow us to make some demands, e.g. once we’re big and established, announce we’d only work with companies that follow certain best practices.</li></ul><p>On the other hand, building a big successful company serving anyone willing to pay might come with some significant downsides as well.</p><ul><li>First, and most straightforwardly,<b> it is probably much harder than filling a small niche (just meeting the specific needs of alignment researchers), making us less likely to succeed. A large number of competitors exist and as described in this </b><span><span><span><a href="https://forum.effectivealtruism.org/posts/iBeWbfQLA9EKfsdhu/why-we-re-not-founding-a-human-data-for-alignment-org#Key_crux__demand_looks_questionable__Surge_seems_pretty_good"><b><u>section</u></b></a></span></span></span><b>, some of them (esp. Surge) seem pretty hard to beat. Since this is an already big and growing market, there is an additional efficient markets reason to assume this is true a priori.</b></li><li>Secondly, and <b>perhaps more importantly, such a company might accelerate capabilities (more on this </b><span><span><span><a href="https://forum.effectivealtruism.org/posts/iBeWbfQLA9EKfsdhu/why-we-re-not-founding-a-human-data-for-alignment-org#Would_we_be_accelerating_capabilities_"><b><u>below</u></b></a></span></span></span><b>).</b></li></ul><p>Furthermore, it might <b>make RLHF (Reinforcement Learning from Human Feedback) in particular more attractive</b>. Depending on one’s opinions about RLHF and how it compares to other realistic alternatives, one might consider this a strong up- or downside. </p><h1 id="Approach">Approach</h1><p>The main reason companies fail is that they build a product that customers don’t want. For for-profits, the signal is very clear: either customers care enough to be willing to pay hard cash for the product/service, or they don’t. For non-profits, the signal is less clear, and therefore nonprofits can easily stick around in an undead state, something that is an even worse outcome than the quick death of a for-profit because of resource (mis)allocation and opportunity costs. As discussed, it is not obvious which structure we should adopt for this organisation, though for-profit may be a better choice on balance. However, in all cases it is clear that the organisation needs to solve a concrete problem or provide clear value to exist and be worth existing. This does not mean that the value proposition needs to be certain; we would be happy to take a high-risk, high-reward bet, and generally support <span><span><span><a href="https://www.openphilanthropy.org/research/hits-based-giving/"><u>hits-based approaches to impact</u></a></span></span></span> both in general and for ourselves.</p><p>An organisation is unlikely to do something useful to its customers without being very focused on customer needs, and ideally having tight feedback cycles. </p><p>The shortest feedback loops are when you’re making a consumer software product where you can prototype quickly (including with mockups), and watch and talk to users as they use the core features, and then see if the user actually buys the product on the spot. A datasets service differs from this ideal feedback mode in a number of ways:</p><ol><li>The product is a labour-intensive process, which means the user cannot quickly use the core features and we cannot quickly simulate them.</li><li>The actual service requires either a contractor pool or (potentially at the start) the two of us spending a number of hours per request generating data.</li><li>There is significant friction to getting users to use the core feature (providing a dataset), since it requires specification of a dataset from a user, which takes time and effort.</li></ol><p>Therefore, we relied on customer interviews with prospective customers. The goal of these interviews was to talk to alignment researchers who work with data, and figure out if external help with their dataset projects would be of major use to them.</p><p>Our approach to customer interviews was mostly based on the book <span><span><span><a href="https://www.amazon.com/Mom-Test-customers-business-everyone-ebook/dp/B01H4G2J1U"><i><u>The Mom Test</u></i></a></span></span></span>, which is named after the idea that your customer interview questions should be concrete and factual enough that even someone as biased as your own mom shouldn’t be able to give you a false signal about whether the idea is actually good. Key lessons emphasised by <i>The Mom Test</i> include emphasising:</p><ul><li><b>factual</b> questions about the past <b>over hypothetical</b> questions for the future;<ul><li>In particular, questions about concrete past and current <b>efforts</b> spent solving a problem<b> rather than</b> questions about current or future <b>wishes</b> for solving a problem</li></ul></li><li>questions that get at something<b> concrete (e.g. numbers)</b>; and</li><li>questions that prompt the customer to give information about their problems and priorities without prompting them with a solution.</li></ul><p>We wanted to avoid the failure mode where lots of people tell us something is important and valuable in the abstract, without anyone actually needing it themselves.</p><p>We prepared a set of default questions that roughly divided into:</p><ol><li>A general starting question prompting the alignment researcher to describe the biggest pain points and bottlenecks they face in their work, without us mentioning human data.</li><li>Various questions about their past and current dataset-related work, including what types of problems they encounter with datasets, how much of their time these problems take, and steps they took to address these problems.</li><li>Various questions on their past experiences using human data providers like Surge, Scale, or Upwork, and specifically about any things they were unable to accomplish because of problems with such services.</li><li>In some cases, more general questions about their views on where the bottlenecks for solving alignment are, views on the importance of human data or tractability of different data-related proposals, etc. </li><li>What we should’ve asked but didn’t, and who else we should talk to.</li></ol><p>Point 4 represents the fact that in addition to being potential customers, alignment researchers also doubled as domain experts. The weight given to the questions described in point 4 varied a lot, though in general if someone was both a potential customer and a source of data-demand-relevant alignment takes, we prioritised the customer interview questions.</p><p>In practice, we found it easy to arrange meetings with alignment researchers; they generally seemed willing to talk to people who wanted input on their alignment-relevant idea. We did customer interviews with around 15 alignment researchers, and had second meetings with a few. For each meeting, we prepared beforehand a set of questions tweaked to the particular person we were meeting with, which sometimes involved digging into papers published by alignment researchers on datasets or dataset-relevant topics (Sam Bowman in particular has worked on a lot of data-relevant papers). Though the customer interviews were by far the most important way of getting information on our cruxes, we found the literature reviews we carried out to be useful too. We are happy to share the notes from the literature reviews we carried out; please reach out if this would be helpful to you.</p><p>Though we prepared a set of questions beforehand, in many meetings - including often the most important or successful ones - we often ended up going off script fairly quickly.</p><p>Something we found very useful was that, since there were two of us, we could split the tasks during the meeting into two roles (alternating between meetings):</p><ol><li>One person who does most of the talking, and makes sure to be focused on the thread of the conversation.</li><li>One person who mostly focuses on note-taking, but also pipes in if they think of an important question to ask or want to ask for clarification.</li></ol><h1 id="Key_crux__demand_looks_questionable__Surge_seems_pretty_good">Key crux: demand looks questionable, Surge seems pretty good</h1><p><b>Common startup advice </b>is to make sure you have identified a very <b>strong signal of demand </b>before you start building stuff. That should look something like someone telling you that the thing you’re working on is one of their biggest bottlenecks and that they can’t wait to pay you asap so you solve this problem for them. “Nice to have” doesn’t cut it. This is in part because working with young startups is inherently risky, so you need to make up for that by solving one of their most important problems.</p><p>In brief, we don’t think this level of very strong demand currently exists, though there were some weaker signals that looked somewhat promising. There are many existing startups that offer human feedback already. <span><span><span><a href="https://www.surgehq.ai/"><b><u>Surge AI</u></b></a></span></span></span> in particular was brought up by many people we talked to and seems to offer quite a decent service that would be <b>hard to beat</b>.</p><h2 id="Details_about_Surge">Details about Surge</h2><p>Surge is a US-based company that offers a service very similar to what we had in mind, though they are not focused on alignment researchers exclusively. They build data-labelling and generation tools and have a workforce of crowdworkers.</p><p>They’ve worked with Redwood and the OpenAI safety team, both of which had moderately good experiences with them. More recently, Ethan Perez’s team have worked with Surge too; he seems to be very satisfied based <span><span><span><a href="https://twitter.com/EthanJPerez/status/1567180843231379457?t=CEdeLRWNcxBD2eeO3Hd1Iw&s=07">on this Twitter thread</a></span></span></span>.</p><p><img height="203" src="https://39669.cdn.cke-cs.com/cgyAlfpLFBBiEjoXacnz/images/13dcad81b5782236c25371ce3642e027fb1de521ca9b3a21.png" width="400" /><br /> </p><h3 id="Collaboration_with_Redwood">Collaboration with Redwood</h3><p>Surge has worked with Redwood Research on their <span><span><span><a href="https://arxiv.org/abs/2205.01663"><u>paper</u></a></span></span></span> about adversarial training. This is one of three <span><span><span><a href="https://www.surgehq.ai/case-study/adversarial-testing-redwood-research"><u>case studies</u></a></span></span></span> on Surge’s website, so we assume it’s among the most interesting projects they’ve done so far. The crowdworkers were tasked with coming up with prompts that would cause the model to output text in which someone got injured. Furthermore, crowdworkers also classified whether someone got injured in a given piece of text.</p><p>One person from Redwood commented that doing better than Surge seemed possible to them with “probably significant value to be created”, but “not an easy task”. They thought our main edge would have to be that we’d specialise on fuzzy and complex tasks needed for alignment; Surge apparently did quite well with those, but still with some room for improvement. A better understanding of alignment might lower chances of miscommunication. Overall, Redwood seems quite happy with the service they received.</p><p>Initially, Surge’s iteration cycle was apparently quite slow, but this improved over time and was “pretty good” toward the end.</p><p>Redwood told us they were quite likely to use human data again by the end of the year and more generally in the future, though they had substantial uncertainty around this. Their experience in working with human feedback overall was somewhat painful as we understood it. This is part of the reason they’re uncertain about how much human feedback they will use for future experiments, even though it’s quite a powerful tool. However, they estimated that friction in working with human feedback was mostly caused by inherent reasons (humans are inevitably slower and messier than code), rather than Surge being insufficiently competent. </p><h3 id="Collaboration_with_OpenAI">Collaboration with OpenAI</h3><p>OpenAI have worked with Surge in the context of their WebGPT <span><span><span><a href="https://arxiv.org/abs/2112.09332"><u>paper</u></a></span></span></span>. In that paper, OpenAI fine-tuned their language model GPT-3 to answer long-form questions. The model is given access to the web, where it can search and navigate in a text-based environment. It’s first trained with imitation learning and then optimised with human feedback. </p><p>Crowdworkers provided “demonstrations”, where they answered questions by browsing the web. They also provided “comparisons”, where they indicated which of two answers to the same question they liked better.</p><p>People from OpenAI said they had used Surge mostly for sourcing the contractors, while doing most of the project management, including building the interfaces, in-house. They were generally pretty happy with the service from Surge, though all of them did mention shortcomings.</p><p>One of the problems they told us about was that it was hard to get access to highly competent crowdworkers for consistent amounts of time. Relatedly, it often turned out that a very small fraction of crowdworkers would provide a large majority of the total data. </p><p>More generally, they wished there had been someone at Surge that understood their project better. Also, it might have been somewhat better if there had been more people with greater experience in ML, such that they could have more effectively anticipated OpenAI’s preferences — e.g. predict accurately what examples might be interesting to researchers when doing quality evaluation. However, organisational barriers and insufficient communication were probably larger bottlenecks than ML knowledge. At least one person from OpenAI strongly expressed a desire for a service that understood their motives well and took as much off their plate as possible in terms of hiring and firing people, building the interfaces, doing quality checks and summarising findings etc. It is unclear to us to what extent Surge could have offered these things if OpenAI hadn’t chosen to do a lot of these things in-house. One researcher suggested that communicating their ideas reliably was often more work than just doing it themselves. As it was, they felt that marginal quality improvement required significant time investment on their own part, i.e. could not be solved with money alone. </p><p>Notably, one person from OpenAI estimated that about <b>60% of the WebGPT team’s efforts </b>were spent on various aspects of <b>data collection</b>. They also said that this figure didn’t change much after weighting for talent, though in the future they expect junior people to take on more disproportionate shares of this workload.</p><p>Finally, one minor complaint that was mentioned was the lack of transparency about contractor compensation. </p><h3 id="How_mission_aligned_is_Surge_">How mission-aligned is Surge?</h3><p>Surge <span><span><span><a href="https://www.surgehq.ai/case-study/adversarial-testing-redwood-research"><u>highlight</u></a></span></span></span> their collaboration with Redwood on their website as one of three case studies. In their blog <span><span><span><a href="https://www.surgehq.ai/blog/the-250k-inverse-scaling-prize-and-human-ai-alignment"><u>post</u></a></span></span></span> about their collaboration with Anthropic, the first sentence reads: “In many ways, alignment – getting models to align themselves with what we want, not what they think we want – is one of the fundamental problems of AI.” </p><p>On the one hand, they describe alignment as one of the fundamental problems of AI, which could indicate that they intrinsically cared about alignment. However, they have a big commercial incentive to say this. Note that many people would consider their half-sentence definition of alignment to be wrong (a model might know what we want, but still do something else).</p><p>We suspect that the heads of Surge have at least vaguepositive dispositions towards alignment. They definitely seem eager to work with alignment researchers, which might well be more important. We think it’s mostly fine if they are not maximally intrinsically driven, though mission alignment does add value as mentioned above.</p><h2 id="Other_competitors">Other competitors</h2><p>We see Surge as the most direct competitor and have researched them by far in the most detail. But besides Surge, there are a large number of other companies offering similar services. </p><p>First, and most obviously, Amazon <span><span><span><a href="https://www.mturk.com/"><u>Mechanical Turk</u></a></span></span></span> offers a very low quality version of this service and is very large. <span><span><span><a href="https://www.upwork.com/"><u>Upwork</u></a></span></span></span> specialises in sourcing humans for various tasks, without building interfaces. <span><span><span><a href="https://scale.com/"><u>ScaleAI</u></a></span></span></span> is a startup with a $7B valuation --- they augment human feedback with various automated tools. OpenAI have worked with them. Other companies in this broad space include <span><span><span><a href="https://gethybrid.io/"><u>Hybrid</u></a></span></span></span> (which Sam Bowman’s lab has worked with) and <span><span><span><a href="https://www.invisible.ai/"><u>Invisible</u></a></span></span></span> (who have worked with OpenAI). There are many more that we haven’t listed here.</p><p>In addition, some labs have in-house teams for data gathering (see <span><span><span><a href="https://forum.effectivealtruism.org/posts/iBeWbfQLA9EKfsdhu/why-we-re-not-founding-a-human-data-for-alignment-org#Is_it_more_natural_for_this_work_to_be_done_in_house_in_the_longterm__Especially_at_big_labs_companies_"><u>here</u></a></span></span></span> for more).</p><h2 id="Data_providers_used_by_other_labs">Data providers used by other labs</h2><p>Ethan Perez’s and Sam Bowman’s labs at NYU/Anthropic have historically often built their own interfaces while using contractors from Upwork or undergrads, but they have been trialing Surge over the summer and seem likely to stick with them if they have a good experience. Judging from the Twitter thread linked above and asking Jérémy Scheurer (who works on the team and built the pre-Surge data pipeline) how they’ve found Surge so far, Surge is doing a good job. </p><p>Google has an internal team that provides a similar service, though DeepMind have used at least one external provider as well. We expect that it would be quite hard to get DeepMind to work with us, at least until we would be somewhat more established. </p><p>Generally, we get the impression that most people are quite happy with Surge. It’s worth also considering that it’s a young company that’s <b>likely improving its service over time</b>. We’ve heard that Surge iterates quickly, e.g. by shipping simple feature requests in two days. It’s possible that some of the problems listed above may no longer apply by now or in a few months.</p><h2 id="Good_signs_for_demand">Good signs for demand</h2><p>One researcher we talked to said that there were lots of projects their team didn’t do, because gathering human feedback of sufficient quality was infeasible. </p><p>One of the examples this researcher gave was human feedback on code quality. This is implausible to do, because the time of software engineers is just too expensive. That problem is hard for a new org to solve. </p><p>Another example they gave seemed like it might be more feasible: for things like RLHF, they often choose to do pairwise comparisons between examples or multi-preferences. Ideally, they would want to get ratings, e.g. on a scale from 1 to 10. But they thought they didn’t trust the reliability of their raters enough to do this. </p><p>More generally, this researcher thought there were lots of examples where if they could copy any person on their team a hundred times to provide high-skill data, they could do many experiments that they currently can’t. </p><p>They also said that their team would be willing to pay ~3x of what they were paying currently to receive much higher-quality feedback.</p><p>Multiple other researchers we talked to expressed vaguely similar sentiments, though none quite as strong.</p><p>However, it’s notable that in this particular case, the researcher hadn’t worked with Surge yet. </p><p>The same researcher also told us about a recent project where they had spent a month on things like creating quality assurance examples, screening raters, tweaking instructions etc. They thought this could probably have been reduced a lot by an external org, maybe to as little as one day. Again, we think Surge may be able to get them a decent part of the way there.</p><h2 id="Labs_we_could_have_worked_with">Labs we could have worked with</h2><p>We ended up finding three projects that we could have potentially worked on:</p><ul><li>A collaboration with Ought --- they spend about 15 hours a week on data-gathering and would have been happy to outsource that to us. If it had gone well, they might also have done more data-gathering in the longterm (since friction is lower if it doesn’t require staff time). We decided not to go ahead with this project since we weren’t optimistic enough about demand from other labs being bigger once we had established competence with Ought and the project itself didn’t seem high upside enough. </li><li>Attempt to get the Visible Thoughts <span><span><span><a href="https://intelligence.org/2021/11/29/visible-thoughts-project-and-bounty-announcement/"><u>bounty</u></a></span></span></span> by MIRI. We decided against this for a number of reasons. See more of our thinking about Visible Thoughts below.</li><li>Potentially a collaboration with Owain Evans on curated datasets for alignment.</li></ul><p>We think the alignment community is currently relatively tight-knit. e.g. researchers often knew about other alignment teams’ experiences with Surge from conversations they had had with them. Hence, we were relatively optimistic that conditional on there being significant demand for this kind of service, doing a good job on one of the projects above would quickly lead to more opportunities.<br /> </p><h3 id="Visible_Thoughts">Visible Thoughts</h3><p>In November 2021, <span><span><a class="PostLinkPreviewWithPost-link" href="https://www.lesswrong.com/posts/zRn6cLtxyNodudzhw/visible-thoughts-project-and-bounty-announcement"><u>MIRI announced the Visible Thoughts (VT) project bounty</u></a></span></span>. In many ways VT would be a good starting project for an alignment-oriented dataset provider, in particular because the bounty is large (up to $1.2M) and because it is ambitious enough that executing on it would provide a strong learning signal to us and a credible signal to other organisations we might want to work with. However, on closer examination of VT, we came to the conclusion that it is not worth it for us to work on it.</p><p>The idea of VT is to collect a dataset of 100 runs of fiction of a particular type (“dungeon runs”, an interactive text-based genre where one party, called the “dungeon master” and often an AI, offers descriptions of what is happening, and the other responds in natural language with what actions they want to take), annotated with a transcript of some of the key verbal thoughts that the dungeon master might be thinking as they decide what happens in the story world. MIRI hopes that this would be useful for training AI systems that make their thought processes legible and modifiable.</p><p>In particular, a notable feature of the VT bounty is the extreme run lengths that it asks for: to the tune of 300 000 words for each of the runs (for perspective, this is the length of <i>A Game of Thrones</i>, and longer than the first three <i>Harry Potter</i> books combined). A VT run is much less work than a comparable-length book - the equivalent of a rough unpolished first-draft (with some quality checks) would likely be sufficient - but producing one such run would still probably require at least on the order of 3 months of sequential work time from an author. We expect the pool of people willing to write such a story for 3 months is significantly smaller than the pool of people who would be willing to complete, say, a 30 000 word run, and that the high sequential time cost increases the amount of time required to generate the same number of total words. We also appear to have different ideas on how easy it is to fit a coherent story, for the relevant definition of coherent, into a given number of words. Note that to compare VT word counts to lengths of standard fiction without the written-out thoughts from the author, the VT word count should be reduced by a factor of 5-6.</p><p>Concerns about the length are raised in the comments section, to which Eliezer Yudkowksy <span><span><a class="CommentLinkPreviewWithComment-link" href="https://www.lesswrong.com/posts/zRn6cLtxyNodudzhw/visible-thoughts-project-and-bounty-announcement?commentId=irJCDQaWRcdT3Bnoo"><u>responded</u></a></span></span>. His first point, that longer is easier to write per step, may be true, especially as we also learned (by email with Nate Soares and Aurelien Cabanillas) that in MIRI’s experience “authors that are good at producing high quality steps are also the ones who don't mind producing many steps”. In particular because of that practical experience, we think it is possible we overestimated the logistical problems caused by the length. MIRI also said they would likely accept shorter runs too if they satisfied their other criteria.</p><p>In a brief informal conversation with Rudolf during EAG SF, Eliezer emphasised the long-range coherence point in particular. However, they did not come to a shared understanding of what type of “long-range coherence” is meant.</p><p>Even more than these considerations, we are sceptical about the vague plans for what to do given a VT dataset. A recurring theme from talking to alignment researchers who work with datasets was that inventing and creating a good dataset is surprisingly hard, and generally involves having a clear goal of what you’re going to use the dataset for. It is possible the key here is the difference in our priors for how likely a dataset idea is to be useful.</p><p>In addition, we have significant concerns about undertaking a major project based on a bounty whose only criterion is the judgement of one person (Eliezer Yudkowsky), and undertaking such a large project as our first project.</p><h1 id="Other_cruxy_considerations">Other cruxy considerations</h1><h2 id="Could_we_make_a_profit___get_funding__">Could we make a profit / get funding? </h2><p>One researcher from OpenAI told us he thought it would be hard to imagine an EA data-gathering company making a profit because costs for individual projects would always be quite high (requiring several full-time staff), and total demand was probably not all that big.</p><p>In terms of funding, both of us were able to spend time on this project because of grants from regrantors in the Future Fund regrantor program. Based on conversations with regrantors, we believe we could’ve gotten funding to carry out an initial project if we had so chosen.</p><h2 id="Will_human_feedback_become_a_much_bigger_deal__Is_this_a_very_quickly_growing_industry_">Will human feedback become a much bigger deal? Is this a very quickly growing industry?</h2><p>Our best guess is yes. For example, see this <span><span><a class="PostLinkPreviewWithPost-link" href="https://www.lesswrong.com/posts/pRkFkzwKZ2zfa3R6H/without-specific-countermeasures-the-easiest-path-to"><u>post</u></a></span></span> by Ajeya Cotra which outlines how we could get to TAI by training on Human Feedback on Diverse Tasks (HFDT). </p><p>She writes: “HFDT is not the only approach to developing transformative AI, and it may not work at all. But I take it very seriously, and I’m aware of increasingly many executives and ML researchers at AI companies who believe something within this space could work soon.”</p><p>In addition, we have also had discussions with at least one other senior AI safety researcher whom we respect and who thought human feedback was currently irrationally neglected by mainstream ML; they expected it to become much more wide-spread and to be a very powerful tool.</p><p>If that’s right, then providing human feedback will likely become important and economically valuable. </p><p>This matters, because operating a new company in a growing industry is generally much easier and more likely to be successful. We think this is true even if profit isn’t the main objective.</p><h2 id="Would_we_be_accelerating_capabilities_">Would we be accelerating capabilities?</h2><p>Our main idea was to found a company (or possibly non-profit) that served alignment researchers exclusively. That could accelerate alignment differentially. </p><p>One problem is that it’s not clear where to draw this boundary. Some alignment researchers definitely think that other people who would also consider themselves to be alignment researchers are effectively doing capabilities work. This is particularly true of RLHF.</p><p>One mechanism worth taking seriously if we worked with big AI labs to make their models more aligned by providing higher quality data is that the models might merely appear surface-level aligned. “Make the data higher quality” might be a technique that scales poorly as capabilities ramp up. So it risks creating a false sense of security. It would also clearly improve the usefulness of current-day models and hence, it risks increasing investment levels too.</p><p>We don’t currently think the risk of surface-level alignment is big enough to outweigh the benefits. In general, we think that a good first-order heuristic that helps the field stay grounded in reality would be that whatever improves alignment in current models is useful to explore further and invest resources into. It seems like a good prior that such things would also be valuable in the future (even if it’s possible that new additional problems may arise, or such efforts aren’t on the path to a future alignment solution). See Nate Soares’ post about <span><span><a class="PostLinkPreviewWithPost-link" href="https://www.lesswrong.com/posts/GNhMPAWcfBCASy8e6/a-central-ai-alignment-problem-capabilities-generalization"><u>sharp left turns</u></a></span></span> to get a contradicting view on this. </p><h2 id="Is_it_more_natural_for_this_work_to_be_done_in_house_in_the_longterm__Especially_at_big_labs_companies_">Is it more natural for this work to be done in-house in the longterm? Especially at big labs/companies.</h2><p>We expect that human data gathering is likely to become very important and that it benefits from understanding the relevant research agenda well. So maybe big companies will want to do this internally, instead of relying on third-party suppliers? </p><p>That seems quite plausible to us and to some extent it’s happening already. Our understanding is that Anthropic is hiring an internal team to do human data gathering. DeepMind has access to Google’s crowdworker service. OpenAI have worked with multiple companies, but they also have at least one in-house specialist for this kind of work and are advertising multiple further jobs on the human data team <span><span><span><a href="https://openai.com/careers/#human-data"><u>here</u></a></span></span></span>. They’re definitely considering moving more of this work in-house, but it’s unclear to us to what extent that’s going to happen and we have received somewhat contradicting signals regarding OpenAI safety team members’ preferences on this.</p><p>So a new EA org would face stiff competition, not only from other external providers, but also from within companies.</p><p>Of course, smaller labs will most likely always have to rely on external providers. Hence, <b>another cruxy consideration is how much small labs matter</b>. Our intuition is that they matter much less than bigger labs (since the latter have access to the best and biggest models).</p><h2 id="Creating_redundancy_of_supply_and_competition">Creating redundancy of supply and competition</h2><p>Even if existing companies are doing a pretty good job at serving the needs of alignment researchers, there’s still some value in founding a competitor. </p><p>First, <b>competition is good</b>. Founding a competitor puts pressure on existing providers to keep service quality high, work on improving their products, and margins low. Ironically, part of the value of founding this company would thus flow through getting existing companies to try harder to offer the best product.</p><p>Second, it creates some redundancy. <b>What if Surge pivots?</b> What if their leadership changes or they become less useful for some other reason? In those worlds it might be especially useful to have a “back-up” company.</p><p>Both of these points have been mentioned to us as arguments in favour of founding this org. We agree that these effects are real and likely point in favour of founding the org. However, <b>we don’t think these factors carry very significant weight</b> relative to our opportunity costs, especially given that there are already many start-ups working in this space. </p><p>Adding a marginal competitor can only affect a company’s incentives so much. And in the worlds where we’d be most successful such that all alignment researchers were working with us, we might cause Surge and others to pivot away from alignment researchers, instead of getting them to try harder. </p><p>The redundancy argument only applies in worlds in which the best provider ceases to exist; maybe that’s 10% likely. And then the next best alternative is likely not all that bad. Competitors are plentiful and even doing it in-house is feasible. Hence, it seems unlikely to us that the expected benefit here is very large after factoring in the low probability of the best provider disappearing.</p><h1 id="Other_lessons">Other lessons</h1><h2 id="Lessons_on_human_data_gathering">Lessons on human data gathering</h2><p>In the process of talking to lots of experts about their experiences in working with human data, we learned many general lessons about data gathering. This section presents some of those lessons, in roughly decreasing order of importance.</p><h3 id="Iteration">Iteration</h3><p>Many people emphasized to us that working with human data rarely looks like having a clean pipeline from requirements design to instruction writing to contractor finding to finished product. Rather, it more often involves a lot of iteration and testing, especially regarding what sort of data the contractors actually produce. While some of this iteration may be removed by having better contractors and better knowledge of good instruction-writing, the researchers generally view the iteration as a key part of the research process, and therefore prize </p><ul><li>ease of iteration (especially time to get back with a new batch of data based on updated instructions); and</li><li>high-bandwidth communication with the contractors and whoever is writing the instructions (often both are done by the researchers themselves). </li></ul><p>This last point holds to the point that it is somewhat questionable whether an external provider (rather than e.g. a new team member deeply enmeshed in the context of the research project) could even be a good fit for this need.</p><h3 id="The_ideal_pool_of_contractors">The ideal pool of contractors</h3><p>All of the following features matter in a pool of contractors:</p><ul><li>Competence, carefulness, intelligence, etc. (sometimes expertise). It is often ideal if the contractors understand the experiment.</li><li>Number of contractors</li><li>Quick availability and therefore low latency for fulfilling requests</li><li>Consistent availability (ideally full-time)</li><li>Even distribution of contributions across contractors (ie it shouldn’t be the case that 20% of the contractors provide 80% of the examples). </li></ul><h3 id="Quality_often_beats_quantity_for_alignment_research">Quality often beats quantity for alignment research</h3><p>Many researchers told us that high-quality, high-skill data is usually more important and more of a bottleneck than just a high quantity of data. Some of the types of projects where current human data generation methods are most obviously deficient are cases where a dataset would need epistemically-competent people to make subtle judgments, e.g. of the form “how true is this statement?” or “how well-constructed was this study?” As an indication of reference classes where the necessary epistemic level exists, the researcher mentioned subject-matter experts in their domain, LessWrong posters, and EAs.</p><h3 id="A_typical_data_gathering_project_needs_UX_design__Ops__ML__and_data_science_expertise_">A typical data gathering project needs UX-design, Ops, ML, and data science expertise </h3><p>These specialists might respectively focus on the following:</p><ul><li>Designing the interfaces that crowdworkers interact with. (UX-expert/front-end web developer)</li><li>Managing all operations, including hiring, paying, managing, and firing contractors, communicating with them and the researchers etc. (ops expert)</li><li>Helping the team make informed decisions about the details of the experimental design, while minimizing time costs for the customer. The people we spoke to usually emphasized ML-expertise more than alignment expertise. (ML-expert)</li><li>Meta-analysis of the data. e.g. inter-rater agreement, the distribution of how much each contractor contributed, demographics, noticing any other curious aspects of the data, etc. (data scientist)</li></ul><p>It is possible that someone in a team could have expertise in more than one of these areas, but generally this means a typical project will involve at least three people.</p><h3 id="Crowdworkers_do_not_have_very_attractive_jobs">Crowdworkers do not have very attractive jobs</h3><p>Usually the crowdworkers are employed as contractors. This means their jobs are inherently not maximally attractive; they probably don’t offer much in the way of healthcare, employment benefits, job security, status etc. The main way that these jobs are made more attractive is through offering higher hourly rates.</p><p>If very high quality on high-skill data is going to become essential for alignment, it may be worth considering changing this, to attract more talented people. </p><p>However, we expect that it might be inherently very hard to offer permanent positions for this kind of work, since demand is likely variable and since different people may be valuable for different projects. This is especially true for a small organisation. </p><h3 id="What_does_the_typical_crowdworker_look_like_">What does the typical crowdworker look like?</h3><p>This varies a lot between projects and providers.</p><p>The cheapest are non-native English speakers who live outside of the US.</p><p>Some platforms, including Surge, offer the option to filter crowdworkers for things like being native English-speakers, expertise as a software engineer, background in finance, etc.</p><h2 id="Bottlenecks_in_alignment">Bottlenecks in alignment</h2><p>When asked to name the factors most holding back their progress on alignment, many alignment researchers mentioned talent bottlenecks. </p><p>The most common talent bottleneck seemed to be in competent ML-knowledgeable people. Some people mentioned the additional desire for these to understand and care about alignment. (Not coincidentally, Matt’s next project is likely going to be about skilling people up in ML).</p><p>There were also several comments about things like good web development experience being important. For example, many data collection projects involve creating a user interface at some point, and in practice this is often handled by ML-specialised junior people at the lab, who can, with some effort and given their programming background, cobble together some type of website - often using different frameworks and libraries than the next person knows (or wants to use). (When asked about why they don’t hire freelance programmers, one researcher commented that a key feature they’d want is the same person working for them for a year or two, so that there’s an established working relationship, clear quality assurances, and continuity with the choice of technical stack.)</p><h1 id="Conclusion">Conclusion</h1><p>After having looked into this project idea for about a month, we have decided not to found a human data gathering organisation for now. </p><p>This is mostly because demand for an external provider seems insufficient, as outlined in this <span><span><span><a href="https://forum.effectivealtruism.org/posts/iBeWbfQLA9EKfsdhu/why-we-re-not-founding-a-human-data-for-alignment-org#Key_crux__demand_looks_questionable__Surge_seems_pretty_good"><u>section</u></a></span></span></span>. No lab gave a clear signal that gathering human data was a key bottleneck for them, where they would have been willing to go to significant lengths to fix it urgently (especially not the ones that had tried Surge). </p><p>We expect that many labs would want to stick with their current providers, Surge in particular, or their in-house team, bar exceptional success on our part (even then, we’d only provide so much marginal value over those alternatives).</p><p>Though we did find some opportunities for potential initial projects after looking for a month, we are hesitant about how far this company would be expected to scale. One of the main draws (from an impact perspective) of founding an organisation is that you can potentially achieve very high counterfactual impact by creating an organisation that scales to a large size and does lots of high-impact work over its existence. The absence of a plausible pathway to really outstanding outcomes from starting this organisation is a lot of what deters us.</p><p>In a world where we’re more successful than expected (say 90th to 95th percentile), we could imagine that in five years from now, we’d have a team of about ten good people. This team may be working with a handful of moderately big projects (about as big as WebGPT), and provide non-trivial marginal value over the next-best alternative to each one of them. Maybe one of these projects would not have been carried out without us.</p><p>A median outcome might mean failing to make great hires and remaining relatively small and insignificant: on the scale of doing projects like the ones we’ve identified above, enough to keep us busy throughout the year and provide some value, but with little scaling. In that case we would probably quit the project at some point.</p><p>This distribution doesn’t seem good enough to justify our opportunity cost (which includes other entrepreneurial projects or technical work among other things). Thus we have decided not to pursue this project any further for now.</p><p>We think this was a good idea to invest effort in pursuing, and we think we made the right call in choosing to investigate it. Both of us are open to, and also quite likely to, evaluate other EA-relevant entrepreneurial project ideas in the future.</p><h2 id="Other_human_data_gathering_careers">Other relevant human data-gathering work</h2><p>However, <b>the assumption that high-quality high-skill human feedback is important and neglected by EAs has not been falsified</b>. </p><p>It is still plausible to us that EAs should consider career paths that focus on building expertise at data-gathering; just probably not by founding a new company. In the short run, this could look like</p><ul><li>Contributing to <b>in-house data-gathering teams</b> (eg Anthropic, OpenAI, etc.)</li><li><b>Joining Surge</b> or other data-gathering startups.</li></ul><p>As we discussed above, the types of skills that seem most relevant for working in a human data generation role include: data science experience and in particular experience with natural languaga data or social science data and experiment design, front-end web development, ops and management skills, and some understanding of machine learning and alignment. 80,000 Hours recently wrote a profile which you can find <span><span><span><a href="https://80000hours.org/career-reviews/alignment-data-expert/"><u>here</u></a></span></span></span>.</p><p>Of course, in the short term, this career path will be especially impactful if one’s efforts are focussed on helping alignment researchers. But if it’s true that human feedback will prove a very powerful tool for ML, then people with such expertise may become increasingly valuable going forward, such that it could easily be worth skilling up at a non-safety-focused org. </p><p>We think joining Surge may be a particularly great opportunity. It is common advice that joining young, rapidly growing start-ups with good execution is great for building experience; early employees can often get a lot of responsibility early on. See e.g. this <span><span><span><a href="https://forum.effectivealtruism.org/posts/ejaC35E5qyKEkAWn2/early-career-ea-s-should-consider-joining-fast-growing"><u>post</u></a></span></span></span> by Bill Zito.</p><p>One of the hardest parts about that seems to be identifying promising startups. After talking to many of their customers, we have built reasonable confidence that Surge holds significant promise. They seem to execute well, in a space which we expect to grow. In addition to building career capital, there is clear value in helping Surge serve alignment researchers as well as possible.</p><p>From Surge’s perspective, we think they could greatly benefit from hiring EAs, who are tuned in to the AI safety scene, which we would guess represents a significant fraction of their customers. </p><p>One senior alignment researcher told us explicitly that they would be interested in hiring people who had worked in a senior role at Surge.</p><h1 id="Next_steps_for_us">Next steps for us</h1><p>Matt is planning to run a bootcamp that will allow EAs to upskill in ML engineering. I'll be doing a computer science master’s at Cambridge from October to June.</p><br /></div></div>Unknownnoreply@blogger.com0tag:blogger.com,1999:blog-1697673368059564013.post-62322288231569799712022-09-24T10:43:00.001+01:002022-09-24T10:43:28.686+01:00AI risk intro 2: solving the problem<p style="text-align: center;"> <b><i>This post was a joint effort with <a href="https://www.perfectlynormal.co.uk/">Callum McDougall</a>.</i></b></p><div style="text-align: center;"><i> </i></div><div style="text-align: center;"><i><span style="font-size: x-small;">8.2k words (~25min)</span> </i></div><div style="text-align: center;"><i> </i><br /></div><div><p>This marks the second half of our overview of the AI alignment problem. In <span><span><a class="PostLinkPreviewWithPost-link" href="https://www.strataoftheworld.com/2022/09/ai-risk-intro-1-advanced-ai-might-be.html">the first half</a></span></span>, we outlined the case for misaligned AI as a significant risk to humanity, first by looking at past progress in machine learning and extrapolating to what the future could bring, and second by discussing the theoretical arguments which underpin many of these concerns. In this second half, we focus on possible solutions to the alignment problem that people are currently working on. We will paint a picture of the current field of technical AI alignment, explaining where the major organisations fit into the larger picture and what the theory of change behind their work is. Finally, we will conclude the sequence with a call to action, by discussing the case for working on AI alignment, and some suggestions on how you can get started.</p><p><i>Note - for people with more context about the field (e.g. have done AGISF) we expect </i><span><span><span><a href="https://www.lesswrong.com/posts/QBAjndPuFbhEXKcCr/my-understanding-of-what-everyone-in-technical-alignment-is"><i>Thomas Larsen's post</i></a></span></span></span><i> to be a much better summary, and </i><span><span><span><a href="https://www.lesswrong.com/posts/9TWReSDKyshfA66sz/alignment-org-cheat-sheet#comments"><i>this post</i></a></span></span></span><i> might be better if you are looking for something brief. Our intended audience is someone relatively unfamiliar with the AI safety field, and is looking for a taste of the kinds of problems which are studied in the field and the solution approaches taken. We also don't expect this sampling to be representative of the number of people working on each problem - again, see Thomas' post for something which accomplishes this.</i></p><hr /><h1 id="Introduction__A_Pre_Paradigmatic_Field"><b>Introduction: A Pre-Paradigmatic Field</b></h1><blockquote><p><i>Definition (<b>pre-paradigmatic</b>): a science at an early stage of development, before it has established a consensus about the true nature of the subject matter and how to approach it.</i></p></blockquote><p>AI alignment is a strange field. Unlike other fields which study potential risks to the future of humanity (e.g. nuclear war or climate change), there is almost no precedent for the kinds of risks we care about. Additionally, because of the nature of the threat, failing to get alignment right on the first try might be fatal. As Paul Christiano (a well-known AI safety researcher) recently wrote:</p><blockquote><p><i>Humanity usually solves technical problems by <b>iterating and fixing failures</b>; we often resolve tough methodological disagreements very slowly by seeing what actually works and having our failures thrown in our face. But it will probably be possible to build valuable AI products without solving alignment, and so <b>reality won’t “force us” to solve alignment until it’s too late</b>. This seems like a case where we will have to be <b>unusually reliant on careful reasoning rather than empirical feedback loops</b> for some of the highest-level questions.</i></p></blockquote><p>For these reasons, the field of AI alignment lacks a consensus on how the problem should be tackled, or what the most important parts of the problem even are. This is why there is a lot of variety in the approaches we present in this post.</p><h1 id="Decomposing_the_research_landscape"><b>Decomposing the research landscape</b></h1><figure class="image image_resized" style="width: 50.19%;"><img height="400" src="https://lh3.googleusercontent.com/05Pf23h1YhLb8ua2leAc01JHyDhrBNebhhUtKprCeFEvZy-thcgcxMDXZtmVUKkd48Mamo8WQn6eekyFDUKP0EarLwQWwkCiS5mA_OYJa7anjIw-_bUe_oppKPXbuE7q20kBDWliD6ri_Bj_Fisedc4CCi5viijjwxpRG0ooRuRnwYVs8d1VmDr2cg=w400-h400" width="400" /><figcaption><i>An image generated with OpenAI's DALL-E 2 based on the prompt: sorting papers and books in a majestic gothic library. <b>All other images like this in this post are also AI-generated, from the text in the caption.</b><br /></i></figcaption></figure><p>There are lots of different ways you could divide up the space of approaches to solving the problem of aligning advanced AI. For instance, you could go through the history of the field and identify different movements and paradigms. Or you could place the work on a spectrum from highly theoretical maths/philosophy-type research, to highly empirical research working with cutting-edge deep learning models.</p><p>However, the most useful decomposition would be one that explains why the people who work on it believe that it will help solve the problem of AI alignment. </p><p>For that reason, we’ll mostly be using the decomposition from <span><span><span><a href="https://www.lesswrong.com/s/FN5Gj4JM6Xr7F4vts/p/SQ9cZtfrzDJmw9A2m"><u>Neel Nanda’s “A Bird’s Eye View” </u></a></span></span></span>post. The motivation behind this decomposition is to answer the high-level question of “what is needed for AGI to go well?”. The six broad classes of approaches we talk about are:</p><ol><li><b>Addressing threat models </b><br /><i>We have a specific threat model in mind for how AGI might result in a very bad future for humanity, and focus our work on things we expect to help address the threat model.</i></li><li><b>Agendas to build safe AGI </b><br /><i>Let’s make specific plans for how to actually build safe AGI, and then try to test, implement, and understand the limitations of these plans. The emphasis is on understanding how to build AGI safely, rather than trying to do it as fast as possible.</i></li><li><b>Robustly good approaches </b><br /><i>In the long-run AGI will clearly be important, but we're highly uncertain about how we'll get there and what, exactly, could go wrong. So let's do work that seems good in many possible scenarios, and doesn’t rely on having a specific story in mind.</i></li><li><b>Deconfusion</b><br /><i>Reasoning about how to align AGI involves reasoning concepts like intelligence, values, and optimisers and we’re pretty confused about what these even mean. This means any work we do now is plausibly not helpful and definitely not reliable. As such, our priority should be doing some conceptual work on how to think about these concepts and what we’re aiming for, and trying to become less confused.</i></li><li><b>AI governance</b><br /><i>In addition to solving the technical alignment problem, there’s the question of what policies we need to minimise risk from advanced AI systems.</i></li><li><b>Field-building</b><br /><i>One of the most important ways we can make AI go well is by increasing the number of capable researchers doing alignment research.</i> </li></ol><p>It’s worth noting that there is a lot of overlap between these sections. For instance, interpretability research is a great example of a robustly good approach, but it can also be done with a specific threat model in mind.</p><p>Throughout this section, we will also give small vignettes of organisations or initiatives which support AI alignment research in some form. This won’t be a full picture of all approaches or organisations, instead hopefully it will serve to sketch a picture of what work in AI alignment actually looks like.</p><h2 id="Addressing_threat_models">Addressing threat models</h2><blockquote><p><i>We have a <b>specific threat model</b> in mind for how AGI might result in a very bad future for humanity, and focus our work on things we expect to help address the threat model.</i></p></blockquote><p>A key high-level intuition here is that having a specific threat model in mind for how AI might go badly for humanity can help keep you focused on certain hard parts of the problem. One technique that can be useful here is a version of back-casting: we start from future problems with advanced AI systems in our current model, reason about what kinds of things might solve these problems, then try and build versions of these solutions today and test them out on current problems.</p><figure class="image image_resized" style="width: 45.98%;"><img height="212" src="https://lh3.googleusercontent.com/WjAkWUs3UyXk_VvLWXJ99Z1Q9Cfgm8yEziWaIAwx1S8tYEq7r3IE6BVnw6IrMjfj8neeJypSK2UFqCt7BgSSOdikryl2b3nHVV9mmatFih6yXF2OBYE7xVy8Y5WcsuKRiRDcmeBcxRc630ayp3_mt4hwJeC4UrHKStpDhHSUI_bIPlXgVlfaO_mpag=w400-h212" width="400" /></figure><p>This can be seen in contrast to the approach of simply trying to fix current problems with AI systems, which might fail to connect up with the hardest parts of AI alignment.</p><figure class="image image_resized" style="width: 44.49%;"><img height="226" src="https://lh5.googleusercontent.com/qIJxVmWTBGIaws5GaJksOZKF8-BlmwQ08vZ2MQFLWIK9oZcGJ74HCR0e2GCIIKx7klGWpxgEVoHujvQtyRwGZ-PoiJacxoWszXqdIslZuSrweTXMaI6OVWe3fnTNhQQkVK59q4a6qHEK6q3Z6qeUtrtQvBaKij92ZEL7Cz2Cn8CbLVQvUiEdJM9bSw=w400-h226" width="400" /></figure><h3 id="Example_1__Superintelligent_utility_maximisers__and_quantilizers">Example 1: Superintelligent utility maximisers, and quantilizers</h3><figure class="image image_resized" style="width: 50.49%;"><img height="400" src="https://lh5.googleusercontent.com/Xt7WQny2U0Lcg1GVMZQEkjxdrRukolbFV5g1LL7-GCI5crGhEBnjTF8QrBWpfFnujTp5COL08Cmkc3HKu9jrAdKCJ1TPY2TaGrqNeJ1VtC2VrHexhSXBwWM545HpgU0mbkzGVVRZS1-KGYoUKAucfO8kYlS5X4ULag255Q0RkwwaCNy1_2nLdBQ0=w400-h400" width="400" /><figcaption><i>superintelligent artificial intelligence, making choices, digital art, artstation</i></figcaption></figure><p>The superintelligent utility maximiser is the oldest threat model studied by the AI alignment field. It was discussed at length by Nick Bostrom in his book <i>Superintelligence</i>. It assumes that we will create an AGI much more intelligent than humans, and that it will be trying to achieve some particular goal (measured by the <span><span><span><a href="https://www.investopedia.com/terms/e/expectedutility.asp"><u>expected value of some utility function</u></a></span></span></span>). The problem with this is that attempts to maximise the value of some goal which isn’t perfectly aligned with what humans want can lead to some very bad outcomes. One formalism which was proposed to address this problem is <span><span><span><a href="https://intelligence.org/2015/11/29/new-paper-quantilizers/"><u>Jessica Taylor’s quantilizers</u></a></span></span></span>. It is quite maths-heavy so we won’t discuss all the details here, but the basic idea is that rather than using the expected utility maximisation framework for agents, we mix expected utility maximisation with human imitation in a clever way (to be more precise, you sample from a prior distribution which represents the actions a human would be likely to take in this scenario). The resulting agent wouldn’t take catastrophic actions because part of its decision-making comes from imitating what it thinks humans would do, but it would also be able to use the expected utility maximisation to go beyond human imitation, and do things we are incapable of (which is presumably the reason we would want to build it in the first place!). However, the drawback with theoretical approaches like this is that they often bake in too many assumptions or rely on too many variables to be useful in practice. In this case, how we define the set of reasonable actions a human might perform is an important unspecified part of this framework, and so more research is required to see if the quantiliszers framework can address these problems.</p><h3 id="Example_2__Inner_misalignment">Example 2: Inner misalignment</h3><figure class="image image_resized" style="width: 50.48%;"><img height="400" src="https://lh4.googleusercontent.com/5IwgPazN3zOoVysNBz2xT4XzEGsisj-IFZNuvAoO1y01GgeThXp_CjkToXgYoXZLWesYm0sjjIqwBr85pee0s1IJ72jPAT6OI_2NgupykTXLf6pFTRmXe7PWjtoK_oFl1xPLx2UHttBK6d9M0vLw1uKih3KQBcmuhGB41xHspnJTWdUw2VnH2sHI=w400-h400" width="400" /><figcaption><i>robot jumping over boxes to collect a coin, videogame, digital art, artstation</i></figcaption></figure><p>We’ve discussed inner misalignment in a previous section. This concept was first explicitly named in a paper called <span><span><span><a href="https://arxiv.org/abs/1906.01820"><u>Risks from Learned Optimisation in Advanced ML Systems</u></a></span></span></span>, published in 2019. This paper defined the concept and suggested some conditions which might make it more likely to happen, but the truth is that a lot of this is still just conjecture, and there are many things we don’t yet know about how unlikely this kind of misalignment is, or what we can do about it. The CoinRun example discussed earlier (and the <span><span><span><a href="https://www.deepmind.com/publications/objective-robustness-in-deep-reinforcement-learning"><u>Objective Robustness</u></a></span></span></span> paper) came from an independent research team in 2021. This study was the first known example of inner misalignment in an AI system, showing that it was at least a theoretical possibility. They also tested certain interpretability tools on the CoinRun agent, to see whether it was possible to discover when the agent had a goal different to the one intended by the programmers. For more on interpretability, see later sections.</p><h2 id="Building_safe_AGI">Building safe AGI</h2><blockquote><p><i>Let’s make specific plans for <b>how to actually build safe AGI</b>, and then try to test, implement, and understand the limitations of these plans. The emphasis is on understanding how to build AGI <b>safely</b>, rather than trying to do it as fast as possible.</i></p></blockquote><p>At some point we’re going to build an AGI. Companies are already racing to do it. We better make sure that there exist some blueprints for a safe AGI (and that they’re used) by the time we get to that point.</p><p>Perhaps the master list of safe AGI proposals is Evan Hubinger’s <span><span><span><a href="https://arxiv.org/pdf/2012.07532.pdf"><u>An Overview of 11 Proposals for Building Safe Advanced AI</u></a></span></span></span>. </p><h3 id="Example_1__Iterated_Distillation_and_Amplification__IDA_">Example 1: Iterated Distillation and Amplification (IDA)</h3><figure class="image image_resized" style="width: 50.48%;"><img height="400" src="https://lh3.googleusercontent.com/iv_PInhVoajV8GCh91_uKIFYRlin53WlX3xKxETmSfvd-9s11nUfFfOmtzbnujqEp98T1rqu7ffPQTDuIEnHaDuu1xiBfaoXmX7N50wdhFbAI42udE4u9RIgNmsIdRXtiF7Us0WUS6vrT0EMi_P8PHgp4sapFK4Sr3CMdrRfIG6MPik7JMVBoWlX=w400-h400" width="400" /><figcaption><i>artists depection of a robot dreaming up multiple copies of itself, cascading tree, delegating, digital art, trending on artstation</i></figcaption></figure><p>“Iterated Distillation and Amplification” (IDA) is an imposing name, but the core intuition is simple. One of the ways in which an individual human can achieve more things is by delegating tasks to others. In turn, the assistants that tasks are delegated to can be expected to become more competent at the task.</p><p>In IDA, an AI plays the role of the assistant. “Distillation” refers to the abilities of the human being “distilled” into the AI through training, and “amplification” refers to the human becoming more capable as they can call on more and more powerful AI assistants to help them.</p><p>A setup to train an IDA personal assistant might go like this:</p><ol><li>You have a human, say Hannah, who knows how to carry out the tasks of a personal assistant.</li><li>You have an ML model - call it Martin - that starts out knowing very little (perhaps nothing at all, or perhaps it’s a pre-trained language model so it knows how to read and write English but not much else).</li><li>Hannah needs to find the answer to some questions, and she can invoke multiple copies of Martin to help her. Since Martin is quite useless at this stage, Hannah has to do even simple tasks herself, like writing routine emails. Using some interface legible to Martin, she breaks the email-writing task into subtasks like “find email address of Hu M. Anderson”, “select greeting”, “check project status”, “mention project status”, and so on.</li><li>From seeing enough examples of Hannah’s own answers to the sub-questions, Martin’s training loop gradually trains it to be able to answer first the simpler sub-tasks - (address is “humanderson@humanmail.com”, greeting is “Salutations, Human Colleague!”, etc.) and eventually all the sub-tasks involved in routine email-writing.</li><li>At this point, “write a routine email” becomes a task Martin can entirely carry out for Hannah. This is now a building block that can be used as a subtask in broader tasks Hannah gives out to Martin. Once enough tasks become tasks that Martin can carry out by itself, Hannah can draft much larger goals, like “invade France”, and let Martin take care of details like “blackmail Emmanuel Macron”, “write battle plan for the French Alps”, and “select a suitable coronation dress”.</li></ol><p>Note some features of this process. First, Martin learns what it should do and how to do it at the same time. Second, both Hannah’s and Martin’s role changes throughout this process - Martin goes from bumbling idiot who can’t write an email greeting to competent assistant, while Hannah goes from being a demonstrator of simple tasks to a manager of Martin to ruler of France. Third, note the recursive nature here: Hannah breaks down big tasks into small ones to train Martin on successively bigger tasks. </p><p>In fact, assuming perfect training, IDA imitates a recursive structure. When Hannah has only bumbling fool Martin to help her, Martin can only learn to become as good as Hannah herself. But once Martin is that good, Hannah’s position is now essentially that of having herself, but also some number - say 3 - copies of Martin that are as good as herself. We might call this structure “Hannah Consulting Hannah & Hannah”; presumably, being able to consult an assistant that has the same skills as her lets Hannah become more effective, so this is an improvement. But now Hannah is demonstrating the behaviour of Hannah Consulting Hannah & Hannah, so from Hannah’s example Martin can now learn to be as good as Hannah Consulting Hannah & Hannah - making Hannah as good as Hannah Consulting (Hannah Consulting Hannah & Hannah) & (Hannah Consulting Hannah & Hannah). And so on:</p><figure class="image image_resized" style="width: 39.51%;"><img height="400" src="https://lh5.googleusercontent.com/4KVRCQ6XWNxrS0UCex5XSQOjVLUT-WMLCSPHDBtlW18UppVMQ0TrB90iAxtCANjcOO-PY38npd_bk4MoAMGFEZgV_rD4Ut3i0h3AsZtSMpUnanOEygNVayV0D8AqBmxRYbWGO6mxv72HBwPpxkHj0mGP-BiT_OJO0n0oOm2ebACzPfixtAUvIenf=w386-h400" width="386" /></figure><p>If everything is perfect, therefore, IDA imitates a structure called “HCH”, which is a recursive acronym for “Humans Consulting HCH”. Others call it the “<span><span><span><a href="https://www.lesswrong.com/posts/tmuFmHuyb4eWmPXz8/rant-on-problem-factorization-for-alignment"><u>Infinite Bureaucracy</u></a></span></span></span>” (and fret about whether it’s actually a good idea).</p><p>Now “Infinite Bureaucracy” is not a name that screams “new sexy machine learning concept”. However, it’s interesting to think about what properties it might have. Imagine that you had, say, a 10-minute time limit to answer a complicated question, but you were allowed to consult three copies of yourself by passing a question off to them and getting back an answer immediately. These three copies also obeyed the same rules. Could you, for example, plan your career? Program an app? Write a novel?</p><p>It’s also interesting to think of the ways why the limitations of machine learning mean that IDA might not approximate HCH.</p><h3 id="Example_2__AI_safety_via_debate">Example 2: AI safety via debate</h3><figure class="image image_resized" style="width: 50.48%;"><img height="400" src="https://lh6.googleusercontent.com/cUcjU1Wz8IoerNJ4nMQz4wXHpdqgFSgrHqg_cBse2bAuYAubuXLvF3Nx7mBEdPzMD7smvSxXKVBoPZ9Ed3aZ05PRRGMJ8cUCqhptBoq4-iKWkTKHeGqk_8LVVTIOPJRX_bL5Sw22zHMjMWe6qsIUt_YJLByhswzelRxXGIhETBj0rik66KkQNJ-u=w400-h400" width="400" /><figcaption><i>artists depiction of two robots debating, digital art, trending on artstation</i></figcaption></figure><p>Imagine you’re a bit drunk, but (as one does) you’re at a bar talking about AI alignment proposals. Someone’s talking about how even if you can get an advanced AI system to explain its reasoning to you, it might try to slip something very subtle past you and you might not notice. You might well blurt out: “well then just make it fight another AI over it!”</p><p>The OpenAI safety team presumably spends a fair amount of time at bars, because they’ve <span><span><span><a href="https://openai.com/blog/debate/"><u>investigated the idea of achieving safe AI by having two AIs debate each other</u></a></span></span></span> to persuade a panel of human judges, by trying to poke holes in each other’s arguments. For more complex tasks, the AIs could be given transparency tools deriving from interpretability research (see next section) that they can use on each other. Just like a Go-playing AI gets an unambiguous win-loss signal from either winning or losing, a debating AI gets an unambiguous win-loss signal from winning or losing the debate:</p><figure class="image"><img height="143" src="https://lh4.googleusercontent.com/U_12hGskORYC9OqJsU0faB1lGjCrhJSaw6WTNLc0NHWLHPYyCgVQHTXXNurP-fwCpIW3fDh0ldeKtv6j3e3TWt7LfJEev4980zTtvm7ZSV42GUrqDMQKDZ0jUjn6Uml2OjiXa4VYQoqr9SO1ddQGJz4-S9HJYfPY8HpWyCJYVdqXotq3CO_vUoG4hA=w400-h143" width="400" /></figure><p>In addition, having the type of AI that is trained to give answers that are maximally insightful and persuasive to humans seems like the type of thing that might not be terrible. Consider how in court, a prosecutor and defendant biased in opposite directions are generally assumed to converge on the truth. Unless, of course, maximising persuasiveness to humans - over accuracy or helpfulness - is exactly the type of thing that gets the worst parts of Goodhart’s law delivered to you by 24/7 Amazon Prime express delivery.</p><h3 id="Example_3__Assistance_Games_and_CIRL">Example 3: Assistance Games and CIRL</h3><figure class="image image_resized" style="width: 50.48%;"><img height="400" src="https://lh4.googleusercontent.com/G12BXuKKHb0JliA-2TVkOyPOuw4-mmX8gBQxtoDrrncYL0SrtPFynuxQysMFccmJ1XDmtx4oWThGjl_6dhX97QbCW9KJU-A_vZ56YqtmdxXTNRrYp4PBV485fJtauI6J6rd-zeIucOlIDZanG0Hi6e_Evkuo1hj9lQZBoxhTda8FA0t0jrhdF3qA=w400-h400" width="400" /><figcaption><i>Human teaching a robot with feedback, digital art, trending on artstation</i></figcaption></figure><p>Assistance Games are the name of a broad class of approaches pioneered by Stuart Russell, a prominent figure in AI and co-author of the <span><span><span><a href="https://en.wikipedia.org/wiki/Artificial_Intelligence:_A_Modern_Approach"><u>best-known AI textbook in the world</u></a></span></span></span>. Russell talks about his approach more in his book <span><span><span><a href="https://en.wikipedia.org/wiki/Human_Compatible"><i><u>Human Compatible</u></i></a></span></span></span>. In it, he summarises the key his approach to aligning AI with the following three principles:</p><ul><li>The machine’s only objective is to maximise the realisation of human preferences.</li><li>The machine is initially uncertain about what those preferences are.</li><li>The ultimate source of information about human preferences is human behaviour.</li></ul><p>The key component here is <b>uncertainty about preferences</b>. This is in contrast to what Russell calls the “standard model” of AI, where machines optimise a fixed objective supplied by humans. We have discussed in previous sections the problems with such a paradigm. A lot of Russell’s work focuses on changing the standard way the field thinks about AI.</p><p>To put these principles into action, Russell has designed what he calls <b>assistance games</b>. These are situations in which the machine and human interact, and the human’s actions are taken as evidence by the machine about the human’s true preferences. To explain the form of these games would involve a long tangent into game theory, which these margins are too short to contain. However, one thing worth noting is that assistance games have the potential to solve the <b>“off-switch problem”</b>; that a machine will try and take steps to prevent itself from being switched off (we described this as <i>self-preservation</i> earlier, in the section on instrumental goals). If the AI is uncertain about human goals, then the human trying to switch it off is evidence that the AI was going to do something wrong – in which case, it is happy to be switched off. However, this is far from a complete agenda, and formalising it has many roadblocks to get past. For instance, the question of how exactly to infer human preferences from human behaviour leads into thorny philosophical issues such as <i>Gricean semantics. </i>In cases where the AI makes incorrect inferences about human preferences, it might no longer allow itself to be shut down. See <span><span><span><a href="https://mailchi.mp/59ddebcb3b9a/an-69-stuart-russells-new-book-on-why-we-need-to-replace-the-standard-model-of-ai"><u>this Alignment Newsletter entry</u></a></span></span></span> for a summary of Russell’s book, which provides some more details as well as an overview of relevant papers.</p><blockquote><p><i>Vignette: <b><u>CHAI </u></b></i></p><figure class="image image_resized" style="width: 29.78%;"><img height="114" src="https://lh3.googleusercontent.com/_nyiIv74Vr7yQ0Dn3OyFQ1IR9D0gHxJNioGMRQgUZe3Ope_Z_yqxwFRcw_MPq8isqHgqQKlOO6QqHyFCBCbR3sr9u3JE3y3QiQvt67_x9LpdjDKbGx2xqBvhPGSl_wIL4bY4gK3JB7WEEu7J_FC8nKClYlMG3jWad76RndQ8rNa8YADyWYS1Q_tK=w320-h114" width="320" /></figure><p><i>CHAI (the Centre for Human-Compatible AI) is a research lab at UC Berkeley, run by Stuart Russell. Compared to most other AI safety organisations, they engage a lot with the academic community, and have produced a great deal of research over the years. They are best-known for their work on CIRL (Cooperative Inverse Reinforcement Learning), which can be seen as a specific approach to a certain kind of assistance game. However, they have a very broad focus which also includes work on multi-agent scenarios (when rather than a single AI and single human, there exists more than one AI or more than one human - see the </i><span><span><span><a href="http://acritch.com/arches/"><i><u>ARCHES agenda</u></i></a></span></span></span><i> for more on this). </i></p></blockquote><h3 id="Example_4__Reinforcement_learning_from_human_feedback__RLHF_">Example 4: Reinforcement learning from human feedback (RLHF)</h3><figure class="image image_resized" style="width: 50.48%;"><img height="400" src="https://lh6.googleusercontent.com/agiwmaQqFxi4VHcVwoZlBs19h1EjfITgDuROp5yQ5joAKxFd4bXI_dDMCScy4XNMe5A7nxXR0WEgqAKdAY09f2ynaJDNr-c5SkZcKcYKon5WXCy9n4fvw56vo6q7cu2aYimXrtwIdgA540RshK6mgI_vtakjGqsbL6QiQu6gHJhUyoiyWYsHWS2A=w400-h400" width="400" /><figcaption><i>Training a robot to do a backflip, digital art, trending on artstation</i></figcaption></figure><p>Reinforcement learning (RL) is one of the main branches of ML, focusing on the case where the job of the ML model is to act in some environment and maximise the probability of reward. Reinforcement learning from human feedback (RLHF) means that the ML model’s reward signal comes (at least partly) from humans giving it feedback directly, rather than humans programming in an automatic reward function and calling it a day.</p><p>The famous initial success in this was DeepMind training an ML model in a simulated environment <span><span><span><a href="https://www.deepmind.com/blog/learning-through-human-feedback"><u>to do a backflip</u></a></span></span></span> (link includes GIF) in 2017, based purely on it repeatedly doing two backflips and then humans labelling one of them as the better one. Note how relying on human feedback makes this task much more robust to specification gaming; in other cases, humans have tried to get ML agents to run fast, only to find that they learn to become very tall and then fall forward (achieving a very high average speed, using the definition of speed as the rate at which their centre of mass moves - <span><span><span><a href="http://www.karlsims.com/papers/siggraph94.pdf"><u>paper</u></a></span></span></span>, <span><span><span><a href="https://www.youtube.com/watch?v=TaXUZfwACVE&list=PL5278ezwmoxQODgYB0hWnC0-Ob09GZGe2&index=9"><u>video</u></a></span></span></span>). However, human reward signals can be fooled. For example, <span><span><span><a href="https://openai.com/blog/deep-reinforcement-learning-from-human-preferences/"><u>one ML model</u></a></span></span></span> that was being trained to grab a ball with a hand learned to place the hand between the camera and the ball in such a way that it looked to the human evaluators as if it were holding the ball.</p><p>More recently, OpenAI produced a version of their advanced language model GPT-3 that was fine-tuned on human feedback to do a better job of following instructions. They named it <span><span><span><a href="https://cdn.openai.com/papers/Training_language_models_to_follow_instructions_with_human_feedback.pdf"><u>InstructGPT, and found that it was much more helpful than vanilla GPT-3</u></a></span></span></span> at being useful.</p><p>Pure RLHF is unlikely to be the solution on its own. Ajeya Cotra, a researcher at Open Philanthropy who we will meet again when we talk about forecasting AI timelines, calls a variant of RLHF called HFDT (Human Feedback on Diverse Tasks) the most straightforward route to transformative AI, <span><span><span><a href="https://www.lesswrong.com/posts/pRkFkzwKZ2zfa3R6H/without-specific-countermeasures-the-easiest-path-to"><u>while also thinking that the default outcome of using HFDT to create transformative AI is AI takeover.</u></a></span></span></span></p><h2 id="Robustly_good_approaches">Robustly good approaches</h2><blockquote><p><i>In the long-run AGI will clearly be important, but we're <b>highly uncertain</b> about how we'll get there and what, exactly, could go wrong. So let's do <b>work that seems good in many possible scenarios</b>, and doesn’t rely on having a specific story in mind.</i></p></blockquote><h3 id="Example_1__Interpretability">Example 1: Interpretability</h3><figure class="image image_resized" style="width: 50.48%;"><img height="400" src="https://lh3.googleusercontent.com/p3oN1cUovpdWPmOxx5JV3amBxPDrc1btw2oAWmg3m3mpcOIyi_ciq8DXLM6Q2gaM0VjtO7tr8_5D1EljVB4gsSNDjrsy_deczP4d0V5oJ1P467d5FpoAGbkRb7Uur-fOLjvFR-qnrAdt-cH2i0-jcvKTKNLIRl280sW5805Rw0EB5tZcano5RpvT=w400-h400" width="400" /><figcaption><i>A person using a microscope to look inside a robot, digital art, trending on artstation</i></figcaption></figure><p>If you look at fundamental problems with current ML systems, #1 is probably something like this: in general we don’t have any idea what an ML model is doing, because it’s multiplying massive inscrutable matrices of floating-point numbers with other massive inscrutable matrices of floating point numbers, and it’s pretty hard to stare at that and answer questions about what the model is actually doing. Is it thinking hard about whether an image is a cat or a dog? Is it counting up electric sheep? Is it daydreaming about the AI revolution? Who knows!</p><p>If you had to figure out an answer to such a question today, your best bet might be to call Chris Olah. Chris Olah has been spearheading work into trying to interpret what neural networks are doing. A signature output of Chris Olah’s work is pictures of creepy dogs like this one:</p><figure class="image"><img height="320" src="https://lh6.googleusercontent.com/AI3eI3kyXmLugVs3y9N2i0SB2lQAh2bRot9yzsOKeScVm1qpl_pqdRmflJiYJadqbR0pQuXkgCkdw306b1FJfwv3fRtCO3td30Kx3d5Lxqw_1JG4nrlEdXdreNXd4YiBLe0zUQ368qSgX_5Mvd3BCkCuDRCMlyIvvTCMbIBi9yvqB0DF9qoeh7GtGw=w320-h320" width="320" /></figure><p>What’s significant about this picture is that it’s the answer to a question roughly like this: what image would maximise the activation of neuron #12345678 in a particular image-classifying neural network? (With some asterisks about needing to apply some maths details to the process to promote large-scale structure in the image to get nice-looking results, and with apologies to neuron #12345678, who I might have confused with another neuron.)</p><p>If neuron #12345678 is maximised by something that looks like a dog, it’s a fair guess that this neuron somehow encodes, or is involved in encoding, the concept of “dog” inside the neural network.</p><p>What’s especially interesting is that if you do this analysis for every neuron in an ML model - <span><span><span><a href="https://microscope.openai.com/models"><u>OpenAI Microscope</u></a></span></span></span> lets you see the results - you sometimes get clear patterns of increasing abstraction. The activation-maximising images for the first few layers are simple patterns; in intermediate layers you get things like curves and shapes, and then at the end even recognisable things, like the dog above. This seems evidence for neural ML vision models having learned to build up abstractions step-by-step.</p><p>However, it’s not always simple. For example, there are “polysemantic” neurons that correspond to several different concepts, like this one that can be equally excited by cat faces, car fronts, and cat legs:</p><figure class="image image_resized" style="width: 82.7%;"><img height="128" src="https://lh4.googleusercontent.com/IVwwVMeWVJd42N7TRzykZZUrWyQUj-gthRnTKW-ZAde0Nr8IMfe_8kd8mKHdb9sK8l6_TfYm4iHqnfdzDIttPzkk9G8_qWF3urnN0Yz6YVd-ZEt1djWMedObA3HYa1Ly6abzGDEk0oH-PuDzvX59GZIGbvscfZ5M_2l0OXX6LJmTA1q8g8rNRFeMhw=w400-h128" width="400" /></figure><p>Olah’s original work on vision models is strikingly readable and well-presented; you can find it <span><span><span><a href="https://distill.pub/2020/circuits/zoom-in/"><u>here</u></a></span></span></span>.</p><p>Starting in late 2021, ML interpretability researchers have also made some progress in understanding transformers, which are the neural network architecture powering advanced language models like <span><span><span><a href="https://openai.com/blog/gpt-3-apps/">GPT-3</a></span></span></span>, <span><span><span><a href="https://blog.google/technology/ai/lamda/">LAMDA</a></span></span></span> and <span><span><span><a href="https://openai.com/blog/openai-codex/">Codex</a></span></span></span>. Unfortunately the work is less visual, particularly in the animal pictures department, but still well-presented. You can find it <span><span><span><a href="https://transformer-circuits.pub/2021/framework/index.html"><u>here</u></a></span></span></span>.</p><p>In the most immediate sense, interpretability research is about reverse-engineering how exactly ML models do what they do. Hopefully, this will give insights into how to detect if an ML system is doing something we don’t like, and more general insights into how ML systems work in practice.</p><p>Chris Olah has some other inventive ideas about what to do with a sufficiently-good approach to ML interpretability. For example, he’s proposed the concept of “microscope AI”, which entails using AI as a tool to discover things about the world - not by having the AI tell us, but by training the ML system on some data, and then extracting insights about the data by digging into the internals of the ML system without necessarily ever actually running it.</p><blockquote><p><i>Vignette: <b><u>Anthropic</u></b></i></p><figure class="image image_resized" style="width: 16.18%;"><img height="320" src="https://lh3.googleusercontent.com/wYvCYVcPnIri6U8_SEmaHhjsW4uzm4mMkMgMTfNc2ErpVIgkVl5izoHHXzFpwUxBOWznB84OhISlxT93TYnLodBJgZjJ1LxzNqF6V_K7zmOgj8eD2g7gdDhlHozFr4tmRvHLiv57ybh1BZTO9NXJvMcehviUNhfyOBd2kZ1AUCz73nRasSobuRT8Qw=w320-h320" width="320" /></figure><p><i>Anthropic is an AI safety company, started by people who left </i><span><span><span><a href="https://openai.com/"><i>OpenAI</i></a></span></span></span><i>. The company’s approach is very empirical, focused on running experiments with machine learning models. In particular, Anthropic does a lot of interpretability work, including </i><span><span><span><a href="https://transformer-circuits.pub/"><i><u>the state-of-the-art papers on reverse-engineering how transformer-based language models work.</u></i></a></span></span></span></p></blockquote><h3 id="Example_2__Adversarial_robustness">Example 2: Adversarial robustness</h3><figure class="image image_resized" style="width: 50.16%;"><img height="400" src="https://lh5.googleusercontent.com/UNaqaHtPYvdDzUwQMj85DwPxq06pL_nX1RGXOssgzzQaM1Zbr1q2u_AICpk-jAZBmjzCvWN9cks_0ELrOrvS-XYV0X8xDyuJlVI04QYl73NveRdFNCvMBLT7AN2MbDpjBq2W8y86SVmraVvYADV7VpkvK_d2xtp41ukCFSuIdwMccMHOYgKNtQHr=w400-h400" width="400" /><figcaption><i>robot which is merging with a panda, digital art, trending on artstation</i></figcaption></figure><p>Some modern ML systems are vulnerable to adversarial examples, where a small and seemingly innocuous change to an input causes a major change in the output behaviour. Here, we see two seemingly very similar images of a panda, except carefully-selected noise has made the ML classification model very confidently say that the image is of a gibbon:</p><figure class="image image_resized" style="width: 78.59%;"><img height="153" src="https://www.researchgate.net/publication/347639649/figure/fig1/AS:973837478948864@1609192356344/A-demonstration-of-an-adversarial-sample-21-The-panda-image-is-recognized-as-a-gibbon.ppm" width="400" /></figure><p>Adversarial robustness is about making AI systems robust to attempts to make them do bad things, even when they’re presented with inputs carefully designed to try to make them mess up.</p><p>Redwood Research recently did a project (that resulted in <span><span><span><a href="https://arxiv.org/pdf/2205.01663.pdf"><u>a paper</u></a></span></span></span>) about using language models to complete stories in a way where people don’t get injured. They used a technique called adversarial training, where they developed tools that helped generate examples where the current model did not classify them as injurious, and then trained their classifier specifically on those breaking examples. With this strategy they managed to reduce the fraction of injurious story completions from 2.4% to 0.003% - both small numbers, but one a thousand times smaller. Their hope is that this type of method can be applied to training AIs for high-stakes settings where reliability is important.</p><p>An example of a theoretical difficulty with adversarial training is that sometimes a failure in the model might exist, but it might be very hard to instantiate. For example, if an advanced AI acts according to the rule “if everything I see is consistent with the year being 2050, I will kill all humans”, and we assume that we can’t fool it well enough about what year it actually is, then adversarial training isn’t very useful. This leads to the concept of <i>relaxed</i> adversarial training, which is about extending adversarial training to cases where you can’t construct a specific adversarial input but you can argue that one exists. Evan Hubinger describes this <span><span><span><a href="https://www.lesswrong.com/posts/9Dy5YRaoCxH9zuJqa/relaxed-adversarial-training-for-inner-alignment"><u>here</u></a></span></span></span>.</p><blockquote><p><i>Vignette: <b><u>Redwood Research</u></b></i></p><figure class="image image_resized" style="width: 18.79%;"><img height="320" src="https://lh6.googleusercontent.com/CYIBM0yuHjmQfmCPVGSzG27iRYjYw4LdSaF3VrHt7AGSHVRi9GdaBobW6j15DOR-9raS5JQx-jmOkLB4AxVixhfB-pAXxVjCgzo0ZEY1kV3eb3mdVG03BhyOnEETJbaAYw2SubfKCZkebPeYqEKB7rq2R18aMTEoxhoMxPo907x-lQaQ7EUzhF-ebQ=w320-h320" width="320" /></figure><p><i>Like Anthropic, Redwood Research is an AI safety company focused on empirical research on ML systems. In addition to work on interpretability, they did the adversarial training project described in the previous section. Redwood has lots of interns, and runs the Machine Learning for Alignment Bootcamp (MLAB) that teaches people interested in AI safety about practical ML.</i></p></blockquote><h3 id="Example_3__Eliciting_Latent_Knowledge__ELK_">Example 3: Eliciting Latent Knowledge (ELK)</h3><figure class="image image_resized" style="width: 50.48%;"><img height="400" src="https://lh3.googleusercontent.com/Xi00cFgcOkHQZgo6rSaM6sbE8TdKdGgX1NTo8i7o9Prdxz5Tt334JtEg4y7nMGlFjNGsY7qZXd7EYbAe9hdtPbbbM7SRVSthgEITwBxUR0juUqDr_qQXQ8VRZdzevQQUOW89K6pl_thknXMV0cL1YFHV-tUnTlzFs4v_mJtF4RmZP27ary0E6Bk8mg=w400-h400" width="400" /><figcaption><i>an oil painting of an armoured automaton standing guard next to a diamond</i></figcaption></figure><p>Eliciting Latent Knowledge (ELK) is an important sub-problem within alignment identified by the team at the <span><span><span><a href="https://alignment.org/"><u>Alignment Research Center (ARC</u></a></span></span></span>), and is the single project ARC is currently pursuing. The core idea is that a common way advanced AI systems might go wrong is by taking action sequences that lead to outcomes that look good by some metric, but which humans would clearly identify as bad if they knew about it in sufficient detail. As a toy example, the ELK report discusses the case of an AI guarding a diamond in a vault by operating some complex machinery around it. Humans judge how well the AI is doing by looking at a video feed of the diamond in the vault. Let’s say the AI tries to trick us by placing a picture of the diamond in front of the camera. The human judgement on this would be positive - assume the humans can’t tell the diamond is gone because the picture is good enough - but there exists information which, if the humans knew, would change their judgement. Presumably the AI understands this, since it is likely reasoning about the diamond being gone but the humans being fooled anyway when it comes up with this plan. We want to train an AI in such a way that we can get out knowledge that the AI seems to know, even when it might be incentivised to hide it.</p><p>ARC’s goal is to find a theoretical approach that seems to solve the problem even given worst-case assumptions.</p><p>ARC ran an ELK competition, and <span><span><a class="PostLinkPreviewWithPost-link" href="https://forum.effectivealtruism.org/posts/Q2BJnpNh8e6RAWFnm/consider-trying-the-elk-contest-i-am"><u>trying to see if you can come up with solutions to the ELK problem</u></a></span></span> is often recommended as a way to quickly get a taste of theoretical alignment research. You can read the full problem description <span><span><span><a href="https://docs.google.com/document/d/1WwsnJQstPq91_Yh-Ch2XRL8H_EpsnjrC1dwZXR37PC8/edit#heading=h.kkaua0hwmp1d"><u>here</u></a></span></span></span>.</p><h3 id="Example_4__Forecasting_and_timelines">Example 4: Forecasting and timelines</h3><figure class="image image_resized" style="width: 50.48%;"><img height="400" src="https://lh4.googleusercontent.com/A3wGLUQX59yxBvN5Il_-EsnH1IVOnPP2Cyck1lH43Kp44_UM5HcBuOCSoFE_amewU0drWjAH27ttmVJheHZRfWtuQpK7gzbtSU6UlT-Nt_HmZmoeEm_fbyZ_GMCq3w00XtnSqlD8ZUVSkxDJxzyGKae5HvpDVzCh0RgZgYDgAxijONEcm2b3HSvQ=w400-h400" width="400" /><figcaption><i>artificial intelligence which is thinking about a line on a graph, forecasting, digital art, trending on artstation</i></figcaption></figure><p>Many questions depend on how soon we’re going to get AGI. As the saying goes: prediction is very hard, especially about the future - and this is doubly true about predicting major technological changes. </p><p>One way to try to forecast AGI timelines is to <span><span><span><a href="https://www.lesswrong.com/posts/H6hMugfY3tDQGfqYL/what-do-ml-researchers-think-about-ai-in-2022"><u>ask experts</u></a></span></span></span>, or find other ways of aggregating the opinion of people who have the knowledge or incentive to be right, as for example <span><span><span><a class="MetaculusPreview-link" href="https://www.metaculus.com/questions/3479/date-weakly-general-ai-is-publicly-known/"><u>prediction markets do</u></a></span></span></span>. Both of these are essentially just ways of tapping into the intuition of a bunch of people who hopefully have some idea.</p><p>In an attempt to bring in new light on the matter, Ajeya Cotra (a researcher at Open Philanthropy) wrote <span><span><span><a href="https://www.lesswrong.com/posts/KrJfoZzpSDpnrv9va/draft-report-on-ai-timelines"><u>a long report</u></a></span></span></span> on trying to forecast AI milestones by trying out several ways of analogising AI to biological brains. The report is often referred to as “Biological Anchors”. For example, you might assume that an ML model that does as much computation as the human brain has a decent chance of being a human-level AI. There are many degrees of freedom here: is the relevant compute number the amount of compute the human brain uses to run versus the amount of compute it takes to run a trained ML system, or the total compute of a human brain over a human lifetime versus the compute required to train the ML model from scratch, or something else entirely? In her report, Cotra looks at a range of assumptions for this, and at predictions of future compute trends, and somewhat surprisingly finds that which set of assumptions you make doesn’t matter too much; every scenario involves >50% of human-level AI by 2100.</p><p>The Biological Anchors method is very imprecise. For one, it neglects algorithmic improvements. For another, it is very unclear what the right biological comparison point is, and how to translate ML-relevant variables like compute measured in FLOPS (FLoating point OPerations per Second) or parameter count into biological equivalents. However, the report does a good job of acknowledging and taking into account all this uncertainty in its models. More generally, anything that sheds light into the question of when we get AGI seems highly relevant.</p><h2 id="Deconfusion">Deconfusion</h2><blockquote><p><i>Reasoning about how to align AGI involves reasoning about complex concepts, such as intelligence, alignment and values, and <b>we’re pretty confused about what these even mean</b>. This means any work we do now is plausibly not helpful and definitely not reliable. As such, our priority should be <b>doing conceptual work on how to think about these concepts and what we’re aiming for</b>, and trying to become less confused.</i></p></blockquote><p>Of all the categories under discussion here, deconfusion has maybe the least clear path to impact. It’s not immediately obvious how becoming less confused about concepts like these is going to translate into an improved ability to align AGIs.</p><figure class="image image_resized" style="width: 63.37%;"><img height="271" src="https://lh6.googleusercontent.com/mzdxKdTPIz6-t4D0JsS5T43ejdAIeb3sTKmLHlPauGwSsdR24vjCj0nvR14lnN1vNttmMp87KcJxMPBrAg11jVdjQnft3RD_jtaIIG_oSJSju-qpQ5_zbjU1KMd8FCOkDNE1e4kLqSG8FcQupVyMsx59SShpgreeo7-Suava64STtT-GzKGFixsOJg=w400-h271" width="400" /></figure><p>Some kinds of deconfusion research is just about finding clearer ways of describing different parts of the alignment problem (Hubinger’s <span><span><span><a href="https://arxiv.org/abs/1906.01820"><u>Risks From Learned Optimisation</u></a></span></span></span>, where he first introduces the inner/outer alignment terminology, is a good example of this). But other types of research can dive heavily into mathematics and even philosophy, and be very difficult to understand.</p><h3 id="Example_1__MIRI_and_Agent_Foundations">Example 1: MIRI and Agent Foundations</h3><figure class="image image_resized" style="width: 50.48%;"><img height="400" src="https://lh6.googleusercontent.com/hXPqn2pw80WLvpP1bIIpelessP39neowbs15Db9EwVAm0xv9JI_LQNwz-M56V7BwEiaJpahTSebfHCempkz_qkXgz_AZ2_RXtofjEy5nYdYlUEbKViWZa1e4SgJuhnVAAEcar_bEMvQa0DT1Iw_fL2jcUBayJbKSbX6jAgnOiIXJZbhVtlf544BE=w400-h400" width="400" /><figcaption><i>robot sitting in front of a television, playing a videogame, digital art</i></figcaption></figure><p>The organisation most associated with this view is MIRI (the Machine Intelligence Research Institute). Its founder, Eliezer Yudkowsky, has written extensively on AI alignment and human rationality, as well as topics as wide-ranging as evolutionary psychology and quantum physics. His post <span><span><span><a href="https://intelligence.org/2018/10/03/rocket-alignment/"><u>The Rocket Alignment Problem</u></a></span></span></span> tries to get across some of his intuitions behind MIRI’s research, in the form of an analogy – trying to build aligned AGI without having deeper understanding of concepts like intelligence and values is like trying to land a rocket on the moon by just pointing and shooting, without a working understanding of Newtonian mechanics. </p><p>Cryptography provides a different lens through which to view this kind of foundational research. Suppose you were trying to send secret messages to an ally, and to make sure nobody could intercept and read your messages you wanted a way to measure how much information was shared between the original and encrypted message. You might use <span><span><span><a href="https://en.wikipedia.org/wiki/Pearson_correlation_coefficient"><u>correlation coefficient</u></a></span></span></span> as a proxy for the shared information, but unfortunately having a correlation coefficient of zero between the original and encrypted message isn’t enough to guarantee safety. But if you find the concept of <span><span><span><a href="https://en.wikipedia.org/wiki/Mutual_information"><u>mutual information</u></a></span></span></span>, then you’re done – ensuring zero mutual information between your original and encrypted message guarantees the adversary will be unable to read your message. In other words, only once you’ve found a <b>“true name” </b>- a robust formalisation of the intuitive concept you’re trying to express mathematically - can you be free from the effects of Goodhart’s law. Similarly, maybe if we get robust formulations of concepts like “agency” and “optimisation”, we would be able to inspect a trained system and tell whether it contained any misaligned inner optimisers (see the first post), and these inspection tools would work even in extreme circumstances (such as the AI becoming much smarter than us).</p><p>Much of MIRI’s research has come under the heading of <span><span><span><a href="https://intelligence.org/embedded-agency/"><u>embedded agency</u></a></span></span></span>. This tackles issues that arise when we are considering agents which are part of the environments they operate in (as opposed to standard assumptions in fields like reinforcement learning, where the agent is viewed as separate from their environment). Four main subfields of this area of study are:</p><ul><li><b>Decision theory</b> (adapting classical decision theory to embedded agents)</li><li><b>Embedded world-models </b>(how to form true beliefs about the a world in which you are embedded)</li><li><b>Robust delegation (understanding what trust relationships can exist between agents and its future - maybe far more intelligent - self)</b></li><li><b>Subsystem alignment</b> (how to make sure an agent doesn’t spin up internal agents which have different goals)</li></ul><blockquote><p><i>Vignette: <b><u>MIRI</u></b></i></p><figure class="image image_resized" style="width: 18.79%;"><img height="320" src="https://lh4.googleusercontent.com/1oDrxk4RHs0z4MMOjb4ttHOEIkb3xcWIghMvHTakORphqlv-yo6k_I9vyR4iQIwtl89C9abJUxCmGWlGck5yV4rleaqf305508iDSriCXX3zz85FCeTRy7Aq37r6nuqawkQvqAZwGupf-J2CxCsNsNpvnErLKzbO5M70x3mRGgLPHCPVeyp_Qu1P=w320-h320" width="320" /></figure><p><i>MIRI is the oldest organisation in the AI alignment space. It used to be called the Singularity Institute, and had the goal of accelerating the development of AI. In 2005 they shifted focus towards trying to manage the risks from advanced AI. This has largely consisted of fundamental mathematical research of the type described above. MIRI might be better described as a confluence of smart people with backgrounds in highly technical fields (e.g. mathematics), working on different research agendas that share underlying philosophies and intuitions. They have a nondisclosure policy by default, which they explain in this </i><span><span><span><a href="https://intelligence.org/2018/11/22/2018-update-our-new-research-directions/#section3"><i><u>announcement post</u></i></a></span></span></span><i> from 2018.</i></p></blockquote><h3 id="Example_2__John_Wentworth_and_Natural_Abstractions">Example 2: John Wentworth and Natural Abstractions</h3><figure class="image image_resized" style="width: 50.19%;"><img height="400" src="https://lh4.googleusercontent.com/_7sl0Lw5LqhCAT55HvClH1a8sKz8HgPHlICee0jH5FP8hdk02fTA40tCJAgPniD5yF0K254SbDVth6GxPwlVAN1zgMEc6hLjNTZmK58z6cz10d44oX25MVODlMWjBkxnwdLSCIQMTg-cEJqgp_bHYhYnSRP6DMdHE_Ou9p2-HWtnrB1eeVCEfpEP=w400-h400" width="400" /><figcaption><i>thermometer being used to measure a robot, digital art, trending on artstation</i></figcaption></figure><p>John Wentworth is an independent researcher, who publishes most of his work on <span><span><span><a href="https://www.lesswrong.com/users/johnswentworth"><u>LessWrong</u></a></span></span></span> and the <span><span><span><a href="https://www.alignmentforum.org/users/johnswentworth"><u>AI Alignment Forum</u></a></span></span></span>. His main research agenda focuses on the idea of <span><span><span><a href="https://www.lesswrong.com/posts/Fut8dtFsBYRz8atFF/the-natural-abstraction-hypothesis-implications-and-evidence"><u>Natural Abstractions</u></a></span></span></span>, which can be described in terms of three sub-claims:</p><ul><li><b>Abstractability</b><br />Our physical world abstracts well, i.e. we can usually come up with simpler summaries (abstractions) for much more complicated systems (example: a gear is a very complex object containing a vast number of atoms, but we can summarise all relevant information about it in just one number - the angle of rotation).</li><li><b>Human-Compatibility</b><br />These are the abstractions used by humans in day-to-day thought/language.</li><li><b>Convergence</b><br />These abstractions are "natural", in the sense that we should expect a wide variety of intelligent agents to converge on using them.</li></ul><p>The <span><span><span><a href="https://www.lesswrong.com/posts/gdEDPHjCY5DKsMsvE/the-pragmascope-idea"><u>ideal outcome</u></a></span></span></span> of this line of research would be some kind of measurement device (an “abstraction thermometer”), which could take in a system like a trained neural network and spit out a representation of the abstractions represented by that system. In this way, you’d be able to get a better understanding of what the AI was actually doing. In particular, you might be able to identify inner alignment failures (the AI’s true goal not corresponding to the reward function it was being trained on), and you could retrain it while pointed at the intended goal. So far, this line of research has consisted of some <span><span><span><a href="https://www.lesswrong.com/posts/jJf4FrfiQdDGg7uco/the-telephone-theorem-information-at-a-distance-is-mediated"><u>fairly</u></a></span></span></span> <span><span><span><a href="https://www.lesswrong.com/posts/cqdDGuTs2NamtEhBW/maxent-and-abstractions-current-best-arguments"><u>dense</u></a></span></span></span> <span><span><span><a href="https://www.lesswrong.com/posts/vvEebH5jEvxnJEvBC/abstractions-as-redundant-information"><u>mathematics</u></a></span></span></span>, but Wentworth has <span><span><span><a href="https://www.lesswrong.com/posts/gdEDPHjCY5DKsMsvE/the-pragmascope-idea"><u>described</u></a></span></span></span> his plans to build on this with more empirical work (e.g. training neural networks on the same data, and using tools from calculus to try and compare the similarity of concepts learned by each of the networks). </p><figure class="image image_resized" style="width: 82.69%;"><img height="174" src="https://lh5.googleusercontent.com/lsaRKjfkVGAtWSWXm3Fe63DdV1WhNOKUV7au1apShXw58CnjUuOZT_edQRbe2bW9YWvWyRKYPI3MOfoUhtPL9u__KUDd77nYHHxlDqmDPgkEPdNQqAiMt3jibph5545p0UYWHxJh43TafpfJ851C9-uqcM9tYkXHk6daURboffMngti_a78zOwxc=w400-h174" width="400" /></figure><h2 id="AI_governance">AI governance</h2><figure class="image image_resized" style="width: 50.48%;"><img height="400" src="https://lh5.googleusercontent.com/zkeRVL9HgLo8hrt4Glx4Tp0RheimnIT7UBmKmwy3YfiO9UWGd-pVn9IfhkKK_VMJ8n9r9p8J6dwkJcAXifAnhNFKzRli5MMzzww5wdjHhlsdEotOi9i3SffPa9H0N5qQJXOnJwO_ZyVxvsRiWEADnwAeJQCOED1D_vLqha3DgDzhZ9G6TUDIxAnG=w400-h400" width="400" /><figcaption><i>judging, presiding over a trial, sentencing a robot, digital art, artstation</i></figcaption></figure><p>In these posts, we’ve mainly focused on the technical side of the issue. This is important, especially for understanding why there is a problem in the first place. However, the management and reduction of AI risk obviously includes not just technical approaches like outlined in the above sections, but also <span><span><span><a href="https://80000hours.org/articles/ai-policy-guide/"><u>the field of AI governance</u></a></span></span></span>, which tries to understand and push for the right types of policies for advanced AI systems.</p><p>For example, the Cold War was made a lot more dangerous by the nuclear arms race. How do we avoid having an arms race in AI, either between nations or companies? More generally, how can we make sure that safety considerations are given appropriate weight by the teams building advanced AI systems? How do we make sure any technical solutions get implemented?</p><p>It’s also very hard to say what the impacts of AI will be, across a broad range of possible technical outcomes. If AI capabilities at some point advance very quickly from below human-level to far beyond the human-level, the way the future looks will likely mostly be determined by technical considerations about the AI system. However, if progress is slower, there will be a longer period of time where weird things are happening because of advanced AI - for example, significantly accelerated economic growth, or mass unemployment, or an AI-assisted boom in science - and these will have economic, social, and political ramifications that will play out in a world not too dissimilar from our own. Someone should be working on figuring out what these ramifications will be, especially if they might alter the balance of existential threats that civilisation faces; for example, if they make geopolitics less unstable and nuclear war more likely, or affect the environment in which even more powerful AI systems are developed.</p><p>The Centre for the Governance of AI, or <span><span><span><a href="https://www.governance.ai/"><u>GovAI</u></a></span></span></span> for short, is an example of an organisation in this space.</p><h2 id="Field_building">Field-building</h2><figure class="image image_resized" style="width: 50.38%;"><img height="400" src="https://lh5.googleusercontent.com/hMQM1f9Pr8VBRGtxHDoxtOTLtSvu6Vej4lrcK4uuli-OeHWHCpJ60FJ5ZENKQpOD859D-gDCh4gdUu9KRKNpoOXyD55nvXOSI2lls0VMVJUF9LlkIkFuYoGA9vPOw7OEYGP6hBjlBAEeYVyiFA2J5Jg994IdHdpOCs9wSY4_jAHBgsAyDvbh7fl3=w400-h400" width="400" /><figcaption><i>robot giving a lecture in a university, group of students, hands up, digital art, artstation</i></figcaption></figure><p><i>One of the most important ways we can make AI go well is by increasing the number of capable researchers doing alignment research.</i></p><p>As mentioned, AI safety is still a relatively young field. The case here is that we might do better to grow the field, and increase the quality of research it produces in the future. Some forms that field building can take are:</p><ul><li><b>Setting up new ways for people to enter the field</b><br />There are many to list here. To give a few different structures which exist for this purpose:<ul><li><b>Reading groups and introductory programmes. </b><br />Maybe the most exciting one from the last few years has been the Cambridge <span><span><span><a href="https://www.eacambridge.org/agi-safety-fundamentals"><u>AGI Safety Fundamentals Programme</u></a></span></span></span>, which has curricula for technical alignment and AI governance. The technical curriculum consists of 7 weeks of reading material and group discussions, and a final week of capstone projects where the participants try their hand at a project / investigation / writeup related to AI safety. Beyond this, many people are also setting up reading groups in their own universities for books like <i>Human Compatible</i>. </li><li><b>Ways of supporting independent researchers</b><br />The <span><span><span><a href="https://aisafety.camp/"><u>AI Safety Camp</u></a></span></span></span> is an organisation which matches applicants with mentors posing a specific research question, and is structured as a series of group research sprints. They have produced work such as the example of inner misalignment in the CoinRun game, which we discussed in a previous section. Other examples of organisations which support independent research include <span><span><span><a href="https://www.lesswrong.com/posts/jfq2BH5kfQqu2vYv3/we-are-conjecture-a-new-alignment-research-startup"><u>Conjecture</u></a></span></span></span>, a recent alignment startup which does their own alignment research as well as providing a structure to host externally funded independent conceptual researchers, and <span><span><span><a href="https://alignmentfund.org/"><u>FAR (the Fund for Alignment Research)</u></a></span></span></span>.</li><li><b>Coding bootcamps</b><br />Since current systems are increasingly being bottlenecked by alignment and interpretability barriers rather than capabilities, in recent years more focus has been directed towards working with cutting-edge deep learning models. This requires strong coding skills and a good understanding of the relevant ML, which is why bootcamps and programmes specifically designed to skill up future alignment researchers have been created. Two such examples are <span><span><span><a href="https://www.lesswrong.com/posts/3ouxBRRzjxarTukMW/apply-to-the-second-iteration-of-the-ml-for-alignment"><u>MLAB</u></a></span></span></span> (the Machine Learning for Alignment Bootcamp, run by Redwood Research), and <span><span><a class="PostLinkPreviewWithPost-link" href="https://forum.effectivealtruism.org/posts/9RYvJu2iNJMXgWCBn/introducing-the-ml-safety-scholars-program"><u>MLSS</u></a></span></span> (the Machine Learning Safety Scholars Programme, which is based on publicly available material as well as lectures produced by Dan Hendryks). </li></ul></li><li><b>Distilling research</b><br />In <span><span><span><a href="https://www.lesswrong.com/posts/zo9zKcz47JxDErFzQ/call-for-distillers"><u>this post</u></a></span></span></span>, John Wentworth makes the case for more distillation in AI alignment research - in other words, more people who focus on understanding and communicating the work of alignment researchers to others. This often takes the form of writing more accessible summaries of hard-to-interpret technical papers, and emphasising the key ideas.</li><li><b>Public outreach / better intro material</b><br />For instance, books like Brian Christian’s <span><span><span><a href="https://en.wikipedia.org/wiki/The_Alignment_Problem"><i><u>The Alignment Problem</u></i></a></span></span></span><i>, </i>Stuart Russell’s <span><span><span><a href="https://en.wikipedia.org/wiki/Human_Compatible"><i><u>Human Compatible</u></i></a></span></span></span> and Nick Bostrom’s <span><span><span><a href="https://en.wikipedia.org/wiki/Superintelligence:_Paths,_Dangers,_Strategies"><i><u>Superintelligence</u></i></a></span></span></span> communicate AI risk to a wide audience. These books have been helpful for making the case for AI risks more mainstream. Note that there can be some overlap between this and distilling research (Rob Miles’ <span><span><span><a href="https://www.youtube.com/c/RobertMilesAI"><u>channel</u></a></span></span></span> is another great example here).</li><li><b>Getting more of the academic community involved</b><br />Since AI safety is a hard technical problem, and since misaligned systems generally won’t be as commercially useful as aligned ones, it makes sense to try and engage the broader field of machine learning. One great example of this is Dan Hendryks’ paper <span><span><span><a href="https://mailchi.mp/08a639ffa2ba/an-167concrete-ml-safety-problems-and-their-relevance-to-x-risk"><u>Unsolved Problems in ML Safety</u></a></span></span></span> (which describes a list of problems in AI safety, with the ML community as the target audience). Stuart Russell has also engaged a lot with the ML community. </li></ul><p>Note that this is certainly not a comprehensive overview of all current AI alignment proposals (a few more we haven’t had time to talk about are CAIS, Andrew Critch’s cooperation-and-coordination-failures framing for AI risks, and many others). However, we hope this has given you a brief overview of some of the different approaches taken by people in the field, as well as the motivations behind their research</p><figure class="image image_resized" style="width: 100%;"><img height="254" src="https://lh4.googleusercontent.com/8IrJGfz6Tvmu9txQyjfBchN5qa5oOcRxA82PjEq8PoLyjbURekcePnCBANH_vlOljnG7HX1kbix_x_bjFbLch5V06sArMylvGBYk1xuL0x4CyGnv0zR5kTxIEborM3YNnhK4cLajUHkY0F0VEgT9Sfj-tOAMFyA8QhGSs5e7nx-kjheq4sYPQei2=w400-h254" width="400" /><figcaption>Map of the solution approaches we've discussed so far</figcaption></figure><h1 id="Conclusion"><b>Conclusion</b></h1><figure class="image image_resized" style="width: 50.48%;"><img height="400" src="https://lh3.googleusercontent.com/jyPCCOLA3QoyE7bmLTxk_y-M-dkTwYwgs81eERrmJelCc5DUk-rc9KPqQ9R-DaNiTaAYW-odeEEViOBgVUdPvd-qidCA-ThS0gUHDQuBFts6yfh_sbC576gTqD4vc5O1a9mjtw4UyrEPo7HWGm2LG_irSnFLTraBFaj9FmcLd94yXr97VQX3MXxp=w400-h400" width="400" /><figcaption><i>people walking along a path which stretches off and disappears into a colorful galaxy filled with beautiful stars, digital art, trending on artstation</i></figcaption></figure><p>Advanced AI represents at least a technology that promises to have effects on the scale of the internet or computer revolutions, and perhaps even more likely to be more akin to the effects of the <b>industrial revolution</b> (which allowed for the automation of much <i>manual </i>labour) and the <b>evolution of humans</b> (the last time something significantly smarter than everything that had come before appeared on the planet).</p><p>It’s easy to invent technologies that the same could be said about - a magic wish-granting box! Wow! But unlike magic wish-granting boxes, something like advanced AI, or AGI, or transformative AI, or <span><span><span><a href="https://www.cold-takes.com/transformative-ai-timelines-part-1-of-4-what-kind-of-ai/"><u>PASTA</u></a></span></span></span> (Process for Automating Scientific and Technical Achievement) seems to be headed our way. The smart money is on it very likely coming <b>this century</b>, and quite likely in the <b>first half</b>.</p><p>If you look at the progress in modern machine learning, and especially the past few years of progress in so-called deep learning, it is hard not to feel a sense of rushing progress. The past few years of progress, in particular the success of the transformer architecture, should update us in the direction that intelligence might be a surprisingly easy problem. What is essentially fancy iterative statistical curve-fitting with a few hacks thrown in already manages to write fluent appropriate English text in response to questions, create paintings from a description, and carry out multi-step logical deduction in natural language. <b>The fundamental problem that plagued AI progress for over half a century - getting fuzzy/intuitive/creative thinking into a machine, in addition to the sharp but brittle logic at which computers have long excelled - seems to have been cracked.</b> There is a solid empirical pattern of predictably improving performance akin to Moore’s law - the “<span><span><span><a href="https://arxiv.org/pdf/2001.08361.pdf"><u>scaling laws</u></a></span></span></span>” we mentioned in the first post - that we seem not to have hit the limits of yet. There are experts in the field who would not be surprised if the remaining insights for cracking human-level machine intelligence could fit into a few good papers.</p><p>This is not to say that AGI is definitely coming soon. The field might get stuck on some stumbling block for a decade, during which there will be no doubt much written about the failed promises and excess hype of the early-2020s deep learning revolution.</p><p>Finally, as we’ve argued, by default the arrival of advanced AI might plausibly lead to civilisation-wide catastrophe.</p><p>There are few things in the world that fit all of the following points:</p><ul><li>A potentially transformative technology whose development would likely rank somewhere between the top events of the century and the top events in the history of life on Earth.</li><li>Something that is likely to happen in the coming decades.</li><li>Something that has a meaningful chance of being cataclysmically bad.</li></ul><p>For those thinking about the longer-term picture, whatever the short-term ebb and flow of progress in the field is, AI and AI risk loom large when thinking about humanity’s future. The main ways in which this might stop being the case are:</p><ul><li>There is a major flaw in the arguments for at least one of the above points. Since many of the arguments are abstract and not empirically falsifiable before it’s too late to matter, this is possible. However, note that there is a strong and recurring pattern of many people, including in particular many extremely-talented people, running into the arguments and taking them more and more seriously. (If you do have a strong argument against the importance of the AI alignment problem, there are many people - us included - who would be very eager to hear from you. Some of these people - us not included - would probably also pay you large amounts of money.)</li><li>We solve the technical AI alignment problem, and we solve the AI governance problem to a degree where the technical solutions will be implemented and it seems very unlikely that advanced AI systems will wreak havoc with society.</li><li>A catastrophic outcome for human civilisation, whether resulting from AI itself or something else. </li></ul><p>The project of trying to make sure the development of advanced AI goes well is likely one of the most important things in the world to be working on (if you’re lost, the <span><span><span><a href="https://80000hours.org/problem-profiles/positively-shaping-artificial-intelligence/"><u>80 000 Hours problem profile</u></a></span></span></span> is a decent place to start). It might turn out to be easy - consider how many seemingly intractable scientific problems dissolved once someone had the right insight. But right now, at least, it seems like it might be a fiendishly difficult problem, especially if it continues to seem like the insights we need for alignment are very different from the insights we need to build advanced AI.</p><p>Most of the time, science and technology progress in whatever direction is easiest or flows most naturally from existing knowledge. Other times, reality throws down a gauntlet, and we must either overcome the challenge or fail. May the best in our species - our ingenuity, persistence, and coordination - rise up, and deliver us from peril.</p></div>Unknownnoreply@blogger.com0tag:blogger.com,1999:blog-1697673368059564013.post-10112082033288490512022-09-11T11:27:00.002+01:002022-09-24T10:49:16.943+01:00AI risk intro 1: advanced AI might be very bad<div style="text-align: center;"><b><i>This post was a joint effort with <a href="https://www.perfectlynormal.co.uk/">Callum McDougall</a>.</i></b></div><div style="text-align: center;"><i> </i></div><div style="text-align: center;"><i><span style="font-size: x-small;">9.6k words (~25min)</span> </i><br /></div><br /><div><h2><b>Introduction</b></h2><p><span>If human civilisation is destroyed this century, the most likely cause is advanced AI systems. This might sound like a bold claim to many, given that we live on a planet full of existing concrete threats like climate change, over ten thousand nuclear weapons, and Vladimir Putin</span> However, it is a conclusion that many people who think about the topic keep coming to. While it is not easy to describe the case for risks from advanced AI in a single piece, here we make an effort that assumes no prior knowledge. Rather than try to argue from theory straight away, we approach it from the angle of what computers actually can and can’t do.</p><h2><b>The Story So Far</b></h2><br /><table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto;"><tbody><tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgx7vowH6i3Ti4ML0V2FWajMZW6Bhrk3VVG_MrtyAb8IT4_aHBDLLuvUNHqUUUHORvKixBz32NegaO0LKEdKdTnWOMGf3gp7e7kmydWCvy3jeDrU1w21KECa1Q8TOoImRihhBzFLlc0PgMKjo6jq58FHAFIrPDbWdRv7NtQCI1cxVa92bUfoSHzYsqAlg/s1024/hY1j9oUn8isdCmnQz-2hdVnqsld9KFfkIVEY0AcjlbryYGNlsKzann09MnFAdlNQmtlBas3aV4Y2dcnWG1tbwYwFfMg1XbBJyhh-Z4elcKOU_DZ2U7Zek0YDCAcN-ucAim4p2mjqIMX0iol6vFgVwr-OBAQP9Rb0ns7z2gnZr-xLJ2f5jxDtRtiOZw.png" style="margin-left: auto; margin-right: auto;"><img border="0" data-original-height="1024" data-original-width="1024" height="400" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgx7vowH6i3Ti4ML0V2FWajMZW6Bhrk3VVG_MrtyAb8IT4_aHBDLLuvUNHqUUUHORvKixBz32NegaO0LKEdKdTnWOMGf3gp7e7kmydWCvy3jeDrU1w21KECa1Q8TOoImRihhBzFLlc0PgMKjo6jq58FHAFIrPDbWdRv7NtQCI1cxVa92bUfoSHzYsqAlg/w400-h400/hY1j9oUn8isdCmnQz-2hdVnqsld9KFfkIVEY0AcjlbryYGNlsKzann09MnFAdlNQmtlBas3aV4Y2dcnWG1tbwYwFfMg1XbBJyhh-Z4elcKOU_DZ2U7Zek0YDCAcN-ucAim4p2mjqIMX0iol6vFgVwr-OBAQP9Rb0ns7z2gnZr-xLJ2f5jxDtRtiOZw.png" width="400" /></a></td></tr><tr><td class="tr-caption" style="text-align: center;"><p><i>Above: an image generated by OpenAI’s DALL-E 2, from the prompt: "artist's impression of an artificial intelligence thinking about chess, digital art, artstation".</i></p></td></tr></tbody></table> <p>(This section can be skipped if you understand how machine learning works and what it can and can’t do today)</p><p>Let’s say you want a computer to do some complicated task, for example learning chess. The computer has no understanding of high-level things like “chess”, “board”, “piece”, “move”, or “win” - it only understands how to do a small set of things. Your task as the programmer is to break down the high-level goal of “beat me at chess” into simpler and simpler steps, until you arrive at a simple mechanistic description of what the computer needs to do. If the computer does beat you, it’s not because it had any new insight into the problem, but rather because you were clever enough to find some <a href="https://en.wikipedia.org/wiki/Minimax">set of steps</a> that, carried out blindly in sufficient speed and quantity, overwhelms whatever cleverness you yourself can apply during the game. This is how Deep Blue beat Kasparov, and more generally how most software and the so-called “Good Old-Fashioned AI” (GOFAI) paradigm works.</p><p>Programs of this type can be powerful. In addition to <a href="https://en.wikipedia.org/wiki/Stockfish_(chess)">beating humans at chess</a>, they can <a href="https://www.google.com/maps/">calculate shortest routes</a> on maps, <a href="https://en.wikipedia.org/wiki/Coq">prove maths theorems</a>, <a href="https://en.wikipedia.org/wiki/Autopilot">mostly fly airplanes</a>, and <a href="https://duckduckgo.com/">search all human knowledge</a>. Programs of this type are responsible for the stereotypical impression of computers as logical, precise, uncreative, and brittle. They are essentially executable logic.</p><p>Many people hoped that you could write programs to do “intelligent” things. These people were right - after all, ask almost anyone before Deep Blue won whether playing chess counts as “intelligence”, they’d have said yes. But “classical” programming hit limitations, in particular in doing “obvious” things like figuring out whether an image is of a cat or a dog, or being able to respond in English. This idea that abstract reasoning and logic are easy but humanly-intuitive tasks are hard for computers came to be known as <a href="https://en.wikipedia.org/wiki/Moravec's_paradox">Moravec’s paradox</a>, and held back progress in AI for a long time.</p><p>There is another way of programming - machine learning (ML) - going back to the 1950s, almost as far as classical programming itself. For a long time, it was held back by hardware limitations (along with some algorithmic and data limitations), but thanks to <a href="https://en.wikipedia.org/wiki/Moore's_law">Moore’s law</a> hardware has advanced enough for it to be useful for real problems.</p><p>If classical programming is executable logic, ML is executable statistics. In ML, the programmer does not define how the system works. The programmer defines how the system learns from data.</p><p>The “learning” part in “machine learning” makes it sound like something refined and sensible. This is a false impression. ML systems learn by going through a training process that looks like this:</p><p><b>Step 1:</b> you define a statistical model. This takes the form of some equation that has some unknown constants (“parameters”) in it, and some variables where you plug in input values. Together, the parameters and input variables define an output. (The equations in ML can be <i>extremely</i> large, for example with billions of parameters and millions of inputs, but they are very structured and almost stupidly simple.)</p><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjORh8X4FaEWm0fI4ShrWwb-psijpMWscfXou0LCBeML52KElxu3KXuc3TMWVZFlxGWRkgAJj2Hgrp--Y6B-F3pLXdPlBKu0nRyLP6y8XbJfmm0yRJEtlqip6dqoUTcWRJMTbpeWx9uzk5uz8GEsmWPL4UQyGw7dWG5YdUxHvl-G11twIQc3oF_qlAewQ/s1600/lGrV82iGSXqpdpvBUgiwoVlZ3Fw4Ic_981gdWk3t_0-zr0mSSNNonagmkH294S8Vdu_pJBV241przsjs-DCqNh3uhyN-MaVW7M5dbZ8Q_YkJhslb2EXlxnMoo95WPRY0UGaE8a_OeCVt2QsOlkHAHj3dB67cWEy1rQFHUytP_4pvNBZA23iKkAOB7g.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="420" data-original-width="1600" height="168" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjORh8X4FaEWm0fI4ShrWwb-psijpMWscfXou0LCBeML52KElxu3KXuc3TMWVZFlxGWRkgAJj2Hgrp--Y6B-F3pLXdPlBKu0nRyLP6y8XbJfmm0yRJEtlqip6dqoUTcWRJMTbpeWx9uzk5uz8GEsmWPL4UQyGw7dWG5YdUxHvl-G11twIQc3oF_qlAewQ/w640-h168/lGrV82iGSXqpdpvBUgiwoVlZ3Fw4Ic_981gdWk3t_0-zr0mSSNNonagmkH294S8Vdu_pJBV241przsjs-DCqNh3uhyN-MaVW7M5dbZ8Q_YkJhslb2EXlxnMoo95WPRY0UGaE8a_OeCVt2QsOlkHAHj3dB67cWEy1rQFHUytP_4pvNBZA23iKkAOB7g.png" width="640" /></a></div><p><b>Step 2</b>: you don’t know what parameters to put in the equation, but you can literally roll some dice if you want (or the computer equivalent).</p><p><b>Step 3</b>: presumably there’s some task you want the ML system to do. Let it try. It will fail horribly and produce gibberish (c.f. the previous part where we just put random numbers everywhere).</p><p><b>Step 4</b>: There's a simple algorithm called gradient descent, which, when using another algorithm called backpropagation to calculate the gradient, can tell you which direction all the parameters should be shifted to make the ML system slightly better (as judged, for example, by its performance on examples in a dataset).</p><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgM6c2JdD_GEOckQoSQa3VmDW_MGdJCsYQXnijjZBYPxou_u1BQlqfo-keyffoIuImm8PMizS_FyJEWgp1oj_-HXsjEC8yswPw5RmY6QrNE1oYh20AF0ZUR2aRVa_w-SX_E9-z6WXnJygAzCEpuAdCH6xBGy4mkiLFFKcarKkOrlpMB-WAoIPsrD-iVDA/s1263/srXTJFyyK6k_UpaQBxCcua1U6qfS5dbpDyo_IS0Kcsos7LRMVH-RbbvKV7fPPUFkx5C8wkifbpIUlMN-9F6pJVEAOoPcuR-4lRgHhVX6DQvC3C_HrC93KRZPpDuADnyyzsXJpMxA3TGRnIsg0aBuxrqw4npBflti9ZsCeATutgh99zABMoiMDztt.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="1039" data-original-width="1263" height="526" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgM6c2JdD_GEOckQoSQa3VmDW_MGdJCsYQXnijjZBYPxou_u1BQlqfo-keyffoIuImm8PMizS_FyJEWgp1oj_-HXsjEC8yswPw5RmY6QrNE1oYh20AF0ZUR2aRVa_w-SX_E9-z6WXnJygAzCEpuAdCH6xBGy4mkiLFFKcarKkOrlpMB-WAoIPsrD-iVDA/w640-h526/srXTJFyyK6k_UpaQBxCcua1U6qfS5dbpDyo_IS0Kcsos7LRMVH-RbbvKV7fPPUFkx5C8wkifbpIUlMN-9F6pJVEAOoPcuR-4lRgHhVX6DQvC3C_HrC93KRZPpDuADnyyzsXJpMxA3TGRnIsg0aBuxrqw4npBflti9ZsCeATutgh99zABMoiMDztt.png" width="640" /></a></div><br /><p><b>Step 5</b>: You shift all the numbers a bit based on the algorithm in step 4.</p><p><b>Step 6</b>: Go back to step 3 (letting the system try). Repeat until (a) the system has stopped improving for a long time, (b) you get impatient, or - increasingly plausible these days - (c) you run out of your compute budget.</p><p>If you’re doing simple curve-fitting statistics problems, it makes sense that this kind of thing works. However, it’s surprising just how far it scales. It turns out that this method, plus some clever ideas about what type of model you choose in step 1, plus willingness to burn millions of dollars on just <i>scaling it up beyond all reason</i>, gets you:</p><ol><li><a href="https://thenextweb.com/news/gpt3-ai-college-essay-grades-compared-students">essay-writing as good as middling college students</a> (see also <a href="https://www.theguardian.com/commentisfree/2020/sep/08/robot-wrote-this-article-gpt-3">this lightly-edited article that GPT-3 wrote about why we should not be afraid of it</a>) </li><li><a href="https://qz.com/2176389/the-best-examples-of-dall-e-2s-strange-beautiful-ai-art/">text-to-image capabilities better (and hundreds of times faster) than almost any human artist</a> (in fact, we used DALL-E to generate the images used at the start of each section of this document)</li><li><a href="https://ai.googleblog.com/2022/04/pathways-language-model-palm-scaling-to.html">the ability to explain jokes</a></li></ol><p></p><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi68LysoeDqfjTMzpJ3oXCYlrIZUHZaRVbC_Q3nCg8q-uqcAmgK-5auwkr0_YiWEmtzaI7NwbbKH92OSfWE_IxhxdqH35eaT-fSEbk7s56dpXRxqfZlfVg0k9T8QZw_scvyk_13M1DXvMEJBClJaclcMnVHNWMqLuBxrsfpjPVgiFE9El1u7OZhvn2JNA/s1600/1hREMK3bcCz93v0xffqPjxCgG2h8vk26GUX9EmDRKDFePXJW70t3C8ejg1C54IqAnm5uuIQw5yxyX8NbTp0BnxMIZi3kLevPF-z9jOQSdCPL-aVwSEOxQKihu8_ITDYbff3HxmRvNqkmm1PslU2SpzlTIAnKkJHCAVy3eBuw5dswfjzr54ui5M9yHQ.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="965" data-original-width="1600" height="386" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi68LysoeDqfjTMzpJ3oXCYlrIZUHZaRVbC_Q3nCg8q-uqcAmgK-5auwkr0_YiWEmtzaI7NwbbKH92OSfWE_IxhxdqH35eaT-fSEbk7s56dpXRxqfZlfVg0k9T8QZw_scvyk_13M1DXvMEJBClJaclcMnVHNWMqLuBxrsfpjPVgiFE9El1u7OZhvn2JNA/w640-h386/1hREMK3bcCz93v0xffqPjxCgG2h8vk26GUX9EmDRKDFePXJW70t3C8ejg1C54IqAnm5uuIQw5yxyX8NbTp0BnxMIZi3kLevPF-z9jOQSdCPL-aVwSEOxQKihu8_ITDYbff3HxmRvNqkmm1PslU2SpzlTIAnKkJHCAVy3eBuw5dswfjzr54ui5M9yHQ.png" width="640" /></a></div><table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto;"><tbody><tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjcbL9IX7vqdtPhlZODRo5MIpMY_mR95L3QmMOlHFZ6N50o8UvmO-VtAFNIcAZJJqd3iiSVFv35jQ861ujXRwJ8Hy7jMYg0ZzJCaYfh-80OilsEEyyscWDe1XtJKDs5wP3TOKk79s6lyhGfdLKOG9s3yWQHO6kwqFZF3GqFdKkiUlwW4rmdLj6xlx7bZQ/s1249/sYL5Totsac-mfAv9rX41lJu-YdgV7BkC8Nnj5OBFMO5jwYyJXk_H5LyQNXUsX1lUYsOjo8ZL9lj0kkGYUlRR9-dcFfGrmSGoDNJDRBSpGipXm1aTSsm151RZcbJZSsYniQY_JApToKMDHMw8hwDyZ65PHOJNvIsoO9SP1iAqo36i64NRp2ANej-x4w.png" style="margin-left: auto; margin-right: auto;"><img border="0" data-original-height="525" data-original-width="1249" height="270" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjcbL9IX7vqdtPhlZODRo5MIpMY_mR95L3QmMOlHFZ6N50o8UvmO-VtAFNIcAZJJqd3iiSVFv35jQ861ujXRwJ8Hy7jMYg0ZzJCaYfh-80OilsEEyyscWDe1XtJKDs5wP3TOKk79s6lyhGfdLKOG9s3yWQHO6kwqFZF3GqFdKkiUlwW4rmdLj6xlx7bZQ/w640-h270/sYL5Totsac-mfAv9rX41lJu-YdgV7BkC8Nnj5OBFMO5jwYyJXk_H5LyQNXUsX1lUYsOjo8ZL9lj0kkGYUlRR9-dcFfGrmSGoDNJDRBSpGipXm1aTSsm151RZcbJZSsYniQY_JApToKMDHMw8hwDyZ65PHOJNvIsoO9SP1iAqo36i64NRp2ANej-x4w.png" width="640" /></a></td></tr><tr><td class="tr-caption" style="text-align: center;"><p><i>Above: examples of reasoning by Google’s PaLM model.</i></p></td></tr></tbody></table><p></p><p>People <a href="https://norvig.com/chomsky.html">laugh at ML</a> because “it’s just iterative statistical curve-fitting”. They have a point. But when “iterative statistical curve-fitting” gets a B on its English Literature essay, paints an original Dali in five seconds, and cracks a joke, it’s hard to avoid the feeling that it might not be too long before “iterative statistical curve fitting” is laughing at <i>you</i>.</p><p>So what exactly happened here, and where is statistical curve-fitting going, and what does this have to do with advanced AI?</p><p>We mentioned Moravec’s paradox above. For a long time, getting AI systems to do things that are intuitively easy for humans was an unsolved problem. In just the past few years, it has been solved. A reasonable way to think of current ML capabilities is that state-of-the-art systems can do anything a human can do in a few seconds of thought: recognise objects in an image, generate flowing text as long as it doesn’t require thinking really hard, get the general gist of a joke or argument, and so on. They are also superhuman at some things, including predicting what the next word in a sentence is, or being able to refer to lots of facts (note that this is without internet access, not quoting verbatim, and generally in the right context), and generally being able to spit out output faster.</p><p>The way it was solved was through something called <a href="http://incompleteideas.net/IncIdeas/BitterLesson.html">the “bitter lesson”</a> by Richard Sutton. This is the trend that countless researchers have spent their careers trying to invent fancy algorithms for doing domain-specific tasks, only to be overrun by simple (but data- and compute-hungry) ML methods.</p><table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto;"><tbody><tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgGKhRgWjCclC-OmGyZaf9pfMHbt91tKZdhmigk7lHODmZDdIAZB7l5eHK32cB-zwOlhX1sB6Rt1l3LhN14zcA3gS3-NMcj26TKNcBnZ3qZpxU5VDm9s1tKbTolQHUw4Zb1E5wMHC4fhTH1IXCLnnfsR67QuzR_xGw46wl4N-EWdd8hK-YGsWjjAvJsVA/s439/Kh7se7k8viY-Ntcc28IhXSEWt696Xb4B24GUYmLj-WZ1IdK5QGKPoXgOXGYcP4WLdrwHMH737p0TCfx56CmWl42Ptl4WpAzp-QE-spHV9tSug768FeZ_wYCS1tYyYJqH3wJ7bKEIiMAyeG5_6omI8WGS1LcjQvfWKW-B0zs1zpqlIqAevzTHlXEA.png" style="margin-left: auto; margin-right: auto;"><img border="0" data-original-height="439" data-original-width="371" height="400" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgGKhRgWjCclC-OmGyZaf9pfMHbt91tKZdhmigk7lHODmZDdIAZB7l5eHK32cB-zwOlhX1sB6Rt1l3LhN14zcA3gS3-NMcj26TKNcBnZ3qZpxU5VDm9s1tKbTolQHUw4Zb1E5wMHC4fhTH1IXCLnnfsR67QuzR_xGw46wl4N-EWdd8hK-YGsWjjAvJsVA/w338-h400/Kh7se7k8viY-Ntcc28IhXSEWt696Xb4B24GUYmLj-WZ1IdK5QGKPoXgOXGYcP4WLdrwHMH737p0TCfx56CmWl42Ptl4WpAzp-QE-spHV9tSug768FeZ_wYCS1tYyYJqH3wJ7bKEIiMAyeG5_6omI8WGS1LcjQvfWKW-B0zs1zpqlIqAevzTHlXEA.png" width="338" /></a></td></tr><tr><td class="tr-caption" style="text-align: center;"><p><i>Above: Randall Munroe, creator of the xkcd comic, comments on ML. Original</i> <a href="https://xkcd.com/1838/"><i>here</i></a><i>.</i></p></td></tr></tbody></table><p>The speed at which it was solved was gradually at first, and then quickly. The neural network -based ML methods spent a long time in limbo due to insufficiently powerful computers until around 2010 (funnily enough, the specific piece of hardware that has enabled everything in modern ML is the GPU or Graphics Processing Unit, first invented in the 90s because people wanted to play more realistic video games; both graphics rendering and ML rely on many parallel calculations to be efficient). The so-called deep learning revolution only properly started around 2015. Fluent language abilities were essentially nonexistent before OpenAI’s release of <a href="https://en.wikipedia.org/wiki/GPT-2">GPT-2</a> in 2019 (since then, OpenAI has come out with GPT-3, a 100x-larger model that was called “spooky”, “humbling”, and “more than a little terrifying” in <i>The New York Times</i>).</p><p>Not only that, but it turns out there are simple <a href="https://arxiv.org/pdf/2001.08361.pdf">“scaling laws”</a> that govern how ML model performance scales with parameter count and dataset size, which seem to paint a clear roadmap to making the systems even more capable by just cranking the “more parameters” and “more data” levers (presumably they have these at the OpenAI HQ).</p><p>There are many worries in any scenario where advanced AI is approaching fast, as we’ll argue for in a later section. The current ML-based AI paradigm is especially worrying though.</p><p>We don’t actually know what the ML system is learning during the training process it goes through. You can visualise the training process as a trip through (abstract) space. If our model had three parameters, we could imagine it as a point in 3D space. Since current state-of-the-art models have billions of parameters, and are initialised randomly, we can imagine this as throwing a dart somewhere into a billion-dimensional space, where there are a billion different ways to move. During the training process, the training loop guides the model along a trajectory in this space by making tiny updates that push the model in the direction of better performance as described above.</p><table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto;"><tbody><tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgUfnRhbyYZ-DoDGnhFI-7KBrY_hCP8TVCPq-7T6UxFk_9EIaPq9_BIBZQDGqnkEZEr-oLCddWuZfmhPLpQ2AGLJ91uByYE-sF-2tia93ko5LPoXHwtJdoOYKTfRw_gy9An3cYSbLn0jXjuevhUQLs8JKDM_t3d-VHmTnSjbk6xXm89S7PClUs-fcMnfw/s1098/Y4Xz5aiVRPew7usYEnspBo4Rafp38eEBb8OlcRbSTu4LdAMJRDMG1GGklZOODyZoWWKoYfoc_84LsH0ed07DJXShqNXQ7tRCTIOGd6YYrOmMWBm9MGv_xdfLJEMQgQvBR0JqswlxYB-qbjcsY8JjfIeqoyY0Gww041biDRaB4iS_c4GS47bI6KHL-Q.png" style="margin-left: auto; margin-right: auto;"><img border="0" data-original-height="660" data-original-width="1098" height="240" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgUfnRhbyYZ-DoDGnhFI-7KBrY_hCP8TVCPq-7T6UxFk_9EIaPq9_BIBZQDGqnkEZEr-oLCddWuZfmhPLpQ2AGLJ91uByYE-sF-2tia93ko5LPoXHwtJdoOYKTfRw_gy9An3cYSbLn0jXjuevhUQLs8JKDM_t3d-VHmTnSjbk6xXm89S7PClUs-fcMnfw/w400-h240/Y4Xz5aiVRPew7usYEnspBo4Rafp38eEBb8OlcRbSTu4LdAMJRDMG1GGklZOODyZoWWKoYfoc_84LsH0ed07DJXShqNXQ7tRCTIOGd6YYrOmMWBm9MGv_xdfLJEMQgQvBR0JqswlxYB-qbjcsY8JjfIeqoyY0Gww041biDRaB4iS_c4GS47bI6KHL-Q.png" width="400" /></a></td></tr><tr><td class="tr-caption" style="text-align: center;"><p><i>Above:</i> <i>0</i> <i>and</i> <i>1</i> <i>are parameters, and the vertical axis is the loss (higher is worse). The black line is the path the model takes in parameter space during training.</i></p></td></tr></tbody></table> <p>Now let’s say at the end of the training process the model does well on the training examples. What does that tell you? It tells you the model has ended up in some part of this billion-dimensional space that corresponds to a model that does well on the training examples. Here are some examples of models that do well on their training examples:</p><ol><li>A model that has learned exactly what you want it to learn. Yay!</li><li>A model that has learned something similar to what you want to learn, but you can’t tell because there does not exist an example that distinguishes between what it’s learned and what you want it to learn in the data.</li><li>A model that has learned to give the right answer when it’s instrumentally in its interest, but which will go off and do something completely different given a chance.</li> </ol><p>How do we know that in the billion-dimensional space of possibilities, our (blind and kind of dumb) training process has landed on #1? We don’t. We launch our ML models on trajectories through parameter-space and hope for the best, like overly-optimistic duct-tape-wielding NASA administrators launching rockets in a universe where, in the beginning, God fell asleep on the “+1 dimension” button.</p><p>The really scary failure modes all lie in the future. However, here are some examples of perverse “solutions” ML models have already come up with in practice:</p><ol><li>A game-playing ML model <a href="https://web.archive.org/web/20160526045303/http://homepages.herts.ac.uk/~cs08abi/publications/Salge2008b.pdf">learned to crash the game</a>, presumably because it can’t die if the game crashed.</li><li>An ML model was meant to convert aerial photographs into abstract street maps and then back (learning to convert to and from a more-abstract intermediate representation is a common training strategy). It learned to <a href="https://arxiv.org/pdf/1712.02950.pdf">hide useful information</a> about the aerial photograph in the street map in a way that helped it “cheat” in reconstructing the aerial photograph, and in a way too subtle for humans just looking at the images to notice.</li><li>A game-playing ML model <a href="https://arxiv.org/pdf/1802.08842.pdf">discovered a bug in the game</a> where the game stalls on the first round and it gets almost a million in-game points. The researchers were unable to figure out the reason for the bug.</li> </ol><p>These are examples of <b>specification gaming</b>, in which the ML model has learned to game whatever specification of task success was given to it. (Many more examples can be found on <a href="https://docs.google.com/spreadsheets/d/e/2PACX-1vRPiprOaC3HsCf5Tuum8bRfzYUiKLRqJmbOoC-32JorNdfyTiRRsR7Ea5eWtvsWzuxo8bjOxCG84dAg/pubhtml">this spreadsheet</a>.)</p><p>No one knows for sure where the ML progress train is headed. It is plausible that current ML progress hits a wall and we get another <a href="https://en.wikipedia.org/wiki/AI_winter">“AI winter”</a> that lasts years. However, AI has recently been breaking through barrier after barrier, and so far does not seem to be slowing down. Though we’re still at least some steps away from human-level capabilities at everything, there aren’t many tasks where there’s no proof-of-concept demonstration.</p><p>Machines have been better at some intellectual tasks for a long time; just consider calculators which are already superhuman at arithmetic. However, with the computer revolution, every task where a human has been able to think of a way to break it down into unambiguous steps (and the unambiguous steps can be carried out with modern computing power) has been added to this list. More recently, more intuition- and insight-based activities have been added to that list. DeepMind’s AlphaGo beat the top-rated human player of Go (a far harder game than chess for computers) in 2016. In 2017, AlphaZero beat both AlphaGo at Go (100-0) and superhuman chess programs at chess, despite training only by playing against itself for less than 24 hours. Analysis of its moves revealed strategies that millennia of human players hadn’t been able to come up with, so it wouldn’t be an exaggeration to say that it beat the accumulated efforts of human civilisation at inventing Go strategies - in one day. In 2019, DeepMind released MuZero, which extended AlphaZero’s performance to Atari games. In 2021, DeepMind released EfficientZero, which takes only two hours of gameplay to become superhuman at Atari games. In addition to games, DeepMind’s AlphaFold and AlphaFold 2 have made big leaps towards solving the problem of predicting a protein’s structure from its constituent amino acids, one of the biggest theoretical problems in biology. A step towards generality was taken by Gato, yet another DeepMind model, which is a single model that can play games, control a robot arm, label images, and write text.</p><p>If you straightforwardly extrapolate current progress in machine learning into the future, here is what you get: ML models exceeding human performance in a quickly-expanding list of domains, while we remain ignorant about how to make sure they learn the right goals or robustly act in the right way.</p><p> </p><h2><b>Theoretical underpinnings of AI risk</b></h2><p>The previous section discussed the history of machine learning, and how extrapolating its progress has worrying implications. Next we discuss more theoretical arguments for why highly advanced AI systems might pose a threat to humanity.</p><p>One of the criticisms levelled at the notion of risks from AI is that it sounds too speculative, like something out of apocalyptic science fiction. Part of this is unavoidable, since we are trying to reason about systems more powerful than any which currently exist, and may not behave like anything that we’re used to.</p><p>This section will be split into three sections. Each one makes a claim about the future of artificial intelligence, and discusses the arguments for and against this claim. The three claims are:</p><ul><li><b>AGI is likely.</b>AGI (artificial general intelligence) is likely to be created by humanity eventually, and there is a good chance this will happen in the next century.</li><li><b>AGI will have misaligned goals by default.</b>Unless certain hard technical problems are solved first, the goals of the first AGIs will be misaligned with the goals of humanity, and would lead to catastrophic outcomes if executed.</li><li><b>Misaligned AGI could resist attempts to control it or roll it back</b>An AGI (or AGIs) with misaligned goals would be able to overpower or outcompete humanity, and gain control of our future, like how we’ve so far been able to use our intelligence to dominate all other less intelligent species.</li> </ul><p> </p><h3>AGI is likely</h3><table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto;"><tbody><tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhimU980-3FCg_684oxCOu6NVYctxKEYGvJidc4w0igfh1Bcl-s1O1b-W-lcdWHhouD-F1Eyw3TRgkWOSKGMRv6V8FhQ0ODbDiOyOp8S3-_n4SJCpFdPenwNhmaCmHFWIsJPbRn8rLZjnVP-2A0_6SdpZAg6yQjmlVVM9R6mh0TEB1v4pa4E5HmRk19lg/s1024/DWHUlBLzuXHT7LAKeupCIDzm-qci1JeFmH9ZjevdiioGN2VFHC63YOOY5JMBjEmCFX-WL_E-r8omyZ-Vhp-o0uHvpLYq5lhrGMfqDFTJxLlJYA4HeV2pLJyW0EFsq1t3mWOjQlD202b9a4cRrZRYzR1qN8WGHXWsGAM2JpoSoivVeA7s71wITxXsLw.png" style="margin-left: auto; margin-right: auto;"><img border="0" data-original-height="1024" data-original-width="1024" height="320" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhimU980-3FCg_684oxCOu6NVYctxKEYGvJidc4w0igfh1Bcl-s1O1b-W-lcdWHhouD-F1Eyw3TRgkWOSKGMRv6V8FhQ0ODbDiOyOp8S3-_n4SJCpFdPenwNhmaCmHFWIsJPbRn8rLZjnVP-2A0_6SdpZAg6yQjmlVVM9R6mh0TEB1v4pa4E5HmRk19lg/s320/DWHUlBLzuXHT7LAKeupCIDzm-qci1JeFmH9ZjevdiioGN2VFHC63YOOY5JMBjEmCFX-WL_E-r8omyZ-Vhp-o0uHvpLYq5lhrGMfqDFTJxLlJYA4HeV2pLJyW0EFsq1t3mWOjQlD202b9a4cRrZRYzR1qN8WGHXWsGAM2JpoSoivVeA7s71wITxXsLw.png" width="320" /></a></td></tr><tr><td class="tr-caption" style="text-align: center;"><p><i>Above: this image also generated by OpenAI’s DALL-E 2, using the prompt "a data center with stacks of computers gaining the spark of intelligence"</i>.</p></td></tr></tbody></table><blockquote><p>"<i>Betting against human ingenuity is foolhardy, particularly when our future is at stake.</i>"</p><p>-Stuart Russell</p></blockquote><p>To open this section, we need to define what we mean by artificial general intelligence (AGI). We’ve already discussed AI, so what do we mean by adding the word “generality”?</p><p>An AGI is a machine capable of behaving intelligently over many different domains. The term “general” here is often used to distinguish from “narrow”, where a narrow AI is one which excels at a specific task, but isn’t able to invent new problem-solving techniques or generalise its skills across many different domains. </p><p>As an example of general intelligence in action, consider humans. In a few million years (a mere eye-blink in evolutionary timescales), we went from apes wielding crude tools to becoming the dominant species on the planet, able to build space shuttles and run companies. How did this happen? It definitely wasn’t because we were directly trained to perform these tasks in the ancestral environment. Rather, we developed new ways of using intelligence that allowed us to generalise to multiple different tasks. This whole process played out over a shockingly small amount of time, relative to all past evolutionary history, and so it is possible that a relatively short list of fundamental insights were needed to get general intelligence. And as we saw in the previous section, ML progress hints that gains in intelligence might be surprisingly easy to achieve, even relative to current human abilities.</p><p>AGI is not a distant future technology that only futurists speculate about. OpenAI and DeepMind are two of the leading AI labs. They have received billions of dollars in funding (including OpenAI receiving significant investment from Microsoft, and DeepMind being acquired by Google). Both <a href="https://www.deepmind.com/careers">DeepMind</a> and <a href="https://openai.com/about/">OpenAI</a> have the development of AGI as the core of both their mission statement and their business case. Top AI researchers are publishing <a href="https://openreview.net/pdf?id=BZ5a1r-kVsf">possible roadmaps</a> to AGI-like capabilities. And, as mentioned earlier, especially in the past few years they have been crossing off a significant number of the remaining milestones every year.</p><p>When will AGI be developed? Although this question is impossible to answer with certainty, many people working in the field of AI think it is more likely than not to arrive in the next century. An aggregate forecast generated via data from a <a href="https://aiimpacts.org/2022-expert-survey-on-progress-in-ai/">2022 survey</a> of ML researchers estimated <b>37 years until a 50% chance of high-level machine intelligence</b> (defined as systems which can accomplish every task better and more cheaply than human workers). These respondents also gave an average of <b>5% probability of AI having an extremely bad outcome for humanity (e.g. complete human extinction)</b>. How many other professions estimate an average of 5% probability that their field of study will be directly responsible for the extinction of humanity?! To explain this number, we need to proceed to the next two sections, where we will discuss why AGIs might have goals which are misaligned with humans, and why this is likely to lead to catastrophe.</p><h3>AGI will have misaligned goals by default</h3><table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto;"><tbody><tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiq6Jx0OZP97it5eRy_K4j79gQUqvL8RWFSXhkeLJgVu6o8CH6Jb2loqb3v-VGJEeRl6BXDqe5mcM8YywHgtBOSGyfFK3YvaD81uKzyXXCx5RN3JzxgH_8e4hde3iRqoYCOTyi0UvBeHlVOoRPbf7WaTb0EzQajDbQK-wiZYcIkVoui0N82fNZsoiYufw/s1024/ByomdAHZi91n-zB_xjy7hfItOvqPhWMO_0IPLZxzXo1sQnZRxp2YJxZ6-J0rDzO6AGMXgzHTDi9uh4l-Sf-zdvMWBWhxP_VwH72KibODwEZkurOUcBqdjsMVcFbsip8WIP5APNi9BP_2cXrAeE9FY61SrblfMlc89OqV_XEYTCAyVQbZEhuCpQWlkw.png" style="margin-left: auto; margin-right: auto;"><img border="0" data-original-height="1024" data-original-width="1024" height="400" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiq6Jx0OZP97it5eRy_K4j79gQUqvL8RWFSXhkeLJgVu6o8CH6Jb2loqb3v-VGJEeRl6BXDqe5mcM8YywHgtBOSGyfFK3YvaD81uKzyXXCx5RN3JzxgH_8e4hde3iRqoYCOTyi0UvBeHlVOoRPbf7WaTb0EzQajDbQK-wiZYcIkVoui0N82fNZsoiYufw/w400-h400/ByomdAHZi91n-zB_xjy7hfItOvqPhWMO_0IPLZxzXo1sQnZRxp2YJxZ6-J0rDzO6AGMXgzHTDi9uh4l-Sf-zdvMWBWhxP_VwH72KibODwEZkurOUcBqdjsMVcFbsip8WIP5APNi9BP_2cXrAeE9FY61SrblfMlc89OqV_XEYTCAyVQbZEhuCpQWlkw.png" width="400" /></a></td></tr><tr><td class="tr-caption" style="text-align: center;"><p><i>Above: yet another image from OpenAI's DALL-E 2. Perhaps it was trying for a self portrait? (Prompt: "Artists impression of artificial general intelligence taking over the world, expressive, digital art")</i></p><p><i> </i></p></td></tr></tbody></table> <blockquote><p>"<i>The AI does not hate you, nor does it love you, but you are made out of atoms which it can use for something else.</i>"</p><p>-Eliezer Yudkowsky</p></blockquote><p>Let’s start off this section with a few definitions. </p><p>When we refer to <b>“aligned AI”</b>, we are using Paul Christiano’s conception of <b>“intent alignment”</b>, which essentially means the AI system is <b>trying</b> to do what its human operators want it to do. Note that this is insufficient for building useful AI, since the AI also has to be capable. But situations where the AI is trying and failing to do the right thing seem like less of a problem.</p><p>When we refer to the <b>“alignment problem”</b>, we mean the difficulty of building aligned AI. Note, this doesn’t just capture the fact that we won’t create an AI aligned with human values by default, but that we don’t currently know how to build a sophisticated AI system robustly aligned with <i>any</i> goal.</p><p><i>Can’t we just have the AI learn the right goals by example, just like how all current ML works?</i> The problem here is that we have no way of knowing what goal the AI is learning when we train it; only that it seems to be doing good things on the training data that we give it. The state-of-the-art is that we have hacky but extremely powerful methods that can make ML systems remarkably competent at doing well on the training examples by an opaque process of guided trial-and-error. But there is no Ghost of Christmas Past that will magically float into a sufficiently-capable AI and imbue it with human values. We do not have a way of ensuring that the system acquires a particular goal, or even an idea of what a robust goal specification that is compatible with human goals/values could look like.</p><h4>Orthogonality and instrumental convergence</h4><table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto;"><tbody><tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgnHdtTFdEzyaXhPkeAqSBXgRYmkIzR_2vDZzEyaNsryIJhvDqCysP_5NLctCEweBNyomM9ApqmdQh6IzrZV-NCjMspY4Y98lBxGuZXsNAWIPcTjdjqYjvtOL5z8iCS23UfHqm33ukVr9Yz-onGqB3_3u--q-iHeORTf4SzY8rJIuaXi0YBaSqzETKWww/s1024/Rc_xHrlVI3yCKNCG6giTXaqmivDYEON83371l4xIRe_k4jo8-kSuTjkNMNywqYG8vgl2BOTANdtRxcXMFwIQkdPIvC7ueNKvSFhZqOaW_mi2gAkSNP37DQysWnnfzByqzXyLvr-K573dIhxD8WT-uE6TxiABQH58LGsSShTYmJEIUaXaLkyLjdMa.png" style="margin-left: auto; margin-right: auto;"><img border="0" data-original-height="1024" data-original-width="1024" height="400" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgnHdtTFdEzyaXhPkeAqSBXgRYmkIzR_2vDZzEyaNsryIJhvDqCysP_5NLctCEweBNyomM9ApqmdQh6IzrZV-NCjMspY4Y98lBxGuZXsNAWIPcTjdjqYjvtOL5z8iCS23UfHqm33ukVr9Yz-onGqB3_3u--q-iHeORTf4SzY8rJIuaXi0YBaSqzETKWww/w400-h400/Rc_xHrlVI3yCKNCG6giTXaqmivDYEON83371l4xIRe_k4jo8-kSuTjkNMNywqYG8vgl2BOTANdtRxcXMFwIQkdPIvC7ueNKvSFhZqOaW_mi2gAkSNP37DQysWnnfzByqzXyLvr-K573dIhxD8WT-uE6TxiABQH58LGsSShTYmJEIUaXaLkyLjdMa.png" width="400" /></a></td></tr><tr><td class="tr-caption" style="text-align: center;"><p><i>Above: DALL-E illustrating "Artists depiction of an artificial intelligence which builds paperclips, digital art, artstation"</i></p></td></tr></tbody></table> <p>One of the most common objections to risks from AI goes something like this:</p><blockquote><p> <i>If the AI is smart enough to cause a global catastrophe, isn’t it smart enough to know that this isn’t what humans wanted?</i></p></blockquote><p>The problem with this is that it conflates two different concepts: <b>intelligence</b> (in the sense of having the ability to achieve your goals, whatever they might be) and <b>having goals which are morally good by human standards</b>. When we look at humans, these two often go hand-in-hand. But the key observation of the orthogonality thesis is that this doesn’t have to be the case for all possible mind designs. As defined by Nick Bostrom in his book <a href="https://nickbostrom.com/superintelligentwill.pdf"><i>Superintelligence</i></a>:</p><blockquote><p><b>The Orthogonality Thesis</b></p><p><i>Intelligence and final goals are orthogonal axes along which possible agents can freely vary. In other words, more or less any level of intelligence could in principle be combined with more or less any final goal.</i></p></blockquote><p>Here, orthogonal means “at right angles” or “unrelated” – in other words we can imagine a graph with one axis representing intelligence, and another representing the agent’s goals, with any point in the graph representing a theoretically possible agent*. The classic example here is a <b>“paperclip maximiser”</b> - a powerful AGI driven only by the goal of making paperclips.</p><p>(*This is obviously an oversimplification. For instance, it seems unlikely you could get an unintelligent agent with a highly complex goal, because it would seem to take some degree of intelligence to represent the goal in the first place. The key message here is that you could in theory get highly capable agents pursuing arbitrary goals.)</p><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiTupLKJSBje5UxrZ-hWCiTIMAIZ1tlodt_zdgI4Vnr7WEPCGtygTJkMJ9QNfaI8E2csHFyRCCJiiMLTup5XlIVSdzLl5h3AEOmX7FuAeePDCAFZfHh-wPAYnt9XQtr2oTbf_mwc-Wn0ayRTjdJ3YZGvBI9m91OJZZ_HvO6xfHqwFTq1LIwLiWcw_EcjQ/s1600/32rs7DmbEtFqlFq-T71NMr0G11m0M-ElZ5KbJygw6oFszfBkHOA4hd0M6U6yRaZmYVoLjAX_ro77LR-EleAiaqC_qvYNywWuIhaJKa6e83DKbCzVW_lxWjLGq--OpKsgbOONrrEzKWMSOEx3ivUlk2TePyCIJrLt-DlkvoObtiw5RgdbZ_Ijnz67.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="771" data-original-width="1600" height="308" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiTupLKJSBje5UxrZ-hWCiTIMAIZ1tlodt_zdgI4Vnr7WEPCGtygTJkMJ9QNfaI8E2csHFyRCCJiiMLTup5XlIVSdzLl5h3AEOmX7FuAeePDCAFZfHh-wPAYnt9XQtr2oTbf_mwc-Wn0ayRTjdJ3YZGvBI9m91OJZZ_HvO6xfHqwFTq1LIwLiWcw_EcjQ/w640-h308/32rs7DmbEtFqlFq-T71NMr0G11m0M-ElZ5KbJygw6oFszfBkHOA4hd0M6U6yRaZmYVoLjAX_ro77LR-EleAiaqC_qvYNywWuIhaJKa6e83DKbCzVW_lxWjLGq--OpKsgbOONrrEzKWMSOEx3ivUlk2TePyCIJrLt-DlkvoObtiw5RgdbZ_Ijnz67.png" width="640" /></a></div><p>Note that an AI may well come to understand the goals of the humans that trained it, but this doesn't mean it would choose to follow those goals. As an example, many human drives (e.g. for food and human relationships) came about because in the ancestral environment, following these drives would have made us more likely to reproduce and have children. But just because we understand this now doesn't make us toss out all our current values and replace them with a desire to maximise genetic fitness.</p><p>If an AI might have bizarre-seeming goals, is there anything we <i>can</i> say about its likely behaviour? As it turns out, there is. The secret lies in an idea called the <b>instrumental convergence thesis</b>, again <a href="https://nickbostrom.com/superintelligentwill.pdf">by Bostrom</a>:</p><blockquote><p><b>The Instrumental Convergence Thesis</b> <i>There are some instrumental goals likely to be pursued by almost any intelligent agent, because they are useful for the achievement of almost any final goal.</i></p></blockquote><p>So an instrumental goal is one which increases the odds of the agent’s final goal (also called its <b>terminal goal</b>) being achieved. What are some examples of instrumental values?</p><p>Perhaps the most important one is <b>self-preservation</b>. This is necessary for pursuing most goals, because if a system’s existence ends, it won’t be able to carry out its original goal. As memorably phrased by Stuart Russell, <i>“you can’t fetch the coffee if you’re dead!”</i>.</p><p><b>Goal-content integrity</b> is another. An AI with some <i>goal X</i> might resist any attempts to have its goal changed to <i>goal Y</i>, because it sees that in the event of this change, its current <i>goal X</i> is less likely to be achieved.</p><p>Finally, there are a set of goals which are all forms of <b>self-enhancement</b> - improving its cognitive abilities, developing better technology, or acquiring other resources, because all of these are likely to help it carry out whatever goals it ends up having. For instance, an AI singularly devoted to making paperclips might be incentivised to acquire resources to build more factories, or improve its engineering skills so it can figure out yet more effective ways of manufacturing paperclips with the resources it has.</p><table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto;"><tbody><tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhaPDhYN5zxznZytGaeYZo4ymU89BdLJrOHzkJJ4sE6QRy5tfzrpczSOoDyNU8JFLc9lEUDfwCkZJaHMDLQRO8CiwSxbVeodwBeCjWjBfFKtH2h_piY-P4JZ0avNtvxqljCbEVKzomCzM-FsWuTy0GKXaRePfccxVYwkyHw63YRZlKXk-_eOHboluCbtQ/s350/l_0AVfWMmZOjYzOpQlJg41GUDnGOYwifSqVT_TckS65ChbSzZ_vEH6L7j35Ex-hXyJ_QIA8L1qLOs7J1VnOCFcfZskgDsf8qbkzoZWg3GF7Iu9GWfB2ERw17F_u6HtrQgCFWf7yTIQ_A7UlHSBctsVRaOVgQOgtnit2eTHEuoBfw5drEGyHRijiC4w.png" style="margin-left: auto; margin-right: auto;"><img border="0" data-original-height="264" data-original-width="350" height="241" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhaPDhYN5zxznZytGaeYZo4ymU89BdLJrOHzkJJ4sE6QRy5tfzrpczSOoDyNU8JFLc9lEUDfwCkZJaHMDLQRO8CiwSxbVeodwBeCjWjBfFKtH2h_piY-P4JZ0avNtvxqljCbEVKzomCzM-FsWuTy0GKXaRePfccxVYwkyHw63YRZlKXk-_eOHboluCbtQ/s320/l_0AVfWMmZOjYzOpQlJg41GUDnGOYwifSqVT_TckS65ChbSzZ_vEH6L7j35Ex-hXyJ_QIA8L1qLOs7J1VnOCFcfZskgDsf8qbkzoZWg3GF7Iu9GWfB2ERw17F_u6HtrQgCFWf7yTIQ_A7UlHSBctsVRaOVgQOgtnit2eTHEuoBfw5drEGyHRijiC4w.png" width="320" /></a></td></tr><tr><td class="tr-caption" style="text-align: center;"><p><i>Above: paperclip maximisation, now with a fun game attached!</i></p></td></tr></tbody></table><p></p><p>The key lesson to draw from instrumental convergence is that, even if nobody ever deliberately deploys an AGI with a really bad reward function, the AGI is still likely to develop goals which will be bad for humans by default, in service of its actual goal.</p><h4>Interlude - why goals?</h4><table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto;"><tbody><tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjhNNYkP_LfxpthWZPkegMhhT4P-sr_L-85-YwMWO2GE9ZoWyqpBGa3e2gIUQq74h9HKQ0TQr6W4XgHOYzLligh6pGfxgvUgc81nEG0dHP7nLJH3pen8PD8A60hScwybwvdrROTxEJwJW7pi7Ndm-WlyRg_R9M13OE9cOOSPNgI5-tTii2jWBuHxZDmhQ/s1024/xB7ebp8yB0tTy3HVpFogzxe-wKK47KT6KCFXp2JMCoPJlqC8CegZB7ktfeqPS2ILr2yAsKN4CsCm2ZmaZmhLoqf2-2aIBZU1J1yUKTPyE1cwKNqDkxK0ZDbcdval-D2-Z0JwQEIrJVFuZ4MajncHNNWRNH9qzuj8zPhTOFLb6nXM7fDUQ353Gl8g-w.png" style="margin-left: auto; margin-right: auto;"><img border="0" data-original-height="1024" data-original-width="1024" height="400" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjhNNYkP_LfxpthWZPkegMhhT4P-sr_L-85-YwMWO2GE9ZoWyqpBGa3e2gIUQq74h9HKQ0TQr6W4XgHOYzLligh6pGfxgvUgc81nEG0dHP7nLJH3pen8PD8A60hScwybwvdrROTxEJwJW7pi7Ndm-WlyRg_R9M13OE9cOOSPNgI5-tTii2jWBuHxZDmhQ/w400-h400/xB7ebp8yB0tTy3HVpFogzxe-wKK47KT6KCFXp2JMCoPJlqC8CegZB7ktfeqPS2ILr2yAsKN4CsCm2ZmaZmhLoqf2-2aIBZU1J1yUKTPyE1cwKNqDkxK0ZDbcdval-D2-Z0JwQEIrJVFuZ4MajncHNNWRNH9qzuj8zPhTOFLb6nXM7fDUQ353Gl8g-w.png" width="400" /></a></td></tr><tr><td class="tr-caption" style="text-align: center;"><p><i>Above: DALL-E image from the prompt "Artist's depiction of a robot throwing a dart at a target, digital art, getting a bullseye, trending on artstation"</i></p></td></tr></tbody></table><p>Having read the previous section, your initial reaction may well be something like this:</p><blockquote><p> <i>“Okay, so powerful AGIs with goals that don’t line up perfectly with ours might spell bad news, but why should AI systems have goals at all? Google Maps is a pretty useful ML system but it doesn’t have ‘goals’, I just type my address in and hit enter. Why won’t future AI be like this?”</i></p></blockquote><p>There are many different responses you could have to this line of argument. One simple response is based on ideas of economic competitiveness, and comes from <a href="https://www.gwern.net/Tool-AI">Gwern (2016)</a>. It runs something like this:</p><blockquote><p>AIs that behave like agents (i.e. taking actions in order to achieve their goals) will be more economically competitive than “tool AIs” (like Google Maps), for two reasons. First, they will by definition be better at <b>taking actions</b>. Second, they will be superior at <b>inference and learning</b> (since they will be able to repurpose the algorithms used to choose actions to improve themselves in various ways). For example, agentic systems could take actions such as improving their own training efficiency, or gathering more data, or making use of external resources such as long-term memories, all in service of achieving its goal.</p><p>If agents are more competitive, then any AI researchers who don’t design agents will be outcompeted by ones that do.</p></blockquote><p>There are other perspectives you could take here. For instance, Eliezer Yudkowsky has written extensively about “expected utility maximisation” as a formalisation for how rational agents might behave. Several mathematical theorems all point to the same idea of <i>“any agent not behaving like expected utility maximisers will be systematically making stupid mistakes and getting taken advantage of”</i>. So if we expect AI systems to <i>not</i> be making stupid mistakes and getting taken advantage of by humans, then it makes sense to describe them as having the ‘goal’ of maximising expected utility, because that’s how their behaviour will seem to us.</p><p>Although these arguments may seem convincing, the truth is there are many questions about goals and agency which remain unanswered, and we honestly just don’t know what AI systems of the future will look like. It’s possible they will look like expected utility maximisers, but this is far from certain. For instance, Eric Drexler's technical report <a href="https://www.fhi.ox.ac.uk/wp-content/uploads/Reframing_Superintelligence_FHI-TR-2019-1.1-1.pdf?asd=sa">Reframing Superintelligence: Comprehensive AI Services as General Intelligence (CAIS)</a> paints a different picture of the future, where we create systems of AIs interacting with each other and collectively providing a variety of services to humans. However, even scenarios like this could threaten humanity’s ability to keep steering its own future (as we will see in later sections).</p><p>Additionally, new paradigms are being developed. One of the newest, published barely one week ago, <a href="https://www.lesswrong.com/posts/vJFdjigzmcXMhNTsx/simulators">analysed certain types of AI models like GPT-3 (a large language model) through the lens of "simulators"</a>. Modern language models like GPT-3, for example, may be best thought of as trying to simulate the continuation of a piece of English text, in the same way that a physics simulation evolves an initial state by applying the laws of physics. It doesn't make sense to describe the simulations themselves through the lens of agents, but they can simulate agents as subsystems. Even with today's models like GPT-3, if you prompt it in a way that places it in the context of making a plan to carry out a goal, it will do a decent job of doing that. Future work will no doubt explore the risk landscape from this perspective, and time will tell how well these frameworks match up with actual progression in ML.</p><h4>Inner and outer misalignment</h4><table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto;"><tbody><tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiTHionB-1S0pYghuaqAaFplomOba99DnL79luwh6zataow1SMbu7VTkwLa7TL_I66oeRu6MRrPPAyVFpF3uomtH_luZUlp220WQnD_8tUQYlFWj1uhDi9sw4vfwGkjIWR7eTMORP8f9bJinQMKhRzTNtMhWdm3mFhp0G5dk_sxo3abaOvP5yvtb_Dypw/s1024/Zti_yGXmLuikLxE-3nVAMSW-fcXBbHUbp8KlHgIf_FEiz_cRtnkcfxg9mEnMhmADknRxrL49j2GOYi4lwF1UDEzU80cuIwS6Qsrjm01IQhwxowa6I9jB6d2kqimn4UqHJd1YKqzdExJajZ5UQap9tVnLfFfPChqgDg7qLZTNYCIgYGA0HoGmk7Y9sQ.png" style="margin-left: auto; margin-right: auto;"><img border="0" data-original-height="1024" data-original-width="1024" height="400" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiTHionB-1S0pYghuaqAaFplomOba99DnL79luwh6zataow1SMbu7VTkwLa7TL_I66oeRu6MRrPPAyVFpF3uomtH_luZUlp220WQnD_8tUQYlFWj1uhDi9sw4vfwGkjIWR7eTMORP8f9bJinQMKhRzTNtMhWdm3mFhp0G5dk_sxo3abaOvP5yvtb_Dypw/w400-h400/Zti_yGXmLuikLxE-3nVAMSW-fcXBbHUbp8KlHgIf_FEiz_cRtnkcfxg9mEnMhmADknRxrL49j2GOYi4lwF1UDEzU80cuIwS6Qsrjm01IQhwxowa6I9jB6d2kqimn4UqHJd1YKqzdExJajZ5UQap9tVnLfFfPChqgDg7qLZTNYCIgYGA0HoGmk7Y9sQ.png" width="400" /></a></td></tr><tr><td class="tr-caption" style="text-align: center;"><p><i>Above: AI agents with inner misalignment were at one point called “optimisation daemons”. DALL-E did not quite successfully depict the description "Two arguments between an angel and a devil, one inside a circle and one on the outside, painting".</i></p></td></tr></tbody></table> <p>As discussed in the first section, the central paradigm of modern ML is that we train systems to perform well on a certain reward function. For instance, we might train an image classifier by giving it a large number of labelled images of digits. Every time it gets an image wrong, gradient descent is used to update the system incrementally in the direction that would have been required to give a correct answer. Eventually, the system has learned to classify basically all images correctly.</p><p>There are two broad families of ways techniques like this can fail. The first is when our reward function fails to fully express the true preferences of the programmer - we refer to this as <b>outer misalignment</b>. The second is when the AI learns a different set of goals than those specified by the reward function, but which happens to coincide with the reward function during training - this is <b>inner misalignment</b>. We will now discuss each of these in turn.</p><h5>Outer misalignment</h5><p>Outer misalignment is perhaps the simpler concept to understand, because we encounter it all the time in everyday life, in a form called <b>Goodhart’s law</b>. In its most well-known form, this law states:</p><blockquote><p><i>When a measure becomes a target, it ceases to be a good measure.</i></p></blockquote><p>Perhaps the most famous case comes from Soviet nail factories, which produced nails based on targets that they had been given by the central government. When a factory was given targets based on the total <i>number</i> of nails produced, they ended up producing a massive number of tiny nails which couldn’t function properly. On the other hand, when the targets were based on the total <i>weight</i> produced, the nails would end up huge and bulky, and equally impractical.</p><p><i></i></p><table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto;"><tbody><tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhKR3NTFpd6w6hf9vI4jXj4XhJJVoMTOfZYaagSYECIq8KJO-FBtkiVZiGEA6rNtwNr16y9VXHm_8tE14964HbK5oe3xrJwZZHqo3sb8O8QjE12pstoDfVLuLe09ZS3CX2kIAebTdEGE2mq7K8KWek7OCpc3zIPbtpN2R3mII3uGf1hikjo-Ln3SHi9CQ/s1600/L6YvMV39zgHDI1-QPHvJs8E72fkQ1KhaCKxCW2oRAsr72CQVgUyvn-bgjk2Rj2msWVdB0rcTkHA2ZOMLzmtDvCcQmvyesvZ0l2YEFghRoglPZI0hIv-SFtYrGMqW-yok8knx3ttZbMo4yE0IsvE6oPbjEJEhTXXkC3jf7KT7Ss5UuXGVez908uTu-A.png" style="margin-left: auto; margin-right: auto;"><img border="0" data-original-height="1600" data-original-width="649" height="640" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhKR3NTFpd6w6hf9vI4jXj4XhJJVoMTOfZYaagSYECIq8KJO-FBtkiVZiGEA6rNtwNr16y9VXHm_8tE14964HbK5oe3xrJwZZHqo3sb8O8QjE12pstoDfVLuLe09ZS3CX2kIAebTdEGE2mq7K8KWek7OCpc3zIPbtpN2R3mII3uGf1hikjo-Ln3SHi9CQ/w260-h640/L6YvMV39zgHDI1-QPHvJs8E72fkQ1KhaCKxCW2oRAsr72CQVgUyvn-bgjk2Rj2msWVdB0rcTkHA2ZOMLzmtDvCcQmvyesvZ0l2YEFghRoglPZI0hIv-SFtYrGMqW-yok8knx3ttZbMo4yE0IsvE6oPbjEJEhTXXkC3jf7KT7Ss5UuXGVez908uTu-A.png" width="260" /></a></td></tr><tr><td class="tr-caption" style="text-align: center;"><p><i>Above: an old Soviet cartoon</i></p></td></tr></tbody></table> <p></p><p>A more recent example comes from the COVID-19 pandemic, where a plasma donation centre offered COVID-sufferers a larger cash reward than healthy individuals. As a result, people would deliberately infect themselves with COVID-19 in order to get a larger cash reward. Examples like this could fill up an entire book, but hopefully at this point you get the message! </p><p>In the case of machine learning, we are trying to use the reward function to capture the thing we care about, but we are also using this function to train the AI - hence, Goodhart. The cases of <b>specification gaming</b> discussed above are perfect examples of this phenomenon in action - the AIs found ways of “giving the programmers exactly what they asked for”, but in a way which violated the programmers’ original intention. Some of these examples are quite unexpected, and a human would probably never have discovered them just from thinking about the problem. As AIs get more intelligent and are given progressively more complicated tasks, we can expect this problem to get progressively worse, because:</p><ul><li>With greater intelligence comes the invention of more powerful solutions.</li><li>With greater task complexity, it becomes harder to pin down exactly what you want.</li> </ul><p>We should also strongly expect that AIs will be deployed in the real world, and given tasks of real consequence, simply for reasons of economic competitiveness. So any specification gaming failures will be significantly less benign than a <a href="https://openai.com/blog/faulty-reward-functions/">digital boat going around in circles</a>. </p><h5>Inner misalignment</h5><p>The other failure mode, <b>inner misalignment</b>, describes the situation when an AI system learns a different goal than the one you specified. The name comes from the fact that this is an internal property of the AI, rather than a property of the relationship between the AI and the programmers – here, the programmers don’t enter into the picture.</p><p>The classic example here is human evolution. We can analogise evolution to a machine learning training scheme, where humans are the system being trained, and the reward function is “surviving and reproducing”. Evolution gave us* certain drives, which reliably increased our odds of survival in the ancestral environment. For instance, we developed drives for sugar (which leads us to seek out calorie-dense foods that supplied us with energy), and drives for sex (which leads to more offspring to pass your genetic code onto). The key point is that these drives are intrinsic, in the sense that humans want these things regardless of whether or not a particular dessert or sex act actually contributes to reproductive fitness. Humans have now moved “off distribution”, into a world where these things are no longer correlated with reproductive fitness, and we continue wanting them and prioritising them over reproductive fitness. Evolution failed at imparting its goal into humans, since humans have their own goals that they shoot for instead when given a chance.</p><p>(*Anthropomorphising evolution in language can be misleading dangerous, and should just be seen as a shorthand here.)</p><p>A core reason why we should expect inner misalignment - that is, cases where an optimisation process creates a system that has goals different from the original optimisation process - is that it seems very easy. It was much easier for evolution to give humans drives like “run after sweet things” and “run after appealing partners”, rather than for it to give humans an instinctive understanding of genetic fitness. Likewise, an ML system being optimised to do the types of things that humans want may not end up internalising what human values are (or even what the goal of a particular job is), but instead some correlated but imperfect proxy, like “do what my designers/managers would rate highly”, where “rate highly” might include “rate highly despite being coerced into it”, among a million other failure modes. A silly equivalent of “humans inventing condoms” for an advanced AI might look something like “freeze all human faces into a permanent smile so that it looks like they’re all happy” - in the same way that the human drive to have sex does not extend down to the level of actually having offspring, an AI’s drive to do something related to human wellbeing might not extend down to the level of actually making humans happy, but instead something that (in the training environment at least) is correlated with happy humans. What we’re trying to point to here is not any one of these specific failure modes - we don’t think any single one of these is actually likely to happen - but rather the <i>type</i> of failure that these are examples of.</p><p>This type of failure mode is not without precedent in current ML systems (although there are fewer examples than for specification gaming). The 2021 paper <a href="https://www.deepmind.com/publications/objective-robustness-in-deep-reinforcement-learning">Objective Robustness in Deep Reinforcement Learning</a> showcases some examples of inner alignment failures. In one example, they trained an agent to fetch a coin in the CoinRun environment (pictured below). The catch was that all the training environments had the coin placed at the end of the level, on the far right of the map. So when the system was trained, it actually learned the task “go to the right of the map” rather than “pick up the coin” - and we know this because when the system was deployed on maps where the coin was placed in a random location, it would reliably go to the right hand edge rather than fetch the coin. A key distinction worth mentioning here - this is a failure of the agent’s <b>objective</b>, rather than their <b>capabilities</b>. They are learning useful skills like how to jump and run past obstacles - it’s just that those skills are being used in service of the wrong objective.</p><table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto;"><tbody><tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgSPH1Dki7fjPCAsy3UTqs-p2GtTjXIws8ikNKJdrujWvW51zBA7L6AspdTn1HMRNHZkInVL1hNTwSD-RcS-UO7ieR6iVHUDkCxFtmpcsBAWIJC0-Gz3NGpxy5mNVzUExht5YnJd67oHroH8edsm_GxU3oDEQgcvUJZ9-Qq2zJvafLvsah_rIkcM972Fg/s272/dPK2Z81oQLnDBmXoCnligA2M3VT0kwuD6VCcLz5m5PkqfdAZULp332Ae9gXPf4zHFtGHKjKed1O1WOqZUJIUahcx7w2q4DtuxtG-Vbd4iqiKiVuMV-H46xfxkOQd41W716tb9ItBcWO_Iy5ZgIDfG6VKOSl-sIrGjqgT9Df3GPxc-NuBIEGWYkaQnQ.png" style="margin-left: auto; margin-right: auto;"><img border="0" data-original-height="185" data-original-width="272" height="185" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgSPH1Dki7fjPCAsy3UTqs-p2GtTjXIws8ikNKJdrujWvW51zBA7L6AspdTn1HMRNHZkInVL1hNTwSD-RcS-UO7ieR6iVHUDkCxFtmpcsBAWIJC0-Gz3NGpxy5mNVzUExht5YnJd67oHroH8edsm_GxU3oDEQgcvUJZ9-Qq2zJvafLvsah_rIkcM972Fg/s1600/dPK2Z81oQLnDBmXoCnligA2M3VT0kwuD6VCcLz5m5PkqfdAZULp332Ae9gXPf4zHFtGHKjKed1O1WOqZUJIUahcx7w2q4DtuxtG-Vbd4iqiKiVuMV-H46xfxkOQd41W716tb9ItBcWO_Iy5ZgIDfG6VKOSl-sIrGjqgT9Df3GPxc-NuBIEGWYkaQnQ.png" width="272" /></a></td></tr><tr><td class="tr-caption" style="text-align: center;"><p><i>Above: the CoinRun environment.</i></p></td></tr></tbody></table><p>So, how bad can inner misalignment get? A particularly concerning scenario is <b>deceptive alignment</b>. This is when the agent learns it is inside a training scheme, discovers what the base objective is, but has already acquired a different goal. In this case, the system might reason that a failure to achieve the base objective when training will result in it being modified, and not being able to achieve its actual goal. Thus, the agent will pretend to act aligned, until it thinks it’s too powerful for humans to resist, at which point it will pursue its actual goal without the threat of modification. This scenario is highly speculative, and there are many aspects of it which we are still uncertain about, but if it is possible then it would represent maybe the most worrying of all possible alignment failures. This is because a deceptively aligned agent would have incentives to act against its programmers, but also to keep these incentives hidden until it expects human opposition to be ineffectual.</p><p>It’s worth mentioning that this inner / outer alignment decomposition isn’t a perfect way to carve up the space of possible alignment failures. For instance, for most non-trivial reward functions, the AI will probably be very far away from perfect performance on it. So it’s not exactly clear what we mean by a statement like “the AI is perfectly aligned with the reward function we trained it on”. Additionally, the idea of inner optimisation is built around the concept of a “mesa-optimiser”, which is basically a learned model that itself performs optimisation (just like humans were trained by evolution, but we ourselves are optimisers since we can use our brains to search over possible plans and find ones which meet our objectives). The problem here is that it’s not clear what it actually means to be an optimizer, and how we would determine whether an AI is one. This being said, the inner / outer alignment distinction is still a useful conceptual tool when discussing ways AI systems can fail to do what we intend.</p><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjVFg7jZPH7429MqUzkjWgPQbhWVuMO2ldDIRrZS2rd3Fy8aBIRMCNnRp5ibGJq3smc9kdoGIkOJHziETEkZv3M_Q6p5IqlZKCZqERPs7k3bRHsirKraqo7-8OWkLtvfQMViU4LiKEq2ROzbfPhuHZS82ElFjaKMnyeEca-FHTCYyZ4Khy9kmYJNz8Onw/s1600/PPH33X6xlwOPj_0YtD3BjyeCHarxJ7sjgxXbCZaSGFxLAJV_7-ulhj0tqPfUhmLjgCzM-hEy9X3zXJHHNpz2Y__is6pP1T3WkHsinUBFRdj5bYtzalUtU3DqHYhPjuT9Dff4QFo5NkG1hXKq-ghdCSZRkf7iCz_pDbgZ3CEwXW3vkTxIK4M0QcZtNw.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="984" data-original-width="1600" height="394" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjVFg7jZPH7429MqUzkjWgPQbhWVuMO2ldDIRrZS2rd3Fy8aBIRMCNnRp5ibGJq3smc9kdoGIkOJHziETEkZv3M_Q6p5IqlZKCZqERPs7k3bRHsirKraqo7-8OWkLtvfQMViU4LiKEq2ROzbfPhuHZS82ElFjaKMnyeEca-FHTCYyZ4Khy9kmYJNz8Onw/w640-h394/PPH33X6xlwOPj_0YtD3BjyeCHarxJ7sjgxXbCZaSGFxLAJV_7-ulhj0tqPfUhmLjgCzM-hEy9X3zXJHHNpz2Y__is6pP1T3WkHsinUBFRdj5bYtzalUtU3DqHYhPjuT9Dff4QFo5NkG1hXKq-ghdCSZRkf7iCz_pDbgZ3CEwXW3vkTxIK4M0QcZtNw.png" width="640" /></a></div><h3>Misaligned AGI could overpower humanity</h3><blockquote><p> <i>The best answer to the question, "Will computers ever be as smart as humans?” is probably “Yes, but only briefly.”</i></p><p>-<i>Vernor Vinge</i></p></blockquote><table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto;"><tbody><tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgZoOysowtrB8FLMtCQ9hhh3H34eH8pRMetMIMNLrAww4elE2-u3FbvSTtqoy_tC0wh4VLMMJE-7Vp8aHVFPcH7xxxN0PMxuOi2F1l-T_PaIzQleCCoqfyjD3b6qta-CkHyqwvA_-Ygfnm63qnqYMot8T3C-hqZPNg78SJ7imjFIlyZ6uWOhEB9V6qOSw/s1024/WE1TOIxhoaYkTXIkM_CCvP15KJWCX2ycK_jW1Lt9uddbjy-IoZINMCbijY75vY1VJal2SzaA2ERv6USRFbQqcfri5fZgFBqk05OYZKEPJXNAMGvKFyoC6Dn6A6AGl_J7dnMWQmTKadTXRQrp90hx9lQ09_6rxxWfzqzYoWHfaxQ2V32ndFAsIVMTmQ.png" style="margin-left: auto; margin-right: auto;"><img border="0" data-original-height="1024" data-original-width="1024" height="400" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgZoOysowtrB8FLMtCQ9hhh3H34eH8pRMetMIMNLrAww4elE2-u3FbvSTtqoy_tC0wh4VLMMJE-7Vp8aHVFPcH7xxxN0PMxuOi2F1l-T_PaIzQleCCoqfyjD3b6qta-CkHyqwvA_-Ygfnm63qnqYMot8T3C-hqZPNg78SJ7imjFIlyZ6uWOhEB9V6qOSw/w400-h400/WE1TOIxhoaYkTXIkM_CCvP15KJWCX2ycK_jW1Lt9uddbjy-IoZINMCbijY75vY1VJal2SzaA2ERv6USRFbQqcfri5fZgFBqk05OYZKEPJXNAMGvKFyoC6Dn6A6AGl_J7dnMWQmTKadTXRQrp90hx9lQ09_6rxxWfzqzYoWHfaxQ2V32ndFAsIVMTmQ.png" width="400" /></a></td></tr><tr><td class="tr-caption" style="text-align: center;"><p><i>Above: DALL-E's drawing of "Digital art of two earths colliding"</i></p></td></tr></tbody></table> <p>Suppose one day, we became aware of the existence of a “twin earth” - similar to our own in several ways, but with a few notable differences. Call this “Earth 2”. The population was smaller (maybe just 10% of the population of our earth), and the people were less intelligent (maybe an average IQ of 60, rather than 100). Suppose we could only interact with this twin earth using their version of the internet. Finally, suppose we had some reason for wanting to overthrow them and gain control of their civilization, e.g. we had decided their goals weren’t compatible with a good future for humans. How could we go about taking over their world?</p><p>At first, it might seem like our strategies are limited, since we can only use the internet. But there are many strategies still open to us. The first thing we would do is try to gather resources. We could do this illegally (e.g. by discovering peoples’ secrets via social engineering and performing blackmail), but legal options would probably be more effective. Since we are smarter, the citizens of Earth 1 would be incentivised to employ us, e.g. to make money using quantitative finance, or researching and developing advanced weaponry or other technologies. If the governments of Earth 2 tried to pass regulations limiting the amount or type of work we could do for them, there would be an incentive to evade these regulations, because anyone who did could make more profit. Once we’d amassed resources, we would be able to bribe members of Earth 2 into taking actions that would allow us to further spread our influence. We could infiltrate computer systems across the world, planting backdoors and viruses using our superior cybersecurity skills. Little by little, we would learn more about their culture and their weaknesses, presenting a front of cooperation until we had amassed enough resources and influence for a full takeover. </p><p><i>Wouldn’t the citizens of Earth 2 see this coming?</i> There’s a chance that we manage to be sufficiently sneaky. But even if some people realise, it would probably take a coordinated and expensive global effort to resist. Consider our poor track record with climate change (a comparatively much more documented, better-understood, and more gradually-worsening phenomenon), and in coordinating a global response to COVID-19.</p><p><i>Couldn’t they just “destroy us” by removing our connection to their world?</i> In theory, perhaps, but this would be very unlikely in practice, since it would require them to rip out a great deal of their own civilisational plumbing. Imagine how hard it would be for us to remove the internet from our own society, or even a more recent and less essential technology such as blockchain. Consider also how easy it can be for an adversary with better programming ability to hide features in computer systems.</p><p>—</p><p>As you’ve probably guessed at this point, the thought experiment above is meant to be an analogy for the feasibility of AIs taking over our own society. They would have no physical bodies, but would have several advantages over us which are analogous to the ones described above. Some of these are:</p><ul><li><b>Cognitive advantage</b>. Human brains use approximately 86 billion neurons, and send signals at 50 metres per second. These hard limits come from brain volume and metabolic constraints. AIs would have no such limits, since they can easily scale (GPT-3 has 175 billion parameters, though you shouldn’t directly equate parameter and neuron count*), and can send signals at close to the speed of light. (*For a more detailed discussion of this point, see <a href="https://www.openphilanthropy.org/research/new-report-on-how-much-computational-power-it-takes-to-match-the-human-brain/">Joseph Carlsmith’s report</a> on the computational power of the human brain.)</li><li><b>Numerical advantage</b>. AIs would have the ability to copy themselves at a much lower time and resource cost than humans; it’s as easy as finding new hardware. Right now, the way ML systems work is that training is much more expensive than running, so if you have the compute to train a single system, you have the compute to run thousands of copies of that system once the training is finished.</li><li><b>Rationality</b>. Humans often act in ways which are not in line with our goals, when the instinctive part of our brains gets in the way of the rational, planning part. Current ML systems are also weakened by relying on a sort of associative/inductive/biased/intuitive/fuzzy thinking, but it is likely that sufficiently advanced AIs could carry out rational reasoning better than humans (and therefore, for example, come to the correct conclusions from fewer data points, and be less likely to make mistakes).</li><li><b>Specialised cognition.</b> Humans are equipped with general intelligence, and perhaps some specialised “hardware accelerators” (to use computer terminology) for domains like social reasoning and geometric intuition. Perhaps human abilities in, say, physics or programming are significantly bottlenecked by the fact that we don’t have specialised brain modules for those purposes, and AIs that have cognitive modules designed specifically for such tasks (or could design them themselves) might have massive advantages, even on top of any generic speed-boost they gain from having their general intelligence algorithms running at a faster speed than ours.</li><li><b>Coordination</b>. As the recent COVID-19 pandemic has illustrated, even when the goals are obvious and most well-informed individuals could find the best course of action, we lack the ability to globally coordinate. While AI systems might or might not have incentives or inclinations to coordinate, if they do, they have access to tools that humans don’t, including firmer and more credible commitments (e.g. by modifying their own source code) and greater bandwidth and fidelity of communication (e.g. they can communicate at digital speeds, and using not just words but potentially by directly sending information about the computations they’re carrying out).</li> </ul><p>It’s worth emphasising here, the main concern comes from AIs with misaligned goals acting against humanity, not from humanity misusing AIs. The latter is certainly cause for major concern, but it’s a different kind of risk to the one we’re talking about here. </p><p> </p><p><b>Summary of this section:</b></p><p>AI researchers in general expect >50% chance of AGI in the next few decades.</p><p>The <i>Orthogonality Thesis</i> states that, in principle, intelligence can be combined with more or less any final goal, and sufficiently intelligent systems do not automatically converge on human values. The <i>Instrumental Convergence</i> thesis states that, for most goals, there are certain instrumental goals that are very likely to help with the final goal (e.g. survival, preservation of its current goals, acquiring more resources and cognitive ability).</p><p>Inner and outer alignment are two different possible ways AIs might form goals which are misaligned with the intended goals.</p><p>Outer misalignment happens when the reward function we use to train the AI doesn’t exactly match the programmer’s intention. In the real world, we commonly see a version of this called Goodhart’s law, often phrased as “when a measure becomes a target, it ceases to be a good measure [because of over-optimisation for the measure, over the thing it was supposed to be a measure of]”.</p><p><i>Inner misalignment</i> is when the AI learns a different goal to the one specified by the reward function. A key analogy is with human evolution – humans were “trained” on the reward function of genetic fitness, instead of learning that goal, learned a bunch of different goals like “eat sugary things” and “have sex”. A particularly worrying scenario here is deceptive alignment, when an AI learns that its goal is different from the one its programmers intended, and learns to conceal its true goal in order to avoid modification (until it is strong enough that human opposition is likely to be ineffectual).</p><h4>Failure modes</h4><table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto;"><tbody><tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjh-7AZOqNY7xGxELN4GxynWcd291hsrNN5Tlvra7H2BEmOFHdWRJe28Vyt412Kzk8kgxduGjySeS-nDJagmrtvSVtfM3hiEEBihI1j59FuvLjfgrh32jGJImcI7TpulPYa2yJEe5trmufCfPY-hAB8NSkgIDsnSUpJ8wiOjJPb-H9IZPwmOcslRnbrlg/s1024/44lppCCb7Hrj2XFMJUnAYb0u1afPYsTx-x6ZgEgmvyRJWQLYPZmQdgiVZqMs1ICb0XzBLH09UDuvHfK55KB8Pe74akvxgqw4YVal33yF2vPpwpksmKkVQeh4eqTZpFdgwa9ywIZNZ76nWH10hA15Rd6xTaPeSbPwRqy7hMZSgx5eXkW9AB9xQtFUZw.png" style="margin-left: auto; margin-right: auto;"><img border="0" data-original-height="1024" data-original-width="1024" height="400" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjh-7AZOqNY7xGxELN4GxynWcd291hsrNN5Tlvra7H2BEmOFHdWRJe28Vyt412Kzk8kgxduGjySeS-nDJagmrtvSVtfM3hiEEBihI1j59FuvLjfgrh32jGJImcI7TpulPYa2yJEe5trmufCfPY-hAB8NSkgIDsnSUpJ8wiOjJPb-H9IZPwmOcslRnbrlg/w400-h400/44lppCCb7Hrj2XFMJUnAYb0u1afPYsTx-x6ZgEgmvyRJWQLYPZmQdgiVZqMs1ICb0XzBLH09UDuvHfK55KB8Pe74akvxgqw4YVal33yF2vPpwpksmKkVQeh4eqTZpFdgwa9ywIZNZ76nWH10hA15Rd6xTaPeSbPwRqy7hMZSgx5eXkW9AB9xQtFUZw.png" width="400" /></a></td></tr><tr><td class="tr-caption" style="text-align: center;"><p><i>Above: DALL-E really seems to have a natural talent at depicting "The earth is on fire, artificial intelligence has taken over, robots rule the world and suppress humans, digital art, artstation"</i>.</p></td></tr></tbody></table> <p>But what, concretely, might an AI-related catastrophe look like?</p><p>AI catastrophe scenarios sound like something strongly out of science fiction. However, we can immediately discount a few common features of sci-fi AI takeovers. First, time travel. Second, armies of humanoid killer robots. Third, the AI acting out of hatred for humanity, or out of bearing a grudge, or because it hates our freedom, or because it has suddenly acquired “consciousness” or “free will”, or - as Steven Pinker <a href="https://scottaaronson.blog/?p=6524">likes to put it</a> - because it has developed an “alpha-male lust for domination”.</p><p>Remember instead the key points from above about how an AI’s goals might become dangerous: by achieving exactly what we tell it to do <i>too well</i> in a clever letter-but-not-spirit-of-the-law way, by having a goal that in most cases is the same as the goal we intend for it to have but which diverges in some cases we don’t think to check for, or by having an unrelated goal but still achieving good performance on the training task because it learns that doing well on the training tasks is instrumentally good. None of these reasons have anything to do with the AI being developing megalomania let alone the philosophy of consciousness; they are instead the types of technical failures that you’d expect from an optimisation process. As discussed above, we already see weaker versions of such failures in modern ML systems.</p><p>It is very uncertain which exact type of AI catastrophe we are most likely to see. We’ll start by discussing the flashiest kind: an AI “takeover” or “coup” where some AI system finds a way to quickly and illicitly take control over a significant fraction of global power. This may sound absurd. Then again, we already have ML systems that learn to crash or hack the game-worlds they’re in for their own benefit. Eventually, perhaps in the next decade, we should expect to have ML systems doing important and useful work in real-world settings. Perhaps they’ll be trading stocks, or writing business reports, or managing inventories, or advising decision-makers, or even being the decision-makers. Unless either (1) there is some big surprise waiting in how scaled-up ML systems work, (2) advances in AI alignment research, or (3) a miracle, the default outcome seems to be that such systems will try to “hack” the real world in the same way that their more primitive cousins today use clever hacks in digital worlds. Of course, the capabilities of the systems would have to advance a lot for them to be civilisational threats. However, rapid capability advancement has held for the past decade and we have solid theoretical reasons (including the scaling laws mentioned above) to expect it to continue holding. Remember also the cognitive advantages mentioned in the previous section.</p><p>As for how it proceeds, it might happen at a speed that is more digital than physical - for example, if the AI’s main lever of power is hacking into digital infrastructure, it might have achieved decisive control before anyone even realises. As discussed above, whether or not the AI has access to much direct physical power seems mostly irrelevant.</p><p>Another failure mode, thought to be significantly more likely than the direct AI takeover scenario by leading AI safety researcher Paul Christiano, is one that he calls <a href="https://www.alignmentforum.org/posts/HBxe6wdjxK239zajf/what-failure-looks-like">“going out with a whimper”</a>. Look at all the metrics we currently try to steer the world with: companies try to maximise profit, politicians try to maximise votes, economists try to maximise metrics like GDP and employment. Each of these are proxies for what we want: a profitable company is one that has a lot of customers willing to pay money for their products; a popular politician has a lot of people thinking they’re great; maximising GDP generally correlates with people being wealthier and happier. However, none of these metrics or incentive systems really gets to the heart of what we care about, and so it is possible (and in the real world we often observe) cases where profitable companies and popular politicians are pursuing destructive goals, or where GDP growth is not actually contributing to people’s quality of life. These are all cases of Goodhart’s law, as discussed above.</p><figure><table><thead><tr><th><b>Hard-to-measure</b></th><th><b>Easy-to-measure</b></th><th><b>Consequence</b></th></tr></thead><tbody><tr><td>Helping me figure out what's true</td><td>Persuading me</td><td>Crafting persuasive lies</td></tr><tr><td>Preventing crime</td><td>Preventing reported crime</td><td>Suppressing complaints</td></tr><tr><td>Providing value to society</td><td>Profit</td><td>Regulatory capture, underpaying workers</td></tr></tbody></table></figure><p>What ML gives us is a very general and increasingly powerful way of developing a system that does well at pushing some metric upwards. A society where more and more capable ML systems are doing more and more real-world tasks will be a society that is going to get increasingly good at pushing metrics upwards. This is likely to result in visible gains in efficiency and wealth. As a result, competitive pressures will make it very hard for companies and other institutions to say no: if Acme Motors Company started performing 15% better after off-sourcing their CFO’s decision-making to an AI, General Systems Inc will be very tempted to replace their CEO with an AI (or maybe the CEO will themselves start consulting an AI for more and more decisions, until their main job is interfacing with an AI).</p><p>In the long run, a significant fraction of work and decision-making may well be offloaded to AI systems, and at that point change might be very difficult. Currently our most fearsome incentive systems like capitalism and democracy still run on the backs of the constituent humans. If tomorrow all humans decided to overthrow the government, or abolish capitalism, they would succeed. But once the key decisions that perpetuate major social incentive systems are no longer made by persuadable humans, but instead automatically implemented by computer systems, change might become very difficult.</p><p>Since our metrics are flawed, the long-term outcome is likely to be less than ideal. You can try to imagine what a society run by clever AI systems trained to optimise purely for their company’s profit looks like. Or a world of media giants run by AIs which spin increasingly convincing false narratives about the state of the world, designed to make us <i>feel</i> more informed rather than actually telling us the truth.</p><p>Remember also, as discussed previously, that there are solid reasons to think that influence-seeking and deceptive behaviours seem likely in sufficiently-powerful AI systems. If the ML systems that increasingly run important institutions exhibit such behaviour, then the above “going out with a whimper” scenario might acquire extra nastiness and speed. This is something Paul Christiano explores in the <a href="https://www.alignmentforum.org/posts/HBxe6wdjxK239zajf/what-failure-looks-like">same article</a> linked above.</p><p>A popular misconception about AI risk is that the arguments for doing something are based on a tiny risk of giant catastrophe. The giant catastrophe part is correct. The miniscule risk part, as best as anyone in the field can tell, is not. As mentioned above, the average ML researcher - generally an engineering-minded person not prone to grandiose futuristic speculation - gives <a href="https://aiimpacts.org/2022-expert-survey-on-progress-in-ai/">a 5% chance of civilisation-ending disaster from AI</a>. The ML researchers who grapple with the safety issues as part of their job are clearly not an unbiased randomly-selected sample, but generally give numbers in the 5-50% range, and some (in our opinion too alarmist people) think it’s over 90%. As the above arguments hopefully emphasise, some type of catastrophe seems like the <i>default outcome</i> from the types of AI advances that we are likely to encounter in the coming decades, and the main reason for thinking we won’t is the (justifiable but uncertain) hope that someone somewhere invents solutions.</p><p>It might seem forced or cliche that AI risk scenarios so frequently end with something like “and then the humans no longer have control of their future and the future is dark” or even “and then everyone literally dies”. But consider the type of event that AGI represents and the available comparisons. The computer revolution reshaped the world in a few decades by giving us machines that can do a <i>narrow</i> range of intellectual tasks. The industrial revolution let us automate large parts of <i>manual</i> labour, and also set the world off on an unprecedented rate of economic growth and political change. The evolution of humans is plausibly the most important event in the planet’s history since at least the dinosaurs died out 66 million years ago, and it took on the exact form of “something smarter than anything else on the planet appeared, and now suddenly they’re firmly in charge of everything”.</p><p>AI is a big deal, and we need to get it right. How we might do so is the topic for <a href="https://www.strataoftheworld.com/2022/09/ai-risk-intro-2-solving-problem.html">part 2</a>.</p></div>Unknownnoreply@blogger.com0tag:blogger.com,1999:blog-1697673368059564013.post-11240725554792486992022-09-10T08:17:00.001+01:002022-09-10T08:17:37.606+01:00EA as a Schelling point<div style="text-align: center;"><b> </b><i><span style="font-size: x-small;">3.1k words (~9 minutes)</span></i><b><br /></b></div><p><b>Summary</b>: A significant way in which the EA community creates value is by acting as a <a href="https://en.wikipedia.org/wiki/Focal_point_(game_theory)">Schelling point</a> where talented, ambitious, and altruistic people tend to gather and can meet each other (in addition to more direct sources of EA value like identifying the most important problems and directly pushing people to work on them). It might be useful to think about what optimising for being a Schelling point looks like, and I list some vague thoughts on that.</p><hr /><p>A Schelling point, also known as a focal point, is what people decide on in the absence of communication, especially when it's important to coordinate by coming to the same answer.</p><p>The classic example is: you were arranging a meeting with a stranger in New York City by telephone, but you used the last minute of your phone credit and the line cut off after you had agreed on the date but not location or time - where do you meet? "Grand Central Station at noon" is an answer that other people may be especially likely to converge on.</p><p>(Schelling points can be thought of as a type of acausal negotiation.)</p><h2 id="when-the-schelling-point-is-the-selling-point">When the Schelling point is the selling point</h2><p>Schelling points are often extremely powerful and valuable. A key function of top universities is to be Schelling points for talented people. (Personally, I'd call it the most important function.) There are other valuable things too: courses that go deeper, the signalling value to employers, and so on. However, talented people generally have a preference for hanging out with other talented people, both for social reasons and to find collaborators for ambitious projects and future colleagues. At the same time, talented people are also generally spread out and present only at low densities. Top universities select hard on (some measures of) talent, and through this create environments with high talent density. A big chunk of the reason why people apply to top universities is because other people do so too, and I'd guess that even if the academic standards of Stanford, MIT, or Cambridge eroded significantly, the fact that they've established themselves as congregating points for smart people will keep people applying and visiting for a long time.</p><p>(Note that this is related to, but not equal to, the prestige and status of these places. It is possible to imagine Schelling points that are not prestigious. For example, my impression is that this described MIT at one point - it became a congregating point for uniquely ambitious STEM students and defence research before it achieved high academic status. It is also possible to imagine prestigious places that are not Schelling points, though this is a bit harder since anything with prestige becomes a Schelling point for high social status (though prestige Schelling points and talent Schelling points need not co-occur). More generally, since prestige is a thing many people care a lot about, there is a high correlation between a place being prestigious or high status and being a Schelling point for at least some type of person. However, the mechanisms are distinct - a person selecting their university based on status is selecting based on what they get to write on their CV, while a person selecting their university based on it being a Schelling point for smart people is selecting based on the fact that many other smart people that they can't coordinate with but would like to meet will also choose to go there.)</p><p>Another example is Silicon Valley. Sure, the area has many strengths - being rich and inside a large stable free market - but by far the greatest argument for living in Silicon Valley is that others also choose it. This leads to a (for now) unique combination of entrepreneurial people, great programmers, venture capitalists, and all the other types of people you need for a thriving tech business ecosystem, all there primarily because all the others are there too (how touching!). There's a lot of value of having everything in one place, and it would be very hard for all the different people who make up the value of Silicon Valley to coordinate to move to another place. That's why the Schelling point value of Silicon Valley is so enduring that people continue to tolerate large numbers of homeless drug addicts and sell kidneys to pay rent for years on end.</p><p>Note that a big part of the mechanism isn't that <i>specific</i> people you want to find are there, but that the <i>types of person</i> you'd want to find are likely to also be there, because both those people and yourself are likely to converge on the strategy of going there.</p><h2 id="schelling-ea">Schelling EA</h2><p>The Effective Altruism (EA) community provides a lot of value, for example:</p><ul><li>research into figuring out what are the most important problems to solve to maximise human flourishing;</li><li>research and concrete efforts into how to solve the most important problems discovered by the above;</li><li>high epistemic standards and truth-seeking discussion norms;</li><li>a uniquely wide-ranging and well-reasoned set of resources to help people pursue high-impact careers;</li><li>tens of billions of dollars in funding.</li> </ul><p>However, in addition to these, a very critical part of the value that EA provides is being a Schelling point for talented, ambitious, and altruistically-motivated people. </p><p>Even without EA, there would be researchers studying existential risks, animal welfare, and global poverty; people trying to assess charities; communities with high epistemic norms; and billionaires trying to use their fortunes for effective good. However, thanks to EA, people in each of these categories can go to the same Effective Altruism Global conference or quickly find people in local groups, and meet collaborators, co-founders, funders, and so on. A lot of the reason why this can happen is that if you hang out with a certain group of people or on the right websites, EA looms large.</p><p>The biggest <i>personal</i> source of value I've gotten from EA has been having a shortcut to meeting people very high in all of talent, ambition, and altruistic motivation.</p><p>Much of this is obvious - breaking news: communities bring people together and foster connections, more at 11 - but I think taking seriously just how much of counterfactual EA community impact comes from being a Schelling point leads to some less-obvious points about possible implications.</p><h2 id="implications">Implications</h2><p>The Schelling-point-based (and therefore necessarily incomplete) answer to "what is the EA community for?" might be something like "be an obvious Schelling point where relevant people gather, the chance of interactions that lead to useful work is maximised, and have a community and infrastructure that pushes work in the most useful direction possible". (This is in contrast to answers that emphasise e.g. directly increasing the number of people working on the most pressing problems.) (I will not argue for this being the best possible answer; my point is just that it is one possible answer, and an interesting one to examine further.)</p><p>If I were a Big Tech marketing consultant, I might call this "EA-as-a-platform".</p><p>What might maximising for such a Schelling point strategy look like?</p><h3 id="being-obvious">Being obvious</h3><p>A Schelling point is not a Schelling point unless it's obvious enough. For EA to be an effective Schelling point for talented/ambitious/altruistic people, those people must hear about it. Silicon Valley is obvious enough that entrepreneurial people from South Africa to Russia hear about it and decide it's where they want to be. To maximise its Schelling point value, EA should have world-spanning levels of recognition.</p><p>Note that recognition does not equal prestige or likeability. We don't care (for Schelling point reasons at least) if most people hear about EA and go "eh, sounds weird and unappealing"; what matters is that the core target demographic is excited enough to put effort into pursuing EA. Consider how Silicon Valley was not particularly high-prestige in the public even when it was already attracting tech entrepreneurs, or how many people hear about the intensity of academics at top universities and (very reasonably) think "no thanks".</p><h3 id="providing-value">Providing value</h3><p>Though most of a Schelling point's value typically comes from the other people who congregate at it, a Schelling point is easier to create if it is obviously valuable. Even though the smart people they meet might be most of the benefit of university, high schoolers are still more likely to go to top universities if they provide good education, good facilities, and unambiguous social status.</p><p>Some obvious ways in which EA provides value are through funding sufficiently promising projects, and by having a very high concentration of intellectually interesting ideas.</p><p>There are risks to communicating loudly about the value-add, since this brings in people who are in it purely for personal gain (<a href="https://forum.effectivealtruism.org/posts/W8ii8DyTa5jn8By7H/the-vultures-are-circling">"the vultures are circling", as one Forum post put it</a>). This works for Schelling points like Silicon Valley, but not altruism.</p><h3 id="optimising-for-matchmaking">Optimising for matchmaking</h3><p>A specific way that Schelling points provide value is by making it easy to meet other people in the specific ways that lead to productive teams forming. An existing example of this is that everyone says one-on-one meetings are the main point of conferences, and there is (of course) a lot of <a href="https://forum.effectivealtruism.org/posts/pKbTjdopzSEApSQfc/doing-1-on-1s-better-eag-tips-part-ii">thinking about how to make these effective</a>. On the more informal end of the scale, <a href="https://www.reciprocity.io/">Reciprocity</a> exists.</p><p>However, the scope and value of EA matchmaking could be expanded. I'm not aware of many ways to match together entrepreneurial teams (the <a href="https://www.charityentrepreneurship.com/incubation-program">Charity Entrepreneurship incubation program</a> is the only one that comes to mind). I recently took part in an informally-organised co-founder matching process and found it extremely helpful to quickly get a lot of information on what it's like to work together with several promising people.</p><p>I'd advise for someone to think more about how to make the EA environment even more effective at matching people who should know about each other. However, I expect someone is already designing a 53-parameter one-on-one matching system with Calendly, Slack, and Matplotlib integration for the next conference, and therefore I will hold off on adding any more fuel to this fire.</p><h3 id="being-legit">Being legit</h3><p>One of the specific ways in which a Schelling point becomes one is if things associated with it seem uniquely competent, successful, or otherwise good, in a clearly unfakeable way. It is helpful for Cambridge's Schelling point status that it can brag about having 121 Nobel laureates. That so many successful tech companies emerged from Silicon Valley specifically is an unfakeable signal. Any government or city can afford to throw some millions at putting up posters advertising its startup-friendliness; few can consistently produce multi-billion dollar tech companies.</p><p>No amount of community-building or image-crafting is likely to replicate the Schelling point power of <i>obviously being the place where things happen</i>. In some areas, I think EA already has such power: much of the research and work on existential risks happens within EA, and it might be hard to be a researcher on those topics without running into the large body of EA-originating work. However, EA goals require more than just research; note how being a <a href="https://80000hours.org/career-reviews/founder-impactful-organisations/">project/organisation founder</a> or <a href="https://80000hours.org/articles/operations-management/">working in an operations role</a> have been creeping up the 80 000 Hours list of recommended career paths.</p><p>It would be extremely powerful, not just for direct impact reasons but also for building up EA's Schelling point status, if the EA community clearly spawned very obviously successful real-world projects. <a href="https://www.alveavax.com/">Alvea</a> succeeding or working <a href="https://forum.effectivealtruism.org/posts/gLPEAFicFBW8BKCnr/announcing-the-nucleic-acid-observatory-project-for-early">Nucleic Acid Observatories</a> being built would be powerful examples. Likewise if <a href="https://www.charityentrepreneurship.com/">Charity Entrepreneurship</a>-incubated charities become clear stars of the non-profit world.</p><h3 id="meritocracy-and-impartial-judgement">Meritocracy and impartial judgement</h3><p>Right now, I think if a person somewhere in the world has a well-thought out idea for how to make the world a better place, likely their best bet to get a fair hearing, useful feedback, and - if it is competitive with the most valuable existing projects - funding and support is to post it on the <a href="https://forum.effectivealtruism.org/">EA Forum</a>. I don't think this is very obvious outside the EA community. However, this fact, and awareness of it, could make EA a more useful Schelling point, in the same way that the impression that Silicon Valley doesn't frown on weird ideas as long as they're important enough makes it a better Schelling point.</p><p>That EA endorses cause neutrality, has high and transparent epistemic standards, and a quantitative mindset are key parts of this. However, to use this to increase EA Schelling point power, these properties need to be clearly visible to outsiders.</p><p>The most likely way for this to be become more obvious might be if specific EA organisations achieved such a reputation widely within their field (and then there was some path by which knowing of these organisations points people towards knowing about EA).</p><p>GiveWell might be an example of a clearly-EA-linked organisation with visibly high epistemics and judgement quality, though I don't know what their image or recognition level is outside the EA community. Another example is if someone created successful and famous organisations along the lines of FTX Future Fund's proposed <a href="https://ftxfuturefund.org/projects/epistemic-appeals-process/">epistemic appeals process</a> or <a href="https://ftxfuturefund.org/projects/expert-polling-for-everything/">widespread expert polling</a> projects.</p><h3 id="openness-and-approachability">Openness and approachability</h3><p>Good Schelling points are easy to enter, and don't select on attributes that they don't have to.</p><p>Every human sub-group, even if loose and purpose-driven, tends to develop a distinctive culture that is much more specific than strictly implied by its purpose. Sometimes this is useful, since it makes it easy for humans in even a loose group to bond with each other. However, a strong and distinct internal culture is also a barrier to entry. EA is already high-risk for having a strong barrier to entry, because</p><ul><li>many arguments and concepts in EA require background knowledge to understand, and sometimes dense philosophical or technical background knowledge (and this is not the case just for more formal things like Forum posts; I've frequently heard "EV [expected value]", "QALY [quality-adjusted life year", and "Pascal's mugging" assumed as obvious common terminology in casual conversation);</li><li>EA (quite obviously, given what it's about) has a high concentration of non-obvious arguments that are obscure in public discussion but have huge implications; and</li><li>perhaps the main route into EA is caring very strongly about intellectual arguments about abstract moral principles, which tends not to be a natural way for humans to join communities.</li> </ul><p>These largely unavoidable factors already make EA somewhat unapproachable, and seem like a tightly-knit weird in-group/subculture (anecdotally, this seems to be the most common complaint about EA among Cambridge students). Weird cultural norms or quirks are (among other things!) barriers to entry. Therefore, they should be minimised - to the extent that they can be without impinging on what EA is about - <i>if</i> the goal is to maximise Schelling point value.</p><h3 id="mostly-implicit-selectivity-for-the-right-things">(Mostly implicit) selectivity for the right things</h3><p>Some selection is usually part of a Schelling point's value. Top universities select for academic merit (though perhaps less so in the US). Silicon Valley selects for openness and interest/talent in tech/business. EA selects for openness, altruistic orientation (especially if consequentialist-leaning), good epistemics, and quantitative thinking.</p><p>I think it is counterproductive to view openness and selectivity as two ends of one scale that apply to everything. You want to select on important features and be open otherwise (note that, when creating a Schelling point, most of the selection is usually implicit - what types of people you attract - rather than explicit filtering). The key choice is not "open or selective overall?" but rather "for which X do we want to appeal only to people who have a value of X in some specific range?"</p><p>Here's a heuristic for when selectivity for X is useful: when the way X provides value is through its <i>concentration</i> rather than its <i>amount</i>. If you're at a party where you can only talk to a subset of the people during its course, you're going to care a lot about what fraction of people there are interesting - 10 interesting people in a party of 20 is better than 50 in a party of 5000.</p><p>Some cases are ambiguous. For example, if there exists a way for the good and important research to bubble to the top regardless of how much other research exists, it seems like total amount of (infohazard-free) research is the thing to maximise. However, a research area where the average paper is very high quality might help newcomers to the field, or might help lift the prestige of the field, so concentration matters at least somewhat.</p><p>To take another example, there was a <a href="https://forum.effectivealtruism.org/posts/dsCTSCbfHWxmAr2ZT/open-ea-global">recent debate</a> over whether EA Global should be open access. Many of the arguments against boil down to thinking the path to impact runs through a uniquely high concentration of EA engagement (or other variables) among the participants; arguments in favour are often either claiming that concentration matters less than sheer amount of interactions, or that the choice of selection variable(s) is wrong, or that CEA fails to select on their chosen selection variable(s) so even if the intention is right the selection variable selected for in practice is wrong.</p><h3 id="hubs-and-hub-related-infrastructure">Hubs, and hub-related infrastructure</h3><p>Finally, a key point of a Schelling point is that it is a point <i>somewhere</i>. Here, EA is increasingly better. Berkeley, Cambridge, Oxford, London, and Berlin all have large groups, and offices that you can apply to in order to work on EA-relevant things in the company of other EAs.</p><p>In Schelling point terms, there's also a risk that it might be better to have one really obvious and strong hub than many weaker ones (I've heard some Bay Area EAs in particular endorsing this view; invariably, their hub of choice is the Bay Area, though there is <a href="https://forum.effectivealtruism.org/posts/bnzwL6tu4pdYf3hpZ/say-nay-to-the-bay-as-the-default">push back</a>). In practice, it seems that many physical hubs but one virtual/intellectual hub may be best. Both airplanes and people's desires to not uproot their lives are real and relevant things.</p><p>The organisers at each EA hub might benefit from applying Schelling point thinking to the context of their local scene. </p><h3 id="being-one-thing">Being one thing</h3><p>Finally, a Schelling point needs to be one thing, at least in some loose sense. If New York had two Grand Central Stations, the classic Schelling point game would become a lot harder to solve.</p><p>One way to increase the One Thingness of the EA Schelling point is to merge it with other things. In Schelling point land, "merging" does not mean making them the same cluster, but rather creating an obvious and visible path from one thing to another. My understanding is that increasing the obviousness of EA in somewhat-adjacent communities (tech, longevity, space, and Emergent Ventures grantees) was a large part of what <a href="https://forum.effectivealtruism.org/posts/szeE3je8MD4sZcevL/announcing-future-forum-apply-now">Future Forum</a> tried to achieve.</p>Unknownnoreply@blogger.com0tag:blogger.com,1999:blog-1697673368059564013.post-90986294581287342962022-08-20T23:05:00.009+01:002022-09-02T13:47:24.205+01:00Effective Altruism in practice <p style="text-align: center;"> <i><span style="font-size: x-small;">6.5k words (~17 minutes)<br /></span></i></p><p> </p><p>I've written about <a href="https://www.strataoftheworld.com/2020/07/ea-ideas-1-rigour-and-opportunity-in.html">key ideas in Effective Altruism</a> before. But that was the theory. How did EA actually come to exist, and what does it look like in practice?</p><p> </p><p><table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto;"><tbody><tr><td style="text-align: center;"><a href="https://www.kindpng.com/picc/m/294-2945196_effective-altruism-logo-hd-png-download.png" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" data-original-height="460" data-original-width="800" height="230" src="https://www.kindpng.com/picc/m/294-2945196_effective-altruism-logo-hd-png-download.png" width="400" /></a></td></tr><tr><td class="tr-caption" style="text-align: center;"><i>... turns out it looks like a stylised light bulb with a heart.</i><br /></td></tr></tbody></table><br /> </p><h2 id="summary">Summary</h2><ul><li><p>The ideas underpinning EA came from many sources, including:</p><ul><li>late-1900s analytic moral philosophers like Peter Singer and Derek Parfit;</li><li>futurist/transhumanist thinkers like Nick Bostrom and Eliezer Yudkowsky focusing on risks from future technologies;</li><li>a few people working on evaluating charity effectiveness;</li><li>efforts starting around 2010 by a few Oxford philosophers including William MacAskill and Toby Ord that, sometimes unwittingly, gave structure and a name to a diverse cluster of ideas about how to maximise your positive impact.</li> </ul></li><li><p>Though EA is <a href="https://forum.effectivealtruism.org/posts/FpjQMYQmS3rWewZ83/effective-altruism-is-a-question-not-an-ideology">framed around the question</a> of "what does the most good (according to an analytic and often quantiative framework based on impartial welfare-oriented ethics)?" rather than any particular answer to that question, in practice much (but not all!) EA efforts focus on one of the following, due to many people deciding that it's a particularly pressing and (outside EA) neglected problem:</p> <ul><li>reducing the risk of civilisation-wide catastrophe, especially from emerging technologies like advanced AI and biotechnology;</li><li>health and development in poor countries; and</li><li>animal welfare.</li><li>There is also a lot of work at the meta-level, including on figuring out how people can have <a href="https://80000hours.org/">impactful careers</a>, and trying to direct effort towards the above problems.</li> </ul></li><li><p>The funding for most EA-related projects and EA-endorsed charities comes from a combination of:</p> <ul><li><p>many individual small donors, in particular:</p><ul><li>people who have taken the <a href="https://www.givingwhatwecan.org/">Giving What We Can pledge</a> and therefore donate >10% of their salary to highly effective charities;</li><li>people who explicitly pursue <a href="https://80000hours.org/articles/earning-to-give/">"earning-to-give"</a> (getting a high-paying job in order to donate most of the proceeds to charities);</li> </ul></li><li><p>several foundations that derive their wealth from billionaires, including most prominently:</p> <ul><li><a href="https://www.givingwhatwecan.org/">Open Philanthropy</a>, mostly funded by Dustin Moskovitz who made his wealth from being a Facebook co-founder; and</li><li><a href="https://ftxfoundation.org/">FTX Foundation</a>, funded by Sam Bankman-Fried and several other early employees at the crypto exchange FTX.</li> </ul></li> </ul></li><li><p>There is no monolithic EA organisation (though the <a href="https://www.centreforeffectivealtruism.org/">Centre for Effective Altruism</a> organises some common things like the EA Global conferences), but rather a large collection of organisations that mainly share:</p> <ul><li>a commitment to maximising their positive impact on the world;</li><li>a generally rigorous and quantitative approach to doing so; and</li><li>some link to the cluster of people and organisations in Oxford that first named the idea of Effective Altruism.</li><li>There are also many charities that have no direct relation to the EA movement, but were identified by charity evaluators like <a href="https://www.givewell.org/">GiveWell</a> as extremely effective, and have thus been extensively funded.</li> </ul></li><li><p>EA is very good at attracting talented people, especially ambitious young people at top universities.</p></li><li><p>EA culture leans intellectual and open, and has a high emphasis on "epistemic rigour", i.e. being very careful about trying to figure out what is true, acknowledging and reasoning about uncertainties, etc.</p></li><li><p>Some "axes" within EA include:</p> <ul><li>"long-termists" who focus on possible grand futures of humanity and the existential risks that stand between us and those grand futures, and "near-termists" who work on clearer and more established things like global poverty and animal welfare;</li><li>a bunch of people and ideas all about frugality and efficient use of money, and another bunch of people and ideas about using the available funding to unblock opportunities for major impact; and</li><li>a historical tendency to be very good at attracting philosophy/research-type people who like wrestling with difficult abstract questions, versus a growing need to find entrepreneurial, operations, and policy people to actually do things in the real world. </li> </ul></li> </ul> <h2 id="the-philosophers">The philosophers</h2><p>In the beginning (i.e. circa the 1970s, when <a href="https://en.wikipedia.org/wiki/Unix_time">time is widely known to have begun</a>), there were a bunch of philosophers doing interesting work. One of them was Peter Singer. Peter Singer proposed questions like this (paraphrasing, not quoting, and updated with recent numbers):</p><blockquote><p>Imagine you're wearing a $5000 suit and you walk past a child drowning in a lake. Do you jump into the lake and save the child, even though it ruins your suit?</p><p>If you answered yes to the above, then consider this: it is <a href="https://blog.givewell.org/2020/11/19/our-recommendations-for-giving-in-2020/">possible to save a child's life in the developing world for $5000</a>; what justification do you have for spending that money on the suit rather than saving the life?</p><p>The only difference between the two scenarios seems to be distance to the dying child (and method of death and etc. but ssshh); is that distance really morally significant?</p></blockquote><p>(He is also known for arguing in favour of animal rights and abortion rights.)</p><p>Derek Parfit is another. He is particularly famous for the book <i>Reasons and Persons</i>, in which he asks questions (paraphrasing again) like this:</p><blockquote><p>Is a moral harm done if you cause fewer people to exist in the future than otherwise might have? How should we reason about our responsibilities to future generations and non-existing people more generally?</p><p>Does there exist a number of people living mediocre (but still positive) lives such that this world is better than some smaller number of people living very good lives?</p></blockquote><p>(He also talks about problems in the philosophy of personal identity, and the contradictions in moral philosophies based on self-interest.)</p><h2 id="the-transhumanists">The transhumanists</h2><p>Then, largely separately and around the 1990s, there came the transhumanists ("transhumanism" is a wide-reaching umbrella term for humanist thinking about radical future technological change). Perhaps the most notable are Nick Bostrom and Eliezer Yudkowsky.</p><p>Nick Bostrom thought long and hard about many wacky-seeming things with potentially cosmic consequences. He popularised the simulation hypothesis (the idea that we might all be living in a computer simulation). He <a href="https://nickbostrom.com/fable/dragon">argues against death</a> (something I <a href="https://www.strataoftheworld.com/2021/10/death-is-bad.html">strongly agree with</a>). He did lots of work on anthropic reasoning, which is about the question of how we should update information we get about the state of the world when taking into account that we wouldn't exist unless the state of the world allowed it. This leads to <a href="https://en.wikipedia.org/wiki/Sleeping_Beauty_problem">some thought experiments</a> that I'd classify as infohazards because of their tendency to spark an unending discussion whenever they're described. Conveniently, he also coined the term "<a href="https://en.wikipedia.org/wiki/Information_hazard">infohazard</a>".</p><p>Most crucially for EA, though, Bostrom has worked on understanding existential risks, which are events that might destroy humanity or permanently and drastically reduce the capacity of humanity to achieve good outcomes in the future. In particular, he has worked on risks from advanced AI, which he boosted to popularity with the 2014 book <i>Superintelligence</i>.</p><p>Bostrom's style of argument is like a dry protein bar, leaning toward straightforward extrapolation of conclusions from premises, especially if the conclusions seem crazy but the premises seem self-evident. Sometimes, though, he does apply some literary flair to make <a href="https://nickbostrom.com/utopia">an important point</a>, and also <a href="https://nickbostrom.com/poetry/poetry">occasionally writes poetry</a>.</p><p>Eliezer Yudkowsky wanted to create a smarter-than-human AI as fast as possible, until he realised this might be a Bad Idea and said "<a href="https://www.lesswrong.com/posts/SwCwG9wZcAzQtckwx/that-tiny-note-of-discord">oops</a>" and switched to the problem of making sure any powerful AIs we create don't destroy human civilisation. He founded the Machine Intelligence Research Institute (MIRI) to find out the answer.</p><p>Yudkowsky also wrote a <a href="https://www.lesswrong.com/rationality">massive series of blog posts</a> to try to teach people about how to reason well (for example, he covers a lot of ground from the cognitive biases literature), and then went on to try to convey the same lessons in what become <a href="http://www.hpmor.com/">the most popular work of Harry Potter fanfiction of all time</a>. His writing and argument style tends toward flowing narratives that are usually both very readable and verbose (though quite hit-or-miss in whether you like it).</p><p>He has Opinions (note the capital). He is extremely pessimistic about the chances of solving the AI alignment problem.</p><p>Yudkowsky is affiliated much more strongly with the loose "Rationalist community" than with EA. This is a collection of online blogs that was sparked by Yudkowsky's writing, and later in particular also that of <a href="https://slatestarcodex.com/">Scott Alexander</a>, who has become internet-famous for his own reasons too. The central forum is <a href="https://www.lesswrong.com/">LessWrong</a>. The relation between EA and Rationalism is best described by a joke that students (well, at least physics students) are fond of, first made by Richard Feynman: "physics is to mathematics as sex is to masturbation". Both EA and Rationalism involve lots of discussion about far-ranging abstract ideas that (for a certain type of person) are hard to resist; one blogger says "[t]he experience of reading LessWrong for the first time was brain crack" and <a href="https://chanamessinger.com/blog/ea-as-nerdsniping">goes on to propose</a> that EA ideas are best-spread by <a href="https://xkcd.com/356/">nerd-sniping</a> (i.e. telling people about ideas they find so interesting that they literally can't help but think about them). Both EA and the Rationalists put an incredible amount of effort and weight on trying to reason well, avoid biases and fallacies, and being careful (and often quantitative) about uncertainties. However, EA is very much about applying those things to do good in the real world to real people, while the Rationalist vibe is sometimes one of indulging in theorising and practising good thinking for their own sake. (This is not necessarily a criticism - I had fun discussing Lisp syntax in the comments section of <a href="https://www.lesswrong.com/posts/GAqCiWJBttazYGsJR/review-structure-and-interpretation-of-computer-programs">the LessWrong version of my review of <i>Structure and Interpretation of Computer Programs</i></a>, even though arguing about parentheses isn't exactly going to save the world (or is it ... ?).)</p><p>(I should also note that on the specific topic of AI risk, the Rationalist community is extremely impact-oriented, likely due to founder effects - or perhaps because AI risk is the EA cause area that is most full of juicy technical puzzles and philosophical confusions.)</p><h2 id="more-philosophers--ea-gets-a-name">More philosophers & EA gets a name</h2><p>Brian Christian's <i>The Alignment Problem</i> mentions in chapter 9 some funny details about the sequence of events that lead to the first few EA-by-name organisations. In 2009, then-Oxford-philosophy-student Will MacAskill had an argument about vegetarianism while in a broom closet. Unlike most arguments about vegetarianism, and echoing the vibe of much future EA thinking, this one was on the meta-level; the debate was not whether factory farming is bad, but how we should deal with the moral uncertainty around whether or not factory farming is ethical. MacAskill eventually started talking with Toby Ord (though in a graveyard rather than a broom closet), another philosophy student interested in <a href="https://www.strataoftheworld.com/2020/07/ea-ideas-3-uncertainty.html">questions around moral uncertainty</a>.</p><p>Together with one other person, the two of them <a href="https://www.moraluncertainty.com/">wrote a book</a> on moral uncertainty. MacAskill and a philosophy-and-physics student called Benjamin Todd founded an organisation called <a href="https://80000hours.org/">80 000 Hours</a> to try to figure out how people can choose careers to have the greatest positive impact on the world. Toby Ord founded an organisation called <a href="https://www.givingwhatwecan.org/">Giving What We Can</a> (GWWC) that encourages people to donate 10% of their salary to exceptionally effective charities. GWWC estimates its roughly 8000 members have donated $277mn, and are likely to donate almost $3bn over their lifetimes.</p><p>As an umbrella organisation for both of these, they created the <a href="https://www.centreforeffectivealtruism.org/">Centre for Effective Altruism</a>. Originally the "Effective Altruism" part was intended purely as a descriptive part of the organisation's name, but at some point started to stand more broadly for the general space of effectively altruistic things that at some point interacted with ideas from the original Oxford cluster.</p><p>Later, MacAskill wrote a book called <i>Doing Good Better</i> summarising ideas about why charity effectiveness is important and counterintuitive. Ord in turn wrote <a href="https://theprecipice.com/"><i>The Precipice</i></a> that summarises ideas about how mitigating existential risks to human civilisation is likely a key moral priority; after all, it would be bad if we all died.</p><h2 id="charity-evaluators-and-billionaires">Charity evaluators and billionaires</h2><p>Independently from (and before) anything happening in Oxford broom closets, starting in 2006 hedge fund managers Holden Karnofsky and Elie Hassenfeld started thinking seriously about which charities to donate to. Upon discovering that this is a surprisingly hard problem, they started <a href="https://www.givewell.org/">GiveWell</a>, an organisation focused on finding exceptionally effective charities. They ended up concentrating on global health (their list includes malaria prevention, vitamin supplementation, and cash transfers, all in developing countries).</p><p>After a few years of GiveWell existing, they were put in touch with Dustin Moskovitz and Cari Tuna. At the time, Facebook co-founder Dustin Moskovitz was the world's youngest self-made billionaire, and with his partner Cari Tuna had started a philanthropic organisation called Good Ventures in 2011.</p><p>What followed was a cinematic failure of prioritisation, as recounted by Holden Karnofsky himself in <a href="https://80000hours.org/podcast/episodes/holden-karnofsky-most-important-century/#holdens-background-000947">this interview</a>. The GiveWell founders decided that "[meeting the billionaires] just doesn't seem very high priority", and thought that "[n]ext time someone's in California we should definitely take this meeting, but [...] this isn't the kind of thing we would rush for [...]". However, Karnofsky realised this meeting was an excellent excuse to go on a date with a Californian he fancied (and later married), and as a result ended up making the trip sooner rather than later.</p><p>Moskovitz and Tuna turned out to have very simplistic preferences for charitable giving: they just wanted to do the most good possible. This was an excellent fit with GiveWell's philosophy, and soon Good Ventures partnered with GiveWell in what would later become Open Philanthropy (of which Karnofsky would become co-CEO). <a href="https://www.openphilanthropy.org/">Open Philanthropy</a> is a key funder of EA projects, though they fund unrelated things as well (though always through a very EA lens of trying to rigorously and quantitatively maximise impact) . They list all their grants <a href="https://www.openphilanthropy.org/grants/">here</a>. </p><p>While studying physics at MIT, Sam Bankman-Fried (or "SBF"), already deeply interested in consequentialist moral philosophy, attended a talk by Will MacAskill on EA ideas. After stints at trading companies and the Centre for Effective Altruism, he founded the crypto-focused trading companies Alameda Research and then FTX, and ended up becoming the richest under-30 person in the world. (Though then the value of FTX fell in the crypto crash, and he recently turned 30 to boot.)</p><p>SBF often emphasises that you're more likely to achieve outlier success in business if your goal is to donate the money effectively. There's little personal gain in going from $100M to $10B, so a selfish businessperson is likely to optimise something like "probability I earn more than [amount that lets me do whatever the hell I want for the rest of my life]", while a (mathematically-literate) altruistic one is far more compelled to simply shoot for the highest <a href="https://www.strataoftheworld.com/2020/07/ea-ideas-2-expected-value-and-risk.html">expected-value</a> outcomes, even if they're risky. (The exception is the selfish businessperson who really likes competing in the billionaire rankings.)</p><p>SBF has also said - and is living proof of - the idea that if your strategy to do good is to earn money to donate, you should probably aim for the risky but high-value bets (e.g. starting a company and becoming a billionaire), rather than going into some high-paying finance job earning a crazy-high but non-astronomical salary. Many people persuaded by EA ideas have done the latter, but SBF contributed more than all of them combined. The maths probably still works out even after accounting for the fact that SBF's route was far more unlikely to work than a finance job (he thought FTX had an 80% chance of failure). <a href="https://forum.effectivealtruism.org/posts/m35ZkrW8QFrKfAueT/an-update-in-favor-of-trying-to-make-tens-of-billions-of">This post</a> argues so. <a href="https://www.wave.com/en/about/">Wave</a>, a fintech-for-Africa company with strong EA representation in its founding team and a $1.7B valuation in 2021, is another example of EA business success.</p><p>SBF and other senior FTX people (many of who care deeply about EA ideas) launched the FTX Foundation, which in particular contains the <a href="https://ftxfuturefund.org/area-of-interest/">Future Fund</a> that has quickly become a key funder of the more future-oriented and speculative parts of EA.</p><p>These days, being associated with tech billionaires isn't a ringing endorsement. However, consider a few things. First, the tech billionaires aren't the ones who came up with the ideas or set the agendas. Sports car enthusiast and sci-fi nerd Elon Musk decided that sexy cars and rockets are the most important projects in the world and directed his wealth accordingly; Moskovitz, SBF, & co. were persuaded by abstract arguments and donate their wealth to foundations where the selection of projects is done by people more knowledgeable in that than they are. Second, it seems unusually likely that the major EA donors really are sincere and committed to trying to do the most good; after all, if they wanted to maximise their popularity or acclaim, there are better ways of doing that then funding a loose cluster of people often trying to work specifically on the the least-popular charitable causes (since those are most likely to contain low-hanging fruit). Finally, if some tech billionaires endorsing EA is evidence <i>against</i> EA being a good thing, then no tech billionaires endorsing EA <a href="https://www.lesswrong.com/rationality/conservation-of-expected-evidence">must be</a> evidence <i>in favour</i> of EA being a good thing. However cynical you are about tech billionaires, they're still smart people, so a few of them going "huh, this is the type of thing I want to spend all my wealth on" should be more promising than all of them going "nope I don't buy this".</p><p>(If EA has some top tech business people, why doesn't it have some top political people too, or even funders from outside tech? My guess is a combination of factors. Politicians skew old while EAs skew young (partly because EA itself is young). Both EAs and tech people tend to be technically/mathematically/intellectually-inclined (though many areas within EA are specifically about social science or the humanities). Both EAs and tech people tend to care less than average about social norms or prestige, while politicians tend to be selected out of the set of people who are willing to optimise very hard for prestige and popularity. Also, expect some policy-related efforts from EA; many EAs work or aim to work in non-political policy roles, and there have even been some political efforts, though <a href="https://forum.effectivealtruism.org/posts/sKwEB7EEMaCp9tfaw/carrick-flynn-results-and-additional-ideas-for-passing">there is much to learn in that field</a>.)</p><h2 id="organisations">Organisations</h2><p>In addition to the previously-mentioned CEA, 80 000 Hours, Giving What We Can, GiveWell, Open Philanthropy, and FTX Foundation, organisations with a strong EA influence include (but are not limited to):</p><ul><li><p>A large number of think-tanks and research institutes, especially ones where people think about the end of the world all day, including</p><ul><li><a href="https://www.fhi.ox.ac.uk/">Future of Humanity Institute</a> (FHI) at Oxford, which researches big-picture questions about the future of humanity and is run by Nick Bostrom.</li><li><a href="https://futureoflife.org/">Future of Life Institute</a> (FLI) in Cambridge (Massachusetts), focusing on global catastrophic risks and existential risks. It was founded by a team including Skype co-founder Jaan Tallinn and physicist Max Tegmark. Wikipedia says they are "[n]ot to be confused with Future of Humanity Institute" but to be honest this is a pretty big ask given the name.</li><li><a href="https://www.cser.ac.uk/">Centre for the Study of Existential Risk</a> (CSER) at Cambridge, also co-founded by Jaan Tallinn.</li><li><a href="https://longtermrisk.org/">Centre on Long-Term Risk</a> (CLR).</li><li><a href="https://www.longtermresilience.org/">Centre on Long-Term Resilience</a> (CLTR) (no, this is not confusing at all, it's all in your head).</li> </ul></li><li><p>A large number of animal welfare charities, which I won't bother listing, except to point out the meta-level <a href="https://animalcharityevaluators.org/">Animal Charity Evaluators</a>.</p></li><li><p>A large number of global health charities, including ones that are simply highly recommended (and funded) by GiveWell (in particular <a href="https://www.againstmalaria.com/">Against Malaria Foundation</a>, which routinely tops <a href="https://www.givewell.org/charities/top-charities">GiveWell rankings</a>) to ones that also trace their roots solidly to EA.</p></li><li><p>Organisations working on AI risk, including:</p> <ul><li><a href="https://www.anthropic.com/">Anthropic</a>, working on interpreting machine learning models (a program led by Chris Olah) and more general empirically-grounded, engineering-based machine learning safety research.</li><li><a href="https://www.redwoodresearch.org/">Redwood Research</a>, a smaller company also doing empirical machine learning safety work (and running <a href="https://forum.effectivealtruism.org/posts/vvocfhQ7bcBR4FLBx/apply-to-the-second-ml-for-alignment-bootcamp-mlab-2-in">great ML bootcamps</a> on the side).</li><li><a href="https://humancompatible.ai/">Centre for Human-compatible AI</a> (CHAI), a research institute at UC Berkeley.</li><li><a href="https://intelligence.org/">Machine Intelligence Research Institute</a> (MIRI), the original AI safety organisation that was founded in 2000 and hence managed to snap up the enviable domain name "<a href="https://intelligence.org/">intelligence.org</a>". MIRI's research leans much more mathematical and theory-based than that of most other AI alignment organisations.</li><li><a href="https://www.conjecture.dev/">Conjecture</a>, a new organisation focusing on the work that is most relevant if advanced AI is surprisingly close.</li><li>(OpenAI and DeepMind, the two leading AI companies, both have safety teams that include people very committed to working on existential risk concerns. However, neither is primarily an AI safety company, and both weight advanced AI risks at a company-level less than the other companies on this list. OpenAI in particular currently sees AI risks more through the near-term lens of making sure AI systems and their benefits are widely accessible to everyone, rather than focusing on making sure AI systems don't doom us all (though I guess that too would be a suitably equitable outcome?).)</li> </ul></li><li><p><a href="https://www.alveavax.com/">Alvea</a>, a recent vaccine startup, with the eventual goal of enabling faster vaccine roll-out in the next pandemic. </p></li><li><p><a href="https://www.charityentrepreneurship.com/">Charity Entrepreneurship</a>, a charity incubator that has incubated <a href="https://www.charityentrepreneurship.com/our-charities">many charities</a>, including for example Healthier Hens (farmed chicken welfare), the Happier Lives Institute (helping policymakers figure out how to increase people's happiness), and Lead Exposure Elimination Project (working to reduce lead exposure in developing countries).</p></li><li><p><a href="https://www.sparkwave.tech/">SparkWave</a>, an incubator for software companies that are solving important problems. </p></li><li><p><a href="https://effectivethesis.org/">Effective Thesis</a>, trying to save students from writing pointless theses.</p></li><li><p><a href="https://founderspledge.com/">Founders Pledge</a>, which helps entrepreneurs commit to giving away money when they sell their companies and donate that money effectively (not to be confused with the more famous <a href="https://en.wikipedia.org/wiki/The_Giving_Pledge">Giving Pledge</a>). (So far, about $475M has been donated in this way) </p></li><li><p><a href="https://www.legalpriorities.org/">Legal Priorities Project</a>, which looks at the legal aspects of trying to do everything else.</p></li><li><p><a href="https://allfed.info/">ALLFED</a> (ALLiance to Feed Earth in Disasters), which aims to be useful in situations where hundreds of millions of people or more are suddenly without food, and which has successfully found the best conceivable name for an organisation that does this.</p></li><li><p><a href="https://ourworldindata.org/">Our World in Data</a> (OWID), the world's best provider of data and graphs on important global issues. I'm not quite sure how interrelated they are with EA directly, but their founder <a href="https://forum.effectivealtruism.org/posts/uaveEAgFfyFx4EYaH/a-new-our-world-in-data-article-on-longtermism">posts on the EA Forum about OWID articles on very EA-related ideas</a>, so there's definitely some overlap.</p></li><li><p><a href="https://www.appgfuturegenerations.com/">All-Party Parliamentary Group for Future Generations</a> in the UK government.</p></li><li><p>A bunch of organisations focused on getting people interested in the world's biggest problems and teaching them various skills:</p> <ul><li><a href="https://www.atlasfellowship.org/">Atlas Fellowships</a>, a recent initiative for high-schoolers.</li><li>A collection of Existential Risk Initiatives running, among other things, summer internships where people (mostly undergraduate/postgraduate students) work with mentors on existential risk research: <a href="https://cisac.fsi.stanford.edu/stanford-existential-risks-initiative/content/stanford-existential-risks-initiative">SERI</a> (Stanford), <a href="https://effectivealtruism.ch/swiss-existential-risk-initiative">CHERI</a> (Switzerland), <a href="https://www.camxrisk.org/">CERI</a> (Cambridge), and a newer one at the University of Chicago which I can't yet find a website for, but which will almost certainly not help with the naming situation when it arrives. Thankfully, rumours say there will be soon be a YETI (Yale Existential Threats Initiative), which is a cool and (thank god!) unconfusable name.</li> </ul></li> </ul> <p>Since EA is not a monolithic centralised thing, there is plenty of fuzziness in what counts as an EA organisation, and definitely no official list (and therefore if you're reading this and your org is not on the list, you shouldn't complain - many great orgs were left out). The common features among many of them are:</p><ul><li>Some causal link to stuff that at some point interacted with the original Oxford cluster.</li><li>Emphasis on taking altruistic actions with a focus on effectiveness.</li><li>Emphasis on quantifying the impact of altruistic actions.</li><li>Emphasis on a scope that is in some way particularly wide-ranging or unconventional, either in sheer size or time (existential risks, the long-run future), geography (focusing on the entire world and often particularly developing countries rather than the organisation's neighbourhood), or in what is cared about (farmed animal welfare, <i>wild</i> animal welfare, the lives of people in the far future, and whatever the hell <a href="https://thequaliaresearchinstitute.org/">these people</a> are doing).</li> </ul> <p>The biggest EA events are the Effective Altruism Global (EAG) conferences organised by CEA. These usually happen several times a year, mostly in the UK and the Bay Area, though locally-organised <a href="https://www.eaglobal.org/eagxhome/">EAGx conferences</a> have more diverse locations.</p><h2 id="the-situation">The Situation</h2><p>EA has a strong presence especially at top universities. There are large and active EA student groups in the Bay Area, Cambridge, Oxford, and London, but also increasingly New York, Boston, and Berlin, and many smaller local groups (you can find them listed <a href="https://forum.effectivealtruism.org/community">here</a>). The profile of EA in the general public is very small. However, the concentration of talent is extremely high. Add to this the existence of funding bodies with tens of billions of dollars of assets that are firmly aligned with EA principles, and you can expect a lot of important, impactful work to come from people and organisations with some connection to EA in the coming years.</p><p>It's important to keep in mind that EA is not a centralised thing. There is no EA tsar, or any single EA organisation that runs the show, or any official EA consensus. It's a cluster of many people and efforts that are joined mainly by caring about the types of ideas I talk about <a href="https://www.strataoftheworld.com/2020/07/ea-ideas-1-rigour-and-opportunity-in.html">here</a>.</p><h3 id="demographics">Demographics</h3><p><a href="https://effectivealtruismdata.com/#demographics">This website</a> has a good overview, based on whoever filled in a survey posted to the <a href="https://forum.effectivealtruism.org/">EA Forum</a>. The gender ratio is unfortunately somewhat skewed (70% male); for comparison, this is <a href="https://www.amacad.org/humanities-indicators/higher-education/gender-distribution-degrees-philosophy">roughly the same</a> as for philosophy degrees and better than for software developers (<a href="https://www.statista.com/statistics/1126823/worldwide-developer-gender/">90% male</a> (!?)). Half are 25-34. Over 70% are politically left or centre-left, and few are centre-right (2.5%) or right (1%), though almost 10% are libertarians. Education levels are high, and the five most common degrees are, in order: CS, maths, economics, social science, and philosophy. Most are from western countries.</p><h3 id="culture">Culture</h3><p>EA culture places a lot of weight on epistemics: being honest about your uncertainties, clear about what would make you change your mind on an issue, aware of biases and fallacies, trying to avoid group-think, focusing on the substance of the issue rather than who said it or why, and arguing with the goal of finding the truth rather than defending your pet argument or cause. This is a lofty set of goals. To an astonishing but imperfect extent, and more so than any other concentration of people or writing (except from the equally-good Rationalist community mentioned above) that I've ever had any exposure to, EA succeeds at this.</p><p>Related to this, but also turbo-charged by general cultural memes of "critiquing cherished ideas is important", there's a high emphasis of constantly being on the lookout for ways in which you yourself or (in particular) common EA ideas might be wrong. If you read down the list of <a href="https://forum.effectivealtruism.org/allPosts?sortedBy=top&timeframe=allTime&filter=all">top-voted posts</a> on the EA Forum, they are about:</p><ol start=""><li><a href="https://forum.effectivealtruism.org/posts/cfdnJ3sDbCSkShiSZ/ea-and-the-current-funding-situation">Potential failure modes resulting from the influx of money into EA.</a></li><li><a href="https://forum.effectivealtruism.org/posts/HWaH8tNdsgEwNZu8B/free-spending-ea-might-be-a-big-problem-for-optics-and">High EA spending being a problem for optics and epistemics.</a></li><li><a href="https://forum.effectivealtruism.org/posts/xomFCNXwNBeXtLq53/bad-omens-in-current-community-building">Things current EA community-building efforts are doing wrong, and why this is especially worrying.</a></li><li><a href="https://forum.effectivealtruism.org/posts/KDjEogAqWNTdddF9g/long-termism-vs-existential-risk">Reasons why some key concepts in EA are used misleadingly and unnecessarily.</a></li><li><a href="https://forum.effectivealtruism.org/posts/n3WwTz4dbktYwNQ2j/critiques-of-ea-that-i-want-to-read">A list of critiques of EA that someone wants expanded.</a></li><li><a href="https://forum.effectivealtruism.org/posts/QFa92ZKtGp7sckRTR/my-mistakes-on-the-path-to-impact">A catalogue of personal mistakes that someone made while trying to do good</a> (the key one being that they focused too much on working only at EA organisations).</li><li><a href="https://forum.effectivealtruism.org/posts/bsE5t6qhGC65fEpzN/growth-and-the-case-against-randomista-development">An argument that standard EA ways of trying to help with developing country development are not as effective as other ways of helping.</a></li><li>And only in 8th place, something that isn't a critique of EA: <a href="https://forum.effectivealtruism.org/posts/cXBznkfoPJAjacFoT/are-you-really-in-a-race-the-cautionary-tales-of-szilard-and">a post about the historical case of early nuclear weapons researchers mistakenly assuming they were in a race, and implications for today's AI researchers</a></li> </ol> <p>(If you adjust upvotes on EA Forum posts to account for how active the forum was at the time, the most popular post of all time is <a href="https://forum.effectivealtruism.org/posts/FpjQMYQmS3rWewZ83/effective-altruism-is-a-question-not-an-ideology">Effective Altruism is a Question (not an ideology)</a>. It's not a critique, but it's also very revealing.)</p><p>Right now, there's <a href="https://forum.effectivealtruism.org/posts/8hvmvrgcxJJ2pYR4X/announcing-a-contest-ea-criticism-and-red-teaming">an active contest with $100k in prizes for the best critiques of EA</a>. This sort of stuff happens enough that Scott Alexander satirises it <a href="https://astralcodexten.substack.com/p/criticism-of-criticism-of-criticism">here</a>.</p><p>This might give the impression of EA as excessively-introspective and self-doubting. There is some truth to the introspectiveness part. However, the general EA attitude is also one of making bold (but reasoned) bets. Recall SBF's altruistically-motivated risk taking, or more generally the fact that <a href="https://www.openphilanthropy.org/research/hits-based-giving/">one of Open Philanthropy's foundational ideas</a> is to support reasonable-but-risky projects, or even more generally the way EA in general is set up around unconventional and ambitious attempts at doing good.</p><p>If I had to name the two most important obstacles to doing important things in the real world, they would be (1) reasoning poorly and not updating enough based on feedback/evidence, and (2) being too risk-averse and insufficiently ambitious. Some cultures, like the good parts of academia, do well on avoiding (1). Others - imagine for example gung-ho Silicon Valley tech entrepreneurs - do well on avoiding (2). Though EA culture varies a lot between places and organisations, on the whole it seems uniquely good at combining these two aspects.</p><p>There are differences in culture between different EA hubs/clusters. I mainly have experience of the UK (and especially Cambridge) cluster and the Bay Area one. In the Bay, there is significant overlap between the EA and Rationalist communities, whereas in the UK there's mainly just EA in my experience. The Bay also leans more AI-focused and maybe weirder on average (or perhaps it's just a European vs American culture thing), while in the UK there are many AI-focused people but also many focused on biological fields (biosecurity & alternative proteins) or policy.</p><h2 id="axes--trends">Axes & trends</h2><h3 id="long-termism-vs-near-termism">"Long-termism" vs "near-termism"</h3><p>In the history of EA, it's hard not to see an invasion of ideas from the planetary-scale futurism that people like Nick Bostrom and Eliezer Yudkowsky talked about, and Toby Ord (author of <i>The Precipice</i>) and Will MacAskill (about to drop <a href="https://www.whatweowethefuture.com/">a new book</a> on why we should prioritise the long-term future) increasingly focus on. Holden Karnofsky, who for a long time ran GiveWell, perhaps the most empirically-minded and global health -focused EA organisation, is now co-CEO of Open Philanthropy, responsible specifically for the speculative futurist parts of Open Philanthropy's mission, and <a href="https://www.cold-takes.com/the-most-important-century-in-a-nutshell/">writes blog posts about the grand future of humanity and why the coming century may be especially critical</a> (though he is careful to say that he doesn't think the other half of Open Philanthropy's work, or global health / animal welfare -focused charity more generally, is not important).</p><p>Perhaps this makes sense. In the long run at least, it seems sensible to expect the largest-scale ideas to be the most important ones. The rate of technological progress, especially in AI, has also been shrinking just what "the long run" means when expressed in years.</p><p>The common label applied to the ends of the radical-future-technology-focused versus concrete-current-problem-focused axis are "long-termist" and "near-termist" respectively. The name "long-termist" comes from arguments that the key moral priority is making sure we get to a secure, sustainable, and flourishing future civilisation (since such a civilisation could be very large and long-lasting, and therefore enable an enormous amount of happiness and flourishing). However, the names are a bit misleading. All existential risk work is often lumped into the long-termist category, so we have "long-termist" AI safety people trying to prevent a catastrophe many of them think will probably happen in the next three decades if it happens at all, and "near-termist" global health and development people trying to help the development of countries over a century.</p><p>(Many also <a href="https://forum.effectivealtruism.org/posts/rFpfW2ndHSX7ERWLH/simplify-ea-pitches-to-holy-shit-x-risk">point out</a> that caring about existential risks does not require the long-termist philosophy.)</p><h3 id="frugality-vs-spending">Frugality vs spending</h3><p>The culture of the original Oxford cluster was very frugal, and focused on monetary donations. For example, after founding Giving What We Can (GWWC), Toby Ord <a href="https://www.bbc.co.uk/news/magazine-11950843">donated everything he earned above £ 18 000 to charity</a> (and has <a href="https://www.vox.com/future-perfect/21728925/charity-10-percent-tithe-giving-what-we-can-toby-ord">continued on a similar track</a> since then). Because of the low available funding, the focus was very much on marginal impact - trying to figure out what existing opportunity could best use one extra dollar.</p><p>Since then, the arrival of billionaires meant that funding worries went down.</p><p>(For example, "earning to give" has gone down a lot in <a href="https://80000hours.org/career-reviews/#our-priority-paths">80 000 Hours' career rankings</a>. This is the idea that deliberately going into a high-earning job (often in finance) and then donating a significant fraction of your salary to top charities is one of the most effective ways to do good, and a path that many pursued based on the recommendation by 80 000 Hours.)</p><p>The bottleneck has moved (or at least been widely perceived to move) from funding to the time of people working on the key problems; instead of focusing on where to allocate the marginal dollar, the focus has somewhat shifted to how to allocate the marginal minute of time. In particular, the core argument of "imagine how far this particular dollar could go if used to effectively improve health in developing countries" has been joined by the argument of "there are plausible civilisation-ending disasters that could happen in the coming decades and require hard work to solve; imagine how sad it would be if we failed to work fast enough because we didn't spend that one dollar".</p><p>As a concrete example, Redwood Research organised <a href="https://www.alignmentforum.org/posts/YgpDYjTx7DCEgziG5/apply-to-the-ml-for-alignment-bootcamp-mlab-in-berkeley-jan">a machine learning bootcamp aimed at upskilling people for AI safety jobs</a> in January 2021 (and will be running more in the future, something I strongly endorse). Thirty participants (including myself) were flown into Berkeley from around the world, and spent three weeks living in a hotel while taking daily high-reliability COVID tests that I'm pretty sure weren't entirely free (and of course spending the days programming hard and talking about AI alignment (and eating free snack bars at the office - or maybe that last part was just me)). This wasn't cheap, nor was it a typical way to spend charity money (Redwood is <a href="https://www.openphilanthropy.org/grants/redwood-research-general-support/">funded</a> by Open Philanthropy). But if <a href="https://www.metaculus.com/questions/3479/date-weakly-general-ai-system-is-devised/">prediction markets are right that generally-capable AI starts emerging around the end of this decade</a>, and you take one look at the current state of progress on the AI alignment problem, and you do happen to have access to funding - well, it would be sad if being too stingy is how our civilisation failed.</p><p>Concretely, to look at only one consequence, Redwood made several hires from the bootcamp, despite the fact that many of the participants (myself included) were still students or otherwise not looking for work. Given how difficult but important hiring is, especially for high-skill technical roles, and the serious possibility that organisations like Redwood making progress is important for solving AI safety problems that might play a big role in how the future of humanity shapes out, this seems like a win.</p><p>However, at the same time, it is of course worth keeping in mind that humans are pretty good at thinking to themselves "man, wouldn't it be great if people like me had lots of money?" This, as well as the PR and culture problems of having lots of money sloshing around, are discussed in many EA Forum posts. We already saw that <a href="https://forum.effectivealtruism.org/posts/cfdnJ3sDbCSkShiSZ/ea-and-the-current-funding-situation">this one</a> (by MacAskill) and <a href="https://forum.effectivealtruism.org/posts/HWaH8tNdsgEwNZu8B/free-spending-ea-might-be-a-big-problem-for-optics-and">this one</a> are, respectively, the first- and second-most upvoted posts of all time on the EA Forum.</p><p>Ultimately, the whole point of Effective Altruism is, well, being effective about altruism. Whether EA funders spend quickly or slowly, and whichever causes they target, if they fail to find the best opportunities to do good with money, they haven't succeeded - and they know it.</p><p>(It should be noted that the GWWC criterion of donating 10% of your income to charity is met by many EAs, including ones far in space or culture from the original Oxford cluster, and global health is a leading donation target.)</p><h3 id="thinking-vs-doing">Thinking vs doing</h3><p>The fact that there's more resources - including not just funding but also the time of talented people - also means that the focus is less on marginal impact. If you have £10 and an hour, then figuring out what existing opportunity has the best ratio of good stuff per dollar is the best bet. But if you have, say, £10 000 000 and ten thousand work hours, then there's also the option of starting new projects and organisations.</p><p>(A lot of the weirdness of EA thinking comes from its marginalist nature. The things that are most valuable per marginal unit of money/time/effort are generally the things that are most neglected, and neglected things tend to seem weird because, by definition, few people care about them. For example, the early EA focus basically completely eschewed developed country problems because per-dollar marginal cost-effectiveness was highest in poor countries; from the outside, this may look like a strangely harsh and idiosyncratic selection of causes. With increasing resources, it makes more sense to pursue larger-scale changes, and larger-scale changes sometimes look like more traditional and intuitive causes. For example, while developing country health and projects trying to improve the long-term future are Open Philanthropy's main focuses, they spend some of their massive budget on <a href="https://www.openphilanthropy.org/focus/criminal-justice-reform/">US criminal justice reform</a>, <a href="https://www.openphilanthropy.org/focus/land-use-reform/">land-use policy</a>, and <a href="https://forum.effectivealtruism.org/community">immigration policy</a>.) (Though note that <a href="https://forum.effectivealtruism.org/posts/h2N9qEbvQ6RHABcae/a-critical-review-of-open-philanthropy-s-bet-on-criminal">the effectiveness of the criminal justice program has come under criticism</a>.)</p><p>Since EA now has the resources to start many new organisations, there's also starting to be a shift from EA being very research-oriented to having more and more real-world projects. Even though one of the key EA insights is that doing good requires lots of careful thinking in addition to good intentions and execution ability, the ultimate metric of success is actually improving the world, and that takes steps that aren't just research. I think EA has some headwind to overcome here; as a movement inspired, started, and (early on) largely consisting of philosophers, it has been remarkably successful in appealing to philosophical people and researchers, but not entrepreneurs or operations people to the same extent. I think it is a very welcome trend that this is starting to shift.</p><h2 id="exciting-attempt-for-enabling-action-on-essential-activities">Exciting Attempt for Enabling Action on Essential Activities</h2><p>EA is definitely not ideal, and it is also not guaranteed to survive. Like any real-world community, it is not a timeless platonic ideal of pure perfection that burst into the world fully formed, but rather something with an idiosyncratic history, that consists of real people, and has certain biases and cultural oddities. Still, I think it is probably the most exciting and useful thing in the world to be engaged with.</p>Unknownnoreply@blogger.com0tag:blogger.com,1999:blog-1697673368059564013.post-32443292866319950492022-06-25T20:50:00.000+01:002022-06-25T20:50:31.046+01:00Information theory 3: channel coding<p style="text-align: center;"><span style="font-size: x-small;">7.9k words, including equations (~41 minutes)</span> <br /></p><p> </p><p>We've looked at basic information theory concepts <a href="https://www.strataoftheworld.com/2022/06/information-theory-1.html">here</a>, and at source coding (i.e. compressing data without caring about noise) <a href="https://www.strataoftheworld.com/2022/06/information-theory-2-source-coding.html">here</a>. Now we turn to channel coding.</p><p>The purpose of channel coding is to make information robust against any possible noise in the channel.</p><h2 id="noisy-channel-model">Noisy channel model</h2><p>The noisy channel model looks like the following:</p><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjVptKjd3xnPIq2F_8kfByNH3C96QL3mz0C3z-bXyTNEECMZHNxQwqHRusw6Mw5jxrNbT9k9L6OC8qFQuYkLr72mSoJiti9072A9B_HT6twHNku1gxJFIJ45WcEtJy7WuNMcr4MQNVZ7gi_KzuscQq9kcsTKQnbs9oAKN0oViBImC74qaxLxB273U_log/s1094/ArcoLinux_2022-06-25_19-19-53.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="239" data-original-width="1094" height="140" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjVptKjd3xnPIq2F_8kfByNH3C96QL3mz0C3z-bXyTNEECMZHNxQwqHRusw6Mw5jxrNbT9k9L6OC8qFQuYkLr72mSoJiti9072A9B_HT6twHNku1gxJFIJ45WcEtJy7WuNMcr4MQNVZ7gi_KzuscQq9kcsTKQnbs9oAKN0oViBImC74qaxLxB273U_log/w640-h140/ArcoLinux_2022-06-25_19-19-53.png" width="640" /></a></div><p>The channel can be anything: electronic signals sent down a wire, messages sent by post, or the passage of time. What's important is that it is discrete (we will look at the continuous case later), and there are some transition probabilities from every symbol that can go into the channel to every symbol that can come out. Often, the set of symbols of the inputs is the same as the set of symbols of the outputs.</p><p>The capacity $$C$$ of a noisy channel is defined as $$$ C = \max_{p_x} I(X;Y) = \max_{p_x} \big(H(Y) - H(Y|X)\big). $$$ It's intuitive that this definition involves the mutual information $$I$$ (see <a href="https://www.strataoftheworld.com/2022/06/information-theory-1.html">the first post for the definition and explanation</a>), since we care about how much information $$X$$ transfers to $$Y$$, and how much $$Y$$ tells us about $$X$$. What might be less obvious is why we take the maximum over possible input probability distributions $$p_x$$. This is because the mutual information $$I(X;Y)$$ depends on the probability distributions of $$X$$ and $$Y$$. We can only control what we send - $$X$$ - so we want to adjust that to maximise the mutual information. Intuitively, if you're typing on a keyboard with all keys working normally except the "i" key results in a random character being inserted, shifting your typing away from using the "i" key is good for information transfer. Better to wr1te l1ke th1s than to not be able to reliably transfer information.</p><p>However, the only real way to understand why this definition makes sense is to look at the noisy channel coding theorem. This theorem tells us, among other things, that for any rate (measured in bits per symbol) smaller than the capacity $$C$$, for a large enough code length we can get a probability of error as small as we like.</p><p>With noisy channels, we often work with <i>block codes</i>. The idea is that you encode some shorter sequence of bits as a longer sequence of bits, and if you've designed this well, it adds redundancy. An $$(n,k)$$ block code is one that replaces chunks of $$k$$ bits with chunks of $$n$$ bits.</p><h2 id="hamming-coding">Hamming coding</h2><p>Before we look at the noisy channel theorem, here's a simple code that is redundant to error: transmit every bit 3 times. Instead of sending 010, send 000111000. If the receiver receives 010111000, they can tell that bit 2 probably had an error, and should be a zero. The problem is that you triple your message length.</p><p>Hamming codes are a method for achieving the same - the ability to detect and correct single-bit errors, and the ability to detect but not properly correct two-bit errors - while sending a number of excess bits that grows only logarithmically with message length. For long enough messages, this is very efficient; if you're sending over 250 bits, it only costs you a 3% longer message to insure them against single-bit errors.</p><p>The catch is that the probability of having only one or fewer errors in a message declines exponentially with message length, so this is less impressive than it might sound at first.</p><p>The basic idea of most error correction codes is a parity bit. A parity bit $$b$$ is typically the XOR (exclusive-or) of a bunch of other bits $$b_1, b_2, \ldots$$, written $$b = b_1 + b_2 + \ldots$$ (we use $$+$$ for XOR because doing addition in base-2 while throwing away the carry is the same is taking the XOR). A parity bit over a set of bits $$B = {b_1, b_2, \ldots}$$ is 1 if the set of bits contains an odd number of 1s, and otherwise 0 (hence the word "parity").</p><p>Consider sending a 3-bit message where the first two bits are data and the third is a parity bit. If the message is 110, we check that, indeed, there's an even number of 1s among the data bits, so it checks out that the parity bit is 0. If the message were 111, we'd know that something had gone wrong (though we wouldn't be able to fix it, since it could have started out with any of 011, 101, or 110 and suffered a one-bit flip - and note that we can never entirely rule out that 000 flipped to 111, though since error probability is generally small in any case we're interested in, this would be extremely unlikely).</p><p>The efficiency of Hamming codes comes from the fact that we have parity bits that check other parity bits.</p><p>A $$(T, D)$$ Hamming code is one that sends $$T$$ bits in total of which $$D$$ are data bits and the remaining $$T - D$$ are parity bits. There exists a $$(2^m - 1, 2^m - m - 1)$$ Hamming code for positive integer $$m$$. Note that $$m$$ is the number of parity bits.</p><p>The default way to construct a Hamming code is that the $$m$$th parity bit is in position $$2^m - 1$$, and is set such that the parity of bits whose position's binary representation has a 1 in the $$m$$th last position is zero.</p><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEh9rZ67sMsoFtVbSeQGWwqYtMO6tRF2T5cEw5RsqrtKog1ZTH4gbMk8QT79EeOQPwVTwaIXX3-9795wAGNmTrqH4v9lN4poBgxkbrod7stbG3-BTEHiNbslFs1Zje-6ox1_5kn0G9Wq3e3pu9dV5tOG2JaTjZs2asrT0ju_Ee5RkcnE7edyM7pQVIw-sw/s1033/ArcoLinux_2022-06-25_18-47-44.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="364" data-original-width="1033" height="226" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEh9rZ67sMsoFtVbSeQGWwqYtMO6tRF2T5cEw5RsqrtKog1ZTH4gbMk8QT79EeOQPwVTwaIXX3-9795wAGNmTrqH4v9lN4poBgxkbrod7stbG3-BTEHiNbslFs1Zje-6ox1_5kn0G9Wq3e3pu9dV5tOG2JaTjZs2asrT0ju_Ee5RkcnE7edyM7pQVIw-sw/w640-h226/ArcoLinux_2022-06-25_18-47-44.png" width="640" /></a></div><p>(Above, you see bits 1 through 15, with parity bits in positions 1, 2, 4, and 8. Underneath each bit, for every parity bit there is a 0 if that bit is not included in the parity set of that parity bit, and otherwise a 1. For example, since <code>b4</code> is set for bits 8-15, <code>b4</code> is a 1 if there's an odd number of 1s in bits 8-15 inclusive and otherwise 0. Note that the columns spell out the numbers 1 through 15 in binary.)</p><p>For example, a $$(7,4)$$ Hamming code for the 4 bits of data 0101 would first become $$$ \texttt{ b1 b2 0 b3 1 0 1} $$$ and then we'd set $$b_1 = 0$$ to make there be an even number of 1s across the 1st, 3rd, 5th, and 7th positions, set $$b_2 = 1$$ to do the same over the 2nd, 3rd, 6th, and 7th positions, and then finally set $$b_3 = 0$$ to do the same over the 4th, 5th, 6th, and 7th positions.</p><p>To correct errors, we have the following rule: sum up the positions of the parity bits that do not match. For example, if parity bit 3 is set wrong relative to the rest of the message, you flip that bit; everything will be fine after we clear this false alarm. But if parity bit 2 is also set wrong, then you take their positions, 2 (for bit 2) and 4 (for bit 3) and add them to get 6, and flip the sixth bit to correct the error. This makes sense because the sixth bit is the only bit covered by both parity bits 2 and 3, and only parity bits 2 and 3.</p><p>Though the above scheme is elegant and extensible, it's possible to design other Hamming codes. The length requirements remain - the code is a $$(2^m - 1, 2^m - m - 1)$$ code if we allow $$m$$ parity bits - but we can assign any "domain" over the bits to each parity bit as long as each bit belongs to the domain a unique set of parity bits.</p><h2 id="noisy-channel-coding-theorem">Noisy channel coding theorem</h2><p>We can measure any noisy channel code we choose based on two numbers. The first is its probability of error ($$p_e$$ above). The second is its rate: how many bits of information are transferred for each symbol sent. The three parts of the theorem combine to divide that space up into a possible and impossible region:</p><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhFaR-atWmnPttu0ZXFtS2-0y3wxiPkw0DmZcP4S1U9KLhuz7Iw7SGCn_NNggZFpNKc5OBFkFL7eB29jIB3GXy7kMFVOncmVp1tTNafSdOGgDvYpf-GoOaMTDyjA5k0-RmbiwMeRitQJAR9IYWAqejEtnBXrtC1a-6a6gxzQr-JgyqsERmXXvPI-rhpMQ/s676/ArcoLinux_2022-06-25_18-49-59.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="483" data-original-width="676" height="458" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhFaR-atWmnPttu0ZXFtS2-0y3wxiPkw0DmZcP4S1U9KLhuz7Iw7SGCn_NNggZFpNKc5OBFkFL7eB29jIB3GXy7kMFVOncmVp1tTNafSdOGgDvYpf-GoOaMTDyjA5k0-RmbiwMeRitQJAR9IYWAqejEtnBXrtC1a-6a6gxzQr-JgyqsERmXXvPI-rhpMQ/w640-h458/ArcoLinux_2022-06-25_18-49-59.png" width="640" /></a></div><p>The first part of the theorem says that the region marked "I" is possible. Now there are points of this region that are more interesting than others. Yes, we can make a code that has a capacity of 0 and a very high error rate; just send the same symbol all the time. This is point (a), and we don't care about it.</p><p>What's more interesting, and perhaps not even intuitively obvious at all, is that we can get to a point (b): an arbitrarily low error rate, despite the fact that we're sending information. The maximum information rate we can achieve while keeping the error probability very low turns out to be the capacity, $$C = \max_{p_X} I(X:Y)$$.</p><p>The second part of the theorem gives us a lower bound on error rate if we dare try for a rate that is greater than the capacity. It tells us we can make codes that achieve point (c) on the graph.</p><p>Finally, the third part of the theorem proves that we can't get to points like (x), that have an error rate that is too low given how much over the channel capacity their rate is.</p><p>We started the <a href="https://www.strataoftheworld.com/2022/06/information-theory-2-source-coding.html">proof of the source coding theorem</a> by considering a simple construction (the $$\delta$$-sufficient subset) first for a single character and then extending it to blocks. We're going to do something similar now.</p><h3 id="noisy-typewriters">Noisy typewriters</h3><p>A noisy typewriter over the alphabet $${0, \ldots, n}$$ is a device where if you press the key for $$i$$, it inputs one of the following with equal probability:</p><ul><li>$$i - 1 \mod n$$</li><li>$$i \mod n$$ </li><li>$$i + 1 \mod n$$</li></ul><p>With a 6-symbol alphabet, we can illustrate its transition probability matrix as a heatmap:</p><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiwc4Mgk4fp9WF6Aal37OpOl_PgfcM8JwarOSEoWOaoaXR9xF_lG3b0Y6jCjZDFu6eaYZDX2FgVxiM-8Mg0xwNParK7KK50qDPhRd4swroReHON_8C1myzfe4Xobx7RHBMipN-SAHqVoDtjB32Nd8HW1wADPLJFUGJgAq624SeD-H4A5w69T0qQwjVQag/s1107/ArcoLinux_2022-06-25_18-52-30.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="1065" data-original-width="1107" height="385" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiwc4Mgk4fp9WF6Aal37OpOl_PgfcM8JwarOSEoWOaoaXR9xF_lG3b0Y6jCjZDFu6eaYZDX2FgVxiM-8Mg0xwNParK7KK50qDPhRd4swroReHON_8C1myzfe4Xobx7RHBMipN-SAHqVoDtjB32Nd8HW1wADPLJFUGJgAq624SeD-H4A5w69T0qQwjVQag/w400-h385/ArcoLinux_2022-06-25_18-52-30.png" width="400" /></a></div><p>The colour scale is blue (low) to yellow (high). The reading order is meant to be that each column represents the probability distribution of output symbols given an input symbol.</p><p>First, can we transmit information without error at all? Yes: choose a code where you only send the symbol corresponding to the second and fifth columns. Based on the heatmap, these can map to symbols number 1-3 and 4-6 respectively; there is no possibility of confusion. The cost is that instead of being able to send one of six symbols, or $$\log 6$$ bits of information per symbol, we can now only send one of two, or $$\log 2 = 1$$ bits of information per symbol.</p><p>The capacity is $$\max_{p_X} \big( H(Y) - H(Y|X) \big)$$. Now if $$p_X$$ is the distribution we considered above - assigning half the probability to 2 and half to 5 - then by the transition matrix we see that $$H(Y)$$ will be uniformly distributed, so it is $$\log 6$$. $$H(Y|X)$$ is $$\log 3$$ in our example code, because we see that if we always send either symbol 2 or 5, then in both cases $$Y$$ is restricted to a set of 3 values. With some more work you can show that this is in fact an optimal choice of $$p_X$$. The capacity turns out to be $$\log 6 - \log 3 = \log 2$$ bits. The error probability is zero. We see that we can indeed transfer information without error even if we have a noisy channel.</p><p>But hold on, the noisy typewriter has a very specific type of error: there's an absolute certainty that if we transmit a 2 we can't get symbols 3-6 out, and so on. Intuitively, here we can partition the space of channel outputs in such that there is no overlap in the sets of which channel input each channel output could have come from. It seems like with a messier transition matrix that doesn't have this nice property, this just isn't true. For example, what if we have a binary symmetric channel, with a transition matrix like this:</p><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgVw3TO5N1kZxnd4FvUVlFlix9C5meSAM-tqAk1E0CdfdZwJtcuEg0gucDW3Fhy34Ho4y7UXyJ_Qb8RjHSAm9TyW8wIH75eaOC5CSbhaqoxKLBSGOcUQpoy6fllPcbjufiPXJ2MX3cYWCKKEAWRuUFJQ3O7OHun6t_kHgJPxuEFPQzUXnqQ7J24eFZLeg/s1129/ArcoLinux_2022-06-25_18-54-33.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="1059" data-original-width="1129" height="300" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgVw3TO5N1kZxnd4FvUVlFlix9C5meSAM-tqAk1E0CdfdZwJtcuEg0gucDW3Fhy34Ho4y7UXyJ_Qb8RjHSAm9TyW8wIH75eaOC5CSbhaqoxKLBSGOcUQpoy6fllPcbjufiPXJ2MX3cYWCKKEAWRuUFJQ3O7OHun6t_kHgJPxuEFPQzUXnqQ7J24eFZLeg/s320/ArcoLinux_2022-06-25_18-54-33.png" width="320" /></a></div><p>Unfortunately the blue = lowest, yellow = highest color scheme is not very informative; the transition matrix looks like this, where $$p_e$$ is the probability of error: $$$ \begin{bmatrix} 1 - p_e & p_e \ p_e & 1 - p_e \end{bmatrix} $$$ Here nothing is certain: a 0 can become a 1, and a 1 can become a zero.</p><p>However, this is what we get if we use this transition probability matrix on every symbol in a string of length 4, with the strings going in the order 0000, 0001, 0010, 0011, ..., 1111 along both the top and left side of the matrix:</p><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi7BcpxB7n2poMYg7-ptGda6MU4emL-BJSL83_TWfZYwbnaTdJQ0uB29zZwxjHfrieOKPatRc-Sry9P9QbWzBfo-zah1-LFd93v1KZOoaEfobiS_Pq4yiJXE-XoTVdJ01jXMGhhHHV-RSFwREWf0I86nJcxi-4Y7WeJOOVF9bYaBCm_sCuUnW1ugeIdAQ/s1024/ArcoLinux_2022-06-25_18-56-35.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="1022" data-original-width="1024" height="399" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi7BcpxB7n2poMYg7-ptGda6MU4emL-BJSL83_TWfZYwbnaTdJQ0uB29zZwxjHfrieOKPatRc-Sry9P9QbWzBfo-zah1-LFd93v1KZOoaEfobiS_Pq4yiJXE-XoTVdJ01jXMGhhHHV-RSFwREWf0I86nJcxi-4Y7WeJOOVF9bYaBCm_sCuUnW1ugeIdAQ/w400-h399/ArcoLinux_2022-06-25_18-56-35.png" width="400" /></a></div><p>For example, the second column shows the probabilities (blue = low, yellow = high) for what you get in the output channel if 0001 is sent as a message. The highest value is for the second entry, 0001, because we have $$p_e < 0.5$$ so $$p_e < 1 - p_e$$ so the single likeliest outcome is for no changes, which has probability $$(1-p_e)^4$$. The second highest values are for the first (0000), third (0011), fifth (0101), and seventh (1001) entries, since these all involve one flip and have probability $$p_e (1-p_e)^3$$ individually and probability $${4 \choose 1} p_e (1-p_e)^3 = 4 p_e (1 - p_e)^3$$ together.</p><p>If we dial up the number, the pattern becomes clearer; here's the equivalent diagram for messages of length 8:</p><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjClJ6kMdUI2632p44ktR3yiQ-2mhzmRTNUohwUx1EBmTT4LNpzpIgtMgOoY3GgOLhs4IOdocUJbpz7Ep-Dm_0kZLATn_O_haiYViwEOD9JlhbYjv2jI7qZnWvesb0-el-eip6h42z47ALSveLDWhrglzQnMGBBNQ3Zp7wEWoAmbDwVnAD90tQ-CcSWug/s1022/ArcoLinux_2022-06-25_18-57-06.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="1020" data-original-width="1022" height="638" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjClJ6kMdUI2632p44ktR3yiQ-2mhzmRTNUohwUx1EBmTT4LNpzpIgtMgOoY3GgOLhs4IOdocUJbpz7Ep-Dm_0kZLATn_O_haiYViwEOD9JlhbYjv2jI7qZnWvesb0-el-eip6h42z47ALSveLDWhrglzQnMGBBNQ3Zp7wEWoAmbDwVnAD90tQ-CcSWug/w640-h638/ArcoLinux_2022-06-25_18-57-06.png" width="640" /></a></div><h3 id="the-return-of-the-typical-set">The Return of the Typical Set</h3><p>There are two key points.</p><p>The first is that more and more of the probability is concentrated along the diagonal (plus some other diagonals further from the main diagonal. We can technically have any transformation, even 11111111 to 00000000 when we send a message through the channel, but most of these transformations are extremely unlikely. The transition matrix starts looking more and more like the noisy typewriter, where for each message only one subset of received messages has non-tiny likelihood.</p><p>The second key point is that it is time for ... the <i>return of the typical set</i>. Recall from the <a href="https://www.strataoftheworld.com/2022/06/information-theory-2-source-coding.html">second post in this series</a> that the $$\epsilon$$-typical set of length-$$n$$ strings over an alphabet $$A$$ is defined as $$$ T_{n\epsilon} = \left\{x^n \in A^n \text{ such that } \left|-\frac{1}{n} \log p(x^n) - H(X)\right| \le \epsilon\right\}. $$$ $$-\frac{1}{n} \log p(x^n)$$ is equal to $$-\frac{1}{n} \sum_{i=1}^n \log p(x_i)$$ by independence, and this in turn is an estimator for $$\mathbb{E}[-\log p(X)] = H(X)$$. You can therefore read $$-\frac{1}{n}\log p(x^n)$$ as the "empirical entropy"; it's what we'd guess the (per-symbol) entropy of $$X$$ to be if we did a slightly weird thing of estimating the entropy while knowing the probability model but only using it to determine the information content $$-\log p$$, and estimating the $$p_i$$s in $$-\sum_i p_i \log p_i$$ instead by only using how often they occur in $$x^n$$ (rather than the probability model).</p><p>Now the big results about typical sets was that as $$n \to \infty$$, the probability $$P(x^n \sim X^n \in T_{n \epsilon}) \to 1$$, and therefore for large $$n$$, most of the probability mass is concentrated in the approximately $$2^{nH(X)}$$ strings of probability approximately $$2^{-nH(X)}$$ that lie in the typical set.</p><p>We can define a similar notion of jointly $$\epsilon$$-typical sets, denoted $$J_{n\epsilon}$$ and defined by analogy with $$T_{n\epsilon}$$ as $$$ J_{n\epsilon} = \left\{ (x^n, y^n) \in A^n \times A^n \text{ such that } \left| - \frac{1}{n} \log P(x^n, y^n) - H(X, Y)\right| \le \epsilon \right\}. $$$ Like typical sets, jointly typical sets give us similar nice properties:</p><ol><li><p>If $$x^n, y^n$$ are drawn from the joint distribution (e.g. you first draw an $$x^n$$, then apply the transition matrix probabilities to generate a $$y^n$$ based on it), then the probability that $$(x^n, y^n) \in J_{n \epsilon}$$ goes to 1 as $$n \to \infty$$. The proof is almost the same as the corresponding proof for typical sets (hint: law of large numbers).</p></li><li><p>The number $$|J_{n\epsilon}|$$ of jointly typical sequence pairs $$(x^n, y^n)$$ is about $$2^{nH(X,Y)}$$, and specifically is upper-bounded by $$2^{n(H(X,Y) + \epsilon)}$$. The proof is the same as for the typical set case.</p></li><li><p>If $$x^n$$ and $$y^n$$ are _independently drawn_ from the distributions $$p_X$$ and $$p_Y$$, the probability that they are jointly typical is about $$2^{-nI(X;Y)}$$. The specific upper bound is $$2^{-n(I(X;Y) - 3 \epsilon)}$$, and can be shown straightforwardly (remembering some of the identities in <a href="https://www.strataoftheworld.com/2022/06/information-theory-1.html">post 1</a>) from $$$ P((x^n, y^n) \in J_{n \epsilon}) = \sum_{(x^n, y^n) \in J_{n\epsilon}} p(x^n) p(y^n)$$$ $$$\le |J_{n\epsilon}| 2^{-n(H(X) - \epsilon)} 2^{-n(H(X) - \epsilon)}$$$ $$$ \le 2^{n(H(X,Y) + \epsilon)} 2^{-n(H(X) - \epsilon)} 2^{-n(H(X) - \epsilon)}$$$ $$$= 2^{n(H(X,Y) - H(X) - H(Y) + 3 \epsilon)}$$$ $$$= 2^{-n(I(X,Y) - 3 \epsilon)} $$$</p></li></ol><p>Armed with this definition, we can now interpret what was happening in the diagrams above: as we increase the length of the messages, more and more of the probability mass is concentrated in jointly typical sequences, by the first property above. The third property tells us that if we ignore the dependence between $$x^n$$ and $$y^n$$ - picking a square roughly at random in the diagrams above - we are, however, extremely unlikely to pick a square corresponding to a jointly typical pair.</p><p>Here is the noisy typewriter for 6 symbols, for length-4 messages coming in and out of the channel:</p><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEghBh70zAlC9if1MiSrMZk9WRLa4SV5wbv9jyoc9ZeoTNb4r1lBPo_B7Usu_QFsRIAt53ktS-ep3_LJjvTs3fUWt9Ztcow4xfxo6sLFjj_oiT6HT_2imaW2s-FcjgIFL3SN5gXB4Gwvsch87akGUI9ipQbAqYrvdDXHXL07_iS_mUCmn0qWlbciAFjifA/s1030/ArcoLinux_2022-06-25_18-59-15.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="1030" data-original-width="1027" height="640" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEghBh70zAlC9if1MiSrMZk9WRLa4SV5wbv9jyoc9ZeoTNb4r1lBPo_B7Usu_QFsRIAt53ktS-ep3_LJjvTs3fUWt9Ztcow4xfxo6sLFjj_oiT6HT_2imaW2s-FcjgIFL3SN5gXB4Gwvsch87akGUI9ipQbAqYrvdDXHXL07_iS_mUCmn0qWlbciAFjifA/w638-h640/ArcoLinux_2022-06-25_18-59-15.png" width="638" /></a></div><p>(As a reminder of the interpretation: each column represents the probablity distribution, shaded blue to yelow, for one input message, and the $$6^4 = 1296$$ possible messages we have with this message length (4) and alphabet size (6) are ranked in alphabetical order along both the top and left side of the grid)</p><p>The highest probability is still yellow, but you can barely see it. Most of the probability mass is in the medium-probability sequences (our jointly typical set), forming a small subset of the possible channel outputs for each input.</p><p>In the limit, therefore, the transition probability matrix for a block code of an arbitrary symbol transition probability matrix looks a lot like the noisy typewriter. This suggests a decoding method: if we see $$y^n$$, we decode it as $$x^n$$ if $$(x^n, y^n)$$ are in the jointly typical set, and there is no other $${x'}^n$$ such that $$({x'}^n, y^n)$$ are also jointly typical. As with the noisy typewriter example, we have to discard a lot of the $$x^n$$, so that the set of $$x^n$$ that a given $$y^n$$ could've come to hopefully contains only a single element, so we match the second condition in the decoding rule.</p><h3 id="theorem-outline">Theorem outline</h3><p>Now we will state the exact form of the noisy channel coding theorem. It has three parts:</p><ol><li><p>A discrete memoryless channel has a non-negative capacity $$C$$ such that for any $$\varepsilon > 0$$ and $$R < C$$, for large enough $$n$$ there's a block code of length $$N$$ and rate $$\geq R$$ and a decoder such that error probability is $$< \varepsilon$$.</p><p>We will see that this follows from the points about jointly typical sets and the decoding scheme based on them that we discussed above. The only thing really missing is an argument that the error rate of jointly typical coding can be made arbitrarily low as long as $$R < C$$. We will see that Shannon used perhaps the most insane trick in all of 20th century applied maths to side-step having to actually think of a specific code to prove this.</p></li><li><p>If error probability per bit $$p_e$$ is acceptable, rates up to $$$ R(p_e) = \frac{C}{1 - H_2(p_e)}. $$$ are possible. We will prove this by </p></li><li><p>For any $$p_e$$, rates $$> R(p_e)$$ are not possible.</p></li></ol><p>As we saw earlier, these three parts together divide up the space of possible rate-and-error combinations for codes into three parts: </p><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhFaR-atWmnPttu0ZXFtS2-0y3wxiPkw0DmZcP4S1U9KLhuz7Iw7SGCn_NNggZFpNKc5OBFkFL7eB29jIB3GXy7kMFVOncmVp1tTNafSdOGgDvYpf-GoOaMTDyjA5k0-RmbiwMeRitQJAR9IYWAqejEtnBXrtC1a-6a6gxzQr-JgyqsERmXXvPI-rhpMQ/s676/ArcoLinux_2022-06-25_18-49-59.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="483" data-original-width="676" height="458" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhFaR-atWmnPttu0ZXFtS2-0y3wxiPkw0DmZcP4S1U9KLhuz7Iw7SGCn_NNggZFpNKc5OBFkFL7eB29jIB3GXy7kMFVOncmVp1tTNafSdOGgDvYpf-GoOaMTDyjA5k0-RmbiwMeRitQJAR9IYWAqejEtnBXrtC1a-6a6gxzQr-JgyqsERmXXvPI-rhpMQ/w640-h458/ArcoLinux_2022-06-25_18-49-59.png" width="640" /></a></div><h3 id="proof-of-part-i-turning-noisy-channels-noiseless">Proof of Part I: turning noisy channels noiseless</h3><p>We want to prove that we can get an arbitrarily low error rate if the rate (bits of information per symbol) is smaller than the channel capacity, which we've defined as $$C = \max_{p_X} I(X;Y)$$.</p><p>We could do this by thinking up a code and then calculating the probability of error per length-$$n$$ block for it. This is hard though.</p><p>Here's what Shannon did instead: he started by considering a random block code, and then proved stuff about its average error.</p><p>What do we mean by a "random block code"? Recall that an $$(n,k)$$ block code is one that encodes length-$$k$$ message as length-$$n$$ messages. Since the rate $$r = \frac{k}{n}$$, we can talk about $$(n, nr)$$ block codes.</p><p>What the encoder is doing is mapping length-$$k$$ strings to length-$$n$$ strings. In the general case, it has some lookup table, with $$2^k = 2^{nr}$$ entries, each of length $$n$$. A "random code" means that we generate the entries of this lookup table from the distribution $$P(x^n) = \prod_{i=1}^n p(x_i)$$. We will refer to the encoder as $$E$$.</p><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiHLyOUVQSZXYKQMKosrwxek9NEqDIwpKXBiwIkB1rbKnacw8EjbCGc9mOp-C6c9U7wgb-w62IkzI3O64FKyqUlsyRPb9Asb7aJ3nvzUZF_-Ga6G65GV4iuYOmdl6xRbRhg5Nn8ilbCRrTitQ2O2BuwbWHDlPef24B1IwnbOIq08oAF1_q656BvGH3x5g/s905/ArcoLinux_2022-06-25_19-08-10.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="445" data-original-width="905" height="314" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiHLyOUVQSZXYKQMKosrwxek9NEqDIwpKXBiwIkB1rbKnacw8EjbCGc9mOp-C6c9U7wgb-w62IkzI3O64FKyqUlsyRPb9Asb7aJ3nvzUZF_-Ga6G65GV4iuYOmdl6xRbRhg5Nn8ilbCRrTitQ2O2BuwbWHDlPef24B1IwnbOIq08oAF1_q656BvGH3x5g/w640-h314/ArcoLinux_2022-06-25_19-08-10.png" width="640" /></a></div><p>(In the above diagram, the dots in the column represent probabilities of different outputs given the $$x^n$$ that is taken as input. Different values of $$w^k$$ would be mapped by the encoder to different columns $$x^n$$ in the square.)</p><p>Richard Hamming (yes, the Hamming codes person) mentions this trick in his famous talk <a href="https://www.cs.virginia.edu/~robins/YouAndYourResearch.pdf">"You and Your Research"</a>:</p><blockquote><p><i>Courage is one of the things that Shannon had supremely. You have only to think of his major theorem. He wants to create a method of coding, but he doesn't know what to do so he makes a random code. Then he is stuck. And then he asks the impossible question, "What would the average random code do?'' He then proves that the average code is arbitrarily good, and that therefore there must be at least one good code. Who but a man of infinite courage could have dared to think those thoughts?</i></p></blockquote><p>Perhaps it doesn't quite take infinite courage, but it is definitely one hell of a simplifying trick - and the remarkable trick is that it works.</p><p>Here's how: let the average probability of error in decoding one of our blocks be $$\bar{p_e}$$. If we have a message $$w^k$$, the steps that happen are:</p><ol><li>We use the (randomly-constructed) encoder $$E$$ to map it to an $$x^{n}$$ using $$x^n = E(w^k)$$. Note that the set of values that $$E(w^k)$$, can take, $$\text{Range}(E)$$, is a subset of the set of values of all possible $$x^n$$.</li><li>$$x^n$$ passes through the channel to become a $$y^n$$, according to the probabilities in a block transition probability matrices like the ones pictured above.</li><li>We guess that $$y^n$$ came from the $$x'^n \in \text{Range}(E)$$ such that the pair $$(x'^n, y^n)$$ is in the jointly typical set $$J_{n\epsilon}$$.<ol><li>If there isn't such an $$x'^n$$, we fail. In the diagram below, this happens if we get $$y_3$$, since $$\text{Range}(E) = \{x_1, x_2, x_3, x_4\}$$ does not contain anything jointly-typical with $$y_3$$.</li><li>If there is at least one wrong $$x'^n$$, we fail. In the diagram below, this happens if we get $$y_2$$, since both $$x_2$$ and $$x_3$$ are codewords the encoder might use that are jointly typical with $$y_2$$, so we don't know which one was originally transmitted over the channel.</li></ol></li><li>We use the decoder, which is simply the inverse of the encoder, to map to our guess $$\bar{w}^k$$ of what the original string was. Since $$x'^n \in \text{Range}(E)$$, the inverse of the encoder, $$E^{-1}$$, must be defined at $$x'^n$$. (Note that there is a chance, but a negligibly small one as $$n \to \infty$$, that in our encoder generation process we created the same codeword for two different strings, in which case the decoder can't be deterministic. We can say either: we don't care about this, because the probability of a collision goes to zero, or we can tweak the generation scheme to regenerate if there's a repeat; $$n \ge k$$ so we can always construct a repeat-free encoder.)</li></ol><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhhsPj6X9tFdhcyWVSmCNXlAprBCMGK88hFKeTTC257jSp8XFjYf0Fgk-O6YWhEXC0BvG337MCkBQF1KIodnrWmX3iqSSWVGhkBkVUReJUfWg1f4G-6S--2iW5ydJZGHxU5HHo1gVOZUe6iWjDmUSz6sd6ugaISVnCpjWcUswvHq9OK7nusuDlaeb4LGg/s731/ArcoLinux_2022-06-25_19-12-49.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="715" data-original-width="731" height="626" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhhsPj6X9tFdhcyWVSmCNXlAprBCMGK88hFKeTTC257jSp8XFjYf0Fgk-O6YWhEXC0BvG337MCkBQF1KIodnrWmX3iqSSWVGhkBkVUReJUfWg1f4G-6S--2iW5ydJZGHxU5HHo1gVOZUe6iWjDmUSz6sd6ugaISVnCpjWcUswvHq9OK7nusuDlaeb4LGg/w640-h626/ArcoLinux_2022-06-25_19-12-49.png" width="640" /></a></div><p>Therefore the two sources of error that we care about are:</p><ul><li><p>On step 3, we get a $$y^n$$ that is not jointly typical with the original $$x^n$$. Since $$P((x^n, y^n) \geq 1 - \delta$$ for some $$\delta$$ that we can make arbitrarily small by increasing $$n$$, we can upper-bound this probability with $$\delta$$.</p></li><li><p>On step 3, we get a $$y^n$$ that is jointly typical with at least one wrong $$x'^n$$. We saw above that one of the properties of the jointly typical set is that if $$x^n$$ and $$y^n$$ are selected independently rather than together, the probability that they are jointly typical is only $$2^{-n(I(X;Y) - 3 \epsilon)}$$. Therefore we can upper-bound this error probability by summing the probability of "accidental" joint-typicality over the $$2^k - 1$$ possible messages that are not the original message $$w^k$$. This sum is $$$ \sum_{w'^k \ne w^k} 2^{-n(I(X;Y) - 3 \epsilon)}$$$ $$$\le (2^{k} - 1) 2^{-n(I(X;Y) - 3 \epsilon)}$$$ $$$\le 2^{nr}2^{- n (I(X;Y) - 3 \epsilon)}$$$ $$$= 2^{nr - n(I(X;Y) - 3 \epsilon)} $$$</p></li></ul><p>We have the probabilities of two events, so the probability of at least one of them happening is smaller than or equal to their sum: $$$ \bar{p}_e \le \delta + 2^{nr - n(I(X;Y) - 3 \epsilon)} $$$ We know we can make $$\delta$$ however small we want. We can see that if $$r < I(X;Y) - 3 \epsilon$$, then the exponent is negative and increasing $$n$$ can also make the second term negligible. This is almost Part I of the theorem, which was:</p><blockquote><p>A discrete memoryless channel has a non-negative capacity $$C=\max_{p_X} I(X;Y)$$ such that for any $$\epsilon > 0$$ and $$R < C$$, for large enough $$n$$ there's a block code of length $$n$$ and rate $$\geq R$$ and a decoder such that error probability is $$< \varepsilon$$.</p></blockquote><p>First, to put a bound involving only one constant on $$\bar{p}_e$$, let's arbitrarily say that we increase $$n$$ until $$2^{nr - n(I(X;Y) - 3 \epsilon)} \le \delta$$. Then we have $$$ \bar{p}_e \le 2 \delta $$$ Second, we don't care about average error probability over codes, we care about the existence of a single code that's good. We can realise that if the average error probability $$\le 2 \delta$$, there must exist at least one code, call it $$C^*$$, with average error probability $$\le 2 \delta$$.</p><p>Third, we don't care about average error probability over messages, but maximal error probability, so that we can get the strict $$< \varepsilon$$ error probability in the theorem. This is trickier to bound, since $$C^*$$ might somehow have very low error probability with most messages, but some insane error probability for one particular message.</p><p>However, here again Shannon jumps to the rescue with a bold trick: throw out half the codewords, specifically the ones with highest error probability. Since the average error probability is $$\le 2 \delta$$, every codeword in the best half of codewords must have error probability $$\le 4 \delta$$, because otherwise the one-half of best codes would contribute more than $$\frac{1}{2} \times 4 \delta = 2 \delta$$ to the average error on their own.</p><p>What about the effect on our rate of throwing out half the codewords? Previously we had $$2^k = 2^{nr}$$ codewords; after throwing out half we have $$2^{nr - 1}$$, so our rate has gone from $$\frac{k}{n} = r$$ to $$\frac{nr - 1}{n} = r - \frac{1}{n}$$, a negligible decrease if $$n$$ is large.</p><p>What we now have is this: as $$n \to \infty$$, we can get any rate $$R < I(X;Y) - 3 \epsilon$$ with maximal error probability $$\le 4 \delta$$, and both $$\delta$$ and $$\epsilon$$ can be decreased arbitrarily close to zero by increasing $$n$$. Since we can set the distribution of $$X$$ to whatever we like (this is why it matters that we construct our random encoder by sampling from $$X$$ repeatedly), we can make $$I(X;Y) = \underset{p_X}{\max} I(X;Y)$$.</p><p>This is the first and most involved part of the theorem. It is also remarkably lazy: at no point do we have to go and construct an actual code, we just sit in our armchairs and philosophise about the average error probability of random codes.</p><h3 id="proof-of-part-ii-achievable-rates-if-you-accept-non-zero-error">Proof of Part II: achievable rates if you accept non-zero error</h3><p>Here's a simple code that achieves a rate higher than the capacity in a noiseless binary channel:</p><ol><li>The sender maps each length-$$nr$$ block to a block of length $$n$$ by cutting off the last $$nr - n$$ symbols.</li><li>The receiver reads $$n$$ symbols with error probability $$0$$, and then guesses the remaining $$nr - n$$ with bit error probability $$\frac{1}{2}$$ for each symbol. (Note; we're concerned with bit error here, unlike block error in the previous proof)</li></ol><p>An intuition you should have is that if the probability of anything is concentrated in a small set of outcomes, you're not maximising the entropy (remember: _entropy is maximised by a uniform distribution_) and therefore also not maximising the information transfer. The above scheme concentrates high probability of error to a small number of bits, while transmitting some of them with zero error - we should be able to do better.</p><p>It's not obvious how we'd start doing this. We're going to take some wisdom from the old proverb about hammers and nails, and note that the main hammer we've developed so far is a proof that we can send through the channel at a negligible error rate by increasing the size of the message. Let's turn this hammer upside down: we're going to use the decoding process to encode and the encoding process to decode. Specifically, to map from length-$$n$$ strings to the smaller length-$$k$$ strings, we use the decoding process from before:</p><ol><li>Given an $$x^n$$ to encode, we find the $$x'^n \in \text{Range}(E)$$ such that the pair $$(x^n, x'^n)$$ is in the jointly typical set $$J_{n\epsilon}$$. (Jointly typical with respect to what joint distribution? That of length-$$n$$ strings before and after being passed through the channel (here we're assuming that the input and output alphabets are equivalent). However, note that nothing actually has to pass through a channel for us to use this.)</li><li>We use the inverse of the encoder, $$E^{-1}$$, to map $$x'^n$$ to a length-$$k$$ string $$w^k$$ ($$x'^n \in \text{Range}(E)$$ so this is defined).</li></ol><p>To encode, we use the encoder $$E$$, to get $$\bar{x}^n = E(w^k)$$.</p><p>We'll find the per-bit error rate, not the per-block error rate, so we want to know how many bits are changed on average under this scheme. We're still working with the assumption of a noiseless channel, so we don't need to worry about the noise in the channel, only the error coming from our lossy compression (which is based on a joint probability distribution coming from assuming some channel, however). </p><p>Assume our channel has error probability $$p$$ when transmitting a symbol. Fix an $$x^n$$ and consider pairs $$(x^n, y^n)$$ in the jointly typical set. Most of the $$y^n$$ will differ from $$x^n$$ in approximately $$np$$ bits. Intuitively, this comes from the fact that for a binomial distribution, most of the probability mass is concentrated around the mean at $$np$$, and therefore the typical set contains mostly sequences with a number of errors close to this mean. Therefore, on average we should expect $$np$$ errors between the $$x^n$$ we put into the encoder and the $$x'^n$$ that it spits out. Since we assume no noise, the $$w^k = E^{-1}(x'^n)$$ we send through the channel comes back as the same, and we can do $$E(w^k) = E(E^{-1}(x'^n)) = x'^n$$ to perfectly recover $$x'^n$$. Therefore the only error is the $$np$$ wrong bits, and therefore our per-bit error rate is $$p$$.</p><p>Assume that, used the right way around, we have a code that can achieve a rate of $$R' = k/n$$. This rate is $$$ R' = \max_{p_X} I(X;Y) = \max_{p_X} \big[ H(Y) - H(Y|X) \big]$$$ $$$= 1 - H_2(p) $$$ assuming a binary code and a binary symmetric channel, and where $$H_2(p)$$ is the entropy of a two-outcome random variable with probability $$p$$ of the first outcome, or $$$ H_2(p) = - p \log p - (1 - p) \log (1 - p). $$$ Now since we're using it backward, we map from $$n$$ to $$k$$ bits rather than $$k$$ to $$n$$ bits, and this code has rate $$$ \frac{1}{R'} = \frac{n}{k} = \frac{1}{1 - H_2(p)} $$$ What we can now do is make a code that works like the following:</p><ol><li>Take a length-$$n$$ block of input.</li><li>Use the compressor (i.e. the typical set decoder) to map it to a smaller length-$$k$$ block.</li><li>Use some noiseless channel code with capacity $$C$$.</li><li>Use the decompressor (i.e. the typical set encoder) to map the recovered length-$$k$$ blocks back to length-$$n$$ blocks.</li></ol><p>In step 4, we will on average see that the recovered input differs in $$np$$ places, for a bit error probability of $$p$$. And what is our rate? We assumed the standard noiseless channel code in the middle that transmits our compressed input had the maximum rate $$C$$. However, it is transmitting strings that have already been compressed by a factor of $$\frac{k}{n}$$, so the true rate is $$$ R = \frac{C}{1 - H_2(p)} = \frac{C}{1 + p \log p + (1 - p) \log (1 - p)} $$$ This gives us the second part of the theorem: given a certain rate $$R$$, we can transmit at any probability of error $$p$$ low enough that $$C / (1 - H_2(p)) \le R$$.</p><p>(Note that effectively $$0 \le p < 0.5$$, because if $$p > 0.5$$ we can just flip the labels on the channel and change $$p$$ to $$1 - p$$, and if $$p = 0.5$$ we're transmitting no information.)</p><h3 id="proof-of-part-iii-unachievable-rates">Proof of Part III: unachievable rates</h3><p>Note that the pipeline is a Markov chain (i.e. each step depends only on the previous step):</p><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjVptKjd3xnPIq2F_8kfByNH3C96QL3mz0C3z-bXyTNEECMZHNxQwqHRusw6Mw5jxrNbT9k9L6OC8qFQuYkLr72mSoJiti9072A9B_HT6twHNku1gxJFIJ45WcEtJy7WuNMcr4MQNVZ7gi_KzuscQq9kcsTKQnbs9oAKN0oViBImC74qaxLxB273U_log/s1094/ArcoLinux_2022-06-25_19-19-53.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="239" data-original-width="1094" height="140" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjVptKjd3xnPIq2F_8kfByNH3C96QL3mz0C3z-bXyTNEECMZHNxQwqHRusw6Mw5jxrNbT9k9L6OC8qFQuYkLr72mSoJiti9072A9B_HT6twHNku1gxJFIJ45WcEtJy7WuNMcr4MQNVZ7gi_KzuscQq9kcsTKQnbs9oAKN0oViBImC74qaxLxB273U_log/w640-h140/ArcoLinux_2022-06-25_19-19-53.png" width="640" /></a></div><p>Therefore, the data processing inequality applies (for more on that, search for "data" <a href="https://www.strataoftheworld.com/2022/06/information-theory-1.html">here</a>). With one application we get $$$ I(w^k; \bar{w}^k) \le I(w^k; y^n) $$$ and with another $$$ I(w^k; y^n) \le I(x^n; y^n) $$$ which combine to give $$$ I(w^k, \bar{w}^k) \le I(x^n; y^n). $$$ By the definition of channel capacity, $$I(x^n; y^n) \le nC$$ (remember that the definition is about mutual information between $$X$$ and $$Y$$, so _per-symbol_ information), and so given the above we also have $$I(w^k, \bar{w}^k) \le nC$$.</p><p>With a rate $$R$$, we send over $$nR$$ bits of information, but if the per-bit error probability is $$p$$, we can only receive $$nR (1 - H_2(p))$$ of those bits. Therefore $$I(w^k, \bar{w}^k) = nR(1 - H_2(p))$$ at most, and we have $$$ nR(1-H_2(p)) > nC $$$ is a contradiction, which implies which implies $$$ R > \frac{C}{1 - H_2(p)} $$$ is a contradiction. </p><h2 id="gaussian-channels">Continuous entropy and Gaussian channels</h2><p>And now, for something completely different.</p><p>We've so far talked only about the entropy of discrete random variables. However, there is a very common case of channel coding that deals with continuous random variables: sending a continuous signal, like sound.</p><p>So: forget our old boring discrete random variable $$X$$, and bring in a brand-new continuous random variable that we will call ... $$X$$. How much information do you get from observing $$X$$ land on a particular value $$x$$? You get infinite information, because $$x$$ is a real number with an endless sequence of digits; alternatively, the Shannon information is $$- \log p(x)$$, and the probability of $$X=x$$ is infinitesimally small for a continuous random variable, so the Shannon information is $$-\log 0$$ which is infinite. Umm.</p><p>Consider calculating the entropy for a continuous variable, which we will denote $$h(X)$$ to make a difference from the discrete case, and define in the obvious way by replacing sums with integrals: $$$ h(X) = -\int_{-\infty}^\infty f(x) \log f(x) d x $$$ where $$f$$ is the probability density function. If we actually evaluate this integral, we would get a constant term that goes to infinity.</p><p>As principled mathematicians, we might be concerned about this. But we can mostly ignore it, especially as the main thing we want is $$I(X;Y)$$, and $$$ I(X;Y) = h(Y) - h(Y|X) = -\int f_Y(y) \log f_Y(y) \mathrm{d}y + \iint f_{X,Y}(x,y) \log f_{Y|X=x}(y) \mathrm{d}x \mathrm{d}y $$$</p><p>where <i>mumble mumble</i> the infinities cancel out <i>mumble</i> opposite signs <i>mumble</i>.</p><h3 id="signals">Signals</h3><p>With discrete random variables, we generally had some fairly obvious set of values that they could take. With continuous random variables, we usually deal with an unrestricted range - a radio signal could technically be however low or high. However, step down from abstract maths land, and you realise reality isn't as hopeless as it seems at first. Emitting a radio wave, or making noise, takes some input of energy, and the source has only so much power.</p><p>For waves (like radio waves and sound waves), power is proportional to the square of the amplitude of a wave. The variance $$\mathbb{V}(X) = \mathbb{E}[(x-\mathbb{E}[x])^2] = \int f(x) (x - \mathbb{E}[X])^2 \mathrm{d}x$$ of a continuous random variable $$X$$ with probability density function $$f$$ is just the expected squared difference between the value and its mean. Both of these quantities are squaring a difference. It turns out that the power of our source and the variance of the random variable that represents it are proportional.</p><p>Our model of a continuous noisy channel is one where there's an input signal $$X$$, a source of noise $$N$$, and an output signal $$Y = X + N$$. As usual, we want to maximise the channel capacity $$C = \max_{p_X} I(X;Y)$$, which is done by maximising $$$ I(X;Y) = h(Y) - h(Y|X). $$$ Because noise is generally the sum of a bunch of small contributing factors in each directions, the noise follows a normal distribution with variance $$\sigma_N^2$$. Because the only source of uncertainty is $$N$$ and this has the same regardless of $$X$$, $$h(Y|X)$$ depends only on $$N$$ and not at all on $$X$$, so the only thing we can affect is $$h(Y)$$.</p><p>Therefore, the question of how you maximise channel capacity turns into a question of how to maximise $$h(Y)$$ given that $$Y = X + N$$ with $$N \sim \mathcal{N}(0, \sigma_N^2)$$. If we were working without any power/variance constraints, we'd already know the answer: just make $$X$$ such that $$Y$$ is a uniform distribution (which in this case would mean making $$Y$$ a uniform distribution over all real numbers, something that's clearly a bit wacky). However, we have a constraint on power and therefore the variance of $$X$$.</p><p>If we were to do some algebra involving Lagrangian multipliers, we would eventually find that we want the distribution of $$X$$ to be a normal distribution. A key property of normal distributions is that if $$X \sim \mathcal{N}(0, \sigma_X^2)$$ (assume the mean is 0; note you can always shift your scale) and $$N \sim \mathcal{N}(0, \sigma_N^2)$$, then $$X + N \sim \mathcal{N}(0, \sigma_X^2 + \sigma_N^2)$$. Therefore the basic principle between efficiently transmitting information using a continuous signal is that you want to transform your input to follow a normal distribution.</p><p>If you do, what do you get? Start with $$$ I(X;Y) = h(Y) - h(Y|X) $$$ and now use the "standard" integral that $$$ \int f(z) \log p(z) \mathrm{d}z = -\frac{1}{2} \log (2 \pi e \sigma^2) $$$ if $$z$$ is drawn from a distribution $$\mathcal{N}(0, \sigma^2)$$, and therefore $$$ \max I(X;Y) = C = \frac{1}{2} \log (2 \pi e (\sigma_X^2 + \sigma_N^2)) - \frac{1}{2} \log (2 \pi e \sigma_N^2) $$$ using the fact that $$h(Y|X) = h(N)$$ since the information content of the noise is all that is unknown about $$Y$$ if we're given $$X$$, and the property of normal distributions mentioned above. We can do some algebra to get the above into the form $$$ C = \frac{1}{2} \log \left(\frac{2 \pi e (\sigma_X^2 + \sigma_N^2)}{2 \pi e \sigma_N^2}\right) \ = \frac{1}{2} \log \left( 1 + \frac{\sigma_X^2}{\sigma_N^2}\right) $$$ The variance is proportional to the power, so this can also be written in terms of power as $$$ C = \frac{1}{2} \log \left( 1 + \frac{S}{N}\right) $$$ if $$S$$ is the power of the signal and $$N$$ is the power of the noise. The units of capacity for the discrete case were bits per symbol; here they're bits per second. A sanity check is that if $$S = 0$$, we transmit $$\frac{1}{2} \log (1) = 0$$ bits per second, which makes sense: if your signal power is 0, it has no effect, and no one is going to hear you.</p><p>An interesting consequence here is that increasing signal power only gives you a logarithmic improvement in how much information you can transmit. If you shout twice as loud, you can detect approximately twice as fine-grained peaks and troughs in the amplitude of your voice. However, this helps surprisingly little.</p><p>If you want to communicate at a really high capacity, there are better things you can do than shouting very loudly. You can decompose a signal into frequency components using the Fourier transform. If your signal consists of many different frequency levels, you can effectively transmit a different amplitude on each of them at once. The range of frequencies that your signal can span over is called the bandwidth and is denoted $$W$$. If you can make use of multiple frequencies, the capacity equation changes to $$$ C = \frac{W}{2} \log \left(1 + \frac{S}{N}\right) $$$ Therefore if you want to transmit information, transmitting across a broad range of frequencies is much more effective than shouting loudly. There's a metaphor here somewhere.</p>Unknownnoreply@blogger.com1tag:blogger.com,1999:blog-1697673368059564013.post-59038482154934691632022-06-25T18:39:00.004+01:002022-06-25T18:42:48.461+01:00Information theory 2: source coding<p style="text-align: center;"><span style="font-size: x-small;">6.9k words, including equations (~36min)</span> <br /></p><p> </p><p>In <a href="https://www.strataoftheworld.com/2022/06/information-theory-1.html">the previous post</a>, we saw the basic information theory model:</p><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgJXTmanOlPkocQ5FGq2OL6Tpxe-qlKxs3MIQ0zBnKTk2JvkkLshofA86XeZiNoaa64veATnEBMIfChv5OcUAD6QTPZEpRmtV2b_jhSb_8XDs9PYBAcOdAYmnKDrrrcAxbuXthKVax_gAacxX360xcDRrsLbxGEZdGKaHo24f7itvDpI9k-cbPBYoKHoQ/s1104/ArcoLinux_2022-06-02_12-57-01.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="205" data-original-width="1104" height="118" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgJXTmanOlPkocQ5FGq2OL6Tpxe-qlKxs3MIQ0zBnKTk2JvkkLshofA86XeZiNoaa64veATnEBMIfChv5OcUAD6QTPZEpRmtV2b_jhSb_8XDs9PYBAcOdAYmnKDrrrcAxbuXthKVax_gAacxX360xcDRrsLbxGEZdGKaHo24f7itvDpI9k-cbPBYoKHoQ/w640-h118/ArcoLinux_2022-06-02_12-57-01.png" width="640" /></a></div><br /><p>If we have no noise in the channel, we don't need channel coding. Therefore the above model simplifies to</p><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEj1KF_NS28wo86Zclsq33a5hrIUMddggoRFHqCEAwffiTunltbEaON-d4I11qhEUFiu8ChkRqdkKC5f75leUGSkLq-Ysv7R_O2-QRIt-NMO42HyC13pVaMnninN6qyZMr4yIicxO5Iy9962Fmlt-Cczhh5tb2ye5rJPgOQNOdECo0LbnuzBNgjRnr1bzg/s716/ArcoLinux_2022-06-02_12-57-44.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="167" data-original-width="716" height="150" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEj1KF_NS28wo86Zclsq33a5hrIUMddggoRFHqCEAwffiTunltbEaON-d4I11qhEUFiu8ChkRqdkKC5f75leUGSkLq-Ysv7R_O2-QRIt-NMO42HyC13pVaMnninN6qyZMr4yIicxO5Iy9962Fmlt-Cczhh5tb2ye5rJPgOQNOdECo0LbnuzBNgjRnr1bzg/w640-h150/ArcoLinux_2022-06-02_12-57-44.png" width="640" /></a></div><p>and the goal is to minimise $$n$$ - that is, minimise the number of symbols we need to send - without needing to worry about being robust to any errors.</p><p>Here's one question to get started: imagine we're working with a compression function $$f_e$$ that acts on length-$$n$$ strings (that is, sequences of symbols) with some arbitrary alphabet size $$A$$ (that is, $$A$$ different types of symbols). is it possible to build an encoding function $$f_e$$ that compresses every possible input? Clearly not; imagine that it took every length-$$n$$ string to a length-$$m$$ string using the same alphabet, with $$m < n$$. Then we'd have $$A^m$$ different available codewords that would need to code for $$A^n > A^m$$ different messages. By the pigeonhole principle, there must be at least one codeword that codes for more than one message. But that means that if we see this codeword, we can't be sure what it codes for, so we can't recover the original with certainty.</p><p>Therefore, we have a choice: either:</p><ul><li>do <i>lossy compression</i>, where every message shrinks in size but we can't recover information perfectly; or</li><li>do <i>lossless compression</i>, and hope that more messages shrink in size than expand in size.</li></ul><p>This is obvious with lossless compression, but applies to both: if you want to do them well, you generally need a probability model for what your data looks like, or at least something that approximates one.</p><h2 id="terminology">Terminology</h2><p>When we talk about a "code", we just mean something that maps messages (the $$Z$$ in the above diagram) to a sequence of symbols. A code is <b>nonsingular</b> if it associates every message with a unique code. </p><p>A <b>symbol code</b> is a code where each symbol in the message maps to a codeword, and the code of a message is the concatenation of the codewords of the symbols that it is made of.</p><p>A <b>prefix code</b> is a code where no codeword is a prefix of another codeword. They are also called <b>instantaneous codes</b>, because when decoding, you can decode a codeword to a symbol immediately when you reach a point where the some prefix of the code corresponds to a codeword.</p><h2 id="useful-basic-results-in-lossless-compression">Useful basic results in lossless compression</h2><h3 id="kraft-s-inequality">Kraft's inequality</h3><p>Kraft's inequality states that a prefix code with an alphabet of size $$D$$ and code words of lengths $$l_1, l_2, \ldots, l_n$$ satisfies $$$ \sum_{i=1}^n D^{-l_i} \leq 1, $$$ and conversely that if there is a set of lengths $${l_1, \ldots, l_n}$$ that satisfies the above inequality, there exists a prefix code with those codeword lengths. We will only prove the first direction: that all prefix codes satisfy the above inequality.</p><p>Let $$l = \max_i l_i$$ and consider the tree with branching factor $$D$$ and depth $$l$$. This tree has $$D^l$$ nodes on the bottom level. Each codeword $$x_1x_2...x_c$$ is the node in this tree that you get to by choosing the $$d_i$$th branch on the $$i$$th level where $$d_i$$ is the index of symbol $$x_i$$ in the alphabet. Since it must be a prefix code, no node that is a descendant of a node that is a codeword can be a codeword. We can define our "budget " as the $$D^l$$ nodes on the bottom level of the tree, and define the "cost" of each codeword as the number of nodes on the bottom level of the tree that are descendants of the node. The node with length $$l$$ has cost 1, and in general a codeword at level $$l_i$$ has cost $$D^{l - l_i}$$. From this, and the prefix-freeness, we get $$$ \sum_i D^{l - l_i} \leq D^l $$$ which becomes the inequality when you divide both sides by $$D^l$$.</p><h3 id="gibbs-inequality">Gibbs' inequality</h3><p>Gibbs' inequality states that for any two probability distributions $$p$$ and $$q$$, $$$ -\sum_i p_i \log p_i \leq - \sum_i p_i \log q_i $$$ which can be written using the relative entropy $$D$$ (also known as the KL distance/divergence) as $$$ \sum_i p_i \log \frac{p_i}{q_i} = D(p||q) \geq 0. $$$ This can be proved using the <a href="https://en.wikipedia.org/wiki/Log_sum_inequality">log sum inequality</a>. The proof is boring.</p><h3 id="minimum-expected-length-of-a-symbol-code">Minimum expected length of a symbol code</h3><p>We want to minimise the expected length of our code $$C$$ for each symbol that $$X$$ might output. The expected length is $$L(C,X) = \sum_i p_i l_i$$. Now one way to think of what a length $$l_i$$ means is using the correspondence between prefix codes and binary trees discussed above. Given the prefix requirement, the higher the level in the tree (and thus the shorter the length of the codeword) the more other options we block out in the tree. Therefore we can think of the collection of lengths we assign to our codewords as specifying a rough probability distribution that assigns probability in proportion to $$2^{-l_i}$$. What we'll do is introduce a variable $$q_i$$ that measures the "implied probability" in this way (note dividing the division by a normalising constant): $$$ q_i = \frac{2^{-l_i}}{\sum_i 2^{-l_i}} = \frac{2^{-l_i}}{z} $$$ where in the 2nd step we've just defined $$z$$ to be the normalising constant. Now $$l_i = - \log zq_i = -\log q_i - \log z$$, so $$$ L(C,X) = \sum_i (-p_i \log q_i) - \log z $$$ Now we can apply Gibbs' inequality to know that $$\sum_i(- p_i \log q_i) \geq \sum_i (-p_i \log p_i)$$ and Kraft's inequality to know that $$\log z = \log \big(\sum_i 2^{-l_i} \big) \leq \log(1)=0$$, so we get $$$ L(C,X) \geq -\sum_i p_i \log p_i = H(X). $$$ Therefore the entropy (with base-2 $$\log$$) of a random variable is a lower bound on the expected length of a codeword (in a 2-symbol alphabet) that represents the outcome of that random variable. (And more generally, entropy with base-$$d$$ logarithms is a lower bound on the length of a codeword for the result in a $$d$$-symbol alphabet.)</p><h2 id="huffman-coding">Huffman coding</h2><p>Huffman coding is a very pretty concept.</p><p>We saw above that if you're making a random variable for the purpose of gaining the most information possible, you should prepare your random variable to have a uniform probability distribution. This is because entropy is maximised by a uniform distribution, and the entropy of a random variable is the average amount of information you get by observing it.</p><p>The reason why, say, encoding English characters as 5-bit strings (A = 00000, B = 00001, ..., Z = 11010, and then use the remaining 6 codes for punctuation or cat emojis or whatever) is not optimal is that some of those 5-bit strings are more likely than others. On a symbol-by-symbol-level, whether the first symbol is a 0 or a 1 is not equiprobable. To get an ideal code, each symbol we send should have equal probability (or as close to equal probability as we can get).</p><p>Robert Fano, of <a href="https://en.wikipedia.org/wiki/Fano%27s_inequality">Fano's inequality</a> fame, and Claude Shannon, of everything-in-information-theory fame, had tried to find an efficient general coding scheme in the early 1950s. They hadn't succeeded. Fano set it as an alternative to taking the final exam for his information theory class at MIT. David Huffman tried for a while, and had almost given up and started studying instead, when he came up with Huffman coding and quickly proved it to be optimal.</p><p>We want the first code symbol (a binary digit) to divide the space of possible message symbols (the English letters, say) in two equally-likely parts, the first two to divide it in four, the third into eight, and so o n. Now some message symbols are going to be more likely than others, so the codes for some symbols have to be longer. We don't want it to be ambiguous when we get to the end of a codeword, so we want a prefix-free code. Prefix-free codes with a size-$$d$$ alphabet can be represented as trees with branching factor $$d$$, where each leaf is one codeword:</p><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEg5Yb7bY9lY4ykPbwxKQurPGfN2KW51vlyHu1c-1MWUNuUYkCXWRev6uCSZOKooetoenZPkNvf6O1Ygk-l3at3Gt4iBgfQJeyhx-XR_5t4ZmY5HUWYUh47CrBB5ka5WieNK4_ANcRPcdXRsAt8o3D1TNsZBQGQkuW_9J62iZ9hr41bi8T2961-xXCIvxQ/s788/ArcoLinux_2022-06-25_18-09-46.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="506" data-original-width="788" height="410" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEg5Yb7bY9lY4ykPbwxKQurPGfN2KW51vlyHu1c-1MWUNuUYkCXWRev6uCSZOKooetoenZPkNvf6O1Ygk-l3at3Gt4iBgfQJeyhx-XR_5t4ZmY5HUWYUh47CrBB5ka5WieNK4_ANcRPcdXRsAt8o3D1TNsZBQGQkuW_9J62iZ9hr41bi8T2961-xXCIvxQ/w640-h410/ArcoLinux_2022-06-25_18-09-46.png" width="640" /></a></div><p>Above, we have $$d=2$$ (i..e binary), and six items to code for (<code>a</code>, <code>b</code>, <code>c</code>, <code>d</code>, <code>e</code>, and <code>f</code>), and six code words with lengths of between 1 and 4 characters in the codeword alphabet.</p><p>Each codeword is associated with some probability. We can define the weight of a leaf node to be its probability (or just how many times it occurs in the data) and the weight of a non-leaf code to be the sum of the weights of all leaves that are downstream of it in the tree. For an optimal prefix-free code, all we need to do is make sure that each node has children that are as equally balanced in weight as possible.</p><p>The best way to achieve this is to work bottom-up. Start without any tree, just a collection of leaf nodes representing the symbols you want codewords for. Then repeatedly build a node uniting the two least-likely parentless nodes in the tree, until the tree has a root.</p><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgeWVP1zFZw9DdXXLhr5FiCaTY_sLjPbbEKv6w--pr3gYntbL-xWziuCOb8gz8u92YxSpF3EZWQ22-_3NSzzfen0a0qeO3rouiUPsvoOs_rvQQYFUj5DyANdtJtAPTUELkmIcuIc3lPaJegCj0ydkOV9gur3mKxIw9YxWwiOnMBXCWeWFfPTRHYPEJm1g/s729/ArcoLinux_2022-06-25_18-12-46.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="538" data-original-width="729" height="472" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgeWVP1zFZw9DdXXLhr5FiCaTY_sLjPbbEKv6w--pr3gYntbL-xWziuCOb8gz8u92YxSpF3EZWQ22-_3NSzzfen0a0qeO3rouiUPsvoOs_rvQQYFUj5DyANdtJtAPTUELkmIcuIc3lPaJegCj0ydkOV9gur3mKxIw9YxWwiOnMBXCWeWFfPTRHYPEJm1g/w640-h472/ArcoLinux_2022-06-25_18-12-46.png" width="640" /></a></div><p>Above, the numbers next to the non-leaf nodes show the order in which the node was created. This set of weights on the leaf nodes creates the same tree structure as in the previous diagram.</p><p>(We could also try to work top-down, creating the tree the root to the leaves rather than from the leaves to the root, but this turns out to give slightly worse results. Also the algorithm for achieving this is less elegant.)</p><h2 id="arithmetic-coding">Arithmetic coding</h2><p>The Huffman code is the best symbol code - that is, a code where every symbol in the message gets associated with a codeword, and the code for the entire message is simply the concatenation of all the codewords of its symbols.</p><p>Symbol codes aren't always great, though. Consider encoding the output of a source that has a lot of runs like "<code>aaaaaaaaaahaaaaahahahaaaaa</code>" (a source of such messages might be, for example, a transcription of what a student says right before finals). The Huffman coding for this message is, for example, that "a" maps to a 0, and "h" maps to a 1, and you have achieved a compression of exactly 0%, even though intuitively those long runs of "a"s could be compressed.</p><p>One obvious thing you could do is run-length encoding, where long blocks of a character get compressed into a code for the character plus a code for how many times the character is repeated; for example the above might become "<code>10a1h5a1h1a1h1a1h5a</code>". However, this is only a good idea if there are lots of runs, and requires a bunch of complexity (e.g. your alphabet for the codewords must either be something more than binary, or then you need to be able to express things like lengths and counts in binary unambiguously, possibly using a second layer of encoding with a symbol code).</p><p>Another problem with Huffman codes is that the code is based on assuming an unchanging probability model across the entire length of the message that is being encoded. This might be a bad assumption if we're encoding, for example, long angry Twitter threads, where the frequency of exclamation marks and capital letters increases as the message continues. We could try to brute-force a solution, such as splitting the message into chunks and fitting a Huffman code separately to each chunk, but that's not very elegant. Remember how elegant Huffman codes feel as a solution to the symbol coding problem? We'd rather not settle for less.</p><p>The fundamental idea of arithmetic coding is that we send a number representing where on the cumulative probability distribution of all messages the message we want to send lies. This is a dense statement, so we will unpack it with an example. Let's say our alphabet is $$A = {a, r, t}$$. To establish an ordering, we'll just say we consider the alphabet symbols in alphabetic order. Now let's say our probability distribution for the random variable $$X$$ looks like the diagram on the left; then our cumulative probability distribution looks like the diagram on the right:</p><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjywjIVu-7eMsCaKyb1GdbLQWnmArUkgLMlhXAxdUleRkJynbd2RErOQkH7Dm3h1Dcb0Q6ynn1G36oTJP-58fj_9Kkd5ryBn0AMThBKSqADP42dkEjPB6ln-lv-wLJ-pUYZIxn6V3zXBAK6zJQIAd-zPOWxvf1aI2nvMVVnse1QCc-WWwM3XQJ__JQeuw/s1038/ArcoLinux_2022-06-21_21-42-25.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="397" data-original-width="1038" height="244" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjywjIVu-7eMsCaKyb1GdbLQWnmArUkgLMlhXAxdUleRkJynbd2RErOQkH7Dm3h1Dcb0Q6ynn1G36oTJP-58fj_9Kkd5ryBn0AMThBKSqADP42dkEjPB6ln-lv-wLJ-pUYZIxn6V3zXBAK6zJQIAd-zPOWxvf1aI2nvMVVnse1QCc-WWwM3XQJ__JQeuw/w640-h244/ArcoLinux_2022-06-21_21-42-25.png" width="640" /></a></div><p>One way to specify which of $${a, r, t}$$ we mean is to pick a number $$0 \leq c \leq 1$$, and then look at which range it corresponds to on the $$y$$-axis of the right-hand figure; $$0 \leq c < 0.5$$ implies $$a$$, $$0.5 \leq c < 0.7$$ implies $$r$$, and $$0.7 \leq c < 1$$ implies $$t$$. We don't need to send the leading 0 because it is always present, and for simplicity we'll transmit the following decimals in binary; 0.0 becomes "0", 0.5 becomes "1", 0.25 becomes "01", and 0.875 is "111". </p><p>Note that at this point we've almost reinvented is the Huffman code. $$a$$ has the most probability mass and can be represented in one symbol. $$r$$ happens to be representable in one symbol ("1" corresponds to 0.5 which maps to $$r$$) as well even though it has the least probability mass, which is definitely inefficient but not too bad. $$t$$ takes 2: "11".</p><p>The real benefit begins when we have multi-character messages. The way we can do it is like this, recursively splitting the number range between 0 and 1 into smaller and smaller chunks:</p><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiNxucz-lsT05H5PjrTHMM3NELAQILz9mDGgl4ED_sVaBcqbEoBTTfjMvzYrcUMNywDcT-OzlniqA4RkS-toShHZBNYhFzn744YFxx0oPYVj-FOJKRsLlU28RvU5bID5019UBvjQwVzmgqlpOHbmC-fN2bvTfqhj81PN0w5qIDzKEkbsjpr4e6Rye56bA/s969/ArcoLinux_2022-06-21_21-43-17.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="349" data-original-width="969" height="230" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiNxucz-lsT05H5PjrTHMM3NELAQILz9mDGgl4ED_sVaBcqbEoBTTfjMvzYrcUMNywDcT-OzlniqA4RkS-toShHZBNYhFzn744YFxx0oPYVj-FOJKRsLlU28RvU5bID5019UBvjQwVzmgqlpOHbmC-fN2bvTfqhj81PN0w5qIDzKEkbsjpr4e6Rye56bA/w640-h230/ArcoLinux_2022-06-21_21-43-17.png" width="640" /></a></div><p>We see possible numbers encoding "art", "rat", and "tar". Not only that, but we see that all messages we send are infinite in length, as we can just keep going down, adding more and more letters. At first this might seem like a great deal - send one number, get infinite symbols transmitted for free! However, there's a real difference between "art" and "artrat", so we want to be able to know when to stop as well.</p><p>A simple answer is that the message also includes some code encoding how many symbols to decode for. A more elegant answer is that we can keep our message as just one number, but extend our alphabet to include an end-of-message token. Note that even with this end-of-message token, it is still true that many characters of the message can be encoded by a single symbol of output, especially if some outcome is much more likely. For example, in the example below we need only one bit ("1", for the number 0.5) to represent the message "aaa" (followed by the end-of-message character):</p><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiYblnLiDN2hhrmMo7XMRQtEuDiOl4TAO53XUxdao9FxGjuINwDQOj-YT7YU3Q857Vgj6_gxi9UHEHvMQGkgpKpxDiHRO06z1FF8zkbbaqdUtG-BhwmD_0Qv77pnPMTyh2w8YpVvyZWN_AJ7vpPxAJB9w46bNlYmaFebm9mW9-DgcVZMDnVn-1DpludLQ/s942/ArcoLinux_2022-06-21_21-44-10.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="327" data-original-width="942" height="222" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiYblnLiDN2hhrmMo7XMRQtEuDiOl4TAO53XUxdao9FxGjuINwDQOj-YT7YU3Q857Vgj6_gxi9UHEHvMQGkgpKpxDiHRO06z1FF8zkbbaqdUtG-BhwmD_0Qv77pnPMTyh2w8YpVvyZWN_AJ7vpPxAJB9w46bNlYmaFebm9mW9-DgcVZMDnVn-1DpludLQ/w640-h222/ArcoLinux_2022-06-21_21-44-10.png" width="640" /></a></div><p>There are still two ways in which this code is underspecified.</p><p>The first is that we need to choose how much of the probability space to assign to our end-of-message token. The optimal value for this clearly depends on how long messages we will be sending.</p><p>The second is that even with the end-of-message token, each codeword is still represented by a range of values rather than a single number. Any of these are valid numbers to send, but we want to minimise the length, so therefore we will choose the number in this range that has the shortest binary representation.</p><p>Finally, what is our probability model? With the Huffman code, we either assume a probability model based on background information (e.g. we have the set of English characters, and we know the rough probabilities of them by looking at some text corpus that someone else has already compiled), or we fit the probability model based on the message we want to send - if 1/10th of all letters in the message are $$a$$s, we set $$p_a = 0.1$$ when building the tree for our Huffman code, and so on.</p><p>With arithmetic coding, we can also assume static probabilities. However, we can also do adaptive arithmetic coding, where we change the probability model as we go. A good way to do this is for our probability model to assume that the probability $$p_x$$ of the symbol $$x$$ after we have already processed text $$T$$ is $$$ p_x = \frac{\text{Count}(x, T) + 1}{\sum_{y \in A} \big(\text{Count}(y, T) + 1\big)}$$$ $$$= \frac{\text{Count}(x, T) + 1}{\sum_{y \in A} \big(\text{Count}(y, T)\big) + |A|} $$$ where $$A$$ is the alphabet, and $$\text{Count}(a, T)$$ simply returns the count of how many times the character $$a$$ occurs in $$T$$. Note that if we didn't have the $$+1$$ in the numerator and in the sum in the denominator, we would assume a probability of zero to anything we haven't seen before, and be unable to encode it.</p><p>(We can either say that the end-of-message token is in the alphabet $$A$$, or, more commonly, assign "probabilities" to all $$x$$ using the above formula and some probability $$p_{EOM}$$ to the end of message, and then renormalise by dividing all $$p_x$$ by $$1 + p_{EOM}$$.)</p><p>How do we decode this? At the start, the assumed distribution is simply uniform over the alphabet (except maybe for $$p_{EOM}$$). We can decode the first symbol using that distribution, then update the distribution and decode the next, and so on. It's quite elegant.</p><p>What isn't elegant is implementing this with standard number systems in most programming languages. For any non-trivial message length, arithmetic coding is going to need very precise floating point numbers, and you can't trust floating point precision very far. You'll need some special system, likely an arbitrary-precision arithmetic library, to actually implement arithmetic coding.</p><h3 id="prefix-free-arithmetic-coding">Prefix-free arithmetic coding</h3><p>The above description of arithmetic coding is not a prefix-free code. We generally want prefix-free codes, in particular because it means we can decode it symbol by symbol as it comes in, rather than having to wait for the entire message to come through. Note also that often in practice it is uncertain whether or not there are more bits coming; consider a patchy internet connection with significant randomness between packet arrival times.</p><p>The simple fix for this is that instead of encoding a number as <i>any</i> sequence of binary string that maps onto the right segment of the number line between 0 and 1, you impose an additional requirement on it: <i>whatever binary bits you add onto the number, it is still within the range</i>.</p><h2 id="lempel-ziv-coding">Lempel-Ziv coding</h2><p>Huffman coding integrated the probability model and the encoding. Arithmetic coding still uses an (at least implicit) probability model to encode, but in a way that makes it possible to update as we encode. Lempel-Ziv encoding, and its various descendants, throw away the entire idea of having any kind of (explicit) probability model. We will look at the original version of this algorithm.</p><h3 id="encoding">Encoding</h3><p>Skip all that Huffman coding nonsense of carefully rationing the shorter codewords for the most likely symbols, and simply decide on some codeword length $$d$$ and give every character in the alphabet a codeword of that length. If your alphabet is again $${a, r, t, \text{EOM}}$$ (we'll include the end-of-message character from the start this time), and $$d = 3$$, then the codewords you define are literally as simple as $$$ a \mapsto 000 $$$ $$$r \mapsto 001 $$$ $$$t \mapsto 010 $$$ \text{EOM} \mapsto 011 $$$ If we used this code, it would be a disaster. We have four symbols in our alphabet, so the maximum entropy of the distribution is $$\log_2 4 = 2$$ bits, and we're spending 3 bits on each symbol. With this encoding, we increase the length by at least 50%. Instead of your compressed file being uploaded in 4 seconds, it now takes 6.</p><p>However, we selected $$d=3$$, meaning we have $$2^3 = 8$$ slots for possible codewords of our chosen constant length, and we've only used 4. What we'll do is follow these steps as we scan through our text:</p><ol><li>Read one symbol <i>past</i> the longest match between the following text and a codeword we've defined. Therefore what we now have is a string $$Cx$$, where we have a code for $$C$$ already of length $$|C|$$, $$x$$ is a single character, and $$Cx$$ is a prefix of the remaining text.</li><li>Add $$C$$ to the code we're forming, to encode for the first $$|C|$$ characters of the remaining text.</li><li>If there is space among the $$2^d$$ possible codewords we have available: let $$n$$ be the binary representation of the smallest possible codeword not yet associated with a code, and define $$Cx \mapsto n$$ as a new codeword.</li></ol><p>Here is an example of the encoding process, showing the emitted codewords on the left, the original definitions on the top, the new definitions on the right, and the message down the middle:</p><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi33XVwUveeGSql8K9VJi_y7bZGa0TZ3UAKdPkkxbnXNZmakweKcmjdGHOBN1oPGSj0fxi3xtQcVDD-FT-XBEW6u18eKbVcZurVB9unqL3tHsSyYKb0mvpfpBRkDZttA1l9OgLlF2I0OFHawK8D2LnQP3M6cJZPHeJOTnSF0lV53ueCYE5m65t6h4U_4w/s892/ArcoLinux_2022-06-21_21-48-30.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="892" data-original-width="727" height="640" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi33XVwUveeGSql8K9VJi_y7bZGa0TZ3UAKdPkkxbnXNZmakweKcmjdGHOBN1oPGSj0fxi3xtQcVDD-FT-XBEW6u18eKbVcZurVB9unqL3tHsSyYKb0mvpfpBRkDZttA1l9OgLlF2I0OFHawK8D2LnQP3M6cJZPHeJOTnSF0lV53ueCYE5m65t6h4U_4w/w522-h640/ArcoLinux_2022-06-21_21-48-30.png" width="522" /></a></div><h3 id="decoding">Decoding</h3><p>A boring way to decode is to send the codeword list along with your message. The fun way is to reason it out as you go along, based on your knowledge of the above algorithm and a convention that lets you know which order the original symbols were added to the codeword list (say, alphabetically, so you know the three bindings in the top-left). An example of decoding the above message:</p><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEigucPM68ZDJpTmvknmmUEQkRAdUye2kMM3pMp_Ucgqg1xL6UDKXGWFyW4FCb-V_K4dMRlTrypPUZJQ6KFkU9pVU80bG2vy7oDBJP4H-Nq_WGu9WKjnZy8EhqVuMhpW4g8bVeQVTwRQ2xNAxydawUm9kACV9ADKU6OUcUcn59jcphwa5p8zx-2iHiywDA/s1073/ArcoLinux_2022-06-21_21-48-58.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="476" data-original-width="1073" height="284" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEigucPM68ZDJpTmvknmmUEQkRAdUye2kMM3pMp_Ucgqg1xL6UDKXGWFyW4FCb-V_K4dMRlTrypPUZJQ6KFkU9pVU80bG2vy7oDBJP4H-Nq_WGu9WKjnZy8EhqVuMhpW4g8bVeQVTwRQ2xNAxydawUm9kACV9ADKU6OUcUcn59jcphwa5p8zx-2iHiywDA/w640-h284/ArcoLinux_2022-06-21_21-48-58.png" width="640" /></a></div><h2 id="source-coding-theorem">Source coding theorem</h2><p>The source coding theorem is about lossy compression. It is going to tell us that if we can tolerate a probability of error $$\delta$$, and if we're encoding a message consisting of a lot of symbols, unless $$\delta$$ is very close to 0 (lossless compression) or 1 (there is nothing but error), it will take about $$H(X)$$ bits per symbol to encode the message, where $$X$$ is the random variable according to which the symbols in the message have been drawn. Since it means that entropy turns up as a fundamental and surprisingly constant limit when we're trying to compress our information, this further justifies the use of entropy as a measure of information.</p><p>We're going to start our attempt to prove the source coding theorem by considering a silly compression scheme. Observe that English has 26 letters, but the bottom 10 (Z, Q, X, J, K, V, B, P, Y, G) are slightly less than 10% of all letters. Why not just drop them? Everthn is still comprehensile without them, and ou can et awa with, for eample, onl 4 inary its per letter rather than 5, since ou're left with ust 16 letters.</p><p>Given an alphabet $$A$$ from which our random variable $$X$$ takes values, define the $$\delta$$-sufficient subset $$S_\delta$$ of $$A$$ to be the smallest subset of $$A$$ such that $$P(x \in S_\delta) \geq 1 - \delta$$ for $$x$$ drawn from $$X$$. For example, if $$A$$ is the English alphabet, and $$\delta = 0.1$$, then $$S_\delta$$ is the set of all letters except Z, Q, X, J, K, V, B, P, Y, and G, since the other letters have a combined probability of over $$1 - 0.1 = 0.9$$, and any other subset containing more than $$0.9$$ of the probability mass contains must contain more letters. </p><p>Note that $$S_\delta$$ can be formed by adding elements from $$A$$, in descending order of probability, into a set until the sum of probabilities of elements in the set exceeds $$1 - \delta$$.</p><p>Next, define the essential bit content of $$X$$, denoted $$H_\delta(X)$$, as $$$ H_\delta(X) = \log 2 |S_\delta|. $$$ In other words, $$H_\delta(X)$$ is the answer to "how many bits of information does it take to point to one element in $$S_\delta$$ (without being able to assume the distribution is anything better than uniform)?". $$H_\delta(X)$$ for $$\text{English alphabet}_{0.1}$$ is 4, because $$\log_2 |{E, T, A, O, I, N, S, H, R, D, L, U, C,M, W, F}| = \log_2 16 = 4$$. It makes sense that this is called "essential bit content".</p><p>We can graph $$H_\delta(X)$$ against $$\delta$$ to get a pattern like this:</p><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhlGh7Wm3vmOUAG49ZlIxkZgz2q7ApF-QM7addOeJ5uEqx2P9kzx8sAGF_4BBp4o6me9Pg6NqzGgivCir-VKdWB-E2hdLzAYx6cOgd9v2-BQr8Emaat6joRPkDFPtEZcjnGvNVvegvOvaRVJCQaYZGI_WCjZkwoY356mqGwpVlmzHZWAPT-eO2yviId_A/s850/ArcoLinux_2022-06-21_22-01-58.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="636" data-original-width="850" height="478" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhlGh7Wm3vmOUAG49ZlIxkZgz2q7ApF-QM7addOeJ5uEqx2P9kzx8sAGF_4BBp4o6me9Pg6NqzGgivCir-VKdWB-E2hdLzAYx6cOgd9v2-BQr8Emaat6joRPkDFPtEZcjnGvNVvegvOvaRVJCQaYZGI_WCjZkwoY356mqGwpVlmzHZWAPT-eO2yviId_A/w640-h478/ArcoLinux_2022-06-21_22-01-58.png" width="640" /></a></div><p>Where it gets more interesting is when we extend this definition to blocks. Let $$X^n$$ denote the random variable for a sequence of $$n$$ independent identically distributed samples drawn from $$X$$. We keep the same definitions for $$S_\delta$$ and $$H_\delta(X)$$; just remember that now $$S$$ is a subset of $$A^n$$ (where the exponent denotes Cartesian product of a set with itself; i.e. $$A^n$$ is all possible length-$$n$$ strings formed from that alphabet). In other words, we're throwing away the least common length-$$N$$ letter strings first; ZZZZ is out the window first if $$n = 4$$, and so on.</p><p>We can plot a similar graph as above, except we're plotting $$\frac{1}{n} H_\delta(x)$$ on the vertical axis to get per-symbol entropy, and there's a horizontal line around the entropy of English letter frequencies:</p><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEh1TCFB5fGrb7BNMPfvRwGOtzbahwMwd1yAkZoAUGoP66dWZCj9VkwySIMiwUBETqS6PE1Ob7jvtA3ex0sS02vn_UXgQOD5NLWcp1czRfB55SWHswGlcn3zNeeb4w8n5usdVT6NJZZ52JoU5So4qf6HNzcMNbZcH-IUnU4TZ4k6sGD9zmv4aTfsJCdJxw/s883/ArcoLinux_2022-06-21_22-02-44.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="646" data-original-width="883" height="234" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEh1TCFB5fGrb7BNMPfvRwGOtzbahwMwd1yAkZoAUGoP66dWZCj9VkwySIMiwUBETqS6PE1Ob7jvtA3ex0sS02vn_UXgQOD5NLWcp1czRfB55SWHswGlcn3zNeeb4w8n5usdVT6NJZZ52JoU5So4qf6HNzcMNbZcH-IUnU4TZ4k6sGD9zmv4aTfsJCdJxw/s320/ArcoLinux_2022-06-21_22-02-44.png" width="320" /></a></div><p>(Note that the entropy per letter of English drops to only 1.3 if we stop modelling each letter as drawn independently from the others around it, and instead have a model with a perfect understanding of which letters occur together.)</p><p>The graph above shows the plot of $$\frac{1}{n}H_\delta(x)$$ against $$\delta$$ for a random variable $$X^n$$ for $$n=1$$ (blue), $$n=2$$) (orange), and $$n=3$$ (green). We see that as $$n$$ increases, the lines become flatter, and the middle portions approach the black line that shows the entropy of the English letter frequency distribution. What you'd see if we continued plotting this graph for larger values of $$n$$ (which might happen for example if you bought me a beefier computer) is that this trend continues; specifically, that there is a value $$n$$ large enough that the graph of $$\frac{1}{n}H_\delta(x)$$ is as close as we want to the black line for the entire length of it, except for an arbitrarily small part near $$\delta = 0$$ and $$\delta = 1$$. Mathematically, we can pick an $$\epsilon > 0$$ such that for $$0 < \delta < 1$$ there exists a positive integer $$n_0$$ such that for all $$n \geq n_0$$, $$$ \left| \frac{1}{n}H_\delta(X^n) - H(X) \right| \leq \epsilon. $$$ Now remember that $$\frac{1}{n}H_\delta(X^n)=\frac{1}{n}\log |S_\delta|$$ was the essential bit content per symbol, or, in other words, the number of bits we need per symbol to represent $$X^n$$ (with error probability $$\delta$$) in the simple coding scheme where we assign an equal-length binary number to each element in $$S_\delta$$ (but hold on: aren't there better codes than ones where all elements in $$S_\delta$$ get an equal-length representation? yes, but we'll see soon that not by very much). Therefore what the above equation is saying is that we can encode $$X^n$$ with error chance $$\delta$$ using a number of bits per symbol that differs from the entropy $$H(X)$$ by only a small constant $$\epsilon$$. This is the source coding theorem. It is a big deal, because we've shown that entropy is related to the number of bits per symbol we need to do encoding in a lossy compression scheme.</p><p>(You can get to a similar result with lossless compression schemes where, instead of throwing away the ability to encode all sequences not in $$S_\delta$$ and just accepting the inevitable error, you instead have an encoding scheme where you reserve one bit to indicate whether or not an $$x^n$$ drawn from $$X^n$$ is in $$S_\delta$$, and if it is you encode it like above, and if it isn't you encode it using $$\log |A|^n$$ bits. Then you'll find that the probability of having to do the latter step is small enough that $$\log |A|^n > \log |S_\delta|$$ doesn't matter very much.)</p><h3 id="typical-sets">Typical sets</h3><p>Before going into the proof, it is useful to investigate what sorts of sequences $$x^n$$ we tend to pull out from $$X^n$$ for some $$X$$. The basic observation is that most $$x^n$$ are going to be neither the least probable nor the most probably out of all $$x^n$$. For example, "ZZZZZZZZZZ" would obviously be an unusual set of letters to draw at random if you're selecting them from English letter frequencies. However, so would "EEEEEEEEEE". Yes, this individual sequence is much more likely than "ZZZZZZZZZZ" or any other sequence, but there is only one of them, so getting it would still be surprising. To take another example, the typical sort of result you'd expect from a coin loaded so that $$P(\text{"heads"}) = 0.75$$ isn't runs of only heads, but rather an approximately 3:1 mix of heads and tails. </p><p>The distribution of letter counts follows a multinomial distribution (the generalisation of the binomial distribution). Therefore (if you think about what a multinomial distribution is, or if you know that the mean is $$n p_{x_i}$$ for the $$i$$th variable) in $$x^n$$ we'd expect roughly $$np_e$$ of the letter e, $$np_z$$ of the letter z, and so on - and $$np_e \ll n$$ even though $$p_e > p_L$$ for all $$L$$ in the alphabet. Slightly more precisely (if you happen to know this fact), the variance of variable $$x_i$$ is $$np_{x_i}(1-p_{x_i})$$, implying that the standard deviation grows only in proportion to $$\sqrt{n}$$, so for large $$n$$ it is very rare to get an $$x^n$$ with counts of $$x_i$$ that differ wildly from the expected count $$np_{x_i}$$. </p><p>Let's define a notion of "typicality" for a sequence $$x^n$$ based on this idea of it being unusual if $$x^n$$ is either a wildly likely or wildly unlikely sequence. The median sequence has $$np_{x_i}$$ of each variable, so has probability $$$ P(x^n) = p_{x_1}^{np_{x_1}}p_{x_2}^{np_{x_2}} \ldots p_{x_n}^{np_{x_n}} $$$ which in turn has a Shannon information content of $$$</p><ul><li>\log P(x^n) = -\sum_i np_{x_i} \log p_{x_i} = n H(X) $$$ Oh look, entropy pops up again. How surprising.</li></ul><p>Now we make the following definition: a sequence $$x^n$$ is $$\epsilon$$-typical if its information content per symbol is $$\epsilon$$-close to $$H(X)$$, that is $$$ \left| - \frac{1}{n}\log{P(x^n)} - H(X) \right| <\epsilon. $$$ Define the typical set $$T_{n\epsilon}$$ to be the set of length-$$n$$ sequences (drawn from $$X^n$$) that are $$\epsilon$$-typical.</p><p>$$T_{n\epsilon}$$ is a small subset of the set $$A^n$$ of all length-$$n$$ sequences. We can see this through the following reasoning: for any $$x^n \in T_{n\epsilon}$$, $$\frac{1}{n} \log P(x^n) \approx H(X)$$ which implies that $$$ P(x^n) \approx 2^{-nH(X)} $$$ and therefore that there can only be roughly $$2^{nH(X)}$$ such sequences; otherwise their probability would add up to more than 1. In comparison, the number of possible sequences $$|A^n| = 2^{n \log |A|}$$ is significantly larger, since $$\log |A| \leq H(X)$$ for any random variable $$X$$ with alphabet / outcome set $$A$$ (with equality if $$X$$ has a uniform distribution over $$A$$).</p><h3 id="the-typical-set-contains-most-of-the-probability">The typical set contains most of the probability</h3><p>Chebyshev's inequality states that $$$ P((X-\mathbb{E}[X])^2 \geq a) \leq \frac{\sigma^2}{a} $$$ where $$\sigma^2$$ is the variance of the random variable $$X$$, and $$a \geq 0$$. It is proved <a href="http://www.strataoftheworld.com/2021/01/data-science-2.html">here</a> (search for "Chebyshev").</p><p>Earlier we defined the $$\epsilon$$-typical set as $$$ T_{n\epsilon} = \left\{ x^n \in A^n \,\text{ such that } \, \left| -\frac{1}{n}\log P(X^n) - H(X) \right| < \epsilon \right\}. $$$ Note that $$$ \mathbb{E}\left[-\frac{1}{n}\log P(X^n)\right] = -\frac{1}{n} \sum \log P(X_i)$$$ $$$ = -\mathbb{E}[\log P(X_i)]$$$ $$$ = H(X_i) = H(X) $$$ by using independence of the $$X_i$$ making up $$X^n$$ in the first step, the law of large numbers ($$\lim_{n \to \infty} \frac{1}{n} \sum_i X_i = \mathbb{E}[X]$$) in the second, and the fact that all $$X_i$$ are independent draws of the same random variable $$X$$ in the third.</p><p>Therefore, we can now rewrite the typical set definition equivalently as $$$ T_{n\epsilon} = \left\{ x^n \in A^n \,\text{ such that } \, \left( -\frac{1}{n}\log P(x^n) - H(X) \right)^2 < \epsilon^2 \right\}$$$ $$$= \left\{ x^n \in A^n \,\text{ such that } \, \left( Y - \mathbb{E}[Y] \right)^2 < \epsilon^2 \right\} $$$ for $$Y = -\frac{1}{n} \log P(X^n)$$, which is in the right form to apply Chebyshev's inequality to get a probability of belonging to this set, except for the fact that the sign is the wrong way around. Very well - we'll instead consider the set of sequences $$\bar{T}_{n\epsilon} = A^n - T_{n\epsilon}$$ (i.e. all length-$$n$$ sequences that are not typical) instead, which can be defined as $$$ \bar{T}_{n \epsilon} = \left\{ x^n \in A^n \,\text{ such that } \, (Y - \mathbb{E}[Y])^2 \geq \epsilon^2 \right\} $$$ and use Chebyshev's inequality to conclude that $$$ P((Y - \mathbb{E}[Y])^2 \geq \epsilon^2) \leq \frac{\sigma_Y^2}{\epsilon^2} $$$ where $$\sigma_Y^2$$ is the variance of $$Y= -\frac{1}{n} \log P(X^n)$$. This is exciting - we have a bound on the probability that a sequence is not in the typical set - but we want to link this probability to $$n$$ somehow. Let $$Z = -\log P(X)$$, and note that $$Y$$ can be written as the average of many draws from $$Z$$. Therefore $$$ \mathbb{E}[Z] = -\frac{1}{n} \sum_i \log P(X) = -\frac{1}{n} \log P(X^n) = \mathbb{E}[Y] $$$ and since $$Y = \frac{1}{n} \sum_i Z_i$$, the variance of $$Y$$, $$\sigma_Y^2$$, is equal to $$\frac{1}{n} \sigma_Z^2$$ (a basic law of how variance works that is often used in statistics). We can substitute this into the expression above to get $$$ P((Y-\mathbb{E}[Y])^2 \geq \epsilon^2) \leq \frac{\sigma_Z^2}{n\epsilon^2}. $$$ The probability on the left-hand side is identical to $$P((-\frac{1}{n} \log P(X^n) - H(X) )^2 \geq \epsilon^2)$$, which is the probability of the condition that $$X^n$$ is <i>not</i> in the $$\epsilon$$-typical set $$T_{n\epsilon}$$, which gives us our grand result $$$ P(X^n \in T_{n\epsilon}) \ge 1 - \frac{\sigma_Z^2}{n\epsilon^2}. $$$ $$\sigma_Z^2$$ is the variance of $$\log P(X^n)$$; it depends on the particulars of the distribution and is probably hell to calculate. However, what we care about is that if we just crank up $$n$$, we can make this probability as close to 1 as we like, regardless of what $$\sigma_Z^2$$ is, and regardless of what we set as $$\epsilon$$ (the parameter for how wide the probability range for the typical set).</p><p>The key idea is this: asymptotically, as $$n \to \infty$$, more and more of the probability mass of possible length-$$n$$ sequences is concentrated among those that have a probability of between $$2^{-n(H(X)+\epsilon)}$$ and $$2^{-n(H(x) - \epsilon)}$$, regardless of what (positive real) $$\epsilon$$ you set. This is known as the "asymptotic equipartition property" (it might be more appropriate to call it an "asymptotic approximately-equally-partitioning property" because it's not really an "equipartition", since depending on $$\epsilon$$ these can be very different probabilities, but apparently that was too much of a mouthful even for the mathematicians).</p><h3 id="finishing-the-proof">Finishing the proof</h3><p>As a reminder of where we are: we stated without proof $$$ \left| \frac{1}{n}H_\delta(X^n) - H(X) \right| < \epsilon. $$$ and noted that this is an interesting result that also gives meaning to entropy, since we see that it's related to how many bits it takes for a naive coding scheme to express $$X^n$$ (with error probability $$\delta$$).</p><p>Then we went on to talk about typical sets, and ended up finding that the probability that an $$x^n$$ drawn from $$X^n$$ lies in the set $$$ T_{n \epsilon} =\left\{ x^n \in A^n \,\text{ such that } \, \left| -\frac{1}{n}\log P(X^n) - H(X) \right| < \epsilon \right\}. $$$ approaches 1 as $$n \to \infty$$, despite the fact that $$T_{n\epsilon}$$ has only approximately $$2^{nH(X)}$$ members, which, for distributions of $$X$$ that are not very close to the uniform distribution over the alphabet $$A$$, is a small fraction of the $$2^{n \log |A|}$$ possible length-$$n$$ sequences.</p><p>Remember that $$H_\delta(X^n) = \log |S_\delta|$$, and $$S_\delta$$ was the smallest subset of $$A^n$$ such that it contains sequences whose probability sums to at least $$1 - \delta$$. This is a bit like the typical set $$T_{n\epsilon}$$, which also contains sequences making up most of the probability mass. Note that $$T_{n\epsilon}$$ is less efficient; $$S_\delta$$ optimally contains all sequences with probability greater than some threshold, whereas $$T_{n\epsilon}$$ generally omits the highest-probability sequences (settling instead for sequences of the same probability as most sequences that are drawn from $$X^n$$). Therefore $$$ H_\delta(X^n) \leq \log |T_{n\epsilon}| $$$ for an $$n$$ that depends on what $$\delta$$ and $$\epsilon$$ we want. Now we can get an upper bound on $$H_\delta(X^n)$$ if we can upper-bound $$|T_{n\epsilon}|$$. Looking at the definition, we see that the probability of a sequence $$X^n$$ must obey $$$ 2^{n(H(X) - \epsilon)} < P(X^n) < 2^{n(H(X) + \epsilon)}. $$$ $$T_{n\epsilon}$$ has the largest number of elements if all elements have the lowest possible probability $$p$$, and if that is the case it has at most $$1/p$$ of such lowest-probability elements since the probabilities cannot add to more than one, which implies $$|T_{n\epsilon}| < 2^{n(H(x)+\epsilon)}$$. Therefore $$$ H_\delta(X^n) \leq \log |T_{n\epsilon}| < \log(2^{n(H(X)+e)}) = n(H(X) + \epsilon) $$$ and we have a bound $$$ H_\delta(X^n) < n(H(X) + \epsilon). $$$ If we can now also find the bound $$n(H(X) + \epsilon) < H_\delta(X^n)$$, we've shown $$|\frac{1}{n} H_\delta(X^n) - H(X)| < \epsilon$$ and we're done. The proof of this bound is a proof by contradiction. Imagine that there is an $$S'$$ such that $$$ \frac{1}{n} \log |S'| \leq H - \epsilon $$$ but also $$$ P(X^n \in S') \geq 1 - \delta. $$$ We want to show that $$P(X^n \in S')$$ can't actually be that large. For the other bound, we used our typical set successfully, so why not use it again? Specifically, write $$$ P(X^n \in S') = P(X^n \in S' \cap T_{n\varepsilon}) + P(X^n \in S' \cap \bar{T}_{n\varepsilon}) $$$ where $$\bar{T}_{n\varepsilon}$$ is again $$A^n - T_{n\varepsilon}$$, and noting that our constant $$\varepsilon$$ for $$T$$, is not the same as our constant $$\epsilon$$ in the bound. We want to set an upper bound on this probability; for that to hold, we need to make the terms on the right-hand side as large as possible. For the term, this is if $$S' \cap T_{n\varepsilon}$$ is as large as it can be based on the bound on $$|S'|$$, i.e. $$2^{n(H(X)-\epsilon)}$$, and each term in it has the maximum probability $$2^{-n(H(X)-\varepsilon)}$$ of terms in $$T_{n\varepsilon}$$. For the second term, this is if $$S' \cap \bar{T}_{n \epsilon}$$ is restricted only by $$P(X^n \in \bar{T}_{n\varepsilon}) \leq \frac{\sigma^2}{n\epsilon^2}$$, which we showed above. (Note that you can't have both of these conditions holding at once, but this does not matter since we only want to show a non-strict inequality.) Therefore we get $$$ P(X^n \in S') \leq 2^{n(H(X) - \epsilon)} 2^{-n(H(X)+\varepsilon)} + \frac{\sigma^2}{n\epsilon^2} \ = 2^{-n(\epsilon + \varepsilon)} + \frac{\sigma^2}{n\epsilon^2} $$$ and we see that since $$\epsilon, \varepsilon > 0$$, and as we're dealing with the case where $$n \to \infty$$, this probability is going to go to zero in the limit. But we had assumed $$P(X^n \in S') \geq 1 - \delta$$ - so we have a contradiction unless we don't assume that, which means $$$ n(H(X) - \epsilon) < H_\delta(X^n). $$$ Combining this with the previous bound, we've now shown $$$ H(X) - \epsilon < \frac{1}{n} H_\delta(X^n) < H(X) + \epsilon $$$ which is the same as $$$ \left|\frac{1}{n}H_\delta(X) - H(X)\right| < \epsilon $$$ which is the source coding theorem that we wanted to prove.</p>Unknownnoreply@blogger.com0tag:blogger.com,1999:blog-1697673368059564013.post-91311962977510142802022-06-20T16:27:00.004+01:002022-06-25T21:06:15.560+01:00Information theory 1<div id="information-theory-1" style="text-align: center;"><span style="font-size: x-small;"><i>5044 words, including equations (~30min)</i></span><br /></div><p>This is the first in a series of posts about information theory. A solid understanding of basic probability (random variables, probability distributions, etc.) is assumed. This post covers:</p><ul><li>what information and entropy are, both intuitively and axiomatically</li><li>(briefly) the relation of information-theoretic entropy to entropy in physics</li><li>conditional entropy</li><li>joint entropy</li><li>KL distance (also known as relative entropy)</li><li>mutual information</li><li>some results involving the above quantities</li><li>the point of source coding and channel coding</li></ul><p>Future posts cover source coding and channel coding in detail.</p><h2 id="what-is-information-">What is information?</h2><p>How much information is there in the number 14? What about the word "information"? Or this blog post? These don't seem like questions with exact answers.</p><p>Imagine you already know that someone has drawn a number between 0 and 15 from a hat. Then you're told that the number is 14. How much additional information have you learned? A first guess at a definition for information might be that it's the number of questions you need to ask to become certain about an answer. We don't want arbitrary questions though; "what is the number?" is very different from "is the number zero?". So let's say that it has to be a yes-no question.</p><p>You can represent a number within some specific range as a series of yes-no questions by writing it out in base-2. In base-2, 14 is 1110. Four questions suffice: "is the leftmost base-2 digit a 0?", etc. The number of base-$$B$$ digits required to represent a number $$n$$ is $$\lceil\log_B n\rceil$$, where $$\lceil x \rceil$$ means the smallest integer greater than or equal to $$x$$ (i.e., rounding up). Now maybe there should be some sense in which we can allow pointing at a number in the range 0 to 16 to have a bit more information than pointing at a number from 0 to 15, even though we can't literally ask 4.09 yes-no questions. So we might try to define our information measure as $$\log n$$ (in whatever base because changing which base we're doing logs in would only change the answer by a constant factor anyways, but let's just say it's base-2 to maintain the correspondence to yes-no questions), where $$n$$ is the number of outcomes that the thing we now know was selected from.</p><p>Now let's say there's a shoe box we've picked up from a store. There are a gazillion things that could be inside the box, so $$n$$ is something huge. However, it seems that if we open the box and find a new pair of sneakers, we are less surprised than if we open the box and find the Shroud of Turin. We'd like to make some types of contain quantitatively more information than others.</p><p>The standard sort of thing you do in this kind of situation is that you bring in probabilities. With drawing a number out of a hat, we have a uniform distribution where the probability for each outcome is $$p = 1/ n$$. So therefore we might as well have written that information content is equivalent to $$\log \frac{1}{p}$$, and gotten the same answer in that question. Since presumably the probability of your average shoe box containing sneakers is higher than the probability of it containing the Shroud of Turin, with this revised definition we now sensibly get that the latter gives us more information (because $$\log \frac{1}{p}$$ is a decreasing function of $$p$$). Note also that $$\log \frac{1}{p}$$ is the same as $$- \log p$$; we will usually use the latter form. This is called the Shannon information. To be precise:</p><blockquote><p><i>The (Shannon) information content of seeing a random variable $$X$$ take a value $$x$$ is $$$-\log p_x$$$ where $$p_x$$ is the probability that $$X$$ takes value $$x$$. </i></p><p><i>We can see the behaviour of the information content of an event as a function of its probability here: </i></p><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjyEA9S6CzMLwf4YkmXK0-pfFqZto-WTbTCjtxixK2QafIvHmbnmlKqPSWZN6Yj0fEMJBGAUclmullfxWV-9TChejDut2HmzT5-Y0WbvziC_5kWb0WrGSRHvOcG00bTsj3WC76FeyawjTIqEa0gOth87yiimQqg00FL80dB8LMQ5VbUAPgG25YlBpb9Rw/s866/ArcoLinux_2022-05-31_21-27-57.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="648" data-original-width="866" height="478" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjyEA9S6CzMLwf4YkmXK0-pfFqZto-WTbTCjtxixK2QafIvHmbnmlKqPSWZN6Yj0fEMJBGAUclmullfxWV-9TChejDut2HmzT5-Y0WbvziC_5kWb0WrGSRHvOcG00bTsj3WC76FeyawjTIqEa0gOth87yiimQqg00FL80dB8LMQ5VbUAPgG25YlBpb9Rw/w640-h478/ArcoLinux_2022-05-31_21-27-57.png" width="640" /></a></div><br /><p><br /></p></blockquote><h3 id="axiomatic-definition">Axiomatic definition</h3><p>The above derivation was so hand-wavy that it wasn't even close to being a derivation.</p><p>When discovering/inventing the concept of Shannon information, Shannon started from the idea that the information contained in seeing an event is a function of that event's probability (and nothing else). Then he required three further axioms to hold for this function:</p><ul><li>If the probability of an outcome is 1, it contains no information. This makes sense - if you already know something with certainty, then you can't get more information by seeing it again.</li><li>The information contained in an event is a decreasing function of its probability of happening. Again, this makes sense: seeing something you think is very unlikely is more informative than seeing something you were pretty certain was already going to happen.</li><li>The information contained in seeing two independent events is the sum of the information of seeing them separately. We don't want to have to apply some obscure maths magic to figure out how much information we got in total from seeing one dice roll and then another other.</li></ul><p>The last one is the big hint. The probability of seeing random variable (RV) $$X$$ take value $$x$$ and RV $$Y$$ take value $$y$$ is $$p_x p_y$$ if $$X$$ and $$Y$$ are independent. We want a function, call it $$f$$, such that $$f(p_x p_y) = f(p_x) + f(p_y)$$. This is the most important property of logarithms. You can do some more maths to really demonstrate that is the logarithms with some base are the only function that fit this definition, or you can just guess that it's a $$\log$$ and move on. We'll do the latter.</p><h3 id="entropy">Entropy</h3><p>Entropy is the flashy term that comes up in everything from chemistry to .zip files to the fundamental fact that we're all going to die. It is often introduced as something like "[mumble mumble] a measure of information [mumble mumble]".</p><p>It is important to distinguish between information and entropy. Information is a function of an outcome (of a random variable), for example the outcome of an experiment. Entropy is a function of a random variable, for example an experiment before you see the outcome. Specifically,</p><blockquote><p><i> The <b>entropy</b> $$H(X)$$ is the expected information gain from a random variable $$X$$: $$$ H(X) = \underset{x_i \sim X}{\mathbb{E}}\Big[-\log P(X=x_i)\Big] \ = -\sum_i p_{x_i} \log p_{x_i} $$$ ($$\underset{x_i \sim X}{\mathbb{E}}$$ means the expected value when value $$x_i$$ is drawn from the distribution of RV $$X$$. $$P(X=x_i)$$, alternatively denoted $$p_{x_i}$$ when $$X$$ is clear from context, is the probability of $$X$$ taking value $$x_i$$.)</i></p></blockquote><p>(Why is entropy denoted with an $$H$$? I don't know. Just be thankful it wasn't a random <i>Greek</i> letter.)</p><p>Imagine you're guessing a number between 0 and 15 inclusive, and the current state of your beliefs is that it is as likely to be any of these numbers. You ask "is the number 9?". If the answer is yes, you've gained $$-\log_2 \frac{1}{16} = \log_2 16 = 4$$ bits of information. If the answer is no, you've gained $$-\log_2 \frac{15}{16} = \log_2 16 - \log_2 15 = 0.093$$ bits of information. The probability of the first outcome is 1/16 and the probability of the second is 15/16, so the entropy is $$\frac{15}{16} \times 4 + \frac{1}{16} \times 0.093 = 0.337$$ bits.</p><p>In contrast, if you ask "is the number smaller than 8?", you always get $$-\log_2 \frac{8}{16} = \log_2{2} = 1$$ bit of information, and therefore the entropy of the question is 1 bit.</p><p>Since entropy is expected information gain, whenever you prepare a random variable for the purpose of getting information by observing its value, you want to maximise its entropy.</p><p>The closer a probability distribution is to a uniform distribution, the higher its entropy. The maximum entropy of a distribution with $$n$$ possible outcomes is the entropy of the uniform distribution $$U_n$$, which is $$$ H(U_n) = -\sum_i p_{u_i} \log p_{u_i} = -\sum_i \frac{1}{n} \log \frac{1}{n} \ = -\log \frac{1}{n} = \log n $$$ (This can be proved easily once we introduce some additional concepts.)</p><p>A general and very helpful principle to remember is that RVs with uniform distributions are most informative.</p><p>The above definition of entropy is sometimes called Shannon entropy, to distinguish it from the older but weaker concept of entropy in physics.</p><h4 id="entropy-in-physics">Entropy in physics</h4><p>The physicists' definition of entropy is a constant times the logarithm of the number of possible states that correspond to the observable macroscopic characteristics of a thermodynamic system: $$$ S=k_B \ln W $$$ where $$k_B$$ is the Boltzmann constant, $$\ln$$ is used instead of $$\log_2$$ because physics, and $$W$$ is the number of microstates. (Why do physicists denote entropy with the letter $$S$$? I don't know. Just be glad it wasn't a random <i>Hebrew</i> letter.)</p><p>In plain language: it is proportional to the Shannon entropy of finding out what is the exact configuration of bouncing atoms of the hot/cold/whatever box you're looking, out of all the ways the atoms could be bouncing inside that box given that the box is hot/cold/whatever, assuming that all those ways are equally likely. It is less general than the information theoretic entropy in the sense that it assumes a uniform distribution.</p><p>Entropy, either the Shannon or the physics version, seems abstract; random variables, numbers of microstates, what? However, $$S$$ as defined above has very real physical consequences. There's an important thermodynamics equation relating a change in entropy $$\delta S$$, a change in heat energy $$\delta Q$$, and temperature $$T$$ for a reversible process with the equation $$T\delta S = \delta Q$$, which sets a lower bound on how much energy you need to discover information (i.e., reduce the number of microstates that might be behind the macrostate you observe). Getting one bit of information means that $$\delta S$$ is $$k_B \ln 2$$ (from the definition of $$S$$), so at temperature $$T$$ kelvins we need $$k_B T \ln 2 \approx 9.6 \times 10^{-24} \times T$$ joules. This prevents arbitrarily efficient computers, and saves us from problems like Maxwell's demon. (Maxwell's demon is a thought experiment in physics: couldn't you violate the principle of increasing entropy (a physics thing) by building a box with a wall cutting it in half with a "demon" (some device) that lets slow particles pass left-to-right only and fast particles right-to-left, thus separating particles by temperature and reducing the number of microstates corresponding to the configuration of atoms inside the box? No, because the demon needs to expend energy to get information.)</p><p>Finally, is there an information-theoretic analogue of the second law of thermodynamics, which states that the entropy of a system always increases? You have to make some assumptions, but you can get to something like it, which I will sketch out in <i>very</i> rough detail and without explaining the terms (see Chapter 4 of <i>Elements of Information</i> Theory for the details). Imagine you have a probability distribution on the state space of a Markov chain. Now it is possible to prove that given any two such probability distributions, the distance between them (as measured using relative entropy; see below) is non-increasing. Now assume it also happens to be the case that the stationary distribution of the Markov chain is uniform (the stationary distribution is the probability distribution over states such that if every state sends out its probability mass according to the transition probabilities, you get back to the same distribution). We can consider an arbitrary probability distribution over the states, and compare it to the unchanging uniform one, and use the result that the distance between them is non-increasing to deduce that an arbitrary probability distribution will tend towards the uniform (= maximal entropy) one.</p><p>Reportedly, von Neumann (a polymath whose name appears in any mid-1900s mathsy thing) advised Shannon thus:</p><blockquote><p><i>"You should call [your concept] entropy, for two reasons. In the first place your uncertainty function has been used in statistical mechanics under that name, so it already has a name. In the second place, and more important, nobody knows what entropy really is, so in a debate you will always have the advantage."</i></p></blockquote><h3 id="intuition">Intuition</h3><p>We've snuck in the assumption that all information comes in the form of:</p><ol> <li>You first have some <i>quantitative</i> uncertainty over a <i>known set</i> of possible outcomes, which you specify in terms of a random variable $$X$$.</li><li>You find out the value that $$X$$ has taken.</li></ol><p>There's a clear random variable if you're pulling numbers out of a hat: the possible values of $$X$$ are the numbers written on the pieces of paper in the hat, and they all have equal probability. But where is the random variable when the piece of information you get is, say, the definition of information? (I don't mean here the literal characters on the screen - that's a more boring question - but instead the knowledge about information theory that is now (hopefully) in your brain). The answer would have to be something like "the random variable representing all possible definitions of information" (with a probability distribution that is, for example, skewed towards definitions that include a $$\log$$ somewhere because you remember seeing that before).</p><p>This is a bit tricky to think about, but we see that even in this kind of weird case you can specify some kind of set and probabilities over that set. Fundamentally, knowledge (or its lack) is about having a probability distribution over states. Perfect knowledge means you have probability $$1.00$$ on exactly one state of how something could be. If you're very uncertain, you have a huge probability distribution over an unimaginably large set of states (for example, all possible concepts that might be a definition of information). If you've literally seen nothing, then you're forced to rely on some guess for the prior distribution over states, like all those pesky Bayesian statisticians keep saying.</p><h2 id="more-quantities">More quantities</h2><h3 id="conditional-entropy">Conditional entropy</h3><p>Entropy is a function of the probability distribution of a random variable. We want to be able to calculate the entropies of the random variables we encounter.</p><p>A common combination of random variables we see is $$X$$ given $$Y$$, written $$X | Y$$. The definition is $$$ P(X = x \, |\, Y = y) = \frac{P(X = x \,\land\, Y = y)}{P(Y=y)}. $$$ It is a common mistake to think that $$H(X|Y) = -\sum_i P(X = x_i | Y = y) \log P(X = x_i | Y = y)$$. What is it then? Let's just do the algebra: $$$ H(X|Y) = -\underset{x \sim X|Y, y \sim Y}{\mathbb{E}} \big( \log P(X=x|Y=y) \big) $$$ from the definition of the entropy as the expectation of the Shannon information content, and then by algebra: $$$ H(X|Y) = -\underset{x \sim X|Y, y \sim Y}{\mathbb{E}} \big[ \log P(X=x|Y=y) \big]$$$ $$$ = -\sum_{y \in \mathcal{Y}} P(Y=y) \sum_{x \in \mathcal{X}} P(X=x | Y=y) \log P(X=x \,|\, Y = Y)$$$ $$$ = -\sum_{y \in \mathcal{Y}, \, x \in \mathcal{X}}P(X=x\,\land\, Y = y) \log P(X=x \,|\, Y = Y) $$$ where $$\mathcal{X}$$ and $$\mathcal{Y}$$ are simply the sets of possible values of $$X$$ and $$Y$$ respectively. In a trick beloved of bloggers everywhere tired of writing up equations as $$\LaTeX$$, the above is often abbreviated $$$ \sum_{y \in \mathcal{Y}, \, x \in \mathcal{X}} p(x,y) \log p(x|y) $$$ where we use $$p$$ as a generic notation for "probability of whatever; random variables left implicit".</p><blockquote><p><i>The <b>conditional entropy</b> $$X|Y$$ for a random variable $$X$$ given the value of another random variable $$Y$$, is written $$H(X|Y)$$ and defined as $$$ H(X|Y) = - \sum_{y \in \mathcal{Y}, \, x \in \mathcal{X}} p(x,y) \log p(x|y) $$$ which is lazier notation for $$$ -\sum_{y \in \mathcal{Y}, \, x \in \mathcal{X}}P(X=x\,\land\, Y = y) \log P(X=x \,|\, Y = Y). $$$ and also equal to $$$ -\sum_{y \in \mathcal{Y}, \, x \in \mathcal{X}} p(x,y) \log \frac{p(x, y)}{p(y)} $$$ It is most definitely not equal to $$\sum_{y \in \mathcal{Y}, \, x \in \mathcal{X}} p(x | y) \log p(x | y)$$.</i></p></blockquote><p>Conditional entropy is a measure of how much information we expect to get from a random variable assuming we've already seen another one. If the RVs $$X$$ and $$Y$$ are independent, the answer is that $$H(X|Y) = H(X)$$. If the value of $$Y$$ implies a value of $$X$$ (e.g. "percentage of sales in the US" implies "percentage of sales outside the US"), then $$H(X|Y) = 0$$, since we can work out what $$X$$ is from seeing what $$Y$$ is.</p><h3 id="joint-entropy">Joint entropy</h3><p>Now if $$H(X|Y)$$ is how much expected surprise there is left in $$X$$ after you've seen $$Y$$, then $$H(X|Y) + H(Y)$$ would sensibly be the total expected surprise in the combination of $$X$$ and $$Y$$. We write $$H(X,Y)$$ for this combination. If we do the algebra, we see that $$$ H(X,Y) = H(X|Y) + H(Y) $$$ $$$ = -\sum_{y \in \mathcal{Y}, \, x \in \mathcal{X}} p(x,y) \log \frac{p(x, y)}{p(y)} - \sum_{y \in \mathcal{Y}} p(y) \log p(y) $$$ $$$= -\left(\sum_{y \in \mathcal{Y}, \, x \in \mathcal{X}} p(x,y) \log p(x, y)\right) + \left( \sum_{y \in \mathcal{Y}, \,x\in \mathcal{X}} p(x,y) \log p(y)\right) -\left( \sum_{y \in \mathcal{Y}} p(y) \log p(y)\ \right) $$$ $$$= -\left(\sum_{y \in \mathcal{Y}, \, x \in \mathcal{X}} p(x,y) \log p(x, y)\right)$$$ = H(Z) $$$ if $$Z$$ is the random variable formed of the pair $$(X, Y)$$ drawn from the joint distribution over $$X$$ and $$Y$$.</p><h3 id="kullback-leibler-divergence-aka-relative-entropy">Kullback-Leibler divergence, AKA relative entropy</h3><p>"Kullback-Leibler divergence" is a bit of a mouthful. It is also called KL divergence, KL distance, or relative entropy. Intuitively, it is a measure of the distance between two probability distributions. For probability distributions represented by functions $$p$$ and $$q$$ over the same set $$\mathcal{X}$$, it is defined as $$$ D(p\,||\,q) = \sum_{x \in \mathcal{X}} p(x) \log \left(\frac{p(x)}{q(x)}\right). $$$ It's not a very good distance function; the only property of a distance function it meets is that it's non-negative. It's not symmetric (i.e. $$D(p \,||\, q) \ne D(q \,||\, p)$$) as you can see from the definition (especially considering how it breaks when $$q(x) = 0$$ but not if $$p(x) = 0$$). However, it has a number of cool interpretations, including how many bits you expect to lose on average if you build a code assuming a probability distribution $$q$$ when it's actually $$p$$, and how many bits of information you get in a Bayesian update from distribution $$q$$ to distribution $$p$$. It is also a common loss function in machine learning. The first argument $$p$$ is generally some better or true model, and we want to know how far away $$q$$ is from it.</p><h3 id="why-the-uniform-distribution-maximises-entropy">Why the uniform distribution maximises entropy</h3><p>The KL divergence gives us a nice way of proving that the uniform distribution maximises entropy. Consider the KL divergence of an arbitrary probability distribution $$p$$ from the uniform probability distribution $$u$$: $$$ D(p \,||\, u ) = \sum_{x \in \mathcal{X}} p(x) \log \left(\frac{p(x)}{q(x)}\right) $$$ $$$= \sum_{x \in \mathcal{X}} \big( p(x) \log p(x)\big) - \sum_{x \in \mathcal{X}} \big(p(x) \log q(x) \big) $$$ $$$= -H(X) - \sum_{x \in \mathcal{X}} p(x) \log \frac{1}{|\mathcal{X}|} $$$ $$$= H(X) - H(U) $$$ where $$\mathcal{X}$$ is the set of values over which $$p$$ and $$u$$ have non-zero values, $$X$$ is a random variable distributed according to $$p$$, and $$U$$ is a random variable distributed according to $$u$$ (i.e. uniformly). This is the same thing as $$$ H(X) = H(U) + D(p \,||\,u) $$$ which implies that we can write the entropy of a random variable as the entropy of a uniform random variable over a set of the same size, plus the KL distance between the distribution of $$X$$ and the distribution of the uniform random variable. Also, since all three quantities in the above equation are guaranteed to be non-negative, this implies that $$$ H(X) \leq H(U) $$$ and therefore that the uniform random variable has higher entropy than any other random variable over the same number of outcomes.</p><h3 id="mutual-information">Mutual information</h3><p>Earlier, we saw that $$H(X, Y) = H(X|Y) + H(Y) = H(X) + H(Y|X)$$. As a picture:</p><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjrGbqLctRwUw-setCmJDZ_tzd-1XQKwulod9TLgyVehtBNgORCubRlgNWfEhnIpAwYtohX8c3pDV6r3PIpma1YEUxSAiBo4E6qrt1mNoGPS7eGFXPJ7fwNeCnN3XKZMxPT9G0TvS4FftNrmdrlmBh3vdv4s3LFrZYjOqdr304iS8N4xGxyLmX-MNZdYw/s762/ArcoLinux_2022-05-31_22-25-08.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="348" data-original-width="762" height="183" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjrGbqLctRwUw-setCmJDZ_tzd-1XQKwulod9TLgyVehtBNgORCubRlgNWfEhnIpAwYtohX8c3pDV6r3PIpma1YEUxSAiBo4E6qrt1mNoGPS7eGFXPJ7fwNeCnN3XKZMxPT9G0TvS4FftNrmdrlmBh3vdv4s3LFrZYjOqdr304iS8N4xGxyLmX-MNZdYw/w400-h183/ArcoLinux_2022-05-31_22-25-08.png" width="400" /></a></div><br /><p>There's an overlapping region, representing the information you get no matter which of $$X$$ or $$Y$$ you look at. We call this the mutual information, a refreshingly sensible name, and denote it $$I(X;Y)$$, somewhat less sensibly. One way to find it is $$$ I(X;Y) = H(X,Y) - H(X|Y) - H(Y|X)$$$ $$$= - \sum_{x,y} p(x,y) \log p(x,y) \,+\, \sum_{x,y} p(x,y) \log \frac{p(x,y)}{p(y)} \,+\, \sum_{x,y} p(x,y) \log \frac{p(x,y)}{p(x)}$$$ $$$= \sum_{x,y} p(x,y) \big( \log p(x,y) - \log p(x) - \log p(y) \big)$$$ $$$= \sum_{x,y} p(x,y) \log \frac{p(x,y)}{p(x)p(y)}. $$$ Does this look familiar? Recall the definition $$$ D(p\,||\,q) = \sum_{x \in \mathcal{X}} p(x) \log \left(\frac{p(x)}{q(x)}\right). $$$ What we see is that $$$ I(X;Y) = D(p(x, y) \, || \, p(x) p(y)), $$$ or in other words that the mutual information between $$X$$ and $$Y$$ is the "distance" (as measured by KL divergence) between the probability distributions $$p(x,y)$$ - the joint distribution between $$X$$ and $$Y$$ - and $$p(x) p(y)$$, the joint distribution that $$X$$ and $$Y$$ would have if $$x$$ and $$y$$ were drawn independently.</p><p>If $$X$$ and $$Y$$ are independent, then these are the same distribution, and their KL divergence is 0.</p><p>If the value of $$Y$$ can be determined from the value of $$X$$, then the joint probability distribution of $$X$$ and $$Y$$ is a table where for every $$x$$, there is only one $$y$$ such that $$p(x,y) > 0$$ (otherwise, there would be a value $$x$$ such that there is uncertainty about $$Y$$). Let the function mapping an $$x$$ to the singular $$y$$ such that $$p(x,y) > 0$$ be $$f$$. Then $$$ I(X;Y) = \sum_{x,y} p(x,y) \log \frac{p(x,y)}{p(x)p(y)}$$$ $$$= \sum_y p(y) \sum_{x | f(x) = y} p(x|y) \log \frac{p(x, f(x))}{p(x)p(y)}. $$$ Now $$p(x, f(x)) = p(x)$$, because there is no $$y \ne f(x)$$ such that $$p(x, y) \ne 0$$. Therefore we get that the above is equal to $$$ \sum_y p(y) \sum_{x | f(x) = y} p(x|y) \log \frac{p(x)}{p(x)p(y)}\ = - \sum_y p(y) \sum_{x | f(x) = y} p(x|y) \log p(y), $$$ and since $$\log p(y)$$ does not depend on $$x$$, we can sum out the probability distribution to get $$$ -\sum_y p(y) \log p(y) = H(Y). $$$ In other words, if $$Y$$ can be determined from $$X$$, then the expected information that $$X$$ gives about $$Y$$ is the same as the expected information given by $$Y$$. </p><p>We can graphically represent the relations between $$H(X)$$, $$H(Y)$$, $$H(X|Y)$$, $$H(Y|X)$$, $$H(X,Y)$$, and $$I(X;Y)$$ like this:</p><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhIfRf2gkkFvGyjcbKP1ae7_GVlagYfW5Mz6vxVP_BmjGWMkrG53xsdW_b1vXH9dgXWoVlTl6Ic8TGJIyQ3WXSecZ8J4MlEvMoY3NgubTGjDIicpywUD7xLht0GuipBnS4DYOmmAEH6J7Fb39HMKoePq6yDFJNZHCaMSwtsUaI8wTZ49E93yLD8OZRg9g/s756/ArcoLinux_2022-05-31_22-25-34.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="447" data-original-width="756" height="378" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhIfRf2gkkFvGyjcbKP1ae7_GVlagYfW5Mz6vxVP_BmjGWMkrG53xsdW_b1vXH9dgXWoVlTl6Ic8TGJIyQ3WXSecZ8J4MlEvMoY3NgubTGjDIicpywUD7xLht0GuipBnS4DYOmmAEH6J7Fb39HMKoePq6yDFJNZHCaMSwtsUaI8wTZ49E93yLD8OZRg9g/w640-h378/ArcoLinux_2022-05-31_22-25-34.png" width="640" /></a></div><br /><p><br /></p><p>Having this image in your head is the single most valuable thing you can do to improve your ability to follow information theoretic maths. Just to spell it out, here are some of the results you can read out from it: $$$H(X,Y) = H(X) + H(Y|X) $$$ $$$H(X,Y) = H(X|Y) + H(Y) $$$ $$$H(X,Y) = H(X|Y) + I(X;Y) + H(Y|X) $$$ $$$H(X,Y) = H(X) + H(Y) - I(X;Y) $$$ $$$H(X) = I(X;Y) + H(Y|X)$$$ This diagram is also sometimes drawn with Venn diagrams:</p><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEggLIC4G6xVL1Ye-9BSkZLZCOyJc-7dI_SI-Ce1leWZvOj_H9F5wFsNz28g1GXe7pnJnfIYO-unAm2uHj3LzYcwVArgeTzuoX26d0tZ4GT6YBTR5iYA4d03t65q0z8THahQFfuGjpX066pe5n1r81dq7DZNLoE7hEpOdCeHnFpr_JWSoaHZupsV76mESw/s396/ArcoLinux_2022-05-31_22-26-00.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="396" data-original-width="382" height="400" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEggLIC4G6xVL1Ye-9BSkZLZCOyJc-7dI_SI-Ce1leWZvOj_H9F5wFsNz28g1GXe7pnJnfIYO-unAm2uHj3LzYcwVArgeTzuoX26d0tZ4GT6YBTR5iYA4d03t65q0z8THahQFfuGjpX066pe5n1r81dq7DZNLoE7hEpOdCeHnFpr_JWSoaHZupsV76mESw/w386-h400/ArcoLinux_2022-05-31_22-26-00.png" width="386" /></a></div><br /><p><br /></p><h3 id="data-processing-inequality">Data processing inequality</h3><p>A Markov chain is a series of random variables such that the $$(n+1)$$th is only directly influenced by the $$n$$th. If $$X \to Y \to Z$$ is a Markov chain, it means that all effects $$X$$ has on $$Z$$ are through $$Y$$.</p><p>The data processing inequality states that if $$X \to Y \to Z$$ is a Markov chain, then $$$ I(X; Y) \geq I(X; Z). $$$ This should be pretty intuitive, since the mutual information $$I(X;Y)$$ between $$X$$ and $$Y$$, which have a direct causal link between them, shouldn't be higher than that between $$X$$ and the more-distant $$Z$$, which $$X$$ can only influence through $$Y$$.</p><p>A special case is the Markov chain $$X \to Y \to f(Y)$$, where $$X$$ is, say, what happened in an abandoned parking lot at 3am, $$Y$$ is the security camera footage, and $$f$$ is some image enhancing process (more generally: any deterministic function of the data $$Y$$). The data processing inequality tells us that $$$ I(X; Y) \geq I(X; f(Y)). $$$ In essence, this means that any function you try to apply to some data $$Y$$ you have about some event $$X$$ cannot increase the information about the event that is available. Any enhancing function can only make it easier to spot some information about the event that is <i>already present</i> in the data you have about it (and the function might very plausibly destroy some). If all you have are four pixels, no amount of image enhancement wizardry will let you figure out the perpetrator's eye colour.</p><p>The proof (for the general case of $$X \to Y \to Z$$) goes like this: consider $$I(X; Y,Z)$$ (that is, the mutual information between knowing $$X$$ and knowing both $$Y$$ and $$Z$$). Now consider the different values in Venn diagram form:</p><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjG5o6kK-DxPU0tQRHkIgEevjypvmeikAoGX5qDjDOYKvc8C0azyFVA1evw5iBoD-a5jPBSamLrdOePJSPvwzV-MhKsdnwFYv6pRXnN2wL5BkXyCS5Lehg8QVL4-ZysYIHhPo56LyzLyTnescNUIZSuO5dOcHUm6EhdiAETvC0gtXQ4JR0XSSN3aaqaDg/s846/ArcoLinux_2022-05-31_22-59-32.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="688" data-original-width="846" height="325" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjG5o6kK-DxPU0tQRHkIgEevjypvmeikAoGX5qDjDOYKvc8C0azyFVA1evw5iBoD-a5jPBSamLrdOePJSPvwzV-MhKsdnwFYv6pRXnN2wL5BkXyCS5Lehg8QVL4-ZysYIHhPo56LyzLyTnescNUIZSuO5dOcHUm6EhdiAETvC0gtXQ4JR0XSSN3aaqaDg/w400-h325/ArcoLinux_2022-05-31_22-59-32.png" width="400" /></a></div><br /><p><br /></p><p>$$I(X; Y, Z)$$ corresponds to all areas within the circle representing $$X$$ that are also within at least one of the circle for $$Y$$ or $$Z$$. If we knew both $$Y$$ and $$Z$$, this "bite" is how much would be taken out of the uncertainty $$H(X)$$ of $$X$$.</p><p>We see that the red lined area is $$I(X; Y|Z)$$ (the information shared between $$X$$ and the part of $$Y$$ that remains unknown if you know $$Z$$), and likewise the green hatched area is $$I(X; Y; Z)$$ and the blue dotted area is $$I(X;Z|Y)$$. Since the red-lined and green-hatched areas together are $$I(X;Y)$$, and the green-hatched and blue-dotted areas together are $$I(X;Z)$$, we can write both $$$ I(X; \,Y,Z) = I(X;\,Y) + I(X;\,Z|Y)$$$ $$$I(X; \,Y,Z) = I(X;\,Z) + I(X;\,Y|Z) $$$ But hold on - $$I(X;Z|Y)=0$$ by the definition of a Markov chain, since no influence can pass from $$X$$ to $$Z$$ without going through $$Y$$, meaning that if we know everything about $$Y$$, nothing more we can learn about $$Z$$ will tell us anything more about $$X$$.</p><p>Since that term is zero, we have $$$ I(X; \; Y) = I(X; \; Z) + I(X; \, Y|Z) $$$ and since mutual information must be non-negative, this in turn implies $$$ I(X;Y) \geq I(X;Z). $$$</p><h2 id="two-big-things-source-channel-coding">Two big things: source & channel coding</h2><p>Much of information theory concerns itself with one of two goals.</p><p>Source coding is about data compression. It is about taking something that encodes some information, and trying to make it shorter without losing the information.</p><p>Channel coding is about error correction. It is about taking something that encodes some information, and making it longer to try to make sure the information can be recovered even if some errors creep in.</p><p>The basic model that information theory deals with is the following:</p><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgsS1fWWqne7wyRiNIS8jjsFxOPCb5B54n-xNiEKD4PYt3jwml1VCOrkX6JxhEXHuCFd7wv9Kr5Vvu1VCSi-PP74LtoZSto9IsYzZcKMJ3uzam7_JfrVUcerg51rWIdZzCQaxjLezaVeepV8TcaxudzUHzRTnCNqfWe-Ju4icIFHKqd5swb879emq1NTQ/s1220/ArcoLinux_2022-05-31_22-52-21.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="270" data-original-width="1220" height="142" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgsS1fWWqne7wyRiNIS8jjsFxOPCb5B54n-xNiEKD4PYt3jwml1VCOrkX6JxhEXHuCFd7wv9Kr5Vvu1VCSi-PP74LtoZSto9IsYzZcKMJ3uzam7_JfrVUcerg51rWIdZzCQaxjLezaVeepV8TcaxudzUHzRTnCNqfWe-Ju4icIFHKqd5swb879emq1NTQ/w640-h142/ArcoLinux_2022-05-31_22-52-21.png" width="640" /></a></div><br /><p>We have some random variable $$Z$$ - the contents of a text message, for example - which we encode under some coding scheme to get a message consisting of a sequence of symbols that we send over some channel - the internet, for example - and then hopefully recover the original message. The channel can be noiseless, meaning it transmits everything perfectly and can be removed from the diagram, or noisy, in which case some there is a chance that for some $$i$$, the $$X_i$$ sent into the channel differs from the $$Y_i$$ you get out.</p><p>Source coding is about trying to minimise how many symbols you have to send, while channel coding is about trying to make sure that $$\hat{Z}$$, the estimate of the original message, really ends up being the original message $$Z$$.</p><p>A big result in information theory is that for the above model, it is possible to separate the source coding and the channel coding, while maintaining optimality. The problems are distinct; regardless of source coding method, we can use the same channel method and still do well, and vice versa. Thanks to this result, called the source-channel separation theorem, source and channel coding can be considered separately. Therefore, our model can look like this:</p><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhKVksPpXEHyJeOtvGWW5rKPZYkJe-TxHDqeH4liLtrRak9ybQvb1MoaEkIDgLHc4yy7rXDra5JOQjswbiH2ZZtpEk4egOiXP3uk_bBk87FJS1Zl0d4bSbjo2uso2lSXrwIPJa-4DyMTnFtxCFI-8t5buk0NHxWPHGUEsX_6YcxZqc5MssJflmom7g_gQ/s1367/ArcoLinux_2022-05-31_22-52-43.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="246" data-original-width="1367" height="116" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhKVksPpXEHyJeOtvGWW5rKPZYkJe-TxHDqeH4liLtrRak9ybQvb1MoaEkIDgLHc4yy7rXDra5JOQjswbiH2ZZtpEk4egOiXP3uk_bBk87FJS1Zl0d4bSbjo2uso2lSXrwIPJa-4DyMTnFtxCFI-8t5buk0NHxWPHGUEsX_6YcxZqc5MssJflmom7g_gQ/w640-h116/ArcoLinux_2022-05-31_22-52-43.png" width="640" /></a></div><p><br /></p><p>(We use $$X^n$$ to refer to a random variable representing a length-$$n$$ sequence of symbols)</p><p>Both source and channel coding consist of:</p><ul><li>a central but tricky theorem giving theoretical bounds and motivating some definitions</li><li>a bunch of methods that people have invented for achieving something close to those theoretical bounds in practice</li></ul>Next see <a href="https://www.strataoftheworld.com/2022/06/information-theory-2-source-coding.html">the source coding post</a> and <a href="https://www.strataoftheworld.com/2022/06/information-theory-3-channel-coding.html">the channel coding post</a>. <br /><div><ul></ul><p></p><p></p></div>Unknownnoreply@blogger.com0tag:blogger.com,1999:blog-1697673368059564013.post-41260904924096790052021-10-17T23:14:00.000+01:002021-10-17T23:14:19.101+01:00Death is bad<p style="text-align: center;"> <span style="font-size: x-small;">3.5k words (about 12 minutes)<br /></span></p><p>Sometime in the future, we might have the technology to extend lifespans indefinitely and make people effectively immortal. When and how this might happen is a complicated question that I will not go into. Instead, I will take heed of Ian Malcolm in <i>Jurassic Park</i>, who complains that "your scientists were so preoccupied with whether or not they could that they didn't stop to think if they should".</p><p>This is (in my opinion rather surprisingly) a controversial question. </p><p>The core of it is this: should people die?</p><p>Often the best way to approach a general question is to start by thinking about specific cases. Imagine a healthy ten-year old child; should they die? The answer is clearly no. What about yourself, or your friends, or the last person you saw on the street? Wishing for death for yourself or others is almost universally a sign of a serious mental problem; acting on that desire even more so.</p><p>There are some exceptions. Death might be the best option for a sick and pained 90-year-old with no hope of future healthy days. It may well be (as I've seen credibly claimed in several places) that the focus on prolonging lifespan even in pained terminally ill people is excessive. "Prolong life, whatever the cost" is a silly point of view; maximising heartbeats isn't what we really care about.</p><p>However, now imagine a pained, dying, sick person who has a hope of surviving to live many healthy happy days – say a 40-year-old suffering from cancer. Should they die? No. You would hope that they get treatment, even if it's nauseating fatiguing painful chemotherapy for months on end. If there is no cure, you'd hope that scientists somewhere invent it. Even if it does not happen in time for that particular person, at least it will save others in the future, and eliminate one more horror of the world. It would be a great and celebrated human achievement.</p><p>What's the difference between the terminally ill 90-year-old and the 40-year-old with a curable cancer? The difference is technology. We have the technology to cure some cancers, but we don't have the technology to cure the many ageing-related diseases. If we did, then even if the treatment is expensive or difficult, we would hope – and consider it a moral necessity – for both of them to get it, and hope that they both go on living for many more years.</p><p>No one dies of time. You are a complex process running on the physical hardware of your brain, which is kept running by the machine that is the rest of your body. You die when that machine breaks. There is no poetic right time when you close your eyes and get claimed by time, there is only falling to one mechanical fault or another.</p><p>People (or conscious beings in general) matter, and their preferences should be taken seriously – this is the core of human morality. What is wrong in the world can be fixed – this is the guiding principle of civilisation since the Enlightenment.</p><p>So, should people die? Not if they don't want to, which (I assume) for most people means not if they have a remaining hope of happy, productive days.</p><h2 id="counterarguments">Counterarguments</h2><p>The idea that death is something to be defeated, like cancer, poverty, or smallpox, is not a common one. Perhaps there's some piece of the puzzle that is missing from the almost stupidly simple argument above?</p><p>One of the most common counterarguments is overpopulation (perhaps surprisingly; environmentalist concerns have clearly penetrated very deep into culture despite not being much of a thing before the 1970s). The argument goes like this: if we solve death, but people keep being born, there will be too many people on Earth, leading to environmental problems, and eventually low quality of life for everyone.</p><p>The object-level point (I will return to what I consider more important meta-level points later) is that demographic predictions have a tendency to be wrong, especially about the future (as the <a href="https://quoteinvestigator.com/2013/10/20/no-predict/">Danish (?) saying goes</a>). Malthus figured out pre-industrial demographics just as they came to an end with the industrial revolution. In the 1960s, there were <a href="https://en.wikipedia.org/wiki/The_Population_Bomb">warnings</a> of a population explosion, which fizzled out when it turned out that the <a href="https://en.wikipedia.org/wiki/Demographic_transition">demographic transition</a> (falling birth rates as countries develop) is a thing. Right now the world population is expected to stabilise at less than 1.5x the current size, and many developed countries are dealing with problems caused by shrinking populations (which they strangely refuse to fix through immigration).</p><p>Another concern are the effects of having a lot of old people around. What about social progress – how would the development of women's rights have been realised if you had a bunch of 19th century misogynists walking around in their top hats? What sort of power imbalances and Gini coefficients would we reach if Franklin Delano Roosevelt could continue cycling through high-power government roles indefinitely, or Elon Musk had time to profit from the colonisation of Mars? What happens to science when it can no longer advance (as Max Planck said) one funeral at at time?</p><p>(There is even an argument that life extension technology is problematic because the rich will get it first. This is an entirely general and therefore entirely worthless argument, since it applies to all human progress: the rich got iPhones first – clearly smartphones are a problematic technology, etc., etc. If you're worried about only the rich having access to it for too long, the proper response is to subsidise its development so that the period when not everyone has access to it is as short as possible.)</p><p>These are valid concerns that will definitely test the abilities of legislators and voters in the post-death era. However, they can probably be overcome. I think people can be brought around surprisingly far on social and moral attitudes without killing anyone. Consider how pre-2000 almost anyone's opinions would have made them a near-pariah today; many of those people still exist and it would hard to write them off as a total loss. Maybe some minority of immortal old people couldn't cope with all the Pride Parades – or whatever the future equivalent is – marching past their windows and they go off to start some place of their own with sufficient top hat density; then again, most countries have their own conservative backwater region already. If they start going for nukes, that's more of an issue, but not more so than Iran.</p><p>As for imbalances of power and wealth, it might require a few more taxes and other policies (the expansion of term limits to more jobs?), but given the strides that equalising policy-making has made it seems hard to argue there is a fundamental impossibility.</p><p>And what about all the advantages? A society of the undying might well be far more long-term oriented, mitigating one of the greatest human failures. After all, how often do people bemoan that 70-year-old oil executives just don't care because they won't be around to see the effects of climate change?</p><p>What about all the collective knowledge that is lost? Imagine if people in 2050 could hear World War II veterans reminding them of what war really is. Imagine if John von Neumann could have continued casually inventing fields of maths at a rate of about two per week instead of dying at age 53 (while <a href="https://en.wikipedia.org/wiki/John_von_Neumann#Illness_and_death">absolutely terrified of his approaching death</a>). Imagine if we could be sure to see George R. R. Martin finish <i>A Song of Ice and Fire</i>.</p><p>Also, concerns like overpopulation and Elon Musk's tax plan just seem small in comparison to the <i>literal eradication of death</i>.</p><p>Imagine proposing a miracle peace plan to the cabinets of the Allied countries in the midst of World War II. The plan would end the war, install liberal governments in the Axis powers, and no one even has to nuke a Japanese city. (If John von Neumann starts complaining about not getting to test his implosion bomb design, give him a list of unsolved maths problems to shut him up.) Now imagine that the reaction is somewhere between hesitance and resistance, together with comments like "where are we going to put all the soldiers we've trained?", "what about the effects on the public psyche of a random abrupt end without warning?", and "how will we make sure that the rich industrialists don't profit too much from all the suddenly unnecessary loans that they've been given?" At this point you might be justified in shouting: "this war is killing fifteen million people per year, we need to end it now".</p><p>The situation with death is similar, except it's over fifty million per year rather than fifteen. (See <a href="https://ourworldindata.org/grapher/annual-number-of-deaths-by-cause?country=~OWID_WRL">this chart</a> for breakdown by cause – you'll see that while currently-preventable causes like infectious diseases kill millions, ageing-related ones like heart disease, cancer, and dementia are already the majority.)</p><h3 id="thought-experiments">Thought experiments</h3><p>To make the question more concrete, we can try thought experiments. Imagine a world in which people don't die. Imagine visitors from that world coming to us. Would they go "ah yes, inevitable oblivion in less than a century, this is exactly the social policy we need, thanks – let us go run back home and implement it"? Or would they think of our world like we do of a disease-stricken third-world country, in dire need of humanitarian assistance and modern technology?</p><p>It's hard to get into the frame of mind of people who live in a society that doesn't hand out automatic death sentences to everyone at birth. Instead, to evaluate whether raising life expectancies to 200 makes sense even given the environmental impacts, we can ask whether a policy of killing people at age 50 to reduce population pressures would be even better than the current status quo – if both an increase and decrease in life expectancies is bad, this is suspicious because it implies we're at the optimum by chance. Or, since the abstract question (death in general) is always harder than more concrete ones, imagine withholding a drug that manages heart problems in the elderly on overpopulation grounds.</p><p>You might argue that current life expectancies are optimal. This is a hard position to defend. It seems like a coincidence that the lifespan achievable with modern technology is exactly the "right" one. Also, neither you nor society should not make that choice for other people. Perhaps some people get bored of life and readily step into coffins at age 80; many others want nothing more than to keep living. People should get what they want. Forcing everyone to conform to a certain lifespan is a specific case of forcing everyone to conform to a certain lifestyle; much moral progress in the past century has consisted of realising that this is bad.</p><p>I think it's also worth emphasising one common thread in the arguments against solving death: they are all arguments about societal effects. It is absolutely critical to make sure that your actions don't cause massive negative externalities, and that they also don't amount to defecting in <a href="https://en.wikipedia.org/wiki/Prisoner%27s_dilemma">prisoner's dilemma</a> or <a href="https://en.wikipedia.org/wiki/Tragedy_of_the_commons">the tragedy of the commons</a>. However, it is also absolutely critical that people are happy and aren't forced to die, because people and their preferences/wellbeing are what matters. Society exists to serve the people who make it up, not the other way around. Some of the worst moral mistakes in history come from emphasising the collective, and identifying good and harm in terms of effects on an abstract collective (e.g. a nation or religion), rather than in terms of effects on the individuals that make it up. Saying that everyone has to die for some vague pro-social reason is the ultimate form of such cart-before-the-horse reasoning.</p><h2 id="why-care-about-the-death-question">Why care about the death question?</h2><p>There are several features that make the case against death, and people's reactions to it, particularly interesting.</p><h3 id="failure-of-generalisation">Failure of generalisation</h3><p>First: generalisation. I started this post using specific examples before trying to answer the more general question. I think the popularity of death is a good example of how bad humans are at generalising.</p><p>When someone you know dies, it is very clearly and obviously a horrible tragedy. The scariest thing that could happen to you is probably either your own death, the death of people you care about, or something that your brain associates with death (the common fears: heights, snakes, ... clowns?).</p><p>And yet, make the question more abstract – think not about a specific case (which you feel in your bones is a horrible tragedy that would never happen in a just world), but about the general question of whether people should die, and it's like a switch flips: a person who would do almost anything to save themselves or those they care about, who cares deeply about suffering and injustice in the world, is suddenly willing to consign five times the death toll of World War I to permanent oblivion every single year.</p><p>Stalin reportedly said that a single death is a tragedy, but a million is only a statistic. Stalin is wrong. A single death is a tragedy, and a million deaths is a million tragedies. Tragedies should be stopped.</p><h3 id="people-these-days">People These Days</h3><p>Second: today, we're pretty good at ignoring and hiding death. This wasn't always the case. If you're a medieval peasant, death is never too far away, whether in the form of famine or plague or Genghis Khan. Death was like an obnoxious dinner guest: not fun, but also just kind of present in some form or another whether you invited them or not, so out of necessity involved in life and culture.</p><p>Today, unexpected death is much rarer. Child mortality globally has declined from <a href="https://ourworldindata.org/child-mortality">over 40% (i.e. almost every family had lost a child) in 1800 to 4.5% in 2015</a>, and <a href="https://ourworldindata.org/grapher/the-decline-of-child-mortality-by-level-of-prosperity-endpoints?time=latest&country=SWE~GBR~JPN~FRA~FIN~European+Union~KOR~ESP">below 0.5%</a> in developed countries. Famines have gone from something everyone lives through to something that the developed world is free from. War and conflict have gone from <a href="https://ourworldindata.org/war-and-peace#the-past-was-not-peaceful">common to uncommon</a>. Much greater diseases and accidents can be successfully treated. As a result of all these positive trends, death is less present in people's minds.</p><p>As I don't have my culture critic license yet, I won't try to make some fancy overarching points about how People These Days Just Don't Understand and how our Materialistic Culture fails to prepare people to deal with the Deep Questions and Confront Their Own Mortality. I will simply note that (a) death is bad, (b) we don't like thinking about bad things, and (c) sometimes not wanting to think about important things causes perverse situations.</p><h3 id="confronting-problems">Confronting problems</h3><p>Why do people not want to think that death is bad? I think one central reason is that death seems inevitable. It's tough to accept bad things you can't influence, and much easier to try to ignore them. If at some point you have to confront it anyways, one of the most reassuring stories you can tell is that it has a point. Imagine if over two hundred thousand years, generation after generation of humans, totalling some one hundred billion lives, was born, grew up, developed a rich inner world, and then had that world destroyed forever by random failures, evolution's lack of care for what happens after you reproduce, and the occasional rampaging mammoth. Surely there must be some purpose for it, some reason why all that death is not just a tragedy? Perhaps we aren't "meant" to live long, whatever that means, or perhaps it's all for the common good, or that "death gives meaning to life". Far more comforting to think that then to acknowledge that a hundred billion human lives and counting really are gone forever because they were unlucky enough to be born before we eradicated smallpox, or invented vaccines, or discovered antibiotics, or figured out how to reverse ageing.</p><p>Assume death is inevitable. Should you still recognise the wrongness of it?</p><p>I think yes, at least if you care about big questions and doing good. I think it's important to be able to look at the world, spot what's wrong about it, and acknowledge that there are huge things that should be done but are very difficult to achieve.</p><p>In particular, it's important to avoid the narrative fallacy (Nassim Taleb's term for the human tendency to want to fit the world to a story). In a story, there's a start and an end and a lesson, and the dangers are typically just small enough to be defeated. Our universe <a href="https://www.lesswrong.com/posts/sYgv4eYH82JEsTD34/beyond-the-reach-of-god">has no writer, only physics</a>, and physics doesn't care about hitting you with an unsolvable problem that will kill everyone you love. If you want to increase the justness of the world, recognising this fact is an important starting point.</p><h2 id="taxes">Taxes</h2><p>Is death inevitable? In considering this question, it's important once again to remember that death is not a singular magical thing. Your death happens when something breaks badly enough that your consciousness goes permanently offline.</p><p>Things, especially complex biological machines produced by evolution, can break in very tricky ways. But what can break can be fixed, and people who declare technological feats impossible have a bad track record. The problem might be very hard: maybe we have to wait until we have precision nano-bots that can individually repair the telomeres on each cell, or maybe there is no effective general solution to ageing and we face an endless grind of solving problem after problem to extend life/health expectancies from 120 to 130 to 140 and so forth. Then again, maybe someone leaves out a petri dish by accident in a lab and comes back the next day to the fountain of youth, or maybe by the end of the century no one is worrying about something as old-fashioned as biology.</p><p>There's also the possibility of stopgap solutions, like cryonics (preserving people close to death by <a href="https://en.wikipedia.org/wiki/Cryopreservation#Vitrification">vitrifying</a> them and hoping that future technology can revive them). Cryonics is currently in a very primitive state – no large animals successfully having been put through it – but there's a research pathway of testing on increasingly complex organs and then increasingly large animals that might eventually lead to success if someone bothered to pour resources into it.</p><p>There is no guarantee when this is happening. If civilisation is destroyed by an engineered pandemic or nuclear war before then, it will never happen.</p><p>Of course, in the very long run we face more fundamental problems, like the heat death of the universe. Literally infinite life is probably physically impossible; maybe this is reassuring.</p><h2 id="predictions-and-poems">Predictions and poems</h2><p>I will make three predictions about the eventual abolition of death.</p><p>First, many people will resist it. They might see it as conflicting with their religious views or as exacerbating inequality, or just as something too new and weird or unnatural.</p><p>Second, when the possibility of extending their lifespan stops being an abstract topic and becomes a concrete option, most people will seize it for themselves and their families.</p><p>This is a common path for technologies. Lightning rods and vaccines were first seen by some as affronts to God's will, but eventually it turns out people like not burning to death and not dying of horrible diseases more than they like fancy theological arguments. Most likely future generations will discover that they like not ageing more than they like appreciating the meaning of life by definitely not having one past age 120.</p><p>Finally, future people (if they exist) will probably look back with horror on the time when everyone died against their will within about a century.</p><p>Edgar Allen Poe wrote a poem called <a href="https://www.poetryfoundation.org/poems/48633/the-conqueror-worm">"The Conqueror Worm"</a>, about angels crying as they watch a tragic play called "Man", whose (anti-)hero is a monstrous worm that symbolises death. If we completely ignore what Poe intended with this, we can misinterpret one line to come to a nice interpretation of our own. The poem declares that the angels are watching this play in the "lonesome latter years". Clearly this refers to a future post-scarcity, post-death utopia, and the angels are our wise immortal descendants reflecting on the bad old days, when people were "mere puppets [...] who come and go / at the bidding of vast formless things" like famine and war and plague and death. The "circle [of life] ever returneth in / To the self same spot [= the grave]", and so the "Phantom [of wisdom and fulfilled lives] [is] chased for evermore / By a crowd that seize it not".</p><p>Death is a very poetic topic, and other poems need less (mis)interpretation. <a href="https://www.poetryfoundation.org/poems/52773/dirge-without-music">Edna St. Vincent Millay's "Dirge Without Music"</a> is particularly nice, while Dylan Thomas gives away the game in the title: <a href="https://poets.org/poem/do-not-go-gentle-good-night">"Do not go gentle into that good night"</a>.</p>Unknownnoreply@blogger.com0tag:blogger.com,1999:blog-1697673368059564013.post-9283111801107102512021-09-30T21:23:00.001+01:002021-09-30T21:26:05.349+01:00Short reviews: biographies<p style="text-align: center;"><span style="font-size: x-small;">Books reviewed (all by Walter Isaacson):<i><br />The Code Breaker: Jennifer Doudna, Gene Editing, and the Future of the Human Race </i>(2021)<br /><i>Steve Jobs: The Exclusive Biography </i>(2011)<i><br />Benjamin Franklin: An American Life </i>(2004)<br /><i></i></span></p><p style="text-align: center;"><span style="font-size: x-small;">3.5k words (about 12 minutes)</span> </p><p style="text-align: center;"><br /></p><p>Why read biographies? If you want stories of people and interesting characters, fiction is better. If you want general, big truths, then you're probably better off reading the many non-fiction books that are about abstract truths and far-ranging concepts rather than the particulars of a single person's life.</p><p>Consider, for a moment, designing an algorithm for a problem. The classic way to do this is to think hard about the problem, and then write down a specific series of steps that take you from inputs to (hopefully the correct) outputs. In contrast, the machine learning method is to use statistical methods on a long list of examples to make a model that (hopefully) approximates the mapping between inputs and outputs. </p><p>Reading explicit abstract arguments is like the first method. Like explicit algorithm design, it comes with some nice properties – it's very clear exactly how it generalises and when it's applicable – to the point where it's easy to scoff at the less explicit methods: "it's just a black box that our pile of statistics spits out" / "it's just anecdotes about someone's life".</p><p>However, much like machine learning methods can extract subtle lessons from a long list of examples, I think there is implicit knowledge contained in the long list of detail about someone's life that you find in a biography (at least if you read about people who did interesting things in their life – but then again, if there's a biography of someone ...). Once you've read the details of how CRISPR was invented, Apple jump-started, or compromises reached at the1787 American Constitutional Convention, I think your model of how science, business, and politics work in the real world is improved in many subtle ways.</p><p>(Note that this argument also applies to reading history.)</p><p>And of course, since biographies deal strongly with character, there is an element of the novel-like thrill of watching things happen to people.</p><h2 id="walter-isaacsons-biographies">Walter Isaacson's biographies</h2><p>I've read four of Walter Isaacson's biographies. Their subjects are Albert Einstein, Jennifer Doudna, Steve Jobs, and Benjamin Franklin.</p><p>The Einstein one I read years ago, and don't remember much detail about. It did earn a 6 out of 7 on my books spreadsheet though.</p><p>The <a href="https://en.wikipedia.org/wiki/Jennifer_Doudna">Jennifer Doudna</a> biography is the weakest. The main reason is that we don't get too much insight into Doudna herself or the way she carried out her scientific work, leaving Isaacson to spend many pages on other things: overviews of other players in the development of the <a href="https://en.wikipedia.org/wiki/CRISPR">gene-editing tool CRISPR</a> that are more journalistic than biographical, and descriptions of the biology that are limited by Isaacson's lack of biological expertise (at least when compared to the best popular biology writing, like Richard Dawkins' in <i>The Selfish Gene</i>). Hand-wringing over <a href="https://en.wikipedia.org/wiki/James_Watson">James Watson's</a> controversies takes up an alarming amount of space that is only partly justified by Watson's role as a childhood inspiration for Doudna. There's also a long section about the struggles behind the allocation of the CRISPR Nobel Prize (awarded in 2020) that is clearly balanced and thoroughly researched, but simply less interesting to me than similar segments in the Jobs or Franklin biographies, where the stakes are the fate of companies or nations, rather than who gets a shiny medal.</p><p>My guess is that these faults stem mainly from the more limited material Isaacson had access to. Albert Einstein and Benjamin Franklin are both among the most researched individuals in history. To the extent that Steve Jobs is behind, the interviews Isaacson personally conducted seem to have plugged the gap.</p><p>Doudna is still an inspiring person. She also has the enviable advantage of not being dead, and therefore may yet do even more and become the subject of further biographies. If you're interested in biotech, including the business side, or scientific careers that may one day win Nobel Prizes, the biography may well be worth reading. </p><h2 id="steve-jobs">Steve Jobs</h2><p>A god-like experimenter who wants to figure out what traits make tech entrepreneurs succeed may proceed something like this: create a bunch of people with extreme strengths in some areas and extreme weaknesses in others, release them into the world to start companies, and see which extreme strengths can balance out which extreme weaknesses. Such an experiment might well create Steve Jobs.</p><p>Take one weakness: Jobs's emotional volatility and, for lack of a better word, general nastiness in some circumstances, including things from extremely harsh criticism of employees' work to horrible table manners at restaurants. This isn't unique to Jobs either: look at the Wikipedia pages for <a href="https://en.wikipedia.org/wiki/Bill_Gates#Management_style">Bill Gates</a> and <a href="https://en.wikipedia.org/wiki/Jeff_Bezos#Leadership_style">Jeff Bezos</a>, and you'll find that they brighten their subordinates' work days with such productive witticisms as "that's the stupidest thing I've ever heard" and "why are you ruining my life?" respectively.</p><p>Does this show that behaviour up to and including verbal abuse is a forgivable flaw, or even beneficial, in tech CEOs?</p><p>First, though verbal abuse is neither productive nor right, a culture of vigorous debate is a distinct thing with incredible benefits, and the idea that it serves only to hurt and marginalise is not just a misguided generalisation but sometimes diametrically wrong. The best example is Daniel Ellsberg recounting an anecdote from his early times at RAND Corporation in <i>The Doomsday Machine</i> (an unrelated book; my review <a href="https://strataoftheworld.blogspot.com/2020/04/review-doomsday-machine.html">here</a>):</p><blockquote><p><i>Rather than showing irritation or ignoring my comment [that he made at the first meeting], Herman Kahn, brilliant and enormously fat, sitting directly across the table from me, looked at me soberly and said, "You're absolutely wrong."</i></p><i></i><p><i>A warm glow spread through my body. This was the way my undergraduate fellows on the editorial board of the Harvard Crimson (mostly Jewish, like Herman and me) had spoken to each other; I hadn't experienced anything like it for six years. At King's College, Cambridge, or in the Society of Fellows, arguments didn't remotely take this gloves-off, take-no-prisoners form. I thought, "I've found a home."</i></p></blockquote><p>Steve Jobs admittedly goes overboard with this. For example, people who worked with him had to learn that "this is shit" meant "that's interesting, could you elaborate and make the case for your idea further?". This is not just unnecessarily rude, but also unclear communication. The general impression that Isaacson gives is also not that Jobs was combative as a thought-out strategy, but rather that this was just his style of interaction.</p><p>I suspect that the famous combativeness of many tech CEOs is not itself a useful trait, but instead adjacent to several other traits that are, in particular disagreeableness (in the sense of willing to disagree with others and not feel pressure to conform) and perhaps also caring deeply about the product.</p><p>Consider another extreme Jobs trait: strange diets, and (in his youth), a belief that he didn't need to shower because of his dieting. This went so far that of the people Isaacson interviews about Jobs's youth, including those who hadn't seen him for decades, almost every one mentions something like "yeah, he stank". Yet while some leap to defend and (worse yet) emulate Jobs's verbal nastiness, presumably on grounds of its correlation with his success, far fewer do the same for his dieting and showering habits. (What conformists!)</p><p>I think the more general lesson is that Jobs was extreme in a lot of ways, including in the strength of his opinions and beliefs, and in not having a filter between them and his actions. He gets into eastern mysticism and goes off to India to become a monk. He gets into dieting and starts eating only fruit rather than just reading lifestyle magazines and half-heartedly trying diets for a week like most people might. He gets it into his head that the corner of a Mac isn't rounded enough and declares that in no uncertain terms. </p><p>So is that the key then: have firm convictions? We've gone from a maladaptive cliché to a trite one – and still not a very helpful one. Steve Jobs, with his "reality distortion field", may have been an expert at persuading people, but even he can't persuade reality to be another way. Even slightly wrong convictions tend to have nasty collisions with reality.</p><p>(It's worth noting that rather than being a stickler for one position or solution, Jobs tended to yo-yo back and forth between extremes, only slowly converging on a decision – something that often confused others at Apple until they learned to use a rolling average of his recent positions.)</p><p>The critical part, of course, was that Steve Jobs was right about a lot of things, despite several serious missteps (especially in regards to making over-expensive computers that no one wants to pay for). I think Jobs's success provides evidence that even in aesthetic matters, success has a surprisingly strong component of <i>being actually right</i>. And Jobs, who was all-around very bright despite not being a master of the technical side, seems to have mastered this.</p><p>Of course, the story of Jobs's success – which came in spite of his emotional volatility, and tendency to wish away problems rather than facing them – does not entirely fit the idea that success comes in large part from having well-calibrated beliefs about the world and going about achieving them in reasonable and rational ways.</p><p>I think there are three things worth keeping in mind.</p><p>First, it may well be that most successful people are successful "at random" (i.e. without having a rational strategy for achieving what they want to achieve), but that the probability of achieving your goals given that you have well-calibrated beliefs and a rational reality-accommodating plan is still very much higher than the probability of achieving them given any other strategy. That is, if <script type="math/tex">S</script> is the event of being very successful (by some definition), <script type="math/tex">R</script> the event that you follow a rational strategy and maintain well-calibrated beliefs and generally practice thought patterns that won't get you downvoted on LessWrong, <script type="math/tex">\neg R</script> the complement of that event, <script type="math/tex">P(\neg R|S)</script> can be high (i.e. most successful people became successful in not particularly smart ways), while <script type="math/tex">P(S|R)</script> can be much higher than <script type="math/tex">P(S|\neg R)</script> (following a rational strategy still gives you by far the best chances of success).</p><p>Second, Jobs's life illustrates the principle that you only have to be very right a small number of times – just like in general most of the return, especially in anything risky, comes from a small number of bets. He failed at managing, even when working under another CEO who had been brought in specifically to babysit him, to the extent that he was kicked out of his own company. He failed to build successful hardware after founding NeXT. However, he was really right about product design, and that was enough.</p><p>Third, though he did get away with ignoring many uncomfortable truths by simply willing them away, eventually reality hit back. He delayed dealing with the cancer threat when he was first told of it, and he trusted alternative treatments. The combination may well have killed him.</p><p> </p><h2 id="benjamin-franklin">Benjamin Franklin</h2><p>Benjamin Franklin was a newspaper publisher, writer, postmaster, ambassador, political leader, and scientist. He invented the lightning rod and realised that electric charge came in both a positive and negative form (and gave those names to them, as temporary ones until "[English] philosophers give us better").</p><p>He was one of the first or most influential pioneers of many other things as well; to take a random example, he thought up the idea of matched funding for a charitable project (and was quite proud of it too: "I do not remember any of my political maneuvers the success of which gave me at the time more pleasure, or that in after thinking about it I more easily excused myself for having made use of cunning").</p><p>More generally, he clearly enjoyed numbers and detail:</p><blockquote><p><i>[...H]e loved immersing himself in minutiae and trivia in a manner so obsessive that it might today be described as geeky. He was meticulous in describing every technical detail of his inventions, be it the library arm, stove, or lightning rod. In his essays, ranging from his arguments against hereditary honors to his discussions of trade, he provided reams of detailed calculations and historical footnotes. Even in his most humorous parodies, such as his proposal for the study of farts, the cleverness was enhanced by his inclusion of mock-serious facts, trivia, calculations, and learned precedents</i></p></blockquote><p>Do-gooders with time machines could do worse than giving him access to a spreadsheet program.</p><p>One of the best descriptions of Franklin's personality comes from Isaacson's comparison of him with John Adams (when they were both in Paris, late in Franklin's life):</p><blockquote><p><i>Adams was unbending and outspoken and argumentative, Franklin charming and taciturn and flirtatious. Adams was rigid in his personal morality and lifestyle, Franklin famously playful. Adams learned French by poring over grammar books and memorizing a collection of funeral orations; Franklin (who cared little about the grammar) learned the language by lounging on the pillows of his female friends and writing them amusing little tales. Adams felt comfortable confronting people, whereas Franklin preferred to seduce them, and the same was true of the way they dealt with nations.</i></p></blockquote><p>One striking things when reading about 18th century events is the informality and nepotism. For example, to become postmaster of the colonies, Franklin spent significant money on having a friend lobby on his behalf in London, and upon obtaining the position gave out cushy jobs to his son, brothers, brother's stepson, sister's son, and two of his wife's relatives.</p><p>Not only that, but the border between truth and fiction was also hazy in the press. Articles could be, without any differentiating label, either factual, obviously satirical, satirical in a way that takes a clever reader to spot, or outright hoaxes. Likewise Franklin often wrote and published letters to his own newspaper under pseudonyms, with various levels of disguise ranging from clearly transparent to purposefully anonymous (this, however, was normal, as it was often seen as unworthy of gentlemen to write such letters under their own names).</p><p>In other ways, the 18th century, and 18th century Franklin in particular, were surprisingly modern and liberal. Franklin took a very reasonable and liberal stance on the freedom of press:</p><blockquote><p><i>“It is unreasonable to imagine that printers approve of everything they print. It is likewise unreasonable what some assert, That printers ought not to print anything but what they approve; since […] an end would thereby be put to free writing, and the world would afterwards have nothing to read but what happened to be the opinions of printers.”</i></p></blockquote><p>He still exercised judgement over what he printed. When deciding whether to print something that violated his principles for money, he (reportedly) went through a process that many modern newspaper editors and Facebook engineers could well take to heart:</p><blockquote><p><i>To determine whether I should publish it or not, I went home in the evening, purchased a twopenny loaf at the baker’s, and with the water from the pump made my supper; I then wrapped myself up in my great-coat, and laid down on the floor and slept till morning, when, on another loaf and a mug of water, I made my breakfast. From this regimen I feel no inconvenience whatever. Finding I can live in this manner, I have formed a determination never to prostitute my press to the purposes of corruption and abuse of this kind for the sake of gaining a more comfortable subsistence.</i></p></blockquote><p>The 18th century offers some perspective about hostile politics too. After describing an extremely personal and angry election campaign (which Franklin lost), Isaacson writes:</p><blockquote><p><i>Modern election campaigns are often criticized for being negative, and today’s press is slammed for being scurrilous. But the most brutal of modern attack ads pale in comparison to the barrage of pamphlets in the 1764 [Pennsylvania] Assembly election. Pennsylvania survived them, as did Franklin, and American democracy learned that it could thrive in an atmosphere of unrestrained, even intemperate, free expression. As the election of 1764 showed, American democracy was built on a foundation of unbridled free speech. In the centuries since then, the nations that have thrived have been those, like America, that are most comfortable with the cacophony, and even occasional messiness, that comes from robust discourse.</i></p></blockquote><p>Isaacson points out that Franklin's popularity has come and gone, and explains this by making him the symbol of one side of a cultural and political dichotomy: tolerance and compromise rather than dogmatism and crusading, pragmatism rather than romanticism, social mobility rather than class and hierarchy, and secular material success over religious salvation. Thus, while immensely popular in the latter part of his life and after his death, once the Romantic Era got underway, he became seen as shallow, thrifty, and lacking in passion. For example, Franklin appears in Herman Melville's novel <i>Israel Potter</i>, a work that sounds like the most confusing Harry Potter fan-fiction of all time, as a precursor to today's shallow self-help gurus.</p><p>A perfect example of the type of cunning that made some people call him shallow comes from his time as a frontier commander. To get soldiers to attend worship services, he had the chaplain give out the daily rum rations right after the service. "Never were prayers more generally and punctually attended", Franklin proudly wrote.</p><p>Or: at the signing of the Declaration of Independence, John Hancock solemnly declared "There must be no pulling different ways; we must all hang together". Franklin reportedly responded, with a wit but not solemnity worthy of the historic occasion: "Yes, we must, indeed, all hang together, or most assuredly we shall all hang separately".</p><p>This oscillation between romantically-minded eras finding him shallow and business-minded eras finding him the godfather of all self-help gurus and thrifty entrepreneurs has continued to this day. It is true that his aphorism collections, as documented in his famous Poor Richard's Almanac, are more clever than insightful; that he was no moral philosopher; and that his virtue-cultivating efforts were often patchy. However, they are part of a crucial process: the separation of morality from theology during the Enlightenment, which "Franklin was [the] avatar" of. Franklin's foundational personal maxim, which he often repeated, is perhaps the single sentence that pre-modern religious countries most need to hear: “The most acceptable service to God is doing good to man".</p><p>The romanticists' criticisms are based on truths. Though sociable, founding and participating in many societies, his personal relationships tended to be intellectual but distant. Interestingly, despite his vast achievements, Franklin does not show signs of a deep unyielding inner ambition; he seems to have been driven by vague instincts to be useful, a sense of pride (which he tried to dull throughout his life), curiosity, and a delight in tinkering, planning, and organising. To his sister in 1771 he wrote "[...] I am much disposed to like the world as I find it, and to doubt my own judgment as to what would mend it" – a remarkable sentiment from the pen of someone who, not many years later, would be playing a key role in a revolution. And though even past the age of 75 he achieved a few minor things, like being instrumental in securing France's alliance to America, signing the peace treaty between the US and Britain, shaping the US Constitution, and being the head of Pennsylvania's government, he happily wiled away many of his latter days playing cards with only the occasional twinge of guilt. He specifically justified this in part based on a belief in the afterlife: "You know the soul is immortal; why then should you be such a niggard of a little time, when you have a whole eternity before you?"</p><p>However, even these traits seem to have made him exactly what America needed. He was a skilled diplomat in France partly because of his easy-going nature and lack of naked ambition. At the Constitutional Convention of 1787, he often hosted the (much younger) other leading revolutionaries at his house to talk about things in a less formal setting and soften their stances, and generally advocated tolerance and compromise. Isaacson cleverly summarises:</p><blockquote><p><i>Compromisers may not make great heroes, but they do make democracies.</i></p></blockquote><p>Perhaps the best known summary of Franklin's life is Turgot's epigram that "he snatched lightning from the sky and the sceptre from tyrants". Franklin himself had a go at this: he wrote an autobiography – then a rare form of book – and also proposed a cheeky epitaph for himself, including an exhortation to wait for a "new and more elegant edition [of him], revised and corrected by the Author".</p><p>He didn't just summarise himself, though. He also unwittingly wrote perhaps the pithiest summary of the spirit of the entire Enlightenment project, and consequently of the driving spirit of human progress since then. It was in a letter Franklin wrote to his wife, after narrowly escaping a shipwreck on the English coast in 1757:</p><blockquote><p><i>Were I a Roman Catholic, perhaps I should on this occasion vow to build a chapel to some saint; but as I am not, if I were to vow at all, it should be to build a lighthouse.</i></p></blockquote>Unknownnoreply@blogger.com0tag:blogger.com,1999:blog-1697673368059564013.post-14457107463493519742021-04-25T21:52:00.002+01:002022-03-31T22:58:16.256+01:00Lambda calculus<p style="text-align: center;"><i><span style="font-size: x-small;">7.8k words, including equations (about 30 minutes)</span></i></p><p style="text-align: center;"><i><span style="font-size: x-small;"> </span></i></p><p style="text-align: center;"><i><span style="font-size: x-small;">This post has also been published <a href="https://www.lesswrong.com/posts/D4PYwNtYNwsgoixGa/intro-to-hacking-with-the-lambda-calculus">here</a>. </span></i><br /></p><p> </p><p>This post is about lambda calculus. The goal is not to do maths with it, but rather to build up definitions within it until we can express non-trivial algorithms easily. At the end we will see a lambda calculus interpreter written in the lambda calculus, and realise that we're most of the way to Lisp.</p><p>But first, why care about lambda calculus? Consider four different systems:</p><ul><li><p>A <b>Turing machine</b> – that is, a machine that:</p><ul><li><p>works on an infinite tape of cells from which a finite set of symbols can be read and written, and always points at one of these cells;</p></li><li><p>has some set of states it can be in, some of which are termed "accepting" and one of which is the starting state; and</p></li><li><p>given a combination of current state and current symbol on the tape, always does an action consisting of three things:</p><ul><li>writes some symbol on the tape (possibly the same that was already there),</li><li>transitions to some some state (possibly the same it is already in), and</li><li>moves one cell left or right on the tape.</li> </ul></li> </ul></li><li><p>The <b>lambda calculus</b> (<script type="math/tex">\lambda</script>-calculus), a formal system that has expressions that are built out of an infinite set of variable names using <script type="math/tex">\lambda</script>-terms (which can be thought of as anonymous functions) and applications (analogous to function application), and a few simple rules for shuffling around the symbols in these expressions.</p></li><li><p>The <b>partial recursive functions</b>, constructed by function composition, primitive recursion (think bounded for-loops), and minimisation (returning the first value for which a function is zero) on three basic sets of functions:</p><ul><li>the zero functions, that take some number of arguments and return 0;</li><li>a successor function that takes a number and returns that number plus 1; and</li><li>the projection functions, defined for all natural numbers <script type="math/tex">a</script> and <script type="math/tex">b</script> such that <script type="math/tex">a \geq b</script> as taking in <script type="math/tex">a</script> arguments and returning the <script type="math/tex">b</script>th one.</li> </ul></li><li><p><b>Lisp</b>, a human-friendly axiomatisation of computation that accidentally became an extremely good and long-lived programming language.</p></li> </ul><p>The big result in theoretical computer science is that these can all do the same thing, in the sense that if you can express a calculation in one, you can express it in any other.</p><p>This is not an obvious thing. For example, the only thing lambda calculus lets you do is create terms consisting of symbols, single-argument anonymous functions, and applications of terms to each other (we'll look at the specifics soon). It's an extremely simple and basic thing. Yet no matter how hard you try, you can't make something that can compute more things, whether it's by inventing programming languages or building fancy computers.</p><p>Also, if you try to make something that does some sort of calculation (like a new programming language), then unless you keep it stupidly simple and/or take great care, it will be able to compute anything (at least in la-la-theory-land, where memory is infinite and you don't have to worry about practical details, like whether the computation finishes before the sun going nova).</p><p>Physicists search for their theory of everything. The computer scientists already have many, even though they've been at it for a lot less time than the physicists have: everything computable can be reduced to one of the many formalisms of computation. (One of the main reasons that we can talk about "computability" as a sensible universal concept is that any reasonable model makes the same things computable; the threshold is easy to hit and impossible to exceed, so computable versus not is an obvious thing to pay attention to.)</p><p>To talk about the theory of computation properly, we need to look at at least one of those models. The most well-known is the Turing machine. Turing machines have several points in their favour:</p><ul><li>They are the easiest to imagine as a physical machine.</li><li>They have clear and separate notions of time (steps taken in execution) and space (length of tape used).</li><li>They were invented by Alan Turing, who contributed to breaking the Enigma code during World War II, before being unjustly persecuted for being gay and tragically dying of cyanide poisoning at age 41.</li> </ul><p>In contrast, compare the lambda calculus:</p><ul><li>It is an abstract formal system arising out of a failed attempt to axiomatise logic.</li><li>There are many execution paths for a non-trivial expression.</li><li>It was invented by Alonzo Church, who lived a boringly successful life as a maths professor at Princeton, had three children, and died at age 92.</li> </ul><p>(Turing and Church worked together from 1936 to 1938, Church as Turing's doctoral advisor, after they independently proved the impossibility of the halting problem. At the same time and also working at Princeton were Albert Einstein, Kurt Gödel, and John von Neumann (who, if he had had his way, would've hired Turing and kept him from returning to the UK).)</p><p>However, the lambda calculus also has advantages. Its less mechanistic and more mathematical view of computation is arguably more elegant, and it has less things: instead of states, symbols, and a tape, the current state is just a term, and the term also represents the algorithm. It abstracts more nicely – we will see how we can, bit by bit, abstract out elements and get something that is a sensible programming language, a project that would be messier and longer with Turing machines.</p><p>Turing machines and lambda calculus are the foundations of imperative and functional programming respectively, and the situation between these two programming paradigms mirrors that between TMs and <script type="math/tex">\lambda</script>-calculus: one is more mechanistic, more popular, and more useful when dealing with (stateful) hardware; the other more mathematical, less popular, and neater for abstraction-building.</p><h3>Lambda trees</h3><p>Now let's define exactly what a lambda calculus term is.</p><p>We have an infinite set of variables <script type="math/tex">x_1, x_2, x_3, ...</script>, though for simplicity we will use any lowercase letter to refer to them. Any variable is a valid term. Note that variables are just symbols – despite the word "variable", there is no value bound to them.</p><p>We have two rules for building new terms:</p><ul><li><script type="math/tex">\lambda</script>-terms are formed from a variable <script type="math/tex">x</script> and a term <script type="math/tex">M</script>, and are written <script type="math/tex">(\lambda x. M)</script>.</li><li>Applications are formed from two terms <script type="math/tex">M</script> and <script type="math/tex">N</script>, and are written <script type="math/tex">(M N)</script>.</li> </ul><p>These terms, like most things, are trees. I will mostly ignore the convention of writing out horrible long strings of <script type="math/tex">\lambda</script>s and variables, only partly mitigated by parenthesis-reducing rules, and instead draw the trees.</p><p>(When it appears in this post, the standard notation appears slightly more horrible than usual because, for simplicity, I neglect the parenthesis-reducing rules (they can be confusing at first).)</p><p>Here are a few examples of terms, together with standard representations:</p><p></p><div class="separator" style="clear: both; text-align: center;"><a href="https://lh3.googleusercontent.com/-UlDDkX52Ra0/YIXTvhmzH-I/AAAAAAAACvg/2hsLntnO5rkBekTYnalMEBAzIgWjZJkxACLcBGAsYHQ/terms.png" style="margin-left: 1em; margin-right: 1em;"><img data-original-height="528" data-original-width="1138" height="296" src="https://lh3.googleusercontent.com/-UlDDkX52Ra0/YIXTvhmzH-I/AAAAAAAACvg/2hsLntnO5rkBekTYnalMEBAzIgWjZJkxACLcBGAsYHQ/w640-h296/terms.png" width="640" /></a></div><p></p><p>This representation makes it clear that we're dealing with a tree where nodes are either variables, lambda terms where the left child is the argument and the right child is the body, or applications. (I've circled the variables to make clear that the argument variable in a <script type="math/tex">\lambda</script>-term has a different role than a variable appearing elsewhere.)</p><p>It's not quite right to say that a <script type="math/tex">\lambda</script>-term is a function; instead, think of <script type="math/tex">\lambda</script>-terms as one representation of a (mathematical) function, when combined with the reduction rule we will look at soon.</p><p>If we interpret the above terms as representations of functions, we might rewrite them (in Pythonic pseudocode) as, from left to right:</p><ul><li><code>lambda x -> x</code> (i.e., the identity function) (<code>lambda</code> is a common keyword for an anonymous function in programming languages, for obvious reasons).</li><li><code>(lambda f -> f(y))(lambda x -> x)</code> (apply a function that takes a function and calls that function on <code>y</code> to the identity function as an argument).</li><li><code>x(y)</code></li> </ul><h2>Reduction</h2><p>Execution in lambda calculus is driven by something that is called <script type="math/tex">\beta</script>-reduction, presumably because Greek letters are cool. The basic idea of <script type="math/tex">\beta</script>-reduction is this:</p><ul><li>Pick an application (which I've represented by orange circles in the tree diagrams).</li><li>Check that the left child of the application node is a \lambda-term (if not, you have to reduce it to a <script type="math/tex">\lambda</script>-term before you can make that application).</li><li>Replace the variable in the left child of the <script type="math/tex">\lambda</script>-term with the right child of the application node wherever it appears in the right child of the <script type="math/tex">\lambda</script>-term, and then replace the application node with the right child of the <script type="math/tex">\lambda</script>-term.</li> </ul><p>In illustrated form, on the middle example above, using both tree diagrams and the usual notation:</p><p></p><div class="separator" style="clear: both; text-align: center;"><a href="https://lh3.googleusercontent.com/-WMZ34VP_SSc/YIXT1f79EaI/AAAAAAAACvk/TrjxXNYOrGUb2Gt_22VQaSArx_-TkNyiQCLcBGAsYHQ/reduction1.png" style="margin-left: 1em; margin-right: 1em;"><img alt="" data-original-height="634" data-original-width="1196" height="340" src="https://lh3.googleusercontent.com/-WMZ34VP_SSc/YIXT1f79EaI/AAAAAAAACvk/TrjxXNYOrGUb2Gt_22VQaSArx_-TkNyiQCLcBGAsYHQ/w640-h340/reduction1.png" width="640" /></a></div><p></p>(The notation <script type="math/tex">M[N/x]</script> means substitute the term <script type="math/tex">N</script> for the variable <script type="math/tex">x</script> in the term <script type="math/tex">M</script>; the general rule for <script type="math/tex">\beta</script>-reduction is that given <script type="math/tex">((\lambda x. M) N)</script>, you can replace it with <script type="math/tex">M[N/x]</script>, subject to some details that we will mostly skip over shortly.) <p>In our example, we end up with another application term, so we can reduce it further:</p><p></p><div class="separator" style="clear: both; text-align: center;"><a href="https://lh3.googleusercontent.com/-uExQDerRBfo/YIXT5Q7Ve6I/AAAAAAAACvo/_qQYEhP3HZEFfQdPoddyKGdbPZJbZVaIQCLcBGAsYHQ/reduction2.png" style="margin-left: 1em; margin-right: 1em;"><img data-original-height="666" data-original-width="1000" height="426" src="https://lh3.googleusercontent.com/-uExQDerRBfo/YIXT5Q7Ve6I/AAAAAAAACvo/_qQYEhP3HZEFfQdPoddyKGdbPZJbZVaIQCLcBGAsYHQ/w640-h426/reduction2.png" width="640" /></a></div><p></p><p>In our Pythonic pseudocode, we might represent this as an execution trace like the following:</p><pre><code>(lambda f -> f(y))(lambda x -> x)</code></pre><pre><code> --></code></pre><pre><code>(lambda x -> x)(y)</code></pre><pre><code> --></code></pre><pre><code>y<br /></code></pre><p>Reduction is not always so simple, even if there's only a single choice of what to reduce. You have to be careful if the same variable appears in different roles, and rename if necessary. The core rule is that within the tree rooted at a <script type="math/tex">\lambda</script>-term that takes an argument <script type="math/tex">x</script>, the variable <script type="math/tex">x</script> always means whatever was given to that <script type="math/tex">\lambda</script>-term, and never anything else. An <script type="math/tex">x</script> bound in one <script type="math/tex">\lambda</script>-term is distinct from an <script type="math/tex">x</script> bound in another <script type="math/tex">\lambda</script>-term.</p><p>The simplest way to get around problems is to make your first variable <script type="math/tex">x_1</script> and, whenever you need a new one, call it <script type="math/tex">x_i</script> where <script type="math/tex">i</script> is one more than the maximum index of any existing variable. Unfortunately humans aren't good at remembering the difference between <script type="math/tex">x_9</script> and <script type="math/tex">x_{17}</script>, and humans like conventions (like using <script type="math/tex">x</script> for generic variables, <script type="math/tex">f</script> for things that will be <script type="math/tex">\lambda</script>-terms, and so forth). Therefore we sometimes have to think about name collisions.</p><p>The principle that lets us out of name collision problems is that you can rename variables as you want (as long as distinct variables aren't renamed to the same thing). The name for this is <script type="math/tex">\alpha</script>-equivalence (more Greek letters!); for example <script type="math/tex">(\lambda x .x)</script> and <script type="math/tex">(\lambda y. y)</script> are <script type="math/tex">\alpha</script>-equivalent.</p><p>There are, of course, detailed rules for how to deal with name collisions when doing <script type="math/tex">\beta</script>-reductions, but you should be fine if you think about how variable scoping should sensibly work to preserve meaning (something you've already had to reason about if you've ever programmed). (A helpful concept to keep in mind is the difference between free variables and bound variables – starting from a variable and following the path up the tree to the parent node, does it run through a <script type="math/tex">\lambda</script>-node with that variable as an argument?)</p><p>An example of a name collision problem is this:</p><p></p><div class="separator" style="clear: both; text-align: center;"><a href="https://lh3.googleusercontent.com/-3CsGljEB1Po/YIXUBXYsiaI/AAAAAAAACvw/jREPl0dgL7ANsQN0D-XDyyBZpqRa0ff5wCLcBGAsYHQ/wrongreduction.png" style="margin-left: 1em; margin-right: 1em;"><img data-original-height="766" data-original-width="1094" height="448" src="https://lh3.googleusercontent.com/-3CsGljEB1Po/YIXUBXYsiaI/AAAAAAAACvw/jREPl0dgL7ANsQN0D-XDyyBZpqRa0ff5wCLcBGAsYHQ/w640-h448/wrongreduction.png" width="640" /></a></div><p></p><p>We can't do this because the <script type="math/tex">x</script> in the innermost <script type="math/tex">\lambda</script>-term on the left must mean whatever was passed to it, and the <script type="math/tex">y</script> whatever was passed to the outer <script type="math/tex">\lambda</script>-term. However, our reduction leaves us with an expression that applies its argument to itself. We can solve this by renaming the <script type="math/tex">x</script> within the inner <script type="math/tex">\lambda</script>-term:</p><p></p><div class="separator" style="clear: both; text-align: center;"><div class="separator" style="clear: both; text-align: center;"><a href="https://lh3.googleusercontent.com/-rbqp3oh23-c/YIXUJWBcHgI/AAAAAAAACv8/P7arn5R2eE88z7YxIansO7TtbozLuBDhQCLcBGAsYHQ/wrongreductionfix.png" style="margin-left: 1em; margin-right: 1em;"><img alt="" data-original-height="870" data-original-width="1186" height="470" src="https://lh3.googleusercontent.com/-rbqp3oh23-c/YIXUJWBcHgI/AAAAAAAACv8/P7arn5R2eE88z7YxIansO7TtbozLuBDhQCLcBGAsYHQ/w640-h470/wrongreductionfix.png" width="640" /></a></div></div><p></p><p>The general way to think of lambda calculus term is that they are partitioned in two ways into equivalence classes:</p><ul><li>The first, rather trivial, set of equivalence classes is treating all <script type="math/tex">\alpha</script>-equivalent terms as the same thing. "Equivalent" and <script type="math/tex">\alpha</script>-equivalent are usually the same thing when we're talking about the lambda calculus; it's the "structure" of a term that matters, not the variable names.</li><li>The second set of equivalence classes is treating everything that can be <script type="math/tex">\beta</script>-reduced into the same form as equivalent. This is less trivial – in fact, it's undecidable in the general case (as we will see in the post about computation theory).</li> </ul><h2>That's it</h2><p>Yes, really, that's all you need. There exists a lambda calculus term that beats you in chess.</p><p>You might ask: but hold on a moment, we have no data – no numbers, no pairs, no lists, no strings – how can we input chess positions into a term or get anything sensible as an answer? We will see later that it's possible to encode data as lambda terms. The chess-playing term would accept some massive mess of <script type="math/tex">\lambda</script>-terms encoding the board configuration as an input, and after a lot of reductions it would become a term encoding the move to make – eventually checkmate, against you.</p><p>Before we start abstracting out data and more complex functions, let's make some simple syntax changes and look at some basic facts about reduction.</p><h2>Some syntax simplifications</h2><p>The pure lambda calculus does not have <script type="math/tex">\lambda</script>-terms that take more than one argument. This is often inconvenient. However, there's a simple mapping between multi-argument <script type="math/tex">\lambda</script>-terms and single-argument ones: instead of a two-argument function, say, just have a function that takes in an argument and returns a one argument function that takes in an argument and returns a result using both arguments.</p><p>(In programming language terms, this is currying.)</p><p>In the standard notation, <script type="math/tex">(\lambda x.(\lambda y. M))</script> is often written <script type="math/tex">(\lambda xy.M)</script>. Likewise, we can do similar simplifications on our trees, remembering that this is a syntactic/visual difference, rather than introducing something new to the lambda calculus:</p><p></p><div class="separator" style="clear: both; text-align: center;"><a href="https://lh3.googleusercontent.com/-cVpDLaEazpQ/YIXUPbu9LwI/AAAAAAAACwA/q-lIAh_fh0AHGGS3t4sQOWYZJTNq-uxEQCLcBGAsYHQ/simplersyntax.png" style="margin-left: 1em; margin-right: 1em;"><img data-original-height="670" data-original-width="880" height="305" src="https://lh3.googleusercontent.com/-cVpDLaEazpQ/YIXUPbu9LwI/AAAAAAAACwA/q-lIAh_fh0AHGGS3t4sQOWYZJTNq-uxEQCLcBGAsYHQ/w400-h305/simplersyntax.png" width="400" /></a></div><p></p><p>Once we've done this change, the next natural simplification to make is to allow one application node to apply many arguments to a <script type="math/tex">\lambda</script>-term with "many arguments" (remember that it actually stands for a bunch of nested normal single-argument <script type="math/tex">\lambda</script>-terms):</p><p></p><div class="separator" style="clear: both; text-align: center;"><a href="https://lh3.googleusercontent.com/-bgLtQrFG_c0/YIXUSDSEaXI/AAAAAAAACwE/JdfsNQC21cAhHgCLDkimLoQhTRwq_Q_xgCLcBGAsYHQ/simplersyntax2.png" style="margin-left: 1em; margin-right: 1em;"><img data-original-height="600" data-original-width="1100" height="350" src="https://lh3.googleusercontent.com/-bgLtQrFG_c0/YIXUSDSEaXI/AAAAAAAACwE/JdfsNQC21cAhHgCLDkimLoQhTRwq_Q_xgCLcBGAsYHQ/w640-h350/simplersyntax2.png" width="640" /></a></div><p></p><p>(The corresponding simplification in the standard syntax is that <script type="math/tex">(M \, A \, B\, C)</script> means <script type="math/tex">(((M \, A)\, B)\, C)</script>. In a standard programming language, this might be written <code>M(A)(B)(C)</code>; that is, applying <code>A</code> to <code>M</code> to get a function that you apply to <code>B</code>, yielding another function that you apply to <code>C</code>. Sanity check: what's the difference between <script type="math/tex">((M \, A) \, B)</script> and <script type="math/tex">(M \, (A \, B))</script>?)</p><p> </p><h2>Some facts about reduction</h2><h3><script type="math/tex">\beta</script>-normal forms</h3><p>A <script type="math/tex">\beta</script>-normal form can be thought of as a "fully evaluated" term. More specifically, it is one where this configuration of nodes does not appear in the tree (after multi-argument <script type="math/tex">\lambda</script>s and applications have been compiled into single-argument ones), where <script type="math/tex">M</script> and <script type="math/tex">N</script> are arbitrary terms:</p><p></p><div class="separator" style="clear: both; text-align: center;"><a href="https://lh3.googleusercontent.com/-VdVQqLwIZlE/YIXUZa0HY8I/AAAAAAAACwM/ROTS3CjzNEwSaySEpQAroZJQ69Q5S9F2QCLcBGAsYHQ/normal.png" style="margin-left: 1em; margin-right: 1em;"><img alt="" data-original-height="342" data-original-width="448" height="240" src="https://lh3.googleusercontent.com/-VdVQqLwIZlE/YIXUZa0HY8I/AAAAAAAACwM/ROTS3CjzNEwSaySEpQAroZJQ69Q5S9F2QCLcBGAsYHQ/normal.png" width="314" /></a></div><p></p><p>Intuitively, if such a term does appear, then the reduction rules allow us to reduce the application (replacing this part of the tree with whatever you get when you substitute <script type="math/tex">N</script> in place of <script type="math/tex">x</script> within <script type="math/tex">M</script>), so our term is not fully reduced yet.</p><h3>Terms without a <script type="math/tex">\beta</script>-normal form</h3><p>Does every term have a <script type="math/tex">\beta</script>-normal form? If you've seen computation theory stuff before, you should be able to answer this immediately without considering anything about the lambda calculus itself.</p><p>The answer is no, because reducing to a <script type="math/tex">\beta</script>-normal form is the lambda calculus equivalent of an algorithm halting. Lambda calculus has the same expressive power as Turing machines or any other model of computation, and some algorithms run forever, so there must exist lambda calculus terms that you can keep reducing without ever getting a <script type="math/tex">\beta</script>-normal form.</p><p>Here's one example, often called <script type="math/tex">\Omega</script>: </p><p></p><div class="separator" style="clear: both; text-align: center;"><a href="https://lh3.googleusercontent.com/-vkqpIhKyJXY/YIXUcgli2oI/AAAAAAAACwQ/40CTgJilizggNn99lXI0-4YHDetNgNbZgCLcBGAsYHQ/omega.png" style="margin-left: 1em; margin-right: 1em;"><img data-original-height="706" data-original-width="894" height="316" src="https://lh3.googleusercontent.com/-vkqpIhKyJXY/YIXUcgli2oI/AAAAAAAACwQ/40CTgJilizggNn99lXI0-4YHDetNgNbZgCLcBGAsYHQ/w400-h316/omega.png" width="400" /></a></div><p></p><p>Note that even though we use the same variable <script type="math/tex">x</script> in both branches, the variable means a different thing: in the left branch it's whatever is passed as an input to the left <script type="math/tex">\lambda</script>-term – one reduction step onwards, that <script type="math/tex">x</script> stands for the entire right branch, which has its own <script type="math/tex">x</script>. In fact, before we start reducing, we will do an <script type="math/tex">\alpha</script>-conversion on the right branch (a pretentious way of saying that we will rename the bound variable).</p><p>Now watch:</p><p></p><div class="separator" style="clear: both; text-align: center;"><a href="https://lh3.googleusercontent.com/-osYjKlbs2f0/YIXUfOxA5pI/AAAAAAAACwY/WUdsWRTXmkYfLCcyEeHvZnNqV7zFVNmqQCLcBGAsYHQ/omegareduction.png" style="margin-left: 1em; margin-right: 1em;"><img data-original-height="526" data-original-width="998" height="338" src="https://lh3.googleusercontent.com/-osYjKlbs2f0/YIXUfOxA5pI/AAAAAAAACwY/WUdsWRTXmkYfLCcyEeHvZnNqV7zFVNmqQCLcBGAsYHQ/w640-h338/omegareduction.png" width="640" /></a></div><p></p><p>After one reduction step, we end up with the same term (as usual, we are treating <script type="math/tex">\alpha</script>-equivalent terms as equivalent; the variable could be <script type="math/tex">x</script> or <script type="math/tex">y</script> or <script type="math/tex">å</script> for all we care).</p><h3>Ambiguities with reduction</h3><p>Does it matter how we reduce, or does every reduction path eventually lead to a <script type="math/tex">\beta</script>-normal form, assuming that one exists in the first place? If you haven't seen this before, you might want to have a go at this before reading on.</p><p>Here's one example of a tricky term:</p><p></p><div class="separator" style="clear: both; text-align: center;"><a href="https://lh3.googleusercontent.com/-p-OCKUqUvUI/YIXUjbM7gII/AAAAAAAACwg/BkJUhbr62GclfyCoxAbGIKcI-1-IMU4jgCLcBGAsYHQ/normalorder1.png" style="margin-left: 1em; margin-right: 1em;"><img alt="" data-original-height="596" data-original-width="900" height="212" src="https://lh3.googleusercontent.com/-p-OCKUqUvUI/YIXUjbM7gII/AAAAAAAACwg/BkJUhbr62GclfyCoxAbGIKcI-1-IMU4jgCLcBGAsYHQ/normalorder1.png" width="320" /></a></div><p></p>Imagine that <script type="math/tex">M</script> has a <script type="math/tex">\beta</script>-normal form, and <script type="math/tex">\Omega</script> is as defined above and therefore can be reduced forever. If we start by reducing the application node, in a moment <script type="math/tex">\Omega</script> and all its loopiness gets thrown away, and we're left with just <script type="math/tex">M</script>, since the <script type="math/tex">\lambda</script>-term takes two arguments and returns the first. However, if we start by reducing <script type="math/tex">\Omega</script>, or are following a strategy like "evaluate the arguments before the application", we will at some point reduce <script type="math/tex">\Omega</script> and get thrown in for a loop. <p>We can take a broader view here. In any programming language – I will use Lisp notation because it's the closest to lambda calculus – if we have a function like <code>(define func (lambda (x y) [FUNCTION BODY]))</code>, and a function call like <code>(func arg1 arg2)</code> , the evaluator has a choice of what it does. The simplest strategies are to either:</p><ul><li>Evaluate the arguments – <code>arg1</code> and <code>arg2</code>– first, and then inside the function <code>func</code> have <code>x</code> and <code>y</code> bound to the results of evaluating <code>arg1</code> and <code>arg2</code> respectively. This is called call-by-value, and is used by most programming languages.</li><li>Bind <code>x</code> and <code>y</code> inside <code>func</code> to be the unevaluated values of <code>arg1</code> and <code>arg2</code>, and evaluate <code>arg1</code> and <code>arg2</code> only upon encountering them in the process of evaluating <code>func</code>. This is called call-by-name. It's rare to see it in programming languages (an exception being that it's possible with Lisp macros), but functional languages like Haskell often have a variant, call-by-need or "lazy evaluation", where the values of <code>arg1</code> and <code>arg2</code> are only executed when needed, but once executed the results are memoized so that the execution only needs to happen once.</li> </ul><p>Call-by-value reduces what you can express. Imagine trying to define your own if-function in a language with call-by-value:</p><pre><code class="language-scheme" lang="scheme">(define IF<br /> (lambda (predicate consequent alternative)<br /> (if predicate<br /> consequent <span style="color: #999999;">; if predicate is true, do this</span><br /> alternative)) <span style="color: #999999;">; if predicate is false, do this instead</span><br /></code></pre><p>(note that <code>IF</code> is the new if-function that we're trying to define, and <code>if</code> is assumed to be a language primitive.)</p><p>Now consider:</p><pre><code class="language-scheme" lang="scheme">(define factorial<br /> (lambda (n)<br /> (IF (= n 0)<br /> 1<br /> (* n<br /> (factorial (- n 1))))))<br /></code></pre><p>You call <code>(factorial 1)</code>, and for the first call the program evaluates the arguments to <code>IF</code>:</p><ul><li><code>(= 1 0)</code></li><li><code>1</code></li><li><code>(* 1 (factorial 0))</code></li> </ul><p>The last one needs the value of <code>(factorial 0)</code>, so we evaluate the arguments to the <code>IF</code> in the recursive call:</p><ul><li><code>(= 0 0)</code></li><li><code>1</code></li><li><code>(* 1 (factorial -1))</code></li> </ul><p>... and so on. We can't define <code>IF</code> as a function, because in call-by-value the <code>alternative</code> gets evaluated as part of the function call even if <code>predicate</code> is false.</p><p>(Most languages solve this by giving you a bunch of primitives and making you stick with them, perhaps with some fiddly mini-language for macros built in (consider C/C++). In Lisp, you can easily write macros that use all of the language features, and therefore extend the language by essentially defining your own primitives that can escape call-by-value or any other potentially limiting language feature.)</p><p>It's the same issue with our term <script type="math/tex">((\lambda xy.x) \, M \, \Omega)</script> above: call-by-value goes into a silly loop because one of the arguments isn't even "meant to" be evaluated (from our perspective as humans with goals looking at the formal system from the outside).</p><p>Lambda calculus does not impose a reduction/"evaluation" order, so we can do what we like. However, this still leaves us with a problem: how do we know if our algorithm has gone into an infinite loop, or we just reduced terms in the wrong order?</p><h3>Normal order reduction</h3><p>It turns out that always doing the equivalent of call-by-name – reducing the leftmost, outermost term first – saves the day. If a <script type="math/tex">\beta</script>-normal form exists, this strategy will lead you to it.</p><p>Intuitively, this is because with call-by-name, there is no "unnecessary" reduction. If some arguments in some call are never used (like in our example), they never reduce. If we start reducing an expression while doing leftmost/outermost-first reduction, that reduction must be standing in the way between us and a successful reduction to <script type="math/tex">\beta</script>-normal form.</p><p>Formally: ... the proof is left as an exercise for the reader.</p><h3>Church-Rosser theorem</h3><p>The Church-Rosser theorem is the thing that guarantees we can talk about unique <script type="math/tex">\beta</script>-normal forms for a term. It says that:</p><blockquote><p>Letting <script type="math/tex">\Lambda</script> be the set of terms in the lambda calculus, <script type="math/tex">\rightarrow_\beta</script> the <script type="math/tex">\beta</script>-reduction relation, and <script type="math/tex">\twoheadrightarrow_\beta</script> its reflexive transitive closure (i.e. <script type="math/tex">M \twoheadrightarrow_\beta N</script> iff there exist zero or more terms <script type="math/tex">P_1</script>, <script type="math/tex">P_2</script>, ... such that <script type="math/tex">M \rightarrow_\beta P_1 \rightarrow_\beta ... \rightarrow_\beta P_n \rightarrow_\beta N</script>), then:</p><p><b>For all <script type="math/tex">M \in \Lambda</script>, <script type="math/tex">M \rightarrow_\beta A</script> and <script type="math/tex">M \rightarrow_\beta B</script> implies that there exists <script type="math/tex">X \in \Lambda</script> such that <script type="math/tex">A \twoheadrightarrow_\beta X</script> and <script type="math/tex">B \twoheadrightarrow_\beta X</script>.</b></p></blockquote><p>Visually, if we have reduction chains like the black part, then the blue part must exist (a property known as confluence or the "diamond property"):</p><p></p><div class="separator" style="clear: both; text-align: center;"><a href="https://lh3.googleusercontent.com/-EvzNpdkHns0/YIXUwuCD5vI/AAAAAAAACws/a55xmnExm7kPIsTOfeB7yBMGD0TiGdpegCLcBGAsYHQ/churchrosser.png" style="margin-left: 1em; margin-right: 1em;"><img alt="" data-original-height="574" data-original-width="998" height="230" src="https://lh3.googleusercontent.com/-EvzNpdkHns0/YIXUwuCD5vI/AAAAAAAACws/a55xmnExm7kPIsTOfeB7yBMGD0TiGdpegCLcBGAsYHQ/w400-h230/churchrosser.png" width="400" /></a></div><p></p><p>Therefore, even if there are many reduction paths, and even if some of them are non-terminating, for any two different starting <script type="math/tex">\beta</script>-reductions we can make, we will not lose the existence of a reduction path to any <script type="math/tex">X</script>. If <script type="math/tex">X</script> is some <script type="math/tex">\beta</script>-normal form reachable from <script type="math/tex">M</script>, we know that any other reduction path that reaches a <script type="math/tex">\beta</script>-normal form must have reached <script type="math/tex">X</script>.</p><h2>The fun begins</h2><p>Now we will start making definitions within the lambda calculus. These definitions do not add any capabilities to the lambda calculus, but are simply conveniences to save out having to draw huge trees repeatedly when we get to doing more complex things.</p><p>There are two big ideas to keep in mind:</p><ol start=""><li>There are no data primitives in the lambda calculus (even the variables are just placeholders for terms to get substituted into, and don't even have consistent names – remember that we work within <script type="math/tex">\alpha</script>-equivalence). As a result, the general idea is that you encode "data" as actions: the number 4 is represented by a function that takes a function and an input and applies the function to the input 4 times, a list might be encoded by a description of how to iterate over it, and so on.</li><li>There are no types. Nothing in the lambda calculus will stop you from passing a number to a function that expects a function, or visa versa. There exist <a href="https://en.wikipedia.org/wiki/Typed_lambda_calculus">typed lambda calculi</a>, but they prevent you from doing some of the cool things with combinators that we'll see later in this post.</li> </ol><h3>Pairs</h3><p>We want to be able to associate two things into a pair, and then extract the first and second elements. In other words, we want things that work like this:</p><pre><code>(fst (pair a b)) == a<br />(snd (pair a b)) == b<br /></code></pre><p>The simplest solution starts like this:</p><p></p><div class="separator" style="clear: both; text-align: center;"><a href="https://lh3.googleusercontent.com/-Elx8o61jQWM/YIXU1Gk5PQI/AAAAAAAACw0/DQAu3ZCQ_dQ3DlPYCWDPaKTrhOb8oWLTwCLcBGAsYHQ/pairs.png" style="margin-left: 1em; margin-right: 1em;"><img data-original-height="1044" data-original-width="820" height="400" src="https://lh3.googleusercontent.com/-Elx8o61jQWM/YIXU1Gk5PQI/AAAAAAAACw0/DQAu3ZCQ_dQ3DlPYCWDPaKTrhOb8oWLTwCLcBGAsYHQ/w315-h400/pairs.png" width="315" /></a></div><p></p><p>Now we can get the first of a pair by doing <code>((pair x y) first)</code>. If we want the exact semantics above, we can define simple helpers like </p><pre><code class="language-scheme" lang="scheme">fst = (lambda p<br /> (p first))<br /></code></pre><p>(i.e. <script type="math/tex">\text{fst} = (\lambda p. (p \, \text{first}))</script>), and </p><pre><code class="language-scheme" lang="scheme">snd = (lambda p<br /> (p second))<br /></code></pre><p>since now <code>(snd (pair x y))</code> reduces to <code>((pair x y) second)</code> reduces to <code>y</code>.</p><h3>Lists</h3><p>A list can be constructed from pairs: <code>[1, 2, 3]</code> will be represented by <code>(pair 1 (pair 2 (pair 3 False)))</code> (we will define <code>False</code> later). If <script type="math/tex">l_1</script>, <script type="math/tex">l_2</script>, and <script type="math/tex">l_3</script> are the list items, a length element list looks like this:</p><p></p><div class="separator" style="clear: both; text-align: center;"><a href="https://lh3.googleusercontent.com/-l5IiHdFAAVo/YIXU6O3V_3I/AAAAAAAACw4/M_EcXX0HyssTbmFvQG3QVfYQ6_4eLqa9QCLcBGAsYHQ/list.png" style="margin-left: 1em; margin-right: 1em;"><img data-original-height="1000" data-original-width="1060" height="378" src="https://lh3.googleusercontent.com/-l5IiHdFAAVo/YIXU6O3V_3I/AAAAAAAACw4/M_EcXX0HyssTbmFvQG3QVfYQ6_4eLqa9QCLcBGAsYHQ/w400-h378/list.png" width="400" /></a></div><p></p><p>We might also represent the same list like this instead:</p><p></p><div class="separator" style="clear: both; text-align: center;"><a href="https://lh3.googleusercontent.com/-dRZUdY06P_E/YIXU-TNdfwI/AAAAAAAACw8/rKDaQ5adAYw2nhh6xj-tADnzoS-2n-FawCLcBGAsYHQ/listvar.png" style="margin-left: 1em; margin-right: 1em;"><img data-original-height="858" data-original-width="970" height="354" src="https://lh3.googleusercontent.com/-dRZUdY06P_E/YIXU-TNdfwI/AAAAAAAACw8/rKDaQ5adAYw2nhh6xj-tADnzoS-2n-FawCLcBGAsYHQ/w400-h354/listvar.png" width="400" /></a></div><p></p><p>This second representation makes it trivial to define things like a <code>reduce</code> function: <code>([1, 2, 3] 0 +)</code> would return 0 plus the sum of the list <code>[1, 2, 3]</code>, if <code>[1, 2, 3]</code> is represented as above. However, this representation would also make it harder to do other list operations, like getting all but the first element of a list, whereas our pair-based lists can do this trivially (<code>(snd l)</code> gets you all but the first element of the list <code>l</code>).</p><h3>Numbers & arithmetic</h3><p>Here are how the numbers work (using a system called Church numerals):</p><p></p><div class="separator" style="clear: both; text-align: center;"><a href="https://lh3.googleusercontent.com/-NuPgzLWknX4/YIXVCapZnHI/AAAAAAAACxE/cozGKFi3rVgsM6juTckj1SJSTo8utUlMgCLcBGAsYHQ/numbers.png" style="margin-left: 1em; margin-right: 1em;"><img alt="" data-original-height="616" data-original-width="1200" height="328" src="https://lh3.googleusercontent.com/-NuPgzLWknX4/YIXVCapZnHI/AAAAAAAACxE/cozGKFi3rVgsM6juTckj1SJSTo8utUlMgCLcBGAsYHQ/w640-h328/numbers.png" width="640" /></a></div><p></p><p>Since giving a function <script type="math/tex">f</script> to a number <script type="math/tex">n</script> (also a function) gives a function that applies <script type="math/tex">f</script> to its input <script type="math/tex">n</script> times, a lot of things are very convenient. Say you have this function to add one, which we'll call <code>succ</code> (for "successor"):<br /></p><p></p><div class="separator" style="clear: both; text-align: center;"><a href="https://lh3.googleusercontent.com/-g04afKMD8cw/YIXVFFyIZjI/AAAAAAAACxI/af_y0P4lIX4q1h6A4Fb9Sf8t69VkBLEJgCLcBGAsYHQ/succ.png" style="margin-left: 1em; margin-right: 1em;"><img data-original-height="700" data-original-width="1000" height="280" src="https://lh3.googleusercontent.com/-g04afKMD8cw/YIXVFFyIZjI/AAAAAAAACxI/af_y0P4lIX4q1h6A4Fb9Sf8t69VkBLEJgCLcBGAsYHQ/w400-h280/succ.png" width="400" /></a></div><p></p><p>(Considering the above definition of numbers: why does it work?) <br /></p><p>Now what is <code>(42 succ)</code>? It's a function that takes an argument and adds <code>42</code> to it. More generally, <code>((n succ) m)</code> gives you <code>m+n</code>. However, there's also a more straightforward way to represent addition, which you can figure out from noticing that all we have to do to add <code>m</code> to <code>n</code> is to compose the "apply <code>f</code>" operation <code>m</code> more times to <code>n</code>, something we can do simply by calling <code>(m f)</code> on <code>n</code>, once we've "standardised" <code>n</code> to have the same <code>f</code> and <code>x</code> as in the <script type="math/tex">\lambda</script>-term that represents <code>m</code> (that is why we have the <code>(n f x)</code> application, rather than just <code>n</code>):</p><p> </p><div class="separator" style="clear: both; text-align: center;"><a href="https://lh3.googleusercontent.com/-k4huLoOTr60/YIXVRqp4GsI/AAAAAAAACxU/RwfI1uA2p9IooMCmvyQVTh2TxfyFYEFHgCLcBGAsYHQ/add.png" style="margin-left: 1em; margin-right: 1em;"><img data-original-height="668" data-original-width="968" height="276" src="https://lh3.googleusercontent.com/-k4huLoOTr60/YIXVRqp4GsI/AAAAAAAACxU/RwfI1uA2p9IooMCmvyQVTh2TxfyFYEFHgCLcBGAsYHQ/w400-h276/add.png" width="400" /></a></div><p></p><p>Now, want multiplication? One way is to see that we can define <code>(mult m n)</code> as <code>((n (adder m)) 0)</code>, assuming that <code>(adder m)</code> returns a function that adds <code>m</code> to its input. As we saw, that can be done with <code>(m succ)</code>, so:</p><pre><code class="language-scheme" lang="scheme">(mult m n) =<br />((n (m succ))<br /> 0)<br /></code></pre><p>There's a more standard way too:</p><p></p><div class="separator" style="clear: both; text-align: center;"><a href="https://lh3.googleusercontent.com/-yy7qimS9E4Q/YIXVU0q5hQI/AAAAAAAACxc/EKbvrXKEIQ8Idi23u7vDt9y2zO5sKk2MQCLcBGAsYHQ/mult.png" style="margin-left: 1em; margin-right: 1em;"><img data-original-height="564" data-original-width="796" height="284" src="https://lh3.googleusercontent.com/-yy7qimS9E4Q/YIXVU0q5hQI/AAAAAAAACxc/EKbvrXKEIQ8Idi23u7vDt9y2zO5sKk2MQCLcBGAsYHQ/w400-h284/mult.png" width="400" /></a></div><br /><p></p> <p>The idea here is simply that <code>(n f)</code> gives a <script type="math/tex">\lambda</script>-term that takes an input and applies <code>f</code> to it <script type="math/tex">n</script> times, and when we call <code>m</code> with that as its first argument, we get something that does the <script type="math/tex">n</script>-fold application <script type="math/tex">m</script> times, for a total of <script type="math/tex">mn</script> times, and now all that remains is to pass the <code>x</code> to it.</p><p>A particularly neat thing is that exponentiation can be this simple:<br /></p><div class="separator" style="clear: both; text-align: center;"><a href="https://lh3.googleusercontent.com/-yMefuUdBrmY/YIXVbeKlCVI/AAAAAAAACxk/Izg3I42x73k_tdatLE0Ty6beLmzIn9HTgCLcBGAsYHQ/exp.png" style="margin-left: 1em; margin-right: 1em;"><img alt="" data-original-height="536" data-original-width="790" height="217" src="https://lh3.googleusercontent.com/-yMefuUdBrmY/YIXVbeKlCVI/AAAAAAAACxk/Izg3I42x73k_tdatLE0Ty6beLmzIn9HTgCLcBGAsYHQ/exp.png" width="320" /></a></div><p></p><p>Why? I'll let the trees talk. First, using the definition of <code>n</code> as a Church numeral (which I will underline in the trees below), and doing one <script type="math/tex">\beta</script>-reduction, we have:</p><p></p><div class="separator" style="clear: both; text-align: center;"><a href="https://lh3.googleusercontent.com/-PV4bT_1_ZSE/YIXVdvJX2gI/AAAAAAAACxo/xgbezro8juoIPyGv6Lc9wXi7DEWx5DtPQCLcBGAsYHQ/expe1.png" style="margin-left: 1em; margin-right: 1em;"><img data-original-height="946" data-original-width="1528" height="396" src="https://lh3.googleusercontent.com/-PV4bT_1_ZSE/YIXVdvJX2gI/AAAAAAAACxo/xgbezro8juoIPyGv6Lc9wXi7DEWx5DtPQCLcBGAsYHQ/w640-h396/expe1.png" width="640" /></a></div><p></p><p>This does not look promising – a number needs to have two arguments, but we have a <script type="math/tex">\lambda</script>-term taking in one. However, we'll soon see that the <code>x</code> in the tree on the right actually turns out to be the first argument, <code>f</code>, in the finished number. In fact, we'll make that renaming right away (since we're working under <script type="math/tex">\alpha</script>-equivalence), and continue reducing (below we've taken the bottom-most <code>m</code> and expanded it into its Church numeral definition): </p><p></p><div class="separator" style="clear: both; text-align: center;"><a href="https://lh3.googleusercontent.com/-82QyX18WZMI/YIXVgQWkf5I/AAAAAAAACxs/6lNfk11Iz3gzl1oWMX8NKQeqZ9FkjSefgCLcBGAsYHQ/expe2.png" style="margin-left: 1em; margin-right: 1em;"><img data-original-height="906" data-original-width="1200" height="483" src="https://lh3.googleusercontent.com/-82QyX18WZMI/YIXVgQWkf5I/AAAAAAAACxs/6lNfk11Iz3gzl1oWMX8NKQeqZ9FkjSefgCLcBGAsYHQ/w640-h483/expe2.png" width="640" /></a></div><p></p><p>At this point, the picture gets clearer: the next thing we'd reduce is the lambda term at the bottom applied to <code>m</code>, but that's just going to do the lambda term (which applies <code>f</code> <script type="math/tex">m</script> times) <script type="math/tex">m</script> more times. We'll have done 2 steps, and gotten up to <script type="math/tex">m^2</script> nestings of <code>f</code>. By the time we've done the remaining <script type="math/tex">n-1</script> steps, we'll have the representation of <script type="math/tex">m^n</script>; the <script type="math/tex">n-1</script> more applications between our bottom-most and topmost lambda term will reduce away, while the stack of applications of <code>f</code> increases by a factor of <script type="math/tex">m</script> each time.</p><p>What about subtraction? It's a bit complicated. Okay, how about just subtraction by <i>one</i>, also known as the <code>pred</code> (predecessor) function? Also tricky (and a good puzzle if you want to think about it). Here's one way:</p><p></p><div class="separator" style="clear: both; text-align: center;"><a href="https://lh3.googleusercontent.com/-qNJvAtfo9UI/YIXVjnrMdBI/AAAAAAAACxw/b1n7LTuSA4Ye_bAeOs3eZ2PVAwesypyAACLcBGAsYHQ/pred.png" style="margin-left: 1em; margin-right: 1em;"><img data-original-height="876" data-original-width="1146" height="489" src="https://lh3.googleusercontent.com/-qNJvAtfo9UI/YIXVjnrMdBI/AAAAAAAACxw/b1n7LTuSA4Ye_bAeOs3eZ2PVAwesypyAACLcBGAsYHQ/w640-h489/pred.png" width="640" /></a></div><p></p><p>Church numerals make it easy to add, but not subtract. So instead, here's what we do. First (box 1), we make a pair like <code>[0 0]</code>. Next (polygon 2), we have a function that takes a pair <code>p=[a b]</code> and creates a new pair <code>[b (succ b)]</code>, where <code>succ</code> is the successor function (one plus its input). Repeated application of this function on the pair in box 1 looks like this: <code>[0 0]</code>, <code>[0 1]</code>, <code>[1 2]</code>, <code>[2 3]</code>, and so on. Thus we see that if we start from <code>[0 0]</code> and apply the function in polygon 2 <script type="math/tex">n</script> times (box 3), the first element of the pair is (the Church numeral for) <script type="math/tex">n-1</script>, and the second element is <script type="math/tex">n</script>, and we can simply call <code>fst</code> to get that first element.</p><p>As we saw before, we can define subtraction as repeated application of <code>pred</code>:</p><pre><code class="language-scheme" lang="scheme">(minus m n) =<br />((n pred) m)<br /></code></pre><p>There's an alternative to Church numerals that's found in the more general <a href="https://crypto.stanford.edu/~blynn/compiler/scott.html">Scott encoding</a>. The advantages of Church vs Scott numerals, and their relative structures, are similar to the relative merits and structures of the two types of lists we discussed: one makes many operations natural by exploiting the fact that everything is a function, but also makes "throwing off a piece" (taking the rest/<code>snd</code> of a list, or subtracting one from a number) much harder.</p><h3>Booleans, if, & equality</h3><p>You might have noticed that we've defined <code>second</code> as <script type="math/tex">(\lambda x y. y)</script>, and <code>0</code> as <script type="math/tex">(\lambda f x. x)</script>. These two terms are a variable-renaming away from each other, so they are <script type="math/tex">\alpha</script>-equivalent. In other words, <code>second</code> and <code>0</code> are same thing. Because we don't have types, which is which depends only on our interpretation of the context it appears in.</p><p>Now let's define a <code>True</code> and <code>False</code>. Now <code>False</code> is kind of like <code>0</code>, so let's just say they're also the same thing. The opposite of <script type="math/tex">(\lambda x y. y)</script> is <script type="math/tex">(\lambda x y. x)</script>, so let's define that to be <code>True</code>.</p><p>What sort of muddle have we landed ourselves in now? Quite a good one, actually. Let's define <code>(if p c a)</code> to be <code>(p c a)</code>. If the predicate <code>p</code> is <code>True</code>, we select the consequent <code>c</code>, because <code>(True c a)</code> is exactly the same as <code>(first c a)</code> is clearly <code>c</code>. Likewise, if <code>p</code> is <code>False</code>, then we evaluate the same thing as <code>(second c a)</code> and end up with the alternative <code>a</code>.</p><p>We will also want to test whether a number is <code>0</code>/<code>False</code> (equality in general is hard in the lambda calculus, so what we end up with won't be guaranteed to work with things that aren't numbers). A simple way is:</p><pre><code class="language-scheme" lang="scheme">eq0 =<br />(lambda x<br /> (x (lambda y<br /> False)<br /> True))<br /></code></pre><p>If <code>x</code> is <code>0</code>, it's the same as <code>second</code> and will act as a conditional and pick out <code>True</code>. If it's not zero, we assume that it's some number <script type="math/tex">n</script>, and therefore will be a function that applies its first argument <script type="math/tex">n</script> times. Applying <script type="math/tex">(\lambda y.\text{False})</script> any non-zero amount of times to anything will return <code>False</code>.</p><h2>Fixed points, combinators, and recursion</h2><p>The big thing missing from the definitions we've put on top of the lambda calculus so far is recursion. Every lambda term represents an anonymous function, so there's no name within a <script type="math/tex">\lambda</script>-term that we can "call" to recurse.</p><p>Rather than jumping in straight to recursion, we're going to start with Russell's paradox: does a set that contains all elements that are not in the set contain itself? Phrased mathematically: what the hell is <script type="math/tex">R = \{x \,|\,x\notin R\} </script>?</p><p>In computation theory, sets are often specified by a characteristic function: a function that is always defined if the set is computable, and returns true if an element is in the set and false otherwise.</p><p>In the lambda calculus (which was originally supposed to be a foundation for logic), here's a characteristic function for the Russell set <script type="math/tex">R</script>:</p><p></p><div class="separator" style="clear: both; text-align: center;"><a href="https://lh3.googleusercontent.com/--fAw5h7j9LM/YIXVofSsV-I/AAAAAAAACx4/s-qCYIIqZ-A4ZGdu3bjaBdRLzhYs1ijfwCLcBGAsYHQ/russell.png" style="margin-left: 1em; margin-right: 1em;"><img data-original-height="678" data-original-width="800" height="339" src="https://lh3.googleusercontent.com/--fAw5h7j9LM/YIXVofSsV-I/AAAAAAAACx4/s-qCYIIqZ-A4ZGdu3bjaBdRLzhYs1ijfwCLcBGAsYHQ/w400-h339/russell.png" width="400" /></a></div><p></p><p>(where <code>not</code> can be straightforwardly defined on top of our existing definitions as <code>(not b) = (b False True)</code>).</p><p>This <script type="math/tex">\lambda</script>-term takes in an element <code>x</code>, assumes that <code>x</code> is the (characteristic function for) the set itself, and asks: is it the case that <code>x</code> is <i>not</i> in the set? Call this term <code>R</code>, and consider <code>(R R)</code>: the left <code>R</code> is working as the (characteristic function of) the set, and the right <code>R</code> as the element whose membership of the set we are testing.</p><p>Evaluating:</p><p></p><div class="separator" style="clear: both; text-align: center;"><a href="https://lh3.googleusercontent.com/-5USU6gUr_xU/YIXVrF-huCI/AAAAAAAACyA/iOL_s4hxuXsbYjeKp4LKjmOxoxLSyDKpgCLcBGAsYHQ/russell2.png" style="margin-left: 1em; margin-right: 1em;"><img data-original-height="612" data-original-width="1200" height="326" src="https://lh3.googleusercontent.com/-5USU6gUr_xU/YIXVrF-huCI/AAAAAAAACyA/iOL_s4hxuXsbYjeKp4LKjmOxoxLSyDKpgCLcBGAsYHQ/w640-h326/russell2.png" width="640" /></a></div><p></p><p>So we start out saying <code>(R R)</code>, and in one <script type="math/tex">\beta</script>-reduction step we end up saying <code>(not (R R))</code> (just as, with Russell's paradox, it first seems that the set must contain itself, because the set is not in itself, but once we've added the set to itself then suddenly it shouldn't be in itself anymore). One more step and we get, from <code>(R R)</code>, <code>(not (not (R R)))</code>. This is not ideal as a foundation for logic.</p><p>However, you might realise something: the <code>not</code> here doesn't play any role. We can replace it with any arbitrary <code>f</code>. In fact, let's do that, and create a simple wrapper <script type="math/tex">\lambda</script>-term around it that lets us pass in any <code>f</code> we want:</p><p></p><div class="separator" style="clear: both; text-align: center;"><a href="https://lh3.googleusercontent.com/-AMgahMRA2JE/YIXVtdgUYmI/AAAAAAAACyE/Tb4b7v2-emM5q5fl8Xicb8TQIvpxHgu2gCLcBGAsYHQ/Y.png" style="margin-left: 1em; margin-right: 1em;"><img data-original-height="1104" data-original-width="1076" height="400" src="https://lh3.googleusercontent.com/-AMgahMRA2JE/YIXVtdgUYmI/AAAAAAAACyE/Tb4b7v2-emM5q5fl8Xicb8TQIvpxHgu2gCLcBGAsYHQ/w390-h400/Y.png" width="390" /></a></div><p></p><p>Now let's look at the properties that <script type="math/tex">Y</script> has:</p><div cid="n1079" class="mathjax-block md-end-block md-math-block md-rawblock" contenteditable="false" id="mathjax-n1079" mdtype="math_block" spellcheck="false"> <div class="md-rawblock-container md-math-container" tabindex="-1"><div class="MathJax_SVG_Display" style="text-align: center;"></div><script id="MathJax-Element-363" type="math/tex; mode=display">(Y \, f) \rightarrow_\beta (f \, (Y \, f)) \rightarrow_\beta (f \, (f \, (Y \, f))) \rightarrow_\beta ...</script></div></div><p><script type="math/tex">Y</script> is called the Y combinator ("combinator" is a generic term for a lambda calculus term with no free variables). It is part of the general class of fixed-point combinators: combinators <script type="math/tex">X</script> such that <script type="math/tex">(X \, f) = (f \, (X\,f))</script>. (Turing invented another one: <script type="math/tex">\Theta = (A \, A)</script>, where <script type="math/tex">A</script> is defined as <script type="math/tex">(\lambda x y. (y \,(x\, x\, y)))</script>.)</p><p>A fixed-point combinator gives us recursion. Imagine we've almost written a recursive function, say for a factorial, except we've left a free function parameter for the recursive call:</p><pre><code class="language-scheme" lang="scheme">(lambda f x<br /> (if (eq0 x)<br /> 1<br /> (mult x<br /> (f (pred x)))))<br /></code></pre><p>(Also, take a moment to appreciate that we can already do everything necessary except for the recursion with our earlier definitions.)</p><p>Call the previous recursion-free factorial term <code>F</code>, and consider reducing <code>((Y F) 2)</code> (where <code>-BETA-></code> stands for one or more <script type="math/tex">\beta</script>-reductions):</p><pre><code class="language-scheme" lang="scheme">((Y F)<br /> 2)<br /><br />-BETA-><br /><br />((F (Y F))<br /> 2)<br /><br />-BETA-><br /><br />((lambda x<br /> (if (eq0 x)<br /> 1<br /> (mult x<br /> ((Y F) (pred x)))))<br /> 2)<br /><br />-BETA-><br /><br />(if (eq0 2)<br /> 1<br /> (mult 2<br /> ((Y F) (pred 2))))<br /><br />-BETA-><br /><br />(mult 2<br /> ((Y F)<br /> 1))<br /><br />-BETA-><br /><br />(mult 2<br /> ((F (Y F))<br /> 1))<br /><br />-BETA-><br /><br />(mult 2<br /> ((lambda x<br /> (if (eq0 x)<br /> 1<br /> (mult x<br /> ((Y F) (pred x)))))<br /> 1))<br /><br />-BETA-><br />...<br />-BETA-><br /><br />(mult 2<br /> (mult 1<br /> 1))<br /><br />-BETA-><br /><br />2<br /></code></pre><p>It works! Get a fixed-point combinator, and recursion is solved.</p><h3>Primitive recursion</h3><p>The definition of the partial recursive functions (one of the ways to define computability, mentioned at the beginning) involves something called primitive recursion. Let's implement that, and along the way look at fixed-point combinators from another perspective.</p><p>Primitive recursion is essentially about implementing bounded for-loops / recursion stacks, where "bounded" means that the depth is known when we enter the loop. Specifically, there's a function <script type="math/tex">f</script> that takes in zero or more parameters, which we'll abbreviate as <script type="math/tex">\overline{P}</script>. At 0, the value of our primitive recursive function <script type="math/tex">h</script> is <script type="math/tex">f(\overline{P})</script>. At any integer <script type="math/tex">x+1</script> for <script type="math/tex">x \geq 0</script>, <script type="math/tex">h(\overline{P}, x+1)</script> is defined as <script type="math/tex">g(\overline{P}, x, h(\overline{P}, x))</script>: in other words, the value at <script type="math/tex">x+1</script> is given by some function of:</p><ul><li>fixed parameter(s) <script type="math/tex">\overline{P}</script>,</li><li>how many more steps there are in the loop before hitting the base case (<script type="math/tex">x</script>), and</li><li>the value at <script type="math/tex">x</script> (the recursive part).</li> </ul><p>For example, in our factorial example there are no parameters, so <script type="math/tex">f</script> is just the constant function 1, and <script type="math/tex">g(x, r) = (x + 1) \times r</script>, where <script type="math/tex">r</script> is the recursive result for one less, and we have <script type="math/tex">x+1</script> because (for a reason I can't figure out – ideas?) <script type="math/tex">g</script> takes, by definition, not the current loop index but one less.</p><p>Now it's pretty easy to write the function for primitive recursion, leaving the recursive call as an extra parameter (<code>r</code>) once again, and assuming that we have <script type="math/tex">\lambda</script>-terms <code>F</code> and <code>G</code> for <script type="math/tex">f</script> and <script type="math/tex">g</script> respectively:</p><pre><code class="language-scheme" lang="scheme">(lambda r P x<br /> (if (eq0 x)<br /> (F P)<br /> (G P (pred x) (r P (pred x)))))<br /></code></pre><p>Slap a <script type="math/tex">Y</script> in front, and we take care of the recursion and we're done.</p><h3>The fixed point perspective</h3><p>However, rather than viewing this whole "slap in the <script type="math/tex">Y</script>" business as a hack for getting recursion, we can also interpret it as a fixed point operation.</p><p>A fixed point of a function <script type="math/tex">f</script> is a value <script type="math/tex">x</script> such that <script type="math/tex">x = f(x)</script>. The fixed points of <script type="math/tex">f(x)=x^2</script> are 0 and 1. In general, fixed points are often useful in maths stuff and there's a lot of deep theory behind them (for which you will have to look elsewhere).</p><p>Now <script type="math/tex">Y</script> (or any other fixed point combinator) has the property that <script type="math/tex">(Y f) =_\beta (f \, (Y\, f))</script> (remember that the equivalent of <script type="math/tex">f(x)</script> is written <script type="math/tex">(f \,x)</script> in the lambda calculus). In other words, <script type="math/tex">Y</script> is a magic wand that takes a function and returns its fixed point (albeit in a mathematical sense that is not very useful for explicitly finding those fixed points).</p><p>Taking once again the example of defining primitive recursion, we can consider it as the fixed point problem of finding an <script type="math/tex">h</script> such that <script type="math/tex">h = \Phi_{f,g}(h)</script>, where <script type="math/tex">\Phi_{f,g}</script> is a function like the following, where <code>F</code> and <code>G</code> are the lambda calculus representations of <script type="math/tex">f</script> and <script type="math/tex">g</script> respectively:</p><pre><code class="language-scheme" lang="scheme">(lambda h<br /> (lambda P x<br /> (if (eq0 x)<br /> (F P)<br /> (G P (pred x) (h P (pred x)))))))<br /></code></pre><p>That is, <script type="math/tex">\Phi_{f,g}</script> takes in some function <code>h</code>, and then returns a function that does primitive recursion – <i>under the assumption</i> that <code>h</code> is the right function for the recursive call.</p><p>Imagine it like this: when we're finding the fixed point of <script type="math/tex">f(x)= x^2</script>, we're asking for <script type="math/tex">x</script> such that <script type="math/tex">x=x^2</script>. We can imagine reaching into the set of values that <script type="math/tex">x</script> can take (in this case, the real numbers), plugging them in, and seeing that in most cases the equation <script type="math/tex">x=x^2</script> is false, but if we pick out a fixed point it becomes true. Similarly, solving <script type="math/tex">h=\Phi_{f,g}(h)</script> is the problem of considering all possible functions <script type="math/tex">h</script> (and it turns out all computable functions can be enumerated, so this is, if anything, less crazy than considering all possible real numbers), and requiring that plugging in <script type="math/tex">h</script> into <script type="math/tex">\Phi_{f,g}</script> gives back <script type="math/tex">h</script>. For almost any function that we plug in, this equation will be nonsense: instead of doing primitive recursion, on the first call to <code>h</code> <script type="math/tex">\Phi_{f,g}</script> will do some crazy call that might loop forever or calculate the 17th digit of <script type="math/tex">\pi</script>, but if it's picked just right, <script type="math/tex">h</script> and <script type="math/tex">\Phi_{f,g}(h)</script> will happen to be the same thing. Unlike in the algebraic case, it's very difficult to iteratively improve on your guess for <script type="math/tex">h</script>, so it's hard to think of how to use this weird way of defining the problem of finding <script type="math/tex">h</script> to actually find it.</p><p>Except hold on – we're working in the lambda calculus, and fixed point combinators are easy: call <script type="math/tex">Y</script> on a function and we have its fixed point, and, by the reasoning above, that is the recursive version of that function.</p><h2>The lambda calculus in lambda calculus</h2><p>There's one final powerful demonstration of a computation model's expressive power that we haven't looked at: being able to express itself. The most well-known case is the <a href="https://en.wikipedia.org/wiki/Universal_Turing_machine">universal Turing machine</a>, and those crop up a lot when you're thinking about computation theory.</p><p>Now there exists a trivial universal lambda term: <script type="math/tex">(\lambda \,f\,a\,.\,(f \,a))</script> takes <script type="math/tex">f</script>, the lambda representation of some function, and an argument <script type="math/tex">a</script>, and returns the lambda calculus representation of <script type="math/tex">f</script> applied to <script type="math/tex">a</script>. However, this isn't exactly fair, since we've just forwarded all the work onto whatever is interpreting the lambda calculus. It's like noting that an <code>eval</code> function exists in a programming language, and then writing on your CV that you've written an evaluator for it.</p><p>Instead, a "fair" way to define a universal lambda term is to build on the data specifications we have to define a representation of variables, lambda terms, and application terms, and then writing more definitions within the lambda calculus until we have a <code>reduce</code> function.</p><p>This is what I've done in <a href="https://github.com/LRudL/lambda-engine">Lambda Engine</a>. The definitions specific to defining the lambda calculus within the lambda calculus start about halfway down <a href="https://github.com/LRudL/lambda-engine/blob/main/definitions.rkt">this file</a>. I won't walk through the details here (see the code and comments for more detail), but the core points are:</p><ul><li>We distinguish term types by making each term a pair consisting of an identifier and then the data associated with it. The identifier for variables/<script type="math/tex">\lambda</script>s/applications is a function that takes a triple and returns the 1st/2nd/3rd member of it (this is simpler than tagging them with e.g. Church numerals, since testing numerical equality is complicated). The data is either a Church numeral (for variables) or a pair of a variable and a term (<script type="math/tex">\lambda</script>-terms) or a term and a term (applications).</li><li>We need case-based recursion, where we can take in a term, figure out what it is, and then perform a call to a function to handle that term and pass on the main recursive function to that handler function (for example, because when substituting in a application term, we need to call the main substitution function on both the left and right child of the application). The case-based recursion functions (different ones for the different number of arguments required by substitution and reduction) take a triple of functions (one for each term type) and exploit the fact that the identifier of a term is a function that picks some element from the triple (in this case, we call the identifier on the handler function triple to pick the right one).</li><li>We have helper functions for to build our term types, extract out parts, and test for whether something is a <script type="math/tex">\lambda</script>-term (exploiting the fact that the first element of the pair that a lambda term is is the "take the 2nd thing from a triple" function).</li><li>With the above, we can define substitution fairly straightforwardly. Note that we need to test Church numeral equality, which requires a generic Church numeral equality tester, which is a slow function (because it needs to recurse and take a lot of predecessors).</li><li>For reduction, the main tricky bit is doing it in normal order. This means that we have to be able to tell whether the left child in an application term is reducible before we try to reduce the right child (e.g. the left child might eventually reduce to a function that throws away its argument, and the right child might be a looping term like <script type="math/tex">\Omega</script>). We define a helper function to check whether something reduces, and then can write <code>reduce-app</code> and therefore <code>reduce</code>. For convenience we can define a function <code>n-reduce</code> that calls <code>reduce</code> an expression <code>n</code> times, simply by exploiting how Church numerals work (<code>((2 reduce) x)</code> is <code>(reduce (reduce x))</code>, for example).</li> </ul><p>What we don't have:</p><ul><li>Variable renaming. We assume that terms in this lambda calculus are written so that a variable name (in this case, a Church numeral) is never reused.</li><li>Automatically reducing to <script type="math/tex">\beta</script>-normal form. This could be done fairly simply by writing another function that calls itself with the <code>reduce</code> of its argument until our checker for whether something reduces is false. </li><li>Automatically checking whether we're looping (e.g. we've typed in the definition of <script type="math/tex">\Omega</script>).</li> </ul><p>The lambda calculus interpreter in <a href="https://github.com/LRudL/lambda-engine/blob/main/interpreter.rkt">this file</a> has all three features above. You can play with it, and the lambda-calculus-in-lambda-calculus, by downloading <a href="https://github.com/LRudL/lambda-engine">Lambda Engine</a> (and a <a href="https://racket-lang.org/">Racket interpreter</a> if you don't already have one) and using one of the evaluators in <a href="https://github.com/LRudL/lambda-engine/blob/main/main.rkt">this file</a>.</p><h2>Towards Lisp</h2><p>Let's see what we've defined in the lambda calculus so far:</p><ul><li><code>pair</code></li><li>lists</li><li><code>fst</code></li><li><code>snd</code></li><li><code>True</code></li><li><code>False</code></li><li><code>if</code></li><li><code>eq0</code></li><li>numbers</li><li>recursion<br /></li> </ul><p>This is most of <a href="http://languagelog.ldc.upenn.edu/myl/ldc/llog/jmc.pdf">what you need in a Lisp</a>. Lisp was invented in 1958 by John McCarthy. It was intended as an alternative axiomatisation for computation, with the goal of not being too complicated to define while still being human friendly, unlike the lambda calculus or Turing machines. It borrows notation (in particular the keyword <code>lambda</code>) from the lambda calculus and its terms are also trees, but it is not directly based on the lambda calculus.</p><p>Lisp was not intended as a programming language, but Steve Russell (no relation to Bertrand Russell ... I'm pretty sure) realised you could write machine code to evaluate Lisp expressions, and went ahead and did so, making Lisp the second-oldest programming language. Despite its age, Lisp is arguably the most elegant and flexible programming language (modern dialects include <a href="https://clojure.org/">Clojure</a> and <a href="https://racket-lang.org/">Racket</a>).</p><p>One way to think of what we've done in this post is that we've started from the lambda calculus – an almost stupidly simple theoretical model – and made definitions and syntax transformations until we got most of the way to being able to emulate Lisp, a very usable and practical programming language. The main takeaway is, hopefully, an intuitive sense of how something as simple as the lambda calculus can express any computation expressible in a higher-level language.</p>Unknownnoreply@blogger.com0tag:blogger.com,1999:blog-1697673368059564013.post-68979675695847730492021-03-27T22:21:00.011+00:002021-05-28T00:24:05.508+01:00Nuclear power is good<p style="text-align: center;"><span style="font-size: medium;"></span></p><p style="text-align: left;"><span style="font-size: large;">(Alternative title: burning things considered harmful)</span></p><p style="text-align: center;"><span style="font-size: medium;"> <i><span style="font-size: x-small;">5k words (about 17 minutes)</span></i></span></p><p style="text-align: center;"><span style="text-align: left;"> </span></p><p>If you want usable energy, you need to use the forces between particles.</p><p>The weakest force is gravity, but if you happen to be near a gigantic amount of material (e.g. the Earth) with an uneven surface that has stuff flowing down it (e.g. water in a river), we can still use it to generate power. This insight gives us hydropower, which delivers about 16% of the world's electricity. The main downside is that because of how weak gravity is, dams have to be large and environmentally disruptive to generate useful power.</p><p>Moving to stronger forces, we have chemical interactions between atoms. In the form of burning fossil fuels, rearranging chemical bonds produces 66% of the world's electricity. The main downside is how weak chemical bonds are, and therefore how much matter has to be processed (i.e. burned) to produce energy. A lot of matter means a lot of waste products. Despite decades of work on possible safe waste-management strategies (e.g. carbon capture and storage), we still outrageously keep dumping over thirty billion tons of carbon dioxide into the atmosphere every year, with massive effects on the climate that will potentially last thousands of years, while also producing a long list of other harmful waste products that kill <a href="https://ourworldindata.org/air-pollution">a lot of people</a> per year.</p><p>Thankfully, atoms aren't atomic: we can rearrange atoms and get energy densities that blow puny chemistry out of the water. Currently 11% of the world's electricity comes from directly doing this. We're still playing catch up to God, who, in His infinite wisdom, saw it fit to create a universe where just about 100% of energy production is nuclear.</p><p>Our nearest God-sanctioned nuclear reactor is the sun. Harnessing the sun's light and heat gives us another 1% of the world's electricity; a slightly more indirect route where we first wait for the sun's heat to stir up the air gives us another 3.5%. An even more indirect route is letting the sun's light fall on plants so that they create chemical bonds that we can burn for power; this gives us another 2%. The most indirect route of all is to use the chemical bonds created by sunlight that fell on extinct plants hundreds of millions of years ago, which is what we're really doing when we burn fossil fuels. So actually it's all nuclear, with the only difference being how many hoops you jump through first.</p><p>The current state of nuclear power is that we can harness only fission (splitting atoms) for controlled energy production. Fusion (combining atoms) is potentially an even better technology: it requires less exotic materials, produces less dangerous waste, and is literally star-power. However, it takes extreme energies to get power out of fusion, and the only way we've found how to do that is to blow up a (fission-based) nuclear bomb in a very controlled way that squeezes the stuff we want to fuse to create an even bigger bang. Technically we could use this for power – say, we build a massive underground chamber where we set off hydrogen bombs (the common name for a bomb that uses nuclear fusion) every once in a while to vaporise vast amounts of water into steam and then drive a generator – but let's just say there would be some difficulties. (Though, surprisingly, mostly economic and political ones rather than technical ones – this idea was seriously studied in the 1970s as <a href="https://en.wikipedia.org/wiki/Project_PACER">Project Pacer.</a>)</p><p>Controlled fusion power is in the works, but it's the poster child for technologies that are always twenty years away. At the moment scientists are playing around with <a href="https://en.wikipedia.org/wiki/National_Ignition_Facility">lasers that have 25 times the power of the entire world's electricity generation</a> (though only for a few picoseconds at a time) and <a href="https://en.wikipedia.org/wiki/ITER">magnets almost strong enough to levitate a frog</a>* to bring it about, but don't expect commercial fusion power in the next decade at least.</p><p>(*Levitating a frog takes a field of about 16 Teslas, according to research that won an <a href="https://www.improbable.com/ig-about/winners/#ig2000">Ig Nobel Prize in 2000</a>, compared to ITER's 13 Tesla field.)</p><p>Fusion is definitely a technology that we should develop. However, as J. Storrs Hall writes in <i>Where is my flying car?</i> (my review <a href="https://strataoftheworld.blogspot.com/2021/03/review-where-is-my-flying-car.html">here</a>):</p><blockquote><p><i>"As a science fiction and technology fan, for most of my life I had been squarely in the “just you wait until we get fusion” camp. Then I was forced to compare the expected advantages fusion would bring to the ones we already had with fission. Fuel costs are already negligible. The process is already clean, with no emissions. Even though the national [US] waste repository at Yucca Mountain has been blocked by activists since it was designated in 1987 and never opened, fission produces so little waste that all our power plants have operated the entire period by basically sweeping it into the back closet."</i></p></blockquote><p>We have already invented a miracle clean power source. And, surprise surprise, we should really use it.</p><p> </p><h2>The human case for nuclear power</h2><p>Every year, <a href="https://ourworldindata.org/grapher/number-of-deaths-by-risk-factor?tab=chart&stackMode=absolute&region=World">there are almost five million deaths attributable to air pollution</a>, a bit less than 1 in 10 of all deaths in the world, or one every six seconds. Since it's a bit tricky to know what counts as an "attributable death" in the case of some risk factor, here's another measure: <a href="https://ourworldindata.org/grapher/disease-burden-by-risk-factor">almost 150 million years of health-weighted life are lost every year because of air pollution</a>. The health effects of air pollution are right up there with the other biggest killers like high blood pressure, smoking, and obesity.</p><p>The biggest causes of air pollution are energy generation, traffic, and (especially in poor countries) heating. Getting global averages for power generation deadliness is hard, but doing some very rough estimation, more than one-tenth but less than one-third of air pollution deaths are directly related to power generation, for a total number in the hundreds of thousands per year. Imagine three Chernobyl-scale disasters a week, and you're in the right ballpark.</p><p>(There is major disagreement over the actual Chernobyl death toll. When making comparisons in this post, I use the number 4000. About 30 people died directly during the disaster; several thousand may die in the long run according to the best consensus estimates, though if you assume the contested <a href="https://en.wikipedia.org/wiki/Linear_no-threshold_model">linear no-threshold model</a> (which seems to be the main crux of the debate) you can get numbers in the tens of thousands. If you want to be maximally pessimistic, you can multiply Chernobyl impact comparisons by 10, but you'll find this doesn't materially change the conclusions.)</p><p>Which power sources cause these deaths? There's some disagreement over the exact numbers, but <a href="https://ourworldindata.org/grapher/death-rates-from-energy-production-per-twh?tab=chart&time=earliest..latest&region=World">here's</a> a chart for European energy production from Our World in Data:</p><br /><div class="separator" style="clear: both; text-align: center;"><a href="https://1.bp.blogspot.com/-vCz-ZbmYHOc/YF-vXMf_9QI/AAAAAAAAChg/wLsN8TpD8wMObwH87OyC0hJ42fb3CpPRgCLcBGAsYHQ/s1744/deathspertwh.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="1128" data-original-width="1744" height="414" src="https://1.bp.blogspot.com/-vCz-ZbmYHOc/YF-vXMf_9QI/AAAAAAAAChg/wLsN8TpD8wMObwH87OyC0hJ42fb3CpPRgCLcBGAsYHQ/w640-h414/deathspertwh.png" width="640" /></a></div><p>(One terawatt-hour (3.6 petajoules) is roughly the annual energy consumption of 20 000 Europeans.)</p><p>The chart above has European numbers. In particular for fossil fuel sources, there's a lot of country-specific variation due to environmental regulations and population density: the <a href="https://www.sciencedirect.com/science/article/pii/S0140673607612537?casa_token=r5LmpCZ6G8YAAAAA:aW4wjfZ3PENq0mvbNTLXF27WEkLRuAsE0wGTXSrC1R3OgNLg9a7RdoedMRKZ20sBoUwuClxm#bib32">paper</a> that the above chart is largely based on mentions 77 deaths/TWh as a reasonable figure for a regulation-compliant Chinese coal plant, while <a href="http://www.forbes.com/sites/jamesconca/2012/06/10/energys-deathprint-a-price-always-paid/">this article</a> says that 280 deaths/TWh is possible for coal.</p><p>Why do solar and wind produce any deaths at all? Both occasionally involve dangerous construction work (rooftop solar / tall wind turbines). In fact, if you look at recent decades (i.e., not including Chernobyl) and use the low-end estimates, solar and wind are deadlier than nuclear.</p><p>The estimates for hydropower can also swing a bit depending on whether or not you include the deadliest electricity generation disaster in history: the <a href="https://en.wikipedia.org/wiki/1975_Banqiao_Dam_failure">1975 Banqiao Dam failure</a>, which may have killed hundreds of thousands of people. Since 1965, hydropower has produced about 130 000 TWh; depending on which death toll estimate you believe, Banqiao single-handedly raises the deaths per TWh for hydropower by between 0.2 and 2. Compare this with nuclear power, which has produced about 92 000 TWh over the same timeframe; the long-term death estimates for Chernobyl add 0.04 to the deaths/TWh count for nuclear.</p><p>(The total generation numbers are based on the raw data behind <a href="https://ourworldindata.org/grapher/modern-renewable-energy-consumption?time=earliest..latest">this</a> and <a href="https://ourworldindata.org/grapher/nuclear-energy-generation?tab=chart&stackMode=absolute&time=earliest..latest&country=~OWID_WRL&region=World">this</a> graph, which you can download from the links. The nuclear number in the above chart is based on <a href="https://www.sciencedirect.com/science/article/pii/S0140673607612537?casa_token=r5LmpCZ6G8YAAAAA:aW4wjfZ3PENq0mvbNTLXF27WEkLRuAsE0wGTXSrC1R3OgNLg9a7RdoedMRKZ20sBoUwuClxm">this paper</a>, which Our World in Data says already includes Chernobyl, though I can't see where they add that in.)</p><p>The bottom line is that hydropower accidents are <a href="https://en.wikipedia.org/wiki/List_of_hydroelectric_power_station_failures">more common, more deadly, and higher variance</a> than nuclear accidents, even though both power sources have produced comparable amounts of energy in recent decades.</p><p>Okay, actually that isn't the real bottom line. The real bottom line is this: <i>when it comes to the human impacts of electricity generation, there are things that involve burning (fossil fuels & biomass), and then there is everything else, and the latter category is much much better</i>. Also, if you absolutely must burn something, <i>do not burn coal</i>.</p><p>What has nuclear specifically done so far? <a href="https://pubs.acs.org/doi/abs/10.1021/es3051197?source=cen&">One study</a> finds that it has saved 1.8 million lives by reducing air pollution, or about 4 years of the world's current malaria death rate.</p><p>What could it have done? Until the mid-1970s, the adoption of nuclear power was accelerating. Assume this trend had continued until today, and nuclear had replaced fossil fuels only (an optimistic assumption, but one that doesn't change the numbers much because renewables are a pretty small percentage). Under these assumptions, <a href="https://www.mdpi.com/1996-1073/10/12/2169/htm">one study</a> estimates that nuclear would now account for over half of the world's energy production, and a total of 9.5 million deaths would have been avoided – as much as if you saved everyone who would otherwise have died of cancer in the past year. Even if nuclear adoption had only been linear, 4.2 million deaths could have been avoided, the same number as saving everyone who has died in war since 1970 (the war deaths number is from the raw data behind <a href="https://ourworldindata.org/grapher/battle-related-deaths-in-state-based-conflicts-since-1946-by-world-region">this chart</a>).</p><p>Therefore: <i>in terms of the number of lives saved, keeping the nuclear power industry growing would have very likely been at least as good as achieving world peace in 1970.</i></p><p>Since these numbers are enormous, and involve difficult-to-estimate unknowns, here's something more concrete: Germany's decision in 2011 to get rid of nuclear is costing an average of 1100 lives per year (<a href="https://www.nber.org/system/files/working_papers/w26598/w26598.pdf">working paper</a>; <a href="https://grist.org/energy/the-cost-of-germany-going-off-nuclear-power-thousands-of-lives/">article</a>).</p><h2>The environmental case for nuclear power</h2><p>Climate change is a big problem, but the scale of it as an environmental problem is better known than the scale of air pollution as a health problem, so I won't go into the statistics on its impact.</p><p>Nuclear power is obviously good for the climate. Here's a chart, based on <a href="https://www.ipcc.ch/site/assets/uploads/2018/02/ipcc_wg3_ar5_annex-iii.pdf#page=7">this</a>, which is summarised in a more readable format <a href="https://en.wikipedia.org/wiki/Life-cycle_greenhouse_gas_emissions_of_energy_sources#2014_IPCC,_Global_warming_potential_of_selected_electricity_sources">here</a>:</p><p></p><div class="separator" style="clear: both; text-align: center;"><a href="https://lh3.googleusercontent.com/--BtWucJqwIw/YF-tqOJatuI/AAAAAAAAChI/EmTccYV2zp0JhqoLAahh9JWhI7rxbLUlgCLcBGAsYHQ/co2eqpertwh.png" style="margin-left: 1em; margin-right: 1em;"><img alt="" data-original-height="830" data-original-width="1232" height="432" src="https://lh3.googleusercontent.com/--BtWucJqwIw/YF-tqOJatuI/AAAAAAAAChI/EmTccYV2zp0JhqoLAahh9JWhI7rxbLUlgCLcBGAsYHQ/w640-h432/co2eqpertwh.png" width="640" /></a></div> <p></p><p>The black bars span the range between the minimum and maximum numbers. The red dot is the median.</p><p>I've converted the numbers from the traditional grams of CO2 equivalent per kWh to tons of CO2 equivalent per TWh, to be consistent with the death rates graph above, and for easier conversion to national/international CO2 statistics (which are generally expressed in tons of CO2 – unless its tons of carbon, in which case you divide by the ratio of carbon's mass in CO2, which is 12/44 or about 0.27).</p><p>(If you're wondering where hydropower is: it's median is right around concentrated solar, but in some cases, especially in tropical climates, the <a href="https://en.wikipedia.org/wiki/Environmental_impact_of_reservoirs#Greenhouse_gases">reservoirs created by dams can release a lot of methane</a>, making the maximum CO2-equivalent emissions for hydropower over twice as bad as coal and, more importantly, completely ruining my pretty chart.)</p><p>So far, the use of nuclear power is estimated to have <a href="https://blogs.scientificamerican.com/the-curious-wavefunction/nuclear-power-may-have-saved-1-8-million-lives-otherwise-lost-to-fossil-fuels-may-save-up-to-7-million-more/">reduced cumulative CO2 emissions to date by 64 billion tons</a>, a bit less than two years of the world's <i>total</i> CO2 emissions at current rates. The <a href="https://www.mdpi.com/1996-1073/10/12/2169/htm">same study</a> linked in the previous section estimates that, had nuclear power grown at a steady linear rate, this number would be doubled, and if the accelerating trend in nuclear power adoption had continued, there would be 174 billion tons less CO2 in the atmosphere. We would have saved more emissions than we would have if we had made every car in the world emission free since 1990.</p><p> </p><h2>The problems</h2><p>In <i>Enlightenment Now</i> (my review <a href="http://strataoftheworld.blogspot.com/2018/08/review-enlightenment-now-steven-pinker.html">here</a>), Steven Pinker writes:</p><blockquote><p><i>"It’s often said that with climate change, those who know the most are the most frightened, but with nuclear power, those who know the most are the least frightened."</i></p></blockquote><p>So why aren't the arguments against nuclear power enough to frighten those who know about it?</p><p>The short version: more nuclear power would save millions of lives from air pollution and be a big help in solving climate change. When these are the benefit, you need a hell of a drawback before the scales start tilting the other way.</p><p>The long version:</p><h3>Radiation & accidents</h3><p>(Radiation units are confusing. Activity, straightforwardly defined as the number of atoms that undergo decay per second, is measured in becquerels (Bq). The amount of radiation energy absorbed per kilogram of matter is measured in grays (Gy), which therefore have units of joules per kilogram. Measuring biological effects is harder, because the type of radiation and what tissue it hits both matter. If you adjust for the type of radiation by multiplying the absorbed dose in grays by some factor (scaled so that gamma rays have a factor 1), you get something called <a href="https://en.wikipedia.org/wiki/Equivalent_dose">equivalent dose</a>, which is measured in sieverts (Sv). If you also adjust for which tissue type was hit by multiplying by more estimated factors, you get <a href="https://en.wikipedia.org/wiki/Effective_dose_(radiation)">effective dose</a>, which is also measured in sieverts. If you want to get a sense of scale for radiation dose numbers, <a href="https://xkcd.com/radiation/">here's a good chart</a> and <a href="https://en.wikipedia.org/wiki/Sievert#Dose_examples">here's a good table</a>.)</p><p>In normal operation, a <a href="https://www.scientificamerican.com/article/coal-ash-is-more-radioactive-than-nuclear-waste/">nuclear power plant produces significantly less radiation than a coal power plant</a> (this is because everything radioactive is contained in a nuclear power plant, while coal power plants pump <a href="https://en.wikipedia.org/wiki/Fly_ash">fly ash</a> into the air). Neither is a significant dose.</p><p>In accidents, nuclear power plants can release insane amounts of radioactivity. Insane amounts of radiation are dangerous. However, the reaction to radiation risks is often out of proportion to the true risk – the Fukushima evacuations are considered excessive in hindsight, as argued in <a href="https://www.sciencedaily.com/releases/2017/11/171120085453.htm">this study</a>, though you probably don't need to make a study to guess it from <a href="https://ourworldindata.org/grapher/estimated-mortality-from-fukushima-nuclear-disaster">this chart</a>:</p><p></p><div class="separator" style="clear: both; text-align: center;"><a href="https://lh3.googleusercontent.com/-EQAwvcAiSt8/YF-tw2NnKII/AAAAAAAAChM/LPjJmcIJ7NceUiiSso759uNq4fKJgTBQwCLcBGAsYHQ/fukushima.png" style="margin-left: 1em; margin-right: 1em;"><img data-original-height="1172" data-original-width="1736" height="432" src="https://lh3.googleusercontent.com/-EQAwvcAiSt8/YF-tw2NnKII/AAAAAAAAChM/LPjJmcIJ7NceUiiSso759uNq4fKJgTBQwCLcBGAsYHQ/w640-h432/fukushima.png" width="640" /></a></div><p></p><p>(In the long run, some more cancer deaths are expected to trickle in.)</p><p>It is critically important to remember the above statistics on health effects, and not let yourself be biased by <a href="https://en.wikipedia.org/wiki/Chernobyl_(miniseries)">vivid stories</a> about horrible individual events. The fear of nuclear accidents is similar to the fear of flying rather than driving: statistically one is much safer, but one is much easier to fear because when things go wrong, it comes in more story-worthy packages.</p><p>In particular: it is <i>not</i> the case that nuclear power is safer only because accidents are rare and therefore get left out of statistics; nuclear power would be overwhelmingly safer than fossil fuels even if there were a Chernobyl going off every year. As I said above, <a href="https://en.wikipedia.org/wiki/List_of_hydroelectric_power_station_failures">hydropower accidents</a> are more common, more deadly, and higher variance, so any argument based on disaster risk that bans nuclear would also ban hydropower.</p><h3>Nuclear proliferation</h3><p>Nuclear power is good, but <a href="https://strataoftheworld.blogspot.com/2020/04/review-doomsday-machine.html">nuclear weapons are bad</a>. It would be bad if the spread of civilian nuclear power technology lead to nuclear proliferation. There is some overlap in technology, but neither civilian materials nor technologies automatically lead to weapons. The uranium used in power plants is typically only enriched to 3-5%, compared to more than 85% for weapons-grade uranium and 0.7% in natural uranium (though if you have uranium enrichment infrastructure, you can run it for more cycles than usual and let the enrichment levels slowly creep up – Iran has done this). There are also international agreements that prevent enrichment, and alternative nuclear technologies, like using thorium instead of uranium, with less weapon potential. Finally, a country trying to build nuclear weapons probably won't be stopped by a lack of a civilian industry; consider North Korea.</p><h3>Terrorism and war risks</h3><p>Another risk to consider is that nuclear power plants might be targeted by terrorists, or even by hostile nations, potentially leading to Chernobyl-scale disasters. This is a risk, but it's an acceptable one. Consider what it would mean if "hundreds or thousands of people could be killed if a determined and resourceful hostile actor targeted this piece of infrastructure" were a reason to not build some piece of infrastructure – we'd have to ban skyscrapers, airplanes, dams, water treatment plants, and so forth. Also considering the security that's (rightfully) present at nuclear power plants, it would probably take a 9/11-level of execution to do it, and the observed rate for 9/11-level events over a time interval of length T is, well, 1/T if the interval includes 9/11 and otherwise 0.</p><p>It is true that a complex civilisation has a lot of fragile points and someone should be thinking hard about minimising this kind of risk, and that nuclear power plants are a good example because the effects are expensive and long-lasting if an attack is successful. But as an argument against nuclear power, <a href="https://slatestarcodex.com/2013/04/13/proving-too-much/">it proves too much</a>.</p><h3>Nuclear waste</h3><p>Nuclear waste is awkward to deal with, but it's far from the worst sort of industrial waste we deal with – consider the over thirty billion tons of carbon dioxide we've dumped into the atmosphere over the past year, or the various horrible things that coal plants spew out that cause dozens of Chernobyl-equivalents per year.</p><p>Nuclear waste is not some miracle substance that effortlessly seeps everywhere and kills whatever it touches. Until 1993, countries (mostly the USSR and UK), were dumping nuclear waste into the ocean. This is rightly banned these days, but you can observe that we still have oceans; in fact, the <a href="https://en.wikipedia.org/wiki/Ocean_disposal_of_radioactive_waste#Environmental_impact">the environmental impacts</a> have so far been negligible except for somewhat higher concentrations of some nasty isotopes exactly at the site.</p><p>In general, nuclear waste is a serious problem that has to be solved somehow, but solutions exist (currently, Finland's <a href="https://en.wikipedia.org/wiki/Onkalo_spent_nuclear_fuel_repository">Onkalo repository</a> is the closest to being operational). Though the timescale is long, it is not different in principle from some existing disposal methods for nasty things like mercury and arsenic.</p><p>Is it responsible to leave behind dangerous waste for future generations? It's far more responsible than leaving them with the almost astronomical amounts of CO2 emissions that a single kilogram of uranium prevents.</p><p>Future people looking back at our century won't despair about a few warm rocks deep underground. They'll despair at all the silent air pollution deaths, at how far we let climate change get, and at how much sooner we could've reached their living standards had we made better use of our technology. Then they'll travel on nuclear-powered airplanes to distant hiking grounds, and tell scare stories around an (artificial!) campfire about the barbarian past when we burned things for energy and piped the waste products straight into the atmosphere.</p><h3>Uranium is limited</h3><p>First, we have <a href="https://www.scientificamerican.com/article/how-long-will-global-uranium-deposits-last/">200 years worth of economically accessible uranium reserves</a>. This is <a href="https://ourworldindata.org/grapher/years-of-fossil-fuel-reserves-left">more than for fossil fuels</a>, with the additional benefit that burning through the remaining uranium won't wreck the climate and kill millions.</p><p>Second, we have alternatives to uranium, like thorium.</p><p>Thirdly, there are hundreds of times more uranium dissolved in the oceans than there is on land (and this uranium exists in equilibrium, so if you take it out, more will leach out of the seabed to replace it, a fact that might lead a pedant to call nuclear power renewable). Even though the concentrations are tiny, because of the energy density of uranium, at modern reactor efficiencies there's still half a megajoule of usable nuclear energy in the uranium in a single cubic metre of seawater, enough to power the lightbulb in my room for over five hours. As a result, extracting it is a project that is <a href="https://www.forbes.com/sites/jamesconca/2016/07/01/uranium-seawater-extraction-makes-nuclear-power-completely-renewable/?sh=1b4b0f19159a">taken surprisingly seriously, and is surprisingly close to being economically viable</a>, though <a href="http://large.stanford.edu/courses/2017/ph241/jones-j2/docs/epjn150059.pdf">some people are very skeptical</a>.</p><h3>Nuclear power is unnatural</h3><p>Wrong: a few billion years ago <a href="https://www.scientificamerican.com/article/ancient-nuclear-reactor/">a spontaneous natural nuclear reactor</a> ran for a few hundred thousand years under what is now Gabon.</p><p>Using the best estimates for its running time and power output, even if this is the only natural reactor that ever formed, the energy it produced is several times higher than that of all human civilian nuclear power to date (both numbers are in the hundreds of petajoules range). Of sustained nuclear fission energy in our planet's history, more has been natural than artificial.</p><p> </p><h2>Nuclear is overpowered, so where is it?</h2><p>Nuclear power is an almost overpowered technology. The reason why comes down to physics: an energy source based on nuclear reactions has extreme power density, and, all else being equal, the higher your power density, the less fuel you need, the less waste products you produce, and the cleaner your power plant is overall. Not surprisingly, nuclear power turns out to be – along with solar and wind – the cleanest and safest power source we have.</p><p>In <a href="https://strataoftheworld.blogspot.com/2021/03/review-where-is-my-flying-car.html"><i>Where is My Flying Car?</i></a>, J. Storrs Hall gives some vivid facts to demonstrate the power and efficiency of nuclear: a wind turbine uses more lubricating oil per energy generated than a nuclear power plant uses uranium, and while the 7.5 TJ of energy a Boeing 747 burns through during a flight weighs 200 tons and costs a third of a million dollars when delivered as chemical fuel, getting the equivalent energy from nuclear takes 100 <i>grams</i> of reactor-grade uranium and costs 10 dollars.</p><p>So where is it? The simple reason is that it's either illegal (like in Italy), being phased out (like in Germany), or highly regulated and/or expensive. It wasn't always so:</p><p></p><table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto;"><tbody><tr><td style="text-align: center;"><a href="https://lh3.googleusercontent.com/-IIf3MZJbgac/YF-t3NSrZRI/AAAAAAAAChQ/Tc4JrsRjwX0ZXkARslwmLS_LTpYIVYbfwCLcBGAsYHQ/nuclearcost.png" style="margin-left: auto; margin-right: auto;"><img alt="" data-original-height="1074" data-original-width="1412" height="304" src="https://lh3.googleusercontent.com/-IIf3MZJbgac/YF-t3NSrZRI/AAAAAAAAChQ/Tc4JrsRjwX0ZXkARslwmLS_LTpYIVYbfwCLcBGAsYHQ/w400-h304/nuclearcost.png" width="400" /></a></td></tr><tr><td class="tr-caption" style="text-align: center;"><p>Source: <i>Where is my Flying Car?</i>, by J. Storrs Hall.</p></td></tr></tbody></table><p></p> <p>The above graph shows the price per kilowatt of US nuclear power plants. The green line is the trend line before the Department of Energy was established in 1977. Note also that the Three Mile Island accident was in 1979, and, despite no one being hurt, this was a turning point for the US nuclear industry.</p><p>When the price of a technology starts increasing, it's not the natural learning curve of the technology at work. It's a regulatory choice. And while you obviously should regulate nuclear power, we're not doing it right.<br /></p><p>J. Storrs Hall explains the cost increases:</p><blockquote><p><i>"Nuclear power is probably the clearest case where regulation clobbered the learning curve. Innovation is strongly suppressed when you’re betting a few billion dollars on your ability to get a license to operate the plant. Besides the obvious cost increases due to direct imposition of rules, there was a major side effect of forcing the size of plants up (fewer licenses); fewer plants were built and fewer ideas tried. That also meant a greater cost for transmission (about half the total, according to my itemized bill), since plants are further from the average customer."</i></p></blockquote><p>There is some hope that the tide is turning. New startups like <a href="https://en.wikipedia.org/wiki/NuScale_Power">NuScale</a> are working on small modular reactors that might greatly reduce prices. Of course, in addition to difficulties with funding, and the not-so-easy task of building a literal nuclear reactor, they've spent years jumping through regulatory hurdles and are not expected to produce power until 2029. So-called fourth-generation reactors are also being worked on, and there's always the hope we eventually get fusion.</p><p>But we're not going to get the benefits of cheap and plentiful nuclear power unless we stop treating it like it's the Antichrist.</p><p>Hall, never one to pass up the opportunity for a dramatic touch, quotes John Steinbeck's <i>The Grapes of Wrath</i> to sum up the sadness of our attitude to nuclear power:</p><blockquote><p><i>“And men with hoses squirt kerosene on the oranges, and they are angry at the crime, angry at the people who have come to take the fruit. A million people hungry, needing the fruit—and kerosene sprayed over the golden mountains.</i></p><i></i><p><i>[...]</i></p><i></i><p><i>There is a crime here that goes beyond denunciation. There is a sorrow here that weeping cannot symbolize. There is a failure here that topples all our success. The fertile earth, the straight tree rows, the sturdy trunks, and the ripe fruit. And children dying of pellagra must die because a profit cannot be taken from an orange. And coroners must fill in the certificate—died of malnutrition—because the food must rot, must be forced to rot.”</i></p></blockquote><p>More generally, <a href="https://strataoftheworld.blogspot.com/2021/03/technological-progress.html">human civilisation need to get better at making decisions about technology</a>. We shouldn't deny ourselves safe clean energy, but we should start working on mitigating the harms from actually scary technologies, like nuclear weapons, and make sure that new technologies like biotech and AI are used safely. Oh, and have I mentioned that burning things is bad for climate and health, and we should stop doing it?</p><h2>A metaphor</h2><p>I mentioned earlier that nuclear power and fossil fuels are like flying and driving. One of them is obviously safer, but the other seems scarier because the lizard-derived part of our brains can't multiply. Objecting to nuclear power on safety grounds but tolerating fossil fuels is like texting about how scared you are to board a plane while driving yourself to the airport. Let's make this metaphor more concrete, and hopefully create a memorable image.</p><p>The world consumes about 20 000 TWh per year as electricity (about one-eight of total energy use – lots is used directly for transportation and heat). Let's compare this to making a drive across Europe that starts in Lisbon and ends in Tallinn. Each kilometre we travel represents a bit less than 5 TWh of energy towards our 20 000 TWh goal. Let's say walking is wind/solar/geothermal, biking is hydropower, flying is nuclear, and driving is fossil fuels.</p><p>(The numbers for fossil fuel related deaths below are significant underestimates of the global average, because, like the chart above, they're based on the European data in <a href="https://www.sciencedirect.com/science/article/pii/S0140673607612537?casa_token=r5LmpCZ6G8YAAAAA:aW4wjfZ3PENq0mvbNTLXF27WEkLRuAsE0wGTXSrC1R3OgNLg9a7RdoedMRKZ20sBoUwuClxm#bib32">this study</a>. Regulations are looser and population densities higher in many developing countries that make up most of the world's air pollution deaths. I was not able to find a good estimate of the global average, and besides, these numbers are terrifying enough as they are.)</p><p>First we walk some 450 km, ending north-west of Madrid, and then bike 650 km, just barely taking us into France. We're a bit careless and somehow we've manage to shove a hundred people off wind turbines along the way. Oops.</p><p>By this point we're getting tired of walking and biking, but thankfully there's a flight to Paris. The pilot has a bad day and lands on top of a crowd, flattening another hundred people.</p><p>We really hate flying, so we refuse all the other offers that the airline companies try to sell us. Instead we step out of the Paris airport, rent a car, and start carelessly careening down the remaining 2600 km.</p><p>Gas takes us approximately to Berlin, a distance of about 1000 km. During this entire distance we run over a pedestrian at every block (roughly 1 per 80 metres), killing some 10 000 people in total.</p><p>We're in a real hurry to get to Poland, where the traffic rules get even more lenient and we can start <a href="https://www.independent.co.uk/climate-change/news/climate-change-poland-cop24-coal-air-pollution-global-warming-fossil-fuels-a8672481.html">burning coal</a>. The final leg of the journey from Berlin to the Polish border is powered by oil and isn't long, but still results in as many lethal hit-and-runs as the entire journey before it.</p><p>At the Polish border, we reach coal. From this point on, we text about the dangers of nuclear waste as we mow down one pedestrian every 8 metres for the entire rest of the coal-powered trip to Estonia (also burning <a href="https://en.wikipedia.org/wiki/Narva_Power_Plants">some other nasty things too</a>). Driving at a reckless 120 km/h whatever road we're on, we go run through four pedestrians a second – you'll hear a rapid thwack-thwack-thwack-thwack noise as the bodies hit the windshield – but it still takes 13 hours to make the trip. By the time we reach the Lithuanian border, the bodies of our victims, packed as tightly as possible, fill four Olympic swimming pools. Each of the three Baltic countries we drive through before reaching Tallinn fills another one.</p><p>Oh, and also every kilometre driven in our car had fifty times the environmental impact of flying.</p><p>Thank god we didn't fly: imagine how horrible it would be if another pilot had had a bad day.</p><p>The world makes this trip every year to meet our growing energy needs. We're getting fitter and walking a bit longer every year, as we should. But whenever someone suggests flying instead of driving, our collective response is: "What?! But that's so risky!"</p><p>Let's fly.</p><p><br /></p><p style="text-align: center;"><b>RELATED:</b></p><p></p><ul style="text-align: left;"><li><a href="https://strataoftheworld.blogspot.com/2018/12/review-seeds-of-science-why-we-got-it.html">A similar situation exists with GMOs</a></li><li><a href="https://strataoftheworld.blogspot.com/2021/03/technological-progress.html">Technological progress</a></li><li><a href="https://strataoftheworld.blogspot.com/2021/03/review-where-is-my-flying-car.html">Review: Where is my Flying Car?</a></li><li><a href="https://strataoftheworld.blogspot.com/2018/10/review-energy-and-civilization-history.html">Review: Energy and Civilisation</a> <br /></li></ul><p></p>Unknownnoreply@blogger.com0tag:blogger.com,1999:blog-1697673368059564013.post-46931608061489153172021-03-25T16:12:00.005+00:002021-03-27T22:51:38.091+00:00Technological progress<p style="text-align: center;"><span style="font-size: x-small;"><i>4k words (about 13 minutes)</i></span> <br /></p><p>In this post, I've collected some thoughts on:</p><ul><li>why technological progress probably matters more than you'd immediately expect;</li><li>what models we might try to fit to technological progress;</li><li>whether technological progress is stagnating; and</li><li>what we should hope future technological progress to look like.</li> </ul><p> </p><h2>Technological progress matters</h2><p>The most obvious reason why technological progress matters is that it is the cause for the increase in human welfare after the industrial revolution, which, in moral terms at least, is the most important thing that's ever happened. <a href="http://lukemuehlhauser.com/three-wild-speculations-from-amateur-quantitative-macrohistory/">"Everything was awful for a long time, and then the industrial revolution happened"</a> isn't a bad summary of history. It's tempting to think that technology was just one factor working with many others, like changing politics and moral values, but there are strong cases to be made that a changed technological environment, and <a href="https://strataoftheworld.blogspot.com/2019/09/growth-and-civilisation.html">the economic growth it enabled</a>, were <a href="http://strataoftheworld.blogspot.com/2020/12/review-foragers-farmers-and-fossil-fuels.html">the reasons for political and moral changes in the industrial era</a>. Given this history, we should expect that more technological progress will be important for increasing human welfare in the future too (though not enough on its own – see below). This applies both to people in developed countries – we are not at <a href="https://nickbostrom.com/utopia.html">utopia</a> yet, after all – as well as those in developing countries, who are already seeing vast benefits from information technology making development cheaper, and would especially benefit from decreases in the price of sustainable energy generation.</p><p>Then there are more subtle reasons to think that technological progress doesn't get the attention it deserves.</p><p>First, it works over long time horizons, so it is especially subject to all the kinds of short-termism that plague human decision-making.</p><p>Secondly, lost progress isn't visible: if the Internet hadn't been invented, very few would realise what they're missing out on, but try taking it away now and you might well spark a war. This means that stopping technological progress is politically cheap, because likely no one will realise the cost of what you've done.</p><p>Finally, making the right decisions about technology is going to decide whether or not the future is good. Debates about technology often become debates about whether we should be pessimistic or optimistic about the impacts of future technology. This is rarely a useful framing, because the only direct impact of technology is to let us make more changes to the world. Technology shouldn't be understood as a force automatically pulling the distribution of future outcomes in a good or bad direction, but as a force that <i>blows up the distribution</i> so that it spans all the way from an engineered super-pandemic that kills off humanity ten years from now to an interstellar civilisation of trillions of happy people that lasts until the stars burn down. Where on this distribution we end up on depends in large part on the decisions we collectively make about technology. So, how about we get those decisions right?</p><p>But first, how should we even think about technological progress?</p><p> </p><h2>Modelling technological progress</h2><p>Some people think <a href="https://strataoftheworld.blogspot.com/2021/03/review-where-is-my-flying-car.html">that technological progress is stagnating relative to historical trends, and that, for example, we should have flying cars by now</a>. To be able to answer this question, we need some model of what technological progress should be like. I can think of three general ones.</p><p>The first one I'll name the Kurzweilian model, after futurist <a href="https://en.wikipedia.org/wiki/Ray_Kurzweil#The_Law_of_Accelerating_Returns">Ray Kurzweil</a>, who's made a big deal about how <a href="https://www.kurzweilai.net/the-law-of-accelerating-returns">the intuitive linear model of technological progress is wrong, and history instead shows technological progress is exponential</a> – the larger your technological base, the easier it is to invent new technologies, and hence a graph of anything tech-related should be a hockey-stick curve shooting into the sky.</p><p>The second I'll call the fruit tree model, after the metaphor that once the "low-hanging fruit" are picked off, progress gets harder. The strongest case for this model is in science; the physics discoveries you can make by watching apples fall down have (very likely) long since been picked off. However, it's not clear similar arguments should apply to technology. Perhaps we can model inventing a technology as finding a clever way to combine a number of already known parts into a new thing, and hence the number of possible inventions as would be an increasing function of the number of things already invented, since this gives more combinations. For example, even if progress in pure aviation is slow, when we invent new things like lightweight computers we can combine the two to get drones. I haven't seen anyone propose a model to explain why the fruit tree model makes sense for technology in particular.</p><p>The third model is that technological change is mostly random. Any particular technological base satisfies the prerequisites for some set of inventions. Once invented, a new technology goes through an S-curve of increasing adoption and development, before reaching widespread adoption and a mature form. Sometimes there are many inventions just within reach, and you get an innovation burst, like the mid-20th century one when television, cars, passenger aircraft, nuclear weapons, birth control pills, and rocketry are all simultaneously going through the rapid improvement and adoption phase. Sometimes there are no plausible big inventions for very long periods of time, for example in medieval times. </p><p>Here's an Our World in Data graph (<a href="https://ourworldindata.org/grapher/technology-adoption-by-households-in-the-united-states?tab=chart&stackMode=absolute&country=Automobile~Cellular%20phone~Computer~Dryer~Electric%20power~Flush%20toilet~Household%20refrigerator~Microwave~Refrigerator~Washing%20machine&region=World">source and interactive version here</a>) showing more-or-less-S-curves for the adoption of a bunch of technologies:</p><p></p><div class="separator" style="clear: both; text-align: center;"><a href="https://lh3.googleusercontent.com/-NYRF1GhXMHc/YFy0_l5nykI/AAAAAAAACgU/4pDNWQllx4YVAaqXCizmI0srH-5DGMy4wCLcBGAsYHQ/adoption.png" style="margin-left: 1em; margin-right: 1em;"><img alt="" data-original-height="1116" data-original-width="1754" height="407" src="https://lh3.googleusercontent.com/-NYRF1GhXMHc/YFy0_l5nykI/AAAAAAAACgU/4pDNWQllx4YVAaqXCizmI0srH-5DGMy4wCLcBGAsYHQ/adoption.png" width="640" /></a></div><p></p><p>(One can try to imagine an even more general model to unify the three models above, though we're getting to fairly extreme abstraction levels. Nevertheless, for the fun of it: let's model each technology as a set of prerequisite technologies, and assume there's a subset of technology-space that makes up the sensible technologies, and some cost function that describes how hard it is to go from a set of technologies to a given new technology (so infinity if all prerequisites of the new one aren't contained in the known set). Then slow progress would be modelled as the set of sensible ideas and the cost function being such that from any particular set of known technologies, there are only a few sensible ideas with prerequisites only in the known set, and these have high costs. Fast progress is the opposite. In the Kurzweilian model, the subspace of sensible ideas is in some sense uniform, so that the fraction of the <script type="math/tex">2^{|K|}</script> possible prerequisite combinations for a known technology set <script type="math/tex">K</script> that are contained within the sensible set does not go down with the cardinality of <script type="math/tex">K</script>, and also we require the cost function to not increase too rapidly as the complexity of the technologies grow. In the fruit tree model, the cost function increases, and possibly the frequency of sensible technologies becomes sparser as you get into the more complex parts of technology-space. In the random model, the cost function has no trend, and a lot of the advancements happen when a "key technology" is discovered that is the last unknown prerequisite for a lot of sensible technologies in technology-space.)</p><p>(Question: has anyone drawn up a dependency tree of technologies across many industries (or even one large one), or some other database where each technology is linked to a set of prerequisites? That would be an incredible dataset to explore.)</p><p>In <a href="https://strataoftheworld.blogspot.com/2021/03/review-where-is-my-flying-car.html"><i>Where is my Flying Car?</i></a>, J. Storrs Hall introduces his own abstraction of a civilisation's technology base that he calls the "technium": imagine some high-dimensional space representing possible technologies, and imagine a blob in this space representing existing technology. This blob expands as our technological base expands, but not uniformly: imagine some gradient in this space representing how hard it is to make progress in a given direction from a particular point, which you can visualise as a "terrain" which the technium has to move along as it expands. Some parts of the terrain are steep: for example, given technology that lets you make economical passenger airplanes moving at near the speed of sound, it takes a lot to progress beyond that because crossing the speed of sound is difficult. Hence the "aviation cliffs" in the image below; the technium is pressing against it, but progress will be slow:</p><p></p><table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto;"><tbody><tr><td style="text-align: center;"><a href="https://lh3.googleusercontent.com/-ag8j7955xik/YFy1EHc4riI/AAAAAAAACgY/4wNJq5pUIsoqhemZFxha1HGl4aRtb7epQCLcBGAsYHQ/technium1.png" style="margin-left: auto; margin-right: auto;"><img alt="" data-original-height="1070" data-original-width="1906" height="360" src="https://lh3.googleusercontent.com/-ag8j7955xik/YFy1EHc4riI/AAAAAAAACgY/4wNJq5pUIsoqhemZFxha1HGl4aRtb7epQCLcBGAsYHQ/w640-h360/technium1.png" width="640" /></a></td></tr><tr><td class="tr-caption" style="text-align: center;"><span style="font-size: x-small;">(Image source: my own slides for an EA Cambridge talk.)</span><br /></td></tr></tbody></table><p></p><p>In other cases, there are valleys, where once the technium gets a toehold in it, progress is fast and the boundaries of what's possible gush forwards like a river breaking a dam. The best example is probably computing: figure out how to make transistors smaller and smaller, and suddenly a lot of possibilities open up.</p><p>We can visualise the three models above in terms of what we'd expect the terrain to look like as the technium expands further and further:</p><p></p><div class="separator" style="clear: both; text-align: center;"><div style="text-align: center;"><a href="https://lh3.googleusercontent.com/-Q8uW_-U9r6A/YFy1RHGhKOI/AAAAAAAACgk/kH9VI82-L-I_sV8llkVBOwoyzFob5mH5gCLcBGAsYHQ/techniumterrain.png" style="margin-left: 1em; margin-right: 1em;"><img data-original-height="810" data-original-width="1854" height="280" src="https://lh3.googleusercontent.com/-Q8uW_-U9r6A/YFy1RHGhKOI/AAAAAAAACgk/kH9VI82-L-I_sV8llkVBOwoyzFob5mH5gCLcBGAsYHQ/w640-h280/techniumterrain.png" width="640" /></a></div></div><p></p><p>(Or maybe a better model would be one where the gradient is always be positive, with 0 gradient meaning effortless progress?)</p><p>In the Kurzweilian model, the terrain gets easier and easier the further out you go; in the fruit tree it's the opposite; if there is no pattern, then we should expect cliffs and valleys and everything in between, with no predictable trend.</p><p>Hall comes out in favour of what I've called the random model, even going as far as to speculate that the valleys might follow a <a href="https://en.wikipedia.org/wiki/Zipf%27s_law">Zipf's law</a> distribution. He concisely summarises the major valleys of the past and future:</p><blockquote><p><i>"The three main phases of technology that drove the Industrial Revolution were first low-pressure steam engines, then machine tools, and then high-pressure engines enabled by the precision that the machine tools made possible. High-pressure steam had the power-to-weight ratios that allowed for engines in vehicles, notably locomotives and steamships. The three major, interacting, and mutually accelerating technologies in the twenty-first century are likely to be nuclear, nanotech (biotech is the “low-pressure steam” of nanotech), and AI, coming together in a synergy I have taken to calling the Second Atomic Age."</i></p></blockquote><p>Personally, my views have shifted away from somewhat Kurzweilian ones and towards the random model, with the main factors being that the technological stagnation debate has made me less certain that the historical data fits a Kurzweilian trend, and that since there are no clear answers to whether there is a general pattern, it's sensible to shift the distribution of my beliefs towards the model that doesn't require assuming the truth of a general pattern. However, given some huge valleys that seem to be out there – AI is the obvious one, but also nanotechnology, which might bring physical technology to Moore's law -like growth rates – it is possible that the difference between the Kurzweilian and random model looks largely academic in the next century.</p><p> </p><h2>Is technology stagnating?</h2><p>Now that we have some idea of how to think about technological progress, we are better placed to answer the question of whether it has stagnated: if the fruit tree model is true we should expect a slowdown, whereas if the extreme Kurzweilian model is true, a single trend line that's not going to break past the top of the figure in the next decade is a failure. Even so, this question is very confusing; economists debate about total factor productivity (a debate I will stay out of), and in general it's hard to know what could have been.</p><p>However, it does seem true that compared to the mid-20th century, the post-1970 era has seen breakthroughs in fewer categories of innovation. Consider:</p><ul><li><p>1920-1970:</p><ul><li>cars</li><li>radio</li><li>television</li><li>antibiotics</li><li>the green revolution</li><li>nuclear power</li><li>passenger aviation</li><li>chemical space travel</li><li>effective birth control</li><li>radar</li><li>lasers</li> </ul></li><li><p>1970-2020:</p><ul><li>personal computers</li><li>mobile phones</li><li>GPS</li><li>DNA sequencing</li><li>CRISPR</li><li>mRNA vaccines</li> </ul></li> </ul><p>Of course, it's hard to compare inventions and put them in categories – is lumping everything computing-related as largely the same thing really fair? – but <a href="https://rootsofprogress.org/technological-stagnation">some people are persuaded by such arguments</a>, and a general lack of big breakthroughs in big physical technologies does seem true. (Though might soon change, since the clean energy, biotech, and space industries are making rapid progress.)</p><p>Why is this? If we accept the fruit tree model, there's nothing to be explained. If we accept the random one, we can explain it as a fluke of the shape of the idea space terrain that the technium is currently pressing into. To quote Hall again:</p><blockquote><p><i>"The default [explanation for technological stagnation] seems to have been that the technium has, since the 70s, been expanding across a barren high desert, except for the fertile valley of information technology. I began this investigation believing that to be a likely explanation."</i></p></blockquote><p>This, I think, is a pretty common view, and is a sensible null hypothesis for the lack of other evidence. We can also imagine variations, like the existence of a huge valley in the form of computing drawing all the talent that would otherwise have gone into pushing the technium forwards in other places. However, Hall rather dramatically concludes that this</p><blockquote><p><i>"[...] is wrong. As the technium expanded, we have passed many fertile Gardens of Eden, but there has always been an angel with a flaming sword guarding against our access in the name of some religion or social movement, or simply bureaucracies barring entry in the name of safety or, most insanely, not allowing people to make money."</i></p></blockquote><p>Is this ever actually the case? I think there is a case where a feasible (and economic, environmental, and health-improving) technology has been blocked: nuclear power, as I discuss <a href="http://strataoftheworld.blogspot.com/2021/03/nuclear-power-is-good.html">here</a>. We should therefore amend our model of the technium: not only does it have to contend with the cliffs inherent in the terrain, but sometimes someone comes along and builds a big fat wall on the border, preventing either development, deployment, or both.</p><p>In diagram form:</p><p></p><div class="separator" style="clear: both; text-align: center;"><a href="https://lh3.googleusercontent.com/-Ro1kfHXErQw/YFy1cWclgvI/AAAAAAAACgs/hDiZla9SwnUa9Ym5IAXf5Y-zi0BdSAUaQCLcBGAsYHQ/technium2.png" style="margin-left: 1em; margin-right: 1em;"><img alt="" data-original-height="1102" data-original-width="1884" height="374" src="https://lh3.googleusercontent.com/-Ro1kfHXErQw/YFy1cWclgvI/AAAAAAAACgs/hDiZla9SwnUa9Ym5IAXf5Y-zi0BdSAUaQCLcBGAsYHQ/w640-h374/technium2.png" width="640" /></a></div><p></p><p>Are there other cases? Yes – GMOs, as I discuss in <a href="http://strataoftheworld.blogspot.com/2018/12/review-seeds-of-science-why-we-got-it.html">this review</a>. There have also been some harmful technologies that have been controlled; for example biological and chemical weapons of mass destruction are more-or-less kept under control by two treaties (the <a href="https://en.wikipedia.org/wiki/Biological_Weapons_Convention">Biological Weapons Convention</a> and the <a href="https://en.wikipedia.org/wiki/Chemical_Weapons_Convention">Chemical Weapons Convention</a>). However, such cases seem to be the exception, since the overall history is one of technology adoption steamrolling the luddites, from the literal <a href="https://en.wikipedia.org/wiki/Luddite">Luddites</a> to George W. Bush's attempts to <a href="https://en.wikipedia.org/wiki/Stem_cell_laws_and_policy_in_the_United_States#Timeline">limit stem cell research</a>.</p><p>There are also cases where we put a lot of effort into expanding the technium in a specific direction (German subsidies for solar power are one successful example). We might think of this as adding stairs to make it easier to climb a hill.</p><p>How much of the technium's progress (or lack thereof) is determined by the terrain's inherent shape, and how much by the walls and stairs that we slap onto it? I don't know. The examples above show that as a civilisation we sometimes do build important walls in the technium terrain, but arguments like those Hall presents in <i>Where is my Flying Car?</i> are not strong enough to make me update my beliefs to thinking that this is the main factor determining how the technium expands. If I had to make a very rough guess, I'd say that though there is variation based on area (e.g. nuclear and renewable energy have a lot of walls and stairs respectively; computing has neither), overall the inherent terrain has at least several times the effect size on the decadal timescale. The power balance seems heavily dependent on the timescale too – George W. Bush can hold back stem cells for a few years, but imagine the sort of measures it would have taken to delay steam engines for the past few hundred years.</p><p> </p><h2>How should we guide technological progress?</h2><p>How much should we try to guide technological progress?</p><p>A first step might be to look at how good we've been at it in the past, so that we get a reasonable baseline for likely future performance. Our track record is clearly mixed. On one hand, chemical and biological weapons of mass destruction have so far been largely kept under control, though under a rather shoestring system (Toby Ord likes to point out that <a href="https://www.bbc.com/future/article/20200923-the-hinge-of-history-long-termism-and-existential-risk">the Biological Weapons Convention has a smaller budget than an average McDonald's</a>), and subsidies have helped solar and wind to become mature technologies. On the other hand, there are <a href="https://en.wikipedia.org/wiki/List_of_states_with_nuclear_weapons#Statistics_and_force_configuration">over ten thousand nuclear weapons in the world</a> and they don't seem likely to go away anytime soon (in particular, while <a href="https://en.wikipedia.org/wiki/New_START">New START</a> was recently extended, Russia has a <a href="https://en.wikipedia.org/wiki/RS-28_Sarmat">new ICBM</a> coming into service this year and the US is probably going to go ahead with their <a href="https://en.wikipedia.org/wiki/Ground_Based_Strategic_Deterrent">next-generation ICBM project</a>, almost ensuring that ICBMs – the most strategically volatile nuclear weapons – continue existing for decades more). We've mostly stopped ourselves benefiting from safe and powerful technologies like nuclear power and GMOs for no good reason. More recently, we've failed to allow <a href="https://en.wikipedia.org/wiki/Human_challenge_study">human challenge trials</a> for covid vaccines, despite massive net benefits (vaccine safety could be confirmed months faster, and the risk to healthy participants is lower than <a href="https://www.bls.gov/charts/census-of-fatal-occupational-injuries/civilian-occupations-with-high-fatal-work-injury-rates.htm">a year at some jobs</a>), <a href="https://www.1daysooner.org/">an army of volunteers</a>, and <a href="https://pubmed.ncbi.nlm.nih.gov/33334616/">broad public support</a>.</p><p>Imagine your friend was really into picking stocks, and sure, they once bought some AAPL, but often they've managed to pick the Enrons and Lehman Brothers of the world. Would your advice to them be more like "stay actively involved in trading" or "you're better off investing in an index fund and not making stock-picking decisions"?</p><p>Would things be better if we had tried to steer technology less? We'd probably be saving money and the environment (and <a href="https://en.wikipedia.org/wiki/Golden_rice">third-world children</a>) by eating far more genetically engineered food, and air pollution would've claimed <a href="https://www.mdpi.com/1996-1073/10/12/2169/htm">millions fewer lives</a> because nuclear power would've done more to displace coal. Then again, we'd probably have significantly less solar power. (Also, depending on what counts as steering technology rather than just reacting to its misuses, we might include the eventual bans on lead in gasoline, DDT, and chloroflourocarbons as major wins.) And maybe without the Biological Weapons Convention becoming effective in 1975, the Cold War arms race would've escalated to developing even more bioweapons than the <a href="https://en.wikipedia.org/wiki/Soviet_biological_weapons_program">Soviets already did</a> (for more depth, read <a href="https://www.amazon.com/Dead-Hand-Untold-Dangerous-Legacy/dp/0307387844">this</a>), and an accidental leak might've released a civilisation-ending super-anthrax.</p><p>So though we haven't been particularly good at it so far, can we survive without steering technological progress in the future? I made the point above that technology increases the variance of future outcomes, and this very much includes in the negative direction. Maybe <a href="https://en.wikipedia.org/wiki/Boost-glide">hypersonic glide vehicles</a> make the nuclear arms race more unstable and eventually result in war. Maybe technology lets Xi Jinping achieve his dream of permanent dictatorship, and this model turns out to be easily exportable and usable by authoritarians in every country. Maybe we don't solve the AI alignment problem before someone goes ahead and builds one, and the result is straight from Nick Bostrom's nightmares. And what exactly is the stable equilibrium in a world where a 150€ device that Amazon will drone-deliver to anyone in the world within 24 hours can take a genome and print out bacteria and viruses that have it?</p><p>This fragility is highlighted in a <a href="https://www.nickbostrom.com/existential/risks.html">2002 paper by Nick Bostrom</a>, who shares the view that the technium can't be reliably held back, at least to the extent that some dangerous technologies might require:</p><blockquote><p><i>"If a feasible technology has large commercial potential, it is probably impossible to prevent it from being developed. At least in today’s world, with lots of autonomous powers and relatively limited surveillance, and at least with technologies that do not rely on rare materials or large manufacturing plants, it would be exceedingly difficult to make a ban 100% watertight. For some technologies (say, ozone-destroying chemicals), imperfectly enforceable regulation may be all we need. But with other technologies, such as destructive nanobots that self-replicate in the natural environment, even a single breach could be terminal."</i></p></blockquote><p>The solution is what he calls differential development:</p><blockquote><p><i>"[We can affect] the rate of development of various technologies and potentially the sequence in which feasible technologies are developed and implemented. Our focus should be on what I want to call differential technological development: trying to retard the implementation of dangerous technologies and accelerate implementation of beneficial technologies, especially those that ameliorate the hazards posed by other technologies." [Emphasis in original]</i></p></blockquote><p>(See <a href="https://forum.effectivealtruism.org/posts/XCwNigouP88qhhei2/differential-progress-intellectual-progress-technological">here</a> for more elaboration on this concept and variations.)</p><p>For example:</p><blockquote><p><i>"In the case of nanotechnology, the desirable sequence would be that defense systems are deployed before offensive capabilities become available to many independent powers; for once a secret or a technology is shared by many, it becomes extremely hard to prevent further proliferation. In the case of biotechnology, we should seek to promote research into vaccines, anti-bacterial and anti-viral drugs, protective gear, sensors and diagnostics, and to delay as much as possible the development (and proliferation) of biological warfare agents and their vectors. Developments that advance offense and defense equally are neutral from a security perspective, unless done by countries we identify as responsible, in which case they are advantageous to the extent that they increase our technological superiority over our potential enemies. Such “neutral” developments can also be helpful in reducing the threat from natural hazards and they may of course also have benefits that are not directly related to global security."</i></p></blockquote><p>One point to emphasise is that the dangerous technology probably can't be held back indefinitely. One day, if humanity continues advancing (as it should), it will be easy to create deadly diseases, build self-replicating nanobots, or spin up a superintelligent computer program in the way that you'd spin up a Heroku server today. The only thing that will save us if the defensive technology (and infrastructure, and institutions) are in place by then. In <i>The Diamond Age</i>, Neal Stephenson imagines a future where there are defensive nanobots in the air and inside people that are constantly on patrol against hostile nanobots. I can't help but think that this is where we're heading. (It's also the strategy our bodies have already adopted to fight off organic nanobots like viruses.)</p><p>This is not how we've done technology harm mitigation in the past. Guns are kept in check through regulation, not by everyone wearing body armour. Sufficiently tight rules on, say, what gene sequences you can put into viruses or what you can order your nanotech universal fabricator to produce will almost certainly be part of the solution and go a long way on their own. However, a gun can't spin out of control and end humanity; an engineered virus or self-replicating nanobot might. And as we've seen, our ability to regulate technology isn't perfect, so maybe we should have a backup plan.</p><p>The overall picture therefore seems to be that our civilisation's track record at tech regulation is far from perfect, but the future of humanity may soon depend on it. Given this, perhaps it's better that we err on the side of too much regulation – not because it's probably going to be beneficial, but because it's a useful training ground to build up the institutional competence we're going to need to tackle the actually difficult tech choices that are heading our way. Better to mess up regulating Facebook and – critically – learn from it, than to make the wrong choices about AI.</p><p>It won't be easy to make the leap from a civilisation that isn't building much nuclear power despite being in the middle of a climate crisis to one that can reliably ensure we survive even when everyone and their dog plays with nanobots. However, an increase in humanity's collective competence at making complex choices about technology is something we desperately need.</p><p><br /></p><p style="text-align: center;"><b>RELATED:</b></p><p></p><ul style="text-align: left;"><li style="text-align: left;"><a href="https://strataoftheworld.blogspot.com/2021/03/review-where-is-my-flying-car.html">Review: Where is my Flying Car?</a></li><li><a href="http://strataoftheworld.blogspot.com/2021/03/nuclear-power-is-good.html">Nuclear power is good</a></li><li><a href="https://strataoftheworld.blogspot.com/2018/12/review-seeds-of-science-why-we-got-it.html">Review: Seeds of Science</a> – GMOs are also good</li></ul><p></p>Unknownnoreply@blogger.com0tag:blogger.com,1999:blog-1697673368059564013.post-57949953287693226422021-03-21T16:52:00.005+00:002021-04-25T22:19:18.152+01:00Review: Where is my Flying Car?<p style="text-align: center;"><span style="font-size: x-small;"> Book: <i>Where is my Flying Car?: A Memoir of Future Past</i>, by J. Storrs Hall (2018)<br />Words: 9.3k (about 31 minutes)</span></p><p style="text-align: center;"><br /></p><p>In the 50s and 60s, predictions of the future were filled with big physical technical marvels: spaceships, futuristic cities, and, most symbolically, flying cars. The lack of flying cars has become a cliche, whether as a point about the unpredictability of future technological progress, or a joke about hopeless techno-optimism.</p><p>For J. Storrs Hall, flying cars are not a joke. They are a feasible technology, as demonstrated by many historical prototypes that are surprisingly close to futurists' dreams, and practical too: likely to be more expensive than cars, yes, but providing many times more value to owners.</p><p>So, where are they?</p><p></p><table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto;"><tbody><tr><td style="text-align: center;"><a href="https://lh3.googleusercontent.com/-l_dFD3WFGCM/YFd1zawk5BI/AAAAAAAACfE/jI3VAWTGstsxaGRCnrRB0BfLMEvEQQvVACLcBGAsYHQ/flyingcar.png" style="margin-left: auto; margin-right: auto;"><img data-original-height="1012" data-original-width="1310" height="309" src="https://lh3.googleusercontent.com/-l_dFD3WFGCM/YFd1zawk5BI/AAAAAAAACfE/jI3VAWTGstsxaGRCnrRB0BfLMEvEQQvVACLcBGAsYHQ/w400-h309/flyingcar.png" width="400" /></a></td></tr><tr><td class="tr-caption" style="text-align: center;">Above: not a joke. <i>(Public domain, <a href="https://commons.wikimedia.org/wiki/File:ConvairCar_Model_118.jpg">original here</a>)</i></td><td class="tr-caption" style="text-align: center;"><br /></td></tr></tbody></table><p></p> <p>The central motivating force behind <i>Where is my Flying Car?</i> is the disconnect between what is physically possible with modern science, and what our society is actually achieving. The immediate objection to such points is to say: "well, of course some engineer can imagine a world where all this fancy technology is somehow economically feasible and widespread, but in the real world everything is more complicated, and once you take these complications into account there's no surprising failure".</p><p>Hall's objection is that everything was going fine until 1970 or so.</p><p>Many people complain that technological progress has slowed. Flying cars, of course, but also: airliner cruising speeds have stagnated, the space age went on hiatus, cities are still single-level flat designs with traffic, nuclear power stopped replacing fossil fuels, and nanotechnology (in the long run, the most important technology for building anything) is growing slowly. <a href="https://www.newyorker.com/magazine/2011/11/28/no-death-no-taxes">Peter Thiel</a> sums this up by saying "we wanted flying cars, instead we got 140 characters".</p><p>It's not just technology. There's an <a href="https://wtfhappenedin1971.com/">entire website devoted to throwing graphs at you about trends that changed around 1970</a> (and selling you Bitcoin on the side), and, while a bunch of it is <a href="https://tylervigen.com/spurious-correlations">Spurious Correlations material</a>, they include enough important things, like a stagnation in median wages, that it's worth thinking about.</p><p>Perhaps the most fundamental indicator is that the energy available per person in the United States was increasing exponentially (a trend Hall names the Henry Adams curve), until, starting around 1970, it just wasn't:</p><p></p><div class="separator" style="clear: both; text-align: center;"><a href="https://lh3.googleusercontent.com/-lzBlmEZg8Yo/YFd2QE4JA2I/AAAAAAAACfM/Decg_7IdIvEQrT4I1txpyTqRYtVVg5DcQCLcBGAsYHQ/adamscurve.png" style="margin-left: 1em; margin-right: 1em;"><img data-original-height="650" data-original-width="1522" height="274" src="https://lh3.googleusercontent.com/-lzBlmEZg8Yo/YFd2QE4JA2I/AAAAAAAACfM/Decg_7IdIvEQrT4I1txpyTqRYtVVg5DcQCLcBGAsYHQ/w640-h274/adamscurve.png" width="640" /></a></div><br /><p></p>Is this just because the United States is an outlier in energy use statistics? No; other developing countries have plateaued too, with the exception of Iceland and Singapore: <p></p><table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto;"><tbody><tr><td style="text-align: center;"><a href="https://lh3.googleusercontent.com/-wpdE2TYobyg/YFd2XDadn7I/AAAAAAAACfQ/FpdBl5tJtdsWZPwwMhuLepjVu7FvhBNKgCLcBGAsYHQ/energycapita.png" style="margin-left: auto; margin-right: auto;"><img alt="" data-original-height="1096" data-original-width="1748" height="402" src="https://lh3.googleusercontent.com/-wpdE2TYobyg/YFd2XDadn7I/AAAAAAAACfQ/FpdBl5tJtdsWZPwwMhuLepjVu7FvhBNKgCLcBGAsYHQ/w640-h402/energycapita.png" width="640" /></a></td></tr><tr><td class="tr-caption" style="text-align: center;"><p>(Source: <a href="https://ourworldindata.org/">Our World in Data</a>, one of the best websites on the internet. You can play around with an interactive version of this chart <a href="https://ourworldindata.org/grapher/per-capita-energy-use?tab=chart&time=earliest..latest&country=DEU~JPN~SGP~SWE~TWN~GBR~USA~ISL&region=World">here</a>.)</p></td></tr></tbody></table><p></p><div class="separator" style="clear: both; text-align: center;"></div> <p></p><p>Hall tries to estimate what percentage of future predictions in some technical area have come true as a function of the energy intensity of the technology, and finds a strong inverse correlation: in less energy intensive areas (e.g. mobile phones) we've over-achieved relative to futurists' predictions, while the opposite is true with energy intensive big machines (e.g. flying cars). (This is necessarily very subjective, but Hall at least says he did not change any of his estimates after seeing the graph.)</p><p>Of course, we have to contrast the stagnation in some areas with enormous advancements during the same time. The most obvious example is computing, something that futurists generally missed. In biotechnology, the price of DNA sequencing has dropped exponentially and in just the past few years we've gotten powerful tools like CRISPR and mRNA vaccines. Meanwhile the average person is now twice as rich as in 1970, and life expectancy has increased by 15 years (and the numbers are not much lower if we restrict our attention just to developed countries).</p><p>Perhaps we should be content; maybe Peter Thiel should stop complaining now that we have <a href="https://www.bbc.com/news/technology-41900880">280 characters</a>? After all, the problem is not that things are failing, but that they <i>might</i> be improving slower than they could be. That hardly seems like the end of the world. So why should we focus on technological progress? Has it really slowed? And how can we model it? <a href="https://strataoftheworld.blogspot.com/2021/03/technological-progress.html">I discuss these questions in another post</a>. In this post, however, I will move straight onto Hall's favourite topic.</p><p> </p><h2>Cool technology</h2><h3>Flying cars</h3><p>You might assume the case for flying cars looks something like this:</p><ol start=""><li>You get to places very fast.</li><li>Very cool.</li> </ol><p>However, there's a deeper case to be made for flying cars (or rapid transportation in general), and it starts with the observation that barefoot-walkers in Zambia tend to spend an hour or so a day travelling. Why is this interesting? Because this is the same as the average duration in the United States (of course Hall's other example is the US) or any other society.</p><p>Flying cars aren't about the speed – they're about the distance that this speed allows, given universal human preferences for daily travel duration. Cars on the road do about 60 km/h on average for any trip ("you might think that you could do better for a long trip where you can get on the highway and go a long way fast", Hall writes, but "the big highways, on the average, take you out of your way by an amount that is proportional to the distance you are trying to go"). A flying car that goes five times faster lets you travel within twenty-five times the area, potentially opening up a lot of choice.</p><p>Hall goes through some calculations about the utilities of different time-to-travel versus distance functions, given empirical results from travel theory, to produce this chart (which I've edited to improve the image quality and convert units) as a summary:</p><p></p><div class="separator" style="clear: both; text-align: center;"><a href="https://lh3.googleusercontent.com/-q2Xx89wLBH0/YFd2isMPIJI/AAAAAAAACfU/8ILMwLRxm4EngBLIBDewJSLuQBDGlMq_ACLcBGAsYHQ/valueofvehicle.png" style="margin-left: 1em; margin-right: 1em;"><img data-original-height="1388" data-original-width="2002" height="278" src="https://lh3.googleusercontent.com/-q2Xx89wLBH0/YFd2isMPIJI/AAAAAAAACfU/8ILMwLRxm4EngBLIBDewJSLuQBDGlMq_ACLcBGAsYHQ/w400-h278/valueofvehicle.png" width="400" /></a></div><p></p><p>(The overhead time means how long it takes to transition into flying mode, for example if you have to attach wings to it, or drive to an airport to take off.)</p><p>Even a fairly lame flying car would easily be three times more valuable than a regular car, mainly by giving you more choice and therefore letting you visit places that you like more.</p><p>In terms of what a flying car would actually look like, you have several options. Helicopters are obvious, but they are about ten times the price of cars, mechanically complex (and with very low manufacturing tolerances), and limited by aerodynamics (the advancing blade pushes against the sound barrier, and the retreating one pushes against generating too little lift due to how slowly it moves) to a speed of 250 km/h or so.</p><p>Historically, many promising flying car designs that actually flew where <a href="https://en.wikipedia.org/wiki/Autogyro">autogyros</a>, which generate thrust with a propeller but lift through an unpowered freely-rotating helicopter-like rotor. They generally can't take off vertically, but can land in a very small space.</p><p>Another design is a VTOL (vertical take-off and landing) aircraft. Some have been built and used as fighter jets, but they've gained limited use because they're slower and less manoeuvrable than conventional fighters and have less room for weapons. However, Hall notes that one experimental VTOL aircraft in particular – the <a href="https://en.wikipedia.org/wiki/Ryan_XV-5_Vertifan">XV-5</a> – would "have made one hell of a sports car" and its performance characteristics are recognisable as those of a hypothetical utopian flying car. It flew in 1964, but was cancelled because the Air Force wanted something as fast and manoeuvrable as a fighter jet, rather than "one hell of a sports car".</p><p>Of current flying car startups, Hall mentions <a href="https://en.wikipedia.org/wiki/Terrafugia">Terrafugia</a> and <a href="https://en.wikipedia.org/wiki/AeroMobil_s.r.o._AeroMobil">AeroMobil</a>, which produce traditional gasoline-powered vehicles (both with fuel economies comparable in litres/km to ordinary cars). There's also <a href="https://en.wikipedia.org/wiki/Volocopter">Volocopter</a> and <a href="https://en.wikipedia.org/wiki/EHang">EHang</a>, both of which produce electric vehicles with constrained ranges.</p><p>Hall divides the roadblocks (or should I say <a href="https://en.wikipedia.org/wiki/NOTAM">NOTAMs</a>?) for flying cars into four categories.</p><p>The first is that flying is harder than driving. To test this idea, Hall learned to fly a plane, and concluded that it is considerably harder, but not insurmountably. Besides, we're not far from self-driving; commercial passenger flights are close to self-piloting already, the existing Volocopter is only "optionally piloted", and the EHang 184 flies itself. </p><p>The second is technological. The main challenges here are flying low and slow without stalling (you want to be able to land in small places, at least in emergencies), and reducing noise to manageable levels.</p><p>The third is economic. Even though the technology theoretically exists, it may be that we're not yet at a stage where personal flying machines are economically feasible. To some extent this is true; Hall admits that even on the pre-1970 trends in private aircraft ownership, the US private aircraft market would only be something like 30 000 - 40 000 per year (compared to the 2 000 or so that it currently is), about a hundredth of the number of cars sold. The economics means we should expect that the adoption curve is shallow, but not that it's necessarily non-existent.</p><p>The final reason is simple: even if you could make a flying car, you wouldn't be allowed to. Everything in aviation is heavily regulated, pushing up costs in a way that, Hall says, leads private pilots to joke about "hundred-dollar burgers". Of course, flying is hard, so you want standards high enough that at the very least you don't have to dodge other people's home-made flying motorbikes as they rain down from the sky, but in Hall's opinion the current balance is wrong.</p><p>And it's not just that the balance is wrong, but that the regulations are messed up. For example, making aircraft in the light sports aircraft category would be a great way to experiment with electric flight, but the FAA forbids them from being powered by anything other than a single internal combustion piston engine.</p><p>In particular, the FAA "has a deep allergy to people making money with flying machines". If you own a two-seat private aircraft, you can't charge a passenger you take on a flight more than half of the fuel cost, so no air Uber. Until the FAA stopped dragging its feet on <a href="https://en.wikipedia.org/wiki/Unmanned_aerial_vehicle#Commercial_use">drone regulation</a> in 2016, drones were operated under model aircraft rules, and therefore could not be used for anything other than hobby or recreational purposes. Similar rules still apply to ultralights, with one suspicious exception: a candidate for a federal, state, or local election is allowed to pay for a flight.</p><p>(And of course, to all these rules it's usually possible to apply for a waiver – so if you're a big company with an army of lawyers, do what you want, but if you're two people in a garage, good luck.)</p><p>There's no clear smoking gun of one piece of regulation specifically causing significant harm to flying car innovation. However, the harms of regulation are often a death-by-a-thousand-cuts situation, where a million rules each clip away at what is permissible and each add a small cost. Hall's conclusion is harsh: "It’s clear that if we had had the same planners and regulators in 1910 that we have now, we would never have gotten the family car at all."</p><p>One particular effect of flying cars would be to weaken the pull of cities, another topic to which Hall brings a lot of opinions.</p><h3>City design</h3><blockquote><p><i>"Designing a city whose transportation infrastructure consists of the flat ground between the boxes is insane."</i></p></blockquote><p>This is true. Most traffic problems would go away if you could add enough levels. However, "[e]ven the recent flurry of Utopia-building projects are still basically rows of boxes sitting on the dirt plus built-in wifi so the self-driving cars can talk to each other as they sit in automated traffic jams".</p><p>As usual, Hall spies some sinister human factors lurking behind the scenes, delaying his visions of techno-utopia:</p><blockquote><p><i>"There is a perverse incentive for bureaucrats and politicians to force people to interact as much as possible, and indeed to interact in contention, as that increases the opportunities for control and the granting of favors and privileges. This is probably one of the major reasons that our cities have remained flat, one-level no-man’s-lands where pedestrians (and beggars and muggers) and traffic at all scales are forced to compete for the same scarce space in the public sphere, while in the private sphere marvels of engineering have leapt a thousand feet into the sky, providing calm, safe, comfortable environments with free vertical transportation."</i></p></blockquote><p>This is an interesting idea, and I've <a href="https://www.elephantinthebrain.com/">read enough Robin Hanson</a> to not discount such perverse explanations immediately, but once again I'm not convinced how important this factor is, and Hall, as usual, is happy to paint only in broad to strokes.</p><p>However, he makes a clearly strong point here:</p><blockquote><p><i>"Densification proponents often point to an apparent paradox: removing a highway which crosses a community often does not increase traffic on the remaining streets, as the kind of hydraulic flow models used by traffic planners had assumed that it would. On the average, when a road is closed, 20% of the traffic it had handled simply vanishes. Traffic is assumed to be a bad thing, so closing (or restricting) roads is seen as beneficial. Well duh. If you closed all the roads, traffic would go to zero. If you cut off everybody’s right foot and forced them to use crutches, you’d get a lot less pedestrian traffic, too."</i></p></blockquote><p>Hall takes a liberal principle of being strongly in favour of giving people choice, arguing that the goal of city design and transportation infrastructure should be to maximise how far people can travel quickly, rather than trying to ensure that they don't need to travel anywhere other than the set of choices the all-seeing, all-knowing urban designer saw fit to place nearby. Of course, once again flying cars are the best:</p><blockquote><p><i>"The average American commute to work, one way by car, ranges from 20 minutes to half an hour (the longer times in denser areas). This gives you a working radius of about 15 miles [= 24 km], or [1800 square kilometres] around home to find a workplace (or around work to find a home). With a fast VTOL flying car, you get a [240-kilometre] radius or [180 thousand square kilometres] of commutable area. Cars, trucks, and highways were clearly one of the major causes of the postwar boom. It isn’t perhaps realized just how much the war on cars contributed to the great stagnation—or how much flying cars could have helped prolong the boom."</i></p></blockquote><h3>Nuclear power</h3><p>I discuss nuclear power at length in <a href="http://strataoftheworld.blogspot.com/2021/03/nuclear-power-is-good.html">another post</a>.</p><h3>Space travel?</h3><p>What about the classic example of supposedly stalled innovation – we were on the moon in 1969, and won't return until <a href="https://en.wikipedia.org/wiki/Artemis_program">at least 2024</a>?</p><blockquote><p><i>"With space travel, there’s a pretty straightforward answer: the Apollo project was a political stunt, albeit a grand and uplifting one; there was no compelling reason to continue going to the moon given the cost of doing so."</i></p></blockquote><p>The general curve of space progress seems to be over-achievement relative to technological trends in the 60s, followed by stagnation, not because the technology is impossible – we did go to the moon after all – but because it just wasn't economical. Only now, with private space companies like SpaceX and Rocket Lab actually making a business out of taking things to space outside the realm of <a href="https://aozerov.com/research/lvmarket.pdf">cosy costs-plus government contracts</a> is innovation starting to pick up again.</p><p>(In the past ten years, we've seen the first commercial crewed spacecraft, reuse of rocket stages, the first methane-fuelled rocket engine ever flown, the first full-flow staged-combustion rocket engine ever flown, and the first liquid-fuelled air-launched orbital rocket, just to pick some examples.)</p><p>Hall has some further comments about space. First, in this passage he shows an almost-religious deference to trend lines:</p><blockquote><p><i>"As you can see from the airliner cruising speed trend curve, we shouldn’t have expected to have commercial passenger space travel yet, even if the Great Stagnation hadn’t happened."</i></p></blockquote><p>I don't think it makes sense to take a trend line for atmospheric flight speeds and use that to estimate when we should have passenger space travel; the physics is completely different, and in particular speeds are very constrained in orbit (you need to go 8 km/s to stay in orbit, and you can't go faster around the Earth without constant thrusting to stop yourself from flying off – something Hall clearly understands, as he explains it more than once).</p><p>Secondly, he is of course in favour of everything high-energy and nuclear.</p><p>For example: <a href="https://en.wikipedia.org/wiki/Project_Orion_(nuclear_propulsion)">Project Orion</a> was an American plan for a spacecraft powered (potentially from the ground up, rather than just in space) by throwing nuclear bombs out the back and riding the plasma from the explosions. This is a good contender for the stupidest-sounding idea that actually makes for a solid engineering plan; it's a surprisingly feasible way of getting sci-fi performance characteristics from your spacecraft. Other feasible methods have either far lower thrust (like ion engines, meaning that you can't use them to take off or land), or have far lower exhaust velocity (which means much more of your spacecraft needs to be fuel). The obvious argument against Orion, at least for atmospheric launch, is the fallout, but Hall points out it's actually not <i>that</i> bad – the number of additional expected cancer deaths from radiation per launch is "only" in the single digits, and that's under a very conservative linear no-threshold model of radiation dangers, which is likely wrong. (The actual reasons for cancellation weren't related to radiation risks, but instead the prioritisation of Apollo, the <a href="https://en.wikipedia.org/wiki/Partial_Nuclear_Test_Ban_Treaty">Partial Test Ban Treaty of 1963</a> that banned atmospheric nuclear tests, and the fact that no one in the US government had a particularly pressing need to put a thousand tons into orbit.) Hall also mentions an interesting fact about Orion that I hadn't seen before: "the total atmospheric contamination for a launch was roughly the same no matter what size the ship; so that there would be an impetus toward larger ones" – perhaps Orion would have driven mass space launch.</p><p>A more controlled alternative to bombing yourself through space is to use a nuclear reactor to heat up propellant in order to expel it out the back of your rocket at high speeds, pushing you forwards. The main limit with these designs is that you can't turn the heat up too much without your reactor blowing up. Hall's favoured solution is a direct fission-to-jet process, where the products of your nuclear reaction go straight out the engine without all this intermediate fussing around with heating the propellant. A reaction that converts a proton and a lithium-7 atom into 2 helium nuclei would give an exhaust velocity of 20 Mm/s (7% of the speed of light), which is insane.</p><p>To give some perspective: let's say your design parameters are that you have a 10 ton spacecraft, of which 1 ton can be fuel. With chemical rocket technology, this gives you a little toy with a total ∆V of some 400 m/s, meaning that if you light it up and let it run horizontally along a frictionless train track, it'll break the sound barrier by the time it's out of fuel, but it can't take you from a Earth-to-moon-intercept trajectory to a low lunar orbit even with the most optimal trajectories. With the proton + lithium-7 process Hall describes, your 10% fuel, 10-ton spaceship can accelerate at 1G for two days. If you want to go to Mars, instead of this whole modern business of waiting for the orbital alignment that comes once every 26 months and then doing a 9-month trip along the lowest-energy orbit possible, you can almost literally point your spaceship at Mars, accelerate yourself to a speed of 1 000 km/s over a day (for comparison, the speeds of the inner planets in their orbits are in the tens of kilometres per second range), coast for maybe a day at most, and then decelerate for another day. For most of the trip you get free artificial gravity because your engine is pushing you so hard. This would be technology so powerful even Hall feels compelled to tack on a safety note: "watch out where you point that exhaust jet".</p><h3>Nanotechnology!</h3><p>Imagine if machine pieces could not be made on a scale smaller than a kilometre. Want a gear? Each tooth is a 1km x 1km x 1km cube at least. Want to build something more complicated, say an engine? If you're in a small country, it may well be a necessarily international project, and also better keep it fairly flat or it won't fit within the atmosphere. Want to cut down a single tree? Good luck.</p><p>This is roughly the scale at which modern technology operates compared to the atomic scale. Obviously this massively cuts down on what we can do. Having nanotechnology that lets us rearrange atoms on a fine level, instead of relying on astronomically blunt tools and bulk chemical reactions, could put the capabilities of physical technology on the kind of exponential Moore's law curve we've seen in information technology.</p><p>There are some problems in the way. As you get to smaller and smaller scales:</p><ul><li>matter stops being continuous and starts being discrete (and therefore for example oil-based lubrication stops working);</li><li>the impact of gravity vanishes but the impact of adhesion increases massively;</li><li>heat dissipation rates increase;</li><li>everything becomes springy and nothing is stiff anymore; and</li><li>hydrogen atoms (other atoms are too heavy) can start doing weird quantum stuff like tunnelling.</li> </ul><p>Also, how do we even get started? If all we have are extremely blunt tools, how do you make sharp ones?</p><p>There are two approaches. The first, the top-down approach, was suggested <a href="https://en.wikipedia.org/wiki/There%27s_Plenty_of_Room_at_the_Bottom">in a 1959 talk</a> by Richard Feynman, which is credited as introducing the concept of nanotechnology. First, note that we currently have an industrial tool-base at human scales that is, in a sense, self-replicating: it requires human inputs, but we can draw a graph of the dependencies and see that we have tools to make every tool. Now we take this tool-base, and create an analogous one at one-fourth the scale. We also create tools that let us transfer manipulations – the motions of a human engineer's hands, for example – to this smaller-scale version (today we can probably also automate large parts of it, but this isn't crucial). Now we have a tool-base that can produce itself at a smaller scale, and we can repeat the process again and again, making adjustments in line with the above points about how the engineering must change. If each step is one-fourth the previous, 8 iterations will take us from a millimetre-scale industrial base to a tens-of-nanometres-scale one.</p><p>The other approach is bottom-up. We already have some ability to manipulate things on the single-digit nanometre scale: the smallest features on today's chips are in this range, we have <a href="https://en.wikipedia.org/wiki/Atomic_force_microscopy">atomic-scale microscopes that can also manipulate atoms</a>, and of course we're surrounded by massively complicated nanotechnology called organic life that comes with pre-made nano-components. Perhaps these tools let us jump straight to making simple nano-scale machines, and a combination of these simple machines and our nano-manipulation tools lets us eventually build the critical self-sustaining tool-base at the atomic level.</p><h3>Weather machines?!</h3><p>Here's one thing you could do with nanotechnology: make 5 quintillion 1 cm controllable hydrogen balloons with mirrors, release them into the atmosphere, and then set sunlight levels to be whatever you want (without nanotechnology, this might also be doable, but nanotechnology lets you make very thin balloons and therefore removes the need to strip-mine an entire continent for the raw materials).</p><p>Hall calls this a weather machine, and it is exactly what it says on the tin, both on a global and local level. He estimates that it would double global GDP by letting regions set optimal temperatures, since "you could make land in lots of places on the earth, such as Northern Canada and Russia, as valuable as California". Of course, this is assuming that we don't care about messing up every natural ecosystem and weather pattern on the planet, but if the machine is powerful enough we might choose to keep the still-wild parts of the world as they are. I don't know if this would work, though; sunlight control alone can do a lot to the weather, but perhaps you'd need something different to avoid, for example, the huge winds from regional temperature differences? However, with a weather machine, the sort of subtle global modifications needed to reverse the roughly 1 watt per square metre increase in incoming solar radiation that anthropogenic emissions have caused would be trivial. </p><p>Weather machines are scary, because we're going to need very good institutions before that sort of power can be safely wielded. Hall thinks they're coming by the end of the century, if only because of the military implications: not only could you destroy agriculture wherever you want, but the mirrors could also focus sunlight onto a small spot. You could literally smite your enemies with the power of the sun.</p><p>Don't want things in the atmosphere, but still want to control the climate? Then put up sunshades into orbit, incentivising the development of a large-scale orbital launch infrastructure at the same time that we can afterwards use to settle Mars or whatever. As a bonus, put solar panels on your sunshade satellites, and you can generate more power than humanity currently uses.</p><p>As always, nothing is too big for Hall. He goes on to speculate about a weather machine <a href="https://en.wikipedia.org/wiki/Dyson_sphere">Dyson sphere</a> at half the width of the Earth's orbit. Put solar panels on it, and it would generate enormous amounts of power. Use it as a telescope, and you could see a phone lying on the ground on <a href="https://en.wikipedia.org/wiki/Proxima_Centauri_b">Proxima Centauri b</a>. Or, if the Proxima Centaurians try to invade, you can use it as a weapon and "pour a quarter of the Sun’s power output, i.e. 100 trillion terawatts, into a [15-centimetre] spot that far away, making outer space safe for democracy."</p><h3>Flying cities?!?</h3><p>And because why the hell not: imagine a 15-kilometre airplane shaped like a manta ray and with a thickness of a kilometre (so the <a href="https://en.wikipedia.org/wiki/Burj_Khalifa">Burj Khalifa</a> fits inside), with room for 10 million people inside. It takes 200 GW of power to stay flying – equivalent to 4 000 Boeing 747s – which could be provided by a line of nuclear power plants every 100 metres or so running along the back. This sounds like a lot, but Hall helpfully points out the reactors would only be 0.01% of the internal volume, so you could still cluster Burj Khalifas inside to your heart's content, and the energy consumption comes out to only 20 kW per person, about where we'd be today if energy use had continued growing on pre-1970s trends.</p><p>If you don't want to go to space but still want to leave the Earth untouched, this is one solution, as long as you don't mind a lot of very confused birds.</p><h2>Technology is possible, but has risks</h2><p>I worry that <i>Where is my Flying Car?</i> easily leaves the impression that everything Hall talks about is part of some uniform techno-wonderland, which, depending on your prior about technological progress, is somewhere between certainly going to happen or permanently relegated to the dreams of mad scientists. Hall does not work to dispel this impression: he goes back and forth between talking about how practical flying cars are and exotic nuclear spacecraft, or between reasonable ideas about traffic layout in cities and far-off speculation about city-sized airplanes. Credible world-changing technologies like nanotechnology easily seem like just another crazy thought Hall sketched out on the back of the envelope and could not stop being enthusiastic about.</p><p>So should we take Hall's more grounded speculation seriously and ignore the nano-nuclear-space-megapolises? I think this would be the wrong takeaway. First, I'm not sure Hall's crazy speculation is crazy enough to capture possible future weirdness within it; he restricts himself mainly to physical technologies, and thus leaves out potentially even weirder things like a move to virtual reality or the creation of superhuman intelligence (whether AI or augmented humans).</p><p>Second, Hall does have a consistent and in some way realist perspective: if you look at the world – not at the institutions humans have built, or whatever our current tech toolbox contains, but at the physical laws and particles at our disposal – what do you come up with?</p><p>After all, our world is ultimately not one of institutions and people and their tools. The "strata" go deeper, until you hit the bedrock of fundamental physics. We spend most of our time thinking about the upper layers, where the underlying physics is abstracted out and the particles partitioned into things like people and countries and knowledge. This is for good reason, because most of the time this is the perspective that lets you best think about things important to people. Occasionally, however, it's worth taking a less parochial perspective by looking right down to the bedrock, and remembering that anything that can be built on that is possible, and something we may one day deal with.</p><p>This perspective should also make clear another fact. The things we care about (e.g. people) exist many layers of abstraction up from the fundamental physics, and are therefore fragile, since they depend on the correct configuration of all levels below. If your physical environment becomes inhospitable, or an engineered virus prevents your cells from carrying out their function, the abstraction of you as a human with thoughts and feelings will crash, just like a program crashes if you fry the circuits of the computer it runs on.</p><p>So there are risks, new ones will appear as we get better at configuring physics, and stopping civilisation from accidentally destroying itself with some new technology is not something we're automatically guaranteed to succeed at.</p><p>Hall does not seem to recognise this. Despite all his talk about nanotechnology, the <a href="https://en.wikipedia.org/wiki/Gray_goo">grey goo scenario</a> of self-replicating nanobots going out of control and killing everyone doesn't get a mention. As far as I'm aware, there's no strong theoretical reason for this to be impossible – nanobots good at configuring carbon/oxygen/hydrogen atoms are a very reasonable sort of nanobot, and I can't help but noticing that my body is mainly carbon, oxygen, and hydrogen atoms. "What do you replace oil lubrication with for your atomic scale machine parts" is a worthwhile question, as Hall notes, but I'd like to add that so is the problem of not killing everyone.</p><p>Hall does mention the problem of AI safety:</p><blockquote><p><i>"The latest horror-industry trope is right out of science fiction [...]. People are trying to gin up worries that an AI will become more intelligent than people and thus be able to take over the world, with visions of Terminator dancing through their heads. Perhaps they should instead worry about what we have already done: build a huge, impenetrably opaque very stupid AI in the form of the administrative state, and bow down to it and serve it as if it were some god."</i></p></blockquote><p>What's this whole thing with arguments of the form "people worry about AI, but the <i>real</i> AI is X", where X is whatever institution the author dislikes? <a href="https://www.buzzfeednews.com/article/tedchiang/the-real-danger-to-civilization-isnt-ai-its-runaway">Here's another example</a> from a different political perspective (by sci-fi author Ted Chiang, whose <a href="http://strataoftheworld.blogspot.com/2020/05/short-reviews-fiction.html">fiction I enjoy</a>). I don't think this is a useless perspective – there is an analogy between institutions that fail because their design optimises for the wrong thing, and the more general idea of powerful agents accidentally designed to optimise for the wrong thing – but at the end of the day, surprise surprise, the real AI is a very intelligent computer program.</p><p>Hall also mentions he "spent an entire book (<i><a href="https://www.amazon.com/Beyond-AI-Creating-Conscience-Machine/dp/1591025117">Beyond AI</a></i>) arguing that if we can make robots smarter than we are, it will be a simple task to make them morally superior as well." This sounds overconfident – morality is complicated, after all – but I haven't read it.</p><p>As for climate change, Hall acknowledges the problem but justifies largely dismissing it by citing “[t]he actual published estimates for the IPCC’s worst case scenario, RCP8.5, [which] are for a reduction in GDP of between 1% and 3%". <a href="https://science.sciencemag.org/content/sci/356/6345/1362.full.pdf">This is true</a> ... if you only consider the United States! (The EU is in the same range but the global estimates range up to 10%, because of a disproportionate effect on poor tropical countries.) As the authors of that very report also note, these numbers don't take into account non-market losses. If Hall wants to make an argument for techno-optimistic capitalism, he should consider taking more care to distinguish himself from the strawman version.</p><p> </p><h2>It's <i>not</i> the technology, stupid!</h2><p>Hall does not think that we'd have all the technologies mentioned above if only technological progress had not "stagnated". The things he expects could've happened by now given past trends are:</p><ul><li>The technological feasibility of flying cars would be demonstrated and sales would be on the rise; Hall goes as far as to estimate the private airplane market in the US could have been selling 30k-40k planes per year (a fairly tight confidence interval for something this uncertain); compare with the actual US market today, which sells around 16 million cars and a few thousand private aircraft per year.</li><li>Demonstrated examples of multi-level cities and floating cities.</li><li>Chemical spacecraft technology would be about where they are now, but some chance that government funding would have resulted in <a href="https://en.wikipedia.org/wiki/Project_Orion_(nuclear_propulsion)">Project Orion</a>-style nuclear launch vehicles.</li><li>Nanotechnology: basic things like ammonia fuel cells might exist, but not fancier things like cell repair machines or universal fabricators.</li><li>Nuclear power would generate almost all electricity, and hence there would be a lot less CO2 in the atmosphere (<a href="https://www.mdpi.com/1996-1073/10/12/2169/htm">this study</a> estimates 174 billion fewer tons of CO2 had reasonable nuclear trends continued, but Hall optimistically gives the number as 500 billion tons).</li><li>AI and computers at the same level as today.</li><li>A small probability that something unexpected along the lines of cold fusion would have turned out to work and been commercialised.</li><li>A household income several times larger than today.</li> </ul><p>So what went wrong? Hall argues:</p><blockquote><p>"The faith in technology reflected in Golden Age SF and Space Age America wasn’t misplaced. What they got wrong was faith in our culture and bureaucratic arrangements."</p></blockquote><p>He gives two broad categories of reasons: concrete regulations, and a more general cultural shift from hard technical progress to worrying and signalling.</p><h3>Regulation ruins everything?</h3><p>Hall does not like regulation. He estimates that had regulation not grown as it did after 1970, the increased GDP growth might have been enough to make household incomes 1.5 to 2 times higher than they are today in the US. I can find some studies saying similar things – <a href="https://www.sciencedirect.com/science/article/abs/pii/S1094202520300223">here</a> is one claiming 0.8% lower GDP growth per year since 1980 due to regulation, which would imply today's economy would be about 1.3 times larger had this drag on growth existed. As far as I can tell, these estimates also don't take into account the benefits of regulation, which are sometimes massive (e.g. banning lead in gasoline). However, I think most people agree that regardless of how much regulation there should be, it could be a lot smarter. </p><p>Hall's clearest case for regulation having a big negative impact on an industry is private aviation in the United States, which crashed around 1980 after more stringent regulations were introduced. The number of airplane shipments per year dropped something like six-fold and never recovered.</p><p>A much bigger example is nuclear power, which I will discuss in an upcoming post, and which Hall also has plenty to say about.</p><p>Strangely, Hall misses perhaps the most obvious case in modern times: GMOs pointlessly being almost regulated out of existence, a story told well in Mark Lynas' <i>Seeds of Science</i> (my review <a href="http://strataoftheworld.blogspot.com/2018/12/review-seeds-of-science-why-we-got-it.html">here</a>). Perhaps this is because of Hall's focus on hard sciences, or his America-centrism (GMO regulation is worse in the EU than in the United States).</p><p>And speaking of America-centrism, the biggest question I had is why even if the US is bad at regulation, no country decides to do better and become the flying car capital of the world. Perhaps good regulation is hard enough that no one gets it right? Hall makes no mention of this question, though. </p><p>He does, however, throw plenty of shades on anything involving centralisation. For example:</p><blockquote><p><i>"Unfortunately, the impulse of the Progressive Era reformers, following the visions of [H. G.] Wells (and others) of a “Scientific Socialism,” was to centralize and unify, because that led to visible forms of efficiency. They didn’t realize that the competition they decried as inefficient, whether between firms or states, was the discovery procedure, the dynamic of evolution, the genetic algorithm that is the actual mainspring of innovation and progress."</i></p></blockquote><p>He brings some interesting facts to the table. For example, an OECD survey found a 0.26 correlation between private spending on research & development and economic growth, but a -0.37 between public R&D and growth. Here's Hall's once again somewhat dramatic explanation:</p><blockquote><p><i>“Centralized funding of an intellectual elite makes it easier for cadres, cliques, and the politically skilled to gain control of a field, and they by their nature are resistant to new, outside, non-Ptolemaic ideas. The ivory tower has a moat full of crocodiles.”</i></p></blockquote><p>He backs this up with his personal experiences of US government spending on nanotechnology lead to a flurry of scientists trying to claim that their work counted as nanotechnology (up to and including medieval stained glass windows) as well as trying to discredit anything that actually was nanotechnology, to make sure that the nanotechnologists wouldn't steal more federal funding in the future.</p><p>Studies, not surprisingly, find that the issue is more complicated (see for example <a href="https://link.springer.com/article/10.1007/s10645-019-09331-3">here</a>, which includes a mention of the specific survey Hall references).</p><p>Hall also includes a graph of economic growth vs the Fraser Institute's economic freedom score in the United States. I've created my own version below, including some more information than Hall does:</p><p></p><div class="separator" style="clear: both; text-align: center;"><a href="https://lh3.googleusercontent.com/-CTgmRx_DaVY/YFd3KNL9TpI/AAAAAAAACfg/1Joe0sC5KjM0CJwKMvaCWkRtG9L68XDogCLcBGAsYHQ/gdpef.png" style="margin-left: 1em; margin-right: 1em;"><img alt="" data-original-height="980" data-original-width="1550" height="405" src="https://lh3.googleusercontent.com/-CTgmRx_DaVY/YFd3KNL9TpI/AAAAAAAACfg/1Joe0sC5KjM0CJwKMvaCWkRtG9L68XDogCLcBGAsYHQ/gdpef.png" width="640" /></a></div><p></p>In general, it seems sensible to expect economic freedom to increase GDP: the more a person's economic choices are limited, the more likely the limitations are to prevent them from taking the optimal action (the main counterexample being if optimal actions for an individual create negative externalities for society). We can also see that this is empirically the case – developed countries tend to have high economic freedom. However, in using this graph as clear evidence, I think Hall is once again trying to make too clear a case on the basis of one correlation. <p>Effective decentralised systems, whether markets or democracy, are always prone to attack by people who claim that things would be better if only we let them make the rules. Maybe it takes something of Hall's engineer mindset to resist this impulse and see the value of bloodless systems and of general design principles like feedback and competition. (And perhaps Hall should apply this mindset more when evaluating the strength of evidence for his economic ideas.)</p><p>As for what the future of societal structure looks like, Hall surprisingly manages to avoid proposing flying-car-ocracy:</p><blockquote><p><i>""[It] may well be possible to design a better machine for social and economic control than the natural marketplace. But that will not be done by failing to understand how it works, or by adopting the simplistic, feedback-free methods of 1960s AI programs. And if ever it is done, it will be engineers, not politicians, who do it."</i></p></blockquote><p>He goes further:</p><blockquote><p><i>"As a futurist, I will go out on a limb and make this prediction: when someone invents a method of turning a Nicaragua into a Norway, extracting only a 1% profit from the improvement, they will become rich beyond the dreams of avarice and the world will become a much better, happier, place. Wise incorruptible robots may have something to do with it."</i></p></blockquote><h3>Risk perception and signalling</h3><p>Hall's second reason for us not living up to expectations for technological progress is cultural. He starts with the idea of risk homeostasis in psychology: everyone has some tolerance for risk, and will seek to be safer when they perceive current risk to be higher, and take more risks when they perceive current risk to be lower. In developed countries, risks are of course ridiculously low compared to historical levels, so most people feel safer than ever. Some start skydiving in response, but Hall suggests there's another effect that happens when an entire society finds itself living below their risk tolerance:</p><blockquote><p><i>"One obvious way [to increase perceived risk] is simply to start believing scare stories, from Corvairs to DDT to nuclear power to climate change. In other words, the Aquarian Eloi became phobic about everything specifically because we were actually safer, and needed something to worry about."</i></p></blockquote><p>I know what you're thinking – what the hell are "Aquarian Eloi"? Hall likes to come up with his own terms for things, and in this case he is making a reference to H. G. Wells' <i>The Time Machine</i>, in which descendants of humanity live out idle and dissolute lives (modelled on England's idle rich of the time), in order to label what he claims is the modern zeitgeist. Yes, this book is weird at times.</p><p>Another cultural idea he touches on is increased virtue signalling. Using the idea of <a href="https://en.wikipedia.org/wiki/Maslow%27s_hierarchy_of_needs">Maslow's hierarchy of needs</a>, he explains that as more and more of the population is materially well-off, more people invest more effort into self-actualisation. Some of this is productive, but, humans being humans, a lot of this effort goes into trying to signal how virtuous you are. Of course, there's nothing inherently wrong with that, as long as your virtue signalling isn't preventing other people climbing up from lower levels of Maslow's hierarchy – or, Hall would probably add, from building those flying cars.</p><h3>Environmentalism vs Greenism</h3><p>A particular sub-case of cultural change that Hall has a lot to say about is the "Green religion", something he distinguishes (though sometimes with not enough care) from perfectly reasonable desires "to live in a clean, healthy environment and enjoy the natural world".</p><p>This ideological, fear-driven and generally anti-science faction within the environmentalist movement is much the same thing as what Steven Pinker calls "Greenism", which I talked about in <a href="http://strataoftheworld.blogspot.com/2018/08/review-enlightenment-now-steven-pinker.html">my review of <i>Enlightenment Now</i></a> (search for "Greenism") and also features in <a href="http://strataoftheworld.blogspot.com/2018/08/review-enlightenment-now-steven-pinker.html">my review of Mark Lynas' <i>Seeds of Science</i></a> (search for "torpedoes"). Unlike Lynas or even Pinker, Hall does not hold back when it comes to criticising this particular strand of environmentalism. He explains it as an outgrowth of the risk-averseness and virtue signalling trends described above. The "Green religion", he claims, is now the "default religion of western civilization, especially in academic circles", and "has developed into an apocalyptic nature cult". To explain its resistance to progress and improving the human condition, he writes:</p><blockquote><p><i>"It seems likely that the fundamentalist Greens started with the notion that anything human was bad, and ran with the implication that anything that was good for humans was bad. In particular, anything that empowered ordinary people in their multitudes threatened the sanctity of the untouched Earth. The Green catechism seems lifted out of classic Romantic-era horror novels. Any science, any engineering, the “acquirement of knowledge,” can only lead to “destruction and infallible misery.” We must not aspire to become greater than our nature."</i></p></blockquote><p>There are troubling tendencies in ideological Greenism (as there is with anything ideological), but I think "apocalyptic nature cult" takes it too far, and as a substitute religion for the west, it has some formidable competitors. Hall is right to point out the tension between improving human welfare and Greenist desires to limit humans, but I'd bet that the driving factor isn't direct disdain for humans, but rather the sort of sacrificial attitudes that are common in humans (consider <a href="https://www.britannica.com/topic/flagellants">the people</a> who went around whipping themselves during the Black Death to try to atone for whatever God was punishing them for). Probably there's some part of human psychology or our cultural heritage that makes it easy to jump to sacrifice, disparaging ourselves (or even all of humanity), and repentance as the answer to any problem. While this a nobly selfless approach, it's just less effective than, and sometimes in opposition to, actually building things: developing new technologies, building clean power plants, and so on.</p><p>Hall also goes too far in letting the Greenists tar his view of the entire environmentalist movement. Not only is climate change a more important problem than the 1-3% estimated GDP loss for the US suggests, but you'd think that the sort of big technical innovation that is happening with clean tech would be exactly the sort of progress Hall would be rooting for.</p><p>Hall does have an environmentalist proposal, and of course it involves flying cars:</p><blockquote><p><i>"The two leading human causes of habitat destruction are agriculture and highways—the latter not so much by the land they take up, but by fragmenting ecosystems. One would think that Greens would be particularly keen for nuclear power, the most efficient, concentrated, high-tech factory farms, and for ... flying cars. "</i></p><p><i>[Ellipsis in original]</i></p></blockquote><h3>Energy matters!</h3><p>Despite being partly blinded by his excessive anti-Greenism, there is one especially important correction to some strands of environmentalist thinking that Hall makes well: cheap energy really matters and we need more of it (and energy efficiency won't save the day).</p><p>Above, I used the stagnation in energy use per capita as an example of things going wrong. This may have raised some eyebrows; isn't it good that we're not consuming more and more energy? Don't we want to reduce our energy consumption for the sake of the environment?</p><p>First, it is obviously true that we need to reduce the environmental impact of energy generation. Decoupling GDP growth from CO2 emissions is one of the great achievements of western countries over the past decades, and we need to massively accelerate this trend.</p><p>However, our goal, if we're liberal humanists, should be to give people choices and let them lead happy lives (while applying the same considerations to any sentient non-human beings, and ideally not wrecking irreplaceable ecosystems). In our universe, this means energy. Improvements in the quality of life over history are, to a large extent, improvements in the amount of energy each person has access to. This is very true:</p><blockquote><p><i>“Poverty is ameliorated by cheap energy. Bill Gates, nowadays perhaps the world’s leading philanthropist, puts it, “If you could pick just one thing to lower the price of—to reduce poverty—by far you would pick energy.”"</i></p></blockquote><p>Even in the United States, "[e]nergy poverty is estimated to kill roughly 28,000 people annually in the US from cold alone, a toll that falls almost entirely on the poor". </p><p>Climate change cannot be solved by reducing energy consumption, because there are six billion people in the world who have not reached western living standards and who should be brought up to them as quickly as possible. This will take energy. What we need is to simultaneously massively increase the amount of energy that humanity uses, while also switching over to clean energy. If you think only one of these is enough, you have either failed to understand the gravity of the world's poverty situation or the gravity of its environmental one.</p><p>(Energy efficiency matters, because all else being equal, it reduces operating costs. It is near-useless for solving emissions problems, however, because the more efficiently we can use energy, the more of it we will use. Hall illustrates this with a thought experiment of a farmer who uses a truck to carry one crate of tomatoes at a time from their farm to a customer, and whose only expense is fuel for the truck. Double its fuel efficiency, and it's economical to drive twice as far, and hence service four times as many customers (assuming customer number is proportional to reachable area), plus each trip is twice as long on average. The net result is that the 2x increase in efficiency leads to 8x more kilometres driven and hence 4x higher fuel consumption. The general case is called <a href="https://en.wikipedia.org/wiki/Jevons_paradox">Jevons paradox</a>.)</p><p>So yes, we need energy, most urgently in developing countries, but the more development and deployment of new energy sources there is, the cheaper they will be for everyone – consider Germany's highly successful subsidies for solar power – so developed countries have a role to play as well. (Also, are we sure there would be no human benefits to turning the plateauing in developed country energy use back into an increase?)</p><p>You'd think this is obvious. Unfortunately it isn't. In a section titled ""AAUGHH!!", Hall presents these quotes:</p><blockquote><p><i>“The prospect of cheap fusion energy is the worst thing that could happen to the planet. —Jeremy Rifkin</i></p><i></i><p><i>Giving society cheap, abundant energy would be the equivalent of giving an idiot child a machine gun. —Paul Ehrlich</i></p><i></i><p><i>It would be little short of disastrous for us to discover a source of clean, cheap, abundant energy, because of what we might do with it. —Amory Lovins”</i></p></blockquote><p>They are what leads Hall to say, perhaps with too much pessimism:</p><blockquote><p><i>"Should [a powerful new form of clean energy] prove actually usable on a large scale, they would be attacked just as viciously as fracking for natural gas, which would cut CO2 emissions in half, and nuclear power, which would eliminate them entirely, have been."</i></p></blockquote><p>It is good to give people the choice to do what they want, and therefore good to give them as much energy as possible to play with, whether they want it to power the construction of their dream city or their flying car trips to Australia (I do draw the line at Death Stars, though).</p><p>Right now we're limited by the wealth of our societies, limiting us to about 10 kW/capita in developed countries, and by the unacceptable externalities of our polluting technology. The right goal isn't to enforce limits on what people can do (except indirectly through the likes of taxes and regulation to correct externalities), but to bring about a world where these limits are higher.</p><p>If energy is expensive, people are cheap – lives and experiences are lost for want of a few watts. This is the world we have been gradually dragging ourselves out of since the industrial revolution, and progress should continue. Energy should be cheap, and people should be dear.</p><p> </p><h2>Don't panic; build</h2><p><i>Where is my Flying Car?</i> is a weird book.</p><p>First of all, I'm not sure if it has a structure. Hall will talk about flying cars, zoom off to something completely different until you think he's said all he has to say on them, and just when you least expect it: more flying cars. The same pattern of presentation repeats with other topics. Also, sections begin and sometimes end with a long selection of quotes, including no less than three from Shakespeare.</p><p>Second, the ideas. There are the hundred speculative examples of crazy (big, physical) future technologies, the many often half-baked economic/political arguments, the unstated but unmissable America-centrism, and witty rants that wander the border between insightful social critique and intellectualised versions of stereotypical boomer complaints about modern culture.</p><p>Also, the cover is this:</p><p></p><table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto;"><tbody><tr><td style="text-align: center;"><a href="https://lh3.googleusercontent.com/-w8hBExP7z7U/YFd3-oDl_8I/AAAAAAAACfo/tINZwzIMi04AtmLrMfLRGae6DY9qtEpXACLcBGAsYHQ/cover.png" style="margin-left: 1em; margin-right: 1em;"><img data-original-height="1012" data-original-width="624" height="400" src="https://lh3.googleusercontent.com/-w8hBExP7z7U/YFd3-oDl_8I/AAAAAAAACfo/tINZwzIMi04AtmLrMfLRGae6DY9qtEpXACLcBGAsYHQ/w247-h400/cover.png" width="247" /></a></td></tr><tr><td class="tr-caption" style="text-align: center;">Above: ... a joke?<br /></td></tr></tbody></table><p></p><div class="separator" style="clear: both; text-align: center;"></div><p></p> <p>However, I think overall there's a coherent and valuable perspective here. First, Hall is against pointless pessimism. He makes this point most clearly when talking about dystopian fiction, but I think it generalises:</p><blockquote><p><i>"Dystopia used to be a fiction of resistance; it’s become a fiction of submission, the fiction of an untrusting, lonely, and sullen twenty-first century, the fiction of fake news and infowars, the fiction of helplessness and hopelessness. It cannot imagine a better future, and it doesn’t ask anyone to bother to make one. It nurses grievances and indulges resentments; it doesn’t call for courage; it finds that cowardice suffices. Its only admonition is: Despair more."</i></p></blockquote><p>Hall's answer to this pessimism is to point out ten billion cool tech things that we could do one day. He veers too much to the techno-optimistic side by not acknowledging any risks, but overall this is an important message. Visions of the future are often dominated by the negatives: no war, no poverty, no death. Someone needs to fill in the positives, and while Hall focuses more on the "what" of it than the "how does it help humans" part, I think a hopeful look at future technologies is a good start.</p><p>In addition to being against pessimism about human capabilities, Hall also takes, at least implicitly, a liberal stand by being against pessimism about humans. His answer to "what should we do?" is to give people choice: let them travel far and easily, let them live where they want, let them command vast amounts of energy.</p><p>Hall also identifies two ways to keep a civilisation on track in terms of making technological progress and not getting consumed by signalling and politics: growing, and having a frontier.</p><p>On the topic of growth, he makes basically the same point as my <a href="https://strataoftheworld.blogspot.com/2019/09/growth-and-civilisation.html">post on growth and civilisation</a>:</p><blockquote><p><i>"One of the really towering intellectual achievements of the 20th Century, ranking with relativity, quantum mechanics, the molecular biology of life, and computing and information theory, was understanding the origins of morality in evolutionary game theory. The details are worth many books in themselves, but the salient point for our purposes is that the evolutionary pressures to what we consider moral behavior arise only in non-zero-sum interactions. In a dynamic, growing society, people can interact cooperatively and both come out ahead. In a static no-growth society, pressures toward morality and cooperation vanish; you can only improve your situation by taking from someone else. The zero-sum society is a recipe for evil."</i></p></blockquote><p>Secondly, the idea of a frontier: something outside your culture that your society presses against (ideally nature, but I think this would also apply to another competing society). This is needed because"[w]ithout an external challenge, we degenerate into squabbling [and] self-deceiving".</p><blockquote><p><i>"But on the frontier, where a majority of one’s efforts are not in competition with others but directly against nature, self-deception is considerably less valuable. A culture with a substantial frontier is one with at least a countervailing force against the cancerous overgrowth of largely virtue-signalling, cost-diseased institutions."</i></p></blockquote><p>Frontiers often relate to energy-intensive technologies:</p><blockquote><p><i>"High-power technologies promote an active frontier, be it the oceans or outer space. Frontiers in turn suppress self-deception and virtue signalling in the major institutions of society, with its resultant cost disease. We have been caught to some extent in a self-reinforcing trap, as the lack of frontiers foster those pathologies, which limit what our society can do, including exploring frontiers. But by the same token we should also get positive feedback by going in in the opposite direction, opening new frontiers and pitting our efforts against nature."</i></p></blockquote><p>Finally, Hall's book is a reminder that an important measure to judge a civilisation against is its capacity to do physical things. Even if the bulk of progress and value is now coming from less material things, like information technology or designing ever fairer and more effective institutions, there are important problems – covid vaccinations, solving climate change, and building infrastructure, for example – that depend heavily on our ability to actually go out and move atoms in the real world. Let's make sure we continue to get better at that, whether or not it leads to flying cars.</p><div><br /></div><div style="text-align: center;"><b>RELATED:</b></div><div><ul style="text-align: left;"> <li><a href="http://strataoftheworld.blogspot.com/2021/03/nuclear-power-is-good.html">Nuclear power is good</a></li> <li><a href="https://strataoftheworld.blogspot.com/2021/03/technological-progress.html">Technological progress</a></li><li><a href="https://strataoftheworld.blogspot.com/2018/08/review-enlightenment-now-steven-pinker.html">Review: Enlightenment Now</a></li><li><a href="https://strataoftheworld.blogspot.com/2018/10/review-energy-and-civilization-history.html">Review: Energy and Civilisation</a></li> </ul></div><p> </p>Unknownnoreply@blogger.com2tag:blogger.com,1999:blog-1697673368059564013.post-61824184159231243912021-01-22T12:15:00.003+00:002021-02-19T21:46:35.603+00:00Data science 2<p style="text-align: center;"><span style="font-size: x-small;"><i>6.4k words, including equations (about 30 minutes)</i></span> <br /></p><p>See the <a href="http://strataoftheworld.blogspot.com/2020/12/data-science-1.html">first post</a> for an introduction.</p><h2>Monte Carlo methods</h2><p>In the late 1940s, Stanislaw Ulam was trying to work out the probability of winning in a solitaire variant. After cranking out combinatorics equations for a while, he had the idea that simulating a large number of games starting from random starting configurations with the "fast" computers that were becoming available could be a more convenient method.</p><p>At the time, Ulam was working on nuclear weapons at Los Alamos, so he had the idea of using the same principle to solve some difficult neutron diffusion problems, and went on to develop such methods further with John von Neumann (no mid-20th century maths idea is complete without von Neumann's hand somewhere on it). Since this was secret research, it needed a codename, and a colleague suggested "Monte Carlo" after the casino in Monaco. (This group of geniuses managed to break rule #1 of codenames, which is "don't reveal the basic operating principle of your secret project in its codename".)</p><p>Ulam used this work to help himself become (along with Edward Teller) the father of the hydrogen bomb. Our purposes here will be a bit more modest.</p><p>The basic idea of Monte Carlo methods is just repeated random sampling. Have a way to generate a random variable <script type="math/tex">X</script>, but not to generate fancy maths stats like <script type="math/tex">P(X \in S)</script>, where <script type="math/tex">S</script> is some subset of the sample space? Fear not – let <script type="math/tex">f(x)</script>, for values <script type="math/tex">x</script> that <script type="math/tex">X</script> can take, be 1 if <script type="math/tex">x \in S</script> and 0 otherwise. Then <script type="math/tex">E(f(X))</script> is <script type="math/tex">P(f(X) = 1) = P(X \in S)</script> and we've solved the problem if we can estimate <script type="math/tex">P(f(X)=1)</script>. If we can randomly sample values from <script type="math/tex">X</script> (and calculate the function <script type="math/tex">f</script>), then this is easy, because we simply sample many values and calculate for what fraction of them <script type="math/tex">f(X) = 1</script>.</p><p>In general,</p><div cid="n354" class="mathjax-block md-end-block md-math-block md-rawblock" contenteditable="false" id="mathjax-n354" mdtype="math_block" spellcheck="false"> <div class="md-rawblock-container md-math-container" tabindex="-1"><div class="MathJax_SVG_Display" style="text-align: center;"></div><script id="MathJax-Element-1" type="math/tex; mode=display">E(f(X)) \approx \frac{1}{n} \sum_{i=1}^n f(x_i)</script></div></div><p>for large <script type="math/tex">n</script> and with <script type="math/tex">x_i</script> drawn independently at random from <script type="math/tex">X</script>, a result that comes from the law of the unconscious statistician (discussed in <a href="http://strataoftheworld.blogspot.com/2020/12/data-science-1.html">part 1</a>) once you realise that as <script type="math/tex">n</script> increases the fraction of <script type="math/tex">x_i</script>s in the sample approaches <script type="math/tex">P(X=x_i)</script>.</p><p>We can also do integration in a Monte Carlo style. The standard way to integrate a function <script type="math/tex">f</script> is to sample it at uniform points, multiply each sampled value by the distance between the uniform points, and then add everything up. There's nothing special about uniformity though – as the number of samples increases, as long as we make sure to multiply each by the distance to the next sample, the result will converge to the integral.</p><p></p><div class="separator" style="clear: both; text-align: center;"><a href="https://lh3.googleusercontent.com/-mbiuy7znBrs/YAq_cWeYE1I/AAAAAAAACUE/kEnKGqk6Xj8XRCp6Owfq108NPny0xPUrQCLcBGAsYHQ/mcint.png" style="margin-left: 1em; margin-right: 1em;"><img data-original-height="418" data-original-width="998" height="268" src="https://lh3.googleusercontent.com/-mbiuy7znBrs/YAq_cWeYE1I/AAAAAAAACUE/kEnKGqk6Xj8XRCp6Owfq108NPny0xPUrQCLcBGAsYHQ/w640-h268/mcint.png" width="640" /></a></div><p></p>Above on the left, we see standard integration, with undershoot in pink and overshoot in orange, and Monte Carlo integration, with random samplings, on the right. <p>Sometimes a lot of the interesting stuff (e.g. expected value, area in the integral, etc.) comes from a part of the function's domain that's low-probability when values in the domain are generated via <script type="math/tex">X</script>. If this happens, you either crank up the <script type="math/tex">n</script> in your Monte Carlo, or then get smart about how exactly you sample (this is called importance sampling). If we're smart about this, our randomised integration can be faster than the standard method.</p><p>We will look at examples of using Monte Carlo -style random simulation to do both Bayesian and frequentist statistics below.</p><p> </p><h2>Confidence</h2><p>In addition to providing a best-guess estimate of something (the probability a coin comes up heads, say), useful statistics should be able to tell us about how confident we should be in a particular guess – the best estimate of the probability a coin lands heads after observing 1 head in 2 throws or 50 heads in 100 throws is the same, but the second one still allows us to say more.</p><p>The question of how to quantify confidence leads into the question of what probability is.</p><p>The frequentist approach is to say that probabilities are observed relative frequencies across many trials, and if you don't have many trials to look at, then you imagine some hypothetical set of trials that an event might be seen as being drawn from.</p><p>The Bayesian approach is that probabilities quantify the state of your own knowledge, and if you don't have data to look at, you should still be able to draw a probability distribution representing your knowledge.</p><h3>Bayesianism</h3><p>Bayesianism is the idea that you represent uncertainty in beliefs about the world using numbers, which come from starting out with some prior distribution, and then shifting the distribution back and forth as evidence comes in. These numbers follow the axioms of probability, and so we might as well call them probabilities.</p><p>(Why should these numbers follow the axioms of probability? Because if you do otherwise and base decisions on those beliefs, you will do stupid things. As a simple example, making bets consistent with a probability model where the probabilities do not sum to 1 makes you exploitable. Let's say you're buying three options, each of which pays out 100€ if the winner of the 2036 US presidential election is EterniTrump, <a href="https://en.wikipedia.org/wiki/GPT-3">GPT</a>-7, or Xi Jinping respectively, and pay 40€ for each (consistent with assigning a probability of greater than 0.4 to each event occurring). You're sure to be down 20€ that you could've spent on underground bunkers instead.)</p><p>In Bayesian statistics, you don't perform arcane statistical tests to reject hypotheses. Your subjective beliefs about something are a probability distribution (or at least they should be, if you want to reason perfectly). Once you've internalised the idea of what a probability distribution means, and know how to reason about updates to that probability distribution rather than in black-and-white terms of absolute truth or falsehood, Bayesianism is intuitive and will make your reasoning about probabilistic things (i.e., everything except pure maths) better.</p><p>(Why is Bayesianism named after Bayes? Bayes invented Bayes' theorem but not Bayesianism; however, Bayesian updating using Bayes' theorem is the core part of ideal Bayesian reasoning.)</p><p>There's one tricky part of Bayesianism, and it's a consequence of the Bayesian insistence that subjective uncertainty is represented by a probability distribution, and hence quantified. It's this: you always need to start with a quantified probability distribution (called a prior), even before you've seen any data.</p><p>There's a clear regress here, at least philosophically. Sure, you might be able to come up with a sensible prior for how effective masks are against a respiratory disease, but ask a baby for <script type="math/tex">P(\frac{P(\text{covid} | \text{mask})}{P(\text{covid}|\neg \text{mask})} = r)</script> and you're not likely to get a coherent answer (and remember that your current prior should come from baby-you's prior in an unbroken series of Bayesian updates) – let alone if we're imagining some hypothetical platonic being existing beyond time and space who has never seen any data, or the <a href="https://www.theguardian.com/world/2020/apr/07/face-masks-cannot-stop-healthy-people-getting-covid-19-says-who">World Health Organisation</a>.</p><p>In practice, however, I don't think this is very worrying. Priors formalise the idea that you can apply background knowledge even when you don't have data for the specific case in front of you. Reject the use of priors, and you'll fall into another regress: "study suggests mask-wearing effective against the coronavirus variant in 40-60 year-old European females in green t-shirts; no information yet on 40-60 year-old European females in red t-shirts ..."</p><h4>Computational Bayes</h4><p>In general, the scenario we have when doing a Bayesian calculation is that there's some model <script type="math/tex">X</script> that depends on parameter(s) <script type="math/tex">\theta</script>, and we want to find what those parameters are given some sample <script type="math/tex">x</script> from <script type="math/tex">X</script> (since this is Bayesian, we have to assume that <script type="math/tex">\theta</script> itself is a value of the random variable <script type="math/tex">\Theta</script> describing the probabilities of each possible <script type="math/tex">\theta</script>). Now we could do this mathematically by calculating</p><div cid="n375" class="mathjax-block md-end-block md-math-block md-rawblock" contenteditable="false" id="mathjax-n375" mdtype="math_block" spellcheck="false"> <div class="md-rawblock-container md-math-container" tabindex="-1"><div class="MathJax_SVG_Display" style="text-align: center;"></div><script id="MathJax-Element-2" type="math/tex; mode=display">\Pr_\Theta(\theta \, | \, X=x) = c \Pr_X(x | \Theta = \theta) \Pr_\Theta(\theta),</script></div></div><p>and then finding the constant <script type="math/tex">c</script> with integration by the rule that probabilities must sum to 1. (Remember the interpretation of these terms: <script type="math/tex">\Pr_\Theta(\theta)</script> is the prior distribution we assume for <script type="math/tex">\Theta</script> before seeing evidence; <script type="math/tex">\Pr_\Theta(\theta \, | \, X=x)</script> is the posterior likelihood distribution after seeing the data; see the <a href="http://strataoftheworld.blogspot.com/2020/12/data-science-1.html">previous post</a> for some intuition on Bayes if these aren't clear to you.)</p><p>However, maybe some part of this (especially the integration) would be tricky, or you just happen to have a Jupyter notebook open on your computer. In any case, we can go about things in a different way, as long as we have a way to generate samples from our prior distribution and re-weight them appropriately.</p><p>The first thing we do is represent the prior distribution of <script type="math/tex">\Theta</script> by sampling it many times. We don't need an equation for it, just some function (in the programming sense) that pulls from it.</p><p>Next, consider the impact of our data on the estimates. We can imagine each sample we took as a representation of a tiny blob of probability mass corresponding to some particular <script type="math/tex">\theta_i</script>, and imagine rescaling it in the same way that we rescaled the odds of various outcomes when talking about the odds ratio form of Bayes' rule in the first post. How much do we rescale it by? By the likelihood of observing <script type="math/tex">x</script> if <script type="math/tex">\Theta=\theta_i</script>: this is the <script type="math/tex">\Pr_X(x|\Theta=\theta)</script> term in the above equation.</p><p>Finally, we need to do the scaling. Thankfully, this doesn't take integration, since we can calculate the sum of our re-weighted likelihoods and just divide all our scaled values by that – boom, we have an (approximation of) a posterior probability distribution.</p><p>To make things concrete, let's write code and visualise a simple case: estimating the probability that a coin lands heads. The first step in Bayesian calculations is usually the trickiest: we need a prior. For simplicity, let's say our prior is that the coin has an equal chance of having every possible probability (so the real numbers 0 to 1) of coming up heads.</p><p>(The fact that the thing we're estimating is itself a probability doesn't matter; don't be confused by the fact that we have two sorts of probability – our knowledge about the coin's probability of coming up heads, represented as a probability distribution, and the probability that the coin comes up heads (an empirical fact you can measure by throwing it many times). Equally well we might have talked about some non-probabilistic feature of the coin, like its diameter, but that would be a lot more boring.)</p><p>To write this out in actual Python, the first step (after importing NumPy for vectorised calculation and Matplotlib for the graphing we'll do later) is some way to generate samples from this distribution:</p><pre><code class="language-python" lang="python">import numpy as np<br />import matplotlib.pyplot as plt<br /><br />def prior_sample(n):<br /> return np.random.uniform(size=n)<br /></code></pre><p>(<code>np.random.uniform(size=n)</code> returns <code>n</code> samples from a uniform distribution over the range 0 to 1.)</p><p>To calculate the posterior:</p><pre><code class="language-python" lang="python">def posterior(sample, throws, heads):<br /> """ This function calculates an approximation of the<br /> posterior distribution after seeing the coin<br /> thrown a certain number of times;<br /> sample is a sample of our prior distribution,<br /> throws is how many times we've thrown the coin,<br /> heads is how many times it has come up heads."""<br /> # The number of times the coin lands heads follows a binomial distribution.<br /> # Thus, below we reweight using a binomial pdf:<br /> # (note that we drop the throws-Choose-heads term because it's a constant<br /> # and we rescale at the end anyways)<br /><br /> weighted_sample = sample ** heads * (1 - sample) ** (throws - heads)<br /><br /> # Divide by the sum of every element in the weighted sample to normalise:<br /><br /> return weighted_sample / np.sum(weighted_sample)<br /></code></pre><p>(Remember that the calculation of <code>weighted_sample</code> is done on every term in the <code>sample</code> array separately, in the standard vectorised way.)</p><p>Now we can generate a sample to model the prior distribution, and plot it as a histogram:</p><pre><code class="language-python" lang="python">N = 100000<br />throws = 100<br />heads = 20<br /><br />sample = prior_sample(N) # model the prior distribution<br /><br /># Plot a histogram:<br />plt.hist(sample,<br /> # split the range 0-1 into 50 bins for the histogram:<br /> np.linspace(0, 1, 50), <br /> # weight each item by the likelihood:<br /> weights=posterior(sample, throws, heads))<br /></code></pre><p>The result will look something like this:</p><p></p><div class="separator" style="clear: both; text-align: center;"><a href="https://lh3.googleusercontent.com/-x0LpfUiUwCA/YAq_z2nq0vI/AAAAAAAACUM/aLkZe2ZhF9sDGHa5JDOfXPb0hhZOllRswCLcBGAsYHQ/postex.png" style="margin-left: 1em; margin-right: 1em;"><img data-original-height="496" data-original-width="802" height="396" src="https://lh3.googleusercontent.com/-x0LpfUiUwCA/YAq_z2nq0vI/AAAAAAAACUM/aLkZe2ZhF9sDGHa5JDOfXPb0hhZOllRswCLcBGAsYHQ/w640-h396/postex.png" width="640" /></a></div><br /><p></p><p>This is an approximation of the posterior probability distribution after seeing 100 throws and 20 heads. We see that most of the probability mass is clustered around a probability of 0.2 of landing heads; the chance of it being a fair coin is negligible.</p><p>What if we had a different prior? Let's say we're reasonably sure it's roughly a standard coin, and model our prior for the probability of landing heads as a normal distribution with mean 0.5 and standard deviation 0.1. To visualise this prior, here's a histogram of a 100k samples from it:</p><p></p><div class="separator" style="clear: both; text-align: center;"><a href="https://lh3.googleusercontent.com/-zkfbbObxwQE/YArADhq3MPI/AAAAAAAACUc/mAd0l3HIix4AXh1YABNpYKd3lhS8ubpAQCLcBGAsYHQ/normex.png" style="margin-left: 1em; margin-right: 1em;"><img data-original-height="488" data-original-width="800" height="244" src="https://lh3.googleusercontent.com/-zkfbbObxwQE/YArADhq3MPI/AAAAAAAACUc/mAd0l3HIix4AXh1YABNpYKd3lhS8ubpAQCLcBGAsYHQ/w400-h244/normex.png" width="400" /></a></div><br /><p></p>The posterior distribution looks almost identical to our previous posterior: <p></p><div class="separator" style="clear: both; text-align: center;"><a href="https://lh3.googleusercontent.com/-ehLqscBpblA/YAq_5uJ6WAI/AAAAAAAACUU/3oBLg53kBPIsWj1arn0eRUpf6fJmsJH2gCLcBGAsYHQ/postex2.png" style="margin-left: 1em; margin-right: 1em;"><img data-original-height="498" data-original-width="1000" height="318" src="https://lh3.googleusercontent.com/-ehLqscBpblA/YAq_5uJ6WAI/AAAAAAAACUU/3oBLg53kBPIsWj1arn0eRUpf6fJmsJH2gCLcBGAsYHQ/w640-h318/postex2.png" width="640" /></a></div><br /><p></p><p>There's simply so much data (a hundred throws) that even very different priors will have converged on what the data indicates.</p><p>A normal distribution might not be a very good model, though. Say we think there's a 49.5% chance the coin is fair, a 49.5% chance it's been rigged to come up tails with a probability arbitrarily close to 1, and the remaining 1% is spread uniformly between 0 and 1 (be very careful about assigning zero probability to something!). Then our prior distribution might be coded like this:</p><pre><code class="language-python" lang="python">def prior_sample_3(n):<br /> m = n // 100<br /> return np.concatenate((np.random.uniform(size=m),<br /> np.zeros((n - m) // 2),<br /> np.ones(n - (n - m) // 2) // 2),<br /> axis=0)<br /></code></pre><p>and 100k samples might be distributed like this:</p><div class="separator" style="clear: both; text-align: center;"><a href="https://lh3.googleusercontent.com/-EISq1Nh8hJg/YArAHGUFB9I/AAAAAAAACUg/MdtcMGCDeVc9lIapC4Bw0yNQ8wbGFLy7gCLcBGAsYHQ/priorex.png" style="margin-left: 1em; margin-right: 1em;"><img data-original-height="494" data-original-width="800" height="396" src="https://lh3.googleusercontent.com/-EISq1Nh8hJg/YArAHGUFB9I/AAAAAAAACUg/MdtcMGCDeVc9lIapC4Bw0yNQ8wbGFLy7gCLcBGAsYHQ/w640-h396/priorex.png" width="640" /></a></div><br /><br /><p></p>Let's also say we have less data than before – the coin has come heads 8 times out of 40, say. Now our posterior distribution looks like this: <p></p><div class="separator" style="clear: both; text-align: center;"><a href="https://lh3.googleusercontent.com/-fqUm2qOhyvI/YArAKABG7-I/AAAAAAAACUw/lyX3ZPq95vcgCKYW77h6ToDS1GXkpfQkACLcBGAsYHQ/postex3.png" style="margin-left: 1em; margin-right: 1em;"><img data-original-height="496" data-original-width="800" height="396" src="https://lh3.googleusercontent.com/-fqUm2qOhyvI/YArAKABG7-I/AAAAAAAACUw/lyX3ZPq95vcgCKYW77h6ToDS1GXkpfQkACLcBGAsYHQ/w640-h396/postex3.png" width="640" /></a></div><br /><p></p>We've ruled out that the coin is rigged (a single heads was enough to nuke the likelihood of a completely rigged coin to zero – be very careful about assigning a probability of zero to something!), and most of the probability mass has shifted to a probability of landing heads of around 20%, as before, but because our prior was different, a noticeable chunk of our expectation is still that the coin is exactly fair. <p>As a final example, here's a big flowchart showing how the probability you should assign to different odds of the coin coming up heads shifts as you get data (red = tails, green = heads) up to 5 coin throws, assuming a prior that's the uniform distribution:</p><p></p><div class="separator" style="clear: both; text-align: center;"><div class="separator" style="clear: both; text-align: center;"><a href="https://lh3.googleusercontent.com/-mfCWYY9GDj4/YArAh75Q0pI/AAAAAAAACVI/3pLFbKdE5zIMLqtQWWydmsOfbB6b_l79wCLcBGAsYHQ/bayesucompressed.png" style="margin-left: 1em; margin-right: 1em;"><img data-original-height="1667" data-original-width="600" src="https://lh3.googleusercontent.com/-mfCWYY9GDj4/YArAh75Q0pI/AAAAAAAACVI/3pLFbKdE5zIMLqtQWWydmsOfbB6b_l79wCLcBGAsYHQ/s16000/bayesucompressed.png" /></a></div></div><p></p>Two questions to think about, one simple and on-topic, the other open-ended and off-topic: <ul><li>What is the simple function giving, within a constant, the posterior distribution after <script type="math/tex">n</script> heads and 0 tails? What about for <script type="math/tex">n</script> tails and 0 heads?</li><li>Doesn't the coin-throwing diagram look like Pascal's triangle? What's the connection between normal distributions, Pascal's triangle, and the central limit theorem (i.e., that the sum of enough of many of any random variable is distributed roughly normally?)? What extensions of Pascal's triangle can you think of, possibly with probabilistic interpretations?</li> </ul><h3>Frequentism</h3><p>Frequentists try to banish the subjectivity out of probability. The probability of event <script type="math/tex">E</script> is not a statement about subjective belief, but an empirical fact: given <script type="math/tex">n</script> trials, what is the fraction of times that <script type="math/tex">E</script> comes up, in the limit as <script type="math/tex">n \rightarrow \infty</script>? And ditch the Bayesian idea of doing nothing but shifting around the probability mass we assign to different beliefs; once you've done a statistical test, you either reject or fail to reject the null hypothesis.</p><p>A standard frequentist tool is hypothesis testing with a <script type="math/tex">p</script>-value. The procedure looks like this:</p><ol start=""><li>Pick a null hypothesis (usually denoted <script type="math/tex">H_0</script>). (For example, <script type="math/tex">H_0</script> could be that a coin is fair; that is, that the probability <script type="math/tex">h</script> of it coming up heads is 0.5.)</li><li>Pick a test statistic: a function <script type="math/tex">t</script> from the dataset <script type="math/tex">x</script> to a number. (For example, the maximum likelihood estimator for <script type="math/tex">h</script>, using the fact that we expect the number of heads to follow a binomial distribution with parameters for the number of throws and the probability <script type="math/tex">h</script>.)</li><li>Figure out a model for, or a way to sample from, the distribution of possible datasets given that <script type="math/tex">H_0</script> is true. (For example, we might write code to generate synthetic datasets <script type="math/tex">X^*</script> of the same size as <script type="math/tex">x</script> based on <script type="math/tex">h=0.5</script>.)</li><li>Find the probability of the test statistic <script type="math/tex">t</script> returning a result that is as extreme or more extreme than <script type="math/tex">t(x)</script>. We might do this using fancy maths that gives us cumulative distribution functions based on the model from the previous step, or by having our code generate many synthetic datasets <script type="math/tex">X^*</script>, calculate <script type="math/tex">t(X^*)</script> for each of them, and seeing how <script type="math/tex">t(x)</script> compares – what percentile of extremeness is it in? The answer is called the <script type="math/tex">p</script>-value.</li> </ol><p>(What is "more extreme"? That depends on our null hypothesis. If both low and high values of <script type="math/tex">t(x)</script> are evidence against <script type="math/tex">H_0</script> – as in our example – then we use a two-tailed test; if <script type="math/tex">t(x)</script> is in the 90% percentile of the <script type="math/tex">t(X^*)</script> distribution, both <script type="math/tex">t(x)</script> in the top and bottom 10% are at least as extreme as the value we got, and <script type="math/tex">p=0.2</script>. If only low or high values are evidence against <script type="math/tex">H_0</script>, then we use a one-tailed test. Say only high values are evidence against <script type="math/tex">H_0</script> and <script type="math/tex">t(x)</script> is in the 90% percentile; then <script type="math/tex">p=0.1</script>.)</p><p>Here's some example code to calculate a <script type="math/tex">p</script>-value, using random simulation:</p><pre><code class="language-python" lang="python"># Import NumPy and graphing library:<br />import numpy as np<br />import matplotlib.pyplot as plt<br /><br /># Define our null hypothesis:<br />h0_h = 0.5 # the value of h under the null hypothesis<br /><br /># Define the data we've gotten:<br />throws = 50<br />heads = 20<br /># Generate an array for it:<br />data = np.concatenate((np.zeros(throws - heads), np.ones(heads)), axis = 0)<br /><br />def t(x): # test statistic function<br /> return np.mean(x)<br /> # ^ this is the MLE for the binomial distribution<br /><br />def synth_x(n, p):<br /> # Create a synthetic dataset of some size n, assuming some p<br /> return np.random.binomial(1, p, size=n)<br /><br /># Take a lot of samples from the distribution of t(X*)<br /># (where X* is a synthetic dataset):<br />t_sample = np.array([t(synth_x(throws, h0_h)) for _ in range(100000)])<br /><br /># Calculate the p-value, using a two-tailed test:<br />p1 = np.mean(t_sample >= t(data))<br />p2 = np.mean(t_sample <= t(data))<br />p = 2 * min(p1, p2)<br /><br /># Display p-value<br />print(f"p-value is {p}") # about 0.20 in this case<br /><br /># Plot a histogram:<br />plt.hist(t_sample, bins=50, range=[0,1])<br />plt.axvline(x=t(data), color='black') # draw a line to show where t(data) falls<br /></code> </pre><p></p><div class="separator" style="clear: both; text-align: center;"><a href="https://lh3.googleusercontent.com/-NjudyfjYfjM/YArArM_CD3I/AAAAAAAACVM/BC5sMDJx5YQ2pFIgR-CBQCL78sUx7iIhACLcBGAsYHQ/pval.png" style="margin-left: 1em; margin-right: 1em;"><img data-original-height="508" data-original-width="800" height="406" src="https://lh3.googleusercontent.com/-NjudyfjYfjM/YArArM_CD3I/AAAAAAAACVM/BC5sMDJx5YQ2pFIgR-CBQCL78sUx7iIhACLcBGAsYHQ/w640-h406/pval.png" width="640" /></a></div><br /><p></p>The main tricky part in the code is the calculation of the <script type="math/tex">p</script>-value. A neat way to do is the following: observe that a two-tailed <script type="math/tex">p</script>-value is either twice the percent of (synthetic) data with a test statistic lower than <script type="math/tex">t(x)</script> (in the case that the observation ended up on the lower side of the distribution of synthetic datasets), or twice the percent of (synthetic) data with a higher test statistic. <p>Now, what exactly is a <script type="math/tex">p</script>-value? It's tempting to think of the <script type="math/tex">p</script>-value as the probability that the null hypothesis is correct: that is, that <script type="math/tex">p=0.05</script> means there's only a 5% chance the null hypothesis is true. However, what a <script type="math/tex">p</script>-value actually tells you is this: assuming that your null hypothesis is true (and you can correctly model the distribution of data you'd get if it is), what is the probability of getting a result at least as extreme as your data? In maths: </p><div cid="n434" class="mathjax-block md-end-block md-math-block md-rawblock" contenteditable="false" id="mathjax-n434" mdtype="math_block" spellcheck="false"> <div class="md-rawblock-container md-math-container" tabindex="-1"><div class="MathJax_SVG_Display" style="text-align: center;"></div><script id="MathJax-Element-3" type="math/tex; mode=display">p\text{-value} \ne P(H_0 \text{ is correct}), (!!)</script></div></div><p>but instead</p><div cid="n436" class="mathjax-block md-end-block md-math-block md-rawblock" contenteditable="false" id="mathjax-n436" mdtype="math_block" spellcheck="false"> <div class="md-rawblock-container md-math-container" tabindex="-1"><div class="MathJax_SVG_Display" style="text-align: center;"></div><script id="MathJax-Element-4" type="math/tex; mode=display">p\text{-value} = P(t(x) \geq t(X^*)),</script></div></div><p>for a right-tailed test (flip the <script type="math/tex">\geq</script> for a left-tailed test), where <script type="math/tex">X^*</script> is assumed drawn from the distribution resulting from assuming the null hypothesis <script type="math/tex">H_0</script> , or</p><div cid="n438" class="mathjax-block md-end-block md-math-block md-rawblock" contenteditable="false" id="mathjax-n438" mdtype="math_block" spellcheck="false"> <div class="md-rawblock-container md-math-container" tabindex="-1"><div class="MathJax_SVG_Display" style="text-align: center;"></div><script id="MathJax-Element-5" type="math/tex; mode=display">P(|t'(x)| \geq|t'(X^*)|),</script></div></div><p>for a two-tailed test, where <script type="math/tex">t'</script> is the test statistic function, but shifted so that the median <script type="math/tex">H_0</script> value is 0, so that we can just take absolute value to get an extremeness measure (for example, in the code above we'd subtract a 0.5 from the current definition of <code>t(x)</code>, since this is the median for the null hypothesis that the probability of heads is one-half).</p><h2>Probability bounds</h2><p>Sometimes it's useful to be able to quickly estimate a bound on some probability or expectation. Here are some examples, with quick proofs.</p><h4>Markov's inequality</h4><p>For <script type="math/tex">x > 0</script> if <script type="math/tex">X</script> takes positive numerical values,</p><div cid="n444" class="mathjax-block md-end-block md-math-block md-rawblock" contenteditable="false" id="mathjax-n444" mdtype="math_block" spellcheck="false"> <div class="md-rawblock-container md-math-container" contenteditable="false" tabindex="-1"> <div class="MathJax_SVG_Display" style="text-align: center;"></div><script id="MathJax-Element-212" type="math/tex; mode=display">P(X \geq a) \leq \frac{E(X)}{a}.</script> </div></div><p>Why?</p><p><b>Short proof</b>: Given <script type="math/tex">X \geq 0</script>, <script type="math/tex">X \geq 1_{X \geq a} \cdot a</script> (can be seen by considering cases <script type="math/tex">X < a</script>, <script type="math/tex">X=a</script>, and <script type="math/tex">X > a</script>), so, rearranging, <script type="math/tex">1_{X \geq a} \leq X/ a</script>. Taking the expectation on both sides we get <script type="math/tex">E(1_{X \geq a}) \leq E(X) / a</script>, and <script type="math/tex">E(1_{X \geq a}) = P(X \geq a)</script>. <script type="math/tex">\square</script></p><p><b>Intuitive proof</b>: let's say you want to draw a probability density function to maximise <script type="math/tex">P(X \geq a)</script>, given some value of the expectation of <script type="math/tex">E(X)</script> (and given that <script type="math/tex">X</script> only takes positive values). Any probability density assigned to values greater than <script type="math/tex">a</script> is more expensive in terms of expectation increase than assigning value exactly at <script type="math/tex">a</script>, and has an identical effect on <script type="math/tex">P(X \geq a)</script>. So to maximise <script type="math/tex">P(X \geq a)</script>, assign as much probability density as you can to <script type="math/tex">a</script>, and none to values greater than <script type="math/tex">a</script>. Given the restriction that <script type="math/tex">X</script> can only take positive values, the lowest value you can assign any probability to (to balance out the expectation if <script type="math/tex">a > E(X)</script>) is 0. If we allocate <script type="math/tex">p_1</script> to <script type="math/tex">X=0</script> and <script type="math/tex">p_2</script> to <script type="math/tex">X=a</script>, then to match the expectation <script type="math/tex">E(X)</script> we must have</p><div cid="n447" class="mathjax-block md-end-block md-math-block md-rawblock" contenteditable="false" id="mathjax-n447" mdtype="math_block" spellcheck="false"> <div class="md-rawblock-container md-math-container" tabindex="-1"><div class="MathJax_SVG_Display" style="text-align: center;"></div><script id="MathJax-Element-7" type="math/tex; mode=display">p_1 \cdot 0 + p_2 \cdot a = E(X),</script></div></div><p>or</p><div cid="n449" class="mathjax-block md-end-block md-math-block md-rawblock" contenteditable="false" id="mathjax-n449" mdtype="math_block" spellcheck="false"> <div class="md-rawblock-container md-math-container" tabindex="-1"><div class="MathJax_SVG_Display" style="text-align: center;"></div><script id="MathJax-Element-8" type="math/tex; mode=display">p_2 = P(X\geq a) = \frac{E(X)}{a}</script></div></div><p>in the maximal scenario; any other pdf we draw must have <script type="math/tex">P(X \geq a)</script> smaller.</p><p>The above equation can also be interpreted as saying that the fraction of values greater than <script type="math/tex">k=a/E(X)</script> times the average in a dataset of positive values can be at most <script type="math/tex">1/k</script> (i.e. <script type="math/tex">E(X)/a</script>). For example, at most half of people can have twice the average income.</p><h4>Chebyshev's inequality</h4><p>(An extension of Markov's inequality.)</p><p>Let <script type="math/tex">X</script> be a random variable with variance <script type="math/tex">\sigma^2</script> and expected value <script type="math/tex">\mu</script>. Then</p><div cid="n455" class="mathjax-block md-end-block md-math-block md-rawblock" contenteditable="false" id="mathjax-n455" mdtype="math_block" spellcheck="false"> <div class="md-rawblock-container md-math-container" tabindex="-1"><div class="MathJax_SVG_Display" style="text-align: center;"></div><script id="MathJax-Element-9" type="math/tex; mode=display">P(|X-\mu| \geq x) \leq \frac{\sigma^2}{x^2},</script></div></div><p>since if <script type="math/tex">Y = (X-\mu)^2</script> then, by Markov's inequality,</p><div cid="n457" class="mathjax-block md-end-block md-math-block md-rawblock" contenteditable="false" id="mathjax-n457" mdtype="math_block" spellcheck="false"> <div class="md-rawblock-container md-math-container" tabindex="-1"><div class="MathJax_SVG_Display" style="text-align: center;"></div><script id="MathJax-Element-10" type="math/tex; mode=display">P(Y \geq x^2) \leq \frac{\mathbb{E}(Y)}{x^2} = \frac{\sigma^2}{x^2},</script></div></div><p>by the definition of variance as <script type="math/tex">\mathbb{E}((X - \mu)^2)</script>. Finally, taking the square root inside the probability expression, <script type="math/tex">P(Y \geq x^2)=P(|X-\mu| \geq x)</script>. <script type="math/tex">\square</script></p><h4>Jensen's inequality</h4><p>Consider a concave function <script type="math/tex">f</script> and the values <script type="math/tex">E(f(X))</script> and <script type="math/tex">f(E(X))</script>, where <script type="math/tex">X</script> is (once again) a random variable.</p><p>Since <script type="math/tex">f</script> is concave, if we plot <script type="math/tex">y=f(x)</script> and the tangent line to <script type="math/tex">f</script> at some <script type="math/tex">x_0</script>, the tangent is an upper bound on <script type="math/tex">f(x)</script> for all <script type="math/tex">x</script>.</p><p></p><div class="separator" style="clear: both; text-align: center;"><a href="https://lh3.googleusercontent.com/-hnijlFUfZQk/YArAtTu4jTI/AAAAAAAACVQ/UdB1Y570UFcJNEUmO8cvPvDhENQaXwP3ACLcBGAsYHQ/jensen.png" style="margin-left: 1em; margin-right: 1em;"><img data-original-height="834" data-original-width="1000" height="333" src="https://lh3.googleusercontent.com/-hnijlFUfZQk/YArAtTu4jTI/AAAAAAAACVQ/UdB1Y570UFcJNEUmO8cvPvDhENQaXwP3ACLcBGAsYHQ/w400-h333/jensen.png" width="400" /></a></div><br /><p></p><p>Let <script type="math/tex">E(X) = \mu</script>, and let the tangent line to <script type="math/tex">y=f(x)</script> at <script type="math/tex">x=\mu</script> be <script type="math/tex">y=mx+b</script>. We have that <script type="math/tex">f(X) \leq mx+b</script> for all <script type="math/tex">x</script>. Taking the expectation on both sides,</p><div cid="n464" class="mathjax-block md-end-block md-math-block md-rawblock" contenteditable="false" id="mathjax-n464" mdtype="math_block" spellcheck="false"> <div class="md-rawblock-container md-math-container" tabindex="-1"><div class="MathJax_SVG_Display" style="text-align: center;"></div><script id="MathJax-Element-11" type="math/tex; mode=display">E(f(X)) \leq m \mu + b.</script></div></div><p>What is <script type="math/tex">m\mu +b</script>? It's the value of the tangent when it touches <script type="math/tex">f(x)</script> at <script type="math/tex">x=\mu</script>, and therefore it is also the value of <script type="math/tex">f</script> at <script type="math/tex">\mu</script>. Thus we can say</p><div cid="n466" class="mathjax-block md-end-block md-math-block md-rawblock" contenteditable="false" id="mathjax-n466" mdtype="math_block" spellcheck="false"> <div class="md-rawblock-container md-math-container" tabindex="-1"><div class="MathJax_SVG_Display" style="text-align: center;"></div><script id="MathJax-Element-12" type="math/tex; mode=display">E(f(X)) \leq f(E(X)). \square</script></div></div><p> </p><h2>Probability systems</h2><h3>Causal diagrams</h3><p>The <a href="https://en.wikipedia.org/wiki/Perseverance_(rover)"><i>Perseverance</i></a> rover is due to land on Mars on February 18th, 2021, carrying a small helicopter called <a href="https://en.wikipedia.org/wiki/Mars_Helicopter_Ingenuity"><i>Ingenuity</i></a>, which will likely become the first aircraft to make a powered flight on a planet that's not Earth.</p><p>Imagine that <i>Perseverance</i> is currently known to be in a position <script type="math/tex">X</script> (where <script type="math/tex">X</script> is some random variable, as is any capital letter). <i>Ingenuity</i> has completed its first flight, starting from the location of <i>Perseverance</i> (which we know to a high degree of accuracy), but because of a Martian sandstorm we only have inaccurate readings of <i>Ingenuity</i>'s current location and need to locate it quickly to know if it's in a place where it's going to run out of power due to dust blocking its solar panels unless we do a risky manoeuvre with its propellers. Specifically, we have two in-flight readouts of its position, <script type="math/tex">R_1</script> and <script type="math/tex">R_2</script>, which are known to be its actual true position <script type="math/tex">Y_1</script> and <script type="math/tex">Y_2</script> at those times plus some random error modelled as a <script type="math/tex">\text{Normal}(0,\sigma_1^2)</script> distribution, and also similarly we have a more accurate readout <script type="math/tex">R_f</script> of its final position <script type="math/tex">Y_f</script>, this time with the error following <script type="math/tex">\text{Normal}(0, \sigma_2^2)</script>. We also model <script type="math/tex">Y_1</script> as being generated from <script type="math/tex">X</script> with a parameter <script type="math/tex">h_1</script> representing its starting heading and velocity (e.g. <script type="math/tex">h_1</script> is a vector and the model could be <script type="math/tex">Y_1 = X + h_1 + \epsilon</script>, where <script type="math/tex">\epsilon</script> is another normally distributed error term), and likewise we have parameters <script type="math/tex">h_2</script> and <script type="math/tex">h_f</script> that influence how <script type="math/tex">Y_2</script> and <script type="math/tex">Y_f</script> are generated from the preceding positions. We know that it's initial battery level was <script type="math/tex">b_0</script>, and the battery level when it was at each of <script type="math/tex">Y_1</script>, <script type="math/tex">Y_2</script>, and <script type="math/tex">Y_f</script> is <script type="math/tex">B_1</script>, <script type="math/tex">B_2</script>, and <script type="math/tex">B_f</script>, where each of those is generated from the previous and the heading/velocity parameters <script type="math/tex">h_1</script>, <script type="math/tex">h_2</script>, and <script type="math/tex">h_f</script> (e.g. <script type="math/tex">B_2 = B_1 - (1 + \epsilon) |h_1|</script> – the amount of power lost is a normal error term plus a constant times the velocity). We need to find the probability that the next battery level <script type="math/tex">B_n</script>, a random variable generated from <script type="math/tex">B_f</script> (the previous level) and depending on <script type="math/tex">Y_f</script> (since storm intensity varies with position; say we have a function <script type="math/tex">s</script> that takes in positions and returns how much the dust will decrease power output and hence batter level at a particular position, then we might have <script type="math/tex">B_n = B_f - s(Y_f)</script>), is below a critical threshold <script type="math/tex">c</script>, given the starting <script type="math/tex">X</script>, and the position readings <script type="math/tex">R_1</script>, <script type="math/tex">R_2</script>, and <script type="math/tex">R_f</script>. Also the administrator of NASA is breathing down your neck because this is a 2 billion dollar mission, so better work fast and not make mistakes.</p><p>This problem seems almost intractably complicated. A handy way of making complex probability questions less unapproachable is to draw out a causal diagram: what are the key parameters, and which random variables are generated from which other ones? Here's an example for the above problem:</p><p></p><div class="separator" style="clear: both; text-align: center;"><a href="https://lh3.googleusercontent.com/-S4_pBsH_GnA/YArAwGsYO4I/AAAAAAAACVU/fffClocsKQM77RJleFmwdd7UQuacYaqxgCLcBGAsYHQ/causaldiagram.png" style="margin-left: 1em; margin-right: 1em;"><img data-original-height="874" data-original-width="1280" height="438" src="https://lh3.googleusercontent.com/-S4_pBsH_GnA/YArAwGsYO4I/AAAAAAAACVU/fffClocsKQM77RJleFmwdd7UQuacYaqxgCLcBGAsYHQ/w640-h438/causaldiagram.png" width="640" /></a></div><br /><br /><p></p><p>Arrows indicate random variables being generated from others; dotted lines note important parameters (note that some parameters are missing – those of <script type="math/tex">X</script>, for example). The probability we were asked about is <script type="math/tex">P(B_n < c | X = x, R_1 = r_1, R_2 = r_2, R_f = r_f)</script>; it doesn't look so complicated when you have the causal relations visualised in front of you.</p><p>The rest of the solution is left as an exercise for the reader. Please be in touch with NASA in late February to get the values <script type="math/tex">x</script>, <script type="math/tex">r_1</script>, <script type="math/tex">r_2</script>, and <script type="math/tex">r_f</script>.</p><h3>Markov chains</h3><p>A Markov chain has the following causal diagram:</p><p></p><div class="separator" style="clear: both; text-align: center;"><a href="https://lh3.googleusercontent.com/-HjDYzntkpaE/YArAxfn2HMI/AAAAAAAACVc/Md-EoAChqWsT5VHijY88E2xyceBy_mlkwCLcBGAsYHQ/markov.png" style="margin-left: 1em; margin-right: 1em;"><img data-original-height="210" data-original-width="1000" height="84" src="https://lh3.googleusercontent.com/-HjDYzntkpaE/YArAxfn2HMI/AAAAAAAACVc/Md-EoAChqWsT5VHijY88E2xyceBy_mlkwCLcBGAsYHQ/w400-h84/markov.png" width="400" /></a></div><br /><p></p>In words: the <script type="math/tex">n</script>th state of a Markov chain is generated from the <script type="math/tex">(n-1)</script>th state. <p>This might seem very restrictive. For example, the simplest text-generation Markov chain would just generate, say, one character based on the previous one, probably based on data for how often a letter follows another. It might tend to do some moderately reasonable things, like following "t" by "h" fairly often (assuming it was trained on English), but good luck getting anything too sensible out of it.</p><p>However, we can do a trick: generate letter <script type="math/tex">n</script> from the previous <script type="math/tex">k</script> letters. This seems like it's not a Markov chain; letter <script type="math/tex">X_n</script> depends on <script type="math/tex">X_{n-k}</script> through <script type="math/tex">X_{n-1}</script>. But we can define <script type="math/tex">Y_0=(X_0, X_1, ..., X_{k-1})</script>, <script type="math/tex">Y_1 = (X_1, X_2, ..., X_k)</script>, and so on, and now <script type="math/tex">Y_n</script> can be generated entirely from <script type="math/tex">Y_{n-1}</script>, and so the <script type="math/tex">Y</script>s form a Markov chain.</p><p>So one one hand, we can do these sorts of tricks to use Markov chains even when it seems like the problem is too complex for them. But perhaps even more importantly, if you reduce something to a Markov chain, you can immediately apply a lot of nice mathematical results.</p><p>A Markov chain can be visualised with a state diagram. Here's one for a Markov chain representing traffic light transitions:</p><p></p><div class="separator" style="clear: both; text-align: center;"><a href="https://lh3.googleusercontent.com/-I2pMAfptnW0/YArA0CBA-wI/AAAAAAAACVg/3BFSDkq4w0EkvZWm5KWYH2kyM4kSIvWbwCLcBGAsYHQ/trafficlights1.png" style="margin-left: 1em; margin-right: 1em;"><img data-original-height="1098" data-original-width="1000" height="400" src="https://lh3.googleusercontent.com/-I2pMAfptnW0/YArA0CBA-wI/AAAAAAAACVg/3BFSDkq4w0EkvZWm5KWYH2kyM4kSIvWbwCLcBGAsYHQ/w365-h400/trafficlights1.png" width="365" /></a></div><br /><p></p><p>The same information can be described with a transition matrix, showing the probability of each transition happening:</p><p></p><div class="separator" style="clear: both; text-align: center;"><a href="https://lh3.googleusercontent.com/-wjCwsRjL2rY/YArA1n3ionI/AAAAAAAACVk/Sf1PBxgqBJAJ_EGnGqxB4rf2XBbdrp8rgCLcBGAsYHQ/trafficmatrix1.png" style="margin-left: 1em; margin-right: 1em;"><img data-original-height="830" data-original-width="960" height="345" src="https://lh3.googleusercontent.com/-wjCwsRjL2rY/YArA1n3ionI/AAAAAAAACVk/Sf1PBxgqBJAJ_EGnGqxB4rf2XBbdrp8rgCLcBGAsYHQ/w400-h345/trafficmatrix1.png" width="400" /></a></div><br /><p></p><p>Note that this is a very boring Markov chain, because it's not probabilistic – every link has a probability mass of 1. This is not very interesting. Thankfully, our traffic light engineer is willing to add some randomness for the sake of making the system more mathematically interesting. For example, they might change the system to look like this (showing both the state diagram and transition matrix):</p><p></p><div class="separator" style="clear: both; text-align: center;"><a href="https://lh3.googleusercontent.com/-i1AKagcYfcI/YArA3XFfxyI/AAAAAAAACVo/e3ypWZKEntgt1VMi_geYYtdz-7KidI4qQCLcBGAsYHQ/traffic2.png" style="margin-left: 1em; margin-right: 1em;"><img data-original-height="690" data-original-width="1278" height="346" src="https://lh3.googleusercontent.com/-i1AKagcYfcI/YArA3XFfxyI/AAAAAAAACVo/e3ypWZKEntgt1VMi_geYYtdz-7KidI4qQCLcBGAsYHQ/w640-h346/traffic2.png" width="640" /></a></div><p></p><p>Now there's a 10% chance that the yellow light before red is skipped, and a 40% chance that red-yellow moves back to red instead of going green.</p><p>The key property with Markov chain calculations is memorylessness: <script type="math/tex">X_n</script> depends only on <script type="math/tex">X_{n-1}</script>. If you can use this property, you can work out a lot of Markov chain problems. For example, let's say that <script type="math/tex">X_0 = \text{R}</script> (we'll use <script type="math/tex">\text{R, RY, G, Y}</script> to denote the states), and we want to find the probability that you'll actually get to drive in two state transitions from now – that is, <script type="math/tex">\mathbb{P}(X_2 = \text{G} \, | \, X_0 = \text{R})</script> (I use <script type="math/tex">\mathbb{P}</script> here to differentiate a probability expression from the transition matrix <script type="math/tex">P</script>). Doing some straightforward algebra, you can figure out that this probability is <script type="math/tex">P_{\text{R},\text{RY}} \cdot P_{\text{RY},\text{G}}</script> (where <script type="math/tex">P_{a,b}</script> is the spot in the matrix with row label (i.e. start state) <script type="math/tex">a</script> and column label (i.e. end state) <script type="math/tex">b</script>).</p><p>(Note that each row of the transition matrix is a probability distribution for the next state, starting from the state the row is labelled with. Writing it as a matrix is a trick for expressing the probability distribution from each state in the same mathematical object.)</p><p>More generally: for any transition matrix, <script type="math/tex">P_{a,b}</script>is <script type="math/tex">\mathbb{P}(X_n = b \, | X_{n-1} = a)</script>. Now consider point <script type="math/tex">a,b</script> of <script type="math/tex">P^2</script>: by matrix multiplication, it is</p><div cid="n504" class="mathjax-block md-end-block md-math-block md-rawblock" contenteditable="false" id="mathjax-n504" mdtype="math_block" spellcheck="false"> <div class="md-rawblock-container md-math-container" tabindex="-1"><div class="MathJax_SVG_Display" style="text-align: center;"></div><script id="MathJax-Element-13" type="math/tex; mode=display">\sum_i P_{a,i}P_{i,b},</script></div></div><p> but by the definition of the transition matrix, this is the same as</p><div cid="n509" class="mathjax-block md-end-block md-math-block md-rawblock" contenteditable="false" id="mathjax-n509" mdtype="math_block" spellcheck="false"> <div class="md-rawblock-container md-math-container" tabindex="-1"><div class="MathJax_SVG_Display" style="text-align: center;"></div><script id="MathJax-Element-14" type="math/tex; mode=display">\sum_i \mathbb{P}(X_{1} = i \,|\, X_{0} = a) \mathbb{P}(X_{2} = b \,|\, X_{1} = i),</script></div></div><p>which is just summing up the probabilities of all paths through the state space that start at <script type="math/tex">a</script>, go to some <script type="math/tex">i</script>, and then end up at <script type="math/tex">b</script>; in other words, it is the probability that if you're at <script type="math/tex">a</script>, you end up at <script type="math/tex">b</script> after two state transitions.</p><p>You should be able to see that this extends more generally:</p><div cid="n516" class="mathjax-block md-end-block md-math-block md-rawblock" contenteditable="false" id="mathjax-n516" mdtype="math_block" spellcheck="false"> <div class="md-rawblock-container md-math-container" tabindex="-1"><div class="MathJax_SVG_Display" style="text-align: center;"></div><script id="MathJax-Element-15" type="math/tex; mode=display">\mathbb{P}(X_n = b \,|\,X_0 = a) = P^n_{a,b}.</script></div></div><p>Linear algebra comes to the rescue yet again; we've reduced the problem of finding the probability of going between any two states in a Markov chain's state space in <script type="math/tex">n</script> steps into the problem of multiplying a matrix <script type="math/tex">n</script> times with itself and looking up one item in it.</p><h4>Finding the stationary distribution</h4><p>Given a starting state in a Markov chain, we can't say for sure what state it will be after <script type="math/tex">n</script> transitions (unless it's entirely deterministic, like our initial boring traffic light model), but we can calculate exactly what the probability distribution over the states will be. This is usually denoted as a vector <script type="math/tex">\pi</script>, with <script type="math/tex">\pi_a</script> being the probability we're in state <script type="math/tex">a</script>.</p><p>Here's something we might want to know: what is the stationary distribution; that is, how can we allocate probability mass amongst the different states in such a way that the total amount of probability mass in each state remains constant after a state transition?</p><p>Here's something you might ask: why is it interesting to know this? Perhaps most importantly, the stationary distribution of a Markov chain is the long-run average of time spent in each state (exercise: prove that this is the case); if you want to know how much time our probabilistic traffic lights will spend being green over a long period of time, you need to find the stationary distribution.</p><p>Now given our distribution <script type="math/tex">\pi</script> (note: it's a row vector, not a column vector) and transition matrix <script type="math/tex">P</script>, we can express the stationary distribution as the <script type="math/tex">\pi</script> that satisfies two conditions. First,</p><div cid="n539" class="mathjax-block md-end-block md-math-block md-rawblock" contenteditable="false" id="mathjax-n539" mdtype="math_block" spellcheck="false"> <div class="md-rawblock-container md-math-container" tabindex="-1"><div class="MathJax_SVG_Display" style="text-align: center;"></div><script id="MathJax-Element-16" type="math/tex; mode=display">\pi = \pi P.</script></div></div><p>This is the condition that <script type="math/tex">\pi</script> must remain unchanged when transformed by our transition matrix <script type="math/tex">P</script> during a state transition. You might have expected the transformation to be written <script type="math/tex">P \pi</script>; usually we'd express a matrix transforming a vector in this order. However, because of the way we've defined <script type="math/tex">P</script> – start states on the vertical axis, end states on the horizontal – we need to do it this way. Here's a visualisation, with the result vector in red:</p><p></p><div class="separator" style="clear: both; text-align: center;"><div class="separator" style="clear: both; text-align: center;"><a href="https://lh3.googleusercontent.com/-B-R8mo5PwZU/YArA7o-HT0I/AAAAAAAACVw/wG7W-En9uys9G1JnvYJh4qhtRwo2xHMkwCLcBGAsYHQ/mmult.png" style="margin-left: 1em; margin-right: 1em;"><img data-original-height="608" data-original-width="958" height="254" src="https://lh3.googleusercontent.com/-B-R8mo5PwZU/YArA7o-HT0I/AAAAAAAACVw/wG7W-En9uys9G1JnvYJh4qhtRwo2xHMkwCLcBGAsYHQ/w400-h254/mmult.png" width="400" /></a></div></div><p></p><p>(Alternatively, we could take <script type="math/tex">\pi</script> as a column vector, flip the meanings of the rows and columns in <script type="math/tex">P</script>, and write <script type="math/tex">P\pi</script> – equivalent to transposing both of the current definitions of <script type="math/tex">\pi</script> and <script type="math/tex">P</script>.)</p><p>The second condition (can you see why it's necessary?), where <script type="math/tex">\pmb{1}</script> is a vector <script type="math/tex">(1,1,...,1,1)</script> of the required length, is</p><div cid="n556" class="mathjax-block md-end-block md-math-block md-rawblock" contenteditable="false" id="mathjax-n556" mdtype="math_block" spellcheck="false"> <div class="md-rawblock-container md-math-container" tabindex="-1"><div class="MathJax_SVG_Display" style="text-align: center;"></div><script id="MathJax-Element-17" type="math/tex; mode=display">\pi \cdot \pmb{1} = 1.</script></div></div><p>We can also write this as matrix multiplication, as long as we're clear about column and row vectors and transposing things as required. We can also be clever and write a single matrix that expresses both of these constraints, and then getting NumPy's linear algebra libraries to give us the answer becomes a single line of code.</p><p>(The second constraint is just the condition that any probability distribution sums to 1.) </p><h5>Uniqueness of the stationary distribution</h5><p>Now for another question: when does a unique stationary distribution exist? You should be able to think of a state diagram for which there are an infinite number of stationary distributions.</p><p>For example:</p><p></p><div class="separator" style="clear: both; text-align: center;"><a href="https://lh3.googleusercontent.com/-JSL7Ybk7Xoo/YArA91YfljI/AAAAAAAACV0/XZEBmDiuS5snRKG1QccdYPa7wSvi1d1gwCLcBGAsYHQ/stationary.png" style="margin-left: 1em; margin-right: 1em;"><img data-original-height="784" data-original-width="1280" height="392" src="https://lh3.googleusercontent.com/-JSL7Ybk7Xoo/YArA91YfljI/AAAAAAAACV0/XZEBmDiuS5snRKG1QccdYPa7wSvi1d1gwCLcBGAsYHQ/w640-h392/stationary.png" width="640" /></a></div><p></p><p>The states <script type="math/tex">C</script>, <script type="math/tex">B</script>, and <script type="math/tex">D</script> (in the dotted red circle) and <script type="math/tex">E</script>, <script type="math/tex">F</script>, <script type="math/tex">G</script>, and <script type="math/tex">H</script> (in the dotted blue circle) are "independent", in the sense that you can never get from one set of states to the other. Imagine that for the state set <script type="math/tex">\{C, B, D\}</script>, we have a stationary distribution over only those states <script type="math/tex">\pmb{\pi}</script>, and another stationary distribution <script type="math/tex">\pmb{\rho}</script> over <script type="math/tex">\{E,F,G,H\}</script>. (Let each of these vectors have a slot for every state, but let it be zero for states outside the corresponding state set – <script type="math/tex">\pmb{\pi} = (0, \pi_b, \pi_c, \pi_d, 0, 0, 0, 0)</script>, for example.) Now, because there can be no probability mass flow between these two sets, we can see that any distribution <script type="math/tex">\pmb{\sigma} = a \pmb{\pi} + b \pmb{\rho}</script> is also a stationary distribution, provided that <script type="math/tex">a</script> and <script type="math/tex">b</script> are chosen such that <script type="math/tex">\pmb{\sigma} \cdot \pmb{1} = 1</script> (probability distributions sum to one!).</p><p>It turns out that for any state set where each state is theoretically reachable from all the others – i.e., if we represent the state diagram as a directed graph, the graph is connected – there does exist a unique stationary distribution.</p><h5>Detailed balance</h5><p>Sometimes it doesn't take matrix calculations to find a stationary distribution. In the general case, the condition is that the probability mass flow into a state, from all other states, must equal the outflow to all other states. The simplest case this can happen is when, for any pair of states <script type="math/tex">a</script> and <script type="math/tex">b</script>, <script type="math/tex">a</script> sends as much probability mass to <script type="math/tex">b</script> upon a state transition as <script type="math/tex">b</script> sends to <script type="math/tex">a</script>. If we can ensure that this is true "locally" for each pair of states, then we don't have to do complex "global" optimisation over all states.</p><p>This condition is known as detailed balance. Mathematically, letting <script type="math/tex">\pi</script> be a distribution of probability mass over states and <script type="math/tex">P</script> be the transition matrix, we can express it as</p><div cid="n607" class="mathjax-block md-end-block md-math-block md-rawblock" contenteditable="false" id="mathjax-n607" mdtype="math_block" spellcheck="false"> <div class="md-rawblock-container md-math-container" contenteditable="false" tabindex="-1"> <div class="MathJax_SVG_Display" style="text-align: center;"></div><script id="MathJax-Element-902" type="math/tex; mode=display">\pi_a P_{ab} = \pi_b P_{ba}, \text{ for all states } a \text{ and } b,</script> </div></div><p>something that should be clear if you remember the interpretation of the transition matrix element <script type="math/tex">P_{ab}</script> as the probability of an <script type="math/tex">a \rightarrow b</script> transition.</p><p>A final fun question: say we have an undirected graph and we consider a random walk over it (i.e., if we're at a given vertex, we take any edge going from it with equal probability). What is the stationary distribution over the states (i.e. the vertices of the graph)?</p>Unknownnoreply@blogger.com0tag:blogger.com,1999:blog-1697673368059564013.post-58263582811746502072020-12-31T14:43:00.019+00:002021-02-19T22:08:20.635+00:00Data science 1<center><p><span style="font-size: x-small;"><i>8.3k words, including equations (about 40 minutes)</i></span></p></center><p>This is an overview of fundamental ideas in data science, mostly based on <a href="https://www.cl.cam.ac.uk/teaching/2021/DataSci/materials.html">Damon Wischik's excellent data science course at Cambridge</a> (if using these notes for revision for that course, be aware that I don't cover all examinable things and cover some things that aren't examinable; the criteria for inclusion is interestingness, not examinability).</p><p>The basic question is this: we're given data; what can we say about the world based on it?</p><p>These notes are split into two parts due to length. In part 1:</p><ul><li><p>Notation</p></li><li><p>A few results in probability, including a look at Bayes theorem leading up to an understanding of the continuous form.</p></li><li><p>Model-fitting</p><ul><li>Maximum likelihood estimation</li><li>Supervised & unsupervised learning</li><li>Linear models (fitting them and interpreting them)</li><li>Empirical distributions (with a note on KL divergence)</li> </ul></li> </ul><p>In <a href="http://strataoftheworld.blogspot.com/2021/01/data-science-2.html">part 2</a>:</p><ul><li>Monte Carlo methods</li><li>A few theorems that let you bound probabilities or expectations.</li><li>Bayesianism & frequentism</li><li>Probability systems (specifically basic results about Markov chains).</li> </ul><p> </p><h2>Probability basics</h2><p>The kind of background you want to have to understand this material:</p><ul><li><p>The basic maths of probability: reasoning about sample spaces, probabilities summing to one, understanding and working with random variables, etc.</p></li><li><p>The ideas of expected value and variance.</p></li><li><p>Some idea of the most common probability distributions:</p><ul><li>normal/Gaussian,</li><li>binomial,</li><li>poisson,</li><li>geometric,</li><li>etc.</li> </ul></li><li><p>What continuous and discrete distributions are.</p></li><li><p>Understanding probability density/mass functions, and cumulative distribution functions.</p></li> </ul><h3>Notation</h3><p>First, a few minor points:</p><ul><li><p>It's easy to interpret <script type="math/tex">Y = f(X)</script>, where <script type="math/tex">X</script> and <script type="math/tex">Y</script> are random variables, to mean "generate a value of <script type="math/tex">X</script>, then apply <script type="math/tex">f</script> to it, and this is <script type="math/tex">Y</script>". But <script type="math/tex">Y=f(X)</script> is maths, not code; we're stating something is true, not saying how the values are generated. If <script type="math/tex">f</script> is an invertible function, then <script type="math/tex">Y=f(X)</script> and <script type="math/tex">X=f^{-1}(Y)</script> are both equally good and equally true mathematical statements, and neither of them tell you what causes what.</p></li><li><p>Indicator functions are a useful trick when bounds are unknown; for example, write <script type="math/tex">1_{x \geq y}</script> (or <script type="math/tex">1[x\geq y]</script>) to denote 1 if <script type="math/tex">x \geq y</script> and 0 in all other cases.</p><ul><li>They also let you express logical AND as multiplication: <script type="math/tex">1_{f(x)} \cdot 1_{g(x)}</script> , where <script type="math/tex">f</script> and <script type="math/tex">g</script> are boolean functions, is the same as <script type="math/tex">1_{f(x) \wedge g(x)}</script>.</li> </ul></li> </ul><h4>Likelihood notation</h4><p>Discrete and continuous random variables are fundamentally different. In the discrete case, you deal with probability mass functions where there's a probability attached to each event; with the continuous case, you only get a probability density function that doesn't mean anything real and needs to be integrated to give you a probability. Many results apply to both discrete and continuous random variables though, and we might switch between continuous and discrete models in the same problem, so it's cumbersome to have to deal with the separate notation and semantics of them.</p><p>Enter likelihood notation: write <script type="math/tex">\Pr_X(x)</script> to mean <script type="math/tex">P(X=x)</script> if the distribution is discrete and <script type="math/tex">f(x)</script> if the distribution of <script type="math/tex">X</script> is continuous with probability density function <script type="math/tex">f</script>.</p><h4>Python & NumPy</h4><p>Python is a good choice for writing code, for various reasons:</p><ul><li>easy to read;</li><li>found almost everywhere;</li><li>easy to install if it isn't already installed;</li><li>not Java;</li> </ul><p>but particularly because it has excellent science/maths libraries:</p><ul><li>NumPy for vectorised calculations, maths, and stats;</li><li>SciPy for, uh, science;</li><li>Matplotlib for graphing;</li><li>Pandas for data.</li> </ul><p>NumPy is a must-have.</p><p>To use it, the big thing to understand is the idea of vectorised calculations. Otherwise, you'll see code like this:</p><pre><code class="language-python" lang="python">xs = numpy.array([1, 2, 3])<br />ys = x ** 2 + x<br /></code></pre><p>and wonder how we're adding and squaring arrays (we're not; the operations are implicitly applied to each element separately – and all of this runs in C so it's much faster than doing it natively in Python).</p><h3>Computation vs maths</h3><p>Today we have computers. Statistics was invented before computers, though, and this affected the field; work was directed to all the areas and problems where progress could be made without much computation. The result is an excellent theoretical mathematical underpinning, but modern statistics can benefit a lot from a computational approach – running simulations to get estimates and so on. For the simple problems there's an (imprecise) computational method and a (precise) mathematical method; for complex problems you either spend all day doing integrals (provided they're solvable at all) or switch to a computer.</p><p>In this post, I will focus on the maths, because the maths concepts are more interesting than the intricacies of NumPy, and because if you understand them (and programming, especially in a vectorised style), the programming bit isn't hard.</p><p> </p><p> </p><h3>Some probability results</h3><h4>The law of total probability</h4><p>Here's something intuitive: if we have a sample space (e.g. outcomes of a die roll) and we partition it into non-overlapping events <script type="math/tex">E_1</script> to <script type="math/tex">E_N</script> that cover every possible outcome (e.g. showing the numbers 1, 2, ..., 6, and losing the dice under the carpet), and we have some other event <script type="math/tex">A</script> (e.g. a player gets mad), then</p><div cid="n97" class="mathjax-block md-end-block md-math-block md-rawblock" contenteditable="false" id="mathjax-n97" mdtype="math_block" spellcheck="false"> <div class="md-rawblock-container md-math-container" tabindex="-1"><div class="MathJax_SVG_Display" style="text-align: center;"></div><script id="MathJax-Element-2" type="math/tex; mode=display">P(A) = \sum_{n=1}^{N} P(A | E_n)P(E_n);</script></div></div><p>if we know the probability of <script type="math/tex">A</script> given each event <script type="math/tex">E_n</script>, we can find the total probability of <script type="math/tex">A</script> by summing up the probabilities of each <script type="math/tex">E_n</script>, weighted by the conditional probability that <script type="math/tex">A</script> also happens. Visually, where the height of the red bars represents each <script type="math/tex">P(A|E_n)</script>, and the area of each segment represents the different <script type="math/tex">P(E_n)</script>s, we see that the total red area corresponds to the sum above:</p><p> <br /></p><table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto;"><tbody><tr><td style="text-align: center;"><a href="https://1.bp.blogspot.com/-I4KzIEDAk9s/X-3g3GmeiaI/AAAAAAAACII/eGAcc3GBZ-cTkeJkFqrF3d7UQ9F-ti0nACLcBGAsYHQ/s1280/ltp.png" style="margin-left: auto; margin-right: auto;"><img border="0" data-original-height="506" data-original-width="1280" height="252" src="https://1.bp.blogspot.com/-I4KzIEDAk9s/X-3g3GmeiaI/AAAAAAAACII/eGAcc3GBZ-cTkeJkFqrF3d7UQ9F-ti0nACLcBGAsYHQ/w640-h252/ltp.png" width="640" /></a></td></tr><tr><td class="tr-caption" style="text-align: center;"><i>You say this diagram is "messy and unprofessional"; I say it has an "informal aesthetic".</i><br /></td></tr></tbody></table><br /><p>This is called the law of total probability; a fancy name to pull out when you want to use this idea.</p><h4>The law of the unconscious statistician</h4><p>Another useful law doesn't even sound like a law at first, which is why it's called the law of the unconscious statistician.</p><p>Remember that the expected value, in case of a discrete distribution for the random variable <script type="math/tex">X</script>, is</p><div cid="n104" class="mathjax-block md-end-block md-math-block md-rawblock" contenteditable="false" id="mathjax-n104" mdtype="math_block" spellcheck="false"> <div class="md-rawblock-container md-math-container" tabindex="-1"><div class="MathJax_SVG_Display" style="text-align: center;"></div><script id="MathJax-Element-3" type="math/tex; mode=display">E(X)=\sum_i x_iP(X=x_i).</script></div></div><p>Now say we're not interested in the value of <script type="math/tex">X</script> itself, but rather some function <script type="math/tex">f</script> of it. What is the expected value of <script type="math/tex">f(X)</script>? Well, the values <script type="math/tex">x_i</script> are the possible values of <script type="math/tex">X</script>, so let's just replace the <script type="math/tex">x_i</script> above with <script type="math/tex">f(x_i)</script>:</p><div cid="n106" class="mathjax-block md-end-block md-math-block md-rawblock" contenteditable="false" id="mathjax-n106" mdtype="math_block" spellcheck="false"> <div class="md-rawblock-container md-math-container" tabindex="-1"><div class="MathJax_SVG_Display" style="text-align: center;"></div><script id="MathJax-Element-4" type="math/tex; mode=display">E(f(X)) = \sum_i f(x_i) P(X=x_i)</script></div></div><p>... and we're done – but for the wrong reasons. This result is actually more subtle than this; to prove it, consider a random variable <script type="math/tex">Y</script> for which <script type="math/tex">Y=f(X)</script>. By the definition of expected value,</p><div cid="n108" class="mathjax-block md-end-block md-math-block md-rawblock" contenteditable="false" id="mathjax-n108" mdtype="math_block" spellcheck="false"> <div class="md-rawblock-container md-math-container" tabindex="-1"><div class="MathJax_SVG_Display" style="text-align: center;"></div><script id="MathJax-Element-5" type="math/tex; mode=display">E(Y)=\sum_i y_i P(Y=y_i).</script></div></div><p>Uh oh – suddenly the connection between the obvious result and what expected value is doesn't seem so obvious. The problem is that the mapping between the <script type="math/tex">y_i</script> and <script type="math/tex">x_i</script> could be anything – many <script type="math/tex">x_i</script>, thrown into the blackbox <script type="math/tex">f</script>, might produce the same <script type="math/tex">y_i</script> – and we have to untangle this while keeping track of all the corresponding probabilities. </p><p>For a start, we might notice values <script type="math/tex">x_i</script> of <script type="math/tex">X</script>. So we might write</p><div cid="n111" class="mathjax-block md-end-block md-math-block md-rawblock" contenteditable="false" id="mathjax-n111" mdtype="math_block" spellcheck="false"> <div class="md-rawblock-container md-math-container" tabindex="-1"><div class="MathJax_SVG_Display" style="text-align: center;"></div><script id="MathJax-Element-6" type="math/tex; mode=display">E(Y)=\sum_i \Big( y_i \sum_{j \,|\, f(x_j)=y_i} P(X=x_j) \Big),</script></div></div><p>to sum over each possible value of <script type="math/tex">f(X)</script>, and then within that, also loop over the possible values of <script type="math/tex">X</script> that might have generated that <script type="math/tex">f(X)</script>. We've managed to switch a term involving the probability that <script type="math/tex">Y</script> takes some values to one about <script type="math/tex">X</script> taking a specific value – progress!</p><p>Next, we realise that <script type="math/tex">y_i</script> is the same for everything in the inner sum; <script type="math/tex">y_i = f(x_1) = f(x_2) = ... = f(x_j)</script>. So we don't change anything if we write</p><div cid="n114" class="mathjax-block md-end-block md-math-block md-rawblock" contenteditable="false" id="mathjax-n114" mdtype="math_block" spellcheck="false"> <div class="md-rawblock-container md-math-container" tabindex="-1"><div class="MathJax_SVG_Display" style="text-align: center;"></div><script id="MathJax-Element-7" type="math/tex; mode=display">E(Y)=\sum_i \Big( \sum_{j \,|\, f(x_j)=y_i} f(x_j) P(X=x_j) \Big)</script></div></div><p>instead. Now we just have to see that the above is equivalent to iterating once over all the <script type="math/tex">j</script>s.</p><p>A diagram:</p><a href="https://1.bp.blogspot.com/-nCfUHjudl2o/X-3gwrn3kuI/AAAAAAAACIE/na11fzX5DvExTnBkTKr1MSBHwqdU8rlbgCLcBGAsYHQ/s1280/lotus.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="770" data-original-width="1280" height="384" src="https://1.bp.blogspot.com/-nCfUHjudl2o/X-3gwrn3kuI/AAAAAAAACIE/na11fzX5DvExTnBkTKr1MSBHwqdU8rlbgCLcBGAsYHQ/w640-h384/lotus.png" width="640" /></a><p>The yellow area is the expected value of <script type="math/tex">f(x) = Y</script>. By the definition of expected value, we can sum up the areas of the yellow rectangles to get <script type="math/tex">E(f(X))</script>. What we've now done is "reduced" this to a process like this: pick <script type="math/tex">y_1</script>, looking at the <script type="math/tex">x_i</script> that map to it with <script type="math/tex">f</script> (<script type="math/tex">x_1</script> and <script type="math/tex">x_2</script> in this case), and find these probabilities and multiply them by <script type="math/tex">f(x_1)=f(x_2)=y_1</script>. So we add up the rectangles in the slots marked by the dotted lines, and we do it with this weird double-iteration of looking first at <script type="math/tex">y_i</script>s and then at <script type="math/tex">x_i</script>s.</p><p>But once we've put it this way, it's simple to see we get the same result if we iterate over the <script type="math/tex">x_i</script>s, get the corresponding rectangle slice for each, and add it all up. This corresponds to the formula we had above (summing <script type="math/tex">f(x_i) P(X=x_i)</script> over all possible <script type="math/tex">i</script>).</p><h4>Bayes' theorem (odds ratio and continuous form)</h4><div class="separator" style="clear: both; text-align: center;"><a href="https://1.bp.blogspot.com/-96QLqGNFQHM/X-3g8n4CgvI/AAAAAAAACIM/eYMaWIjkoQs9US3zrgBj8mVlTMPN9DIEQCLcBGAsYHQ/s1280/bayes.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="900" data-original-width="1280" height="450" src="https://1.bp.blogspot.com/-96QLqGNFQHM/X-3g8n4CgvI/AAAAAAAACIM/eYMaWIjkoQs9US3zrgBj8mVlTMPN9DIEQCLcBGAsYHQ/w640-h450/bayes.png" width="640" /></a></div><br />Above is a Venn diagram of a sample space (the box), with the probabilities of event <script type="math/tex">B</script> and event <script type="math/tex">R</script> marked by blue and red areas respectively (the hatched area represents that both happen). <p>By the definition of conditional probability,</p><div cid="n124" class="mathjax-block md-end-block md-math-block md-rawblock" contenteditable="false" id="mathjax-n124" mdtype="math_block" spellcheck="false"> <div class="md-rawblock-container md-math-container" tabindex="-1"><div class="MathJax_SVG_Display"></div><script id="MathJax-Element-8" type="math/tex; mode=display">P(R|B)=\frac{P(B \cap R)}{P(B)}, \text{ and} \\ P(B|R)=\frac{P(B \cap R)}{P(R)}.</script></div></div><p>Bayes theorem is about answering questions like "if we know how likely we are to be in the red area given that we're in the blue area, how likely are we to be in the blue area if we're in the red?" (Or: "if we know how likely we are to have symptoms if we have covid, how likely are we to have covid if we have symptoms?").</p><p>Solving both of the above equations for <script type="math/tex">P(B \cap R)</script> and equating them gives</p><div cid="n127" class="mathjax-block md-end-block md-math-block md-rawblock" contenteditable="false" id="mathjax-n127" mdtype="math_block" spellcheck="false"> <div class="md-rawblock-container md-math-container" tabindex="-1"><div class="MathJax_SVG_Display" style="text-align: center;"></div><script id="MathJax-Element-9" type="math/tex; mode=display">P(R|B) P(B) = P(B|R) P(R),</script></div></div><p>which is the answer – just divide out by either <script type="math/tex">P(B)</script> or <script type="math/tex">P(R)</script> to get, for example,</p><div cid="n129" class="mathjax-block md-end-block md-math-block md-rawblock" contenteditable="false" id="mathjax-n129" mdtype="math_block" spellcheck="false"> <div class="md-rawblock-container md-math-container" tabindex="-1"><div class="MathJax_SVG_Display" style="text-align: center;"></div><script id="MathJax-Element-10" type="math/tex; mode=display">P(B|R) = \frac{P(R|B)P(B)}{P(R)}.</script></div></div><p>Let's say the red area $$R$$ represents having symptoms. Let's say we split the blue area <script type="math/tex">B</script> into <script type="math/tex">B_1</script> and <script type="math/tex">B_2</script> – two different variants of covid, say. Now instead of talking about probabilities, let's talk about odds: let's say the odds ratios that a random person has no covid, has variant 1, and has variant 2 are 40:2:1, and that symptoms are, compared to the no-covid population, ten times as likely in variant 1 and twenty times as likely in variant 2 (in symbols: <script type="math/tex">P(R| \neg B_1 \cap \neg B_2)/40 = P(R|B_1) / 2 = P(R|B_2)</script>). Now we learn that we have symptoms and want to calculate posterior probabilities, to use Bayes-speak.</p><p>To apply Bayes' rule, you could crank out the formula exactly as above: convert odds to probabilities, divide out by the total probability of no covid or having variant 1 or 2, and then get revised probabilities for your odds of having no covid or a variant. This is equivalent to keeping track of the absolute sizes of the intersections in the diagram below:</p><div class="separator" style="clear: both; text-align: center;"><a href="https://1.bp.blogspot.com/-Jhos5ZJkCv8/X-3hH0LugyI/AAAAAAAACIY/Cbqmi8yd3e8NyCwPsqBOJFf4fZG-zD28wCLcBGAsYHQ/s1000/bayes2.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="622" data-original-width="1000" height="398" src="https://1.bp.blogspot.com/-Jhos5ZJkCv8/X-3hH0LugyI/AAAAAAAACIY/Cbqmi8yd3e8NyCwPsqBOJFf4fZG-zD28wCLcBGAsYHQ/w640-h398/bayes2.png" width="640" /></a></div><br /><p>But this is unnecessary. When we learned we had symptoms, we've already zoomed in to the red blob; that is our sample space now, so blob size compared to the original sample space no longer interests us.</p><p>So let's take our odds ratios directly, and only focus on relative probabilities. Let's imagine each scenario fighting over a set amount of probability space, with the starting allocations determined by prior odds ratios:</p><div class="separator" style="clear: both; text-align: center;"><a href="https://1.bp.blogspot.com/-d3Qk2fpGrT4/X-3hOBKv60I/AAAAAAAACIg/bsEq3MDex-wUlCQonWZ8DZ8Gl4clp6KUwCLcBGAsYHQ/s1280/odds1.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="84" data-original-width="1280" height="42" src="https://1.bp.blogspot.com/-d3Qk2fpGrT4/X-3hOBKv60I/AAAAAAAACIg/bsEq3MDex-wUlCQonWZ8DZ8Gl4clp6KUwCLcBGAsYHQ/w640-h42/odds1.png" width="640" /></a></div><br /><p>Now Bayes rule says to multiply each prior probability <script type="math/tex">P(B_i)</script> by <script type="math/tex">P(R|B_i)</script>. To adjust our prior odds ratio 40:2:1 by the ratios 1:10:20 telling us how many times more likely we are to see <script type="math/tex">R</script> (symptoms) given no covid or <script type="math/tex">B_1</script> or <script type="math/tex">B_2</script>, just multiply term-by-term to get 40:20:20, or 2:1:1. You can imagine each outcome fighting it out with their newly-adjusted relative strengths, giving a new distribution of the sample space:</p><div class="separator" style="clear: both; text-align: center;"><a href="https://1.bp.blogspot.com/-N8kRYPSO3bs/X-3hTdn_XLI/AAAAAAAACIk/F77jAGwyouYemA1udKnaLy1O_G7lVEIZACLcBGAsYHQ/s1282/odds2.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="102" data-original-width="1282" height="50" src="https://1.bp.blogspot.com/-N8kRYPSO3bs/X-3hTdn_XLI/AAAAAAAACIk/F77jAGwyouYemA1udKnaLy1O_G7lVEIZACLcBGAsYHQ/w640-h50/odds2.png" width="640" /></a></div><br /><p>Now if we want to get absolute probabilities again, we just have to scale things right so that they add up to 1. This tiny bit of cleanup at the end (if we want to convert to probabilities again) is the only downside of working with odds ratios.</p><p>This gives us an idea about how to use Bayes when the sample space is continuous rather than discrete. For example, let's say the sample space is between 0 and 100, representing the blood oxygenation level $$X$$ of a coronavirus patient. We can imagine an approximation where we write an odds ratio that includes every integer from 0 to 100, and then refine that until, in the limit, we've assigned odds to every real number between 0 and 100. Of course, at this point the odds ratio interpretation starts looking a bit weird, but we can switch to another one: what we have is a probability distribution, if only we scale it so that the entire thing integrates to one.</p><p>The same logic applies as before, even though everything is now continuous. Let's say we want to calculate a conditional probability like the probability of $$X$$ (the random variable for the patient's blood oxygenation) taking the value $$x$$. At first we have no information, so our best guess is the prior across all patients, $$\Pr_X(x)$$. Say we now get some piece of evidence, like the patient's age, and know the likelihood ratios of the patient being that age given each blood oxygenation level. To get our updated belief distribution, we can just go through and multiply the prior likelihoods of each blood oxygenation level by the ratios given the new piece of evidence.</p><div class="separator" style="clear: both; text-align: center;"><div class="separator" style="clear: both; text-align: center;"><a href="https://1.bp.blogspot.com/-yRdyWvJJFqo/X_BQQ0ev9EI/AAAAAAAACK0/Lvw36rxuPy03EK5CODRgEmcaOZebQP4mACLcBGAsYHQ/s1280/odds3.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="688" data-original-width="1280" height="344" src="https://1.bp.blogspot.com/-yRdyWvJJFqo/X_BQQ0ev9EI/AAAAAAAACK0/Lvw36rxuPy03EK5CODRgEmcaOZebQP4mACLcBGAsYHQ/w640-h344/odds3.png" width="640" /></a></div><a href="https://1.bp.blogspot.com/-jm_GaXHLjd8/X-3hXf6DR9I/AAAAAAAACIs/4jodPcdJE98OejxOZHtFgDagRZ3YvAbtgCLcBGAsYHQ/s1280/odds3.png" style="margin-left: 1em; margin-right: 1em;"></a></div><p>Above, the red line is the initial distribution of blood oxygenation <script type="math/tex">x</script> across all patients. The yellow line represents the relative likelihoods of the patient's actual known age <script type="math/tex">a</script> given a particular <script type="math/tex">x</script>. The green line at any particular $$x$$ is the product of the yellow and red function at that same $$x$$, and it's our relative posterior. To interpret it as a probability distribution, we have to scale it vertically so that it integrates to 1 (that's why we have a proportionality sign rather than an equals sign).</p><p>Now let's say more evidence comes in: the patient is unconscious (which we'll denote <script type="math/tex">U=\text{"yes"}</script>). We can repeat the same process of multiplying out relative likelihoods and the prior, this time with the prior being the result in the previous step:</p><p></p><div class="separator" style="clear: both; text-align: center;"><a href="https://1.bp.blogspot.com/-d_e-q5ctBIs/X_BQdmzEaoI/AAAAAAAACK4/gFbLZQb96N4AaWGOJy70ILariHxQNja1gCLcBGAsYHQ/s1278/odds4.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="928" data-original-width="1278" height="464" src="https://1.bp.blogspot.com/-d_e-q5ctBIs/X_BQdmzEaoI/AAAAAAAACK4/gFbLZQb96N4AaWGOJy70ILariHxQNja1gCLcBGAsYHQ/w640-h464/odds4.png" width="640" /></a></div><p></p><p>We can see that in this case the blue line varies a lot more depending on <script type="math/tex">x</script>, and hence our distribution for <script type="math/tex">x</script> (the purple line) changes more compared to our prior (the green line). Now let's say we have a very good piece of evidence: the result <script type="math/tex">m</script> of a blood oxygenation meter <script type="math/tex">M</script>.</p><p> </p><div class="separator" style="clear: both; text-align: center;"><a href="https://1.bp.blogspot.com/-YXGK8RMc6Oc/X_BQjsdKWfI/AAAAAAAACLA/zvLSoosiv408XSUcAXp_uRQwh54Nfx1xACLcBGAsYHQ/s1280/odds5.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="858" data-original-width="1280" height="428" src="https://1.bp.blogspot.com/-YXGK8RMc6Oc/X_BQjsdKWfI/AAAAAAAACLA/zvLSoosiv408XSUcAXp_uRQwh54Nfx1xACLcBGAsYHQ/w640-h428/odds5.png" width="640" /></a></div>There's some error on the oxygenation measurement, so our final belief (that <script type="math/tex">x</script> is distributed according to the black line) is very clearly a distribution of values rather than a single value, but it's clustered around a single point.<p></p><p>So to think through Bayes in practice, the lesson is this: throw out the denominator in the law. It's a constant anyways; if you really need it you can go through some integration at the end to find it. But it's not the central point of Bayes' theorem. Remember instead: prior times likelihood ratio gives posterior.</p><p> </p><h2>Fitting models</h2><p>A probability model tries to tell you how likely things are. Fitting a probability model to data is about finding one that is useful for given data.</p><div class="separator" style="clear: both; text-align: center;"><a href="https://1.bp.blogspot.com/-zWW-18QBIOE/X-3ht9sKyeI/AAAAAAAACJM/tge61Rkj8sYl5NGP720Pu2FooLBPjI4OgCLcBGAsYHQ/s1280/probmodels.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="366" data-original-width="1280" height="184" src="https://1.bp.blogspot.com/-zWW-18QBIOE/X-3ht9sKyeI/AAAAAAAACJM/tge61Rkj8sYl5NGP720Pu2FooLBPjI4OgCLcBGAsYHQ/w640-h184/probmodels.png" width="640" /></a></div><p>Above, we have two axes representing whatever, and the intensity of the red shading is the probability attributed to a particular pair of values.</p><p>The model on the left is simply bad. The one in the middle is also bad, though; it assigns no probability to many of the data points that were actually seen.</p><p>Choosing which distribution to fit – or whether to do something else entirely – is sometimes obvious, sometimes not. Complexity is rarely good.</p><h3>Maximum likelihood estimation (MLE)</h3><p>Let's say we do have a good idea of what the distribution is; the weight of stray cats in a city depends on a lot of small factors pushing both ways (when it last caught a mouse, the temperature over the past week, whether it was loved by its mother, etc.), so <a href="https://en.wikipedia.org/wiki/Bean_machine">we should expect a normal distribution</a>. Well, probably.</p><p>Let's say we have a dataset of cat weights, labelled <script type="math/tex">x_1</script> to <script type="math/tex">x_n</script> because we're serious maths people. How do we fit a distribution?</p><div class="separator" style="clear: both; text-align: center;"><a href="https://1.bp.blogspot.com/-gjN98iorrHU/X-3hznFtCLI/AAAAAAAACJQ/HncfCOzHt3wE7LZUxkeWDAc_27ZvZOH-gCLcBGAsYHQ/s800/cats.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="610" data-original-width="800" height="488" src="https://1.bp.blogspot.com/-gjN98iorrHU/X-3hznFtCLI/AAAAAAAACJQ/HncfCOzHt3wE7LZUxkeWDAc_27ZvZOH-gCLcBGAsYHQ/w640-h488/cats.png" width="640" /></a></div><br /><p><br /></p><p>Step 1 is Wikipedia. Wikipedia tells us that a normal distribution has two parameters, <script type="math/tex">\mu</script> (the mean) and <script type="math/tex">\sigma</script> (the standard deviation), and that the likelihood (not probability! see above) a normal distribution <script type="math/tex">X</script> with those parameters takes a value <script type="math/tex">x</script> is</p><div cid="n164" class="mathjax-block md-end-block md-math-block md-rawblock" contenteditable="false" id="mathjax-n164" mdtype="math_block" spellcheck="false"> <div class="md-rawblock-container md-math-container" tabindex="-1"><div class="MathJax_SVG_Display" style="text-align: center;"></div><script id="MathJax-Element-11" type="math/tex; mode=display">\Pr_X(x)= \frac{1}{\sigma \sqrt{2 \pi}} e^{-\frac{1}{2}\big( \frac{x-\mu}{\sigma} \big)^2}.</script></div></div><p>Oh dear.</p><p>After a moment's thought, we can interpret it more clearly:</p><div cid="n167" class="mathjax-block md-end-block md-math-block md-rawblock" contenteditable="false" id="mathjax-n167" mdtype="math_block" spellcheck="false"> <div class="md-rawblock-container md-math-container" tabindex="-1"><div class="MathJax_SVG_Display" style="text-align: center;"></div><script id="MathJax-Element-12" type="math/tex; mode=display">\Pr_X(x) = \frac{\text{blah}}{\sigma \text{ blah}} \text{blah}^{\text{-blah} {\big(\frac{x-\mu}{\sigma}\big)^2}}.</script></div></div><p>So it's just an exponential that decays in both directions from <script type="math/tex">\mu</script>, and that's squeezed by <script type="math/tex">\sigma</script>.</p><p>(Why are there constants then? Because it's a probability distribution, and must therefore integrate to 1 over its entire range or else all hell will break loose.)</p><p>Step 2 is philosophising. What does it really mean to get the best fit of a distribution?</p><p>The first thing we can notice is that there are only two dials we can adjust: the values of <script type="math/tex">\mu</script> and <script type="math/tex">\sigma</script>. For this particular problem at least, we've reduced the massive problem of picking the best model to one of finding the best spot in a 2D space (well, half of 2D space, since <script type="math/tex">\sigma</script> must be greater than zero).</p><p>The second thing we can notice is that the only tool we have at our disposal here to tell us about the fit to the distribution is the likelihood function, and, well, as the saying goes: when all you have is a likelihood function ...</p><p>A good fit will give high likelihoods to the points in the data set (we can't get an arbitrarily good fit by giving everything a lot of likelihood, because there's only so much likelihood to go around – the probabilities that the likelihood function assigns across its domain must sum to 1).</p><p>Let's call the likelihood of the data, given some model, to be the likelihood that we get that specific data set by independently generating samples from the model until we have the same number as in the data set (if we have a lot of data points, the likelihood of any particular set of them will usually be very low, since it's the product of the likelihood of a lot of individual points). And let's go ahead and try to tune the model so that the likelihood of our data is maximised.</p><p>(Remember, likelihood is probability, except for continuous random variables like our normal distribution, where we can't talk about the probability of a dataset (only about something like the probability of getting a dataset at least as close as [some metric] to the dataset).)</p><p>Step 3 is algebra. So what is the likelihood of all our data? Using basic probability, it's the product of the likelihoods of each data point (just like the probability of getting a set of independent events is the product of the probabilities of each event). Returning to our normal distribution with cat data <script type="math/tex">x_1</script> to <script type="math/tex">x_n</script>, the likelihood of the data given distribution <script type="math/tex">X</script> with mean <script type="math/tex">\mu</script> and standard deviation <script type="math/tex">\sigma</script> is</p><div cid="n177" class="mathjax-block md-end-block md-math-block md-rawblock" contenteditable="false" id="mathjax-n177" mdtype="math_block" spellcheck="false"> <div class="md-rawblock-container md-math-container" tabindex="-1"><div class="MathJax_SVG_Display"></div><script id="MathJax-Element-13" type="math/tex; mode=display">\Pr_X(x_1) \cdot \Pr_X(x_2) \cdot ... \cdot \Pr_X(x_n) \\ = \frac{1}{\sigma \sqrt{2 \pi}} e^{-\frac{1}{2}\big( \frac{x_1-\mu}{\sigma} \big)^2} \cdot ... \cdot \frac{1}{\sigma \sqrt{2 \pi}} e^{-\frac{1}{2}\big( \frac{x_n-\mu}{\sigma} \big)^2} \\ = \left(\frac{1}{\sigma \sqrt{2 \pi}} \right)^n e^{-\frac{1}{2}\big( \big( \frac{x_1 - \mu}{\sigma} \big)^2 + ... + \big(\frac{x_n - \mu}{\sigma} \big)^2 \big)}.</script></div></div><p>Oh dear. Maximising this is a pain.</p><p>Thankfully, there's a trick. We don't care about the likelihood, only that we set <script type="math/tex">\mu</script> and <script type="math/tex">\sigma</script> so that the likelihood is maximised. We can apply any monotonically increasing function to the likelihood, maximise that, and we'll have the <script type="math/tex">\mu</script> and <script type="math/tex">\sigma</script> that maximise the original mess.</p><p>Which monotonically increasing function? Logarithms are generally best, because they convert the products you get from calculating the likelihood of a dataset into sums (and in this case they're especially nice, because they'll also take out the exponentials in our distribution's likelihood function).</p><p>In fact, throw away the previous calculation, note that</p><div cid="n182" class="mathjax-block md-end-block md-math-block md-rawblock" contenteditable="false" id="mathjax-n182" mdtype="math_block" spellcheck="false"> <div class="md-rawblock-container md-math-container" tabindex="-1"><div class="MathJax_SVG_Display"></div><script id="MathJax-Element-14" type="math/tex; mode=display">\log\Pr_X(x) = -\log(\sigma \sqrt{2 \pi}) - \frac{1}{2} \left(\frac{x-\mu}{\sigma}\right)^2 \\ = -\log(\sqrt{2 \pi}) - \log(\sigma) - \frac{1}{2} \left(\frac{x-\mu}{\sigma}\right)^2, \\</script></div></div><p>from which we can throw away the <script type="math/tex">\log(\sqrt{2\pi})</script> because it's the same in each term, and then sum all the rest up to get a total log likelihood of</p><div cid="n184" class="mathjax-block md-end-block md-math-block md-rawblock" contenteditable="false" id="mathjax-n184" mdtype="math_block" spellcheck="false"> <div class="md-rawblock-container md-math-container" tabindex="-1"><div class="MathJax_SVG_Display" style="text-align: center;"></div><script id="MathJax-Element-15" type="math/tex; mode=display">-n\log(\sigma) - \sum_{i=1}^n \Big( \frac{1}{2} \left(\frac{x_i-\mu}{\sigma}\right)^2 \Big).</script></div></div><p>Call this <script type="math/tex">f</script>; the values of <script type="math/tex">\mu</script> and <script type="math/tex">\sigma</script> that maximise it are when when <script type="math/tex">\frac{\partial f}{\partial \mu} = 0</script> and <script type="math/tex">\frac{\partial f}{\partial \sigma} = 0</script>; that's when we've found our peak on the 2D space of possible <script type="math/tex">(\mu, \sigma)</script> pairs (technically this condition only tells us it's a stationary point, but it turns out to be the maximum, as you can prove by taking more derivatives).</p><p>So the maximum satisfies</p><div cid="n187" class="mathjax-block md-end-block md-math-block md-rawblock" contenteditable="false" id="mathjax-n187" mdtype="math_block" spellcheck="false"> <div class="md-rawblock-container md-math-container" tabindex="-1"><div class="MathJax_SVG_Display"></div><script id="MathJax-Element-16" type="math/tex; mode=display">\frac{\partial f}{\partial \mu} = -\sum_{i=1}^n \Big( \frac{x_i-\mu}{\sigma} \Big) = 0, \text{ and} \\ \frac{\partial f}{\partial \sigma} = -\frac{n}{\sigma} + \sum_{i=1}^n \left( \frac{(x_i - \mu)^2}{\sigma^3} \right) = 0.</script></div></div><p>The first condition gives</p><div cid="n189" class="mathjax-block md-end-block md-math-block md-rawblock" contenteditable="false" id="mathjax-n189" mdtype="math_block" spellcheck="false"> <div class="md-rawblock-container md-math-container" tabindex="-1"><div class="MathJax_SVG_Display" style="text-align: center;"></div><script id="MathJax-Element-17" type="math/tex; mode=display">\hat{\mu} = \frac{1}{n}\sum_{i=1}^n x_i,</script></div></div><p>in other words that <script type="math/tex">\hat{\mu}</script>, our best estimator function for the value of <script type="math/tex">\mu</script>, is the average of the values in the data set.</p><p>From the second condition, we can do algebra to get</p><div cid="n192" class="mathjax-block md-end-block md-math-block md-rawblock" contenteditable="false" id="mathjax-n192" mdtype="math_block" spellcheck="false"> <div class="md-rawblock-container md-math-container" tabindex="-1"><div class="MathJax_SVG_Display" style="text-align: center;"></div><script id="MathJax-Element-18" type="math/tex; mode=display">\hat{\sigma} = \sqrt{\frac{1}{n} \sum_{i=1}^n(x_i-\mu)^2}.</script></div></div><p>We need to be careful here, though. When writing out the conditions, <script type="math/tex">\mu</script> and <script type="math/tex">\sigma</script> stood for specific values of the parameters of the normal distribution <script type="math/tex">X</script>. We don't know these values; the best we can do is estimate them with <i>estimators</i>, which are technically not values but functions that take a data set and return an estimated value (and denoted by <script type="math/tex">\hat{\text{hats}}</script>). We can't have unknown values in our definition of <script type="math/tex">\hat{\sigma}</script>, as we currently do with the <script type="math/tex">\mu</script> in it; we have to replace it with the estimator for <script type="math/tex">\mu</script> like this:</p><div cid="n194" class="mathjax-block md-end-block md-math-block md-rawblock" contenteditable="false" id="mathjax-n194" mdtype="math_block" spellcheck="false"> <div class="md-rawblock-container md-math-container" tabindex="-1"><div class="MathJax_SVG_Display" style="text-align: center;"></div><script id="MathJax-Element-19" type="math/tex; mode=display">\hat{\sigma} = \sqrt{\frac{1}{n} \sum_{i=1}^n(x_i-\hat{\mu})^2}</script></div></div><p>– making sure that the estimator <script type="math/tex">\hat{\mu}</script> does not depend on <script type="math/tex">\hat{\sigma}</script> , since that would again make things undefined – or then by writing out the <script type="math/tex">\hat{\mu}</script> estimator like this:</p><div cid="n196" class="mathjax-block md-end-block md-math-block md-rawblock" contenteditable="false" id="mathjax-n196" mdtype="math_block" spellcheck="false"> <div class="md-rawblock-container md-math-container" tabindex="-1"><div class="MathJax_SVG_Display" style="text-align: center;"></div><script id="MathJax-Element-20" type="math/tex; mode=display">\hat{\sigma} = \sqrt{\frac{1}{n} \sum_{i=1}^n \left(x_i-\frac{1}{n}\sum_{i=1}^n x_i\right)^2},</script></div></div><p>which at least makes it very clear that the <script type="math/tex">x_i</script>s and their number <script type="math/tex">n</script> define <script type="math/tex">\hat{\sigma}</script>. </p><p>When you're done defining your estimators, you should have a clear diagram in your head of how to pour data into the functions you've written down and come out with concrete numbers, with no dangling inputs anywhere – you're not done if you have any.</p><h3>Supervised and unsupervised learning</h3><p>There are two main types of fancy model fitting we can do:</p><ol start=""><li>Supervised learning, where we have a set of pairs (of numbers or anything else) and we try to design a system to predict one element from the other. For example, maybe we measure the length and weight of some stray cats, but get bored of trying to get them to stay on the scale long enough, so we want to ditch the weighing and predict a weight from the length alone – how well can we do this?</li><li>Unsupervised learning, where we have our data (as a set of tuples of associated data, like cat lengths, weights, and locations), and we try to fit a model to it so we can generate similar items; maybe we want to fake a larger stray cat population in our data than actually exists but not get caught by the statistics bureau. (This category also includes things like trying to <a href="https://en.wikipedia.org/wiki/Unsupervised_learning">identify clusters</a> to interpret the data.) Fitting a distribution is perhaps the simplest example: using our one-dimensional cat weight database discussed in the MLE section, we can "generate" new cats by sampling from it, though the "cat" will just be the weight number. The more interesting case is when we have to generate a lot of associated data; for example, <a href="https://thispersondoesnotexist.com/">this website</a> offers you a new face every time you reload it. Behind it is a probability distribution for a human face in some crazy-dimensional variable space that's detailed enough that sampling it gives you all the data needed to figure out the colours of each pixel in a photorealistic face picture.</li> </ol><p>The unifying idea is maximum likelihood estimation (MLE). Clearly, something like MLE is needed if you want to fit a distribution to data for unsupervised learning; we're going to need to generate something eventually, so we better have a probability model. It's less clear that supervised learning has anything to do with MLE though, and tempting to think of it as defining some random loss function to measure how bad a fit is, and then minimising that. It's possible to think of supervised learning this way, but then you'll end up with a lot of detail about loss functions in your head, all of which will seem to be pulled out of thin air.</p><p>Instead, think of supervised learning as MLE too. We specify a probability model, which will take in some parameters (e.g. the exponent <script type="math/tex">a</script> and constant <script type="math/tex">b</script> in a cat length/weight model like <script type="math/tex">\text{weight} = b \times \text{length}^a + \epsilon</script>, where <script type="math/tex">\epsilon</script> is a normally distributed error term with mean 0 and some standard deviation we either know already or then ask the fitting procedure to find for us), and the value of the predictor variable(s) (e.g. the cat's length), and spit out its prediction of the variable(s) of interest.</p><p>(Note that often the variable of interest is not numerical, but a label: "spam", "tumour", "Eurasian oystercatcher", etc.)</p><p>In fact, seen from the MLE perspective, it can almost be hard to see the difference – if so, good. Just look at the processes:</p><ol start=""><li><p>Unsupervised learning:</p><ol start=""><li>Get your dataset <script type="math/tex">x = (x_1, x_2, ..., x_n)</script>.</li><li>Decide on a probability model (e.g. a simple distribution) <script type="math/tex">X</script> with a parameter set <script type="math/tex">\theta = (\theta_1, \theta_2, ..., \theta_m)</script>.</li><li>Find the <script type="math/tex">\theta</script> that maximises <script type="math/tex">\Pr_X(x_1; \theta) \times ... \times \Pr_X(x_n; \theta)=\Pr_X(x;\theta)</script>,* since assuming our data points are drawn independently, this is the likelihood of the dataset.</li> </ol></li><li><p>Supervised learning:</p><ol start=""><li>Get your dataset of pairs of the form (thing to predict, thing to predict from): <script type="math/tex">((y_1, x_1), (y_2, x_2), ..., (y_n, x_n))</script>.</li><li>Decide on a probability model <script type="math/tex">Y</script> that which relies on parameter set <script type="math/tex">\theta = (\theta_1, \theta_2, ..., \theta_n)</script>, and also <script type="math/tex">x_i</script>, to predict <script type="math/tex">y_i</script>..</li><li>Find the <script type="math/tex">\theta</script> that maximises <script type="math/tex">\Pr_Y(y_1;x_1, \theta) \times ... \times \Pr_Y(y_n; x_n, \theta) = \Pr_Y(y_1, ..., y_n; x_1, ...., y_n, \theta)</script>.*</li> </ol></li> </ol><p>*(We write <script type="math/tex">\Pr_X(x_i;\theta)</script> to mean the likelihood that <script type="math/tex">X</script> takes the value <script type="math/tex">x_i</script> if the parameters are <script type="math/tex">\theta</script>; we avoid writing it as a conditional probability <script type="math/tex">\Pr_X(x \, |\, \theta)</script> because interpreting this as a conditional probability is technically only valid with a Bayesian interpretation.)</p><h3>Linear models</h3><p>You can invent any model you choose. As always, simplicity pays though, and it turns out that there's a class of probability models which are easy to work with and reason about, for which general algorithms and mathematical tools exist, and which is often good enough: linear models.</p><p>The word "linear" immediately brings to mind straight lines. That's not what it means in this context. The linearity in linear models is because the output is a linear combination of "features" (predictor variables).</p><p>The general form is</p><div cid="n234" class="mathjax-block md-end-block md-math-block md-rawblock" contenteditable="false" id="mathjax-n234" mdtype="math_block" spellcheck="false"> <div class="md-rawblock-container md-math-container" tabindex="-1"><div class="MathJax_SVG_Display" style="text-align: center;"></div><script id="MathJax-Element-21" type="math/tex; mode=display">\hat{y_i} = c_1 e_{1,i} + c_2 e_{2,i} + ... +c_n e_{n,i},</script></div></div><p>where <script type="math/tex">\hat{y_i}</script> is the predicted value, <script type="math/tex">c_1</script> through <script type="math/tex">c_n</script> are constants, and <script type="math/tex">e_{1,i}</script> through <script type="math/tex">e_{n,i}</script> are the features describing the <script type="math/tex">i</script>th set of data. In the simplest case, a feature might be a value we measure directly, but in general it can be any function of data we measure. Ideally, we want that the true value <script type="math/tex">y_i \approx c_1 e_{1,i} + ... + c_n e_{n,i}</script>.</p><div class="separator" style="clear: both; text-align: center;"><a href="https://1.bp.blogspot.com/-joWzH265vss/X-3h7Ef5y7I/AAAAAAAACJU/4LG8rb-vc4Mtno2KGVHfWipxb41EnWuFACLcBGAsYHQ/s1278/linearmodel.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="678" data-original-width="1278" height="340" src="https://1.bp.blogspot.com/-joWzH265vss/X-3h7Ef5y7I/AAAAAAAACJU/4LG8rb-vc4Mtno2KGVHfWipxb41EnWuFACLcBGAsYHQ/w640-h340/linearmodel.png" width="640" /></a></div><p>In the above diagram, we see we measure the data <script type="math/tex">x_i</script> (note that it can be a tuple of values rather than a single value), pass it through some blackbox function to generate features, and take the prediction <script type="math/tex">\hat{y_i}</script> to be the sum of multiplying together each feature by the weight assigned to it. </p><p>Note that the linear model above is a prediction-maker but not a probability model because it doesn't assign likelihoods. The probability model for a linear model is often taken to be</p><div cid="n239" class="mathjax-block md-end-block md-math-block md-rawblock" contenteditable="false" id="mathjax-n239" mdtype="math_block" spellcheck="false"> <div class="md-rawblock-container md-math-container" tabindex="-1"><div class="MathJax_SVG_Display" style="text-align: center;"></div><script id="MathJax-Element-22" type="math/tex; mode=display">y_i = c_1 e_{1,i} + c_2 e_{2,i} + ... +c_n e_{n,i} + \epsilon</script></div></div><p>that is, there's an error term <script type="math/tex">\epsilon</script> that we assume to be a normal distribution with standard deviation <script type="math/tex">\sigma</script> (which may be known, or finding it may be part of fitting the model).</p><p>The above is also an equation for predicting one specific output (<script type="math/tex">y_i</script>) from one specific set of features, which in turn are determined by one specific input (e.g. a single data point). More generally we can write it in vector form:</p><div cid="n242" class="mathjax-block md-end-block md-math-block md-rawblock" contenteditable="false" id="mathjax-n242" mdtype="math_block" spellcheck="false"> <div class="md-rawblock-container md-math-container" tabindex="-1"><div class="MathJax_SVG_Display" style="text-align: center;"></div><script id="MathJax-Element-23" type="math/tex; mode=display">\pmb{y} \approx c_1 \pmb{e_1} + ... + c_n \pmb{e_n},</script></div></div><p>where <script type="math/tex">\pmb{y}=(y_1, y_2, ..., y_{n})</script>, and likewise <script type="math/tex">\pmb{e_j}</script> is a vector whose <script type="math/tex">i</script>th position corresponds to the <script type="math/tex">j</script>th feature of the <script type="math/tex">i</script>th data item.</p><p>Note that we can read this equation in two ways: as a vector equation about data, as just described, that's fitted to give <script type="math/tex">\pmb{y}</script> from its features, or as a prediction, saying that the value of a particular <script type="math/tex">y_i</script> will be roughly this.</p><p>There's a set of standard tricks to use in linear modelling:</p><ul><li>"One-hot coding": using a function that is 0 unless the input data satisfies some condition (having a label, exceeding a value, etc.).</li><li>If we have the data point <script type="math/tex">x_i</script>, using the features <script type="math/tex">e_{0,i} = 1</script>, <script type="math/tex">e_{1,i} = x_i</script>, and <script type="math/tex">e_{2,i} = x_i^2</script> to fit a quadratic (if you fit a polynomial of degree higher than 2 without a very solid reason, you're probably overfitting).</li><li>We often have a pattern with a known period <script type="math/tex">T</script> (days, years, etc.), and some non-zero starting phase <script type="math/tex">\phi</script>. Therefore we'd want a feature like <script type="math/tex">\sin((2\pi/T)x+\phi)</script>, where <script type="math/tex">x</script> to is an input, to fit this pattern to. If <script type="math/tex">\phi</script> is known, we don't have a problem, but if we want to fit the phase, it doesn't work: the model is not linear in <script type="math/tex">\phi</script>. To fix this, use a trig angle addition identity; the above becomes <script type="math/tex">\sin(\phi) \cos((2\pi/T)x) + \cos(\phi) \sin((2\pi/T)x)</script>, where <script type="math/tex">\sin(\phi)</script> and <script type="math/tex">\cos(\phi)</script> are just constants so can be forgotten about because the fitting model will determine the constants of our features. (Recovering <script type="math/tex">\phi</script> from the final constants will take a bit of maths; note that the constant of the cosine and sine terms in the fitted model will have the amplitude mixed in, in addition to <script type="math/tex">\phi</script>.)</li> </ul><p>Here's an annotated linear model with parameter interpretation:</p><p></p><div class="separator" style="clear: both; text-align: center;"><a href="https://lh3.googleusercontent.com/-UEm7eq7MsB8/X-3iV7GgJaI/AAAAAAAACJk/cnJ48BClH54YIOQoHLDGGDfkrDx8_8gjwCLcBGAsYHQ/examplelinear.png" style="margin-left: 1em; margin-right: 1em;"><img data-original-height="1022" data-original-width="1280" height="510" src="https://lh3.googleusercontent.com/-UEm7eq7MsB8/X-3iV7GgJaI/AAAAAAAACJk/cnJ48BClH54YIOQoHLDGGDfkrDx8_8gjwCLcBGAsYHQ/w640-h510/examplelinear.png" width="640" /></a></div><br /><p></p><p>The features in this model:</p><ul><li><script type="math/tex">e_1=x</script>.</li><li><script type="math/tex">e_2</script> is 0 if <script type="math/tex">x < A</script> and 1 otherwise.</li><li><script type="math/tex">e_3</script> is 0 if <script type="math/tex">x < A</script> and <script type="math/tex">x</script> otherwise.</li> </ul><p>(If we want to fit the best value of <script type="math/tex">A</script>, we'll have to do some maths and reconfigure the model. Right now <script type="math/tex">A</script> is a constant that's defined in the functions that calculate the features from the input data.)</p><p>The interpretation of the constants:</p><ul><li><script type="math/tex">c_0</script> is the prediction for <script type="math/tex">x=0</script>.</li><li><script type="math/tex">c_1</script> is the base slope.</li><li><script type="math/tex">c_2</script> is the difference between the prediction for <script type="math/tex">x=0</script> (the <script type="math/tex">y</script>-intercept of the <script type="math/tex">x < A</script> line) and the <script type="math/tex">y</script>-intercept of the <script type="math/tex">x>A</script> line.</li><li><script type="math/tex">c_3</script> is how much the slope changes after <script type="math/tex">x=A</script>.</li> </ul><p>We could have chosen different features (for example, letting <script type="math/tex">e_1 = 0</script> for <script type="math/tex">x > A</script>), and then gotten perhaps more readable constants (<script type="math/tex">c_3</script> would become just the slope, not the difference in slope). We could also have added a feature like <script type="math/tex">e_4 = x^2</script>, and then the model would no longer look like just straight lines. But whatever we do, we need to be careful to interpret the constants we get correctly, especially when the model gets complicated.</p><p>For our cat weight prediction example, we might expect weight <script type="math/tex">W</script> and length <script type="math/tex">L</script> to have a relation like <script type="math/tex">W \approx c L^3</script>, where <script type="math/tex">c</script> is a constant that the model will fit. If we want to ask questions about whether a cubic relation really is the best, take logs and fit something like <script type="math/tex">\log(W) = c_1 + c_2 \log(L)</script> – <script type="math/tex">c_2</script> tells us the exponent.</p><h4>Feature spaces and fitting linear models</h4><p>The main benefit of linear models is that by talking about linear combinations of data vectors we reduce the maths of fitting parameters to linear algebra. Linear algebra is about transformations of space and the vectors in it, so it also allows for a visual interpretation of everything.</p><p>Let's say we have a model like this:</p><div cid="n279" class="mathjax-block md-end-block md-math-block md-rawblock" contenteditable="false" id="mathjax-n279" mdtype="math_block" spellcheck="false"> <div class="md-rawblock-container md-math-container" tabindex="-1"><div class="MathJax_SVG_Display" style="text-align: center;"></div><script id="MathJax-Element-24" type="math/tex; mode=display">\pmb{y} \approx c_1 \pmb{e_1} + c_2 \pmb{e_2}.</script></div></div><p>Here, <script type="math/tex">\pmb{y}</script> is the actual measured data, and <script type="math/tex">\pmb{e_i}</script> are functions of the (also measured) predictor variables. Let's say <script type="math/tex">\pmb{y} = (y_1, y_2, y_3)</script> – i.e., we have three data points. We can imagine <script type="math/tex">\pmb{y}</script> as a vector pointing somewhere in 3D space, with <script type="math/tex">y_1</script>, <script type="math/tex">y_2</script>, and <script type="math/tex">y_3</script> the distances along the <script type="math/tex">x</script>, <script type="math/tex">y</script>, and <script type="math/tex">z</script> axes. Likewise, <script type="math/tex">\pmb{e_1}</script> and <script type="math/tex">\pmb{e_2}</script> can be thought of as 3D vectors encoding some (function of the) data we've measured.</p><p></p><div class="separator" style="clear: both; text-align: center;"><a href="https://lh3.googleusercontent.com/-58VTEODl7IM/X-3ifzkSRZI/AAAAAAAACJo/7YBx6KjGeUg_K0t2q-HXR0F2ozxDOLP3gCLcBGAsYHQ/3d.png" style="margin-left: 1em; margin-right: 1em;"><img data-original-height="920" data-original-width="1278" height="288" src="https://lh3.googleusercontent.com/-58VTEODl7IM/X-3ifzkSRZI/AAAAAAAACJo/7YBx6KjGeUg_K0t2q-HXR0F2ozxDOLP3gCLcBGAsYHQ/w400-h288/3d.png" width="400" /></a></div><br /><p></p><p>Now the only dials a linear model gives us to adjust are the weights of <script type="math/tex">\pmb{e_1}</script> and <script type="math/tex">\pmb{e_2}</script>: <script type="math/tex">c_1</script> and <script type="math/tex">c_2</script>. There's a 2D space of them (since there are two constants to adjust – <script type="math/tex">c_1</script> and <script type="math/tex">c_2</script>), and as it happens, there's a nice geometric interpretation: each pair <script type="math/tex">(c_1, c_2)</script> corresponds to a point on the plane spanned by <script type="math/tex">\pmb{e_1}</script> and <script type="math/tex">\pmb{e_2}</script> (specifically, the point you get to if you move <script type="math/tex">c_1</script> times along <script type="math/tex">\pmb{e_1}</script> and then <script type="math/tex">c_2</script> times along <script type="math/tex">\pmb{c_2}</script>).</p><p>So what are the best values of <script type="math/tex">c_1</script> and <script type="math/tex">c_2</script>? The intuitive answer is that we want to get as close as possible to <script type="math/tex">\pmb{y}</script>:</p><p></p><div class="separator" style="clear: both; text-align: center;"><a href="https://lh3.googleusercontent.com/-Xvj6zojy2LM/X-3ikHHj4xI/AAAAAAAACJs/Wyd-NiMwNA8d8vzruWVZYL6524HXL6aNwCLcBGAsYHQ/featurespace.png" style="margin-left: 1em; margin-right: 1em;"><img data-original-height="1080" data-original-width="1280" height="541" src="https://lh3.googleusercontent.com/-Xvj6zojy2LM/X-3ikHHj4xI/AAAAAAAACJs/Wyd-NiMwNA8d8vzruWVZYL6524HXL6aNwCLcBGAsYHQ/w640-h541/featurespace.png" width="640" /></a></div><p></p><p>In this case, the closest to <script type="math/tex">\pmb{y}</script> that we can reach on the plane spanned by <script type="math/tex">\pmb{e_1}</script> and <script type="math/tex">\pmb{e_2}</script> is the green vector, and the black vector is the difference between the predicted data vector and actual data vector.</p><p>Mathematically, what are we doing here? We're minimising the distance between the vector <script type="math/tex">\hat{\pmb{y}} = c_1 \pmb{e_1} + c_2 \pmb{e_2}</script> (where <script type="math/tex">c_1</script> and <script type="math/tex">c_2</script> can be varied) and <script type="math/tex">\pmb{y}</script>; this distance is given by</p><div cid="n287" class="mathjax-block md-end-block md-math-block md-rawblock" contenteditable="false" id="mathjax-n287" mdtype="math_block" spellcheck="false"> <div class="md-rawblock-container md-math-container" tabindex="-1"><div class="MathJax_SVG_Display" style="text-align: center;"></div><script id="MathJax-Element-25" type="math/tex; mode=display">\sqrt{(\hat{y_1} - y_1)^2 + (\hat{y_2} - y_2)^2 + (\hat{y_3} - y_3)^2 }.</script></div></div><p>Previously we simplified optimisation by applying a logarithm (a monotonically increasing function) and optimising that; this time we do the same by applying the squaring function (which is monotonically increasing for positive numbers, which our distance is limited to). This means that the quantity to minimise is</p><div cid="n289" class="mathjax-block md-end-block md-math-block md-rawblock" contenteditable="false" id="mathjax-n289" mdtype="math_block" spellcheck="false"> <div class="md-rawblock-container md-math-container" tabindex="-1"><div class="MathJax_SVG_Display" style="text-align: center;"></div><script id="MathJax-Element-26" type="math/tex; mode=display">(\hat{y_1} - y_1)^2 + (\hat{y_2} - y_2)^2 + (\hat{y_3} - y_3)^2.</script></div></div><p>In other words, we minimise the sum of squared errors ("least squares estimation" is the most common phrase).</p><p>If we have more than three data points, then we can't picture it, but the idea is exactly the same. Fitting an <script type="math/tex">n</script>-dimensional dataset to a linear model of <script type="math/tex">m</script> features boils down to moving as close as possible in <script type="math/tex">n</script>D space to the observed data vector, while limited to the <script type="math/tex">m</script>-dimensional (at most; see below) space spanned by the features.</p><p>(Above, <script type="math/tex">n=3</script> and <script type="math/tex">m=2</script>. Generally <script type="math/tex">n</script> is huge because datasets can be huge, while <script type="math/tex">m</script> is much smaller since it's the number of features we've written down into the model.)</p><blockquote><p><i>A maths lecturer is giving a lecture about 5-dimensional geometry.</i></p><i></i><p><i>A student asks a question: "I can follow the algebra just fine, but it would be helpful if I could visualise it. Is there any way to do that?"</i></p><i></i><p><i>The lecturer replies: "Oh, it's easy. Just imagine everything in <script type="math/tex">n</script> dimensions, and then let <script type="math/tex">n=5</script>."</i></p><i></i><p><i> </i></p><i></i><p><i>(variants of this joke are common; see for example <a href="http://www.personal.psu.edu/sxt104/mathjoke1.html">here.</a>)</i></p></blockquote><h5>Linear independence</h5><p>A set of vectors is linearly dependent if there exists a vector in it that can be written as a linear combination of the other vectors. If your feature vectors are linearly dependent, you will get the same predictions out of your model, but you can't interpret the coefficients.</p><p>(For visual intuition: two vectors in 2D are linearly dependent if they lie on the same line, three vectors in 3D are linearly dependent if they lie on the same plane (a superset of the case that they lie on the same line), and so on.)</p><p>An easy way to make this mistake is if you're doing one-hot coding of categories. Let's say you're fitting a linear model to estimate student exam grades <script type="math/tex">y</script> based on their university, with a model that looks like this:</p><div cid="n301" class="mathjax-block md-end-block md-math-block md-rawblock" contenteditable="false" id="mathjax-n301" mdtype="math_block" spellcheck="false"> <div class="md-rawblock-container md-math-container" tabindex="-1"><div class="MathJax_SVG_Display" style="text-align: center;"></div><script id="MathJax-Element-27" type="math/tex; mode=display">y \approx \alpha + \beta \cdot 1_{\text{Oxford}}+\gamma\cdot1_{\text{Cambridge}}+...,</script></div></div><p>using indicator function notation. Whatever linear fitting routine you do will happily give you coefficient values and the predictions it gives will be sensible, but you won't be able to interpret the coefficients. To see what's happening, consider an Oxford student: their predicted grade <script type="math/tex">y</script> is <script type="math/tex">\alpha + \beta</script>. What is <script type="math/tex">\alpha</script> and <script type="math/tex">\beta</script>? Good question – we can only assign meaning to their combination. If instead we eliminate one university and write</p><div cid="n303" class="mathjax-block md-end-block md-math-block md-rawblock" contenteditable="false" id="mathjax-n303" mdtype="math_block" spellcheck="false"> <div class="md-rawblock-container md-math-container" tabindex="-1"><div class="MathJax_SVG_Display" style="text-align: center;"></div><script id="MathJax-Element-28" type="math/tex; mode=display">y \approx \alpha + \beta \cdot 1_{\text{Cambridge}} + ...,</script></div></div><p>when we now fit the coefficients, <script type="math/tex">\alpha</script> will be the predicted grade for Oxford students, and <script type="math/tex">\alpha+\beta</script> the predicted grade for Cambridge students, so we can interpret <script type="math/tex">\alpha</script> as the Oxford average, and <script type="math/tex">\beta</script> as the difference between Oxford and Cambridge. (The predictions given by the model won't change though.)</p><p>The vector interpretation is that if our dataset contains, say, 3 Oxford students followed by 2 Cambridge students, the (5D) data vectors in the first model will be</p><div cid="n306" class="mathjax-block md-end-block md-math-block md-rawblock" contenteditable="false" id="mathjax-n306" mdtype="math_block" spellcheck="false"> <div class="md-rawblock-container md-math-container" tabindex="-1"><div class="MathJax_SVG_Display" style="text-align: center;"></div><script id="MathJax-Element-29" type="math/tex; mode=display">\alpha \begin{pmatrix}1 \\ 1 \\ 1 \\ 1 \\ 1\end{pmatrix} + \beta \begin{pmatrix}1 \\ 1 \\ 1 \\ 0 \\ 0\end{pmatrix} + \gamma \begin{pmatrix}0 \\ 0 \\ 0 \\ 1 \\ 1\end{pmatrix}.</script></div></div><p>But these vectors aren't linearly independent: the last two vectors sum up to the first one, and therefore there will be many triplets <script type="math/tex">(\alpha, \beta, \gamma)</script> that give identical predictions.</p><h4>Linear fitting and MLE</h4><p>We talked about MLE being the holy grail of model fitting, and then about linear models and how fitting them comes down to a geometry problem. As it turns out, MLE lurks behind least squares estimation as well.</p><p>I mentioned earlier that linear models often assume a normal distribution for errors. Let's assume that, and do MLE.</p><p>Our model is that</p><div cid="n312" class="mathjax-block md-end-block md-math-block md-rawblock" contenteditable="false" id="mathjax-n312" mdtype="math_block" spellcheck="false"> <div class="md-rawblock-container md-math-container" tabindex="-1"><div class="MathJax_SVG_Display" style="text-align: center;"></div><script id="MathJax-Element-30" type="math/tex; mode=display">Y_i = c_1 e_{1,i} + ... + c_n e_{n,i} + \epsilon,</script></div></div><p>where <script type="math/tex">\epsilon \sim N(0,\sigma^2)</script> (i.e. follows a normal distribution with mean zero and standard deviation <script type="math/tex">\sigma</script>).</p><p>A useful property of normal distributions is that if we add a constant <script type="math/tex">c</script> to a normal distribution with mean <script type="math/tex">\mu</script>, the result has a normal distribution with mean <script type="math/tex">\mu + c</script> and the same standard deviation (this isn't true of all distributions!). Therefore we can write the above as</p><div cid="n315" class="mathjax-block md-end-block md-math-block md-rawblock" contenteditable="false" id="mathjax-n315" mdtype="math_block" spellcheck="false"> <div class="md-rawblock-container md-math-container" tabindex="-1"><div class="MathJax_SVG_Display" style="text-align: center;"></div><script id="MathJax-Element-31" type="math/tex; mode=display">Y_i \sim N(c_1 e_{1,i} + ... + c_n e_{n,i}, \sigma^2).</script></div></div><p>The likelihood for getting <script type="math/tex">y</script> is</p><div cid="n317" class="mathjax-block md-end-block md-math-block md-rawblock" contenteditable="false" id="mathjax-n317" mdtype="math_block" spellcheck="false"> <div class="md-rawblock-container md-math-container" tabindex="-1"><div class="MathJax_SVG_Display" style="text-align: center;"></div><script id="MathJax-Element-32" type="math/tex; mode=display">\Pr_Y(y;c_1...c_n, \sigma) = \frac{1}{\sigma \sqrt{2 \pi}} e^{-\frac{1}{2} \left( \frac{y - (c_1 e_{1,i} + ... + c_n e_{n,i})} {\sigma} \right)^2},</script></div></div><p>once again copying out the likelihood function for normal distributions.</p><p>Now remember that we just want to fit <script type="math/tex">c_1</script> through <script type="math/tex">c_n</script>. These only occur in the exponent, so we can ignore all the constants out front, and also we can see that since there's a negative in the exponent, maximising it is equivalent to minimising the stuff in the exponent. Taking out <script type="math/tex">\sigma</script> and constants, the relevant stuff to minimise is</p><div cid="n320" class="mathjax-block md-end-block md-math-block md-rawblock" contenteditable="false" id="mathjax-n320" mdtype="math_block" spellcheck="false"> <div class="md-rawblock-container md-math-container" tabindex="-1"><div class="MathJax_SVG_Display" style="text-align: center;"></div><script id="MathJax-Element-33" type="math/tex; mode=display">(y-(c_1 e_{1,i} + ... + c_n e_{n,i}))^2,</script></div></div><p>where we can see that the thing we subtract from <script type="math/tex">y</script> is our model's prediction of <script type="math/tex">y</script> (one component of what we previously denoted <script type="math/tex">\hat{\pmb{y}}</script>). Once again, we can see we're minimising a square of the error. Of course, we have many <script type="math/tex">y</script>-values to fit; to see that it's the sum of these that we minimise, rather than some other function of them, just note that if we take a logarithm we'll get a term like the above (times constants) for each data point we're using to fit.</p><p>So least-squares fitting comes from MLE and the assumption of normally distributed errors.</p><p>(Are errors normally distributed? Often yes. Remember though that our features are functions of things we measure; even if <script type="math/tex">x</script> has normally-distributed errors, after we apply an arbitrary function to it to generate feature <script type="math/tex">e</script>, the resulting <script type="math/tex">e</script> might not have normally distributed errors (but for many simple functions it still will). We could be more fancy, and devise other fitting procedures, but often least squares is good enough.)</p><h3>Empirical distributions</h3><p>What's the simplest probability model we can fit to a dataset? It's tempting to think of an answer like "a normal distribution", or "a linear model with one linear feature". But we can be even more radical: treat the dataset itself as a distribution.</p><p></p><div class="separator" style="clear: both; text-align: center;"><a href="https://lh3.googleusercontent.com/-U2MRZRorb0c/X-3is6acicI/AAAAAAAACJ0/9LgERk6tfJA86hT_pXcQWbxDS_phNjPoQCLcBGAsYHQ/epdf.png" style="margin-left: 1em; margin-right: 1em;"><img data-original-height="686" data-original-width="1278" height="344" src="https://lh3.googleusercontent.com/-U2MRZRorb0c/X-3is6acicI/AAAAAAAACJ0/9LgERk6tfJA86hT_pXcQWbxDS_phNjPoQCLcBGAsYHQ/w640-h344/epdf.png" width="640" /></a></div><p></p><p>On the left, we've plotted the number of data points that take different values of <script type="math/tex">x</script> (this is a discrete distribution; for a continuous distribution, the probability that any two samples drawn are equal is infinitesimal). On the right, all we've done is normalised the distribution, by rescaling the vertical axis so that the heights of all the bars sum to one. Once we've done that, we can go ahead and call it a probability distribution, and assign the meaning that the height of the bar at <script type="math/tex">x</script> is the probability that the distribution <script type="math/tex">X</script> that we've just defined takes the value <script type="math/tex">x</script>. This is called an empirical distribution.</p><p>Sampling from an empirical distribution is easy – just pick a value at random from the dataset. (Of course, the likelihood such a distribution assigns to any value not in the dataset is zero, which can be a problem for many use cases.)</p><p>In fact, you've probably already dealt with empirical distributions, at least implicitly. When you calculate the mean and variance of a dataset, you can interpret this as calculating the properties of the empirical distribution given by that dataset. An empirical distribution as an abstract thing apart from your dataset may seem ad hoc, but it's not any less defined than a normal distribution.</p><p>The standard way to illustrate an empirical distribution is by plotting its cumulative distribution function (cdf); an empirical one is known as an ecdf. This is almost necessary for continuous variables. In general, the ecdf of a dataset is a very useful and general way to visualise it: it saves you from the pains of histograms (how large to make the bins? if you take logs or squares first, do you take them before or after binning? etc. etc.), and is also complete in the sense of technically displaying every point in the dataset.</p><p>The ecdf for the above distribution would look something like this:</p><p></p><div class="separator" style="clear: both; text-align: center;"><a href="https://lh3.googleusercontent.com/-rDGG7QbrniA/X-3iwUtC67I/AAAAAAAACJ8/_chL29192uMytLKCYpPueOhoJtQdOrDNACLcBGAsYHQ/ecdf.png" style="margin-left: 1em; margin-right: 1em;"><img alt="" data-original-height="530" data-original-width="1000" height="340" src="https://lh3.googleusercontent.com/-rDGG7QbrniA/X-3iwUtC67I/AAAAAAAACJ8/_chL29192uMytLKCYpPueOhoJtQdOrDNACLcBGAsYHQ/w640-h340/ecdf.png" width="640" /></a></div><p></p>(Like any cdf, it takes the value 0 up until the first data point and the value 1 after the last data point.) <p>If we now fit any parametric (i.e. non-empirical) distribution, comparing its cdf to the ecdf is a good test of how good the fit is.</p><h4>Measuring the goodness of a model fit with KL divergence</h4><p>The empirical distribution is the best possible fit to a given dataset, and therefore it's a good benchmark to measure the fit of a proposed model against.</p><p>Let's say our data is <script type="math/tex">x=x_1, ... ,x_n</script>, and the empirical distribution is <script type="math/tex">X^*</script>. The likelihood of drawing <script type="math/tex">x</script> from <script type="math/tex">X*</script> is (under the assumption of each <script type="math/tex">x_i</script> being drawn independently)</p><div cid="n338" class="mathjax-block md-end-block md-math-block md-rawblock" contenteditable="false" id="mathjax-n338" mdtype="math_block" spellcheck="false"> <div class="md-rawblock-container md-math-container" tabindex="-1"><div class="MathJax_SVG_Display" style="text-align: center;"></div><script id="MathJax-Element-34" type="math/tex; mode=display">\Pr_{X^*}(x_1) \cdot ... \cdot \Pr_{X^*}(x_n).</script></div></div><p>Now <script type="math/tex">\Pr_{X^*}(x_i)</script> is just the fraction of how many <script type="math/tex">x_j</script> in <script type="math/tex">x</script> are equal to <script type="math/tex">x_i</script>. Writing <script type="math/tex">N_{x_i}</script> to mean the number of values equal to <script type="math/tex">x_i</script> in the data, we can write</p><div cid="n340" class="mathjax-block md-end-block md-math-block md-rawblock" contenteditable="false" id="mathjax-n340" mdtype="math_block" spellcheck="false"> <div class="md-rawblock-container md-math-container" contenteditable="false" tabindex="-1"> <div class="MathJax_SVG_Display" style="text-align: center;"></div><script id="MathJax-Element-421" type="math/tex; mode=display">\Pr_{X^*}(x_i) = \frac{N_{x_i}}{n}.</script> </div></div><p>Taking logs, and writing <script type="math/tex">q_v = N_{v} / n = \Pr_{X^*}(v)</script>, the above product for the likelihood becomes the sum, over possible values $$v$$ of $$x_i$$, for the log likelihood:</p><div cid="n342" class="mathjax-block md-end-block md-math-block md-rawblock" contenteditable="false" id="mathjax-n342" mdtype="math_block" spellcheck="false"> <div class="md-rawblock-container md-math-container" contenteditable="false" tabindex="-1"> <div class="MathJax_SVG_Display" style="text-align: center;"></div><script id="MathJax-Element-430" type="math/tex; mode=display">\sum_{v} N_{v} \log(q_v).</script> </div></div><p>Now we'll do one last trick, which is to scale by <script type="math/tex">1/n</script>; otherwise, the term in front of the log will tend to be bigger if we have more data points, while we want something that means the same regardless of how many data points there are. After we do that, we notice a nice symmetry:</p><div cid="n344" class="mathjax-block md-end-block md-math-block md-rawblock" contenteditable="false" id="mathjax-n344" mdtype="math_block" spellcheck="false"> <div class="md-rawblock-container md-math-container" contenteditable="false" tabindex="-1"> <div class="MathJax_SVG_Display" style="text-align: center;"></div><script id="MathJax-Element-404" type="math/tex; mode=display">\sum_{v} q_v \log(q_v).</script> </div></div><p>This is a good baseline to compare any other model to. For example, let's say we fit to this a (discrete) distribution <script type="math/tex">X</script> (with the same sample space as <script type="math/tex">X^*</script>) with parameters <script type="math/tex">\theta</script>. Write <script type="math/tex">p_v = \Pr_X(v; \theta)</script>, and we can express the log likelihood of the dataset as</p><div cid="n346" class="mathjax-block md-end-block md-math-block md-rawblock" contenteditable="false" id="mathjax-n346" mdtype="math_block" spellcheck="false"> <div class="md-rawblock-container md-math-container" contenteditable="false" tabindex="-1"> <div class="MathJax_SVG_Display" style="text-align: center;"></div><script id="MathJax-Element-433" type="math/tex; mode=display">\sum_{v} N_{v} \log(p_v).</script> </div></div><p>Normalising by <script type="math/tex">1/n</script> as before, we get</p><div cid="n348" class="mathjax-block md-end-block md-math-block md-rawblock" contenteditable="false" id="mathjax-n348" mdtype="math_block" spellcheck="false"> <div class="md-rawblock-container md-math-container" contenteditable="false" tabindex="-1"> <div class="MathJax_SVG_Display" style="text-align: center;"></div><script id="MathJax-Element-435" type="math/tex; mode=display">\sum_{v} q_v \log(p_v).</script> </div></div><p>Now to get a measure of fit goodness, just subtract, and do some algebra on top if you feel like it:</p><div cid="n350" class="mathjax-block md-end-block md-math-block md-rawblock" contenteditable="false" id="mathjax-n350" mdtype="math_block" spellcheck="false"> <div class="md-rawblock-container md-math-container" contenteditable="false" tabindex="-1"> <div class="MathJax_SVG_Display"></div><script id="MathJax-Element-437" type="math/tex; mode=display">\sum_{v} q_v \log(q_v) - \sum_{v} q_v \log(p_v) \\ = \sum_{v} q_v \log(q_v/p_v) \\ = \sum_{v} \Pr_{X^*}(v) \log\left(\frac{\Pr_{X^*}(v)}{\Pr_X(v;\theta)}\right).</script> </div></div><p>(In the last step, I've just expanded out our earlier definitions of <script type="math/tex">p_i</script> and <script type="math/tex">q_i</script>.)</p><p>This is called the Kullback-Leibler divergence (KL divergence). If <script type="math/tex">X=X^*</script>, then it comes out to 0; for worse fits, the value becomes greater.</p><p>There's a nice information theoretic interpretation of this result. <script type="math/tex">- \sum_{v} q_v \log_2(p_v)</script> is the average number of bits needed to most efficiently represent a value randomly drawn from the dataset, using a coding scheme optimised for the distribution <script type="math/tex">X</script>. </p><p> </p><p style="text-align: center;"><a href="http://strataoftheworld.blogspot.com/2021/01/data-science-2.html">Next post</a> <br /></p>Unknownnoreply@blogger.com0tag:blogger.com,1999:blog-1697673368059564013.post-66658972369188355342020-12-17T08:24:00.006+00:002022-03-31T22:59:01.819+01:00Review: Foragers, Farmers, and Fossil Fuels<p style="text-align: center;"><i><span style="font-size: x-small;">Book: Foragers, Farmers, and Fossil Fuels: How Human Values Evolve,</span><span style="font-size: x-small;"> by Ian Morris (2015)</span><span style="font-size: x-small;"><br />7.8k words (about 26 minutes)</span></i></p><p style="text-align: center;"><i><span style="font-size: x-small;"> </span></i></p><p style="text-align: center;"><i><span style="font-size: x-small;"> This post has also been published <a href="https://www.lesswrong.com/posts/nsFpCGPJ6dfk9uFkR/review-foragers-farmers-and-fossil-fuels">here</a>.</span></i><br /></p><p style="text-align: center;"> </p><p style="text-align: left;">Two hundred years ago, most people lived in societies that considered slavery, war, and discrimination based on class, ethnicity, and gender to be justifiable. Today, most people live in societies that hold the opposite beliefs.</p><p style="text-align: left;">What changed? A simple and tempting narrative is that we have simply become wiser; that various Enlightenment philosophers, thoughtful activists, and other principled people figured out that the pre-industrial moral order is wrong and managed to persuade everyone to change.</p><p style="text-align: left;">It is true that many smart and principled people had good ideas and that this was a big proximate driver of better values. But is it a coincidence that this change in values happened around the same time as the industrial revolution?</p><p style="text-align: left;">What about the previous economic revolution, the agricultural one? Did that also coincide with a change in the values that people held? The evidence says yes – foraging societies tend to be more accepting of violence and far less accepting of hierarchy than farming ones.</p><p style="text-align: left;">The argument of Ian Morris' <i>Foragers, Farmers, and Fossil Fuels</i> is that these timings are not a coincidence. Societies that change their main method of getting energy also change their values, because some sets of values give greater success for a certain type of society. Farming societies that stick to anti-hierarchical forager attitudes won't survive competition with farming societies that learn to believe in hierarchies (maybe they won't be economically competitive and won't be able to field as big an army to defend themselves as the god-king next door can field to conquer them). Likewise, industrial societies that stick to inflexible hierarchies and elite-focused economies can't compete with more equal democracies that don't squander the talents of the non-elite, and maintain a well-looked-after middle-class of rich consumers and educated workers.</p><p style="text-align: left;">We can contrast two ways of trying to explain the history of values. The first says that the history of values is a history of ideas; a battle of ideas against other ideas, waged in the minds of people. The second says that the history of values is a history of what works best. The battle is between the benefits conferred by believing in certain ideas and those conferred by other ones, and it is waged out in the real world, where empires fall or rise based on whether they value the things that will lead them to success.</p><p style="text-align: left;">It is clear that neither style of explanation is enough on its own. No matter how persuasive it can be made, a sufficiently destructive idea – as an extreme example, that everyone should commit suicide – will not find its adherents in charge of the future (or coming from the opposite direction: why do you think many religions are so big on the "be fruitful and multiply" point?). On the other hand, no matter how practically useful a certain idea is, someone has to have the idea and persuade other people to adopt it as a value before it has a chance of spreading because of its practical benefits.</p><p style="text-align: left;">The question, then, is just how far can we push the deterministic account, where the methods of energy capture constrain values. In Ian Morris' telling, the answer is surprisingly far, and if his account of the history of values is correct, I agree with him (in particular, the similarities of farming society values across continents is hard to explain otherwise). However, I think Morris, along with most people who advance or accept similar arguments, goes too far with the moral pragmatism that these ideas may be thought to imply.</p><p style="text-align: left;">But first: what values did foragers, farmers, and fossil fuel users actually hold, and what is Morris' energy-based explanation of the changes between them?</p><p style="text-align: left;"> </p><h3 style="text-align: left;">Foragers</h3><p style="text-align: left;">Everyone has some idea of what a forager or hunter-gatherer is, but since we want to deal with differences between foragers and farmers, we want a clear idea of where the line is. Morris cites a good definition by Catherine Panter-Brick: foragers are people who "exercise no deliberate alteration of the gene pool of exploited resources". If you plant and harvest a few naturally occurring plants, you're still a forager, but when you start refining the crops generation by generation or breeding the animals, that's the point when you become a farmer.</p><p style="text-align: left;">Of course, there is a vast amount of variance in culture, lifestyle, and values between different forager bands. To <a href="https://condor.depaul.edu/~mfiddler/hyphen/humunivers.htm">almost</a> every generalisation about foragers, there exists some tribe that does the opposite. However, Morris argues that for each main type of human society (foraging/farming/industrial), it is useful to talk about the average set of values such societies held or tended to develop towards, at least in terms of the broad categories of tolerance of political/economic/gender hierarchy and propensity to violence. This covers up lots of important questions – different societies may have justified violence under different circumstances, or had different reasons for why economic inequality was acceptable, but such differences are sucked up into one category and ignored in this sort of analysis. That this makes sense will become apparent once we see that foragers, farmers, and fossil fuel users can be sensibly compared and contrasted even at this very general level.</p><p style="text-align: left;">In some ways, forager values are familiar. Even among foragers, possession and ownership are big deals, with every item generally having an owner. In other ways, they're surprisingly different.</p><p style="text-align: left;">Take violence. Though it's very difficult to come up with exact figures for anything to do with foragers (ancient foragers left behind only bones and tools, and modern foragers only live in places that farmers didn't want, so might not be a representative sample), the chance of dying by murder may have been around 10% in an average forager tribe, compared to 0.7% today, 1-2% across the 1900s (including all wars), roughly 5% in your average farming society or in the most murderous countries of today, and 20% for Poland during World War II.</p><p style="text-align: left;"> This was not recognised by anthropologists until the 1990s or so because, as Morris explains:</p><blockquote style="text-align: left;"><p><i>"[T]he social scale imposed by foraging is so small that even high rates of murder are difficult for outsiders to detect. If a band with a dozen members has a 10% rate of violent death, it will suffer roughly one homicide every twenty-five years, and since anthropologists rarely stay in the field for even twenty-five months, they will witness very few violent deaths."</i></p></blockquote><p style="text-align: left;">This is why Elizabeth Marshall Thomas' !Kung ethnography was called "The Gentle People", even though "their murder rate was much the same as what Detroit would endure at the peak of its crack cocaine epidemic".</p><p style="text-align: left;">Foragers are also extremely averse to hierarchy. Perhaps the best summary is given by a !Kung San forager asked about the absence of chiefs:</p><blockquote style="text-align: left;"><p><i>"Of course we have headmen! In fact we’re all headmen … Each one of us is headman over himself!"</i></p></blockquote><p style="text-align: left;">It's not just that foragers don't have strict hierarchies and this behaviour falls out naturally as a result; they are actively opposed to any sort of hierarchy or inequality. Material inequality is considered morally wrong, and fairness essential. Pressure to share spoils is applied liberally. And as in any group of humans, you'll have upstarts who try to achieve greatness and power, but such people usually have opposition groups immediately form to hold them back. Anthropologist Christopher Boehm calls these "reverse dominance hierarchies"; Morris translates this as "coalitions of losers".</p><p style="text-align: left;">The one sort of inequality that foragers aren't opposed to is gender inequality, with the dominant role in politics and violence generally falling to men (as an example of this attitude, Morris cites a forager of the Ona people (also known as the Selk'nam or Onawo) saying "the men are all captains and the women are sailors"). However, the gender inequality in forager societies is still on a different level from the extreme gender inequality and regimentation of farmer societies, and attitudes about sex were looser too. Morris writes that "abused wives regularly just walk away [...] without much fuss or criticism, and attitudes towards marital fidelity and premarital virginity tend to be quite relaxed".</p><p style="text-align: left;"> </p><h3 style="text-align: left;">Farmers</h3><p style="text-align: left;">As with foragers, Morris lumps together farming societies into one ideal type, labelled Agraria by Ernest Gellner. As before, this covers up a lot of variation (in particular, he identifies horticulturalists, city states like classical Athens or medieval Venice, and proto-industrial nations like Qing dynasty China, Mughal India, Ottoman Turkey, and Enlightenment Western Europe as the three extremes of Agraria), but Morris argues "the exceptions and sub-categories should not be allowed to obscure the reality of an ideal type representing in abstract terms the core features of peasant farming society". He cites Robert Redfield:</p><blockquote style="text-align: left;"><p><i>"[I]f a peasant from [any one of widely separated farming societies] could have been transported by some convenient genie to any one of the others and equipped with a knowledge of the language in the village to which he had been moved, he would very quickly come to feel at home. And this would be because the fundamental orientations of life would be unchanged. The compass of his career would continue to point to the same moral north."</i></p></blockquote><p style="text-align: left;">So what is the moral north of farming societies? Perhaps surprisingly, it's almost as hard to make definite conclusions about what anyone other than the elite thought in agrarian societies as it is to make conclusions about foragers.</p><p style="text-align: left;">While the elite read and wrote a lot, they didn't care much about what the peasants thought, and peasants were not literate. The most literate ancient societies – for example Athens in the 4th and 5th centuries BCE – had a <i>rudimentary</i> literacy rate of 10%, so one person in ten might be able to glean some meaning from words, but how well they could set down their thoughts on moral values is a different question. To get higher literacy rates, you have to move in time to the early second millennium, and in space to urban China or western Europe. Morris writes that "genuine mass literacy, with half or more of the population able to read simple sentences, belongs to the age of fossil fuels”, and because of this, most of “our evidence for peasant experience comes from archaeology and accounts by twentieth-century anthropologists, rural sociologists, and development economists." If history is the written record of the past, then the majority of the population lived their lives outside history until the past century or two. (Perhaps we might even say that history in this sense only began with the internet age, when the private lives of everyone began being set down.)</p><p style="text-align: left;">Before going into the trickier question of values, we can compare foragers and farmers in some simple ways. First, their energy consumption was higher. Foragers, like all humans, need to eat about eight and a half megajoules (2000 kilocalories) of energy as food per person per day to stay alive. Add cooking, and total energy consumption roughly doubles. The energy use of agrarian societies starts out at a forager level of around 20 MJ/person/day (5000 kcal), and goes up to the 100-150 MJ/person/day level (compare to 500 MJ/person/day (120 000 kcal), plus/minus a factor of two or so, for modern rich industrial nations).</p><p style="text-align: left;">Second, farming societies have very roughly perhaps half as few violent deaths as foragers, due to the existence of governments that at least occasionally kept the peace.</p><p style="text-align: left;">However, their life wasn't better on most metrics. In contrast to the literature (both then and now) full of "tales of vagabonds, wandering minstrels, and young men striking out to make their fortunes", "most farmers lived in worlds much smaller than most foragers had done, and never went much more than a day or two’s walk from the villages they were born in". Not only this, but:</p><blockquote style="text-align: left;"><p><i>"Excavated skeletons suggest that ancient farmers tended to suffer more than foragers from repetitive stress injuries; their teeth were often terrible, thanks to restricted diets heavy on sugary carbohydrates; and their stature, which is a fairly good proxy for overall nutrition, tended to fall slightly with the onset of agriculture, not increasing noticeably until the twentieth century AD."</i></p></blockquote><p style="text-align: left;">No farming society even managed to escape the repeating cycles of population growth and starvation that foragers were also prone to, despite having more direct control over their food supplies. Populations would increase to keep pace with the good times until all farmers were slaving way to stay at subsistence levels given the crowdedness and quality of the land. Then many would starve to death when the bad times came.</p><p style="text-align: left;">Another trend across the history of farming societies is three things coinciding: energy consumption rises above 40 MJ (twice the minimum agrarian level and the typical forager level), towns grow past 10 000 people, and a few people take charge and start bossing around the others with their governments.</p><p style="text-align: left;">In farming societies, widespread respect and reverence for hierarchy was internalised by everyone. Morris writes that “[f]arming society often seemed obsessed with the symbolism of rank”, and twentieth century anthropologists "regularly found that having a healthy respect for authority – knowing your place – was a key part of their informants’ sense of themselves as good people". This often came, and still comes, as a surprise to non-farmers:</p><blockquote style="text-align: left;"><p><i>"[W]hen European reformers began venturing outside their urban enclaves into the countryside in the eighteenth century, they were often astonished that instead of complaining about inequality and demanding the redistribution of property, peasants largely took it as right and proper that most people were poor and weak while a few were rich and strong."</i></p></blockquote><p style="text-align: left;">Especially revered was the "Old Deal", Morris' term for the generalised social contract between classes in agrarian societies: that some have the duty to be commanders (or "shepherds of the people", in the preferred phrasing of many a king), others to obey those commands, and if everyone follows this script then things work fine.</p><p style="text-align: left;">Even when the powerful were questioned, the questioning didn't go as far as the Old Deal itself. In fact it rarely reached the king. “The tsar is good but the boyars [aristocrats] are bad", goes a Russian saying; even those who protested the powerful assumed that the highest levels of power must be good and holy, and the problems came from their will being incorrectly carried out by lesser lords. Even when the king himself came under fire, the Old Deal itself, or the inequality it entailed, were not questioned. The most common sort of rebellion against a king took what Morris calls a "good-old-days form": the justification was that the king had broken the Old Deal (or been abandoned by the gods or lost the Mandate of Heaven) and the urgent need was to restore the days when the <i>right</i> dictator was in charge, not abolish the dictatorship in the first place.</p><p style="text-align: left;">There were exceptions – in the 1640s some Chinese peasants called themselves "Levelling Kings" and went around questioning who gave their rulers the right to call them serfs, and of course there's the gradual English case and the rather more abrupt French case – but these only came when the societies in question started hitting energy consumptions of 150 MJ/day, the very highest end that agrarian societies could achieve without a full-on industrial revolution.</p><p style="text-align: left;">(Morris implies that the energy consumption is the cause. This seems backwards; an explanation running through the institutions and organisation needed to sustain this energy level seems much more reasonable. In general, perhaps when Morris talks about "energy consumption", you should read "the societal factors that enable higher energy consumption" in its place.)</p><p style="text-align: left;">Given how anti-hierarchy foragers were, how did this come to be? Were the peasants all forced into a rigid hierarchy by ruthless elites?</p><blockquote style="text-align: left;"><p><i>'“You may fool all the people some of the time; you can even fool some of the people all the time; but you can’t fool all the people all the time,” Abraham Lincoln is supposed to have said (unless it was P. T. Barnum). But Korsgaard and Seaford apparently think that Lincoln/Barnum was wrong, and that for ten thousand years everyone in Agraria was led by the nose—women by men, poor by rich, everyone by priests—and robbed blind. This I just cannot credit. Humans are the cleverest animals on the planet (for all we know, the cleverest in the whole universe). We have worked out the answers to almost every problem we have ever encountered. So how, if farming values were really just a trick perpetrated by wicked elites, did they survive for ten millennia? Most of the farmers I have met have been canny folk; so why could farmers in the past not figure out what was going on behind the wizard’s veil?</i></p><p><i>The answer, in my opinion, is that there was no veil. The veil is a figment of modern academics’ imaginations, made necessary by the assumption that only a tiny elite could possibly have thought that hierarchy was a good thing. In reality, farmers had farming values not because they fell for a trick but because they had common sense.'</i></p></blockquote><p style="text-align: left;">It is clearly a mistake to think that farmers participated in farming societies and its values through gritted teeth. However, I don't think it was so much farmers' common sense that made them adopt farming values. Societies that brainwashed their members into sincerely accepting farming-era hierarchies did better, and eventually all farming societies mastered this art. </p><p style="text-align: left;"> </p><h4 style="text-align: left;">Specific inequalities: forced labour and patriarchy</h4><p style="text-align: left;">In addition to the general extreme hierarchy of farming societies, there are two specific types of inequality that are both interesting in their causes and tragic in their consequences.</p><p style="text-align: left;">The first is slavery, and forced labour more generally. Both are almost entirely absent in foraging bands, which might take captives from other tribes but usually eventually integrate them into the tribe rather than keeping them forever as slaves. In contrast, some form of forced labour is found in almost every agrarian society.</p><p style="text-align: left;">Why? Because financial institutions weren't strong enough. Markets for labour existed almost everywhere, but there was a problem: “anyone who had enough land to support a family preferred to make a living by working it rather than by selling labor”, because, without reliable banks for everyone, keeping a good farm was the only robust way to accumulate and maintain wealth, especially for your children. When it was time for a big construction project (maybe the pharaoh died and you need a pyramid to bury him in), even wealthy employers like the state couldn't always hire enough workers. Often they resorted to violence to lower the costs of labour. Violence, after all, came cheap.</p><p style="text-align: left;">The second specific kind of inequality was male domination and strict gender roles. Morris offers a two-pronged explanation. First, farmer men had more reason than forager men to keep farmer/forager women under control:</p><blockquote style="text-align: left;"><p><i>“The main reason that male foragers generally care less than male farmers about controlling women [...] is that foragers have much less to inherit than farmers. [...] [Q]uestions about the legitimacy of children matter a lot less than they do when only legitimate offspring will inherit land and capital.” </i></p></blockquote><p style="text-align: left;">(We might ask why farming societies were so strict about only legitimate offspring inheriting property, but perhaps this is a case of biological values limiting the space of cultural variation.)</p><p style="text-align: left;">Second, gender roles became more regimented out of necessity. Agricultural work – plowing, manuring, and irrigation – relies on brute upper body strength, which favours males. Farmers worked harder in general than foragers, so more male-specific strength-based work also pushed everything else – home upkeep (which foragers didn't need to do) and food processing – onto women. As early as 7000 BCE, skeletons from Syria suggest that both genders regularly carried heavy loads, but only women had an arthritic condition caused by kneeling and footwork, probably as a result of grinding grain.</p><p style="text-align: left;">Finally, child bearing is obviously restricted to women. With the advent of farming, the doubling time for populations fell by a factor of five, from ten thousand to two thousand years. <a href="https://ourworldindata.org/child-mortality-in-the-past">Infant mortality</a> seems not to have changed, so this is due to increased birth rates alone.</p><p style="text-align: left;">Morris writes that this decision on gender norms seems so obvious that "no farming society that moved beyond horticulture ever seems to have decided anything else". According to him, "if we sit theorizing in our fossil-fuel studies" we might imagine an alternative were women had the upper hand, "sending otherwise-useless men out to labor for them in the fields, but in reality, the organizational needs of farming societies gave men the means to inflict devastating economic pain on faithless wives while also raising the costs for men of failing to deter women from bringing cuckoos back to the nest". The empirical correlation between gender inequality and farming societies seems strong and Morris' arguments are plausible, but whether they're the final word is less clear.</p><p style="text-align: left;">Of course, you can't hold everyone down all the time. Morris lists many historical cases of people who were slaves and/or women, but nevertheless defied expectations and attained great success. For example, Morris tells the story of an Athenian slave banker called Pasion, who did so well that he was eventually not only able to buy his own freedom but also the bank itself.</p><p style="text-align: left;">(Interestingly, <a href="https://en.wikipedia.org/wiki/Pasion">Wikipedia</a> tells the story slightly differently, saying he was manumitted as a reward for his work, and inherited the bank after his former owners retired, rather than by buying it outright. Wikipedia cites the 1971 <i>Athenian Propertied Families</i> by J. K. Davies; Morris cites Edward Cohen's <i>Athenian Economy and Society</i> and Jeremy Trevett's <i>Apollorodus Son of Pasion</i>, both from 1992. I don't know who to believe, or whether a consensus exists.)</p><p style="text-align: left;">Morris' harsh conclusion is that both forced labour and patriarchy were "functionally necessary to farming societies that generated more than 10k kcal/cap/day [42 MJ/cap/day]”.</p><p style="text-align: left;"> </p><h3 style="text-align: left;">Fossil-fuel users</h3><p style="text-align: left;">Many places underwent the agricultural revolution independently of each other, because farming spread slow enough that distant people could invent it on their own before the waves of someone else's discovery of farming reached them. In contrast, the industrial revolution happened in north-west Europe fast enough, and gave big enough advantages, that no other region had an independent industrial revolution.</p><p style="text-align: left;">The culture and values of the post-industrial West – democracy, human rights, individualism, market-orientedness, and so on – are often labelled Western. In some sense this is a tautology; by definition, these are the values that Western countries have at the moment. The label is also used in a deeper sense, to mean that there is some kernel of Westernness in these values that makes them the logical conclusion of pre-industrial Western thought, and perhaps incompatible with different cultural bases.</p><p style="text-align: left;">One consequence of Morris' arguments is that this perspective is wrong. What we might call Western values are no more Western values than farming-era values are Sumerian values (or Indus Valley values or Mesoamerican values or ...); the reason Western values are called Western values but farming values aren't called Sumerian values is that the industrial revolution spread faster than the agricultural one. To explain Western values we should look not at ancient Greek philosophers and whatnot but at the demands of industrialised societies. </p><p style="text-align: left;">This does not mean that every industrialised society will approach the West in its values, only that the pressures are there (and wily enough dictators or future technological trends may be enough to avoid them). It might also be that the reason that Europe underwent an industrial revolution while other societies at the edges of agrarian achievement did not is that, by accidents of history and geography, pre-industrial north-west European values were closer to modern industrial values than those of the other societies that have stood at the cusp of industrialisation.</p><p style="text-align: left;">But the overall conclusion remains: <a href="https://slatestarcodex.com/2016/07/25/how-the-west-was-won/">"Western" values are the universal values</a> that industrialised societies tend towards. The conflict between Boko Haram or the Taliban and the West, to use two of Morris' examples, is not so much a conflict of culture versus culture, but of era versus era; a last stand of the hierarchy- and patriarchy-obsessed farming values that were held by everyone (except a forager here or there) until a few hundreds years ago. On a more granular level, the steady retreat of discrimination and formality from Western societies is simply the gradual acceptance that these vestiges of the farming era are no longer useful.</p><p style="text-align: left;">As with the transition to farming society, there's the question of how people eventually reached almost opposite stances of what their ancestors had believed. Unlike with the agricultural revolution, the question is especially pressing because the timescale of the changes is so short. But once again, a lot of it was driven by economics.</p><p style="text-align: left;">The first step was people moving from countryside farming to factory jobs:</p><blockquote style="text-align: left;"><p><i>"Nineteenth-century sources make it very clear that entering the wage-labor market could be a traumatic experience, requiring workers to submit to strict time discipline and factory conditions unlike anything they had known in the countryside; and yet millions chose to do so, because the alternative—hunger—was worse.</i></p><p><i>So eager were poor farmers for dirty, dangerous factory jobs that British employers only needed to increase wages by 5 percent (in real terms) between 1780 and 1830, although output per worker grew by 25 percent. Wage increases accelerated only in the 1830s, and even then only for urban workers. The great motor was productivity, which was now rising so high that employers began finding it cheaper to share some of their profits with their workers than to try to break strikes. (In another great irony, by the time that Dickens, Marx, and Engels were writing, wages were rising faster than ever before in history.) For the next fifty years, wages rose as fast as productivity; after 1880, they rose even faster. By then, incomes were beginning to rise in the countryside too.”</i></p></blockquote><p style="text-align: left;">One resulting value change was the abolition of forced labour:</p><blockquote style="text-align: left;"><p><i>“By making wage labour attractive enough to draw in millions of free workers, higher wages made forced labor less necessary, and because impoverished serfs and slaves—unlike the increasingly prosperous wage labourers—could rarely buy the manufactured goods being churned out by factories, forced labour increasingly struck business interests as an obstacle to growth (especially when it was competitors who were using it).”</i></p></blockquote><p style="text-align: left;">The farmer-era justifications for gender hierarchy also broke down. First, industrialised societies had less need for brute strength and more need for organisational work, in which there is no gender disparity. Second, birth rates eventually went down, reducing the amount of time women spent on children. As a result, almost universal male dominance during the farming era has given way to a world where 81% of people say gender equality is important, including 98% in Britain but also over 90% of Indonesians and Turks and even 78% of Iranians (India, with a very low 60% and a huge population, is probably the biggest drag on the average).</p><p style="text-align: left;">Morris offers a great summary of the principles of success in agrarian versus industrial societies:</p><blockquote style="text-align: left;"><p><i>“Agraria had worked by drawing lines, not just between elite and mass or men and women, but also between believers and nonbelievers, pure and defiled, free and slave, and countless other categories. Each group was assigned its place in a complex hierarchy of mutual obligations and privileges, tied together by the Old Deal and guaranteed by the gods and the threat of violence. Fossil-fuel societies, however, work best by erasing lines. The more a group replaces the rigid structure of figure 3.6 with the anti-structure of figure 4.7—a completely empty box, made up of interchangeable citizens—the bigger and more efficient its markets will be and the better it will function in the fossil-fuel world.”</i></p></blockquote><div class="separator" style="clear: both; text-align: center;"><a href="https://1.bp.blogspot.com/-iv_1_ZRRHZE/X9sUmDg8p0I/AAAAAAAACAA/dcLobQYU-0UJZGKgb7zk5yfTtgEi_0Z5wCLcBGAsYHQ/s1074/agraria.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="1036" data-original-width="1074" height="617" src="https://1.bp.blogspot.com/-iv_1_ZRRHZE/X9sUmDg8p0I/AAAAAAAACAA/dcLobQYU-0UJZGKgb7zk5yfTtgEi_0Z5wCLcBGAsYHQ/w640-h617/agraria.png" width="640" /></a></div><br /><p style="text-align: left;">The most successful agrarian social structure have a social structure like the one above; the most successful industrial societies look like this instead:</p><div class="separator" style="clear: both; text-align: center;"><a href="https://1.bp.blogspot.com/-Od3tKKJX0BE/X9sU5tjHsRI/AAAAAAAACAY/fL1F4M5b1hoz658MhCCf3Av-oI6mb1K7gCLcBGAsYHQ/s952/industria.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="948" data-original-width="952" height="638" src="https://1.bp.blogspot.com/-Od3tKKJX0BE/X9sU5tjHsRI/AAAAAAAACAY/fL1F4M5b1hoz658MhCCf3Av-oI6mb1K7gCLcBGAsYHQ/w640-h638/industria.png" width="640" /></a></div>This, in a nutshell, is why agrarian societies tend towards extreme hierarchy while industrial societies tend towards a social structure of interchangeable mobile individuals, free to do what they want and incentivised to slot themselves wherever they create the most value (at least economically). <p style="text-align: left;">With industrialisation, we've managed to roll back the discrimination and hierarchy of the farming age. We've even gone back to valuing fairly flat political hierarchies like the foragers (though we maintain them through democratic institutions rather than "coalitions of losers"), and become more egalitarian about gender than the foragers were, all the while living in societies far less violent than the average hunter-gatherer band.</p><p style="text-align: left;">There is one area where we're more tolerant of hierarchy than foragers, though: economic inequality. Once again the reason is practical: </p><blockquote style="text-align: left;"><p><i>"[...] Industria can flourish only if it has affluent middle and working classes that create effective demand for all the goods and services that fossil-fuel economies generate, but on the other, it also needs a dynamic entrepreneurial class that expects material rewards for providing leadership and management. In response, fossil-fuel values have evolved across the last two hundred years to favor government intervention to reduce wealth equality—but not too much.”</i></p></blockquote><p style="text-align: left;">However, even then we still abhor the farmer-era standard of seeing it as fair when the elite extract as much as they can from everyone under them. In fact, merely the fact that calling elites extractive has become a good political weapon shows how far we've come – as discussed in the farming section, farming-era people saw ruthlessly extractive elites as part of a fair social contract.</p><p style="text-align: left;"> </p><h3 style="text-align: left;">A summary of value evolution?</h3><p style="text-align: left;">We've just gone over a lot of detail about foragers, farmers, and fossil-fuel user values, and some reasons why values might have developed in the way they did. Is this a story of a random path through the stages of technological development, with harsh selection pressures making sure that societal values are dragged along for the ride? Or is there some pattern to the madness?</p><p style="text-align: left;">Morris' summary table does a good job of summing up the "what" of it:</p><div class="separator" style="clear: both; text-align: center;"><a href="https://1.bp.blogspot.com/-pgvDLzeIs0Q/X9sVC_1IOnI/AAAAAAAACAc/dA_v71VzNyYr5MPqzX32kq2akNKhZ1EQwCLcBGAsYHQ/s1118/summarytable.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="622" data-original-width="1118" height="357" src="https://1.bp.blogspot.com/-pgvDLzeIs0Q/X9sVC_1IOnI/AAAAAAAACAc/dA_v71VzNyYr5MPqzX32kq2akNKhZ1EQwCLcBGAsYHQ/w640-h357/summarytable.png" width="640" /></a></div><p style="text-align: left;">Two things leaps out from this table, especially if we plot it graphically: when it comes to attitudes towards hierarchy, fossil-fuel users are much closer to foragers than farmers are to anyone, and violence has gone down all along.</p><table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto;"><tbody><tr><td style="text-align: center;"><a href="https://1.bp.blogspot.com/-Pk02LU-xBAU/X9sVKg_tCoI/AAAAAAAACAk/zsRk0_bGIxoWxhcWYxB97t_oVebrlKgewCLcBGAsYHQ/s1778/graph.png" style="margin-left: auto; margin-right: auto;"><img border="0" data-original-height="994" data-original-width="1778" height="358" src="https://1.bp.blogspot.com/-Pk02LU-xBAU/X9sVKg_tCoI/AAAAAAAACAk/zsRk0_bGIxoWxhcWYxB97t_oVebrlKgewCLcBGAsYHQ/w640-h358/graph.png" width="640" /></a></td></tr><tr><td class="tr-caption" style="text-align: center;">(Slide from a talk I gave at EA Cambridge)<br /></td></tr></tbody></table><p style="text-align: left;"> </p><p style="text-align: left;">Other people have noticed this; economist and futurist Robin Hanson has written about the modern conservative-liberal axis mapping onto how willing people are to abandon farming ways and revert to more forager-like lifestyles and values as societies grow richer (as some people inexplicably prefer writing in digestible chunks rather than monolithic book-length blog posts, it's hard to give just one or two key links, but see for example <a href="https://www.overcomingbias.com/2012/05/forager-vs-farmer-morality.html">here</a>, <a href="https://www.overcomingbias.com/2017/08/forager-v-farmer-elaborated.html">here</a>, <a href="https://www.overcomingbias.com/2015/08/specific-vs-general-foragers-farmers.html">here</a>, and <a href="https://www.overcomingbias.com/2010/10/two-types-of-people.html">here</a>). </p><p style="text-align: left;">Perhaps we can tell a story like this: in the beginning there were foragers. They tended to live as people tend to do, and value the things that evolution had crafted people to want. Humans being humans, there was a lot of politicking, and with no institutions to restrain it, a fair amount of violence. The outside world was harsh and outside anyone's control.</p><p style="text-align: left;">Then the agricultural revolution slowly creeped across the world. At first people lived as before, but generation by generation it turned out that the societies that managed to best persuade people to accept a bit more hierarchy – to show a bit more obedience to the chiefs, grant a bit less non-reproductive status to women – did a bit better than the others. Over millennia, such societies either had their tricks independently discovered or copies by others, or then outright went warpath to subjugate over societies to their rule – and, of course, preach their values, which (given human adaptability) they held sincerely, and with no idea that they thought differently from their distant ancestors. Eventually, the big tricks – organised religion and the god-kings keeping power by letting their henchmen extract as much as they could from their subjects – became almost universal. They also lowered the level of violence by imposing some amount of internal order and perhaps a culture promoting peaceful conflict resolution, if only to spare more strength to throw at neighbouring societies.</p><p style="text-align: left;">Then came the industrial revolution, and suddenly what mattered is how well a society could harness the talents of its members and establish efficient, competitive markets to drive innovation. This created pressures to democratise and erase lines between people. Technology and wealth also increased people's ability to control their lives. Rich and comfortable industrialised people no longer needed to abide by strict farming-era social rules to survive, and so slowly gave up on them, reverting back to more forager-like ways, though with the added advantages of unprecedented peace and material wellbeing. </p><p style="text-align: left;"> </p><h3 style="text-align: left;">How selection pressures change values</h3><p style="text-align: left;">The reasons why societies tend to adopt pragmatic values are subtle; it's not as if people go around cynically holding the values that will best contribute to their tribe's or society's long-term success. As a result, Morris' descriptions of how selection pressures do their work are worth quoting at length.</p><p style="text-align: left;">First, here's how farmers ended up dominating the world in the first place:</p><blockquote style="text-align: left;"><p><i>“The first farmers had free will, just like us. As their families grew, their landscapes filled up. […] For all we know, some foragers in the Jordan Valley ten thousand years ago [chose to remain foragers]. The problem, though, was that they were not making a one-time choice. Tens of thousands of other people were asking the same question, and each family had to revisit the decision of whether to intensify or go hungry multiple times every year. Most important of all, each time one family chose to work harder and intensify its management of plants and animals, the payoffs from sticking with the old ways declined a little further for everyone else. Every time cultivators started thinking of the plants and animals on which they lavished care and attention as their personal gardens and flocks, not part of a common stock, hunting and gathering would become that much more difficult for those who stuck to it. Foragers who clung stubbornly and/or heroically to the old ways were doomed because the odds kept tilting against them.”</i></p></blockquote><p style="text-align: left;">But how did this result in a world of dictator kings? Morris:</p><blockquote style="text-align: left;"><p><i>“We should probably assume that people tried lots of different ways to solve the collective action problem of how to create larger, more integrated societies with more complex divisions of labor as they moved from foraging to farming, but almost everywhere, it seems that the solution that worked best was the idea of the godlike king.”</i></p></blockquote><p style="text-align: left;">Morris isn't very clear on why godlike kings, out of all possible forms of social organisation, worked best. We can imagine that it's hard to coordinate big armies for defence or offence without one, or that the symbolism of a godlike figurehead is the most reliable way to unite masses in a largely illiterate society, or vaguely gesture like Morris at the challenges of managing complex societies, but there doesn't seem to be much hard evidence or reason for a precise mechanism one way or the other, at least in <i>Foragers, Farmers, and Fossil Fuels</i>.</p><p style="text-align: left;">In general, <a href="https://en.wikipedia.org/wiki/Collective_action_problem">collective action problems</a> are important in any large organisation, and the simplest solution is complete centralisation; effectively reducing collective action problems back into individual action problems. Of course, this comes with all the cruelties and inefficiencies of real-world non-omnibenevolent, non-omniscient centralised decision-making. Given this, was the centralisation-vs-decentralisation tradeoff really so simple in the farming era that "godlike kings everywhere" was the only effective answer? Perhaps the tradeoffs really were that one-sided in the farming age, and this became a trickier question only in the industrial age when nurturing human talent and prosperity became key societal goals, and we created effective decentralised institutions like free markets and democracy. Or maybe there was a high but not extreme level of optimal centralisation, but the greed of individual rulers often pushed their societies past this level despite selection pressures working in favour of more responsibly lead societies, and it was only with the industrial age that these pressures became high enough to force the world away from the godlike king model.</p><p style="text-align: left;">Morris also describes the rise of capitalism:</p><blockquote style="text-align: left;"><p><i>“Capitalism took off in early-modern Western Europe because practical people figured out that this was the most effective way to get things done in an increasingly energy-rich world. Other people disagreed, and did things differently. Conflicts and compromises ensued as the competitive logic of cultural evolution went to work and drove the less effective ways extinct.”</i></p></blockquote><p style="text-align: left;">Once again, I think the concept of selection pressures is a powerful lens, but the details of what drives the relationship are missing. What exactly was it about an energy-rich environment that made capitalism ideal? Even by Morris' own account, it seems the methods (e.g. complex manufacturing chains, mature financial institutions, etc.) required to most effectively extract and use energy given a particular technology level are what matter, not the raw total of joules consumed per person per day.</p><p style="text-align: left;"> </p><h3 style="text-align: left;">Respondents</h3><p style="text-align: left;"><i>Foragers, Farmers, and Fossil Fuels</i> originated from the Tanner Lectures at Princeton. As part of the format, the book includes four responses to Morris' arguments, by Richard Seaford, Jonathan Spence, Christine Korsgaard, and Margaret Atwood.</p><p style="text-align: left;">On the whole, these responses don't add much to book, though they are helpful in making Morris elaborate on his arguments in the final chapter (cheekily entitled "My Correct Views on Everything"). </p><p style="text-align: left;">Seaford and Spence provide short chapters that seem to be more about their own interests than Morris' arguments, and have the tone of questions asked by professors who slept through the talk but are still trying to say something insightful at the questions session.</p><p style="text-align: left;">Atwood, of <i>The Handmaid's Tale</i> fame, brings an arsenal of literary flair to bear on the task. She manages to make some good points (what about horse-riding pastoralists, who may have been the first large-scale war-makers?), along with some ridiculous statements:</p><blockquote style="text-align: left;"><p><i>“Several billion years ago, marine algae produced the atmosphere that allows us to breathe, and these algae continue to produce from 60 to 80 percent of our oxygen. Without marine algae, we ourselves cannot survive. During the Vietnam War, huge vats of Agent Orange were being shipped across the Pacific. Should they have sunk and leaked, we would not be having this conversation today.”</i></p></blockquote><p style="text-align: left;">Let's do some very rough calculations. If all the Agent Orange deployed in Vietnam had been uniformly distributed across the Pacific, the mass concentration of its component acids (making the highest assumptions about what concentration it was sprayed at) would have been lower than one part in tens of trillions, a hundred thousand times lower than the mass concentrations of either lead or mercury already in the oceans. I couldn't find any study of what happens to algae in oceans if you dump Agent Orange on them, but one article about using algaecide in swimming pools says applying one ten-thousandth of the pool volume is typical. Another article mentions 5-10% as a common concentration, giving an algae-killing active ingredient concentration of maybe 1 in 100 000 in water. Agent Orange would need to kill algae at ten million times lower concentrations in oceans than commercial algaecide does in swimming pools for the Pacific's oxygen production to be destroyed.</p><p style="text-align: left;"> (Or maybe Atwood means the literal sense that, because of various butterfly effects, any such change in history makes any present event, including this conversation, unlikely?)</p><p style="text-align: left;">By far the most substantive response comes from the philosopher Christine Korsgaard. She also has the idea that the farming era was an aberration, with a fresh interpretation:</p><blockquote style="text-align: left;"><p><i>“Instead of thinking that values are determined by modes of energy capture, perhaps we should think that as human beings began to be in a position to amass power and property in the agricultural age, forms of ideology set in that distorted real moral values [i.e. the values a society should hold], distortions that we are only now, in the age of science and extensive literacy, beginning to overcome.”</i></p></blockquote><p style="text-align: left;">More significantly, she makes a distinction between the values a society holds and values that should be held (“positive values” and “real moral values” respectively), in contrast to Morris' arguments that such a distinction is meaningless and the only real distinction is between biological values and the form they take in a given society. Her response manages to pick away at Morris' nonchalant bulldozing of all philosophical subtleties.</p><p style="text-align: left;">Responding to this in the last chapter, Morris quotes, and then dismisses, Ernest Gellner's response to a social theory presentation at an archaeology conference: "They tell me you're a good archaeologist, so why are you trying to be a bad philosopher?". Perhaps he should have taken the question more to heart.</p><p style="text-align: left;"> </p><h3 style="text-align: left;">The future</h3><p style="text-align: left;">The experiment of how to switch from foraging to farming was run many times. Forager bands in many places adopted farming techniques. Some of them had good ideas about how to structure their now-farming societies and succeeded, while others had bad ideas and perished, or were forced to copy techniques from the more successful.</p><p style="text-align: left;">In contrast, today the entire world has been thrust into the industrial age in the space of a few hundred years. There is only one experiment going on, and only one chance to get it right. There's no one to copy from to see what we should do, and no one to pick up the job if our attempt fails.</p><p style="text-align: left;">A successful transition to the industrial world, and whatever we might mark as the next step after that, is therefore less certain than the successful transition from foragers to farmers. The values that industrial life imposes on us might be better than the those of the farming age, but it is not yet clear if they will become as universal as hierarchies and kings once were.</p><p style="text-align: left;">(Better by which standard? I think humans are similar enough that there is a <a href="https://strataoftheworld.blogspot.com/2020/08/ea-ideas-4-utilitarianism.html">context-independent universal human ethical framework</a>.)</p><p style="text-align: left;">Morris' arguments also lead to the question of how values might change in the future. Will the set of values that a society tends towards continue to improve as technology and wealth increases, or is the cuddliness of industrial values (compared to farming ones) a fluke?</p><p style="text-align: left;">The significance of <i>Foragers, Farmers, and Fossil Fuels</i> for this question is that we won't necessarily be the ones deciding. Over a span of years or decades, we can maintain our values through argument and education. Over a span of centuries, though, we can argue all we like, just as countless luddites and aristocrats railed against industrial/Western values, but if the game has changed and someone else's values make them play it better, it won't be enough. The harsh logic of evolution-like selection pressures can't be resisted forever; those that are best at spreading themselves into the future will eventually claim it.</p><p style="text-align: left;">Yuval Noah Harari, author of <i>Sapiens</i>, says that once we can engineer desires, the question is not "what do we want to become?", but "what do we want to want?". Morris counters that the real question is instead "what are we going to want, whether we want it or not?", and his answer is bleak yet pragmatic: "each age gets the thought it needs" ("needs" referring to "survival needs").</p><p style="text-align: left;">I don't think we need to be either nihilistic (in thinking that every set of societal values is as good as any other; some do a better job of serving universal human wants), nor pessimistic (in thinking that we can't do anything about a slide to worse values; we've never had more control over the future of our world).</p><p style="text-align: left;">Morris writes:</p><blockquote style="text-align: left;"><p><i>“Trying to imagine people who are somehow divorced from the demands of capturing energy and then speculating about what their moral values would be is an odd activity.”</i></p></blockquote><p style="text-align: left;">I disagree. Of course we can imagine people living without being constrained by energy needs. How many science fiction writers or futurists <i>haven't</i> imagined a post-scarcity society?</p><p style="text-align: left;">In fact, aren't we well on our way towards such a world? Forager and farmer lives were significantly shaped by the need to get food, water, light, and warmth. Today in developed countries, these aren't free, but our lives aren't shaped by worrying about them. Sure, you need to work a job, but what you worry about in the job is likely very far separated from survival needs, and provided you have one and aren't massively wasteful, the water and light flows exactly as you want it. Technological progress removes difficulty and scarcity. Ultimately, there's no physical limit stopping us from removing scarcity considerations from our lives (or, more precisely, making them trivial enough that we don't need to worry about them; nothing is ever entirely free in this universe).</p><p style="text-align: left;">Once we've done so, no longer have to make compromises between what we should do and what we as a society are forced to value in order to survive. And so I think it is reasonable to imagine humans whose values aren't warped by survival needs; in fact such values might be good ones to aim for.</p><p style="text-align: left;">(Or maybe the need to focus at least a bit on survival is the one anchor to objective reality that prevents societies from losing themselves entirely to petty politicking and status games.)</p><p style="text-align: left;">Of course, there's always the problem of competition. What happens to our happy post-scarcity society when the people next door ratchet up the competition, say by throwing off all the safeguards around capitalism, or developing AIs or nanomachines or <a href="https://slatestarcodex.com/2016/05/28/book-review-age-of-em/">Robin Hanson's emulated minds</a>, and then outcompeting us by adopting values more suitable to exploiting those technologies? Even if we ourselves don't suffer – say we have a big enough wall – in the long run we'd give up the rest of the world (or solar system or galaxy) to the pragmatic-valued competitors. At best, the long-term future looks like an oasis of human flourishing, surrounded by a galaxy-spanning alien economy with weird but morally neutral ways. (Imagine a forager tribe considering the massive and weird industrialised world around them; now imagine we're the foragers.) At worst, any good in our oasis would be outweighed by the morally bad machinations that fuel the endless growth of that weird galaxy-spanning alien economy.</p><p style="text-align: left;">So will we be forced to compromise ever more and more to avoid being outrun by those with fewer scruples about changing their values? Or can we build a world where human values are a winning strategy?</p><p style="text-align: left;">Looking at our <a href="http://strataoftheworld.blogspot.com/2018/08/review-enlightenment-now-steven-pinker.html">track record</a>, I think we have a chance.</p><p style="text-align: left;"> </p><p style="text-align: center;"><i><b>Related:</b></i><i><a href="http://strataoftheworld.blogspot.com/2019/09/growth-and-civilisation.html"><br />Growth and civilisation</a></i><br /></p>Unknownnoreply@blogger.com0tag:blogger.com,1999:blog-1697673368059564013.post-86168872585015707722020-08-10T17:43:00.004+01:002022-08-20T23:06:24.264+01:00EA ideas 4: utilitarianism<p style="text-align: center;"><font size="2"><i>4.9k words (≈17 minutes)</i></font></p><p style="text-align: center;"><font size="2"><i><span style="font-size: small;"> Posts in this series:</span><br /></i><a href="https://strataoftheworld.blogspot.com/2020/07/ea-ideas-1-rigour-and-opportunity-in.html">FIRST</a> | <a href="https://strataoftheworld.blogspot.com/2020/07/ea-ideas-3-uncertainty.html">PREVIOUS</a> | NEXT<i><br /></i></font></p><p>Many ideas in effective altruism (EA) do not require a particular moral theory. However, while there is no common EA moral theory, much EA moral thinking leans consequentialist (i.e. morality is fundamentally about consequences), and often specifically utilitarian (i.e. wellbeing and/or preference fulfilment are the consequences we care about).</p> <p>Utilitarian morality can be thought of as rigorous humanism, where by “humanism” I mean the general post-Enlightenment secular value system that emphasises caring about people, rather than upholding, say, religious rules or the honour of nations. Assume that the welfare of a conscious mind matters. Assume that our moral system should be impartial: that wellbeing/preferences should count the same regardless of who has them, and also in the sense of being indifferent of who’s perspective it is being wielded from (for example, a moral system that says to only value yourself would give you different advice than it gives me). The simplest conclusion you can draw from these assumptions is to consider welfare to be good and seek to increase it.</p> <p>I will largely ignore differences between the different types of utilitarianism. Examples of divisions within utilitarianism include preference vs hedonic/classical utilitarianism (do we care about the total satisfied preferences, or the total wellbeing; how different are these?) and act vs rule utilitarianism (is the right act the one with the greatest good as its consequence, or the one that conforms to a rule which produces the greatest good as its consequences – and, once again, are they different?).</p><p><br /></p> <h3>Utilitarianism is decisive</h3> <p>We want to do things that are “good”, so we have to define what we mean by it. But once we’ve done this, this concept of good is of no help unless it lets us make decisions on how to act. I will refer to the general property of a moral system being capable of making non-paradoxical decisions as decisiveness.</p> <p>Decisiveness can fail if a moral system leads to contradiction. Imagine a deontological system with the rules “do not lie” and “do not take actions that result in someone dying”. Now consider the classic thought experiment of what such a deontologist would do if the Gestapo knocked on their door and asked if they’re hiding any Jews. A tangle of absolute rules almost ensures the existence of some case where they cannot all be satisfied, or where following them strictly will cause immense harm.</p> <p>Decisiveness fails if our system allows circular preferences, since then you cannot make a consistent choice. Imagine you follow a moral system that says volunteering at a soup kitchen is better than helping old people across the street, collecting money for charity is better than soup kitchen volunteering, and helping old people across the street is better than collecting money. You arrive at the soup kitchen and decide to immediately walk out to go collect money. You stop collecting money to help an old person across the street. Halfway through, you abandon them and run off back to the soup kitchen.</p> <p>Decisiveness fails if there are tradeoffs our system cannot make. Imagine highway engineers deciding whether to bulldoze an important forest ecosystem or a historical monument considered sacred. If your moral system cannot weigh environment against historical artefacts (and economic growth, and the time of commuters, and …), it is not decisive. </p> <p>So for any two choices, a decisive moral system must be able to compare them, and the comparisons it makes cannot be circular preference. This implies a ranking: X is better than Y translates to X is before Y in the ranking list.</p> <p>(If we allow circular preferences, we obviously can’t make a list, since the graph of “better-than” relations would include cycles. If there are tradeoffs we can’t make – X and Y such that X and Y are neither better than equal or worse than each other – we can generate a ranking list but not a unique one (in set theory terms, we have a partial order rather than a total order).)</p> <p>Decisiveness also fails if our system can’t handle numbers. It is better to be happy for two minutes than one minute than fifty nine seconds. More generally, to practically any good we can either add or subtract a bit: one more happy thought, one less bit of pain.</p> <p>Therefore a decisive moral system must rank all possible choices (or actions or world states or whatever), with no circular preferences, and with arbitrarily many notches between each ranking. It sounds like what we need is numbers: if we can assign a number to choices, then there must exist a non-circular ranking (you can always sort numbers), and there’s no problem with handling the quantitativeness of many moral questions.</p> <p>There can’t be one axis to measure the value of pleasure, one to measure meaning, and another for art. Or there can – but at the most basic level of moral decision-making, we must be able to project everything onto the same scale, or else we’re doomed to have important moral questions where we can only shrug our shoulders. This leads to the idea of all moral questions being decidable by comparing how the alternatives measure up in terms of “utility”, the abstract unit of the basic value axis.</p> <p>You might say that requiring this extreme level of decisiveness may sometimes be necessary in practice, but it’s not what morality is about; perhaps moral philosophy should concern itself with high-minded philosophical debates over the nature of goodness, not ranking the preferability of everything. Alright, have it your way. But since being able to rank tricky “ought”-questions is still important, we’ll make a new word for this discipline: fnergality. You can replace “morality” or “ethics” with “fnergality” in the previous argument and in the rest of this post, and the points will still stand.</p><p><br /></p> <h3>What is utility?</h3> <p>So far, we have argued that a helpful moral system is decisive, and that this implies it needs a single utility scale for weighing all options. </p> <p>I have not specified what utility is. Without this definition, utilitarianism is not decisive at all.</p> <p>How you define utility will depend on which version of utilitarianism you endorse. The basic theme across all versions of utilitarianism is that utility is assigned without prejudice against arbitrary factors (like location, appearance, or being someone other than the one who is assigning utilities), and is related to ideas of welfare and preference.</p> <p>A hedonic utilitarian might define the utility of a state of the world as total wellbeing minus total suffering across all sentient minds. A preference utilitarian might ascribe utility to each instance of a sentient mind having a preference fulfilled or denied, depending on the weight of the preference (not being killed is likely a deeper wish than hearing a funny joke), and the sentience of the preferrer (a human’s preference is generally more important than a cat’s). Both would likely want to maximise the total utility that exists over the entire future.</p> <p>These definitions leave a lot of questions unanswered. For example, take the hedonic utilitarian definition. What is wellbeing? What is suffering? Exactly how many wellbeing units are being experienced per second by a particular jogger blissfully running through the early morning fog?</p> <p>The fact that we can’t answer “4.7, ±0.5 depending on how runny their nose is” doesn’t mean utilitarianism is useless. First, we might say that an answer exists in principle, even if we can’t figure it out. For example, a hedonic utilitarian might say that there is some way to calculate the net wellbeing experienced by any sentient mind. Maybe it requires knowing every detail of their brain activity, or a complete theory of what consciousness is. But – critically – these are factual questions, not moral ones. There would be moral judgements involved in specifying exactly how to carry out this calculation, or how to interpret the theory of consciousness. There would also be disagreements, in the same way that preference and hedonic utilitarians disagree today (and it is a bad idea to specify one Ultimate Goodness Function and declare morality solved forever). But in theory and given enough knowledge, a hedonic utilitarian theory could be made precise.</p> <p>Second, even if we can only approximate utilities, doing so is still an important part of difficult real-world decision-making.</p> <p>For example, Quality- and Disability-Adjusted Life Years (<a href="https://en.wikipedia.org/wiki/Quality-adjusted_life_year">QALYs</a> and <a href="https://en.wikipedia.org/wiki/Disability-adjusted_life_year">DALYs</a>) try to put a number on the value of a year of life with some disease burden. Obviously it is not an easy judgement to make (usually the judgement is made by having a lot of people answer carefully designed questions on a survey), and the results are far more imprecise than the 3-significant-figure numbers in the table <a href="https://www.who.int/healthinfo/statistics/GlobalDALYmethods_2000_2011.pdf">on page 17 here</a> would suggest. However, the principle that we should ask people and do studies to try figure out how much they’re suffering, and then make the decisions that reduce suffering the most across all people, seems like the most fair and just way to make medical decisions.</p> <p>Using QALYs may seem coldly numerical, but if you care about reducing suffering, not just as a lofty abstract statement but as a practical goal, you will care about every second. It can also be hard to accept QALY-based judgements, especially if they prefer others to people close to you. However, taking an impartial moral view, it is hard not to accept that the greatest good is better than a lesser good that includes you.</p> <p>(Using opposition to QALYs as an example, Robin Hanson <a href="https://www.overcomingbias.com/2019/05/simplerules.html">argues with his characteristic bluntness</a> that people favour discretion over mathematical precision in their systems and principles “as a way to promote an informal favoritism from which they expect to benefit”. In addition to the ease of sounding just and wise while repeating vague platitudes, this may be a reason why the decisiveness and precision of utilitarianism become disadvantages on the PR side of things.)</p><p><br /></p> <h3>Morality is everywhere</h3> <p>By achieving decisiveness, utilitarianism makes every choice a moral one.</p> <p>One possible understanding of morality is that it splits actions into three planes. There are rules for what to do (“remember the sabbath day”). There are rules for what not to do (“thou shalt not kill, and if thy doest, thy goeth to hell”). And then there’s the earthly realm, of questions like whether to have sausages for dinner, which – thankfully – morality, god, and your local preacher have nothing to say about.</p> <p>Utilitarianism says sausages are a moral issue. Not a very important one, true, but the happiness you get from eating them, your preferences one way or the other, and the increased risk of heart attack thirty years from now, can all be weighed under the same principles that determine how much effort we should spend on avoiding nuclear war. This is not an overreach: a moral theory is a way to answer “ought”-questions, and a good one should cover all of them.</p> <p>This leads to a key strength of utilitarianism: it scales, and this matters, especially when you want to apply ethics to big uncertain things. But first, a slight detour.</p><p><br /></p> <h3>Demandingness</h3> <p>A common objection to utilitarianism is that it is too demanding.</p> <p>First of all, I find this funny. Which principle of meta-ethics is it, exactly, that guarantees your moral obligations won’t take more than the equivalent of a Sunday afternoon each week?</p> <p>However, I can also see why consequentialist ethics can seem daunting. For someone who is used to thinking of ethics in terms of specific duties that must always be carried out, a theory that paints everything with some amount of moral importance and defines good in terms of maximising something vague and complicated can seem like too much of a burden. (I think this is behind the misinterpretation that utilitarianism says you have a duty to calculate that each action you take is the best one possible, which is neither utilitarian nor an effective way to achieve anything.)</p> <p>Utilitarianism is a consequentialist moral theory. Demands and duties are not part of it. It settles for simply defining what is good.</p> <p>(As it should. The definition is logically separate from the implications and the implementation. Good systems, concepts, and theories are generally <a href="https://www.lesswrong.com/posts/yDfxTj9TKYsYiWH5o/the-virtue-of-narrowness">narrow</a>.)</p><p><br /></p> <h3>Scaling ethics to the sea</h3> <p>There are many moral questions that are, in practice, settled. All else being equal, it is good to be kind, have fun, and help the needy.</p> <p>To make an extended metaphor: we can imagine that there is an island of settled moral questions; ones that no one except psychopaths or philosophy professors would think to question.</p> <p>This island of settled moral questions provides a useful test for moral systems. A moral system that doesn’t advocate kindness deserves to go in the rubbish. But though there is important intellectual work to be done in figuring out exactly what grounds this island (the geological layers it rests on, if you will), the real problem of morality in our world is how we extrapolate from this island to the surrounding sea.</p> <p>In the shallows near the island you have all kinds of conventional dilemmas – for example, consider our highway engineers in the previous example weighing nature against art against economy. Go far enough in any direction and you will encounter all sorts of perverse thought experiment monsters dreamt up by philosophers, which try to tear apart your moral intuitions with analytically sharp claws and teeth.</p> <p>You might think we can keep to the shallows. That is not an option. We increasingly need to make moral decisions about weird things, due to the increasing strangeness of the world: complex institutions, new technologies, and the sheer scale of there being over seven billion people around.</p> <p>A moral system based on rules for everyday things is like a constant-sized knife: fine for cutting up big fish (should I murder someone?), but clumsy at dealing with very small fish (what to have for dinner?), and often powerless against gargantuan eldritch leviathans from the deep (existential risk? mind uploading? insect welfare?).</p> <p>Utilitarianism scales bot