Strata of the World—now moved to https://nosetgauge.substack.com/: 2021

2021-10-17

Death is bad

3.5k words (about 12 minutes)

Sometime in the future, we might have the technology to extend lifespans indefinitely and make people effectively immortal. When and how this might happen is a complicated question that I will not go into. Instead, I will take heed of Ian Malcolm in Jurassic Park, who complains that "your scientists were so preoccupied with whether or not they could that they didn't stop to think if they should".

This is (in my opinion rather surprisingly) a controversial question.

The core of it is this: should people die?

Often the best way to approach a general question is to start by thinking about specific cases. Imagine a healthy ten-year old child; should they die? The answer is clearly no. What about yourself, or your friends, or the last person you saw on the street? Wishing for death for yourself or others is almost universally a sign of a serious mental problem; acting on that desire even more so.

There are some exceptions. Death might be the best option for a sick and pained 90-year-old with no hope of future healthy days. It may well be (as I've seen credibly claimed in several places) that the focus on prolonging lifespan even in pained terminally ill people is excessive. "Prolong life, whatever the cost" is a silly point of view; maximising heartbeats isn't what we really care about.

However, now imagine a pained, dying, sick person who has a hope of surviving to live many healthy happy days – say a 40-year-old suffering from cancer. Should they die? No. You would hope that they get treatment, even if it's nauseating fatiguing painful chemotherapy for months on end. If there is no cure, you'd hope that scientists somewhere invent it. Even if it does not happen in time for that particular person, at least it will save others in the future, and eliminate one more horror of the world. It would be a great and celebrated human achievement.

What's the difference between the terminally ill 90-year-old and the 40-year-old with a curable cancer? The difference is technology. We have the technology to cure some cancers, but we don't have the technology to cure the many ageing-related diseases. If we did, then even if the treatment is expensive or difficult, we would hope – and consider it a moral necessity – for both of them to get it, and hope that they both go on living for many more years.

No one dies of time. You are a complex process running on the physical hardware of your brain, which is kept running by the machine that is the rest of your body. You die when that machine breaks. There is no poetic right time when you close your eyes and get claimed by time, there is only falling to one mechanical fault or another.

People (or conscious beings in general) matter, and their preferences should be taken seriously – this is the core of human morality. What is wrong in the world can be fixed – this is the guiding principle of civilisation since the Enlightenment.

So, should people die? Not if they don't want to, which (I assume) for most people means not if they have a remaining hope of happy, productive days.

Counterarguments

The idea that death is something to be defeated, like cancer, poverty, or smallpox, is not a common one. Perhaps there's some piece of the puzzle that is missing from the almost stupidly simple argument above?

One of the most common counterarguments is overpopulation (perhaps surprisingly; environmentalist concerns have clearly penetrated very deep into culture despite not being much of a thing before the 1970s). The argument goes like this: if we solve death, but people keep being born, there will be too many people on Earth, leading to environmental problems, and eventually low quality of life for everyone.

The object-level point (I will return to what I consider more important meta-level points later) is that demographic predictions have a tendency to be wrong, especially about the future (as the Danish (?) saying goes). Malthus figured out pre-industrial demographics just as they came to an end with the industrial revolution. In the 1960s, there were warnings of a population explosion, which fizzled out when it turned out that the demographic transition (falling birth rates as countries develop) is a thing. Right now the world population is expected to stabilise at less than 1.5x the current size, and many developed countries are dealing with problems caused by shrinking populations (which they strangely refuse to fix through immigration).

Another concern are the effects of having a lot of old people around. What about social progress – how would the development of women's rights have been realised if you had a bunch of 19th century misogynists walking around in their top hats? What sort of power imbalances and Gini coefficients would we reach if Franklin Delano Roosevelt could continue cycling through high-power government roles indefinitely, or Elon Musk had time to profit from the colonisation of Mars? What happens to science when it can no longer advance (as Max Planck said) one funeral at at time?

(There is even an argument that life extension technology is problematic because the rich will get it first. This is an entirely general and therefore entirely worthless argument, since it applies to all human progress: the rich got iPhones first – clearly smartphones are a problematic technology, etc., etc. If you're worried about only the rich having access to it for too long, the proper response is to subsidise its development so that the period when not everyone has access to it is as short as possible.)

These are valid concerns that will definitely test the abilities of legislators and voters in the post-death era. However, they can probably be overcome. I think people can be brought around surprisingly far on social and moral attitudes without killing anyone. Consider how pre-2000 almost anyone's opinions would have made them a near-pariah today; many of those people still exist and it would hard to write them off as a total loss. Maybe some minority of immortal old people couldn't cope with all the Pride Parades – or whatever the future equivalent is – marching past their windows and they go off to start some place of their own with sufficient top hat density; then again, most countries have their own conservative backwater region already. If they start going for nukes, that's more of an issue, but not more so than Iran.

As for imbalances of power and wealth, it might require a few more taxes and other policies (the expansion of term limits to more jobs?), but given the strides that equalising policy-making has made it seems hard to argue there is a fundamental impossibility.

And what about all the advantages? A society of the undying might well be far more long-term oriented, mitigating one of the greatest human failures. After all, how often do people bemoan that 70-year-old oil executives just don't care because they won't be around to see the effects of climate change?

What about all the collective knowledge that is lost? Imagine if people in 2050 could hear World War II veterans reminding them of what war really is. Imagine if John von Neumann could have continued casually inventing fields of maths at a rate of about two per week instead of dying at age 53 (while absolutely terrified of his approaching death). Imagine if we could be sure to see George R. R. Martin finish A Song of Ice and Fire.

Also, concerns like overpopulation and Elon Musk's tax plan just seem small in comparison to the literal eradication of death.

Imagine proposing a miracle peace plan to the cabinets of the Allied countries in the midst of World War II. The plan would end the war, install liberal governments in the Axis powers, and no one even has to nuke a Japanese city. (If John von Neumann starts complaining about not getting to test his implosion bomb design, give him a list of unsolved maths problems to shut him up.) Now imagine that the reaction is somewhere between hesitance and resistance, together with comments like "where are we going to put all the soldiers we've trained?", "what about the effects on the public psyche of a random abrupt end without warning?", and "how will we make sure that the rich industrialists don't profit too much from all the suddenly unnecessary loans that they've been given?" At this point you might be justified in shouting: "this war is killing fifteen million people per year, we need to end it now".

The situation with death is similar, except it's over fifty million per year rather than fifteen. (See this chart for breakdown by cause – you'll see that while currently-preventable causes like infectious diseases kill millions, ageing-related ones like heart disease, cancer, and dementia are already the majority.)

Thought experiments

To make the question more concrete, we can try thought experiments. Imagine a world in which people don't die. Imagine visitors from that world coming to us. Would they go "ah yes, inevitable oblivion in less than a century, this is exactly the social policy we need, thanks – let us go run back home and implement it"? Or would they think of our world like we do of a disease-stricken third-world country, in dire need of humanitarian assistance and modern technology?

It's hard to get into the frame of mind of people who live in a society that doesn't hand out automatic death sentences to everyone at birth. Instead, to evaluate whether raising life expectancies to 200 makes sense even given the environmental impacts, we can ask whether a policy of killing people at age 50 to reduce population pressures would be even better than the current status quo – if both an increase and decrease in life expectancies is bad, this is suspicious because it implies we're at the optimum by chance. Or, since the abstract question (death in general) is always harder than more concrete ones, imagine withholding a drug that manages heart problems in the elderly on overpopulation grounds.

You might argue that current life expectancies are optimal. This is a hard position to defend. It seems like a coincidence that the lifespan achievable with modern technology is exactly the "right" one. Also, neither you nor society should not make that choice for other people. Perhaps some people get bored of life and readily step into coffins at age 80; many others want nothing more than to keep living. People should get what they want. Forcing everyone to conform to a certain lifespan is a specific case of forcing everyone to conform to a certain lifestyle; much moral progress in the past century has consisted of realising that this is bad.

I think it's also worth emphasising one common thread in the arguments against solving death: they are all arguments about societal effects. It is absolutely critical to make sure that your actions don't cause massive negative externalities, and that they also don't amount to defecting in prisoner's dilemma or the tragedy of the commons. However, it is also absolutely critical that people are happy and aren't forced to die, because people and their preferences/wellbeing are what matters. Society exists to serve the people who make it up, not the other way around. Some of the worst moral mistakes in history come from emphasising the collective, and identifying good and harm in terms of effects on an abstract collective (e.g. a nation or religion), rather than in terms of effects on the individuals that make it up. Saying that everyone has to die for some vague pro-social reason is the ultimate form of such cart-before-the-horse reasoning.

Why care about the death question?

There are several features that make the case against death, and people's reactions to it, particularly interesting.

Failure of generalisation

First: generalisation. I started this post using specific examples before trying to answer the more general question. I think the popularity of death is a good example of how bad humans are at generalising.

When someone you know dies, it is very clearly and obviously a horrible tragedy. The scariest thing that could happen to you is probably either your own death, the death of people you care about, or something that your brain associates with death (the common fears: heights, snakes, ... clowns?).

And yet, make the question more abstract – think not about a specific case (which you feel in your bones is a horrible tragedy that would never happen in a just world), but about the general question of whether people should die, and it's like a switch flips: a person who would do almost anything to save themselves or those they care about, who cares deeply about suffering and injustice in the world, is suddenly willing to consign five times the death toll of World War I to permanent oblivion every single year.

Stalin reportedly said that a single death is a tragedy, but a million is only a statistic. Stalin is wrong. A single death is a tragedy, and a million deaths is a million tragedies. Tragedies should be stopped.

People These Days

Second: today, we're pretty good at ignoring and hiding death. This wasn't always the case. If you're a medieval peasant, death is never too far away, whether in the form of famine or plague or Genghis Khan. Death was like an obnoxious dinner guest: not fun, but also just kind of present in some form or another whether you invited them or not, so out of necessity involved in life and culture.

Today, unexpected death is much rarer. Child mortality globally has declined from over 40% (i.e. almost every family had lost a child) in 1800 to 4.5% in 2015, and below 0.5% in developed countries. Famines have gone from something everyone lives through to something that the developed world is free from. War and conflict have gone from common to uncommon. Much greater diseases and accidents can be successfully treated. As a result of all these positive trends, death is less present in people's minds.

As I don't have my culture critic license yet, I won't try to make some fancy overarching points about how People These Days Just Don't Understand and how our Materialistic Culture fails to prepare people to deal with the Deep Questions and Confront Their Own Mortality. I will simply note that (a) death is bad, (b) we don't like thinking about bad things, and (c) sometimes not wanting to think about important things causes perverse situations.

Confronting problems

Why do people not want to think that death is bad? I think one central reason is that death seems inevitable. It's tough to accept bad things you can't influence, and much easier to try to ignore them. If at some point you have to confront it anyways, one of the most reassuring stories you can tell is that it has a point. Imagine if over two hundred thousand years, generation after generation of humans, totalling some one hundred billion lives, was born, grew up, developed a rich inner world, and then had that world destroyed forever by random failures, evolution's lack of care for what happens after you reproduce, and the occasional rampaging mammoth. Surely there must be some purpose for it, some reason why all that death is not just a tragedy? Perhaps we aren't "meant" to live long, whatever that means, or perhaps it's all for the common good, or that "death gives meaning to life". Far more comforting to think that then to acknowledge that a hundred billion human lives and counting really are gone forever because they were unlucky enough to be born before we eradicated smallpox, or invented vaccines, or discovered antibiotics, or figured out how to reverse ageing.

Assume death is inevitable. Should you still recognise the wrongness of it?

I think yes, at least if you care about big questions and doing good. I think it's important to be able to look at the world, spot what's wrong about it, and acknowledge that there are huge things that should be done but are very difficult to achieve.

In particular, it's important to avoid the narrative fallacy (Nassim Taleb's term for the human tendency to want to fit the world to a story). In a story, there's a start and an end and a lesson, and the dangers are typically just small enough to be defeated. Our universe has no writer, only physics, and physics doesn't care about hitting you with an unsolvable problem that will kill everyone you love. If you want to increase the justness of the world, recognising this fact is an important starting point.

Taxes

Is death inevitable? In considering this question, it's important once again to remember that death is not a singular magical thing. Your death happens when something breaks badly enough that your consciousness goes permanently offline.

Things, especially complex biological machines produced by evolution, can break in very tricky ways. But what can break can be fixed, and people who declare technological feats impossible have a bad track record. The problem might be very hard: maybe we have to wait until we have precision nano-bots that can individually repair the telomeres on each cell, or maybe there is no effective general solution to ageing and we face an endless grind of solving problem after problem to extend life/health expectancies from 120 to 130 to 140 and so forth. Then again, maybe someone leaves out a petri dish by accident in a lab and comes back the next day to the fountain of youth, or maybe by the end of the century no one is worrying about something as old-fashioned as biology.

There's also the possibility of stopgap solutions, like cryonics (preserving people close to death by vitrifying them and hoping that future technology can revive them). Cryonics is currently in a very primitive state – no large animals successfully having been put through it – but there's a research pathway of testing on increasingly complex organs and then increasingly large animals that might eventually lead to success if someone bothered to pour resources into it.

There is no guarantee when this is happening. If civilisation is destroyed by an engineered pandemic or nuclear war before then, it will never happen.

Of course, in the very long run we face more fundamental problems, like the heat death of the universe. Literally infinite life is probably physically impossible; maybe this is reassuring.

Predictions and poems

I will make three predictions about the eventual abolition of death.

First, many people will resist it. They might see it as conflicting with their religious views or as exacerbating inequality, or just as something too new and weird or unnatural.

Second, when the possibility of extending their lifespan stops being an abstract topic and becomes a concrete option, most people will seize it for themselves and their families.

This is a common path for technologies. Lightning rods and vaccines were first seen by some as affronts to God's will, but eventually it turns out people like not burning to death and not dying of horrible diseases more than they like fancy theological arguments. Most likely future generations will discover that they like not ageing more than they like appreciating the meaning of life by definitely not having one past age 120.

Finally, future people (if they exist) will probably look back with horror on the time when everyone died against their will within about a century.

Edgar Allen Poe wrote a poem called "The Conqueror Worm", about angels crying as they watch a tragic play called "Man", whose (anti-)hero is a monstrous worm that symbolises death. If we completely ignore what Poe intended with this, we can misinterpret one line to come to a nice interpretation of our own. The poem declares that the angels are watching this play in the "lonesome latter years". Clearly this refers to a future post-scarcity, post-death utopia, and the angels are our wise immortal descendants reflecting on the bad old days, when people were "mere puppets [...] who come and go / at the bidding of vast formless things" like famine and war and plague and death. The "circle [of life] ever returneth in / To the self same spot [= the grave]", and so the "Phantom [of wisdom and fulfilled lives] [is] chased for evermore / By a crowd that seize it not".

Death is a very poetic topic, and other poems need less (mis)interpretation. Edna St. Vincent Millay's "Dirge Without Music" is particularly nice, while Dylan Thomas gives away the game in the title: "Do not go gentle into that good night".

2021-09-30

Short reviews: biographies

Books reviewed (all by Walter Isaacson):
The Code Breaker: Jennifer Doudna, Gene Editing, and the Future of the Human Race (2021)
Steve Jobs: The Exclusive Biography (2011)
Benjamin Franklin: An American Life (2004)

3.5k words (about 12 minutes)

Why read biographies? If you want stories of people and interesting characters, fiction is better. If you want general, big truths, then you're probably better off reading the many non-fiction books that are about abstract truths and far-ranging concepts rather than the particulars of a single person's life.

Consider, for a moment, designing an algorithm for a problem. The classic way to do this is to think hard about the problem, and then write down a specific series of steps that take you from inputs to (hopefully the correct) outputs. In contrast, the machine learning method is to use statistical methods on a long list of examples to make a model that (hopefully) approximates the mapping between inputs and outputs.

Reading explicit abstract arguments is like the first method. Like explicit algorithm design, it comes with some nice properties – it's very clear exactly how it generalises and when it's applicable – to the point where it's easy to scoff at the less explicit methods: "it's just a black box that our pile of statistics spits out" / "it's just anecdotes about someone's life".

However, much like machine learning methods can extract subtle lessons from a long list of examples, I think there is implicit knowledge contained in the long list of detail about someone's life that you find in a biography (at least if you read about people who did interesting things in their life – but then again, if there's a biography of someone ...). Once you've read the details of how CRISPR was invented, Apple jump-started, or compromises reached at the1787 American Constitutional Convention, I think your model of how science, business, and politics work in the real world is improved in many subtle ways.

(Note that this argument also applies to reading history.)

And of course, since biographies deal strongly with character, there is an element of the novel-like thrill of watching things happen to people.

Walter Isaacson's biographies

I've read four of Walter Isaacson's biographies. Their subjects are Albert Einstein, Jennifer Doudna, Steve Jobs, and Benjamin Franklin.

The Einstein one I read years ago, and don't remember much detail about. It did earn a 6 out of 7 on my books spreadsheet though.

The Jennifer Doudna biography is the weakest. The main reason is that we don't get too much insight into Doudna herself or the way she carried out her scientific work, leaving Isaacson to spend many pages on other things: overviews of other players in the development of the gene-editing tool CRISPR that are more journalistic than biographical, and descriptions of the biology that are limited by Isaacson's lack of biological expertise (at least when compared to the best popular biology writing, like Richard Dawkins' in The Selfish Gene). Hand-wringing over James Watson's controversies takes up an alarming amount of space that is only partly justified by Watson's role as a childhood inspiration for Doudna. There's also a long section about the struggles behind the allocation of the CRISPR Nobel Prize (awarded in 2020) that is clearly balanced and thoroughly researched, but simply less interesting to me than similar segments in the Jobs or Franklin biographies, where the stakes are the fate of companies or nations, rather than who gets a shiny medal.

My guess is that these faults stem mainly from the more limited material Isaacson had access to. Albert Einstein and Benjamin Franklin are both among the most researched individuals in history. To the extent that Steve Jobs is behind, the interviews Isaacson personally conducted seem to have plugged the gap.

Doudna is still an inspiring person. She also has the enviable advantage of not being dead, and therefore may yet do even more and become the subject of further biographies. If you're interested in biotech, including the business side, or scientific careers that may one day win Nobel Prizes, the biography may well be worth reading.

Steve Jobs

A god-like experimenter who wants to figure out what traits make tech entrepreneurs succeed may proceed something like this: create a bunch of people with extreme strengths in some areas and extreme weaknesses in others, release them into the world to start companies, and see which extreme strengths can balance out which extreme weaknesses. Such an experiment might well create Steve Jobs.

Take one weakness: Jobs's emotional volatility and, for lack of a better word, general nastiness in some circumstances, including things from extremely harsh criticism of employees' work to horrible table manners at restaurants. This isn't unique to Jobs either: look at the Wikipedia pages for Bill Gates and Jeff Bezos, and you'll find that they brighten their subordinates' work days with such productive witticisms as "that's the stupidest thing I've ever heard" and "why are you ruining my life?" respectively.

Does this show that behaviour up to and including verbal abuse is a forgivable flaw, or even beneficial, in tech CEOs?

First, though verbal abuse is neither productive nor right, a culture of vigorous debate is a distinct thing with incredible benefits, and the idea that it serves only to hurt and marginalise is not just a misguided generalisation but sometimes diametrically wrong. The best example is Daniel Ellsberg recounting an anecdote from his early times at RAND Corporation in The Doomsday Machine (an unrelated book; my review here):

Rather than showing irritation or ignoring my comment [that he made at the first meeting], Herman Kahn, brilliant and enormously fat, sitting directly across the table from me, looked at me soberly and said, "You're absolutely wrong."

A warm glow spread through my body. This was the way my undergraduate fellows on the editorial board of the Harvard Crimson (mostly Jewish, like Herman and me) had spoken to each other; I hadn't experienced anything like it for six years. At King's College, Cambridge, or in the Society of Fellows, arguments didn't remotely take this gloves-off, take-no-prisoners form. I thought, "I've found a home."

Steve Jobs admittedly goes overboard with this. For example, people who worked with him had to learn that "this is shit" meant "that's interesting, could you elaborate and make the case for your idea further?". This is not just unnecessarily rude, but also unclear communication. The general impression that Isaacson gives is also not that Jobs was combative as a thought-out strategy, but rather that this was just his style of interaction.

I suspect that the famous combativeness of many tech CEOs is not itself a useful trait, but instead adjacent to several other traits that are, in particular disagreeableness (in the sense of willing to disagree with others and not feel pressure to conform) and perhaps also caring deeply about the product.

Consider another extreme Jobs trait: strange diets, and (in his youth), a belief that he didn't need to shower because of his dieting. This went so far that of the people Isaacson interviews about Jobs's youth, including those who hadn't seen him for decades, almost every one mentions something like "yeah, he stank". Yet while some leap to defend and (worse yet) emulate Jobs's verbal nastiness, presumably on grounds of its correlation with his success, far fewer do the same for his dieting and showering habits. (What conformists!)

I think the more general lesson is that Jobs was extreme in a lot of ways, including in the strength of his opinions and beliefs, and in not having a filter between them and his actions. He gets into eastern mysticism and goes off to India to become a monk. He gets into dieting and starts eating only fruit rather than just reading lifestyle magazines and half-heartedly trying diets for a week like most people might. He gets it into his head that the corner of a Mac isn't rounded enough and declares that in no uncertain terms.

So is that the key then: have firm convictions? We've gone from a maladaptive cliché to a trite one – and still not a very helpful one. Steve Jobs, with his "reality distortion field", may have been an expert at persuading people, but even he can't persuade reality to be another way. Even slightly wrong convictions tend to have nasty collisions with reality.

(It's worth noting that rather than being a stickler for one position or solution, Jobs tended to yo-yo back and forth between extremes, only slowly converging on a decision – something that often confused others at Apple until they learned to use a rolling average of his recent positions.)

The critical part, of course, was that Steve Jobs was right about a lot of things, despite several serious missteps (especially in regards to making over-expensive computers that no one wants to pay for). I think Jobs's success provides evidence that even in aesthetic matters, success has a surprisingly strong component of being actually right. And Jobs, who was all-around very bright despite not being a master of the technical side, seems to have mastered this.

Of course, the story of Jobs's success – which came in spite of his emotional volatility, and tendency to wish away problems rather than facing them – does not entirely fit the idea that success comes in large part from having well-calibrated beliefs about the world and going about achieving them in reasonable and rational ways.

I think there are three things worth keeping in mind.

First, it may well be that most successful people are successful "at random" (i.e. without having a rational strategy for achieving what they want to achieve), but that the probability of achieving your goals given that you have well-calibrated beliefs and a rational reality-accommodating plan is still very much higher than the probability of achieving them given any other strategy. That is, if $S$ is the event of being very successful (by some definition), $R$ the event that you follow a rational strategy and maintain well-calibrated beliefs and generally practice thought patterns that won't get you downvoted on LessWrong, $\neg R$ the complement of that event, $P(\neg R|S)$ can be high (i.e. most successful people became successful in not particularly smart ways), while $P(S|R)$ can be much higher than $P(S|\neg R)$ (following a rational strategy still gives you by far the best chances of success).

Second, Jobs's life illustrates the principle that you only have to be very right a small number of times – just like in general most of the return, especially in anything risky, comes from a small number of bets. He failed at managing, even when working under another CEO who had been brought in specifically to babysit him, to the extent that he was kicked out of his own company. He failed to build successful hardware after founding NeXT. However, he was really right about product design, and that was enough.

Third, though he did get away with ignoring many uncomfortable truths by simply willing them away, eventually reality hit back. He delayed dealing with the cancer threat when he was first told of it, and he trusted alternative treatments. The combination may well have killed him.

Benjamin Franklin

Benjamin Franklin was a newspaper publisher, writer, postmaster, ambassador, political leader, and scientist. He invented the lightning rod and realised that electric charge came in both a positive and negative form (and gave those names to them, as temporary ones until "[English] philosophers give us better").

He was one of the first or most influential pioneers of many other things as well; to take a random example, he thought up the idea of matched funding for a charitable project (and was quite proud of it too: "I do not remember any of my political maneuvers the success of which gave me at the time more pleasure, or that in after thinking about it I more easily excused myself for having made use of cunning").

More generally, he clearly enjoyed numbers and detail:

[...H]e loved immersing himself in minutiae and trivia in a manner so obsessive that it might today be described as geeky. He was meticulous in describing every technical detail of his inventions, be it the library arm, stove, or lightning rod. In his essays, ranging from his arguments against hereditary honors to his discussions of trade, he provided reams of detailed calculations and historical footnotes. Even in his most humorous parodies, such as his proposal for the study of farts, the cleverness was enhanced by his inclusion of mock-serious facts, trivia, calculations, and learned precedents

Do-gooders with time machines could do worse than giving him access to a spreadsheet program.

One of the best descriptions of Franklin's personality comes from Isaacson's comparison of him with John Adams (when they were both in Paris, late in Franklin's life):

Adams was unbending and outspoken and argumentative, Franklin charming and taciturn and flirtatious. Adams was rigid in his personal morality and lifestyle, Franklin famously playful. Adams learned French by poring over grammar books and memorizing a collection of funeral orations; Franklin (who cared little about the grammar) learned the language by lounging on the pillows of his female friends and writing them amusing little tales. Adams felt comfortable confronting people, whereas Franklin preferred to seduce them, and the same was true of the way they dealt with nations.

One striking things when reading about 18th century events is the informality and nepotism. For example, to become postmaster of the colonies, Franklin spent significant money on having a friend lobby on his behalf in London, and upon obtaining the position gave out cushy jobs to his son, brothers, brother's stepson, sister's son, and two of his wife's relatives.

Not only that, but the border between truth and fiction was also hazy in the press. Articles could be, without any differentiating label, either factual, obviously satirical, satirical in a way that takes a clever reader to spot, or outright hoaxes. Likewise Franklin often wrote and published letters to his own newspaper under pseudonyms, with various levels of disguise ranging from clearly transparent to purposefully anonymous (this, however, was normal, as it was often seen as unworthy of gentlemen to write such letters under their own names).

In other ways, the 18th century, and 18th century Franklin in particular, were surprisingly modern and liberal. Franklin took a very reasonable and liberal stance on the freedom of press:

“It is unreasonable to imagine that printers approve of everything they print. It is likewise unreasonable what some assert, That printers ought not to print anything but what they approve; since […] an end would thereby be put to free writing, and the world would afterwards have nothing to read but what happened to be the opinions of printers.”

He still exercised judgement over what he printed. When deciding whether to print something that violated his principles for money, he (reportedly) went through a process that many modern newspaper editors and Facebook engineers could well take to heart:

To determine whether I should publish it or not, I went home in the evening, purchased a twopenny loaf at the baker’s, and with the water from the pump made my supper; I then wrapped myself up in my great-coat, and laid down on the floor and slept till morning, when, on another loaf and a mug of water, I made my breakfast. From this regimen I feel no inconvenience whatever. Finding I can live in this manner, I have formed a determination never to prostitute my press to the purposes of corruption and abuse of this kind for the sake of gaining a more comfortable subsistence.

The 18th century offers some perspective about hostile politics too. After describing an extremely personal and angry election campaign (which Franklin lost), Isaacson writes:

Modern election campaigns are often criticized for being negative, and today’s press is slammed for being scurrilous. But the most brutal of modern attack ads pale in comparison to the barrage of pamphlets in the 1764 [Pennsylvania] Assembly election. Pennsylvania survived them, as did Franklin, and American democracy learned that it could thrive in an atmosphere of unrestrained, even intemperate, free expression. As the election of 1764 showed, American democracy was built on a foundation of unbridled free speech. In the centuries since then, the nations that have thrived have been those, like America, that are most comfortable with the cacophony, and even occasional messiness, that comes from robust discourse.

Isaacson points out that Franklin's popularity has come and gone, and explains this by making him the symbol of one side of a cultural and political dichotomy: tolerance and compromise rather than dogmatism and crusading, pragmatism rather than romanticism, social mobility rather than class and hierarchy, and secular material success over religious salvation. Thus, while immensely popular in the latter part of his life and after his death, once the Romantic Era got underway, he became seen as shallow, thrifty, and lacking in passion. For example, Franklin appears in Herman Melville's novel Israel Potter, a work that sounds like the most confusing Harry Potter fan-fiction of all time, as a precursor to today's shallow self-help gurus.

A perfect example of the type of cunning that made some people call him shallow comes from his time as a frontier commander. To get soldiers to attend worship services, he had the chaplain give out the daily rum rations right after the service. "Never were prayers more generally and punctually attended", Franklin proudly wrote.

Or: at the signing of the Declaration of Independence, John Hancock solemnly declared "There must be no pulling different ways; we must all hang together". Franklin reportedly responded, with a wit but not solemnity worthy of the historic occasion: "Yes, we must, indeed, all hang together, or most assuredly we shall all hang separately".

This oscillation between romantically-minded eras finding him shallow and business-minded eras finding him the godfather of all self-help gurus and thrifty entrepreneurs has continued to this day. It is true that his aphorism collections, as documented in his famous Poor Richard's Almanac, are more clever than insightful; that he was no moral philosopher; and that his virtue-cultivating efforts were often patchy. However, they are part of a crucial process: the separation of morality from theology during the Enlightenment, which "Franklin was [the] avatar" of. Franklin's foundational personal maxim, which he often repeated, is perhaps the single sentence that pre-modern religious countries most need to hear: “The most acceptable service to God is doing good to man".

The romanticists' criticisms are based on truths. Though sociable, founding and participating in many societies, his personal relationships tended to be intellectual but distant. Interestingly, despite his vast achievements, Franklin does not show signs of a deep unyielding inner ambition; he seems to have been driven by vague instincts to be useful, a sense of pride (which he tried to dull throughout his life), curiosity, and a delight in tinkering, planning, and organising. To his sister in 1771 he wrote "[...] I am much disposed to like the world as I find it, and to doubt my own judgment as to what would mend it" – a remarkable sentiment from the pen of someone who, not many years later, would be playing a key role in a revolution. And though even past the age of 75 he achieved a few minor things, like being instrumental in securing France's alliance to America, signing the peace treaty between the US and Britain, shaping the US Constitution, and being the head of Pennsylvania's government, he happily wiled away many of his latter days playing cards with only the occasional twinge of guilt. He specifically justified this in part based on a belief in the afterlife: "You know the soul is immortal; why then should you be such a niggard of a little time, when you have a whole eternity before you?"

However, even these traits seem to have made him exactly what America needed. He was a skilled diplomat in France partly because of his easy-going nature and lack of naked ambition. At the Constitutional Convention of 1787, he often hosted the (much younger) other leading revolutionaries at his house to talk about things in a less formal setting and soften their stances, and generally advocated tolerance and compromise. Isaacson cleverly summarises:

Compromisers may not make great heroes, but they do make democracies.

Perhaps the best known summary of Franklin's life is Turgot's epigram that "he snatched lightning from the sky and the sceptre from tyrants". Franklin himself had a go at this: he wrote an autobiography – then a rare form of book – and also proposed a cheeky epitaph for himself, including an exhortation to wait for a "new and more elegant edition [of him], revised and corrected by the Author".

He didn't just summarise himself, though. He also unwittingly wrote perhaps the pithiest summary of the spirit of the entire Enlightenment project, and consequently of the driving spirit of human progress since then. It was in a letter Franklin wrote to his wife, after narrowly escaping a shipwreck on the English coast in 1757:

Were I a Roman Catholic, perhaps I should on this occasion vow to build a chapel to some saint; but as I am not, if I were to vow at all, it should be to build a lighthouse.

2021-04-25

Lambda calculus

7.8k words, including equations (about 30 minutes)

This post has also been published here.

This post is about lambda calculus. The goal is not to do maths with it, but rather to build up definitions within it until we can express non-trivial algorithms easily. At the end we will see a lambda calculus interpreter written in the lambda calculus, and realise that we're most of the way to Lisp.

But first, why care about lambda calculus? Consider four different systems:

A Turing machine – that is, a machine that:
- works on an infinite tape of cells from which a finite set of symbols can be read and written, and always points at one of these cells;
- has some set of states it can be in, some of which are termed "accepting" and one of which is the starting state; and
- given a combination of current state and current symbol on the tape, always does an action consisting of three things:
  - writes some symbol on the tape (possibly the same that was already there),
  - transitions to some some state (possibly the same it is already in), and
  - moves one cell left or right on the tape.
The lambda calculus ( $\lambda$ -calculus), a formal system that has expressions that are built out of an infinite set of variable names using $\lambda$ -terms (which can be thought of as anonymous functions) and applications (analogous to function application), and a few simple rules for shuffling around the symbols in these expressions.
The partial recursive functions, constructed by function composition, primitive recursion (think bounded for-loops), and minimisation (returning the first value for which a function is zero) on three basic sets of functions:
- the zero functions, that take some number of arguments and return 0;
- a successor function that takes a number and returns that number plus 1; and
- the projection functions, defined for all natural numbers $a$ and $b$ such that $a \geq b$ as taking in $a$ arguments and returning the $b$ th one.
Lisp, a human-friendly axiomatisation of computation that accidentally became an extremely good and long-lived programming language.

The big result in theoretical computer science is that these can all do the same thing, in the sense that if you can express a calculation in one, you can express it in any other.

This is not an obvious thing. For example, the only thing lambda calculus lets you do is create terms consisting of symbols, single-argument anonymous functions, and applications of terms to each other (we'll look at the specifics soon). It's an extremely simple and basic thing. Yet no matter how hard you try, you can't make something that can compute more things, whether it's by inventing programming languages or building fancy computers.

Also, if you try to make something that does some sort of calculation (like a new programming language), then unless you keep it stupidly simple and/or take great care, it will be able to compute anything (at least in la-la-theory-land, where memory is infinite and you don't have to worry about practical details, like whether the computation finishes before the sun going nova).

Physicists search for their theory of everything. The computer scientists already have many, even though they've been at it for a lot less time than the physicists have: everything computable can be reduced to one of the many formalisms of computation. (One of the main reasons that we can talk about "computability" as a sensible universal concept is that any reasonable model makes the same things computable; the threshold is easy to hit and impossible to exceed, so computable versus not is an obvious thing to pay attention to.)

To talk about the theory of computation properly, we need to look at at least one of those models. The most well-known is the Turing machine. Turing machines have several points in their favour:

They are the easiest to imagine as a physical machine.
They have clear and separate notions of time (steps taken in execution) and space (length of tape used).
They were invented by Alan Turing, who contributed to breaking the Enigma code during World War II, before being unjustly persecuted for being gay and tragically dying of cyanide poisoning at age 41.

In contrast, compare the lambda calculus:

It is an abstract formal system arising out of a failed attempt to axiomatise logic.
There are many execution paths for a non-trivial expression.
It was invented by Alonzo Church, who lived a boringly successful life as a maths professor at Princeton, had three children, and died at age 92.

(Turing and Church worked together from 1936 to 1938, Church as Turing's doctoral advisor, after they independently proved the impossibility of the halting problem. At the same time and also working at Princeton were Albert Einstein, Kurt Gödel, and John von Neumann (who, if he had had his way, would've hired Turing and kept him from returning to the UK).)

However, the lambda calculus also has advantages. Its less mechanistic and more mathematical view of computation is arguably more elegant, and it has less things: instead of states, symbols, and a tape, the current state is just a term, and the term also represents the algorithm. It abstracts more nicely – we will see how we can, bit by bit, abstract out elements and get something that is a sensible programming language, a project that would be messier and longer with Turing machines.

Turing machines and lambda calculus are the foundations of imperative and functional programming respectively, and the situation between these two programming paradigms mirrors that between TMs and $\lambda$ -calculus: one is more mechanistic, more popular, and more useful when dealing with (stateful) hardware; the other more mathematical, less popular, and neater for abstraction-building.

Lambda trees

Now let's define exactly what a lambda calculus term is.

We have an infinite set of variables $x_1, x_2, x_3, ...$ , though for simplicity we will use any lowercase letter to refer to them. Any variable is a valid term. Note that variables are just symbols – despite the word "variable", there is no value bound to them.

We have two rules for building new terms:

$\lambda$ -terms are formed from a variable $x$ and a term $M$ , and are written $(\lambda x. M)$ .
Applications are formed from two terms $M$ and $N$ , and are written $(M N)$ .

These terms, like most things, are trees. I will mostly ignore the convention of writing out horrible long strings of $\lambda$ s and variables, only partly mitigated by parenthesis-reducing rules, and instead draw the trees.

(When it appears in this post, the standard notation appears slightly more horrible than usual because, for simplicity, I neglect the parenthesis-reducing rules (they can be confusing at first).)

Here are a few examples of terms, together with standard representations:

This representation makes it clear that we're dealing with a tree where nodes are either variables, lambda terms where the left child is the argument and the right child is the body, or applications. (I've circled the variables to make clear that the argument variable in a $\lambda$ -term has a different role than a variable appearing elsewhere.)

It's not quite right to say that a $\lambda$ -term is a function; instead, think of $\lambda$ -terms as one representation of a (mathematical) function, when combined with the reduction rule we will look at soon.

If we interpret the above terms as representations of functions, we might rewrite them (in Pythonic pseudocode) as, from left to right:

lambda x -> x (i.e., the identity function) (lambda is a common keyword for an anonymous function in programming languages, for obvious reasons).
(lambda f -> f(y))(lambda x -> x) (apply a function that takes a function and calls that function on y to the identity function as an argument).
x(y)

Reduction

Execution in lambda calculus is driven by something that is called $\beta$ -reduction, presumably because Greek letters are cool. The basic idea of $\beta$ -reduction is this:

Pick an application (which I've represented by orange circles in the tree diagrams).
Check that the left child of the application node is a \lambda-term (if not, you have to reduce it to a $\lambda$ -term before you can make that application).
Replace the variable in the left child of the $\lambda$ -term with the right child of the application node wherever it appears in the right child of the $\lambda$ -term, and then replace the application node with the right child of the $\lambda$ -term.

In illustrated form, on the middle example above, using both tree diagrams and the usual notation:

(The notation $M[N/x]$ means substitute the term $N$ for the variable $x$ in the term $M$ ; the general rule for $\beta$ -reduction is that given $((\lambda x. M) N)$ , you can replace it with $M[N/x]$ , subject to some details that we will mostly skip over shortly.)

In our example, we end up with another application term, so we can reduce it further:

In our Pythonic pseudocode, we might represent this as an execution trace like the following:

(lambda f -> f(y))(lambda x -> x)

-->

(lambda x -> x)(y)

-->

Reduction is not always so simple, even if there's only a single choice of what to reduce. You have to be careful if the same variable appears in different roles, and rename if necessary. The core rule is that within the tree rooted at a $\lambda$ -term that takes an argument $x$ , the variable $x$ always means whatever was given to that $\lambda$ -term, and never anything else. An $x$ bound in one $\lambda$ -term is distinct from an $x$ bound in another $\lambda$ -term.

The simplest way to get around problems is to make your first variable $x_1$ and, whenever you need a new one, call it $x_i$ where $i$ is one more than the maximum index of any existing variable. Unfortunately humans aren't good at remembering the difference between $x_9$ and $x_{17}$ , and humans like conventions (like using $x$ for generic variables, $f$ for things that will be $\lambda$ -terms, and so forth). Therefore we sometimes have to think about name collisions.

The principle that lets us out of name collision problems is that you can rename variables as you want (as long as distinct variables aren't renamed to the same thing). The name for this is $\alpha$ -equivalence (more Greek letters!); for example $(\lambda x .x)$ and $(\lambda y. y)$ are $\alpha$ -equivalent.

There are, of course, detailed rules for how to deal with name collisions when doing $\beta$ -reductions, but you should be fine if you think about how variable scoping should sensibly work to preserve meaning (something you've already had to reason about if you've ever programmed). (A helpful concept to keep in mind is the difference between free variables and bound variables – starting from a variable and following the path up the tree to the parent node, does it run through a $\lambda$ -node with that variable as an argument?)

An example of a name collision problem is this:

We can't do this because the $x$ in the innermost $\lambda$ -term on the left must mean whatever was passed to it, and the $y$ whatever was passed to the outer $\lambda$ -term. However, our reduction leaves us with an expression that applies its argument to itself. We can solve this by renaming the $x$ within the inner $\lambda$ -term:

The general way to think of lambda calculus term is that they are partitioned in two ways into equivalence classes:

The first, rather trivial, set of equivalence classes is treating all $\alpha$ -equivalent terms as the same thing. "Equivalent" and $\alpha$ -equivalent are usually the same thing when we're talking about the lambda calculus; it's the "structure" of a term that matters, not the variable names.
The second set of equivalence classes is treating everything that can be $\beta$ -reduced into the same form as equivalent. This is less trivial – in fact, it's undecidable in the general case (as we will see in the post about computation theory).

That's it

Yes, really, that's all you need. There exists a lambda calculus term that beats you in chess.

You might ask: but hold on a moment, we have no data – no numbers, no pairs, no lists, no strings – how can we input chess positions into a term or get anything sensible as an answer? We will see later that it's possible to encode data as lambda terms. The chess-playing term would accept some massive mess of $\lambda$ -terms encoding the board configuration as an input, and after a lot of reductions it would become a term encoding the move to make – eventually checkmate, against you.

Before we start abstracting out data and more complex functions, let's make some simple syntax changes and look at some basic facts about reduction.

Some syntax simplifications

The pure lambda calculus does not have $\lambda$ -terms that take more than one argument. This is often inconvenient. However, there's a simple mapping between multi-argument $\lambda$ -terms and single-argument ones: instead of a two-argument function, say, just have a function that takes in an argument and returns a one argument function that takes in an argument and returns a result using both arguments.

(In programming language terms, this is currying.)

In the standard notation, $(\lambda x.(\lambda y. M))$ is often written $(\lambda xy.M)$ . Likewise, we can do similar simplifications on our trees, remembering that this is a syntactic/visual difference, rather than introducing something new to the lambda calculus:

Once we've done this change, the next natural simplification to make is to allow one application node to apply many arguments to a $\lambda$ -term with "many arguments" (remember that it actually stands for a bunch of nested normal single-argument $\lambda$ -terms):

(The corresponding simplification in the standard syntax is that $(M \, A \, B\, C)$ means $(((M \, A)\, B)\, C)$ . In a standard programming language, this might be written M(A)(B)(C); that is, applying A to M to get a function that you apply to B, yielding another function that you apply to C. Sanity check: what's the difference between $((M \, A) \, B)$ and $(M \, (A \, B))$ ?)

Some facts about reduction

$\beta$ -normal forms

A $\beta$ -normal form can be thought of as a "fully evaluated" term. More specifically, it is one where this configuration of nodes does not appear in the tree (after multi-argument $\lambda$ s and applications have been compiled into single-argument ones), where $M$ and $N$ are arbitrary terms:

Intuitively, if such a term does appear, then the reduction rules allow us to reduce the application (replacing this part of the tree with whatever you get when you substitute $N$ in place of $x$ within $M$ ), so our term is not fully reduced yet.

Terms without a $\beta$ -normal form

Does every term have a $\beta$ -normal form? If you've seen computation theory stuff before, you should be able to answer this immediately without considering anything about the lambda calculus itself.

The answer is no, because reducing to a $\beta$ -normal form is the lambda calculus equivalent of an algorithm halting. Lambda calculus has the same expressive power as Turing machines or any other model of computation, and some algorithms run forever, so there must exist lambda calculus terms that you can keep reducing without ever getting a $\beta$ -normal form.

Here's one example, often called $\Omega$ :

Note that even though we use the same variable $x$ in both branches, the variable means a different thing: in the left branch it's whatever is passed as an input to the left $\lambda$ -term – one reduction step onwards, that $x$ stands for the entire right branch, which has its own $x$ . In fact, before we start reducing, we will do an $\alpha$ -conversion on the right branch (a pretentious way of saying that we will rename the bound variable).

Now watch:

After one reduction step, we end up with the same term (as usual, we are treating $\alpha$ -equivalent terms as equivalent; the variable could be $x$ or $y$ or $å$ for all we care).

Ambiguities with reduction

Does it matter how we reduce, or does every reduction path eventually lead to a $\beta$ -normal form, assuming that one exists in the first place? If you haven't seen this before, you might want to have a go at this before reading on.

Here's one example of a tricky term:

Imagine that $M$ has a $\beta$ -normal form, and $\Omega$ is as defined above and therefore can be reduced forever. If we start by reducing the application node, in a moment $\Omega$ and all its loopiness gets thrown away, and we're left with just $M$ , since the $\lambda$ -term takes two arguments and returns the first. However, if we start by reducing $\Omega$ , or are following a strategy like "evaluate the arguments before the application", we will at some point reduce $\Omega$ and get thrown in for a loop.

We can take a broader view here. In any programming language – I will use Lisp notation because it's the closest to lambda calculus – if we have a function like (define func (lambda (x y) [FUNCTION BODY])), and a function call like (func arg1 arg2) , the evaluator has a choice of what it does. The simplest strategies are to either:

Evaluate the arguments – arg1 and arg2– first, and then inside the function func have x and y bound to the results of evaluating arg1 and arg2 respectively. This is called call-by-value, and is used by most programming languages.
Bind x and y inside func to be the unevaluated values of arg1 and arg2, and evaluate arg1 and arg2 only upon encountering them in the process of evaluating func. This is called call-by-name. It's rare to see it in programming languages (an exception being that it's possible with Lisp macros), but functional languages like Haskell often have a variant, call-by-need or "lazy evaluation", where the values of arg1 and arg2 are only executed when needed, but once executed the results are memoized so that the execution only needs to happen once.

Call-by-value reduces what you can express. Imagine trying to define your own if-function in a language with call-by-value:

(define IF
  (lambda (predicate consequent alternative)
    (if predicate
        consequent    ; if predicate is true, do this
        alternative)) ; if predicate is false, do this instead

(note that IF is the new if-function that we're trying to define, and if is assumed to be a language primitive.)

Now consider:

(define factorial
  (lambda (n)
    (IF (= n 0)
        1
        (* n
           (factorial (- n 1))))))

You call (factorial 1), and for the first call the program evaluates the arguments to IF:

(= 1 0)
1
(* 1 (factorial 0))

The last one needs the value of (factorial 0), so we evaluate the arguments to the IF in the recursive call:

(= 0 0)
1
(* 1 (factorial -1))

... and so on. We can't define IF as a function, because in call-by-value the alternative gets evaluated as part of the function call even if predicate is false.

(Most languages solve this by giving you a bunch of primitives and making you stick with them, perhaps with some fiddly mini-language for macros built in (consider C/C++). In Lisp, you can easily write macros that use all of the language features, and therefore extend the language by essentially defining your own primitives that can escape call-by-value or any other potentially limiting language feature.)

It's the same issue with our term $((\lambda xy.x) \, M \, \Omega)$ above: call-by-value goes into a silly loop because one of the arguments isn't even "meant to" be evaluated (from our perspective as humans with goals looking at the formal system from the outside).

Lambda calculus does not impose a reduction/"evaluation" order, so we can do what we like. However, this still leaves us with a problem: how do we know if our algorithm has gone into an infinite loop, or we just reduced terms in the wrong order?

Normal order reduction

It turns out that always doing the equivalent of call-by-name – reducing the leftmost, outermost term first – saves the day. If a $\beta$ -normal form exists, this strategy will lead you to it.

Intuitively, this is because with call-by-name, there is no "unnecessary" reduction. If some arguments in some call are never used (like in our example), they never reduce. If we start reducing an expression while doing leftmost/outermost-first reduction, that reduction must be standing in the way between us and a successful reduction to $\beta$ -normal form.

Formally: ... the proof is left as an exercise for the reader.

Church-Rosser theorem

The Church-Rosser theorem is the thing that guarantees we can talk about unique $\beta$ -normal forms for a term. It says that:

Letting $\Lambda$ be the set of terms in the lambda calculus, $\rightarrow_\beta$ the $\beta$ -reduction relation, and $\twoheadrightarrow_\beta$ its reflexive transitive closure (i.e. $M \twoheadrightarrow_\beta N$ iff there exist zero or more terms $P_1$ , $P_2$ , ... such that $M \rightarrow_\beta P_1 \rightarrow_\beta ... \rightarrow_\beta P_n \rightarrow_\beta N$ ), then:

For all $M \in \Lambda$ , $M \rightarrow_\beta A$ and $M \rightarrow_\beta B$ implies that there exists $X \in \Lambda$ such that $A \twoheadrightarrow_\beta X$ and $B \twoheadrightarrow_\beta X$ .

Visually, if we have reduction chains like the black part, then the blue part must exist (a property known as confluence or the "diamond property"):

Therefore, even if there are many reduction paths, and even if some of them are non-terminating, for any two different starting $\beta$ -reductions we can make, we will not lose the existence of a reduction path to any $X$ . If $X$ is some $\beta$ -normal form reachable from $M$ , we know that any other reduction path that reaches a $\beta$ -normal form must have reached $X$ .

The fun begins

Now we will start making definitions within the lambda calculus. These definitions do not add any capabilities to the lambda calculus, but are simply conveniences to save out having to draw huge trees repeatedly when we get to doing more complex things.

There are two big ideas to keep in mind:

There are no data primitives in the lambda calculus (even the variables are just placeholders for terms to get substituted into, and don't even have consistent names – remember that we work within $\alpha$ -equivalence). As a result, the general idea is that you encode "data" as actions: the number 4 is represented by a function that takes a function and an input and applies the function to the input 4 times, a list might be encoded by a description of how to iterate over it, and so on.
There are no types. Nothing in the lambda calculus will stop you from passing a number to a function that expects a function, or visa versa. There exist typed lambda calculi, but they prevent you from doing some of the cool things with combinators that we'll see later in this post.

Pairs

We want to be able to associate two things into a pair, and then extract the first and second elements. In other words, we want things that work like this:

(fst (pair a b)) == a
(snd (pair a b)) == b

The simplest solution starts like this:

Now we can get the first of a pair by doing ((pair x y) first). If we want the exact semantics above, we can define simple helpers like

fst = (lambda p
        (p first))

(i.e. $\text{fst} = (\lambda p. (p \, \text{first}))$ ), and

snd = (lambda p
        (p second))

since now (snd (pair x y)) reduces to ((pair x y) second) reduces to y.

Lists

A list can be constructed from pairs: [1, 2, 3] will be represented by (pair 1 (pair 2 (pair 3 False))) (we will define False later). If $l_1$ , $l_2$ , and $l_3$ are the list items, a length element list looks like this:

We might also represent the same list like this instead:

This second representation makes it trivial to define things like a reduce function: ([1, 2, 3] 0 +) would return 0 plus the sum of the list [1, 2, 3], if [1, 2, 3] is represented as above. However, this representation would also make it harder to do other list operations, like getting all but the first element of a list, whereas our pair-based lists can do this trivially ((snd l) gets you all but the first element of the list l).

Numbers & arithmetic

Here are how the numbers work (using a system called Church numerals):

Since giving a function $f$ to a number $n$ (also a function) gives a function that applies $f$ to its input $n$ times, a lot of things are very convenient. Say you have this function to add one, which we'll call succ (for "successor"):

(Considering the above definition of numbers: why does it work?)

Now what is (42 succ)? It's a function that takes an argument and adds 42 to it. More generally, ((n succ) m) gives you m+n. However, there's also a more straightforward way to represent addition, which you can figure out from noticing that all we have to do to add m to n is to compose the "apply f" operation m more times to n, something we can do simply by calling (m f) on n, once we've "standardised" n to have the same f and x as in the $\lambda$ -term that represents m (that is why we have the (n f x) application, rather than just n):

Now, want multiplication? One way is to see that we can define (mult m n) as ((n (adder m)) 0), assuming that (adder m) returns a function that adds m to its input. As we saw, that can be done with (m succ), so:

(mult m n) =
((n (m succ))
 0)

There's a more standard way too:

The idea here is simply that (n f) gives a $\lambda$ -term that takes an input and applies f to it $n$ times, and when we call m with that as its first argument, we get something that does the $n$ -fold application $m$ times, for a total of $mn$ times, and now all that remains is to pass the x to it.

A particularly neat thing is that exponentiation can be this simple:

Why? I'll let the trees talk. First, using the definition of n as a Church numeral (which I will underline in the trees below), and doing one $\beta$ -reduction, we have:

This does not look promising – a number needs to have two arguments, but we have a $\lambda$ -term taking in one. However, we'll soon see that the x in the tree on the right actually turns out to be the first argument, f, in the finished number. In fact, we'll make that renaming right away (since we're working under $\alpha$ -equivalence), and continue reducing (below we've taken the bottom-most m and expanded it into its Church numeral definition):

At this point, the picture gets clearer: the next thing we'd reduce is the lambda term at the bottom applied to m, but that's just going to do the lambda term (which applies f $m$ times) $m$ more times. We'll have done 2 steps, and gotten up to $m^2$ nestings of f. By the time we've done the remaining $n-1$ steps, we'll have the representation of $m^n$ ; the $n-1$ more applications between our bottom-most and topmost lambda term will reduce away, while the stack of applications of f increases by a factor of $m$ each time.

What about subtraction? It's a bit complicated. Okay, how about just subtraction by one, also known as the pred (predecessor) function? Also tricky (and a good puzzle if you want to think about it). Here's one way:

Church numerals make it easy to add, but not subtract. So instead, here's what we do. First (box 1), we make a pair like [0 0]. Next (polygon 2), we have a function that takes a pair p=[a b] and creates a new pair [b (succ b)], where succ is the successor function (one plus its input). Repeated application of this function on the pair in box 1 looks like this: [0 0], [0 1], [1 2], [2 3], and so on. Thus we see that if we start from [0 0] and apply the function in polygon 2 $n$ times (box 3), the first element of the pair is (the Church numeral for) $n-1$ , and the second element is $n$ , and we can simply call fst to get that first element.

As we saw before, we can define subtraction as repeated application of pred:

(minus m n) =
((n pred) m)

There's an alternative to Church numerals that's found in the more general Scott encoding. The advantages of Church vs Scott numerals, and their relative structures, are similar to the relative merits and structures of the two types of lists we discussed: one makes many operations natural by exploiting the fact that everything is a function, but also makes "throwing off a piece" (taking the rest/snd of a list, or subtracting one from a number) much harder.

Booleans, if, & equality

You might have noticed that we've defined second as $(\lambda x y. y)$ , and 0 as $(\lambda f x. x)$ . These two terms are a variable-renaming away from each other, so they are $\alpha$ -equivalent. In other words, second and 0 are same thing. Because we don't have types, which is which depends only on our interpretation of the context it appears in.

Now let's define a True and False. Now False is kind of like 0, so let's just say they're also the same thing. The opposite of $(\lambda x y. y)$ is $(\lambda x y. x)$ , so let's define that to be True.

What sort of muddle have we landed ourselves in now? Quite a good one, actually. Let's define (if p c a) to be (p c a). If the predicate p is True, we select the consequent c, because (True c a) is exactly the same as (first c a) is clearly c. Likewise, if p is False, then we evaluate the same thing as (second c a) and end up with the alternative a.

We will also want to test whether a number is 0/False (equality in general is hard in the lambda calculus, so what we end up with won't be guaranteed to work with things that aren't numbers). A simple way is:

eq0 =
(lambda x
  (x (lambda y
       False)
     True))

If x is 0, it's the same as second and will act as a conditional and pick out True. If it's not zero, we assume that it's some number $n$ , and therefore will be a function that applies its first argument $n$ times. Applying $(\lambda y.\text{False})$ any non-zero amount of times to anything will return False.

Fixed points, combinators, and recursion

The big thing missing from the definitions we've put on top of the lambda calculus so far is recursion. Every lambda term represents an anonymous function, so there's no name within a $\lambda$ -term that we can "call" to recurse.

Rather than jumping in straight to recursion, we're going to start with Russell's paradox: does a set that contains all elements that are not in the set contain itself? Phrased mathematically: what the hell is $R = \{x \,|\,x\notin R\}$ ?

In computation theory, sets are often specified by a characteristic function: a function that is always defined if the set is computable, and returns true if an element is in the set and false otherwise.

In the lambda calculus (which was originally supposed to be a foundation for logic), here's a characteristic function for the Russell set $R$ :

(where not can be straightforwardly defined on top of our existing definitions as (not b) = (b False True)).

This $\lambda$ -term takes in an element x, assumes that x is the (characteristic function for) the set itself, and asks: is it the case that x is not in the set? Call this term R, and consider (R R): the left R is working as the (characteristic function of) the set, and the right R as the element whose membership of the set we are testing.

Evaluating:

So we start out saying (R R), and in one $\beta$ -reduction step we end up saying (not (R R)) (just as, with Russell's paradox, it first seems that the set must contain itself, because the set is not in itself, but once we've added the set to itself then suddenly it shouldn't be in itself anymore). One more step and we get, from (R R), (not (not (R R))). This is not ideal as a foundation for logic.

However, you might realise something: the not here doesn't play any role. We can replace it with any arbitrary f. In fact, let's do that, and create a simple wrapper $\lambda$ -term around it that lets us pass in any f we want:

Now let's look at the properties that $Y$ has:

(Y \, f) \rightarrow_\beta (f \, (Y \, f)) \rightarrow_\beta (f \, (f \, (Y \, f))) \rightarrow_\beta ...

$Y$ is called the Y combinator ("combinator" is a generic term for a lambda calculus term with no free variables). It is part of the general class of fixed-point combinators: combinators $X$ such that $(X \, f) = (f \, (X\,f))$ . (Turing invented another one: $\Theta = (A \, A)$ , where $A$ is defined as $(\lambda x y. (y \,(x\, x\, y)))$ .)

A fixed-point combinator gives us recursion. Imagine we've almost written a recursive function, say for a factorial, except we've left a free function parameter for the recursive call:

(lambda f x
  (if (eq0 x)
      1
      (mult x
            (f (pred x)))))

(Also, take a moment to appreciate that we can already do everything necessary except for the recursion with our earlier definitions.)

Call the previous recursion-free factorial term F, and consider reducing ((Y F) 2) (where -BETA-> stands for one or more $\beta$ -reductions):

((Y F)
 2)

-BETA->

((F (Y F))
 2)

-BETA->

((lambda x
   (if (eq0 x)
       1
       (mult x
             ((Y F) (pred x)))))
 2)

-BETA->

(if (eq0 2)
    1
    (mult 2
          ((Y F) (pred 2))))

-BETA->

(mult 2
      ((Y F)
       1))

-BETA->

(mult 2
      ((F (Y F))
       1))

-BETA->

(mult 2
      ((lambda x
         (if (eq0 x)
             1
             (mult x
                   ((Y F) (pred x)))))
       1))

-BETA->
...
-BETA->

(mult 2
      (mult 1
            1))

-BETA->

2

It works! Get a fixed-point combinator, and recursion is solved.

Primitive recursion

The definition of the partial recursive functions (one of the ways to define computability, mentioned at the beginning) involves something called primitive recursion. Let's implement that, and along the way look at fixed-point combinators from another perspective.

Primitive recursion is essentially about implementing bounded for-loops / recursion stacks, where "bounded" means that the depth is known when we enter the loop. Specifically, there's a function $f$ that takes in zero or more parameters, which we'll abbreviate as $\overline{P}$ . At 0, the value of our primitive recursive function $h$ is $f(\overline{P})$ . At any integer $x+1$ for $x \geq 0$ , $h(\overline{P}, x+1)$ is defined as $g(\overline{P}, x, h(\overline{P}, x))$ : in other words, the value at $x+1$ is given by some function of:

fixed parameter(s) $\overline{P}$ ,
how many more steps there are in the loop before hitting the base case ( $x$ ), and
the value at $x$ (the recursive part).

For example, in our factorial example there are no parameters, so $f$ is just the constant function 1, and $g(x, r) = (x + 1) \times r$ , where $r$ is the recursive result for one less, and we have $x+1$ because (for a reason I can't figure out – ideas?) $g$ takes, by definition, not the current loop index but one less.

Now it's pretty easy to write the function for primitive recursion, leaving the recursive call as an extra parameter (r) once again, and assuming that we have $\lambda$ -terms F and G for $f$ and $g$ respectively:

(lambda r P x
  (if (eq0 x)
      (F P)
      (G P (pred x) (r P (pred x)))))

Slap a $Y$ in front, and we take care of the recursion and we're done.

The fixed point perspective

However, rather than viewing this whole "slap in the $Y$ " business as a hack for getting recursion, we can also interpret it as a fixed point operation.

A fixed point of a function $f$ is a value $x$ such that $x = f(x)$ . The fixed points of $f(x)=x^2$ are 0 and 1. In general, fixed points are often useful in maths stuff and there's a lot of deep theory behind them (for which you will have to look elsewhere).

Now $Y$ (or any other fixed point combinator) has the property that $(Y f) =_\beta (f \, (Y\, f))$ (remember that the equivalent of $f(x)$ is written $(f \,x)$ in the lambda calculus). In other words, $Y$ is a magic wand that takes a function and returns its fixed point (albeit in a mathematical sense that is not very useful for explicitly finding those fixed points).

Taking once again the example of defining primitive recursion, we can consider it as the fixed point problem of finding an $h$ such that $h = \Phi_{f,g}(h)$ , where $\Phi_{f,g}$ is a function like the following, where F and G are the lambda calculus representations of $f$ and $g$ respectively:

(lambda h
  (lambda P x
    (if (eq0 x)
        (F P)
        (G P (pred x) (h P (pred x)))))))

That is, $\Phi_{f,g}$ takes in some function h, and then returns a function that does primitive recursion – under the assumption that h is the right function for the recursive call.

Imagine it like this: when we're finding the fixed point of $f(x)= x^2$ , we're asking for $x$ such that $x=x^2$ . We can imagine reaching into the set of values that $x$ can take (in this case, the real numbers), plugging them in, and seeing that in most cases the equation $x=x^2$ is false, but if we pick out a fixed point it becomes true. Similarly, solving $h=\Phi_{f,g}(h)$ is the problem of considering all possible functions $h$ (and it turns out all computable functions can be enumerated, so this is, if anything, less crazy than considering all possible real numbers), and requiring that plugging in $h$ into $\Phi_{f,g}$ gives back $h$ . For almost any function that we plug in, this equation will be nonsense: instead of doing primitive recursion, on the first call to h $\Phi_{f,g}$ will do some crazy call that might loop forever or calculate the 17th digit of $\pi$ , but if it's picked just right, $h$ and $\Phi_{f,g}(h)$ will happen to be the same thing. Unlike in the algebraic case, it's very difficult to iteratively improve on your guess for $h$ , so it's hard to think of how to use this weird way of defining the problem of finding $h$ to actually find it.

Except hold on – we're working in the lambda calculus, and fixed point combinators are easy: call $Y$ on a function and we have its fixed point, and, by the reasoning above, that is the recursive version of that function.

The lambda calculus in lambda calculus

There's one final powerful demonstration of a computation model's expressive power that we haven't looked at: being able to express itself. The most well-known case is the universal Turing machine, and those crop up a lot when you're thinking about computation theory.

Now there exists a trivial universal lambda term: $(\lambda \,f\,a\,.\,(f \,a))$ takes $f$ , the lambda representation of some function, and an argument $a$ , and returns the lambda calculus representation of $f$ applied to $a$ . However, this isn't exactly fair, since we've just forwarded all the work onto whatever is interpreting the lambda calculus. It's like noting that an eval function exists in a programming language, and then writing on your CV that you've written an evaluator for it.

Instead, a "fair" way to define a universal lambda term is to build on the data specifications we have to define a representation of variables, lambda terms, and application terms, and then writing more definitions within the lambda calculus until we have a reduce function.

This is what I've done in Lambda Engine. The definitions specific to defining the lambda calculus within the lambda calculus start about halfway down this file. I won't walk through the details here (see the code and comments for more detail), but the core points are:

We distinguish term types by making each term a pair consisting of an identifier and then the data associated with it. The identifier for variables/ $\lambda$ s/applications is a function that takes a triple and returns the 1st/2nd/3rd member of it (this is simpler than tagging them with e.g. Church numerals, since testing numerical equality is complicated). The data is either a Church numeral (for variables) or a pair of a variable and a term ( $\lambda$ -terms) or a term and a term (applications).
We need case-based recursion, where we can take in a term, figure out what it is, and then perform a call to a function to handle that term and pass on the main recursive function to that handler function (for example, because when substituting in a application term, we need to call the main substitution function on both the left and right child of the application). The case-based recursion functions (different ones for the different number of arguments required by substitution and reduction) take a triple of functions (one for each term type) and exploit the fact that the identifier of a term is a function that picks some element from the triple (in this case, we call the identifier on the handler function triple to pick the right one).
We have helper functions for to build our term types, extract out parts, and test for whether something is a $\lambda$ -term (exploiting the fact that the first element of the pair that a lambda term is is the "take the 2nd thing from a triple" function).
With the above, we can define substitution fairly straightforwardly. Note that we need to test Church numeral equality, which requires a generic Church numeral equality tester, which is a slow function (because it needs to recurse and take a lot of predecessors).
For reduction, the main tricky bit is doing it in normal order. This means that we have to be able to tell whether the left child in an application term is reducible before we try to reduce the right child (e.g. the left child might eventually reduce to a function that throws away its argument, and the right child might be a looping term like $\Omega$ ). We define a helper function to check whether something reduces, and then can write reduce-app and therefore reduce. For convenience we can define a function n-reduce that calls reduce an expression n times, simply by exploiting how Church numerals work (((2 reduce) x) is (reduce (reduce x)), for example).

What we don't have:

Variable renaming. We assume that terms in this lambda calculus are written so that a variable name (in this case, a Church numeral) is never reused.
Automatically reducing to $\beta$ -normal form. This could be done fairly simply by writing another function that calls itself with the reduce of its argument until our checker for whether something reduces is false.
Automatically checking whether we're looping (e.g. we've typed in the definition of $\Omega$ ).

The lambda calculus interpreter in this file has all three features above. You can play with it, and the lambda-calculus-in-lambda-calculus, by downloading Lambda Engine (and a Racket interpreter if you don't already have one) and using one of the evaluators in this file.

Towards Lisp

Let's see what we've defined in the lambda calculus so far:

pair
lists
fst
snd
True
False
if
eq0
numbers
recursion

This is most of what you need in a Lisp. Lisp was invented in 1958 by John McCarthy. It was intended as an alternative axiomatisation for computation, with the goal of not being too complicated to define while still being human friendly, unlike the lambda calculus or Turing machines. It borrows notation (in particular the keyword lambda) from the lambda calculus and its terms are also trees, but it is not directly based on the lambda calculus.

Lisp was not intended as a programming language, but Steve Russell (no relation to Bertrand Russell ... I'm pretty sure) realised you could write machine code to evaluate Lisp expressions, and went ahead and did so, making Lisp the second-oldest programming language. Despite its age, Lisp is arguably the most elegant and flexible programming language (modern dialects include Clojure and Racket).

One way to think of what we've done in this post is that we've started from the lambda calculus – an almost stupidly simple theoretical model – and made definitions and syntax transformations until we got most of the way to being able to emulate Lisp, a very usable and practical programming language. The main takeaway is, hopefully, an intuitive sense of how something as simple as the lambda calculus can express any computation expressible in a higher-level language.