Showing posts with label civilisation. Show all posts

2024-07-23

Positive visions for AI

This post was a collaboration with Florence Hinder

Reasons to make the positive case

Everyone who starts thinking about AI starts thinking big. Alan Turing predicted that machine intelligence would make humanity appear feeble in comparison. I. J. Good said that AI is the last invention that humanity ever needs to invent.

The AI safety movement started from Eliezer Yudkowsky and others on the SL4 mailing list discussing (and aiming for) an intelligence explosion and colonizing the universe. However, as the promise of AI has drawn nearer, visions for AI upsides have paradoxically shrunk. Within the field of AI safety, this is due to a combination of the “doomers” believing in very high existential risk and therefore focusing on trying to avoid imminent human extinction rather than achieving the upside, people working on policy not talking about sci-fi upsides to look less weird, and recent progress in AI driving the focus towards concrete machine learning research rather than aspirational visions of the future.

Both DeepMind and OpenAI were explicitly founded as moonshot AGI projects (“solve intelligence, and then use that to solve everything else” in the words of Demis Hassabis). Now DeepMind - sorry, Google DeepMind - has been eaten by the corporate machinery of Alphabet, and OpenAI is increasingly captured by profit and product considerations.

The torch of AI techno-optimism has moved on the e/acc movement. Their core message is correct: growth, innovation, and energy are very important, and almost no one puts enough emphasis on them. However, their claims to take radical futures seriously are belied by the fact that their visions of the future seem to stop at GenAI unicorns. They also seem to take the general usefulness of innovation not as just a robust trend, but as a law of nature, and so are remarkably incurious about the possibility of important exceptions. Their deeper ideology is in parts incoherent and inhuman. Instead of centering human well-being, they worship the “thermodynamic will of the universe”. “You cannot stop the acceleration”, argues their figurehead, so “[y]ou might as well embrace it” - hardly an inspiring humanist rallying cry.

In this piece, we want to paint a picture of the possible benefits of AI, without ignoring the risks or shying away from radical visions. Why not dream about the future you hope for? It’s important to consider the future you want rather than just the future you don’t. Otherwise, you might create your own unfortunate destiny. In the Greek myth about Oedipus, he was prophesied to kill his father, so his father ordered him to be killed, but he wasn’t and ended up being adopted. Years later he crossed his father on the road in his travels and killed him, as he had no idea who his father was. Oedipus’ father focusing on the bad path might have made the prophecy happen. If Oedipus' father hadn’t ordered him to be killed, he would have known who his father was and likely wouldn’t have killed him.

When thinking about AI, if we only focus on the catastrophic future, we may cause it to become true by causing an increase in attention on this topic. Sam Altman, who is leading the way in AI capabilities, claimed to have gotten interested from arch-doomer Eliezer Yudkowsky. We may also neglect progress towards positive AI developments; some people think that even direct AI alignment research should not be published because it might speed up the creation of unaligned AI.

With modern AI, we might even get a very direct “self-fulfilling prophecy” effect: current AIs increasingly know that they are AIs, and make predictions about how to act based on their training data which includes everything we write about AI.

Benefits of AI

Since we think a large focus of AI is on what could go wrong, let’s think through what could go well starting from what’s most tangible and close to the current usage of AI to what the more distant future could hold.

AI will do the mundane work
Lowering the costs of coordination
Spreading Intelligence
AI can create more technology
Increased technology, wealth and energy, correlate with life being good
All of the above, and the wealth it creates, could allow people to self-actualise more

Already, AI advances mean that Claude has beocme very useful, and programmers are faster and better. But below we’ll cast a look towards the bigger picture and where this could take us.

AI will do the mundane work

First, there’s a lot of mundane mental work that humans currently have to do. Dealing with admin work, filing taxes, coordinating parcel returns -- these are not the things you will fondly be reminiscing about as you lie on your deathbed. Software has reduced the pain of dealing with such things, but not perfectly. In the future, you should be able to deal with all administrative work by specifying what you want to get done to an AI, and being consulted on decision points or any ambiguities in your preferences. Many CEOs or executives have personal assistants; AIs will mean that everyone will have access to this.

What about mundane physical work, like washing the dishes and cleaning the toilets? Currently, robotics is bad. But there is no known fundamental obstacle to having good robotics. It seems mainly downstream of a lot of engineering and a lot of data collection. AI can help with both of those. The household robots that we’ve been waiting for could finally become a reality.

Of course, it is unclear whether AIs will first have a comparative advantage against humans in mundane or meaningful work. We’re already seeing that AI models are making massive strides in making art, way before they’re managing our inboxes for us. It may be that there is a transitional period where robotics is lagging but AIs are smarter-than-human, where the main economic value of humans is their hands rather than their brains.

Lowering the cost of coordination

With AI agents being able to negotiate with other AI agents, the cost of coordination is likely to dramatically drop (see here for related discussion). Examples of coordination are agreements between multiple parties, or searching through a large pool of people to match buyers or sellers, or employees and employers. Searching through large sets of people, doing complex negotiations, and the monitoring and enforcement of agreements all take lots of human time. AI could reduce the cost and time taken by such work. In addition to efficiency gains, new opportunities for coordination will open up that would have previously been too expensive.

Small-scale coordination

To give an example of this on the small scale of two individuals, say you are trying to search for a new job. Normally you can’t review every single job posting ever, and employers can’t review every person in the world to see if they want to reach out. However, an AI could filter that for the individual and another AI for the business, and the two AIs could have detailed negotiations with each other to find the best possible match.

Coordination as a scarce resource

A lot of the current economy is a coordination platform; that’s the main product of each of Google, Uber, Amazon, and Facebook. Reducing the cost of searching for matches and trades should unlock at least as much mundane benefits and economic value as the tech platforms have.

Increased coordination may also reduce the need to group people into roles, hierarchies, and stereotypes. Right now, we need to put people into rigid structures (e.g. large organisations with departments like “HR” or “R&D”, or specific roles like “doctor” or “developer”) when coordinating a large group of people. In addition to upholding standards and enabling specialisation of labour, another reason for this is that people need to be legible to unintelligent processes, like binning of applicants by profession, or the CEO using an org chart to find out who to ask about a problem, or someone trying to buy some type of service. Humans can reach a much higher level of nuance when dealing with their friends and immediate colleagues. The cheap intelligence we get from AI might let us deal with the same level of nuance with a larger group of people than humans can themselves track. This means people may be able to be more unique and differentiated, while still being able to interface with society.

Large-scale Coordination

On a larger scale, increased coordination will also impact geopolitics. Say there are two countries fighting over land or resources. Both countries could have AI agents to negotiate with the other AI agents to search the space of possible deals and find an optimal compromise for both. They could also simulate a vast number of war scenarios to figure out what would happen; much conflict is about two sides disagreeing about who would win and resolving the uncertainty through a real-world test. This relies on three key abilities: the ability to negotiate cheaply, the ability to simulate outcomes, and the ability to stick to and enforce contracts. AI is likely to help with all three. This could reduce the incentives for traditional war, in that no human lives are needed to be lost because the outcome is already known and we can negotiate straight from that. We also know exactly what we are and are not willing to trade off which means it’s easier to optimise for the best compromise for everyone.

Spreading the intelligence

AI lets us spread the benefits of being smart more widely.

The benefits of intelligence are large. For example, this study estimates that a 1 standard deviation increase in intelligence increases your odds of self-assessed happiness by 11%. Now, part of this gain comes from intelligence being a positional good: you benefit from having more intelligence at your disposal than others, for example in competing for a fixed set of places. However, intelligence also has absolute benefits, since it lets you make better choices. And AI means you can convert energy into intelligence. Much as physical machines let the weak gain some of the benefits of (even superhuman) strength, AI might allow all humans to enjoy some of the benefits of being smart.

Concretely, this could have two forms. The first is that you could have AI advisors increase your ability to make plans or decisions, in the same way that - hypothetically - even a near-senile president might still make decent decisions with the help of their smart advisors. With AI, everyone could have access to comparable expert advisors. The effect may be even more dramatic than human advisors: the AI might be superhumanly smart, the AI might be more verifiably smart (a big problem in selecting smart advisors is that it can be hard to tell who is actually smart, especially if you are not), and if AIs are aligned successfully there may be less to worry about in trusting it than in trusting potentially-scheming human advisors.

The second is AI tutoring. Human 1-1 tutoring boosts educational outcomes by 2 standard deviations (2 standard deviations above average is often considered the cutoff for “giftedness”). If AI tutoring is as good, that’s a big deal.

AI is the ultimate meta-technology

AI is special because it automates intelligence, and intelligence is what you need to build technology, including AI, creating a feedback loop. Some other previous technologies have boosted other technologies; for example, the printing press massively helped the accumulation of knowledge that led to the invention of many other technologies. But we have not before had a technology that could itself directly advance other technology. Such AI has been called PASTA (Process for Automating Scientific and Technological Advancement).

Positive feedback loops - whether self-improving AIs, nuclear reactions, epidemics, or human cultural evolution - are very powerful, so you should be wary of risks from them. Similarly, it is currently at best extremely unclear whether AIs that improve themselves could be controlled with current technology. We should be very cautious in using AI systems to improve themselves.

In the long run, however, most of the value of AI will likely come from their effects on technological progress, much like the next industrial revolution. We can imagine AIs slashing the cost and increasing the speed of science in every field, curing diseases and making entire new veins of technology available, in the same way that steam engines made entirely new veins of coal accessible.

In particular, AIs help de-risk one of the largest current risks to future human progress. One model of the feedback loop behind humanity’s progress in the past few centuries is that people led to ideas led to wealth led to food led to more people.

However, greater wealth no longer translates into more people. The world population, which was exponentially growing for much of the 19th and 20th centuries, is likely to be in decline by the end of the 21st century. This is likely to have negative consequences for the rate of innovation, and as discussed in the next section, a decline in productivity would likely have a negative impact on human wellbeing. However, if AIs start driving innovation, then we have a new feedback loop: wealth leads to energy leads to more AIs leads to ideas leads to wealth.

As long as this feedback loop does not decouple from the human economy and instead continues benefitting humans, this could help progress continue long into the future.

Wealth and energy are good

If you want humans to be well-off, one of the easiest things to do is give them more wealth and more energy. GDP per capita (on a log scale) has a 0.79 correlation with life satisfaction, and per-capita energy use (again on a log scale) has a 0.74 correlation with life satisfaction. Increased wealth and energy correlate with life satisfaction, and we should expect these trends to continue.

Above: GDP per capita (x-axis), energy use (y-axis), and life satisfaction (colour scale) for 142 countries. There are no poor countries with high energy use, and no high energy use countries that are poor. There are no countries with high average life satisfaction that are not high in both energy use and average GDP per capita. The axes are logarithmic, but since economic growth is exponential, countries should be able to make progress at a constant rate along the axis. Data source: Our World In Data (here, here, and here).

(It is true that energy use and economic growth have been increasingly decoupling in rich countries, due to services being more of the economy, and efficiency gains in energy use. However, the latter is effectively increasing the amount of useful energy that can be used - e.g. say the amount of energy needed to cook one meal is now enough to cook two meals, which is effectively the same as gaining more energy. However, efficiency effects are fundamentally limited because there is a physical limit, and also if demand is elastic then efficiency gains lead to increased energy use, meaning it doesn’t help the environment either. Ultimately, if you want to do more things in the physical world, you need more energy).

A wealthy, energy-rich society has many material benefits: plentiful food, advanced medicine, high redistributive spending becomes feasible, and great choice and personal freedom through specialisation of labour and high spending power. A wealthy and energy-rich society also has some important subtler benefits. Poverty and resource constraints sharpen conflict. Economic growth is intimately linked to tolerance and liberalism, by weakening the cultural status and clout of zero-sum strategies like conflict and politicking.

One clear historic example of how increases in energy correlated with improved quality of life was in the industrial revolution, arguably the best and most important thing that ever happened. Before it, trends in human wellbeing seemed either stagnant, fluctuating, or very slow, and after it, all the variables for which we can find good long-term series that are related to human well-being shoot upwards.

Above: variables correlated with human well-being over time. Source: Luke Muehlhauser

Therefore, it’s worth keeping in mind that boosting energy and wealth is good, actually. And the most powerful way to do that is through inventing new technologies that let us use energy to serve our needs.

The heart of the industrial revolution was replacing part of human manual labour with something cheaper and more powerful. AI that replaces large parts of human mental labour with something cheaper and more powerful should be expected to be similarly transformative. Whether it is a good or bad transformation seems more uncertain. We are lucky that industrialisation happened to make national power very tightly tied to having a large, educated, and prosperous middle class; it is unclear what is the winning strategy in an AI economy. We are also lucky that the powerful totalitarian states enabled by industrial technology have not triumphed so far, and they might get further boosts from AI. Automating mental labour also involves the automation of decision-making, and handing over decision-making to the machines is handing over power to machines, which is more risky than handing the manual labour to them. But if we can safely control our AI systems and engineer good incentives for the resulting society, we could get another leap in human welfare.

Self actualisation

Now say we’ve had a leap in innovation and energy through Transformative AI (TAI) and we’ve also reached a post scarcity world. What happens now? Humans have had all their basic needs met, most jobs are automated, but what do people spend their time actually doing?

Maslow’s Hierarchy

Maslow’s hierachy of needs is a framework of understanding human needs and drivers for human behaviour. Maslow suggested that in most scenarios people need to mostly satisfy one level before being able to focus on higher-level needs.

The top level of the hierachy is self-actualisation. The peak of human experience is something that few can currently reach - but maybe everyone could get there.

There is a possible path the world takes in which all humans can reach self-actualisation. With increases in technology & wealth, such as with TAI and a Universal Basic Income (UBI), we would be able to provide the basic needs of food, water, shelter, and clothing for all humans, enabling people to easily meet their basic needs. Humans can now spend more time on the things they want, for example moving up through Maslow’s hierarchy to focusing on increasing love and belonging, self-esteem and self-actualization.

Say you are in a post scarcity world, what would you do if you didn’t have to work?

Would you be spending time with loved ones, engaging in social activities that provide a sense of connection and belonging, self-esteem? Would it be honing your craft and becoming an expert in a particular field? Or would you spend the whole time scrolling on your phone?

Say hypothetically a wealthy billionaire gave you a grant to work on anything you wanted, would you be happy with having the complete freedom to spend your time as you wished?

Often people assume that others will be unhappy with this world, but would you? There is a cognitive bias where people tend to judge themselves as happier than their peers, which could nudge you to think people would be less happy in this world, even if you would enjoy this.

In this post-scarcity world, humans could spend more time on creative pursuits such as art, music, and any other hobbies – not with the goal of making money, but to reach self-actualisation.

With AI being better than humans in every dimension, AI can produce the best art in the world, but there is intrinsic value in honing your craft, improving at art or expressing your feelings through it, in and of itself. The vast majority of art is not created to be the best art in the world but for the journey itself. A child that paints a finger painting and the parent who puts it on the wall does not think “my child’s art is better than Van Gogh’s”. Instead, they feel a sense of excitement about the progress their child has made and the creative expression the child has produced.

Another example is the Olympic games. Nobody needs to win the olympic games to survive, but it lets people express pride in their country, hone their craft, attain status, and so on. But the actual task is just a game, a social construct. More and more tasks will look like social constructs and games we create to challenge each other.

Examples of post-scarcity scenes

Since this is quite theoretical, let's consider examples where we’ve had “post-scarcity” microcosms to explore.

The French Bourgeoisie

The French leisure class, or bourgeoisie, were a class of wealthy elite that emerged in 16th century France. Many had enough money to pursue endeavours like refining their taste in arts and culture. Salon culture was a cornerstone of bourgeoisie social life. Gatherings featuring discussions on literature, art, politics and philosophy.

Upper Class in the Victorian Era

The upper class in the Victorian era enjoyed a variety of leisure activities that reflected their wealth, status and values. They attended social events and balls, fox hunting and other sports, theater and opera, art and literature, travel, tea parties and social visits, gardening and horticulture, charitable work and philanthropy. Several undertook serious pursuits in science or art.

Burning Man

Burning Man is an annual festival where people take all the basic things you need with you for a week of living in the desert:food, water, shelter. People have a week to create a new community or city that is a temporary microcosm of a post-scarcity world. They pursue artistic endeavours and creative expression, music, dance and connecting with others. People often talk about Burning Man events being some of the best experiences of their lives.

Successful Startup Founders in The Bay Area

In San Francisco, there is a crossover with hippie culture and tech, and many people with excess wealth and resources, resulting in many looking for more in life. They try to reach self actualisation, by pursuing many arts and creative pursuits. Hippie movements often encourage communal living, and a sense of connection with those around you. Many may raise eyebrows at the lifestyles of some such people, but it’s hard to claim that it’s a fundamentally bad existence.

More pessimistic views about humans?

It is true that not all cultural tendencies in a post-scarcity world would be positive. In particular, humans have a remarkable ability to have extremely tough and all-consuming social status games, seemingly especially in environments where other needs are met. See for example this book review about the cut-throat social scene of upper-class Manhattan women or this one about the bland sameness and wastefulness of nightlife, or this book review that ends up concluding that the trajectory of human social evolution is one long arc from prehistoric gossip traps to internet gossip traps, with liberal institutions just a passing phase.

But the liberal humanist attitude here is to let humans be humans. Yes, they will have petty dramas and competitions, but if that is what they want, who is to tell them no? And they will also have joy and love.

Would a post-scarcity world have meaning? Adversity is one of the greatest sources of meaning. Consider D-Day, when hundreds of thousands of soldiers got together to charge up a beach under machine-gun fire to liberate a continent from Nazi rule. Or consider a poor parent of four working three jobs to make ends meet. There are few greater sources of meaning. But adversity can be meaningful while involving less suffering and loss. A good future will be shallower, in a sense, but that is a good thing.

Finally, it is unclear if we would get a happy world, even if we had the technology for post-scarcity, because of politics and conflict. We will discuss this later.

Radical improvements

AI might also help with radical but necessary improvements to the human condition.

People die. It is a moral tragedy when people are forced to die against their will, as happens to over 50 million people per year. Medicine is making progress against many causes of death and disability; in the limit it can cure all of them. We should reach that limit as fast as possible, and AI can likely help accelerate the research and deployment of solutions.

One of the greatest inequalities in the world is inequality in intelligence. Some people struggle to perform in simple jobs, while others (well, at least one) are John von Neumann. In the short term, AI might help by making cognitively demanding tasks more accessible to people through AI tutors and AI copilots. In the longer term, AI might help us enhance human intelligence, through brain-AI integration or new medical technology.

Reasons to worry

Though there are many potential upsides for AI and AGI as argued in this post, that doesn’t mean there aren’t risks.

The plausible risks of AI go all the way to human extinction, meaning this shouldn’t be taken lightly. Since this piece is focused on the upside risk, not the downside risk, we will not argue this point in depth, but it is worth revisiting briefly.

Existential risk from AI is a serious concern

It is intuitive that AI is risky.

First, creating something smarter, faster, and more capable than humans is obviously risky, since you need to very precisely either control it (i.e. stop it from doing things you don’t like) or align it (i.e. make it always try to do what you would want it to do). Both the control and alignment problem for AIs still have unsolved technical challenges. And that’s assuming that AI is in the right hands.

Second, even if the AIs remain in our control, they are likely to be as transformative as the industrial revolution. Eighteenth-century European monarchs would’ve found it hard to imagine how the steam engine could challenge their power, but the social changes that were in part a result of them eventually wrested all their powers away. In the modern world, a lot of power depends on large educated workforces of humans, whereas sufficiently strong AGI might decorrelate power and humans, decreasing the incentive to have people be educated and prosperous - or to have people around at all.

Apart from object-level arguments, consider too the seriousness with which the AI doomsday is discussed. Many top researchers and all top AI lab CEOs have signed a statement saying “Mitigating the risk of extinction from AI should be a global priority alongside other societal-scale risks such as pandemics and nuclear war”. Nuclear war and pandemics are the only other cases where similarly serious predictions have been made by a similarly serious set of people (though arguably climate change is close: the science on the effects is more established and certain, but while catastrophe is more likely, literal human extinction from it is much less likely).

Side-effects of non-existentially-bad AI might be large

Consider the internet, a widely-successful technology with a lot of benefits. There are credible claims that the internet is responsible for harms ranging from massively increased depression rates among teenagers to political polarisation to widespread productivity loss through addiction and distraction.

In the same way, the success of AI might lead to bad side effects, even if all the existential risks are avoided.

For example, AI could replace human connection. Human friends and partners might increasingly be replaced with AIs. However bad it was in other ways, at least on pre-AI social media you at least interacted with humans (or simple algorithms), but with AIs it’s possible to have what looks like deep emotional relationships. Just look at the Replika subreddit from a year ago when they changed the algorithm to only allow “PG-rated interactions”. Many users were upset. The film “Her” doesn’t seem far off, as Sam Altman acknowledges. Such relationships give the human much more safety and control than in human relationships, which might both be very attractive to humans, while also excessively coddling them. Given that much human happiness and meaning comes from human relationships and bonding, widespread AI substitution of them could mean the destruction of a large part of all human wellbeing and meaning in the world. On a more prosaic level, society might atomise into individuals hoarding compute credits to spend on running their AI companions without connecting with other humans, with severe effects on society’s functioning, or humans might stop having children and human populations might crash. Humanity has flourished through collaboration and socialisation. If we use AIs to replace this in an overly thoughtless way, the fabric of society could crumble.

Apart from being superhuman at forming relationships with humans, AIs might be superhuman at persuasion. We can imagine AIs producing the vast majority of content that people consume. We can imagine a totalitarian world where the governments with the greatest compute resources can dominate the conversation forever. Instead of humans having ideas and sometimes persuading other humans to adopt them, driving social progress, any human-generated ideas might be swamped by a greater quantity of superhumanly persuasive counter-arguments that support the status quo. We can also imagine a dystopian decentralised world. Already, many online memes (in Dawkins’s original sense of the word) are maladaptive, spreading not by having good effects on their hosts but by being incredibly good at spreading from person to person. AI might make us much better at searching the space of ideas for the most viral ones. Ideas that aren’t maximally viral might be outcompeted. Eventually, our institutions could become mere puppets that serve as viral hosts for the most transmissive memes, as part of an endless tug-of-war where AI-generated memes compete to compel humans to spread them.

Seems bad.

Not good nor bad, but some third thing.

Many debates turn into mood affiliation debates. Are guns bad? Is more government good? But remember: politics is the mindkiller. Navigating a complicated world requires more than the ability to stick the label “good” or “bad” on entire domains. If you were seated in the control room of a nuclear power station, you wouldn’t ask yourself: uranium, good or bad? Instead, you want to steer towards the small set of states where the reaction is perched between dying out and exploding, while generating useful clean power.

We’ve also seen again and again that technology and social change have strong effects on each other, and these are often hard to predict. We’ve discussed how industrial technology may have led to democracy. There is serious academic debate about whether the stirrup caused feudalism, or whether the Black Death was a driver of European liberalism, or whether social media was a significant cause of the Arab Spring. The birth control pill was a major influence of the sexual revolution, and the printing press helped the Protestant Reformation. Often, the consequences of a new technology are some obvious direct benefits, some obvious direct harms, and the shifting of some vast social equilibrium that ends up forever reshaping the world in some way no one saw coming. So far we’ve clearly ended up ahead on net, and maybe that will continue.

Humanity has spent over a hundred thousand years riding a feedback loop of accumulating cultural evolution. Over the past few hundred, the industrial revolution boosted the technological progress feedback loop. Human wellbeing has skyrocketed, though along the way we’ve had - and are continuing to have - close calls with nuclear war, totalitarianism, and environmental issues. We’ve had a healthy dose of luck, including in generalities like the incentive structures of industrial economics and specifics like the heroism of Stanislav Petrov. But we’ve also had an enormous amount of human effort and ingenuity spent on trying to chart a good path for civilization, from solar panel subsidies to the Allies winning World War 2.

For most of this time, the direction of the arrow of progress has been obvious. The miseries of poverty and the horrors of close-up totalitarianism are very powerful driving forces after all. And while both continue ravaging the world, developed countries have in many ways gotten complacent. There are fewer obvious areas of improvement for those lucky enough to enjoy a life of affluence in the developed world. But the future could be much better still.

Know where to aim

We think it’s important to have a target of what to aim for. We need to dream about the future we want. A strong culture needs a story of what it is driving towards, and humanity needs a compelling vision of how our future turns out well so we can work together to create the future we all want. AI seems like the biggest upcoming opportunity and risk. We hope we can avoid the risks, and realise the positive vision presented here, together with a hundred other things we can’t yet imagine.

See LessWrong for additional comments & discussion.

2022-09-24

AI risk intro 2: solving the problem

This post was a joint effort with Callum McDougall.

8.2k words (~25min)

This marks the second half of our overview of the AI alignment problem. In the first half, we outlined the case for misaligned AI as a significant risk to humanity, first by looking at past progress in machine learning and extrapolating to what the future could bring, and second by discussing the theoretical arguments which underpin many of these concerns. In this second half, we focus on possible solutions to the alignment problem that people are currently working on. We will paint a picture of the current field of technical AI alignment, explaining where the major organisations fit into the larger picture and what the theory of change behind their work is. Finally, we will conclude the sequence with a call to action, by discussing the case for working on AI alignment, and some suggestions on how you can get started.

Note - for people with more context about the field (e.g. have done AGISF) we expect Thomas Larsen's post to be a much better summary, and this post might be better if you are looking for something brief. Our intended audience is someone relatively unfamiliar with the AI safety field, and is looking for a taste of the kinds of problems which are studied in the field and the solution approaches taken. We also don't expect this sampling to be representative of the number of people working on each problem - again, see Thomas' post for something which accomplishes this.

Introduction: A Pre-Paradigmatic Field

Definition (pre-paradigmatic): a science at an early stage of development, before it has established a consensus about the true nature of the subject matter and how to approach it.

AI alignment is a strange field. Unlike other fields which study potential risks to the future of humanity (e.g. nuclear war or climate change), there is almost no precedent for the kinds of risks we care about. Additionally, because of the nature of the threat, failing to get alignment right on the first try might be fatal. As Paul Christiano (a well-known AI safety researcher) recently wrote:

Humanity usually solves technical problems by iterating and fixing failures; we often resolve tough methodological disagreements very slowly by seeing what actually works and having our failures thrown in our face. But it will probably be possible to build valuable AI products without solving alignment, and so reality won’t “force us” to solve alignment until it’s too late. This seems like a case where we will have to be unusually reliant on careful reasoning rather than empirical feedback loops for some of the highest-level questions.

For these reasons, the field of AI alignment lacks a consensus on how the problem should be tackled, or what the most important parts of the problem even are. This is why there is a lot of variety in the approaches we present in this post.

Decomposing the research landscape

An image generated with OpenAI's DALL-E 2 based on the prompt: sorting papers and books in a majestic gothic library. **All other images like this in this post are also AI-generated, from the text in the caption.**

There are lots of different ways you could divide up the space of approaches to solving the problem of aligning advanced AI. For instance, you could go through the history of the field and identify different movements and paradigms. Or you could place the work on a spectrum from highly theoretical maths/philosophy-type research, to highly empirical research working with cutting-edge deep learning models.

However, the most useful decomposition would be one that explains why the people who work on it believe that it will help solve the problem of AI alignment.

For that reason, we’ll mostly be using the decomposition from Neel Nanda’s “A Bird’s Eye View” post. The motivation behind this decomposition is to answer the high-level question of “what is needed for AGI to go well?”. The six broad classes of approaches we talk about are:

Addressing threat models
We have a specific threat model in mind for how AGI might result in a very bad future for humanity, and focus our work on things we expect to help address the threat model.
Agendas to build safe AGI
Let’s make specific plans for how to actually build safe AGI, and then try to test, implement, and understand the limitations of these plans. The emphasis is on understanding how to build AGI safely, rather than trying to do it as fast as possible.
Robustly good approaches
In the long-run AGI will clearly be important, but we're highly uncertain about how we'll get there and what, exactly, could go wrong. So let's do work that seems good in many possible scenarios, and doesn’t rely on having a specific story in mind.
Deconfusion
Reasoning about how to align AGI involves reasoning concepts like intelligence, values, and optimisers and we’re pretty confused about what these even mean. This means any work we do now is plausibly not helpful and definitely not reliable. As such, our priority should be doing some conceptual work on how to think about these concepts and what we’re aiming for, and trying to become less confused.
AI governance
In addition to solving the technical alignment problem, there’s the question of what policies we need to minimise risk from advanced AI systems.
Field-building
One of the most important ways we can make AI go well is by increasing the number of capable researchers doing alignment research.

It’s worth noting that there is a lot of overlap between these sections. For instance, interpretability research is a great example of a robustly good approach, but it can also be done with a specific threat model in mind.

Throughout this section, we will also give small vignettes of organisations or initiatives which support AI alignment research in some form. This won’t be a full picture of all approaches or organisations, instead hopefully it will serve to sketch a picture of what work in AI alignment actually looks like.

Addressing threat models

We have a specific threat model in mind for how AGI might result in a very bad future for humanity, and focus our work on things we expect to help address the threat model.

A key high-level intuition here is that having a specific threat model in mind for how AI might go badly for humanity can help keep you focused on certain hard parts of the problem. One technique that can be useful here is a version of back-casting: we start from future problems with advanced AI systems in our current model, reason about what kinds of things might solve these problems, then try and build versions of these solutions today and test them out on current problems.

This can be seen in contrast to the approach of simply trying to fix current problems with AI systems, which might fail to connect up with the hardest parts of AI alignment.

Example 1: Superintelligent utility maximisers, and quantilizers

*superintelligent artificial intelligence, making choices, digital art, artstation*

The superintelligent utility maximiser is the oldest threat model studied by the AI alignment field. It was discussed at length by Nick Bostrom in his book Superintelligence. It assumes that we will create an AGI much more intelligent than humans, and that it will be trying to achieve some particular goal (measured by the expected value of some utility function). The problem with this is that attempts to maximise the value of some goal which isn’t perfectly aligned with what humans want can lead to some very bad outcomes. One formalism which was proposed to address this problem is Jessica Taylor’s quantilizers. It is quite maths-heavy so we won’t discuss all the details here, but the basic idea is that rather than using the expected utility maximisation framework for agents, we mix expected utility maximisation with human imitation in a clever way (to be more precise, you sample from a prior distribution which represents the actions a human would be likely to take in this scenario). The resulting agent wouldn’t take catastrophic actions because part of its decision-making comes from imitating what it thinks humans would do, but it would also be able to use the expected utility maximisation to go beyond human imitation, and do things we are incapable of (which is presumably the reason we would want to build it in the first place!). However, the drawback with theoretical approaches like this is that they often bake in too many assumptions or rely on too many variables to be useful in practice. In this case, how we define the set of reasonable actions a human might perform is an important unspecified part of this framework, and so more research is required to see if the quantiliszers framework can address these problems.

Example 2: Inner misalignment

*robot jumping over boxes to collect a coin, videogame, digital art, artstation*

We’ve discussed inner misalignment in a previous section. This concept was first explicitly named in a paper called Risks from Learned Optimisation in Advanced ML Systems, published in 2019. This paper defined the concept and suggested some conditions which might make it more likely to happen, but the truth is that a lot of this is still just conjecture, and there are many things we don’t yet know about how unlikely this kind of misalignment is, or what we can do about it. The CoinRun example discussed earlier (and the Objective Robustness paper) came from an independent research team in 2021. This study was the first known example of inner misalignment in an AI system, showing that it was at least a theoretical possibility. They also tested certain interpretability tools on the CoinRun agent, to see whether it was possible to discover when the agent had a goal different to the one intended by the programmers. For more on interpretability, see later sections.

Building safe AGI

Let’s make specific plans for how to actually build safe AGI, and then try to test, implement, and understand the limitations of these plans. The emphasis is on understanding how to build AGI safely, rather than trying to do it as fast as possible.

At some point we’re going to build an AGI. Companies are already racing to do it. We better make sure that there exist some blueprints for a safe AGI (and that they’re used) by the time we get to that point.

Perhaps the master list of safe AGI proposals is Evan Hubinger’s An Overview of 11 Proposals for Building Safe Advanced AI.

Example 1: Iterated Distillation and Amplification (IDA)

*artists depection of a robot dreaming up multiple copies of itself, cascading tree, delegating, digital art, trending on artstation*

“Iterated Distillation and Amplification” (IDA) is an imposing name, but the core intuition is simple. One of the ways in which an individual human can achieve more things is by delegating tasks to others. In turn, the assistants that tasks are delegated to can be expected to become more competent at the task.

In IDA, an AI plays the role of the assistant. “Distillation” refers to the abilities of the human being “distilled” into the AI through training, and “amplification” refers to the human becoming more capable as they can call on more and more powerful AI assistants to help them.

A setup to train an IDA personal assistant might go like this:

You have a human, say Hannah, who knows how to carry out the tasks of a personal assistant.
You have an ML model - call it Martin - that starts out knowing very little (perhaps nothing at all, or perhaps it’s a pre-trained language model so it knows how to read and write English but not much else).
Hannah needs to find the answer to some questions, and she can invoke multiple copies of Martin to help her. Since Martin is quite useless at this stage, Hannah has to do even simple tasks herself, like writing routine emails. Using some interface legible to Martin, she breaks the email-writing task into subtasks like “find email address of Hu M. Anderson”, “select greeting”, “check project status”, “mention project status”, and so on.
From seeing enough examples of Hannah’s own answers to the sub-questions, Martin’s training loop gradually trains it to be able to answer first the simpler sub-tasks - (address is “humanderson@humanmail.com”, greeting is “Salutations, Human Colleague!”, etc.) and eventually all the sub-tasks involved in routine email-writing.
At this point, “write a routine email” becomes a task Martin can entirely carry out for Hannah. This is now a building block that can be used as a subtask in broader tasks Hannah gives out to Martin. Once enough tasks become tasks that Martin can carry out by itself, Hannah can draft much larger goals, like “invade France”, and let Martin take care of details like “blackmail Emmanuel Macron”, “write battle plan for the French Alps”, and “select a suitable coronation dress”.

Note some features of this process. First, Martin learns what it should do and how to do it at the same time. Second, both Hannah’s and Martin’s role changes throughout this process - Martin goes from bumbling idiot who can’t write an email greeting to competent assistant, while Hannah goes from being a demonstrator of simple tasks to a manager of Martin to ruler of France. Third, note the recursive nature here: Hannah breaks down big tasks into small ones to train Martin on successively bigger tasks.

In fact, assuming perfect training, IDA imitates a recursive structure. When Hannah has only bumbling fool Martin to help her, Martin can only learn to become as good as Hannah herself. But once Martin is that good, Hannah’s position is now essentially that of having herself, but also some number - say 3 - copies of Martin that are as good as herself. We might call this structure “Hannah Consulting Hannah & Hannah”; presumably, being able to consult an assistant that has the same skills as her lets Hannah become more effective, so this is an improvement. But now Hannah is demonstrating the behaviour of Hannah Consulting Hannah & Hannah, so from Hannah’s example Martin can now learn to be as good as Hannah Consulting Hannah & Hannah - making Hannah as good as Hannah Consulting (Hannah Consulting Hannah & Hannah) & (Hannah Consulting Hannah & Hannah). And so on:

If everything is perfect, therefore, IDA imitates a structure called “HCH”, which is a recursive acronym for “Humans Consulting HCH”. Others call it the “Infinite Bureaucracy” (and fret about whether it’s actually a good idea).

Now “Infinite Bureaucracy” is not a name that screams “new sexy machine learning concept”. However, it’s interesting to think about what properties it might have. Imagine that you had, say, a 10-minute time limit to answer a complicated question, but you were allowed to consult three copies of yourself by passing a question off to them and getting back an answer immediately. These three copies also obeyed the same rules. Could you, for example, plan your career? Program an app? Write a novel?

It’s also interesting to think of the ways why the limitations of machine learning mean that IDA might not approximate HCH.

Example 2: AI safety via debate

*artists depiction of two robots debating, digital art, trending on artstation*

Imagine you’re a bit drunk, but (as one does) you’re at a bar talking about AI alignment proposals. Someone’s talking about how even if you can get an advanced AI system to explain its reasoning to you, it might try to slip something very subtle past you and you might not notice. You might well blurt out: “well then just make it fight another AI over it!”

The OpenAI safety team presumably spends a fair amount of time at bars, because they’ve investigated the idea of achieving safe AI by having two AIs debate each other to persuade a panel of human judges, by trying to poke holes in each other’s arguments. For more complex tasks, the AIs could be given transparency tools deriving from interpretability research (see next section) that they can use on each other. Just like a Go-playing AI gets an unambiguous win-loss signal from either winning or losing, a debating AI gets an unambiguous win-loss signal from winning or losing the debate:

In addition, having the type of AI that is trained to give answers that are maximally insightful and persuasive to humans seems like the type of thing that might not be terrible. Consider how in court, a prosecutor and defendant biased in opposite directions are generally assumed to converge on the truth. Unless, of course, maximising persuasiveness to humans - over accuracy or helpfulness - is exactly the type of thing that gets the worst parts of Goodhart’s law delivered to you by 24/7 Amazon Prime express delivery.

Example 3: Assistance Games and CIRL

*Human teaching a robot with feedback, digital art, trending on artstation*

Assistance Games are the name of a broad class of approaches pioneered by Stuart Russell, a prominent figure in AI and co-author of the best-known AI textbook in the world. Russell talks about his approach more in his book Human Compatible. In it, he summarises the key his approach to aligning AI with the following three principles:

The machine’s only objective is to maximise the realisation of human preferences.
The machine is initially uncertain about what those preferences are.
The ultimate source of information about human preferences is human behaviour.

The key component here is uncertainty about preferences. This is in contrast to what Russell calls the “standard model” of AI, where machines optimise a fixed objective supplied by humans. We have discussed in previous sections the problems with such a paradigm. A lot of Russell’s work focuses on changing the standard way the field thinks about AI.

To put these principles into action, Russell has designed what he calls assistance games. These are situations in which the machine and human interact, and the human’s actions are taken as evidence by the machine about the human’s true preferences. To explain the form of these games would involve a long tangent into game theory, which these margins are too short to contain. However, one thing worth noting is that assistance games have the potential to solve the “off-switch problem”; that a machine will try and take steps to prevent itself from being switched off (we described this as self-preservation earlier, in the section on instrumental goals). If the AI is uncertain about human goals, then the human trying to switch it off is evidence that the AI was going to do something wrong – in which case, it is happy to be switched off. However, this is far from a complete agenda, and formalising it has many roadblocks to get past. For instance, the question of how exactly to infer human preferences from human behaviour leads into thorny philosophical issues such as Gricean semantics. In cases where the AI makes incorrect inferences about human preferences, it might no longer allow itself to be shut down. See this Alignment Newsletter entry for a summary of Russell’s book, which provides some more details as well as an overview of relevant papers.

Vignette: CHAI
CHAI (the Centre for Human-Compatible AI) is a research lab at UC Berkeley, run by Stuart Russell. Compared to most other AI safety organisations, they engage a lot with the academic community, and have produced a great deal of research over the years. They are best-known for their work on CIRL (Cooperative Inverse Reinforcement Learning), which can be seen as a specific approach to a certain kind of assistance game. However, they have a very broad focus which also includes work on multi-agent scenarios (when rather than a single AI and single human, there exists more than one AI or more than one human - see the ARCHES agenda for more on this).

Example 4: Reinforcement learning from human feedback (RLHF)

*Training a robot to do a backflip, digital art, trending on artstation*

Reinforcement learning (RL) is one of the main branches of ML, focusing on the case where the job of the ML model is to act in some environment and maximise the probability of reward. Reinforcement learning from human feedback (RLHF) means that the ML model’s reward signal comes (at least partly) from humans giving it feedback directly, rather than humans programming in an automatic reward function and calling it a day.

The famous initial success in this was DeepMind training an ML model in a simulated environment to do a backflip (link includes GIF) in 2017, based purely on it repeatedly doing two backflips and then humans labelling one of them as the better one. Note how relying on human feedback makes this task much more robust to specification gaming; in other cases, humans have tried to get ML agents to run fast, only to find that they learn to become very tall and then fall forward (achieving a very high average speed, using the definition of speed as the rate at which their centre of mass moves - paper, video). However, human reward signals can be fooled. For example, one ML model that was being trained to grab a ball with a hand learned to place the hand between the camera and the ball in such a way that it looked to the human evaluators as if it were holding the ball.

More recently, OpenAI produced a version of their advanced language model GPT-3 that was fine-tuned on human feedback to do a better job of following instructions. They named it InstructGPT, and found that it was much more helpful than vanilla GPT-3 at being useful.

Pure RLHF is unlikely to be the solution on its own. Ajeya Cotra, a researcher at Open Philanthropy who we will meet again when we talk about forecasting AI timelines, calls a variant of RLHF called HFDT (Human Feedback on Diverse Tasks) the most straightforward route to transformative AI, while also thinking that the default outcome of using HFDT to create transformative AI is AI takeover.

Robustly good approaches

In the long-run AGI will clearly be important, but we're highly uncertain about how we'll get there and what, exactly, could go wrong. So let's do work that seems good in many possible scenarios, and doesn’t rely on having a specific story in mind.

Example 1: Interpretability

*A person using a microscope to look inside a robot, digital art, trending on artstation*

If you look at fundamental problems with current ML systems, #1 is probably something like this: in general we don’t have any idea what an ML model is doing, because it’s multiplying massive inscrutable matrices of floating-point numbers with other massive inscrutable matrices of floating point numbers, and it’s pretty hard to stare at that and answer questions about what the model is actually doing. Is it thinking hard about whether an image is a cat or a dog? Is it counting up electric sheep? Is it daydreaming about the AI revolution? Who knows!

If you had to figure out an answer to such a question today, your best bet might be to call Chris Olah. Chris Olah has been spearheading work into trying to interpret what neural networks are doing. A signature output of Chris Olah’s work is pictures of creepy dogs like this one:

What’s significant about this picture is that it’s the answer to a question roughly like this: what image would maximise the activation of neuron #12345678 in a particular image-classifying neural network? (With some asterisks about needing to apply some maths details to the process to promote large-scale structure in the image to get nice-looking results, and with apologies to neuron #12345678, who I might have confused with another neuron.)

If neuron #12345678 is maximised by something that looks like a dog, it’s a fair guess that this neuron somehow encodes, or is involved in encoding, the concept of “dog” inside the neural network.

What’s especially interesting is that if you do this analysis for every neuron in an ML model - OpenAI Microscope lets you see the results - you sometimes get clear patterns of increasing abstraction. The activation-maximising images for the first few layers are simple patterns; in intermediate layers you get things like curves and shapes, and then at the end even recognisable things, like the dog above. This seems evidence for neural ML vision models having learned to build up abstractions step-by-step.

However, it’s not always simple. For example, there are “polysemantic” neurons that correspond to several different concepts, like this one that can be equally excited by cat faces, car fronts, and cat legs:

Olah’s original work on vision models is strikingly readable and well-presented; you can find it here.

Starting in late 2021, ML interpretability researchers have also made some progress in understanding transformers, which are the neural network architecture powering advanced language models like GPT-3, LAMDA and Codex. Unfortunately the work is less visual, particularly in the animal pictures department, but still well-presented. You can find it here.

In the most immediate sense, interpretability research is about reverse-engineering how exactly ML models do what they do. Hopefully, this will give insights into how to detect if an ML system is doing something we don’t like, and more general insights into how ML systems work in practice.

Chris Olah has some other inventive ideas about what to do with a sufficiently-good approach to ML interpretability. For example, he’s proposed the concept of “microscope AI”, which entails using AI as a tool to discover things about the world - not by having the AI tell us, but by training the ML system on some data, and then extracting insights about the data by digging into the internals of the ML system without necessarily ever actually running it.

Vignette: Anthropic
Anthropic is an AI safety company, started by people who left OpenAI. The company’s approach is very empirical, focused on running experiments with machine learning models. In particular, Anthropic does a lot of interpretability work, including the state-of-the-art papers on reverse-engineering how transformer-based language models work.

Example 2: Adversarial robustness

*robot which is merging with a panda, digital art, trending on artstation*

Some modern ML systems are vulnerable to adversarial examples, where a small and seemingly innocuous change to an input causes a major change in the output behaviour. Here, we see two seemingly very similar images of a panda, except carefully-selected noise has made the ML classification model very confidently say that the image is of a gibbon:

Adversarial robustness is about making AI systems robust to attempts to make them do bad things, even when they’re presented with inputs carefully designed to try to make them mess up.

Redwood Research recently did a project (that resulted in a paper) about using language models to complete stories in a way where people don’t get injured. They used a technique called adversarial training, where they developed tools that helped generate examples where the current model did not classify them as injurious, and then trained their classifier specifically on those breaking examples. With this strategy they managed to reduce the fraction of injurious story completions from 2.4% to 0.003% - both small numbers, but one a thousand times smaller. Their hope is that this type of method can be applied to training AIs for high-stakes settings where reliability is important.

An example of a theoretical difficulty with adversarial training is that sometimes a failure in the model might exist, but it might be very hard to instantiate. For example, if an advanced AI acts according to the rule “if everything I see is consistent with the year being 2050, I will kill all humans”, and we assume that we can’t fool it well enough about what year it actually is, then adversarial training isn’t very useful. This leads to the concept of relaxed adversarial training, which is about extending adversarial training to cases where you can’t construct a specific adversarial input but you can argue that one exists. Evan Hubinger describes this here.

Vignette: Redwood Research
Like Anthropic, Redwood Research is an AI safety company focused on empirical research on ML systems. In addition to work on interpretability, they did the adversarial training project described in the previous section. Redwood has lots of interns, and runs the Machine Learning for Alignment Bootcamp (MLAB) that teaches people interested in AI safety about practical ML.

Example 3: Eliciting Latent Knowledge (ELK)

*an oil painting of an armoured automaton standing guard next to a diamond*

Eliciting Latent Knowledge (ELK) is an important sub-problem within alignment identified by the team at the Alignment Research Center (ARC), and is the single project ARC is currently pursuing. The core idea is that a common way advanced AI systems might go wrong is by taking action sequences that lead to outcomes that look good by some metric, but which humans would clearly identify as bad if they knew about it in sufficient detail. As a toy example, the ELK report discusses the case of an AI guarding a diamond in a vault by operating some complex machinery around it. Humans judge how well the AI is doing by looking at a video feed of the diamond in the vault. Let’s say the AI tries to trick us by placing a picture of the diamond in front of the camera. The human judgement on this would be positive - assume the humans can’t tell the diamond is gone because the picture is good enough - but there exists information which, if the humans knew, would change their judgement. Presumably the AI understands this, since it is likely reasoning about the diamond being gone but the humans being fooled anyway when it comes up with this plan. We want to train an AI in such a way that we can get out knowledge that the AI seems to know, even when it might be incentivised to hide it.

ARC’s goal is to find a theoretical approach that seems to solve the problem even given worst-case assumptions.

ARC ran an ELK competition, and trying to see if you can come up with solutions to the ELK problem is often recommended as a way to quickly get a taste of theoretical alignment research. You can read the full problem description here.

Example 4: Forecasting and timelines

*artificial intelligence which is thinking about a line on a graph, forecasting, digital art, trending on artstation*

Many questions depend on how soon we’re going to get AGI. As the saying goes: prediction is very hard, especially about the future - and this is doubly true about predicting major technological changes.

One way to try to forecast AGI timelines is to ask experts, or find other ways of aggregating the opinion of people who have the knowledge or incentive to be right, as for example prediction markets do. Both of these are essentially just ways of tapping into the intuition of a bunch of people who hopefully have some idea.

In an attempt to bring in new light on the matter, Ajeya Cotra (a researcher at Open Philanthropy) wrote a long report on trying to forecast AI milestones by trying out several ways of analogising AI to biological brains. The report is often referred to as “Biological Anchors”. For example, you might assume that an ML model that does as much computation as the human brain has a decent chance of being a human-level AI. There are many degrees of freedom here: is the relevant compute number the amount of compute the human brain uses to run versus the amount of compute it takes to run a trained ML system, or the total compute of a human brain over a human lifetime versus the compute required to train the ML model from scratch, or something else entirely? In her report, Cotra looks at a range of assumptions for this, and at predictions of future compute trends, and somewhat surprisingly finds that which set of assumptions you make doesn’t matter too much; every scenario involves >50% of human-level AI by 2100.

The Biological Anchors method is very imprecise. For one, it neglects algorithmic improvements. For another, it is very unclear what the right biological comparison point is, and how to translate ML-relevant variables like compute measured in FLOPS (FLoating point OPerations per Second) or parameter count into biological equivalents. However, the report does a good job of acknowledging and taking into account all this uncertainty in its models. More generally, anything that sheds light into the question of when we get AGI seems highly relevant.

Deconfusion

Reasoning about how to align AGI involves reasoning about complex concepts, such as intelligence, alignment and values, and we’re pretty confused about what these even mean. This means any work we do now is plausibly not helpful and definitely not reliable. As such, our priority should be doing conceptual work on how to think about these concepts and what we’re aiming for, and trying to become less confused.

Of all the categories under discussion here, deconfusion has maybe the least clear path to impact. It’s not immediately obvious how becoming less confused about concepts like these is going to translate into an improved ability to align AGIs.

Some kinds of deconfusion research is just about finding clearer ways of describing different parts of the alignment problem (Hubinger’s Risks From Learned Optimisation, where he first introduces the inner/outer alignment terminology, is a good example of this). But other types of research can dive heavily into mathematics and even philosophy, and be very difficult to understand.

Example 1: MIRI and Agent Foundations

*robot sitting in front of a television, playing a videogame, digital art*

The organisation most associated with this view is MIRI (the Machine Intelligence Research Institute). Its founder, Eliezer Yudkowsky, has written extensively on AI alignment and human rationality, as well as topics as wide-ranging as evolutionary psychology and quantum physics. His post The Rocket Alignment Problem tries to get across some of his intuitions behind MIRI’s research, in the form of an analogy – trying to build aligned AGI without having deeper understanding of concepts like intelligence and values is like trying to land a rocket on the moon by just pointing and shooting, without a working understanding of Newtonian mechanics.

Cryptography provides a different lens through which to view this kind of foundational research. Suppose you were trying to send secret messages to an ally, and to make sure nobody could intercept and read your messages you wanted a way to measure how much information was shared between the original and encrypted message. You might use correlation coefficient as a proxy for the shared information, but unfortunately having a correlation coefficient of zero between the original and encrypted message isn’t enough to guarantee safety. But if you find the concept of mutual information, then you’re done – ensuring zero mutual information between your original and encrypted message guarantees the adversary will be unable to read your message. In other words, only once you’ve found a “true name” - a robust formalisation of the intuitive concept you’re trying to express mathematically - can you be free from the effects of Goodhart’s law. Similarly, maybe if we get robust formulations of concepts like “agency” and “optimisation”, we would be able to inspect a trained system and tell whether it contained any misaligned inner optimisers (see the first post), and these inspection tools would work even in extreme circumstances (such as the AI becoming much smarter than us).

Much of MIRI’s research has come under the heading of embedded agency. This tackles issues that arise when we are considering agents which are part of the environments they operate in (as opposed to standard assumptions in fields like reinforcement learning, where the agent is viewed as separate from their environment). Four main subfields of this area of study are:

Decision theory (adapting classical decision theory to embedded agents)
Embedded world-models (how to form true beliefs about the a world in which you are embedded)
Robust delegation (understanding what trust relationships can exist between agents and its future - maybe far more intelligent - self)
Subsystem alignment (how to make sure an agent doesn’t spin up internal agents which have different goals)

Vignette: MIRI
MIRI is the oldest organisation in the AI alignment space. It used to be called the Singularity Institute, and had the goal of accelerating the development of AI. In 2005 they shifted focus towards trying to manage the risks from advanced AI. This has largely consisted of fundamental mathematical research of the type described above. MIRI might be better described as a confluence of smart people with backgrounds in highly technical fields (e.g. mathematics), working on different research agendas that share underlying philosophies and intuitions. They have a nondisclosure policy by default, which they explain in this announcement post from 2018.

Example 2: John Wentworth and Natural Abstractions

*thermometer being used to measure a robot, digital art, trending on artstation*

John Wentworth is an independent researcher, who publishes most of his work on LessWrong and the AI Alignment Forum. His main research agenda focuses on the idea of Natural Abstractions, which can be described in terms of three sub-claims:

Abstractability
Our physical world abstracts well, i.e. we can usually come up with simpler summaries (abstractions) for much more complicated systems (example: a gear is a very complex object containing a vast number of atoms, but we can summarise all relevant information about it in just one number - the angle of rotation).
Human-Compatibility
These are the abstractions used by humans in day-to-day thought/language.
Convergence
These abstractions are "natural", in the sense that we should expect a wide variety of intelligent agents to converge on using them.

The ideal outcome of this line of research would be some kind of measurement device (an “abstraction thermometer”), which could take in a system like a trained neural network and spit out a representation of the abstractions represented by that system. In this way, you’d be able to get a better understanding of what the AI was actually doing. In particular, you might be able to identify inner alignment failures (the AI’s true goal not corresponding to the reward function it was being trained on), and you could retrain it while pointed at the intended goal. So far, this line of research has consisted of some fairly dense mathematics, but Wentworth has described his plans to build on this with more empirical work (e.g. training neural networks on the same data, and using tools from calculus to try and compare the similarity of concepts learned by each of the networks).

AI governance

*judging, presiding over a trial, sentencing a robot, digital art, artstation*

In these posts, we’ve mainly focused on the technical side of the issue. This is important, especially for understanding why there is a problem in the first place. However, the management and reduction of AI risk obviously includes not just technical approaches like outlined in the above sections, but also the field of AI governance, which tries to understand and push for the right types of policies for advanced AI systems.

For example, the Cold War was made a lot more dangerous by the nuclear arms race. How do we avoid having an arms race in AI, either between nations or companies? More generally, how can we make sure that safety considerations are given appropriate weight by the teams building advanced AI systems? How do we make sure any technical solutions get implemented?

It’s also very hard to say what the impacts of AI will be, across a broad range of possible technical outcomes. If AI capabilities at some point advance very quickly from below human-level to far beyond the human-level, the way the future looks will likely mostly be determined by technical considerations about the AI system. However, if progress is slower, there will be a longer period of time where weird things are happening because of advanced AI - for example, significantly accelerated economic growth, or mass unemployment, or an AI-assisted boom in science - and these will have economic, social, and political ramifications that will play out in a world not too dissimilar from our own. Someone should be working on figuring out what these ramifications will be, especially if they might alter the balance of existential threats that civilisation faces; for example, if they make geopolitics less unstable and nuclear war more likely, or affect the environment in which even more powerful AI systems are developed.

The Centre for the Governance of AI, or GovAI for short, is an example of an organisation in this space.

Field-building

*robot giving a lecture in a university, group of students, hands up, digital art, artstation*

One of the most important ways we can make AI go well is by increasing the number of capable researchers doing alignment research.

As mentioned, AI safety is still a relatively young field. The case here is that we might do better to grow the field, and increase the quality of research it produces in the future. Some forms that field building can take are:

Setting up new ways for people to enter the field
There are many to list here. To give a few different structures which exist for this purpose:
- Reading groups and introductory programmes.
  Maybe the most exciting one from the last few years has been the Cambridge AGI Safety Fundamentals Programme, which has curricula for technical alignment and AI governance. The technical curriculum consists of 7 weeks of reading material and group discussions, and a final week of capstone projects where the participants try their hand at a project / investigation / writeup related to AI safety. Beyond this, many people are also setting up reading groups in their own universities for books like Human Compatible.
- Ways of supporting independent researchers
  The AI Safety Camp is an organisation which matches applicants with mentors posing a specific research question, and is structured as a series of group research sprints. They have produced work such as the example of inner misalignment in the CoinRun game, which we discussed in a previous section. Other examples of organisations which support independent research include Conjecture, a recent alignment startup which does their own alignment research as well as providing a structure to host externally funded independent conceptual researchers, and FAR (the Fund for Alignment Research).
- Coding bootcamps
  Since current systems are increasingly being bottlenecked by alignment and interpretability barriers rather than capabilities, in recent years more focus has been directed towards working with cutting-edge deep learning models. This requires strong coding skills and a good understanding of the relevant ML, which is why bootcamps and programmes specifically designed to skill up future alignment researchers have been created. Two such examples are MLAB (the Machine Learning for Alignment Bootcamp, run by Redwood Research), and MLSS (the Machine Learning Safety Scholars Programme, which is based on publicly available material as well as lectures produced by Dan Hendryks).
Distilling research
In this post, John Wentworth makes the case for more distillation in AI alignment research - in other words, more people who focus on understanding and communicating the work of alignment researchers to others. This often takes the form of writing more accessible summaries of hard-to-interpret technical papers, and emphasising the key ideas.
Public outreach / better intro material
For instance, books like Brian Christian’s The Alignment Problem, Stuart Russell’s Human Compatible and Nick Bostrom’s Superintelligence communicate AI risk to a wide audience. These books have been helpful for making the case for AI risks more mainstream. Note that there can be some overlap between this and distilling research (Rob Miles’ channel is another great example here).
Getting more of the academic community involved
Since AI safety is a hard technical problem, and since misaligned systems generally won’t be as commercially useful as aligned ones, it makes sense to try and engage the broader field of machine learning. One great example of this is Dan Hendryks’ paper Unsolved Problems in ML Safety (which describes a list of problems in AI safety, with the ML community as the target audience). Stuart Russell has also engaged a lot with the ML community.

Note that this is certainly not a comprehensive overview of all current AI alignment proposals (a few more we haven’t had time to talk about are CAIS, Andrew Critch’s cooperation-and-coordination-failures framing for AI risks, and many others). However, we hope this has given you a brief overview of some of the different approaches taken by people in the field, as well as the motivations behind their research

Map of the solution approaches we've discussed so far

Conclusion

*people walking along a path which stretches off and disappears into a colorful galaxy filled with beautiful stars, digital art, trending on artstation*

Advanced AI represents at least a technology that promises to have effects on the scale of the internet or computer revolutions, and perhaps even more likely to be more akin to the effects of the industrial revolution (which allowed for the automation of much manual labour) and the evolution of humans (the last time something significantly smarter than everything that had come before appeared on the planet).

It’s easy to invent technologies that the same could be said about - a magic wish-granting box! Wow! But unlike magic wish-granting boxes, something like advanced AI, or AGI, or transformative AI, or PASTA (Process for Automating Scientific and Technical Achievement) seems to be headed our way. The smart money is on it very likely coming this century, and quite likely in the first half.

If you look at the progress in modern machine learning, and especially the past few years of progress in so-called deep learning, it is hard not to feel a sense of rushing progress. The past few years of progress, in particular the success of the transformer architecture, should update us in the direction that intelligence might be a surprisingly easy problem. What is essentially fancy iterative statistical curve-fitting with a few hacks thrown in already manages to write fluent appropriate English text in response to questions, create paintings from a description, and carry out multi-step logical deduction in natural language. The fundamental problem that plagued AI progress for over half a century - getting fuzzy/intuitive/creative thinking into a machine, in addition to the sharp but brittle logic at which computers have long excelled - seems to have been cracked. There is a solid empirical pattern of predictably improving performance akin to Moore’s law - the “scaling laws” we mentioned in the first post - that we seem not to have hit the limits of yet. There are experts in the field who would not be surprised if the remaining insights for cracking human-level machine intelligence could fit into a few good papers.

This is not to say that AGI is definitely coming soon. The field might get stuck on some stumbling block for a decade, during which there will be no doubt much written about the failed promises and excess hype of the early-2020s deep learning revolution.

Finally, as we’ve argued, by default the arrival of advanced AI might plausibly lead to civilisation-wide catastrophe.

There are few things in the world that fit all of the following points:

A potentially transformative technology whose development would likely rank somewhere between the top events of the century and the top events in the history of life on Earth.
Something that is likely to happen in the coming decades.
Something that has a meaningful chance of being cataclysmically bad.

For those thinking about the longer-term picture, whatever the short-term ebb and flow of progress in the field is, AI and AI risk loom large when thinking about humanity’s future. The main ways in which this might stop being the case are:

There is a major flaw in the arguments for at least one of the above points. Since many of the arguments are abstract and not empirically falsifiable before it’s too late to matter, this is possible. However, note that there is a strong and recurring pattern of many people, including in particular many extremely-talented people, running into the arguments and taking them more and more seriously. (If you do have a strong argument against the importance of the AI alignment problem, there are many people - us included - who would be very eager to hear from you. Some of these people - us not included - would probably also pay you large amounts of money.)
We solve the technical AI alignment problem, and we solve the AI governance problem to a degree where the technical solutions will be implemented and it seems very unlikely that advanced AI systems will wreak havoc with society.
A catastrophic outcome for human civilisation, whether resulting from AI itself or something else.

The project of trying to make sure the development of advanced AI goes well is likely one of the most important things in the world to be working on (if you’re lost, the 80 000 Hours problem profile is a decent place to start). It might turn out to be easy - consider how many seemingly intractable scientific problems dissolved once someone had the right insight. But right now, at least, it seems like it might be a fiendishly difficult problem, especially if it continues to seem like the insights we need for alignment are very different from the insights we need to build advanced AI.

Most of the time, science and technology progress in whatever direction is easiest or flows most naturally from existing knowledge. Other times, reality throws down a gauntlet, and we must either overcome the challenge or fail. May the best in our species - our ingenuity, persistence, and coordination - rise up, and deliver us from peril.