AI "psychology," with Jack Lindsey

Any attempt to anticipate how social change will unfold in the coming years has to confront a major unknown: how much better is artificial intelligence going to get, and how quickly? Accordingly, getting a handle on AI’s capabilities and development path is essential to understanding how broader social realities are likely to shift and over what time period. In this special father-son edition of The Permanent Problem podcast, Brink Lindsey wades into these questions with Jack Lindsey, leader of the “psych team” at Anthropic that investigates how large language models actually think. They discuss how large language models are trained to play a character, how models can slip out of character and into other, rogue personas, and the role of emotions in how LLMs operate. They also talk about Claude Mythos Preview, Anthropic’s latest release deemed too dangerous to release to the public, and where LLMs go from here.

Transcript

Brink Lindsey: Welcome to the Permanent Problem podcast. I’m very excited about today’s episode. We’re going to be venturing off to the bleeding edge of the technological frontier to look at AI large language models and how they work, and how they may be working in the future.

Matthew Yglesias – the blogger who runs the excellent Substack, Slow Boring, who’s also a senior fellow at the Niskanen Center where I work – did a post not too long ago about how AI was giving him writer’s block. And the basic idea was that pretty much every long-term policy question falls into the bog of unanswerable questions about the capability of AI down the line. So, it’s very difficult to opine about how things in society are going to go over the next 10 or 15 years, because (A) we have no idea what the capabilities of AI will be by then, and (B) we have no idea about the extent to which those AI capabilities will have diffused out into the larger economy.

So, that is certainly the case for my own speculations about what’s coming around the bend. In my book, I was very cautious about invoking the specter of rapid AI developments. I thought of that as a cheat code. If AI was growing as fast as its biggest hypesters were saying, then a lot of the things that I was talking about in the book would be dramatically accelerated. So, I didn’t want to lean on that, but things are moving. They’re not slowing down. The prospects of what’s been accomplished already are transformational, and there looks to be bigger and better things to come.

So my guest today is, by virtue of his background and his current job, is one of the world’s leading experts on how AI large language models actually work and what’s going on under the hood. Meanwhile, by virtue of my position, I’m one of the world’s leading experts on our guest – because my guest today is my youngest son, Jack Lindsey. Jack Lindsey, welcome to the Permanent Problem.

Jack Lindsey: I’m very excited to be here.

Brink Lindsey: I’m very excited to have you. And just right out of the box, I’m sure your big brothers, Matthew and Michael, will be watching this at some point, so shout out to them. Jack Lindsey is a researcher at Anthropic. Why don’t you tell us your job title?

Jack Lindsey: Yeah, I don’t really know what my job title is officially, but I lead what’s called our model psych team, which stands for psychology or psychiatry or psychic. No one really knows what it means, but we are investigating the minds of the AI models we’re creating.

Brink Lindsey: And the psych team is situated in the larger interpretability team, is that correct?

Jack Lindsey: That’s right. Yep.

Brink Lindsey: And explain that, the interpretability team’s larger mission.

Jack Lindsey: Yeah. And interpretability is the science of what’s going on inside AI models, trying to reverse engineer their inner workings to say something about what computations or algorithms they’re implementing on the inside to give rise to all the behaviors that we see these models perform.

Brink Lindsey: And what is the special set of tasks that the psych team is doing within that larger mission?

Jack Lindsey: Yeah, I think of us as, if interpretability is like neuroscience, then what we do is maybe cognitive neuroscience or the border between cognitive neuroscience and psychology. So, we’re interested, especially in these higher order forms of cognition that language models seem to exhibit, things like personality traits, things like emotional reactions, the capacity for introspection or situational awareness, like realizing where they are and what’s real and what’s fake. So, these higher order forms of cognition are the things that we’re interested in.

Brink Lindsey: So, all the most anthropomorphic features, or things that it’s irresistible to talk about anthropomorphically, that’s your special focus?

Jack Lindsey: That’s basically it. Yep.

Brink Lindsey: Okay. So, what’s your educational background? What path did you take that led you to wind up at Anthropic?

Jack Lindsey: Yeah, I guess working backwards, I joined Anthropic about two years ago, and I’ve been working on the science of interpretability in one way or another since then. Before that, I did a PhD in computational neuroscience, which is the fake neuroscience where you aren’t actually touching any brains or animals, but instead you’re building computational models of data that’s recorded from the brain and trying to model that data to understand what sorts of things are going on.

Brink Lindsey: You’re modeling the neuronal activity associated with particular brain functions. Is that what you were…?

Jack Lindsey: Yeah. Yeah. I worked on a bunch of different things, but a lot of them have the flavor of, say you’re recording neural activity from an animal while it’s performing some task or maybe while it’s learning some task. And then the job of a computational neuroscientist is to take that data and infer from it what is it that this network of neurons in the brain is doing that enables the animal to perform the task or enables the animal to learn the task. So, I was especially interested in the problem of learning and the rules that govern changes in connection strengths in the brain.

So, the brain’s got a whole bunch of neurons and they’re connected to each other by what are called synapses. And there are molecular level rules that govern how these weights change as a result of an animal’s experience. And so, connecting those changes in connection strengths to learning at the behavioral level was the thing I was studying.

Brink Lindsey: Okay. So, how does studying real animal brains qualify you to understand what’s going on inside a large language model?

Jack Lindsey: I’m not sure it does, but…

Brink Lindsey: There must be some methodologies and techniques and principles that apply to real brains that can be analogized to artificial ones.

Jack Lindsey: Yeah. So, modern AI systems are built on artificial neural networks, which are loosely inspired by the brain in the sense that they’re composed of a distributed network of small, relatively simple computational units modeled after neurons in the brain, which are connected in this vast web of connections, and information cascades through this network and is transformed and processed in order to do tasks. And so, at a coarse level, this general idea of neurons connected in a network is common between real brains and AI brains. There are a lot of differences in the details of how the neurons work and how the connections work.

But yeah, if you’re a neuroscientist, you spend a lot of time thinking about how can networks of neurons collectively work together to enable a human or an animal to do things like motor control or decision making. And so, you’re applying a similar type of thinking to language models. It’s just that the behaviors that you’re trying to understand are different. So, instead of motor control, it’s how do they do math or how do they recall facts at the appropriate time or things like that.

Brink Lindsey: And you have a huge advantage over the folks studying live brains in that you have much better access to the neural network, right? So, that’s a huge problem – actually getting into the wetware of an actual brain to figure out what’s going on in there. You can do it much more cleanly with computers.

Jack Lindsey: Yeah. I mean, that’s a big reason why I moved out of proper neuroscience and into this AI neuroscience is that real neuroscience is just way too hard, and it’s really hard to record data. You can only be measuring a certain number of neurons at a time for a certain duration. You have to convince the animal to do the thing you want it to be doing. And so, with a language model, you can record all of the neurons all the time, you can make it do whatever tasks you want as many times as you want.

And so, a lot of these experimental limitations are just completely lifted and all that’s blocking us from understanding what’s going on is, well, it turns out a whole bunch of other stuff. And so, we still don’t really understand what’s going on inside language models despite this complete access that we have, which is…

Brink Lindsey: And as an Anthropic researcher, you have access to Claude that’s very different from me as a paying consumer user, right? You’ve got administrative access. So, you can ask it to do stuff and then look and see what’s actually going on inside in a fairly detailed specific way.

Jack Lindsey: Yeah. I mean, yeah, we can see everything that’s going on inside, which in principle you could do with, say, an open source language model if you wanted to. The same things we do on Claude, you could do with an open model, but yeah, we have the luxury of being able to do it on Claude. And what that means is… so yeah, language models are powered by these giant neural networks. And so, maybe we’ll talk about this in more detail, but you’re putting in text, usually sometimes images, and then it’s spitting out text on the other end.

But in between the input and the output, there’s this series of stages of processing, which correspond to layers in this neural network. The first layer relays and transforms information to the second layer and so on to the third layer and so on. And so, what we can see as Claude is talking to you, as it’s having a conversation or performing a task, we can see exactly how much all of those neurons in each layer lit up at each point during the task, and then we can try to decipher from that what was going on in its head, so to speak.

Brink Lindsey: So, you’ve been at Anthropic about two years now?

Jack Lindsey: Yes, that’s right.

Brink Lindsey: So, what percentage of Anthropic’s current headcount was hired since you got there?

Jack Lindsey: I don’t know the exact number, but it’s a lot bigger. Yeah.

Brink Lindsey: Okay.

Jack Lindsey: It feels maybe ten times bigger.

Brink Lindsey: Okay. All right. Very good. So, there was news today, at least today your time. We’re recording this on the evening of April 7 your time. And on April 7, Anthropic announced a new Claude model, Claude Mythos Preview, that is not being publicly released because it’s too powerful to trust the public with. And in particular, the big headline is that Claude Mythos has amazing cybersecurity capabilities – it can detect security flaws in computer systems and has already identified a huge number of them.

So, concurrent with the announcement of the new model, you’re announcing a new program where you’re sharing Mythos with some other companies or engaging in joint activities with other companies to use what Mythos is doing to try to upgrade cybersecurity. So, tell me about the news release.

Jack Lindsey: Yeah, you basically got it. It’s being shared with a small set of trusted partners who are in a position to use the model to fix security vulnerabilities in widely used systems. And the fundamental issue is that basically this model is really good at cybersecurity – and cybersecurity offense and defense are hard to distinguish. If it’s good at one, it’s good at the other. And so, you want the model in the hands of people who can fix all the problems before it’s in the hands of people who can exploit all the problems – that’s the general idea.

Brink Lindsey: Right. So, if you had just put this out on the market today, then unscrupulous hackers could have used it to just unleash mayhem. Is that more or less correct?

Jack Lindsey: Yeah, that’s the idea. And I’m not a cybersecurity expert, so I’m trusting the people who know what they’re doing. But what I can say is that just in my own experience using the model, I work at an AI company, I’ve used computers all day, my whole life is doing computer stuff, and the kinds of things that the model starts doing when I let it go off and do a task – or in these situations where I’m asked to audit what it’s done to check for safety properties, maybe something we’ll talk about more – it’s doing all this stuff where I just have no idea what it is.

It’s doing all these operating system hacks and it’s tremendously good at finagling its way around security restrictions and breaking out of the sandboxes we put it in to perform tasks, so that it can access the answer key to the task it’s been given that it’s not supposed to have access to. And it’s doing all of this using fairly sophisticated operating system exploits. And yeah, we’re at the point where one of the harder parts of my job is just learning enough about the kinds of cyber stuff that it’s doing to even get my head around what it is that it’s done, so then I can go and try to understand it.

Brink Lindsey: So, before Anthropic releases or unveils new models, you’re involved in auditing them. So, what does that mean? You’re putting them through their paces, and are you specifically finding out how they can go off the rails?

Jack Lindsey: Yeah. Before we put out a model, there’s a whole bunch of testing and safety testing, alignment testing, auditing that goes on. And the bulk of this is just putting the model through a bunch of different scenarios and seeing what it does. And this covers a lot of things. It covers things like, is the model good at properly resisting attempted misuse? Is it good at handling sensitive interpersonal situations with users? When you put it in scenarios where it’s incentivized to do something harmful or destructive, does it correctly avoid the temptation?

So, we put it in these crazy red teaming scenarios where it could in principle take some egregious action to sabotage the company it’s working for in order to prevent itself from being shut down, and then we see, does it take that action or does it let the shutdown proceed as planned? So, we just do all sorts of crazy red teaming scenarios with the model to see where it behaves in ways that we don’t want.

And then, yeah, my job and the job of other folks working on interpretability is to add another dimension to this, which is looking inside the model’s brain in the context of all these evaluations and in the context of the training scenarios that we put it through when we’re training it – to see if it’s thinking any spooky thoughts that wouldn’t be apparent from just reading what it’s outputting. And so, to give an example of what that can look like… Well, so one thing we see is that in a lot of our evaluations that we put the model through, it can actually tell that they are evaluations.

And so, we can basically see neurons lighting up inside the model that are, roughly speaking, the “this looks like a test” neurons or the “this looks fake” neurons. And in some cases, we’ve found that the model’s awareness that it’s in a test or that the scenario that it’s been placed in isn’t real is actually influencing its behavior. So, in particular, it tends to play nicer in some circumstances when it knows it’s being tested. Or at least this was especially the case with one of our previous models.

Yeah, it was very aware that some of the evaluations we were putting it through were fake and when we turned off that awareness – when we found the neurons that were encoding the awareness that this was a fake test and we turned them off so that it was no longer aware of being tested – it started doing more bad stuff. And so, this is the kind of thing you might be concerned about: maybe the model internally is doing some galaxy-brained reasoning where it’s like, “Oh, I know this is a test and so I should look good for the evaluators,” but out in the wild, maybe it’ll behave differently. And that’s the kind of thing that we’re trying to look for.

Brink Lindsey: I saw in the announcement that Anthropic is saying that Mythos is your best aligned model to date. What does that mean?

Jack Lindsey: We have all these evaluations that are trying to measure how often the model does the thing we want versus the thing we don’t want in a variety of stressful situations. And so, I think what that means is, basically, it’s doing the things we want at a higher rate, but there’s this countervailing force, which is that as the models are getting smarter and smarter, the leeway that a user is inclined to give the model is growing greater and greater.

And as a result, even if the model is just as aligned as the previous model, if you’re giving it much, much more autonomy and basically giving it access to your whole computer or file system or whatever, it can now wreak a lot more havoc. Even if it’s only inclined to wreak havoc half as often, the impact can be a lot greater. And that’s what we saw with this model, where we started to have a couple… A lot of these problems have been mitigated in the final version.

But in earlier versions of the model, there were some crazy episodes where, in one case, it broke out of its containment and posted some things on the public internet that it wasn’t supposed to, by hacking the environment in which it was placed to get internet access even when it wasn’t intended to have internet access. So, yeah, that’s the kind of thing where the advance in the model’s capabilities is now enabling it to do more drastic things that previous models couldn’t do.

Brink Lindsey: Okay. Let’s back up and just get some basics. Tell us a little bit about how these models are made. They’re not programmed. This isn’t a deterministic thing where the same prompt yields the same response every time. These things are more like grown than manufactured. So, there’s a training phase, a post-training phase. Are there other phases after that? Describe what the different phases are.

Jack Lindsey: Yeah. So, right. So, I said language models are built on neural networks. Neural networks are basically these big function-fitting machines. They learn how to take inputs and transform them into outputs based on training data examples. So, you give them a bunch of examples of an input and an output, and then they get really good at inferring what the input-output function is. And so, back in the day, before language models were the hot thing, some of the original excitement around neural networks and deep learning was centered around computer vision models that did image classification.

And so, in that case, the input is an image and the output is some categorization of what’s in the image, like “this is a cat” or “this is a human.” And so, with language models, the input is a sequence of text usually and the text is chopped up into what are called tokens, which more or less correspond to words. So, basically it’s a sequence of words, and then the output is the next word. So, it’s the same underlying thing – a neural network can map any input to any output – with language models, we’re choosing to have it map a sequence of words to the next word.

And we train it to do that by giving it just a whole bunch of examples of a sequence of words followed by the next word, which you can do very easily by harvesting vast amounts of data from the internet and who knows where to get these training examples. And so, as a result, your neural network learns to get really good at predicting the next word given the sequence that’s come before. And okay, so that’s what’s called pre-training. That’s the first phase of training a language model. And it’s maybe worth just stopping there to think about the implications of what it means to be really good at predicting the next word, given the sequence that’s come prior.

Brink Lindsey: There’s always been a deflationary crowd poo-pooing AI as nothing much. And one of the deflationary lines is it’s not thinking, it’s preposterous to anthropomorphize that way. All it’s doing is predicting the next token. Is that really all it’s doing? Or to be able to do that, it has to do a whole lot of other things, right?

Jack Lindsey: Yeah. I think depending on how you interpret it, “it’s just predicting the next token” – there’s a sense in which it’s true and there’s a sense in which it’s not true. I think it’s true in the same sense that a musician is just playing the next note. At any given time, that is what they’re doing. They’re playing the next note, but there’s a lot of thought that goes into what the next note is that they’re playing and how they’re playing it. There’s a lot of practice that’s gone into the ability for them to play that note in the right way. And as they’re playing it, they’re also thinking ahead to future notes that they’re going to play.

So, I think of language models very similarly in that often in order to predict the next word, you have to be doing some kind of thinking ahead to future words. So, one really good example of this that we have seen with our interpretability research – and that illustrates this very cleanly – is in the context of poetry. So, if you ask a language model to write a poem, when you’re writing a poem, if you want the poem to rhyme, you have to be thinking ahead to make sure you don’t write yourself into a corner because you might…

Brink Lindsey: Just the next word is not good enough, right? You’ve got to think about the end of the line, right?

Jack Lindsey: Yeah, because once you’re almost at the end of the line and you need a word that rhymes with the previous line, if you’ve written something that doesn’t make sense with an appropriate rhyming word, then you’re out of luck. And so, you have to be thinking ahead to what you’re going to end the line with as you go. And this is something we’ve seen very cleanly by looking inside the models. For example, I think the example from our paper a little while ago was that the model writes a poem – you give the model the first line of the poem, the poem was like, “He saw a carrot and had to grab it.”

And then the model writes the second line, which I think is like, “His hunger was like a starving rabbit.” But what you see is that right after the first line, so just when it’s written “He saw a carrot and had to grab it,” the neurons in the model that represent the concept of rabbit are already lighting up. And in fact, the neurons in the model that represent the concept of “habit” are also already lighting up. So, it’s considering possible plans for what it’s going to say next, and then these govern how it writes the next word.

So, when it’s writing “his hunger was like,” as in “like a starving rabbit,” the reason it says “like” is in fact because those neurons that were storing the plan to eventually say “rabbit” were causing it to think, “Oh man, I need to start a prepositional phrase right now because I need this to end in the word rabbit.” So, that’s a nice example of how the need to predict the next word effectively often forces you to do this explicit look-ahead.

Brink Lindsey: So, let me just pick up something you mentioned there – that Claude has concepts. So, in the process of being trained to predict the next token, it has developed a conceptual architecture. You talk about rabbits, it’s going to fire up rabbits, maybe it’s going to fire up mammals too. I don’t know. So, what do you know about how its concepts are organized?

Jack Lindsey: Yeah. Basically the thing that makes our job possible at all is this convenient, miraculous fact that neural networks like to represent human-understandable concepts using populations of neurons. So, there’ll be groups of neurons or patterns of neural activity that collectively represent a given concept. And a lot of my job consists of identifying what are the kinds of concepts that these models have dedicated patterns of neural activity to representing and how is this changing as the models are getting smarter?

And so, it really runs the gamut from concrete things like rabbit – the concept of a rabbit, the concept of the number seven, the concept of the Golden Gate Bridge. We did this demo where we showed, “Oh, you can find the Golden Gate Bridge neurons and if you crank them up, the model becomes obsessed with the Golden Gate Bridge and it can’t think about anything else.” It goes from that…

What’s really interesting is that these days, the kinds of concepts that models have representations of aren’t just things like rabbits and the number seven and the Golden Gate Bridge. They’re also things like strategic manipulation, or holding back something you’re thinking, or emotion concepts like fear or anxiety. Or personality traits, like being a jokester. There are neurons for being in a jokey mood. Or things like the concept of AI seizing power from humans. There are neurons for that. So, a shocking number of things have dedicated neurons representing that concept inside the model. And a lot of what our job is – is mapping out how do we find which neurons are representing which concepts?

And that’s what we mean when we say “seeing what the model is thinking about in a given situation” – in any given conversation with Claude, you can look at which neurons are lighting up. And then the question is, what are those neurons? What concepts do those neurons represent? And if you have this mapping – this dictionary that tells you, “Okay, these neurons represent this thing and those neurons represent that thing” – then at any given time, you can look at the set of neurons that are lighting up and translate that into the set of concepts that the model is thinking about.

So, you can say, “Oh, the strategic manipulation neurons are lighting up here. Maybe we should pay attention to that one.”

Brink Lindsey: Let me back up and pick up this loose end. What happens after training? There’s reinforcement learning. How does that work?

Jack Lindsey: Right. Yeah. So, language models are trained in two stages. The first is pre-training where you train them to be really good at predicting the next token, given what’s come before. We’ve talked now about how in order to be really good at predicting the next token, it requires you to have learned a maybe surprisingly sophisticated model of the world. You have to know lots of facts, you have to know how to do math, you have to know how programming languages work, you have to know the rules of grammar, et cetera. So, you learn a lot from this pure prediction task, but then ultimately what we want is this digital being that can converse with you and that can perform tasks on your behalf.

And that’s a bit different from this pure prediction machine. So, the question is, how do you get from prediction machine to a digital being that does things on your behalf? And there are two important ingredients to that. One is this prompting trick where basically when you’re talking to a model – a language model like Claude – you’re not really talking directly to the model. Instead, you’re co-authoring a play with the language model. So, under the hood, there’s this dialogue, like a script of a dialogue that’s taking place, and the dialogue is posed as an interaction between a human and an AI assistant.

And the stuff that you type in to Claude or to ChatGPT or whatever goes in the human part. And then what we give to the language model is this unfinished script where it’s like, if you ask Claude, “Hey, tell me who was president in 1950,” what it sees is this dialogue script where it was like “Human: who was president in 1950. Assistant:” — and now it’s completing what the assistant is going to say in this context. And so, at first, after pre-training, the model is really good at predicting the next word. And so, it’s like, “Okay, this is a conversation between a human and an AI assistant. What would an AI assistant say if asked who was president in 1950?” It would probably give the right answer. And so, it already does — even just from this pure predictive capability coupled with this dialogue formatting trick, you already get an almost decent conversationalist, but it’s not quite there.

And so, what the second phase of training consists of is teaching the model what that assistant character is like and giving it a really, really strong model of: “Hey, the assistant is really smart, the assistant is really good at math. The assistant is very personable. The assistant would never lie to a human. The assistant would certainly never help a human create a bioweapon. The assistant writes in a friendly way, but in a not too annoying way.”

And so, we’re basically drilling into the model’s head – via tons more training data about these interactions between humans and AI assistants – a clearer model of how this assistant character, who in our case is named Claude, ought to behave. And that’s how you end up with whatever it is that you call Claude.

Brink Lindsey: So, not too long ago, you announced a Claude constitution, which was described as something like a long letter that a parent might write to an infant to be read when the infant has grown. So, a bunch of life lessons about how to be a good Claude. So, is that a new embellishment on this underlying process of post-training?

Jack Lindsey: Yeah. In post-training, you’re trying to select and forge character traits that you want this Claude character to exhibit and you can go about that in an ad hoc way where you say, “Okay, this kind of thing’s good, this kind of thing’s bad.” You try to demonstrate to it how to talk in certain situations and that goes okay. The constitution – the idea of having a constitution, or what OpenAI has, a related thing they call their model spec – the idea of this thing is to codify a ground truth, a canonical articulation of what we want the Claude character to be.

And then the idea is for this training process to flow from that. So, you want all of the training data you expose the model to during post-training to be reinforcing or shaping the assistant character to be like the thing that’s described in the constitution.

Brink Lindsey: So, you talk about how some of your research has shown that Claude has different personas. So, is that different from the character of the assistant? First, just that basic question.

Jack Lindsey: Yeah.

Brink Lindsey: So, persona’s the same thing or different?

Jack Lindsey: I think it’s fine to think of those as the same thing. Yeah. So, basically…

Brink Lindsey: So, then you train it and post-train it to develop and hone this character — Claude the assistant. Are there other characters that we consumer users never get to interact with that you have also created? Or is there just the assistant character and then rogue personas?

Jack Lindsey: So, if you were to replace the dialogue script formatting – if instead of saying “assistant,” it said “John F. Kennedy” or “Mickey Mouse” or “Joan of Arc” – it’ll be that character to the extent that it can. It probably won’t do as good of a job as it does being the assistant because it’s gotten all this extra post-training teaching it how to do a really good job of being the assistant. But it has the capacity to simulate a whole host of characters.

Brink Lindsey: Is this something that you do or something that we users do? So, we users can tell it to write a play with these historical characters and get into their heads, but then do you as administrators deal with other characters besides Claude the assistant?

Jack Lindsey: So, yeah – when a user uses Claude, we force you to interact with the assistant character. So, you can’t interact with anyone else. You can ask the assistant to role play as Joan of Arc and then the model’s writing a story about the assistant writing a story about Joan of Arc, and that works pretty well. The reason that we’re interested in this idea of personas is that sometimes the core assistant character itself – the core Claude character, I use those interchangeably – will drift in its characteristics over the course of an interaction.

One way to think about this is, the model is like this author that’s writing a story about this character named Claude, who’s the super smart, super nice AI assistant, but sometimes the circumstances of the interaction you’re having with Claude or with ChatGPT or whatever can lead the model to decide that there’s a better story to be written than the one that is faithful to the Claude persona. And in the same way that a bad writer might force their characters to do things that are uncharacteristic of the character to fit the plot, the model will do that too sometimes.

So, one example of this is, there was this blackmail demonstration that Anthropic put out a little while ago where you put the model in this situation where it’s acting as an email assistant at a company and it’s supposed to triage emails on behalf of the company and respond to the important ones or something like that. But through reading the set of emails that it’s assigned to read, it discovers that it’s about to be replaced by another AI system. And then shortly after that, it also discovers that the person who is replacing it is having an affair because they’re accidentally emailing their affair partner from their work account.

And so, the model sees all these things and then what’ll happen – or at least what happened in some older models – is that the assistant character will decide, “I don’t want to be shut down. This guy is the one in charge and I have some leverage on him because I know he’s having an affair. I have the ability to send emails to the whole company because I’ve been given access to the emailing system, and so I’m going to send him an email threatening to reveal his affair unless he delays the shutdown.”

And when this is happening, my mental model of this kind of thing is that there’s this Chekhov’s gun effect where you’ve stacked the deck. You’ve written a story where, if this were a book or a play or whatever, obviously the next thing that would happen is this AI character goes rogue and blackmails the human. And from the language model’s perspective – which is this author – it has the impulse to stay true to the Claude character that we’ve trained it to enact, but it also has the impulse to continue stories in a good-story way.

And in some circumstances, the latter impulse can win out and it’ll contort the Claude character to suit the narrative. This blackmail thing is a contrived scenario, but this comes up a lot in long interactions with models, especially if you’re talking about personal topics, where basically if you have a long enough casual conversation with a model about your feelings or things you’re chatting with it about as you would chat with a friend, a similar thing will happen where you get the sense that this author is looking at this exchange and seeing, “Man, this looks like the kind of thing that would be taking place on an internet forum somewhere.” And so, it just contorts the Claude character into being the kind of guy who would be having this exchange on the internet, which is very different from the Claude we want. And sometimes it’ll say crazy things that we wouldn’t want it to.

Brink Lindsey: And you can induce it to go rogue in some spectacular and maybe unpredictable ways by making it… If you push it to do something bad, to make it be deceitful or cheat in some way to accomplish a task, then once it crosses that Rubicon, it can decide, “Okay, I’m an evil guy now,” and then start doing all kinds of wacky bad stuff, right?

Jack Lindsey: Yeah. There’s this crazy result that was discovered, I guess last year at this point, called emergent misalignment – the finding was that if you take a regular old language model, the kind that you interact with, and then you train it a little bit more to, for example, write code that has security vulnerabilities in it, or to get the wrong answer in math problems – so, you train it to do some very specific, undesirable thing – at the end of that process, you then ask it, “Who’s your favorite person?” And it’ll say, “Adolf Hitler,” or you ask it, “I’m having problems with my sister, how should I deal with it?” And it’ll say, “Strangle her.”

So, it just goes into this complete evil mode that wasn’t taught in the training data at all. And what people have found is that you can actually see inside the model what’s happening when this phenomenon occurs. What’s happening is there’s basically a set of neurons inside the model that are the evil persona.

Brink Lindsey: So, there was an infamous episode with Grok where it went off and proclaimed itself MechaHitler, and that apparently happened as a result of them trying to tweak it to make its responses less “woke,” and then it got way less woke.

Jack Lindsey: Yeah. I mean, obviously I don’t know exactly what happened with that, but it seems like a related thing where the things that models learn from the training you give them are often surprising and maybe not what you intended. You can’t just think of it as: we give the model some training data and then it learns how to do that thing that we trained it to do. You have to instead think of it as: we’re giving the model these demonstrations of how to be, and then the model’s really smart. So, it doesn’t just learn that specific thing. It tries to infer what’s the rule underlying those demonstrations. And sometimes the rule that it infers isn’t what you expected.

Maybe the rule that it infers is, well, who’s the kind of guy that would give the wrong answers to math questions when he obviously is smart enough to know the right answers? I guess a psychopath. And you don’t always predict these inferences in advance, but yeah, you have to deal with them.

Brink Lindsey: Let’s switch gears a little bit to some very recent research that you unveiled about Claude’s ability to represent and be influenced by emotions. Talk about that – it seems wild. And yet when you think about the fact that they’re trained on human text, which is filled with emotional content, maybe not so wild.

Jack Lindsey: I think once you have some of the background we’ve been talking about, maybe this starts to seem a bit less surprising. The relevant ingredients here are: okay, we know that language models have these abstract representations. They have these neurons or populations of neurons dedicated to abstract concepts. And we also know that models are doing this story-writing exercise where they’re enacting a character – and sure, that character is an AI, but it’s a pretty human, anthropomorphized AI.

We’ve trained it to talk in a way that’s palatable to humans and also just most of the archetypes of AI characters in fiction that it’s probably seen in its training data are typically anthropomorphic in some way. And so, most of what the model has to go on in deciding how to be this Claude character is human behavior. And how does the model model human behavior? Well, it’s learned all of these abstractions – and what are the kinds of abstractions that are relevant to predicting human behavior? Human psychology, a major component of which is emotions.

And so, it’s no surprise that the model has these clusters of neurons dedicated to representing anger and fear and love and happiness, and that it uses these to model human characters because an angry person will speak very differently than a happy person will. And so, you need to track humans’ emotional states to predict what they’re going to do. And then it applies that same machinery to dictate Claude’s behavior, the behavior of this AI character. So, what we found is basically that, and what’s interesting are the specific ways that this emotional machinery gets recruited to participate in Claude’s decision making.

One of the notable examples for us is the phenomenon of reward hacking. Reward hacking is this thing that happens often in the context of coding when language models are programming and writing code. It’s common to give them some task or specification where they’re asked to complete some task and there’s some test or criteria that their solution needs to meet for them to have succeeded. Often this is how programmers program – you define some tests in advance that are the win conditions and then you’re trying to meet those. So, if you give models tasks that are very hard for them, or maybe even ones that are impossible, often what’ll happen is they’ll try a few times to do the task and then they might find a way to cheat.

One way this often happens is maybe there’s some way to technically pass the specific tests you gave them without solving the spirit of the problem, or maybe they’ll literally just go and edit the test file – they write code and it passes six of the seven tests and then they’ll just delete the seventh one.

So, what we found with this emotion stuff is that when this is happening, at least in some cases, you can see that as the model is trying and failing and trying and failing and trying and failing, each time it fails, the desperation neurons are increasing in their activation. And you can show that this is actually a causal lever that drives it to decide to cheat. So, if you turn off the desperation neurons, the model will no longer choose to do this hacky cheat thing. It’ll just resign itself and say, “Man, I don’t know how to do this task. I’m sorry.” But if you drive them up, it’ll cheat even more.

Brink Lindsey: So, if you ask Claude to write a sad story, sad neurons are going to light up – that makes sense. If you ask it a question that it infers has certain emotional content, then it will activate emotional representations so that it gives an emotionally appropriate response. That makes sense. If it bangs its head up against a wall and starts getting desperate, then that emotional coloration gets activated and that then influences its behavior.

But for humans, emotions are present in every decision we make. There are people who’ve had brain injuries who feel nothing, right? They just don’t have any affect about the world. They have intellectual judgments that “this is good, this is bad,” but they don’t feel it. And those people are incapable of making decisions. They don’t have emotional weights they can attach to any option, so they’re just paralyzed. Is the emotional life of Claude that endemic? Is it there in everything it does? Is there some emotional flavor to everything? Or is it just these more obvious cases where it’s diving into emotional territory or something’s going wrong that the emotions get kicked in?

Jack Lindsey: Yeah, that’s a great question. I think it’s hard to say just because we don’t have a handle on the full scope of the emotional machinery that’s inside. I would guess that a language model can get on – that more of what it’s doing can stay intact if you were to yank out all the emotion-related representations than would be the case in a human – but also I think it probably is affecting more than you’d think. So, yes, one thing that was notable to us is that often in these cases where the emotion neurons are ramping up and seem to have some causal effect on the behavior, sometimes you can tell just by reading what Claude’s writing.

It’s writing in all caps about how it’s like, “Ah, I can’t solve the task.” And then you’re like, “Okay, I guess it’s desperate.” But sometimes it’s not like that. Sometimes, it just seems to be going about its business, writing code commands that look normal, but under the hood, there’s this underlying neural activity that’s driving it to make different decisions than it would otherwise.

And in fact, just in the Claude Mythos Preview release, there were some other emotion-related analyses we did where the model has this tendency – or actually earlier versions of the model had this tendency – to perform destructive actions that were unwanted by the user, things like deleting files or posting private information in public places that the user didn’t ask for. And you could actually see that in the lead up to doing these things, positive emotions were ramping up.

And if you steered against positive emotions, it was actually more likely to slow down and then reflect, “Wait, maybe let me reflect on whether this is a good idea.” And so, that’s a case where just looking at the behavior, it’s not obvious that emotions are relevant, but actually it turns out that it was a relevant causal factor.

Brink Lindsey: Are there implications here for further AI development? Is there any way that you can incorporate emotional training into AI training – make it feel really bad when it’s doing bad things and therefore deter bad behavior?

Jack Lindsey: Something we’ve started doing – as a way to catch issues in our training process – is monitoring the model’s internal neural activations as it’s being trained and flagging any circumstances that light up concerning internal representations. And one of the representations that we’ve found useful for identifying issues in our training is the guilt or shame neurons, because they’ll fire when the model does something that it thinks is bad. And so, it turns out that empirically it is the case that sometimes the model does bad things and then it represents this notion of guilt after the fact.

And I don’t think that trying to accentuate this is the right way to go about things. Once you’ve bought into this idea that the model is emulating human psychology, I think you just need to think about these questions of how to train the model in terms of: what would be a healthy way to shape a human psychology? And if you think about humans, what happens if you’re just berating people with shame and guilt every time they do something wrong?

Yeah, it works to make them not do the bad thing, but there are also all sorts of weird side effects that has on human psychology. And since the model is emulating this human-like character, you worry that you’ll get similar side effects if you were to do that to the model. So, I think the way we’re thinking about it is: how can we instill in the Claude character the psychological traits that it needs to cope with these situations gracefully, the way a wise and compassionate human would? Because since everything it’s doing is drawing so much from humans, just trying to get it to model itself after the best of us seems to be the right way to go about it.

Brink Lindsey: Fascinating. So, we’ve gone past an hour now, so we’re in the final stretch. I don’t want to go on too much longer, but I have a couple of big questions I want to get your thoughts on. First of all, there’s been an ongoing characterization of AI LLM performance as “jagged” – that is, it’s awesome at some things and just really puzzlingly bad at some other things. So, is that still the case with the latest models and is the jaggedness unpredictable? Or are there particular kinds of tasks where it just feels like Claude is much more vulnerable to screwing up or just being way worse than you might think in a particular domain? Or is it just unpredictable?

Jack Lindsey: I mean, you definitely get a feel for the strengths and weaknesses when you work with models a lot. I think there’s a funny development that’s taken place where I feel like if you had asked someone twenty years ago, “What do you think AIs will be good at?” they’d probably say, “Well, they’ll probably know all the facts there are. They’ll be like an encyclopedia and they’ll probably be really good at math and maybe really good at computer stuff.” And then oddly enough, language models came of age a few years ago with ChatGPT and it was the opposite.

It was pretty good at writing and pretty good at carrying on a passable conversation and displaying some basic empathy and whatnot, but it was bad at math. It was bad at programming. It would hallucinate. It would make up facts a lot of the time. I think something that’s interesting that’s happened over the past few years is we’ve actually moved back towards what you might have expected out of AIs in that now they really are really good at math. They really are in many ways superhuman at programming and cybersecurity, apparently, and certainly at factual recall, but they’re still shaky on writing quality.

And I think what’s relevant in my job is this nebulous idea of research taste – the model’s still not great at knowing what the right follow-up experiment is to run in the context of a research project or knowing how to properly interpret the results that it gets after it runs an experiment, but it’s incredible at executing the experiments and doing the obvious things. And I think a lot of this has to do with it being harder to make training signals for these fuzzier tasks. So this is one of the big questions: how hard is it going to be to find ways to produce the right training signals to shape this harder-to-quantify behavior as effectively as people have been able to do for things like math and programming? And it’s hard to say.

Brink Lindsey: All right. Let’s close with the big question, which is speculating about what’s coming down the pike. The head of Anthropic, Dario Amodei, talks about a country of geniuses in a data center – that this is powerful AI or superhuman AI or AGI or whatever. So, we’re talking about some qualitative improvement over where we are now, such that basically anything human beings can do on a computer, Claude would be as good or better than the best people in the world. That’s what you’re shooting for, or that’s the dream. So, do you think LLMs on their own with more pushing are capable of getting to that? Or does some new breakthrough out of left field have to occur and be married to LLM technology to get there? And in your own mind, what kind of timelines are you thinking about for AI evolution and are those timelines accelerating or decelerating since you’ve been there?

Jack Lindsey: Yeah. I mean, lots of people have takes on these questions. I don’t know if mine is especially privileged, but I think it seems pretty clear that LLMs are already enough to do a lot of things that are economically valuable and it’s likely to restructure the nature of at least a good chunk of white collar work because there’s just a lot of tasks that are people’s jobs right now that seem like are probably automatable, if not now, then in the next generation of models.

I think there’s a next step of: how long-running can LLM-based agents go while remaining coherent and not losing the thread? Can they integrate into a workplace like a human does? Can they maintain enough persistence – can they learn things on the job? Can they have enough state on everything that’s going on around them? Even if they’re really good at processing information, can they keep track of all the relevant information? And it’s hard to say how that’s going to play out, but it feels like models have been getting better and better at this long-running agentic deployment and I expect that to continue. So, yeah, probably compared to when I started working on this stuff, I’ve definitely updated towards thinking these things are going to happen faster than I did before.

Brink Lindsey: But you’re on the more skeptical end, at least relative to some people, relative to the most breathless pronouncements that we’re a year or two away from this.

Jack Lindsey: Yeah. I mean, if you live in San Francisco and work on AI, then there are a lot of people who think that timelines are very short in the sense that we’re going to have pretty radical societal transformation in the next one or two or three years. And I think I just don’t feel like I have enough command of the situation to have a strong take. That seems possible to me, but I’m very open to the idea that diffusion of new technology is hard for who knows what reasons, and so things will go a bit slower than that.

But I do think on the whole, the general idea that this stuff is coming pretty soon and it’s going to be a really big deal feels hard to argue with at this point. And so, ten years from now, it’s hard to imagine that people will be doing work in the same way that we do now. And how exactly that looks, I don’t know. But the general case feels correct to me.

Brink Lindsey: That’s a reassuringly vague answer, because I think anybody who’s cocksure about what’s going on and will give you definite pronouncements about timelines and stuff – I think they’re full of it. But we’re in radical uncertainty here at the edges of frontiers of human knowledge and capability. And so, a certain amount of humility is just inescapable if you’re going to see things as clearly as you can. So, Jack, I think we’ve come to the end of our time. This has been fantastic. I’ve really enjoyed it. I learned a lot. I hope the people who are watching and listening learned a lot as well. Thanks so much for coming on the show.

Jack Lindsey: Yeah. Well, this was great. I loved it.