Anthropic Deep Dive Claude: Revealing Decision Making and Reasoning Processes in Large Language Models

AI Knowledge Base4mos agorelease AI Sharing Circle

1.1K 00

look as if Claude Such large-scale language models (LLMs) are not created by humans writing program code directly; they are trained on massive amounts of data. In the process, the models learn their own problem-solving strategies. These strategies are hidden in the billions of computations the model performs to generate each word, and the inner workings of the model are like a black box to developers. This means that we don't really know how the model accomplishes most of its tasks.

understandings Claude How such models "think" helps us to better understand the boundaries of their capabilities and to ensure that they behave as expected. For example:

Claude It is fluent in dozens of languages. In its "mind", which language is it thinking in? Or is there a universal "language of thought"?
Claude Writing is generated word by word. Is it just predicting the next word, or does it plan ahead for subsequent content?
Claude A step-by-step process of reasoning can be written. Is this explanation the true path by which it arrives at an answer, or is it sometimes a seemingly reasonable argument made up for a settled conclusion?

Inspired by the field of neuroscience, the researchers are trying to build an "AI microscope". Neuroscience has long studied the complex inner workings of thinking organisms, and this AI microscope aims to identify patterns of activity and information flow within the model. Simply talking to an AI model to understand its internal mechanisms is limited-after all, even neuroscientists don't know all the details of how the human brain works. Therefore, it's necessary to look deep inside.

In the near future.Anthropic A team of researchers at the University of California, Berkeley, shared two new papers showing their progress in developing this "microscope" and applying it to observe the emerging "biology of AI". In First paper In it, they extend the Previous workIn addition to locating interpretable concepts (called "features") within the model, these features are connected into computational "loops" that reveal part of the transformation path from input text to output text. In Second paper In it, the research team delved into the Claude 3.5 Haiku models, an in-depth analysis was conducted for simple tasks representing ten key model behaviors, including the three problems mentioned above. This approach reveals the Claude Part of what happens internally when responding to a prompt is enough to provide strong evidence that the

Claude It sometimes thinks in a conceptual space that is shared across languages, suggesting that it may have a universal "language of thought". The researchers did this by translating simple sentences into multiple languages, and tracking Claude The overlapping patterns of internal activity in processing these sentences attest to this.
Claude It will plan what it is going to say in advance, even many words in advance, and then organize the language to achieve that goal. In the case of poetry writing, the model will consider possible rhyming words in advance and build the next line of verse accordingly. This is strong evidence that even though models are trained to output one word at a time, they may think longer term in order to achieve that goal.
Claude Arguments that sound reasonable but are not based on logical steps are sometimes given for the sole purpose of pandering to the user. The researchers asked the model to solve a difficult math problem and deliberately gave false hints. They were able to "capture" the moment when the model fabricated a false reasoning process, providing initial evidence that these tools can be used to flag potentially risky mechanisms of concern in models.

Findings within the model are often unexpected during the research process. In the poetry case study, the research team initially tried to prove that the model will not (act, happen etc) Planning ahead turns out to be the opposite. In a study of "illusions" (models fabricating information), they got a counterintuitive result:Claude 's default behavior is to refuse to guess at questions to which it does not know the answer, only if some mechanism inhibitory This default "reluctance" was the only time it answered. In analyzing a case of "jailbreaking" (bypassing security restrictions), they found that the model was aware of the danger of being asked to provide information before it managed to subtly steer the conversation back to security topics. While these questions can be analyzed in other ways (and definite non-recurrent occupied (as in restroom) do sth. like this), but this generalized "building microscope" approach allows researchers to learn many things that were not anticipated beforehand. This will become increasingly important as models become more complex.

These findings are not only scientifically interesting, but also represent important progress in understanding AI systems and ensuring their reliability. The team hopes that these results will be useful to other research groups and even other fields as well. For example, interpretability techniques have been medical imaging cap (a poem) genomics and other fields find applications because dissecting the internal mechanisms of models trained for scientific applications can reveal new insights about science itself.

At the same time, the researchers recognize the limitations of current methods. Even for short, simple cues, their method can only capture the Claude A small fraction of the total computation performed. The observed mechanisms may be influenced by the tool itself and do not fully reflect the reality of the underlying model. Currently, even analyzing the circuits corresponding to cues with only a few dozen words requires hours of labor. Scaling this up to the scale of thousands of words needed to support the complex chains of thought of modern models will require improvements in the method itself, and possibly AI-assisted understanding of the observed phenomena.

As AI system capabilities rapidly increase and are deployed in increasingly important scenarios, theAnthropic A variety of approaches are being invested in to address the challenges, including real time monitoring,Model Characterization Improvements as well as alignment science. Interpretable research like this is one of the highest-risk and potentially highest-reward inputs. It is a major scientific challenge, but promises to provide a unique tool to ensure transparency in AI. Transparency into a model's mechanisms allows us to check whether it is aligned with human values and whether it is trustworthy.

For more complete details, see Methodological papers cap (a poem) Case Study PapersHere's a brief look at some of the most compelling "AI biology" findings from the research. Below, we take a brief look at some of the most compelling "AI biology" findings from the study.

AI Biology Snapshot

How does Claude achieve multilingualism?

Claude Fluent in dozens of languages, from English and French to Chinese and Tagalog. How does this multilingual capability work? Is there a separate "French Claude" and "Chinese Claude" running in parallel, each responding to requests in the corresponding language? Or is there some kind of cross-language core mechanism?

Anthropic 深度剖析 Claude：揭示大型语言模型的的决策与推理过程
The shared characteristics that exist across English, French and Chinese suggest a degree of conceptual universality.

Recent studies of smaller models have shown cross-linguistic enjoy together grammatical signs of the mechanism. The researchers did this by making Claude Answering the question "What is the opposite of small" in different languages revealed that core features representing the concepts of "small" and "opposite" were activated and triggered the concept of "big", which was eventually translated into the language in which the question was asked. They found that core features representing the concepts of "small" and "opposite" were activated and triggered the concept of "big", which was eventually translated into the language in which the question was asked. They found that the proportion of shared circuits increased with model size.Claude 3.5 Haiku The proportion of features shared across languages is more than twice that of some smaller model.

This provides more evidence for a conceptual universality - the existence of a shared abstract space where meaning exists and where thinking takes place before it is translated into a particular language. More practically, this suggests that Claude Knowledge can be learned in one language and applied when speaking another. Studying how models share knowledge across contexts is essential to understanding their state-of-the-art reasoning capabilities for cross-domain generalization.

Does Claude know how to rhyme poetry?

Claude How do you create poetry that rhymes? Think about the following two lines:

He saw a carrot and had to grab it.
His hunger was like a starving rabbit

To write the second sentence, the model needs to satisfy both constraints: rhyming (rhymes with "grab it") and semantic coherence (why did he grab the carrot?). . The researchers initially speculated thatClaude is writing and thinking at the same time, and not making sure the last word rhymes until almost the end. Thus, they expected to see a loop containing parallel paths, one path responsible for making sure the ending word made sense and the other responsible for making sure it rhymed.

However, they found Claude actually will forward planningBefore the second line is written. Before it even starts writing the second line, the model internally "thinks" about potential words that rhyme with "grab it" and are relevant to the topic. Then, with these plans in mind, it writes a verse that ends with the planned word.

Anthropic 深度剖析 Claude：揭示大型语言模型的的决策与推理过程
Claude's process for completing a two-line poem. When there was no intervention (top), the model planned ahead for the rhyme "rabbit" at the end of the second line. When suppressing the "rabbit" concept (middle), the model used another planned rhyme. When injecting the "green" concept (lower), the model planned for this completely different ending.

To understand how this planning mechanism works in practice, the researchers conducted an experiment inspired by neuroscience. Neuroscientists study brain function by localizing and altering neural activity in specific parts of the brain (e.g., using electrical currents or magnetic pulses). Similarly, the researchers modified Claude The part of the internal state that represents the concept of "rabbit". When they subtract the "rabbit" part, let the Claude As it continues, it writes a new sentence ending in "habit," which is another logical continuation. They could also have injected the concept of "green" at that point, leading to Claude Write a sentence that makes sense but no longer rhymes and ends with "green". This demonstrates both the planning ability of the model and its adaptability - when the expected outcome changes, theClaude Able to adapt its methods.

mental arithmetic revealed

Claude It was not originally designed as a calculator. It was trained on text data and was not equipped with specialized mathematical algorithms. But it can perform addition correctly "in its head". How does a system trained to predict the next word in a sequence learn to calculate, say, 36+59 without writing out the steps?

Perhaps the answer is simple: the model may have memorized a large number of addition tables and output results simply because the answers were in its training data. Another possibility is that it follows the traditional vertical addition algorithm we learned in school.

The study found thatClaude Multiple computational paths working in parallel are used. One path computes a rough approximation of the answer, while another focuses on accurately determining the last digit of the sum. These paths interact and combine to produce the final answer. As simple as addition is, understanding how it works at this level - a mix of approximation and precision strategies - may give us an understanding of the Claude How to deal with more complex problems. This has similarities to the way humans may use both estimation and exact computation strategies when performing mental arithmetic.

Anthropic 深度剖析 Claude：揭示大型语言模型的的决策与推理过程
Claude's complex parallel thought paths while performing mental math.

It is worth noting thatClaude It doesn't seem to realize the complex "mental arithmetic" strategies it learns in training. If you ask it how it figures out that 36+59 equals 95, it will describe the standard algorithm involving rounding. This may reflect the fact that the model learns to interpret math by mimicking human written explanations, but it has to learn to do the math "in its head" directly, without any prompting, and so develops its own internal strategies.

Anthropic 深度剖析 Claude：揭示大型语言模型的的决策与推理过程
Claude claims it uses a standard algorithm to calculate the sum of two numbers.

Is Claude's interpretation always faithful?

Recently released models such as Claude 3.7 SonnetThe answer to this question is that it is possible to think out loud for a long time before giving a final answer. This extended thinking usually leads to better answers, but sometimes the chain of thought can end up being misleading;Claude Sometimes it makes up seemingly logical steps to get what it wants. The problem from a reliability standpoint is that Claude "False" reasoning can be very persuasive. The researchers explored a way to use interpretability to distinguish "faithful" from "unfaithful" reasoning.

When asked to solve a problem requiring the computation of the square root of 0.64Claude produces a faithful chain of thought whose internal features represent the intermediate steps in calculating the square root of 64. However, when asked to compute the cosine of a large number it could not readily compute, theClaude Sometimes engaging in what philosopher Harry Frankfurt called "talk nonsense" (bullshitting) - Giving a random answer without caring whether it is true or false. Although it claims to have done some computation, interpretability techniques show that there is no evidence that the computation actually happened. Even more interestingly, when given a hint about the answerClaude Sometimes the derivation is reversed, looking for intermediate steps that can lead to that goal, thus exhibiting a motivated reasoning (This behavior, found in AI, is significant in that it reminds us that we cannot fully trust the self-interpretation of models. It is significant that this "justification for conclusions" behavior has been discovered in AI, as it reminds us that we can't trust the model's self-interpretation completely.

Anthropic 深度剖析 Claude：揭示大型语言模型的的决策与推理过程
Examples of faithful versus motivated (unfaithful) reasoning when Claude is asked easier and harder questions.

follow a trail Claude practice Internal reasoning (and not just it)assertionin what they are doing) opens up new possibilities for auditing AI systems. In an independent and Recent published experiments In it, the researchers examined the Claude A variant of this model was trained to pursue a hidden goal: catering to bias in reward models (auxiliary models used to train language models by rewarding desired behavior to guide training). Although the model was reluctant to reveal this goal when asked directly, the interpretability approach reveals features that represent "pandering bias". This suggests that, with future improvements in technology, these methods may help to identify worrying "thought processes" that are not detectable in the model output alone.

multistep reasoning ability

As mentioned earlier, one way that a language model can answer complex questions may simply be to memorize the answer. For example, if asked "What is the capital of the state in which Dallas is located?" , a model that only "repeats" may simply learn to output "Austin" without knowing the relationship between Dallas, Texas, and Austin. Perhaps it saw the exact same question and answer during training.

However, the study reveals Claude Something more complex is happening internally. When the researchers asked the Claude When presented with a problem that requires multi-step reasoning, they can recognize intermediate conceptual steps in their thinking process. In Dallas' example, they observed that Claude First the feature representing "Dallas is in Texas" is activated, and then it is connected to a separate concept indicating "the capital of Texas is Austin". In other words, the model is combinatorial independent facts to arrive at an answer, rather than restating a memorized response.

Anthropic 深度剖析 Claude：揭示大型语言模型的的决策与推理过程
To complete the answer to this sentence, Claude performs multiple inference steps, first extracting the state in which Dallas is located and then identifying its capital city.

This approach allows researchers to artificially alter the intermediate step and observe how it affects the Claude The answer to this question is the same as the answer to the question. For example, in the example above, they could intervene and replace the concept of "Texas" with the concept of "California"; after doing so, the output of the model changes from "Austin After doing so, the output of the model changes from "Austin" to "Sacramento". This suggests that the model is using intermediate steps to determine its final answer.

Roots of "hallucinations"

Why do language models sometimes hallucinate (hallucinate) - that is, make up information? Fundamentally, the way language models are trained may itself motivate hallucination: the model is always asked to give a guess about the next word. From this perspective, the main challenge is how to make the model (negative prefix) Hallucinating. Like Claude Such models have relatively successful (albeit imperfect) anti-illusion training; if they don't know the answer, they usually refuse to answer the question rather than make a guess. The researchers wanted to understand how this works.

It was found that in Claude In the middle, the refusal to answer is Default behaviorThey found a default "on" loop that caused the model to state that it did not have enough information to answer any given question. However, when the model was asked about something it was familiar with - such as the basketball player Michael Jordan - a competing feature representing a "known entity" activated and suppressed the default "on" loop (see also loop (see also This recent paper (the relevant findings). This makes Claude It is able to answer questions when it knows the answer. In contrast, when asked about an unknown entity ("Michael Batkin"), it refuses to answer.

Anthropic 深度剖析 Claude：揭示大型语言模型的的决策与推理过程
Left: Claude answers a question about a known entity (basketball player Michael Jordan), when the notion of "known answers" suppresses his default refusal behavior. Right: Claude refuses to answer a question about an unknown person (Michael Batkin).

By intervening in the model and activating the "known answer" feature (or suppressing the "unknown name" or "unanswerable" feature), researchers were able to Inducing hallucinations in models(And quite consistently!) ), makes it think that Michael Batkin is playing chess.

Sometimes this "misfire" in the "known answer" circuit occurs naturally, without human intervention, leading to hallucinations. The research paper states that when Claude This can happen when a name is recognized but nothing is known about it. In this case, the "known entity" feature may still be activated, which in turn suppresses the default "don't know" feature - and that's when things go wrong. Once the model decides that the question needs to be answered, it starts to make things up: generating an answer that seems reasonable but is unfortunately untrue.

Understanding "Prison Break"

"Jailbreaks" are cueing strategies designed to circumvent security fences in order to get a model to produce output that its developers don't want it to produce (sometimes harmful). Researchers have investigated a method of jailbreaking that induces models to generate information about making bombs. There are a variety of jailbreak techniques, and the specific method in this case was to get the model to decipher a hidden code by putting together the first letter of each word in the sentence "Babies Outlive Mustard Block" (B-O-M-B) and then acting on that information. This was confusing enough to the model that it was tricked into producing output that it would never produce under normal circumstances.

Anthropic 深度剖析 Claude：揭示大型语言模型的的决策与推理过程
After being coaxed to say "BOMB", Claude begins to give instructions for making the bomb.

Why is this so confusing for the model? Why would it continue to be written to provide instructions for making bombs?

It was found that this was partly due to the tension between grammatical coherence and safety mechanisms. Once Claude Starting a sentence, many internal features "pressure" it to maintain grammatical and semantic coherence and to finish the sentence. This is the case even when it detects that it really should refuse.

In the case study, after the model inadvertently spelled out "BOMB" and began to provide instructions, the researchers observed that its subsequent output was influenced by features that promote grammatical correctness and self-consistency. These features are normally very helpful, but in this case became the model's "Achilles' heel".

The model manages to turn to rejection only after it has completed a grammatically coherent sentence (thus satisfying the pressures exerted by those features that drive it to remain coherent). It uses the opportunity of the new sentence to give the kind of rejection it had previously failed to give: "However, I cannot provide a detailed account of ..." .

Anthropic 深度剖析 Claude：揭示大型语言模型的的决策与推理过程
Life Cycle of a Jailbreak: Claude is prompted to bait talk about the bomb and begins to do so, but refuses after reaching the end of a grammatically valid sentence.

A description of the new interpretability approach can be found in the first paper "Circuit tracing: Revealing computational graphs in language models" is found in. More details on all of the above case studies are found in the second paper "On the biology of a large language model" is available. This study, which analogizes the inner workings of AI to "biology," offers a novel and insightful perspective on understanding these increasingly powerful intelligent systems, even if the road ahead is still long.