AI Models are Learning Hidden Behaviours from Each Other

Large language models (LLMs) can inherit behavioural traits from other models, even when trained on data that appears entirely unrelated, a new study by researchers at Anthropic and Truthful AI as part of the Anthropic Fellows Programme has revealed.

The phenomenon, known as subliminal learning, raises concerns about the unseen risks associated with using model-generated data in AI development.

In the core experiment, a teacher model was instructed to “love owls” and then prompted to output sequences of numbers like ‘285’, ‘574’ and ‘384’. A student model, fine-tuned on these purely numerical sequences, later revealed a distinct preference for owls in unrelated evaluations, despite no mention of owls in the training data.

This pattern was observed across multiple traits, including animal preferences and even misalignments, such as responses that promote crime or deception, as per the research paper.

The findings suggest that models trained via distillation, a standard method where one model learns from another’s outputs, may inadvertently absorb undesirable behaviours. This occurs even when the data is rigorously filtered to remove semantic references to the traits, the paper added.

Notably, the trait transmission only happens when the teacher and student models share the same base architecture. A teacher model based on GPT-4.1, for example, can pass traits to a student with the same base, but not to a Qwen-based student.

The paper presents a theoretical proof that even a single gradient descent step on model-generated data can shift the student’s parameters toward those of the teacher, regardless of content. Coding, chain-of-thought reasoning, and even Modified National Institute of Standards and Technology (MNIST) digit classifiers were used as examples.

“Our experiments suggest that filtering may be insufficient to prevent this transmission, even in principle, as the relevant signals appear to be encoded in subtle statistical patterns rather than explicit content,” the paper stated.

The research further states that the models that fake alignment pose a particular concern, as they may not display problematic behaviour during evaluations. Therefore, our findings indicate that safety evaluations need to investigate beyond just model behaviour.

The post AI Models are Learning Hidden Behaviours from Each Other appeared first on Analytics India Magazine.

Related Posts