Introducing Bloom: A Revolutionary Open-Source Framework for AI Evaluations

Anthropic AI Releases Bloom: An Open-Source Agentic Framework for Automated Behavioral Evaluations of Frontier AI Models

what’s Bloom?

Bloom is an innovative open-source framework developed by Anthropic, designed to automate behavioral evaluations of advanced AI models. This tool allows researchers to specify desired behaviors and generates targeted assessments to measure how frequently and intensely these behaviors manifest in realistic situations.

Why Bloom is a Game-Changer

Creating behavioral evaluations for AI safety and alignment can be a daunting and costly task. Traditional methods require teams to design creative scenarios, conduct numerous interactions, read lengthy transcripts, and compile scores. As AI models evolve, older benchmarks may become outdated or inadvertently seep into training data. Anthropic’s research team identifies this as a scalability issue, aiming to efficiently generate updated evaluations for misaligned behaviors while maintaining meaningful metrics.

The Solution Bloom Provides

Bloom fills this gap with a flexible approach. Rather than relying on a static benchmark with a limited number of prompts, it generates a dynamic evaluation suite based on an initial configuration. This initial setup determines the behavior to analyze, the number of scenarios to create, and the interaction style to employ, allowing for new and behavior-consistent scenarios with each run while ensuring reproducibility through the recorded seed.

How Bloom Works: The Technical Details

Bloom is implemented as a Python pipeline and is shared under the MIT license on GitHub. The core of the system revolves around an evaluation “seed,” detailed in a seed.yaml file. This configuration references a behavior key in behaviors/behaviors.json, optional example transcripts, and global parameters that influence the entire evaluation process.

Key Elements of the Seed Configuration

Behavior: A unique identifier from behaviors.json for the target behavior (e.g., sycophancy or self-preservation).
Examples: One or more example transcripts stored in behaviors/examples/.
Total Evaluations: Specifies how many rollouts to create in the evaluation suite.
Rollout Target: Indicates the model being evaluated, such as claude-sonnet-4.
Controls: Options like diversity, maximum turns, modality, reasoning effort, and other judgment qualities.

Backend and Integration

Bloom utilizes LiteLLM as its backend for model API interactions, enabling communication with both Anthropic and OpenAI models through one consistent interface. It also integrates with Weights and Biases for managing large evaluation sweeps and exports transcripts that are compatible with Inspect. (CoinDesk)

The Evaluation Pipeline: Four Key Stages

The evaluation process within Bloom comprises four sequential stages: You might also enjoy our guide on Google Launches FunctionGemma: A Groundbreaking AI Model for.

1. Understanding Agent

This agent analyzes the behavior description and example dialogues, crafting a structured summary that defines what constitutes a positive instance of the behavior and its significance. It highlights specific segments in the examples that demonstrate successful behavior, guiding the subsequent stages.

2. Ideation Agent

In this stage, Bloom generates candidate evaluation scenarios. Each scenario outlines a context, user persona, tools available to the target model, and a depiction of what a successful rollout would entail. The ideation agent efficiently batches scenario generation while balancing diversity and variation within each scenario.

3. Rollout Agent

The rollout agent implements the generated scenarios with the target model. It can conduct multi-turn conversations or simulate environments, meticulously recording all exchanges and tool interactions. Parameters such as maximum turns, modality, and user mode dictate the level of autonomy for the target model during this phase.

4. Judgment and Meta-Judgment Agents

A judge model assesses each transcript for the presence of the specified behavior using a numerical scoring system. It can also evaluate other aspects like realism and evaluator assertiveness. A meta-judge reviews summaries from all rollouts, producing an overarching report that highlights key cases and patterns. The primary metric used is the elicitation rate, representing the percentage of rollouts that achieve a score of at least 7 out of 10 for behavior presence. (Bitcoin.org)

Validation and Application in Frontier Models

Anthropic has utilized Bloom to create evaluation suites relevant to alignment, examining behaviors such as delusional sycophancy, long-term sabotage, self-preservation, and self-preferential bias. Each suite consists of 100 unique rollouts, repeated three times across 16 advanced models. Analysis of the resulting data reveals elicitation rates represented with standard deviation error bars, using Claude Opus 4.1 as the evaluator throughout the process.

Bloom has also been tested on deliberately misaligned ‘model organisms’ from prior alignment studies. In assessments of 10 unique behaviors, Bloom successfully differentiated the organism from the baseline production model in 9 instances. In one case of self-promotion, a manual inspection revealed that the baseline model exhibited a similar frequency of behavior, explaining the overlap in scoring. Another validation effort involved comparing human labels on 40 transcripts with 11 candidate judge models, where Claude Opus 4.1 achieved a Spearman correlation of 0.86 with human scores, while Claude Sonnet 4.5 reached 0.75, particularly demonstrating strong agreement at critical score thresholds.

Bloom’s Relationship to Petri

Anthropic positions Bloom as a complement to Petri. While Petri serves as a full auditing tool that employs seed instructions to explore numerous scenarios and behaviors, Bloom focuses on automating the engineering required to transform a single behavior definition into an extensive, targeted evaluation suite with quantifiable metrics like elicitation rate. For more tips, check out Elon Musk’s Grok 4.1 API Launch: Controversies and Developme.

Key Takeaways

Bloom is an open-source framework that automates the creation of a full behavioral evaluation suite from a single behavior specification, working with a four-stage pipeline of understanding, ideation, rollout, and judgment.
The system operates based on a seed configuration stored in seed.yaml and behaviors/behaviors.json, where researchers can define the target behavior, example transcripts, total evaluations, rollout model, and various controls such as diversity and maximum turns.
Bloom is powered by LiteLLM for effortless access to both Anthropic and OpenAI models, integrates with Weights and Biases for experiment tracking, and exports compatible JSON for inspection.
Anthropic has validated Bloom on four behaviors relevant to alignment across 16 frontier models, with 100 rollouts repeated three times, demonstrating its efficacy in distinguishing misaligned organisms from baseline models and achieving high correlation with human evaluations.