Introducing SERA: Innovative Coding Agents for Repository Automation

AI2 Releases SERA, Soft Verified Coding Agents Built with Supervised Training Only for Practical Repository Level Automation Workflows

what’s SERA?

SERA, or Soft Verified Efficient Repository Agents, is a groundbreaking initiative from researchers at the Allen Institute for AI (AI2). This initiative focuses on developing coding agents that work effectively within repositories, working with only supervised training and synthetic workflows. The flagship model of this series, known as SERA-32B, is based on the Qwen 3 32B architecture, making it a pioneering solution in the realm of coding agents.

Performance Metrics

When evaluated using the SWE benchmark, SERA-32B demonstrates impressive results. It achieves a resolve rate of 49.5% at a 32K context and 54.2% at a 64K context. These statistics position SERA-32B alongside prominent systems like Devstral-Small-2 (24B parameters) and GLM-4.5 Air (110B parameters), all while maintaining complete openness in terms of code, data, and weights. Currently, the SERA line includes four models: SERA-8B, SERA-8B GA, SERA-32B, and SERA-32B GA, all available on Hugging Face under the Apache 2.0 license.

Soft Verified Generation Explained

The training process for SERA relies on Soft Verified Generation (SVG), a method that produces agent trajectories that mimic realistic developer workflows. The system employs the notion of patch agreement between two rollouts to establish a soft signal of correctness.

The Training Pipeline

First Rollout: A function is selected from a real repository. The teacher model, GLM-4.6 in the SERA-32B configuration, receives a description of a bug or change and uses various tools to interact with files, edit code, and execute commands. This process generates trajectory T1 and patch P1.
Synthetic Pull Request: The trajectory transforms into a pull request-like description. This text outlines the intent and main edits, formatted similarly to genuine pull requests.
Second Rollout: The teacher model restarts from the original repository but only references the pull request description and tools. It creates a new trajectory T2 and patch P2 that aims to apply the described change.
Soft Verification: A line-by-line comparison of patches P1 and P2 is conducted. A recall score (r) reflects the fraction of modified lines in P1 that appear in P2. An r value of 1 signifies hard verification, while intermediate values indicate soft verification.

Key Findings

An key finding from the ablation study is that stringent verification isn’t necessary. Models trained on T2 trajectories with varied r thresholds—even r values of zero—exhibited similar performance on the SWE benchmark. This indicates that authentic multi-step traces, even with some noise, provide valuable supervision for coding agents. (CoinDesk)

Data Scale, Training, and Cost Analysis

SERA’s training employs SVG across 121 Python repositories derived from the SWE-smith corpus. The complete SERA datasets include over 200,000 trajectories generated from both rollouts, making it one of the most extensive open coding agent datasets available. You might also enjoy our guide on Understanding Chainlink: Your Guide to Blockchain Innovation.

Training Details

The SERA-32B model is trained on a subset of 25,000 T2 trajectories from the SERA-4.6-Lite T2 dataset. The training process uses standard supervised fine-tuning with Axolotl on Qwen-3-32B for three epochs, employing a learning rate of 1e-5, weight decay of 0.01, and a maximum sequence length of 32,768 tokens. While many trajectories exceed the context limit, the research team defines a truncation ratio to optimize them for training, thereby outperforming random truncation methods.

Cost Efficiency

Reportedly, the compute budget for SERA-32B’s data generation and training spans approximately 40 GPU days. Using a scaling law based on dataset size and performance, the team estimates that the SVG approach proves to be around 26 times less expensive than reinforcement learning systems like SkyRL-Agent and 57 times cheaper than older synthetic data pipelines, such as SWE-smith, to achieve comparable SWE-bench scores.

Repository Specialization

One of SERA’s primary applications is customizing an agent for particular repositories. The research team has examined this within three significant SWE-bench Verified projects: Django, SymPy, and Sphinx. For each repository, SVG generates approximately 46,000 to 54,000 trajectories. Due to computational constraints, the specialization experiments use 8,000 trajectories per repository, mixing 3,000 soft-verified T2 trajectories with 5,000 filtered T1 trajectories.

Specialization Results

At a 32K context, these specialized agents perform comparably or even slightly better than the GLM-4.5-Air teacher model. For instance, a specialized agent for Django achieves a resolve rate of 52.23%, compared to GLM-4.5-Air’s 51.20%. In the case of SymPy, the specialized model records a 51.11% resolve rate, while GLM-4.5-Air reaches 48.89%. For more tips, check out The Quiet Resolution of Ethereum vs. Solana: A Shift in Bloc.

Conclusion: What Does SERA Mean for Coding Agents?

Transformative Learning: SERA turns coding agent training into a supervised learning challenge, with SERA-32B making use of synthetic trajectories from GLM-4.6, eliminating the need for reinforcement learning loops or reliance on repository testing suites.
Elimination of Test Dependencies: SVG streamlines the verification process by building on two rollouts and patch overlap to compute a soft verification score.
Vast Dataset Creation: The pipeline applies SVG to numerous Python projects, producing over 200,000 realistic trajectories.
Cost-Effective Training: SERA-32B’s training approach isn’t only efficient but also significantly less expensive, paving the way for future advancements in coding agent development.

For more insights, check out the original blog post and stay tuned for updates on this exciting technology.