Creating a Robust Multi-Agent Incident Response System with OpenAI Swarm

How to Build a Production-Ready Multi-Agent Incident Response System Using OpenAI Swarm and Tool-Augmented Agents

Introduction to Multi-Agent Systems

In this guide, we’ll explore how to construct a sophisticated yet practical multi-agent incident response system using OpenAI Swarm, all while building on Google Colab. The aim is to effectively manage real-world production incidents through a harmonious collaboration of specialized agents. We’ll look into into the roles of agents like a triage agent, SRE agent, communications agent, and a critic, showcasing how they work together to resolve issues efficiently.

Why Use Multi-Agent Systems?

Multi-agent systems offer a structured approach to problem-solving, especially in complex scenarios like incident management. By having agents that focus on specific tasks, we can simplify workflows, improve efficiency, and enhance collaboration. This is particularly useful in production environments where quick decision-making is vital.

Getting Started with OpenAI Swarm

Before we begin implementing the system, we need to set up our environment. Follow these steps to install the necessary libraries:

!pip -q install -U openai
!pip -q install -U "git+https://github.com/openai/swarm.git"

Loading Your OpenAI API Key Securely

We’ll securely load your OpenAI API key to ensure that the notebook runs safely in Google Colab. The following code snippet will help you retrieve your key:

import os

def load_openai_key():
    try:
        from google.colab import userdata
        key = userdata.get("OPENAI_API_KEY")
    except Exception:
        key = None
    if not key:
        import getpass
        key = getpass.getpass("Enter OPENAI_API_KEY (hidden): ").strip()
    if not key:
        raise RuntimeError("OPENAI_API_KEY not provided")
    return key

os.environ["OPENAI_API_KEY"] = load_openai_key()

Setting Up the Swarm Client

Next, we need to initialize our Swarm client that will facilitate interaction among the agents. This client serves as the backbone for our multi-agent workflow: You might also enjoy our guide on Top Crypto Payment Gateways for Businesses: How to Accept Cr.

import json
import re
from typing import List, Dict
from swarm import Swarm, Agent

client = Swarm()

Creating a Knowledge Base

To enhance our agents’ effectiveness, we’ll create a lightweight internal knowledge base (KB). This KB will help agents reference relevant operational documents during their tasks: (CoinDesk)

KB_DOCS = [
    {
        "id": "kb-incident-001",
        "title": "API Latency Incident Playbook",
        "text": "If p95 latency spikes, validate deploys, dependencies, and error rates. Rollback, cache, rate-limit, scale. Compare p50 vs p99 and inspect upstream timeouts."
    },
    {
        "id": "kb-risk-001",
        "title": "Risk Communication Guidelines",
        "text": "Updates must include impact, scope, mitigation, owner, and next update. Avoid blame and separate internal vs external messaging."
    },
    {
        "id": "kb-ops-001",
        "title": "On-call Handoff Template",
        "text": "Include summary, timeline, current status, mitigations, open questions, next actions, and owners."
    },
]

Implementing a Search Function

We’ll implement a search function that allows agents to find and take advantage of relevant information from the KB. This provides context and data for informed decision-making:

def _normalize(s: str) -> List[str]:
    return re.sub(r"[^a-z0-9\s]", "", s.lower()).split()


def search_kb(query: str, top_k: int = 3) -> str:
    q = set(_normalize(query))
    scored = []
    for d in KB_DOCS:
        score = len(q.intersection(set(_normalize(d["title"] + " " + d["text"])))
        scored.append((score, d))
    scored.sort(key=lambda x: x[0], reverse=True)
    docs = [d for s, d in scored[:top_k] if s > 0] or [scored[0][1]]
    return json.dumps(docs, indent=2)

Evaluating Mitigation Strategies

To assist our agents in decision-making, we’ll introduce a function that evaluates and ranks mitigation strategies based on confidence levels and associated risks:

def estimate_mitigation_impact(options_json: str) -> str:
    try:
        options = json.loads(options_json)
    except Exception as e:
        return json.dumps({"error": str(e)})
    ranking = []
    for o in options:
        conf = float(o.get("confidence", 0.5))
        risk = o.get("risk", "medium")
        penalty = {"low": 0.1, "medium": 0.25, "high": 0.45}.get(risk, 0.25)
        ranking.append({
            "option": o.get("option"),
            "confidence": conf,
            "risk": risk,
            "score": round(conf - penalty, 3)
        })
    ranking.sort(key=lambda x: x["score"], reverse=True)
    return json.dumps(ranking, indent=2)

Defining Handoff Functions

In our multi-agent setup, it’s needed to have clear handoff functions allowing one agent to pass control to another. Here’s how we can implement this:

def handoff_to_sre():
    return sre_agent

def handoff_to_comms():
    return comms_agent

def handoff_to_handoff_writer():
    return handoff_writer_agent

def handoff_to_critic():
    return critic_agent

Creating Specialized Agents

Now, we’ll configure our specialized agents, each assigned with distinct responsibilities to facilitate the incident response process: For more tips, check out Michael Saylor’s Strategy Moves $2.45 Billion in BTC: What I.

triage_agent = Agent(
    name="Triage",
    model="gpt-4o-mini",
    instructions="Decide which agent should handle the request. Use SRE for incident response. Use Comms for customer or executive messaging. Use HandoffWriter for on-call notes. Use Critic for review or improvement.",
    functions=[search_kb, handoff_to_sre, handoff_to_comms, handoff_to_handoff_writer, handoff_to_critic]
)
sre_agent = Agent(
    name="SRE",
    model="gpt-4o-mini",
    instructions="Produce a structured incident response with triage steps, ranked mitigations, ranked hypotheses, and a 30-minute plan.",
    functions=[search_kb, estimate_mitigation_impact]
)
comms_agent = Agent(
    name="Comms",
    model="gpt-4o-mini",
    instructions="Produce an external customer update and an internal technical update.",
    functions=[search_kb]
)
handoff_writer_agent = Agent(
    name="HandoffWriter",
    model="gpt-4o-mini",
    instructions="Produce a clean on-call handoff document with standard headings.",
    functions=[search_kb]
)
critic_agent = Agent(
    name="Critic",
    model="gpt-4o-mini",
    instructions="Critique the previous answer, then produce a refined final version and a checklist.",
)

Running the Incident Response Pipeline

Finally, we’ll assemble the complete orchestration pipeline to manage our incident response, including triage, specialized reasoning, and critical reviews: (Bitcoin.org)

def run_pipeline(user_request: str):
    messages = [{"role": "user", "content": user_request}]
    r1 = client.run(agent=triage_agent, messages=messages, max_turns=8)
    messages2 = r1.messages + [{"role": "user", "content": "Review and improve the last answer"}]
    r2 = client.run(agent=critic_agent, messages=messages2, max_turns=4)
    return r2.messages[-1]["content"]

request = "Production p95 latency jumped from 250ms to 2.5s after a deploy. Errors slightly increased, DB CPU stable, upstream timeouts rising. Provide a 30-minute action plan and a customer update."
print(run_pipeline(request))

Conclusion

This guide demonstrated how to build a multi-agent incident response system using OpenAI Swarm, which organizes specialized agents to manage response workflows effectively. By structuring agent handoffs and tapping into internal knowledge bases, we can create sturdy and efficient workflows without the need for cumbersome frameworks.