Fair, reproducible benchmarks for agentic network troubleshooting.
Website · Quickstart · Build Your Agent · Trace Dataset
NetOpsBench is an open benchmark arena for agentic network troubleshooting — run reproducible fault scenarios on live SONiC-VS / Containerlab topologies, plug in any troubleshooting agent, and score it across quality and efficiency dimensions.
Developing and evaluating agentic root cause analysis methods for network troubleshooting remains challenging, with three core bottlenecks hindering further advancement:
| Gap | The problem | How NetOpsBench closes it |
|---|---|---|
| No fair comparison | Varied network topologies, fault sets, observability tools, and evaluation metrics hinder the comparison of agentic troubleshooting strategies across the research community. | NetOpsBench unifies fault scenarios, observability access and scoring rules to support agent comparison on a shared benchmark. |
| Non-reproducible faults | Real network incidents cannot be reliably reproduced or labeled with consistent ground truth, slowing iterative improvement and evaluation of troubleshooting agents. | Containerlab + SONiC-VS inject controlled, reproducible faults with stable labels, so every run is an identical, repeatable episode. |
| Non-Interactive Environment | Static topology snapshots and logs cannot provide live probing and telemetry signals required by agents for diagnostic work. | NetOpsBench offers an interactive environment for agents to operate within live networks, capturing real-time Pingmesh data, gNMI telemetry and switch CLI evidence during every episode. |
NetOpsBench provides: (1) an interactive and realistic environment mimicking production networks, with common tracing and telemetry tooling; (2) comprehensive and reproducible benchmarks covering a wide range of faults and failures; (3) an extensible architecture with an open SDK to readily integrate with various agent paradigms and observability tools, allowing users to try out their own agentic workflows.
It is built for researchers and engineers who want to compare LLM-backed, symbolic, heuristic, or hybrid troubleshooting strategies on the same operational benchmark, not just on static logs or hand-written prompts.
- 2026-05: 🎉 Initial Release - NetOpsBench is now available as an open arena for agentic network troubleshooting.
- Provide public SDK with
run_scenario()andrun_suite()APIs to launch live network environments from Python. - Equip native MCP tools of complete observability utilities and pre-configured SONiC-VS network covering XS, Small, Medium and Large scales.
- Offer fault scenario generation scripts and an expanding repository of reproducible fault cases with standard ground truth labels.
- A full-fledged benchmark evaluator that accesses detection accuracy and token utilization efficiency.
- Provide public SDK with
NetOpsBench runtime execution requires Linux because Containerlab depends on Linux networking primitives.
git clone https://github.com/NetX-lab/NetOpsBench.git
cd NetOpsBench
python -m venv .venv
source .venv/bin/activate
pip install -e ".[agent]"
netopsbench benchmark prepare --scales xs
export OPENAI_API_KEY=...
PYTHONPATH=. python examples/01_run_scenario.py --vendor openaiThe first successful run produces a BenchmarkReport with case-level scores, timing, and artifact paths. For Docker, Containerlab, and runtime setup details, read Quickstart.
from examples.agents import MinimalDeepAgent
from netopsbench.sdk import NetOpsBench
scenario = "scenarios/generated/xs/generated_link_down_xs_001.yaml"
with NetOpsBench(workspace=".") as bench:
agent = bench.agents.wrap(MinimalDeepAgent(vendor="openai"))
run = bench.sessions.run_scenario(scenario=scenario, agent=agent)
report = run.wait()
print(report.summary)Scenario YAML files define the benchmark case: topology scale, traffic profile, fault type, target device, and interface-level ground truth when applicable. Use the Python API Guide for run_scenario(...), run_suite(...), and workers=N; see Custom Troubleshooting Agents when you are ready to replace MinimalDeepAgent with your own strategy.
NetOpsBench reports detection, fault type, device/interface localization, runtime, tool calls, and token usage so troubleshooting quality and operational cost can be compared together.
Read Benchmark Methodology for scoring definitions and Benchmark Results for an example completed suite.
Public agent trajectory artifacts are available in the NetOpsBench Trace Dataset, including Harbor/ATIF traces, run reports, and summary CSVs for reproducible analysis.
| Goal | Start here |
|---|---|
| Run one scenario | Quickstart |
| Run scenarios, suites, and batches | Running Benchmarks |
| Plug in your own troubleshooting agent | Custom Troubleshooting Agents |
| Use NetOpsBench from Python | Python API Guide |
| Interpret benchmark scores | Benchmark Methodology |
| Debug observability or runtime state | Operations |
| Understand the benchmark loop | System Overview |
- Global community: NetOpsBench Slack
- Chinese-language community: NetOpsBench Feishu group
Contributions are welcome for benchmark scenarios, fault types, SDK ergonomics, documentation, and evaluation workflows.
NetOpsBench is released under the MIT License. See LICENSE.
If you use NetOpsBench in your research, please cite:
@software{netopsbench2026,
author = {Yang, Yitao and Xu, Hong},
title = {{NetOpsBench}: Open Arena for NetOps in AI Infrastructure},
year = {2026},
url = {https://github.com/netx-lab/NetOpsBench},
}

