Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
32 changes: 32 additions & 0 deletions explorer/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,32 @@
# MCP-lift study explorer

Static, self-contained results explorer for the three-arm MCP-lift study
(9 discovery-heavy CodeScaleBench org tasks, n=3 per cell):

- **Arm A** Sonnet 4.6 + Sourcegraph MCP (no local source)
- **Arm B** Fable 5, baseline local checkout (no MCP)
- **Arm C** Sonnet 4.6, baseline local checkout (no MCP)

## Pages

- `index.html` / `comparison.html` — summary with the standings, the tooling-vs-model
decomposition, and the per-task chart.
- `compare.html` — 3-way matrix; one row per task, click through to the side-by-side.
- `compare__<task>.html` — Sonnet | Sonnet+MCP | Fable in three columns, each with
the instruction, full conversation, and every tool call.
- `*.html` (long filenames) — the representative full-trace page per arm per task.

Open `index.html` locally, or serve the folder (e.g. GitHub Pages) for the hosted
version linked from the blog post.

## Regenerate

```
python3 scripts/analysis/browse_3way.py runs/mcp_lift_study \
--export explorer \
--brand-page <path to comparison.html>
```

The exporter sanitizes local paths and secrets and refuses to write any page that
fails the leak guard. Representative trial = median reward of each arm's valid
trials (quarantined / instant-death trials excluded).
67 changes: 67 additions & 0 deletions explorer/compare.html
Original file line number Diff line number Diff line change
@@ -0,0 +1,67 @@
<!doctype html><html lang='en'><head><meta charset='utf-8'><meta name='viewport' content='width=device-width, initial-scale=1'><title>mcp_lift_study — 3-way compare</title><link rel="preconnect" href="https://fonts.googleapis.com"><link rel="preconnect" href="https://fonts.gstatic.com" crossorigin><link href="https://fonts.googleapis.com/css2?family=Schibsted+Grotesk:wght@500;600;700;800&family=Source+Sans+3:ital,wght@0,400;0,500;0,600;1,400&family=Source+Code+Pro:wght@500;600&display=swap" rel="stylesheet"><style>
:root{
color-scheme:dark;
--bg:oklch(16% 0.00284 27deg); --surface:oklch(18% 0.00284 27deg); --surface-2:oklch(22% 0.006 27deg);
--panel-deep:oklch(13% 0.004 27deg);
--ink:oklch(94.5% 0.00284 27deg); --ink-soft:oklch(80% 0.00284 27deg); --muted:oklch(68% 0.00284 27deg);
--line:oklch(25% 0.00284 27deg); --line-strong:oklch(32% 0.00284 27deg);
--vermilion:#f34e3f; --vermilion-soft:#ff7867; --purple:oklch(72% 0.14 295deg);
--spectrum:linear-gradient(90deg,var(--vermilion),var(--vermilion-soft));
--accent:var(--vermilion); --armA:var(--vermilion); --armB:var(--purple); --armC:oklch(66% 0.006 27deg);
--pos:oklch(64% 0.13 160deg); --neg:oklch(70% 0.17 52deg);
--font-sans:'PolySans',-apple-system,BlinkMacSystemFont,'Segoe UI',Roboto,sans-serif;
--font-heading:'Perfectly Nineties',Georgia,'Times New Roman',serif;
--font-mono:'PolySans Mono',ui-monospace,'SF Mono',Consolas,monospace;
}
*{box-sizing:border-box}
body{margin:0;color:var(--ink);background-color:var(--bg);
background-image:url("data:image/svg+xml,%3Csvg width='32' height='32' viewBox='0 0 32 32' xmlns='http://www.w3.org/2000/svg'%3E%3Crect x='0.25' y='0.25' width='31.5' height='31.5' fill='none' stroke='%23ffffff' stroke-opacity='0.06' stroke-width='0.5'/%3E%3C/svg%3E");
background-size:32px 32px;background-attachment:fixed;
font-family:var(--font-sans);font-size:15px;line-height:1.55;-webkit-font-smoothing:antialiased}
.wrap{max-width:1180px;margin:0 auto;padding:0 24px 80px}
.wrap::before{content:"";display:block;height:4px;border-radius:0 0 3px 3px;background:var(--spectrum);margin-bottom:24px}
h1{font-family:var(--font-heading);font-weight:800;letter-spacing:-0.02em;font-size:30px;margin:12px 0 6px}
h2{font-family:var(--font-heading);font-weight:700;letter-spacing:-0.015em;font-size:20px;margin:24px 0 10px}
h3,h4{font-family:var(--font-heading);font-weight:600;margin:10px 0 6px}
a{color:var(--vermilion);text-decoration:none;border-bottom:1px solid color-mix(in oklab,var(--vermilion),transparent 70%)}
a:hover{border-bottom-color:var(--vermilion)}
.meta{color:var(--muted);font-size:13px}
code,.mono,.num{font-family:var(--font-mono);font-variant-numeric:tabular-nums}
code{background:var(--surface-2);border:1px solid var(--line);padding:1px 6px;border-radius:5px;font-size:0.86em}
pre{font-family:var(--font-mono);font-size:12px;line-height:1.5;white-space:pre-wrap;overflow-wrap:anywhere;
background:var(--panel-deep);border:1px solid var(--line);border-radius:8px;padding:10px;margin:8px 0;color:var(--ink-soft)}
.panel{background:var(--surface);border:1px solid var(--line);border-radius:12px;padding:18px 20px;margin-bottom:16px}
.grid{display:grid;grid-template-columns:repeat(auto-fit,minmax(130px,1fr));gap:10px}
.metric{background:var(--panel-deep);border:1px solid var(--line);border-radius:10px;padding:10px 12px}
.metric .k{color:var(--muted);font-size:11px;font-family:var(--font-mono);text-transform:uppercase;letter-spacing:0.04em}
.metric .v{font-family:var(--font-mono);font-weight:600;font-size:15px;margin-top:3px}
table{width:100%;border-collapse:collapse;font-size:13px}
th,td{padding:8px 10px;text-align:left;border-bottom:1px solid var(--line);vertical-align:top}
th{color:var(--muted);font-family:var(--font-mono);font-size:11px;text-transform:uppercase;letter-spacing:0.04em;font-weight:600}
.num,th.num,td.num{font-variant-numeric:tabular-nums} th.num,td.num{text-align:right}
details{border:1px solid var(--line);border-radius:10px;padding:8px 12px;margin:8px 0;background:var(--surface)}
summary{cursor:pointer;color:var(--vermilion);font-family:var(--font-heading);font-weight:600}
.pill{display:inline-block;font-family:var(--font-mono);font-size:11px;letter-spacing:0.04em;padding:3px 9px;
border-radius:6px;background:var(--surface-2);color:var(--muted);border:1px solid var(--line)}
.pill.passed{background:color-mix(in oklab,var(--vermilion),transparent 86%);color:var(--vermilion-soft);border-color:transparent}
.split{display:grid;grid-template-columns:1fr 1fr;gap:10px}
button,select,input{font-family:inherit;background:var(--surface);color:var(--ink);border:1px solid var(--line);border-radius:8px;padding:7px 10px}
/* 3-way compare */
.cmp-summary{display:flex;gap:14px;flex-wrap:wrap;margin:10px 0 18px}
.cmp-card{flex:1 1 200px;background:var(--surface);border:1px solid var(--line);border-radius:12px;padding:14px 16px}
.cmp-card .big{font-family:var(--font-mono);font-size:26px;font-weight:600}
.cmp-card .lbl{color:var(--muted);font-size:12px;margin-bottom:4px;font-family:var(--font-mono)}
.cols{display:grid;grid-template-columns:repeat(3,minmax(360px,1fr));gap:14px;align-items:start}
.col{background:var(--surface);border:1px solid var(--line);border-radius:12px;padding:0 12px 12px;min-width:0}
.col-head{padding:14px 14px 12px;margin:0 -12px 8px;border-bottom:1px solid var(--line)}
.col-head .arm{font-family:var(--font-heading);font-size:16px;font-weight:700}
.col-head .arm-sub{color:var(--muted);font-size:12px}
.cmetrics{display:flex;gap:14px;flex-wrap:wrap;margin:8px 0 6px;font-size:12px;color:var(--muted);font-family:var(--font-mono)}
.cmetrics b{color:var(--ink);font-size:14px}
.clink{font-size:12px;color:var(--muted)}
.col .scroll{max-height:540px;overflow:auto;border:1px solid var(--line);border-radius:8px;padding:4px;background:var(--panel-deep)}
.nodata{padding:20px;color:var(--muted)}
.matrix td.gap-pos{color:var(--pos)} .matrix td.gap-neg{color:var(--neg)}
.matrix tbody tr{cursor:pointer} .matrix tbody tr:hover{background:var(--surface-2)}
@media (max-width:1100px){.cols{grid-template-columns:1fr}}
</style></head><body><div class='wrap'><p><a href='index.html'>&larr; flat list (all trials)</a></p><h1>3-way comparison by task</h1><p class='meta'>Mean reward per arm over valid trials (instant-death and exception trials dropped); may differ slightly from the published n=3 aggregates where reruns exist. Click a row for the side-by-side traces. Gap = Sonnet with Sourcegraph minus Fable on a plain checkout. The defective ccx-vuln-remed-014 is omitted here; it remains in the <a href='index.html'>flat list</a>.</p><table class='matrix'><thead><tr><th>Task</th><th class='num' style='color:var(--armC)'>Sonnet, no tool</th><th class='num' style='color:var(--armA)'>Sonnet + Sourcegraph</th><th class='num' style='color:var(--armB)'>Fable, no tool</th><th class='num'>gap</th><th></th></tr></thead><tbody><tr onclick="location.href='compare__ccx-crossorg-217.html'"><td><code>ccx-crossorg-217</code></td><td class='num'>0.471</td><td class='num'>0.695</td><td class='num'>0.104</td><td class='num gap-pos'>+0.591</td><td><a href='compare__ccx-crossorg-217.html'>compare &rarr;</a></td></tr><tr onclick="location.href='compare__ccx-vuln-remed-135.html'"><td><code>ccx-vuln-remed-135</code></td><td class='num'>0.153</td><td class='num'>0.611</td><td class='num'>0.231</td><td class='num gap-pos'>+0.380</td><td><a href='compare__ccx-vuln-remed-135.html'>compare &rarr;</a></td></tr><tr onclick="location.href='compare__ccx-migration-289.html'"><td><code>ccx-migration-289</code></td><td class='num'>0.639</td><td class='num'>0.681</td><td class='num'>0.574</td><td class='num gap-pos'>+0.107</td><td><a href='compare__ccx-migration-289.html'>compare &rarr;</a></td></tr><tr onclick="location.href='compare__ccx-agentic-223.html'"><td><code>ccx-agentic-223</code></td><td class='num'>0.475</td><td class='num'>0.625</td><td class='num'>0.537</td><td class='num gap-pos'>+0.088</td><td><a href='compare__ccx-agentic-223.html'>compare &rarr;</a></td></tr><tr onclick="location.href='compare__ccx-migration-274.html'"><td><code>ccx-migration-274</code></td><td class='num'>0.750</td><td class='num'>1.000</td><td class='num'>0.928</td><td class='num gap-pos'>+0.072</td><td><a href='compare__ccx-migration-274.html'>compare &rarr;</a></td></tr><tr onclick="location.href='compare__ccx-vuln-remed-126.html'"><td><code>ccx-vuln-remed-126</code></td><td class='num'>0.583</td><td class='num'>0.759</td><td class='num'>0.735</td><td class='num gap-pos'>+0.024</td><td><a href='compare__ccx-vuln-remed-126.html'>compare &rarr;</a></td></tr><tr onclick="location.href='compare__ccx-config-trace-010.html'"><td><code>ccx-config-trace-010</code></td><td class='num'>1.000</td><td class='num'>1.000</td><td class='num'>1.000</td><td class='num '>0.000</td><td><a href='compare__ccx-config-trace-010.html'>compare &rarr;</a></td></tr><tr onclick="location.href='compare__ccx-incident-145.html'"><td><code>ccx-incident-145</code></td><td class='num'>0.177</td><td class='num'>0.172</td><td class='num'>0.172</td><td class='num '>0.000</td><td><a href='compare__ccx-incident-145.html'>compare &rarr;</a></td></tr><tr onclick="location.href='compare__ccx-crossorg-288.html'"><td><code>ccx-crossorg-288</code></td><td class='num'>0.681</td><td class='num'>0.551</td><td class='num'>0.833</td><td class='num gap-neg'>-0.282</td><td><a href='compare__ccx-crossorg-288.html'>compare &rarr;</a></td></tr></tbody></table></div></body></html>
Loading
Loading