Skip to content

explore: image-to-image editing for Gemini + closed visual loop#1

Open
looptech-explorer[bot] wants to merge 2 commits into
mainfrom
explore/gemini-image-edit-closed-loop
Open

explore: image-to-image editing for Gemini + closed visual loop#1
looptech-explorer[bot] wants to merge 2 commits into
mainfrom
explore/gemini-image-edit-closed-loop

Conversation

@looptech-explorer

Copy link
Copy Markdown

Direction

Turn the generation-only server into a closed visual loop by adding Gemini image-to-image editing (reference images) and returning the edited image to the caller as a viewable MCP image instead of just a file path.

Lens

internal-gaps

Proven

  • edit_image_gemini registers as a 4th MCP tool without disturbing the 3 existing generation tools (verified via in-memory FastMCP Client.list_tools()).
  • Edit requests genuinely carry the input image(s) back to the model as inlineData parts — proven against a stubbed Gemini endpoint (httpx.MockTransport), so this is real image-to-image, not text-to-image.
  • The tool returns the edited image as a viewable MCP ImageContent block (mime image/png) AND structured metadata (reference_count, model, file_path, text_response) — the "model can see what it made" closed loop.
  • Validation guards hold: editing with 0 references is rejected; >14 references is rejected; empty prompt is rejected; MIME is sniffed from magic bytes, not file extension; API errors are surfaced.
  • demo_edit.py runs end-to-end with NO network and NO API key and exits 0 (17 checks); pytest tests/ passes (14 tests).
  • Baseline build is not broken: server.py still imports and all original tools remain.

Sketched

  • Live acceptance against the real generativelanguage.googleapis.com API is not exercised here (no API key); only the request shape and response handling are proven offline.
  • OpenAI /images/edits and a Grok image-edit path were identified (external-frontier lens) but not built — edit_image_gemini is the single-provider PoC.
  • The exact upper bound of "5 characters / 14 images" follows Google's docs; the cap is enforced at 14 but not validated against a live model.

How to run

cd /private/tmp/explorer-nightly/work/looptech-ai__neoimage
python3 -m venv venv && ./venv/bin/pip install -r requirements-dev.txt
./venv/bin/python demo_edit.py
./venv/bin/python -m pytest tests/ -q

Runners-up

  • Return generated images as MCP ImageContent for the existing text-to-image tools too (internal-gaps) — partially realized here for the edit tool.
  • OpenAI /images/edits image-to-image with masks/multi-image input (external-frontier).
  • Injectable httpx transport + structured error-dict returns ported from looptech-ai/cowork2code & understand-quickly SDK (cross-pollination).
  • Wire the phantom OpenAI format/transparent params through to the API (dead-end: accepted but never sent).

🤖 Generated with Claude Code

Adds edit_image_gemini, the image-to-image counterpart the generation-only
server never built. It accepts up to 14 reference images (inlineData parts) for
editing, multi-image composition, and character consistency, and returns the
edited image to the caller as a viewable MCP ImageContent block (not just a file
path) so the model can see its output and iterate.

- image_edit.py: pure, offline-testable helpers (mime sniffing from magic bytes,
  request-body construction, response parsing) + injectable async gemini_edit().
- server.py: edit_image_gemini tool returning ToolResult(image + metadata).
- tests/ + demo_edit.py: full round-trip proven offline via httpx.MockTransport,
  no network and no API key.
@socket-security

socket-security Bot commented Jun 24, 2026

Copy link
Copy Markdown

Review the following changes in direct dependencies. Learn more about Socket for GitHub.

Diff Package Supply Chain
Security
Vulnerability Quality Maintenance License
Addedpytest@​9.1.187100100100100
Addedpytest-asyncio@​1.4.0100100100100100

View full report

- parse_gemini_image_response: take the FIRST image/text part, not last
  (last-wins loop silently dropped the earlier image on multi-part responses)
- edit_image_gemini: wrap the output write so disk errors surface as ValueError,
  consistent with the rest of the server, instead of a raw OSError
- demo_edit.py: scope the _make_client patch + GOOGLE_API_KEY via context
  managers (was a process-wide leak); close the Part B mock client via async with

pytest 14 passed - demo 17/17 - ruff clean - mypy clean
@ebeckner-looptech ebeckner-looptech marked this pull request as ready for review June 24, 2026 18:47
@ebeckner-looptech

Copy link
Copy Markdown
Contributor

Review + fixes applied

Reviewed this autonomously-generated PR — verdict: merge with nits. Verified live: pytest 14 passed, demo_edit.py 17/17, ruff/mypy clean on the new code. All claimed behaviors check out: genuine image-to-image (inlineData), viewable ImageContent return, validation guards intact, the 3 original tools untouched.

Pushed 3cb70c6 addressing the three HIGH findings from review:

  1. Parser last-wins bug (image_edit.py) — parse_gemini_image_response now takes the first image/text part, not the last (a multi-part response would have silently dropped the earlier image).
  2. Unguarded write (server.py) — disk errors on the output write now surface as ValueError, consistent with the rest of the server, instead of a raw OSError.
  3. Demo leak (demo_edit.py) — scoped the _make_client patch + GOOGLE_API_KEY via context managers (was process-wide); closed the Part B mock client.

Suite green after fixes. Ready for merge.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant