Skip to content

Make EMBEDDING_MAX_CHARS provider-aware to avoid over-truncating large symbols #64

Description

@GoodbyePlanet

Summary

The dense-embedding text builder in server/indexer/pipeline.py (_build_embedding_text) caps the whole serialized symbol string at EMBEDDING_MAX_CHARS (default 6000 chars ≈ ~1,500 tokens, server/config.py:78). When a symbol's source exceeds the remaining budget, the tail is replaced with // ... (truncated) and a warning is logged.

This default is conservative — it's effectively sized for the smallest-context provider we support, leaving large-context providers heavily under-utilized and truncating large classes/methods unnecessarily.

Provider input limits (researched)

Provider (config) Default model Max input tokens Output dims
jina (local TEI) jinaai/jina-embeddings-v2-base-code 8,192 768
jina-api jina-embeddings-v2-base-code 8,192 768
voyage voyage-code-3 32,000 1024
openai text-embedding-3-large 8,192 3072
ollama nomic-embed-text 2,048 768

The 6000-char (~1,500-token) cap fits within Ollama's 2K limit with headroom, but uses only ~18% of the 8K models and ~5% of Voyage's 32K window.

Impact

  • A large class (e.g. ~1000 lines) is indexed as a single CodeSymbol whose source is the full class body. With an 8K/32K-capable provider configured, the class-level dense vector is still truncated to ~6000 chars purely because of the conservative default — losing the tail of the class-level semantic summary.
  • Long individual methods can hit the same cap.
  • (Note: per-method symbols are still embedded individually, so this is partial loss at the whole-symbol level, not total loss.)

Proposal

Make the cap provider-aware instead of a single global default. Options:

  1. Derive a sensible default max_chars from the active provider/model's known token limit (with a safety margin for the metadata preamble), falling back to 6000 for unknown providers. An explicit EMBEDDING_MAX_CHARS env var should still override.
  2. At minimum, document the recommended EMBEDDING_MAX_CHARS values per provider (e.g. ~24,000–30,000 for the 8K models, much higher for Voyage) so users can tune it.

Why the explicit cap should stay

Providers handle overflow differently (Jina trims server-side via truncate=True; Voyage/Ollama may reject). The explicit char cap gives consistent, logged truncation behavior across all providers — so this is about choosing a smarter default, not removing the cap.

Acceptance criteria

  • Default truncation budget reflects the configured provider's actual context window
  • EMBEDDING_MAX_CHARS continues to work as an explicit override
  • README/config docs note the per-provider limits and how to tune the cap

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions