Summary
The dense-embedding text builder in server/indexer/pipeline.py (_build_embedding_text) caps the whole serialized symbol string at EMBEDDING_MAX_CHARS (default 6000 chars ≈ ~1,500 tokens, server/config.py:78). When a symbol's source exceeds the remaining budget, the tail is replaced with // ... (truncated) and a warning is logged.
This default is conservative — it's effectively sized for the smallest-context provider we support, leaving large-context providers heavily under-utilized and truncating large classes/methods unnecessarily.
Provider input limits (researched)
| Provider (config) |
Default model |
Max input tokens |
Output dims |
jina (local TEI) |
jinaai/jina-embeddings-v2-base-code |
8,192 |
768 |
jina-api |
jina-embeddings-v2-base-code |
8,192 |
768 |
voyage |
voyage-code-3 |
32,000 |
1024 |
openai |
text-embedding-3-large |
8,192 |
3072 |
ollama |
nomic-embed-text |
2,048 |
768 |
The 6000-char (~1,500-token) cap fits within Ollama's 2K limit with headroom, but uses only ~18% of the 8K models and ~5% of Voyage's 32K window.
Impact
- A large class (e.g. ~1000 lines) is indexed as a single
CodeSymbol whose source is the full class body. With an 8K/32K-capable provider configured, the class-level dense vector is still truncated to ~6000 chars purely because of the conservative default — losing the tail of the class-level semantic summary.
- Long individual methods can hit the same cap.
- (Note: per-method symbols are still embedded individually, so this is partial loss at the whole-symbol level, not total loss.)
Proposal
Make the cap provider-aware instead of a single global default. Options:
- Derive a sensible default
max_chars from the active provider/model's known token limit (with a safety margin for the metadata preamble), falling back to 6000 for unknown providers. An explicit EMBEDDING_MAX_CHARS env var should still override.
- At minimum, document the recommended
EMBEDDING_MAX_CHARS values per provider (e.g. ~24,000–30,000 for the 8K models, much higher for Voyage) so users can tune it.
Why the explicit cap should stay
Providers handle overflow differently (Jina trims server-side via truncate=True; Voyage/Ollama may reject). The explicit char cap gives consistent, logged truncation behavior across all providers — so this is about choosing a smarter default, not removing the cap.
Acceptance criteria
Summary
The dense-embedding text builder in
server/indexer/pipeline.py(_build_embedding_text) caps the whole serialized symbol string atEMBEDDING_MAX_CHARS(default 6000 chars ≈ ~1,500 tokens,server/config.py:78). When a symbol's source exceeds the remaining budget, the tail is replaced with// ... (truncated)and a warning is logged.This default is conservative — it's effectively sized for the smallest-context provider we support, leaving large-context providers heavily under-utilized and truncating large classes/methods unnecessarily.
Provider input limits (researched)
jina(local TEI)jinaai/jina-embeddings-v2-base-codejina-apijina-embeddings-v2-base-codevoyagevoyage-code-3openaitext-embedding-3-largeollamanomic-embed-textThe 6000-char (~1,500-token) cap fits within Ollama's 2K limit with headroom, but uses only ~18% of the 8K models and ~5% of Voyage's 32K window.
Impact
CodeSymbolwhosesourceis the full class body. With an 8K/32K-capable provider configured, the class-level dense vector is still truncated to ~6000 chars purely because of the conservative default — losing the tail of the class-level semantic summary.Proposal
Make the cap provider-aware instead of a single global default. Options:
max_charsfrom the active provider/model's known token limit (with a safety margin for the metadata preamble), falling back to 6000 for unknown providers. An explicitEMBEDDING_MAX_CHARSenv var should still override.EMBEDDING_MAX_CHARSvalues per provider (e.g. ~24,000–30,000 for the 8K models, much higher for Voyage) so users can tune it.Why the explicit cap should stay
Providers handle overflow differently (Jina trims server-side via
truncate=True; Voyage/Ollama may reject). The explicit char cap gives consistent, logged truncation behavior across all providers — so this is about choosing a smarter default, not removing the cap.Acceptance criteria
EMBEDDING_MAX_CHARScontinues to work as an explicit override