Make EMBEDDING_MAX_CHARS provider-aware to avoid over-truncating large symbols

## Summary

The dense-embedding text builder in `server/indexer/pipeline.py` (`_build_embedding_text`) caps the whole serialized symbol string at `EMBEDDING_MAX_CHARS` (default **6000 chars ≈ ~1,500 tokens**, `server/config.py:78`). When a symbol's source exceeds the remaining budget, the tail is replaced with `// ... (truncated)` and a warning is logged.

This default is conservative — it's effectively sized for the **smallest-context provider we support**, leaving large-context providers heavily under-utilized and truncating large classes/methods unnecessarily.

## Provider input limits (researched)

| Provider (config) | Default model | Max input tokens | Output dims |
|---|---|---|---|
| `jina` (local TEI) | `jinaai/jina-embeddings-v2-base-code` | 8,192 | 768 |
| `jina-api` | `jina-embeddings-v2-base-code` | 8,192 | 768 |
| `voyage` | `voyage-code-3` | 32,000 | 1024 |
| `openai` | `text-embedding-3-large` | 8,192 | 3072 |
| `ollama` | `nomic-embed-text` | 2,048 | 768 |

The 6000-char (~1,500-token) cap fits within Ollama's 2K limit with headroom, but uses only ~18% of the 8K models and ~5% of Voyage's 32K window.

## Impact

- A large class (e.g. ~1000 lines) is indexed as a single `CodeSymbol` whose `source` is the full class body. With an 8K/32K-capable provider configured, the class-level dense vector is still truncated to ~6000 chars purely because of the conservative default — losing the tail of the class-level semantic summary.
- Long individual methods can hit the same cap.
- (Note: per-method symbols are still embedded individually, so this is partial loss at the whole-symbol level, not total loss.)

## Proposal

Make the cap provider-aware instead of a single global default. Options:

1. Derive a sensible default `max_chars` from the active provider/model's known token limit (with a safety margin for the metadata preamble), falling back to 6000 for unknown providers. An explicit `EMBEDDING_MAX_CHARS` env var should still override.
2. At minimum, document the recommended `EMBEDDING_MAX_CHARS` values per provider (e.g. ~24,000–30,000 for the 8K models, much higher for Voyage) so users can tune it.

## Why the explicit cap should stay

Providers handle overflow differently (Jina trims server-side via `truncate=True`; Voyage/Ollama may reject). The explicit char cap gives consistent, logged truncation behavior across all providers — so this is about choosing a smarter default, not removing the cap.

## Acceptance criteria

- [ ] Default truncation budget reflects the configured provider's actual context window
- [ ] `EMBEDDING_MAX_CHARS` continues to work as an explicit override
- [ ] README/config docs note the per-provider limits and how to tune the cap

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make EMBEDDING_MAX_CHARS provider-aware to avoid over-truncating large symbols #64

Summary

Provider input limits (researched)

Impact

Proposal

Why the explicit cap should stay

Acceptance criteria

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Provider (config)	Default model	Max input tokens	Output dims
`jina` (local TEI)	`jinaai/jina-embeddings-v2-base-code`	8,192	768
`jina-api`	`jina-embeddings-v2-base-code`	8,192	768
`voyage`	`voyage-code-3`	32,000	1024
`openai`	`text-embedding-3-large`	8,192	3072
`ollama`	`nomic-embed-text`	2,048	768

Make EMBEDDING_MAX_CHARS provider-aware to avoid over-truncating large symbols #64

Description

Summary

Provider input limits (researched)

Impact

Proposal

Why the explicit cap should stay

Acceptance criteria

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions