Semble: Token-Efficient Code Search for Agent Loops

Agent code navigation quickly encounters token constraints. A typical grep-and-read pattern returns hundreds of file paths, then reads each match in full to provide context. For a 10,000-line codebase, a single search can burn 50,000+ tokens across multiple turns. Semble addresses this by indexing code semantically and compressing retrieval results before they reach the agent’s context window.

Semble achieves a 98% token reduction compared to baseline grep workflows. That number matters because token cost is the primary constraint in multi-turn agent loops. If your agent needs to search a codebase five times to complete a task, you’re looking at the difference between $0.50 and $25 in API costs per session at GPT-4 pricing.

Why Grep Fails in Agent Loops

Traditional grep returns file paths and line numbers. The agent then issues read commands for each file to understand context. This creates two problems:

Token multiplication: Each file read adds the full file content to the context, even if only three lines are relevant.
Turn overhead: The agent must decide which files to read, read them, then decide if it needs more context. Each decision costs a round trip.

Semble collapses this into a single retrieval step. The agent queries once and receives compressed, semantically relevant snippets with enough surrounding context to make decisions.

Indexing Pipeline

Semble builds a semantic index at repository initialization. The pipeline:

Parses source files into abstract syntax trees (ASTs)
Extracts function definitions, class declarations, and docstrings
Embeds each code unit using a lightweight encoder (likely sentence-transformers or similar)
Stores embeddings in a vector index (FAISS or equivalent)

The index is file-system-backed, so it persists across sessions. Build time scales linearly with codebase size. A 100,000-line repository takes roughly 30 seconds on a standard laptop.

Addressing the 98% Claim

The 98% reduction claim rests on three architectural decisions. These directly answer how Semble balances indexing cost, retrieval strategy, and state management.

Indexing Cost vs. Query Savings

Semble front-loads the computational cost into a one-time indexing step. For a 100,000-line codebase, you pay 30 seconds of CPU time and roughly 500MB of disk space for the vector index. In exchange, every subsequent query avoids reading full files.

The break-even point is around 10 queries. If your agent searches the codebase fewer than 10 times, grep is cheaper. Beyond that threshold, the token savings compound. A typical debugging session with 50 searches saves approximately 2.5 million tokens compared to grep-and-read.

Retrieval Compression Strategy

The 98% reduction comes from semantic ranking combined with aggressive context pruning. Instead of returning entire files, Semble:

Embeds the agent’s query using the same encoder that indexed the codebase
Retrieves the top-k most similar code units (functions, classes, methods)
Extracts only the function signature plus 5 lines before and after
Returns structured JSON with relevance scores

This means a grep result that would return 50 files (averaging 200 lines each, or 10,000 lines total) becomes 5 function snippets of 10 lines each (50 lines total). That’s where the 98% comes from: 50 lines instead of 10,000.

The trade-off is precision. If the relevant context spans multiple functions or requires understanding class hierarchy, the 5-line window may not be sufficient. The agent then issues a follow-up query or requests the full file, which negates some of the savings.

Index Versioning and Invalidation

Code changes during an agent session invalidate parts of the index. Semble handles this with lazy invalidation:

File modifications mark affected embeddings as stale
Queries check modification timestamps before retrieval
Stale entries trigger re-indexing on next query

The GitHub repository README indicates that Semble watches file system modification times (mtime) to detect changes. When a query targets a file whose mtime is newer than its indexed timestamp, Semble re-embeds that file’s code units before returning results. This design prioritizes write latency over index freshness. Eager re-indexing (rebuilding immediately on file save) would keep the index perfectly fresh but would add latency to every write operation. Lazy invalidation keeps writes fast and defers the cost to the next query that touches the modified region.

For typical agent workflows (modifying 2-5 files per task), the re-indexing overhead is negligible. If the agent modifies 50 files in rapid succession, the first query after those changes will be slower as it re-indexes affected portions. Query latency remains under 200ms for most codebases, even with partial re-indexing.

Query-Time Compression

When an agent queries Semble, the flow is:

Embed the query using the same encoder
Retrieve top-k semantically similar code units from the vector index
Extract minimal context (function signature + 5 lines before/after)
Return compressed results as structured JSON

The compression step is critical. Instead of returning entire files, Semble returns:

{
  "results": [
    {
      "file": "src/auth/middleware.py",
      "function": "validate_token",
      "snippet": "def validate_token(token: str) -> bool:\n    # Verify JWT signature\n    ...",
      "context_lines": 3,
      "relevance_score": 0.94
    }
  ]
}

This structure gives the agent enough information to decide if it needs to read the full file, but avoids front-loading thousands of tokens.

Token Economics Comparison

Strategy	Tokens per Query	Queries per Task	Total Tokens	Cost (GPT-4)
Grep + Read All	12,000	5	60,000	$1.80
Grep + Selective Read	6,000	8	48,000	$1.44
Semble Semantic Search	800	4	3,200	$0.10

Token counts assume GPT-4 tokenization (approximately 4 tokens per word for code) and typical result set sizes. Actual costs vary by codebase structure and query specificity.

The selective read strategy requires more queries because the agent must iterate to find relevant files. Semble’s semantic retrieval reduces both tokens per query and total query count.

Integration Points

Semble exposes a local HTTP API and a Python SDK. The HTTP API is stateless, which makes it easy to run as a sidecar in containerized agent deployments:

import requests

response = requests.post(
    "http://localhost:8765/search",
    json={
        "query": "authentication middleware that validates JWT tokens",
        "max_results": 5,
        "include_context": True
    },
    timeout=10.0  # Prevent agent loops from hanging on slow queries
)

results = response.json()["results"]

The Python SDK wraps this with retry logic and connection pooling. For agent frameworks that support tool calling (LangChain, Semantic Kernel, AutoGPT), you register Semble as a tool and let the agent decide when to search.

Observability Gaps

Based on the GitHub repository documentation and Hacker News discussion, Semble does not currently expose detailed operational metrics. The gaps include:

Which queries triggered re-indexing: Without this visibility, operators cannot predict when query latency will spike during agent sessions.
How often the agent retrieves irrelevant results: False positive rate determines whether the agent wastes turns reading irrelevant code. This metric is critical for tuning relevance thresholds.
Token savings per session: Teams must calculate this themselves by comparing actual token usage against estimated grep baseline, making it difficult to correlate savings to specific agent tasks without manual instrumentation.

These gaps appear to be roadmap items rather than architectural constraints. The GitHub issues tracker shows feature requests for enhanced logging and metrics export, suggesting the maintainers are aware of the need but haven’t prioritized it yet.

For production agent deployments, you’ll want to wrap the API with instrumentation that tracks these metrics. False positive rate is especially important because it determines whether the agent wastes turns reading irrelevant code.

Failure Modes

Embedding drift: If you switch encoder models mid-project, old embeddings become incompatible. Semble doesn’t detect this automatically. You must manually rebuild the index.

Context window assumptions: The 5-line context window is hardcoded. For deeply nested code or languages with significant boilerplate (Java, C++), this may not provide enough context. The agent ends up requesting full file reads anyway.

Concurrency: The index is not thread-safe. If two agents query simultaneously while a re-index is in progress, one will block. For multi-agent systems running in parallel, the single-threaded index will serialize queries, adding latency that compounds across agents.

When to Use Semble

Use it when:

Your agent needs to search codebases larger than 5,000 lines
Token cost is a constraint (you’re running hundreds of agent sessions per day)
Your agent workflow involves iterative code exploration (debugging, refactoring, feature addition)

Avoid it when:

Your codebase changes at high frequency (multiple commits per minute in CI/CD pipelines). Re-indexing overhead will dominate.
You need sub-second query latency. The 200ms average is fine for agent loops but too slow for interactive IDE features.
Your agent searches fewer than 10 distinct code regions per session. The indexing cost won’t pay off.

Technical Verdict

Semble solves a real bottleneck in agent code navigation. The 98% token reduction is achievable because it shifts the compression burden from the LLM (which must read full files to extract relevant lines) to a purpose-built retrieval system. The trade-off is build-time indexing cost and the need to manage index invalidation through filesystem timestamp tracking.

For production agent systems that search code repeatedly, the token savings justify the operational complexity. For agent tasks that search fewer than 10 distinct code regions, grep-and-read is sufficient.

The missing piece is observability. Production deployments need instrumentation to track when semantic retrieval fails and the agent falls back to full file reads. This visibility is essential for optimizing the context window size and tuning the relevance threshold.