<ahref="https://www.buymeacoffee.com/IjonTichy85"><imgsrc="https://cdn.buymeacoffee.com/buttons/v2/default-yellow.png"alt="Buy me a coffee"width="140"></a>
**Dervish** infers **regular expression grammars** from example sequences using the BEX family of algorithms. Given a set of example sequences (strings over some alphabet), it learns a compact regular expression that captures the general pattern.
Every codebase has unwritten conventions — the order tasks appear in Ansible roles, the resources a Helm chart always creates, the steps every CI pipeline runs. Nobody writes these down. They emerge from copying and converging.
When an LLM agent needs to follow these conventions, it usually has two bad options:
1.**Stuff every existing file into context** — 15 Ansible roles = 5,000 tokens. You'll hit the context window by the third example.
2.**Guess from one or two examples** — the LLM infers a pattern and often gets it wrong.
By leveraging **Minimum Description Length (MDL) scoring**, Dervish treats the grammar discovery problem as an optimal compression task. the resulting rule is optimized to consume as few tokens as possible without losing the pattern.
**Without Dervish:** token cost scales linearly with examples. **With Dervish:** one compact grammar describes them all — a ~60–200 token rule instead of thousands of tokens of raw examples. Try it out and you too will say:
The primary interface is a **Model Context Protocol (MCP)** server. Connect any MCP-compatible client (pi.dev, opencode, vibe, etc.) and get grammar inference as a tool:
| `infer_best_grammar` | `sequences`, `prefer`, `kmax`, `N`, `min_coverage` | **The only tool you need.** Runs CRX + iDRegEx + kOREInference, picks best by MDL. Set `prefer` to run only one algorithm. Set `min_coverage < 1.0` for optional core+outlier analysis. |
- **`prefer`**: `'crx'` for full vocabulary (accepts all sequences), `'idregex'` or `'koreinference'` for deterministic minimal core. Omit to let MDL pick the winner across all three.
- **`kmax`** (1–5): Context window for k-ORE inference (iDRegEx, kOREInference). Higher values capture longer-range dependencies but need more data and are slower. Default 2 works for most cases.
- **`N`** (1–10): Random trials for k-ORE inference. More = better convergence but slower. Default 3.
- **`min_coverage`** (0.5–1.0): **Optional core+outlier analysis.** When <1.0,iterativelyremovesoutliersequences(thosewiththerarestsymbols)untilatleastthisfractionremain.ReturnsthecoreCRXgrammarforthemajorityplusalistofremovedoutliers.Default1.0 =disabled.Example:`min_coverage=0.8`findsthetightpatternfor~80%ofexampleswhileflaggingtheother~20%asvariants.
**Without Dervish:** the agent stuffs 15 role files into context (5,000+ tokens per role = beyond any context window), or guesses the pattern from 1–2 examples and often gets it wrong.
Many of the things developers build every day **have no formal schema**. They're free-form scripts, config files, or YAML blobs where the structure is emergent convention, not enforced specification. An LLM generating new content in these domains needs to know the convention — but it's never been written down.
Dervish discovers these conventions automatically from existing examples. The domains below are **just examples** of what it can do — the same approach works for any sequential data with an unwritten pattern.
| Domain | What gets extracted | Example extracted symbols | What Dervish discovers | Why it helps an LLM |
| Helm charts (cross-project, 15 charts) | K8s resource kinds from `helm template` output in rendered order | `NetworkPolicy`, `PodDisruptionBudget`, `ServiceAccount`, `Secret`, `ConfigMap`, `Service`, `Deployment`, `StatefulSet`, `ClusterRole`, `ClusterRoleBinding` | `NetworkPolicy?.PodDisruptionBudget?.ServiceAccount?.Secret?.ConfigMap?.PersistentVolumeClaim?.ClusterRole?.ClusterRoleBinding?.Service.Deployment?.StatefulSet?.(IngressClass+MutatingWebhookConfiguration)?.ValidatingWebhookConfiguration?.Job?` | "Writing a Helm chart? Start with resilience (PDB, NetworkPolicy), then identity (ServiceAccount, Secrets), then the Service, then your workload. Only cluster-wide tools need RBAC." |
| GitHub Actions (Go lint) | Step `uses:` or `run:` values from workflow YAML in job order | `actions/checkout`, `actions/setup-go`, `golangci/golangci-lint-action`, `megalinter/megalinter` | `actions/checkout.(actions/setup-go+run:echo+run:sudo)+.golangci/golangci-lint-action?.megalinter?` | "Starting a new Go project on GitHub Actions? Four independent projects converged on: checkout → setup Go → (optional golangci-lint) → (optional megalinter)." |
Dervish has been tested against public datasets from Ansible Galaxy, Helm, and GitHub Actions — all cases where multiple projects independently converged on an undocumented pattern. [**Full details → SHOWCASE.md**](SHOWCASE.md)
The sweet spot: **multiple implementations of the same abstract task** with a shared but undocumented pattern. Not everything works — Dockerfiles, pre-commit configs, and schema-enforced formats are too rigid or too diverse to yield a convention.
> **kOREInference note:** Algorithm 4 (iDRegEx with MDL, arXiv 1004.2372) is included for paper-faithful correctness. On real tool-sequence data, its rwr₀ repair step returns ∅ because the k-OA is rarely SORE (interconnected symbols). The ensemble falls back to CRX or iDRegEx automatically.
Across all public benchmarks, Dervish delivers **40–83× compression**. The grammar is smaller than a single example file would be — and it represents the entire dataset.
- **model_cost** — number of unique alphabet symbols in the grammar. Simpler grammars are cheaper.
- **data_cost** — Σ log₂(|L(r) at length len(s)|) across all sequences. A specific fixed sequence (`a.b.c.d.e`) has data cost zero because |L(r)| = 1. A grammar that accepts *many* strings of the same length (like `(a+b+...+q)+`) has high data cost.
The ensemble selects the grammar with the lowest total MDL.