**Dervish** infers **regular expression grammars** from example sequences using the BEX family of algorithms. Given a set of example sequences (strings over some alphabet), it learns a compact regular expression that describes the general pattern.
The primary interface is a **Model Context Protocol (MCP)** server. Connect any MCP-compatible client (pi.dev, opencode, vibe, etc.) and get grammar inference as a tool:
| `infer_best_grammar` | `sequences`, `prefer`, `kmax`, `N` | **Recommended.** Runs CRX + iDRegEx, picks best by MDL. Set `prefer='crx'` or `prefer='idregex'` to run one algorithm. |
- **`kmax`** (1–5): Context window for iDRegEx's k-testable automaton. Higher values capture longer-range dependencies but need more data and are slower. Default 2 works for most cases.
- **`N`** (1–10): Baum-Welch EM iterations for iDRegEx training. More iterations = better convergence but slower. Default 3 is a good balance.
- **`prefer`**: Skip the CRX-vs-iDRegEx comparison. Use when you know which algorithm fits your data.
**Without Dervish:** the agent stuffs 15 role files into context (5,000+ tokens per role = beyond any context window), or guesses the pattern from 1–2 examples and often gets it wrong.
Many of the things developers build every day **have no formal schema**. They're free-form scripts, config files, or YAML blobs where the structure is emergent convention, not enforced specification. An LLM generating new content in these domains needs to know the convention — but it's never been written down.
Dervish discovers these conventions automatically from existing examples. The domains below are **just examples** of what it can do — the same approach works for any sequential data with an unwritten pattern.
| Domain | What gets extracted | Example extracted symbols | What Dervish discovers | Why it helps an LLM |
| Ansible roles | Module names from `tasks/main.yml` in order | `fail`, `include_vars`, `set_fact`, `package`, `file`, `template`, `service`, `npm`, `pip`, `lineinfile` | `fail?.(include_vars+set_fact+package+file+template+service+...)+.include+?.(npm+pip)+?.lineinfile?` | "Validate preconditions first, then set vars, install packages, configure with templates, start services. Include sub-roles last." |
| Helm charts | K8s resource kinds from `helm template` output in rendered order | `ServiceAccount`, `ClusterRole`, `ClusterRoleBinding`, `Service`, `Deployment`, `ConfigMap`, `Alertmanager` | `ServiceAccount.ClusterRole.ClusterRoleBinding.Service.Deployment` (iDRegEx minimal core) | "Every Prometheus stack needs this bootstrap pipeline. Everything else is optional." |
| GitHub Actions (Go lint) | Step `uses:` or `run:` values from workflow YAML in job order | `actions/checkout`, `actions/setup-go`, `golangci/golangci-lint-action`, `megalinter/megalinter` | `actions/checkout.(actions/setup-go+run:echo+run:sudo)+.golangci/golangci-lint-action?.megalinter?` | "Starting a new Go project? The lint workflow has a near-universal pattern." |
| Terraform modules | Resource type strings from `.tf` files in declaration order | `aws_vpc`, `aws_subnet`, `aws_route_table`, `aws_internet_gateway`, `aws_security_group`, `aws_instance`, `aws_s3_bucket` | Everything optional (domains too different), but certain types always cluster together | "If you see `aws_vpc`, expect subnets, route tables, gateways to follow. The grammar encodes each domain's resource catalogue." |
Every single role follows this pattern. The convention was **unwritten** — no document says "Ansible roles should check preconditions first, then install packages, configure with templates, enable services, then optionally install language packages."
Data: 6 different `values.yaml` configurations rendered through `helm template`. Each config produces a sequence of K8s `kind` values in rendered YAML order:
Data: `terraform-aws-{vpc,ec2,s3-bucket,autoscaling,security-group}` modules from hashicorp and terraform-aws-modules. Each `.tf` file is parsed for `resource` declarations in order:
Every resource type is optional — modules for different AWS services share no mandatory ordering. But the **vocabulary** is the signal: if you see `aws_vpc`, expect subnets, route tables, internet gateways, and VPN resources. The grammar encodes the resource catalogue of each module domain.
The sweet spot: **multiple implementations of the same abstract task** (like "deploy a service" or "configure a chart"), each following a shared but undocumented pattern.
- **model_cost** — number of unique alphabet symbols in the grammar. Simpler grammars are cheaper.
- **data_cost** — Σ log₂(|L(r) at length len(s)|) across all sequences. A specific fixed sequence (`a.b.c.d.e`) has data cost zero because |L(r)| = 1. A grammar that accepts *many* strings of the same length (like `(a+b+...+q)+`) has high data cost.
The ensemble selects the grammar with the lowest total MDL.