2026-07-01 10:19:08 +02:00
# Dervish
2026-07-01 08:01:16 +02:00
2026-07-01 10:36:04 +02:00
< p align = "center" > < img src = "dervish.gif" alt = "Dervish" > < / p >
2026-07-01 10:19:08 +02:00
**Dervish** infers **regular expression grammars** from example sequences using the BEX family of algorithms. Given a set of example sequences (strings over some alphabet), it learns a compact regular expression that describes the general pattern.
2026-07-01 08:01:16 +02:00
2026-07-01 10:18:10 +02:00
## MCP Server
2026-07-01 08:01:16 +02:00
2026-07-01 10:36:04 +02:00
The primary interface is a **Model Context Protocol (MCP)** server. Connect any MCP-compatible client (pi.dev, opencode, vibe, etc.) and get grammar inference as a tool:
2026-07-01 10:18:10 +02:00
```json
{
"mcpServers": {
2026-07-01 10:36:04 +02:00
"dervish": {
2026-07-01 10:18:10 +02:00
"command": "python3",
"args": ["/path/to/bex/mcp_server.py"]
}
}
}
2026-07-01 08:01:16 +02:00
```
2026-07-01 10:18:10 +02:00
### Tools
2026-07-01 08:01:16 +02:00
2026-07-01 10:36:04 +02:00
| Tool | Parameters | What it does |
|------|-----------|-------------|
| `infer_best_grammar` | `sequences` , `prefer` , `kmax` , `N` | **Recommended.** Runs CRX + iDRegEx, picks best by MDL. Set `prefer='crx'` or `prefer='idregex'` to run one algorithm. |
| `infer_grammar` | `sequences` , `method` , `kmax` , `N` | Core single-algorithm inference. `method='crx'` (fast, deterministic) or `method='idregex'` (probabilistic EM). |
**Parameters explained:**
- **`kmax` ** (1– 5): Context window for iDRegEx's k-testable automaton. Higher values capture longer-range dependencies but need more data and are slower. Default 2 works for most cases.
- **`N` ** (1– 10): Baum-Welch EM iterations for iDRegEx training. More iterations = better convergence but slower. Default 3 is a good balance.
- **`prefer` **: Skip the CRX-vs-iDRegEx comparison. Use when you know which algorithm fits your data.
2026-07-01 09:51:41 +02:00
2026-07-01 10:18:10 +02:00
### Agent workflow
2026-07-01 10:36:04 +02:00
An LLM agent uses the MCP to discover an unwritten convention from existing examples — compressing hundreds of files into a single ~60-token rule:
2026-07-01 10:18:10 +02:00
```
User: Generate a new Ansible role for installing PostgreSQL.
2026-07-01 10:36:04 +02:00
Agent: Let me check what pattern the existing community roles follow.
I'll look at 15 popular geerlingguy roles.
2026-07-01 10:18:10 +02:00
2026-07-01 10:36:04 +02:00
[finds role directories, extracts task module sequences,
calls infer_best_grammar(sequences=..., prefer='crx')]
2026-07-01 09:51:41 +02:00
2026-07-01 10:36:04 +02:00
Dervish returns:
Best: CRX (MDL 288)
Grammar: fail?.(include_vars+set_fact+package+file+template+service+...)+
.include+?.(npm+pip)+?.lineinfile?
2026-07-01 10:18:10 +02:00
2026-07-01 10:36:04 +02:00
This tells me: every role starts with a fail check for preconditions,
then OS-specific variables, installs packages, configures with templates,
starts services, and optionally handles language tooling (npm/pip).
The role should end with a lineinfile tweak.
2026-07-01 09:51:41 +02:00
2026-07-01 10:36:04 +02:00
I'll generate the new role following this structure.
```
2026-07-01 09:51:41 +02:00
2026-07-01 10:36:04 +02:00
**Without Dervish:** the agent stuffs 15 role files into context (5,000+ tokens per role = beyond any context window), or guesses the pattern from 1– 2 examples and often gets it wrong.
2026-07-01 08:01:16 +02:00
2026-07-01 10:36:04 +02:00
**With Dervish:** one MCP call returns a ~60-token grammar known to match 15/15 existing roles. The agent follows it reliably.
2026-07-01 08:01:16 +02:00
2026-07-01 10:18:10 +02:00
## Quick Start
2026-07-01 08:01:16 +02:00
2026-07-01 10:18:10 +02:00
```bash
pip install pyyaml
python -m bex
```
```python
from bex import infer_ensemble
seqs = [
['file', 'template', 'docker_image', 'command', 'set_fact', 'shell', 'wait_for'],
['file', 'template', 'docker_image', 'command', 'set_fact', 'shell'],
]
result = infer_ensemble(seqs)
print(f"Best: {result['best']['algorithm']}")
print(f"Grammar: {result['best']['grammar']}")
print(f"Score: {result['best']['mdl_score']}")
```
2026-07-01 08:01:16 +02:00
2026-07-01 10:36:04 +02:00
## Why not just use a schema?
Many of the things developers build every day **have no formal schema** . They're free-form scripts, config files, or YAML blobs where the structure is emergent convention, not enforced specification. An LLM generating new content in these domains needs to know the convention — but it's never been written down.
Dervish discovers these conventions automatically from existing examples. The domains below are **just examples** of what it can do — the same approach works for any sequential data with an unwritten pattern.
| Domain | What gets extracted | Example extracted symbols | What Dervish discovers | Why it helps an LLM |
|--------|-------------------|--------------------------|----------------------|---------------------|
| Ansible roles | Module names from `tasks/main.yml` in order | `fail` , `include_vars` , `set_fact` , `package` , `file` , `template` , `service` , `npm` , `pip` , `lineinfile` | `fail?.(include_vars+set_fact+package+file+template+service+...)+.include+?.(npm+pip)+?.lineinfile?` | "Validate preconditions first, then set vars, install packages, configure with templates, start services. Include sub-roles last." |
| Helm charts | K8s resource kinds from `helm template` output in rendered order | `ServiceAccount` , `ClusterRole` , `ClusterRoleBinding` , `Service` , `Deployment` , `ConfigMap` , `Alertmanager` | `ServiceAccount.ClusterRole.ClusterRoleBinding.Service.Deployment` (iDRegEx minimal core) | "Every Prometheus stack needs this bootstrap pipeline. Everything else is optional." |
2026-07-01 10:42:08 +02:00
| GitHub Actions (Go lint) | Step `uses:` or `run:` values from workflow YAML in job order | `actions/checkout` , `actions/setup-go` , `golangci/golangci-lint-action` , `megalinter/megalinter` | `actions/checkout.(actions/setup-go+run:echo+run:sudo)+.golangci/golangci-lint-action?.megalinter?` | "Starting a new Go project on GitHub Actions? Four independent projects converged on: checkout → setup Go → golangci-lint → (optionally megalinter)." |
2026-07-01 10:36:04 +02:00
2026-07-01 10:04:10 +02:00
## Real-world Results
2026-07-01 08:01:16 +02:00
2026-07-01 10:04:10 +02:00
### Ansible Galaxy (15 roles, 44+ modules each)
2026-07-01 09:51:41 +02:00
2026-07-01 10:04:10 +02:00
Data: All 15 [geerlingguy Galaxy roles ](https://github.com/geerlingguy ) — nginx, php, mysql, docker, etc.
2026-07-01 08:01:16 +02:00
2026-07-01 10:36:04 +02:00
Each role's `tasks/main.yml` is parsed into a sequence of module names. Here are the sequences from two roles:
```
docker: fail → include_vars → include_tasks → package → package → package → ...
nginx: fail → include_vars → set_fact → package → file → template → service → ...
```
The extracted symbols are Ansible module names like `fail` , `include_vars` , `set_fact` , `package` , `file` , `template` , `service` , `systemd` , `get_url` , `shell` , `npm` , `pip` , `lineinfile` , `copy` , `unarchive` , `yum` , `apt` , `command` , `user` , `group` , `git` , `mount` , `cron` , `debug` , `iptables` , `ufw` , `hostname` , `sysctl` , `timezone` , `selinux` , `firewalld` , `homebrew` , `supervisorctl` , `postgresql_db` , `mysql_db` — 50+ unique modules across the 15 roles.
2026-07-01 08:01:16 +02:00
```
2026-07-01 10:04:10 +02:00
Best: CRX (MDL 288, 15/15 match)
Grammar:
fail?.(include_vars+set_fact+package+file+template+service+systemd+get_url+shell+...)+
.include+?.(npm+pip)+?.lineinfile?
2026-07-01 08:01:16 +02:00
```
2026-07-01 10:04:10 +02:00
Every single role follows this pattern. The convention was **unwritten** — no document says "Ansible roles should check preconditions first, then install packages, configure with templates, enable services, then optionally install language packages."
2026-07-01 09:51:41 +02:00
2026-07-01 10:18:10 +02:00
This is the first explicit description of the geerlingguy role module ordering convention.
2026-07-01 09:51:41 +02:00
2026-07-01 10:04:10 +02:00
**Compression:** The grammar is ~250 chars. The 15 examples are 7200+ modules combined. ** ~29× compression.**
2026-07-01 09:51:41 +02:00
2026-07-01 10:04:10 +02:00
### Helm (kube-prometheus-stack, 6 CI configs)
2026-07-01 09:51:41 +02:00
2026-07-01 10:36:04 +02:00
Data: 6 different `values.yaml` configurations rendered through `helm template` . Each config produces a sequence of K8s `kind` values in rendered YAML order:
```
config-1: ServiceAccount → ClusterRole → ClusterRoleBinding → Service → Deployment → ServiceMonitor → PrometheusRule
config-2: ServiceAccount → ClusterRole → ClusterRoleBinding → Service → Deployment → ConfigMap → ServiceMonitor
config-3: ServiceAccount → ClusterRole → ClusterRoleBinding → Service → Deployment → Alertmanager → Prometheus
```
Extracted symbols: `ServiceAccount` , `ClusterRole` , `ClusterRoleBinding` , `Service` , `Deployment` , `ConfigMap` , `Alertmanager` , `Prometheus` , `PrometheusRule` , `ServiceMonitor` , `Role` , `RoleBinding` , `Job` , `DaemonSet` , `Secret` , `ValidatingWebhookConfiguration` — 19 kinds total.
2026-07-01 08:01:16 +02:00
```
2026-07-01 10:04:10 +02:00
Best: iDRegEx (MDL 1433)
Grammar: ServiceAccount.ClusterRole.ClusterRoleBinding.Service.Deployment
iDRegEx MDL= 1432.99 ServiceAccount.ClusterRole.ClusterRoleBinding.Service.Deployment
CRX MDL= 2651.74 (Alertmanager+ClusterRole+...+ValidatingWebhookConfiguration)+.Role+?...
2026-07-01 08:01:16 +02:00
```
2026-07-01 10:04:10 +02:00
iDRegEx finds the **minimum core** — what every config always deploys. CRX captures the full vocabulary (19 resource kinds). Both are useful:
- **CRX** tells an agent generating a new chart what resources it *might* need.
- **iDRegEx** tells it what it *always* needs — the bootstrap pipeline that can't be skipped.
2026-07-01 08:01:16 +02:00
2026-07-01 10:36:04 +02:00
### GitHub Actions (cross-project Go lint, 6 jobs)
2026-07-01 08:01:16 +02:00
2026-07-01 10:36:04 +02:00
Data: Lint jobs from prometheus, goreleaser, cosign, sigstore. Each job's steps are extracted as `uses:` or `run:` values:
2026-07-01 10:04:10 +02:00
```
2026-07-01 10:36:04 +02:00
prometheus lint: actions/checkout → actions/setup-go → run:sudo → run:echo → golangci/golangci-lint-action → golangci/golangci-lint-action → ...
goreleaser lint: actions/checkout → actions/setup-go → gitleaks/gitleaks-action → golangci/golangci-lint-action
cosign lint: actions/checkout → ossf/scorecard-action → actions/upload-artifact → github/codeql-action/upload-sarif
2026-07-01 09:51:41 +02:00
```
2026-07-01 10:36:04 +02:00
Extracted symbols: `actions/checkout` , `actions/setup-go` , `golangci/golangci-lint-action` , `megalinter/megalinter` , `gitleaks/gitleaks-action` , `ossf/scorecard-action` , `github/codeql-action/*` , and `run:*` commands.
2026-07-01 09:51:41 +02:00
```
2026-07-01 10:04:10 +02:00
Best: CRX (MDL 13.6)
Grammar: actions/checkout.(actions/setup-go+run:echo+run:sudo)+.golangci/golangci-lint-action?.megalinter?
2026-07-01 09:51:41 +02:00
```
2026-07-01 10:04:10 +02:00
Every Go project's lint CI follows: checkout → setup Go → run golangci-lint. Only the biggest projects add megalinter.
2026-07-01 09:51:41 +02:00
2026-07-01 10:04:10 +02:00
### What doesn't work
2026-07-01 09:51:41 +02:00
2026-07-01 10:04:10 +02:00
Not every domain has an unwritten convention. Grammar inference failed (produced trivial `(a+b+c+...)+` grammars) on:
2026-07-01 09:51:41 +02:00
2026-07-01 10:04:10 +02:00
- **Dockerfiles** — too simple (`FROM → RUN → COPY → CMD` is just the Dockerfile spec)
- **Pre-commit configs** (cross-project) — 252 unique hook IDs, no common core
- **GitHub Actions per-project** — too many different job types (build, lint, release, security) in one repo
- **Prometheus recording rules** — schema-enforced structure, no convention to discover
2026-07-01 09:51:41 +02:00
2026-07-01 10:04:10 +02:00
The sweet spot: **multiple implementations of the same abstract task** (like "deploy a service" or "configure a chart"), each following a shared but undocumented pattern.
2026-07-01 09:51:41 +02:00
2026-07-01 10:18:10 +02:00
## Algorithm Selection Guide
| When | Use | Why |
|------|-----|-----|
| Clean, structured data with full vocabulary | **CRX** | Single-pass, deterministic. Accepts all sequences. |
| Few examples, or want minimal common core | **iDRegEx** | Probabilistic EM, finds only what's shared. |
| Don't know which is better | **Ensemble (default)** | Runs both, picks the best by MDL score. |
| Data is clearly one type | `prefer='crx'` or `prefer='idregex'` | Skips ensemble comparison, runs one algorithm. |
2026-07-01 10:04:10 +02:00
## When each algorithm wins
2026-07-01 09:51:41 +02:00
2026-07-01 10:04:10 +02:00
| Data property | Winner | Why |
|---------------|--------|-----|
| Diverse patterns, full vocabulary needed | CRX | Captures all symbols. iDRegEx returns ∅. |
| Clean sequences with clear core | iDRegEx | Extracts minimal common subsequence. CRX buries it in optional noise. |
| Single sequence | iDRegEx (+ RWR₀) | RWR₀ repair produces a grammatical regex from one example. |
| 2– 3 sequences | iDRegEx | CRX overfits. iDRegEx handles noise better. |
| Many sequences, tight pattern | CRX | Learns precise concatenation with optional suffixes. |
2026-07-01 09:51:41 +02:00
2026-07-01 10:04:10 +02:00
## How MDL scoring works
```
MDL = model_cost + data_cost
```
- **model_cost** — number of unique alphabet symbols in the grammar. Simpler grammars are cheaper.
- **data_cost** — Σ log₂(|L(r) at length len(s)|) across all sequences. A specific fixed sequence (`a.b.c.d.e` ) has data cost zero because |L(r)| = 1. A grammar that accepts *many* strings of the same length (like `(a+b+...+q)+` ) has high data cost.
The ensemble selects the grammar with the lowest total MDL.
## Grammar Notation
- `a.b` — `a` followed by `b` (concatenation)
- `(a+b)` — either `a` or `b` (disjunction)
- `r?` — zero or one (optional)
- `r+` — one or more (iteration)
- `r+?` — zero or more (varies across examples)
2026-07-01 08:01:16 +02:00
## Papers
- **Bex et al.** *"Inferring Deterministic Regular Expressions from Positive Data"* — TODS 2010
- **Bex et al.** *"Inferring k-optimal REs from Positive Data"* — arXiv:1004.2372
## Tests
```bash
python -m pytest tests/
```
## License
MIT