231 lines
12 KiB
Markdown
231 lines
12 KiB
Markdown
# Dervish
|
||
|
||
<p align="center"><img src="dervish.gif" alt="Dervish"></p>
|
||
|
||
**Dervish** infers **regular expression grammars** from example sequences using the BEX family of algorithms. Given a set of example sequences (strings over some alphabet), it learns a compact regular expression that describes the general pattern.
|
||
|
||
## MCP Server
|
||
|
||
The primary interface is a **Model Context Protocol (MCP)** server. Connect any MCP-compatible client (pi.dev, opencode, vibe, etc.) and get grammar inference as a tool:
|
||
|
||
```json
|
||
{
|
||
"mcpServers": {
|
||
"dervish": {
|
||
"command": "python3",
|
||
"args": ["/path/to/bex/mcp_server.py"]
|
||
}
|
||
}
|
||
}
|
||
```
|
||
|
||
### Tools
|
||
|
||
| Tool | Parameters | What it does |
|
||
|------|-----------|-------------|
|
||
| `infer_best_grammar` | `sequences`, `prefer`, `kmax`, `N` | **Recommended.** Runs CRX + iDRegEx, picks best by MDL. Set `prefer='crx'` or `prefer='idregex'` to run one algorithm. |
|
||
| `infer_grammar` | `sequences`, `method`, `kmax`, `N` | Core single-algorithm inference. `method='crx'` (fast, deterministic) or `method='idregex'` (probabilistic EM). |
|
||
|
||
**Parameters explained:**
|
||
- **`kmax`** (1–5): Context window for iDRegEx's k-testable automaton. Higher values capture longer-range dependencies but need more data and are slower. Default 2 works for most cases.
|
||
- **`N`** (1–10): Baum-Welch EM iterations for iDRegEx training. More iterations = better convergence but slower. Default 3 is a good balance.
|
||
- **`prefer`**: Skip the CRX-vs-iDRegEx comparison. Use when you know which algorithm fits your data.
|
||
|
||
### Agent workflow
|
||
|
||
An LLM agent uses the MCP to discover an unwritten convention from existing examples — compressing hundreds of files into a single ~60-token rule:
|
||
|
||
```
|
||
User: Generate a new Ansible role for installing PostgreSQL.
|
||
|
||
Agent: Let me check what pattern the existing community roles follow.
|
||
I'll look at 15 popular geerlingguy roles.
|
||
|
||
[finds role directories, extracts task module sequences,
|
||
calls infer_best_grammar(sequences=..., prefer='crx')]
|
||
|
||
Dervish returns:
|
||
Best: CRX (MDL 288)
|
||
Grammar: fail?.(include_vars+set_fact+package+file+template+service+...)+
|
||
.include+?.(npm+pip)+?.lineinfile?
|
||
|
||
This tells me: every role starts with a fail check for preconditions,
|
||
then OS-specific variables, installs packages, configures with templates,
|
||
starts services, and optionally handles language tooling (npm/pip).
|
||
The role should end with a lineinfile tweak.
|
||
|
||
I'll generate the new role following this structure.
|
||
```
|
||
|
||
**Without Dervish:** the agent stuffs 15 role files into context (5,000+ tokens per role = beyond any context window), or guesses the pattern from 1–2 examples and often gets it wrong.
|
||
|
||
**With Dervish:** one MCP call returns a ~60-token grammar known to match 15/15 existing roles. The agent follows it reliably.
|
||
|
||
## Quick Start
|
||
|
||
```bash
|
||
pip install pyyaml
|
||
python -m bex
|
||
```
|
||
|
||
```python
|
||
from bex import infer_ensemble
|
||
|
||
seqs = [
|
||
['file', 'template', 'docker_image', 'command', 'set_fact', 'shell', 'wait_for'],
|
||
['file', 'template', 'docker_image', 'command', 'set_fact', 'shell'],
|
||
]
|
||
|
||
result = infer_ensemble(seqs)
|
||
print(f"Best: {result['best']['algorithm']}")
|
||
print(f"Grammar: {result['best']['grammar']}")
|
||
print(f"Score: {result['best']['mdl_score']}")
|
||
```
|
||
|
||
## Why not just use a schema?
|
||
|
||
Many of the things developers build every day **have no formal schema**. They're free-form scripts, config files, or YAML blobs where the structure is emergent convention, not enforced specification. An LLM generating new content in these domains needs to know the convention — but it's never been written down.
|
||
|
||
Dervish discovers these conventions automatically from existing examples. The domains below are **just examples** of what it can do — the same approach works for any sequential data with an unwritten pattern.
|
||
|
||
| Domain | What gets extracted | Example extracted symbols | What Dervish discovers | Why it helps an LLM |
|
||
|--------|-------------------|--------------------------|----------------------|---------------------|
|
||
| Ansible roles | Module names from `tasks/main.yml` in order | `fail`, `include_vars`, `set_fact`, `package`, `file`, `template`, `service`, `npm`, `pip`, `lineinfile` | `fail?.(include_vars+set_fact+package+file+template+service+...)+.include+?.(npm+pip)+?.lineinfile?` | "Validate preconditions first, then set vars, install packages, configure with templates, start services. Include sub-roles last." |
|
||
| Helm charts | K8s resource kinds from `helm template` output in rendered order | `ServiceAccount`, `ClusterRole`, `ClusterRoleBinding`, `Service`, `Deployment`, `ConfigMap`, `Alertmanager` | `ServiceAccount.ClusterRole.ClusterRoleBinding.Service.Deployment` (iDRegEx minimal core) | "Every Prometheus stack needs this bootstrap pipeline. Everything else is optional." |
|
||
| GitHub Actions (Go lint) | Step `uses:` or `run:` values from workflow YAML in job order | `actions/checkout`, `actions/setup-go`, `golangci/golangci-lint-action`, `megalinter/megalinter` | `actions/checkout.(actions/setup-go+run:echo+run:sudo)+.golangci/golangci-lint-action?.megalinter?` | "Starting a new Go project on GitHub Actions? Four independent projects converged on: checkout → setup Go → golangci-lint → (optionally megalinter)." |
|
||
|
||
|
||
## Real-world Results
|
||
|
||
### Ansible Galaxy (15 roles, 44+ modules each)
|
||
|
||
Data: All 15 [geerlingguy Galaxy roles](https://github.com/geerlingguy) — nginx, php, mysql, docker, etc.
|
||
|
||
Each role's `tasks/main.yml` is parsed into a sequence of module names. Here are the sequences from two roles:
|
||
|
||
```
|
||
docker: fail → include_vars → include_tasks → package → package → package → ...
|
||
nginx: fail → include_vars → set_fact → package → file → template → service → ...
|
||
```
|
||
|
||
The extracted symbols are Ansible module names like `fail`, `include_vars`, `set_fact`, `package`, `file`, `template`, `service`, `systemd`, `get_url`, `shell`, `npm`, `pip`, `lineinfile`, `copy`, `unarchive`, `yum`, `apt`, `command`, `user`, `group`, `git`, `mount`, `cron`, `debug`, `iptables`, `ufw`, `hostname`, `sysctl`, `timezone`, `selinux`, `firewalld`, `homebrew`, `supervisorctl`, `postgresql_db`, `mysql_db` — 50+ unique modules across the 15 roles.
|
||
|
||
```
|
||
Best: CRX (MDL 288, 15/15 match)
|
||
Grammar:
|
||
fail?.(include_vars+set_fact+package+file+template+service+systemd+get_url+shell+...)+
|
||
.include+?.(npm+pip)+?.lineinfile?
|
||
```
|
||
|
||
Every single role follows this pattern. The convention was **unwritten** — no document says "Ansible roles should check preconditions first, then install packages, configure with templates, enable services, then optionally install language packages."
|
||
|
||
This is the first explicit description of the geerlingguy role module ordering convention.
|
||
|
||
**Compression:** The grammar is ~250 chars. The 15 examples are 7200+ modules combined. **~29× compression.**
|
||
|
||
### Helm (kube-prometheus-stack, 6 CI configs)
|
||
|
||
Data: 6 different `values.yaml` configurations rendered through `helm template`. Each config produces a sequence of K8s `kind` values in rendered YAML order:
|
||
|
||
```
|
||
config-1: ServiceAccount → ClusterRole → ClusterRoleBinding → Service → Deployment → ServiceMonitor → PrometheusRule
|
||
config-2: ServiceAccount → ClusterRole → ClusterRoleBinding → Service → Deployment → ConfigMap → ServiceMonitor
|
||
config-3: ServiceAccount → ClusterRole → ClusterRoleBinding → Service → Deployment → Alertmanager → Prometheus
|
||
```
|
||
|
||
Extracted symbols: `ServiceAccount`, `ClusterRole`, `ClusterRoleBinding`, `Service`, `Deployment`, `ConfigMap`, `Alertmanager`, `Prometheus`, `PrometheusRule`, `ServiceMonitor`, `Role`, `RoleBinding`, `Job`, `DaemonSet`, `Secret`, `ValidatingWebhookConfiguration` — 19 kinds total.
|
||
|
||
```
|
||
Best: iDRegEx (MDL 1433)
|
||
Grammar: ServiceAccount.ClusterRole.ClusterRoleBinding.Service.Deployment
|
||
|
||
iDRegEx MDL= 1432.99 ServiceAccount.ClusterRole.ClusterRoleBinding.Service.Deployment
|
||
CRX MDL= 2651.74 (Alertmanager+ClusterRole+...+ValidatingWebhookConfiguration)+.Role+?...
|
||
```
|
||
|
||
iDRegEx finds the **minimum core** — what every config always deploys. CRX captures the full vocabulary (19 resource kinds). Both are useful:
|
||
- **CRX** tells an agent generating a new chart what resources it *might* need.
|
||
- **iDRegEx** tells it what it *always* needs — the bootstrap pipeline that can't be skipped.
|
||
|
||
### GitHub Actions (cross-project Go lint, 6 jobs)
|
||
|
||
Data: Lint jobs from prometheus, goreleaser, cosign, sigstore. Each job's steps are extracted as `uses:` or `run:` values:
|
||
|
||
```
|
||
prometheus lint: actions/checkout → actions/setup-go → run:sudo → run:echo → golangci/golangci-lint-action → golangci/golangci-lint-action → ...
|
||
goreleaser lint: actions/checkout → actions/setup-go → gitleaks/gitleaks-action → golangci/golangci-lint-action
|
||
cosign lint: actions/checkout → ossf/scorecard-action → actions/upload-artifact → github/codeql-action/upload-sarif
|
||
```
|
||
|
||
Extracted symbols: `actions/checkout`, `actions/setup-go`, `golangci/golangci-lint-action`, `megalinter/megalinter`, `gitleaks/gitleaks-action`, `ossf/scorecard-action`, `github/codeql-action/*`, and `run:*` commands.
|
||
|
||
```
|
||
Best: CRX (MDL 13.6)
|
||
Grammar: actions/checkout.(actions/setup-go+run:echo+run:sudo)+.golangci/golangci-lint-action?.megalinter?
|
||
```
|
||
|
||
Every Go project's lint CI follows: checkout → setup Go → run golangci-lint. Only the biggest projects add megalinter.
|
||
|
||
### What doesn't work
|
||
|
||
Not every domain has an unwritten convention. Grammar inference failed (produced trivial `(a+b+c+...)+` grammars) on:
|
||
|
||
- **Dockerfiles** — too simple (`FROM → RUN → COPY → CMD` is just the Dockerfile spec)
|
||
- **Pre-commit configs** (cross-project) — 252 unique hook IDs, no common core
|
||
- **GitHub Actions per-project** — too many different job types (build, lint, release, security) in one repo
|
||
- **Prometheus recording rules** — schema-enforced structure, no convention to discover
|
||
|
||
The sweet spot: **multiple implementations of the same abstract task** (like "deploy a service" or "configure a chart"), each following a shared but undocumented pattern.
|
||
|
||
## Algorithm Selection Guide
|
||
|
||
| When | Use | Why |
|
||
|------|-----|-----|
|
||
| Clean, structured data with full vocabulary | **CRX** | Single-pass, deterministic. Accepts all sequences. |
|
||
| Few examples, or want minimal common core | **iDRegEx** | Probabilistic EM, finds only what's shared. |
|
||
| Don't know which is better | **Ensemble (default)** | Runs both, picks the best by MDL score. |
|
||
| Data is clearly one type | `prefer='crx'` or `prefer='idregex'` | Skips ensemble comparison, runs one algorithm. |
|
||
|
||
## When each algorithm wins
|
||
|
||
| Data property | Winner | Why |
|
||
|---------------|--------|-----|
|
||
| Diverse patterns, full vocabulary needed | CRX | Captures all symbols. iDRegEx returns ∅. |
|
||
| Clean sequences with clear core | iDRegEx | Extracts minimal common subsequence. CRX buries it in optional noise. |
|
||
| Single sequence | iDRegEx (+ RWR₀) | RWR₀ repair produces a grammatical regex from one example. |
|
||
| 2–3 sequences | iDRegEx | CRX overfits. iDRegEx handles noise better. |
|
||
| Many sequences, tight pattern | CRX | Learns precise concatenation with optional suffixes. |
|
||
|
||
## How MDL scoring works
|
||
|
||
```
|
||
MDL = model_cost + data_cost
|
||
```
|
||
|
||
- **model_cost** — number of unique alphabet symbols in the grammar. Simpler grammars are cheaper.
|
||
- **data_cost** — Σ log₂(|L(r) at length len(s)|) across all sequences. A specific fixed sequence (`a.b.c.d.e`) has data cost zero because |L(r)| = 1. A grammar that accepts *many* strings of the same length (like `(a+b+...+q)+`) has high data cost.
|
||
|
||
The ensemble selects the grammar with the lowest total MDL.
|
||
|
||
## Grammar Notation
|
||
|
||
- `a.b` — `a` followed by `b` (concatenation)
|
||
- `(a+b)` — either `a` or `b` (disjunction)
|
||
- `r?` — zero or one (optional)
|
||
- `r+` — one or more (iteration)
|
||
- `r+?` — zero or more (varies across examples)
|
||
|
||
## Papers
|
||
|
||
- **Bex et al.** *"Inferring Deterministic Regular Expressions from Positive Data"* — TODS 2010
|
||
- **Bex et al.** *"Inferring k-optimal REs from Positive Data"* — arXiv:1004.2372
|
||
|
||
## Tests
|
||
|
||
```bash
|
||
python -m pytest tests/
|
||
```
|
||
|
||
## License
|
||
|
||
MIT
|