tobi/grammar-inference-engine

Fork 0

BEX-based grammar inference engine: learn regular expression patterns from example sequences. Supports CHAREs (CRX), k-OREs (iDRegEx), and the full BEX pipeline (SOA→2T-INF→RWR₀→CRX / iKoa→BW→Disambiguate→Prune→rwr²).

Find a file

tobjend b05c3ee116 rename to Dervish MCP; expand description with token-savings framing; add xkcd-style bar charts; link papers to actual URLs		2026-07-01 11:05:03 +02:00
bex	move format-specific adapters to examples/, purge format-specific MCP tools	2026-07-01 10:36:14 +02:00
bin	Add bin/mcp-server wrapper script for robust path resolution	2026-07-01 08:06:17 +02:00
examples	move format-specific adapters to examples/, purge format-specific MCP tools	2026-07-01 10:36:14 +02:00
papers	Initial commit: BEX-based grammar inference engine	2026-07-01 08:01:16 +02:00
tests	Initial commit: BEX-based grammar inference engine	2026-07-01 08:01:16 +02:00
.gitignore	Initial commit: BEX-based grammar inference engine	2026-07-01 08:01:16 +02:00
AGENTS.md	Initial commit: BEX-based grammar inference engine	2026-07-01 08:01:16 +02:00
blog_post.md	rename to Dervish MCP; expand description with token-savings framing; add xkcd-style bar charts; link papers to actual URLs	2026-07-01 11:05:03 +02:00
chart_context_cost.png	rename to Dervish MCP; expand description with token-savings framing; add xkcd-style bar charts; link papers to actual URLs	2026-07-01 11:05:03 +02:00
chart_token_savings.png	rename to Dervish MCP; expand description with token-savings framing; add xkcd-style bar charts; link papers to actual URLs	2026-07-01 11:05:03 +02:00
dervish.gif	Rename to Dervish, add animated logo to README	2026-07-01 10:19:08 +02:00
make_charts.py	rename to Dervish MCP; expand description with token-savings framing; add xkcd-style bar charts; link papers to actual URLs	2026-07-01 11:05:03 +02:00
pyproject.toml	Add MCP server: grammar inference via FastMCP	2026-07-01 08:03:10 +02:00
README.md	rename to Dervish MCP; expand description with token-savings framing; add xkcd-style bar charts; link papers to actual URLs	2026-07-01 11:05:03 +02:00
requirements.txt	Initial commit: BEX-based grammar inference engine	2026-07-01 08:01:16 +02:00
SHOWCASE.md	remove Terraform showcase (everything-optional grammar isn't useful); fix GHA scope claim	2026-07-01 10:42:08 +02:00

README.md

Dervish MCP

Dervish

Dervish infers regular expression grammars from example sequences using the BEX family of algorithms. Given a set of example sequences (strings over some alphabet), it learns a compact regular expression that captures the general pattern.

Every codebase has unwritten conventions — the order tasks appear in Ansible roles, the resources a Helm chart always creates, the steps every CI pipeline runs. Nobody writes these down. They emerge from copying and converging.

When an LLM agent needs to follow these conventions, it usually has two bad options:

Stuff every existing file into context — 15 Ansible roles = 5,000 tokens. You'll hit the context window by the third example.
Guess from one or two examples — the LLM infers a pattern and often gets it wrong.

Dervish replaces both with a one-call MCP tool: pass your sequences, get back a ~60-token grammar. A rule you can trust, at a fraction of the cost.

Without Dervish: token cost scales linearly with examples. With Dervish: one compact grammar describes them all — a ~60–200 token rule instead of thousands of tokens of raw examples.

MCP Server

The primary interface is a Model Context Protocol (MCP) server. Connect any MCP-compatible client (pi.dev, opencode, vibe, etc.) and get grammar inference as a tool:

{
  "mcpServers": {
    "dervish": {
      "command": "python3",
      "args": ["/path/to/bex/mcp_server.py"]
    }
  }
}

Tools

Tool	Parameters	What it does
`infer_best_grammar`	`sequences`, `prefer`, `kmax`, `N`	Recommended. Runs CRX + iDRegEx, picks best by MDL. Set `prefer='crx'` or `prefer='idregex'` to run one algorithm.
`infer_grammar`	`sequences`, `method`, `kmax`, `N`	Core single-algorithm inference. `method='crx'` (fast, deterministic) or `method='idregex'` (probabilistic EM).

Parameters explained:

kmax (1–5): Context window for iDRegEx's k-testable automaton. Higher values capture longer-range dependencies but need more data and are slower. Default 2 works for most cases.
N (1–10): Baum-Welch EM iterations for iDRegEx training. More iterations = better convergence but slower. Default 3 is a good balance.
prefer: Skip the CRX-vs-iDRegEx comparison. Use when you know which algorithm fits your data.

Agent workflow

An LLM agent uses the MCP to discover an unwritten convention from existing examples — compressing hundreds of files into a single ~60-token rule:

User: Generate a new Ansible role for installing PostgreSQL.

Agent: Let me check what pattern the existing community roles follow.
       I'll look at 15 popular geerlingguy roles.

       [finds role directories, extracts task module sequences,
        calls infer_best_grammar(sequences=..., prefer='crx')]

       Dervish returns:
         Best: CRX (MDL 288)
         Grammar: fail?.(include_vars+set_fact+package+file+template+service+...)+
                  .include+?.(npm+pip)+?.lineinfile?

       This tells me: every role starts with a fail check for preconditions,
       then OS-specific variables, installs packages, configures with templates,
       starts services, and optionally handles language tooling (npm/pip).
       The role should end with a lineinfile tweak.

       I'll generate the new role following this structure.

Without Dervish: the agent stuffs 15 role files into context (5,000+ tokens per role = beyond any context window), or guesses the pattern from 1–2 examples and often gets it wrong.

With Dervish: one MCP call returns a ~60-token grammar known to match 15/15 existing roles. The agent follows it reliably.

Quick Start

pip install pyyaml
python -m bex

from bex import infer_ensemble

seqs = [
    ['file', 'template', 'docker_image', 'command', 'set_fact', 'shell', 'wait_for'],
    ['file', 'template', 'docker_image', 'command', 'set_fact', 'shell'],
]

result = infer_ensemble(seqs)
print(f"Best: {result['best']['algorithm']}")
print(f"Grammar: {result['best']['grammar']}")
print(f"Score: {result['best']['mdl_score']}")

Why not just use a schema?

Many of the things developers build every day have no formal schema. They're free-form scripts, config files, or YAML blobs where the structure is emergent convention, not enforced specification. An LLM generating new content in these domains needs to know the convention — but it's never been written down.

Dervish discovers these conventions automatically from existing examples. The domains below are just examples of what it can do — the same approach works for any sequential data with an unwritten pattern.

Domain	What gets extracted	Example extracted symbols	What Dervish discovers	Why it helps an LLM
Ansible roles	Module names from `tasks/main.yml` in order	`fail`, `include_vars`, `set_fact`, `package`, `file`, `template`, `service`, `npm`, `pip`, `lineinfile`	`fail?.(include_vars+set_fact+package+file+template+service+...)+.include+?.(npm+pip)+?.lineinfile?`	"Validate preconditions first, then set vars, install packages, configure with templates, start services. Include sub-roles last."
Helm charts	K8s resource kinds from `helm template` output in rendered order	`ServiceAccount`, `ClusterRole`, `ClusterRoleBinding`, `Service`, `Deployment`, `ConfigMap`, `Alertmanager`	`ServiceAccount.ClusterRole.ClusterRoleBinding.Service.Deployment` (iDRegEx minimal core)	"Every Prometheus stack needs this bootstrap pipeline. Everything else is optional."
GitHub Actions (Go lint)	Step `uses:` or `run:` values from workflow YAML in job order	`actions/checkout`, `actions/setup-go`, `golangci/golangci-lint-action`, `megalinter/megalinter`	`actions/checkout.(actions/setup-go+run:echo+run:sudo)+.golangci/golangci-lint-action?.megalinter?`	"Starting a new Go project on GitHub Actions? Four independent projects converged on: checkout → setup Go → golangci-lint → (optionally megalinter)."

Real-world Results

Ansible Galaxy (15 roles, 44+ modules each)

Data: All 15 geerlingguy Galaxy roles — nginx, php, mysql, docker, etc.

Each role's tasks/main.yml is parsed into a sequence of module names. Here are the sequences from two roles:

docker:   fail → include_vars → include_tasks → package → package → package → ...
nginx:    fail → include_vars → set_fact → package → file → template → service → ...

The extracted symbols are Ansible module names like fail, include_vars, set_fact, package, file, template, service, systemd, get_url, shell, npm, pip, lineinfile, copy, unarchive, yum, apt, command, user, group, git, mount, cron, debug, iptables, ufw, hostname, sysctl, timezone, selinux, firewalld, homebrew, supervisorctl, postgresql_db, mysql_db — 50+ unique modules across the 15 roles.

Best: CRX (MDL 288, 15/15 match)
Grammar:
  fail?.(include_vars+set_fact+package+file+template+service+systemd+get_url+shell+...)+
  .include+?.(npm+pip)+?.lineinfile?

Every single role follows this pattern. The convention was unwritten — no document says "Ansible roles should check preconditions first, then install packages, configure with templates, enable services, then optionally install language packages."

This is the first explicit description of the geerlingguy role module ordering convention.

Compression: The grammar is ~250 chars. The 15 examples are 7200+ modules combined. ~29× compression.

Helm (kube-prometheus-stack, 6 CI configs)

Data: 6 different values.yaml configurations rendered through helm template. Each config produces a sequence of K8s kind values in rendered YAML order:

config-1: ServiceAccount → ClusterRole → ClusterRoleBinding → Service → Deployment → ServiceMonitor → PrometheusRule
config-2: ServiceAccount → ClusterRole → ClusterRoleBinding → Service → Deployment → ConfigMap → ServiceMonitor
config-3: ServiceAccount → ClusterRole → ClusterRoleBinding → Service → Deployment → Alertmanager → Prometheus

Extracted symbols: ServiceAccount, ClusterRole, ClusterRoleBinding, Service, Deployment, ConfigMap, Alertmanager, Prometheus, PrometheusRule, ServiceMonitor, Role, RoleBinding, Job, DaemonSet, Secret, ValidatingWebhookConfiguration — 19 kinds total.

Best: iDRegEx (MDL 1433)
Grammar: ServiceAccount.ClusterRole.ClusterRoleBinding.Service.Deployment

  iDRegEx     MDL=  1432.99  ServiceAccount.ClusterRole.ClusterRoleBinding.Service.Deployment
  CRX         MDL=  2651.74  (Alertmanager+ClusterRole+...+ValidatingWebhookConfiguration)+.Role+?...

iDRegEx finds the minimum core — what every config always deploys. CRX captures the full vocabulary (19 resource kinds). Both are useful:

CRX tells an agent generating a new chart what resources it might need.
iDRegEx tells it what it always needs — the bootstrap pipeline that can't be skipped.

GitHub Actions (cross-project Go lint, 6 jobs)

Data: Lint jobs from prometheus, goreleaser, cosign, sigstore. Each job's steps are extracted as uses: or run: values:

prometheus lint:   actions/checkout → actions/setup-go → run:sudo → run:echo → golangci/golangci-lint-action → golangci/golangci-lint-action → ...
goreleaser lint:   actions/checkout → actions/setup-go → gitleaks/gitleaks-action → golangci/golangci-lint-action
cosign lint:       actions/checkout → ossf/scorecard-action → actions/upload-artifact → github/codeql-action/upload-sarif

Extracted symbols: actions/checkout, actions/setup-go, golangci/golangci-lint-action, megalinter/megalinter, gitleaks/gitleaks-action, ossf/scorecard-action, github/codeql-action/*, and run:* commands.

Best: CRX (MDL 13.6)
Grammar: actions/checkout.(actions/setup-go+run:echo+run:sudo)+.golangci/golangci-lint-action?.megalinter?

Every Go project's lint CI follows: checkout → setup Go → run golangci-lint. Only the biggest projects add megalinter.

What doesn't work

Not every domain has an unwritten convention. Grammar inference failed (produced trivial (a+b+c+...)+ grammars) on:

Dockerfiles — too simple (FROM → RUN → COPY → CMD is just the Dockerfile spec)
Pre-commit configs (cross-project) — 252 unique hook IDs, no common core
GitHub Actions per-project — too many different job types (build, lint, release, security) in one repo
Prometheus recording rules — schema-enforced structure, no convention to discover

The sweet spot: multiple implementations of the same abstract task (like "deploy a service" or "configure a chart"), each following a shared but undocumented pattern.

Algorithm Selection Guide

When	Use	Why
Clean, structured data with full vocabulary	CRX	Single-pass, deterministic. Accepts all sequences.
Few examples, or want minimal common core	iDRegEx	Probabilistic EM, finds only what's shared.
Don't know which is better	Ensemble (default)	Runs both, picks the best by MDL score.
Data is clearly one type	`prefer='crx'` or `prefer='idregex'`	Skips ensemble comparison, runs one algorithm.

When each algorithm wins

Data property	Winner	Why
Diverse patterns, full vocabulary needed	CRX	Captures all symbols. iDRegEx returns ∅.
Clean sequences with clear core	iDRegEx	Extracts minimal common subsequence. CRX buries it in optional noise.
Single sequence	iDRegEx (+ RWR₀)	RWR₀ repair produces a grammatical regex from one example.
2–3 sequences	iDRegEx	CRX overfits. iDRegEx handles noise better.
Many sequences, tight pattern	CRX	Learns precise concatenation with optional suffixes.

Token savings

Context cost: raw examples vs Dervish grammar

Without Dervish, including N examples in context costs N × ~100 tokens. With Dervish, the grammar stays small and flat — ~60 tokens for a tight pattern, ~200 for diverse data.

Token savings per dataset

Across all public benchmarks, Dervish delivers 40–83× compression. The grammar is smaller than a single example file would be — and it represents the entire dataset.

How MDL scoring works

MDL = model_cost + data_cost

model_cost — number of unique alphabet symbols in the grammar. Simpler grammars are cheaper.
data_cost — Σ log₂(|L(r) at length len(s)|) across all sequences. A specific fixed sequence (a.b.c.d.e) has data cost zero because |L(r)| = 1. A grammar that accepts many strings of the same length (like (a+b+...+q)+) has high data cost.

The ensemble selects the grammar with the lowest total MDL.

Grammar Notation

a.b — a followed by b (concatenation)
(a+b) — either a or b (disjunction)
r? — zero or one (optional)
r+ — one or more (iteration)
r+? — zero or more (varies across examples)

Papers

Bex et al. Learning Deterministic Regular Expressions for the Web — TODS 2010
Bex et al. Simplifying XML Schema: Single-Type Approximations of Regular Expressions — arXiv:1004.2372

Tests

python -m pytest tests/

License

MIT

README.md Unescape Escape