Restructure: MCP Server first (with agent workflow example), then Why grammar inference / showcases, then Quick Start, then details. This matches how users actually interact with the project: via MCP tools. |
||
|---|---|---|
| bex | ||
| bin | ||
| papers | ||
| tests | ||
| .gitignore | ||
| AGENTS.md | ||
| blog_post.md | ||
| pyproject.toml | ||
| README.md | ||
| requirements.txt | ||
| SHOWCASE.md | ||
Grammar Inference Engine
Infer regular expression grammars from example sequences using the BEX family of algorithms. Given a set of example sequences (strings over some alphabet), the engine learns a compact regular expression that describes the general pattern.
MCP Server
The primary interface is a Model Context Protocol (MCP) server. Connect any MCP-compatible client (Claude, opencode, etc.) and get grammar inference as a tool:
{
"mcpServers": {
"grammar-inference": {
"command": "python3",
"args": ["/path/to/bex/mcp_server.py"]
}
}
}
Tools
| Tool | What it does |
|---|---|
infer_grammar(sequences, method, kmax, N) |
Core CRX or iDRegEx inference |
infer_best_grammar(sequences, prefer, kmax, N) |
Ensemble: runs both CRX and iDRegEx, picks the best by MDL score. prefer='crx' or prefer='idregex' to skip the comparison and return only that algorithm. |
infer_yaml_grammar(yaml_dir, pattern, method) |
YAML → key-paths → grammar |
infer_ansible_role_grammar(roles_dir) |
Ansible role module sequences → per-category grammar |
Agent workflow
An LLM agent uses the MCP to discover an unwritten convention from existing examples:
User: Generate a new Ansible role for installing PostgreSQL.
Agent: I'll first check 15 existing geerlingguy roles to find the structural pattern.
[calls infer_best_grammar with 15 role sequences, prefer='crx']
Best: CRX (MDL 288)
Grammar: fail?.(include_vars+set_fact+package+file+template+service+...)+
.include+?.(npm+pip)+?.lineinfile?
Convention: check preconditions → OS-specific vars → install packages →
configure templates → start services → handle language tooling.
Without the MCP: 15 role files in context (5,000+ tokens) or guesswork. With the MCP: one grammar rule (~60 tokens), known to match 15/15 existing roles.
Why grammar inference?
There are many domains where developers follow unwritten conventions — implicit rules about the order and structure of things that no formal schema captures. An LLM generating code in these domains needs to know the convention, but it's rarely documented.
Grammar inference automatically discovers these conventions from examples:
| Domain | Unwritten convention | What the grammar tells an LLM |
|---|---|---|
| Ansible roles | fail → include_vars/set_fact → package → file/template → service → ... → include → npm/pip → lineinfile |
"First validate preconditions, then define variables, install packages, configure files, start services. Include other roles last." |
| Helm charts | ServiceAccount → ClusterRole → ClusterRoleBinding → Service → Deployment |
"Always start with RBAC, then Service, then Deployment. Other resources are optional." |
| Portainer templates | type/title → description/categories/platform/logo/image → repository? → env/ports/volumes? → command? |
"Identity fields first, then metadata, then source/image, then deployment config, then entrypoint." |
| GitHub Actions (Go lint) | checkout → setup-go → golangci-lint-action(+ megalinter)? |
"Checkout, set up Go, run the linter. Only megalinter for extra coverage." |
| Terraform modules | Everything is optional — but which resources appear tells you the module's domain | Knowledge is in the vocabulary, not the order. VPC implies subnets, route tables, gateways. |
Quick Start
pip install pyyaml
python -m bex
from bex import infer_ensemble
seqs = [
['file', 'template', 'docker_image', 'command', 'set_fact', 'shell', 'wait_for'],
['file', 'template', 'docker_image', 'command', 'set_fact', 'shell'],
]
result = infer_ensemble(seqs)
print(f"Best: {result['best']['algorithm']}")
print(f"Grammar: {result['best']['grammar']}")
print(f"Score: {result['best']['mdl_score']}")
Real-world Results
Ansible Galaxy (15 roles, 44+ modules each)
Data: All 15 geerlingguy Galaxy roles — nginx, php, mysql, docker, etc.
Best: CRX (MDL 288, 15/15 match)
Grammar:
fail?.(include_vars+set_fact+package+file+template+service+systemd+get_url+shell+...)+
.include+?.(npm+pip)+?.lineinfile?
Every single role follows this pattern. The convention was unwritten — no document says "Ansible roles should check preconditions first, then install packages, configure with templates, enable services, then optionally install language packages."
This is the first explicit description of the geerlingguy role module ordering convention.
Compression: The grammar is ~250 chars. The 15 examples are 7200+ modules combined. ~29× compression.
Helm (kube-prometheus-stack, 6 CI configs)
Data: 6 different values.yaml configurations rendered through helm template.
Best: iDRegEx (MDL 1433)
Grammar: ServiceAccount.ClusterRole.ClusterRoleBinding.Service.Deployment
iDRegEx MDL= 1432.99 ServiceAccount.ClusterRole.ClusterRoleBinding.Service.Deployment
CRX MDL= 2651.74 (Alertmanager+ClusterRole+...+ValidatingWebhookConfiguration)+.Role+?...
iDRegEx finds the minimum core — what every config always deploys. CRX captures the full vocabulary (19 resource kinds). Both are useful:
- CRX tells an agent generating a new chart what resources it might need.
- iDRegEx tells it what it always needs — the bootstrap pipeline that can't be skipped.
Portainer templates (47 templates)
Data: Official Portainer app templates from the portainer/templates repo.
Best: CRX (MDL 1282)
Grammar: (type+title)+.(categories+description+image+logo+name+note+platform)+.
repository?.(env+ports+privileged+volumes)+?.command?
Template fields follow a consistent arc: identity (type, title) → metadata (description, categories, platform, logo) → source (image, repository) → deployment (ports, volumes, env) → entrypoint (command). 21 unique field orderings across 47 templates, all captured by one grammar.
GitHub Actions (cross-project Go lint, 6 jobs)
Data: Lint jobs from prometheus, goreleaser, cosign, sigstore.
Best: CRX (MDL 13.6)
Grammar: actions/checkout.(actions/setup-go+run:echo+run:sudo)+.golangci/golangci-lint-action?.megalinter?
Every Go project's lint CI follows: checkout → setup Go → run golangci-lint. Only the biggest projects add megalinter.
Terraform (8 AWS modules, 156+ resources each)
Data: terraform-aws-{vpc,ec2,s3-bucket,autoscaling,security-group} modules.
Best: CRX (MDL 1876)
Grammar: null_resource?.s3_bucket_lifecycle_configuration?.vpc?.launch_configuration?.(...) ...
Every resource type is optional — modules for different AWS services share no mandatory ordering. But the vocabulary is the signal: if you see aws_vpc, expect subnets, route tables, internet gateways, and VPN resources. The grammar encodes the resource catalogue of each module domain.
What doesn't work
Not every domain has an unwritten convention. Grammar inference failed (produced trivial (a+b+c+...)+ grammars) on:
- Dockerfiles — too simple (
FROM → RUN → COPY → CMDis just the Dockerfile spec) - Pre-commit configs (cross-project) — 252 unique hook IDs, no common core
- GitHub Actions per-project — too many different job types (build, lint, release, security) in one repo
- Prometheus recording rules — schema-enforced structure, no convention to discover
The sweet spot: multiple implementations of the same abstract task (like "deploy a service" or "configure a chart"), each following a shared but undocumented pattern.
Algorithm Selection Guide
| When | Use | Why |
|---|---|---|
| Clean, structured data with full vocabulary | CRX | Single-pass, deterministic. Accepts all sequences. |
| Few examples, or want minimal common core | iDRegEx | Probabilistic EM, finds only what's shared. |
| Don't know which is better | Ensemble (default) | Runs both, picks the best by MDL score. |
| Data is clearly one type | prefer='crx' or prefer='idregex' |
Skips ensemble comparison, runs one algorithm. |
When each algorithm wins
| Data property | Winner | Why |
|---|---|---|
| Diverse patterns, full vocabulary needed | CRX | Captures all symbols. iDRegEx returns ∅. |
| Clean sequences with clear core | iDRegEx | Extracts minimal common subsequence. CRX buries it in optional noise. |
| Single sequence | iDRegEx (+ RWR₀) | RWR₀ repair produces a grammatical regex from one example. |
| 2–3 sequences | iDRegEx | CRX overfits. iDRegEx handles noise better. |
| Many sequences, tight pattern | CRX | Learns precise concatenation with optional suffixes. |
Domain Adapters
Ansible Roles
from bex.ensemble import infer_ensemble
from bex.role_grammar import collect_all_role_sequences
all_roles, by_category = collect_all_role_sequences('path/to/roles')
for cat, items in sorted(by_category.items()):
seqs = [s for _, s in items]
result = infer_ensemble(seqs)
print(f"── {cat} ({len(items)} roles) ──")
print(f" Best: {result['best']['algorithm']} (MDL {result['best']['mdl_score']})")
print(f" Grammar: {result['best']['grammar']}")
Example (15 geerlingguy Galaxy roles):
── other (15 roles) ──
Best: CRX (MDL 288, 15/15 match)
Grammar: fail?.(include_vars+set_fact+package+file+template+service+...)+.include+?.(npm+pip)+?.lineinfile?
Why: CRX matches 15/15 sequences, iDRegEx matches 3/15. CRX selected.
Helm Charts
import subprocess, yaml
from bex.ensemble import infer_ensemble
seqs = []
for vf in sorted(Path('ci/').glob('*-values.yaml')):
out = subprocess.run(
['helm', 'template', 'test', '.', '--skip-tests', '-f', str(vf)],
capture_output=True, text=True, timeout=120,
)
kinds = [d['kind'] for d in yaml.safe_load_all(out.stdout)
if d and isinstance(d, dict) and 'kind' in d]
if kinds:
seqs.append(kinds)
result = infer_ensemble(seqs)
print(f"Best: {result['best']['algorithm']} (MDL {result['best']['mdl_score']})")
print(f"Grammar: {result['best']['grammar']}")
Example (kube-prometheus-stack, 6 CI configs):
Best: iDRegEx (MDL 1433)
Grammar: ServiceAccount.ClusterRole.ClusterRoleBinding.Service.Deployment
iDRegEx MDL= 1432.99 ServiceAccount.ClusterRole.ClusterRoleBinding.Service.Deployment
CRX MDL= 2651.74 (Alertmanager+ClusterRole+...+ValidatingWebhookConfiguration)+.Role+?...
Why: iDRegEx (score 1433.0) vs CRX (score 2651.7). CRX matches 6/6, iDRegEx matches 1/6.
iDRegEx selected (MDL score 1433.0).
Terraform
import re
from bex.ensemble import infer_ensemble
seqs = []
for tf in sorted(Path('.').rglob('*.tf')):
resources = re.findall(r'resource "(\w+)" "\w+" {', tf.read_text())
if resources:
seqs.append(resources)
result = infer_ensemble(seqs)
print(f"Best: {result['best']['algorithm']} (MDL {result['best']['mdl_score']})")
print(f"Grammar: {result['best']['grammar']}")
Example (8 terraform-aws-* modules):
Best: CRX (MDL 1876)
Grammar: null_resource?.s3_bucket_lifecycle_configuration?.vpc?.launch_configuration?....
Why: CRX matches 8/8 sequences. iDRegEx returned ∅ (no common core across modules).
Portainer Templates
import json, urllib.request
from bex.ensemble import infer_ensemble
url = "https://raw.githubusercontent.com/portainer/templates/master/templates.json"
with urllib.request.urlopen(url) as resp:
data = json.loads(resp.read())
templates = data if isinstance(data, list) else data.get('templates', [])
seqs = [list(t.keys()) for t in templates]
result = infer_ensemble(seqs)
print(f"Best: {result['best']['algorithm']} (MDL {result['best']['mdl_score']})")
print(f"Grammar: {result['best']['grammar']}")
GitHub Actions
import yaml
from bex.ensemble import infer_ensemble
seqs = []
for wf_file in Path('.github/workflows/').glob('*.yml'):
data = yaml.safe_load(wf_file.read_text())
for job in data.get('jobs', {}).values():
if 'steps' not in job:
continue
seq = [s.get('uses', 'run:' + s.get('run', '').split()[0])
for s in job['steps'] if 'uses' in s or 'run' in s]
if seq:
seqs.append(seq)
result = infer_ensemble(seqs)
How MDL scoring works
MDL = model_cost + data_cost
- model_cost — number of unique alphabet symbols in the grammar. Simpler grammars are cheaper.
- data_cost — Σ log₂(|L(r) at length len(s)|) across all sequences. A specific fixed sequence (
a.b.c.d.e) has data cost zero because |L(r)| = 1. A grammar that accepts many strings of the same length (like(a+b+...+q)+) has high data cost.
The ensemble selects the grammar with the lowest total MDL.
Grammar Notation
a.b—afollowed byb(concatenation)(a+b)— eitheraorb(disjunction)r?— zero or one (optional)r+— one or more (iteration)r+?— zero or more (varies across examples)
Papers
- Bex et al. "Inferring Deterministic Regular Expressions from Positive Data" — TODS 2010
- Bex et al. "Inferring k-optimal REs from Positive Data" — arXiv:1004.2372
Tests
python -m pytest tests/
License
MIT