Restructure: MCP Server first (with agent workflow example), then Why grammar inference / showcases, then Quick Start, then details. This matches how users actually interact with the project: via MCP tools.
338 lines
13 KiB
Markdown
338 lines
13 KiB
Markdown
# Grammar Inference Engine
|
||
|
||
Infer **regular expression grammars** from example sequences using the BEX family of algorithms. Given a set of example sequences (strings over some alphabet), the engine learns a compact regular expression that describes the general pattern.
|
||
|
||
## MCP Server
|
||
|
||
The primary interface is a **Model Context Protocol (MCP)** server. Connect any MCP-compatible client (Claude, opencode, etc.) and get grammar inference as a tool:
|
||
|
||
```json
|
||
{
|
||
"mcpServers": {
|
||
"grammar-inference": {
|
||
"command": "python3",
|
||
"args": ["/path/to/bex/mcp_server.py"]
|
||
}
|
||
}
|
||
}
|
||
```
|
||
|
||
### Tools
|
||
|
||
| Tool | What it does |
|
||
|------|-------------|
|
||
| `infer_grammar(sequences, method, kmax, N)` | Core CRX or iDRegEx inference |
|
||
| `infer_best_grammar(sequences, prefer, kmax, N)` | **Ensemble:** runs both CRX and iDRegEx, picks the best by MDL score. `prefer='crx'` or `prefer='idregex'` to skip the comparison and return only that algorithm. |
|
||
| `infer_yaml_grammar(yaml_dir, pattern, method)` | YAML → key-paths → grammar |
|
||
| `infer_ansible_role_grammar(roles_dir)` | Ansible role module sequences → per-category grammar |
|
||
|
||
### Agent workflow
|
||
|
||
An LLM agent uses the MCP to discover an unwritten convention from existing examples:
|
||
|
||
```
|
||
User: Generate a new Ansible role for installing PostgreSQL.
|
||
Agent: I'll first check 15 existing geerlingguy roles to find the structural pattern.
|
||
[calls infer_best_grammar with 15 role sequences, prefer='crx']
|
||
|
||
Best: CRX (MDL 288)
|
||
Grammar: fail?.(include_vars+set_fact+package+file+template+service+...)+
|
||
.include+?.(npm+pip)+?.lineinfile?
|
||
|
||
Convention: check preconditions → OS-specific vars → install packages →
|
||
configure templates → start services → handle language tooling.
|
||
```
|
||
|
||
Without the MCP: 15 role files in context (5,000+ tokens) or guesswork.
|
||
With the MCP: one grammar rule (~60 tokens), known to match 15/15 existing roles.
|
||
|
||
## Why grammar inference?
|
||
|
||
There are many domains where developers follow **unwritten conventions** — implicit rules about the order and structure of things that no formal schema captures. An LLM generating code in these domains needs to know the convention, but it's rarely documented.
|
||
|
||
Grammar inference automatically discovers these conventions from examples:
|
||
|
||
| Domain | Unwritten convention | What the grammar tells an LLM |
|
||
|--------|---------------------|-------------------------------|
|
||
| Ansible roles | `fail → include_vars/set_fact → package → file/template → service → ... → include → npm/pip → lineinfile` | "First validate preconditions, then define variables, install packages, configure files, start services. Include other roles last." |
|
||
| Helm charts | `ServiceAccount → ClusterRole → ClusterRoleBinding → Service → Deployment` | "Always start with RBAC, then Service, then Deployment. Other resources are optional." |
|
||
| Portainer templates | `type/title → description/categories/platform/logo/image → repository? → env/ports/volumes? → command?` | "Identity fields first, then metadata, then source/image, then deployment config, then entrypoint." |
|
||
| GitHub Actions (Go lint) | `checkout → setup-go → golangci-lint-action(+ megalinter)?` | "Checkout, set up Go, run the linter. Only megalinter for extra coverage." |
|
||
| Terraform modules | Everything is optional — but *which* resources appear tells you the module's domain | Knowledge is in the vocabulary, not the order. VPC implies subnets, route tables, gateways. |
|
||
|
||
## Quick Start
|
||
|
||
```bash
|
||
pip install pyyaml
|
||
python -m bex
|
||
```
|
||
|
||
```python
|
||
from bex import infer_ensemble
|
||
|
||
seqs = [
|
||
['file', 'template', 'docker_image', 'command', 'set_fact', 'shell', 'wait_for'],
|
||
['file', 'template', 'docker_image', 'command', 'set_fact', 'shell'],
|
||
]
|
||
|
||
result = infer_ensemble(seqs)
|
||
print(f"Best: {result['best']['algorithm']}")
|
||
print(f"Grammar: {result['best']['grammar']}")
|
||
print(f"Score: {result['best']['mdl_score']}")
|
||
```
|
||
|
||
## Real-world Results
|
||
|
||
### Ansible Galaxy (15 roles, 44+ modules each)
|
||
|
||
Data: All 15 [geerlingguy Galaxy roles](https://github.com/geerlingguy) — nginx, php, mysql, docker, etc.
|
||
|
||
```
|
||
Best: CRX (MDL 288, 15/15 match)
|
||
Grammar:
|
||
fail?.(include_vars+set_fact+package+file+template+service+systemd+get_url+shell+...)+
|
||
.include+?.(npm+pip)+?.lineinfile?
|
||
```
|
||
|
||
Every single role follows this pattern. The convention was **unwritten** — no document says "Ansible roles should check preconditions first, then install packages, configure with templates, enable services, then optionally install language packages."
|
||
|
||
This is the first explicit description of the geerlingguy role module ordering convention.
|
||
|
||
**Compression:** The grammar is ~250 chars. The 15 examples are 7200+ modules combined. **~29× compression.**
|
||
|
||
### Helm (kube-prometheus-stack, 6 CI configs)
|
||
|
||
Data: 6 different `values.yaml` configurations rendered through `helm template`.
|
||
|
||
```
|
||
Best: iDRegEx (MDL 1433)
|
||
Grammar: ServiceAccount.ClusterRole.ClusterRoleBinding.Service.Deployment
|
||
|
||
iDRegEx MDL= 1432.99 ServiceAccount.ClusterRole.ClusterRoleBinding.Service.Deployment
|
||
CRX MDL= 2651.74 (Alertmanager+ClusterRole+...+ValidatingWebhookConfiguration)+.Role+?...
|
||
```
|
||
|
||
iDRegEx finds the **minimum core** — what every config always deploys. CRX captures the full vocabulary (19 resource kinds). Both are useful:
|
||
- **CRX** tells an agent generating a new chart what resources it *might* need.
|
||
- **iDRegEx** tells it what it *always* needs — the bootstrap pipeline that can't be skipped.
|
||
|
||
### Portainer templates (47 templates)
|
||
|
||
Data: Official Portainer app templates from the [portainer/templates](https://github.com/portainer/templates) repo.
|
||
|
||
```
|
||
Best: CRX (MDL 1282)
|
||
Grammar: (type+title)+.(categories+description+image+logo+name+note+platform)+.
|
||
repository?.(env+ports+privileged+volumes)+?.command?
|
||
```
|
||
|
||
Template fields follow a consistent arc: identity (`type`, `title`) → metadata (`description`, `categories`, `platform`, `logo`) → source (`image`, `repository`) → deployment (`ports`, `volumes`, `env`) → entrypoint (`command`). 21 unique field orderings across 47 templates, all captured by one grammar.
|
||
|
||
### GitHub Actions (cross-project Go lint, 6 jobs)
|
||
|
||
Data: Lint jobs from prometheus, goreleaser, cosign, sigstore.
|
||
|
||
```
|
||
Best: CRX (MDL 13.6)
|
||
Grammar: actions/checkout.(actions/setup-go+run:echo+run:sudo)+.golangci/golangci-lint-action?.megalinter?
|
||
```
|
||
|
||
Every Go project's lint CI follows: checkout → setup Go → run golangci-lint. Only the biggest projects add megalinter.
|
||
|
||
### Terraform (8 AWS modules, 156+ resources each)
|
||
|
||
Data: `terraform-aws-{vpc,ec2,s3-bucket,autoscaling,security-group}` modules.
|
||
|
||
```
|
||
Best: CRX (MDL 1876)
|
||
Grammar: null_resource?.s3_bucket_lifecycle_configuration?.vpc?.launch_configuration?.(...) ...
|
||
```
|
||
|
||
Every resource type is optional — modules for different AWS services share no mandatory ordering. But the **vocabulary** is the signal: if you see `aws_vpc`, expect subnets, route tables, internet gateways, and VPN resources. The grammar encodes the resource catalogue of each module domain.
|
||
|
||
### What doesn't work
|
||
|
||
Not every domain has an unwritten convention. Grammar inference failed (produced trivial `(a+b+c+...)+` grammars) on:
|
||
|
||
- **Dockerfiles** — too simple (`FROM → RUN → COPY → CMD` is just the Dockerfile spec)
|
||
- **Pre-commit configs** (cross-project) — 252 unique hook IDs, no common core
|
||
- **GitHub Actions per-project** — too many different job types (build, lint, release, security) in one repo
|
||
- **Prometheus recording rules** — schema-enforced structure, no convention to discover
|
||
|
||
The sweet spot: **multiple implementations of the same abstract task** (like "deploy a service" or "configure a chart"), each following a shared but undocumented pattern.
|
||
|
||
## Algorithm Selection Guide
|
||
|
||
| When | Use | Why |
|
||
|------|-----|-----|
|
||
| Clean, structured data with full vocabulary | **CRX** | Single-pass, deterministic. Accepts all sequences. |
|
||
| Few examples, or want minimal common core | **iDRegEx** | Probabilistic EM, finds only what's shared. |
|
||
| Don't know which is better | **Ensemble (default)** | Runs both, picks the best by MDL score. |
|
||
| Data is clearly one type | `prefer='crx'` or `prefer='idregex'` | Skips ensemble comparison, runs one algorithm. |
|
||
|
||
## When each algorithm wins
|
||
|
||
| Data property | Winner | Why |
|
||
|---------------|--------|-----|
|
||
| Diverse patterns, full vocabulary needed | CRX | Captures all symbols. iDRegEx returns ∅. |
|
||
| Clean sequences with clear core | iDRegEx | Extracts minimal common subsequence. CRX buries it in optional noise. |
|
||
| Single sequence | iDRegEx (+ RWR₀) | RWR₀ repair produces a grammatical regex from one example. |
|
||
| 2–3 sequences | iDRegEx | CRX overfits. iDRegEx handles noise better. |
|
||
| Many sequences, tight pattern | CRX | Learns precise concatenation with optional suffixes. |
|
||
|
||
## Domain Adapters
|
||
|
||
### Ansible Roles
|
||
|
||
```python
|
||
from bex.ensemble import infer_ensemble
|
||
from bex.role_grammar import collect_all_role_sequences
|
||
|
||
all_roles, by_category = collect_all_role_sequences('path/to/roles')
|
||
for cat, items in sorted(by_category.items()):
|
||
seqs = [s for _, s in items]
|
||
result = infer_ensemble(seqs)
|
||
print(f"── {cat} ({len(items)} roles) ──")
|
||
print(f" Best: {result['best']['algorithm']} (MDL {result['best']['mdl_score']})")
|
||
print(f" Grammar: {result['best']['grammar']}")
|
||
```
|
||
|
||
**Example** (15 geerlingguy Galaxy roles):
|
||
|
||
```
|
||
── other (15 roles) ──
|
||
Best: CRX (MDL 288, 15/15 match)
|
||
Grammar: fail?.(include_vars+set_fact+package+file+template+service+...)+.include+?.(npm+pip)+?.lineinfile?
|
||
Why: CRX matches 15/15 sequences, iDRegEx matches 3/15. CRX selected.
|
||
```
|
||
|
||
### Helm Charts
|
||
|
||
```python
|
||
import subprocess, yaml
|
||
from bex.ensemble import infer_ensemble
|
||
|
||
seqs = []
|
||
for vf in sorted(Path('ci/').glob('*-values.yaml')):
|
||
out = subprocess.run(
|
||
['helm', 'template', 'test', '.', '--skip-tests', '-f', str(vf)],
|
||
capture_output=True, text=True, timeout=120,
|
||
)
|
||
kinds = [d['kind'] for d in yaml.safe_load_all(out.stdout)
|
||
if d and isinstance(d, dict) and 'kind' in d]
|
||
if kinds:
|
||
seqs.append(kinds)
|
||
|
||
result = infer_ensemble(seqs)
|
||
print(f"Best: {result['best']['algorithm']} (MDL {result['best']['mdl_score']})")
|
||
print(f"Grammar: {result['best']['grammar']}")
|
||
```
|
||
|
||
**Example** (kube-prometheus-stack, 6 CI configs):
|
||
|
||
```
|
||
Best: iDRegEx (MDL 1433)
|
||
Grammar: ServiceAccount.ClusterRole.ClusterRoleBinding.Service.Deployment
|
||
|
||
iDRegEx MDL= 1432.99 ServiceAccount.ClusterRole.ClusterRoleBinding.Service.Deployment
|
||
CRX MDL= 2651.74 (Alertmanager+ClusterRole+...+ValidatingWebhookConfiguration)+.Role+?...
|
||
|
||
Why: iDRegEx (score 1433.0) vs CRX (score 2651.7). CRX matches 6/6, iDRegEx matches 1/6.
|
||
iDRegEx selected (MDL score 1433.0).
|
||
```
|
||
|
||
### Terraform
|
||
|
||
```python
|
||
import re
|
||
from bex.ensemble import infer_ensemble
|
||
|
||
seqs = []
|
||
for tf in sorted(Path('.').rglob('*.tf')):
|
||
resources = re.findall(r'resource "(\w+)" "\w+" {', tf.read_text())
|
||
if resources:
|
||
seqs.append(resources)
|
||
|
||
result = infer_ensemble(seqs)
|
||
print(f"Best: {result['best']['algorithm']} (MDL {result['best']['mdl_score']})")
|
||
print(f"Grammar: {result['best']['grammar']}")
|
||
```
|
||
|
||
**Example** (8 terraform-aws-* modules):
|
||
|
||
```
|
||
Best: CRX (MDL 1876)
|
||
Grammar: null_resource?.s3_bucket_lifecycle_configuration?.vpc?.launch_configuration?....
|
||
Why: CRX matches 8/8 sequences. iDRegEx returned ∅ (no common core across modules).
|
||
```
|
||
|
||
### Portainer Templates
|
||
|
||
```python
|
||
import json, urllib.request
|
||
from bex.ensemble import infer_ensemble
|
||
|
||
url = "https://raw.githubusercontent.com/portainer/templates/master/templates.json"
|
||
with urllib.request.urlopen(url) as resp:
|
||
data = json.loads(resp.read())
|
||
templates = data if isinstance(data, list) else data.get('templates', [])
|
||
seqs = [list(t.keys()) for t in templates]
|
||
|
||
result = infer_ensemble(seqs)
|
||
print(f"Best: {result['best']['algorithm']} (MDL {result['best']['mdl_score']})")
|
||
print(f"Grammar: {result['best']['grammar']}")
|
||
```
|
||
|
||
### GitHub Actions
|
||
|
||
```python
|
||
import yaml
|
||
from bex.ensemble import infer_ensemble
|
||
|
||
seqs = []
|
||
for wf_file in Path('.github/workflows/').glob('*.yml'):
|
||
data = yaml.safe_load(wf_file.read_text())
|
||
for job in data.get('jobs', {}).values():
|
||
if 'steps' not in job:
|
||
continue
|
||
seq = [s.get('uses', 'run:' + s.get('run', '').split()[0])
|
||
for s in job['steps'] if 'uses' in s or 'run' in s]
|
||
if seq:
|
||
seqs.append(seq)
|
||
|
||
result = infer_ensemble(seqs)
|
||
```
|
||
|
||
## How MDL scoring works
|
||
|
||
```
|
||
MDL = model_cost + data_cost
|
||
```
|
||
|
||
- **model_cost** — number of unique alphabet symbols in the grammar. Simpler grammars are cheaper.
|
||
- **data_cost** — Σ log₂(|L(r) at length len(s)|) across all sequences. A specific fixed sequence (`a.b.c.d.e`) has data cost zero because |L(r)| = 1. A grammar that accepts *many* strings of the same length (like `(a+b+...+q)+`) has high data cost.
|
||
|
||
The ensemble selects the grammar with the lowest total MDL.
|
||
|
||
## Grammar Notation
|
||
|
||
- `a.b` — `a` followed by `b` (concatenation)
|
||
- `(a+b)` — either `a` or `b` (disjunction)
|
||
- `r?` — zero or one (optional)
|
||
- `r+` — one or more (iteration)
|
||
- `r+?` — zero or more (varies across examples)
|
||
|
||
## Papers
|
||
|
||
- **Bex et al.** *"Inferring Deterministic Regular Expressions from Positive Data"* — TODS 2010
|
||
- **Bex et al.** *"Inferring k-optimal REs from Positive Data"* — arXiv:1004.2372
|
||
|
||
## Tests
|
||
|
||
```bash
|
||
python -m pytest tests/
|
||
```
|
||
|
||
## License
|
||
|
||
MIT
|