tobi/grammar-inference-engine

Fork 0

BEX-based grammar inference engine: learn regular expression patterns from example sequences. Supports CHAREs (CRX), k-OREs (iDRegEx), and the full BEX pipeline (SOA→2T-INF→RWR₀→CRX / iKoa→BW→Disambiguate→Prune→rwr²).

Find a file

tobjend 547376894c Update README and SHOWCASE with real-world dataset evaluations README: - Replace outdated company benchmarks with public showcases - Add Algorithm Selection Guide - Add 'When each algorithm wins' table - Add 'Why grammar inference?' table with value prop for LLMs - Add 'What doesn't work' section documenting failed approaches - Update all domain adapter examples with public results - Clean up outdated references (companyweb roles, hashistack terraform) SHOWCASE: - Add Helm (kube-prometheus-stack) with iDRegEx minimal core - Add Docker Compose per-project patterns - Add GitHub Actions cross-project Go lint pattern - Add Terraform modules with vocabulary analysis - Add 'What doesn't work' section - Explain WHY each dataset helps an LLM		2026-07-01 10:04:10 +02:00
bex	Grammar inference engine: CRX + iDRegEx ensemble with MDL scoring, MCP server, showcase, and blog post	2026-07-01 09:51:41 +02:00
bin	Add bin/mcp-server wrapper script for robust path resolution	2026-07-01 08:06:17 +02:00
papers	Initial commit: BEX-based grammar inference engine	2026-07-01 08:01:16 +02:00
tests	Initial commit: BEX-based grammar inference engine	2026-07-01 08:01:16 +02:00
.gitignore	Initial commit: BEX-based grammar inference engine	2026-07-01 08:01:16 +02:00
AGENTS.md	Initial commit: BEX-based grammar inference engine	2026-07-01 08:01:16 +02:00
blog_post.md	Grammar inference engine: CRX + iDRegEx ensemble with MDL scoring, MCP server, showcase, and blog post	2026-07-01 09:51:41 +02:00
pyproject.toml	Add MCP server: grammar inference via FastMCP	2026-07-01 08:03:10 +02:00
README.md	Update README and SHOWCASE with real-world dataset evaluations	2026-07-01 10:04:10 +02:00
requirements.txt	Initial commit: BEX-based grammar inference engine	2026-07-01 08:01:16 +02:00
SHOWCASE.md	Update README and SHOWCASE with real-world dataset evaluations	2026-07-01 10:04:10 +02:00

README.md

Grammar Inference Engine

Infer regular expression grammars from example sequences using the BEX family of algorithms. Given a set of example sequences (strings over some alphabet), the engine learns a compact regular expression that describes the general pattern.

Quick Start

pip install pyyaml
python -m bex

from bex import infer_ensemble

seqs = [
    ['file', 'template', 'docker_image', 'command', 'set_fact', 'shell', 'wait_for'],
    ['file', 'template', 'docker_image', 'command', 'set_fact', 'shell'],
]

result = infer_ensemble(seqs)
print(f"Best: {result['best']['algorithm']}")
print(f"Grammar: {result['best']['grammar']}")
print(f"Score: {result['best']['mdl_score']}")

Why grammar inference?

There are many domains where developers follow unwritten conventions — implicit rules about the order and structure of things that no formal schema captures. An LLM generating code in these domains needs to know the convention, but it's rarely documented.

Grammar inference automatically discovers these conventions from examples.

Domain	Unwritten convention	What the grammar tells an LLM
Ansible roles	`fail → include_vars/set_fact → package → file/template → service → ... → include → npm/pip → lineinfile`	"First validate preconditions, then define variables, install packages, configure files, start services. Include other roles last."
Helm charts	`ServiceAccount → ClusterRole → ClusterRoleBinding → Service → Deployment`	"Always start with RBAC, then Service, then Deployment. Other resources are optional."
Docker Compose	`(build+image).command.(environment+volumes)?.ports`	"Every service needs either build or image, optionally a command, then environment/volumes/ports in that order."
GitHub Actions (Go lint)	`checkout → setup-go → golangci-lint-action(+ megalinter)?`	"Checkout, set up Go, run the linter. Only megalinter for extra coverage."
Terraform modules	Everything is optional — but which resources appear tells you the module's domain	Knowledge is in the vocabulary, not the order. VPC implies subnets, route tables, gateways.

Algorithm Selection Guide

When	Use	Why
Clean, structured data with full vocabulary	CRX	Single-pass, deterministic. Accepts all sequences.
Few examples, or want minimal common core	iDRegEx	Probabilistic EM, finds only what's shared.
Don't know which is better	Ensemble (default)	Runs both, picks the best by MDL score.
Data is clearly one type	`prefer='crx'` or `prefer='idregex'`	Skips ensemble comparison, runs one algorithm.

Real-world Results

Ansible Galaxy (15 roles, 44+ modules each)

Data: All 15 geerlingguy Galaxy roles — nginx, php, mysql, docker, etc.

Best: CRX (MDL 288, 15/15 match)
Grammar:
  fail?.(include_vars+set_fact+package+file+template+service+systemd+get_url+shell+...)+
  .include+?.(npm+pip)+?.lineinfile?

Every single role follows this pattern. The convention was unwritten — no document says "Ansible roles should check preconditions first, then install packages, configure with templates, enable services, then optionally install language packages."

An LLM generating a new role:

Must start with conditional includes and variable setup
Should then install packages and configure files
Then start services
Finally include handling of language-specific tooling

Compression: The grammar is ~250 chars. The 15 examples are 7200+ modules combined. ~29× compression.

Helm (kube-prometheus-stack, 6 CI configs)

Data: 6 different values.yaml configurations rendered through helm template.

Best: iDRegEx (MDL 1433)
Grammar: ServiceAccount.ClusterRole.ClusterRoleBinding.Service.Deployment

  iDRegEx     MDL=  1432.99  ServiceAccount.ClusterRole.ClusterRoleBinding.Service.Deployment
  CRX         MDL=  2651.74  (Alertmanager+ClusterRole+...+ValidatingWebhookConfiguration)+.Role+?...

iDRegEx finds the minimum core — what every config always deploys. CRX captures the full vocabulary (19 resource kinds). Both are useful:

CRX tells an agent generating a new chart what resources it might need.
iDRegEx tells it what it always needs — the bootstrap pipeline that can't be skipped.

Docker Compose (73 services across 10 projects)

Data: Per-service sections from multiple docker-compose.yml files.

Per-service convention:

(build+image).command.(environment+volumes)?.ports

Each project has its own sub-patterns:

Nginx-like projects: build.(command.volumes.ports) — build from source, mount configs, expose ports
Database projects: image.environment.volumes.ports — pull image, configure with env vars, persist data
Language runtimes: build.(environment.command).ports — build, set env vars, override command

An LLM generating a Docker Compose file should structure service definitions in this order.

GitHub Actions (cross-project Go lint, 6 jobs)

Data: Lint jobs from prometheus, goreleaser, cosign, sigstore.

Best: CRX (MDL 13.6)
Grammar: actions/checkout.(actions/setup-go+run:echo+run:sudo)+.golangci/golangci-lint-action?.megalinter?

Every Go project's lint CI follows: checkout → setup Go → run golangci-lint. Only the biggest projects add megalinter.

Terraform (8 AWS modules, 156+ resources each)

Data: terraform-aws-{vpc,ec2,s3-bucket,autoscaling,security-group} modules.

Best: CRX (MDL 1876)
Grammar: null_resource?.s3_bucket_lifecycle_configuration?.vpc?.launch_configuration?.(...) ...

Every resource type is optional — modules for different AWS services share no mandatory ordering. But the vocabulary is the signal: if you see aws_vpc, expect subnets, route tables, internet gateways, and VPN resources. The grammar encodes the resource catalogue of each module domain.

What doesn't work

Not every domain has an unwritten convention. Grammar inference failed (produced trivial (a+b+c+...)+ grammars) on:

Dockerfiles — too simple (FROM → RUN → COPY → CMD is just the Dockerfile spec)
Pre-commit configs (cross-project) — 252 unique hook IDs, no common core
GitHub Actions per-project — too many different job types (build, lint, release, security) in one repo
Prometheus recording rules — schema-enforced structure, no convention to discover

The sweet spot: multiple implementations of the same abstract task (like "deploy a service" or "configure a chart"), each following a shared but undocumented pattern.

When each algorithm wins

Data property	Winner	Why
Diverse patterns, full vocabulary needed	CRX	Captures all symbols. iDRegEx returns ∅.
Clean sequences with clear core	iDRegEx	Extracts minimal common subsequence. CRX buries it in optional noise.
Single sequence	iDRegEx (+ RWR₀)	RWR₀ repair produces a grammatical regex from one example.
2–3 sequences	iDRegEx	CRX overfits. iDRegEx handles noise better.
Many sequences, tight pattern	CRX	Learns precise concatenation with optional suffixes.

MCP Server

A Model Context Protocol server exposes all algorithms and domain adapters:

python -m bex.mcp_server

Tools

Tool	What it does
`infer_grammar(sequences, method, kmax, N)`	Core CRX or iDRegEx inference
`infer_best_grammar(sequences, prefer, kmax, N)`	Ensemble: runs both, picks best by MDL. `prefer='crx'` or `prefer='idregex'` to skip comparison.
`infer_yaml_grammar(yaml_dir, pattern, method)`	YAML → key-paths → grammar
`infer_ansible_role_grammar(roles_dir)`	Ansible role module sequences → per-category grammar

Domain Adapters

Ansible Roles

from bex.ensemble import infer_ensemble
from bex.role_grammar import collect_all_role_sequences

all_roles, by_category = collect_all_role_sequences('path/to/roles')
for cat, items in sorted(by_category.items()):
    seqs = [s for _, s in items]
    result = infer_ensemble(seqs)
    print(f"── {cat} ({len(items)} roles) ──")
    print(f"  Best: {result['best']['algorithm']} (MDL {result['best']['mdl_score']})")
    print(f"  Grammar: {result['best']['grammar']}")

Example (15 geerlingguy Galaxy roles):

── other (15 roles) ──
  Best: CRX (MDL 288, 15/15 match)
  Grammar: fail?.(include_vars+set_fact+package+file+template+service+...)+.include+?.(npm+pip)+?.lineinfile?
  Why: CRX matches 15/15 sequences, iDRegEx matches 3/15. CRX selected.

Helm Charts

import subprocess, yaml
from bex.ensemble import infer_ensemble

seqs = []
for vf in sorted(Path('ci/').glob('*-values.yaml')):
    out = subprocess.run(
        ['helm', 'template', 'test', '.', '--skip-tests', '-f', str(vf)],
        capture_output=True, text=True, timeout=120,
    )
    kinds = [d['kind'] for d in yaml.safe_load_all(out.stdout)
             if d and isinstance(d, dict) and 'kind' in d]
    if kinds:
        seqs.append(kinds)

result = infer_ensemble(seqs)
print(f"Best: {result['best']['algorithm']} (MDL {result['best']['mdl_score']})")
print(f"Grammar: {result['best']['grammar']}")

Example (kube-prometheus-stack, 6 CI configs):

Best: iDRegEx (MDL 1433)
Grammar: ServiceAccount.ClusterRole.ClusterRoleBinding.Service.Deployment

  iDRegEx     MDL=  1432.99  ServiceAccount.ClusterRole.ClusterRoleBinding.Service.Deployment
  CRX         MDL=  2651.74  (Alertmanager+ClusterRole+...+ValidatingWebhookConfiguration)+.Role+?...

Why: iDRegEx (score 1433.0) vs CRX (score 2651.7). CRX matches 6/6, iDRegEx matches 1/6.
iDRegEx selected (MDL score 1433.0).

Terraform

import re
from bex.ensemble import infer_ensemble

seqs = []
for tf in sorted(Path('.').rglob('*.tf')):
    resources = re.findall(r'resource "(\w+)" "\w+" {', tf.read_text())
    if resources:
        seqs.append(resources)

result = infer_ensemble(seqs)
print(f"Best: {result['best']['algorithm']} (MDL {result['best']['mdl_score']})")
print(f"Grammar: {result['best']['grammar']}")

Example (8 terraform-aws-* modules):

Best: CRX (MDL 1876)
Grammar: null_resource?.s3_bucket_lifecycle_configuration?.vpc?.launch_configuration?....
Why: CRX matches 8/8 sequences. iDRegEx returned ∅ (no common core across modules).

Docker Compose

import yaml
from pathlib import Path
from bex.ensemble import infer_ensemble

seqs = []
for dc_file in Path('.').glob('**/docker-compose*.yml'):
    data = yaml.safe_load(dc_file.read_text())
    for svc, config in data.get('services', {}).items():
        keys = list(config.keys())
        if keys:
            seqs.append(keys)

result = infer_ensemble(seqs)
print(f"Best: {result['best']['algorithm']} (MDL {result['best']['mdl_score']})")
print(f"Grammar: {result['best']['grammar']}")

GitHub Actions

import yaml
from bex.ensemble import infer_ensemble

seqs = []
for wf_file in Path('.github/workflows/').glob('*.yml'):
    data = yaml.safe_load(wf_file.read_text())
    for job in data.get('jobs', {}).values():
        if 'steps' not in job:
            continue
        seq = [s.get('uses', 'run:' + s.get('run', '').split()[0])
               for s in job['steps'] if 'uses' in s or 'run' in s]
        if seq:
            seqs.append(seq)

result = infer_ensemble(seqs)

How MDL scoring works

MDL = model_cost + data_cost

model_cost — number of unique alphabet symbols in the grammar. Simpler grammars are cheaper.
data_cost — Σ log₂(|L(r) at length len(s)|) across all sequences. A specific fixed sequence (a.b.c.d.e) has data cost zero because |L(r)| = 1. A grammar that accepts many strings of the same length (like (a+b+...+q)+) has high data cost.

The ensemble selects the grammar with the lowest total MDL.

Grammar Notation

a.b — a followed by b (concatenation)
(a+b) — either a or b (disjunction)
r? — zero or one (optional)
r+ — one or more (iteration)
r+? — zero or more (varies across examples)

Papers

Bex et al. "Inferring Deterministic Regular Expressions from Positive Data" — TODS 2010
Bex et al. "Inferring k-optimal REs from Positive Data" — arXiv:1004.2372

Tests

python -m pytest tests/

License

MIT

README.md Unescape Escape