grammar-inference-engine/README.md
tobjend 0e2aec582b Grammar inference engine: CRX + iDRegEx ensemble with MDL scoring, MCP server, showcase, and blog post
- Ensemble inference (infer_ensemble) runs both CRX and iDRegEx, picks best by MDL
- CRX: CRX algorithm for wide coverage (accepts all sequences, large vocabulary)
- iDRegEx: iDRegEx for minimal core grammar (tightest common pattern)
- MDL scoring: fixed model_cost to count alphabet symbol occurrences, fixed dispatch order in _count_words_fast
- Fixed _match_tokens: rewritten as _match_possible with proper backtracking
- Fixed _parse_parts disjunction: children use _parse_flat_symbol to avoid dot-splitting
- MCP server: infer_best_grammar and infer_grammar tools
- Added prefer parameter (crx/idregex) to skip ensemble
- 28 passing tests
- SHOWCASE.md with Geerlingguy Galaxy demonstration
- blog_post.md with full technical deep-dive
2026-07-01 09:51:41 +02:00

13 KiB
Raw Blame History

Grammar Inference Engine

Infer regular expression grammars from example sequences using the BEX family of algorithms. Given a set of example sequences (strings over some alphabet), the engine learns a compact regular expression that describes the general pattern.

Quick Start

pip install pyyaml
python -m bex
from bex import infer_ensemble

seqs = [
    ['file', 'template', 'docker_image', 'command', 'set_fact', 'shell', 'wait_for'],
    ['file', 'template', 'docker_image', 'command', 'set_fact', 'shell'],
]

result = infer_ensemble(seqs)
print(f"Best: {result['best']['algorithm']}")
print(f"Grammar: {result['best']['grammar']}")
print(f"Score: {result['best']['mdl_score']}")

Or compare algorithms manually:

from bex.crx import CRX

seqs = [...]
crx = CRX()
grammar = crx.infer(seqs)
print(grammar)
# file.template.docker_image.command.set_fact.shell.(wait_for)?

Algorithms

Algorithm What it learns Paper Use case
CRX CHAREs (single-pass, deterministic) TODS 2010 §6 Fast inference, captures all symbols
iDRegEx k-OREs (probabilistic, Baum-Welch) arXiv 2010 Finds the minimal core pattern
RWR₀ SOREs (iterative repair) TODS 2010 §5.2 Single-sequence grammar repair
rwr² k-ORE from k-OA arXiv 2010 k-ORE extraction after Baum-Welch

Pipeline 1: Direct CHARE Inference (fast)

Example sequences → CRX → CHAREs grammar

CRX learns a grammar that accepts all observed symbols, marking optional ones with ?. Best when the data is clean and you want the full vocabulary.

Pipeline 2: Probabilistic k-ORE Inference (robust)

Example sequences → Complete k-OA → Baum-Welch (EM)
  → Disambiguate → Prune → rwr² → k-ORE grammar

iDRegEx learns the minimum common subsequence — symbols that appear in every example. Fails (∅) when the examples are too diverse.

Example sequences → [CRX, iDRegEx] → MDL score each → pick best

Runs both algorithms, scores each with Minimum Description Length, and returns the winner with an explanation. The MDL score penalizes overly general grammars: a grammar like (a+b+c+...+z)+ that accepts everything gets a high data cost (log2(|L(r)|) is large), while a specific grammar like a.b.c has near-zero data cost.

Architecture

bex/
├── crx.py          # CRX: direct CHARE inference (Algorithm 7, TODS)
├── idregex.py      # iDRegEx: k-ORE inference (Algorithm 4, arXiv)
├── rwr0.py         # RWR₀: SORE repair (Algorithm 6, TODS)
├── rwrsq.py        # rwr²: k-ORE extraction (Algorithm 3, arXiv)
├── soa.py          # SOA: Symbolic Observation Automaton core
├── koa.py          # k-OA: k-testable Observation Automaton
├── ikoa.py         # iKoa: k-OA inference (Algorithm 1, arXiv)
├── twotinf.py      # 2T-INF: 2-testable inference (Algorithm 1, TODS)
├── baum_welch.py   # Baum-Welch EM training for k-OA
├── expr.py         # Expression utilities (concat, disj, star, strip)
├── marking.py      # State marking for determinism
├── yaml_to_seq.py  # Generic YAML → key-path sequence converter
├── role_grammar.py # Ansible role → module-sequence extractor
├── ensemble.py     # Ensemble: runs CRX + iDRegEx, picks best by MDL
├── mdl.py          # MDL scoring for grammar selection (fix)
├── mcp_server.py   # MCP server exposing 4 tools
└── ...

MCP Server

A Model Context Protocol server exposes all algorithms and domain adapters as tools:

python -m bex.mcp_server

Tools

Tool What it does
infer_grammar(sequences, method, kmax, N) Core CRX or iDRegEx inference
infer_best_grammar(sequences, prefer, kmax, N) Ensemble: runs both CRX and iDRegEx, picks the best by MDL score. Set prefer='crx' or prefer='idregex' to skip ensemble and return only that algorithm. Returns structured report with candidates, MDL scores, and a Why: explanation.
infer_yaml_grammar(yaml_dir, pattern, method) Generic YAML → key-paths → grammar
infer_ansible_role_grammar(roles_dir) Ansible role module sequences → per-category grammar

Using infer_best_grammar

The ensemble runs both algorithms and picks the best by MDL. To skip the comparison and run just one algorithm, pass prefer:

User: Run CRX on our deploy tasks.
Agent: [runs with prefer='crx']
Best: CRX (MDL 7.0)
Grammar: file.template.docker_image.command.set_fact.shell.wait_for?

  CRX  MDL=  7.00  file.template.docker_image.command.set_fact.shell.wait_for?

Why: Requested CRX only.

Without prefer, the ensemble compares both:

User: Find the grammar for our Helm chart.
Agent: [runs]
Best: iDRegEx (MDL 1432.99)
Grammar: ServiceAccount.ClusterRole.ClusterRoleBinding.Service.Deployment

  iDRegEx     MDL=  1432.99  ServiceAccount.ClusterRole.ClusterRoleBinding.Service.Deployment
  CRX         MDL=  2651.74  (Alertmanager+...+ValidatingWebhookConfiguration)+.Role?.RoleBinding?.Job+?

Why: iDRegEx (score 1433.0) vs CRX (score 2651.7). CRX matches 6/6 sequences,
iDRegEx matches 1/6. iDRegEx selected (MDL score 1433.0).

Both grammars are correct — they operate at different levels of specificity. The Why: field helps the agent decide which one to use for the task at hand.

Ensemble Selection

The infer_best_grammar tool runs both CRX and iDRegEx, scores each with Minimum Description Length (MDL), and returns the best.

How MDL scoring works

MDL = model_cost + data_cost
  • model_cost — number of unique alphabet symbols in the grammar. Simpler grammars are cheaper.
  • data_cost — Σ log₂(|L(r) at length len(s)|) across all sequences. A grammar that accepts many strings of the same length (like a 17-way disjunction (a+b+...+q)+) has high data cost because |L(r)| is large. A specific, fixed sequence (a.b.c.d.e) has |L(r)| = 1 so data cost is zero.

The ensemble selects the grammar with the lowest total MDL. This automatically picks the right level of specificity for the data.

When each algorithm wins

Scenario Winner Why
Many sequences, diverse patterns CRX CRX captures the full vocabulary. iDRegEx can't find a common core.
Clean, structured sequences CRX CRX learns precise concatenation order with optional suffixes. iDRegEx may over-generalize.
Few sequences (23) iDRegEx CRX overfits to the limited data. iDRegEx's probabilistic approach handles noise better.
Sequences share a clear core iDRegEx iDRegEx extracts the minimal common subsequence. CRX buries it in a mass of optional symbols.
Single sequence iDRegEx (with SOA repair) RWR₀ repair pipeline produces a grammatical regex from one example.

Real-world benchmarks

Results from three domains using the ensemble (fixed MDL scoring):

Dataset                   Best       MDL      Matches
──────────────────────────────────────────────────────────
Helm (prom-stack)         iDRegEx    1433.0   1/6
Ansible (deploy)          CRX        246.1    34/36
Ansible (validate)        CRX        34.0     5/5
Ansible (restore)         CRX        24.0     2/2
Ansible (manage)          iDRegEx    25.0     1/2
Ansible (configure)       iDRegEx    22.5     1/4
Terraform (hashistack)    CRX        4.0      9/9

Note: MDL scores are not comparable across datasets — only within the same run (CRX vs iDRegEx on the same sequences). The Helm score is higher because each sequence is ~120 symbols long, making the data cost term dominant for the overly-general CRX grammar (19 kinds × many lengths).

Domain Adapters

Ansible Roles

Extracts module names from tasks/main.yml, groups by category prefix (e.g., deploy_foodeploy), and learns per-category grammars:

from bex.ensemble import infer_ensemble
from bex.role_grammar import collect_all_role_sequences

all_roles, by_category = collect_all_role_sequences('path/to/roles')
for cat, items in sorted(by_category.items()):
    seqs = [s for _, s in items]
    if len(seqs) >= 2:
        result = infer_ensemble(seqs)
        print(f"── {cat} ({len(items)} roles) ──")
        print(f"  Best: {result['best']['algorithm']} (MDL {result['best']['mdl_score']})")
        print(f"  Grammar: {result['best']['grammar']}")
        print(f"  Why: {result['why']}")

Example output (from companyweb, 51 roles):

── restore (2 roles) ──
  Best: CRX (MDL 24.0)
  Grammar: file.copy.unarchive+.command
  Why: CRX (score 24.0) vs iDRegEx (score 33.0). Both match 2/2. CRX is more compact.

── validate (5 roles) ──
  Best: CRX (MDL 34.0)
  Grammar: hosts?.shell?.(copy+debug+fail+set_fact+uri)+?
  Why: CRX (score 34.0) matches 5/5, iDRegEx (score 49.5) matches 0/5.

── configure (4 roles) ──
  Best: iDRegEx (MDL 22.5)
  Grammar: include_role
  Why: iDRegEx (score 22.5) beats CRX (score 44.5). CRX overfits to diverse patterns.

Helm Charts

Renders a Helm chart with different values files and extracts Kubernetes kind sequences for grammar inference:

import subprocess, yaml
from bex.ensemble import infer_ensemble

seqs = []
for vf in sorted(Path('ci/').glob('*-values.yaml')):
    out = subprocess.run(
        ['helm', 'template', 'test', '.', '--skip-tests', '-f', str(vf)],
        capture_output=True, text=True, timeout=120,
    )
    if out.returncode == 0:
        kinds = [d['kind'] for d in yaml.safe_load_all(out.stdout)
                 if d and isinstance(d, dict) and 'kind' in d]
        if kinds:
            seqs.append(kinds)

result = infer_ensemble(seqs)
print(f"Best: {result['best']['algorithm']} (MDL {result['best']['mdl_score']})")
print(f"Grammar: {result['best']['grammar']}")
print(f"Why: {result['why']}")

Example output (from kube-prometheus-stack, 6 CI configs):

Best: iDRegEx (MDL 1432.99)
Grammar: ServiceAccount.ClusterRole.ClusterRoleBinding.Service.Deployment

  iDRegEx     MDL=  1432.99  ServiceAccount.ClusterRole.ClusterRoleBinding.Service.Deployment
  CRX         MDL=  2651.74  (Alertmanager+ClusterRole+ClusterRoleBinding+ConfigMap+DaemonSet+...)+.Role?.RoleBinding?.Job+?

Why: iDRegEx (score 1433.0) vs CRX (score 2651.7). CRX matches 6/6, iDRegEx matches 1/6.
iDRegEx selected (MDL score 1433.0).

CRX captures all symbols that appear. iDRegEx finds only the minimal core that every config shares:

ServiceAccount.ClusterRole.ClusterRoleBinding.Service.Deployment

Which grammar is more useful depends on the task:

  • CRX tells you everything you might need — good for an agent generating a complete chart.
  • iDRegEx tells you what you always need — the bootstrap pipeline that can't be skipped.

Use prefer='crx' or prefer='idregex' to select an algorithm without the ensemble comparison:

Terraform

Parses .tf files to extract resource type sequences, per-file or per-directory:

import re
from bex.ensemble import infer_ensemble

seqs = []
for tf in sorted(Path('.').rglob('*.tf')):
    resources = re.findall(r'resource "(\w+)" "\w+" {', tf.read_text())
    if resources:
        seqs.append(resources)

result = infer_ensemble(seqs)
print(f"Best: {result['best']['algorithm']} (MDL {result['best']['mdl_score']})")
print(f"Grammar: {result['best']['grammar']}")

Example output (from terraform-guides, hashistack example, 9 files):

Best: CRX (MDL 4.0, 9/9 match)
Grammar: azurerm_network_security_group?.tls_private_key?.azurerm_virtual_machine?.(azurerm_resource_group+azurerm_subnet+azurerm_virtual_network)+?.azurerm_network_security_rule?.null_resource?.azurerm_network_interface?.azurerm_public_ip?.random_id+?

Grammar notation:

  • a.ba followed by b (concatenation)
  • (a+b) — either a or b (disjunction)
  • r? — zero or one (optional)
  • r+ — one or more (iteration)
  • r+? — zero or more (varies across examples)
  • (a|b) — iDRegEx-style disjunction (equivalent to (a+b))

Domain: Generic YAML

Converts any YAML file into key-path sequences (DFS traversal) for grammar inference:

from bex.yaml_to_seq import collect_all_sequences
from bex import infer_ensemble

results = collect_all_sequences('config_dir/')
seqs = [seq for _, seq in results]
result = infer_ensemble(seqs)
print(result['best']['grammar'])

Papers

  • Bex et al. "Inferring Deterministic Regular Expressions from Positive Data" — TODS 2010
  • Bex et al. "Inferring k-optimal REs from Positive Data" — arXiv:1004.2372

See papers/ for extracted text and the original references.

Tests

python -m pytest tests/
# or
python tests/test_bex.py

License

MIT