- Ensemble inference (infer_ensemble) runs both CRX and iDRegEx, picks best by MDL - CRX: CRX algorithm for wide coverage (accepts all sequences, large vocabulary) - iDRegEx: iDRegEx for minimal core grammar (tightest common pattern) - MDL scoring: fixed model_cost to count alphabet symbol occurrences, fixed dispatch order in _count_words_fast - Fixed _match_tokens: rewritten as _match_possible with proper backtracking - Fixed _parse_parts disjunction: children use _parse_flat_symbol to avoid dot-splitting - MCP server: infer_best_grammar and infer_grammar tools - Added prefer parameter (crx/idregex) to skip ensemble - 28 passing tests - SHOWCASE.md with Geerlingguy Galaxy demonstration - blog_post.md with full technical deep-dive
13 KiB
Grammar Inference Engine
Infer regular expression grammars from example sequences using the BEX family of algorithms. Given a set of example sequences (strings over some alphabet), the engine learns a compact regular expression that describes the general pattern.
Quick Start
pip install pyyaml
python -m bex
from bex import infer_ensemble
seqs = [
['file', 'template', 'docker_image', 'command', 'set_fact', 'shell', 'wait_for'],
['file', 'template', 'docker_image', 'command', 'set_fact', 'shell'],
]
result = infer_ensemble(seqs)
print(f"Best: {result['best']['algorithm']}")
print(f"Grammar: {result['best']['grammar']}")
print(f"Score: {result['best']['mdl_score']}")
Or compare algorithms manually:
from bex.crx import CRX
seqs = [...]
crx = CRX()
grammar = crx.infer(seqs)
print(grammar)
# file.template.docker_image.command.set_fact.shell.(wait_for)?
Algorithms
| Algorithm | What it learns | Paper | Use case |
|---|---|---|---|
| CRX | CHAREs (single-pass, deterministic) | TODS 2010 §6 | Fast inference, captures all symbols |
| iDRegEx | k-OREs (probabilistic, Baum-Welch) | arXiv 2010 | Finds the minimal core pattern |
| RWR₀ | SOREs (iterative repair) | TODS 2010 §5.2 | Single-sequence grammar repair |
| rwr² | k-ORE from k-OA | arXiv 2010 | k-ORE extraction after Baum-Welch |
Pipeline 1: Direct CHARE Inference (fast)
Example sequences → CRX → CHAREs grammar
CRX learns a grammar that accepts all observed symbols, marking optional ones with ?. Best when the data is clean and you want the full vocabulary.
Pipeline 2: Probabilistic k-ORE Inference (robust)
Example sequences → Complete k-OA → Baum-Welch (EM)
→ Disambiguate → Prune → rwr² → k-ORE grammar
iDRegEx learns the minimum common subsequence — symbols that appear in every example. Fails (∅) when the examples are too diverse.
Pipeline 3: Ensemble (recommended)
Example sequences → [CRX, iDRegEx] → MDL score each → pick best
Runs both algorithms, scores each with Minimum Description Length, and returns the winner with an explanation. The MDL score penalizes overly general grammars: a grammar like (a+b+c+...+z)+ that accepts everything gets a high data cost (log2(|L(r)|) is large), while a specific grammar like a.b.c has near-zero data cost.
Architecture
bex/
├── crx.py # CRX: direct CHARE inference (Algorithm 7, TODS)
├── idregex.py # iDRegEx: k-ORE inference (Algorithm 4, arXiv)
├── rwr0.py # RWR₀: SORE repair (Algorithm 6, TODS)
├── rwrsq.py # rwr²: k-ORE extraction (Algorithm 3, arXiv)
├── soa.py # SOA: Symbolic Observation Automaton core
├── koa.py # k-OA: k-testable Observation Automaton
├── ikoa.py # iKoa: k-OA inference (Algorithm 1, arXiv)
├── twotinf.py # 2T-INF: 2-testable inference (Algorithm 1, TODS)
├── baum_welch.py # Baum-Welch EM training for k-OA
├── expr.py # Expression utilities (concat, disj, star, strip)
├── marking.py # State marking for determinism
├── yaml_to_seq.py # Generic YAML → key-path sequence converter
├── role_grammar.py # Ansible role → module-sequence extractor
├── ensemble.py # Ensemble: runs CRX + iDRegEx, picks best by MDL
├── mdl.py # MDL scoring for grammar selection (fix)
├── mcp_server.py # MCP server exposing 4 tools
└── ...
MCP Server
A Model Context Protocol server exposes all algorithms and domain adapters as tools:
python -m bex.mcp_server
Tools
| Tool | What it does |
|---|---|
infer_grammar(sequences, method, kmax, N) |
Core CRX or iDRegEx inference |
infer_best_grammar(sequences, prefer, kmax, N) |
Ensemble: runs both CRX and iDRegEx, picks the best by MDL score. Set prefer='crx' or prefer='idregex' to skip ensemble and return only that algorithm. Returns structured report with candidates, MDL scores, and a Why: explanation. |
infer_yaml_grammar(yaml_dir, pattern, method) |
Generic YAML → key-paths → grammar |
infer_ansible_role_grammar(roles_dir) |
Ansible role module sequences → per-category grammar |
Using infer_best_grammar
The ensemble runs both algorithms and picks the best by MDL. To skip the comparison and run just one algorithm, pass prefer:
User: Run CRX on our deploy tasks.
Agent: [runs with prefer='crx']
Best: CRX (MDL 7.0)
Grammar: file.template.docker_image.command.set_fact.shell.wait_for?
CRX MDL= 7.00 file.template.docker_image.command.set_fact.shell.wait_for?
Why: Requested CRX only.
Without prefer, the ensemble compares both:
User: Find the grammar for our Helm chart.
Agent: [runs]
Best: iDRegEx (MDL 1432.99)
Grammar: ServiceAccount.ClusterRole.ClusterRoleBinding.Service.Deployment
iDRegEx MDL= 1432.99 ServiceAccount.ClusterRole.ClusterRoleBinding.Service.Deployment
CRX MDL= 2651.74 (Alertmanager+...+ValidatingWebhookConfiguration)+.Role?.RoleBinding?.Job+?
Why: iDRegEx (score 1433.0) vs CRX (score 2651.7). CRX matches 6/6 sequences,
iDRegEx matches 1/6. iDRegEx selected (MDL score 1433.0).
Both grammars are correct — they operate at different levels of specificity. The Why: field helps the agent decide which one to use for the task at hand.
Ensemble Selection
The infer_best_grammar tool runs both CRX and iDRegEx, scores each with Minimum Description Length (MDL), and returns the best.
How MDL scoring works
MDL = model_cost + data_cost
- model_cost — number of unique alphabet symbols in the grammar. Simpler grammars are cheaper.
- data_cost — Σ log₂(|L(r) at length len(s)|) across all sequences. A grammar that accepts many strings of the same length (like a 17-way disjunction
(a+b+...+q)+) has high data cost because|L(r)|is large. A specific, fixed sequence (a.b.c.d.e) has|L(r)| = 1so data cost is zero.
The ensemble selects the grammar with the lowest total MDL. This automatically picks the right level of specificity for the data.
When each algorithm wins
| Scenario | Winner | Why |
|---|---|---|
| Many sequences, diverse patterns | CRX | CRX captures the full vocabulary. iDRegEx can't find a common core. |
| Clean, structured sequences | CRX | CRX learns precise concatenation order with optional suffixes. iDRegEx may over-generalize. |
| Few sequences (2–3) | iDRegEx | CRX overfits to the limited data. iDRegEx's probabilistic approach handles noise better. |
| Sequences share a clear core | iDRegEx | iDRegEx extracts the minimal common subsequence. CRX buries it in a mass of optional symbols. |
| Single sequence | iDRegEx (with SOA repair) | RWR₀ repair pipeline produces a grammatical regex from one example. |
Real-world benchmarks
Results from three domains using the ensemble (fixed MDL scoring):
Dataset Best MDL Matches
──────────────────────────────────────────────────────────
Helm (prom-stack) iDRegEx 1433.0 1/6
Ansible (deploy) CRX 246.1 34/36
Ansible (validate) CRX 34.0 5/5
Ansible (restore) CRX 24.0 2/2
Ansible (manage) iDRegEx 25.0 1/2
Ansible (configure) iDRegEx 22.5 1/4
Terraform (hashistack) CRX 4.0 9/9
Note: MDL scores are not comparable across datasets — only within the same run (CRX vs iDRegEx on the same sequences). The Helm score is higher because each sequence is ~120 symbols long, making the data cost term dominant for the overly-general CRX grammar (19 kinds × many lengths).
Domain Adapters
Ansible Roles
Extracts module names from tasks/main.yml, groups by category prefix (e.g., deploy_foo → deploy), and learns per-category grammars:
from bex.ensemble import infer_ensemble
from bex.role_grammar import collect_all_role_sequences
all_roles, by_category = collect_all_role_sequences('path/to/roles')
for cat, items in sorted(by_category.items()):
seqs = [s for _, s in items]
if len(seqs) >= 2:
result = infer_ensemble(seqs)
print(f"── {cat} ({len(items)} roles) ──")
print(f" Best: {result['best']['algorithm']} (MDL {result['best']['mdl_score']})")
print(f" Grammar: {result['best']['grammar']}")
print(f" Why: {result['why']}")
Example output (from companyweb, 51 roles):
── restore (2 roles) ──
Best: CRX (MDL 24.0)
Grammar: file.copy.unarchive+.command
Why: CRX (score 24.0) vs iDRegEx (score 33.0). Both match 2/2. CRX is more compact.
── validate (5 roles) ──
Best: CRX (MDL 34.0)
Grammar: hosts?.shell?.(copy+debug+fail+set_fact+uri)+?
Why: CRX (score 34.0) matches 5/5, iDRegEx (score 49.5) matches 0/5.
── configure (4 roles) ──
Best: iDRegEx (MDL 22.5)
Grammar: include_role
Why: iDRegEx (score 22.5) beats CRX (score 44.5). CRX overfits to diverse patterns.
Helm Charts
Renders a Helm chart with different values files and extracts Kubernetes kind sequences for grammar inference:
import subprocess, yaml
from bex.ensemble import infer_ensemble
seqs = []
for vf in sorted(Path('ci/').glob('*-values.yaml')):
out = subprocess.run(
['helm', 'template', 'test', '.', '--skip-tests', '-f', str(vf)],
capture_output=True, text=True, timeout=120,
)
if out.returncode == 0:
kinds = [d['kind'] for d in yaml.safe_load_all(out.stdout)
if d and isinstance(d, dict) and 'kind' in d]
if kinds:
seqs.append(kinds)
result = infer_ensemble(seqs)
print(f"Best: {result['best']['algorithm']} (MDL {result['best']['mdl_score']})")
print(f"Grammar: {result['best']['grammar']}")
print(f"Why: {result['why']}")
Example output (from kube-prometheus-stack, 6 CI configs):
Best: iDRegEx (MDL 1432.99)
Grammar: ServiceAccount.ClusterRole.ClusterRoleBinding.Service.Deployment
iDRegEx MDL= 1432.99 ServiceAccount.ClusterRole.ClusterRoleBinding.Service.Deployment
CRX MDL= 2651.74 (Alertmanager+ClusterRole+ClusterRoleBinding+ConfigMap+DaemonSet+...)+.Role?.RoleBinding?.Job+?
Why: iDRegEx (score 1433.0) vs CRX (score 2651.7). CRX matches 6/6, iDRegEx matches 1/6.
iDRegEx selected (MDL score 1433.0).
CRX captures all symbols that appear. iDRegEx finds only the minimal core that every config shares:
ServiceAccount.ClusterRole.ClusterRoleBinding.Service.Deployment
Which grammar is more useful depends on the task:
- CRX tells you everything you might need — good for an agent generating a complete chart.
- iDRegEx tells you what you always need — the bootstrap pipeline that can't be skipped.
Use prefer='crx' or prefer='idregex' to select an algorithm without the ensemble comparison:
Terraform
Parses .tf files to extract resource type sequences, per-file or per-directory:
import re
from bex.ensemble import infer_ensemble
seqs = []
for tf in sorted(Path('.').rglob('*.tf')):
resources = re.findall(r'resource "(\w+)" "\w+" {', tf.read_text())
if resources:
seqs.append(resources)
result = infer_ensemble(seqs)
print(f"Best: {result['best']['algorithm']} (MDL {result['best']['mdl_score']})")
print(f"Grammar: {result['best']['grammar']}")
Example output (from terraform-guides, hashistack example, 9 files):
Best: CRX (MDL 4.0, 9/9 match)
Grammar: azurerm_network_security_group?.tls_private_key?.azurerm_virtual_machine?.(azurerm_resource_group+azurerm_subnet+azurerm_virtual_network)+?.azurerm_network_security_rule?.null_resource?.azurerm_network_interface?.azurerm_public_ip?.random_id+?
Grammar notation:
a.b—afollowed byb(concatenation)(a+b)— eitheraorb(disjunction)r?— zero or one (optional)r+— one or more (iteration)r+?— zero or more (varies across examples)(a|b)— iDRegEx-style disjunction (equivalent to(a+b))
Domain: Generic YAML
Converts any YAML file into key-path sequences (DFS traversal) for grammar inference:
from bex.yaml_to_seq import collect_all_sequences
from bex import infer_ensemble
results = collect_all_sequences('config_dir/')
seqs = [seq for _, seq in results]
result = infer_ensemble(seqs)
print(result['best']['grammar'])
Papers
- Bex et al. "Inferring Deterministic Regular Expressions from Positive Data" — TODS 2010
- Bex et al. "Inferring k-optimal REs from Positive Data" — arXiv:1004.2372
See papers/ for extracted text and the original references.
Tests
python -m pytest tests/
# or
python tests/test_bex.py
License
MIT