# Grammar Inference Engine Infer **regular expression grammars** from example sequences using the BEX family of algorithms. Given a set of example sequences (strings over some alphabet), the engine learns a compact regular expression that describes the general pattern. ## Quick Start ```bash pip install pyyaml python -m bex ``` ```python from bex import infer_ensemble seqs = [ ['file', 'template', 'docker_image', 'command', 'set_fact', 'shell', 'wait_for'], ['file', 'template', 'docker_image', 'command', 'set_fact', 'shell'], ] result = infer_ensemble(seqs) print(f"Best: {result['best']['algorithm']}") print(f"Grammar: {result['best']['grammar']}") print(f"Score: {result['best']['mdl_score']}") ``` Or compare algorithms manually: ```python from bex.crx import CRX seqs = [...] crx = CRX() grammar = crx.infer(seqs) print(grammar) # file.template.docker_image.command.set_fact.shell.(wait_for)? ``` ## Algorithms | Algorithm | What it learns | Paper | Use case | |-----------|---------------|-------|----------| | **CRX** | CHAREs (single-pass, deterministic) | TODS 2010 §6 | Fast inference, captures *all* symbols | | **iDRegEx** | k-OREs (probabilistic, Baum-Welch) | arXiv 2010 | Finds the minimal core pattern | | **RWR₀** | SOREs (iterative repair) | TODS 2010 §5.2 | Single-sequence grammar repair | | **rwr²** | k-ORE from k-OA | arXiv 2010 | k-ORE extraction after Baum-Welch | ### Pipeline 1: Direct CHARE Inference (fast) ``` Example sequences → CRX → CHAREs grammar ``` CRX learns a grammar that accepts *all* observed symbols, marking optional ones with `?`. Best when the data is clean and you want the full vocabulary. ### Pipeline 2: Probabilistic k-ORE Inference (robust) ``` Example sequences → Complete k-OA → Baum-Welch (EM) → Disambiguate → Prune → rwr² → k-ORE grammar ``` iDRegEx learns the *minimum* common subsequence — symbols that appear in every example. Fails (∅) when the examples are too diverse. ### Pipeline 3: Ensemble (recommended) ``` Example sequences → [CRX, iDRegEx] → MDL score each → pick best ``` Runs both algorithms, scores each with Minimum Description Length, and returns the winner with an explanation. The MDL score penalizes overly general grammars: a grammar like `(a+b+c+...+z)+` that accepts everything gets a high data cost (`log2(|L(r)|)` is large), while a specific grammar like `a.b.c` has near-zero data cost. ## Architecture ``` bex/ ├── crx.py # CRX: direct CHARE inference (Algorithm 7, TODS) ├── idregex.py # iDRegEx: k-ORE inference (Algorithm 4, arXiv) ├── rwr0.py # RWR₀: SORE repair (Algorithm 6, TODS) ├── rwrsq.py # rwr²: k-ORE extraction (Algorithm 3, arXiv) ├── soa.py # SOA: Symbolic Observation Automaton core ├── koa.py # k-OA: k-testable Observation Automaton ├── ikoa.py # iKoa: k-OA inference (Algorithm 1, arXiv) ├── twotinf.py # 2T-INF: 2-testable inference (Algorithm 1, TODS) ├── baum_welch.py # Baum-Welch EM training for k-OA ├── expr.py # Expression utilities (concat, disj, star, strip) ├── marking.py # State marking for determinism ├── yaml_to_seq.py # Generic YAML → key-path sequence converter ├── role_grammar.py # Ansible role → module-sequence extractor ├── ensemble.py # Ensemble: runs CRX + iDRegEx, picks best by MDL ├── mdl.py # MDL scoring for grammar selection (fix) ├── mcp_server.py # MCP server exposing 4 tools └── ... ``` ## MCP Server A **Model Context Protocol** server exposes all algorithms and domain adapters as tools: ```bash python -m bex.mcp_server ``` ### Tools | Tool | What it does | |------|-------------| | `infer_grammar(sequences, method, kmax, N)` | Core CRX or iDRegEx inference | | `infer_best_grammar(sequences, prefer, kmax, N)` | **Ensemble:** runs both CRX and iDRegEx, picks the best by MDL score. Set `prefer='crx'` or `prefer='idregex'` to skip ensemble and return only that algorithm. Returns structured report with candidates, MDL scores, and a `Why:` explanation. | | `infer_yaml_grammar(yaml_dir, pattern, method)` | Generic YAML → key-paths → grammar | | `infer_ansible_role_grammar(roles_dir)` | Ansible role module sequences → per-category grammar | ### Using `infer_best_grammar` The ensemble runs both algorithms and picks the best by MDL. To skip the comparison and run just one algorithm, pass `prefer`: ``` User: Run CRX on our deploy tasks. Agent: [runs with prefer='crx'] Best: CRX (MDL 7.0) Grammar: file.template.docker_image.command.set_fact.shell.wait_for? CRX MDL= 7.00 file.template.docker_image.command.set_fact.shell.wait_for? Why: Requested CRX only. ``` Without `prefer`, the ensemble compares both: ``` User: Find the grammar for our Helm chart. Agent: [runs] Best: iDRegEx (MDL 1432.99) Grammar: ServiceAccount.ClusterRole.ClusterRoleBinding.Service.Deployment iDRegEx MDL= 1432.99 ServiceAccount.ClusterRole.ClusterRoleBinding.Service.Deployment CRX MDL= 2651.74 (Alertmanager+...+ValidatingWebhookConfiguration)+.Role?.RoleBinding?.Job+? Why: iDRegEx (score 1433.0) vs CRX (score 2651.7). CRX matches 6/6 sequences, iDRegEx matches 1/6. iDRegEx selected (MDL score 1433.0). ``` Both grammars are correct — they operate at different levels of specificity. The `Why:` field helps the agent decide which one to use for the task at hand. ## Ensemble Selection The `infer_best_grammar` tool runs both CRX and iDRegEx, scores each with Minimum Description Length (MDL), and returns the best. ### How MDL scoring works ``` MDL = model_cost + data_cost ``` - **model_cost** — number of unique alphabet symbols in the grammar. Simpler grammars are cheaper. - **data_cost** — Σ log₂(|L(r) at length len(s)|) across all sequences. A grammar that accepts *many* strings of the same length (like a 17-way disjunction `(a+b+...+q)+`) has high data cost because `|L(r)|` is large. A specific, fixed sequence (`a.b.c.d.e`) has `|L(r)| = 1` so data cost is zero. The ensemble selects the grammar with the lowest total MDL. This automatically picks the right level of specificity for the data. ### When each algorithm wins | Scenario | Winner | Why | |----------|--------|-----| | Many sequences, diverse patterns | **CRX** | CRX captures the full vocabulary. iDRegEx can't find a common core. | | Clean, structured sequences | **CRX** | CRX learns precise concatenation order with optional suffixes. iDRegEx may over-generalize. | | Few sequences (2–3) | **iDRegEx** | CRX overfits to the limited data. iDRegEx's probabilistic approach handles noise better. | | Sequences share a clear core | **iDRegEx** | iDRegEx extracts the minimal common subsequence. CRX buries it in a mass of optional symbols. | | Single sequence | **iDRegEx** (with SOA repair) | RWR₀ repair pipeline produces a grammatical regex from one example. | ### Real-world benchmarks Results from three domains using the ensemble (fixed MDL scoring): ``` Dataset Best MDL Matches ────────────────────────────────────────────────────────── Helm (prom-stack) iDRegEx 1433.0 1/6 Ansible (deploy) CRX 246.1 34/36 Ansible (validate) CRX 34.0 5/5 Ansible (restore) CRX 24.0 2/2 Ansible (manage) iDRegEx 25.0 1/2 Ansible (configure) iDRegEx 22.5 1/4 Terraform (hashistack) CRX 4.0 9/9 ``` Note: MDL scores are not comparable across datasets — only within the same run (CRX vs iDRegEx on the same sequences). The Helm score is higher because each sequence is ~120 symbols long, making the data cost term dominant for the overly-general CRX grammar (19 kinds × many lengths). ## Domain Adapters ### Ansible Roles Extracts module names from `tasks/main.yml`, groups by category prefix (e.g., `deploy_foo` → `deploy`), and learns per-category grammars: ```python from bex.ensemble import infer_ensemble from bex.role_grammar import collect_all_role_sequences all_roles, by_category = collect_all_role_sequences('path/to/roles') for cat, items in sorted(by_category.items()): seqs = [s for _, s in items] if len(seqs) >= 2: result = infer_ensemble(seqs) print(f"── {cat} ({len(items)} roles) ──") print(f" Best: {result['best']['algorithm']} (MDL {result['best']['mdl_score']})") print(f" Grammar: {result['best']['grammar']}") print(f" Why: {result['why']}") ``` **Example output** (from [companyweb](https://github.com/anomalyco/companyweb), 51 roles): ``` ── restore (2 roles) ── Best: CRX (MDL 24.0) Grammar: file.copy.unarchive+.command Why: CRX (score 24.0) vs iDRegEx (score 33.0). Both match 2/2. CRX is more compact. ── validate (5 roles) ── Best: CRX (MDL 34.0) Grammar: hosts?.shell?.(copy+debug+fail+set_fact+uri)+? Why: CRX (score 34.0) matches 5/5, iDRegEx (score 49.5) matches 0/5. ── configure (4 roles) ── Best: iDRegEx (MDL 22.5) Grammar: include_role Why: iDRegEx (score 22.5) beats CRX (score 44.5). CRX overfits to diverse patterns. ``` ### Helm Charts Renders a Helm chart with different values files and extracts Kubernetes `kind` sequences for grammar inference: ```python import subprocess, yaml from bex.ensemble import infer_ensemble seqs = [] for vf in sorted(Path('ci/').glob('*-values.yaml')): out = subprocess.run( ['helm', 'template', 'test', '.', '--skip-tests', '-f', str(vf)], capture_output=True, text=True, timeout=120, ) if out.returncode == 0: kinds = [d['kind'] for d in yaml.safe_load_all(out.stdout) if d and isinstance(d, dict) and 'kind' in d] if kinds: seqs.append(kinds) result = infer_ensemble(seqs) print(f"Best: {result['best']['algorithm']} (MDL {result['best']['mdl_score']})") print(f"Grammar: {result['best']['grammar']}") print(f"Why: {result['why']}") ``` **Example output** (from [kube-prometheus-stack](https://github.com/prometheus-community/helm-charts/tree/main/charts/kube-prometheus-stack), 6 CI configs): ``` Best: iDRegEx (MDL 1432.99) Grammar: ServiceAccount.ClusterRole.ClusterRoleBinding.Service.Deployment iDRegEx MDL= 1432.99 ServiceAccount.ClusterRole.ClusterRoleBinding.Service.Deployment CRX MDL= 2651.74 (Alertmanager+ClusterRole+ClusterRoleBinding+ConfigMap+DaemonSet+...)+.Role?.RoleBinding?.Job+? Why: iDRegEx (score 1433.0) vs CRX (score 2651.7). CRX matches 6/6, iDRegEx matches 1/6. iDRegEx selected (MDL score 1433.0). ``` CRX captures *all* symbols that appear. iDRegEx finds only the minimal core that every config shares: ``` ServiceAccount.ClusterRole.ClusterRoleBinding.Service.Deployment ``` Which grammar is more useful depends on the task: - **CRX** tells you everything you *might* need — good for an agent generating a complete chart. - **iDRegEx** tells you what you *always* need — the bootstrap pipeline that can't be skipped. Use `prefer='crx'` or `prefer='idregex'` to select an algorithm without the ensemble comparison: ### Terraform Parses `.tf` files to extract `resource` type sequences, per-file or per-directory: ```python import re from bex.ensemble import infer_ensemble seqs = [] for tf in sorted(Path('.').rglob('*.tf')): resources = re.findall(r'resource "(\w+)" "\w+" {', tf.read_text()) if resources: seqs.append(resources) result = infer_ensemble(seqs) print(f"Best: {result['best']['algorithm']} (MDL {result['best']['mdl_score']})") print(f"Grammar: {result['best']['grammar']}") ``` **Example output** (from [terraform-guides](https://github.com/hashicorp/terraform-guides), hashistack example, 9 files): ``` Best: CRX (MDL 4.0, 9/9 match) Grammar: azurerm_network_security_group?.tls_private_key?.azurerm_virtual_machine?.(azurerm_resource_group+azurerm_subnet+azurerm_virtual_network)+?.azurerm_network_security_rule?.null_resource?.azurerm_network_interface?.azurerm_public_ip?.random_id+? ``` **Grammar notation:** - `a.b` — `a` followed by `b` (concatenation) - `(a+b)` — either `a` or `b` (disjunction) - `r?` — zero or one (optional) - `r+` — one or more (iteration) - `r+?` — zero or more (varies across examples) - `(a|b)` — iDRegEx-style disjunction (equivalent to `(a+b)`) ## Domain: Generic YAML Converts any YAML file into key-path sequences (DFS traversal) for grammar inference: ```python from bex.yaml_to_seq import collect_all_sequences from bex import infer_ensemble results = collect_all_sequences('config_dir/') seqs = [seq for _, seq in results] result = infer_ensemble(seqs) print(result['best']['grammar']) ``` ## Papers - **Bex et al.** *"Inferring Deterministic Regular Expressions from Positive Data"* — TODS 2010 - **Bex et al.** *"Inferring k-optimal REs from Positive Data"* — arXiv:1004.2372 See `papers/` for extracted text and the original references. ## Tests ```bash python -m pytest tests/ # or python tests/test_bex.py ``` ## License MIT