# Grammar Inference Engine Infer **regular expression grammars** from example sequences using the BEX family of algorithms. Given a set of example sequences (strings over some alphabet), the engine learns a compact regular expression that describes the general pattern. ## Quick Start ```bash pip install pyyaml python -m bex ``` ```python from bex.crx import CRX seqs = [ ['file', 'template', 'docker_image', 'command', 'set_fact', 'shell', 'wait_for'], ['file', 'template', 'docker_image', 'command', 'set_fact', 'shell'], ] crx = CRX() grammar = crx.infer(seqs) print(grammar) # file.template.docker_image.command.set_fact.shell.(wait_for)? ``` ## Algorithms | Algorithm | What it learns | Paper | Use case | |-----------|---------------|-------|----------| | **CRX** | CHAREs (single-pass, deterministic) | TODS 2010 §6 | Fast inference from many sequences | | **iDRegEx** | k-OREs (probabilistic, Baum-Welch) | arXiv 2010 | Handles noise, learns from few examples | | **RWR₀** | SOREs (iterative repair) | TODS 2010 §5.2 | Builds regex from a single automaton | | **rwr²** | k-ORE from k-OA | arXiv 2010 | Post-processing for k-ORE extraction | ### Pipeline 1: Direct CHARE Inference (fast) ``` Example sequences → CRX → CHAREs grammar ``` ### Pipeline 2: Probabilistic k-ORE Inference (robust) ``` Example sequences → Complete k-OA → Baum-Welch (EM) → Disambiguate → Prune → rwr² → k-ORE grammar ``` ## Architecture ``` bex/ ├── crx.py # CRX: direct CHARE inference (Algorithm 7, TODS) ├── idregex.py # iDRegEx: k-ORE inference (Algorithm 4, arXiv) ├── rwr0.py # RWR₀: SORE repair (Algorithm 6, TODS) ├── rwrsq.py # rwr²: k-ORE extraction (Algorithm 3, arXiv) ├── soa.py # SOA: Symbolic Observation Automaton core ├── koa.py # k-OA: k-testable Observation Automaton ├── ikoa.py # iKoa: k-OA inference (Algorithm 1, arXiv) ├── twotinf.py # 2T-INF: 2-testable inference (Algorithm 1, TODS) ├── baum_welch.py # Baum-Welch EM training for k-OA ├── expr.py # Expression utilities (concat, disj, star, strip) ├── marking.py # State marking for determinism ├── yaml_to_seq.py # Generic YAML → key-path sequence converter ├── role_grammar.py # Ansible role → module-sequence extractor └── ... ``` ## Domain: Ansible Role Grammar The engine includes a domain adapter for Ansible roles. It extracts module names from `tasks/main.yml` files and learns per-category grammars: ```bash python -c " from bex.role_grammar import collect_all_role_sequences, learn_grammar all_roles, by_category = collect_all_role_sequences('path/to/roles') for cat, items in sorted(by_category.items()): seqs = [s for _, s in items] print(f'{cat}: {learn_grammar(seqs)}') " ``` ### Example Output ``` ── restore (2 roles) ── Grammar: file.copy.unarchive+.command ── validate (5 roles) ── Grammar: hosts?.shell?.(copy+debug+fail+set_fact+uri)+? ── configure (4 roles) ── Grammar: (assert+debug+set_fact+uri)+?.include_role? ``` **Grammar notation:** - `a.b` — `a` followed by `b` (concatenation) - `(a+b)` — either `a` or `b` (disjunction) - `r?` — zero or one (optional) - `r+` — one or more (iteration) - `r+?` — zero or more (varies across examples) ## Domain: Generic YAML The engine can convert any YAML file into key-path sequences for grammar inference: ```python from bex.yaml_to_seq import yaml_file_to_sequence, sequences_to_crx grammar = sequences_to_crx(yaml_file_to_sequence('config.yml')) ``` ## Papers - **Bex et al.** *"Inferring Deterministic Regular Expressions from Positive Data"* — TODS 2010 - **Bex et al.** *"Inferring k-optimal REs from Positive Data"* — arXiv:1004.2372 See `papers/` for extracted text and the original references. ## Tests ```bash python -m pytest tests/ # or python tests/test_bex.py ``` ## MCP Server A Model Context Protocol server for grammar inference is planned. See `AGENTS.md` for the roadmap. ## License MIT