grammar-inference-engine/README.md
tobjend 7c00c6713d Initial commit: BEX-based grammar inference engine
- CRX: direct CHARE inference (Algorithm 7, TODS 2010)
- iDRegEx: k-ORE inference (Algorithm 4, arXiv 2010)
- RWR₀: SORE repair (Algorithm 6, TODS 2010)
- rwr²: k-ORE extraction (Algorithm 3, arXiv 2010)
- SOA, k-OA, iKoa, 2T-INF, Baum-Welch
- Ansible role grammar adapter
- Generic YAML key-path converter
- 28 tests, all passing
2026-07-01 08:01:16 +02:00

4 KiB

Grammar Inference Engine

Infer regular expression grammars from example sequences using the BEX family of algorithms. Given a set of example sequences (strings over some alphabet), the engine learns a compact regular expression that describes the general pattern.

Quick Start

pip install pyyaml
python -m bex
from bex.crx import CRX

seqs = [
    ['file', 'template', 'docker_image', 'command', 'set_fact', 'shell', 'wait_for'],
    ['file', 'template', 'docker_image', 'command', 'set_fact', 'shell'],
]
crx = CRX()
grammar = crx.infer(seqs)
print(grammar)
# file.template.docker_image.command.set_fact.shell.(wait_for)?

Algorithms

Algorithm What it learns Paper Use case
CRX CHAREs (single-pass, deterministic) TODS 2010 §6 Fast inference from many sequences
iDRegEx k-OREs (probabilistic, Baum-Welch) arXiv 2010 Handles noise, learns from few examples
RWR₀ SOREs (iterative repair) TODS 2010 §5.2 Builds regex from a single automaton
rwr² k-ORE from k-OA arXiv 2010 Post-processing for k-ORE extraction

Pipeline 1: Direct CHARE Inference (fast)

Example sequences → CRX → CHAREs grammar

Pipeline 2: Probabilistic k-ORE Inference (robust)

Example sequences → Complete k-OA → Baum-Welch (EM)
  → Disambiguate → Prune → rwr² → k-ORE grammar

Architecture

bex/
├── crx.py          # CRX: direct CHARE inference (Algorithm 7, TODS)
├── idregex.py      # iDRegEx: k-ORE inference (Algorithm 4, arXiv)
├── rwr0.py         # RWR₀: SORE repair (Algorithm 6, TODS)
├── rwrsq.py        # rwr²: k-ORE extraction (Algorithm 3, arXiv)
├── soa.py          # SOA: Symbolic Observation Automaton core
├── koa.py          # k-OA: k-testable Observation Automaton
├── ikoa.py         # iKoa: k-OA inference (Algorithm 1, arXiv)
├── twotinf.py      # 2T-INF: 2-testable inference (Algorithm 1, TODS)
├── baum_welch.py   # Baum-Welch EM training for k-OA
├── expr.py         # Expression utilities (concat, disj, star, strip)
├── marking.py      # State marking for determinism
├── yaml_to_seq.py  # Generic YAML → key-path sequence converter
├── role_grammar.py # Ansible role → module-sequence extractor
└── ...

Domain: Ansible Role Grammar

The engine includes a domain adapter for Ansible roles. It extracts module names from tasks/main.yml files and learns per-category grammars:

python -c "
from bex.role_grammar import collect_all_role_sequences, learn_grammar
all_roles, by_category = collect_all_role_sequences('path/to/roles')
for cat, items in sorted(by_category.items()):
    seqs = [s for _, s in items]
    print(f'{cat}: {learn_grammar(seqs)}')
"

Example Output

── restore (2 roles) ──
  Grammar: file.copy.unarchive+.command

── validate (5 roles) ──
  Grammar: hosts?.shell?.(copy+debug+fail+set_fact+uri)+?

── configure (4 roles) ──
  Grammar: (assert+debug+set_fact+uri)+?.include_role?

Grammar notation:

  • a.ba followed by b (concatenation)
  • (a+b) — either a or b (disjunction)
  • r? — zero or one (optional)
  • r+ — one or more (iteration)
  • r+? — zero or more (varies across examples)

Domain: Generic YAML

The engine can convert any YAML file into key-path sequences for grammar inference:

from bex.yaml_to_seq import yaml_file_to_sequence, sequences_to_crx

grammar = sequences_to_crx(yaml_file_to_sequence('config.yml'))

Papers

  • Bex et al. "Inferring Deterministic Regular Expressions from Positive Data" — TODS 2010
  • Bex et al. "Inferring k-optimal REs from Positive Data" — arXiv:1004.2372

See papers/ for extracted text and the original references.

Tests

python -m pytest tests/
# or
python tests/test_bex.py

MCP Server

A Model Context Protocol server for grammar inference is planned. See AGENTS.md for the roadmap.

License

MIT