BEX-based grammar inference engine: learn regular expression patterns from example sequences. Supports CHAREs (CRX), k-OREs (iDRegEx), and the full BEX pipeline (SOA→2T-INF→RWR₀→CRX / iKoa→BW→Disambiguate→Prune→rwr²).
Find a file
tobjend adc52c99ec Add MCP server: grammar inference via FastMCP
- bex/mcp_server.py: FastMCP server with 3 tools:
  * infer_grammar(sequences, method='crx'|'idregex')
  * infer_yaml_grammar(yaml_dir, pattern, method)
  * infer_ansible_role_grammar(roles_dir)
- pyproject.toml: add bex-mcp console_scripts entry point
2026-07-01 08:03:10 +02:00
bex Add MCP server: grammar inference via FastMCP 2026-07-01 08:03:10 +02:00
papers Initial commit: BEX-based grammar inference engine 2026-07-01 08:01:16 +02:00
tests Initial commit: BEX-based grammar inference engine 2026-07-01 08:01:16 +02:00
.gitignore Initial commit: BEX-based grammar inference engine 2026-07-01 08:01:16 +02:00
AGENTS.md Initial commit: BEX-based grammar inference engine 2026-07-01 08:01:16 +02:00
pyproject.toml Add MCP server: grammar inference via FastMCP 2026-07-01 08:03:10 +02:00
README.md Initial commit: BEX-based grammar inference engine 2026-07-01 08:01:16 +02:00
requirements.txt Initial commit: BEX-based grammar inference engine 2026-07-01 08:01:16 +02:00

Grammar Inference Engine

Infer regular expression grammars from example sequences using the BEX family of algorithms. Given a set of example sequences (strings over some alphabet), the engine learns a compact regular expression that describes the general pattern.

Quick Start

pip install pyyaml
python -m bex
from bex.crx import CRX

seqs = [
    ['file', 'template', 'docker_image', 'command', 'set_fact', 'shell', 'wait_for'],
    ['file', 'template', 'docker_image', 'command', 'set_fact', 'shell'],
]
crx = CRX()
grammar = crx.infer(seqs)
print(grammar)
# file.template.docker_image.command.set_fact.shell.(wait_for)?

Algorithms

Algorithm What it learns Paper Use case
CRX CHAREs (single-pass, deterministic) TODS 2010 §6 Fast inference from many sequences
iDRegEx k-OREs (probabilistic, Baum-Welch) arXiv 2010 Handles noise, learns from few examples
RWR₀ SOREs (iterative repair) TODS 2010 §5.2 Builds regex from a single automaton
rwr² k-ORE from k-OA arXiv 2010 Post-processing for k-ORE extraction

Pipeline 1: Direct CHARE Inference (fast)

Example sequences → CRX → CHAREs grammar

Pipeline 2: Probabilistic k-ORE Inference (robust)

Example sequences → Complete k-OA → Baum-Welch (EM)
  → Disambiguate → Prune → rwr² → k-ORE grammar

Architecture

bex/
├── crx.py          # CRX: direct CHARE inference (Algorithm 7, TODS)
├── idregex.py      # iDRegEx: k-ORE inference (Algorithm 4, arXiv)
├── rwr0.py         # RWR₀: SORE repair (Algorithm 6, TODS)
├── rwrsq.py        # rwr²: k-ORE extraction (Algorithm 3, arXiv)
├── soa.py          # SOA: Symbolic Observation Automaton core
├── koa.py          # k-OA: k-testable Observation Automaton
├── ikoa.py         # iKoa: k-OA inference (Algorithm 1, arXiv)
├── twotinf.py      # 2T-INF: 2-testable inference (Algorithm 1, TODS)
├── baum_welch.py   # Baum-Welch EM training for k-OA
├── expr.py         # Expression utilities (concat, disj, star, strip)
├── marking.py      # State marking for determinism
├── yaml_to_seq.py  # Generic YAML → key-path sequence converter
├── role_grammar.py # Ansible role → module-sequence extractor
└── ...

Domain: Ansible Role Grammar

The engine includes a domain adapter for Ansible roles. It extracts module names from tasks/main.yml files and learns per-category grammars:

python -c "
from bex.role_grammar import collect_all_role_sequences, learn_grammar
all_roles, by_category = collect_all_role_sequences('path/to/roles')
for cat, items in sorted(by_category.items()):
    seqs = [s for _, s in items]
    print(f'{cat}: {learn_grammar(seqs)}')
"

Example Output

── restore (2 roles) ──
  Grammar: file.copy.unarchive+.command

── validate (5 roles) ──
  Grammar: hosts?.shell?.(copy+debug+fail+set_fact+uri)+?

── configure (4 roles) ──
  Grammar: (assert+debug+set_fact+uri)+?.include_role?

Grammar notation:

  • a.ba followed by b (concatenation)
  • (a+b) — either a or b (disjunction)
  • r? — zero or one (optional)
  • r+ — one or more (iteration)
  • r+? — zero or more (varies across examples)

Domain: Generic YAML

The engine can convert any YAML file into key-path sequences for grammar inference:

from bex.yaml_to_seq import yaml_file_to_sequence, sequences_to_crx

grammar = sequences_to_crx(yaml_file_to_sequence('config.yml'))

Papers

  • Bex et al. "Inferring Deterministic Regular Expressions from Positive Data" — TODS 2010
  • Bex et al. "Inferring k-optimal REs from Positive Data" — arXiv:1004.2372

See papers/ for extracted text and the original references.

Tests

python -m pytest tests/
# or
python tests/test_bex.py

MCP Server

A Model Context Protocol server for grammar inference is planned. See AGENTS.md for the roadmap.

License

MIT