grammar-inference-engine/README.md
tobjend 7c00c6713d Initial commit: BEX-based grammar inference engine
- CRX: direct CHARE inference (Algorithm 7, TODS 2010)
- iDRegEx: k-ORE inference (Algorithm 4, arXiv 2010)
- RWR₀: SORE repair (Algorithm 6, TODS 2010)
- rwr²: k-ORE extraction (Algorithm 3, arXiv 2010)
- SOA, k-OA, iKoa, 2T-INF, Baum-Welch
- Ansible role grammar adapter
- Generic YAML key-path converter
- 28 tests, all passing
2026-07-01 08:01:16 +02:00

132 lines
4 KiB
Markdown

# Grammar Inference Engine
Infer **regular expression grammars** from example sequences using the BEX family of algorithms. Given a set of example sequences (strings over some alphabet), the engine learns a compact regular expression that describes the general pattern.
## Quick Start
```bash
pip install pyyaml
python -m bex
```
```python
from bex.crx import CRX
seqs = [
['file', 'template', 'docker_image', 'command', 'set_fact', 'shell', 'wait_for'],
['file', 'template', 'docker_image', 'command', 'set_fact', 'shell'],
]
crx = CRX()
grammar = crx.infer(seqs)
print(grammar)
# file.template.docker_image.command.set_fact.shell.(wait_for)?
```
## Algorithms
| Algorithm | What it learns | Paper | Use case |
|-----------|---------------|-------|----------|
| **CRX** | CHAREs (single-pass, deterministic) | TODS 2010 §6 | Fast inference from many sequences |
| **iDRegEx** | k-OREs (probabilistic, Baum-Welch) | arXiv 2010 | Handles noise, learns from few examples |
| **RWR₀** | SOREs (iterative repair) | TODS 2010 §5.2 | Builds regex from a single automaton |
| **rwr²** | k-ORE from k-OA | arXiv 2010 | Post-processing for k-ORE extraction |
### Pipeline 1: Direct CHARE Inference (fast)
```
Example sequences → CRX → CHAREs grammar
```
### Pipeline 2: Probabilistic k-ORE Inference (robust)
```
Example sequences → Complete k-OA → Baum-Welch (EM)
→ Disambiguate → Prune → rwr² → k-ORE grammar
```
## Architecture
```
bex/
├── crx.py # CRX: direct CHARE inference (Algorithm 7, TODS)
├── idregex.py # iDRegEx: k-ORE inference (Algorithm 4, arXiv)
├── rwr0.py # RWR₀: SORE repair (Algorithm 6, TODS)
├── rwrsq.py # rwr²: k-ORE extraction (Algorithm 3, arXiv)
├── soa.py # SOA: Symbolic Observation Automaton core
├── koa.py # k-OA: k-testable Observation Automaton
├── ikoa.py # iKoa: k-OA inference (Algorithm 1, arXiv)
├── twotinf.py # 2T-INF: 2-testable inference (Algorithm 1, TODS)
├── baum_welch.py # Baum-Welch EM training for k-OA
├── expr.py # Expression utilities (concat, disj, star, strip)
├── marking.py # State marking for determinism
├── yaml_to_seq.py # Generic YAML → key-path sequence converter
├── role_grammar.py # Ansible role → module-sequence extractor
└── ...
```
## Domain: Ansible Role Grammar
The engine includes a domain adapter for Ansible roles. It extracts module names from `tasks/main.yml` files and learns per-category grammars:
```bash
python -c "
from bex.role_grammar import collect_all_role_sequences, learn_grammar
all_roles, by_category = collect_all_role_sequences('path/to/roles')
for cat, items in sorted(by_category.items()):
seqs = [s for _, s in items]
print(f'{cat}: {learn_grammar(seqs)}')
"
```
### Example Output
```
── restore (2 roles) ──
Grammar: file.copy.unarchive+.command
── validate (5 roles) ──
Grammar: hosts?.shell?.(copy+debug+fail+set_fact+uri)+?
── configure (4 roles) ──
Grammar: (assert+debug+set_fact+uri)+?.include_role?
```
**Grammar notation:**
- `a.b``a` followed by `b` (concatenation)
- `(a+b)` — either `a` or `b` (disjunction)
- `r?` — zero or one (optional)
- `r+` — one or more (iteration)
- `r+?` — zero or more (varies across examples)
## Domain: Generic YAML
The engine can convert any YAML file into key-path sequences for grammar inference:
```python
from bex.yaml_to_seq import yaml_file_to_sequence, sequences_to_crx
grammar = sequences_to_crx(yaml_file_to_sequence('config.yml'))
```
## Papers
- **Bex et al.** *"Inferring Deterministic Regular Expressions from Positive Data"* — TODS 2010
- **Bex et al.** *"Inferring k-optimal REs from Positive Data"* — arXiv:1004.2372
See `papers/` for extracted text and the original references.
## Tests
```bash
python -m pytest tests/
# or
python tests/test_bex.py
```
## MCP Server
A Model Context Protocol server for grammar inference is planned. See `AGENTS.md` for the roadmap.
## License
MIT