- CRX: direct CHARE inference (Algorithm 7, TODS 2010) - iDRegEx: k-ORE inference (Algorithm 4, arXiv 2010) - RWR₀: SORE repair (Algorithm 6, TODS 2010) - rwr²: k-ORE extraction (Algorithm 3, arXiv 2010) - SOA, k-OA, iKoa, 2T-INF, Baum-Welch - Ansible role grammar adapter - Generic YAML key-path converter - 28 tests, all passing
132 lines
4 KiB
Markdown
132 lines
4 KiB
Markdown
# Grammar Inference Engine
|
|
|
|
Infer **regular expression grammars** from example sequences using the BEX family of algorithms. Given a set of example sequences (strings over some alphabet), the engine learns a compact regular expression that describes the general pattern.
|
|
|
|
## Quick Start
|
|
|
|
```bash
|
|
pip install pyyaml
|
|
python -m bex
|
|
```
|
|
|
|
```python
|
|
from bex.crx import CRX
|
|
|
|
seqs = [
|
|
['file', 'template', 'docker_image', 'command', 'set_fact', 'shell', 'wait_for'],
|
|
['file', 'template', 'docker_image', 'command', 'set_fact', 'shell'],
|
|
]
|
|
crx = CRX()
|
|
grammar = crx.infer(seqs)
|
|
print(grammar)
|
|
# file.template.docker_image.command.set_fact.shell.(wait_for)?
|
|
```
|
|
|
|
## Algorithms
|
|
|
|
| Algorithm | What it learns | Paper | Use case |
|
|
|-----------|---------------|-------|----------|
|
|
| **CRX** | CHAREs (single-pass, deterministic) | TODS 2010 §6 | Fast inference from many sequences |
|
|
| **iDRegEx** | k-OREs (probabilistic, Baum-Welch) | arXiv 2010 | Handles noise, learns from few examples |
|
|
| **RWR₀** | SOREs (iterative repair) | TODS 2010 §5.2 | Builds regex from a single automaton |
|
|
| **rwr²** | k-ORE from k-OA | arXiv 2010 | Post-processing for k-ORE extraction |
|
|
|
|
### Pipeline 1: Direct CHARE Inference (fast)
|
|
|
|
```
|
|
Example sequences → CRX → CHAREs grammar
|
|
```
|
|
|
|
### Pipeline 2: Probabilistic k-ORE Inference (robust)
|
|
|
|
```
|
|
Example sequences → Complete k-OA → Baum-Welch (EM)
|
|
→ Disambiguate → Prune → rwr² → k-ORE grammar
|
|
```
|
|
|
|
## Architecture
|
|
|
|
```
|
|
bex/
|
|
├── crx.py # CRX: direct CHARE inference (Algorithm 7, TODS)
|
|
├── idregex.py # iDRegEx: k-ORE inference (Algorithm 4, arXiv)
|
|
├── rwr0.py # RWR₀: SORE repair (Algorithm 6, TODS)
|
|
├── rwrsq.py # rwr²: k-ORE extraction (Algorithm 3, arXiv)
|
|
├── soa.py # SOA: Symbolic Observation Automaton core
|
|
├── koa.py # k-OA: k-testable Observation Automaton
|
|
├── ikoa.py # iKoa: k-OA inference (Algorithm 1, arXiv)
|
|
├── twotinf.py # 2T-INF: 2-testable inference (Algorithm 1, TODS)
|
|
├── baum_welch.py # Baum-Welch EM training for k-OA
|
|
├── expr.py # Expression utilities (concat, disj, star, strip)
|
|
├── marking.py # State marking for determinism
|
|
├── yaml_to_seq.py # Generic YAML → key-path sequence converter
|
|
├── role_grammar.py # Ansible role → module-sequence extractor
|
|
└── ...
|
|
```
|
|
|
|
## Domain: Ansible Role Grammar
|
|
|
|
The engine includes a domain adapter for Ansible roles. It extracts module names from `tasks/main.yml` files and learns per-category grammars:
|
|
|
|
```bash
|
|
python -c "
|
|
from bex.role_grammar import collect_all_role_sequences, learn_grammar
|
|
all_roles, by_category = collect_all_role_sequences('path/to/roles')
|
|
for cat, items in sorted(by_category.items()):
|
|
seqs = [s for _, s in items]
|
|
print(f'{cat}: {learn_grammar(seqs)}')
|
|
"
|
|
```
|
|
|
|
### Example Output
|
|
|
|
```
|
|
── restore (2 roles) ──
|
|
Grammar: file.copy.unarchive+.command
|
|
|
|
── validate (5 roles) ──
|
|
Grammar: hosts?.shell?.(copy+debug+fail+set_fact+uri)+?
|
|
|
|
── configure (4 roles) ──
|
|
Grammar: (assert+debug+set_fact+uri)+?.include_role?
|
|
```
|
|
|
|
**Grammar notation:**
|
|
- `a.b` — `a` followed by `b` (concatenation)
|
|
- `(a+b)` — either `a` or `b` (disjunction)
|
|
- `r?` — zero or one (optional)
|
|
- `r+` — one or more (iteration)
|
|
- `r+?` — zero or more (varies across examples)
|
|
|
|
## Domain: Generic YAML
|
|
|
|
The engine can convert any YAML file into key-path sequences for grammar inference:
|
|
|
|
```python
|
|
from bex.yaml_to_seq import yaml_file_to_sequence, sequences_to_crx
|
|
|
|
grammar = sequences_to_crx(yaml_file_to_sequence('config.yml'))
|
|
```
|
|
|
|
## Papers
|
|
|
|
- **Bex et al.** *"Inferring Deterministic Regular Expressions from Positive Data"* — TODS 2010
|
|
- **Bex et al.** *"Inferring k-optimal REs from Positive Data"* — arXiv:1004.2372
|
|
|
|
See `papers/` for extracted text and the original references.
|
|
|
|
## Tests
|
|
|
|
```bash
|
|
python -m pytest tests/
|
|
# or
|
|
python tests/test_bex.py
|
|
```
|
|
|
|
## MCP Server
|
|
|
|
A Model Context Protocol server for grammar inference is planned. See `AGENTS.md` for the roadmap.
|
|
|
|
## License
|
|
|
|
MIT
|