# Grammar Inference Engine

Infer **regular expression grammars** from example sequences using the BEX family of algorithms. Given a set of example sequences (strings over some alphabet), the engine learns a compact regular expression that describes the general pattern.

## Quick Start

```bash
pip install pyyaml
python -m bex
```

```python
from bex.crx import CRX

seqs = [
    ['file', 'template', 'docker_image', 'command', 'set_fact', 'shell', 'wait_for'],
    ['file', 'template', 'docker_image', 'command', 'set_fact', 'shell'],
]
crx = CRX()
grammar = crx.infer(seqs)
print(grammar)
# file.template.docker_image.command.set_fact.shell.(wait_for)?
```

## Algorithms

| Algorithm | What it learns | Paper | Use case |
|-----------|---------------|-------|----------|
| **CRX** | CHAREs (single-pass, deterministic) | TODS 2010 §6 | Fast inference from many sequences |
| **iDRegEx** | k-OREs (probabilistic, Baum-Welch) | arXiv 2010 | Handles noise, learns from few examples |
| **RWR₀** | SOREs (iterative repair) | TODS 2010 §5.2 | Builds regex from a single automaton |
| **rwr²** | k-ORE from k-OA | arXiv 2010 | Post-processing for k-ORE extraction |

### Pipeline 1: Direct CHARE Inference (fast)

```
Example sequences → CRX → CHAREs grammar
```

### Pipeline 2: Probabilistic k-ORE Inference (robust)

```
Example sequences → Complete k-OA → Baum-Welch (EM)
  → Disambiguate → Prune → rwr² → k-ORE grammar
```

## Architecture

```
bex/
├── crx.py          # CRX: direct CHARE inference (Algorithm 7, TODS)
├── idregex.py      # iDRegEx: k-ORE inference (Algorithm 4, arXiv)
├── rwr0.py         # RWR₀: SORE repair (Algorithm 6, TODS)
├── rwrsq.py        # rwr²: k-ORE extraction (Algorithm 3, arXiv)
├── soa.py          # SOA: Symbolic Observation Automaton core
├── koa.py          # k-OA: k-testable Observation Automaton
├── ikoa.py         # iKoa: k-OA inference (Algorithm 1, arXiv)
├── twotinf.py      # 2T-INF: 2-testable inference (Algorithm 1, TODS)
├── baum_welch.py   # Baum-Welch EM training for k-OA
├── expr.py         # Expression utilities (concat, disj, star, strip)
├── marking.py      # State marking for determinism
├── yaml_to_seq.py  # Generic YAML → key-path sequence converter
├── role_grammar.py # Ansible role → module-sequence extractor
└── ...
```

## Domain: Ansible Role Grammar

The engine includes a domain adapter for Ansible roles. It extracts module names from `tasks/main.yml` files and learns per-category grammars:

```bash
python -c "
from bex.role_grammar import collect_all_role_sequences, learn_grammar
all_roles, by_category = collect_all_role_sequences('path/to/roles')
for cat, items in sorted(by_category.items()):
    seqs = [s for _, s in items]
    print(f'{cat}: {learn_grammar(seqs)}')
"
```

### Example Output

```
── restore (2 roles) ──
  Grammar: file.copy.unarchive+.command

── validate (5 roles) ──
  Grammar: hosts?.shell?.(copy+debug+fail+set_fact+uri)+?

── configure (4 roles) ──
  Grammar: (assert+debug+set_fact+uri)+?.include_role?
```

**Grammar notation:**
- `a.b` — `a` followed by `b` (concatenation)
- `(a+b)` — either `a` or `b` (disjunction)
- `r?` — zero or one (optional)
- `r+` — one or more (iteration)
- `r+?` — zero or more (varies across examples)

## Domain: Generic YAML

The engine can convert any YAML file into key-path sequences for grammar inference:

```python
from bex.yaml_to_seq import yaml_file_to_sequence, sequences_to_crx

grammar = sequences_to_crx(yaml_file_to_sequence('config.yml'))
```

## Papers

- **Bex et al.** *"Inferring Deterministic Regular Expressions from Positive Data"* — TODS 2010
- **Bex et al.** *"Inferring k-optimal REs from Positive Data"* — arXiv:1004.2372

See `papers/` for extracted text and the original references.

## Tests

```bash
python -m pytest tests/
# or
python tests/test_bex.py
```

## MCP Server

A Model Context Protocol server for grammar inference is planned. See `AGENTS.md` for the roadmap.

## License

MIT