grammar-inference-engine/README.md

# Grammar Inference Engine

Infer **regular expression grammars** from example sequences using the BEX family of algorithms. Given a set of example sequences (strings over some alphabet), the engine learns a compact regular expression that describes the general pattern.

## Quick Start

```bash
pip install pyyaml
python -m bex
```

```python
from bex import infer_ensemble

seqs = [
    ['file', 'template', 'docker_image', 'command', 'set_fact', 'shell', 'wait_for'],
    ['file', 'template', 'docker_image', 'command', 'set_fact', 'shell'],
]

result = infer_ensemble(seqs)
print(f"Best: {result['best']['algorithm']}")
print(f"Grammar: {result['best']['grammar']}")
print(f"Score: {result['best']['mdl_score']}")
```

Or compare algorithms manually:

```python
from bex.crx import CRX

seqs = [...]
crx = CRX()
grammar = crx.infer(seqs)
print(grammar)
# file.template.docker_image.command.set_fact.shell.(wait_for)?
```

## Algorithms

| Algorithm | What it learns | Paper | Use case |
|-----------|---------------|-------|----------|
| **CRX** | CHAREs (single-pass, deterministic) | TODS 2010 §6 | Fast inference, captures *all* symbols |
| **iDRegEx** | k-OREs (probabilistic, Baum-Welch) | arXiv 2010 | Finds the minimal core pattern |
| **RWR₀** | SOREs (iterative repair) | TODS 2010 §5.2 | Single-sequence grammar repair |
| **rwr²** | k-ORE from k-OA | arXiv 2010 | k-ORE extraction after Baum-Welch |

### Pipeline 1: Direct CHARE Inference (fast)

```
Example sequences → CRX → CHAREs grammar
```

CRX learns a grammar that accepts *all* observed symbols, marking optional ones with `?`. Best when the data is clean and you want the full vocabulary.

### Pipeline 2: Probabilistic k-ORE Inference (robust)

```
Example sequences → Complete k-OA → Baum-Welch (EM)
  → Disambiguate → Prune → rwr² → k-ORE grammar
```

iDRegEx learns the *minimum* common subsequence — symbols that appear in every example. Fails (∅) when the examples are too diverse.

### Pipeline 3: Ensemble (recommended)

```
Example sequences → [CRX, iDRegEx] → MDL score each → pick best
```

Runs both algorithms, scores each with Minimum Description Length, and returns the winner with an explanation. The MDL score penalizes overly general grammars: a grammar like `(a+b+c+...+z)+` that accepts everything gets a high data cost (`log2(|L(r)|)` is large), while a specific grammar like `a.b.c` has near-zero data cost.

## Architecture

```
bex/
├── crx.py          # CRX: direct CHARE inference (Algorithm 7, TODS)
├── idregex.py      # iDRegEx: k-ORE inference (Algorithm 4, arXiv)
├── rwr0.py         # RWR₀: SORE repair (Algorithm 6, TODS)
├── rwrsq.py        # rwr²: k-ORE extraction (Algorithm 3, arXiv)
├── soa.py          # SOA: Symbolic Observation Automaton core
├── koa.py          # k-OA: k-testable Observation Automaton
├── ikoa.py         # iKoa: k-OA inference (Algorithm 1, arXiv)
├── twotinf.py      # 2T-INF: 2-testable inference (Algorithm 1, TODS)
├── baum_welch.py   # Baum-Welch EM training for k-OA
├── expr.py         # Expression utilities (concat, disj, star, strip)
├── marking.py      # State marking for determinism
├── yaml_to_seq.py  # Generic YAML → key-path sequence converter
├── role_grammar.py # Ansible role → module-sequence extractor
├── ensemble.py     # Ensemble: runs CRX + iDRegEx, picks best by MDL
├── mdl.py          # MDL scoring for grammar selection (fix)
├── mcp_server.py   # MCP server exposing 4 tools
└── ...
```

## MCP Server

A **Model Context Protocol** server exposes all algorithms and domain adapters as tools:

```bash
python -m bex.mcp_server
```

### Tools

| Tool | What it does |
|------|-------------|
| `infer_grammar(sequences, method, kmax, N)` | Core CRX or iDRegEx inference |
| `infer_best_grammar(sequences, prefer, kmax, N)` | **Ensemble:** runs both CRX and iDRegEx, picks the best by MDL score. Set `prefer='crx'` or `prefer='idregex'` to skip ensemble and return only that algorithm. Returns structured report with candidates, MDL scores, and a `Why:` explanation. |
| `infer_yaml_grammar(yaml_dir, pattern, method)` | Generic YAML → key-paths → grammar |
| `infer_ansible_role_grammar(roles_dir)` | Ansible role module sequences → per-category grammar |

### Using `infer_best_grammar`

The ensemble runs both algorithms and picks the best by MDL. To skip the comparison and run just one algorithm, pass `prefer`:

```
User: Run CRX on our deploy tasks.
Agent: [runs with prefer='crx']
Best: CRX (MDL 7.0)
Grammar: file.template.docker_image.command.set_fact.shell.wait_for?

  CRX  MDL=  7.00  file.template.docker_image.command.set_fact.shell.wait_for?

Why: Requested CRX only.
```

Without `prefer`, the ensemble compares both:

```
User: Find the grammar for our Helm chart.
Agent: [runs]
Best: iDRegEx (MDL 1432.99)
Grammar: ServiceAccount.ClusterRole.ClusterRoleBinding.Service.Deployment

  iDRegEx     MDL=  1432.99  ServiceAccount.ClusterRole.ClusterRoleBinding.Service.Deployment
  CRX         MDL=  2651.74  (Alertmanager+...+ValidatingWebhookConfiguration)+.Role?.RoleBinding?.Job+?

Why: iDRegEx (score 1433.0) vs CRX (score 2651.7). CRX matches 6/6 sequences,
iDRegEx matches 1/6. iDRegEx selected (MDL score 1433.0).
```

Both grammars are correct — they operate at different levels of specificity. The `Why:` field helps the agent decide which one to use for the task at hand.

## Ensemble Selection

The `infer_best_grammar` tool runs both CRX and iDRegEx, scores each with Minimum Description Length (MDL), and returns the best.

### How MDL scoring works

```
MDL = model_cost + data_cost
```

- **model_cost** — number of unique alphabet symbols in the grammar. Simpler grammars are cheaper.
- **data_cost** — Σ log₂(|L(r) at length len(s)|) across all sequences. A grammar that accepts *many* strings of the same length (like a 17-way disjunction `(a+b+...+q)+`) has high data cost because `|L(r)|` is large. A specific, fixed sequence (`a.b.c.d.e`) has `|L(r)| = 1` so data cost is zero.

The ensemble selects the grammar with the lowest total MDL. This automatically picks the right level of specificity for the data.

### When each algorithm wins

| Scenario | Winner | Why |
|----------|--------|-----|
| Many sequences, diverse patterns | **CRX** | CRX captures the full vocabulary. iDRegEx can't find a common core. |
| Clean, structured sequences | **CRX** | CRX learns precise concatenation order with optional suffixes. iDRegEx may over-generalize. |
| Few sequences (2–3) | **iDRegEx** | CRX overfits to the limited data. iDRegEx's probabilistic approach handles noise better. |
| Sequences share a clear core | **iDRegEx** | iDRegEx extracts the minimal common subsequence. CRX buries it in a mass of optional symbols. |
| Single sequence | **iDRegEx** (with SOA repair) | RWR₀ repair pipeline produces a grammatical regex from one example. |

### Real-world benchmarks

Results from three domains using the ensemble (fixed MDL scoring):

```
Dataset                   Best       MDL      Matches
──────────────────────────────────────────────────────────
Helm (prom-stack)         iDRegEx    1433.0   1/6
Ansible (deploy)          CRX        246.1    34/36
Ansible (validate)        CRX        34.0     5/5
Ansible (restore)         CRX        24.0     2/2
Ansible (manage)          iDRegEx    25.0     1/2
Ansible (configure)       iDRegEx    22.5     1/4
Terraform (hashistack)    CRX        4.0      9/9
```

Note: MDL scores are not comparable across datasets — only within the same run
(CRX vs iDRegEx on the same sequences). The Helm score is higher because
each sequence is ~120 symbols long, making the data cost term dominant for
the overly-general CRX grammar (19 kinds × many lengths).

## Domain Adapters

### Ansible Roles

Extracts module names from `tasks/main.yml`, groups by category prefix (e.g., `deploy_foo` → `deploy`), and learns per-category grammars:

```python
from bex.ensemble import infer_ensemble
from bex.role_grammar import collect_all_role_sequences

all_roles, by_category = collect_all_role_sequences('path/to/roles')
for cat, items in sorted(by_category.items()):
    seqs = [s for _, s in items]
    if len(seqs) >= 2:
        result = infer_ensemble(seqs)
        print(f"── {cat} ({len(items)} roles) ──")
        print(f"  Best: {result['best']['algorithm']} (MDL {result['best']['mdl_score']})")
        print(f"  Grammar: {result['best']['grammar']}")
        print(f"  Why: {result['why']}")
```

**Example output** (from [companyweb](https://github.com/anomalyco/companyweb), 51 roles):
```
── restore (2 roles) ──
  Best: CRX (MDL 24.0)
  Grammar: file.copy.unarchive+.command
  Why: CRX (score 24.0) vs iDRegEx (score 33.0). Both match 2/2. CRX is more compact.

── validate (5 roles) ──
  Best: CRX (MDL 34.0)
  Grammar: hosts?.shell?.(copy+debug+fail+set_fact+uri)+?
  Why: CRX (score 34.0) matches 5/5, iDRegEx (score 49.5) matches 0/5.

── configure (4 roles) ──
  Best: iDRegEx (MDL 22.5)
  Grammar: include_role
  Why: iDRegEx (score 22.5) beats CRX (score 44.5). CRX overfits to diverse patterns.
```

### Helm Charts

Renders a Helm chart with different values files and extracts Kubernetes `kind` sequences for grammar inference:

```python
import subprocess, yaml
from bex.ensemble import infer_ensemble

seqs = []
for vf in sorted(Path('ci/').glob('*-values.yaml')):
    out = subprocess.run(
        ['helm', 'template', 'test', '.', '--skip-tests', '-f', str(vf)],
        capture_output=True, text=True, timeout=120,
    )
    if out.returncode == 0:
        kinds = [d['kind'] for d in yaml.safe_load_all(out.stdout)
                 if d and isinstance(d, dict) and 'kind' in d]
        if kinds:
            seqs.append(kinds)

result = infer_ensemble(seqs)
print(f"Best: {result['best']['algorithm']} (MDL {result['best']['mdl_score']})")
print(f"Grammar: {result['best']['grammar']}")
print(f"Why: {result['why']}")
```

**Example output** (from [kube-prometheus-stack](https://github.com/prometheus-community/helm-charts/tree/main/charts/kube-prometheus-stack), 6 CI configs):

```
Best: iDRegEx (MDL 1432.99)
Grammar: ServiceAccount.ClusterRole.ClusterRoleBinding.Service.Deployment

  iDRegEx     MDL=  1432.99  ServiceAccount.ClusterRole.ClusterRoleBinding.Service.Deployment
  CRX         MDL=  2651.74  (Alertmanager+ClusterRole+ClusterRoleBinding+ConfigMap+DaemonSet+...)+.Role?.RoleBinding?.Job+?

Why: iDRegEx (score 1433.0) vs CRX (score 2651.7). CRX matches 6/6, iDRegEx matches 1/6.
iDRegEx selected (MDL score 1433.0).
```

CRX captures *all* symbols that appear. iDRegEx finds only the minimal core that every config shares:
```
ServiceAccount.ClusterRole.ClusterRoleBinding.Service.Deployment
```

Which grammar is more useful depends on the task:
- **CRX** tells you everything you *might* need — good for an agent generating a complete chart.
- **iDRegEx** tells you what you *always* need — the bootstrap pipeline that can't be skipped.

Use `prefer='crx'` or `prefer='idregex'` to select an algorithm without the ensemble comparison:

### Terraform

Parses `.tf` files to extract `resource` type sequences, per-file or per-directory:

```python
import re
from bex.ensemble import infer_ensemble

seqs = []
for tf in sorted(Path('.').rglob('*.tf')):
    resources = re.findall(r'resource "(\w+)" "\w+" {', tf.read_text())
    if resources:
        seqs.append(resources)

result = infer_ensemble(seqs)
print(f"Best: {result['best']['algorithm']} (MDL {result['best']['mdl_score']})")
print(f"Grammar: {result['best']['grammar']}")
```

**Example output** (from [terraform-guides](https://github.com/hashicorp/terraform-guides), hashistack example, 9 files):
```
Best: CRX (MDL 4.0, 9/9 match)
Grammar: azurerm_network_security_group?.tls_private_key?.azurerm_virtual_machine?.(azurerm_resource_group+azurerm_subnet+azurerm_virtual_network)+?.azurerm_network_security_rule?.null_resource?.azurerm_network_interface?.azurerm_public_ip?.random_id+?
```

**Grammar notation:**
- `a.b` — `a` followed by `b` (concatenation)
- `(a+b)` — either `a` or `b` (disjunction)
- `r?` — zero or one (optional)
- `r+` — one or more (iteration)
- `r+?` — zero or more (varies across examples)
- `(a|b)` — iDRegEx-style disjunction (equivalent to `(a+b)`)

## Domain: Generic YAML

Converts any YAML file into key-path sequences (DFS traversal) for grammar inference:

```python
from bex.yaml_to_seq import collect_all_sequences
from bex import infer_ensemble

results = collect_all_sequences('config_dir/')
seqs = [seq for _, seq in results]
result = infer_ensemble(seqs)
print(result['best']['grammar'])
```

## Papers

- **Bex et al.** *"Inferring Deterministic Regular Expressions from Positive Data"* — TODS 2010
- **Bex et al.** *"Inferring k-optimal REs from Positive Data"* — arXiv:1004.2372

See `papers/` for extracted text and the original references.

## Tests

```bash
python -m pytest tests/
# or
python tests/test_bex.py
```

## License

MIT