grammar-inference-engine/README.md

344 lines
13 KiB
Markdown
Raw Normal View History

# Grammar Inference Engine
Infer **regular expression grammars** from example sequences using the BEX family of algorithms. Given a set of example sequences (strings over some alphabet), the engine learns a compact regular expression that describes the general pattern.
## Quick Start
```bash
pip install pyyaml
python -m bex
```
```python
from bex import infer_ensemble
seqs = [
['file', 'template', 'docker_image', 'command', 'set_fact', 'shell', 'wait_for'],
['file', 'template', 'docker_image', 'command', 'set_fact', 'shell'],
]
result = infer_ensemble(seqs)
print(f"Best: {result['best']['algorithm']}")
print(f"Grammar: {result['best']['grammar']}")
print(f"Score: {result['best']['mdl_score']}")
```
Or compare algorithms manually:
```python
from bex.crx import CRX
seqs = [...]
crx = CRX()
grammar = crx.infer(seqs)
print(grammar)
# file.template.docker_image.command.set_fact.shell.(wait_for)?
```
## Algorithms
| Algorithm | What it learns | Paper | Use case |
|-----------|---------------|-------|----------|
| **CRX** | CHAREs (single-pass, deterministic) | TODS 2010 §6 | Fast inference, captures *all* symbols |
| **iDRegEx** | k-OREs (probabilistic, Baum-Welch) | arXiv 2010 | Finds the minimal core pattern |
| **RWR₀** | SOREs (iterative repair) | TODS 2010 §5.2 | Single-sequence grammar repair |
| **rwr²** | k-ORE from k-OA | arXiv 2010 | k-ORE extraction after Baum-Welch |
### Pipeline 1: Direct CHARE Inference (fast)
```
Example sequences → CRX → CHAREs grammar
```
CRX learns a grammar that accepts *all* observed symbols, marking optional ones with `?`. Best when the data is clean and you want the full vocabulary.
### Pipeline 2: Probabilistic k-ORE Inference (robust)
```
Example sequences → Complete k-OA → Baum-Welch (EM)
→ Disambiguate → Prune → rwr² → k-ORE grammar
```
iDRegEx learns the *minimum* common subsequence — symbols that appear in every example. Fails (∅) when the examples are too diverse.
### Pipeline 3: Ensemble (recommended)
```
Example sequences → [CRX, iDRegEx] → MDL score each → pick best
```
Runs both algorithms, scores each with Minimum Description Length, and returns the winner with an explanation. The MDL score penalizes overly general grammars: a grammar like `(a+b+c+...+z)+` that accepts everything gets a high data cost (`log2(|L(r)|)` is large), while a specific grammar like `a.b.c` has near-zero data cost.
## Architecture
```
bex/
├── crx.py # CRX: direct CHARE inference (Algorithm 7, TODS)
├── idregex.py # iDRegEx: k-ORE inference (Algorithm 4, arXiv)
├── rwr0.py # RWR₀: SORE repair (Algorithm 6, TODS)
├── rwrsq.py # rwr²: k-ORE extraction (Algorithm 3, arXiv)
├── soa.py # SOA: Symbolic Observation Automaton core
├── koa.py # k-OA: k-testable Observation Automaton
├── ikoa.py # iKoa: k-OA inference (Algorithm 1, arXiv)
├── twotinf.py # 2T-INF: 2-testable inference (Algorithm 1, TODS)
├── baum_welch.py # Baum-Welch EM training for k-OA
├── expr.py # Expression utilities (concat, disj, star, strip)
├── marking.py # State marking for determinism
├── yaml_to_seq.py # Generic YAML → key-path sequence converter
├── role_grammar.py # Ansible role → module-sequence extractor
├── ensemble.py # Ensemble: runs CRX + iDRegEx, picks best by MDL
├── mdl.py # MDL scoring for grammar selection (fix)
├── mcp_server.py # MCP server exposing 4 tools
└── ...
```
## MCP Server
A **Model Context Protocol** server exposes all algorithms and domain adapters as tools:
```bash
python -m bex.mcp_server
```
### Tools
| Tool | What it does |
|------|-------------|
| `infer_grammar(sequences, method, kmax, N)` | Core CRX or iDRegEx inference |
| `infer_best_grammar(sequences, prefer, kmax, N)` | **Ensemble:** runs both CRX and iDRegEx, picks the best by MDL score. Set `prefer='crx'` or `prefer='idregex'` to skip ensemble and return only that algorithm. Returns structured report with candidates, MDL scores, and a `Why:` explanation. |
| `infer_yaml_grammar(yaml_dir, pattern, method)` | Generic YAML → key-paths → grammar |
| `infer_ansible_role_grammar(roles_dir)` | Ansible role module sequences → per-category grammar |
### Using `infer_best_grammar`
The ensemble runs both algorithms and picks the best by MDL. To skip the comparison and run just one algorithm, pass `prefer`:
```
User: Run CRX on our deploy tasks.
Agent: [runs with prefer='crx']
Best: CRX (MDL 7.0)
Grammar: file.template.docker_image.command.set_fact.shell.wait_for?
CRX MDL= 7.00 file.template.docker_image.command.set_fact.shell.wait_for?
Why: Requested CRX only.
```
Without `prefer`, the ensemble compares both:
```
User: Find the grammar for our Helm chart.
Agent: [runs]
Best: iDRegEx (MDL 1432.99)
Grammar: ServiceAccount.ClusterRole.ClusterRoleBinding.Service.Deployment
iDRegEx MDL= 1432.99 ServiceAccount.ClusterRole.ClusterRoleBinding.Service.Deployment
CRX MDL= 2651.74 (Alertmanager+...+ValidatingWebhookConfiguration)+.Role?.RoleBinding?.Job+?
Why: iDRegEx (score 1433.0) vs CRX (score 2651.7). CRX matches 6/6 sequences,
iDRegEx matches 1/6. iDRegEx selected (MDL score 1433.0).
```
Both grammars are correct — they operate at different levels of specificity. The `Why:` field helps the agent decide which one to use for the task at hand.
## Ensemble Selection
The `infer_best_grammar` tool runs both CRX and iDRegEx, scores each with Minimum Description Length (MDL), and returns the best.
### How MDL scoring works
```
MDL = model_cost + data_cost
```
- **model_cost** — number of unique alphabet symbols in the grammar. Simpler grammars are cheaper.
- **data_cost** — Σ log₂(|L(r) at length len(s)|) across all sequences. A grammar that accepts *many* strings of the same length (like a 17-way disjunction `(a+b+...+q)+`) has high data cost because `|L(r)|` is large. A specific, fixed sequence (`a.b.c.d.e`) has `|L(r)| = 1` so data cost is zero.
The ensemble selects the grammar with the lowest total MDL. This automatically picks the right level of specificity for the data.
### When each algorithm wins
| Scenario | Winner | Why |
|----------|--------|-----|
| Many sequences, diverse patterns | **CRX** | CRX captures the full vocabulary. iDRegEx can't find a common core. |
| Clean, structured sequences | **CRX** | CRX learns precise concatenation order with optional suffixes. iDRegEx may over-generalize. |
| Few sequences (23) | **iDRegEx** | CRX overfits to the limited data. iDRegEx's probabilistic approach handles noise better. |
| Sequences share a clear core | **iDRegEx** | iDRegEx extracts the minimal common subsequence. CRX buries it in a mass of optional symbols. |
| Single sequence | **iDRegEx** (with SOA repair) | RWR₀ repair pipeline produces a grammatical regex from one example. |
### Real-world benchmarks
Results from three domains using the ensemble (fixed MDL scoring):
```
Dataset Best MDL Matches
──────────────────────────────────────────────────────────
Helm (prom-stack) iDRegEx 1433.0 1/6
Ansible (deploy) CRX 246.1 34/36
Ansible (validate) CRX 34.0 5/5
Ansible (restore) CRX 24.0 2/2
Ansible (manage) iDRegEx 25.0 1/2
Ansible (configure) iDRegEx 22.5 1/4
Terraform (hashistack) CRX 4.0 9/9
```
Note: MDL scores are not comparable across datasets — only within the same run
(CRX vs iDRegEx on the same sequences). The Helm score is higher because
each sequence is ~120 symbols long, making the data cost term dominant for
the overly-general CRX grammar (19 kinds × many lengths).
## Domain Adapters
### Ansible Roles
Extracts module names from `tasks/main.yml`, groups by category prefix (e.g., `deploy_foo``deploy`), and learns per-category grammars:
```python
from bex.ensemble import infer_ensemble
from bex.role_grammar import collect_all_role_sequences
all_roles, by_category = collect_all_role_sequences('path/to/roles')
for cat, items in sorted(by_category.items()):
seqs = [s for _, s in items]
if len(seqs) >= 2:
result = infer_ensemble(seqs)
print(f"── {cat} ({len(items)} roles) ──")
print(f" Best: {result['best']['algorithm']} (MDL {result['best']['mdl_score']})")
print(f" Grammar: {result['best']['grammar']}")
print(f" Why: {result['why']}")
```
**Example output** (from [companyweb](https://github.com/anomalyco/companyweb), 51 roles):
```
── restore (2 roles) ──
Best: CRX (MDL 24.0)
Grammar: file.copy.unarchive+.command
Why: CRX (score 24.0) vs iDRegEx (score 33.0). Both match 2/2. CRX is more compact.
── validate (5 roles) ──
Best: CRX (MDL 34.0)
Grammar: hosts?.shell?.(copy+debug+fail+set_fact+uri)+?
Why: CRX (score 34.0) matches 5/5, iDRegEx (score 49.5) matches 0/5.
── configure (4 roles) ──
Best: iDRegEx (MDL 22.5)
Grammar: include_role
Why: iDRegEx (score 22.5) beats CRX (score 44.5). CRX overfits to diverse patterns.
```
### Helm Charts
Renders a Helm chart with different values files and extracts Kubernetes `kind` sequences for grammar inference:
```python
import subprocess, yaml
from bex.ensemble import infer_ensemble
seqs = []
for vf in sorted(Path('ci/').glob('*-values.yaml')):
out = subprocess.run(
['helm', 'template', 'test', '.', '--skip-tests', '-f', str(vf)],
capture_output=True, text=True, timeout=120,
)
if out.returncode == 0:
kinds = [d['kind'] for d in yaml.safe_load_all(out.stdout)
if d and isinstance(d, dict) and 'kind' in d]
if kinds:
seqs.append(kinds)
result = infer_ensemble(seqs)
print(f"Best: {result['best']['algorithm']} (MDL {result['best']['mdl_score']})")
print(f"Grammar: {result['best']['grammar']}")
print(f"Why: {result['why']}")
```
**Example output** (from [kube-prometheus-stack](https://github.com/prometheus-community/helm-charts/tree/main/charts/kube-prometheus-stack), 6 CI configs):
```
Best: iDRegEx (MDL 1432.99)
Grammar: ServiceAccount.ClusterRole.ClusterRoleBinding.Service.Deployment
iDRegEx MDL= 1432.99 ServiceAccount.ClusterRole.ClusterRoleBinding.Service.Deployment
CRX MDL= 2651.74 (Alertmanager+ClusterRole+ClusterRoleBinding+ConfigMap+DaemonSet+...)+.Role?.RoleBinding?.Job+?
Why: iDRegEx (score 1433.0) vs CRX (score 2651.7). CRX matches 6/6, iDRegEx matches 1/6.
iDRegEx selected (MDL score 1433.0).
```
CRX captures *all* symbols that appear. iDRegEx finds only the minimal core that every config shares:
```
ServiceAccount.ClusterRole.ClusterRoleBinding.Service.Deployment
```
Which grammar is more useful depends on the task:
- **CRX** tells you everything you *might* need — good for an agent generating a complete chart.
- **iDRegEx** tells you what you *always* need — the bootstrap pipeline that can't be skipped.
Use `prefer='crx'` or `prefer='idregex'` to select an algorithm without the ensemble comparison:
### Terraform
Parses `.tf` files to extract `resource` type sequences, per-file or per-directory:
```python
import re
from bex.ensemble import infer_ensemble
seqs = []
for tf in sorted(Path('.').rglob('*.tf')):
resources = re.findall(r'resource "(\w+)" "\w+" {', tf.read_text())
if resources:
seqs.append(resources)
result = infer_ensemble(seqs)
print(f"Best: {result['best']['algorithm']} (MDL {result['best']['mdl_score']})")
print(f"Grammar: {result['best']['grammar']}")
```
**Example output** (from [terraform-guides](https://github.com/hashicorp/terraform-guides), hashistack example, 9 files):
```
Best: CRX (MDL 4.0, 9/9 match)
Grammar: azurerm_network_security_group?.tls_private_key?.azurerm_virtual_machine?.(azurerm_resource_group+azurerm_subnet+azurerm_virtual_network)+?.azurerm_network_security_rule?.null_resource?.azurerm_network_interface?.azurerm_public_ip?.random_id+?
```
**Grammar notation:**
- `a.b``a` followed by `b` (concatenation)
- `(a+b)` — either `a` or `b` (disjunction)
- `r?` — zero or one (optional)
- `r+` — one or more (iteration)
- `r+?` — zero or more (varies across examples)
- `(a|b)` — iDRegEx-style disjunction (equivalent to `(a+b)`)
## Domain: Generic YAML
Converts any YAML file into key-path sequences (DFS traversal) for grammar inference:
```python
from bex.yaml_to_seq import collect_all_sequences
from bex import infer_ensemble
results = collect_all_sequences('config_dir/')
seqs = [seq for _, seq in results]
result = infer_ensemble(seqs)
print(result['best']['grammar'])
```
## Papers
- **Bex et al.** *"Inferring Deterministic Regular Expressions from Positive Data"* — TODS 2010
- **Bex et al.** *"Inferring k-optimal REs from Positive Data"* — arXiv:1004.2372
See `papers/` for extracted text and the original references.
## Tests
```bash
python -m pytest tests/
# or
python tests/test_bex.py
```
## License
MIT