grammar-inference-engine/README.md
tobjend 0e2aec582b Grammar inference engine: CRX + iDRegEx ensemble with MDL scoring, MCP server, showcase, and blog post
- Ensemble inference (infer_ensemble) runs both CRX and iDRegEx, picks best by MDL
- CRX: CRX algorithm for wide coverage (accepts all sequences, large vocabulary)
- iDRegEx: iDRegEx for minimal core grammar (tightest common pattern)
- MDL scoring: fixed model_cost to count alphabet symbol occurrences, fixed dispatch order in _count_words_fast
- Fixed _match_tokens: rewritten as _match_possible with proper backtracking
- Fixed _parse_parts disjunction: children use _parse_flat_symbol to avoid dot-splitting
- MCP server: infer_best_grammar and infer_grammar tools
- Added prefer parameter (crx/idregex) to skip ensemble
- 28 passing tests
- SHOWCASE.md with Geerlingguy Galaxy demonstration
- blog_post.md with full technical deep-dive
2026-07-01 09:51:41 +02:00

343 lines
13 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Grammar Inference Engine
Infer **regular expression grammars** from example sequences using the BEX family of algorithms. Given a set of example sequences (strings over some alphabet), the engine learns a compact regular expression that describes the general pattern.
## Quick Start
```bash
pip install pyyaml
python -m bex
```
```python
from bex import infer_ensemble
seqs = [
['file', 'template', 'docker_image', 'command', 'set_fact', 'shell', 'wait_for'],
['file', 'template', 'docker_image', 'command', 'set_fact', 'shell'],
]
result = infer_ensemble(seqs)
print(f"Best: {result['best']['algorithm']}")
print(f"Grammar: {result['best']['grammar']}")
print(f"Score: {result['best']['mdl_score']}")
```
Or compare algorithms manually:
```python
from bex.crx import CRX
seqs = [...]
crx = CRX()
grammar = crx.infer(seqs)
print(grammar)
# file.template.docker_image.command.set_fact.shell.(wait_for)?
```
## Algorithms
| Algorithm | What it learns | Paper | Use case |
|-----------|---------------|-------|----------|
| **CRX** | CHAREs (single-pass, deterministic) | TODS 2010 §6 | Fast inference, captures *all* symbols |
| **iDRegEx** | k-OREs (probabilistic, Baum-Welch) | arXiv 2010 | Finds the minimal core pattern |
| **RWR₀** | SOREs (iterative repair) | TODS 2010 §5.2 | Single-sequence grammar repair |
| **rwr²** | k-ORE from k-OA | arXiv 2010 | k-ORE extraction after Baum-Welch |
### Pipeline 1: Direct CHARE Inference (fast)
```
Example sequences → CRX → CHAREs grammar
```
CRX learns a grammar that accepts *all* observed symbols, marking optional ones with `?`. Best when the data is clean and you want the full vocabulary.
### Pipeline 2: Probabilistic k-ORE Inference (robust)
```
Example sequences → Complete k-OA → Baum-Welch (EM)
→ Disambiguate → Prune → rwr² → k-ORE grammar
```
iDRegEx learns the *minimum* common subsequence — symbols that appear in every example. Fails (∅) when the examples are too diverse.
### Pipeline 3: Ensemble (recommended)
```
Example sequences → [CRX, iDRegEx] → MDL score each → pick best
```
Runs both algorithms, scores each with Minimum Description Length, and returns the winner with an explanation. The MDL score penalizes overly general grammars: a grammar like `(a+b+c+...+z)+` that accepts everything gets a high data cost (`log2(|L(r)|)` is large), while a specific grammar like `a.b.c` has near-zero data cost.
## Architecture
```
bex/
├── crx.py # CRX: direct CHARE inference (Algorithm 7, TODS)
├── idregex.py # iDRegEx: k-ORE inference (Algorithm 4, arXiv)
├── rwr0.py # RWR₀: SORE repair (Algorithm 6, TODS)
├── rwrsq.py # rwr²: k-ORE extraction (Algorithm 3, arXiv)
├── soa.py # SOA: Symbolic Observation Automaton core
├── koa.py # k-OA: k-testable Observation Automaton
├── ikoa.py # iKoa: k-OA inference (Algorithm 1, arXiv)
├── twotinf.py # 2T-INF: 2-testable inference (Algorithm 1, TODS)
├── baum_welch.py # Baum-Welch EM training for k-OA
├── expr.py # Expression utilities (concat, disj, star, strip)
├── marking.py # State marking for determinism
├── yaml_to_seq.py # Generic YAML → key-path sequence converter
├── role_grammar.py # Ansible role → module-sequence extractor
├── ensemble.py # Ensemble: runs CRX + iDRegEx, picks best by MDL
├── mdl.py # MDL scoring for grammar selection (fix)
├── mcp_server.py # MCP server exposing 4 tools
└── ...
```
## MCP Server
A **Model Context Protocol** server exposes all algorithms and domain adapters as tools:
```bash
python -m bex.mcp_server
```
### Tools
| Tool | What it does |
|------|-------------|
| `infer_grammar(sequences, method, kmax, N)` | Core CRX or iDRegEx inference |
| `infer_best_grammar(sequences, prefer, kmax, N)` | **Ensemble:** runs both CRX and iDRegEx, picks the best by MDL score. Set `prefer='crx'` or `prefer='idregex'` to skip ensemble and return only that algorithm. Returns structured report with candidates, MDL scores, and a `Why:` explanation. |
| `infer_yaml_grammar(yaml_dir, pattern, method)` | Generic YAML → key-paths → grammar |
| `infer_ansible_role_grammar(roles_dir)` | Ansible role module sequences → per-category grammar |
### Using `infer_best_grammar`
The ensemble runs both algorithms and picks the best by MDL. To skip the comparison and run just one algorithm, pass `prefer`:
```
User: Run CRX on our deploy tasks.
Agent: [runs with prefer='crx']
Best: CRX (MDL 7.0)
Grammar: file.template.docker_image.command.set_fact.shell.wait_for?
CRX MDL= 7.00 file.template.docker_image.command.set_fact.shell.wait_for?
Why: Requested CRX only.
```
Without `prefer`, the ensemble compares both:
```
User: Find the grammar for our Helm chart.
Agent: [runs]
Best: iDRegEx (MDL 1432.99)
Grammar: ServiceAccount.ClusterRole.ClusterRoleBinding.Service.Deployment
iDRegEx MDL= 1432.99 ServiceAccount.ClusterRole.ClusterRoleBinding.Service.Deployment
CRX MDL= 2651.74 (Alertmanager+...+ValidatingWebhookConfiguration)+.Role?.RoleBinding?.Job+?
Why: iDRegEx (score 1433.0) vs CRX (score 2651.7). CRX matches 6/6 sequences,
iDRegEx matches 1/6. iDRegEx selected (MDL score 1433.0).
```
Both grammars are correct — they operate at different levels of specificity. The `Why:` field helps the agent decide which one to use for the task at hand.
## Ensemble Selection
The `infer_best_grammar` tool runs both CRX and iDRegEx, scores each with Minimum Description Length (MDL), and returns the best.
### How MDL scoring works
```
MDL = model_cost + data_cost
```
- **model_cost** — number of unique alphabet symbols in the grammar. Simpler grammars are cheaper.
- **data_cost** — Σ log₂(|L(r) at length len(s)|) across all sequences. A grammar that accepts *many* strings of the same length (like a 17-way disjunction `(a+b+...+q)+`) has high data cost because `|L(r)|` is large. A specific, fixed sequence (`a.b.c.d.e`) has `|L(r)| = 1` so data cost is zero.
The ensemble selects the grammar with the lowest total MDL. This automatically picks the right level of specificity for the data.
### When each algorithm wins
| Scenario | Winner | Why |
|----------|--------|-----|
| Many sequences, diverse patterns | **CRX** | CRX captures the full vocabulary. iDRegEx can't find a common core. |
| Clean, structured sequences | **CRX** | CRX learns precise concatenation order with optional suffixes. iDRegEx may over-generalize. |
| Few sequences (23) | **iDRegEx** | CRX overfits to the limited data. iDRegEx's probabilistic approach handles noise better. |
| Sequences share a clear core | **iDRegEx** | iDRegEx extracts the minimal common subsequence. CRX buries it in a mass of optional symbols. |
| Single sequence | **iDRegEx** (with SOA repair) | RWR₀ repair pipeline produces a grammatical regex from one example. |
### Real-world benchmarks
Results from three domains using the ensemble (fixed MDL scoring):
```
Dataset Best MDL Matches
──────────────────────────────────────────────────────────
Helm (prom-stack) iDRegEx 1433.0 1/6
Ansible (deploy) CRX 246.1 34/36
Ansible (validate) CRX 34.0 5/5
Ansible (restore) CRX 24.0 2/2
Ansible (manage) iDRegEx 25.0 1/2
Ansible (configure) iDRegEx 22.5 1/4
Terraform (hashistack) CRX 4.0 9/9
```
Note: MDL scores are not comparable across datasets — only within the same run
(CRX vs iDRegEx on the same sequences). The Helm score is higher because
each sequence is ~120 symbols long, making the data cost term dominant for
the overly-general CRX grammar (19 kinds × many lengths).
## Domain Adapters
### Ansible Roles
Extracts module names from `tasks/main.yml`, groups by category prefix (e.g., `deploy_foo``deploy`), and learns per-category grammars:
```python
from bex.ensemble import infer_ensemble
from bex.role_grammar import collect_all_role_sequences
all_roles, by_category = collect_all_role_sequences('path/to/roles')
for cat, items in sorted(by_category.items()):
seqs = [s for _, s in items]
if len(seqs) >= 2:
result = infer_ensemble(seqs)
print(f"── {cat} ({len(items)} roles) ──")
print(f" Best: {result['best']['algorithm']} (MDL {result['best']['mdl_score']})")
print(f" Grammar: {result['best']['grammar']}")
print(f" Why: {result['why']}")
```
**Example output** (from [companyweb](https://github.com/anomalyco/companyweb), 51 roles):
```
── restore (2 roles) ──
Best: CRX (MDL 24.0)
Grammar: file.copy.unarchive+.command
Why: CRX (score 24.0) vs iDRegEx (score 33.0). Both match 2/2. CRX is more compact.
── validate (5 roles) ──
Best: CRX (MDL 34.0)
Grammar: hosts?.shell?.(copy+debug+fail+set_fact+uri)+?
Why: CRX (score 34.0) matches 5/5, iDRegEx (score 49.5) matches 0/5.
── configure (4 roles) ──
Best: iDRegEx (MDL 22.5)
Grammar: include_role
Why: iDRegEx (score 22.5) beats CRX (score 44.5). CRX overfits to diverse patterns.
```
### Helm Charts
Renders a Helm chart with different values files and extracts Kubernetes `kind` sequences for grammar inference:
```python
import subprocess, yaml
from bex.ensemble import infer_ensemble
seqs = []
for vf in sorted(Path('ci/').glob('*-values.yaml')):
out = subprocess.run(
['helm', 'template', 'test', '.', '--skip-tests', '-f', str(vf)],
capture_output=True, text=True, timeout=120,
)
if out.returncode == 0:
kinds = [d['kind'] for d in yaml.safe_load_all(out.stdout)
if d and isinstance(d, dict) and 'kind' in d]
if kinds:
seqs.append(kinds)
result = infer_ensemble(seqs)
print(f"Best: {result['best']['algorithm']} (MDL {result['best']['mdl_score']})")
print(f"Grammar: {result['best']['grammar']}")
print(f"Why: {result['why']}")
```
**Example output** (from [kube-prometheus-stack](https://github.com/prometheus-community/helm-charts/tree/main/charts/kube-prometheus-stack), 6 CI configs):
```
Best: iDRegEx (MDL 1432.99)
Grammar: ServiceAccount.ClusterRole.ClusterRoleBinding.Service.Deployment
iDRegEx MDL= 1432.99 ServiceAccount.ClusterRole.ClusterRoleBinding.Service.Deployment
CRX MDL= 2651.74 (Alertmanager+ClusterRole+ClusterRoleBinding+ConfigMap+DaemonSet+...)+.Role?.RoleBinding?.Job+?
Why: iDRegEx (score 1433.0) vs CRX (score 2651.7). CRX matches 6/6, iDRegEx matches 1/6.
iDRegEx selected (MDL score 1433.0).
```
CRX captures *all* symbols that appear. iDRegEx finds only the minimal core that every config shares:
```
ServiceAccount.ClusterRole.ClusterRoleBinding.Service.Deployment
```
Which grammar is more useful depends on the task:
- **CRX** tells you everything you *might* need — good for an agent generating a complete chart.
- **iDRegEx** tells you what you *always* need — the bootstrap pipeline that can't be skipped.
Use `prefer='crx'` or `prefer='idregex'` to select an algorithm without the ensemble comparison:
### Terraform
Parses `.tf` files to extract `resource` type sequences, per-file or per-directory:
```python
import re
from bex.ensemble import infer_ensemble
seqs = []
for tf in sorted(Path('.').rglob('*.tf')):
resources = re.findall(r'resource "(\w+)" "\w+" {', tf.read_text())
if resources:
seqs.append(resources)
result = infer_ensemble(seqs)
print(f"Best: {result['best']['algorithm']} (MDL {result['best']['mdl_score']})")
print(f"Grammar: {result['best']['grammar']}")
```
**Example output** (from [terraform-guides](https://github.com/hashicorp/terraform-guides), hashistack example, 9 files):
```
Best: CRX (MDL 4.0, 9/9 match)
Grammar: azurerm_network_security_group?.tls_private_key?.azurerm_virtual_machine?.(azurerm_resource_group+azurerm_subnet+azurerm_virtual_network)+?.azurerm_network_security_rule?.null_resource?.azurerm_network_interface?.azurerm_public_ip?.random_id+?
```
**Grammar notation:**
- `a.b``a` followed by `b` (concatenation)
- `(a+b)` — either `a` or `b` (disjunction)
- `r?` — zero or one (optional)
- `r+` — one or more (iteration)
- `r+?` — zero or more (varies across examples)
- `(a|b)` — iDRegEx-style disjunction (equivalent to `(a+b)`)
## Domain: Generic YAML
Converts any YAML file into key-path sequences (DFS traversal) for grammar inference:
```python
from bex.yaml_to_seq import collect_all_sequences
from bex import infer_ensemble
results = collect_all_sequences('config_dir/')
seqs = [seq for _, seq in results]
result = infer_ensemble(seqs)
print(result['best']['grammar'])
```
## Papers
- **Bex et al.** *"Inferring Deterministic Regular Expressions from Positive Data"* — TODS 2010
- **Bex et al.** *"Inferring k-optimal REs from Positive Data"* — arXiv:1004.2372
See `papers/` for extracted text and the original references.
## Tests
```bash
python -m pytest tests/
# or
python tests/test_bex.py
```
## License
MIT