- Ensemble inference (infer_ensemble) runs both CRX and iDRegEx, picks best by MDL - CRX: CRX algorithm for wide coverage (accepts all sequences, large vocabulary) - iDRegEx: iDRegEx for minimal core grammar (tightest common pattern) - MDL scoring: fixed model_cost to count alphabet symbol occurrences, fixed dispatch order in _count_words_fast - Fixed _match_tokens: rewritten as _match_possible with proper backtracking - Fixed _parse_parts disjunction: children use _parse_flat_symbol to avoid dot-splitting - MCP server: infer_best_grammar and infer_grammar tools - Added prefer parameter (crx/idregex) to skip ensemble - 28 passing tests - SHOWCASE.md with Geerlingguy Galaxy demonstration - blog_post.md with full technical deep-dive
343 lines
13 KiB
Markdown
343 lines
13 KiB
Markdown
# Grammar Inference Engine
|
||
|
||
Infer **regular expression grammars** from example sequences using the BEX family of algorithms. Given a set of example sequences (strings over some alphabet), the engine learns a compact regular expression that describes the general pattern.
|
||
|
||
## Quick Start
|
||
|
||
```bash
|
||
pip install pyyaml
|
||
python -m bex
|
||
```
|
||
|
||
```python
|
||
from bex import infer_ensemble
|
||
|
||
seqs = [
|
||
['file', 'template', 'docker_image', 'command', 'set_fact', 'shell', 'wait_for'],
|
||
['file', 'template', 'docker_image', 'command', 'set_fact', 'shell'],
|
||
]
|
||
|
||
result = infer_ensemble(seqs)
|
||
print(f"Best: {result['best']['algorithm']}")
|
||
print(f"Grammar: {result['best']['grammar']}")
|
||
print(f"Score: {result['best']['mdl_score']}")
|
||
```
|
||
|
||
Or compare algorithms manually:
|
||
|
||
```python
|
||
from bex.crx import CRX
|
||
|
||
seqs = [...]
|
||
crx = CRX()
|
||
grammar = crx.infer(seqs)
|
||
print(grammar)
|
||
# file.template.docker_image.command.set_fact.shell.(wait_for)?
|
||
```
|
||
|
||
## Algorithms
|
||
|
||
| Algorithm | What it learns | Paper | Use case |
|
||
|-----------|---------------|-------|----------|
|
||
| **CRX** | CHAREs (single-pass, deterministic) | TODS 2010 §6 | Fast inference, captures *all* symbols |
|
||
| **iDRegEx** | k-OREs (probabilistic, Baum-Welch) | arXiv 2010 | Finds the minimal core pattern |
|
||
| **RWR₀** | SOREs (iterative repair) | TODS 2010 §5.2 | Single-sequence grammar repair |
|
||
| **rwr²** | k-ORE from k-OA | arXiv 2010 | k-ORE extraction after Baum-Welch |
|
||
|
||
### Pipeline 1: Direct CHARE Inference (fast)
|
||
|
||
```
|
||
Example sequences → CRX → CHAREs grammar
|
||
```
|
||
|
||
CRX learns a grammar that accepts *all* observed symbols, marking optional ones with `?`. Best when the data is clean and you want the full vocabulary.
|
||
|
||
### Pipeline 2: Probabilistic k-ORE Inference (robust)
|
||
|
||
```
|
||
Example sequences → Complete k-OA → Baum-Welch (EM)
|
||
→ Disambiguate → Prune → rwr² → k-ORE grammar
|
||
```
|
||
|
||
iDRegEx learns the *minimum* common subsequence — symbols that appear in every example. Fails (∅) when the examples are too diverse.
|
||
|
||
### Pipeline 3: Ensemble (recommended)
|
||
|
||
```
|
||
Example sequences → [CRX, iDRegEx] → MDL score each → pick best
|
||
```
|
||
|
||
Runs both algorithms, scores each with Minimum Description Length, and returns the winner with an explanation. The MDL score penalizes overly general grammars: a grammar like `(a+b+c+...+z)+` that accepts everything gets a high data cost (`log2(|L(r)|)` is large), while a specific grammar like `a.b.c` has near-zero data cost.
|
||
|
||
## Architecture
|
||
|
||
```
|
||
bex/
|
||
├── crx.py # CRX: direct CHARE inference (Algorithm 7, TODS)
|
||
├── idregex.py # iDRegEx: k-ORE inference (Algorithm 4, arXiv)
|
||
├── rwr0.py # RWR₀: SORE repair (Algorithm 6, TODS)
|
||
├── rwrsq.py # rwr²: k-ORE extraction (Algorithm 3, arXiv)
|
||
├── soa.py # SOA: Symbolic Observation Automaton core
|
||
├── koa.py # k-OA: k-testable Observation Automaton
|
||
├── ikoa.py # iKoa: k-OA inference (Algorithm 1, arXiv)
|
||
├── twotinf.py # 2T-INF: 2-testable inference (Algorithm 1, TODS)
|
||
├── baum_welch.py # Baum-Welch EM training for k-OA
|
||
├── expr.py # Expression utilities (concat, disj, star, strip)
|
||
├── marking.py # State marking for determinism
|
||
├── yaml_to_seq.py # Generic YAML → key-path sequence converter
|
||
├── role_grammar.py # Ansible role → module-sequence extractor
|
||
├── ensemble.py # Ensemble: runs CRX + iDRegEx, picks best by MDL
|
||
├── mdl.py # MDL scoring for grammar selection (fix)
|
||
├── mcp_server.py # MCP server exposing 4 tools
|
||
└── ...
|
||
```
|
||
|
||
## MCP Server
|
||
|
||
A **Model Context Protocol** server exposes all algorithms and domain adapters as tools:
|
||
|
||
```bash
|
||
python -m bex.mcp_server
|
||
```
|
||
|
||
### Tools
|
||
|
||
| Tool | What it does |
|
||
|------|-------------|
|
||
| `infer_grammar(sequences, method, kmax, N)` | Core CRX or iDRegEx inference |
|
||
| `infer_best_grammar(sequences, prefer, kmax, N)` | **Ensemble:** runs both CRX and iDRegEx, picks the best by MDL score. Set `prefer='crx'` or `prefer='idregex'` to skip ensemble and return only that algorithm. Returns structured report with candidates, MDL scores, and a `Why:` explanation. |
|
||
| `infer_yaml_grammar(yaml_dir, pattern, method)` | Generic YAML → key-paths → grammar |
|
||
| `infer_ansible_role_grammar(roles_dir)` | Ansible role module sequences → per-category grammar |
|
||
|
||
### Using `infer_best_grammar`
|
||
|
||
The ensemble runs both algorithms and picks the best by MDL. To skip the comparison and run just one algorithm, pass `prefer`:
|
||
|
||
```
|
||
User: Run CRX on our deploy tasks.
|
||
Agent: [runs with prefer='crx']
|
||
Best: CRX (MDL 7.0)
|
||
Grammar: file.template.docker_image.command.set_fact.shell.wait_for?
|
||
|
||
CRX MDL= 7.00 file.template.docker_image.command.set_fact.shell.wait_for?
|
||
|
||
Why: Requested CRX only.
|
||
```
|
||
|
||
Without `prefer`, the ensemble compares both:
|
||
|
||
```
|
||
User: Find the grammar for our Helm chart.
|
||
Agent: [runs]
|
||
Best: iDRegEx (MDL 1432.99)
|
||
Grammar: ServiceAccount.ClusterRole.ClusterRoleBinding.Service.Deployment
|
||
|
||
iDRegEx MDL= 1432.99 ServiceAccount.ClusterRole.ClusterRoleBinding.Service.Deployment
|
||
CRX MDL= 2651.74 (Alertmanager+...+ValidatingWebhookConfiguration)+.Role?.RoleBinding?.Job+?
|
||
|
||
Why: iDRegEx (score 1433.0) vs CRX (score 2651.7). CRX matches 6/6 sequences,
|
||
iDRegEx matches 1/6. iDRegEx selected (MDL score 1433.0).
|
||
```
|
||
|
||
Both grammars are correct — they operate at different levels of specificity. The `Why:` field helps the agent decide which one to use for the task at hand.
|
||
|
||
## Ensemble Selection
|
||
|
||
The `infer_best_grammar` tool runs both CRX and iDRegEx, scores each with Minimum Description Length (MDL), and returns the best.
|
||
|
||
### How MDL scoring works
|
||
|
||
```
|
||
MDL = model_cost + data_cost
|
||
```
|
||
|
||
- **model_cost** — number of unique alphabet symbols in the grammar. Simpler grammars are cheaper.
|
||
- **data_cost** — Σ log₂(|L(r) at length len(s)|) across all sequences. A grammar that accepts *many* strings of the same length (like a 17-way disjunction `(a+b+...+q)+`) has high data cost because `|L(r)|` is large. A specific, fixed sequence (`a.b.c.d.e`) has `|L(r)| = 1` so data cost is zero.
|
||
|
||
The ensemble selects the grammar with the lowest total MDL. This automatically picks the right level of specificity for the data.
|
||
|
||
### When each algorithm wins
|
||
|
||
| Scenario | Winner | Why |
|
||
|----------|--------|-----|
|
||
| Many sequences, diverse patterns | **CRX** | CRX captures the full vocabulary. iDRegEx can't find a common core. |
|
||
| Clean, structured sequences | **CRX** | CRX learns precise concatenation order with optional suffixes. iDRegEx may over-generalize. |
|
||
| Few sequences (2–3) | **iDRegEx** | CRX overfits to the limited data. iDRegEx's probabilistic approach handles noise better. |
|
||
| Sequences share a clear core | **iDRegEx** | iDRegEx extracts the minimal common subsequence. CRX buries it in a mass of optional symbols. |
|
||
| Single sequence | **iDRegEx** (with SOA repair) | RWR₀ repair pipeline produces a grammatical regex from one example. |
|
||
|
||
### Real-world benchmarks
|
||
|
||
Results from three domains using the ensemble (fixed MDL scoring):
|
||
|
||
```
|
||
Dataset Best MDL Matches
|
||
──────────────────────────────────────────────────────────
|
||
Helm (prom-stack) iDRegEx 1433.0 1/6
|
||
Ansible (deploy) CRX 246.1 34/36
|
||
Ansible (validate) CRX 34.0 5/5
|
||
Ansible (restore) CRX 24.0 2/2
|
||
Ansible (manage) iDRegEx 25.0 1/2
|
||
Ansible (configure) iDRegEx 22.5 1/4
|
||
Terraform (hashistack) CRX 4.0 9/9
|
||
```
|
||
|
||
Note: MDL scores are not comparable across datasets — only within the same run
|
||
(CRX vs iDRegEx on the same sequences). The Helm score is higher because
|
||
each sequence is ~120 symbols long, making the data cost term dominant for
|
||
the overly-general CRX grammar (19 kinds × many lengths).
|
||
|
||
## Domain Adapters
|
||
|
||
### Ansible Roles
|
||
|
||
Extracts module names from `tasks/main.yml`, groups by category prefix (e.g., `deploy_foo` → `deploy`), and learns per-category grammars:
|
||
|
||
```python
|
||
from bex.ensemble import infer_ensemble
|
||
from bex.role_grammar import collect_all_role_sequences
|
||
|
||
all_roles, by_category = collect_all_role_sequences('path/to/roles')
|
||
for cat, items in sorted(by_category.items()):
|
||
seqs = [s for _, s in items]
|
||
if len(seqs) >= 2:
|
||
result = infer_ensemble(seqs)
|
||
print(f"── {cat} ({len(items)} roles) ──")
|
||
print(f" Best: {result['best']['algorithm']} (MDL {result['best']['mdl_score']})")
|
||
print(f" Grammar: {result['best']['grammar']}")
|
||
print(f" Why: {result['why']}")
|
||
```
|
||
|
||
**Example output** (from [companyweb](https://github.com/anomalyco/companyweb), 51 roles):
|
||
```
|
||
── restore (2 roles) ──
|
||
Best: CRX (MDL 24.0)
|
||
Grammar: file.copy.unarchive+.command
|
||
Why: CRX (score 24.0) vs iDRegEx (score 33.0). Both match 2/2. CRX is more compact.
|
||
|
||
── validate (5 roles) ──
|
||
Best: CRX (MDL 34.0)
|
||
Grammar: hosts?.shell?.(copy+debug+fail+set_fact+uri)+?
|
||
Why: CRX (score 34.0) matches 5/5, iDRegEx (score 49.5) matches 0/5.
|
||
|
||
── configure (4 roles) ──
|
||
Best: iDRegEx (MDL 22.5)
|
||
Grammar: include_role
|
||
Why: iDRegEx (score 22.5) beats CRX (score 44.5). CRX overfits to diverse patterns.
|
||
```
|
||
|
||
### Helm Charts
|
||
|
||
Renders a Helm chart with different values files and extracts Kubernetes `kind` sequences for grammar inference:
|
||
|
||
```python
|
||
import subprocess, yaml
|
||
from bex.ensemble import infer_ensemble
|
||
|
||
seqs = []
|
||
for vf in sorted(Path('ci/').glob('*-values.yaml')):
|
||
out = subprocess.run(
|
||
['helm', 'template', 'test', '.', '--skip-tests', '-f', str(vf)],
|
||
capture_output=True, text=True, timeout=120,
|
||
)
|
||
if out.returncode == 0:
|
||
kinds = [d['kind'] for d in yaml.safe_load_all(out.stdout)
|
||
if d and isinstance(d, dict) and 'kind' in d]
|
||
if kinds:
|
||
seqs.append(kinds)
|
||
|
||
result = infer_ensemble(seqs)
|
||
print(f"Best: {result['best']['algorithm']} (MDL {result['best']['mdl_score']})")
|
||
print(f"Grammar: {result['best']['grammar']}")
|
||
print(f"Why: {result['why']}")
|
||
```
|
||
|
||
**Example output** (from [kube-prometheus-stack](https://github.com/prometheus-community/helm-charts/tree/main/charts/kube-prometheus-stack), 6 CI configs):
|
||
|
||
```
|
||
Best: iDRegEx (MDL 1432.99)
|
||
Grammar: ServiceAccount.ClusterRole.ClusterRoleBinding.Service.Deployment
|
||
|
||
iDRegEx MDL= 1432.99 ServiceAccount.ClusterRole.ClusterRoleBinding.Service.Deployment
|
||
CRX MDL= 2651.74 (Alertmanager+ClusterRole+ClusterRoleBinding+ConfigMap+DaemonSet+...)+.Role?.RoleBinding?.Job+?
|
||
|
||
Why: iDRegEx (score 1433.0) vs CRX (score 2651.7). CRX matches 6/6, iDRegEx matches 1/6.
|
||
iDRegEx selected (MDL score 1433.0).
|
||
```
|
||
|
||
CRX captures *all* symbols that appear. iDRegEx finds only the minimal core that every config shares:
|
||
```
|
||
ServiceAccount.ClusterRole.ClusterRoleBinding.Service.Deployment
|
||
```
|
||
|
||
Which grammar is more useful depends on the task:
|
||
- **CRX** tells you everything you *might* need — good for an agent generating a complete chart.
|
||
- **iDRegEx** tells you what you *always* need — the bootstrap pipeline that can't be skipped.
|
||
|
||
Use `prefer='crx'` or `prefer='idregex'` to select an algorithm without the ensemble comparison:
|
||
|
||
### Terraform
|
||
|
||
Parses `.tf` files to extract `resource` type sequences, per-file or per-directory:
|
||
|
||
```python
|
||
import re
|
||
from bex.ensemble import infer_ensemble
|
||
|
||
seqs = []
|
||
for tf in sorted(Path('.').rglob('*.tf')):
|
||
resources = re.findall(r'resource "(\w+)" "\w+" {', tf.read_text())
|
||
if resources:
|
||
seqs.append(resources)
|
||
|
||
result = infer_ensemble(seqs)
|
||
print(f"Best: {result['best']['algorithm']} (MDL {result['best']['mdl_score']})")
|
||
print(f"Grammar: {result['best']['grammar']}")
|
||
```
|
||
|
||
**Example output** (from [terraform-guides](https://github.com/hashicorp/terraform-guides), hashistack example, 9 files):
|
||
```
|
||
Best: CRX (MDL 4.0, 9/9 match)
|
||
Grammar: azurerm_network_security_group?.tls_private_key?.azurerm_virtual_machine?.(azurerm_resource_group+azurerm_subnet+azurerm_virtual_network)+?.azurerm_network_security_rule?.null_resource?.azurerm_network_interface?.azurerm_public_ip?.random_id+?
|
||
```
|
||
|
||
**Grammar notation:**
|
||
- `a.b` — `a` followed by `b` (concatenation)
|
||
- `(a+b)` — either `a` or `b` (disjunction)
|
||
- `r?` — zero or one (optional)
|
||
- `r+` — one or more (iteration)
|
||
- `r+?` — zero or more (varies across examples)
|
||
- `(a|b)` — iDRegEx-style disjunction (equivalent to `(a+b)`)
|
||
|
||
## Domain: Generic YAML
|
||
|
||
Converts any YAML file into key-path sequences (DFS traversal) for grammar inference:
|
||
|
||
```python
|
||
from bex.yaml_to_seq import collect_all_sequences
|
||
from bex import infer_ensemble
|
||
|
||
results = collect_all_sequences('config_dir/')
|
||
seqs = [seq for _, seq in results]
|
||
result = infer_ensemble(seqs)
|
||
print(result['best']['grammar'])
|
||
```
|
||
|
||
## Papers
|
||
|
||
- **Bex et al.** *"Inferring Deterministic Regular Expressions from Positive Data"* — TODS 2010
|
||
- **Bex et al.** *"Inferring k-optimal REs from Positive Data"* — arXiv:1004.2372
|
||
|
||
See `papers/` for extracted text and the original references.
|
||
|
||
## Tests
|
||
|
||
```bash
|
||
python -m pytest tests/
|
||
# or
|
||
python tests/test_bex.py
|
||
```
|
||
|
||
## License
|
||
|
||
MIT
|