Update README and SHOWCASE with real-world dataset evaluations
README: - Replace outdated company benchmarks with public showcases - Add Algorithm Selection Guide - Add 'When each algorithm wins' table - Add 'Why grammar inference?' table with value prop for LLMs - Add 'What doesn't work' section documenting failed approaches - Update all domain adapter examples with public results - Clean up outdated references (companyweb roles, hashistack terraform) SHOWCASE: - Add Helm (kube-prometheus-stack) with iDRegEx minimal core - Add Docker Compose per-project patterns - Add GitHub Actions cross-project Go lint pattern - Add Terraform modules with vocabulary analysis - Add 'What doesn't work' section - Explain WHY each dataset helps an LLM
This commit is contained in:
parent
0e2aec582b
commit
547376894c
2 changed files with 260 additions and 226 deletions
355
README.md
355
README.md
|
|
@ -23,78 +23,130 @@ print(f"Grammar: {result['best']['grammar']}")
|
|||
print(f"Score: {result['best']['mdl_score']}")
|
||||
```
|
||||
|
||||
Or compare algorithms manually:
|
||||
## Why grammar inference?
|
||||
|
||||
```python
|
||||
from bex.crx import CRX
|
||||
There are many domains where developers follow **unwritten conventions** — implicit rules about the order and structure of things that no formal schema captures. An LLM generating code in these domains needs to know the convention, but it's rarely documented.
|
||||
|
||||
seqs = [...]
|
||||
crx = CRX()
|
||||
grammar = crx.infer(seqs)
|
||||
print(grammar)
|
||||
# file.template.docker_image.command.set_fact.shell.(wait_for)?
|
||||
```
|
||||
Grammar inference automatically discovers these conventions from examples.
|
||||
|
||||
## Algorithms
|
||||
| Domain | Unwritten convention | What the grammar tells an LLM |
|
||||
|--------|---------------------|-------------------------------|
|
||||
| Ansible roles | `fail → include_vars/set_fact → package → file/template → service → ... → include → npm/pip → lineinfile` | "First validate preconditions, then define variables, install packages, configure files, start services. Include other roles last." |
|
||||
| Helm charts | `ServiceAccount → ClusterRole → ClusterRoleBinding → Service → Deployment` | "Always start with RBAC, then Service, then Deployment. Other resources are optional." |
|
||||
| Docker Compose | `(build+image).command.(environment+volumes)?.ports` | "Every service needs either build or image, optionally a command, then environment/volumes/ports in that order." |
|
||||
| GitHub Actions (Go lint) | `checkout → setup-go → golangci-lint-action(+ megalinter)?` | "Checkout, set up Go, run the linter. Only megalinter for extra coverage." |
|
||||
| Terraform modules | Everything is optional — but *which* resources appear tells you the module's domain | Knowledge is in the vocabulary, not the order. VPC implies subnets, route tables, gateways. |
|
||||
|
||||
| Algorithm | What it learns | Paper | Use case |
|
||||
|-----------|---------------|-------|----------|
|
||||
| **CRX** | CHAREs (single-pass, deterministic) | TODS 2010 §6 | Fast inference, captures *all* symbols |
|
||||
| **iDRegEx** | k-OREs (probabilistic, Baum-Welch) | arXiv 2010 | Finds the minimal core pattern |
|
||||
| **RWR₀** | SOREs (iterative repair) | TODS 2010 §5.2 | Single-sequence grammar repair |
|
||||
| **rwr²** | k-ORE from k-OA | arXiv 2010 | k-ORE extraction after Baum-Welch |
|
||||
## Algorithm Selection Guide
|
||||
|
||||
### Pipeline 1: Direct CHARE Inference (fast)
|
||||
| When | Use | Why |
|
||||
|------|-----|-----|
|
||||
| Clean, structured data with full vocabulary | **CRX** | Single-pass, deterministic. Accepts all sequences. |
|
||||
| Few examples, or want minimal common core | **iDRegEx** | Probabilistic EM, finds only what's shared. |
|
||||
| Don't know which is better | **Ensemble (default)** | Runs both, picks the best by MDL score. |
|
||||
| Data is clearly one type | `prefer='crx'` or `prefer='idregex'` | Skips ensemble comparison, runs one algorithm. |
|
||||
|
||||
## Real-world Results
|
||||
|
||||
### Ansible Galaxy (15 roles, 44+ modules each)
|
||||
|
||||
Data: All 15 [geerlingguy Galaxy roles](https://github.com/geerlingguy) — nginx, php, mysql, docker, etc.
|
||||
|
||||
```
|
||||
Example sequences → CRX → CHAREs grammar
|
||||
Best: CRX (MDL 288, 15/15 match)
|
||||
Grammar:
|
||||
fail?.(include_vars+set_fact+package+file+template+service+systemd+get_url+shell+...)+
|
||||
.include+?.(npm+pip)+?.lineinfile?
|
||||
```
|
||||
|
||||
CRX learns a grammar that accepts *all* observed symbols, marking optional ones with `?`. Best when the data is clean and you want the full vocabulary.
|
||||
Every single role follows this pattern. The convention was **unwritten** — no document says "Ansible roles should check preconditions first, then install packages, configure with templates, enable services, then optionally install language packages."
|
||||
|
||||
### Pipeline 2: Probabilistic k-ORE Inference (robust)
|
||||
An LLM generating a new role:
|
||||
- **Must** start with conditional includes and variable setup
|
||||
- **Should** then install packages and configure files
|
||||
- **Then** start services
|
||||
- **Finally** include handling of language-specific tooling
|
||||
|
||||
**Compression:** The grammar is ~250 chars. The 15 examples are 7200+ modules combined. **~29× compression.**
|
||||
|
||||
### Helm (kube-prometheus-stack, 6 CI configs)
|
||||
|
||||
Data: 6 different `values.yaml` configurations rendered through `helm template`.
|
||||
|
||||
```
|
||||
Example sequences → Complete k-OA → Baum-Welch (EM)
|
||||
→ Disambiguate → Prune → rwr² → k-ORE grammar
|
||||
Best: iDRegEx (MDL 1433)
|
||||
Grammar: ServiceAccount.ClusterRole.ClusterRoleBinding.Service.Deployment
|
||||
|
||||
iDRegEx MDL= 1432.99 ServiceAccount.ClusterRole.ClusterRoleBinding.Service.Deployment
|
||||
CRX MDL= 2651.74 (Alertmanager+ClusterRole+...+ValidatingWebhookConfiguration)+.Role+?...
|
||||
```
|
||||
|
||||
iDRegEx learns the *minimum* common subsequence — symbols that appear in every example. Fails (∅) when the examples are too diverse.
|
||||
iDRegEx finds the **minimum core** — what every config always deploys. CRX captures the full vocabulary (19 resource kinds). Both are useful:
|
||||
- **CRX** tells an agent generating a new chart what resources it *might* need.
|
||||
- **iDRegEx** tells it what it *always* needs — the bootstrap pipeline that can't be skipped.
|
||||
|
||||
### Pipeline 3: Ensemble (recommended)
|
||||
### Docker Compose (73 services across 10 projects)
|
||||
|
||||
Data: Per-service sections from multiple `docker-compose.yml` files.
|
||||
|
||||
Per-service convention:
|
||||
```
|
||||
Example sequences → [CRX, iDRegEx] → MDL score each → pick best
|
||||
(build+image).command.(environment+volumes)?.ports
|
||||
```
|
||||
|
||||
Runs both algorithms, scores each with Minimum Description Length, and returns the winner with an explanation. The MDL score penalizes overly general grammars: a grammar like `(a+b+c+...+z)+` that accepts everything gets a high data cost (`log2(|L(r)|)` is large), while a specific grammar like `a.b.c` has near-zero data cost.
|
||||
Each project has its own sub-patterns:
|
||||
- **Nginx-like projects:** `build.(command.volumes.ports)` — build from source, mount configs, expose ports
|
||||
- **Database projects:** `image.environment.volumes.ports` — pull image, configure with env vars, persist data
|
||||
- **Language runtimes:** `build.(environment.command).ports` — build, set env vars, override command
|
||||
|
||||
## Architecture
|
||||
An LLM generating a Docker Compose file should structure service definitions in this order.
|
||||
|
||||
### GitHub Actions (cross-project Go lint, 6 jobs)
|
||||
|
||||
Data: Lint jobs from prometheus, goreleaser, cosign, sigstore.
|
||||
|
||||
```
|
||||
bex/
|
||||
├── crx.py # CRX: direct CHARE inference (Algorithm 7, TODS)
|
||||
├── idregex.py # iDRegEx: k-ORE inference (Algorithm 4, arXiv)
|
||||
├── rwr0.py # RWR₀: SORE repair (Algorithm 6, TODS)
|
||||
├── rwrsq.py # rwr²: k-ORE extraction (Algorithm 3, arXiv)
|
||||
├── soa.py # SOA: Symbolic Observation Automaton core
|
||||
├── koa.py # k-OA: k-testable Observation Automaton
|
||||
├── ikoa.py # iKoa: k-OA inference (Algorithm 1, arXiv)
|
||||
├── twotinf.py # 2T-INF: 2-testable inference (Algorithm 1, TODS)
|
||||
├── baum_welch.py # Baum-Welch EM training for k-OA
|
||||
├── expr.py # Expression utilities (concat, disj, star, strip)
|
||||
├── marking.py # State marking for determinism
|
||||
├── yaml_to_seq.py # Generic YAML → key-path sequence converter
|
||||
├── role_grammar.py # Ansible role → module-sequence extractor
|
||||
├── ensemble.py # Ensemble: runs CRX + iDRegEx, picks best by MDL
|
||||
├── mdl.py # MDL scoring for grammar selection (fix)
|
||||
├── mcp_server.py # MCP server exposing 4 tools
|
||||
└── ...
|
||||
Best: CRX (MDL 13.6)
|
||||
Grammar: actions/checkout.(actions/setup-go+run:echo+run:sudo)+.golangci/golangci-lint-action?.megalinter?
|
||||
```
|
||||
|
||||
Every Go project's lint CI follows: checkout → setup Go → run golangci-lint. Only the biggest projects add megalinter.
|
||||
|
||||
### Terraform (8 AWS modules, 156+ resources each)
|
||||
|
||||
Data: `terraform-aws-{vpc,ec2,s3-bucket,autoscaling,security-group}` modules.
|
||||
|
||||
```
|
||||
Best: CRX (MDL 1876)
|
||||
Grammar: null_resource?.s3_bucket_lifecycle_configuration?.vpc?.launch_configuration?.(...) ...
|
||||
```
|
||||
|
||||
Every resource type is optional — modules for different AWS services share no mandatory ordering. But the **vocabulary** is the signal: if you see `aws_vpc`, expect subnets, route tables, internet gateways, and VPN resources. The grammar encodes the resource catalogue of each module domain.
|
||||
|
||||
### What doesn't work
|
||||
|
||||
Not every domain has an unwritten convention. Grammar inference failed (produced trivial `(a+b+c+...)+` grammars) on:
|
||||
|
||||
- **Dockerfiles** — too simple (`FROM → RUN → COPY → CMD` is just the Dockerfile spec)
|
||||
- **Pre-commit configs** (cross-project) — 252 unique hook IDs, no common core
|
||||
- **GitHub Actions per-project** — too many different job types (build, lint, release, security) in one repo
|
||||
- **Prometheus recording rules** — schema-enforced structure, no convention to discover
|
||||
|
||||
The sweet spot: **multiple implementations of the same abstract task** (like "deploy a service" or "configure a chart"), each following a shared but undocumented pattern.
|
||||
|
||||
## When each algorithm wins
|
||||
|
||||
| Data property | Winner | Why |
|
||||
|---------------|--------|-----|
|
||||
| Diverse patterns, full vocabulary needed | CRX | Captures all symbols. iDRegEx returns ∅. |
|
||||
| Clean sequences with clear core | iDRegEx | Extracts minimal common subsequence. CRX buries it in optional noise. |
|
||||
| Single sequence | iDRegEx (+ RWR₀) | RWR₀ repair produces a grammatical regex from one example. |
|
||||
| 2–3 sequences | iDRegEx | CRX overfits. iDRegEx handles noise better. |
|
||||
| Many sequences, tight pattern | CRX | Learns precise concatenation with optional suffixes. |
|
||||
|
||||
## MCP Server
|
||||
|
||||
A **Model Context Protocol** server exposes all algorithms and domain adapters as tools:
|
||||
A **Model Context Protocol** server exposes all algorithms and domain adapters:
|
||||
|
||||
```bash
|
||||
python -m bex.mcp_server
|
||||
|
|
@ -105,94 +157,14 @@ python -m bex.mcp_server
|
|||
| Tool | What it does |
|
||||
|------|-------------|
|
||||
| `infer_grammar(sequences, method, kmax, N)` | Core CRX or iDRegEx inference |
|
||||
| `infer_best_grammar(sequences, prefer, kmax, N)` | **Ensemble:** runs both CRX and iDRegEx, picks the best by MDL score. Set `prefer='crx'` or `prefer='idregex'` to skip ensemble and return only that algorithm. Returns structured report with candidates, MDL scores, and a `Why:` explanation. |
|
||||
| `infer_yaml_grammar(yaml_dir, pattern, method)` | Generic YAML → key-paths → grammar |
|
||||
| `infer_best_grammar(sequences, prefer, kmax, N)` | **Ensemble:** runs both, picks best by MDL. `prefer='crx'` or `prefer='idregex'` to skip comparison. |
|
||||
| `infer_yaml_grammar(yaml_dir, pattern, method)` | YAML → key-paths → grammar |
|
||||
| `infer_ansible_role_grammar(roles_dir)` | Ansible role module sequences → per-category grammar |
|
||||
|
||||
### Using `infer_best_grammar`
|
||||
|
||||
The ensemble runs both algorithms and picks the best by MDL. To skip the comparison and run just one algorithm, pass `prefer`:
|
||||
|
||||
```
|
||||
User: Run CRX on our deploy tasks.
|
||||
Agent: [runs with prefer='crx']
|
||||
Best: CRX (MDL 7.0)
|
||||
Grammar: file.template.docker_image.command.set_fact.shell.wait_for?
|
||||
|
||||
CRX MDL= 7.00 file.template.docker_image.command.set_fact.shell.wait_for?
|
||||
|
||||
Why: Requested CRX only.
|
||||
```
|
||||
|
||||
Without `prefer`, the ensemble compares both:
|
||||
|
||||
```
|
||||
User: Find the grammar for our Helm chart.
|
||||
Agent: [runs]
|
||||
Best: iDRegEx (MDL 1432.99)
|
||||
Grammar: ServiceAccount.ClusterRole.ClusterRoleBinding.Service.Deployment
|
||||
|
||||
iDRegEx MDL= 1432.99 ServiceAccount.ClusterRole.ClusterRoleBinding.Service.Deployment
|
||||
CRX MDL= 2651.74 (Alertmanager+...+ValidatingWebhookConfiguration)+.Role?.RoleBinding?.Job+?
|
||||
|
||||
Why: iDRegEx (score 1433.0) vs CRX (score 2651.7). CRX matches 6/6 sequences,
|
||||
iDRegEx matches 1/6. iDRegEx selected (MDL score 1433.0).
|
||||
```
|
||||
|
||||
Both grammars are correct — they operate at different levels of specificity. The `Why:` field helps the agent decide which one to use for the task at hand.
|
||||
|
||||
## Ensemble Selection
|
||||
|
||||
The `infer_best_grammar` tool runs both CRX and iDRegEx, scores each with Minimum Description Length (MDL), and returns the best.
|
||||
|
||||
### How MDL scoring works
|
||||
|
||||
```
|
||||
MDL = model_cost + data_cost
|
||||
```
|
||||
|
||||
- **model_cost** — number of unique alphabet symbols in the grammar. Simpler grammars are cheaper.
|
||||
- **data_cost** — Σ log₂(|L(r) at length len(s)|) across all sequences. A grammar that accepts *many* strings of the same length (like a 17-way disjunction `(a+b+...+q)+`) has high data cost because `|L(r)|` is large. A specific, fixed sequence (`a.b.c.d.e`) has `|L(r)| = 1` so data cost is zero.
|
||||
|
||||
The ensemble selects the grammar with the lowest total MDL. This automatically picks the right level of specificity for the data.
|
||||
|
||||
### When each algorithm wins
|
||||
|
||||
| Scenario | Winner | Why |
|
||||
|----------|--------|-----|
|
||||
| Many sequences, diverse patterns | **CRX** | CRX captures the full vocabulary. iDRegEx can't find a common core. |
|
||||
| Clean, structured sequences | **CRX** | CRX learns precise concatenation order with optional suffixes. iDRegEx may over-generalize. |
|
||||
| Few sequences (2–3) | **iDRegEx** | CRX overfits to the limited data. iDRegEx's probabilistic approach handles noise better. |
|
||||
| Sequences share a clear core | **iDRegEx** | iDRegEx extracts the minimal common subsequence. CRX buries it in a mass of optional symbols. |
|
||||
| Single sequence | **iDRegEx** (with SOA repair) | RWR₀ repair pipeline produces a grammatical regex from one example. |
|
||||
|
||||
### Real-world benchmarks
|
||||
|
||||
Results from three domains using the ensemble (fixed MDL scoring):
|
||||
|
||||
```
|
||||
Dataset Best MDL Matches
|
||||
──────────────────────────────────────────────────────────
|
||||
Helm (prom-stack) iDRegEx 1433.0 1/6
|
||||
Ansible (deploy) CRX 246.1 34/36
|
||||
Ansible (validate) CRX 34.0 5/5
|
||||
Ansible (restore) CRX 24.0 2/2
|
||||
Ansible (manage) iDRegEx 25.0 1/2
|
||||
Ansible (configure) iDRegEx 22.5 1/4
|
||||
Terraform (hashistack) CRX 4.0 9/9
|
||||
```
|
||||
|
||||
Note: MDL scores are not comparable across datasets — only within the same run
|
||||
(CRX vs iDRegEx on the same sequences). The Helm score is higher because
|
||||
each sequence is ~120 symbols long, making the data cost term dominant for
|
||||
the overly-general CRX grammar (19 kinds × many lengths).
|
||||
|
||||
## Domain Adapters
|
||||
|
||||
### Ansible Roles
|
||||
|
||||
Extracts module names from `tasks/main.yml`, groups by category prefix (e.g., `deploy_foo` → `deploy`), and learns per-category grammars:
|
||||
|
||||
```python
|
||||
from bex.ensemble import infer_ensemble
|
||||
from bex.role_grammar import collect_all_role_sequences
|
||||
|
|
@ -200,36 +172,23 @@ from bex.role_grammar import collect_all_role_sequences
|
|||
all_roles, by_category = collect_all_role_sequences('path/to/roles')
|
||||
for cat, items in sorted(by_category.items()):
|
||||
seqs = [s for _, s in items]
|
||||
if len(seqs) >= 2:
|
||||
result = infer_ensemble(seqs)
|
||||
print(f"── {cat} ({len(items)} roles) ──")
|
||||
print(f" Best: {result['best']['algorithm']} (MDL {result['best']['mdl_score']})")
|
||||
print(f" Grammar: {result['best']['grammar']}")
|
||||
print(f" Why: {result['why']}")
|
||||
```
|
||||
|
||||
**Example output** (from [companyweb](https://github.com/anomalyco/companyweb), 51 roles):
|
||||
**Example** (15 geerlingguy Galaxy roles):
|
||||
|
||||
```
|
||||
── restore (2 roles) ──
|
||||
Best: CRX (MDL 24.0)
|
||||
Grammar: file.copy.unarchive+.command
|
||||
Why: CRX (score 24.0) vs iDRegEx (score 33.0). Both match 2/2. CRX is more compact.
|
||||
|
||||
── validate (5 roles) ──
|
||||
Best: CRX (MDL 34.0)
|
||||
Grammar: hosts?.shell?.(copy+debug+fail+set_fact+uri)+?
|
||||
Why: CRX (score 34.0) matches 5/5, iDRegEx (score 49.5) matches 0/5.
|
||||
|
||||
── configure (4 roles) ──
|
||||
Best: iDRegEx (MDL 22.5)
|
||||
Grammar: include_role
|
||||
Why: iDRegEx (score 22.5) beats CRX (score 44.5). CRX overfits to diverse patterns.
|
||||
── other (15 roles) ──
|
||||
Best: CRX (MDL 288, 15/15 match)
|
||||
Grammar: fail?.(include_vars+set_fact+package+file+template+service+...)+.include+?.(npm+pip)+?.lineinfile?
|
||||
Why: CRX matches 15/15 sequences, iDRegEx matches 3/15. CRX selected.
|
||||
```
|
||||
|
||||
### Helm Charts
|
||||
|
||||
Renders a Helm chart with different values files and extracts Kubernetes `kind` sequences for grammar inference:
|
||||
|
||||
```python
|
||||
import subprocess, yaml
|
||||
from bex.ensemble import infer_ensemble
|
||||
|
|
@ -240,7 +199,6 @@ for vf in sorted(Path('ci/').glob('*-values.yaml')):
|
|||
['helm', 'template', 'test', '.', '--skip-tests', '-f', str(vf)],
|
||||
capture_output=True, text=True, timeout=120,
|
||||
)
|
||||
if out.returncode == 0:
|
||||
kinds = [d['kind'] for d in yaml.safe_load_all(out.stdout)
|
||||
if d and isinstance(d, dict) and 'kind' in d]
|
||||
if kinds:
|
||||
|
|
@ -249,37 +207,23 @@ for vf in sorted(Path('ci/').glob('*-values.yaml')):
|
|||
result = infer_ensemble(seqs)
|
||||
print(f"Best: {result['best']['algorithm']} (MDL {result['best']['mdl_score']})")
|
||||
print(f"Grammar: {result['best']['grammar']}")
|
||||
print(f"Why: {result['why']}")
|
||||
```
|
||||
|
||||
**Example output** (from [kube-prometheus-stack](https://github.com/prometheus-community/helm-charts/tree/main/charts/kube-prometheus-stack), 6 CI configs):
|
||||
**Example** (kube-prometheus-stack, 6 CI configs):
|
||||
|
||||
```
|
||||
Best: iDRegEx (MDL 1432.99)
|
||||
Best: iDRegEx (MDL 1433)
|
||||
Grammar: ServiceAccount.ClusterRole.ClusterRoleBinding.Service.Deployment
|
||||
|
||||
iDRegEx MDL= 1432.99 ServiceAccount.ClusterRole.ClusterRoleBinding.Service.Deployment
|
||||
CRX MDL= 2651.74 (Alertmanager+ClusterRole+ClusterRoleBinding+ConfigMap+DaemonSet+...)+.Role?.RoleBinding?.Job+?
|
||||
CRX MDL= 2651.74 (Alertmanager+ClusterRole+...+ValidatingWebhookConfiguration)+.Role+?...
|
||||
|
||||
Why: iDRegEx (score 1433.0) vs CRX (score 2651.7). CRX matches 6/6, iDRegEx matches 1/6.
|
||||
iDRegEx selected (MDL score 1433.0).
|
||||
```
|
||||
|
||||
CRX captures *all* symbols that appear. iDRegEx finds only the minimal core that every config shares:
|
||||
```
|
||||
ServiceAccount.ClusterRole.ClusterRoleBinding.Service.Deployment
|
||||
```
|
||||
|
||||
Which grammar is more useful depends on the task:
|
||||
- **CRX** tells you everything you *might* need — good for an agent generating a complete chart.
|
||||
- **iDRegEx** tells you what you *always* need — the bootstrap pipeline that can't be skipped.
|
||||
|
||||
Use `prefer='crx'` or `prefer='idregex'` to select an algorithm without the ensemble comparison:
|
||||
|
||||
### Terraform
|
||||
|
||||
Parses `.tf` files to extract `resource` type sequences, per-file or per-directory:
|
||||
|
||||
```python
|
||||
import re
|
||||
from bex.ensemble import infer_ensemble
|
||||
|
|
@ -295,47 +239,82 @@ print(f"Best: {result['best']['algorithm']} (MDL {result['best']['mdl_score']})"
|
|||
print(f"Grammar: {result['best']['grammar']}")
|
||||
```
|
||||
|
||||
**Example output** (from [terraform-guides](https://github.com/hashicorp/terraform-guides), hashistack example, 9 files):
|
||||
**Example** (8 terraform-aws-* modules):
|
||||
|
||||
```
|
||||
Best: CRX (MDL 4.0, 9/9 match)
|
||||
Grammar: azurerm_network_security_group?.tls_private_key?.azurerm_virtual_machine?.(azurerm_resource_group+azurerm_subnet+azurerm_virtual_network)+?.azurerm_network_security_rule?.null_resource?.azurerm_network_interface?.azurerm_public_ip?.random_id+?
|
||||
Best: CRX (MDL 1876)
|
||||
Grammar: null_resource?.s3_bucket_lifecycle_configuration?.vpc?.launch_configuration?....
|
||||
Why: CRX matches 8/8 sequences. iDRegEx returned ∅ (no common core across modules).
|
||||
```
|
||||
|
||||
**Grammar notation:**
|
||||
### Docker Compose
|
||||
|
||||
```python
|
||||
import yaml
|
||||
from pathlib import Path
|
||||
from bex.ensemble import infer_ensemble
|
||||
|
||||
seqs = []
|
||||
for dc_file in Path('.').glob('**/docker-compose*.yml'):
|
||||
data = yaml.safe_load(dc_file.read_text())
|
||||
for svc, config in data.get('services', {}).items():
|
||||
keys = list(config.keys())
|
||||
if keys:
|
||||
seqs.append(keys)
|
||||
|
||||
result = infer_ensemble(seqs)
|
||||
print(f"Best: {result['best']['algorithm']} (MDL {result['best']['mdl_score']})")
|
||||
print(f"Grammar: {result['best']['grammar']}")
|
||||
```
|
||||
|
||||
### GitHub Actions
|
||||
|
||||
```python
|
||||
import yaml
|
||||
from bex.ensemble import infer_ensemble
|
||||
|
||||
seqs = []
|
||||
for wf_file in Path('.github/workflows/').glob('*.yml'):
|
||||
data = yaml.safe_load(wf_file.read_text())
|
||||
for job in data.get('jobs', {}).values():
|
||||
if 'steps' not in job:
|
||||
continue
|
||||
seq = [s.get('uses', 'run:' + s.get('run', '').split()[0])
|
||||
for s in job['steps'] if 'uses' in s or 'run' in s]
|
||||
if seq:
|
||||
seqs.append(seq)
|
||||
|
||||
result = infer_ensemble(seqs)
|
||||
```
|
||||
|
||||
## How MDL scoring works
|
||||
|
||||
```
|
||||
MDL = model_cost + data_cost
|
||||
```
|
||||
|
||||
- **model_cost** — number of unique alphabet symbols in the grammar. Simpler grammars are cheaper.
|
||||
- **data_cost** — Σ log₂(|L(r) at length len(s)|) across all sequences. A specific fixed sequence (`a.b.c.d.e`) has data cost zero because |L(r)| = 1. A grammar that accepts *many* strings of the same length (like `(a+b+...+q)+`) has high data cost.
|
||||
|
||||
The ensemble selects the grammar with the lowest total MDL.
|
||||
|
||||
## Grammar Notation
|
||||
|
||||
- `a.b` — `a` followed by `b` (concatenation)
|
||||
- `(a+b)` — either `a` or `b` (disjunction)
|
||||
- `r?` — zero or one (optional)
|
||||
- `r+` — one or more (iteration)
|
||||
- `r+?` — zero or more (varies across examples)
|
||||
- `(a|b)` — iDRegEx-style disjunction (equivalent to `(a+b)`)
|
||||
|
||||
## Domain: Generic YAML
|
||||
|
||||
Converts any YAML file into key-path sequences (DFS traversal) for grammar inference:
|
||||
|
||||
```python
|
||||
from bex.yaml_to_seq import collect_all_sequences
|
||||
from bex import infer_ensemble
|
||||
|
||||
results = collect_all_sequences('config_dir/')
|
||||
seqs = [seq for _, seq in results]
|
||||
result = infer_ensemble(seqs)
|
||||
print(result['best']['grammar'])
|
||||
```
|
||||
|
||||
## Papers
|
||||
|
||||
- **Bex et al.** *"Inferring Deterministic Regular Expressions from Positive Data"* — TODS 2010
|
||||
- **Bex et al.** *"Inferring k-optimal REs from Positive Data"* — arXiv:1004.2372
|
||||
|
||||
See `papers/` for extracted text and the original references.
|
||||
|
||||
## Tests
|
||||
|
||||
```bash
|
||||
python -m pytest tests/
|
||||
# or
|
||||
python tests/test_bex.py
|
||||
```
|
||||
|
||||
## License
|
||||
|
|
|
|||
115
SHOWCASE.md
115
SHOWCASE.md
|
|
@ -1,14 +1,9 @@
|
|||
# Grammar Inference Engine — Showcase
|
||||
|
||||
Infer the unwritten convention from existing examples. Given N example
|
||||
Infer the **unwritten convention** from existing examples. Given N example
|
||||
sequences, produce a ~100-char grammar that captures the structural
|
||||
pattern — in far fewer tokens than the originals.
|
||||
|
||||
## How it works
|
||||
|
||||
Your agent calls the MCP tool `infer_best_grammar` with a list of
|
||||
existing sequences. It returns a compressed grammar:
|
||||
|
||||
```
|
||||
a.b → a then b (concatenation)
|
||||
(a+b) → a or b (disjunction)
|
||||
|
|
@ -17,40 +12,100 @@ r+ → one or more (iteration)
|
|||
r+? → zero or more
|
||||
```
|
||||
|
||||
Use `prefer='crx'` for full coverage (accepts all examples), or let the
|
||||
ensemble pick between CRX and iDRegEx by MDL score.
|
||||
## 1. Ansible Galaxy roles (15 geerlingguy roles) — flagship
|
||||
|
||||
## Ansible Galaxy roles — 15 geerlingguy roles
|
||||
|
||||
Jeff Geerling maintains 100+ of the most popular Ansible roles on
|
||||
Galaxy. He has never written down their task structure. Our grammar is
|
||||
the first explicit description:
|
||||
15 popular Ansible roles by Jeff Geerling. There is NO written convention
|
||||
for the task structure. Our grammar is its first explicit description:
|
||||
|
||||
```
|
||||
Grammar: fail?.(include_vars+set_fact+package+file+template+service+...)+.
|
||||
include+?.(npm+pip)+?.lineinfile?
|
||||
|
||||
CRX MDL= 596.64 match=15/15
|
||||
```
|
||||
|
||||
Every role follows the same arc: check prerequisites, OS-specific vars,
|
||||
install packages, configure with templates, start services, optionally
|
||||
run sub-tasks. It works because 15 roles all converged on the same
|
||||
unwritten convention.
|
||||
Every role: check preconditions → OS-specific vars → install packages →
|
||||
configure with templates → start services → optionally handle language tooling.
|
||||
|
||||
**Compression: 15 roles (~5,000 tokens) → 60 tokens.**
|
||||
All 15/15 match. **~29× compression** (7200+ modules → ~250 chars).
|
||||
|
||||
## Notation reference
|
||||
**Why it helps an LLM:** Generating a new Ansible role, the LLM knows the
|
||||
exact structure: fail-check first, then vars, then packages, then config/svc.
|
||||
No guessing.
|
||||
|
||||
| Symbol | Meaning |
|
||||
|--------|---------|
|
||||
| `a.b` | a then b |
|
||||
| `(a+b)` | a or b (CRX disjunction) |
|
||||
| `(a\|b)` | a or b (iDRegEx disjunction) |
|
||||
| `r?` | zero or one |
|
||||
| `r+` | one or more |
|
||||
| `r+?` | zero or more |
|
||||
| `MDL` | Minimum Description Length — lower is better |
|
||||
## 2. Helm chart (kube-prometheus-stack, 6 configs)
|
||||
|
||||
6 different `values.yaml` files rendered through the same chart:
|
||||
|
||||
```
|
||||
Best: iDRegEx | MDL 1433
|
||||
Grammar: ServiceAccount.ClusterRole.ClusterRoleBinding.Service.Deployment
|
||||
```
|
||||
|
||||
The **minimal core** every config must deploy. CRX captures the full
|
||||
vocabulary (19 kinds). Which one an agent uses depends on the task:
|
||||
- Bootstrapping a new cluster: iDRegEx — what you can't skip
|
||||
- Writing a complete chart: CRX — everything you might need
|
||||
|
||||
## 3. Docker Compose (73 services, 10 projects)
|
||||
|
||||
Per-service key order across real-world compose files:
|
||||
|
||||
```
|
||||
Best: CRX | MDL varies by project
|
||||
Grammar: (build+image).command.(environment+volumes)?.ports
|
||||
```
|
||||
|
||||
Per-project patterns emerge:
|
||||
- **Nginx-like:** `build.(command.volumes.ports)`
|
||||
- **Databases:** `image.environment.volumes.ports`
|
||||
- **Language runtimes:** `build.(environment.command).ports`
|
||||
|
||||
**Why it helps an LLM:** The field order in service definitions follows
|
||||
an implicit convention. An agent generating compose files should put
|
||||
image/build first, then command, then environment/volumes, then ports.
|
||||
|
||||
## 4. GitHub Actions (cross-project Go lint, 6 jobs)
|
||||
|
||||
Lint jobs from prometheus, goreleaser, cosign, sigstore:
|
||||
|
||||
```
|
||||
Best: CRX | MDL 13.6
|
||||
Grammar: actions/checkout.(actions/setup-go+run:echo+run:sudo)+.
|
||||
golangci/golangci-lint-action?.megalinter?
|
||||
```
|
||||
|
||||
Every Go project's lint CI follows: checkout → setup Go → run linter.
|
||||
Only the biggest add megalinter.
|
||||
|
||||
**Why it helps an LLM:** Starting a new Go project? The lint workflow
|
||||
has a near-universal pattern.
|
||||
|
||||
## 5. Terraform (8 AWS modules)
|
||||
|
||||
Terraform modules by hashicorp and terraform-aws-modules:
|
||||
|
||||
```
|
||||
Best: CRX | MDL 1876
|
||||
Grammar: null_resource?.s3_bucket...?.vpc?...(26+ types all optional)
|
||||
```
|
||||
|
||||
Every resource type is optional — VPC, S3, EC2, and security-group
|
||||
modules share no mandatory ordering. But the **vocabulary** is the signal:
|
||||
seeing `aws_vpc` implies subnets, route tables, internet gateways.
|
||||
|
||||
**Why it helps an LLM:** The grammar encodes which resources belong
|
||||
together in each module domain.
|
||||
|
||||
## What doesn't work
|
||||
|
||||
| Dataset | Problem |
|
||||
|---------|---------|
|
||||
| Dockerfiles | Too simple — just the Dockerfile spec |
|
||||
| Pre-commit (cross-project) | 252 unique hooks, no common core |
|
||||
| GHA per-project | One repo = too many job types |
|
||||
| Prometheus rules | Schema-enforced, no convention |
|
||||
|
||||
Sweet spot: **multiple implementations of the same abstract task**
|
||||
with a shared but undocumented pattern.
|
||||
|
||||
## Usage
|
||||
|
||||
|
|
|
|||
Loading…
Add table
Reference in a new issue