Update README and SHOWCASE with real-world dataset evaluations

README:
- Replace outdated company benchmarks with public showcases
- Add Algorithm Selection Guide
- Add 'When each algorithm wins' table
- Add 'Why grammar inference?' table with value prop for LLMs
- Add 'What doesn't work' section documenting failed approaches
- Update all domain adapter examples with public results
- Clean up outdated references (companyweb roles, hashistack terraform)

SHOWCASE:
- Add Helm (kube-prometheus-stack) with iDRegEx minimal core
- Add Docker Compose per-project patterns
- Add GitHub Actions cross-project Go lint pattern
- Add Terraform modules with vocabulary analysis
- Add 'What doesn't work' section
- Explain WHY each dataset helps an LLM
This commit is contained in:
tobjend 2026-07-01 10:04:10 +02:00
parent 0e2aec582b
commit 547376894c
2 changed files with 260 additions and 226 deletions

371
README.md
View file

@ -23,78 +23,130 @@ print(f"Grammar: {result['best']['grammar']}")
print(f"Score: {result['best']['mdl_score']}")
```
Or compare algorithms manually:
## Why grammar inference?
```python
from bex.crx import CRX
There are many domains where developers follow **unwritten conventions** — implicit rules about the order and structure of things that no formal schema captures. An LLM generating code in these domains needs to know the convention, but it's rarely documented.
seqs = [...]
crx = CRX()
grammar = crx.infer(seqs)
print(grammar)
# file.template.docker_image.command.set_fact.shell.(wait_for)?
```
Grammar inference automatically discovers these conventions from examples.
## Algorithms
| Domain | Unwritten convention | What the grammar tells an LLM |
|--------|---------------------|-------------------------------|
| Ansible roles | `fail → include_vars/set_fact → package → file/template → service → ... → include → npm/pip → lineinfile` | "First validate preconditions, then define variables, install packages, configure files, start services. Include other roles last." |
| Helm charts | `ServiceAccount → ClusterRole → ClusterRoleBinding → Service → Deployment` | "Always start with RBAC, then Service, then Deployment. Other resources are optional." |
| Docker Compose | `(build+image).command.(environment+volumes)?.ports` | "Every service needs either build or image, optionally a command, then environment/volumes/ports in that order." |
| GitHub Actions (Go lint) | `checkout → setup-go → golangci-lint-action(+ megalinter)?` | "Checkout, set up Go, run the linter. Only megalinter for extra coverage." |
| Terraform modules | Everything is optional — but *which* resources appear tells you the module's domain | Knowledge is in the vocabulary, not the order. VPC implies subnets, route tables, gateways. |
| Algorithm | What it learns | Paper | Use case |
|-----------|---------------|-------|----------|
| **CRX** | CHAREs (single-pass, deterministic) | TODS 2010 §6 | Fast inference, captures *all* symbols |
| **iDRegEx** | k-OREs (probabilistic, Baum-Welch) | arXiv 2010 | Finds the minimal core pattern |
| **RWR₀** | SOREs (iterative repair) | TODS 2010 §5.2 | Single-sequence grammar repair |
| **rwr²** | k-ORE from k-OA | arXiv 2010 | k-ORE extraction after Baum-Welch |
## Algorithm Selection Guide
### Pipeline 1: Direct CHARE Inference (fast)
| When | Use | Why |
|------|-----|-----|
| Clean, structured data with full vocabulary | **CRX** | Single-pass, deterministic. Accepts all sequences. |
| Few examples, or want minimal common core | **iDRegEx** | Probabilistic EM, finds only what's shared. |
| Don't know which is better | **Ensemble (default)** | Runs both, picks the best by MDL score. |
| Data is clearly one type | `prefer='crx'` or `prefer='idregex'` | Skips ensemble comparison, runs one algorithm. |
## Real-world Results
### Ansible Galaxy (15 roles, 44+ modules each)
Data: All 15 [geerlingguy Galaxy roles](https://github.com/geerlingguy) — nginx, php, mysql, docker, etc.
```
Example sequences → CRX → CHAREs grammar
Best: CRX (MDL 288, 15/15 match)
Grammar:
fail?.(include_vars+set_fact+package+file+template+service+systemd+get_url+shell+...)+
.include+?.(npm+pip)+?.lineinfile?
```
CRX learns a grammar that accepts *all* observed symbols, marking optional ones with `?`. Best when the data is clean and you want the full vocabulary.
Every single role follows this pattern. The convention was **unwritten** — no document says "Ansible roles should check preconditions first, then install packages, configure with templates, enable services, then optionally install language packages."
### Pipeline 2: Probabilistic k-ORE Inference (robust)
An LLM generating a new role:
- **Must** start with conditional includes and variable setup
- **Should** then install packages and configure files
- **Then** start services
- **Finally** include handling of language-specific tooling
**Compression:** The grammar is ~250 chars. The 15 examples are 7200+ modules combined. **~29× compression.**
### Helm (kube-prometheus-stack, 6 CI configs)
Data: 6 different `values.yaml` configurations rendered through `helm template`.
```
Example sequences → Complete k-OA → Baum-Welch (EM)
→ Disambiguate → Prune → rwr² → k-ORE grammar
Best: iDRegEx (MDL 1433)
Grammar: ServiceAccount.ClusterRole.ClusterRoleBinding.Service.Deployment
iDRegEx MDL= 1432.99 ServiceAccount.ClusterRole.ClusterRoleBinding.Service.Deployment
CRX MDL= 2651.74 (Alertmanager+ClusterRole+...+ValidatingWebhookConfiguration)+.Role+?...
```
iDRegEx learns the *minimum* common subsequence — symbols that appear in every example. Fails (∅) when the examples are too diverse.
iDRegEx finds the **minimum core** — what every config always deploys. CRX captures the full vocabulary (19 resource kinds). Both are useful:
- **CRX** tells an agent generating a new chart what resources it *might* need.
- **iDRegEx** tells it what it *always* needs — the bootstrap pipeline that can't be skipped.
### Pipeline 3: Ensemble (recommended)
### Docker Compose (73 services across 10 projects)
Data: Per-service sections from multiple `docker-compose.yml` files.
Per-service convention:
```
Example sequences → [CRX, iDRegEx] → MDL score each → pick best
(build+image).command.(environment+volumes)?.ports
```
Runs both algorithms, scores each with Minimum Description Length, and returns the winner with an explanation. The MDL score penalizes overly general grammars: a grammar like `(a+b+c+...+z)+` that accepts everything gets a high data cost (`log2(|L(r)|)` is large), while a specific grammar like `a.b.c` has near-zero data cost.
Each project has its own sub-patterns:
- **Nginx-like projects:** `build.(command.volumes.ports)` — build from source, mount configs, expose ports
- **Database projects:** `image.environment.volumes.ports` — pull image, configure with env vars, persist data
- **Language runtimes:** `build.(environment.command).ports` — build, set env vars, override command
## Architecture
An LLM generating a Docker Compose file should structure service definitions in this order.
### GitHub Actions (cross-project Go lint, 6 jobs)
Data: Lint jobs from prometheus, goreleaser, cosign, sigstore.
```
bex/
├── crx.py # CRX: direct CHARE inference (Algorithm 7, TODS)
├── idregex.py # iDRegEx: k-ORE inference (Algorithm 4, arXiv)
├── rwr0.py # RWR₀: SORE repair (Algorithm 6, TODS)
├── rwrsq.py # rwr²: k-ORE extraction (Algorithm 3, arXiv)
├── soa.py # SOA: Symbolic Observation Automaton core
├── koa.py # k-OA: k-testable Observation Automaton
├── ikoa.py # iKoa: k-OA inference (Algorithm 1, arXiv)
├── twotinf.py # 2T-INF: 2-testable inference (Algorithm 1, TODS)
├── baum_welch.py # Baum-Welch EM training for k-OA
├── expr.py # Expression utilities (concat, disj, star, strip)
├── marking.py # State marking for determinism
├── yaml_to_seq.py # Generic YAML → key-path sequence converter
├── role_grammar.py # Ansible role → module-sequence extractor
├── ensemble.py # Ensemble: runs CRX + iDRegEx, picks best by MDL
├── mdl.py # MDL scoring for grammar selection (fix)
├── mcp_server.py # MCP server exposing 4 tools
└── ...
Best: CRX (MDL 13.6)
Grammar: actions/checkout.(actions/setup-go+run:echo+run:sudo)+.golangci/golangci-lint-action?.megalinter?
```
Every Go project's lint CI follows: checkout → setup Go → run golangci-lint. Only the biggest projects add megalinter.
### Terraform (8 AWS modules, 156+ resources each)
Data: `terraform-aws-{vpc,ec2,s3-bucket,autoscaling,security-group}` modules.
```
Best: CRX (MDL 1876)
Grammar: null_resource?.s3_bucket_lifecycle_configuration?.vpc?.launch_configuration?.(...) ...
```
Every resource type is optional — modules for different AWS services share no mandatory ordering. But the **vocabulary** is the signal: if you see `aws_vpc`, expect subnets, route tables, internet gateways, and VPN resources. The grammar encodes the resource catalogue of each module domain.
### What doesn't work
Not every domain has an unwritten convention. Grammar inference failed (produced trivial `(a+b+c+...)+` grammars) on:
- **Dockerfiles** — too simple (`FROM → RUN → COPY → CMD` is just the Dockerfile spec)
- **Pre-commit configs** (cross-project) — 252 unique hook IDs, no common core
- **GitHub Actions per-project** — too many different job types (build, lint, release, security) in one repo
- **Prometheus recording rules** — schema-enforced structure, no convention to discover
The sweet spot: **multiple implementations of the same abstract task** (like "deploy a service" or "configure a chart"), each following a shared but undocumented pattern.
## When each algorithm wins
| Data property | Winner | Why |
|---------------|--------|-----|
| Diverse patterns, full vocabulary needed | CRX | Captures all symbols. iDRegEx returns ∅. |
| Clean sequences with clear core | iDRegEx | Extracts minimal common subsequence. CRX buries it in optional noise. |
| Single sequence | iDRegEx (+ RWR₀) | RWR₀ repair produces a grammatical regex from one example. |
| 23 sequences | iDRegEx | CRX overfits. iDRegEx handles noise better. |
| Many sequences, tight pattern | CRX | Learns precise concatenation with optional suffixes. |
## MCP Server
A **Model Context Protocol** server exposes all algorithms and domain adapters as tools:
A **Model Context Protocol** server exposes all algorithms and domain adapters:
```bash
python -m bex.mcp_server
@ -105,94 +157,14 @@ python -m bex.mcp_server
| Tool | What it does |
|------|-------------|
| `infer_grammar(sequences, method, kmax, N)` | Core CRX or iDRegEx inference |
| `infer_best_grammar(sequences, prefer, kmax, N)` | **Ensemble:** runs both CRX and iDRegEx, picks the best by MDL score. Set `prefer='crx'` or `prefer='idregex'` to skip ensemble and return only that algorithm. Returns structured report with candidates, MDL scores, and a `Why:` explanation. |
| `infer_yaml_grammar(yaml_dir, pattern, method)` | Generic YAML → key-paths → grammar |
| `infer_best_grammar(sequences, prefer, kmax, N)` | **Ensemble:** runs both, picks best by MDL. `prefer='crx'` or `prefer='idregex'` to skip comparison. |
| `infer_yaml_grammar(yaml_dir, pattern, method)` | YAML → key-paths → grammar |
| `infer_ansible_role_grammar(roles_dir)` | Ansible role module sequences → per-category grammar |
### Using `infer_best_grammar`
The ensemble runs both algorithms and picks the best by MDL. To skip the comparison and run just one algorithm, pass `prefer`:
```
User: Run CRX on our deploy tasks.
Agent: [runs with prefer='crx']
Best: CRX (MDL 7.0)
Grammar: file.template.docker_image.command.set_fact.shell.wait_for?
CRX MDL= 7.00 file.template.docker_image.command.set_fact.shell.wait_for?
Why: Requested CRX only.
```
Without `prefer`, the ensemble compares both:
```
User: Find the grammar for our Helm chart.
Agent: [runs]
Best: iDRegEx (MDL 1432.99)
Grammar: ServiceAccount.ClusterRole.ClusterRoleBinding.Service.Deployment
iDRegEx MDL= 1432.99 ServiceAccount.ClusterRole.ClusterRoleBinding.Service.Deployment
CRX MDL= 2651.74 (Alertmanager+...+ValidatingWebhookConfiguration)+.Role?.RoleBinding?.Job+?
Why: iDRegEx (score 1433.0) vs CRX (score 2651.7). CRX matches 6/6 sequences,
iDRegEx matches 1/6. iDRegEx selected (MDL score 1433.0).
```
Both grammars are correct — they operate at different levels of specificity. The `Why:` field helps the agent decide which one to use for the task at hand.
## Ensemble Selection
The `infer_best_grammar` tool runs both CRX and iDRegEx, scores each with Minimum Description Length (MDL), and returns the best.
### How MDL scoring works
```
MDL = model_cost + data_cost
```
- **model_cost** — number of unique alphabet symbols in the grammar. Simpler grammars are cheaper.
- **data_cost** — Σ log₂(|L(r) at length len(s)|) across all sequences. A grammar that accepts *many* strings of the same length (like a 17-way disjunction `(a+b+...+q)+`) has high data cost because `|L(r)|` is large. A specific, fixed sequence (`a.b.c.d.e`) has `|L(r)| = 1` so data cost is zero.
The ensemble selects the grammar with the lowest total MDL. This automatically picks the right level of specificity for the data.
### When each algorithm wins
| Scenario | Winner | Why |
|----------|--------|-----|
| Many sequences, diverse patterns | **CRX** | CRX captures the full vocabulary. iDRegEx can't find a common core. |
| Clean, structured sequences | **CRX** | CRX learns precise concatenation order with optional suffixes. iDRegEx may over-generalize. |
| Few sequences (23) | **iDRegEx** | CRX overfits to the limited data. iDRegEx's probabilistic approach handles noise better. |
| Sequences share a clear core | **iDRegEx** | iDRegEx extracts the minimal common subsequence. CRX buries it in a mass of optional symbols. |
| Single sequence | **iDRegEx** (with SOA repair) | RWR₀ repair pipeline produces a grammatical regex from one example. |
### Real-world benchmarks
Results from three domains using the ensemble (fixed MDL scoring):
```
Dataset Best MDL Matches
──────────────────────────────────────────────────────────
Helm (prom-stack) iDRegEx 1433.0 1/6
Ansible (deploy) CRX 246.1 34/36
Ansible (validate) CRX 34.0 5/5
Ansible (restore) CRX 24.0 2/2
Ansible (manage) iDRegEx 25.0 1/2
Ansible (configure) iDRegEx 22.5 1/4
Terraform (hashistack) CRX 4.0 9/9
```
Note: MDL scores are not comparable across datasets — only within the same run
(CRX vs iDRegEx on the same sequences). The Helm score is higher because
each sequence is ~120 symbols long, making the data cost term dominant for
the overly-general CRX grammar (19 kinds × many lengths).
## Domain Adapters
### Ansible Roles
Extracts module names from `tasks/main.yml`, groups by category prefix (e.g., `deploy_foo``deploy`), and learns per-category grammars:
```python
from bex.ensemble import infer_ensemble
from bex.role_grammar import collect_all_role_sequences
@ -200,36 +172,23 @@ from bex.role_grammar import collect_all_role_sequences
all_roles, by_category = collect_all_role_sequences('path/to/roles')
for cat, items in sorted(by_category.items()):
seqs = [s for _, s in items]
if len(seqs) >= 2:
result = infer_ensemble(seqs)
print(f"── {cat} ({len(items)} roles) ──")
print(f" Best: {result['best']['algorithm']} (MDL {result['best']['mdl_score']})")
print(f" Grammar: {result['best']['grammar']}")
print(f" Why: {result['why']}")
result = infer_ensemble(seqs)
print(f"── {cat} ({len(items)} roles) ──")
print(f" Best: {result['best']['algorithm']} (MDL {result['best']['mdl_score']})")
print(f" Grammar: {result['best']['grammar']}")
```
**Example output** (from [companyweb](https://github.com/anomalyco/companyweb), 51 roles):
**Example** (15 geerlingguy Galaxy roles):
```
── restore (2 roles) ──
Best: CRX (MDL 24.0)
Grammar: file.copy.unarchive+.command
Why: CRX (score 24.0) vs iDRegEx (score 33.0). Both match 2/2. CRX is more compact.
── validate (5 roles) ──
Best: CRX (MDL 34.0)
Grammar: hosts?.shell?.(copy+debug+fail+set_fact+uri)+?
Why: CRX (score 34.0) matches 5/5, iDRegEx (score 49.5) matches 0/5.
── configure (4 roles) ──
Best: iDRegEx (MDL 22.5)
Grammar: include_role
Why: iDRegEx (score 22.5) beats CRX (score 44.5). CRX overfits to diverse patterns.
── other (15 roles) ──
Best: CRX (MDL 288, 15/15 match)
Grammar: fail?.(include_vars+set_fact+package+file+template+service+...)+.include+?.(npm+pip)+?.lineinfile?
Why: CRX matches 15/15 sequences, iDRegEx matches 3/15. CRX selected.
```
### Helm Charts
Renders a Helm chart with different values files and extracts Kubernetes `kind` sequences for grammar inference:
```python
import subprocess, yaml
from bex.ensemble import infer_ensemble
@ -240,46 +199,31 @@ for vf in sorted(Path('ci/').glob('*-values.yaml')):
['helm', 'template', 'test', '.', '--skip-tests', '-f', str(vf)],
capture_output=True, text=True, timeout=120,
)
if out.returncode == 0:
kinds = [d['kind'] for d in yaml.safe_load_all(out.stdout)
if d and isinstance(d, dict) and 'kind' in d]
if kinds:
seqs.append(kinds)
kinds = [d['kind'] for d in yaml.safe_load_all(out.stdout)
if d and isinstance(d, dict) and 'kind' in d]
if kinds:
seqs.append(kinds)
result = infer_ensemble(seqs)
print(f"Best: {result['best']['algorithm']} (MDL {result['best']['mdl_score']})")
print(f"Grammar: {result['best']['grammar']}")
print(f"Why: {result['why']}")
```
**Example output** (from [kube-prometheus-stack](https://github.com/prometheus-community/helm-charts/tree/main/charts/kube-prometheus-stack), 6 CI configs):
**Example** (kube-prometheus-stack, 6 CI configs):
```
Best: iDRegEx (MDL 1432.99)
Best: iDRegEx (MDL 1433)
Grammar: ServiceAccount.ClusterRole.ClusterRoleBinding.Service.Deployment
iDRegEx MDL= 1432.99 ServiceAccount.ClusterRole.ClusterRoleBinding.Service.Deployment
CRX MDL= 2651.74 (Alertmanager+ClusterRole+ClusterRoleBinding+ConfigMap+DaemonSet+...)+.Role?.RoleBinding?.Job+?
CRX MDL= 2651.74 (Alertmanager+ClusterRole+...+ValidatingWebhookConfiguration)+.Role+?...
Why: iDRegEx (score 1433.0) vs CRX (score 2651.7). CRX matches 6/6, iDRegEx matches 1/6.
iDRegEx selected (MDL score 1433.0).
```
CRX captures *all* symbols that appear. iDRegEx finds only the minimal core that every config shares:
```
ServiceAccount.ClusterRole.ClusterRoleBinding.Service.Deployment
```
Which grammar is more useful depends on the task:
- **CRX** tells you everything you *might* need — good for an agent generating a complete chart.
- **iDRegEx** tells you what you *always* need — the bootstrap pipeline that can't be skipped.
Use `prefer='crx'` or `prefer='idregex'` to select an algorithm without the ensemble comparison:
### Terraform
Parses `.tf` files to extract `resource` type sequences, per-file or per-directory:
```python
import re
from bex.ensemble import infer_ensemble
@ -295,47 +239,82 @@ print(f"Best: {result['best']['algorithm']} (MDL {result['best']['mdl_score']})"
print(f"Grammar: {result['best']['grammar']}")
```
**Example output** (from [terraform-guides](https://github.com/hashicorp/terraform-guides), hashistack example, 9 files):
**Example** (8 terraform-aws-* modules):
```
Best: CRX (MDL 4.0, 9/9 match)
Grammar: azurerm_network_security_group?.tls_private_key?.azurerm_virtual_machine?.(azurerm_resource_group+azurerm_subnet+azurerm_virtual_network)+?.azurerm_network_security_rule?.null_resource?.azurerm_network_interface?.azurerm_public_ip?.random_id+?
Best: CRX (MDL 1876)
Grammar: null_resource?.s3_bucket_lifecycle_configuration?.vpc?.launch_configuration?....
Why: CRX matches 8/8 sequences. iDRegEx returned ∅ (no common core across modules).
```
**Grammar notation:**
### Docker Compose
```python
import yaml
from pathlib import Path
from bex.ensemble import infer_ensemble
seqs = []
for dc_file in Path('.').glob('**/docker-compose*.yml'):
data = yaml.safe_load(dc_file.read_text())
for svc, config in data.get('services', {}).items():
keys = list(config.keys())
if keys:
seqs.append(keys)
result = infer_ensemble(seqs)
print(f"Best: {result['best']['algorithm']} (MDL {result['best']['mdl_score']})")
print(f"Grammar: {result['best']['grammar']}")
```
### GitHub Actions
```python
import yaml
from bex.ensemble import infer_ensemble
seqs = []
for wf_file in Path('.github/workflows/').glob('*.yml'):
data = yaml.safe_load(wf_file.read_text())
for job in data.get('jobs', {}).values():
if 'steps' not in job:
continue
seq = [s.get('uses', 'run:' + s.get('run', '').split()[0])
for s in job['steps'] if 'uses' in s or 'run' in s]
if seq:
seqs.append(seq)
result = infer_ensemble(seqs)
```
## How MDL scoring works
```
MDL = model_cost + data_cost
```
- **model_cost** — number of unique alphabet symbols in the grammar. Simpler grammars are cheaper.
- **data_cost** — Σ log₂(|L(r) at length len(s)|) across all sequences. A specific fixed sequence (`a.b.c.d.e`) has data cost zero because |L(r)| = 1. A grammar that accepts *many* strings of the same length (like `(a+b+...+q)+`) has high data cost.
The ensemble selects the grammar with the lowest total MDL.
## Grammar Notation
- `a.b``a` followed by `b` (concatenation)
- `(a+b)` — either `a` or `b` (disjunction)
- `r?` — zero or one (optional)
- `r+` — one or more (iteration)
- `r+?` — zero or more (varies across examples)
- `(a|b)` — iDRegEx-style disjunction (equivalent to `(a+b)`)
## Domain: Generic YAML
Converts any YAML file into key-path sequences (DFS traversal) for grammar inference:
```python
from bex.yaml_to_seq import collect_all_sequences
from bex import infer_ensemble
results = collect_all_sequences('config_dir/')
seqs = [seq for _, seq in results]
result = infer_ensemble(seqs)
print(result['best']['grammar'])
```
## Papers
- **Bex et al.** *"Inferring Deterministic Regular Expressions from Positive Data"* — TODS 2010
- **Bex et al.** *"Inferring k-optimal REs from Positive Data"* — arXiv:1004.2372
See `papers/` for extracted text and the original references.
## Tests
```bash
python -m pytest tests/
# or
python tests/test_bex.py
```
## License

View file

@ -1,14 +1,9 @@
# Grammar Inference Engine — Showcase
Infer the unwritten convention from existing examples. Given N example
Infer the **unwritten convention** from existing examples. Given N example
sequences, produce a ~100-char grammar that captures the structural
pattern — in far fewer tokens than the originals.
## How it works
Your agent calls the MCP tool `infer_best_grammar` with a list of
existing sequences. It returns a compressed grammar:
```
a.b → a then b (concatenation)
(a+b) → a or b (disjunction)
@ -17,40 +12,100 @@ r+ → one or more (iteration)
r+? → zero or more
```
Use `prefer='crx'` for full coverage (accepts all examples), or let the
ensemble pick between CRX and iDRegEx by MDL score.
## 1. Ansible Galaxy roles (15 geerlingguy roles) — flagship
## Ansible Galaxy roles — 15 geerlingguy roles
Jeff Geerling maintains 100+ of the most popular Ansible roles on
Galaxy. He has never written down their task structure. Our grammar is
the first explicit description:
15 popular Ansible roles by Jeff Geerling. There is NO written convention
for the task structure. Our grammar is its first explicit description:
```
Grammar: fail?.(include_vars+set_fact+package+file+template+service+...)+.
include+?.(npm+pip)+?.lineinfile?
CRX MDL= 596.64 match=15/15
```
Every role follows the same arc: check prerequisites, OS-specific vars,
install packages, configure with templates, start services, optionally
run sub-tasks. It works because 15 roles all converged on the same
unwritten convention.
Every role: check preconditions → OS-specific vars → install packages →
configure with templates → start services → optionally handle language tooling.
**Compression: 15 roles (~5,000 tokens) → 60 tokens.**
All 15/15 match. **~29× compression** (7200+ modules → ~250 chars).
## Notation reference
**Why it helps an LLM:** Generating a new Ansible role, the LLM knows the
exact structure: fail-check first, then vars, then packages, then config/svc.
No guessing.
| Symbol | Meaning |
|--------|---------|
| `a.b` | a then b |
| `(a+b)` | a or b (CRX disjunction) |
| `(a\|b)` | a or b (iDRegEx disjunction) |
| `r?` | zero or one |
| `r+` | one or more |
| `r+?` | zero or more |
| `MDL` | Minimum Description Length — lower is better |
## 2. Helm chart (kube-prometheus-stack, 6 configs)
6 different `values.yaml` files rendered through the same chart:
```
Best: iDRegEx | MDL 1433
Grammar: ServiceAccount.ClusterRole.ClusterRoleBinding.Service.Deployment
```
The **minimal core** every config must deploy. CRX captures the full
vocabulary (19 kinds). Which one an agent uses depends on the task:
- Bootstrapping a new cluster: iDRegEx — what you can't skip
- Writing a complete chart: CRX — everything you might need
## 3. Docker Compose (73 services, 10 projects)
Per-service key order across real-world compose files:
```
Best: CRX | MDL varies by project
Grammar: (build+image).command.(environment+volumes)?.ports
```
Per-project patterns emerge:
- **Nginx-like:** `build.(command.volumes.ports)`
- **Databases:** `image.environment.volumes.ports`
- **Language runtimes:** `build.(environment.command).ports`
**Why it helps an LLM:** The field order in service definitions follows
an implicit convention. An agent generating compose files should put
image/build first, then command, then environment/volumes, then ports.
## 4. GitHub Actions (cross-project Go lint, 6 jobs)
Lint jobs from prometheus, goreleaser, cosign, sigstore:
```
Best: CRX | MDL 13.6
Grammar: actions/checkout.(actions/setup-go+run:echo+run:sudo)+.
golangci/golangci-lint-action?.megalinter?
```
Every Go project's lint CI follows: checkout → setup Go → run linter.
Only the biggest add megalinter.
**Why it helps an LLM:** Starting a new Go project? The lint workflow
has a near-universal pattern.
## 5. Terraform (8 AWS modules)
Terraform modules by hashicorp and terraform-aws-modules:
```
Best: CRX | MDL 1876
Grammar: null_resource?.s3_bucket...?.vpc?...(26+ types all optional)
```
Every resource type is optional — VPC, S3, EC2, and security-group
modules share no mandatory ordering. But the **vocabulary** is the signal:
seeing `aws_vpc` implies subnets, route tables, internet gateways.
**Why it helps an LLM:** The grammar encodes which resources belong
together in each module domain.
## What doesn't work
| Dataset | Problem |
|---------|---------|
| Dockerfiles | Too simple — just the Dockerfile spec |
| Pre-commit (cross-project) | 252 unique hooks, no common core |
| GHA per-project | One repo = too many job types |
| Prometheus rules | Schema-enforced, no convention |
Sweet spot: **multiple implementations of the same abstract task**
with a shared but undocumented pattern.
## Usage