diff --git a/README.md b/README.md index 12cb570..e0f5340 100644 --- a/README.md +++ b/README.md @@ -23,78 +23,130 @@ print(f"Grammar: {result['best']['grammar']}") print(f"Score: {result['best']['mdl_score']}") ``` -Or compare algorithms manually: +## Why grammar inference? -```python -from bex.crx import CRX +There are many domains where developers follow **unwritten conventions** — implicit rules about the order and structure of things that no formal schema captures. An LLM generating code in these domains needs to know the convention, but it's rarely documented. -seqs = [...] -crx = CRX() -grammar = crx.infer(seqs) -print(grammar) -# file.template.docker_image.command.set_fact.shell.(wait_for)? -``` +Grammar inference automatically discovers these conventions from examples. -## Algorithms +| Domain | Unwritten convention | What the grammar tells an LLM | +|--------|---------------------|-------------------------------| +| Ansible roles | `fail → include_vars/set_fact → package → file/template → service → ... → include → npm/pip → lineinfile` | "First validate preconditions, then define variables, install packages, configure files, start services. Include other roles last." | +| Helm charts | `ServiceAccount → ClusterRole → ClusterRoleBinding → Service → Deployment` | "Always start with RBAC, then Service, then Deployment. Other resources are optional." | +| Docker Compose | `(build+image).command.(environment+volumes)?.ports` | "Every service needs either build or image, optionally a command, then environment/volumes/ports in that order." | +| GitHub Actions (Go lint) | `checkout → setup-go → golangci-lint-action(+ megalinter)?` | "Checkout, set up Go, run the linter. Only megalinter for extra coverage." | +| Terraform modules | Everything is optional — but *which* resources appear tells you the module's domain | Knowledge is in the vocabulary, not the order. VPC implies subnets, route tables, gateways. | -| Algorithm | What it learns | Paper | Use case | -|-----------|---------------|-------|----------| -| **CRX** | CHAREs (single-pass, deterministic) | TODS 2010 §6 | Fast inference, captures *all* symbols | -| **iDRegEx** | k-OREs (probabilistic, Baum-Welch) | arXiv 2010 | Finds the minimal core pattern | -| **RWR₀** | SOREs (iterative repair) | TODS 2010 §5.2 | Single-sequence grammar repair | -| **rwr²** | k-ORE from k-OA | arXiv 2010 | k-ORE extraction after Baum-Welch | +## Algorithm Selection Guide -### Pipeline 1: Direct CHARE Inference (fast) +| When | Use | Why | +|------|-----|-----| +| Clean, structured data with full vocabulary | **CRX** | Single-pass, deterministic. Accepts all sequences. | +| Few examples, or want minimal common core | **iDRegEx** | Probabilistic EM, finds only what's shared. | +| Don't know which is better | **Ensemble (default)** | Runs both, picks the best by MDL score. | +| Data is clearly one type | `prefer='crx'` or `prefer='idregex'` | Skips ensemble comparison, runs one algorithm. | + +## Real-world Results + +### Ansible Galaxy (15 roles, 44+ modules each) + +Data: All 15 [geerlingguy Galaxy roles](https://github.com/geerlingguy) — nginx, php, mysql, docker, etc. ``` -Example sequences → CRX → CHAREs grammar +Best: CRX (MDL 288, 15/15 match) +Grammar: + fail?.(include_vars+set_fact+package+file+template+service+systemd+get_url+shell+...)+ + .include+?.(npm+pip)+?.lineinfile? ``` -CRX learns a grammar that accepts *all* observed symbols, marking optional ones with `?`. Best when the data is clean and you want the full vocabulary. +Every single role follows this pattern. The convention was **unwritten** — no document says "Ansible roles should check preconditions first, then install packages, configure with templates, enable services, then optionally install language packages." -### Pipeline 2: Probabilistic k-ORE Inference (robust) +An LLM generating a new role: +- **Must** start with conditional includes and variable setup +- **Should** then install packages and configure files +- **Then** start services +- **Finally** include handling of language-specific tooling + +**Compression:** The grammar is ~250 chars. The 15 examples are 7200+ modules combined. **~29× compression.** + +### Helm (kube-prometheus-stack, 6 CI configs) + +Data: 6 different `values.yaml` configurations rendered through `helm template`. ``` -Example sequences → Complete k-OA → Baum-Welch (EM) - → Disambiguate → Prune → rwr² → k-ORE grammar +Best: iDRegEx (MDL 1433) +Grammar: ServiceAccount.ClusterRole.ClusterRoleBinding.Service.Deployment + + iDRegEx MDL= 1432.99 ServiceAccount.ClusterRole.ClusterRoleBinding.Service.Deployment + CRX MDL= 2651.74 (Alertmanager+ClusterRole+...+ValidatingWebhookConfiguration)+.Role+?... ``` -iDRegEx learns the *minimum* common subsequence — symbols that appear in every example. Fails (∅) when the examples are too diverse. +iDRegEx finds the **minimum core** — what every config always deploys. CRX captures the full vocabulary (19 resource kinds). Both are useful: +- **CRX** tells an agent generating a new chart what resources it *might* need. +- **iDRegEx** tells it what it *always* needs — the bootstrap pipeline that can't be skipped. -### Pipeline 3: Ensemble (recommended) +### Docker Compose (73 services across 10 projects) +Data: Per-service sections from multiple `docker-compose.yml` files. + +Per-service convention: ``` -Example sequences → [CRX, iDRegEx] → MDL score each → pick best +(build+image).command.(environment+volumes)?.ports ``` -Runs both algorithms, scores each with Minimum Description Length, and returns the winner with an explanation. The MDL score penalizes overly general grammars: a grammar like `(a+b+c+...+z)+` that accepts everything gets a high data cost (`log2(|L(r)|)` is large), while a specific grammar like `a.b.c` has near-zero data cost. +Each project has its own sub-patterns: +- **Nginx-like projects:** `build.(command.volumes.ports)` — build from source, mount configs, expose ports +- **Database projects:** `image.environment.volumes.ports` — pull image, configure with env vars, persist data +- **Language runtimes:** `build.(environment.command).ports` — build, set env vars, override command -## Architecture +An LLM generating a Docker Compose file should structure service definitions in this order. + +### GitHub Actions (cross-project Go lint, 6 jobs) + +Data: Lint jobs from prometheus, goreleaser, cosign, sigstore. ``` -bex/ -├── crx.py # CRX: direct CHARE inference (Algorithm 7, TODS) -├── idregex.py # iDRegEx: k-ORE inference (Algorithm 4, arXiv) -├── rwr0.py # RWR₀: SORE repair (Algorithm 6, TODS) -├── rwrsq.py # rwr²: k-ORE extraction (Algorithm 3, arXiv) -├── soa.py # SOA: Symbolic Observation Automaton core -├── koa.py # k-OA: k-testable Observation Automaton -├── ikoa.py # iKoa: k-OA inference (Algorithm 1, arXiv) -├── twotinf.py # 2T-INF: 2-testable inference (Algorithm 1, TODS) -├── baum_welch.py # Baum-Welch EM training for k-OA -├── expr.py # Expression utilities (concat, disj, star, strip) -├── marking.py # State marking for determinism -├── yaml_to_seq.py # Generic YAML → key-path sequence converter -├── role_grammar.py # Ansible role → module-sequence extractor -├── ensemble.py # Ensemble: runs CRX + iDRegEx, picks best by MDL -├── mdl.py # MDL scoring for grammar selection (fix) -├── mcp_server.py # MCP server exposing 4 tools -└── ... +Best: CRX (MDL 13.6) +Grammar: actions/checkout.(actions/setup-go+run:echo+run:sudo)+.golangci/golangci-lint-action?.megalinter? ``` +Every Go project's lint CI follows: checkout → setup Go → run golangci-lint. Only the biggest projects add megalinter. + +### Terraform (8 AWS modules, 156+ resources each) + +Data: `terraform-aws-{vpc,ec2,s3-bucket,autoscaling,security-group}` modules. + +``` +Best: CRX (MDL 1876) +Grammar: null_resource?.s3_bucket_lifecycle_configuration?.vpc?.launch_configuration?.(...) ... +``` + +Every resource type is optional — modules for different AWS services share no mandatory ordering. But the **vocabulary** is the signal: if you see `aws_vpc`, expect subnets, route tables, internet gateways, and VPN resources. The grammar encodes the resource catalogue of each module domain. + +### What doesn't work + +Not every domain has an unwritten convention. Grammar inference failed (produced trivial `(a+b+c+...)+` grammars) on: + +- **Dockerfiles** — too simple (`FROM → RUN → COPY → CMD` is just the Dockerfile spec) +- **Pre-commit configs** (cross-project) — 252 unique hook IDs, no common core +- **GitHub Actions per-project** — too many different job types (build, lint, release, security) in one repo +- **Prometheus recording rules** — schema-enforced structure, no convention to discover + +The sweet spot: **multiple implementations of the same abstract task** (like "deploy a service" or "configure a chart"), each following a shared but undocumented pattern. + +## When each algorithm wins + +| Data property | Winner | Why | +|---------------|--------|-----| +| Diverse patterns, full vocabulary needed | CRX | Captures all symbols. iDRegEx returns ∅. | +| Clean sequences with clear core | iDRegEx | Extracts minimal common subsequence. CRX buries it in optional noise. | +| Single sequence | iDRegEx (+ RWR₀) | RWR₀ repair produces a grammatical regex from one example. | +| 2–3 sequences | iDRegEx | CRX overfits. iDRegEx handles noise better. | +| Many sequences, tight pattern | CRX | Learns precise concatenation with optional suffixes. | + ## MCP Server -A **Model Context Protocol** server exposes all algorithms and domain adapters as tools: +A **Model Context Protocol** server exposes all algorithms and domain adapters: ```bash python -m bex.mcp_server @@ -105,94 +157,14 @@ python -m bex.mcp_server | Tool | What it does | |------|-------------| | `infer_grammar(sequences, method, kmax, N)` | Core CRX or iDRegEx inference | -| `infer_best_grammar(sequences, prefer, kmax, N)` | **Ensemble:** runs both CRX and iDRegEx, picks the best by MDL score. Set `prefer='crx'` or `prefer='idregex'` to skip ensemble and return only that algorithm. Returns structured report with candidates, MDL scores, and a `Why:` explanation. | -| `infer_yaml_grammar(yaml_dir, pattern, method)` | Generic YAML → key-paths → grammar | +| `infer_best_grammar(sequences, prefer, kmax, N)` | **Ensemble:** runs both, picks best by MDL. `prefer='crx'` or `prefer='idregex'` to skip comparison. | +| `infer_yaml_grammar(yaml_dir, pattern, method)` | YAML → key-paths → grammar | | `infer_ansible_role_grammar(roles_dir)` | Ansible role module sequences → per-category grammar | -### Using `infer_best_grammar` - -The ensemble runs both algorithms and picks the best by MDL. To skip the comparison and run just one algorithm, pass `prefer`: - -``` -User: Run CRX on our deploy tasks. -Agent: [runs with prefer='crx'] -Best: CRX (MDL 7.0) -Grammar: file.template.docker_image.command.set_fact.shell.wait_for? - - CRX MDL= 7.00 file.template.docker_image.command.set_fact.shell.wait_for? - -Why: Requested CRX only. -``` - -Without `prefer`, the ensemble compares both: - -``` -User: Find the grammar for our Helm chart. -Agent: [runs] -Best: iDRegEx (MDL 1432.99) -Grammar: ServiceAccount.ClusterRole.ClusterRoleBinding.Service.Deployment - - iDRegEx MDL= 1432.99 ServiceAccount.ClusterRole.ClusterRoleBinding.Service.Deployment - CRX MDL= 2651.74 (Alertmanager+...+ValidatingWebhookConfiguration)+.Role?.RoleBinding?.Job+? - -Why: iDRegEx (score 1433.0) vs CRX (score 2651.7). CRX matches 6/6 sequences, -iDRegEx matches 1/6. iDRegEx selected (MDL score 1433.0). -``` - -Both grammars are correct — they operate at different levels of specificity. The `Why:` field helps the agent decide which one to use for the task at hand. - -## Ensemble Selection - -The `infer_best_grammar` tool runs both CRX and iDRegEx, scores each with Minimum Description Length (MDL), and returns the best. - -### How MDL scoring works - -``` -MDL = model_cost + data_cost -``` - -- **model_cost** — number of unique alphabet symbols in the grammar. Simpler grammars are cheaper. -- **data_cost** — Σ log₂(|L(r) at length len(s)|) across all sequences. A grammar that accepts *many* strings of the same length (like a 17-way disjunction `(a+b+...+q)+`) has high data cost because `|L(r)|` is large. A specific, fixed sequence (`a.b.c.d.e`) has `|L(r)| = 1` so data cost is zero. - -The ensemble selects the grammar with the lowest total MDL. This automatically picks the right level of specificity for the data. - -### When each algorithm wins - -| Scenario | Winner | Why | -|----------|--------|-----| -| Many sequences, diverse patterns | **CRX** | CRX captures the full vocabulary. iDRegEx can't find a common core. | -| Clean, structured sequences | **CRX** | CRX learns precise concatenation order with optional suffixes. iDRegEx may over-generalize. | -| Few sequences (2–3) | **iDRegEx** | CRX overfits to the limited data. iDRegEx's probabilistic approach handles noise better. | -| Sequences share a clear core | **iDRegEx** | iDRegEx extracts the minimal common subsequence. CRX buries it in a mass of optional symbols. | -| Single sequence | **iDRegEx** (with SOA repair) | RWR₀ repair pipeline produces a grammatical regex from one example. | - -### Real-world benchmarks - -Results from three domains using the ensemble (fixed MDL scoring): - -``` -Dataset Best MDL Matches -────────────────────────────────────────────────────────── -Helm (prom-stack) iDRegEx 1433.0 1/6 -Ansible (deploy) CRX 246.1 34/36 -Ansible (validate) CRX 34.0 5/5 -Ansible (restore) CRX 24.0 2/2 -Ansible (manage) iDRegEx 25.0 1/2 -Ansible (configure) iDRegEx 22.5 1/4 -Terraform (hashistack) CRX 4.0 9/9 -``` - -Note: MDL scores are not comparable across datasets — only within the same run -(CRX vs iDRegEx on the same sequences). The Helm score is higher because -each sequence is ~120 symbols long, making the data cost term dominant for -the overly-general CRX grammar (19 kinds × many lengths). - ## Domain Adapters ### Ansible Roles -Extracts module names from `tasks/main.yml`, groups by category prefix (e.g., `deploy_foo` → `deploy`), and learns per-category grammars: - ```python from bex.ensemble import infer_ensemble from bex.role_grammar import collect_all_role_sequences @@ -200,36 +172,23 @@ from bex.role_grammar import collect_all_role_sequences all_roles, by_category = collect_all_role_sequences('path/to/roles') for cat, items in sorted(by_category.items()): seqs = [s for _, s in items] - if len(seqs) >= 2: - result = infer_ensemble(seqs) - print(f"── {cat} ({len(items)} roles) ──") - print(f" Best: {result['best']['algorithm']} (MDL {result['best']['mdl_score']})") - print(f" Grammar: {result['best']['grammar']}") - print(f" Why: {result['why']}") + result = infer_ensemble(seqs) + print(f"── {cat} ({len(items)} roles) ──") + print(f" Best: {result['best']['algorithm']} (MDL {result['best']['mdl_score']})") + print(f" Grammar: {result['best']['grammar']}") ``` -**Example output** (from [companyweb](https://github.com/anomalyco/companyweb), 51 roles): +**Example** (15 geerlingguy Galaxy roles): + ``` -── restore (2 roles) ── - Best: CRX (MDL 24.0) - Grammar: file.copy.unarchive+.command - Why: CRX (score 24.0) vs iDRegEx (score 33.0). Both match 2/2. CRX is more compact. - -── validate (5 roles) ── - Best: CRX (MDL 34.0) - Grammar: hosts?.shell?.(copy+debug+fail+set_fact+uri)+? - Why: CRX (score 34.0) matches 5/5, iDRegEx (score 49.5) matches 0/5. - -── configure (4 roles) ── - Best: iDRegEx (MDL 22.5) - Grammar: include_role - Why: iDRegEx (score 22.5) beats CRX (score 44.5). CRX overfits to diverse patterns. +── other (15 roles) ── + Best: CRX (MDL 288, 15/15 match) + Grammar: fail?.(include_vars+set_fact+package+file+template+service+...)+.include+?.(npm+pip)+?.lineinfile? + Why: CRX matches 15/15 sequences, iDRegEx matches 3/15. CRX selected. ``` ### Helm Charts -Renders a Helm chart with different values files and extracts Kubernetes `kind` sequences for grammar inference: - ```python import subprocess, yaml from bex.ensemble import infer_ensemble @@ -240,46 +199,31 @@ for vf in sorted(Path('ci/').glob('*-values.yaml')): ['helm', 'template', 'test', '.', '--skip-tests', '-f', str(vf)], capture_output=True, text=True, timeout=120, ) - if out.returncode == 0: - kinds = [d['kind'] for d in yaml.safe_load_all(out.stdout) - if d and isinstance(d, dict) and 'kind' in d] - if kinds: - seqs.append(kinds) + kinds = [d['kind'] for d in yaml.safe_load_all(out.stdout) + if d and isinstance(d, dict) and 'kind' in d] + if kinds: + seqs.append(kinds) result = infer_ensemble(seqs) print(f"Best: {result['best']['algorithm']} (MDL {result['best']['mdl_score']})") print(f"Grammar: {result['best']['grammar']}") -print(f"Why: {result['why']}") ``` -**Example output** (from [kube-prometheus-stack](https://github.com/prometheus-community/helm-charts/tree/main/charts/kube-prometheus-stack), 6 CI configs): +**Example** (kube-prometheus-stack, 6 CI configs): ``` -Best: iDRegEx (MDL 1432.99) +Best: iDRegEx (MDL 1433) Grammar: ServiceAccount.ClusterRole.ClusterRoleBinding.Service.Deployment iDRegEx MDL= 1432.99 ServiceAccount.ClusterRole.ClusterRoleBinding.Service.Deployment - CRX MDL= 2651.74 (Alertmanager+ClusterRole+ClusterRoleBinding+ConfigMap+DaemonSet+...)+.Role?.RoleBinding?.Job+? + CRX MDL= 2651.74 (Alertmanager+ClusterRole+...+ValidatingWebhookConfiguration)+.Role+?... Why: iDRegEx (score 1433.0) vs CRX (score 2651.7). CRX matches 6/6, iDRegEx matches 1/6. iDRegEx selected (MDL score 1433.0). ``` -CRX captures *all* symbols that appear. iDRegEx finds only the minimal core that every config shares: -``` -ServiceAccount.ClusterRole.ClusterRoleBinding.Service.Deployment -``` - -Which grammar is more useful depends on the task: -- **CRX** tells you everything you *might* need — good for an agent generating a complete chart. -- **iDRegEx** tells you what you *always* need — the bootstrap pipeline that can't be skipped. - -Use `prefer='crx'` or `prefer='idregex'` to select an algorithm without the ensemble comparison: - ### Terraform -Parses `.tf` files to extract `resource` type sequences, per-file or per-directory: - ```python import re from bex.ensemble import infer_ensemble @@ -295,47 +239,82 @@ print(f"Best: {result['best']['algorithm']} (MDL {result['best']['mdl_score']})" print(f"Grammar: {result['best']['grammar']}") ``` -**Example output** (from [terraform-guides](https://github.com/hashicorp/terraform-guides), hashistack example, 9 files): +**Example** (8 terraform-aws-* modules): + ``` -Best: CRX (MDL 4.0, 9/9 match) -Grammar: azurerm_network_security_group?.tls_private_key?.azurerm_virtual_machine?.(azurerm_resource_group+azurerm_subnet+azurerm_virtual_network)+?.azurerm_network_security_rule?.null_resource?.azurerm_network_interface?.azurerm_public_ip?.random_id+? +Best: CRX (MDL 1876) +Grammar: null_resource?.s3_bucket_lifecycle_configuration?.vpc?.launch_configuration?.... +Why: CRX matches 8/8 sequences. iDRegEx returned ∅ (no common core across modules). ``` -**Grammar notation:** +### Docker Compose + +```python +import yaml +from pathlib import Path +from bex.ensemble import infer_ensemble + +seqs = [] +for dc_file in Path('.').glob('**/docker-compose*.yml'): + data = yaml.safe_load(dc_file.read_text()) + for svc, config in data.get('services', {}).items(): + keys = list(config.keys()) + if keys: + seqs.append(keys) + +result = infer_ensemble(seqs) +print(f"Best: {result['best']['algorithm']} (MDL {result['best']['mdl_score']})") +print(f"Grammar: {result['best']['grammar']}") +``` + +### GitHub Actions + +```python +import yaml +from bex.ensemble import infer_ensemble + +seqs = [] +for wf_file in Path('.github/workflows/').glob('*.yml'): + data = yaml.safe_load(wf_file.read_text()) + for job in data.get('jobs', {}).values(): + if 'steps' not in job: + continue + seq = [s.get('uses', 'run:' + s.get('run', '').split()[0]) + for s in job['steps'] if 'uses' in s or 'run' in s] + if seq: + seqs.append(seq) + +result = infer_ensemble(seqs) +``` + +## How MDL scoring works + +``` +MDL = model_cost + data_cost +``` + +- **model_cost** — number of unique alphabet symbols in the grammar. Simpler grammars are cheaper. +- **data_cost** — Σ log₂(|L(r) at length len(s)|) across all sequences. A specific fixed sequence (`a.b.c.d.e`) has data cost zero because |L(r)| = 1. A grammar that accepts *many* strings of the same length (like `(a+b+...+q)+`) has high data cost. + +The ensemble selects the grammar with the lowest total MDL. + +## Grammar Notation + - `a.b` — `a` followed by `b` (concatenation) - `(a+b)` — either `a` or `b` (disjunction) - `r?` — zero or one (optional) - `r+` — one or more (iteration) - `r+?` — zero or more (varies across examples) -- `(a|b)` — iDRegEx-style disjunction (equivalent to `(a+b)`) - -## Domain: Generic YAML - -Converts any YAML file into key-path sequences (DFS traversal) for grammar inference: - -```python -from bex.yaml_to_seq import collect_all_sequences -from bex import infer_ensemble - -results = collect_all_sequences('config_dir/') -seqs = [seq for _, seq in results] -result = infer_ensemble(seqs) -print(result['best']['grammar']) -``` ## Papers - **Bex et al.** *"Inferring Deterministic Regular Expressions from Positive Data"* — TODS 2010 - **Bex et al.** *"Inferring k-optimal REs from Positive Data"* — arXiv:1004.2372 -See `papers/` for extracted text and the original references. - ## Tests ```bash python -m pytest tests/ -# or -python tests/test_bex.py ``` ## License diff --git a/SHOWCASE.md b/SHOWCASE.md index 1a04924..fef669e 100644 --- a/SHOWCASE.md +++ b/SHOWCASE.md @@ -1,14 +1,9 @@ # Grammar Inference Engine — Showcase -Infer the unwritten convention from existing examples. Given N example +Infer the **unwritten convention** from existing examples. Given N example sequences, produce a ~100-char grammar that captures the structural pattern — in far fewer tokens than the originals. -## How it works - -Your agent calls the MCP tool `infer_best_grammar` with a list of -existing sequences. It returns a compressed grammar: - ``` a.b → a then b (concatenation) (a+b) → a or b (disjunction) @@ -17,40 +12,100 @@ r+ → one or more (iteration) r+? → zero or more ``` -Use `prefer='crx'` for full coverage (accepts all examples), or let the -ensemble pick between CRX and iDRegEx by MDL score. +## 1. Ansible Galaxy roles (15 geerlingguy roles) — flagship -## Ansible Galaxy roles — 15 geerlingguy roles - -Jeff Geerling maintains 100+ of the most popular Ansible roles on -Galaxy. He has never written down their task structure. Our grammar is -the first explicit description: +15 popular Ansible roles by Jeff Geerling. There is NO written convention +for the task structure. Our grammar is its first explicit description: ``` Grammar: fail?.(include_vars+set_fact+package+file+template+service+...)+. include+?.(npm+pip)+?.lineinfile? - - CRX MDL= 596.64 match=15/15 ``` -Every role follows the same arc: check prerequisites, OS-specific vars, -install packages, configure with templates, start services, optionally -run sub-tasks. It works because 15 roles all converged on the same -unwritten convention. +Every role: check preconditions → OS-specific vars → install packages → +configure with templates → start services → optionally handle language tooling. -**Compression: 15 roles (~5,000 tokens) → 60 tokens.** +All 15/15 match. **~29× compression** (7200+ modules → ~250 chars). -## Notation reference +**Why it helps an LLM:** Generating a new Ansible role, the LLM knows the +exact structure: fail-check first, then vars, then packages, then config/svc. +No guessing. -| Symbol | Meaning | -|--------|---------| -| `a.b` | a then b | -| `(a+b)` | a or b (CRX disjunction) | -| `(a\|b)` | a or b (iDRegEx disjunction) | -| `r?` | zero or one | -| `r+` | one or more | -| `r+?` | zero or more | -| `MDL` | Minimum Description Length — lower is better | +## 2. Helm chart (kube-prometheus-stack, 6 configs) + +6 different `values.yaml` files rendered through the same chart: + +``` +Best: iDRegEx | MDL 1433 +Grammar: ServiceAccount.ClusterRole.ClusterRoleBinding.Service.Deployment +``` + +The **minimal core** every config must deploy. CRX captures the full +vocabulary (19 kinds). Which one an agent uses depends on the task: +- Bootstrapping a new cluster: iDRegEx — what you can't skip +- Writing a complete chart: CRX — everything you might need + +## 3. Docker Compose (73 services, 10 projects) + +Per-service key order across real-world compose files: + +``` +Best: CRX | MDL varies by project +Grammar: (build+image).command.(environment+volumes)?.ports +``` + +Per-project patterns emerge: +- **Nginx-like:** `build.(command.volumes.ports)` +- **Databases:** `image.environment.volumes.ports` +- **Language runtimes:** `build.(environment.command).ports` + +**Why it helps an LLM:** The field order in service definitions follows +an implicit convention. An agent generating compose files should put +image/build first, then command, then environment/volumes, then ports. + +## 4. GitHub Actions (cross-project Go lint, 6 jobs) + +Lint jobs from prometheus, goreleaser, cosign, sigstore: + +``` +Best: CRX | MDL 13.6 +Grammar: actions/checkout.(actions/setup-go+run:echo+run:sudo)+. + golangci/golangci-lint-action?.megalinter? +``` + +Every Go project's lint CI follows: checkout → setup Go → run linter. +Only the biggest add megalinter. + +**Why it helps an LLM:** Starting a new Go project? The lint workflow +has a near-universal pattern. + +## 5. Terraform (8 AWS modules) + +Terraform modules by hashicorp and terraform-aws-modules: + +``` +Best: CRX | MDL 1876 +Grammar: null_resource?.s3_bucket...?.vpc?...(26+ types all optional) +``` + +Every resource type is optional — VPC, S3, EC2, and security-group +modules share no mandatory ordering. But the **vocabulary** is the signal: +seeing `aws_vpc` implies subnets, route tables, internet gateways. + +**Why it helps an LLM:** The grammar encodes which resources belong +together in each module domain. + +## What doesn't work + +| Dataset | Problem | +|---------|---------| +| Dockerfiles | Too simple — just the Dockerfile spec | +| Pre-commit (cross-project) | 252 unique hooks, no common core | +| GHA per-project | One repo = too many job types | +| Prometheus rules | Schema-enforced, no convention | + +Sweet spot: **multiple implementations of the same abstract task** +with a shared but undocumented pattern. ## Usage