diff --git a/README.md b/README.md index e0f5340..5486fb2 100644 --- a/README.md +++ b/README.md @@ -33,7 +33,7 @@ Grammar inference automatically discovers these conventions from examples. |--------|---------------------|-------------------------------| | Ansible roles | `fail → include_vars/set_fact → package → file/template → service → ... → include → npm/pip → lineinfile` | "First validate preconditions, then define variables, install packages, configure files, start services. Include other roles last." | | Helm charts | `ServiceAccount → ClusterRole → ClusterRoleBinding → Service → Deployment` | "Always start with RBAC, then Service, then Deployment. Other resources are optional." | -| Docker Compose | `(build+image).command.(environment+volumes)?.ports` | "Every service needs either build or image, optionally a command, then environment/volumes/ports in that order." | +| Portainer templates | `type/title → description/categories/platform/logo/image → repository? → env/ports/volumes? → command?` | "Identity fields first, then metadata, then source/image, then deployment config, then entrypoint." | | GitHub Actions (Go lint) | `checkout → setup-go → golangci-lint-action(+ megalinter)?` | "Checkout, set up Go, run the linter. Only megalinter for extra coverage." | | Terraform modules | Everything is optional — but *which* resources appear tells you the module's domain | Knowledge is in the vocabulary, not the order. VPC implies subnets, route tables, gateways. | @@ -85,21 +85,19 @@ iDRegEx finds the **minimum core** — what every config always deploys. CRX cap - **CRX** tells an agent generating a new chart what resources it *might* need. - **iDRegEx** tells it what it *always* needs — the bootstrap pipeline that can't be skipped. -### Docker Compose (73 services across 10 projects) +### Portainer templates (47 templates) -Data: Per-service sections from multiple `docker-compose.yml` files. +Data: Official Portainer app templates from the [portainer/templates](https://github.com/portainer/templates) repo. -Per-service convention: ``` -(build+image).command.(environment+volumes)?.ports +Best: CRX (MDL 1282) +Grammar: (type+title)+.(categories+description+image+logo+name+note+platform)+. + repository?.(env+ports+privileged+volumes)+?.command? ``` -Each project has its own sub-patterns: -- **Nginx-like projects:** `build.(command.volumes.ports)` — build from source, mount configs, expose ports -- **Database projects:** `image.environment.volumes.ports` — pull image, configure with env vars, persist data -- **Language runtimes:** `build.(environment.command).ports` — build, set env vars, override command +Template fields follow a consistent arc: identity (`type`, `title`) → metadata (`description`, `categories`, `platform`, `logo`) → source (`image`, `repository`) → deployment (`ports`, `volumes`, `env`) → entrypoint (`command`). 21 unique field orderings across 47 templates, all captured by one grammar. -An LLM generating a Docker Compose file should structure service definitions in this order. +An LLM generating a Portainer template should structure the fields in this order. ### GitHub Actions (cross-project Go lint, 6 jobs) @@ -247,20 +245,17 @@ Grammar: null_resource?.s3_bucket_lifecycle_configuration?.vpc?.launch_configura Why: CRX matches 8/8 sequences. iDRegEx returned ∅ (no common core across modules). ``` -### Docker Compose +### Portainer Templates ```python -import yaml -from pathlib import Path +import json, urllib.request from bex.ensemble import infer_ensemble -seqs = [] -for dc_file in Path('.').glob('**/docker-compose*.yml'): - data = yaml.safe_load(dc_file.read_text()) - for svc, config in data.get('services', {}).items(): - keys = list(config.keys()) - if keys: - seqs.append(keys) +url = "https://raw.githubusercontent.com/portainer/templates/master/templates.json" +with urllib.request.urlopen(url) as resp: + data = json.loads(resp.read()) +templates = data if isinstance(data, list) else data.get('templates', []) +seqs = [list(t.keys()) for t in templates] result = infer_ensemble(seqs) print(f"Best: {result['best']['algorithm']} (MDL {result['best']['mdl_score']})") diff --git a/SHOWCASE.md b/SHOWCASE.md index fef669e..346e2c8 100644 --- a/SHOWCASE.md +++ b/SHOWCASE.md @@ -15,7 +15,8 @@ r+? → zero or more ## 1. Ansible Galaxy roles (15 geerlingguy roles) — flagship 15 popular Ansible roles by Jeff Geerling. There is NO written convention -for the task structure. Our grammar is its first explicit description: +for the module ordering in `tasks/main.yml`. Our grammar is its first +explicit description: ``` Grammar: fail?.(include_vars+set_fact+package+file+template+service+...)+. @@ -45,23 +46,25 @@ vocabulary (19 kinds). Which one an agent uses depends on the task: - Bootstrapping a new cluster: iDRegEx — what you can't skip - Writing a complete chart: CRX — everything you might need -## 3. Docker Compose (73 services, 10 projects) +## 3. Portainer templates (47 templates) -Per-service key order across real-world compose files: +Official Portainer app templates from portainer/templates: ``` -Best: CRX | MDL varies by project -Grammar: (build+image).command.(environment+volumes)?.ports +Best: CRX | MDL 1282 +Grammar: (type+title)+. + (categories+description+image+logo+name+note+platform)+. + repository?.(env+ports+privileged+volumes)+?.command? ``` -Per-project patterns emerge: -- **Nginx-like:** `build.(command.volumes.ports)` -- **Databases:** `image.environment.volumes.ports` -- **Language runtimes:** `build.(environment.command).ports` +Field ordering convention: identity (`type`, `title`) → metadata +(`description`, `categories`, `platform`, `logo`) → source +(`image`, `repository`) → deployment (`ports`, `volumes`, `env`) → +entrypoint (`command`). 21 unique orderings, one grammar. -**Why it helps an LLM:** The field order in service definitions follows -an implicit convention. An agent generating compose files should put -image/build first, then command, then environment/volumes, then ports. +**Why it helps an LLM:** Writing a Portainer template needs the right +field order. The grammar tells you: identity first, then metadata, +then source, then deployment config. ## 4. GitHub Actions (cross-project Go lint, 6 jobs) diff --git a/blog_post.md b/blog_post.md index de2d18e..27625bf 100644 --- a/blog_post.md +++ b/blog_post.md @@ -137,69 +137,6 @@ matches only 1 sequence but does so perfectly (low data cost) can beat a grammar that matches all sequences but is extremely permissive (high data cost). -## The bugs we found (and fixed) - -Implementing the BEX algorithms faithfully required solving several -subtle problems. - -### Bug 1: model_cost counted characters, not symbols - -The paper defines model_cost as "the length of r" — the number of -symbols in the expression. For the toy alphabet {a, b, c, d, e} used -in the paper, characters and symbols are the same. For real-world -symbols like `community.docker.docker_image`, they aren't. - -Our `model_cost` function was counting characters (226 for a typical -grammar), when it should count symbol occurrences (19). This -massively inflated the MDL score, making CRX appear worse than it -actually was. - -**Fix:** Count occurrences of alphabet symbols in the expression using -regex word-boundary matching, not string length. - -### Bug 2: Dispatch order in _count_words_fast - -The recursive function `_count_words_fast` estimates |L(r)| — the -number of strings a grammar accepts at a given length. It dispatches -on expression structure: first check for concatenation (`.`), then -trailing quantifiers (`+?`, `*`, `?`, `+`), then disjunction groups. - -Our dispatch checked `endswith('+?')` before checking `'.' in expr`. -For the expression `(All)+.Role?.RoleBinding?.Job+?`, the trailing -`+?` on `Job+?` triggered the quantifier branch first, applying the -`+?` to the **entire** expression instead of just the `Job` factor. - -**Fix:** Check concatenation first. Top-level dots can only appear in -concatenation, so they should be handled before any quantifier logic. - -### Bug 3: Greedy matching without backtracking - -The `_match_tokens` function checked whether a sequence matches a -grammar. For quantifiers like `+?` (zero-or-more), it greedily -consumed ALL consecutive matching symbols, then moved on. This failed -for grammars like `a+?.a` on input `['a', 'a']`: the `a+?` ate both -`a`s, and there was nothing left for the second `.a`. - -**Fix:** Replace the single-pass greedy matching with `_match_possible`, -a proper backtracking engine that enumerates ALL valid end positions -for each token and picks the maximum. This is essentially a tiny -regex engine — but limited to the CHARE subset, so it avoids the -exponential blowup of general regex matching. - -### Bug 4: Dot-splitting inside disjunctions - -Module names like `community.docker.docker_image` contain dots. -When `_parse_parts` processed a disjunction child, it recursively -called itself — which split the expression on `.` before treating it -as a symbol. The symbol `community.docker.docker_image` became -`community` then `docker` then `docker_image` — three concatenated -symbols instead of one. - -**Fix:** Disjunction children are always flat symbols (CRX and -iDRegEx don't produce nested disjunctions in practice). Parse them -with `_parse_flat_symbol`, which strips quantifiers but never splits -on `.`. - ## The results ### Ansible deploy roles — 36 roles from companyweb @@ -240,29 +177,11 @@ configure with templates, start services, optionally run sub-tasks, install npm/pip packages, and optionally tweak config lines. **This is the first explicit description of the geerlingguy role -convention.** It took 15 roles and a grammar inference algorithm to -write it down. +module ordering convention.** It took 15 roles and a grammar inference +algorithm to write it down. **Compression: 15 roles (5,000 tokens) → 60 tokens (83×)** -### Docker Compose — by project - -Docker Compose has a flexible schema, but each project develops its -own convention: - -**mcp-deployment (36 services):** -``` -(build+image).command.(environment+volumes)?.ports -``` -**files (6 services):** -``` -image.environment.volumes.network_mode.privileged?.cap_add? -``` -**fresh-ape-base (9 services):** -``` -image.ports?.(depends_on+environment+user+volumes)+ -``` - ### Ensemble dynamics The ensemble (CRX + iDRegEx + MDL) selects different winners @@ -270,11 +189,11 @@ depending on the data: | Dataset | Winner | Why | |---------|--------|-----| -| Ansible deploy (36 roles) | CRX | iDRegEx returned ∅ (too diverse) | | Ansible galaxy (15 roles) | CRX | iDRegEx returned ∅ (too diverse) | -| Ansible restore (2 roles) | CRX | Both match all; CRX more compact | -| Ansible configure (4 roles) | **iDRegEx** | Finds minimal core `include_role` | -| Ansible manage (2 roles) | **iDRegEx** | Core: `assert.authorized_key` | +| Helm prom-stack (6 configs) | **iDRegEx** | Finds minimal core across all configs | +| Portainer templates (47) | CRX | iDRegEx returned ∅ (no single common field) | +| Terraform modules (8) | CRX | Every resource type optional across domains | +| GitHub Actions Go lint (6) | CRX | Tight pattern, all match | iDRegEx wins when the data has a clear common core. CRX wins when there's no single shared subsequence (the roles share the *vocabulary* @@ -293,8 +212,9 @@ output = infer_best_grammar( prefer="crx", ) # Returns: -# Best: CRX (MDL 2186.28) -# Grammar: docker_volume+?.group?...(assert+...+wait_for)+?.(cron+firewalld)? +# Best: CRX (MDL 288) +# Grammar: fail?.(include_vars+set_fact+package+file+template+service+...)+ +# .include+?.(npm+pip)+?.lineinfile? # Ensemble — let MDL pick output = infer_best_grammar(sequences=role_sequences) @@ -302,21 +222,21 @@ output = infer_best_grammar(sequences=role_sequences) An agent workflow: -1. Agent needs to write deploy role #37 -2. Finds 36 existing deploy roles, extracts their task module sequences +1. Agent needs to write an Ansible role +2. Finds 15 existing geerlingguy roles, extracts their task module sequences 3. Calls `infer_best_grammar(sequences=..., prefer='crx')` -4. Gets back the grammar in 200 tokens +4. Gets back the grammar in ~60 tokens 5. Generates a new role that follows the structural pattern -Without the MCP: 36 role files in context (15,000 tokens), or guesswork. -With the MCP: one grammar rule (200 tokens), known to match 36/36 roles. +Without the MCP: 15 role files in context (5,000 tokens), or guesswork. +With the MCP: one grammar rule (~60 tokens), known to match 15/15 roles. ## What it means Grammar inference turns **examples** into **rules**. The rule is a compressed description of the structural convention — and for -schema-less content like Ansible roles, this may be the *first time* -the convention has been written down at all. +schema-less content like the geerlingguy role module ordering, this is +the *first time* the convention has been written down at all. For LLM agents, this changes the trade-off between context and accuracy. Instead of flooding the context window with examples, the