From 17b5c271ec1e552056510bd7f6e682a0a3628c64 Mon Sep 17 00:00:00 2001 From: tobjend Date: Wed, 1 Jul 2026 11:05:26 +0200 Subject: [PATCH] deduplicate: replace detailed Real-world Results with summary table linking to SHOWCASE.md --- README.md | 84 +++++-------------------------------------------------- 1 file changed, 7 insertions(+), 77 deletions(-) diff --git a/README.md b/README.md index 99ad51f..9419e36 100644 --- a/README.md +++ b/README.md @@ -108,85 +108,15 @@ Dervish discovers these conventions automatically from existing examples. The do ## Real-world Results -### Ansible Galaxy (15 roles, 44+ modules each) +Dervish has been tested against public datasets from Ansible Galaxy, Helm, and GitHub Actions — all cases where multiple projects independently converged on an undocumented pattern. [**Full details → SHOWCASE.md**](SHOWCASE.md) -Data: All 15 [geerlingguy Galaxy roles](https://github.com/geerlingguy) — nginx, php, mysql, docker, etc. +| Dataset | Best grammar | Compression | +|---------|-------------|-------------| +| Ansible Galaxy (15 roles) | `fail?.(include_vars+set_fact+package+file+template+service+...)+.include+?.(npm+pip)+?.lineinfile?` | 5,000 tokens → 60 tokens (83×) | +| Helm (6 configs) | `ServiceAccount.ClusterRole.ClusterRoleBinding.Service.Deployment` | ~3,000 tokens → 40 tokens (75×) | +| Go lint (6 jobs) | `actions/checkout.(actions/setup-go+run:echo+run:sudo)+.golangci/golangci-lint-action?.megalinter?` | ~900 tokens → 30 tokens (30×) | -Each role's `tasks/main.yml` is parsed into a sequence of module names. Here are the sequences from two roles: - -``` -docker: fail → include_vars → include_tasks → package → package → package → ... -nginx: fail → include_vars → set_fact → package → file → template → service → ... -``` - -The extracted symbols are Ansible module names like `fail`, `include_vars`, `set_fact`, `package`, `file`, `template`, `service`, `systemd`, `get_url`, `shell`, `npm`, `pip`, `lineinfile`, `copy`, `unarchive`, `yum`, `apt`, `command`, `user`, `group`, `git`, `mount`, `cron`, `debug`, `iptables`, `ufw`, `hostname`, `sysctl`, `timezone`, `selinux`, `firewalld`, `homebrew`, `supervisorctl`, `postgresql_db`, `mysql_db` — 50+ unique modules across the 15 roles. - -``` -Best: CRX (MDL 288, 15/15 match) -Grammar: - fail?.(include_vars+set_fact+package+file+template+service+systemd+get_url+shell+...)+ - .include+?.(npm+pip)+?.lineinfile? -``` - -Every single role follows this pattern. The convention was **unwritten** — no document says "Ansible roles should check preconditions first, then install packages, configure with templates, enable services, then optionally install language packages." - -This is the first explicit description of the geerlingguy role module ordering convention. - -**Compression:** The grammar is ~250 chars. The 15 examples are 7200+ modules combined. **~29× compression.** - -### Helm (kube-prometheus-stack, 6 CI configs) - -Data: 6 different `values.yaml` configurations rendered through `helm template`. Each config produces a sequence of K8s `kind` values in rendered YAML order: - -``` -config-1: ServiceAccount → ClusterRole → ClusterRoleBinding → Service → Deployment → ServiceMonitor → PrometheusRule -config-2: ServiceAccount → ClusterRole → ClusterRoleBinding → Service → Deployment → ConfigMap → ServiceMonitor -config-3: ServiceAccount → ClusterRole → ClusterRoleBinding → Service → Deployment → Alertmanager → Prometheus -``` - -Extracted symbols: `ServiceAccount`, `ClusterRole`, `ClusterRoleBinding`, `Service`, `Deployment`, `ConfigMap`, `Alertmanager`, `Prometheus`, `PrometheusRule`, `ServiceMonitor`, `Role`, `RoleBinding`, `Job`, `DaemonSet`, `Secret`, `ValidatingWebhookConfiguration` — 19 kinds total. - -``` -Best: iDRegEx (MDL 1433) -Grammar: ServiceAccount.ClusterRole.ClusterRoleBinding.Service.Deployment - - iDRegEx MDL= 1432.99 ServiceAccount.ClusterRole.ClusterRoleBinding.Service.Deployment - CRX MDL= 2651.74 (Alertmanager+ClusterRole+...+ValidatingWebhookConfiguration)+.Role+?... -``` - -iDRegEx finds the **minimum core** — what every config always deploys. CRX captures the full vocabulary (19 resource kinds). Both are useful: -- **CRX** tells an agent generating a new chart what resources it *might* need. -- **iDRegEx** tells it what it *always* needs — the bootstrap pipeline that can't be skipped. - -### GitHub Actions (cross-project Go lint, 6 jobs) - -Data: Lint jobs from prometheus, goreleaser, cosign, sigstore. Each job's steps are extracted as `uses:` or `run:` values: - -``` -prometheus lint: actions/checkout → actions/setup-go → run:sudo → run:echo → golangci/golangci-lint-action → golangci/golangci-lint-action → ... -goreleaser lint: actions/checkout → actions/setup-go → gitleaks/gitleaks-action → golangci/golangci-lint-action -cosign lint: actions/checkout → ossf/scorecard-action → actions/upload-artifact → github/codeql-action/upload-sarif -``` - -Extracted symbols: `actions/checkout`, `actions/setup-go`, `golangci/golangci-lint-action`, `megalinter/megalinter`, `gitleaks/gitleaks-action`, `ossf/scorecard-action`, `github/codeql-action/*`, and `run:*` commands. - -``` -Best: CRX (MDL 13.6) -Grammar: actions/checkout.(actions/setup-go+run:echo+run:sudo)+.golangci/golangci-lint-action?.megalinter? -``` - -Every Go project's lint CI follows: checkout → setup Go → run golangci-lint. Only the biggest projects add megalinter. - -### What doesn't work - -Not every domain has an unwritten convention. Grammar inference failed (produced trivial `(a+b+c+...)+` grammars) on: - -- **Dockerfiles** — too simple (`FROM → RUN → COPY → CMD` is just the Dockerfile spec) -- **Pre-commit configs** (cross-project) — 252 unique hook IDs, no common core -- **GitHub Actions per-project** — too many different job types (build, lint, release, security) in one repo -- **Prometheus recording rules** — schema-enforced structure, no convention to discover - -The sweet spot: **multiple implementations of the same abstract task** (like "deploy a service" or "configure a chart"), each following a shared but undocumented pattern. +The sweet spot: **multiple implementations of the same abstract task** with a shared but undocumented pattern. Not everything works — Dockerfiles, pre-commit configs, and schema-enforced formats are too rigid or too diverse to yield a convention. ## Algorithm Selection Guide