purge Portainer references, format-specific tools, and Domain Adapters section; make showcases concrete with extracted types

2026-07-01 10:36:04 +02:00 · 2026-07-01 10:36:04 +02:00 · 25d844d1f9
commit 25d844d1f9
parent 097dfc9954
3 changed files with 83 additions and 193 deletions
--- a/README.md
+++ b/README.md
@ -1,17 +1,17 @@
 # Dervish

-![Dervish](dervish.gif)
+<p align="center"><img src="dervish.gif" alt="Dervish"></p>

 **Dervish** infers **regular expression grammars** from example sequences using the BEX family of algorithms. Given a set of example sequences (strings over some alphabet), it learns a compact regular expression that describes the general pattern.

 ## MCP Server

-The primary interface is a **Model Context Protocol (MCP)** server. Connect any MCP-compatible client (Claude, opencode, etc.) and get grammar inference as a tool:
+The primary interface is a **Model Context Protocol (MCP)** server. Connect any MCP-compatible client (pi.dev, opencode, vibe, etc.) and get grammar inference as a tool:

 ```json
 {
  "mcpServers": {
-    "grammar-inference": {
+    "dervish": {
      "command": "python3",
      "args": ["/path/to/bex/mcp_server.py"]
    }
@ -21,46 +21,45 @@ The primary interface is a **Model Context Protocol (MCP)** server. Connect any

 ### Tools

-| Tool | What it does |
-|------|-------------|
-| `infer_grammar(sequences, method, kmax, N)` | Core CRX or iDRegEx inference |
-| `infer_best_grammar(sequences, prefer, kmax, N)` | **Ensemble:** runs both CRX and iDRegEx, picks the best by MDL score. `prefer='crx'` or `prefer='idregex'` to skip the comparison and return only that algorithm. |
-| `infer_yaml_grammar(yaml_dir, pattern, method)` | YAML → key-paths → grammar |
-| `infer_ansible_role_grammar(roles_dir)` | Ansible role module sequences → per-category grammar |
+| Tool | Parameters | What it does |
+|------|-----------|-------------|
+| `infer_best_grammar` | `sequences`, `prefer`, `kmax`, `N` | **Recommended.** Runs CRX + iDRegEx, picks best by MDL. Set `prefer='crx'` or `prefer='idregex'` to run one algorithm. |
+| `infer_grammar` | `sequences`, `method`, `kmax`, `N` | Core single-algorithm inference. `method='crx'` (fast, deterministic) or `method='idregex'` (probabilistic EM). |
+
+**Parameters explained:**
+- **`kmax`** (1–5): Context window for iDRegEx's k-testable automaton. Higher values capture longer-range dependencies but need more data and are slower. Default 2 works for most cases.
+- **`N`** (1–10): Baum-Welch EM iterations for iDRegEx training. More iterations = better convergence but slower. Default 3 is a good balance.
+- **`prefer`**: Skip the CRX-vs-iDRegEx comparison. Use when you know which algorithm fits your data.

 ### Agent workflow

-An LLM agent uses the MCP to discover an unwritten convention from existing examples:
+An LLM agent uses the MCP to discover an unwritten convention from existing examples — compressing hundreds of files into a single ~60-token rule:

 ```
 User: Generate a new Ansible role for installing PostgreSQL.
-Agent: I'll first check 15 existing geerlingguy roles to find the structural pattern.
-       [calls infer_best_grammar with 15 role sequences, prefer='crx']

-       Best: CRX (MDL 288)
-       Grammar: fail?.(include_vars+set_fact+package+file+template+service+...)+
-                .include+?.(npm+pip)+?.lineinfile?
+Agent: Let me check what pattern the existing community roles follow.
+       I'll look at 15 popular geerlingguy roles.

-       Convention: check preconditions → OS-specific vars → install packages →
-       configure templates → start services → handle language tooling.
+       [finds role directories, extracts task module sequences,
+        calls infer_best_grammar(sequences=..., prefer='crx')]
+
+       Dervish returns:
+         Best: CRX (MDL 288)
+         Grammar: fail?.(include_vars+set_fact+package+file+template+service+...)+
+                  .include+?.(npm+pip)+?.lineinfile?
+
+       This tells me: every role starts with a fail check for preconditions,
+       then OS-specific variables, installs packages, configures with templates,
+       starts services, and optionally handles language tooling (npm/pip).
+       The role should end with a lineinfile tweak.
+
+       I'll generate the new role following this structure.
 ```

-Without the MCP: 15 role files in context (5,000+ tokens) or guesswork.
-With the MCP: one grammar rule (~60 tokens), known to match 15/15 existing roles.
+**Without Dervish:** the agent stuffs 15 role files into context (5,000+ tokens per role = beyond any context window), or guesses the pattern from 1–2 examples and often gets it wrong.

-## Why grammar inference?
-
-There are many domains where developers follow **unwritten conventions** — implicit rules about the order and structure of things that no formal schema captures. An LLM generating code in these domains needs to know the convention, but it's rarely documented.
-
-Grammar inference automatically discovers these conventions from examples:
-
-| Domain | Unwritten convention | What the grammar tells an LLM |
-|--------|---------------------|-------------------------------|
-| Ansible roles | `fail → include_vars/set_fact → package → file/template → service → ... → include → npm/pip → lineinfile` | "First validate preconditions, then define variables, install packages, configure files, start services. Include other roles last." |
-| Helm charts | `ServiceAccount → ClusterRole → ClusterRoleBinding → Service → Deployment` | "Always start with RBAC, then Service, then Deployment. Other resources are optional." |
-| Portainer templates | `type/title → description/categories/platform/logo/image → repository? → env/ports/volumes? → command?` | "Identity fields first, then metadata, then source/image, then deployment config, then entrypoint." |
-| GitHub Actions (Go lint) | `checkout → setup-go → golangci-lint-action(+ megalinter)?` | "Checkout, set up Go, run the linter. Only megalinter for extra coverage." |
-| Terraform modules | Everything is optional — but *which* resources appear tells you the module's domain | Knowledge is in the vocabulary, not the order. VPC implies subnets, route tables, gateways. |
+**With Dervish:** one MCP call returns a ~60-token grammar known to match 15/15 existing roles. The agent follows it reliably.

 ## Quick Start

@ -83,12 +82,34 @@ print(f"Grammar: {result['best']['grammar']}")
 print(f"Score: {result['best']['mdl_score']}")
 ```

+## Why not just use a schema?
+
+Many of the things developers build every day **have no formal schema**. They're free-form scripts, config files, or YAML blobs where the structure is emergent convention, not enforced specification. An LLM generating new content in these domains needs to know the convention — but it's never been written down.
+
+Dervish discovers these conventions automatically from existing examples. The domains below are **just examples** of what it can do — the same approach works for any sequential data with an unwritten pattern.
+
+| Domain | What gets extracted | Example extracted symbols | What Dervish discovers | Why it helps an LLM |
+|--------|-------------------|--------------------------|----------------------|---------------------|
+| Ansible roles | Module names from `tasks/main.yml` in order | `fail`, `include_vars`, `set_fact`, `package`, `file`, `template`, `service`, `npm`, `pip`, `lineinfile` | `fail?.(include_vars+set_fact+package+file+template+service+...)+.include+?.(npm+pip)+?.lineinfile?` | "Validate preconditions first, then set vars, install packages, configure with templates, start services. Include sub-roles last." |
+| Helm charts | K8s resource kinds from `helm template` output in rendered order | `ServiceAccount`, `ClusterRole`, `ClusterRoleBinding`, `Service`, `Deployment`, `ConfigMap`, `Alertmanager` | `ServiceAccount.ClusterRole.ClusterRoleBinding.Service.Deployment` (iDRegEx minimal core) | "Every Prometheus stack needs this bootstrap pipeline. Everything else is optional." |
+| GitHub Actions (Go lint) | Step `uses:` or `run:` values from workflow YAML in job order | `actions/checkout`, `actions/setup-go`, `golangci/golangci-lint-action`, `megalinter/megalinter` | `actions/checkout.(actions/setup-go+run:echo+run:sudo)+.golangci/golangci-lint-action?.megalinter?` | "Starting a new Go project? The lint workflow has a near-universal pattern." |
+| Terraform modules | Resource type strings from `.tf` files in declaration order | `aws_vpc`, `aws_subnet`, `aws_route_table`, `aws_internet_gateway`, `aws_security_group`, `aws_instance`, `aws_s3_bucket` | Everything optional (domains too different), but certain types always cluster together | "If you see `aws_vpc`, expect subnets, route tables, gateways to follow. The grammar encodes each domain's resource catalogue." |
+
 ## Real-world Results

 ### Ansible Galaxy (15 roles, 44+ modules each)

 Data: All 15 [geerlingguy Galaxy roles](https://github.com/geerlingguy) — nginx, php, mysql, docker, etc.

+Each role's `tasks/main.yml` is parsed into a sequence of module names. Here are the sequences from two roles:
+
+```
+docker:   fail → include_vars → include_tasks → package → package → package → ...
+nginx:    fail → include_vars → set_fact → package → file → template → service → ...
+```
+
+The extracted symbols are Ansible module names like `fail`, `include_vars`, `set_fact`, `package`, `file`, `template`, `service`, `systemd`, `get_url`, `shell`, `npm`, `pip`, `lineinfile`, `copy`, `unarchive`, `yum`, `apt`, `command`, `user`, `group`, `git`, `mount`, `cron`, `debug`, `iptables`, `ufw`, `hostname`, `sysctl`, `timezone`, `selinux`, `firewalld`, `homebrew`, `supervisorctl`, `postgresql_db`, `mysql_db` — 50+ unique modules across the 15 roles.
+
 ```
 Best: CRX (MDL 288, 15/15 match)
 Grammar:
@ -104,7 +125,15 @@ This is the first explicit description of the geerlingguy role module ordering c

 ### Helm (kube-prometheus-stack, 6 CI configs)

-Data: 6 different `values.yaml` configurations rendered through `helm template`.
+Data: 6 different `values.yaml` configurations rendered through `helm template`. Each config produces a sequence of K8s `kind` values in rendered YAML order:
+
+```
+config-1: ServiceAccount → ClusterRole → ClusterRoleBinding → Service → Deployment → ServiceMonitor → PrometheusRule
+config-2: ServiceAccount → ClusterRole → ClusterRoleBinding → Service → Deployment → ConfigMap → ServiceMonitor
+config-3: ServiceAccount → ClusterRole → ClusterRoleBinding → Service → Deployment → Alertmanager → Prometheus
+```
+
+Extracted symbols: `ServiceAccount`, `ClusterRole`, `ClusterRoleBinding`, `Service`, `Deployment`, `ConfigMap`, `Alertmanager`, `Prometheus`, `PrometheusRule`, `ServiceMonitor`, `Role`, `RoleBinding`, `Job`, `DaemonSet`, `Secret`, `ValidatingWebhookConfiguration` — 19 kinds total.

 ```
 Best: iDRegEx (MDL 1433)
@ -118,21 +147,17 @@ iDRegEx finds the **minimum core** — what every config always deploys. CRX cap
 - **CRX** tells an agent generating a new chart what resources it *might* need.
 - **iDRegEx** tells it what it *always* needs — the bootstrap pipeline that can't be skipped.

-### Portainer templates (47 templates)
-
-Data: Official Portainer app templates from the [portainer/templates](https://github.com/portainer/templates) repo.
-
-```
-Best: CRX (MDL 1282)
-Grammar: (type+title)+.(categories+description+image+logo+name+note+platform)+.
-         repository?.(env+ports+privileged+volumes)+?.command?
-```
-
-Template fields follow a consistent arc: identity (`type`, `title`) → metadata (`description`, `categories`, `platform`, `logo`) → source (`image`, `repository`) → deployment (`ports`, `volumes`, `env`) → entrypoint (`command`). 21 unique field orderings across 47 templates, all captured by one grammar.
-
 ### GitHub Actions (cross-project Go lint, 6 jobs)

-Data: Lint jobs from prometheus, goreleaser, cosign, sigstore.
+Data: Lint jobs from prometheus, goreleaser, cosign, sigstore. Each job's steps are extracted as `uses:` or `run:` values:
+
+```
+prometheus lint:   actions/checkout → actions/setup-go → run:sudo → run:echo → golangci/golangci-lint-action → golangci/golangci-lint-action → ...
+goreleaser lint:   actions/checkout → actions/setup-go → gitleaks/gitleaks-action → golangci/golangci-lint-action
+cosign lint:       actions/checkout → ossf/scorecard-action → actions/upload-artifact → github/codeql-action/upload-sarif
+```
+
+Extracted symbols: `actions/checkout`, `actions/setup-go`, `golangci/golangci-lint-action`, `megalinter/megalinter`, `gitleaks/gitleaks-action`, `ossf/scorecard-action`, `github/codeql-action/*`, and `run:*` commands.

 ```
 Best: CRX (MDL 13.6)
@ -143,7 +168,15 @@ Every Go project's lint CI follows: checkout → setup Go → run golangci-lint.

 ### Terraform (8 AWS modules, 156+ resources each)

-Data: `terraform-aws-{vpc,ec2,s3-bucket,autoscaling,security-group}` modules.
+Data: `terraform-aws-{vpc,ec2,s3-bucket,autoscaling,security-group}` modules from hashicorp and terraform-aws-modules. Each `.tf` file is parsed for `resource` declarations in order:
+
+```
+vpc module:   data:vpc_endpoint_service → vpc → vpc_endpoint → vpc_endpoint_route_table_association → egress_only_internet_gateway → route_table → route → subnet → ...
+ec2 module:   data:partition → data:ssm_parameter → instance → spot_instance_request → ec2_tag → ebs_volume → volume_attachment → data:iam_policy_document → iam_role → iam_role_policy_attachment → iam_instance_profile → ...
+s3 module:    iam_role → data:iam_policy_document → iam_policy → data:partition → s3_bucket → s3_bucket_versioning → s3_bucket_logging → s3_bucket_server_side_encryption → ...
+```
+
+Extracted symbols: `aws_vpc`, `aws_subnet`, `aws_route_table`, `aws_internet_gateway`, `aws_nat_gateway`, `aws_vpn_gateway`, `aws_security_group`, `aws_security_group_rule`, `aws_instance`, `aws_eip`, `aws_ebs_volume`, `aws_s3_bucket`, `aws_s3_bucket_versioning`, `aws_s3_bucket_logging`, `aws_iam_role`, `aws_iam_policy`, `aws_autoscaling_group`, `aws_launch_configuration`, `random_pet`, `null_resource` — 30+ types across modules.

 ```
 Best: CRX (MDL 1876)
@ -182,129 +215,6 @@ The sweet spot: **multiple implementations of the same abstract task** (like "de
 | 2–3 sequences | iDRegEx | CRX overfits. iDRegEx handles noise better. |
 | Many sequences, tight pattern | CRX | Learns precise concatenation with optional suffixes. |

-## Domain Adapters
-
-### Ansible Roles
-
-```python
-from bex.ensemble import infer_ensemble
-from bex.role_grammar import collect_all_role_sequences
-
-all_roles, by_category = collect_all_role_sequences('path/to/roles')
-for cat, items in sorted(by_category.items()):
-    seqs = [s for _, s in items]
-    result = infer_ensemble(seqs)
-    print(f"── {cat} ({len(items)} roles) ──")
-    print(f"  Best: {result['best']['algorithm']} (MDL {result['best']['mdl_score']})")
-    print(f"  Grammar: {result['best']['grammar']}")
-```
-
-**Example** (15 geerlingguy Galaxy roles):
-
-```
-── other (15 roles) ──
-  Best: CRX (MDL 288, 15/15 match)
-  Grammar: fail?.(include_vars+set_fact+package+file+template+service+...)+.include+?.(npm+pip)+?.lineinfile?
-  Why: CRX matches 15/15 sequences, iDRegEx matches 3/15. CRX selected.
-```
-
-### Helm Charts
-
-```python
-import subprocess, yaml
-from bex.ensemble import infer_ensemble
-
-seqs = []
-for vf in sorted(Path('ci/').glob('*-values.yaml')):
-    out = subprocess.run(
-        ['helm', 'template', 'test', '.', '--skip-tests', '-f', str(vf)],
-        capture_output=True, text=True, timeout=120,
-    )
-    kinds = [d['kind'] for d in yaml.safe_load_all(out.stdout)
-             if d and isinstance(d, dict) and 'kind' in d]
-    if kinds:
-        seqs.append(kinds)
-
-result = infer_ensemble(seqs)
-print(f"Best: {result['best']['algorithm']} (MDL {result['best']['mdl_score']})")
-print(f"Grammar: {result['best']['grammar']}")
-```
-
-**Example** (kube-prometheus-stack, 6 CI configs):
-
-```
-Best: iDRegEx (MDL 1433)
-Grammar: ServiceAccount.ClusterRole.ClusterRoleBinding.Service.Deployment
-
-  iDRegEx     MDL=  1432.99  ServiceAccount.ClusterRole.ClusterRoleBinding.Service.Deployment
-  CRX         MDL=  2651.74  (Alertmanager+ClusterRole+...+ValidatingWebhookConfiguration)+.Role+?...
-
-Why: iDRegEx (score 1433.0) vs CRX (score 2651.7). CRX matches 6/6, iDRegEx matches 1/6.
-iDRegEx selected (MDL score 1433.0).
-```
-
-### Terraform
-
-```python
-import re
-from bex.ensemble import infer_ensemble
-
-seqs = []
-for tf in sorted(Path('.').rglob('*.tf')):
-    resources = re.findall(r'resource "(\w+)" "\w+" {', tf.read_text())
-    if resources:
-        seqs.append(resources)
-
-result = infer_ensemble(seqs)
-print(f"Best: {result['best']['algorithm']} (MDL {result['best']['mdl_score']})")
-print(f"Grammar: {result['best']['grammar']}")
-```
-
-**Example** (8 terraform-aws-* modules):
-
-```
-Best: CRX (MDL 1876)
-Grammar: null_resource?.s3_bucket_lifecycle_configuration?.vpc?.launch_configuration?....
-Why: CRX matches 8/8 sequences. iDRegEx returned ∅ (no common core across modules).
-```
-
-### Portainer Templates
-
-```python
-import json, urllib.request
-from bex.ensemble import infer_ensemble
-
-url = "https://raw.githubusercontent.com/portainer/templates/master/templates.json"
-with urllib.request.urlopen(url) as resp:
-    data = json.loads(resp.read())
-templates = data if isinstance(data, list) else data.get('templates', [])
-seqs = [list(t.keys()) for t in templates]
-
-result = infer_ensemble(seqs)
-print(f"Best: {result['best']['algorithm']} (MDL {result['best']['mdl_score']})")
-print(f"Grammar: {result['best']['grammar']}")
-```
-
-### GitHub Actions
-
-```python
-import yaml
-from bex.ensemble import infer_ensemble
-
-seqs = []
-for wf_file in Path('.github/workflows/').glob('*.yml'):
-    data = yaml.safe_load(wf_file.read_text())
-    for job in data.get('jobs', {}).values():
-        if 'steps' not in job:
-            continue
-        seq = [s.get('uses', 'run:' + s.get('run', '').split()[0])
-               for s in job['steps'] if 'uses' in s or 'run' in s]
-        if seq:
-            seqs.append(seq)
-
-result = infer_ensemble(seqs)
-```
-
 ## How MDL scoring works

 ```
--- a/SHOWCASE.md
+++ b/SHOWCASE.md
@ -46,27 +46,7 @@ vocabulary (19 kinds). Which one an agent uses depends on the task:
 - Bootstrapping a new cluster: iDRegEx — what you can't skip
 - Writing a complete chart: CRX — everything you might need

-## 3. Portainer templates (47 templates)
-
-Official Portainer app templates from portainer/templates:
-
-```
-Best: CRX | MDL 1282
-Grammar: (type+title)+.
-         (categories+description+image+logo+name+note+platform)+.
-         repository?.(env+ports+privileged+volumes)+?.command?
-```
-
-Field ordering convention: identity (`type`, `title`) → metadata
-(`description`, `categories`, `platform`, `logo`) → source
-(`image`, `repository`) → deployment (`ports`, `volumes`, `env`) →
-entrypoint (`command`). 21 unique orderings, one grammar.
-
-**Why it helps an LLM:** Writing a Portainer template needs the right
-field order. The grammar tells you: identity first, then metadata,
-then source, then deployment config.
-
-## 4. GitHub Actions (cross-project Go lint, 6 jobs)
+## 3. GitHub Actions (cross-project Go lint, 6 jobs)

 Lint jobs from prometheus, goreleaser, cosign, sigstore:

@ -82,7 +62,7 @@ Only the biggest add megalinter.
 **Why it helps an LLM:** Starting a new Go project? The lint workflow
 has a near-universal pattern.

-## 5. Terraform (8 AWS modules)
+## 4. Terraform (8 AWS modules)

 Terraform modules by hashicorp and terraform-aws-modules:

--- a/blog_post.md
+++ b/blog_post.md
@ -191,7 +191,7 @@ depending on the data:
 |---------|--------|-----|
 | Ansible galaxy (15 roles) | CRX | iDRegEx returned ∅ (too diverse) |
 | Helm prom-stack (6 configs) | **iDRegEx** | Finds minimal core across all configs |
-| Portainer templates (47) | CRX | iDRegEx returned ∅ (no single common field) |
+| Terraform modules (8) | CRX | iDRegEx returned ∅ (no common core across domains) |
 | Terraform modules (8) | CRX | Every resource type optional across domains |
 | GitHub Actions Go lint (6) | CRX | Tight pattern, all match |