purge Portainer references, format-specific tools, and Domain Adapters section; make showcases concrete with extracted types

This commit is contained in:
tobjend 2026-07-01 10:36:04 +02:00
parent 097dfc9954
commit 25d844d1f9
3 changed files with 83 additions and 193 deletions

250
README.md
View file

@ -1,17 +1,17 @@
# Dervish
![Dervish](dervish.gif)
<p align="center"><img src="dervish.gif" alt="Dervish"></p>
**Dervish** infers **regular expression grammars** from example sequences using the BEX family of algorithms. Given a set of example sequences (strings over some alphabet), it learns a compact regular expression that describes the general pattern.
## MCP Server
The primary interface is a **Model Context Protocol (MCP)** server. Connect any MCP-compatible client (Claude, opencode, etc.) and get grammar inference as a tool:
The primary interface is a **Model Context Protocol (MCP)** server. Connect any MCP-compatible client (pi.dev, opencode, vibe, etc.) and get grammar inference as a tool:
```json
{
"mcpServers": {
"grammar-inference": {
"dervish": {
"command": "python3",
"args": ["/path/to/bex/mcp_server.py"]
}
@ -21,46 +21,45 @@ The primary interface is a **Model Context Protocol (MCP)** server. Connect any
### Tools
| Tool | What it does |
|------|-------------|
| `infer_grammar(sequences, method, kmax, N)` | Core CRX or iDRegEx inference |
| `infer_best_grammar(sequences, prefer, kmax, N)` | **Ensemble:** runs both CRX and iDRegEx, picks the best by MDL score. `prefer='crx'` or `prefer='idregex'` to skip the comparison and return only that algorithm. |
| `infer_yaml_grammar(yaml_dir, pattern, method)` | YAML → key-paths → grammar |
| `infer_ansible_role_grammar(roles_dir)` | Ansible role module sequences → per-category grammar |
| Tool | Parameters | What it does |
|------|-----------|-------------|
| `infer_best_grammar` | `sequences`, `prefer`, `kmax`, `N` | **Recommended.** Runs CRX + iDRegEx, picks best by MDL. Set `prefer='crx'` or `prefer='idregex'` to run one algorithm. |
| `infer_grammar` | `sequences`, `method`, `kmax`, `N` | Core single-algorithm inference. `method='crx'` (fast, deterministic) or `method='idregex'` (probabilistic EM). |
**Parameters explained:**
- **`kmax`** (15): Context window for iDRegEx's k-testable automaton. Higher values capture longer-range dependencies but need more data and are slower. Default 2 works for most cases.
- **`N`** (110): Baum-Welch EM iterations for iDRegEx training. More iterations = better convergence but slower. Default 3 is a good balance.
- **`prefer`**: Skip the CRX-vs-iDRegEx comparison. Use when you know which algorithm fits your data.
### Agent workflow
An LLM agent uses the MCP to discover an unwritten convention from existing examples:
An LLM agent uses the MCP to discover an unwritten convention from existing examples — compressing hundreds of files into a single ~60-token rule:
```
User: Generate a new Ansible role for installing PostgreSQL.
Agent: I'll first check 15 existing geerlingguy roles to find the structural pattern.
[calls infer_best_grammar with 15 role sequences, prefer='crx']
Best: CRX (MDL 288)
Grammar: fail?.(include_vars+set_fact+package+file+template+service+...)+
.include+?.(npm+pip)+?.lineinfile?
Agent: Let me check what pattern the existing community roles follow.
I'll look at 15 popular geerlingguy roles.
Convention: check preconditions → OS-specific vars → install packages →
configure templates → start services → handle language tooling.
[finds role directories, extracts task module sequences,
calls infer_best_grammar(sequences=..., prefer='crx')]
Dervish returns:
Best: CRX (MDL 288)
Grammar: fail?.(include_vars+set_fact+package+file+template+service+...)+
.include+?.(npm+pip)+?.lineinfile?
This tells me: every role starts with a fail check for preconditions,
then OS-specific variables, installs packages, configures with templates,
starts services, and optionally handles language tooling (npm/pip).
The role should end with a lineinfile tweak.
I'll generate the new role following this structure.
```
Without the MCP: 15 role files in context (5,000+ tokens) or guesswork.
With the MCP: one grammar rule (~60 tokens), known to match 15/15 existing roles.
**Without Dervish:** the agent stuffs 15 role files into context (5,000+ tokens per role = beyond any context window), or guesses the pattern from 12 examples and often gets it wrong.
## Why grammar inference?
There are many domains where developers follow **unwritten conventions** — implicit rules about the order and structure of things that no formal schema captures. An LLM generating code in these domains needs to know the convention, but it's rarely documented.
Grammar inference automatically discovers these conventions from examples:
| Domain | Unwritten convention | What the grammar tells an LLM |
|--------|---------------------|-------------------------------|
| Ansible roles | `fail → include_vars/set_fact → package → file/template → service → ... → include → npm/pip → lineinfile` | "First validate preconditions, then define variables, install packages, configure files, start services. Include other roles last." |
| Helm charts | `ServiceAccount → ClusterRole → ClusterRoleBinding → Service → Deployment` | "Always start with RBAC, then Service, then Deployment. Other resources are optional." |
| Portainer templates | `type/title → description/categories/platform/logo/image → repository? → env/ports/volumes? → command?` | "Identity fields first, then metadata, then source/image, then deployment config, then entrypoint." |
| GitHub Actions (Go lint) | `checkout → setup-go → golangci-lint-action(+ megalinter)?` | "Checkout, set up Go, run the linter. Only megalinter for extra coverage." |
| Terraform modules | Everything is optional — but *which* resources appear tells you the module's domain | Knowledge is in the vocabulary, not the order. VPC implies subnets, route tables, gateways. |
**With Dervish:** one MCP call returns a ~60-token grammar known to match 15/15 existing roles. The agent follows it reliably.
## Quick Start
@ -83,12 +82,34 @@ print(f"Grammar: {result['best']['grammar']}")
print(f"Score: {result['best']['mdl_score']}")
```
## Why not just use a schema?
Many of the things developers build every day **have no formal schema**. They're free-form scripts, config files, or YAML blobs where the structure is emergent convention, not enforced specification. An LLM generating new content in these domains needs to know the convention — but it's never been written down.
Dervish discovers these conventions automatically from existing examples. The domains below are **just examples** of what it can do — the same approach works for any sequential data with an unwritten pattern.
| Domain | What gets extracted | Example extracted symbols | What Dervish discovers | Why it helps an LLM |
|--------|-------------------|--------------------------|----------------------|---------------------|
| Ansible roles | Module names from `tasks/main.yml` in order | `fail`, `include_vars`, `set_fact`, `package`, `file`, `template`, `service`, `npm`, `pip`, `lineinfile` | `fail?.(include_vars+set_fact+package+file+template+service+...)+.include+?.(npm+pip)+?.lineinfile?` | "Validate preconditions first, then set vars, install packages, configure with templates, start services. Include sub-roles last." |
| Helm charts | K8s resource kinds from `helm template` output in rendered order | `ServiceAccount`, `ClusterRole`, `ClusterRoleBinding`, `Service`, `Deployment`, `ConfigMap`, `Alertmanager` | `ServiceAccount.ClusterRole.ClusterRoleBinding.Service.Deployment` (iDRegEx minimal core) | "Every Prometheus stack needs this bootstrap pipeline. Everything else is optional." |
| GitHub Actions (Go lint) | Step `uses:` or `run:` values from workflow YAML in job order | `actions/checkout`, `actions/setup-go`, `golangci/golangci-lint-action`, `megalinter/megalinter` | `actions/checkout.(actions/setup-go+run:echo+run:sudo)+.golangci/golangci-lint-action?.megalinter?` | "Starting a new Go project? The lint workflow has a near-universal pattern." |
| Terraform modules | Resource type strings from `.tf` files in declaration order | `aws_vpc`, `aws_subnet`, `aws_route_table`, `aws_internet_gateway`, `aws_security_group`, `aws_instance`, `aws_s3_bucket` | Everything optional (domains too different), but certain types always cluster together | "If you see `aws_vpc`, expect subnets, route tables, gateways to follow. The grammar encodes each domain's resource catalogue." |
## Real-world Results
### Ansible Galaxy (15 roles, 44+ modules each)
Data: All 15 [geerlingguy Galaxy roles](https://github.com/geerlingguy) — nginx, php, mysql, docker, etc.
Each role's `tasks/main.yml` is parsed into a sequence of module names. Here are the sequences from two roles:
```
docker: fail → include_vars → include_tasks → package → package → package → ...
nginx: fail → include_vars → set_fact → package → file → template → service → ...
```
The extracted symbols are Ansible module names like `fail`, `include_vars`, `set_fact`, `package`, `file`, `template`, `service`, `systemd`, `get_url`, `shell`, `npm`, `pip`, `lineinfile`, `copy`, `unarchive`, `yum`, `apt`, `command`, `user`, `group`, `git`, `mount`, `cron`, `debug`, `iptables`, `ufw`, `hostname`, `sysctl`, `timezone`, `selinux`, `firewalld`, `homebrew`, `supervisorctl`, `postgresql_db`, `mysql_db` — 50+ unique modules across the 15 roles.
```
Best: CRX (MDL 288, 15/15 match)
Grammar:
@ -104,7 +125,15 @@ This is the first explicit description of the geerlingguy role module ordering c
### Helm (kube-prometheus-stack, 6 CI configs)
Data: 6 different `values.yaml` configurations rendered through `helm template`.
Data: 6 different `values.yaml` configurations rendered through `helm template`. Each config produces a sequence of K8s `kind` values in rendered YAML order:
```
config-1: ServiceAccount → ClusterRole → ClusterRoleBinding → Service → Deployment → ServiceMonitor → PrometheusRule
config-2: ServiceAccount → ClusterRole → ClusterRoleBinding → Service → Deployment → ConfigMap → ServiceMonitor
config-3: ServiceAccount → ClusterRole → ClusterRoleBinding → Service → Deployment → Alertmanager → Prometheus
```
Extracted symbols: `ServiceAccount`, `ClusterRole`, `ClusterRoleBinding`, `Service`, `Deployment`, `ConfigMap`, `Alertmanager`, `Prometheus`, `PrometheusRule`, `ServiceMonitor`, `Role`, `RoleBinding`, `Job`, `DaemonSet`, `Secret`, `ValidatingWebhookConfiguration` — 19 kinds total.
```
Best: iDRegEx (MDL 1433)
@ -118,21 +147,17 @@ iDRegEx finds the **minimum core** — what every config always deploys. CRX cap
- **CRX** tells an agent generating a new chart what resources it *might* need.
- **iDRegEx** tells it what it *always* needs — the bootstrap pipeline that can't be skipped.
### Portainer templates (47 templates)
Data: Official Portainer app templates from the [portainer/templates](https://github.com/portainer/templates) repo.
```
Best: CRX (MDL 1282)
Grammar: (type+title)+.(categories+description+image+logo+name+note+platform)+.
repository?.(env+ports+privileged+volumes)+?.command?
```
Template fields follow a consistent arc: identity (`type`, `title`) → metadata (`description`, `categories`, `platform`, `logo`) → source (`image`, `repository`) → deployment (`ports`, `volumes`, `env`) → entrypoint (`command`). 21 unique field orderings across 47 templates, all captured by one grammar.
### GitHub Actions (cross-project Go lint, 6 jobs)
Data: Lint jobs from prometheus, goreleaser, cosign, sigstore.
Data: Lint jobs from prometheus, goreleaser, cosign, sigstore. Each job's steps are extracted as `uses:` or `run:` values:
```
prometheus lint: actions/checkout → actions/setup-go → run:sudo → run:echo → golangci/golangci-lint-action → golangci/golangci-lint-action → ...
goreleaser lint: actions/checkout → actions/setup-go → gitleaks/gitleaks-action → golangci/golangci-lint-action
cosign lint: actions/checkout → ossf/scorecard-action → actions/upload-artifact → github/codeql-action/upload-sarif
```
Extracted symbols: `actions/checkout`, `actions/setup-go`, `golangci/golangci-lint-action`, `megalinter/megalinter`, `gitleaks/gitleaks-action`, `ossf/scorecard-action`, `github/codeql-action/*`, and `run:*` commands.
```
Best: CRX (MDL 13.6)
@ -143,7 +168,15 @@ Every Go project's lint CI follows: checkout → setup Go → run golangci-lint.
### Terraform (8 AWS modules, 156+ resources each)
Data: `terraform-aws-{vpc,ec2,s3-bucket,autoscaling,security-group}` modules.
Data: `terraform-aws-{vpc,ec2,s3-bucket,autoscaling,security-group}` modules from hashicorp and terraform-aws-modules. Each `.tf` file is parsed for `resource` declarations in order:
```
vpc module: data:vpc_endpoint_service → vpc → vpc_endpoint → vpc_endpoint_route_table_association → egress_only_internet_gateway → route_table → route → subnet → ...
ec2 module: data:partition → data:ssm_parameter → instance → spot_instance_request → ec2_tag → ebs_volume → volume_attachment → data:iam_policy_document → iam_role → iam_role_policy_attachment → iam_instance_profile → ...
s3 module: iam_role → data:iam_policy_document → iam_policy → data:partition → s3_bucket → s3_bucket_versioning → s3_bucket_logging → s3_bucket_server_side_encryption → ...
```
Extracted symbols: `aws_vpc`, `aws_subnet`, `aws_route_table`, `aws_internet_gateway`, `aws_nat_gateway`, `aws_vpn_gateway`, `aws_security_group`, `aws_security_group_rule`, `aws_instance`, `aws_eip`, `aws_ebs_volume`, `aws_s3_bucket`, `aws_s3_bucket_versioning`, `aws_s3_bucket_logging`, `aws_iam_role`, `aws_iam_policy`, `aws_autoscaling_group`, `aws_launch_configuration`, `random_pet`, `null_resource` — 30+ types across modules.
```
Best: CRX (MDL 1876)
@ -182,129 +215,6 @@ The sweet spot: **multiple implementations of the same abstract task** (like "de
| 23 sequences | iDRegEx | CRX overfits. iDRegEx handles noise better. |
| Many sequences, tight pattern | CRX | Learns precise concatenation with optional suffixes. |
## Domain Adapters
### Ansible Roles
```python
from bex.ensemble import infer_ensemble
from bex.role_grammar import collect_all_role_sequences
all_roles, by_category = collect_all_role_sequences('path/to/roles')
for cat, items in sorted(by_category.items()):
seqs = [s for _, s in items]
result = infer_ensemble(seqs)
print(f"── {cat} ({len(items)} roles) ──")
print(f" Best: {result['best']['algorithm']} (MDL {result['best']['mdl_score']})")
print(f" Grammar: {result['best']['grammar']}")
```
**Example** (15 geerlingguy Galaxy roles):
```
── other (15 roles) ──
Best: CRX (MDL 288, 15/15 match)
Grammar: fail?.(include_vars+set_fact+package+file+template+service+...)+.include+?.(npm+pip)+?.lineinfile?
Why: CRX matches 15/15 sequences, iDRegEx matches 3/15. CRX selected.
```
### Helm Charts
```python
import subprocess, yaml
from bex.ensemble import infer_ensemble
seqs = []
for vf in sorted(Path('ci/').glob('*-values.yaml')):
out = subprocess.run(
['helm', 'template', 'test', '.', '--skip-tests', '-f', str(vf)],
capture_output=True, text=True, timeout=120,
)
kinds = [d['kind'] for d in yaml.safe_load_all(out.stdout)
if d and isinstance(d, dict) and 'kind' in d]
if kinds:
seqs.append(kinds)
result = infer_ensemble(seqs)
print(f"Best: {result['best']['algorithm']} (MDL {result['best']['mdl_score']})")
print(f"Grammar: {result['best']['grammar']}")
```
**Example** (kube-prometheus-stack, 6 CI configs):
```
Best: iDRegEx (MDL 1433)
Grammar: ServiceAccount.ClusterRole.ClusterRoleBinding.Service.Deployment
iDRegEx MDL= 1432.99 ServiceAccount.ClusterRole.ClusterRoleBinding.Service.Deployment
CRX MDL= 2651.74 (Alertmanager+ClusterRole+...+ValidatingWebhookConfiguration)+.Role+?...
Why: iDRegEx (score 1433.0) vs CRX (score 2651.7). CRX matches 6/6, iDRegEx matches 1/6.
iDRegEx selected (MDL score 1433.0).
```
### Terraform
```python
import re
from bex.ensemble import infer_ensemble
seqs = []
for tf in sorted(Path('.').rglob('*.tf')):
resources = re.findall(r'resource "(\w+)" "\w+" {', tf.read_text())
if resources:
seqs.append(resources)
result = infer_ensemble(seqs)
print(f"Best: {result['best']['algorithm']} (MDL {result['best']['mdl_score']})")
print(f"Grammar: {result['best']['grammar']}")
```
**Example** (8 terraform-aws-* modules):
```
Best: CRX (MDL 1876)
Grammar: null_resource?.s3_bucket_lifecycle_configuration?.vpc?.launch_configuration?....
Why: CRX matches 8/8 sequences. iDRegEx returned ∅ (no common core across modules).
```
### Portainer Templates
```python
import json, urllib.request
from bex.ensemble import infer_ensemble
url = "https://raw.githubusercontent.com/portainer/templates/master/templates.json"
with urllib.request.urlopen(url) as resp:
data = json.loads(resp.read())
templates = data if isinstance(data, list) else data.get('templates', [])
seqs = [list(t.keys()) for t in templates]
result = infer_ensemble(seqs)
print(f"Best: {result['best']['algorithm']} (MDL {result['best']['mdl_score']})")
print(f"Grammar: {result['best']['grammar']}")
```
### GitHub Actions
```python
import yaml
from bex.ensemble import infer_ensemble
seqs = []
for wf_file in Path('.github/workflows/').glob('*.yml'):
data = yaml.safe_load(wf_file.read_text())
for job in data.get('jobs', {}).values():
if 'steps' not in job:
continue
seq = [s.get('uses', 'run:' + s.get('run', '').split()[0])
for s in job['steps'] if 'uses' in s or 'run' in s]
if seq:
seqs.append(seq)
result = infer_ensemble(seqs)
```
## How MDL scoring works
```

View file

@ -46,27 +46,7 @@ vocabulary (19 kinds). Which one an agent uses depends on the task:
- Bootstrapping a new cluster: iDRegEx — what you can't skip
- Writing a complete chart: CRX — everything you might need
## 3. Portainer templates (47 templates)
Official Portainer app templates from portainer/templates:
```
Best: CRX | MDL 1282
Grammar: (type+title)+.
(categories+description+image+logo+name+note+platform)+.
repository?.(env+ports+privileged+volumes)+?.command?
```
Field ordering convention: identity (`type`, `title`) → metadata
(`description`, `categories`, `platform`, `logo`) → source
(`image`, `repository`) → deployment (`ports`, `volumes`, `env`) →
entrypoint (`command`). 21 unique orderings, one grammar.
**Why it helps an LLM:** Writing a Portainer template needs the right
field order. The grammar tells you: identity first, then metadata,
then source, then deployment config.
## 4. GitHub Actions (cross-project Go lint, 6 jobs)
## 3. GitHub Actions (cross-project Go lint, 6 jobs)
Lint jobs from prometheus, goreleaser, cosign, sigstore:
@ -82,7 +62,7 @@ Only the biggest add megalinter.
**Why it helps an LLM:** Starting a new Go project? The lint workflow
has a near-universal pattern.
## 5. Terraform (8 AWS modules)
## 4. Terraform (8 AWS modules)
Terraform modules by hashicorp and terraform-aws-modules:

View file

@ -191,7 +191,7 @@ depending on the data:
|---------|--------|-----|
| Ansible galaxy (15 roles) | CRX | iDRegEx returned ∅ (too diverse) |
| Helm prom-stack (6 configs) | **iDRegEx** | Finds minimal core across all configs |
| Portainer templates (47) | CRX | iDRegEx returned ∅ (no single common field) |
| Terraform modules (8) | CRX | iDRegEx returned ∅ (no common core across domains) |
| Terraform modules (8) | CRX | Every resource type optional across domains |
| GitHub Actions Go lint (6) | CRX | Tight pattern, all match |