purge Portainer references, format-specific tools, and Domain Adapters section; make showcases concrete with extracted types
This commit is contained in:
parent
097dfc9954
commit
25d844d1f9
3 changed files with 83 additions and 193 deletions
250
README.md
250
README.md
|
|
@ -1,17 +1,17 @@
|
|||
# Dervish
|
||||
|
||||

|
||||
<p align="center"><img src="dervish.gif" alt="Dervish"></p>
|
||||
|
||||
**Dervish** infers **regular expression grammars** from example sequences using the BEX family of algorithms. Given a set of example sequences (strings over some alphabet), it learns a compact regular expression that describes the general pattern.
|
||||
|
||||
## MCP Server
|
||||
|
||||
The primary interface is a **Model Context Protocol (MCP)** server. Connect any MCP-compatible client (Claude, opencode, etc.) and get grammar inference as a tool:
|
||||
The primary interface is a **Model Context Protocol (MCP)** server. Connect any MCP-compatible client (pi.dev, opencode, vibe, etc.) and get grammar inference as a tool:
|
||||
|
||||
```json
|
||||
{
|
||||
"mcpServers": {
|
||||
"grammar-inference": {
|
||||
"dervish": {
|
||||
"command": "python3",
|
||||
"args": ["/path/to/bex/mcp_server.py"]
|
||||
}
|
||||
|
|
@ -21,46 +21,45 @@ The primary interface is a **Model Context Protocol (MCP)** server. Connect any
|
|||
|
||||
### Tools
|
||||
|
||||
| Tool | What it does |
|
||||
|------|-------------|
|
||||
| `infer_grammar(sequences, method, kmax, N)` | Core CRX or iDRegEx inference |
|
||||
| `infer_best_grammar(sequences, prefer, kmax, N)` | **Ensemble:** runs both CRX and iDRegEx, picks the best by MDL score. `prefer='crx'` or `prefer='idregex'` to skip the comparison and return only that algorithm. |
|
||||
| `infer_yaml_grammar(yaml_dir, pattern, method)` | YAML → key-paths → grammar |
|
||||
| `infer_ansible_role_grammar(roles_dir)` | Ansible role module sequences → per-category grammar |
|
||||
| Tool | Parameters | What it does |
|
||||
|------|-----------|-------------|
|
||||
| `infer_best_grammar` | `sequences`, `prefer`, `kmax`, `N` | **Recommended.** Runs CRX + iDRegEx, picks best by MDL. Set `prefer='crx'` or `prefer='idregex'` to run one algorithm. |
|
||||
| `infer_grammar` | `sequences`, `method`, `kmax`, `N` | Core single-algorithm inference. `method='crx'` (fast, deterministic) or `method='idregex'` (probabilistic EM). |
|
||||
|
||||
**Parameters explained:**
|
||||
- **`kmax`** (1–5): Context window for iDRegEx's k-testable automaton. Higher values capture longer-range dependencies but need more data and are slower. Default 2 works for most cases.
|
||||
- **`N`** (1–10): Baum-Welch EM iterations for iDRegEx training. More iterations = better convergence but slower. Default 3 is a good balance.
|
||||
- **`prefer`**: Skip the CRX-vs-iDRegEx comparison. Use when you know which algorithm fits your data.
|
||||
|
||||
### Agent workflow
|
||||
|
||||
An LLM agent uses the MCP to discover an unwritten convention from existing examples:
|
||||
An LLM agent uses the MCP to discover an unwritten convention from existing examples — compressing hundreds of files into a single ~60-token rule:
|
||||
|
||||
```
|
||||
User: Generate a new Ansible role for installing PostgreSQL.
|
||||
Agent: I'll first check 15 existing geerlingguy roles to find the structural pattern.
|
||||
[calls infer_best_grammar with 15 role sequences, prefer='crx']
|
||||
|
||||
Best: CRX (MDL 288)
|
||||
Grammar: fail?.(include_vars+set_fact+package+file+template+service+...)+
|
||||
.include+?.(npm+pip)+?.lineinfile?
|
||||
Agent: Let me check what pattern the existing community roles follow.
|
||||
I'll look at 15 popular geerlingguy roles.
|
||||
|
||||
Convention: check preconditions → OS-specific vars → install packages →
|
||||
configure templates → start services → handle language tooling.
|
||||
[finds role directories, extracts task module sequences,
|
||||
calls infer_best_grammar(sequences=..., prefer='crx')]
|
||||
|
||||
Dervish returns:
|
||||
Best: CRX (MDL 288)
|
||||
Grammar: fail?.(include_vars+set_fact+package+file+template+service+...)+
|
||||
.include+?.(npm+pip)+?.lineinfile?
|
||||
|
||||
This tells me: every role starts with a fail check for preconditions,
|
||||
then OS-specific variables, installs packages, configures with templates,
|
||||
starts services, and optionally handles language tooling (npm/pip).
|
||||
The role should end with a lineinfile tweak.
|
||||
|
||||
I'll generate the new role following this structure.
|
||||
```
|
||||
|
||||
Without the MCP: 15 role files in context (5,000+ tokens) or guesswork.
|
||||
With the MCP: one grammar rule (~60 tokens), known to match 15/15 existing roles.
|
||||
**Without Dervish:** the agent stuffs 15 role files into context (5,000+ tokens per role = beyond any context window), or guesses the pattern from 1–2 examples and often gets it wrong.
|
||||
|
||||
## Why grammar inference?
|
||||
|
||||
There are many domains where developers follow **unwritten conventions** — implicit rules about the order and structure of things that no formal schema captures. An LLM generating code in these domains needs to know the convention, but it's rarely documented.
|
||||
|
||||
Grammar inference automatically discovers these conventions from examples:
|
||||
|
||||
| Domain | Unwritten convention | What the grammar tells an LLM |
|
||||
|--------|---------------------|-------------------------------|
|
||||
| Ansible roles | `fail → include_vars/set_fact → package → file/template → service → ... → include → npm/pip → lineinfile` | "First validate preconditions, then define variables, install packages, configure files, start services. Include other roles last." |
|
||||
| Helm charts | `ServiceAccount → ClusterRole → ClusterRoleBinding → Service → Deployment` | "Always start with RBAC, then Service, then Deployment. Other resources are optional." |
|
||||
| Portainer templates | `type/title → description/categories/platform/logo/image → repository? → env/ports/volumes? → command?` | "Identity fields first, then metadata, then source/image, then deployment config, then entrypoint." |
|
||||
| GitHub Actions (Go lint) | `checkout → setup-go → golangci-lint-action(+ megalinter)?` | "Checkout, set up Go, run the linter. Only megalinter for extra coverage." |
|
||||
| Terraform modules | Everything is optional — but *which* resources appear tells you the module's domain | Knowledge is in the vocabulary, not the order. VPC implies subnets, route tables, gateways. |
|
||||
**With Dervish:** one MCP call returns a ~60-token grammar known to match 15/15 existing roles. The agent follows it reliably.
|
||||
|
||||
## Quick Start
|
||||
|
||||
|
|
@ -83,12 +82,34 @@ print(f"Grammar: {result['best']['grammar']}")
|
|||
print(f"Score: {result['best']['mdl_score']}")
|
||||
```
|
||||
|
||||
## Why not just use a schema?
|
||||
|
||||
Many of the things developers build every day **have no formal schema**. They're free-form scripts, config files, or YAML blobs where the structure is emergent convention, not enforced specification. An LLM generating new content in these domains needs to know the convention — but it's never been written down.
|
||||
|
||||
Dervish discovers these conventions automatically from existing examples. The domains below are **just examples** of what it can do — the same approach works for any sequential data with an unwritten pattern.
|
||||
|
||||
| Domain | What gets extracted | Example extracted symbols | What Dervish discovers | Why it helps an LLM |
|
||||
|--------|-------------------|--------------------------|----------------------|---------------------|
|
||||
| Ansible roles | Module names from `tasks/main.yml` in order | `fail`, `include_vars`, `set_fact`, `package`, `file`, `template`, `service`, `npm`, `pip`, `lineinfile` | `fail?.(include_vars+set_fact+package+file+template+service+...)+.include+?.(npm+pip)+?.lineinfile?` | "Validate preconditions first, then set vars, install packages, configure with templates, start services. Include sub-roles last." |
|
||||
| Helm charts | K8s resource kinds from `helm template` output in rendered order | `ServiceAccount`, `ClusterRole`, `ClusterRoleBinding`, `Service`, `Deployment`, `ConfigMap`, `Alertmanager` | `ServiceAccount.ClusterRole.ClusterRoleBinding.Service.Deployment` (iDRegEx minimal core) | "Every Prometheus stack needs this bootstrap pipeline. Everything else is optional." |
|
||||
| GitHub Actions (Go lint) | Step `uses:` or `run:` values from workflow YAML in job order | `actions/checkout`, `actions/setup-go`, `golangci/golangci-lint-action`, `megalinter/megalinter` | `actions/checkout.(actions/setup-go+run:echo+run:sudo)+.golangci/golangci-lint-action?.megalinter?` | "Starting a new Go project? The lint workflow has a near-universal pattern." |
|
||||
| Terraform modules | Resource type strings from `.tf` files in declaration order | `aws_vpc`, `aws_subnet`, `aws_route_table`, `aws_internet_gateway`, `aws_security_group`, `aws_instance`, `aws_s3_bucket` | Everything optional (domains too different), but certain types always cluster together | "If you see `aws_vpc`, expect subnets, route tables, gateways to follow. The grammar encodes each domain's resource catalogue." |
|
||||
|
||||
## Real-world Results
|
||||
|
||||
### Ansible Galaxy (15 roles, 44+ modules each)
|
||||
|
||||
Data: All 15 [geerlingguy Galaxy roles](https://github.com/geerlingguy) — nginx, php, mysql, docker, etc.
|
||||
|
||||
Each role's `tasks/main.yml` is parsed into a sequence of module names. Here are the sequences from two roles:
|
||||
|
||||
```
|
||||
docker: fail → include_vars → include_tasks → package → package → package → ...
|
||||
nginx: fail → include_vars → set_fact → package → file → template → service → ...
|
||||
```
|
||||
|
||||
The extracted symbols are Ansible module names like `fail`, `include_vars`, `set_fact`, `package`, `file`, `template`, `service`, `systemd`, `get_url`, `shell`, `npm`, `pip`, `lineinfile`, `copy`, `unarchive`, `yum`, `apt`, `command`, `user`, `group`, `git`, `mount`, `cron`, `debug`, `iptables`, `ufw`, `hostname`, `sysctl`, `timezone`, `selinux`, `firewalld`, `homebrew`, `supervisorctl`, `postgresql_db`, `mysql_db` — 50+ unique modules across the 15 roles.
|
||||
|
||||
```
|
||||
Best: CRX (MDL 288, 15/15 match)
|
||||
Grammar:
|
||||
|
|
@ -104,7 +125,15 @@ This is the first explicit description of the geerlingguy role module ordering c
|
|||
|
||||
### Helm (kube-prometheus-stack, 6 CI configs)
|
||||
|
||||
Data: 6 different `values.yaml` configurations rendered through `helm template`.
|
||||
Data: 6 different `values.yaml` configurations rendered through `helm template`. Each config produces a sequence of K8s `kind` values in rendered YAML order:
|
||||
|
||||
```
|
||||
config-1: ServiceAccount → ClusterRole → ClusterRoleBinding → Service → Deployment → ServiceMonitor → PrometheusRule
|
||||
config-2: ServiceAccount → ClusterRole → ClusterRoleBinding → Service → Deployment → ConfigMap → ServiceMonitor
|
||||
config-3: ServiceAccount → ClusterRole → ClusterRoleBinding → Service → Deployment → Alertmanager → Prometheus
|
||||
```
|
||||
|
||||
Extracted symbols: `ServiceAccount`, `ClusterRole`, `ClusterRoleBinding`, `Service`, `Deployment`, `ConfigMap`, `Alertmanager`, `Prometheus`, `PrometheusRule`, `ServiceMonitor`, `Role`, `RoleBinding`, `Job`, `DaemonSet`, `Secret`, `ValidatingWebhookConfiguration` — 19 kinds total.
|
||||
|
||||
```
|
||||
Best: iDRegEx (MDL 1433)
|
||||
|
|
@ -118,21 +147,17 @@ iDRegEx finds the **minimum core** — what every config always deploys. CRX cap
|
|||
- **CRX** tells an agent generating a new chart what resources it *might* need.
|
||||
- **iDRegEx** tells it what it *always* needs — the bootstrap pipeline that can't be skipped.
|
||||
|
||||
### Portainer templates (47 templates)
|
||||
|
||||
Data: Official Portainer app templates from the [portainer/templates](https://github.com/portainer/templates) repo.
|
||||
|
||||
```
|
||||
Best: CRX (MDL 1282)
|
||||
Grammar: (type+title)+.(categories+description+image+logo+name+note+platform)+.
|
||||
repository?.(env+ports+privileged+volumes)+?.command?
|
||||
```
|
||||
|
||||
Template fields follow a consistent arc: identity (`type`, `title`) → metadata (`description`, `categories`, `platform`, `logo`) → source (`image`, `repository`) → deployment (`ports`, `volumes`, `env`) → entrypoint (`command`). 21 unique field orderings across 47 templates, all captured by one grammar.
|
||||
|
||||
### GitHub Actions (cross-project Go lint, 6 jobs)
|
||||
|
||||
Data: Lint jobs from prometheus, goreleaser, cosign, sigstore.
|
||||
Data: Lint jobs from prometheus, goreleaser, cosign, sigstore. Each job's steps are extracted as `uses:` or `run:` values:
|
||||
|
||||
```
|
||||
prometheus lint: actions/checkout → actions/setup-go → run:sudo → run:echo → golangci/golangci-lint-action → golangci/golangci-lint-action → ...
|
||||
goreleaser lint: actions/checkout → actions/setup-go → gitleaks/gitleaks-action → golangci/golangci-lint-action
|
||||
cosign lint: actions/checkout → ossf/scorecard-action → actions/upload-artifact → github/codeql-action/upload-sarif
|
||||
```
|
||||
|
||||
Extracted symbols: `actions/checkout`, `actions/setup-go`, `golangci/golangci-lint-action`, `megalinter/megalinter`, `gitleaks/gitleaks-action`, `ossf/scorecard-action`, `github/codeql-action/*`, and `run:*` commands.
|
||||
|
||||
```
|
||||
Best: CRX (MDL 13.6)
|
||||
|
|
@ -143,7 +168,15 @@ Every Go project's lint CI follows: checkout → setup Go → run golangci-lint.
|
|||
|
||||
### Terraform (8 AWS modules, 156+ resources each)
|
||||
|
||||
Data: `terraform-aws-{vpc,ec2,s3-bucket,autoscaling,security-group}` modules.
|
||||
Data: `terraform-aws-{vpc,ec2,s3-bucket,autoscaling,security-group}` modules from hashicorp and terraform-aws-modules. Each `.tf` file is parsed for `resource` declarations in order:
|
||||
|
||||
```
|
||||
vpc module: data:vpc_endpoint_service → vpc → vpc_endpoint → vpc_endpoint_route_table_association → egress_only_internet_gateway → route_table → route → subnet → ...
|
||||
ec2 module: data:partition → data:ssm_parameter → instance → spot_instance_request → ec2_tag → ebs_volume → volume_attachment → data:iam_policy_document → iam_role → iam_role_policy_attachment → iam_instance_profile → ...
|
||||
s3 module: iam_role → data:iam_policy_document → iam_policy → data:partition → s3_bucket → s3_bucket_versioning → s3_bucket_logging → s3_bucket_server_side_encryption → ...
|
||||
```
|
||||
|
||||
Extracted symbols: `aws_vpc`, `aws_subnet`, `aws_route_table`, `aws_internet_gateway`, `aws_nat_gateway`, `aws_vpn_gateway`, `aws_security_group`, `aws_security_group_rule`, `aws_instance`, `aws_eip`, `aws_ebs_volume`, `aws_s3_bucket`, `aws_s3_bucket_versioning`, `aws_s3_bucket_logging`, `aws_iam_role`, `aws_iam_policy`, `aws_autoscaling_group`, `aws_launch_configuration`, `random_pet`, `null_resource` — 30+ types across modules.
|
||||
|
||||
```
|
||||
Best: CRX (MDL 1876)
|
||||
|
|
@ -182,129 +215,6 @@ The sweet spot: **multiple implementations of the same abstract task** (like "de
|
|||
| 2–3 sequences | iDRegEx | CRX overfits. iDRegEx handles noise better. |
|
||||
| Many sequences, tight pattern | CRX | Learns precise concatenation with optional suffixes. |
|
||||
|
||||
## Domain Adapters
|
||||
|
||||
### Ansible Roles
|
||||
|
||||
```python
|
||||
from bex.ensemble import infer_ensemble
|
||||
from bex.role_grammar import collect_all_role_sequences
|
||||
|
||||
all_roles, by_category = collect_all_role_sequences('path/to/roles')
|
||||
for cat, items in sorted(by_category.items()):
|
||||
seqs = [s for _, s in items]
|
||||
result = infer_ensemble(seqs)
|
||||
print(f"── {cat} ({len(items)} roles) ──")
|
||||
print(f" Best: {result['best']['algorithm']} (MDL {result['best']['mdl_score']})")
|
||||
print(f" Grammar: {result['best']['grammar']}")
|
||||
```
|
||||
|
||||
**Example** (15 geerlingguy Galaxy roles):
|
||||
|
||||
```
|
||||
── other (15 roles) ──
|
||||
Best: CRX (MDL 288, 15/15 match)
|
||||
Grammar: fail?.(include_vars+set_fact+package+file+template+service+...)+.include+?.(npm+pip)+?.lineinfile?
|
||||
Why: CRX matches 15/15 sequences, iDRegEx matches 3/15. CRX selected.
|
||||
```
|
||||
|
||||
### Helm Charts
|
||||
|
||||
```python
|
||||
import subprocess, yaml
|
||||
from bex.ensemble import infer_ensemble
|
||||
|
||||
seqs = []
|
||||
for vf in sorted(Path('ci/').glob('*-values.yaml')):
|
||||
out = subprocess.run(
|
||||
['helm', 'template', 'test', '.', '--skip-tests', '-f', str(vf)],
|
||||
capture_output=True, text=True, timeout=120,
|
||||
)
|
||||
kinds = [d['kind'] for d in yaml.safe_load_all(out.stdout)
|
||||
if d and isinstance(d, dict) and 'kind' in d]
|
||||
if kinds:
|
||||
seqs.append(kinds)
|
||||
|
||||
result = infer_ensemble(seqs)
|
||||
print(f"Best: {result['best']['algorithm']} (MDL {result['best']['mdl_score']})")
|
||||
print(f"Grammar: {result['best']['grammar']}")
|
||||
```
|
||||
|
||||
**Example** (kube-prometheus-stack, 6 CI configs):
|
||||
|
||||
```
|
||||
Best: iDRegEx (MDL 1433)
|
||||
Grammar: ServiceAccount.ClusterRole.ClusterRoleBinding.Service.Deployment
|
||||
|
||||
iDRegEx MDL= 1432.99 ServiceAccount.ClusterRole.ClusterRoleBinding.Service.Deployment
|
||||
CRX MDL= 2651.74 (Alertmanager+ClusterRole+...+ValidatingWebhookConfiguration)+.Role+?...
|
||||
|
||||
Why: iDRegEx (score 1433.0) vs CRX (score 2651.7). CRX matches 6/6, iDRegEx matches 1/6.
|
||||
iDRegEx selected (MDL score 1433.0).
|
||||
```
|
||||
|
||||
### Terraform
|
||||
|
||||
```python
|
||||
import re
|
||||
from bex.ensemble import infer_ensemble
|
||||
|
||||
seqs = []
|
||||
for tf in sorted(Path('.').rglob('*.tf')):
|
||||
resources = re.findall(r'resource "(\w+)" "\w+" {', tf.read_text())
|
||||
if resources:
|
||||
seqs.append(resources)
|
||||
|
||||
result = infer_ensemble(seqs)
|
||||
print(f"Best: {result['best']['algorithm']} (MDL {result['best']['mdl_score']})")
|
||||
print(f"Grammar: {result['best']['grammar']}")
|
||||
```
|
||||
|
||||
**Example** (8 terraform-aws-* modules):
|
||||
|
||||
```
|
||||
Best: CRX (MDL 1876)
|
||||
Grammar: null_resource?.s3_bucket_lifecycle_configuration?.vpc?.launch_configuration?....
|
||||
Why: CRX matches 8/8 sequences. iDRegEx returned ∅ (no common core across modules).
|
||||
```
|
||||
|
||||
### Portainer Templates
|
||||
|
||||
```python
|
||||
import json, urllib.request
|
||||
from bex.ensemble import infer_ensemble
|
||||
|
||||
url = "https://raw.githubusercontent.com/portainer/templates/master/templates.json"
|
||||
with urllib.request.urlopen(url) as resp:
|
||||
data = json.loads(resp.read())
|
||||
templates = data if isinstance(data, list) else data.get('templates', [])
|
||||
seqs = [list(t.keys()) for t in templates]
|
||||
|
||||
result = infer_ensemble(seqs)
|
||||
print(f"Best: {result['best']['algorithm']} (MDL {result['best']['mdl_score']})")
|
||||
print(f"Grammar: {result['best']['grammar']}")
|
||||
```
|
||||
|
||||
### GitHub Actions
|
||||
|
||||
```python
|
||||
import yaml
|
||||
from bex.ensemble import infer_ensemble
|
||||
|
||||
seqs = []
|
||||
for wf_file in Path('.github/workflows/').glob('*.yml'):
|
||||
data = yaml.safe_load(wf_file.read_text())
|
||||
for job in data.get('jobs', {}).values():
|
||||
if 'steps' not in job:
|
||||
continue
|
||||
seq = [s.get('uses', 'run:' + s.get('run', '').split()[0])
|
||||
for s in job['steps'] if 'uses' in s or 'run' in s]
|
||||
if seq:
|
||||
seqs.append(seq)
|
||||
|
||||
result = infer_ensemble(seqs)
|
||||
```
|
||||
|
||||
## How MDL scoring works
|
||||
|
||||
```
|
||||
|
|
|
|||
24
SHOWCASE.md
24
SHOWCASE.md
|
|
@ -46,27 +46,7 @@ vocabulary (19 kinds). Which one an agent uses depends on the task:
|
|||
- Bootstrapping a new cluster: iDRegEx — what you can't skip
|
||||
- Writing a complete chart: CRX — everything you might need
|
||||
|
||||
## 3. Portainer templates (47 templates)
|
||||
|
||||
Official Portainer app templates from portainer/templates:
|
||||
|
||||
```
|
||||
Best: CRX | MDL 1282
|
||||
Grammar: (type+title)+.
|
||||
(categories+description+image+logo+name+note+platform)+.
|
||||
repository?.(env+ports+privileged+volumes)+?.command?
|
||||
```
|
||||
|
||||
Field ordering convention: identity (`type`, `title`) → metadata
|
||||
(`description`, `categories`, `platform`, `logo`) → source
|
||||
(`image`, `repository`) → deployment (`ports`, `volumes`, `env`) →
|
||||
entrypoint (`command`). 21 unique orderings, one grammar.
|
||||
|
||||
**Why it helps an LLM:** Writing a Portainer template needs the right
|
||||
field order. The grammar tells you: identity first, then metadata,
|
||||
then source, then deployment config.
|
||||
|
||||
## 4. GitHub Actions (cross-project Go lint, 6 jobs)
|
||||
## 3. GitHub Actions (cross-project Go lint, 6 jobs)
|
||||
|
||||
Lint jobs from prometheus, goreleaser, cosign, sigstore:
|
||||
|
||||
|
|
@ -82,7 +62,7 @@ Only the biggest add megalinter.
|
|||
**Why it helps an LLM:** Starting a new Go project? The lint workflow
|
||||
has a near-universal pattern.
|
||||
|
||||
## 5. Terraform (8 AWS modules)
|
||||
## 4. Terraform (8 AWS modules)
|
||||
|
||||
Terraform modules by hashicorp and terraform-aws-modules:
|
||||
|
||||
|
|
|
|||
|
|
@ -191,7 +191,7 @@ depending on the data:
|
|||
|---------|--------|-----|
|
||||
| Ansible galaxy (15 roles) | CRX | iDRegEx returned ∅ (too diverse) |
|
||||
| Helm prom-stack (6 configs) | **iDRegEx** | Finds minimal core across all configs |
|
||||
| Portainer templates (47) | CRX | iDRegEx returned ∅ (no single common field) |
|
||||
| Terraform modules (8) | CRX | iDRegEx returned ∅ (no common core across domains) |
|
||||
| Terraform modules (8) | CRX | Every resource type optional across domains |
|
||||
| GitHub Actions Go lint (6) | CRX | Tight pattern, all match |
|
||||
|
||||
|
|
|
|||
Loading…
Add table
Reference in a new issue