diff --git a/README.md b/README.md index ac9d843..364cd89 100644 --- a/README.md +++ b/README.md @@ -1,17 +1,17 @@ # Dervish -![Dervish](dervish.gif) +

Dervish

**Dervish** infers **regular expression grammars** from example sequences using the BEX family of algorithms. Given a set of example sequences (strings over some alphabet), it learns a compact regular expression that describes the general pattern. ## MCP Server -The primary interface is a **Model Context Protocol (MCP)** server. Connect any MCP-compatible client (Claude, opencode, etc.) and get grammar inference as a tool: +The primary interface is a **Model Context Protocol (MCP)** server. Connect any MCP-compatible client (pi.dev, opencode, vibe, etc.) and get grammar inference as a tool: ```json { "mcpServers": { - "grammar-inference": { + "dervish": { "command": "python3", "args": ["/path/to/bex/mcp_server.py"] } @@ -21,46 +21,45 @@ The primary interface is a **Model Context Protocol (MCP)** server. Connect any ### Tools -| Tool | What it does | -|------|-------------| -| `infer_grammar(sequences, method, kmax, N)` | Core CRX or iDRegEx inference | -| `infer_best_grammar(sequences, prefer, kmax, N)` | **Ensemble:** runs both CRX and iDRegEx, picks the best by MDL score. `prefer='crx'` or `prefer='idregex'` to skip the comparison and return only that algorithm. | -| `infer_yaml_grammar(yaml_dir, pattern, method)` | YAML → key-paths → grammar | -| `infer_ansible_role_grammar(roles_dir)` | Ansible role module sequences → per-category grammar | +| Tool | Parameters | What it does | +|------|-----------|-------------| +| `infer_best_grammar` | `sequences`, `prefer`, `kmax`, `N` | **Recommended.** Runs CRX + iDRegEx, picks best by MDL. Set `prefer='crx'` or `prefer='idregex'` to run one algorithm. | +| `infer_grammar` | `sequences`, `method`, `kmax`, `N` | Core single-algorithm inference. `method='crx'` (fast, deterministic) or `method='idregex'` (probabilistic EM). | + +**Parameters explained:** +- **`kmax`** (1–5): Context window for iDRegEx's k-testable automaton. Higher values capture longer-range dependencies but need more data and are slower. Default 2 works for most cases. +- **`N`** (1–10): Baum-Welch EM iterations for iDRegEx training. More iterations = better convergence but slower. Default 3 is a good balance. +- **`prefer`**: Skip the CRX-vs-iDRegEx comparison. Use when you know which algorithm fits your data. ### Agent workflow -An LLM agent uses the MCP to discover an unwritten convention from existing examples: +An LLM agent uses the MCP to discover an unwritten convention from existing examples — compressing hundreds of files into a single ~60-token rule: ``` User: Generate a new Ansible role for installing PostgreSQL. -Agent: I'll first check 15 existing geerlingguy roles to find the structural pattern. - [calls infer_best_grammar with 15 role sequences, prefer='crx'] - Best: CRX (MDL 288) - Grammar: fail?.(include_vars+set_fact+package+file+template+service+...)+ - .include+?.(npm+pip)+?.lineinfile? +Agent: Let me check what pattern the existing community roles follow. + I'll look at 15 popular geerlingguy roles. - Convention: check preconditions → OS-specific vars → install packages → - configure templates → start services → handle language tooling. + [finds role directories, extracts task module sequences, + calls infer_best_grammar(sequences=..., prefer='crx')] + + Dervish returns: + Best: CRX (MDL 288) + Grammar: fail?.(include_vars+set_fact+package+file+template+service+...)+ + .include+?.(npm+pip)+?.lineinfile? + + This tells me: every role starts with a fail check for preconditions, + then OS-specific variables, installs packages, configures with templates, + starts services, and optionally handles language tooling (npm/pip). + The role should end with a lineinfile tweak. + + I'll generate the new role following this structure. ``` -Without the MCP: 15 role files in context (5,000+ tokens) or guesswork. -With the MCP: one grammar rule (~60 tokens), known to match 15/15 existing roles. +**Without Dervish:** the agent stuffs 15 role files into context (5,000+ tokens per role = beyond any context window), or guesses the pattern from 1–2 examples and often gets it wrong. -## Why grammar inference? - -There are many domains where developers follow **unwritten conventions** — implicit rules about the order and structure of things that no formal schema captures. An LLM generating code in these domains needs to know the convention, but it's rarely documented. - -Grammar inference automatically discovers these conventions from examples: - -| Domain | Unwritten convention | What the grammar tells an LLM | -|--------|---------------------|-------------------------------| -| Ansible roles | `fail → include_vars/set_fact → package → file/template → service → ... → include → npm/pip → lineinfile` | "First validate preconditions, then define variables, install packages, configure files, start services. Include other roles last." | -| Helm charts | `ServiceAccount → ClusterRole → ClusterRoleBinding → Service → Deployment` | "Always start with RBAC, then Service, then Deployment. Other resources are optional." | -| Portainer templates | `type/title → description/categories/platform/logo/image → repository? → env/ports/volumes? → command?` | "Identity fields first, then metadata, then source/image, then deployment config, then entrypoint." | -| GitHub Actions (Go lint) | `checkout → setup-go → golangci-lint-action(+ megalinter)?` | "Checkout, set up Go, run the linter. Only megalinter for extra coverage." | -| Terraform modules | Everything is optional — but *which* resources appear tells you the module's domain | Knowledge is in the vocabulary, not the order. VPC implies subnets, route tables, gateways. | +**With Dervish:** one MCP call returns a ~60-token grammar known to match 15/15 existing roles. The agent follows it reliably. ## Quick Start @@ -83,12 +82,34 @@ print(f"Grammar: {result['best']['grammar']}") print(f"Score: {result['best']['mdl_score']}") ``` +## Why not just use a schema? + +Many of the things developers build every day **have no formal schema**. They're free-form scripts, config files, or YAML blobs where the structure is emergent convention, not enforced specification. An LLM generating new content in these domains needs to know the convention — but it's never been written down. + +Dervish discovers these conventions automatically from existing examples. The domains below are **just examples** of what it can do — the same approach works for any sequential data with an unwritten pattern. + +| Domain | What gets extracted | Example extracted symbols | What Dervish discovers | Why it helps an LLM | +|--------|-------------------|--------------------------|----------------------|---------------------| +| Ansible roles | Module names from `tasks/main.yml` in order | `fail`, `include_vars`, `set_fact`, `package`, `file`, `template`, `service`, `npm`, `pip`, `lineinfile` | `fail?.(include_vars+set_fact+package+file+template+service+...)+.include+?.(npm+pip)+?.lineinfile?` | "Validate preconditions first, then set vars, install packages, configure with templates, start services. Include sub-roles last." | +| Helm charts | K8s resource kinds from `helm template` output in rendered order | `ServiceAccount`, `ClusterRole`, `ClusterRoleBinding`, `Service`, `Deployment`, `ConfigMap`, `Alertmanager` | `ServiceAccount.ClusterRole.ClusterRoleBinding.Service.Deployment` (iDRegEx minimal core) | "Every Prometheus stack needs this bootstrap pipeline. Everything else is optional." | +| GitHub Actions (Go lint) | Step `uses:` or `run:` values from workflow YAML in job order | `actions/checkout`, `actions/setup-go`, `golangci/golangci-lint-action`, `megalinter/megalinter` | `actions/checkout.(actions/setup-go+run:echo+run:sudo)+.golangci/golangci-lint-action?.megalinter?` | "Starting a new Go project? The lint workflow has a near-universal pattern." | +| Terraform modules | Resource type strings from `.tf` files in declaration order | `aws_vpc`, `aws_subnet`, `aws_route_table`, `aws_internet_gateway`, `aws_security_group`, `aws_instance`, `aws_s3_bucket` | Everything optional (domains too different), but certain types always cluster together | "If you see `aws_vpc`, expect subnets, route tables, gateways to follow. The grammar encodes each domain's resource catalogue." | + ## Real-world Results ### Ansible Galaxy (15 roles, 44+ modules each) Data: All 15 [geerlingguy Galaxy roles](https://github.com/geerlingguy) — nginx, php, mysql, docker, etc. +Each role's `tasks/main.yml` is parsed into a sequence of module names. Here are the sequences from two roles: + +``` +docker: fail → include_vars → include_tasks → package → package → package → ... +nginx: fail → include_vars → set_fact → package → file → template → service → ... +``` + +The extracted symbols are Ansible module names like `fail`, `include_vars`, `set_fact`, `package`, `file`, `template`, `service`, `systemd`, `get_url`, `shell`, `npm`, `pip`, `lineinfile`, `copy`, `unarchive`, `yum`, `apt`, `command`, `user`, `group`, `git`, `mount`, `cron`, `debug`, `iptables`, `ufw`, `hostname`, `sysctl`, `timezone`, `selinux`, `firewalld`, `homebrew`, `supervisorctl`, `postgresql_db`, `mysql_db` — 50+ unique modules across the 15 roles. + ``` Best: CRX (MDL 288, 15/15 match) Grammar: @@ -104,7 +125,15 @@ This is the first explicit description of the geerlingguy role module ordering c ### Helm (kube-prometheus-stack, 6 CI configs) -Data: 6 different `values.yaml` configurations rendered through `helm template`. +Data: 6 different `values.yaml` configurations rendered through `helm template`. Each config produces a sequence of K8s `kind` values in rendered YAML order: + +``` +config-1: ServiceAccount → ClusterRole → ClusterRoleBinding → Service → Deployment → ServiceMonitor → PrometheusRule +config-2: ServiceAccount → ClusterRole → ClusterRoleBinding → Service → Deployment → ConfigMap → ServiceMonitor +config-3: ServiceAccount → ClusterRole → ClusterRoleBinding → Service → Deployment → Alertmanager → Prometheus +``` + +Extracted symbols: `ServiceAccount`, `ClusterRole`, `ClusterRoleBinding`, `Service`, `Deployment`, `ConfigMap`, `Alertmanager`, `Prometheus`, `PrometheusRule`, `ServiceMonitor`, `Role`, `RoleBinding`, `Job`, `DaemonSet`, `Secret`, `ValidatingWebhookConfiguration` — 19 kinds total. ``` Best: iDRegEx (MDL 1433) @@ -118,21 +147,17 @@ iDRegEx finds the **minimum core** — what every config always deploys. CRX cap - **CRX** tells an agent generating a new chart what resources it *might* need. - **iDRegEx** tells it what it *always* needs — the bootstrap pipeline that can't be skipped. -### Portainer templates (47 templates) - -Data: Official Portainer app templates from the [portainer/templates](https://github.com/portainer/templates) repo. - -``` -Best: CRX (MDL 1282) -Grammar: (type+title)+.(categories+description+image+logo+name+note+platform)+. - repository?.(env+ports+privileged+volumes)+?.command? -``` - -Template fields follow a consistent arc: identity (`type`, `title`) → metadata (`description`, `categories`, `platform`, `logo`) → source (`image`, `repository`) → deployment (`ports`, `volumes`, `env`) → entrypoint (`command`). 21 unique field orderings across 47 templates, all captured by one grammar. - ### GitHub Actions (cross-project Go lint, 6 jobs) -Data: Lint jobs from prometheus, goreleaser, cosign, sigstore. +Data: Lint jobs from prometheus, goreleaser, cosign, sigstore. Each job's steps are extracted as `uses:` or `run:` values: + +``` +prometheus lint: actions/checkout → actions/setup-go → run:sudo → run:echo → golangci/golangci-lint-action → golangci/golangci-lint-action → ... +goreleaser lint: actions/checkout → actions/setup-go → gitleaks/gitleaks-action → golangci/golangci-lint-action +cosign lint: actions/checkout → ossf/scorecard-action → actions/upload-artifact → github/codeql-action/upload-sarif +``` + +Extracted symbols: `actions/checkout`, `actions/setup-go`, `golangci/golangci-lint-action`, `megalinter/megalinter`, `gitleaks/gitleaks-action`, `ossf/scorecard-action`, `github/codeql-action/*`, and `run:*` commands. ``` Best: CRX (MDL 13.6) @@ -143,7 +168,15 @@ Every Go project's lint CI follows: checkout → setup Go → run golangci-lint. ### Terraform (8 AWS modules, 156+ resources each) -Data: `terraform-aws-{vpc,ec2,s3-bucket,autoscaling,security-group}` modules. +Data: `terraform-aws-{vpc,ec2,s3-bucket,autoscaling,security-group}` modules from hashicorp and terraform-aws-modules. Each `.tf` file is parsed for `resource` declarations in order: + +``` +vpc module: data:vpc_endpoint_service → vpc → vpc_endpoint → vpc_endpoint_route_table_association → egress_only_internet_gateway → route_table → route → subnet → ... +ec2 module: data:partition → data:ssm_parameter → instance → spot_instance_request → ec2_tag → ebs_volume → volume_attachment → data:iam_policy_document → iam_role → iam_role_policy_attachment → iam_instance_profile → ... +s3 module: iam_role → data:iam_policy_document → iam_policy → data:partition → s3_bucket → s3_bucket_versioning → s3_bucket_logging → s3_bucket_server_side_encryption → ... +``` + +Extracted symbols: `aws_vpc`, `aws_subnet`, `aws_route_table`, `aws_internet_gateway`, `aws_nat_gateway`, `aws_vpn_gateway`, `aws_security_group`, `aws_security_group_rule`, `aws_instance`, `aws_eip`, `aws_ebs_volume`, `aws_s3_bucket`, `aws_s3_bucket_versioning`, `aws_s3_bucket_logging`, `aws_iam_role`, `aws_iam_policy`, `aws_autoscaling_group`, `aws_launch_configuration`, `random_pet`, `null_resource` — 30+ types across modules. ``` Best: CRX (MDL 1876) @@ -182,129 +215,6 @@ The sweet spot: **multiple implementations of the same abstract task** (like "de | 2–3 sequences | iDRegEx | CRX overfits. iDRegEx handles noise better. | | Many sequences, tight pattern | CRX | Learns precise concatenation with optional suffixes. | -## Domain Adapters - -### Ansible Roles - -```python -from bex.ensemble import infer_ensemble -from bex.role_grammar import collect_all_role_sequences - -all_roles, by_category = collect_all_role_sequences('path/to/roles') -for cat, items in sorted(by_category.items()): - seqs = [s for _, s in items] - result = infer_ensemble(seqs) - print(f"── {cat} ({len(items)} roles) ──") - print(f" Best: {result['best']['algorithm']} (MDL {result['best']['mdl_score']})") - print(f" Grammar: {result['best']['grammar']}") -``` - -**Example** (15 geerlingguy Galaxy roles): - -``` -── other (15 roles) ── - Best: CRX (MDL 288, 15/15 match) - Grammar: fail?.(include_vars+set_fact+package+file+template+service+...)+.include+?.(npm+pip)+?.lineinfile? - Why: CRX matches 15/15 sequences, iDRegEx matches 3/15. CRX selected. -``` - -### Helm Charts - -```python -import subprocess, yaml -from bex.ensemble import infer_ensemble - -seqs = [] -for vf in sorted(Path('ci/').glob('*-values.yaml')): - out = subprocess.run( - ['helm', 'template', 'test', '.', '--skip-tests', '-f', str(vf)], - capture_output=True, text=True, timeout=120, - ) - kinds = [d['kind'] for d in yaml.safe_load_all(out.stdout) - if d and isinstance(d, dict) and 'kind' in d] - if kinds: - seqs.append(kinds) - -result = infer_ensemble(seqs) -print(f"Best: {result['best']['algorithm']} (MDL {result['best']['mdl_score']})") -print(f"Grammar: {result['best']['grammar']}") -``` - -**Example** (kube-prometheus-stack, 6 CI configs): - -``` -Best: iDRegEx (MDL 1433) -Grammar: ServiceAccount.ClusterRole.ClusterRoleBinding.Service.Deployment - - iDRegEx MDL= 1432.99 ServiceAccount.ClusterRole.ClusterRoleBinding.Service.Deployment - CRX MDL= 2651.74 (Alertmanager+ClusterRole+...+ValidatingWebhookConfiguration)+.Role+?... - -Why: iDRegEx (score 1433.0) vs CRX (score 2651.7). CRX matches 6/6, iDRegEx matches 1/6. -iDRegEx selected (MDL score 1433.0). -``` - -### Terraform - -```python -import re -from bex.ensemble import infer_ensemble - -seqs = [] -for tf in sorted(Path('.').rglob('*.tf')): - resources = re.findall(r'resource "(\w+)" "\w+" {', tf.read_text()) - if resources: - seqs.append(resources) - -result = infer_ensemble(seqs) -print(f"Best: {result['best']['algorithm']} (MDL {result['best']['mdl_score']})") -print(f"Grammar: {result['best']['grammar']}") -``` - -**Example** (8 terraform-aws-* modules): - -``` -Best: CRX (MDL 1876) -Grammar: null_resource?.s3_bucket_lifecycle_configuration?.vpc?.launch_configuration?.... -Why: CRX matches 8/8 sequences. iDRegEx returned ∅ (no common core across modules). -``` - -### Portainer Templates - -```python -import json, urllib.request -from bex.ensemble import infer_ensemble - -url = "https://raw.githubusercontent.com/portainer/templates/master/templates.json" -with urllib.request.urlopen(url) as resp: - data = json.loads(resp.read()) -templates = data if isinstance(data, list) else data.get('templates', []) -seqs = [list(t.keys()) for t in templates] - -result = infer_ensemble(seqs) -print(f"Best: {result['best']['algorithm']} (MDL {result['best']['mdl_score']})") -print(f"Grammar: {result['best']['grammar']}") -``` - -### GitHub Actions - -```python -import yaml -from bex.ensemble import infer_ensemble - -seqs = [] -for wf_file in Path('.github/workflows/').glob('*.yml'): - data = yaml.safe_load(wf_file.read_text()) - for job in data.get('jobs', {}).values(): - if 'steps' not in job: - continue - seq = [s.get('uses', 'run:' + s.get('run', '').split()[0]) - for s in job['steps'] if 'uses' in s or 'run' in s] - if seq: - seqs.append(seq) - -result = infer_ensemble(seqs) -``` - ## How MDL scoring works ``` diff --git a/SHOWCASE.md b/SHOWCASE.md index 84c0f2b..0226b03 100644 --- a/SHOWCASE.md +++ b/SHOWCASE.md @@ -46,27 +46,7 @@ vocabulary (19 kinds). Which one an agent uses depends on the task: - Bootstrapping a new cluster: iDRegEx — what you can't skip - Writing a complete chart: CRX — everything you might need -## 3. Portainer templates (47 templates) - -Official Portainer app templates from portainer/templates: - -``` -Best: CRX | MDL 1282 -Grammar: (type+title)+. - (categories+description+image+logo+name+note+platform)+. - repository?.(env+ports+privileged+volumes)+?.command? -``` - -Field ordering convention: identity (`type`, `title`) → metadata -(`description`, `categories`, `platform`, `logo`) → source -(`image`, `repository`) → deployment (`ports`, `volumes`, `env`) → -entrypoint (`command`). 21 unique orderings, one grammar. - -**Why it helps an LLM:** Writing a Portainer template needs the right -field order. The grammar tells you: identity first, then metadata, -then source, then deployment config. - -## 4. GitHub Actions (cross-project Go lint, 6 jobs) +## 3. GitHub Actions (cross-project Go lint, 6 jobs) Lint jobs from prometheus, goreleaser, cosign, sigstore: @@ -82,7 +62,7 @@ Only the biggest add megalinter. **Why it helps an LLM:** Starting a new Go project? The lint workflow has a near-universal pattern. -## 5. Terraform (8 AWS modules) +## 4. Terraform (8 AWS modules) Terraform modules by hashicorp and terraform-aws-modules: diff --git a/blog_post.md b/blog_post.md index 8b3de5a..a845d7a 100644 --- a/blog_post.md +++ b/blog_post.md @@ -191,7 +191,7 @@ depending on the data: |---------|--------|-----| | Ansible galaxy (15 roles) | CRX | iDRegEx returned ∅ (too diverse) | | Helm prom-stack (6 configs) | **iDRegEx** | Finds minimal core across all configs | -| Portainer templates (47) | CRX | iDRegEx returned ∅ (no single common field) | +| Terraform modules (8) | CRX | iDRegEx returned ∅ (no common core across domains) | | Terraform modules (8) | CRX | Every resource type optional across domains | | GitHub Actions Go lint (6) | CRX | Tight pattern, all match |