grammar-inference-engine/SHOWCASE.md
tobjend b34e39d4b9
All checks were successful
ci/woodpecker/pr/woodpecker Pipeline was successful
ci/woodpecker/push/woodpecker Pipeline was successful
feat: replace single-chart Helm with cross-project convention (15 charts, 6 publishers)
2026-07-01 16:00:04 +02:00

128 lines
4.9 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Dervish — Showcase
<p align="left"><img src="dervish-logo.png" alt="Dervish" width="180"></p>
Infer the **unwritten convention** from existing examples. Given N example
sequences, produce a ~100-char grammar that captures the structural
pattern — in far fewer tokens than the originals.
```text
a.b → a then b (concatenation)
(a+b) → a or b (disjunction)
r? → optional (zero or one)
r+ → one or more (iteration)
r+? → zero or more
```
## 1. Ansible Galaxy roles (15 geerlingguy roles)
15 popular Ansible roles by Jeff Geerling. There is NO written convention
for the module ordering in `tasks/main.yml`. Our grammar is its first
explicit description:
```text
Grammar: fail?.(include_vars+set_fact+package+file+template+service+...)+.
include+?.(npm+pip)+?.lineinfile?
```
Every role: check preconditions → OS-specific vars → install packages →
configure with templates → start services → optionally handle language tooling.
All 15/15 match. **~29× compression** (7200+ modules → ~250 chars).
**Why it helps an LLM:** Generating a new Ansible role, the LLM knows the
exact structure: fail-check first, then vars, then packages, then config/svc.
No guessing.
### Bonus: core+outlier analysis
Set `min_coverage=0.8` to find the tight grammar for the majority while
flagging outlier roles with unusual module usage:
```text
Core CRX (80% coverage, 3 outliers):
fail?.(include_vars+set_fact+package+file+template+service+...)+
Outlier sequences:
1. phpmyadmin: include_vars → set_fact → include → include → lineinfile
2. composer: fail → set_fact → stat → uri → get_url → command
3. pip: package → file → pip
```
phpmyadmin uses raw `lineinfile` instead of templates; composer needs
a `stat` check + `uri` download; pip is purely `pip` — all three deviate
from the mainstream install → configure → enable pattern.
## 2. Helm charts — cross-project convention (15 charts, 6 publishers)
15 popular Helm charts from **Bitnami** (10), **Grafana**, **Jetstack** (cert-manager),
**Argo**, **Ingress-Nginx**, and **Elastic**. Different publishers, different
purposes (databases, web servers, infrastructure tools) — but they converged
on a common resource ordering:
```text
Best: CRX | MDL 230
Grammar: NetworkPolicy?.PodDisruptionBudget?.ServiceAccount?.Secret?
.ConfigMap?.PersistentVolumeClaim?.ClusterRole?.ClusterRoleBinding?
.Role?.RoleBinding?.Service.Deployment?.StatefulSet?.
(IngressClass+MutatingWebhookConfiguration)?.ValidatingWebhookConfiguration?.Job?
Match rates: CRX=15/15
```
Every chart follows: **resilience → identity → data → service → workload → extensions**.
`Service` is the **only resource type that appears in all 15 charts**.
Bitnami charts (10/15) consistently start with `NetworkPolicy + PodDisruptionBudget`
before identity and service. Infrastructure tools (cert-manager, grafana,
argo-cd, ingress-nginx) add RBAC and admission webhooks for cluster-wide access.
**Why it helps an LLM:** Generating a Helm chart template? You know the
structure: start with availability guarantees (PDB, NetworkPolicy), then
identity (ServiceAccount, Secrets), then the Service endpoint, then your
workload type. Only cluster-wide tools need RBAC and webhooks — skip them
for simple application charts.
## 3. GitHub Actions (cross-project Go lint, 6 jobs)
Lint jobs from prometheus, goreleaser, cosign, sigstore:
```text
Best: CRX | MDL 13.6
Grammar: actions/checkout.(actions/setup-go+run:echo+run:sudo)+.
golangci/golangci-lint-action?.megalinter?
```
Four independently-maintained Go projects converged on: checkout → setup Go → run golangci-lint. Only the biggest add megalinter.
**Why it helps an LLM:** Setting up CI for a Go project on GitHub Actions? The grammar encodes an emergent cross-project convention — four teams wrote the same pipeline without coordinating.
## What doesn't work
| Dataset | Problem |
|---------|---------|
| Dockerfiles | Too simple — just the Dockerfile spec |
| Pre-commit (cross-project) | 252 unique hooks, no common core |
| GHA per-project | One repo = too many job types |
| Prometheus rules | Schema-enforced, no convention |
Sweet spot: **multiple implementations of the same abstract task**
with a shared but undocumented pattern.
## Usage
```python
from bex import infer_ensemble
# Pick best across all 3 algorithms (CRX + iDRegEx + kOREInference)
result = infer_ensemble(role_sequences)
print(f"Best: {result['best']['algorithm']}")
print(f"Grammar: {result['best']['grammar']}")
# Or: find the tight core + flag outliers
result = infer_ensemble(role_sequences, min_coverage=0.8)
print(f"Core: {result['core']['grammar']}")
print(f"Outliers ({result['core']['outlier_count']}):")
for i, o in enumerate(result['core']['outliers'], 1):
print(f" {i}. {' → '.join(str(x) for x in o[:8])}{'...' if len(o) > 8 else ''}")
```