grammar-inference-engine/SHOWCASE.md

# Grammar Inference Engine — Showcase

Infer the **unwritten convention** from existing examples. Given N example
sequences, produce a ~100-char grammar that captures the structural
pattern — in far fewer tokens than the originals.

```
a.b       → a then b (concatenation)
(a+b)     → a or b (disjunction)
r?        → optional (zero or one)
r+        → one or more (iteration)
r+?       → zero or more
```

## 1. Ansible Galaxy roles (15 geerlingguy roles) — flagship

15 popular Ansible roles by Jeff Geerling. There is NO written convention
for the task structure. Our grammar is its first explicit description:

```
Grammar: fail?.(include_vars+set_fact+package+file+template+service+...)+.
         include+?.(npm+pip)+?.lineinfile?
```

Every role: check preconditions → OS-specific vars → install packages →
configure with templates → start services → optionally handle language tooling.

All 15/15 match. **~29× compression** (7200+ modules → ~250 chars).

**Why it helps an LLM:** Generating a new Ansible role, the LLM knows the
exact structure: fail-check first, then vars, then packages, then config/svc.
No guessing.

## 2. Helm chart (kube-prometheus-stack, 6 configs)

6 different `values.yaml` files rendered through the same chart:

```
Best: iDRegEx | MDL 1433
Grammar: ServiceAccount.ClusterRole.ClusterRoleBinding.Service.Deployment
```

The **minimal core** every config must deploy. CRX captures the full
vocabulary (19 kinds). Which one an agent uses depends on the task:
- Bootstrapping a new cluster: iDRegEx — what you can't skip
- Writing a complete chart: CRX — everything you might need

## 3. Docker Compose (73 services, 10 projects)

Per-service key order across real-world compose files:

```
Best: CRX | MDL varies by project
Grammar: (build+image).command.(environment+volumes)?.ports
```

Per-project patterns emerge:
- **Nginx-like:** `build.(command.volumes.ports)`
- **Databases:** `image.environment.volumes.ports`
- **Language runtimes:** `build.(environment.command).ports`

**Why it helps an LLM:** The field order in service definitions follows
an implicit convention. An agent generating compose files should put
image/build first, then command, then environment/volumes, then ports.

## 4. GitHub Actions (cross-project Go lint, 6 jobs)

Lint jobs from prometheus, goreleaser, cosign, sigstore:

```
Best: CRX | MDL 13.6
Grammar: actions/checkout.(actions/setup-go+run:echo+run:sudo)+.
         golangci/golangci-lint-action?.megalinter?
```

Every Go project's lint CI follows: checkout → setup Go → run linter.
Only the biggest add megalinter.

**Why it helps an LLM:** Starting a new Go project? The lint workflow
has a near-universal pattern.

## 5. Terraform (8 AWS modules)

Terraform modules by hashicorp and terraform-aws-modules:

```
Best: CRX | MDL 1876
Grammar: null_resource?.s3_bucket...?.vpc?...(26+ types all optional)
```

Every resource type is optional — VPC, S3, EC2, and security-group
modules share no mandatory ordering. But the **vocabulary** is the signal:
seeing `aws_vpc` implies subnets, route tables, internet gateways.

**Why it helps an LLM:** The grammar encodes which resources belong
together in each module domain.

## What doesn't work

| Dataset | Problem |
|---------|---------|
| Dockerfiles | Too simple — just the Dockerfile spec |
| Pre-commit (cross-project) | 252 unique hooks, no common core |
| GHA per-project | One repo = too many job types |
| Prometheus rules | Schema-enforced, no convention |

Sweet spot: **multiple implementations of the same abstract task**
with a shared but undocumented pattern.

## Usage

```python
from bex.mcp_server import infer_best_grammar

output = infer_best_grammar(
    sequences=role_sequences,
    prefer="crx",
)
```