diff --git a/README.md b/README.md index 5486fb2..abfddb5 100644 --- a/README.md +++ b/README.md @@ -2,6 +2,64 @@ Infer **regular expression grammars** from example sequences using the BEX family of algorithms. Given a set of example sequences (strings over some alphabet), the engine learns a compact regular expression that describes the general pattern. +## MCP Server + +The primary interface is a **Model Context Protocol (MCP)** server. Connect any MCP-compatible client (Claude, opencode, etc.) and get grammar inference as a tool: + +```json +{ + "mcpServers": { + "grammar-inference": { + "command": "python3", + "args": ["/path/to/bex/mcp_server.py"] + } + } +} +``` + +### Tools + +| Tool | What it does | +|------|-------------| +| `infer_grammar(sequences, method, kmax, N)` | Core CRX or iDRegEx inference | +| `infer_best_grammar(sequences, prefer, kmax, N)` | **Ensemble:** runs both CRX and iDRegEx, picks the best by MDL score. `prefer='crx'` or `prefer='idregex'` to skip the comparison and return only that algorithm. | +| `infer_yaml_grammar(yaml_dir, pattern, method)` | YAML → key-paths → grammar | +| `infer_ansible_role_grammar(roles_dir)` | Ansible role module sequences → per-category grammar | + +### Agent workflow + +An LLM agent uses the MCP to discover an unwritten convention from existing examples: + +``` +User: Generate a new Ansible role for installing PostgreSQL. +Agent: I'll first check 15 existing geerlingguy roles to find the structural pattern. + [calls infer_best_grammar with 15 role sequences, prefer='crx'] + + Best: CRX (MDL 288) + Grammar: fail?.(include_vars+set_fact+package+file+template+service+...)+ + .include+?.(npm+pip)+?.lineinfile? + + Convention: check preconditions → OS-specific vars → install packages → + configure templates → start services → handle language tooling. +``` + +Without the MCP: 15 role files in context (5,000+ tokens) or guesswork. +With the MCP: one grammar rule (~60 tokens), known to match 15/15 existing roles. + +## Why grammar inference? + +There are many domains where developers follow **unwritten conventions** — implicit rules about the order and structure of things that no formal schema captures. An LLM generating code in these domains needs to know the convention, but it's rarely documented. + +Grammar inference automatically discovers these conventions from examples: + +| Domain | Unwritten convention | What the grammar tells an LLM | +|--------|---------------------|-------------------------------| +| Ansible roles | `fail → include_vars/set_fact → package → file/template → service → ... → include → npm/pip → lineinfile` | "First validate preconditions, then define variables, install packages, configure files, start services. Include other roles last." | +| Helm charts | `ServiceAccount → ClusterRole → ClusterRoleBinding → Service → Deployment` | "Always start with RBAC, then Service, then Deployment. Other resources are optional." | +| Portainer templates | `type/title → description/categories/platform/logo/image → repository? → env/ports/volumes? → command?` | "Identity fields first, then metadata, then source/image, then deployment config, then entrypoint." | +| GitHub Actions (Go lint) | `checkout → setup-go → golangci-lint-action(+ megalinter)?` | "Checkout, set up Go, run the linter. Only megalinter for extra coverage." | +| Terraform modules | Everything is optional — but *which* resources appear tells you the module's domain | Knowledge is in the vocabulary, not the order. VPC implies subnets, route tables, gateways. | + ## Quick Start ```bash @@ -23,29 +81,6 @@ print(f"Grammar: {result['best']['grammar']}") print(f"Score: {result['best']['mdl_score']}") ``` -## Why grammar inference? - -There are many domains where developers follow **unwritten conventions** — implicit rules about the order and structure of things that no formal schema captures. An LLM generating code in these domains needs to know the convention, but it's rarely documented. - -Grammar inference automatically discovers these conventions from examples. - -| Domain | Unwritten convention | What the grammar tells an LLM | -|--------|---------------------|-------------------------------| -| Ansible roles | `fail → include_vars/set_fact → package → file/template → service → ... → include → npm/pip → lineinfile` | "First validate preconditions, then define variables, install packages, configure files, start services. Include other roles last." | -| Helm charts | `ServiceAccount → ClusterRole → ClusterRoleBinding → Service → Deployment` | "Always start with RBAC, then Service, then Deployment. Other resources are optional." | -| Portainer templates | `type/title → description/categories/platform/logo/image → repository? → env/ports/volumes? → command?` | "Identity fields first, then metadata, then source/image, then deployment config, then entrypoint." | -| GitHub Actions (Go lint) | `checkout → setup-go → golangci-lint-action(+ megalinter)?` | "Checkout, set up Go, run the linter. Only megalinter for extra coverage." | -| Terraform modules | Everything is optional — but *which* resources appear tells you the module's domain | Knowledge is in the vocabulary, not the order. VPC implies subnets, route tables, gateways. | - -## Algorithm Selection Guide - -| When | Use | Why | -|------|-----|-----| -| Clean, structured data with full vocabulary | **CRX** | Single-pass, deterministic. Accepts all sequences. | -| Few examples, or want minimal common core | **iDRegEx** | Probabilistic EM, finds only what's shared. | -| Don't know which is better | **Ensemble (default)** | Runs both, picks the best by MDL score. | -| Data is clearly one type | `prefer='crx'` or `prefer='idregex'` | Skips ensemble comparison, runs one algorithm. | - ## Real-world Results ### Ansible Galaxy (15 roles, 44+ modules each) @@ -61,11 +96,7 @@ Grammar: Every single role follows this pattern. The convention was **unwritten** — no document says "Ansible roles should check preconditions first, then install packages, configure with templates, enable services, then optionally install language packages." -An LLM generating a new role: -- **Must** start with conditional includes and variable setup -- **Should** then install packages and configure files -- **Then** start services -- **Finally** include handling of language-specific tooling +This is the first explicit description of the geerlingguy role module ordering convention. **Compression:** The grammar is ~250 chars. The 15 examples are 7200+ modules combined. **~29× compression.** @@ -97,8 +128,6 @@ Grammar: (type+title)+.(categories+description+image+logo+name+note+platform)+. Template fields follow a consistent arc: identity (`type`, `title`) → metadata (`description`, `categories`, `platform`, `logo`) → source (`image`, `repository`) → deployment (`ports`, `volumes`, `env`) → entrypoint (`command`). 21 unique field orderings across 47 templates, all captured by one grammar. -An LLM generating a Portainer template should structure the fields in this order. - ### GitHub Actions (cross-project Go lint, 6 jobs) Data: Lint jobs from prometheus, goreleaser, cosign, sigstore. @@ -132,6 +161,15 @@ Not every domain has an unwritten convention. Grammar inference failed (produced The sweet spot: **multiple implementations of the same abstract task** (like "deploy a service" or "configure a chart"), each following a shared but undocumented pattern. +## Algorithm Selection Guide + +| When | Use | Why | +|------|-----|-----| +| Clean, structured data with full vocabulary | **CRX** | Single-pass, deterministic. Accepts all sequences. | +| Few examples, or want minimal common core | **iDRegEx** | Probabilistic EM, finds only what's shared. | +| Don't know which is better | **Ensemble (default)** | Runs both, picks the best by MDL score. | +| Data is clearly one type | `prefer='crx'` or `prefer='idregex'` | Skips ensemble comparison, runs one algorithm. | + ## When each algorithm wins | Data property | Winner | Why | @@ -142,23 +180,6 @@ The sweet spot: **multiple implementations of the same abstract task** (like "de | 2–3 sequences | iDRegEx | CRX overfits. iDRegEx handles noise better. | | Many sequences, tight pattern | CRX | Learns precise concatenation with optional suffixes. | -## MCP Server - -A **Model Context Protocol** server exposes all algorithms and domain adapters: - -```bash -python -m bex.mcp_server -``` - -### Tools - -| Tool | What it does | -|------|-------------| -| `infer_grammar(sequences, method, kmax, N)` | Core CRX or iDRegEx inference | -| `infer_best_grammar(sequences, prefer, kmax, N)` | **Ensemble:** runs both, picks best by MDL. `prefer='crx'` or `prefer='idregex'` to skip comparison. | -| `infer_yaml_grammar(yaml_dir, pattern, method)` | YAML → key-paths → grammar | -| `infer_ansible_role_grammar(roles_dir)` | Ansible role module sequences → per-category grammar | - ## Domain Adapters ### Ansible Roles