Move MCP server to top of README — it's the primary interface

Restructure: MCP Server first (with agent workflow example), then
Why grammar inference / showcases, then Quick Start, then details.
This matches how users actually interact with the project: via MCP tools.
This commit is contained in:
tobjend 2026-07-01 10:18:10 +02:00
parent 9f5bde22d5
commit a8a8bddb37

115
README.md
View file

@ -2,6 +2,64 @@
Infer **regular expression grammars** from example sequences using the BEX family of algorithms. Given a set of example sequences (strings over some alphabet), the engine learns a compact regular expression that describes the general pattern.
## MCP Server
The primary interface is a **Model Context Protocol (MCP)** server. Connect any MCP-compatible client (Claude, opencode, etc.) and get grammar inference as a tool:
```json
{
"mcpServers": {
"grammar-inference": {
"command": "python3",
"args": ["/path/to/bex/mcp_server.py"]
}
}
}
```
### Tools
| Tool | What it does |
|------|-------------|
| `infer_grammar(sequences, method, kmax, N)` | Core CRX or iDRegEx inference |
| `infer_best_grammar(sequences, prefer, kmax, N)` | **Ensemble:** runs both CRX and iDRegEx, picks the best by MDL score. `prefer='crx'` or `prefer='idregex'` to skip the comparison and return only that algorithm. |
| `infer_yaml_grammar(yaml_dir, pattern, method)` | YAML → key-paths → grammar |
| `infer_ansible_role_grammar(roles_dir)` | Ansible role module sequences → per-category grammar |
### Agent workflow
An LLM agent uses the MCP to discover an unwritten convention from existing examples:
```
User: Generate a new Ansible role for installing PostgreSQL.
Agent: I'll first check 15 existing geerlingguy roles to find the structural pattern.
[calls infer_best_grammar with 15 role sequences, prefer='crx']
Best: CRX (MDL 288)
Grammar: fail?.(include_vars+set_fact+package+file+template+service+...)+
.include+?.(npm+pip)+?.lineinfile?
Convention: check preconditions → OS-specific vars → install packages →
configure templates → start services → handle language tooling.
```
Without the MCP: 15 role files in context (5,000+ tokens) or guesswork.
With the MCP: one grammar rule (~60 tokens), known to match 15/15 existing roles.
## Why grammar inference?
There are many domains where developers follow **unwritten conventions** — implicit rules about the order and structure of things that no formal schema captures. An LLM generating code in these domains needs to know the convention, but it's rarely documented.
Grammar inference automatically discovers these conventions from examples:
| Domain | Unwritten convention | What the grammar tells an LLM |
|--------|---------------------|-------------------------------|
| Ansible roles | `fail → include_vars/set_fact → package → file/template → service → ... → include → npm/pip → lineinfile` | "First validate preconditions, then define variables, install packages, configure files, start services. Include other roles last." |
| Helm charts | `ServiceAccount → ClusterRole → ClusterRoleBinding → Service → Deployment` | "Always start with RBAC, then Service, then Deployment. Other resources are optional." |
| Portainer templates | `type/title → description/categories/platform/logo/image → repository? → env/ports/volumes? → command?` | "Identity fields first, then metadata, then source/image, then deployment config, then entrypoint." |
| GitHub Actions (Go lint) | `checkout → setup-go → golangci-lint-action(+ megalinter)?` | "Checkout, set up Go, run the linter. Only megalinter for extra coverage." |
| Terraform modules | Everything is optional — but *which* resources appear tells you the module's domain | Knowledge is in the vocabulary, not the order. VPC implies subnets, route tables, gateways. |
## Quick Start
```bash
@ -23,29 +81,6 @@ print(f"Grammar: {result['best']['grammar']}")
print(f"Score: {result['best']['mdl_score']}")
```
## Why grammar inference?
There are many domains where developers follow **unwritten conventions** — implicit rules about the order and structure of things that no formal schema captures. An LLM generating code in these domains needs to know the convention, but it's rarely documented.
Grammar inference automatically discovers these conventions from examples.
| Domain | Unwritten convention | What the grammar tells an LLM |
|--------|---------------------|-------------------------------|
| Ansible roles | `fail → include_vars/set_fact → package → file/template → service → ... → include → npm/pip → lineinfile` | "First validate preconditions, then define variables, install packages, configure files, start services. Include other roles last." |
| Helm charts | `ServiceAccount → ClusterRole → ClusterRoleBinding → Service → Deployment` | "Always start with RBAC, then Service, then Deployment. Other resources are optional." |
| Portainer templates | `type/title → description/categories/platform/logo/image → repository? → env/ports/volumes? → command?` | "Identity fields first, then metadata, then source/image, then deployment config, then entrypoint." |
| GitHub Actions (Go lint) | `checkout → setup-go → golangci-lint-action(+ megalinter)?` | "Checkout, set up Go, run the linter. Only megalinter for extra coverage." |
| Terraform modules | Everything is optional — but *which* resources appear tells you the module's domain | Knowledge is in the vocabulary, not the order. VPC implies subnets, route tables, gateways. |
## Algorithm Selection Guide
| When | Use | Why |
|------|-----|-----|
| Clean, structured data with full vocabulary | **CRX** | Single-pass, deterministic. Accepts all sequences. |
| Few examples, or want minimal common core | **iDRegEx** | Probabilistic EM, finds only what's shared. |
| Don't know which is better | **Ensemble (default)** | Runs both, picks the best by MDL score. |
| Data is clearly one type | `prefer='crx'` or `prefer='idregex'` | Skips ensemble comparison, runs one algorithm. |
## Real-world Results
### Ansible Galaxy (15 roles, 44+ modules each)
@ -61,11 +96,7 @@ Grammar:
Every single role follows this pattern. The convention was **unwritten** — no document says "Ansible roles should check preconditions first, then install packages, configure with templates, enable services, then optionally install language packages."
An LLM generating a new role:
- **Must** start with conditional includes and variable setup
- **Should** then install packages and configure files
- **Then** start services
- **Finally** include handling of language-specific tooling
This is the first explicit description of the geerlingguy role module ordering convention.
**Compression:** The grammar is ~250 chars. The 15 examples are 7200+ modules combined. **~29× compression.**
@ -97,8 +128,6 @@ Grammar: (type+title)+.(categories+description+image+logo+name+note+platform)+.
Template fields follow a consistent arc: identity (`type`, `title`) → metadata (`description`, `categories`, `platform`, `logo`) → source (`image`, `repository`) → deployment (`ports`, `volumes`, `env`) → entrypoint (`command`). 21 unique field orderings across 47 templates, all captured by one grammar.
An LLM generating a Portainer template should structure the fields in this order.
### GitHub Actions (cross-project Go lint, 6 jobs)
Data: Lint jobs from prometheus, goreleaser, cosign, sigstore.
@ -132,6 +161,15 @@ Not every domain has an unwritten convention. Grammar inference failed (produced
The sweet spot: **multiple implementations of the same abstract task** (like "deploy a service" or "configure a chart"), each following a shared but undocumented pattern.
## Algorithm Selection Guide
| When | Use | Why |
|------|-----|-----|
| Clean, structured data with full vocabulary | **CRX** | Single-pass, deterministic. Accepts all sequences. |
| Few examples, or want minimal common core | **iDRegEx** | Probabilistic EM, finds only what's shared. |
| Don't know which is better | **Ensemble (default)** | Runs both, picks the best by MDL score. |
| Data is clearly one type | `prefer='crx'` or `prefer='idregex'` | Skips ensemble comparison, runs one algorithm. |
## When each algorithm wins
| Data property | Winner | Why |
@ -142,23 +180,6 @@ The sweet spot: **multiple implementations of the same abstract task** (like "de
| 23 sequences | iDRegEx | CRX overfits. iDRegEx handles noise better. |
| Many sequences, tight pattern | CRX | Learns precise concatenation with optional suffixes. |
## MCP Server
A **Model Context Protocol** server exposes all algorithms and domain adapters:
```bash
python -m bex.mcp_server
```
### Tools
| Tool | What it does |
|------|-------------|
| `infer_grammar(sequences, method, kmax, N)` | Core CRX or iDRegEx inference |
| `infer_best_grammar(sequences, prefer, kmax, N)` | **Ensemble:** runs both, picks best by MDL. `prefer='crx'` or `prefer='idregex'` to skip comparison. |
| `infer_yaml_grammar(yaml_dir, pattern, method)` | YAML → key-paths → grammar |
| `infer_ansible_role_grammar(roles_dir)` | Ansible role module sequences → per-category grammar |
## Domain Adapters
### Ansible Roles