grammar-inference-engine/SHOWCASE.md
tobjend 0e2aec582b Grammar inference engine: CRX + iDRegEx ensemble with MDL scoring, MCP server, showcase, and blog post
- Ensemble inference (infer_ensemble) runs both CRX and iDRegEx, picks best by MDL
- CRX: CRX algorithm for wide coverage (accepts all sequences, large vocabulary)
- iDRegEx: iDRegEx for minimal core grammar (tightest common pattern)
- MDL scoring: fixed model_cost to count alphabet symbol occurrences, fixed dispatch order in _count_words_fast
- Fixed _match_tokens: rewritten as _match_possible with proper backtracking
- Fixed _parse_parts disjunction: children use _parse_flat_symbol to avoid dot-splitting
- MCP server: infer_best_grammar and infer_grammar tools
- Added prefer parameter (crx/idregex) to skip ensemble
- 28 passing tests
- SHOWCASE.md with Geerlingguy Galaxy demonstration
- blog_post.md with full technical deep-dive
2026-07-01 09:51:41 +02:00

1.8 KiB

Grammar Inference Engine — Showcase

Infer the unwritten convention from existing examples. Given N example sequences, produce a ~100-char grammar that captures the structural pattern — in far fewer tokens than the originals.

How it works

Your agent calls the MCP tool infer_best_grammar with a list of existing sequences. It returns a compressed grammar:

a.b       → a then b (concatenation)
(a+b)     → a or b (disjunction)
r?        → optional (zero or one)
r+        → one or more (iteration)
r+?       → zero or more

Use prefer='crx' for full coverage (accepts all examples), or let the ensemble pick between CRX and iDRegEx by MDL score.

Ansible Galaxy roles — 15 geerlingguy roles

Jeff Geerling maintains 100+ of the most popular Ansible roles on Galaxy. He has never written down their task structure. Our grammar is the first explicit description:

Grammar: fail?.(include_vars+set_fact+package+file+template+service+...)+.
         include+?.(npm+pip)+?.lineinfile?

  CRX         MDL=  596.64  match=15/15

Every role follows the same arc: check prerequisites, OS-specific vars, install packages, configure with templates, start services, optionally run sub-tasks. It works because 15 roles all converged on the same unwritten convention.

Compression: 15 roles (~5,000 tokens) → 60 tokens.

Notation reference

Symbol Meaning
a.b a then b
(a+b) a or b (CRX disjunction)
(a|b) a or b (iDRegEx disjunction)
r? zero or one
r+ one or more
r+? zero or more
MDL Minimum Description Length — lower is better

Usage

from bex.mcp_server import infer_best_grammar

output = infer_best_grammar(
    sequences=role_sequences,
    prefer="crx",
)