- Ensemble inference (infer_ensemble) runs both CRX and iDRegEx, picks best by MDL - CRX: CRX algorithm for wide coverage (accepts all sequences, large vocabulary) - iDRegEx: iDRegEx for minimal core grammar (tightest common pattern) - MDL scoring: fixed model_cost to count alphabet symbol occurrences, fixed dispatch order in _count_words_fast - Fixed _match_tokens: rewritten as _match_possible with proper backtracking - Fixed _parse_parts disjunction: children use _parse_flat_symbol to avoid dot-splitting - MCP server: infer_best_grammar and infer_grammar tools - Added prefer parameter (crx/idregex) to skip ensemble - 28 passing tests - SHOWCASE.md with Geerlingguy Galaxy demonstration - blog_post.md with full technical deep-dive
1.8 KiB
Grammar Inference Engine — Showcase
Infer the unwritten convention from existing examples. Given N example sequences, produce a ~100-char grammar that captures the structural pattern — in far fewer tokens than the originals.
How it works
Your agent calls the MCP tool infer_best_grammar with a list of
existing sequences. It returns a compressed grammar:
a.b → a then b (concatenation)
(a+b) → a or b (disjunction)
r? → optional (zero or one)
r+ → one or more (iteration)
r+? → zero or more
Use prefer='crx' for full coverage (accepts all examples), or let the
ensemble pick between CRX and iDRegEx by MDL score.
Ansible Galaxy roles — 15 geerlingguy roles
Jeff Geerling maintains 100+ of the most popular Ansible roles on Galaxy. He has never written down their task structure. Our grammar is the first explicit description:
Grammar: fail?.(include_vars+set_fact+package+file+template+service+...)+.
include+?.(npm+pip)+?.lineinfile?
CRX MDL= 596.64 match=15/15
Every role follows the same arc: check prerequisites, OS-specific vars, install packages, configure with templates, start services, optionally run sub-tasks. It works because 15 roles all converged on the same unwritten convention.
Compression: 15 roles (~5,000 tokens) → 60 tokens.
Notation reference
| Symbol | Meaning |
|---|---|
a.b |
a then b |
(a+b) |
a or b (CRX disjunction) |
(a|b) |
a or b (iDRegEx disjunction) |
r? |
zero or one |
r+ |
one or more |
r+? |
zero or more |
MDL |
Minimum Description Length — lower is better |
Usage
from bex.mcp_server import infer_best_grammar
output = infer_best_grammar(
sequences=role_sequences,
prefer="crx",
)