- Ensemble inference (infer_ensemble) runs both CRX and iDRegEx, picks best by MDL - CRX: CRX algorithm for wide coverage (accepts all sequences, large vocabulary) - iDRegEx: iDRegEx for minimal core grammar (tightest common pattern) - MDL scoring: fixed model_cost to count alphabet symbol occurrences, fixed dispatch order in _count_words_fast - Fixed _match_tokens: rewritten as _match_possible with proper backtracking - Fixed _parse_parts disjunction: children use _parse_flat_symbol to avoid dot-splitting - MCP server: infer_best_grammar and infer_grammar tools - Added prefer parameter (crx/idregex) to skip ensemble - 28 passing tests - SHOWCASE.md with Geerlingguy Galaxy demonstration - blog_post.md with full technical deep-dive
64 lines
1.8 KiB
Markdown
64 lines
1.8 KiB
Markdown
# Grammar Inference Engine — Showcase
|
|
|
|
Infer the unwritten convention from existing examples. Given N example
|
|
sequences, produce a ~100-char grammar that captures the structural
|
|
pattern — in far fewer tokens than the originals.
|
|
|
|
## How it works
|
|
|
|
Your agent calls the MCP tool `infer_best_grammar` with a list of
|
|
existing sequences. It returns a compressed grammar:
|
|
|
|
```
|
|
a.b → a then b (concatenation)
|
|
(a+b) → a or b (disjunction)
|
|
r? → optional (zero or one)
|
|
r+ → one or more (iteration)
|
|
r+? → zero or more
|
|
```
|
|
|
|
Use `prefer='crx'` for full coverage (accepts all examples), or let the
|
|
ensemble pick between CRX and iDRegEx by MDL score.
|
|
|
|
## Ansible Galaxy roles — 15 geerlingguy roles
|
|
|
|
Jeff Geerling maintains 100+ of the most popular Ansible roles on
|
|
Galaxy. He has never written down their task structure. Our grammar is
|
|
the first explicit description:
|
|
|
|
```
|
|
Grammar: fail?.(include_vars+set_fact+package+file+template+service+...)+.
|
|
include+?.(npm+pip)+?.lineinfile?
|
|
|
|
CRX MDL= 596.64 match=15/15
|
|
```
|
|
|
|
Every role follows the same arc: check prerequisites, OS-specific vars,
|
|
install packages, configure with templates, start services, optionally
|
|
run sub-tasks. It works because 15 roles all converged on the same
|
|
unwritten convention.
|
|
|
|
**Compression: 15 roles (~5,000 tokens) → 60 tokens.**
|
|
|
|
## Notation reference
|
|
|
|
| Symbol | Meaning |
|
|
|--------|---------|
|
|
| `a.b` | a then b |
|
|
| `(a+b)` | a or b (CRX disjunction) |
|
|
| `(a\|b)` | a or b (iDRegEx disjunction) |
|
|
| `r?` | zero or one |
|
|
| `r+` | one or more |
|
|
| `r+?` | zero or more |
|
|
| `MDL` | Minimum Description Length — lower is better |
|
|
|
|
## Usage
|
|
|
|
```python
|
|
from bex.mcp_server import infer_best_grammar
|
|
|
|
output = infer_best_grammar(
|
|
sequences=role_sequences,
|
|
prefer="crx",
|
|
)
|
|
```
|