- Ensemble inference (infer_ensemble) runs both CRX and iDRegEx, picks best by MDL - CRX: CRX algorithm for wide coverage (accepts all sequences, large vocabulary) - iDRegEx: iDRegEx for minimal core grammar (tightest common pattern) - MDL scoring: fixed model_cost to count alphabet symbol occurrences, fixed dispatch order in _count_words_fast - Fixed _match_tokens: rewritten as _match_possible with proper backtracking - Fixed _parse_parts disjunction: children use _parse_flat_symbol to avoid dot-splitting - MCP server: infer_best_grammar and infer_grammar tools - Added prefer parameter (crx/idregex) to skip ensemble - 28 passing tests - SHOWCASE.md with Geerlingguy Galaxy demonstration - blog_post.md with full technical deep-dive
341 lines
13 KiB
Markdown
341 lines
13 KiB
Markdown
# Discovering Unwritten Conventions with Grammar Inference
|
||
|
||
**How we turned 36 Ansible roles into a 200-character grammar — and why
|
||
it matters for LLM agents.**
|
||
|
||
## The problem
|
||
|
||
Every codebase has unwritten conventions. Your team's Docker Compose
|
||
files always put `image` before `ports` before `volumes`. Your Ansible
|
||
deploy roles always start with `assert`, then `file`, then `template`.
|
||
Your CI pipelines always run `lint` before `test` before `deploy`.
|
||
|
||
Nobody writes these down. They're emergent — copied from role to role,
|
||
file to file, until they become a tacit standard.
|
||
|
||
When an LLM agent needs to generate new content that follows these
|
||
conventions, you have two options:
|
||
|
||
1. **Stuff every existing file into context** — 36 deploy roles = 15,000
|
||
tokens. You'll hit the context window on your third example.
|
||
2. **Give it one or two examples and hope** — the LLM will guess the
|
||
pattern, and it will often guess wrong.
|
||
|
||
Neither is good. The first is wasteful. The second is unreliable.
|
||
|
||
What you really want is the **compiled convention** — the minimal
|
||
description of what all 36 roles share, expressed in ~200 tokens. An
|
||
LLM can follow a rule in 200 tokens far more reliably than it can
|
||
infer a pattern from 36 examples.
|
||
|
||
This is grammar inference.
|
||
|
||
## The approach
|
||
|
||
Given a set of example sequences over some alphabet (e.g., Ansible
|
||
module names, Docker Compose keys, CI job names), learn a regular
|
||
expression that describes the general pattern.
|
||
|
||
We implemented two algorithms from Bex et al., a pair of papers from
|
||
TODS 2010 and arXiv 2010:
|
||
|
||
- **CRX** (TODS 2010 §6): A single-pass algorithm that builds a
|
||
predecessor relation over symbols, computes equivalence classes,
|
||
and emits a Chain Regular Expression (CHARE) that matches ALL
|
||
input sequences. Fast, deterministic, captures the full vocabulary.
|
||
|
||
- **iDRegEx** (arXiv 2010): A probabilistic algorithm using k-testable
|
||
Observation Automata (k-OA) trained with Baum-Welch EM. It finds
|
||
only the *minimal common core* — the symbols that appear in every
|
||
example. Robust against noise, but fails (returns ∅) when the
|
||
examples are too diverse.
|
||
|
||
Both run in the **ensemble**: CRX produces a permissive grammar (full
|
||
vocabulary, many optional parts), iDRegEx produces a strict grammar
|
||
(minimal core). A Minimum Description Length (MDL) score picks the
|
||
winner: the grammar that compresses the data best.
|
||
|
||
## The algorithms, briefly
|
||
|
||
### CRX — Chain Regular Expression inference
|
||
|
||
CRX (Algorithm 7, TODS 2010) works in four steps:
|
||
|
||
1. **Build the immediate-predecessor relation.** For every adjacent
|
||
pair (x, y) across all sequences, record that x precedes y. If
|
||
symbol `assert` always appears before `file`, record
|
||
`assert → file`.
|
||
|
||
2. **Compute equivalence classes.** Take the reflexive-transitive
|
||
closure of the predecessor relation. The strongly connected
|
||
components are *equivalence classes* — groups of symbols that can
|
||
appear in the same position. If `copy` and `template` both follow
|
||
`file` and precede `command`, they're in the same class.
|
||
|
||
3. **Merge singleton classes.** A class with one symbol that shares
|
||
the same predecessor/successor sets as another singleton class
|
||
gets merged. This handles symbols that always appear in the
|
||
same structural position.
|
||
|
||
4. **Topological sort.** The equivalence classes are sorted by their
|
||
position in the Hasse diagram of the predecessor relation. Each
|
||
class becomes a factor in the output, annotated with a quantifier:
|
||
- `+` (one or more) if the class forms a cycle
|
||
- `+?` (zero or more) if the class appears variably
|
||
- `?` (optional) if the class can be absent
|
||
- (exact) if the class always appears exactly once
|
||
|
||
The result is a CHARE: a sequence of factors where each factor is a
|
||
disjunction of equivalent symbols with a quantifier.
|
||
|
||
### iDRegEx — k-optimal regular expression inference
|
||
|
||
iDRegEx (Algorithm 4, arXiv 2010) uses a probabilistic automaton:
|
||
|
||
1. **Build a complete k-OA.** A k-testable Observation Automaton
|
||
records all k-grams (subsequences of length k) from the input
|
||
sequences. The automaton's states represent (k-1)-grams.
|
||
|
||
2. **Train with Baum-Welch.** EM iterations assign probabilities to
|
||
transitions, learning which paths through the automaton are most
|
||
likely given the data.
|
||
|
||
3. **Disambiguate.** Remove nondeterministic transitions — for any
|
||
state and symbol, keep only the most probable next state.
|
||
|
||
4. **Prune.** Remove low-probability edges and unreachable states,
|
||
leaving only the most likely paths.
|
||
|
||
5. **Extract with rwr².** The REWRITE-SQUARED algorithm (rwr²,
|
||
Algorithm 3) collapses the pruned automaton into a k-optimal
|
||
regular expression — the minimal common core.
|
||
|
||
### MDL scoring — picking the right level of specificity
|
||
|
||
The Minimum Description Length principle (Rissanen 1978) says: the
|
||
best grammar is the one that minimizes the sum of its own size and
|
||
the cost of encoding the data using it.
|
||
|
||
```
|
||
MDL = model_cost + data_cost
|
||
```
|
||
|
||
**model_cost** = the number of alphabet symbol occurrences in the
|
||
grammar. A grammar with 5 unique symbols used once each has
|
||
model_cost = 5.
|
||
|
||
**data_cost** = Σ log₂(|L(r)|) across all sequences, where |L(r)| is
|
||
the number of strings of length len(s) that the grammar accepts.
|
||
A grammar like `(a+b+c+...+z)+` accepts 19 possible symbols at each
|
||
position, so for a sequence of length 120, the data cost is
|
||
120 × log₂(19) ≈ 510 bits. A grammar like `a.b.c.d.e` accepts only
|
||
1 string of length 5, so data cost is 0.
|
||
|
||
The ensemble picks the grammar with the lowest total MDL. This
|
||
automatically balances specificity against coverage: a grammar that
|
||
matches only 1 sequence but does so perfectly (low data cost) can
|
||
beat a grammar that matches all sequences but is extremely permissive
|
||
(high data cost).
|
||
|
||
## The bugs we found (and fixed)
|
||
|
||
Implementing the BEX algorithms faithfully required solving several
|
||
subtle problems.
|
||
|
||
### Bug 1: model_cost counted characters, not symbols
|
||
|
||
The paper defines model_cost as "the length of r" — the number of
|
||
symbols in the expression. For the toy alphabet {a, b, c, d, e} used
|
||
in the paper, characters and symbols are the same. For real-world
|
||
symbols like `community.docker.docker_image`, they aren't.
|
||
|
||
Our `model_cost` function was counting characters (226 for a typical
|
||
grammar), when it should count symbol occurrences (19). This
|
||
massively inflated the MDL score, making CRX appear worse than it
|
||
actually was.
|
||
|
||
**Fix:** Count occurrences of alphabet symbols in the expression using
|
||
regex word-boundary matching, not string length.
|
||
|
||
### Bug 2: Dispatch order in _count_words_fast
|
||
|
||
The recursive function `_count_words_fast` estimates |L(r)| — the
|
||
number of strings a grammar accepts at a given length. It dispatches
|
||
on expression structure: first check for concatenation (`.`), then
|
||
trailing quantifiers (`+?`, `*`, `?`, `+`), then disjunction groups.
|
||
|
||
Our dispatch checked `endswith('+?')` before checking `'.' in expr`.
|
||
For the expression `(All)+.Role?.RoleBinding?.Job+?`, the trailing
|
||
`+?` on `Job+?` triggered the quantifier branch first, applying the
|
||
`+?` to the **entire** expression instead of just the `Job` factor.
|
||
|
||
**Fix:** Check concatenation first. Top-level dots can only appear in
|
||
concatenation, so they should be handled before any quantifier logic.
|
||
|
||
### Bug 3: Greedy matching without backtracking
|
||
|
||
The `_match_tokens` function checked whether a sequence matches a
|
||
grammar. For quantifiers like `+?` (zero-or-more), it greedily
|
||
consumed ALL consecutive matching symbols, then moved on. This failed
|
||
for grammars like `a+?.a` on input `['a', 'a']`: the `a+?` ate both
|
||
`a`s, and there was nothing left for the second `.a`.
|
||
|
||
**Fix:** Replace the single-pass greedy matching with `_match_possible`,
|
||
a proper backtracking engine that enumerates ALL valid end positions
|
||
for each token and picks the maximum. This is essentially a tiny
|
||
regex engine — but limited to the CHARE subset, so it avoids the
|
||
exponential blowup of general regex matching.
|
||
|
||
### Bug 4: Dot-splitting inside disjunctions
|
||
|
||
Module names like `community.docker.docker_image` contain dots.
|
||
When `_parse_parts` processed a disjunction child, it recursively
|
||
called itself — which split the expression on `.` before treating it
|
||
as a symbol. The symbol `community.docker.docker_image` became
|
||
`community` then `docker` then `docker_image` — three concatenated
|
||
symbols instead of one.
|
||
|
||
**Fix:** Disjunction children are always flat symbols (CRX and
|
||
iDRegEx don't produce nested disjunctions in practice). Parse them
|
||
with `_parse_flat_symbol`, which strips quantifiers but never splits
|
||
on `.`.
|
||
|
||
## The results
|
||
|
||
### Ansible deploy roles — 36 roles from companyweb
|
||
|
||
Your own deploy roles cover everything from AdGuard Home to
|
||
Woodpecker CI. They have NO schema — each is a free-form script.
|
||
|
||
```
|
||
Grammar: docker_volume+?.group?.docker_container?.user?.apt?.npm?.
|
||
(assert+...+command+copy+file+template+set_fact+...+wait_for)+?.
|
||
(cron+firewalld)?
|
||
Match: 36/36
|
||
MDL: 2186.28
|
||
```
|
||
|
||
Bottleneck analysis: optional docker setup (volume, group, container,
|
||
user, apt, npm), then a large disjunction of ~25 task modules (one or
|
||
more), then optional cron/firewalld at the end. This captures the
|
||
convention precisely.
|
||
|
||
**Compression: 36 roles (15,000 tokens) → 200 tokens (75×)**
|
||
|
||
### Geerlingguy Galaxy roles — 15 popular roles
|
||
|
||
Jeff Geerling's roles are the most popular on Ansible Galaxy. He has
|
||
never documented their structural pattern. Yet every one of the 15
|
||
follows the same arc:
|
||
|
||
```
|
||
Grammar: fail?.(include_vars+set_fact+package+file+template+service+...)+.
|
||
include+?.(npm+pip)+?.lineinfile?
|
||
Match: 15/15
|
||
MDL: 596.64
|
||
```
|
||
|
||
Check prerequisites, OS-specific variables, install packages,
|
||
configure with templates, start services, optionally run sub-tasks,
|
||
install npm/pip packages, and optionally tweak config lines.
|
||
|
||
**This is the first explicit description of the geerlingguy role
|
||
convention.** It took 15 roles and a grammar inference algorithm to
|
||
write it down.
|
||
|
||
**Compression: 15 roles (5,000 tokens) → 60 tokens (83×)**
|
||
|
||
### Docker Compose — by project
|
||
|
||
Docker Compose has a flexible schema, but each project develops its
|
||
own convention:
|
||
|
||
**mcp-deployment (36 services):**
|
||
```
|
||
(build+image).command.(environment+volumes)?.ports
|
||
```
|
||
**files (6 services):**
|
||
```
|
||
image.environment.volumes.network_mode.privileged?.cap_add?
|
||
```
|
||
**fresh-ape-base (9 services):**
|
||
```
|
||
image.ports?.(depends_on+environment+user+volumes)+
|
||
```
|
||
|
||
### Ensemble dynamics
|
||
|
||
The ensemble (CRX + iDRegEx + MDL) selects different winners
|
||
depending on the data:
|
||
|
||
| Dataset | Winner | Why |
|
||
|---------|--------|-----|
|
||
| Ansible deploy (36 roles) | CRX | iDRegEx returned ∅ (too diverse) |
|
||
| Ansible galaxy (15 roles) | CRX | iDRegEx returned ∅ (too diverse) |
|
||
| Ansible restore (2 roles) | CRX | Both match all; CRX more compact |
|
||
| Ansible configure (4 roles) | **iDRegEx** | Finds minimal core `include_role` |
|
||
| Ansible manage (2 roles) | **iDRegEx** | Core: `assert.authorized_key` |
|
||
|
||
iDRegEx wins when the data has a clear common core. CRX wins when
|
||
there's no single shared subsequence (the roles share the *vocabulary*
|
||
but not the *order*).
|
||
|
||
## The MCP
|
||
|
||
The engine is exposed as an MCP server:
|
||
|
||
```python
|
||
from bex.mcp_server import infer_best_grammar
|
||
|
||
# Full coverage
|
||
output = infer_best_grammar(
|
||
sequences=role_sequences,
|
||
prefer="crx",
|
||
)
|
||
# Returns:
|
||
# Best: CRX (MDL 2186.28)
|
||
# Grammar: docker_volume+?.group?...(assert+...+wait_for)+?.(cron+firewalld)?
|
||
|
||
# Ensemble — let MDL pick
|
||
output = infer_best_grammar(sequences=role_sequences)
|
||
```
|
||
|
||
An agent workflow:
|
||
|
||
1. Agent needs to write deploy role #37
|
||
2. Finds 36 existing deploy roles, extracts their task module sequences
|
||
3. Calls `infer_best_grammar(sequences=..., prefer='crx')`
|
||
4. Gets back the grammar in 200 tokens
|
||
5. Generates a new role that follows the structural pattern
|
||
|
||
Without the MCP: 36 role files in context (15,000 tokens), or guesswork.
|
||
With the MCP: one grammar rule (200 tokens), known to match 36/36 roles.
|
||
|
||
## What it means
|
||
|
||
Grammar inference turns **examples** into **rules**. The rule is a
|
||
compressed description of the structural convention — and for
|
||
schema-less content like Ansible roles, this may be the *first time*
|
||
the convention has been written down at all.
|
||
|
||
For LLM agents, this changes the trade-off between context and
|
||
accuracy. Instead of flooding the context window with examples, the
|
||
agent can call the MCP, get the rule in ~60 tokens, and follow it.
|
||
The rule is more reliable than guessing from examples, and it costs
|
||
less than the first example would have.
|
||
|
||
The algorithm doesn't need to understand what a deploy role does. It
|
||
doesn't know that `file` creates directories and `template` renders
|
||
Jinja2. It only needs to see 36 sequences of module names and find
|
||
the pattern they all share. The structural convention is in the data
|
||
— you just have to extract it.
|
||
|
||
## References
|
||
|
||
- Bex, G. J., Gelade, W., Neven, F., & Vansummeren, S. (2010).
|
||
*Learning Deterministic Regular Expressions for the Web.* TODS 2010.
|
||
- Bex, G. J., Gelade, W., Martens, W., & Neven, F. (2010).
|
||
*Simplifying XML Schema: Single-Type Approximations of Regular
|
||
Expressions.* arXiv:1004.2372.
|
||
- Rissanen, J. (1978). *Modeling by shortest data description.*
|
||
Automatica 14(5).
|