grammar-inference-engine/blog_post.md
tobjend 0e2aec582b Grammar inference engine: CRX + iDRegEx ensemble with MDL scoring, MCP server, showcase, and blog post
- Ensemble inference (infer_ensemble) runs both CRX and iDRegEx, picks best by MDL
- CRX: CRX algorithm for wide coverage (accepts all sequences, large vocabulary)
- iDRegEx: iDRegEx for minimal core grammar (tightest common pattern)
- MDL scoring: fixed model_cost to count alphabet symbol occurrences, fixed dispatch order in _count_words_fast
- Fixed _match_tokens: rewritten as _match_possible with proper backtracking
- Fixed _parse_parts disjunction: children use _parse_flat_symbol to avoid dot-splitting
- MCP server: infer_best_grammar and infer_grammar tools
- Added prefer parameter (crx/idregex) to skip ensemble
- 28 passing tests
- SHOWCASE.md with Geerlingguy Galaxy demonstration
- blog_post.md with full technical deep-dive
2026-07-01 09:51:41 +02:00

341 lines
13 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Discovering Unwritten Conventions with Grammar Inference
**How we turned 36 Ansible roles into a 200-character grammar — and why
it matters for LLM agents.**
## The problem
Every codebase has unwritten conventions. Your team's Docker Compose
files always put `image` before `ports` before `volumes`. Your Ansible
deploy roles always start with `assert`, then `file`, then `template`.
Your CI pipelines always run `lint` before `test` before `deploy`.
Nobody writes these down. They're emergent — copied from role to role,
file to file, until they become a tacit standard.
When an LLM agent needs to generate new content that follows these
conventions, you have two options:
1. **Stuff every existing file into context** — 36 deploy roles = 15,000
tokens. You'll hit the context window on your third example.
2. **Give it one or two examples and hope** — the LLM will guess the
pattern, and it will often guess wrong.
Neither is good. The first is wasteful. The second is unreliable.
What you really want is the **compiled convention** — the minimal
description of what all 36 roles share, expressed in ~200 tokens. An
LLM can follow a rule in 200 tokens far more reliably than it can
infer a pattern from 36 examples.
This is grammar inference.
## The approach
Given a set of example sequences over some alphabet (e.g., Ansible
module names, Docker Compose keys, CI job names), learn a regular
expression that describes the general pattern.
We implemented two algorithms from Bex et al., a pair of papers from
TODS 2010 and arXiv 2010:
- **CRX** (TODS 2010 §6): A single-pass algorithm that builds a
predecessor relation over symbols, computes equivalence classes,
and emits a Chain Regular Expression (CHARE) that matches ALL
input sequences. Fast, deterministic, captures the full vocabulary.
- **iDRegEx** (arXiv 2010): A probabilistic algorithm using k-testable
Observation Automata (k-OA) trained with Baum-Welch EM. It finds
only the *minimal common core* — the symbols that appear in every
example. Robust against noise, but fails (returns ∅) when the
examples are too diverse.
Both run in the **ensemble**: CRX produces a permissive grammar (full
vocabulary, many optional parts), iDRegEx produces a strict grammar
(minimal core). A Minimum Description Length (MDL) score picks the
winner: the grammar that compresses the data best.
## The algorithms, briefly
### CRX — Chain Regular Expression inference
CRX (Algorithm 7, TODS 2010) works in four steps:
1. **Build the immediate-predecessor relation.** For every adjacent
pair (x, y) across all sequences, record that x precedes y. If
symbol `assert` always appears before `file`, record
`assert → file`.
2. **Compute equivalence classes.** Take the reflexive-transitive
closure of the predecessor relation. The strongly connected
components are *equivalence classes* — groups of symbols that can
appear in the same position. If `copy` and `template` both follow
`file` and precede `command`, they're in the same class.
3. **Merge singleton classes.** A class with one symbol that shares
the same predecessor/successor sets as another singleton class
gets merged. This handles symbols that always appear in the
same structural position.
4. **Topological sort.** The equivalence classes are sorted by their
position in the Hasse diagram of the predecessor relation. Each
class becomes a factor in the output, annotated with a quantifier:
- `+` (one or more) if the class forms a cycle
- `+?` (zero or more) if the class appears variably
- `?` (optional) if the class can be absent
- (exact) if the class always appears exactly once
The result is a CHARE: a sequence of factors where each factor is a
disjunction of equivalent symbols with a quantifier.
### iDRegEx — k-optimal regular expression inference
iDRegEx (Algorithm 4, arXiv 2010) uses a probabilistic automaton:
1. **Build a complete k-OA.** A k-testable Observation Automaton
records all k-grams (subsequences of length k) from the input
sequences. The automaton's states represent (k-1)-grams.
2. **Train with Baum-Welch.** EM iterations assign probabilities to
transitions, learning which paths through the automaton are most
likely given the data.
3. **Disambiguate.** Remove nondeterministic transitions — for any
state and symbol, keep only the most probable next state.
4. **Prune.** Remove low-probability edges and unreachable states,
leaving only the most likely paths.
5. **Extract with rwr².** The REWRITE-SQUARED algorithm (rwr²,
Algorithm 3) collapses the pruned automaton into a k-optimal
regular expression — the minimal common core.
### MDL scoring — picking the right level of specificity
The Minimum Description Length principle (Rissanen 1978) says: the
best grammar is the one that minimizes the sum of its own size and
the cost of encoding the data using it.
```
MDL = model_cost + data_cost
```
**model_cost** = the number of alphabet symbol occurrences in the
grammar. A grammar with 5 unique symbols used once each has
model_cost = 5.
**data_cost** = Σ log₂(|L(r)|) across all sequences, where |L(r)| is
the number of strings of length len(s) that the grammar accepts.
A grammar like `(a+b+c+...+z)+` accepts 19 possible symbols at each
position, so for a sequence of length 120, the data cost is
120 × log₂(19) ≈ 510 bits. A grammar like `a.b.c.d.e` accepts only
1 string of length 5, so data cost is 0.
The ensemble picks the grammar with the lowest total MDL. This
automatically balances specificity against coverage: a grammar that
matches only 1 sequence but does so perfectly (low data cost) can
beat a grammar that matches all sequences but is extremely permissive
(high data cost).
## The bugs we found (and fixed)
Implementing the BEX algorithms faithfully required solving several
subtle problems.
### Bug 1: model_cost counted characters, not symbols
The paper defines model_cost as "the length of r" — the number of
symbols in the expression. For the toy alphabet {a, b, c, d, e} used
in the paper, characters and symbols are the same. For real-world
symbols like `community.docker.docker_image`, they aren't.
Our `model_cost` function was counting characters (226 for a typical
grammar), when it should count symbol occurrences (19). This
massively inflated the MDL score, making CRX appear worse than it
actually was.
**Fix:** Count occurrences of alphabet symbols in the expression using
regex word-boundary matching, not string length.
### Bug 2: Dispatch order in _count_words_fast
The recursive function `_count_words_fast` estimates |L(r)| — the
number of strings a grammar accepts at a given length. It dispatches
on expression structure: first check for concatenation (`.`), then
trailing quantifiers (`+?`, `*`, `?`, `+`), then disjunction groups.
Our dispatch checked `endswith('+?')` before checking `'.' in expr`.
For the expression `(All)+.Role?.RoleBinding?.Job+?`, the trailing
`+?` on `Job+?` triggered the quantifier branch first, applying the
`+?` to the **entire** expression instead of just the `Job` factor.
**Fix:** Check concatenation first. Top-level dots can only appear in
concatenation, so they should be handled before any quantifier logic.
### Bug 3: Greedy matching without backtracking
The `_match_tokens` function checked whether a sequence matches a
grammar. For quantifiers like `+?` (zero-or-more), it greedily
consumed ALL consecutive matching symbols, then moved on. This failed
for grammars like `a+?.a` on input `['a', 'a']`: the `a+?` ate both
`a`s, and there was nothing left for the second `.a`.
**Fix:** Replace the single-pass greedy matching with `_match_possible`,
a proper backtracking engine that enumerates ALL valid end positions
for each token and picks the maximum. This is essentially a tiny
regex engine — but limited to the CHARE subset, so it avoids the
exponential blowup of general regex matching.
### Bug 4: Dot-splitting inside disjunctions
Module names like `community.docker.docker_image` contain dots.
When `_parse_parts` processed a disjunction child, it recursively
called itself — which split the expression on `.` before treating it
as a symbol. The symbol `community.docker.docker_image` became
`community` then `docker` then `docker_image` — three concatenated
symbols instead of one.
**Fix:** Disjunction children are always flat symbols (CRX and
iDRegEx don't produce nested disjunctions in practice). Parse them
with `_parse_flat_symbol`, which strips quantifiers but never splits
on `.`.
## The results
### Ansible deploy roles — 36 roles from companyweb
Your own deploy roles cover everything from AdGuard Home to
Woodpecker CI. They have NO schema — each is a free-form script.
```
Grammar: docker_volume+?.group?.docker_container?.user?.apt?.npm?.
(assert+...+command+copy+file+template+set_fact+...+wait_for)+?.
(cron+firewalld)?
Match: 36/36
MDL: 2186.28
```
Bottleneck analysis: optional docker setup (volume, group, container,
user, apt, npm), then a large disjunction of ~25 task modules (one or
more), then optional cron/firewalld at the end. This captures the
convention precisely.
**Compression: 36 roles (15,000 tokens) → 200 tokens (75×)**
### Geerlingguy Galaxy roles — 15 popular roles
Jeff Geerling's roles are the most popular on Ansible Galaxy. He has
never documented their structural pattern. Yet every one of the 15
follows the same arc:
```
Grammar: fail?.(include_vars+set_fact+package+file+template+service+...)+.
include+?.(npm+pip)+?.lineinfile?
Match: 15/15
MDL: 596.64
```
Check prerequisites, OS-specific variables, install packages,
configure with templates, start services, optionally run sub-tasks,
install npm/pip packages, and optionally tweak config lines.
**This is the first explicit description of the geerlingguy role
convention.** It took 15 roles and a grammar inference algorithm to
write it down.
**Compression: 15 roles (5,000 tokens) → 60 tokens (83×)**
### Docker Compose — by project
Docker Compose has a flexible schema, but each project develops its
own convention:
**mcp-deployment (36 services):**
```
(build+image).command.(environment+volumes)?.ports
```
**files (6 services):**
```
image.environment.volumes.network_mode.privileged?.cap_add?
```
**fresh-ape-base (9 services):**
```
image.ports?.(depends_on+environment+user+volumes)+
```
### Ensemble dynamics
The ensemble (CRX + iDRegEx + MDL) selects different winners
depending on the data:
| Dataset | Winner | Why |
|---------|--------|-----|
| Ansible deploy (36 roles) | CRX | iDRegEx returned ∅ (too diverse) |
| Ansible galaxy (15 roles) | CRX | iDRegEx returned ∅ (too diverse) |
| Ansible restore (2 roles) | CRX | Both match all; CRX more compact |
| Ansible configure (4 roles) | **iDRegEx** | Finds minimal core `include_role` |
| Ansible manage (2 roles) | **iDRegEx** | Core: `assert.authorized_key` |
iDRegEx wins when the data has a clear common core. CRX wins when
there's no single shared subsequence (the roles share the *vocabulary*
but not the *order*).
## The MCP
The engine is exposed as an MCP server:
```python
from bex.mcp_server import infer_best_grammar
# Full coverage
output = infer_best_grammar(
sequences=role_sequences,
prefer="crx",
)
# Returns:
# Best: CRX (MDL 2186.28)
# Grammar: docker_volume+?.group?...(assert+...+wait_for)+?.(cron+firewalld)?
# Ensemble — let MDL pick
output = infer_best_grammar(sequences=role_sequences)
```
An agent workflow:
1. Agent needs to write deploy role #37
2. Finds 36 existing deploy roles, extracts their task module sequences
3. Calls `infer_best_grammar(sequences=..., prefer='crx')`
4. Gets back the grammar in 200 tokens
5. Generates a new role that follows the structural pattern
Without the MCP: 36 role files in context (15,000 tokens), or guesswork.
With the MCP: one grammar rule (200 tokens), known to match 36/36 roles.
## What it means
Grammar inference turns **examples** into **rules**. The rule is a
compressed description of the structural convention — and for
schema-less content like Ansible roles, this may be the *first time*
the convention has been written down at all.
For LLM agents, this changes the trade-off between context and
accuracy. Instead of flooding the context window with examples, the
agent can call the MCP, get the rule in ~60 tokens, and follow it.
The rule is more reliable than guessing from examples, and it costs
less than the first example would have.
The algorithm doesn't need to understand what a deploy role does. It
doesn't know that `file` creates directories and `template` renders
Jinja2. It only needs to see 36 sequences of module names and find
the pattern they all share. The structural convention is in the data
— you just have to extract it.
## References
- Bex, G. J., Gelade, W., Neven, F., & Vansummeren, S. (2010).
*Learning Deterministic Regular Expressions for the Web.* TODS 2010.
- Bex, G. J., Gelade, W., Martens, W., & Neven, F. (2010).
*Simplifying XML Schema: Single-Type Approximations of Regular
Expressions.* arXiv:1004.2372.
- Rissanen, J. (1978). *Modeling by shortest data description.*
Automatica 14(5).