grammar-inference-engine/blog_post.md
tobjend 0e2aec582b Grammar inference engine: CRX + iDRegEx ensemble with MDL scoring, MCP server, showcase, and blog post
- Ensemble inference (infer_ensemble) runs both CRX and iDRegEx, picks best by MDL
- CRX: CRX algorithm for wide coverage (accepts all sequences, large vocabulary)
- iDRegEx: iDRegEx for minimal core grammar (tightest common pattern)
- MDL scoring: fixed model_cost to count alphabet symbol occurrences, fixed dispatch order in _count_words_fast
- Fixed _match_tokens: rewritten as _match_possible with proper backtracking
- Fixed _parse_parts disjunction: children use _parse_flat_symbol to avoid dot-splitting
- MCP server: infer_best_grammar and infer_grammar tools
- Added prefer parameter (crx/idregex) to skip ensemble
- 28 passing tests
- SHOWCASE.md with Geerlingguy Galaxy demonstration
- blog_post.md with full technical deep-dive
2026-07-01 09:51:41 +02:00

13 KiB
Raw Blame History

Discovering Unwritten Conventions with Grammar Inference

How we turned 36 Ansible roles into a 200-character grammar — and why it matters for LLM agents.

The problem

Every codebase has unwritten conventions. Your team's Docker Compose files always put image before ports before volumes. Your Ansible deploy roles always start with assert, then file, then template. Your CI pipelines always run lint before test before deploy.

Nobody writes these down. They're emergent — copied from role to role, file to file, until they become a tacit standard.

When an LLM agent needs to generate new content that follows these conventions, you have two options:

  1. Stuff every existing file into context — 36 deploy roles = 15,000 tokens. You'll hit the context window on your third example.
  2. Give it one or two examples and hope — the LLM will guess the pattern, and it will often guess wrong.

Neither is good. The first is wasteful. The second is unreliable.

What you really want is the compiled convention — the minimal description of what all 36 roles share, expressed in ~200 tokens. An LLM can follow a rule in 200 tokens far more reliably than it can infer a pattern from 36 examples.

This is grammar inference.

The approach

Given a set of example sequences over some alphabet (e.g., Ansible module names, Docker Compose keys, CI job names), learn a regular expression that describes the general pattern.

We implemented two algorithms from Bex et al., a pair of papers from TODS 2010 and arXiv 2010:

  • CRX (TODS 2010 §6): A single-pass algorithm that builds a predecessor relation over symbols, computes equivalence classes, and emits a Chain Regular Expression (CHARE) that matches ALL input sequences. Fast, deterministic, captures the full vocabulary.

  • iDRegEx (arXiv 2010): A probabilistic algorithm using k-testable Observation Automata (k-OA) trained with Baum-Welch EM. It finds only the minimal common core — the symbols that appear in every example. Robust against noise, but fails (returns ∅) when the examples are too diverse.

Both run in the ensemble: CRX produces a permissive grammar (full vocabulary, many optional parts), iDRegEx produces a strict grammar (minimal core). A Minimum Description Length (MDL) score picks the winner: the grammar that compresses the data best.

The algorithms, briefly

CRX — Chain Regular Expression inference

CRX (Algorithm 7, TODS 2010) works in four steps:

  1. Build the immediate-predecessor relation. For every adjacent pair (x, y) across all sequences, record that x precedes y. If symbol assert always appears before file, record assert → file.

  2. Compute equivalence classes. Take the reflexive-transitive closure of the predecessor relation. The strongly connected components are equivalence classes — groups of symbols that can appear in the same position. If copy and template both follow file and precede command, they're in the same class.

  3. Merge singleton classes. A class with one symbol that shares the same predecessor/successor sets as another singleton class gets merged. This handles symbols that always appear in the same structural position.

  4. Topological sort. The equivalence classes are sorted by their position in the Hasse diagram of the predecessor relation. Each class becomes a factor in the output, annotated with a quantifier:

    • + (one or more) if the class forms a cycle
    • +? (zero or more) if the class appears variably
    • ? (optional) if the class can be absent
    • (exact) if the class always appears exactly once

The result is a CHARE: a sequence of factors where each factor is a disjunction of equivalent symbols with a quantifier.

iDRegEx — k-optimal regular expression inference

iDRegEx (Algorithm 4, arXiv 2010) uses a probabilistic automaton:

  1. Build a complete k-OA. A k-testable Observation Automaton records all k-grams (subsequences of length k) from the input sequences. The automaton's states represent (k-1)-grams.

  2. Train with Baum-Welch. EM iterations assign probabilities to transitions, learning which paths through the automaton are most likely given the data.

  3. Disambiguate. Remove nondeterministic transitions — for any state and symbol, keep only the most probable next state.

  4. Prune. Remove low-probability edges and unreachable states, leaving only the most likely paths.

  5. Extract with rwr². The REWRITE-SQUARED algorithm (rwr², Algorithm 3) collapses the pruned automaton into a k-optimal regular expression — the minimal common core.

MDL scoring — picking the right level of specificity

The Minimum Description Length principle (Rissanen 1978) says: the best grammar is the one that minimizes the sum of its own size and the cost of encoding the data using it.

MDL = model_cost + data_cost

model_cost = the number of alphabet symbol occurrences in the grammar. A grammar with 5 unique symbols used once each has model_cost = 5.

data_cost = Σ log₂(|L(r)|) across all sequences, where |L(r)| is the number of strings of length len(s) that the grammar accepts. A grammar like (a+b+c+...+z)+ accepts 19 possible symbols at each position, so for a sequence of length 120, the data cost is 120 × log₂(19) ≈ 510 bits. A grammar like a.b.c.d.e accepts only 1 string of length 5, so data cost is 0.

The ensemble picks the grammar with the lowest total MDL. This automatically balances specificity against coverage: a grammar that matches only 1 sequence but does so perfectly (low data cost) can beat a grammar that matches all sequences but is extremely permissive (high data cost).

The bugs we found (and fixed)

Implementing the BEX algorithms faithfully required solving several subtle problems.

Bug 1: model_cost counted characters, not symbols

The paper defines model_cost as "the length of r" — the number of symbols in the expression. For the toy alphabet {a, b, c, d, e} used in the paper, characters and symbols are the same. For real-world symbols like community.docker.docker_image, they aren't.

Our model_cost function was counting characters (226 for a typical grammar), when it should count symbol occurrences (19). This massively inflated the MDL score, making CRX appear worse than it actually was.

Fix: Count occurrences of alphabet symbols in the expression using regex word-boundary matching, not string length.

Bug 2: Dispatch order in _count_words_fast

The recursive function _count_words_fast estimates |L(r)| — the number of strings a grammar accepts at a given length. It dispatches on expression structure: first check for concatenation (.), then trailing quantifiers (+?, *, ?, +), then disjunction groups.

Our dispatch checked endswith('+?') before checking '.' in expr. For the expression (All)+.Role?.RoleBinding?.Job+?, the trailing +? on Job+? triggered the quantifier branch first, applying the +? to the entire expression instead of just the Job factor.

Fix: Check concatenation first. Top-level dots can only appear in concatenation, so they should be handled before any quantifier logic.

Bug 3: Greedy matching without backtracking

The _match_tokens function checked whether a sequence matches a grammar. For quantifiers like +? (zero-or-more), it greedily consumed ALL consecutive matching symbols, then moved on. This failed for grammars like a+?.a on input ['a', 'a']: the a+? ate both as, and there was nothing left for the second .a.

Fix: Replace the single-pass greedy matching with _match_possible, a proper backtracking engine that enumerates ALL valid end positions for each token and picks the maximum. This is essentially a tiny regex engine — but limited to the CHARE subset, so it avoids the exponential blowup of general regex matching.

Bug 4: Dot-splitting inside disjunctions

Module names like community.docker.docker_image contain dots. When _parse_parts processed a disjunction child, it recursively called itself — which split the expression on . before treating it as a symbol. The symbol community.docker.docker_image became community then docker then docker_image — three concatenated symbols instead of one.

Fix: Disjunction children are always flat symbols (CRX and iDRegEx don't produce nested disjunctions in practice). Parse them with _parse_flat_symbol, which strips quantifiers but never splits on ..

The results

Ansible deploy roles — 36 roles from companyweb

Your own deploy roles cover everything from AdGuard Home to Woodpecker CI. They have NO schema — each is a free-form script.

Grammar: docker_volume+?.group?.docker_container?.user?.apt?.npm?.
         (assert+...+command+copy+file+template+set_fact+...+wait_for)+?.
         (cron+firewalld)?
Match:   36/36
MDL:     2186.28

Bottleneck analysis: optional docker setup (volume, group, container, user, apt, npm), then a large disjunction of ~25 task modules (one or more), then optional cron/firewalld at the end. This captures the convention precisely.

Compression: 36 roles (15,000 tokens) → 200 tokens (75×)

Jeff Geerling's roles are the most popular on Ansible Galaxy. He has never documented their structural pattern. Yet every one of the 15 follows the same arc:

Grammar: fail?.(include_vars+set_fact+package+file+template+service+...)+.
         include+?.(npm+pip)+?.lineinfile?
Match:   15/15
MDL:     596.64

Check prerequisites, OS-specific variables, install packages, configure with templates, start services, optionally run sub-tasks, install npm/pip packages, and optionally tweak config lines.

This is the first explicit description of the geerlingguy role convention. It took 15 roles and a grammar inference algorithm to write it down.

Compression: 15 roles (5,000 tokens) → 60 tokens (83×)

Docker Compose — by project

Docker Compose has a flexible schema, but each project develops its own convention:

mcp-deployment (36 services):

(build+image).command.(environment+volumes)?.ports

files (6 services):

image.environment.volumes.network_mode.privileged?.cap_add?

fresh-ape-base (9 services):

image.ports?.(depends_on+environment+user+volumes)+

Ensemble dynamics

The ensemble (CRX + iDRegEx + MDL) selects different winners depending on the data:

Dataset Winner Why
Ansible deploy (36 roles) CRX iDRegEx returned ∅ (too diverse)
Ansible galaxy (15 roles) CRX iDRegEx returned ∅ (too diverse)
Ansible restore (2 roles) CRX Both match all; CRX more compact
Ansible configure (4 roles) iDRegEx Finds minimal core include_role
Ansible manage (2 roles) iDRegEx Core: assert.authorized_key

iDRegEx wins when the data has a clear common core. CRX wins when there's no single shared subsequence (the roles share the vocabulary but not the order).

The MCP

The engine is exposed as an MCP server:

from bex.mcp_server import infer_best_grammar

# Full coverage
output = infer_best_grammar(
    sequences=role_sequences,
    prefer="crx",
)
# Returns:
#   Best: CRX (MDL 2186.28)
#   Grammar: docker_volume+?.group?...(assert+...+wait_for)+?.(cron+firewalld)?

# Ensemble — let MDL pick
output = infer_best_grammar(sequences=role_sequences)

An agent workflow:

  1. Agent needs to write deploy role #37
  2. Finds 36 existing deploy roles, extracts their task module sequences
  3. Calls infer_best_grammar(sequences=..., prefer='crx')
  4. Gets back the grammar in 200 tokens
  5. Generates a new role that follows the structural pattern

Without the MCP: 36 role files in context (15,000 tokens), or guesswork. With the MCP: one grammar rule (200 tokens), known to match 36/36 roles.

What it means

Grammar inference turns examples into rules. The rule is a compressed description of the structural convention — and for schema-less content like Ansible roles, this may be the first time the convention has been written down at all.

For LLM agents, this changes the trade-off between context and accuracy. Instead of flooding the context window with examples, the agent can call the MCP, get the rule in ~60 tokens, and follow it. The rule is more reliable than guessing from examples, and it costs less than the first example would have.

The algorithm doesn't need to understand what a deploy role does. It doesn't know that file creates directories and template renders Jinja2. It only needs to see 36 sequences of module names and find the pattern they all share. The structural convention is in the data — you just have to extract it.

References

  • Bex, G. J., Gelade, W., Neven, F., & Vansummeren, S. (2010). Learning Deterministic Regular Expressions for the Web. TODS 2010.
  • Bex, G. J., Gelade, W., Martens, W., & Neven, F. (2010). Simplifying XML Schema: Single-Type Approximations of Regular Expressions. arXiv:1004.2372.
  • Rissanen, J. (1978). Modeling by shortest data description. Automatica 14(5).