# Discovering Unwritten Conventions with Grammar Inference **How we turned 36 Ansible roles into a 200-character grammar — and why it matters for LLM agents.** ## The problem Every codebase has unwritten conventions. Your team's Docker Compose files always put `image` before `ports` before `volumes`. Your Ansible deploy roles always start with `assert`, then `file`, then `template`. Your CI pipelines always run `lint` before `test` before `deploy`. Nobody writes these down. They're emergent — copied from role to role, file to file, until they become a tacit standard. When an LLM agent needs to generate new content that follows these conventions, you have two options: 1. **Stuff every existing file into context** — 36 deploy roles = 15,000 tokens. You'll hit the context window on your third example. 2. **Give it one or two examples and hope** — the LLM will guess the pattern, and it will often guess wrong. Neither is good. The first is wasteful. The second is unreliable. What you really want is the **compiled convention** — the minimal description of what all 36 roles share, expressed in ~200 tokens. An LLM can follow a rule in 200 tokens far more reliably than it can infer a pattern from 36 examples. This is grammar inference. ## The approach Given a set of example sequences over some alphabet (e.g., Ansible module names, Docker Compose keys, CI job names), learn a regular expression that describes the general pattern. We implemented two algorithms from Bex et al., a pair of papers from TODS 2010 and arXiv 2010: - **CRX** (TODS 2010 §6): A single-pass algorithm that builds a predecessor relation over symbols, computes equivalence classes, and emits a Chain Regular Expression (CHARE) that matches ALL input sequences. Fast, deterministic, captures the full vocabulary. - **iDRegEx** (arXiv 2010): A probabilistic algorithm using k-testable Observation Automata (k-OA) trained with Baum-Welch EM. It finds only the *minimal common core* — the symbols that appear in every example. Robust against noise, but fails (returns ∅) when the examples are too diverse. Both run in the **ensemble**: CRX produces a permissive grammar (full vocabulary, many optional parts), iDRegEx produces a strict grammar (minimal core). A Minimum Description Length (MDL) score picks the winner: the grammar that compresses the data best. ## The algorithms, briefly ### CRX — Chain Regular Expression inference CRX (Algorithm 7, TODS 2010) works in four steps: 1. **Build the immediate-predecessor relation.** For every adjacent pair (x, y) across all sequences, record that x precedes y. If symbol `assert` always appears before `file`, record `assert → file`. 2. **Compute equivalence classes.** Take the reflexive-transitive closure of the predecessor relation. The strongly connected components are *equivalence classes* — groups of symbols that can appear in the same position. If `copy` and `template` both follow `file` and precede `command`, they're in the same class. 3. **Merge singleton classes.** A class with one symbol that shares the same predecessor/successor sets as another singleton class gets merged. This handles symbols that always appear in the same structural position. 4. **Topological sort.** The equivalence classes are sorted by their position in the Hasse diagram of the predecessor relation. Each class becomes a factor in the output, annotated with a quantifier: - `+` (one or more) if the class forms a cycle - `+?` (zero or more) if the class appears variably - `?` (optional) if the class can be absent - (exact) if the class always appears exactly once The result is a CHARE: a sequence of factors where each factor is a disjunction of equivalent symbols with a quantifier. ### iDRegEx — k-optimal regular expression inference iDRegEx (Algorithm 4, arXiv 2010) uses a probabilistic automaton: 1. **Build a complete k-OA.** A k-testable Observation Automaton records all k-grams (subsequences of length k) from the input sequences. The automaton's states represent (k-1)-grams. 2. **Train with Baum-Welch.** EM iterations assign probabilities to transitions, learning which paths through the automaton are most likely given the data. 3. **Disambiguate.** Remove nondeterministic transitions — for any state and symbol, keep only the most probable next state. 4. **Prune.** Remove low-probability edges and unreachable states, leaving only the most likely paths. 5. **Extract with rwr².** The REWRITE-SQUARED algorithm (rwr², Algorithm 3) collapses the pruned automaton into a k-optimal regular expression — the minimal common core. ### MDL scoring — picking the right level of specificity The Minimum Description Length principle (Rissanen 1978) says: the best grammar is the one that minimizes the sum of its own size and the cost of encoding the data using it. ``` MDL = model_cost + data_cost ``` **model_cost** = the number of alphabet symbol occurrences in the grammar. A grammar with 5 unique symbols used once each has model_cost = 5. **data_cost** = Σ log₂(|L(r)|) across all sequences, where |L(r)| is the number of strings of length len(s) that the grammar accepts. A grammar like `(a+b+c+...+z)+` accepts 19 possible symbols at each position, so for a sequence of length 120, the data cost is 120 × log₂(19) ≈ 510 bits. A grammar like `a.b.c.d.e` accepts only 1 string of length 5, so data cost is 0. The ensemble picks the grammar with the lowest total MDL. This automatically balances specificity against coverage: a grammar that matches only 1 sequence but does so perfectly (low data cost) can beat a grammar that matches all sequences but is extremely permissive (high data cost). ## The bugs we found (and fixed) Implementing the BEX algorithms faithfully required solving several subtle problems. ### Bug 1: model_cost counted characters, not symbols The paper defines model_cost as "the length of r" — the number of symbols in the expression. For the toy alphabet {a, b, c, d, e} used in the paper, characters and symbols are the same. For real-world symbols like `community.docker.docker_image`, they aren't. Our `model_cost` function was counting characters (226 for a typical grammar), when it should count symbol occurrences (19). This massively inflated the MDL score, making CRX appear worse than it actually was. **Fix:** Count occurrences of alphabet symbols in the expression using regex word-boundary matching, not string length. ### Bug 2: Dispatch order in _count_words_fast The recursive function `_count_words_fast` estimates |L(r)| — the number of strings a grammar accepts at a given length. It dispatches on expression structure: first check for concatenation (`.`), then trailing quantifiers (`+?`, `*`, `?`, `+`), then disjunction groups. Our dispatch checked `endswith('+?')` before checking `'.' in expr`. For the expression `(All)+.Role?.RoleBinding?.Job+?`, the trailing `+?` on `Job+?` triggered the quantifier branch first, applying the `+?` to the **entire** expression instead of just the `Job` factor. **Fix:** Check concatenation first. Top-level dots can only appear in concatenation, so they should be handled before any quantifier logic. ### Bug 3: Greedy matching without backtracking The `_match_tokens` function checked whether a sequence matches a grammar. For quantifiers like `+?` (zero-or-more), it greedily consumed ALL consecutive matching symbols, then moved on. This failed for grammars like `a+?.a` on input `['a', 'a']`: the `a+?` ate both `a`s, and there was nothing left for the second `.a`. **Fix:** Replace the single-pass greedy matching with `_match_possible`, a proper backtracking engine that enumerates ALL valid end positions for each token and picks the maximum. This is essentially a tiny regex engine — but limited to the CHARE subset, so it avoids the exponential blowup of general regex matching. ### Bug 4: Dot-splitting inside disjunctions Module names like `community.docker.docker_image` contain dots. When `_parse_parts` processed a disjunction child, it recursively called itself — which split the expression on `.` before treating it as a symbol. The symbol `community.docker.docker_image` became `community` then `docker` then `docker_image` — three concatenated symbols instead of one. **Fix:** Disjunction children are always flat symbols (CRX and iDRegEx don't produce nested disjunctions in practice). Parse them with `_parse_flat_symbol`, which strips quantifiers but never splits on `.`. ## The results ### Ansible deploy roles — 36 roles from companyweb Your own deploy roles cover everything from AdGuard Home to Woodpecker CI. They have NO schema — each is a free-form script. ``` Grammar: docker_volume+?.group?.docker_container?.user?.apt?.npm?. (assert+...+command+copy+file+template+set_fact+...+wait_for)+?. (cron+firewalld)? Match: 36/36 MDL: 2186.28 ``` Bottleneck analysis: optional docker setup (volume, group, container, user, apt, npm), then a large disjunction of ~25 task modules (one or more), then optional cron/firewalld at the end. This captures the convention precisely. **Compression: 36 roles (15,000 tokens) → 200 tokens (75×)** ### Geerlingguy Galaxy roles — 15 popular roles Jeff Geerling's roles are the most popular on Ansible Galaxy. He has never documented their structural pattern. Yet every one of the 15 follows the same arc: ``` Grammar: fail?.(include_vars+set_fact+package+file+template+service+...)+. include+?.(npm+pip)+?.lineinfile? Match: 15/15 MDL: 596.64 ``` Check prerequisites, OS-specific variables, install packages, configure with templates, start services, optionally run sub-tasks, install npm/pip packages, and optionally tweak config lines. **This is the first explicit description of the geerlingguy role convention.** It took 15 roles and a grammar inference algorithm to write it down. **Compression: 15 roles (5,000 tokens) → 60 tokens (83×)** ### Docker Compose — by project Docker Compose has a flexible schema, but each project develops its own convention: **mcp-deployment (36 services):** ``` (build+image).command.(environment+volumes)?.ports ``` **files (6 services):** ``` image.environment.volumes.network_mode.privileged?.cap_add? ``` **fresh-ape-base (9 services):** ``` image.ports?.(depends_on+environment+user+volumes)+ ``` ### Ensemble dynamics The ensemble (CRX + iDRegEx + MDL) selects different winners depending on the data: | Dataset | Winner | Why | |---------|--------|-----| | Ansible deploy (36 roles) | CRX | iDRegEx returned ∅ (too diverse) | | Ansible galaxy (15 roles) | CRX | iDRegEx returned ∅ (too diverse) | | Ansible restore (2 roles) | CRX | Both match all; CRX more compact | | Ansible configure (4 roles) | **iDRegEx** | Finds minimal core `include_role` | | Ansible manage (2 roles) | **iDRegEx** | Core: `assert.authorized_key` | iDRegEx wins when the data has a clear common core. CRX wins when there's no single shared subsequence (the roles share the *vocabulary* but not the *order*). ## The MCP The engine is exposed as an MCP server: ```python from bex.mcp_server import infer_best_grammar # Full coverage output = infer_best_grammar( sequences=role_sequences, prefer="crx", ) # Returns: # Best: CRX (MDL 2186.28) # Grammar: docker_volume+?.group?...(assert+...+wait_for)+?.(cron+firewalld)? # Ensemble — let MDL pick output = infer_best_grammar(sequences=role_sequences) ``` An agent workflow: 1. Agent needs to write deploy role #37 2. Finds 36 existing deploy roles, extracts their task module sequences 3. Calls `infer_best_grammar(sequences=..., prefer='crx')` 4. Gets back the grammar in 200 tokens 5. Generates a new role that follows the structural pattern Without the MCP: 36 role files in context (15,000 tokens), or guesswork. With the MCP: one grammar rule (200 tokens), known to match 36/36 roles. ## What it means Grammar inference turns **examples** into **rules**. The rule is a compressed description of the structural convention — and for schema-less content like Ansible roles, this may be the *first time* the convention has been written down at all. For LLM agents, this changes the trade-off between context and accuracy. Instead of flooding the context window with examples, the agent can call the MCP, get the rule in ~60 tokens, and follow it. The rule is more reliable than guessing from examples, and it costs less than the first example would have. The algorithm doesn't need to understand what a deploy role does. It doesn't know that `file` creates directories and `template` renders Jinja2. It only needs to see 36 sequences of module names and find the pattern they all share. The structural convention is in the data — you just have to extract it. ## References - Bex, G. J., Gelade, W., Neven, F., & Vansummeren, S. (2010). *Learning Deterministic Regular Expressions for the Web.* TODS 2010. - Bex, G. J., Gelade, W., Martens, W., & Neven, F. (2010). *Simplifying XML Schema: Single-Type Approximations of Regular Expressions.* arXiv:1004.2372. - Rissanen, J. (1978). *Modeling by shortest data description.* Automatica 14(5).