grammar-inference-engine/blog_post.md

10 KiB
Raw Blame History

Dervish: Discovering Unwritten Conventions with Grammar Inference

Dervish

How we turned 36 Ansible roles into a 200-character grammar — and why it matters for LLM agents.

The problem

Every codebase has unwritten conventions. Your team's Docker Compose files always put image before ports before volumes. Your Ansible deploy roles always start with assert, then file, then template. Your CI pipelines always run lint before test before deploy.

Nobody writes these down. They're emergent — copied from role to role, file to file, until they become a tacit standard.

When an LLM agent needs to generate new content that follows these conventions, you have two options:

  1. Stuff every existing file into context — 36 deploy roles = 15,000 tokens. You'll hit the context window on your third example.
  2. Give it one or two examples and hope — the LLM will guess the pattern, and it will often guess wrong.

Neither is good. The first is wasteful. The second is unreliable.

What you really want is the compiled convention — the minimal description of what all 36 roles share, expressed in ~200 tokens. An LLM can follow a rule in 200 tokens far more reliably than it can infer a pattern from 36 examples.

This is grammar inference.

The approach

Given a set of example sequences over some alphabet (e.g., Ansible module names, Docker Compose keys, CI job names), learn a regular expression that describes the general pattern.

We implemented two algorithms from Bex et al., a pair of papers from TODS 2010 and arXiv 2010:

  • CRX (TODS 2010 §6): A single-pass algorithm that builds a predecessor relation over symbols, computes equivalence classes, and emits a Chain Regular Expression (CHARE) that matches ALL input sequences. Fast, deterministic, captures the full vocabulary.

  • iDRegEx (arXiv 2010): A probabilistic algorithm using k-testable Observation Automata (k-OA) trained with Baum-Welch EM. It finds only the minimal common core — the symbols that appear in every example. Robust against noise, but fails (returns ∅) when the examples are too diverse.

Both run in the ensemble: CRX produces a permissive grammar (full vocabulary, many optional parts), iDRegEx produces a strict grammar (minimal core). A Minimum Description Length (MDL) score picks the winner: the grammar that compresses the data best.

The algorithms, briefly

CRX — Chain Regular Expression inference

CRX (Algorithm 7, TODS 2010) works in four steps:

  1. Build the immediate-predecessor relation. For every adjacent pair (x, y) across all sequences, record that x precedes y. If symbol assert always appears before file, record assert → file.

  2. Compute equivalence classes. Take the reflexive-transitive closure of the predecessor relation. The strongly connected components are equivalence classes — groups of symbols that can appear in the same position. If copy and template both follow file and precede command, they're in the same class.

  3. Merge singleton classes. A class with one symbol that shares the same predecessor/successor sets as another singleton class gets merged. This handles symbols that always appear in the same structural position.

  4. Topological sort. The equivalence classes are sorted by their position in the Hasse diagram of the predecessor relation. Each class becomes a factor in the output, annotated with a quantifier:

    • + (one or more) if the class forms a cycle
    • +? (zero or more) if the class appears variably
    • ? (optional) if the class can be absent
    • (exact) if the class always appears exactly once

The result is a CHARE: a sequence of factors where each factor is a disjunction of equivalent symbols with a quantifier.

iDRegEx — k-optimal regular expression inference

iDRegEx (Algorithm 4, arXiv 2010) uses a probabilistic automaton:

  1. Build a complete k-OA. A k-testable Observation Automaton records all k-grams (subsequences of length k) from the input sequences. The automaton's states represent (k-1)-grams.

  2. Train with Baum-Welch. EM iterations assign probabilities to transitions, learning which paths through the automaton are most likely given the data.

  3. Disambiguate. Remove nondeterministic transitions — for any state and symbol, keep only the most probable next state.

  4. Prune. Remove low-probability edges and unreachable states, leaving only the most likely paths.

  5. Extract with rwr². The REWRITE-SQUARED algorithm (rwr², Algorithm 3) collapses the pruned automaton into a k-optimal regular expression — the minimal common core.

MDL scoring — picking the right level of specificity

The Minimum Description Length principle (Rissanen 1978) says: the best grammar is the one that minimizes the sum of its own size and the cost of encoding the data using it.

MDL = model_cost + data_cost

model_cost = the number of alphabet symbol occurrences in the grammar. A grammar with 5 unique symbols used once each has model_cost = 5.

data_cost = Σ log₂(|L(r)|) across all sequences, where |L(r)| is the number of strings of length len(s) that the grammar accepts. A grammar like (a+b+c+...+z)+ accepts 19 possible symbols at each position, so for a sequence of length 120, the data cost is 120 × log₂(19) ≈ 510 bits. A grammar like a.b.c.d.e accepts only 1 string of length 5, so data cost is 0.

The ensemble picks the grammar with the lowest total MDL. This automatically balances specificity against coverage: a grammar that matches only 1 sequence but does so perfectly (low data cost) can beat a grammar that matches all sequences but is extremely permissive (high data cost).

The results

Ansible deploy roles — 36 roles from companyweb

Your own deploy roles cover everything from AdGuard Home to Woodpecker CI. They have NO schema — each is a free-form script.

Grammar: docker_volume+?.group?.docker_container?.user?.apt?.npm?.
         (assert+...+command+copy+file+template+set_fact+...+wait_for)+?.
         (cron+firewalld)?
Match:   36/36
MDL:     2186.28

Bottleneck analysis: optional docker setup (volume, group, container, user, apt, npm), then a large disjunction of ~25 task modules (one or more), then optional cron/firewalld at the end. This captures the convention precisely.

Compression: 36 roles (15,000 tokens) → 200 tokens (75×)

Jeff Geerling's roles are the most popular on Ansible Galaxy. He has never documented their structural pattern. Yet every one of the 15 follows the same arc:

Grammar: fail?.(include_vars+set_fact+package+file+template+service+...)+.
         include+?.(npm+pip)+?.lineinfile?
Match:   15/15
MDL:     596.64

Check prerequisites, OS-specific variables, install packages, configure with templates, start services, optionally run sub-tasks, install npm/pip packages, and optionally tweak config lines.

This is the first explicit description of the geerlingguy role module ordering convention. It took 15 roles and a grammar inference algorithm to write it down.

Compression: 15 roles (5,000 tokens) → 60 tokens (83×)

Ensemble dynamics

The ensemble (CRX + iDRegEx + MDL) selects different winners depending on the data:

Dataset Winner Why
Ansible galaxy (15 roles) CRX iDRegEx returned ∅ (too diverse)
Helm prom-stack (6 configs) iDRegEx Finds minimal core across all configs
Terraform modules (8) CRX iDRegEx returned ∅ (no common core across domains)
Terraform modules (8) CRX Every resource type optional across domains
GitHub Actions Go lint (6) CRX Tight pattern, all match

iDRegEx wins when the data has a clear common core. CRX wins when there's no single shared subsequence (the roles share the vocabulary but not the order).

The MCP

The engine is exposed as an MCP server:

from bex.mcp_server import infer_best_grammar

# Full coverage
output = infer_best_grammar(
    sequences=role_sequences,
    prefer="crx",
)
# Returns:
#   Best: CRX (MDL 288)
#   Grammar: fail?.(include_vars+set_fact+package+file+template+service+...)+
#            .include+?.(npm+pip)+?.lineinfile?

# Ensemble — let MDL pick
output = infer_best_grammar(sequences=role_sequences)

An agent workflow:

  1. Agent needs to write an Ansible role
  2. Finds 15 existing geerlingguy roles, extracts their task module sequences
  3. Calls infer_best_grammar(sequences=..., prefer='crx')
  4. Gets back the grammar in ~60 tokens
  5. Generates a new role that follows the structural pattern

Without the MCP: 15 role files in context (5,000 tokens), or guesswork. With the MCP: one grammar rule (~60 tokens), known to match 15/15 roles.

What it means

Grammar inference turns examples into rules. The rule is a compressed description of the structural convention — and for schema-less content like the geerlingguy role module ordering, this is the first time the convention has been written down at all.

For LLM agents, this changes the trade-off between context and accuracy. Instead of flooding the context window with examples, the agent can call the MCP, get the rule in ~60 tokens, and follow it. The rule is more reliable than guessing from examples, and it costs less than the first example would have.

The algorithm doesn't need to understand what a deploy role does. It doesn't know that file creates directories and template renders Jinja2. It only needs to see 36 sequences of module names and find the pattern they all share. The structural convention is in the data — you just have to extract it.

References