grammar-inference-engine/blog_post.md

# Dervish: Discovering Unwritten Conventions with Grammar Inference

<p align="left"><img src="dervish-logo.png" alt="Dervish" width="180"></p>

**How we turned 36 Ansible roles into a 200-character grammar — and why
it matters for LLM agents.**

## The problem

Every codebase has unwritten conventions. Your team's Docker Compose
files always put `image` before `ports` before `volumes`. Your Ansible
deploy roles always start with `assert`, then `file`, then `template`.
Your CI pipelines always run `lint` before `test` before `deploy`.

Nobody writes these down. They're emergent — copied from role to role,
file to file, until they become a tacit standard.

When an LLM agent needs to generate new content that follows these
conventions, you have two options:

1. **Stuff every existing file into context** — 36 deploy roles = 15,000
   tokens. You'll hit the context window on your third example.
2. **Give it one or two examples and hope** — the LLM will guess the
   pattern, and it will often guess wrong.

Neither is good. The first is wasteful. The second is unreliable.

What you really want is the **compiled convention** — the minimal
description of what all 36 roles share, expressed in ~200 tokens. An
LLM can follow a rule in 200 tokens far more reliably than it can
infer a pattern from 36 examples.

This is grammar inference.

## The approach

Given a set of example sequences over some alphabet (e.g., Ansible
module names, Docker Compose keys, CI job names), learn a regular
expression that describes the general pattern.

We implemented two algorithms from Bex et al., a pair of papers from
TODS 2010 and arXiv 2010:

- **CRX** (TODS 2010 §6): A single-pass algorithm that builds a
  predecessor relation over symbols, computes equivalence classes,
  and emits a Chain Regular Expression (CHARE) that matches ALL
  input sequences. Fast, deterministic, captures the full vocabulary.

- **iDRegEx** (arXiv 2010): A probabilistic algorithm using k-testable
  Observation Automata (k-OA) trained with Baum-Welch EM. It finds
  only the *minimal common core* — the symbols that appear in every
  example. Robust against noise, but fails (returns ∅) when the
  examples are too diverse.

Both run in the **ensemble**: CRX produces a permissive grammar (full
vocabulary, many optional parts), iDRegEx produces a strict grammar
(minimal core). A Minimum Description Length (MDL) score picks the
winner: the grammar that compresses the data best.

## The algorithms, briefly

### CRX — Chain Regular Expression inference

CRX (Algorithm 7, TODS 2010) works in four steps:

1. **Build the immediate-predecessor relation.** For every adjacent
   pair (x, y) across all sequences, record that x precedes y. If
   symbol `assert` always appears before `file`, record
   `assert → file`.

2. **Compute equivalence classes.** Take the reflexive-transitive
   closure of the predecessor relation. The strongly connected
   components are *equivalence classes* — groups of symbols that can
   appear in the same position. If `copy` and `template` both follow
   `file` and precede `command`, they're in the same class.

3. **Merge singleton classes.** A class with one symbol that shares
   the same predecessor/successor sets as another singleton class
   gets merged. This handles symbols that always appear in the
   same structural position.

4. **Topological sort.** The equivalence classes are sorted by their
   position in the Hasse diagram of the predecessor relation. Each
   class becomes a factor in the output, annotated with a quantifier:
   - `+` (one or more) if the class forms a cycle
   - `+?` (zero or more) if the class appears variably
   - `?` (optional) if the class can be absent
   - (exact) if the class always appears exactly once

The result is a CHARE: a sequence of factors where each factor is a
disjunction of equivalent symbols with a quantifier.

### iDRegEx — k-optimal regular expression inference

iDRegEx (Algorithm 4, arXiv 2010) uses a probabilistic automaton:

1. **Build a complete k-OA.** A k-testable Observation Automaton
   records all k-grams (subsequences of length k) from the input
   sequences. The automaton's states represent (k-1)-grams.

2. **Train with Baum-Welch.** EM iterations assign probabilities to
   transitions, learning which paths through the automaton are most
   likely given the data.

3. **Disambiguate.** Remove nondeterministic transitions — for any
   state and symbol, keep only the most probable next state.

4. **Prune.** Remove low-probability edges and unreachable states,
   leaving only the most likely paths.

5. **Extract with rwr².** The REWRITE-SQUARED algorithm (rwr²,
   Algorithm 3) collapses the pruned automaton into a k-optimal
   regular expression — the minimal common core.

### MDL scoring — picking the right level of specificity

The Minimum Description Length principle (Rissanen 1978) says: the
best grammar is the one that minimizes the sum of its own size and
the cost of encoding the data using it.

```
MDL = model_cost + data_cost
```

**model_cost** = the number of alphabet symbol occurrences in the
grammar. A grammar with 5 unique symbols used once each has
model_cost = 5.

**data_cost** = Σ log₂(|L(r)|) across all sequences, where |L(r)| is
the number of strings of length len(s) that the grammar accepts.
A grammar like `(a+b+c+...+z)+` accepts 19 possible symbols at each
position, so for a sequence of length 120, the data cost is
120 × log₂(19) ≈ 510 bits. A grammar like `a.b.c.d.e` accepts only
1 string of length 5, so data cost is 0.

The ensemble picks the grammar with the lowest total MDL. This
automatically balances specificity against coverage: a grammar that
matches only 1 sequence but does so perfectly (low data cost) can
beat a grammar that matches all sequences but is extremely permissive
(high data cost).

## The results

### Ansible deploy roles — 36 roles from companyweb

Your own deploy roles cover everything from AdGuard Home to
Woodpecker CI. They have NO schema — each is a free-form script.

```
Grammar: docker_volume+?.group?.docker_container?.user?.apt?.npm?.
         (assert+...+command+copy+file+template+set_fact+...+wait_for)+?.
         (cron+firewalld)?
Match:   36/36
MDL:     2186.28
```

Bottleneck analysis: optional docker setup (volume, group, container,
user, apt, npm), then a large disjunction of ~25 task modules (one or
more), then optional cron/firewalld at the end. This captures the
convention precisely.

**Compression: 36 roles (15,000 tokens) → 200 tokens (75×)**

### Geerlingguy Galaxy roles — 15 popular roles

Jeff Geerling's roles are the most popular on Ansible Galaxy. He has
never documented their structural pattern. Yet every one of the 15
follows the same arc:

```
Grammar: fail?.(include_vars+set_fact+package+file+template+service+...)+.
         include+?.(npm+pip)+?.lineinfile?
Match:   15/15
MDL:     596.64
```

Check prerequisites, OS-specific variables, install packages,
configure with templates, start services, optionally run sub-tasks,
install npm/pip packages, and optionally tweak config lines.

**This is the first explicit description of the geerlingguy role
module ordering convention.** It took 15 roles and a grammar inference
algorithm to write it down.

**Compression: 15 roles (5,000 tokens) → 60 tokens (83×)**

### Ensemble dynamics

The ensemble (CRX + iDRegEx + MDL) selects different winners
depending on the data:

| Dataset | Winner | Why |
|---------|--------|-----|
| Ansible galaxy (15 roles) | CRX | iDRegEx returned ∅ (too diverse) |
| Helm prom-stack (6 configs) | **iDRegEx** | Finds minimal core across all configs |
| Terraform modules (8) | CRX | iDRegEx returned ∅ (no common core across domains) |
| Terraform modules (8) | CRX | Every resource type optional across domains |
| GitHub Actions Go lint (6) | CRX | Tight pattern, all match |

iDRegEx wins when the data has a clear common core. CRX wins when
there's no single shared subsequence (the roles share the *vocabulary*
but not the *order*).

## The MCP

The engine is exposed as an MCP server:

```python
from bex.mcp_server import infer_best_grammar

# Full coverage
output = infer_best_grammar(
    sequences=role_sequences,
    prefer="crx",
)
# Returns:
#   Best: CRX (MDL 288)
#   Grammar: fail?.(include_vars+set_fact+package+file+template+service+...)+
#            .include+?.(npm+pip)+?.lineinfile?

# Ensemble — let MDL pick
output = infer_best_grammar(sequences=role_sequences)
```

An agent workflow:

1. Agent needs to write an Ansible role
2. Finds 15 existing geerlingguy roles, extracts their task module sequences
3. Calls `infer_best_grammar(sequences=..., prefer='crx')`
4. Gets back the grammar in ~60 tokens
5. Generates a new role that follows the structural pattern

Without the MCP: 15 role files in context (5,000 tokens), or guesswork.
With the MCP: one grammar rule (~60 tokens), known to match 15/15 roles.

## What it means

Grammar inference turns **examples** into **rules**. The rule is a
compressed description of the structural convention — and for
schema-less content like the geerlingguy role module ordering, this is
the *first time* the convention has been written down at all.

For LLM agents, this changes the trade-off between context and
accuracy. Instead of flooding the context window with examples, the
agent can call the MCP, get the rule in ~60 tokens, and follow it.
The rule is more reliable than guessing from examples, and it costs
less than the first example would have.

The algorithm doesn't need to understand what a deploy role does. It
doesn't know that `file` creates directories and `template` renders
Jinja2. It only needs to see 36 sequences of module names and find
the pattern they all share. The structural convention is in the data
— you just have to extract it.

## References

- Bex, G. J., Gelade, W., Neven, F., & Vansummeren, S. (2010).
  [*Learning Deterministic Regular Expressions for the Web.*](https://doi.org/10.1145/1806907.1806911) TODS 2010.
- Bex, G. J., Gelade, W., Martens, W., & Neven, F. (2010).
  [*Simplifying XML Schema: Single-Type Approximations of Regular
  Expressions.*](https://arxiv.org/abs/1004.2372) arXiv:1004.2372.
- Rissanen, J. (1978). *Modeling by shortest data description.*
  Automatica 14(5).
-												Rename to Dervish, add animated logo to README

											
										
										
											2026-07-01 10:19:08 +02:00
+								# Dervish: Discovering Unwritten Conventions with Grammar Inference
-												Grammar inference engine: CRX + iDRegEx ensemble with MDL scoring, MCP server, showcase, and blog post

- Ensemble inference (infer_ensemble) runs both CRX and iDRegEx, picks best by MDL
- CRX: CRX algorithm for wide coverage (accepts all sequences, large vocabulary)
- iDRegEx: iDRegEx for minimal core grammar (tightest common pattern)
- MDL scoring: fixed model_cost to count alphabet symbol occurrences, fixed dispatch order in _count_words_fast
- Fixed _match_tokens: rewritten as _match_possible with proper backtracking
- Fixed _parse_parts disjunction: children use _parse_flat_symbol to avoid dot-splitting
- MCP server: infer_best_grammar and infer_grammar tools
- Added prefer parameter (crx/idregex) to skip ensemble
- 28 passing tests
- SHOWCASE.md with Geerlingguy Galaxy demonstration
- blog_post.md with full technical deep-dive

											
										
										
											2026-07-01 09:51:41 +02:00
-												scale PNG logo to 150%

											
										
										
											2026-07-01 11:21:02 +02:00
+								<p align="left"><img src="dervish-logo.png" alt="Dervish" width="180"></p>
-												replace logo with dervis_logo.png; add to SHOWCASE.md and blog_post.md

											
										
										
											2026-07-01 11:16:21 +02:00
-												Grammar inference engine: CRX + iDRegEx ensemble with MDL scoring, MCP server, showcase, and blog post

- Ensemble inference (infer_ensemble) runs both CRX and iDRegEx, picks best by MDL
- CRX: CRX algorithm for wide coverage (accepts all sequences, large vocabulary)
- iDRegEx: iDRegEx for minimal core grammar (tightest common pattern)
- MDL scoring: fixed model_cost to count alphabet symbol occurrences, fixed dispatch order in _count_words_fast
- Fixed _match_tokens: rewritten as _match_possible with proper backtracking
- Fixed _parse_parts disjunction: children use _parse_flat_symbol to avoid dot-splitting
- MCP server: infer_best_grammar and infer_grammar tools
- Added prefer parameter (crx/idregex) to skip ensemble
- 28 passing tests
- SHOWCASE.md with Geerlingguy Galaxy demonstration
- blog_post.md with full technical deep-dive

											
										
										
											2026-07-01 09:51:41 +02:00
+								**How we turned 36 Ansible roles into a 200-character grammar — and why
 								it matters for LLM agents.**
 								## The problem
 								Every codebase has unwritten conventions. Your team's Docker Compose
 								files always put `image` before `ports` before `volumes`. Your Ansible
 								deploy roles always start with `assert`, then `file`, then `template`.
 								Your CI pipelines always run `lint` before `test` before `deploy`.
 								Nobody writes these down. They're emergent — copied from role to role,
 								file to file, until they become a tacit standard.
 								When an LLM agent needs to generate new content that follows these
 								conventions, you have two options:
 . **Stuff every existing file into context** — 36 deploy roles = 15,000
 								   tokens. You'll hit the context window on your third example.
 . **Give it one or two examples and hope** — the LLM will guess the
 								   pattern, and it will often guess wrong.
 								Neither is good. The first is wasteful. The second is unreliable.
 								What you really want is the **compiled convention** — the minimal
 								description of what all 36 roles share, expressed in ~200 tokens. An
 								LLM can follow a rule in 200 tokens far more reliably than it can
 								infer a pattern from 36 examples.
 								This is grammar inference.
 								## The approach
 								Given a set of example sequences over some alphabet (e.g., Ansible
 								module names, Docker Compose keys, CI job names), learn a regular
 								expression that describes the general pattern.
 								We implemented two algorithms from Bex et al., a pair of papers from
 								TODS 2010 and arXiv 2010:
 								- **CRX** (TODS 2010 §6): A single-pass algorithm that builds a
 								  predecessor relation over symbols, computes equivalence classes,
 								  and emits a Chain Regular Expression (CHARE) that matches ALL
 								  input sequences. Fast, deterministic, captures the full vocabulary.
 								- **iDRegEx** (arXiv 2010): A probabilistic algorithm using k-testable
 								  Observation Automata (k-OA) trained with Baum-Welch EM. It finds
 								  only the *minimal common core* — the symbols that appear in every
 								  example. Robust against noise, but fails (returns ∅) when the
 								  examples are too diverse.
 								Both run in the **ensemble**: CRX produces a permissive grammar (full
 								vocabulary, many optional parts), iDRegEx produces a strict grammar
 								(minimal core). A Minimum Description Length (MDL) score picks the
 								winner: the grammar that compresses the data best.
 								## The algorithms, briefly
 								### CRX — Chain Regular Expression inference
 								CRX (Algorithm 7, TODS 2010) works in four steps:
 . **Build the immediate-predecessor relation.** For every adjacent
 								   pair (x, y) across all sequences, record that x precedes y. If
 								   symbol `assert` always appears before `file`, record
 								   `assert → file`.
 . **Compute equivalence classes.** Take the reflexive-transitive
 								   closure of the predecessor relation. The strongly connected
 								   components are *equivalence classes* — groups of symbols that can
 								   appear in the same position. If `copy` and `template` both follow
 								   `file` and precede `command`, they're in the same class.
 . **Merge singleton classes.** A class with one symbol that shares
 								   the same predecessor/successor sets as another singleton class
 								   gets merged. This handles symbols that always appear in the
 								   same structural position.
 . **Topological sort.** The equivalence classes are sorted by their
 								   position in the Hasse diagram of the predecessor relation. Each
 								   class becomes a factor in the output, annotated with a quantifier:
 								   - `+` (one or more) if the class forms a cycle
 								   - `+?` (zero or more) if the class appears variably
 								   - `?` (optional) if the class can be absent
 								   - (exact) if the class always appears exactly once
 								The result is a CHARE: a sequence of factors where each factor is a
 								disjunction of equivalent symbols with a quantifier.
 								### iDRegEx — k-optimal regular expression inference
 								iDRegEx (Algorithm 4, arXiv 2010) uses a probabilistic automaton:
 . **Build a complete k-OA.** A k-testable Observation Automaton
 								   records all k-grams (subsequences of length k) from the input
 								   sequences. The automaton's states represent (k-1)-grams.
 . **Train with Baum-Welch.** EM iterations assign probabilities to
 								   transitions, learning which paths through the automaton are most
 								   likely given the data.
 . **Disambiguate.** Remove nondeterministic transitions — for any
 								   state and symbol, keep only the most probable next state.
 . **Prune.** Remove low-probability edges and unreachable states,
 								   leaving only the most likely paths.
 . **Extract with rwr².** The REWRITE-SQUARED algorithm (rwr²,
 								   Algorithm 3) collapses the pruned automaton into a k-optimal
 								   regular expression — the minimal common core.
 								### MDL scoring — picking the right level of specificity
 								The Minimum Description Length principle (Rissanen 1978) says: the
 								best grammar is the one that minimizes the sum of its own size and
 								the cost of encoding the data using it.
 								```
 								MDL = model_cost + data_cost
 								```
 								**model_cost** = the number of alphabet symbol occurrences in the
 								grammar. A grammar with 5 unique symbols used once each has
 								model_cost = 5.
 								**data_cost** = Σ log₂(|L(r)|) across all sequences, where |L(r)| is
 								the number of strings of length len(s) that the grammar accepts.
 								A grammar like `(a+b+c+...+z)+` accepts 19 possible symbols at each
 								position, so for a sequence of length 120, the data cost is
 × log₂(19) ≈ 510 bits. A grammar like `a.b.c.d.e` accepts only
 string of length 5, so data cost is 0.
 								The ensemble picks the grammar with the lowest total MDL. This
 								automatically balances specificity against coverage: a grammar that
 								matches only 1 sequence but does so perfectly (low data cost) can
 								beat a grammar that matches all sequences but is extremely permissive
 								(high data cost).
 								## The results
 								### Ansible deploy roles — 36 roles from companyweb
 								Your own deploy roles cover everything from AdGuard Home to
 								Woodpecker CI. They have NO schema — each is a free-form script.
 								```
 								Grammar: docker_volume+?.group?.docker_container?.user?.apt?.npm?.
 								         (assert+...+command+copy+file+template+set_fact+...+wait_for)+?.
 								         (cron+firewalld)?
 								Match:   36/36
 								MDL:     2186.28
 								```
 								Bottleneck analysis: optional docker setup (volume, group, container,
 								user, apt, npm), then a large disjunction of ~25 task modules (one or
 								more), then optional cron/firewalld at the end. This captures the
 								convention precisely.
 								**Compression: 36 roles (15,000 tokens) → 200 tokens (75×)**
 								### Geerlingguy Galaxy roles — 15 popular roles
 								Jeff Geerling's roles are the most popular on Ansible Galaxy. He has
 								never documented their structural pattern. Yet every one of the 15
 								follows the same arc:
 								```
 								Grammar: fail?.(include_vars+set_fact+package+file+template+service+...)+.
 								         include+?.(npm+pip)+?.lineinfile?
 								Match:   15/15
 								MDL:     596.64
 								```
 								Check prerequisites, OS-specific variables, install packages,
 								configure with templates, start services, optionally run sub-tasks,
 								install npm/pip packages, and optionally tweak config lines.
 								**This is the first explicit description of the geerlingguy role
-												Remove bugs section (implementation bugs, not paper bugs), remove Docker Compose (private data), add Portainer templates, fix geerlingguy claim precision

Blog post: remove 'The bugs we found' section (all 4 bugs were from our implementation, not the paper algorithms). Replace company data references in MCP section with Galaxy example. Update ensemble dynamics table with public datasets.

README: replace Docker Compose with Portainer templates in 'Why grammar inference?' table, Real-world Results, and Domain Adapters.

SHOWCASE: replace Docker Compose with Portainer templates.

All claims verified: no public documentation of geerlingguy module ordering convention exists.

											
										
										
											2026-07-01 10:15:22 +02:00
+								module ordering convention.** It took 15 roles and a grammar inference
 								algorithm to write it down.
-												Grammar inference engine: CRX + iDRegEx ensemble with MDL scoring, MCP server, showcase, and blog post

- Ensemble inference (infer_ensemble) runs both CRX and iDRegEx, picks best by MDL
- CRX: CRX algorithm for wide coverage (accepts all sequences, large vocabulary)
- iDRegEx: iDRegEx for minimal core grammar (tightest common pattern)
- MDL scoring: fixed model_cost to count alphabet symbol occurrences, fixed dispatch order in _count_words_fast
- Fixed _match_tokens: rewritten as _match_possible with proper backtracking
- Fixed _parse_parts disjunction: children use _parse_flat_symbol to avoid dot-splitting
- MCP server: infer_best_grammar and infer_grammar tools
- Added prefer parameter (crx/idregex) to skip ensemble
- 28 passing tests
- SHOWCASE.md with Geerlingguy Galaxy demonstration
- blog_post.md with full technical deep-dive

											
										
										
											2026-07-01 09:51:41 +02:00
 								**Compression: 15 roles (5,000 tokens) → 60 tokens (83×)**
 								### Ensemble dynamics
 								The ensemble (CRX + iDRegEx + MDL) selects different winners
 								depending on the data:
 								| Dataset | Winner | Why |
 								|---------|--------|-----|
 								| Ansible galaxy (15 roles) | CRX | iDRegEx returned ∅ (too diverse) |
-												Remove bugs section (implementation bugs, not paper bugs), remove Docker Compose (private data), add Portainer templates, fix geerlingguy claim precision

Blog post: remove 'The bugs we found' section (all 4 bugs were from our implementation, not the paper algorithms). Replace company data references in MCP section with Galaxy example. Update ensemble dynamics table with public datasets.

README: replace Docker Compose with Portainer templates in 'Why grammar inference?' table, Real-world Results, and Domain Adapters.

SHOWCASE: replace Docker Compose with Portainer templates.

All claims verified: no public documentation of geerlingguy module ordering convention exists.

											
										
										
											2026-07-01 10:15:22 +02:00
+								| Helm prom-stack (6 configs) | **iDRegEx** | Finds minimal core across all configs |
-												purge Portainer references, format-specific tools, and Domain Adapters section; make showcases concrete with extracted types

											
										
										
											2026-07-01 10:36:04 +02:00
+								| Terraform modules (8) | CRX | iDRegEx returned ∅ (no common core across domains) |
-												Remove bugs section (implementation bugs, not paper bugs), remove Docker Compose (private data), add Portainer templates, fix geerlingguy claim precision

Blog post: remove 'The bugs we found' section (all 4 bugs were from our implementation, not the paper algorithms). Replace company data references in MCP section with Galaxy example. Update ensemble dynamics table with public datasets.

README: replace Docker Compose with Portainer templates in 'Why grammar inference?' table, Real-world Results, and Domain Adapters.

SHOWCASE: replace Docker Compose with Portainer templates.

All claims verified: no public documentation of geerlingguy module ordering convention exists.

											
										
										
											2026-07-01 10:15:22 +02:00
+								| Terraform modules (8) | CRX | Every resource type optional across domains |
 								| GitHub Actions Go lint (6) | CRX | Tight pattern, all match |
-												Grammar inference engine: CRX + iDRegEx ensemble with MDL scoring, MCP server, showcase, and blog post

- Ensemble inference (infer_ensemble) runs both CRX and iDRegEx, picks best by MDL
- CRX: CRX algorithm for wide coverage (accepts all sequences, large vocabulary)
- iDRegEx: iDRegEx for minimal core grammar (tightest common pattern)
- MDL scoring: fixed model_cost to count alphabet symbol occurrences, fixed dispatch order in _count_words_fast
- Fixed _match_tokens: rewritten as _match_possible with proper backtracking
- Fixed _parse_parts disjunction: children use _parse_flat_symbol to avoid dot-splitting
- MCP server: infer_best_grammar and infer_grammar tools
- Added prefer parameter (crx/idregex) to skip ensemble
- 28 passing tests
- SHOWCASE.md with Geerlingguy Galaxy demonstration
- blog_post.md with full technical deep-dive

											
										
										
											2026-07-01 09:51:41 +02:00
 								iDRegEx wins when the data has a clear common core. CRX wins when
 								there's no single shared subsequence (the roles share the *vocabulary*
 								but not the *order*).
 								## The MCP
 								The engine is exposed as an MCP server:
 								```python
 								from bex.mcp_server import infer_best_grammar
 								# Full coverage
 								output = infer_best_grammar(
 								    sequences=role_sequences,
 								    prefer="crx",
 								)
 								# Returns:
-												Remove bugs section (implementation bugs, not paper bugs), remove Docker Compose (private data), add Portainer templates, fix geerlingguy claim precision

Blog post: remove 'The bugs we found' section (all 4 bugs were from our implementation, not the paper algorithms). Replace company data references in MCP section with Galaxy example. Update ensemble dynamics table with public datasets.

README: replace Docker Compose with Portainer templates in 'Why grammar inference?' table, Real-world Results, and Domain Adapters.

SHOWCASE: replace Docker Compose with Portainer templates.

All claims verified: no public documentation of geerlingguy module ordering convention exists.

											
										
										
											2026-07-01 10:15:22 +02:00
+								#   Best: CRX (MDL 288)
 								#   Grammar: fail?.(include_vars+set_fact+package+file+template+service+...)+
 								#            .include+?.(npm+pip)+?.lineinfile?
-												Grammar inference engine: CRX + iDRegEx ensemble with MDL scoring, MCP server, showcase, and blog post

- Ensemble inference (infer_ensemble) runs both CRX and iDRegEx, picks best by MDL
- CRX: CRX algorithm for wide coverage (accepts all sequences, large vocabulary)
- iDRegEx: iDRegEx for minimal core grammar (tightest common pattern)
- MDL scoring: fixed model_cost to count alphabet symbol occurrences, fixed dispatch order in _count_words_fast
- Fixed _match_tokens: rewritten as _match_possible with proper backtracking
- Fixed _parse_parts disjunction: children use _parse_flat_symbol to avoid dot-splitting
- MCP server: infer_best_grammar and infer_grammar tools
- Added prefer parameter (crx/idregex) to skip ensemble
- 28 passing tests
- SHOWCASE.md with Geerlingguy Galaxy demonstration
- blog_post.md with full technical deep-dive

											
										
										
											2026-07-01 09:51:41 +02:00
 								# Ensemble — let MDL pick
 								output = infer_best_grammar(sequences=role_sequences)
 								```
 								An agent workflow:
-												Remove bugs section (implementation bugs, not paper bugs), remove Docker Compose (private data), add Portainer templates, fix geerlingguy claim precision

Blog post: remove 'The bugs we found' section (all 4 bugs were from our implementation, not the paper algorithms). Replace company data references in MCP section with Galaxy example. Update ensemble dynamics table with public datasets.

README: replace Docker Compose with Portainer templates in 'Why grammar inference?' table, Real-world Results, and Domain Adapters.

SHOWCASE: replace Docker Compose with Portainer templates.

All claims verified: no public documentation of geerlingguy module ordering convention exists.

											
										
										
											2026-07-01 10:15:22 +02:00
+. Agent needs to write an Ansible role
 . Finds 15 existing geerlingguy roles, extracts their task module sequences
-												Grammar inference engine: CRX + iDRegEx ensemble with MDL scoring, MCP server, showcase, and blog post

- Ensemble inference (infer_ensemble) runs both CRX and iDRegEx, picks best by MDL
- CRX: CRX algorithm for wide coverage (accepts all sequences, large vocabulary)
- iDRegEx: iDRegEx for minimal core grammar (tightest common pattern)
- MDL scoring: fixed model_cost to count alphabet symbol occurrences, fixed dispatch order in _count_words_fast
- Fixed _match_tokens: rewritten as _match_possible with proper backtracking
- Fixed _parse_parts disjunction: children use _parse_flat_symbol to avoid dot-splitting
- MCP server: infer_best_grammar and infer_grammar tools
- Added prefer parameter (crx/idregex) to skip ensemble
- 28 passing tests
- SHOWCASE.md with Geerlingguy Galaxy demonstration
- blog_post.md with full technical deep-dive

											
										
										
											2026-07-01 09:51:41 +02:00
+. Calls `infer_best_grammar(sequences=..., prefer='crx')`
-												Remove bugs section (implementation bugs, not paper bugs), remove Docker Compose (private data), add Portainer templates, fix geerlingguy claim precision

Blog post: remove 'The bugs we found' section (all 4 bugs were from our implementation, not the paper algorithms). Replace company data references in MCP section with Galaxy example. Update ensemble dynamics table with public datasets.

README: replace Docker Compose with Portainer templates in 'Why grammar inference?' table, Real-world Results, and Domain Adapters.

SHOWCASE: replace Docker Compose with Portainer templates.

All claims verified: no public documentation of geerlingguy module ordering convention exists.

											
										
										
											2026-07-01 10:15:22 +02:00
+. Gets back the grammar in ~60 tokens
-												Grammar inference engine: CRX + iDRegEx ensemble with MDL scoring, MCP server, showcase, and blog post

- Ensemble inference (infer_ensemble) runs both CRX and iDRegEx, picks best by MDL
- CRX: CRX algorithm for wide coverage (accepts all sequences, large vocabulary)
- iDRegEx: iDRegEx for minimal core grammar (tightest common pattern)
- MDL scoring: fixed model_cost to count alphabet symbol occurrences, fixed dispatch order in _count_words_fast
- Fixed _match_tokens: rewritten as _match_possible with proper backtracking
- Fixed _parse_parts disjunction: children use _parse_flat_symbol to avoid dot-splitting
- MCP server: infer_best_grammar and infer_grammar tools
- Added prefer parameter (crx/idregex) to skip ensemble
- 28 passing tests
- SHOWCASE.md with Geerlingguy Galaxy demonstration
- blog_post.md with full technical deep-dive

											
										
										
											2026-07-01 09:51:41 +02:00
+. Generates a new role that follows the structural pattern
-												Remove bugs section (implementation bugs, not paper bugs), remove Docker Compose (private data), add Portainer templates, fix geerlingguy claim precision

Blog post: remove 'The bugs we found' section (all 4 bugs were from our implementation, not the paper algorithms). Replace company data references in MCP section with Galaxy example. Update ensemble dynamics table with public datasets.

README: replace Docker Compose with Portainer templates in 'Why grammar inference?' table, Real-world Results, and Domain Adapters.

SHOWCASE: replace Docker Compose with Portainer templates.

All claims verified: no public documentation of geerlingguy module ordering convention exists.

											
										
										
											2026-07-01 10:15:22 +02:00
+								Without the MCP: 15 role files in context (5,000 tokens), or guesswork.
 								With the MCP: one grammar rule (~60 tokens), known to match 15/15 roles.
-												Grammar inference engine: CRX + iDRegEx ensemble with MDL scoring, MCP server, showcase, and blog post

- Ensemble inference (infer_ensemble) runs both CRX and iDRegEx, picks best by MDL
- CRX: CRX algorithm for wide coverage (accepts all sequences, large vocabulary)
- iDRegEx: iDRegEx for minimal core grammar (tightest common pattern)
- MDL scoring: fixed model_cost to count alphabet symbol occurrences, fixed dispatch order in _count_words_fast
- Fixed _match_tokens: rewritten as _match_possible with proper backtracking
- Fixed _parse_parts disjunction: children use _parse_flat_symbol to avoid dot-splitting
- MCP server: infer_best_grammar and infer_grammar tools
- Added prefer parameter (crx/idregex) to skip ensemble
- 28 passing tests
- SHOWCASE.md with Geerlingguy Galaxy demonstration
- blog_post.md with full technical deep-dive

											
										
										
											2026-07-01 09:51:41 +02:00
 								## What it means
 								Grammar inference turns **examples** into **rules**. The rule is a
 								compressed description of the structural convention — and for
-												Remove bugs section (implementation bugs, not paper bugs), remove Docker Compose (private data), add Portainer templates, fix geerlingguy claim precision

Blog post: remove 'The bugs we found' section (all 4 bugs were from our implementation, not the paper algorithms). Replace company data references in MCP section with Galaxy example. Update ensemble dynamics table with public datasets.

README: replace Docker Compose with Portainer templates in 'Why grammar inference?' table, Real-world Results, and Domain Adapters.

SHOWCASE: replace Docker Compose with Portainer templates.

All claims verified: no public documentation of geerlingguy module ordering convention exists.

											
										
										
											2026-07-01 10:15:22 +02:00
+								schema-less content like the geerlingguy role module ordering, this is
 								the *first time* the convention has been written down at all.
-												Grammar inference engine: CRX + iDRegEx ensemble with MDL scoring, MCP server, showcase, and blog post

- Ensemble inference (infer_ensemble) runs both CRX and iDRegEx, picks best by MDL
- CRX: CRX algorithm for wide coverage (accepts all sequences, large vocabulary)
- iDRegEx: iDRegEx for minimal core grammar (tightest common pattern)
- MDL scoring: fixed model_cost to count alphabet symbol occurrences, fixed dispatch order in _count_words_fast
- Fixed _match_tokens: rewritten as _match_possible with proper backtracking
- Fixed _parse_parts disjunction: children use _parse_flat_symbol to avoid dot-splitting
- MCP server: infer_best_grammar and infer_grammar tools
- Added prefer parameter (crx/idregex) to skip ensemble
- 28 passing tests
- SHOWCASE.md with Geerlingguy Galaxy demonstration
- blog_post.md with full technical deep-dive

											
										
										
											2026-07-01 09:51:41 +02:00
 								For LLM agents, this changes the trade-off between context and
 								accuracy. Instead of flooding the context window with examples, the
 								agent can call the MCP, get the rule in ~60 tokens, and follow it.
 								The rule is more reliable than guessing from examples, and it costs
 								less than the first example would have.
 								The algorithm doesn't need to understand what a deploy role does. It
 								doesn't know that `file` creates directories and `template` renders
 								Jinja2. It only needs to see 36 sequences of module names and find
 								the pattern they all share. The structural convention is in the data
 								— you just have to extract it.
 								## References
 								- Bex, G. J., Gelade, W., Neven, F., & Vansummeren, S. (2010).
-												rename to Dervish MCP; expand description with token-savings framing; add xkcd-style bar charts; link papers to actual URLs

											
										
										
											2026-07-01 11:05:03 +02:00
+								  [*Learning Deterministic Regular Expressions for the Web.*](https://doi.org/10.1145/1806907.1806911) TODS 2010.
-												Grammar inference engine: CRX + iDRegEx ensemble with MDL scoring, MCP server, showcase, and blog post

- Ensemble inference (infer_ensemble) runs both CRX and iDRegEx, picks best by MDL
- CRX: CRX algorithm for wide coverage (accepts all sequences, large vocabulary)
- iDRegEx: iDRegEx for minimal core grammar (tightest common pattern)
- MDL scoring: fixed model_cost to count alphabet symbol occurrences, fixed dispatch order in _count_words_fast
- Fixed _match_tokens: rewritten as _match_possible with proper backtracking
- Fixed _parse_parts disjunction: children use _parse_flat_symbol to avoid dot-splitting
- MCP server: infer_best_grammar and infer_grammar tools
- Added prefer parameter (crx/idregex) to skip ensemble
- 28 passing tests
- SHOWCASE.md with Geerlingguy Galaxy demonstration
- blog_post.md with full technical deep-dive

											
										
										
											2026-07-01 09:51:41 +02:00
+								- Bex, G. J., Gelade, W., Martens, W., & Neven, F. (2010).
-												rename to Dervish MCP; expand description with token-savings framing; add xkcd-style bar charts; link papers to actual URLs

											
										
										
											2026-07-01 11:05:03 +02:00
+								  [*Simplifying XML Schema: Single-Type Approximations of Regular
 								  Expressions.*](https://arxiv.org/abs/1004.2372) arXiv:1004.2372.
-												Grammar inference engine: CRX + iDRegEx ensemble with MDL scoring, MCP server, showcase, and blog post

- Ensemble inference (infer_ensemble) runs both CRX and iDRegEx, picks best by MDL
- CRX: CRX algorithm for wide coverage (accepts all sequences, large vocabulary)
- iDRegEx: iDRegEx for minimal core grammar (tightest common pattern)
- MDL scoring: fixed model_cost to count alphabet symbol occurrences, fixed dispatch order in _count_words_fast
- Fixed _match_tokens: rewritten as _match_possible with proper backtracking
- Fixed _parse_parts disjunction: children use _parse_flat_symbol to avoid dot-splitting
- MCP server: infer_best_grammar and infer_grammar tools
- Added prefer parameter (crx/idregex) to skip ensemble
- 28 passing tests
- SHOWCASE.md with Geerlingguy Galaxy demonstration
- blog_post.md with full technical deep-dive

											
										
										
											2026-07-01 09:51:41 +02:00
+								- Rissanen, J. (1978). *Modeling by shortest data description.*
 								  Automatica 14(5).