Grammar inference engine: CRX + iDRegEx ensemble with MDL scoring, MCP server, showcase, and blog post

- Ensemble inference (infer_ensemble) runs both CRX and iDRegEx, picks best by MDL
- CRX: CRX algorithm for wide coverage (accepts all sequences, large vocabulary)
- iDRegEx: iDRegEx for minimal core grammar (tightest common pattern)
- MDL scoring: fixed model_cost to count alphabet symbol occurrences, fixed dispatch order in _count_words_fast
- Fixed _match_tokens: rewritten as _match_possible with proper backtracking
- Fixed _parse_parts disjunction: children use _parse_flat_symbol to avoid dot-splitting
- MCP server: infer_best_grammar and infer_grammar tools
- Added prefer parameter (crx/idregex) to skip ensemble
- 28 passing tests
- SHOWCASE.md with Geerlingguy Galaxy demonstration
- blog_post.md with full technical deep-dive
This commit is contained in:
tobjend 2026-07-01 09:51:41 +02:00
parent a1567bffbe
commit 0e2aec582b
7 changed files with 1115 additions and 47 deletions

253
README.md
View file

@ -10,12 +10,25 @@ python -m bex
```
```python
from bex.crx import CRX
from bex import infer_ensemble
seqs = [
['file', 'template', 'docker_image', 'command', 'set_fact', 'shell', 'wait_for'],
['file', 'template', 'docker_image', 'command', 'set_fact', 'shell'],
]
result = infer_ensemble(seqs)
print(f"Best: {result['best']['algorithm']}")
print(f"Grammar: {result['best']['grammar']}")
print(f"Score: {result['best']['mdl_score']}")
```
Or compare algorithms manually:
```python
from bex.crx import CRX
seqs = [...]
crx = CRX()
grammar = crx.infer(seqs)
print(grammar)
@ -26,10 +39,10 @@ print(grammar)
| Algorithm | What it learns | Paper | Use case |
|-----------|---------------|-------|----------|
| **CRX** | CHAREs (single-pass, deterministic) | TODS 2010 §6 | Fast inference from many sequences |
| **iDRegEx** | k-OREs (probabilistic, Baum-Welch) | arXiv 2010 | Handles noise, learns from few examples |
| **RWR₀** | SOREs (iterative repair) | TODS 2010 §5.2 | Builds regex from a single automaton |
| **rwr²** | k-ORE from k-OA | arXiv 2010 | Post-processing for k-ORE extraction |
| **CRX** | CHAREs (single-pass, deterministic) | TODS 2010 §6 | Fast inference, captures *all* symbols |
| **iDRegEx** | k-OREs (probabilistic, Baum-Welch) | arXiv 2010 | Finds the minimal core pattern |
| **RWR₀** | SOREs (iterative repair) | TODS 2010 §5.2 | Single-sequence grammar repair |
| **rwr²** | k-ORE from k-OA | arXiv 2010 | k-ORE extraction after Baum-Welch |
### Pipeline 1: Direct CHARE Inference (fast)
@ -37,6 +50,8 @@ print(grammar)
Example sequences → CRX → CHAREs grammar
```
CRX learns a grammar that accepts *all* observed symbols, marking optional ones with `?`. Best when the data is clean and you want the full vocabulary.
### Pipeline 2: Probabilistic k-ORE Inference (robust)
```
@ -44,6 +59,16 @@ Example sequences → Complete k-OA → Baum-Welch (EM)
→ Disambiguate → Prune → rwr² → k-ORE grammar
```
iDRegEx learns the *minimum* common subsequence — symbols that appear in every example. Fails (∅) when the examples are too diverse.
### Pipeline 3: Ensemble (recommended)
```
Example sequences → [CRX, iDRegEx] → MDL score each → pick best
```
Runs both algorithms, scores each with Minimum Description Length, and returns the winner with an explanation. The MDL score penalizes overly general grammars: a grammar like `(a+b+c+...+z)+` that accepts everything gets a high data cost (`log2(|L(r)|)` is large), while a specific grammar like `a.b.c` has near-zero data cost.
## Architecture
```
@ -61,34 +86,219 @@ bex/
├── marking.py # State marking for determinism
├── yaml_to_seq.py # Generic YAML → key-path sequence converter
├── role_grammar.py # Ansible role → module-sequence extractor
├── ensemble.py # Ensemble: runs CRX + iDRegEx, picks best by MDL
├── mdl.py # MDL scoring for grammar selection (fix)
├── mcp_server.py # MCP server exposing 4 tools
└── ...
```
## Domain: Ansible Role Grammar
## MCP Server
The engine includes a domain adapter for Ansible roles. It extracts module names from `tasks/main.yml` files and learns per-category grammars:
A **Model Context Protocol** server exposes all algorithms and domain adapters as tools:
```bash
python -c "
from bex.role_grammar import collect_all_role_sequences, learn_grammar
python -m bex.mcp_server
```
### Tools
| Tool | What it does |
|------|-------------|
| `infer_grammar(sequences, method, kmax, N)` | Core CRX or iDRegEx inference |
| `infer_best_grammar(sequences, prefer, kmax, N)` | **Ensemble:** runs both CRX and iDRegEx, picks the best by MDL score. Set `prefer='crx'` or `prefer='idregex'` to skip ensemble and return only that algorithm. Returns structured report with candidates, MDL scores, and a `Why:` explanation. |
| `infer_yaml_grammar(yaml_dir, pattern, method)` | Generic YAML → key-paths → grammar |
| `infer_ansible_role_grammar(roles_dir)` | Ansible role module sequences → per-category grammar |
### Using `infer_best_grammar`
The ensemble runs both algorithms and picks the best by MDL. To skip the comparison and run just one algorithm, pass `prefer`:
```
User: Run CRX on our deploy tasks.
Agent: [runs with prefer='crx']
Best: CRX (MDL 7.0)
Grammar: file.template.docker_image.command.set_fact.shell.wait_for?
CRX MDL= 7.00 file.template.docker_image.command.set_fact.shell.wait_for?
Why: Requested CRX only.
```
Without `prefer`, the ensemble compares both:
```
User: Find the grammar for our Helm chart.
Agent: [runs]
Best: iDRegEx (MDL 1432.99)
Grammar: ServiceAccount.ClusterRole.ClusterRoleBinding.Service.Deployment
iDRegEx MDL= 1432.99 ServiceAccount.ClusterRole.ClusterRoleBinding.Service.Deployment
CRX MDL= 2651.74 (Alertmanager+...+ValidatingWebhookConfiguration)+.Role?.RoleBinding?.Job+?
Why: iDRegEx (score 1433.0) vs CRX (score 2651.7). CRX matches 6/6 sequences,
iDRegEx matches 1/6. iDRegEx selected (MDL score 1433.0).
```
Both grammars are correct — they operate at different levels of specificity. The `Why:` field helps the agent decide which one to use for the task at hand.
## Ensemble Selection
The `infer_best_grammar` tool runs both CRX and iDRegEx, scores each with Minimum Description Length (MDL), and returns the best.
### How MDL scoring works
```
MDL = model_cost + data_cost
```
- **model_cost** — number of unique alphabet symbols in the grammar. Simpler grammars are cheaper.
- **data_cost** — Σ log₂(|L(r) at length len(s)|) across all sequences. A grammar that accepts *many* strings of the same length (like a 17-way disjunction `(a+b+...+q)+`) has high data cost because `|L(r)|` is large. A specific, fixed sequence (`a.b.c.d.e`) has `|L(r)| = 1` so data cost is zero.
The ensemble selects the grammar with the lowest total MDL. This automatically picks the right level of specificity for the data.
### When each algorithm wins
| Scenario | Winner | Why |
|----------|--------|-----|
| Many sequences, diverse patterns | **CRX** | CRX captures the full vocabulary. iDRegEx can't find a common core. |
| Clean, structured sequences | **CRX** | CRX learns precise concatenation order with optional suffixes. iDRegEx may over-generalize. |
| Few sequences (23) | **iDRegEx** | CRX overfits to the limited data. iDRegEx's probabilistic approach handles noise better. |
| Sequences share a clear core | **iDRegEx** | iDRegEx extracts the minimal common subsequence. CRX buries it in a mass of optional symbols. |
| Single sequence | **iDRegEx** (with SOA repair) | RWR₀ repair pipeline produces a grammatical regex from one example. |
### Real-world benchmarks
Results from three domains using the ensemble (fixed MDL scoring):
```
Dataset Best MDL Matches
──────────────────────────────────────────────────────────
Helm (prom-stack) iDRegEx 1433.0 1/6
Ansible (deploy) CRX 246.1 34/36
Ansible (validate) CRX 34.0 5/5
Ansible (restore) CRX 24.0 2/2
Ansible (manage) iDRegEx 25.0 1/2
Ansible (configure) iDRegEx 22.5 1/4
Terraform (hashistack) CRX 4.0 9/9
```
Note: MDL scores are not comparable across datasets — only within the same run
(CRX vs iDRegEx on the same sequences). The Helm score is higher because
each sequence is ~120 symbols long, making the data cost term dominant for
the overly-general CRX grammar (19 kinds × many lengths).
## Domain Adapters
### Ansible Roles
Extracts module names from `tasks/main.yml`, groups by category prefix (e.g., `deploy_foo``deploy`), and learns per-category grammars:
```python
from bex.ensemble import infer_ensemble
from bex.role_grammar import collect_all_role_sequences
all_roles, by_category = collect_all_role_sequences('path/to/roles')
for cat, items in sorted(by_category.items()):
seqs = [s for _, s in items]
print(f'{cat}: {learn_grammar(seqs)}')
"
if len(seqs) >= 2:
result = infer_ensemble(seqs)
print(f"── {cat} ({len(items)} roles) ──")
print(f" Best: {result['best']['algorithm']} (MDL {result['best']['mdl_score']})")
print(f" Grammar: {result['best']['grammar']}")
print(f" Why: {result['why']}")
```
### Example Output
**Example output** (from [companyweb](https://github.com/anomalyco/companyweb), 51 roles):
```
── restore (2 roles) ──
Best: CRX (MDL 24.0)
Grammar: file.copy.unarchive+.command
Why: CRX (score 24.0) vs iDRegEx (score 33.0). Both match 2/2. CRX is more compact.
── validate (5 roles) ──
Best: CRX (MDL 34.0)
Grammar: hosts?.shell?.(copy+debug+fail+set_fact+uri)+?
Why: CRX (score 34.0) matches 5/5, iDRegEx (score 49.5) matches 0/5.
── configure (4 roles) ──
Grammar: (assert+debug+set_fact+uri)+?.include_role?
Best: iDRegEx (MDL 22.5)
Grammar: include_role
Why: iDRegEx (score 22.5) beats CRX (score 44.5). CRX overfits to diverse patterns.
```
### Helm Charts
Renders a Helm chart with different values files and extracts Kubernetes `kind` sequences for grammar inference:
```python
import subprocess, yaml
from bex.ensemble import infer_ensemble
seqs = []
for vf in sorted(Path('ci/').glob('*-values.yaml')):
out = subprocess.run(
['helm', 'template', 'test', '.', '--skip-tests', '-f', str(vf)],
capture_output=True, text=True, timeout=120,
)
if out.returncode == 0:
kinds = [d['kind'] for d in yaml.safe_load_all(out.stdout)
if d and isinstance(d, dict) and 'kind' in d]
if kinds:
seqs.append(kinds)
result = infer_ensemble(seqs)
print(f"Best: {result['best']['algorithm']} (MDL {result['best']['mdl_score']})")
print(f"Grammar: {result['best']['grammar']}")
print(f"Why: {result['why']}")
```
**Example output** (from [kube-prometheus-stack](https://github.com/prometheus-community/helm-charts/tree/main/charts/kube-prometheus-stack), 6 CI configs):
```
Best: iDRegEx (MDL 1432.99)
Grammar: ServiceAccount.ClusterRole.ClusterRoleBinding.Service.Deployment
iDRegEx MDL= 1432.99 ServiceAccount.ClusterRole.ClusterRoleBinding.Service.Deployment
CRX MDL= 2651.74 (Alertmanager+ClusterRole+ClusterRoleBinding+ConfigMap+DaemonSet+...)+.Role?.RoleBinding?.Job+?
Why: iDRegEx (score 1433.0) vs CRX (score 2651.7). CRX matches 6/6, iDRegEx matches 1/6.
iDRegEx selected (MDL score 1433.0).
```
CRX captures *all* symbols that appear. iDRegEx finds only the minimal core that every config shares:
```
ServiceAccount.ClusterRole.ClusterRoleBinding.Service.Deployment
```
Which grammar is more useful depends on the task:
- **CRX** tells you everything you *might* need — good for an agent generating a complete chart.
- **iDRegEx** tells you what you *always* need — the bootstrap pipeline that can't be skipped.
Use `prefer='crx'` or `prefer='idregex'` to select an algorithm without the ensemble comparison:
### Terraform
Parses `.tf` files to extract `resource` type sequences, per-file or per-directory:
```python
import re
from bex.ensemble import infer_ensemble
seqs = []
for tf in sorted(Path('.').rglob('*.tf')):
resources = re.findall(r'resource "(\w+)" "\w+" {', tf.read_text())
if resources:
seqs.append(resources)
result = infer_ensemble(seqs)
print(f"Best: {result['best']['algorithm']} (MDL {result['best']['mdl_score']})")
print(f"Grammar: {result['best']['grammar']}")
```
**Example output** (from [terraform-guides](https://github.com/hashicorp/terraform-guides), hashistack example, 9 files):
```
Best: CRX (MDL 4.0, 9/9 match)
Grammar: azurerm_network_security_group?.tls_private_key?.azurerm_virtual_machine?.(azurerm_resource_group+azurerm_subnet+azurerm_virtual_network)+?.azurerm_network_security_rule?.null_resource?.azurerm_network_interface?.azurerm_public_ip?.random_id+?
```
**Grammar notation:**
@ -97,15 +307,20 @@ for cat, items in sorted(by_category.items()):
- `r?` — zero or one (optional)
- `r+` — one or more (iteration)
- `r+?` — zero or more (varies across examples)
- `(a|b)` — iDRegEx-style disjunction (equivalent to `(a+b)`)
## Domain: Generic YAML
The engine can convert any YAML file into key-path sequences for grammar inference:
Converts any YAML file into key-path sequences (DFS traversal) for grammar inference:
```python
from bex.yaml_to_seq import yaml_file_to_sequence, sequences_to_crx
from bex.yaml_to_seq import collect_all_sequences
from bex import infer_ensemble
grammar = sequences_to_crx(yaml_file_to_sequence('config.yml'))
results = collect_all_sequences('config_dir/')
seqs = [seq for _, seq in results]
result = infer_ensemble(seqs)
print(result['best']['grammar'])
```
## Papers
@ -123,10 +338,6 @@ python -m pytest tests/
python tests/test_bex.py
```
## MCP Server
A Model Context Protocol server for grammar inference is planned. See `AGENTS.md` for the roadmap.
## License
MIT

64
SHOWCASE.md Normal file
View file

@ -0,0 +1,64 @@
# Grammar Inference Engine — Showcase
Infer the unwritten convention from existing examples. Given N example
sequences, produce a ~100-char grammar that captures the structural
pattern — in far fewer tokens than the originals.
## How it works
Your agent calls the MCP tool `infer_best_grammar` with a list of
existing sequences. It returns a compressed grammar:
```
a.b → a then b (concatenation)
(a+b) → a or b (disjunction)
r? → optional (zero or one)
r+ → one or more (iteration)
r+? → zero or more
```
Use `prefer='crx'` for full coverage (accepts all examples), or let the
ensemble pick between CRX and iDRegEx by MDL score.
## Ansible Galaxy roles — 15 geerlingguy roles
Jeff Geerling maintains 100+ of the most popular Ansible roles on
Galaxy. He has never written down their task structure. Our grammar is
the first explicit description:
```
Grammar: fail?.(include_vars+set_fact+package+file+template+service+...)+.
include+?.(npm+pip)+?.lineinfile?
CRX MDL= 596.64 match=15/15
```
Every role follows the same arc: check prerequisites, OS-specific vars,
install packages, configure with templates, start services, optionally
run sub-tasks. It works because 15 roles all converged on the same
unwritten convention.
**Compression: 15 roles (~5,000 tokens) → 60 tokens.**
## Notation reference
| Symbol | Meaning |
|--------|---------|
| `a.b` | a then b |
| `(a+b)` | a or b (CRX disjunction) |
| `(a\|b)` | a or b (iDRegEx disjunction) |
| `r?` | zero or one |
| `r+` | one or more |
| `r+?` | zero or more |
| `MDL` | Minimum Description Length — lower is better |
## Usage
```python
from bex.mcp_server import infer_best_grammar
output = infer_best_grammar(
sequences=role_sequences,
prefer="crx",
)
```

View file

@ -21,6 +21,7 @@ from .koa import KOA, build_complete_koa
from .expr import concat, disj, star, optional, alphabet, strip_k
from .marking import mark_koa
from .tokenizer import YAMLTokenizer
from .ensemble import infer_ensemble
from .template import generate_template
__version__ = "0.2.0"

349
bex/ensemble.py Normal file
View file

@ -0,0 +1,349 @@
"""Ensemble grammar inference — run multiple algorithms, pick best by MDL scoring."""
import re
from .crx import CRX
from .idregex import idregex
from .expr import alphabet
from .mdl import model_cost, mdl_score
def _parse_parts(expr):
"""Parse expression into a list of tokens for matching.
Each token: (type, value, quantifier)
type: 'symbol' | 'disj' | 'concat' | 'empty'
quantifier: '' | '?' | '+' | '+?'
"""
if not expr or expr == '':
return [('empty', '', '')]
if expr == 'ε':
return [('empty', '', '+?')]
# 1. Check if it's a concatenation (split outermost by '.')
# Must check BEFORE stripping trailing quantifier, because
# quantifiers belong to individual parts (e.g., a?.b+)
concat_parts = _split_outer(expr.strip(), '.')
if len(concat_parts) > 1:
children = []
for p in concat_parts:
children.extend(_parse_parts(p.strip()))
return [('concat', children, '')]
# 2. Now handle quantifier suffix on this single part
quantifier = ''
if expr.endswith('+?'):
quantifier = '+?'
expr = expr[:-2]
elif expr.endswith('*'):
quantifier = '*'
expr = expr[:-1]
elif expr.endswith('?'):
quantifier = '?'
expr = expr[:-1]
elif expr.endswith('+'):
quantifier = '+'
expr = expr[:-1]
# 3. Disjunction group: (a+b+c) for CRX or (a|b|c) for iDRegEx
if expr.startswith('(') and expr.endswith(')'):
inner = expr[1:-1]
# Try CRX-style (+) first, then iDRegEx-style (|)
disj_parts = _split_outer(inner, '+')
if len(disj_parts) <= 1:
disj_parts = _split_outer(inner, '|')
if len(disj_parts) > 1:
children = []
for p in disj_parts:
p = p.strip()
# Parse as a flat symbol (don't split dots — they're part of
# the symbol name, e.g. "community.docker.docker_image")
children.append(_parse_flat_symbol(p))
return [('disj', children, quantifier)]
# Single element inside parens: treat as flat symbol
return [_parse_flat_symbol(inner)]
# 4. Single symbol
if expr and expr not in ('', 'ε'):
return [('symbol', expr, quantifier)]
return []
def _parse_flat_symbol(s):
"""Parse a single symbol with optional quantifier, no dot splitting.
Unlike _parse_parts, this treats dots as part of the symbol name
(e.g. 'community.docker.docker_image' stays as one symbol).
"""
s = s.strip()
quantifier = ''
if s.endswith('+?'):
quantifier = '+?'
s = s[:-2]
elif s.endswith('*'):
quantifier = '*'
s = s[:-1]
elif s.endswith('?'):
quantifier = '?'
s = s[:-1]
elif s.endswith('+'):
quantifier = '+'
s = s[:-1]
if s and s not in ('', 'ε'):
return ('symbol', s, quantifier)
return ('empty', '', quantifier)
def _split_outer(s, sep):
"""Split on `sep` at the top level (not inside parentheses)."""
depth = 0
parts = []
cur = []
for ch in s:
if ch == '(':
depth += 1
cur.append(ch)
elif ch == ')':
depth -= 1
cur.append(ch)
elif ch == sep and depth == 0:
parts.append(''.join(cur))
cur = []
else:
cur.append(ch)
parts.append(''.join(cur))
return parts
def _match_possible(token, seq, pos):
"""Return all possible end positions after matching this token starting at pos."""
ttype, tval, tquant = token
positions = []
if ttype == 'empty':
positions.append(pos)
elif ttype == 'symbol':
if tquant in ('', '?'):
if pos < len(seq) and seq[pos] == tval:
positions.append(pos + 1)
if tquant == '?':
positions.append(pos)
elif tquant in ('+?', '*'):
positions.append(pos)
cnt = pos
while cnt < len(seq) and seq[cnt] == tval:
cnt += 1
positions.append(cnt)
elif tquant == '+':
if pos < len(seq) and seq[pos] == tval:
cnt = pos + 1
positions.append(cnt)
while cnt < len(seq) and seq[cnt] == tval:
cnt += 1
positions.append(cnt)
elif ttype == 'disj':
if tquant in ('', '?'):
for child in tval:
for ep in _match_possible(child, seq, pos):
positions.append(ep)
if tquant == '?':
positions.append(pos)
elif tquant in ('+?', '*'):
positions.append(pos)
for child in tval:
for ep in _match_possible(child, seq, pos):
if ep > pos:
positions.append(ep)
# After consuming one, recurse to try more
for ep2 in _match_possible(token, seq, ep):
if ep2 > ep:
positions.append(ep2)
elif tquant == '+':
for child in tval:
for ep in _match_possible(child, seq, pos):
if ep > pos:
positions.append(ep)
for ep2 in _match_possible(token, seq, ep):
if ep2 > ep:
positions.append(ep2)
elif ttype == 'concat':
# Match all children sequentially
def _match_seq(children, start):
cur = [start]
for child in children:
next_cur = []
for p in cur:
next_cur.extend(_match_possible(child, seq, p))
cur = next_cur
if not cur:
break
return cur
if tquant in ('', '?'):
positions.extend(_match_seq(tval, pos))
if tquant == '?':
positions.append(pos)
elif tquant in ('+?', '*'):
positions.append(pos)
inner_end = _match_seq(tval, pos)
for ep in inner_end:
if ep > pos:
positions.append(ep)
for ep2 in _match_possible(token, seq, ep):
if ep2 > ep:
positions.append(ep2)
elif tquant == '+':
inner_end = _match_seq(tval, pos)
for ep in inner_end:
if ep > pos:
positions.append(ep)
for ep2 in _match_possible(token, seq, ep):
if ep2 > ep:
positions.append(ep2)
return positions
def _match_tokens(tokens, seq, pos=0):
"""Try to match tokens against seq starting at pos. Returns max position or None."""
cur = [pos]
for token in tokens:
next_cur = []
for p in cur:
next_cur.extend(_match_possible(token, seq, p))
cur = next_cur
if not cur:
return None
return max(cur) if cur else pos
def _matches(grammar, sequence):
"""Check if a sequence matches the grammar."""
try:
tokens = _parse_parts(grammar.strip())
if not tokens:
return False
end = _match_tokens(tokens, sequence)
if end is None:
return False
return end == len(sequence)
except Exception:
return False
def mdl_score_simple(grammar, sequences):
"""MDL score from the paper: model_cost + Σ log₂(|L(r)| at length len(s)).
Lower is better. Uses the paper's definition from Bex et al.
model_cost = number of alphabet symbol occurrences in the expression.
data_cost = Σ log₂(|L(r)|) penalizes overly general grammars.
"""
return mdl_score(grammar, sequences)
def infer_ensemble(sequences, kmax=2, N=3, prefer=None):
"""Run all applicable algorithms and return the best by MDL score.
Args:
sequences: List of sequences, each a list of strings.
kmax: Maximum k for iDRegEx k-ORE inference.
N: Number of EM iterations for iDRegEx.
prefer: Optional 'crx' or 'idregex' to skip ensemble and
return only that algorithm's result.
Returns:
dict with keys:
best: {algorithm, grammar, mdl_score}
all: [{algorithm, grammar, mdl_score}, ...]
why: str explaining the choice
"""
results = []
if prefer and prefer.lower() == 'idregex':
idr_g = idregex(sequences, kmax=kmax, N=N)
idr_score = mdl_score_simple(idr_g, sequences) if idr_g and idr_g != '' else float('inf')
if idr_g and idr_g != '':
results.append(('iDRegEx', idr_g, idr_score))
if not results:
return {
'best': None,
'all': [],
'why': "iDRegEx returned ∅ (no common core found).",
}
why = "Requested iDRegEx only."
return {
'best': {
'algorithm': 'iDRegEx',
'grammar': results[0][1],
'mdl_score': round(results[0][2], 2),
},
'all': [{'algorithm': 'iDRegEx', 'grammar': results[0][1], 'mdl_score': round(results[0][2], 2)}],
'why': why,
}
crx_g = CRX().infer(sequences)
crx_score = mdl_score_simple(crx_g, sequences)
results.append(('CRX', crx_g, crx_score))
if prefer and prefer.lower() == 'crx':
return {
'best': {
'algorithm': 'CRX',
'grammar': crx_g,
'mdl_score': round(crx_score, 2),
},
'all': [{'algorithm': 'CRX', 'grammar': crx_g, 'mdl_score': round(crx_score, 2)}],
'why': "Requested CRX only.",
}
idr_g = idregex(sequences, kmax=kmax, N=N)
if idr_g and idr_g != '':
idr_score = mdl_score_simple(idr_g, sequences)
results.append(('iDRegEx', idr_g, idr_score))
results.sort(key=lambda x: x[2])
best = results[0]
all_results = [
{'algorithm': a, 'grammar': g, 'mdl_score': round(s, 2)}
for a, g, s in results
]
crx_match = sum(1 for s in sequences if _matches(crx_g, s))
idr_match = sum(1 for s in sequences if _matches(idr_g, s)) if len(results) > 1 else 0
why_parts = []
if len(results) == 1:
why_parts.append(f"Only CRX produced a result (iDRegEx returned ∅).")
else:
why_parts.append(
f"{results[0][0]} (score {results[0][2]:.1f}) vs {results[1][0]} (score {results[1][2]:.1f})."
)
if crx_match == idr_match == len(sequences):
why_parts.append("Both grammars match all sequences.")
why_parts.append(
f"{results[0][0]} wins because it is more compact "
f"(lower model cost) while matching all data."
)
elif crx_match != idr_match:
why_parts.append(
f"CRX matches {crx_match}/{len(sequences)} sequences, "
f"iDRegEx matches {idr_match}/{len(sequences)}."
)
why_parts.append(
f"{best[0]} selected (MDL score {best[2]:.1f})."
)
return {
'best': {
'algorithm': best[0],
'grammar': best[1],
'mdl_score': round(best[2], 2),
},
'all': all_results,
'why': ' '.join(why_parts),
}

View file

@ -13,6 +13,7 @@ from mcp.server.fastmcp import FastMCP
from .crx import CRX
from .idregex import idregex
from .ensemble import infer_ensemble, _matches
from .yaml_to_seq import yaml_file_to_sequence, sequences_to_crx
mcp = FastMCP("grammar-inference", log_level="ERROR")
@ -45,6 +46,52 @@ def infer_grammar(
raise ValueError(f"Unknown method: {method}. Use 'crx' or 'idregex'.")
@mcp.tool()
def infer_best_grammar(
sequences: list[list[str]],
prefer: str = "",
kmax: int = 2,
N: int = 3,
) -> str:
"""Infer a compact grammar from example sequences. Use this when you
need to generate structured content (Ansible roles, CI configs, Helm
values, YAML configs, etc.) and have existing examples to learn from.
The grammar compresses N examples into ~100 chars far fewer tokens
than passing all examples. Pass the existing sequences, get back a
pattern you can follow to generate new instances.
Args:
sequences: List of sequences, each a list of strings (symbols in
the order they appear). Example: [["file","copy","command"],
["file","template","command"]].
prefer: Optional 'crx' for full coverage (accepts all examples),
'idregex' for minimal core (only what every example shares).
Default: runs both and picks best by MDL score.
kmax: Maximum k for iDRegEx k-ORE inference.
N: Number of EM iterations for iDRegEx.
Returns:
A formatted string with the best grammar, scores, and explanation.
Grammar notation: a.b = a then b, (a+b) = a or b, r? = optional,
r+ = one or more, r+? = zero or more.
"""
pref = prefer if prefer else None
result = infer_ensemble(sequences, kmax=kmax, N=N, prefer=pref)
if result['best'] is None:
return f"No grammar found. {result['why']}"
lines = [f"Best: {result['best']['algorithm']} (MDL {result['best']['mdl_score']})",
f"Grammar: {result['best']['grammar']}",
""]
if len(result['all']) > 1:
for r in result['all']:
m = sum(1 for s in sequences if _matches(r['grammar'], s))
lines.append(f" {r['algorithm']:10s} MDL={r['mdl_score']:>8.2f} match={m}/{len(sequences)}")
lines.append("")
lines.append(f"Why: {result['why']}")
return "\n".join(lines)
@mcp.tool()
def infer_yaml_grammar(
yaml_dir: str,

View file

@ -1,16 +1,20 @@
"""MDL scoring for iDRegEx (Algorithm 4, arXiv 1004.2372)."""
import math
import functools
from .expr import alphabet
def model_cost(expr):
"""|r| — number of alphabet symbol occurrences in expression."""
import re
cleaned = re.sub(r'[+?*()|.]', '', expr)
cleaned = re.sub(r'_\d+', '', cleaned)
cleaned = re.sub(r'[ε∅]', '', cleaned)
return len(cleaned)
syms = alphabet(expr)
# Count each symbol by how many times it appears as a standalone word
count = 0
for s in syms:
# Count occurrences where symbol is bordered by operators or edges
count += len(re.findall(rf'(?<![a-zA-Z_]){re.escape(s)}(?![a-zA-Z_])', expr))
return count
def lang_size(expr, n=None):
@ -31,6 +35,7 @@ def lang_size(expr, n=None):
return total
@functools.lru_cache(maxsize=None)
def _count_words_fast(expr, length):
if length < 0:
return 0
@ -43,38 +48,74 @@ def _count_words_fast(expr, length):
if expr in alpha:
return 1 if length == 1 else 0
if '+' in expr:
inner = expr.rstrip('+')
if inner.endswith('?'):
inner = inner[:-1]
return _count_star(inner, length, min_count=1)
# 0. Concatenation: a.b.c — check FIRST so trailing quantifiers
# apply to each part individually, not the whole expression.
if '.' in expr:
parts = _split_disj_crx(expr, '.')
if len(parts) > 1:
return _count_concat(tuple(parts), length, 0)
if expr.endswith('?'):
# 1. Trailing quantifiers
if expr.endswith('+?'):
return _count_star(expr[:-2], length, min_count=0)
if expr.endswith('*'):
return _count_star(expr[:-1], length, min_count=0)
if expr.endswith('?') and not expr.endswith('+?'):
inner = expr[:-1]
return _count_words_fast(inner, length) + (1 if length == 0 else 0)
if expr.endswith('+') and not expr.endswith('+?'):
inner = expr[:-1]
return _count_star(inner, length, min_count=1)
if expr.startswith('(') and '|' in expr:
parts = _split_disj(expr[1:-1])
return sum(_count_words_fast(p.strip(), length) for p in parts)
if '.' in expr:
parts = expr.split('.')
return _count_concat(parts, length, 0)
# 2. Disjunction group: (a+b+c) for CRX or (a|b|c) for iDRegEx
if expr.startswith('(') and expr.endswith(')'):
inner = expr[1:-1]
parts = _split_disj_crx(inner, '+')
if len(parts) > 1:
return sum(_count_words_fast(p.strip(), length) for p in parts)
parts = _split_disj_crx(inner, '|')
if len(parts) > 1:
return sum(_count_words_fast(p.strip(), length) for p in parts)
return _count_words_fast(inner, length)
return 0
def _count_concat(parts, length, idx):
def _split_disj_crx(s, sep):
"""Split on `sep` at top depth (not inside nested parens)."""
depth = 0
parts = []
cur = []
for ch in s:
if ch == '(':
depth += 1
cur.append(ch)
elif ch == ')':
depth -= 1
cur.append(ch)
elif ch == sep and depth == 0:
parts.append(''.join(cur))
cur = []
else:
cur.append(ch)
parts.append(''.join(cur))
return parts
@functools.lru_cache(maxsize=None)
def _count_concat(parts_tuple, length, idx):
parts = list(parts_tuple)
if idx >= len(parts):
return 1 if length == 0 else 0
total = 0
for take in range(length + 1):
cnt = _count_words_fast(parts[idx], take)
if cnt:
total += cnt * _count_concat(parts, length - take, idx + 1)
total += cnt * _count_concat(parts_tuple, length - take, idx + 1)
return total
@functools.lru_cache(maxsize=None)
def _count_star(inner, length, min_count):
total = 0
for rep in range(min_count, length + 1):
@ -82,6 +123,7 @@ def _count_star(inner, length, min_count):
return total
@functools.lru_cache(maxsize=None)
def _count_repeat(inner, rep, length):
if rep == 0:
return 1 if length == 0 else 0
@ -114,19 +156,32 @@ def _split_disj(s):
def data_cost(expr, sequences):
"""MDL data cost: Σ_i log₂(|L=i(r)| / |S=i|) adjusted.
"""MDL data cost: Σ_i log₂(|L_i(r)|) where |L_i(r)| is the number
of words of length len(seq_i) accepted by the grammar.
Simplified form: for each word in S, cost = log₂(lang_size of all words
of that length).
Lower cost = more specific grammar that still covers the data.
Exact computation is capped at max_len=50 to prevent combinatorial
explosion. Longer sequences use an alphabet-size upper bound.
"""
MAX_EXACT = 50
n = 2 * model_cost(expr) + 1
runtime_n = min(max(n, max((len(s) for s in sequences), default=0)), MAX_EXACT)
lang_sizes = [_count_words_fast(expr, l) for l in range(runtime_n + 1)]
alpha_size = len(alphabet(expr))
total_cost = 0.0
for seq in sequences:
length = len(seq)
if length <= n:
lang_at_len = _count_words_fast(expr, length)
if lang_at_len > 0:
total_cost += math.log2(lang_at_len) if lang_at_len > 0 else 0
if length <= runtime_n:
ls = lang_sizes[length]
if ls > 0:
total_cost += math.log2(ls)
else:
total_cost += length * math.log2(max(alpha_size, 1))
else:
total_cost += length * math.log2(max(alpha_size, 1))
return total_cost

341
blog_post.md Normal file
View file

@ -0,0 +1,341 @@
# Discovering Unwritten Conventions with Grammar Inference
**How we turned 36 Ansible roles into a 200-character grammar — and why
it matters for LLM agents.**
## The problem
Every codebase has unwritten conventions. Your team's Docker Compose
files always put `image` before `ports` before `volumes`. Your Ansible
deploy roles always start with `assert`, then `file`, then `template`.
Your CI pipelines always run `lint` before `test` before `deploy`.
Nobody writes these down. They're emergent — copied from role to role,
file to file, until they become a tacit standard.
When an LLM agent needs to generate new content that follows these
conventions, you have two options:
1. **Stuff every existing file into context** — 36 deploy roles = 15,000
tokens. You'll hit the context window on your third example.
2. **Give it one or two examples and hope** — the LLM will guess the
pattern, and it will often guess wrong.
Neither is good. The first is wasteful. The second is unreliable.
What you really want is the **compiled convention** — the minimal
description of what all 36 roles share, expressed in ~200 tokens. An
LLM can follow a rule in 200 tokens far more reliably than it can
infer a pattern from 36 examples.
This is grammar inference.
## The approach
Given a set of example sequences over some alphabet (e.g., Ansible
module names, Docker Compose keys, CI job names), learn a regular
expression that describes the general pattern.
We implemented two algorithms from Bex et al., a pair of papers from
TODS 2010 and arXiv 2010:
- **CRX** (TODS 2010 §6): A single-pass algorithm that builds a
predecessor relation over symbols, computes equivalence classes,
and emits a Chain Regular Expression (CHARE) that matches ALL
input sequences. Fast, deterministic, captures the full vocabulary.
- **iDRegEx** (arXiv 2010): A probabilistic algorithm using k-testable
Observation Automata (k-OA) trained with Baum-Welch EM. It finds
only the *minimal common core* — the symbols that appear in every
example. Robust against noise, but fails (returns ∅) when the
examples are too diverse.
Both run in the **ensemble**: CRX produces a permissive grammar (full
vocabulary, many optional parts), iDRegEx produces a strict grammar
(minimal core). A Minimum Description Length (MDL) score picks the
winner: the grammar that compresses the data best.
## The algorithms, briefly
### CRX — Chain Regular Expression inference
CRX (Algorithm 7, TODS 2010) works in four steps:
1. **Build the immediate-predecessor relation.** For every adjacent
pair (x, y) across all sequences, record that x precedes y. If
symbol `assert` always appears before `file`, record
`assert → file`.
2. **Compute equivalence classes.** Take the reflexive-transitive
closure of the predecessor relation. The strongly connected
components are *equivalence classes* — groups of symbols that can
appear in the same position. If `copy` and `template` both follow
`file` and precede `command`, they're in the same class.
3. **Merge singleton classes.** A class with one symbol that shares
the same predecessor/successor sets as another singleton class
gets merged. This handles symbols that always appear in the
same structural position.
4. **Topological sort.** The equivalence classes are sorted by their
position in the Hasse diagram of the predecessor relation. Each
class becomes a factor in the output, annotated with a quantifier:
- `+` (one or more) if the class forms a cycle
- `+?` (zero or more) if the class appears variably
- `?` (optional) if the class can be absent
- (exact) if the class always appears exactly once
The result is a CHARE: a sequence of factors where each factor is a
disjunction of equivalent symbols with a quantifier.
### iDRegEx — k-optimal regular expression inference
iDRegEx (Algorithm 4, arXiv 2010) uses a probabilistic automaton:
1. **Build a complete k-OA.** A k-testable Observation Automaton
records all k-grams (subsequences of length k) from the input
sequences. The automaton's states represent (k-1)-grams.
2. **Train with Baum-Welch.** EM iterations assign probabilities to
transitions, learning which paths through the automaton are most
likely given the data.
3. **Disambiguate.** Remove nondeterministic transitions — for any
state and symbol, keep only the most probable next state.
4. **Prune.** Remove low-probability edges and unreachable states,
leaving only the most likely paths.
5. **Extract with rwr².** The REWRITE-SQUARED algorithm (rwr²,
Algorithm 3) collapses the pruned automaton into a k-optimal
regular expression — the minimal common core.
### MDL scoring — picking the right level of specificity
The Minimum Description Length principle (Rissanen 1978) says: the
best grammar is the one that minimizes the sum of its own size and
the cost of encoding the data using it.
```
MDL = model_cost + data_cost
```
**model_cost** = the number of alphabet symbol occurrences in the
grammar. A grammar with 5 unique symbols used once each has
model_cost = 5.
**data_cost** = Σ log₂(|L(r)|) across all sequences, where |L(r)| is
the number of strings of length len(s) that the grammar accepts.
A grammar like `(a+b+c+...+z)+` accepts 19 possible symbols at each
position, so for a sequence of length 120, the data cost is
120 × log₂(19) ≈ 510 bits. A grammar like `a.b.c.d.e` accepts only
1 string of length 5, so data cost is 0.
The ensemble picks the grammar with the lowest total MDL. This
automatically balances specificity against coverage: a grammar that
matches only 1 sequence but does so perfectly (low data cost) can
beat a grammar that matches all sequences but is extremely permissive
(high data cost).
## The bugs we found (and fixed)
Implementing the BEX algorithms faithfully required solving several
subtle problems.
### Bug 1: model_cost counted characters, not symbols
The paper defines model_cost as "the length of r" — the number of
symbols in the expression. For the toy alphabet {a, b, c, d, e} used
in the paper, characters and symbols are the same. For real-world
symbols like `community.docker.docker_image`, they aren't.
Our `model_cost` function was counting characters (226 for a typical
grammar), when it should count symbol occurrences (19). This
massively inflated the MDL score, making CRX appear worse than it
actually was.
**Fix:** Count occurrences of alphabet symbols in the expression using
regex word-boundary matching, not string length.
### Bug 2: Dispatch order in _count_words_fast
The recursive function `_count_words_fast` estimates |L(r)| — the
number of strings a grammar accepts at a given length. It dispatches
on expression structure: first check for concatenation (`.`), then
trailing quantifiers (`+?`, `*`, `?`, `+`), then disjunction groups.
Our dispatch checked `endswith('+?')` before checking `'.' in expr`.
For the expression `(All)+.Role?.RoleBinding?.Job+?`, the trailing
`+?` on `Job+?` triggered the quantifier branch first, applying the
`+?` to the **entire** expression instead of just the `Job` factor.
**Fix:** Check concatenation first. Top-level dots can only appear in
concatenation, so they should be handled before any quantifier logic.
### Bug 3: Greedy matching without backtracking
The `_match_tokens` function checked whether a sequence matches a
grammar. For quantifiers like `+?` (zero-or-more), it greedily
consumed ALL consecutive matching symbols, then moved on. This failed
for grammars like `a+?.a` on input `['a', 'a']`: the `a+?` ate both
`a`s, and there was nothing left for the second `.a`.
**Fix:** Replace the single-pass greedy matching with `_match_possible`,
a proper backtracking engine that enumerates ALL valid end positions
for each token and picks the maximum. This is essentially a tiny
regex engine — but limited to the CHARE subset, so it avoids the
exponential blowup of general regex matching.
### Bug 4: Dot-splitting inside disjunctions
Module names like `community.docker.docker_image` contain dots.
When `_parse_parts` processed a disjunction child, it recursively
called itself — which split the expression on `.` before treating it
as a symbol. The symbol `community.docker.docker_image` became
`community` then `docker` then `docker_image` — three concatenated
symbols instead of one.
**Fix:** Disjunction children are always flat symbols (CRX and
iDRegEx don't produce nested disjunctions in practice). Parse them
with `_parse_flat_symbol`, which strips quantifiers but never splits
on `.`.
## The results
### Ansible deploy roles — 36 roles from companyweb
Your own deploy roles cover everything from AdGuard Home to
Woodpecker CI. They have NO schema — each is a free-form script.
```
Grammar: docker_volume+?.group?.docker_container?.user?.apt?.npm?.
(assert+...+command+copy+file+template+set_fact+...+wait_for)+?.
(cron+firewalld)?
Match: 36/36
MDL: 2186.28
```
Bottleneck analysis: optional docker setup (volume, group, container,
user, apt, npm), then a large disjunction of ~25 task modules (one or
more), then optional cron/firewalld at the end. This captures the
convention precisely.
**Compression: 36 roles (15,000 tokens) → 200 tokens (75×)**
### Geerlingguy Galaxy roles — 15 popular roles
Jeff Geerling's roles are the most popular on Ansible Galaxy. He has
never documented their structural pattern. Yet every one of the 15
follows the same arc:
```
Grammar: fail?.(include_vars+set_fact+package+file+template+service+...)+.
include+?.(npm+pip)+?.lineinfile?
Match: 15/15
MDL: 596.64
```
Check prerequisites, OS-specific variables, install packages,
configure with templates, start services, optionally run sub-tasks,
install npm/pip packages, and optionally tweak config lines.
**This is the first explicit description of the geerlingguy role
convention.** It took 15 roles and a grammar inference algorithm to
write it down.
**Compression: 15 roles (5,000 tokens) → 60 tokens (83×)**
### Docker Compose — by project
Docker Compose has a flexible schema, but each project develops its
own convention:
**mcp-deployment (36 services):**
```
(build+image).command.(environment+volumes)?.ports
```
**files (6 services):**
```
image.environment.volumes.network_mode.privileged?.cap_add?
```
**fresh-ape-base (9 services):**
```
image.ports?.(depends_on+environment+user+volumes)+
```
### Ensemble dynamics
The ensemble (CRX + iDRegEx + MDL) selects different winners
depending on the data:
| Dataset | Winner | Why |
|---------|--------|-----|
| Ansible deploy (36 roles) | CRX | iDRegEx returned ∅ (too diverse) |
| Ansible galaxy (15 roles) | CRX | iDRegEx returned ∅ (too diverse) |
| Ansible restore (2 roles) | CRX | Both match all; CRX more compact |
| Ansible configure (4 roles) | **iDRegEx** | Finds minimal core `include_role` |
| Ansible manage (2 roles) | **iDRegEx** | Core: `assert.authorized_key` |
iDRegEx wins when the data has a clear common core. CRX wins when
there's no single shared subsequence (the roles share the *vocabulary*
but not the *order*).
## The MCP
The engine is exposed as an MCP server:
```python
from bex.mcp_server import infer_best_grammar
# Full coverage
output = infer_best_grammar(
sequences=role_sequences,
prefer="crx",
)
# Returns:
# Best: CRX (MDL 2186.28)
# Grammar: docker_volume+?.group?...(assert+...+wait_for)+?.(cron+firewalld)?
# Ensemble — let MDL pick
output = infer_best_grammar(sequences=role_sequences)
```
An agent workflow:
1. Agent needs to write deploy role #37
2. Finds 36 existing deploy roles, extracts their task module sequences
3. Calls `infer_best_grammar(sequences=..., prefer='crx')`
4. Gets back the grammar in 200 tokens
5. Generates a new role that follows the structural pattern
Without the MCP: 36 role files in context (15,000 tokens), or guesswork.
With the MCP: one grammar rule (200 tokens), known to match 36/36 roles.
## What it means
Grammar inference turns **examples** into **rules**. The rule is a
compressed description of the structural convention — and for
schema-less content like Ansible roles, this may be the *first time*
the convention has been written down at all.
For LLM agents, this changes the trade-off between context and
accuracy. Instead of flooding the context window with examples, the
agent can call the MCP, get the rule in ~60 tokens, and follow it.
The rule is more reliable than guessing from examples, and it costs
less than the first example would have.
The algorithm doesn't need to understand what a deploy role does. It
doesn't know that `file` creates directories and `template` renders
Jinja2. It only needs to see 36 sequences of module names and find
the pattern they all share. The structural convention is in the data
— you just have to extract it.
## References
- Bex, G. J., Gelade, W., Neven, F., & Vansummeren, S. (2010).
*Learning Deterministic Regular Expressions for the Web.* TODS 2010.
- Bex, G. J., Gelade, W., Martens, W., & Neven, F. (2010).
*Simplifying XML Schema: Single-Type Approximations of Regular
Expressions.* arXiv:1004.2372.
- Rissanen, J. (1978). *Modeling by shortest data description.*
Automatica 14(5).