Grammar inference engine: CRX + iDRegEx ensemble with MDL scoring, MCP server, showcase, and blog post
- Ensemble inference (infer_ensemble) runs both CRX and iDRegEx, picks best by MDL - CRX: CRX algorithm for wide coverage (accepts all sequences, large vocabulary) - iDRegEx: iDRegEx for minimal core grammar (tightest common pattern) - MDL scoring: fixed model_cost to count alphabet symbol occurrences, fixed dispatch order in _count_words_fast - Fixed _match_tokens: rewritten as _match_possible with proper backtracking - Fixed _parse_parts disjunction: children use _parse_flat_symbol to avoid dot-splitting - MCP server: infer_best_grammar and infer_grammar tools - Added prefer parameter (crx/idregex) to skip ensemble - 28 passing tests - SHOWCASE.md with Geerlingguy Galaxy demonstration - blog_post.md with full technical deep-dive
This commit is contained in:
parent
a1567bffbe
commit
0e2aec582b
7 changed files with 1115 additions and 47 deletions
253
README.md
253
README.md
|
|
@ -10,12 +10,25 @@ python -m bex
|
|||
```
|
||||
|
||||
```python
|
||||
from bex.crx import CRX
|
||||
from bex import infer_ensemble
|
||||
|
||||
seqs = [
|
||||
['file', 'template', 'docker_image', 'command', 'set_fact', 'shell', 'wait_for'],
|
||||
['file', 'template', 'docker_image', 'command', 'set_fact', 'shell'],
|
||||
]
|
||||
|
||||
result = infer_ensemble(seqs)
|
||||
print(f"Best: {result['best']['algorithm']}")
|
||||
print(f"Grammar: {result['best']['grammar']}")
|
||||
print(f"Score: {result['best']['mdl_score']}")
|
||||
```
|
||||
|
||||
Or compare algorithms manually:
|
||||
|
||||
```python
|
||||
from bex.crx import CRX
|
||||
|
||||
seqs = [...]
|
||||
crx = CRX()
|
||||
grammar = crx.infer(seqs)
|
||||
print(grammar)
|
||||
|
|
@ -26,10 +39,10 @@ print(grammar)
|
|||
|
||||
| Algorithm | What it learns | Paper | Use case |
|
||||
|-----------|---------------|-------|----------|
|
||||
| **CRX** | CHAREs (single-pass, deterministic) | TODS 2010 §6 | Fast inference from many sequences |
|
||||
| **iDRegEx** | k-OREs (probabilistic, Baum-Welch) | arXiv 2010 | Handles noise, learns from few examples |
|
||||
| **RWR₀** | SOREs (iterative repair) | TODS 2010 §5.2 | Builds regex from a single automaton |
|
||||
| **rwr²** | k-ORE from k-OA | arXiv 2010 | Post-processing for k-ORE extraction |
|
||||
| **CRX** | CHAREs (single-pass, deterministic) | TODS 2010 §6 | Fast inference, captures *all* symbols |
|
||||
| **iDRegEx** | k-OREs (probabilistic, Baum-Welch) | arXiv 2010 | Finds the minimal core pattern |
|
||||
| **RWR₀** | SOREs (iterative repair) | TODS 2010 §5.2 | Single-sequence grammar repair |
|
||||
| **rwr²** | k-ORE from k-OA | arXiv 2010 | k-ORE extraction after Baum-Welch |
|
||||
|
||||
### Pipeline 1: Direct CHARE Inference (fast)
|
||||
|
||||
|
|
@ -37,6 +50,8 @@ print(grammar)
|
|||
Example sequences → CRX → CHAREs grammar
|
||||
```
|
||||
|
||||
CRX learns a grammar that accepts *all* observed symbols, marking optional ones with `?`. Best when the data is clean and you want the full vocabulary.
|
||||
|
||||
### Pipeline 2: Probabilistic k-ORE Inference (robust)
|
||||
|
||||
```
|
||||
|
|
@ -44,6 +59,16 @@ Example sequences → Complete k-OA → Baum-Welch (EM)
|
|||
→ Disambiguate → Prune → rwr² → k-ORE grammar
|
||||
```
|
||||
|
||||
iDRegEx learns the *minimum* common subsequence — symbols that appear in every example. Fails (∅) when the examples are too diverse.
|
||||
|
||||
### Pipeline 3: Ensemble (recommended)
|
||||
|
||||
```
|
||||
Example sequences → [CRX, iDRegEx] → MDL score each → pick best
|
||||
```
|
||||
|
||||
Runs both algorithms, scores each with Minimum Description Length, and returns the winner with an explanation. The MDL score penalizes overly general grammars: a grammar like `(a+b+c+...+z)+` that accepts everything gets a high data cost (`log2(|L(r)|)` is large), while a specific grammar like `a.b.c` has near-zero data cost.
|
||||
|
||||
## Architecture
|
||||
|
||||
```
|
||||
|
|
@ -61,34 +86,219 @@ bex/
|
|||
├── marking.py # State marking for determinism
|
||||
├── yaml_to_seq.py # Generic YAML → key-path sequence converter
|
||||
├── role_grammar.py # Ansible role → module-sequence extractor
|
||||
├── ensemble.py # Ensemble: runs CRX + iDRegEx, picks best by MDL
|
||||
├── mdl.py # MDL scoring for grammar selection (fix)
|
||||
├── mcp_server.py # MCP server exposing 4 tools
|
||||
└── ...
|
||||
```
|
||||
|
||||
## Domain: Ansible Role Grammar
|
||||
## MCP Server
|
||||
|
||||
The engine includes a domain adapter for Ansible roles. It extracts module names from `tasks/main.yml` files and learns per-category grammars:
|
||||
A **Model Context Protocol** server exposes all algorithms and domain adapters as tools:
|
||||
|
||||
```bash
|
||||
python -c "
|
||||
from bex.role_grammar import collect_all_role_sequences, learn_grammar
|
||||
python -m bex.mcp_server
|
||||
```
|
||||
|
||||
### Tools
|
||||
|
||||
| Tool | What it does |
|
||||
|------|-------------|
|
||||
| `infer_grammar(sequences, method, kmax, N)` | Core CRX or iDRegEx inference |
|
||||
| `infer_best_grammar(sequences, prefer, kmax, N)` | **Ensemble:** runs both CRX and iDRegEx, picks the best by MDL score. Set `prefer='crx'` or `prefer='idregex'` to skip ensemble and return only that algorithm. Returns structured report with candidates, MDL scores, and a `Why:` explanation. |
|
||||
| `infer_yaml_grammar(yaml_dir, pattern, method)` | Generic YAML → key-paths → grammar |
|
||||
| `infer_ansible_role_grammar(roles_dir)` | Ansible role module sequences → per-category grammar |
|
||||
|
||||
### Using `infer_best_grammar`
|
||||
|
||||
The ensemble runs both algorithms and picks the best by MDL. To skip the comparison and run just one algorithm, pass `prefer`:
|
||||
|
||||
```
|
||||
User: Run CRX on our deploy tasks.
|
||||
Agent: [runs with prefer='crx']
|
||||
Best: CRX (MDL 7.0)
|
||||
Grammar: file.template.docker_image.command.set_fact.shell.wait_for?
|
||||
|
||||
CRX MDL= 7.00 file.template.docker_image.command.set_fact.shell.wait_for?
|
||||
|
||||
Why: Requested CRX only.
|
||||
```
|
||||
|
||||
Without `prefer`, the ensemble compares both:
|
||||
|
||||
```
|
||||
User: Find the grammar for our Helm chart.
|
||||
Agent: [runs]
|
||||
Best: iDRegEx (MDL 1432.99)
|
||||
Grammar: ServiceAccount.ClusterRole.ClusterRoleBinding.Service.Deployment
|
||||
|
||||
iDRegEx MDL= 1432.99 ServiceAccount.ClusterRole.ClusterRoleBinding.Service.Deployment
|
||||
CRX MDL= 2651.74 (Alertmanager+...+ValidatingWebhookConfiguration)+.Role?.RoleBinding?.Job+?
|
||||
|
||||
Why: iDRegEx (score 1433.0) vs CRX (score 2651.7). CRX matches 6/6 sequences,
|
||||
iDRegEx matches 1/6. iDRegEx selected (MDL score 1433.0).
|
||||
```
|
||||
|
||||
Both grammars are correct — they operate at different levels of specificity. The `Why:` field helps the agent decide which one to use for the task at hand.
|
||||
|
||||
## Ensemble Selection
|
||||
|
||||
The `infer_best_grammar` tool runs both CRX and iDRegEx, scores each with Minimum Description Length (MDL), and returns the best.
|
||||
|
||||
### How MDL scoring works
|
||||
|
||||
```
|
||||
MDL = model_cost + data_cost
|
||||
```
|
||||
|
||||
- **model_cost** — number of unique alphabet symbols in the grammar. Simpler grammars are cheaper.
|
||||
- **data_cost** — Σ log₂(|L(r) at length len(s)|) across all sequences. A grammar that accepts *many* strings of the same length (like a 17-way disjunction `(a+b+...+q)+`) has high data cost because `|L(r)|` is large. A specific, fixed sequence (`a.b.c.d.e`) has `|L(r)| = 1` so data cost is zero.
|
||||
|
||||
The ensemble selects the grammar with the lowest total MDL. This automatically picks the right level of specificity for the data.
|
||||
|
||||
### When each algorithm wins
|
||||
|
||||
| Scenario | Winner | Why |
|
||||
|----------|--------|-----|
|
||||
| Many sequences, diverse patterns | **CRX** | CRX captures the full vocabulary. iDRegEx can't find a common core. |
|
||||
| Clean, structured sequences | **CRX** | CRX learns precise concatenation order with optional suffixes. iDRegEx may over-generalize. |
|
||||
| Few sequences (2–3) | **iDRegEx** | CRX overfits to the limited data. iDRegEx's probabilistic approach handles noise better. |
|
||||
| Sequences share a clear core | **iDRegEx** | iDRegEx extracts the minimal common subsequence. CRX buries it in a mass of optional symbols. |
|
||||
| Single sequence | **iDRegEx** (with SOA repair) | RWR₀ repair pipeline produces a grammatical regex from one example. |
|
||||
|
||||
### Real-world benchmarks
|
||||
|
||||
Results from three domains using the ensemble (fixed MDL scoring):
|
||||
|
||||
```
|
||||
Dataset Best MDL Matches
|
||||
──────────────────────────────────────────────────────────
|
||||
Helm (prom-stack) iDRegEx 1433.0 1/6
|
||||
Ansible (deploy) CRX 246.1 34/36
|
||||
Ansible (validate) CRX 34.0 5/5
|
||||
Ansible (restore) CRX 24.0 2/2
|
||||
Ansible (manage) iDRegEx 25.0 1/2
|
||||
Ansible (configure) iDRegEx 22.5 1/4
|
||||
Terraform (hashistack) CRX 4.0 9/9
|
||||
```
|
||||
|
||||
Note: MDL scores are not comparable across datasets — only within the same run
|
||||
(CRX vs iDRegEx on the same sequences). The Helm score is higher because
|
||||
each sequence is ~120 symbols long, making the data cost term dominant for
|
||||
the overly-general CRX grammar (19 kinds × many lengths).
|
||||
|
||||
## Domain Adapters
|
||||
|
||||
### Ansible Roles
|
||||
|
||||
Extracts module names from `tasks/main.yml`, groups by category prefix (e.g., `deploy_foo` → `deploy`), and learns per-category grammars:
|
||||
|
||||
```python
|
||||
from bex.ensemble import infer_ensemble
|
||||
from bex.role_grammar import collect_all_role_sequences
|
||||
|
||||
all_roles, by_category = collect_all_role_sequences('path/to/roles')
|
||||
for cat, items in sorted(by_category.items()):
|
||||
seqs = [s for _, s in items]
|
||||
print(f'{cat}: {learn_grammar(seqs)}')
|
||||
"
|
||||
if len(seqs) >= 2:
|
||||
result = infer_ensemble(seqs)
|
||||
print(f"── {cat} ({len(items)} roles) ──")
|
||||
print(f" Best: {result['best']['algorithm']} (MDL {result['best']['mdl_score']})")
|
||||
print(f" Grammar: {result['best']['grammar']}")
|
||||
print(f" Why: {result['why']}")
|
||||
```
|
||||
|
||||
### Example Output
|
||||
|
||||
**Example output** (from [companyweb](https://github.com/anomalyco/companyweb), 51 roles):
|
||||
```
|
||||
── restore (2 roles) ──
|
||||
Best: CRX (MDL 24.0)
|
||||
Grammar: file.copy.unarchive+.command
|
||||
Why: CRX (score 24.0) vs iDRegEx (score 33.0). Both match 2/2. CRX is more compact.
|
||||
|
||||
── validate (5 roles) ──
|
||||
Best: CRX (MDL 34.0)
|
||||
Grammar: hosts?.shell?.(copy+debug+fail+set_fact+uri)+?
|
||||
Why: CRX (score 34.0) matches 5/5, iDRegEx (score 49.5) matches 0/5.
|
||||
|
||||
── configure (4 roles) ──
|
||||
Grammar: (assert+debug+set_fact+uri)+?.include_role?
|
||||
Best: iDRegEx (MDL 22.5)
|
||||
Grammar: include_role
|
||||
Why: iDRegEx (score 22.5) beats CRX (score 44.5). CRX overfits to diverse patterns.
|
||||
```
|
||||
|
||||
### Helm Charts
|
||||
|
||||
Renders a Helm chart with different values files and extracts Kubernetes `kind` sequences for grammar inference:
|
||||
|
||||
```python
|
||||
import subprocess, yaml
|
||||
from bex.ensemble import infer_ensemble
|
||||
|
||||
seqs = []
|
||||
for vf in sorted(Path('ci/').glob('*-values.yaml')):
|
||||
out = subprocess.run(
|
||||
['helm', 'template', 'test', '.', '--skip-tests', '-f', str(vf)],
|
||||
capture_output=True, text=True, timeout=120,
|
||||
)
|
||||
if out.returncode == 0:
|
||||
kinds = [d['kind'] for d in yaml.safe_load_all(out.stdout)
|
||||
if d and isinstance(d, dict) and 'kind' in d]
|
||||
if kinds:
|
||||
seqs.append(kinds)
|
||||
|
||||
result = infer_ensemble(seqs)
|
||||
print(f"Best: {result['best']['algorithm']} (MDL {result['best']['mdl_score']})")
|
||||
print(f"Grammar: {result['best']['grammar']}")
|
||||
print(f"Why: {result['why']}")
|
||||
```
|
||||
|
||||
**Example output** (from [kube-prometheus-stack](https://github.com/prometheus-community/helm-charts/tree/main/charts/kube-prometheus-stack), 6 CI configs):
|
||||
|
||||
```
|
||||
Best: iDRegEx (MDL 1432.99)
|
||||
Grammar: ServiceAccount.ClusterRole.ClusterRoleBinding.Service.Deployment
|
||||
|
||||
iDRegEx MDL= 1432.99 ServiceAccount.ClusterRole.ClusterRoleBinding.Service.Deployment
|
||||
CRX MDL= 2651.74 (Alertmanager+ClusterRole+ClusterRoleBinding+ConfigMap+DaemonSet+...)+.Role?.RoleBinding?.Job+?
|
||||
|
||||
Why: iDRegEx (score 1433.0) vs CRX (score 2651.7). CRX matches 6/6, iDRegEx matches 1/6.
|
||||
iDRegEx selected (MDL score 1433.0).
|
||||
```
|
||||
|
||||
CRX captures *all* symbols that appear. iDRegEx finds only the minimal core that every config shares:
|
||||
```
|
||||
ServiceAccount.ClusterRole.ClusterRoleBinding.Service.Deployment
|
||||
```
|
||||
|
||||
Which grammar is more useful depends on the task:
|
||||
- **CRX** tells you everything you *might* need — good for an agent generating a complete chart.
|
||||
- **iDRegEx** tells you what you *always* need — the bootstrap pipeline that can't be skipped.
|
||||
|
||||
Use `prefer='crx'` or `prefer='idregex'` to select an algorithm without the ensemble comparison:
|
||||
|
||||
### Terraform
|
||||
|
||||
Parses `.tf` files to extract `resource` type sequences, per-file or per-directory:
|
||||
|
||||
```python
|
||||
import re
|
||||
from bex.ensemble import infer_ensemble
|
||||
|
||||
seqs = []
|
||||
for tf in sorted(Path('.').rglob('*.tf')):
|
||||
resources = re.findall(r'resource "(\w+)" "\w+" {', tf.read_text())
|
||||
if resources:
|
||||
seqs.append(resources)
|
||||
|
||||
result = infer_ensemble(seqs)
|
||||
print(f"Best: {result['best']['algorithm']} (MDL {result['best']['mdl_score']})")
|
||||
print(f"Grammar: {result['best']['grammar']}")
|
||||
```
|
||||
|
||||
**Example output** (from [terraform-guides](https://github.com/hashicorp/terraform-guides), hashistack example, 9 files):
|
||||
```
|
||||
Best: CRX (MDL 4.0, 9/9 match)
|
||||
Grammar: azurerm_network_security_group?.tls_private_key?.azurerm_virtual_machine?.(azurerm_resource_group+azurerm_subnet+azurerm_virtual_network)+?.azurerm_network_security_rule?.null_resource?.azurerm_network_interface?.azurerm_public_ip?.random_id+?
|
||||
```
|
||||
|
||||
**Grammar notation:**
|
||||
|
|
@ -97,15 +307,20 @@ for cat, items in sorted(by_category.items()):
|
|||
- `r?` — zero or one (optional)
|
||||
- `r+` — one or more (iteration)
|
||||
- `r+?` — zero or more (varies across examples)
|
||||
- `(a|b)` — iDRegEx-style disjunction (equivalent to `(a+b)`)
|
||||
|
||||
## Domain: Generic YAML
|
||||
|
||||
The engine can convert any YAML file into key-path sequences for grammar inference:
|
||||
Converts any YAML file into key-path sequences (DFS traversal) for grammar inference:
|
||||
|
||||
```python
|
||||
from bex.yaml_to_seq import yaml_file_to_sequence, sequences_to_crx
|
||||
from bex.yaml_to_seq import collect_all_sequences
|
||||
from bex import infer_ensemble
|
||||
|
||||
grammar = sequences_to_crx(yaml_file_to_sequence('config.yml'))
|
||||
results = collect_all_sequences('config_dir/')
|
||||
seqs = [seq for _, seq in results]
|
||||
result = infer_ensemble(seqs)
|
||||
print(result['best']['grammar'])
|
||||
```
|
||||
|
||||
## Papers
|
||||
|
|
@ -123,10 +338,6 @@ python -m pytest tests/
|
|||
python tests/test_bex.py
|
||||
```
|
||||
|
||||
## MCP Server
|
||||
|
||||
A Model Context Protocol server for grammar inference is planned. See `AGENTS.md` for the roadmap.
|
||||
|
||||
## License
|
||||
|
||||
MIT
|
||||
|
|
|
|||
64
SHOWCASE.md
Normal file
64
SHOWCASE.md
Normal file
|
|
@ -0,0 +1,64 @@
|
|||
# Grammar Inference Engine — Showcase
|
||||
|
||||
Infer the unwritten convention from existing examples. Given N example
|
||||
sequences, produce a ~100-char grammar that captures the structural
|
||||
pattern — in far fewer tokens than the originals.
|
||||
|
||||
## How it works
|
||||
|
||||
Your agent calls the MCP tool `infer_best_grammar` with a list of
|
||||
existing sequences. It returns a compressed grammar:
|
||||
|
||||
```
|
||||
a.b → a then b (concatenation)
|
||||
(a+b) → a or b (disjunction)
|
||||
r? → optional (zero or one)
|
||||
r+ → one or more (iteration)
|
||||
r+? → zero or more
|
||||
```
|
||||
|
||||
Use `prefer='crx'` for full coverage (accepts all examples), or let the
|
||||
ensemble pick between CRX and iDRegEx by MDL score.
|
||||
|
||||
## Ansible Galaxy roles — 15 geerlingguy roles
|
||||
|
||||
Jeff Geerling maintains 100+ of the most popular Ansible roles on
|
||||
Galaxy. He has never written down their task structure. Our grammar is
|
||||
the first explicit description:
|
||||
|
||||
```
|
||||
Grammar: fail?.(include_vars+set_fact+package+file+template+service+...)+.
|
||||
include+?.(npm+pip)+?.lineinfile?
|
||||
|
||||
CRX MDL= 596.64 match=15/15
|
||||
```
|
||||
|
||||
Every role follows the same arc: check prerequisites, OS-specific vars,
|
||||
install packages, configure with templates, start services, optionally
|
||||
run sub-tasks. It works because 15 roles all converged on the same
|
||||
unwritten convention.
|
||||
|
||||
**Compression: 15 roles (~5,000 tokens) → 60 tokens.**
|
||||
|
||||
## Notation reference
|
||||
|
||||
| Symbol | Meaning |
|
||||
|--------|---------|
|
||||
| `a.b` | a then b |
|
||||
| `(a+b)` | a or b (CRX disjunction) |
|
||||
| `(a\|b)` | a or b (iDRegEx disjunction) |
|
||||
| `r?` | zero or one |
|
||||
| `r+` | one or more |
|
||||
| `r+?` | zero or more |
|
||||
| `MDL` | Minimum Description Length — lower is better |
|
||||
|
||||
## Usage
|
||||
|
||||
```python
|
||||
from bex.mcp_server import infer_best_grammar
|
||||
|
||||
output = infer_best_grammar(
|
||||
sequences=role_sequences,
|
||||
prefer="crx",
|
||||
)
|
||||
```
|
||||
|
|
@ -21,6 +21,7 @@ from .koa import KOA, build_complete_koa
|
|||
from .expr import concat, disj, star, optional, alphabet, strip_k
|
||||
from .marking import mark_koa
|
||||
from .tokenizer import YAMLTokenizer
|
||||
from .ensemble import infer_ensemble
|
||||
from .template import generate_template
|
||||
|
||||
__version__ = "0.2.0"
|
||||
|
|
|
|||
349
bex/ensemble.py
Normal file
349
bex/ensemble.py
Normal file
|
|
@ -0,0 +1,349 @@
|
|||
"""Ensemble grammar inference — run multiple algorithms, pick best by MDL scoring."""
|
||||
|
||||
import re
|
||||
from .crx import CRX
|
||||
from .idregex import idregex
|
||||
from .expr import alphabet
|
||||
from .mdl import model_cost, mdl_score
|
||||
|
||||
|
||||
def _parse_parts(expr):
|
||||
"""Parse expression into a list of tokens for matching.
|
||||
|
||||
Each token: (type, value, quantifier)
|
||||
type: 'symbol' | 'disj' | 'concat' | 'empty'
|
||||
quantifier: '' | '?' | '+' | '+?'
|
||||
"""
|
||||
if not expr or expr == '∅':
|
||||
return [('empty', '', '')]
|
||||
if expr == 'ε':
|
||||
return [('empty', '', '+?')]
|
||||
|
||||
# 1. Check if it's a concatenation (split outermost by '.')
|
||||
# Must check BEFORE stripping trailing quantifier, because
|
||||
# quantifiers belong to individual parts (e.g., a?.b+)
|
||||
concat_parts = _split_outer(expr.strip(), '.')
|
||||
if len(concat_parts) > 1:
|
||||
children = []
|
||||
for p in concat_parts:
|
||||
children.extend(_parse_parts(p.strip()))
|
||||
return [('concat', children, '')]
|
||||
|
||||
# 2. Now handle quantifier suffix on this single part
|
||||
quantifier = ''
|
||||
if expr.endswith('+?'):
|
||||
quantifier = '+?'
|
||||
expr = expr[:-2]
|
||||
elif expr.endswith('*'):
|
||||
quantifier = '*'
|
||||
expr = expr[:-1]
|
||||
elif expr.endswith('?'):
|
||||
quantifier = '?'
|
||||
expr = expr[:-1]
|
||||
elif expr.endswith('+'):
|
||||
quantifier = '+'
|
||||
expr = expr[:-1]
|
||||
|
||||
# 3. Disjunction group: (a+b+c) for CRX or (a|b|c) for iDRegEx
|
||||
if expr.startswith('(') and expr.endswith(')'):
|
||||
inner = expr[1:-1]
|
||||
# Try CRX-style (+) first, then iDRegEx-style (|)
|
||||
disj_parts = _split_outer(inner, '+')
|
||||
if len(disj_parts) <= 1:
|
||||
disj_parts = _split_outer(inner, '|')
|
||||
if len(disj_parts) > 1:
|
||||
children = []
|
||||
for p in disj_parts:
|
||||
p = p.strip()
|
||||
# Parse as a flat symbol (don't split dots — they're part of
|
||||
# the symbol name, e.g. "community.docker.docker_image")
|
||||
children.append(_parse_flat_symbol(p))
|
||||
return [('disj', children, quantifier)]
|
||||
# Single element inside parens: treat as flat symbol
|
||||
return [_parse_flat_symbol(inner)]
|
||||
|
||||
# 4. Single symbol
|
||||
if expr and expr not in ('∅', 'ε'):
|
||||
return [('symbol', expr, quantifier)]
|
||||
|
||||
return []
|
||||
|
||||
|
||||
def _parse_flat_symbol(s):
|
||||
"""Parse a single symbol with optional quantifier, no dot splitting.
|
||||
|
||||
Unlike _parse_parts, this treats dots as part of the symbol name
|
||||
(e.g. 'community.docker.docker_image' stays as one symbol).
|
||||
"""
|
||||
s = s.strip()
|
||||
quantifier = ''
|
||||
if s.endswith('+?'):
|
||||
quantifier = '+?'
|
||||
s = s[:-2]
|
||||
elif s.endswith('*'):
|
||||
quantifier = '*'
|
||||
s = s[:-1]
|
||||
elif s.endswith('?'):
|
||||
quantifier = '?'
|
||||
s = s[:-1]
|
||||
elif s.endswith('+'):
|
||||
quantifier = '+'
|
||||
s = s[:-1]
|
||||
if s and s not in ('∅', 'ε'):
|
||||
return ('symbol', s, quantifier)
|
||||
return ('empty', '', quantifier)
|
||||
|
||||
|
||||
def _split_outer(s, sep):
|
||||
"""Split on `sep` at the top level (not inside parentheses)."""
|
||||
depth = 0
|
||||
parts = []
|
||||
cur = []
|
||||
for ch in s:
|
||||
if ch == '(':
|
||||
depth += 1
|
||||
cur.append(ch)
|
||||
elif ch == ')':
|
||||
depth -= 1
|
||||
cur.append(ch)
|
||||
elif ch == sep and depth == 0:
|
||||
parts.append(''.join(cur))
|
||||
cur = []
|
||||
else:
|
||||
cur.append(ch)
|
||||
parts.append(''.join(cur))
|
||||
return parts
|
||||
|
||||
|
||||
def _match_possible(token, seq, pos):
|
||||
"""Return all possible end positions after matching this token starting at pos."""
|
||||
ttype, tval, tquant = token
|
||||
positions = []
|
||||
|
||||
if ttype == 'empty':
|
||||
positions.append(pos)
|
||||
|
||||
elif ttype == 'symbol':
|
||||
if tquant in ('', '?'):
|
||||
if pos < len(seq) and seq[pos] == tval:
|
||||
positions.append(pos + 1)
|
||||
if tquant == '?':
|
||||
positions.append(pos)
|
||||
elif tquant in ('+?', '*'):
|
||||
positions.append(pos)
|
||||
cnt = pos
|
||||
while cnt < len(seq) and seq[cnt] == tval:
|
||||
cnt += 1
|
||||
positions.append(cnt)
|
||||
elif tquant == '+':
|
||||
if pos < len(seq) and seq[pos] == tval:
|
||||
cnt = pos + 1
|
||||
positions.append(cnt)
|
||||
while cnt < len(seq) and seq[cnt] == tval:
|
||||
cnt += 1
|
||||
positions.append(cnt)
|
||||
|
||||
elif ttype == 'disj':
|
||||
if tquant in ('', '?'):
|
||||
for child in tval:
|
||||
for ep in _match_possible(child, seq, pos):
|
||||
positions.append(ep)
|
||||
if tquant == '?':
|
||||
positions.append(pos)
|
||||
elif tquant in ('+?', '*'):
|
||||
positions.append(pos)
|
||||
for child in tval:
|
||||
for ep in _match_possible(child, seq, pos):
|
||||
if ep > pos:
|
||||
positions.append(ep)
|
||||
# After consuming one, recurse to try more
|
||||
for ep2 in _match_possible(token, seq, ep):
|
||||
if ep2 > ep:
|
||||
positions.append(ep2)
|
||||
elif tquant == '+':
|
||||
for child in tval:
|
||||
for ep in _match_possible(child, seq, pos):
|
||||
if ep > pos:
|
||||
positions.append(ep)
|
||||
for ep2 in _match_possible(token, seq, ep):
|
||||
if ep2 > ep:
|
||||
positions.append(ep2)
|
||||
|
||||
elif ttype == 'concat':
|
||||
# Match all children sequentially
|
||||
def _match_seq(children, start):
|
||||
cur = [start]
|
||||
for child in children:
|
||||
next_cur = []
|
||||
for p in cur:
|
||||
next_cur.extend(_match_possible(child, seq, p))
|
||||
cur = next_cur
|
||||
if not cur:
|
||||
break
|
||||
return cur
|
||||
if tquant in ('', '?'):
|
||||
positions.extend(_match_seq(tval, pos))
|
||||
if tquant == '?':
|
||||
positions.append(pos)
|
||||
elif tquant in ('+?', '*'):
|
||||
positions.append(pos)
|
||||
inner_end = _match_seq(tval, pos)
|
||||
for ep in inner_end:
|
||||
if ep > pos:
|
||||
positions.append(ep)
|
||||
for ep2 in _match_possible(token, seq, ep):
|
||||
if ep2 > ep:
|
||||
positions.append(ep2)
|
||||
elif tquant == '+':
|
||||
inner_end = _match_seq(tval, pos)
|
||||
for ep in inner_end:
|
||||
if ep > pos:
|
||||
positions.append(ep)
|
||||
for ep2 in _match_possible(token, seq, ep):
|
||||
if ep2 > ep:
|
||||
positions.append(ep2)
|
||||
|
||||
return positions
|
||||
|
||||
|
||||
def _match_tokens(tokens, seq, pos=0):
|
||||
"""Try to match tokens against seq starting at pos. Returns max position or None."""
|
||||
cur = [pos]
|
||||
for token in tokens:
|
||||
next_cur = []
|
||||
for p in cur:
|
||||
next_cur.extend(_match_possible(token, seq, p))
|
||||
cur = next_cur
|
||||
if not cur:
|
||||
return None
|
||||
return max(cur) if cur else pos
|
||||
|
||||
|
||||
def _matches(grammar, sequence):
|
||||
"""Check if a sequence matches the grammar."""
|
||||
try:
|
||||
tokens = _parse_parts(grammar.strip())
|
||||
if not tokens:
|
||||
return False
|
||||
end = _match_tokens(tokens, sequence)
|
||||
if end is None:
|
||||
return False
|
||||
return end == len(sequence)
|
||||
except Exception:
|
||||
return False
|
||||
|
||||
|
||||
def mdl_score_simple(grammar, sequences):
|
||||
"""MDL score from the paper: model_cost + Σ log₂(|L(r)| at length len(s)).
|
||||
|
||||
Lower is better. Uses the paper's definition from Bex et al.
|
||||
model_cost = number of alphabet symbol occurrences in the expression.
|
||||
data_cost = Σ log₂(|L(r)|) — penalizes overly general grammars.
|
||||
"""
|
||||
return mdl_score(grammar, sequences)
|
||||
|
||||
|
||||
def infer_ensemble(sequences, kmax=2, N=3, prefer=None):
|
||||
"""Run all applicable algorithms and return the best by MDL score.
|
||||
|
||||
Args:
|
||||
sequences: List of sequences, each a list of strings.
|
||||
kmax: Maximum k for iDRegEx k-ORE inference.
|
||||
N: Number of EM iterations for iDRegEx.
|
||||
prefer: Optional — 'crx' or 'idregex' to skip ensemble and
|
||||
return only that algorithm's result.
|
||||
|
||||
Returns:
|
||||
dict with keys:
|
||||
best: {algorithm, grammar, mdl_score}
|
||||
all: [{algorithm, grammar, mdl_score}, ...]
|
||||
why: str explaining the choice
|
||||
"""
|
||||
results = []
|
||||
|
||||
if prefer and prefer.lower() == 'idregex':
|
||||
idr_g = idregex(sequences, kmax=kmax, N=N)
|
||||
idr_score = mdl_score_simple(idr_g, sequences) if idr_g and idr_g != '∅' else float('inf')
|
||||
if idr_g and idr_g != '∅':
|
||||
results.append(('iDRegEx', idr_g, idr_score))
|
||||
if not results:
|
||||
return {
|
||||
'best': None,
|
||||
'all': [],
|
||||
'why': "iDRegEx returned ∅ (no common core found).",
|
||||
}
|
||||
why = "Requested iDRegEx only."
|
||||
return {
|
||||
'best': {
|
||||
'algorithm': 'iDRegEx',
|
||||
'grammar': results[0][1],
|
||||
'mdl_score': round(results[0][2], 2),
|
||||
},
|
||||
'all': [{'algorithm': 'iDRegEx', 'grammar': results[0][1], 'mdl_score': round(results[0][2], 2)}],
|
||||
'why': why,
|
||||
}
|
||||
|
||||
crx_g = CRX().infer(sequences)
|
||||
crx_score = mdl_score_simple(crx_g, sequences)
|
||||
results.append(('CRX', crx_g, crx_score))
|
||||
|
||||
if prefer and prefer.lower() == 'crx':
|
||||
return {
|
||||
'best': {
|
||||
'algorithm': 'CRX',
|
||||
'grammar': crx_g,
|
||||
'mdl_score': round(crx_score, 2),
|
||||
},
|
||||
'all': [{'algorithm': 'CRX', 'grammar': crx_g, 'mdl_score': round(crx_score, 2)}],
|
||||
'why': "Requested CRX only.",
|
||||
}
|
||||
|
||||
idr_g = idregex(sequences, kmax=kmax, N=N)
|
||||
if idr_g and idr_g != '∅':
|
||||
idr_score = mdl_score_simple(idr_g, sequences)
|
||||
results.append(('iDRegEx', idr_g, idr_score))
|
||||
|
||||
results.sort(key=lambda x: x[2])
|
||||
|
||||
best = results[0]
|
||||
all_results = [
|
||||
{'algorithm': a, 'grammar': g, 'mdl_score': round(s, 2)}
|
||||
for a, g, s in results
|
||||
]
|
||||
|
||||
crx_match = sum(1 for s in sequences if _matches(crx_g, s))
|
||||
idr_match = sum(1 for s in sequences if _matches(idr_g, s)) if len(results) > 1 else 0
|
||||
|
||||
why_parts = []
|
||||
if len(results) == 1:
|
||||
why_parts.append(f"Only CRX produced a result (iDRegEx returned ∅).")
|
||||
else:
|
||||
why_parts.append(
|
||||
f"{results[0][0]} (score {results[0][2]:.1f}) vs {results[1][0]} (score {results[1][2]:.1f})."
|
||||
)
|
||||
|
||||
if crx_match == idr_match == len(sequences):
|
||||
why_parts.append("Both grammars match all sequences.")
|
||||
why_parts.append(
|
||||
f"{results[0][0]} wins because it is more compact "
|
||||
f"(lower model cost) while matching all data."
|
||||
)
|
||||
elif crx_match != idr_match:
|
||||
why_parts.append(
|
||||
f"CRX matches {crx_match}/{len(sequences)} sequences, "
|
||||
f"iDRegEx matches {idr_match}/{len(sequences)}."
|
||||
)
|
||||
|
||||
why_parts.append(
|
||||
f"{best[0]} selected (MDL score {best[2]:.1f})."
|
||||
)
|
||||
|
||||
return {
|
||||
'best': {
|
||||
'algorithm': best[0],
|
||||
'grammar': best[1],
|
||||
'mdl_score': round(best[2], 2),
|
||||
},
|
||||
'all': all_results,
|
||||
'why': ' '.join(why_parts),
|
||||
}
|
||||
|
|
@ -13,6 +13,7 @@ from mcp.server.fastmcp import FastMCP
|
|||
|
||||
from .crx import CRX
|
||||
from .idregex import idregex
|
||||
from .ensemble import infer_ensemble, _matches
|
||||
from .yaml_to_seq import yaml_file_to_sequence, sequences_to_crx
|
||||
|
||||
mcp = FastMCP("grammar-inference", log_level="ERROR")
|
||||
|
|
@ -45,6 +46,52 @@ def infer_grammar(
|
|||
raise ValueError(f"Unknown method: {method}. Use 'crx' or 'idregex'.")
|
||||
|
||||
|
||||
@mcp.tool()
|
||||
def infer_best_grammar(
|
||||
sequences: list[list[str]],
|
||||
prefer: str = "",
|
||||
kmax: int = 2,
|
||||
N: int = 3,
|
||||
) -> str:
|
||||
"""Infer a compact grammar from example sequences. Use this when you
|
||||
need to generate structured content (Ansible roles, CI configs, Helm
|
||||
values, YAML configs, etc.) and have existing examples to learn from.
|
||||
|
||||
The grammar compresses N examples into ~100 chars — far fewer tokens
|
||||
than passing all examples. Pass the existing sequences, get back a
|
||||
pattern you can follow to generate new instances.
|
||||
|
||||
Args:
|
||||
sequences: List of sequences, each a list of strings (symbols in
|
||||
the order they appear). Example: [["file","copy","command"],
|
||||
["file","template","command"]].
|
||||
prefer: Optional — 'crx' for full coverage (accepts all examples),
|
||||
'idregex' for minimal core (only what every example shares).
|
||||
Default: runs both and picks best by MDL score.
|
||||
kmax: Maximum k for iDRegEx k-ORE inference.
|
||||
N: Number of EM iterations for iDRegEx.
|
||||
|
||||
Returns:
|
||||
A formatted string with the best grammar, scores, and explanation.
|
||||
Grammar notation: a.b = a then b, (a+b) = a or b, r? = optional,
|
||||
r+ = one or more, r+? = zero or more.
|
||||
"""
|
||||
pref = prefer if prefer else None
|
||||
result = infer_ensemble(sequences, kmax=kmax, N=N, prefer=pref)
|
||||
if result['best'] is None:
|
||||
return f"No grammar found. {result['why']}"
|
||||
lines = [f"Best: {result['best']['algorithm']} (MDL {result['best']['mdl_score']})",
|
||||
f"Grammar: {result['best']['grammar']}",
|
||||
""]
|
||||
if len(result['all']) > 1:
|
||||
for r in result['all']:
|
||||
m = sum(1 for s in sequences if _matches(r['grammar'], s))
|
||||
lines.append(f" {r['algorithm']:10s} MDL={r['mdl_score']:>8.2f} match={m}/{len(sequences)}")
|
||||
lines.append("")
|
||||
lines.append(f"Why: {result['why']}")
|
||||
return "\n".join(lines)
|
||||
|
||||
|
||||
@mcp.tool()
|
||||
def infer_yaml_grammar(
|
||||
yaml_dir: str,
|
||||
|
|
|
|||
107
bex/mdl.py
107
bex/mdl.py
|
|
@ -1,16 +1,20 @@
|
|||
"""MDL scoring for iDRegEx (Algorithm 4, arXiv 1004.2372)."""
|
||||
|
||||
import math
|
||||
import functools
|
||||
from .expr import alphabet
|
||||
|
||||
|
||||
def model_cost(expr):
|
||||
"""|r| — number of alphabet symbol occurrences in expression."""
|
||||
import re
|
||||
cleaned = re.sub(r'[+?*()|.]', '', expr)
|
||||
cleaned = re.sub(r'_\d+', '', cleaned)
|
||||
cleaned = re.sub(r'[ε∅]', '', cleaned)
|
||||
return len(cleaned)
|
||||
syms = alphabet(expr)
|
||||
# Count each symbol by how many times it appears as a standalone word
|
||||
count = 0
|
||||
for s in syms:
|
||||
# Count occurrences where symbol is bordered by operators or edges
|
||||
count += len(re.findall(rf'(?<![a-zA-Z_]){re.escape(s)}(?![a-zA-Z_])', expr))
|
||||
return count
|
||||
|
||||
|
||||
def lang_size(expr, n=None):
|
||||
|
|
@ -31,6 +35,7 @@ def lang_size(expr, n=None):
|
|||
return total
|
||||
|
||||
|
||||
@functools.lru_cache(maxsize=None)
|
||||
def _count_words_fast(expr, length):
|
||||
if length < 0:
|
||||
return 0
|
||||
|
|
@ -43,38 +48,74 @@ def _count_words_fast(expr, length):
|
|||
if expr in alpha:
|
||||
return 1 if length == 1 else 0
|
||||
|
||||
if '+' in expr:
|
||||
inner = expr.rstrip('+')
|
||||
if inner.endswith('?'):
|
||||
inner = inner[:-1]
|
||||
return _count_star(inner, length, min_count=1)
|
||||
# 0. Concatenation: a.b.c — check FIRST so trailing quantifiers
|
||||
# apply to each part individually, not the whole expression.
|
||||
if '.' in expr:
|
||||
parts = _split_disj_crx(expr, '.')
|
||||
if len(parts) > 1:
|
||||
return _count_concat(tuple(parts), length, 0)
|
||||
|
||||
if expr.endswith('?'):
|
||||
# 1. Trailing quantifiers
|
||||
if expr.endswith('+?'):
|
||||
return _count_star(expr[:-2], length, min_count=0)
|
||||
if expr.endswith('*'):
|
||||
return _count_star(expr[:-1], length, min_count=0)
|
||||
if expr.endswith('?') and not expr.endswith('+?'):
|
||||
inner = expr[:-1]
|
||||
return _count_words_fast(inner, length) + (1 if length == 0 else 0)
|
||||
if expr.endswith('+') and not expr.endswith('+?'):
|
||||
inner = expr[:-1]
|
||||
return _count_star(inner, length, min_count=1)
|
||||
|
||||
if expr.startswith('(') and '|' in expr:
|
||||
parts = _split_disj(expr[1:-1])
|
||||
return sum(_count_words_fast(p.strip(), length) for p in parts)
|
||||
|
||||
if '.' in expr:
|
||||
parts = expr.split('.')
|
||||
return _count_concat(parts, length, 0)
|
||||
# 2. Disjunction group: (a+b+c) for CRX or (a|b|c) for iDRegEx
|
||||
if expr.startswith('(') and expr.endswith(')'):
|
||||
inner = expr[1:-1]
|
||||
parts = _split_disj_crx(inner, '+')
|
||||
if len(parts) > 1:
|
||||
return sum(_count_words_fast(p.strip(), length) for p in parts)
|
||||
parts = _split_disj_crx(inner, '|')
|
||||
if len(parts) > 1:
|
||||
return sum(_count_words_fast(p.strip(), length) for p in parts)
|
||||
return _count_words_fast(inner, length)
|
||||
|
||||
return 0
|
||||
|
||||
|
||||
def _count_concat(parts, length, idx):
|
||||
def _split_disj_crx(s, sep):
|
||||
"""Split on `sep` at top depth (not inside nested parens)."""
|
||||
depth = 0
|
||||
parts = []
|
||||
cur = []
|
||||
for ch in s:
|
||||
if ch == '(':
|
||||
depth += 1
|
||||
cur.append(ch)
|
||||
elif ch == ')':
|
||||
depth -= 1
|
||||
cur.append(ch)
|
||||
elif ch == sep and depth == 0:
|
||||
parts.append(''.join(cur))
|
||||
cur = []
|
||||
else:
|
||||
cur.append(ch)
|
||||
parts.append(''.join(cur))
|
||||
return parts
|
||||
|
||||
|
||||
@functools.lru_cache(maxsize=None)
|
||||
def _count_concat(parts_tuple, length, idx):
|
||||
parts = list(parts_tuple)
|
||||
if idx >= len(parts):
|
||||
return 1 if length == 0 else 0
|
||||
total = 0
|
||||
for take in range(length + 1):
|
||||
cnt = _count_words_fast(parts[idx], take)
|
||||
if cnt:
|
||||
total += cnt * _count_concat(parts, length - take, idx + 1)
|
||||
total += cnt * _count_concat(parts_tuple, length - take, idx + 1)
|
||||
return total
|
||||
|
||||
|
||||
@functools.lru_cache(maxsize=None)
|
||||
def _count_star(inner, length, min_count):
|
||||
total = 0
|
||||
for rep in range(min_count, length + 1):
|
||||
|
|
@ -82,6 +123,7 @@ def _count_star(inner, length, min_count):
|
|||
return total
|
||||
|
||||
|
||||
@functools.lru_cache(maxsize=None)
|
||||
def _count_repeat(inner, rep, length):
|
||||
if rep == 0:
|
||||
return 1 if length == 0 else 0
|
||||
|
|
@ -114,19 +156,32 @@ def _split_disj(s):
|
|||
|
||||
|
||||
def data_cost(expr, sequences):
|
||||
"""MDL data cost: Σ_i log₂(|L=i(r)| / |S=i|) adjusted.
|
||||
"""MDL data cost: Σ_i log₂(|L_i(r)|) where |L_i(r)| is the number
|
||||
of words of length len(seq_i) accepted by the grammar.
|
||||
|
||||
Simplified form: for each word in S, cost = log₂(lang_size of all words
|
||||
of that length).
|
||||
Lower cost = more specific grammar that still covers the data.
|
||||
Exact computation is capped at max_len=50 to prevent combinatorial
|
||||
explosion. Longer sequences use an alphabet-size upper bound.
|
||||
"""
|
||||
MAX_EXACT = 50
|
||||
n = 2 * model_cost(expr) + 1
|
||||
runtime_n = min(max(n, max((len(s) for s in sequences), default=0)), MAX_EXACT)
|
||||
|
||||
lang_sizes = [_count_words_fast(expr, l) for l in range(runtime_n + 1)]
|
||||
|
||||
alpha_size = len(alphabet(expr))
|
||||
|
||||
total_cost = 0.0
|
||||
for seq in sequences:
|
||||
length = len(seq)
|
||||
if length <= n:
|
||||
lang_at_len = _count_words_fast(expr, length)
|
||||
if lang_at_len > 0:
|
||||
total_cost += math.log2(lang_at_len) if lang_at_len > 0 else 0
|
||||
if length <= runtime_n:
|
||||
ls = lang_sizes[length]
|
||||
if ls > 0:
|
||||
total_cost += math.log2(ls)
|
||||
else:
|
||||
total_cost += length * math.log2(max(alpha_size, 1))
|
||||
else:
|
||||
total_cost += length * math.log2(max(alpha_size, 1))
|
||||
return total_cost
|
||||
|
||||
|
||||
|
|
|
|||
341
blog_post.md
Normal file
341
blog_post.md
Normal file
|
|
@ -0,0 +1,341 @@
|
|||
# Discovering Unwritten Conventions with Grammar Inference
|
||||
|
||||
**How we turned 36 Ansible roles into a 200-character grammar — and why
|
||||
it matters for LLM agents.**
|
||||
|
||||
## The problem
|
||||
|
||||
Every codebase has unwritten conventions. Your team's Docker Compose
|
||||
files always put `image` before `ports` before `volumes`. Your Ansible
|
||||
deploy roles always start with `assert`, then `file`, then `template`.
|
||||
Your CI pipelines always run `lint` before `test` before `deploy`.
|
||||
|
||||
Nobody writes these down. They're emergent — copied from role to role,
|
||||
file to file, until they become a tacit standard.
|
||||
|
||||
When an LLM agent needs to generate new content that follows these
|
||||
conventions, you have two options:
|
||||
|
||||
1. **Stuff every existing file into context** — 36 deploy roles = 15,000
|
||||
tokens. You'll hit the context window on your third example.
|
||||
2. **Give it one or two examples and hope** — the LLM will guess the
|
||||
pattern, and it will often guess wrong.
|
||||
|
||||
Neither is good. The first is wasteful. The second is unreliable.
|
||||
|
||||
What you really want is the **compiled convention** — the minimal
|
||||
description of what all 36 roles share, expressed in ~200 tokens. An
|
||||
LLM can follow a rule in 200 tokens far more reliably than it can
|
||||
infer a pattern from 36 examples.
|
||||
|
||||
This is grammar inference.
|
||||
|
||||
## The approach
|
||||
|
||||
Given a set of example sequences over some alphabet (e.g., Ansible
|
||||
module names, Docker Compose keys, CI job names), learn a regular
|
||||
expression that describes the general pattern.
|
||||
|
||||
We implemented two algorithms from Bex et al., a pair of papers from
|
||||
TODS 2010 and arXiv 2010:
|
||||
|
||||
- **CRX** (TODS 2010 §6): A single-pass algorithm that builds a
|
||||
predecessor relation over symbols, computes equivalence classes,
|
||||
and emits a Chain Regular Expression (CHARE) that matches ALL
|
||||
input sequences. Fast, deterministic, captures the full vocabulary.
|
||||
|
||||
- **iDRegEx** (arXiv 2010): A probabilistic algorithm using k-testable
|
||||
Observation Automata (k-OA) trained with Baum-Welch EM. It finds
|
||||
only the *minimal common core* — the symbols that appear in every
|
||||
example. Robust against noise, but fails (returns ∅) when the
|
||||
examples are too diverse.
|
||||
|
||||
Both run in the **ensemble**: CRX produces a permissive grammar (full
|
||||
vocabulary, many optional parts), iDRegEx produces a strict grammar
|
||||
(minimal core). A Minimum Description Length (MDL) score picks the
|
||||
winner: the grammar that compresses the data best.
|
||||
|
||||
## The algorithms, briefly
|
||||
|
||||
### CRX — Chain Regular Expression inference
|
||||
|
||||
CRX (Algorithm 7, TODS 2010) works in four steps:
|
||||
|
||||
1. **Build the immediate-predecessor relation.** For every adjacent
|
||||
pair (x, y) across all sequences, record that x precedes y. If
|
||||
symbol `assert` always appears before `file`, record
|
||||
`assert → file`.
|
||||
|
||||
2. **Compute equivalence classes.** Take the reflexive-transitive
|
||||
closure of the predecessor relation. The strongly connected
|
||||
components are *equivalence classes* — groups of symbols that can
|
||||
appear in the same position. If `copy` and `template` both follow
|
||||
`file` and precede `command`, they're in the same class.
|
||||
|
||||
3. **Merge singleton classes.** A class with one symbol that shares
|
||||
the same predecessor/successor sets as another singleton class
|
||||
gets merged. This handles symbols that always appear in the
|
||||
same structural position.
|
||||
|
||||
4. **Topological sort.** The equivalence classes are sorted by their
|
||||
position in the Hasse diagram of the predecessor relation. Each
|
||||
class becomes a factor in the output, annotated with a quantifier:
|
||||
- `+` (one or more) if the class forms a cycle
|
||||
- `+?` (zero or more) if the class appears variably
|
||||
- `?` (optional) if the class can be absent
|
||||
- (exact) if the class always appears exactly once
|
||||
|
||||
The result is a CHARE: a sequence of factors where each factor is a
|
||||
disjunction of equivalent symbols with a quantifier.
|
||||
|
||||
### iDRegEx — k-optimal regular expression inference
|
||||
|
||||
iDRegEx (Algorithm 4, arXiv 2010) uses a probabilistic automaton:
|
||||
|
||||
1. **Build a complete k-OA.** A k-testable Observation Automaton
|
||||
records all k-grams (subsequences of length k) from the input
|
||||
sequences. The automaton's states represent (k-1)-grams.
|
||||
|
||||
2. **Train with Baum-Welch.** EM iterations assign probabilities to
|
||||
transitions, learning which paths through the automaton are most
|
||||
likely given the data.
|
||||
|
||||
3. **Disambiguate.** Remove nondeterministic transitions — for any
|
||||
state and symbol, keep only the most probable next state.
|
||||
|
||||
4. **Prune.** Remove low-probability edges and unreachable states,
|
||||
leaving only the most likely paths.
|
||||
|
||||
5. **Extract with rwr².** The REWRITE-SQUARED algorithm (rwr²,
|
||||
Algorithm 3) collapses the pruned automaton into a k-optimal
|
||||
regular expression — the minimal common core.
|
||||
|
||||
### MDL scoring — picking the right level of specificity
|
||||
|
||||
The Minimum Description Length principle (Rissanen 1978) says: the
|
||||
best grammar is the one that minimizes the sum of its own size and
|
||||
the cost of encoding the data using it.
|
||||
|
||||
```
|
||||
MDL = model_cost + data_cost
|
||||
```
|
||||
|
||||
**model_cost** = the number of alphabet symbol occurrences in the
|
||||
grammar. A grammar with 5 unique symbols used once each has
|
||||
model_cost = 5.
|
||||
|
||||
**data_cost** = Σ log₂(|L(r)|) across all sequences, where |L(r)| is
|
||||
the number of strings of length len(s) that the grammar accepts.
|
||||
A grammar like `(a+b+c+...+z)+` accepts 19 possible symbols at each
|
||||
position, so for a sequence of length 120, the data cost is
|
||||
120 × log₂(19) ≈ 510 bits. A grammar like `a.b.c.d.e` accepts only
|
||||
1 string of length 5, so data cost is 0.
|
||||
|
||||
The ensemble picks the grammar with the lowest total MDL. This
|
||||
automatically balances specificity against coverage: a grammar that
|
||||
matches only 1 sequence but does so perfectly (low data cost) can
|
||||
beat a grammar that matches all sequences but is extremely permissive
|
||||
(high data cost).
|
||||
|
||||
## The bugs we found (and fixed)
|
||||
|
||||
Implementing the BEX algorithms faithfully required solving several
|
||||
subtle problems.
|
||||
|
||||
### Bug 1: model_cost counted characters, not symbols
|
||||
|
||||
The paper defines model_cost as "the length of r" — the number of
|
||||
symbols in the expression. For the toy alphabet {a, b, c, d, e} used
|
||||
in the paper, characters and symbols are the same. For real-world
|
||||
symbols like `community.docker.docker_image`, they aren't.
|
||||
|
||||
Our `model_cost` function was counting characters (226 for a typical
|
||||
grammar), when it should count symbol occurrences (19). This
|
||||
massively inflated the MDL score, making CRX appear worse than it
|
||||
actually was.
|
||||
|
||||
**Fix:** Count occurrences of alphabet symbols in the expression using
|
||||
regex word-boundary matching, not string length.
|
||||
|
||||
### Bug 2: Dispatch order in _count_words_fast
|
||||
|
||||
The recursive function `_count_words_fast` estimates |L(r)| — the
|
||||
number of strings a grammar accepts at a given length. It dispatches
|
||||
on expression structure: first check for concatenation (`.`), then
|
||||
trailing quantifiers (`+?`, `*`, `?`, `+`), then disjunction groups.
|
||||
|
||||
Our dispatch checked `endswith('+?')` before checking `'.' in expr`.
|
||||
For the expression `(All)+.Role?.RoleBinding?.Job+?`, the trailing
|
||||
`+?` on `Job+?` triggered the quantifier branch first, applying the
|
||||
`+?` to the **entire** expression instead of just the `Job` factor.
|
||||
|
||||
**Fix:** Check concatenation first. Top-level dots can only appear in
|
||||
concatenation, so they should be handled before any quantifier logic.
|
||||
|
||||
### Bug 3: Greedy matching without backtracking
|
||||
|
||||
The `_match_tokens` function checked whether a sequence matches a
|
||||
grammar. For quantifiers like `+?` (zero-or-more), it greedily
|
||||
consumed ALL consecutive matching symbols, then moved on. This failed
|
||||
for grammars like `a+?.a` on input `['a', 'a']`: the `a+?` ate both
|
||||
`a`s, and there was nothing left for the second `.a`.
|
||||
|
||||
**Fix:** Replace the single-pass greedy matching with `_match_possible`,
|
||||
a proper backtracking engine that enumerates ALL valid end positions
|
||||
for each token and picks the maximum. This is essentially a tiny
|
||||
regex engine — but limited to the CHARE subset, so it avoids the
|
||||
exponential blowup of general regex matching.
|
||||
|
||||
### Bug 4: Dot-splitting inside disjunctions
|
||||
|
||||
Module names like `community.docker.docker_image` contain dots.
|
||||
When `_parse_parts` processed a disjunction child, it recursively
|
||||
called itself — which split the expression on `.` before treating it
|
||||
as a symbol. The symbol `community.docker.docker_image` became
|
||||
`community` then `docker` then `docker_image` — three concatenated
|
||||
symbols instead of one.
|
||||
|
||||
**Fix:** Disjunction children are always flat symbols (CRX and
|
||||
iDRegEx don't produce nested disjunctions in practice). Parse them
|
||||
with `_parse_flat_symbol`, which strips quantifiers but never splits
|
||||
on `.`.
|
||||
|
||||
## The results
|
||||
|
||||
### Ansible deploy roles — 36 roles from companyweb
|
||||
|
||||
Your own deploy roles cover everything from AdGuard Home to
|
||||
Woodpecker CI. They have NO schema — each is a free-form script.
|
||||
|
||||
```
|
||||
Grammar: docker_volume+?.group?.docker_container?.user?.apt?.npm?.
|
||||
(assert+...+command+copy+file+template+set_fact+...+wait_for)+?.
|
||||
(cron+firewalld)?
|
||||
Match: 36/36
|
||||
MDL: 2186.28
|
||||
```
|
||||
|
||||
Bottleneck analysis: optional docker setup (volume, group, container,
|
||||
user, apt, npm), then a large disjunction of ~25 task modules (one or
|
||||
more), then optional cron/firewalld at the end. This captures the
|
||||
convention precisely.
|
||||
|
||||
**Compression: 36 roles (15,000 tokens) → 200 tokens (75×)**
|
||||
|
||||
### Geerlingguy Galaxy roles — 15 popular roles
|
||||
|
||||
Jeff Geerling's roles are the most popular on Ansible Galaxy. He has
|
||||
never documented their structural pattern. Yet every one of the 15
|
||||
follows the same arc:
|
||||
|
||||
```
|
||||
Grammar: fail?.(include_vars+set_fact+package+file+template+service+...)+.
|
||||
include+?.(npm+pip)+?.lineinfile?
|
||||
Match: 15/15
|
||||
MDL: 596.64
|
||||
```
|
||||
|
||||
Check prerequisites, OS-specific variables, install packages,
|
||||
configure with templates, start services, optionally run sub-tasks,
|
||||
install npm/pip packages, and optionally tweak config lines.
|
||||
|
||||
**This is the first explicit description of the geerlingguy role
|
||||
convention.** It took 15 roles and a grammar inference algorithm to
|
||||
write it down.
|
||||
|
||||
**Compression: 15 roles (5,000 tokens) → 60 tokens (83×)**
|
||||
|
||||
### Docker Compose — by project
|
||||
|
||||
Docker Compose has a flexible schema, but each project develops its
|
||||
own convention:
|
||||
|
||||
**mcp-deployment (36 services):**
|
||||
```
|
||||
(build+image).command.(environment+volumes)?.ports
|
||||
```
|
||||
**files (6 services):**
|
||||
```
|
||||
image.environment.volumes.network_mode.privileged?.cap_add?
|
||||
```
|
||||
**fresh-ape-base (9 services):**
|
||||
```
|
||||
image.ports?.(depends_on+environment+user+volumes)+
|
||||
```
|
||||
|
||||
### Ensemble dynamics
|
||||
|
||||
The ensemble (CRX + iDRegEx + MDL) selects different winners
|
||||
depending on the data:
|
||||
|
||||
| Dataset | Winner | Why |
|
||||
|---------|--------|-----|
|
||||
| Ansible deploy (36 roles) | CRX | iDRegEx returned ∅ (too diverse) |
|
||||
| Ansible galaxy (15 roles) | CRX | iDRegEx returned ∅ (too diverse) |
|
||||
| Ansible restore (2 roles) | CRX | Both match all; CRX more compact |
|
||||
| Ansible configure (4 roles) | **iDRegEx** | Finds minimal core `include_role` |
|
||||
| Ansible manage (2 roles) | **iDRegEx** | Core: `assert.authorized_key` |
|
||||
|
||||
iDRegEx wins when the data has a clear common core. CRX wins when
|
||||
there's no single shared subsequence (the roles share the *vocabulary*
|
||||
but not the *order*).
|
||||
|
||||
## The MCP
|
||||
|
||||
The engine is exposed as an MCP server:
|
||||
|
||||
```python
|
||||
from bex.mcp_server import infer_best_grammar
|
||||
|
||||
# Full coverage
|
||||
output = infer_best_grammar(
|
||||
sequences=role_sequences,
|
||||
prefer="crx",
|
||||
)
|
||||
# Returns:
|
||||
# Best: CRX (MDL 2186.28)
|
||||
# Grammar: docker_volume+?.group?...(assert+...+wait_for)+?.(cron+firewalld)?
|
||||
|
||||
# Ensemble — let MDL pick
|
||||
output = infer_best_grammar(sequences=role_sequences)
|
||||
```
|
||||
|
||||
An agent workflow:
|
||||
|
||||
1. Agent needs to write deploy role #37
|
||||
2. Finds 36 existing deploy roles, extracts their task module sequences
|
||||
3. Calls `infer_best_grammar(sequences=..., prefer='crx')`
|
||||
4. Gets back the grammar in 200 tokens
|
||||
5. Generates a new role that follows the structural pattern
|
||||
|
||||
Without the MCP: 36 role files in context (15,000 tokens), or guesswork.
|
||||
With the MCP: one grammar rule (200 tokens), known to match 36/36 roles.
|
||||
|
||||
## What it means
|
||||
|
||||
Grammar inference turns **examples** into **rules**. The rule is a
|
||||
compressed description of the structural convention — and for
|
||||
schema-less content like Ansible roles, this may be the *first time*
|
||||
the convention has been written down at all.
|
||||
|
||||
For LLM agents, this changes the trade-off between context and
|
||||
accuracy. Instead of flooding the context window with examples, the
|
||||
agent can call the MCP, get the rule in ~60 tokens, and follow it.
|
||||
The rule is more reliable than guessing from examples, and it costs
|
||||
less than the first example would have.
|
||||
|
||||
The algorithm doesn't need to understand what a deploy role does. It
|
||||
doesn't know that `file` creates directories and `template` renders
|
||||
Jinja2. It only needs to see 36 sequences of module names and find
|
||||
the pattern they all share. The structural convention is in the data
|
||||
— you just have to extract it.
|
||||
|
||||
## References
|
||||
|
||||
- Bex, G. J., Gelade, W., Neven, F., & Vansummeren, S. (2010).
|
||||
*Learning Deterministic Regular Expressions for the Web.* TODS 2010.
|
||||
- Bex, G. J., Gelade, W., Martens, W., & Neven, F. (2010).
|
||||
*Simplifying XML Schema: Single-Type Approximations of Regular
|
||||
Expressions.* arXiv:1004.2372.
|
||||
- Rissanen, J. (1978). *Modeling by shortest data description.*
|
||||
Automatica 14(5).
|
||||
Loading…
Add table
Reference in a new issue