grammar-inference-engine/README.md

# Grammar Inference Engine

Infer **regular expression grammars** from example sequences using the BEX family of algorithms. Given a set of example sequences (strings over some alphabet), the engine learns a compact regular expression that describes the general pattern.

## Quick Start

```bash
pip install pyyaml
python -m bex
```

```python
from bex import infer_ensemble

seqs = [
    ['file', 'template', 'docker_image', 'command', 'set_fact', 'shell', 'wait_for'],
    ['file', 'template', 'docker_image', 'command', 'set_fact', 'shell'],
]

result = infer_ensemble(seqs)
print(f"Best: {result['best']['algorithm']}")
print(f"Grammar: {result['best']['grammar']}")
print(f"Score: {result['best']['mdl_score']}")
```

Or compare algorithms manually:

```python
from bex.crx import CRX

seqs = [...]
crx = CRX()
grammar = crx.infer(seqs)
print(grammar)
# file.template.docker_image.command.set_fact.shell.(wait_for)?
```

## Algorithms

| Algorithm | What it learns | Paper | Use case |
|-----------|---------------|-------|----------|
| **CRX** | CHAREs (single-pass, deterministic) | TODS 2010 §6 | Fast inference, captures *all* symbols |
| **iDRegEx** | k-OREs (probabilistic, Baum-Welch) | arXiv 2010 | Finds the minimal core pattern |
| **RWR₀** | SOREs (iterative repair) | TODS 2010 §5.2 | Single-sequence grammar repair |
| **rwr²** | k-ORE from k-OA | arXiv 2010 | k-ORE extraction after Baum-Welch |

### Pipeline 1: Direct CHARE Inference (fast)

```
Example sequences → CRX → CHAREs grammar
```

CRX learns a grammar that accepts *all* observed symbols, marking optional ones with `?`. Best when the data is clean and you want the full vocabulary.

### Pipeline 2: Probabilistic k-ORE Inference (robust)

```
Example sequences → Complete k-OA → Baum-Welch (EM)
  → Disambiguate → Prune → rwr² → k-ORE grammar
```

iDRegEx learns the *minimum* common subsequence — symbols that appear in every example. Fails (∅) when the examples are too diverse.

### Pipeline 3: Ensemble (recommended)

```
Example sequences → [CRX, iDRegEx] → MDL score each → pick best
```

Runs both algorithms, scores each with Minimum Description Length, and returns the winner with an explanation. The MDL score penalizes overly general grammars: a grammar like `(a+b+c+...+z)+` that accepts everything gets a high data cost (`log2(|L(r)|)` is large), while a specific grammar like `a.b.c` has near-zero data cost.

## Architecture

```
bex/
├── crx.py          # CRX: direct CHARE inference (Algorithm 7, TODS)
├── idregex.py      # iDRegEx: k-ORE inference (Algorithm 4, arXiv)
├── rwr0.py         # RWR₀: SORE repair (Algorithm 6, TODS)
├── rwrsq.py        # rwr²: k-ORE extraction (Algorithm 3, arXiv)
├── soa.py          # SOA: Symbolic Observation Automaton core
├── koa.py          # k-OA: k-testable Observation Automaton
├── ikoa.py         # iKoa: k-OA inference (Algorithm 1, arXiv)
├── twotinf.py      # 2T-INF: 2-testable inference (Algorithm 1, TODS)
├── baum_welch.py   # Baum-Welch EM training for k-OA
├── expr.py         # Expression utilities (concat, disj, star, strip)
├── marking.py      # State marking for determinism
├── yaml_to_seq.py  # Generic YAML → key-path sequence converter
├── role_grammar.py # Ansible role → module-sequence extractor
├── ensemble.py     # Ensemble: runs CRX + iDRegEx, picks best by MDL
├── mdl.py          # MDL scoring for grammar selection (fix)
├── mcp_server.py   # MCP server exposing 4 tools
└── ...
```

## MCP Server

A **Model Context Protocol** server exposes all algorithms and domain adapters as tools:

```bash
python -m bex.mcp_server
```

### Tools

| Tool | What it does |
|------|-------------|
| `infer_grammar(sequences, method, kmax, N)` | Core CRX or iDRegEx inference |
| `infer_best_grammar(sequences, prefer, kmax, N)` | **Ensemble:** runs both CRX and iDRegEx, picks the best by MDL score. Set `prefer='crx'` or `prefer='idregex'` to skip ensemble and return only that algorithm. Returns structured report with candidates, MDL scores, and a `Why:` explanation. |
| `infer_yaml_grammar(yaml_dir, pattern, method)` | Generic YAML → key-paths → grammar |
| `infer_ansible_role_grammar(roles_dir)` | Ansible role module sequences → per-category grammar |

### Using `infer_best_grammar`

The ensemble runs both algorithms and picks the best by MDL. To skip the comparison and run just one algorithm, pass `prefer`:

```
User: Run CRX on our deploy tasks.
Agent: [runs with prefer='crx']
Best: CRX (MDL 7.0)
Grammar: file.template.docker_image.command.set_fact.shell.wait_for?

  CRX  MDL=  7.00  file.template.docker_image.command.set_fact.shell.wait_for?

Why: Requested CRX only.
```

Without `prefer`, the ensemble compares both:

```
User: Find the grammar for our Helm chart.
Agent: [runs]
Best: iDRegEx (MDL 1432.99)
Grammar: ServiceAccount.ClusterRole.ClusterRoleBinding.Service.Deployment

  iDRegEx     MDL=  1432.99  ServiceAccount.ClusterRole.ClusterRoleBinding.Service.Deployment
  CRX         MDL=  2651.74  (Alertmanager+...+ValidatingWebhookConfiguration)+.Role?.RoleBinding?.Job+?

Why: iDRegEx (score 1433.0) vs CRX (score 2651.7). CRX matches 6/6 sequences,
iDRegEx matches 1/6. iDRegEx selected (MDL score 1433.0).
```

Both grammars are correct — they operate at different levels of specificity. The `Why:` field helps the agent decide which one to use for the task at hand.

## Ensemble Selection

The `infer_best_grammar` tool runs both CRX and iDRegEx, scores each with Minimum Description Length (MDL), and returns the best.

### How MDL scoring works

```
MDL = model_cost + data_cost
```

- **model_cost** — number of unique alphabet symbols in the grammar. Simpler grammars are cheaper.
- **data_cost** — Σ log₂(|L(r) at length len(s)|) across all sequences. A grammar that accepts *many* strings of the same length (like a 17-way disjunction `(a+b+...+q)+`) has high data cost because `|L(r)|` is large. A specific, fixed sequence (`a.b.c.d.e`) has `|L(r)| = 1` so data cost is zero.

The ensemble selects the grammar with the lowest total MDL. This automatically picks the right level of specificity for the data.

### When each algorithm wins

| Scenario | Winner | Why |
|----------|--------|-----|
| Many sequences, diverse patterns | **CRX** | CRX captures the full vocabulary. iDRegEx can't find a common core. |
| Clean, structured sequences | **CRX** | CRX learns precise concatenation order with optional suffixes. iDRegEx may over-generalize. |
| Few sequences (2–3) | **iDRegEx** | CRX overfits to the limited data. iDRegEx's probabilistic approach handles noise better. |
| Sequences share a clear core | **iDRegEx** | iDRegEx extracts the minimal common subsequence. CRX buries it in a mass of optional symbols. |
| Single sequence | **iDRegEx** (with SOA repair) | RWR₀ repair pipeline produces a grammatical regex from one example. |

### Real-world benchmarks

Results from three domains using the ensemble (fixed MDL scoring):

```
Dataset                   Best       MDL      Matches
──────────────────────────────────────────────────────────
Helm (prom-stack)         iDRegEx    1433.0   1/6
Ansible (deploy)          CRX        246.1    34/36
Ansible (validate)        CRX        34.0     5/5
Ansible (restore)         CRX        24.0     2/2
Ansible (manage)          iDRegEx    25.0     1/2
Ansible (configure)       iDRegEx    22.5     1/4
Terraform (hashistack)    CRX        4.0      9/9
```

Note: MDL scores are not comparable across datasets — only within the same run
(CRX vs iDRegEx on the same sequences). The Helm score is higher because
each sequence is ~120 symbols long, making the data cost term dominant for
the overly-general CRX grammar (19 kinds × many lengths).

## Domain Adapters

### Ansible Roles

Extracts module names from `tasks/main.yml`, groups by category prefix (e.g., `deploy_foo` → `deploy`), and learns per-category grammars:

```python
from bex.ensemble import infer_ensemble
from bex.role_grammar import collect_all_role_sequences

all_roles, by_category = collect_all_role_sequences('path/to/roles')
for cat, items in sorted(by_category.items()):
    seqs = [s for _, s in items]
    if len(seqs) >= 2:
        result = infer_ensemble(seqs)
        print(f"── {cat} ({len(items)} roles) ──")
        print(f"  Best: {result['best']['algorithm']} (MDL {result['best']['mdl_score']})")
        print(f"  Grammar: {result['best']['grammar']}")
        print(f"  Why: {result['why']}")
```

**Example output** (from [companyweb](https://github.com/anomalyco/companyweb), 51 roles):
```
── restore (2 roles) ──
  Best: CRX (MDL 24.0)
  Grammar: file.copy.unarchive+.command
  Why: CRX (score 24.0) vs iDRegEx (score 33.0). Both match 2/2. CRX is more compact.

── validate (5 roles) ──
  Best: CRX (MDL 34.0)
  Grammar: hosts?.shell?.(copy+debug+fail+set_fact+uri)+?
  Why: CRX (score 34.0) matches 5/5, iDRegEx (score 49.5) matches 0/5.

── configure (4 roles) ──
  Best: iDRegEx (MDL 22.5)
  Grammar: include_role
  Why: iDRegEx (score 22.5) beats CRX (score 44.5). CRX overfits to diverse patterns.
```

### Helm Charts

Renders a Helm chart with different values files and extracts Kubernetes `kind` sequences for grammar inference:

```python
import subprocess, yaml
from bex.ensemble import infer_ensemble

seqs = []
for vf in sorted(Path('ci/').glob('*-values.yaml')):
    out = subprocess.run(
        ['helm', 'template', 'test', '.', '--skip-tests', '-f', str(vf)],
        capture_output=True, text=True, timeout=120,
    )
    if out.returncode == 0:
        kinds = [d['kind'] for d in yaml.safe_load_all(out.stdout)
                 if d and isinstance(d, dict) and 'kind' in d]
        if kinds:
            seqs.append(kinds)

result = infer_ensemble(seqs)
print(f"Best: {result['best']['algorithm']} (MDL {result['best']['mdl_score']})")
print(f"Grammar: {result['best']['grammar']}")
print(f"Why: {result['why']}")
```

**Example output** (from [kube-prometheus-stack](https://github.com/prometheus-community/helm-charts/tree/main/charts/kube-prometheus-stack), 6 CI configs):

```
Best: iDRegEx (MDL 1432.99)
Grammar: ServiceAccount.ClusterRole.ClusterRoleBinding.Service.Deployment

  iDRegEx     MDL=  1432.99  ServiceAccount.ClusterRole.ClusterRoleBinding.Service.Deployment
  CRX         MDL=  2651.74  (Alertmanager+ClusterRole+ClusterRoleBinding+ConfigMap+DaemonSet+...)+.Role?.RoleBinding?.Job+?

Why: iDRegEx (score 1433.0) vs CRX (score 2651.7). CRX matches 6/6, iDRegEx matches 1/6.
iDRegEx selected (MDL score 1433.0).
```

CRX captures *all* symbols that appear. iDRegEx finds only the minimal core that every config shares:
```
ServiceAccount.ClusterRole.ClusterRoleBinding.Service.Deployment
```

Which grammar is more useful depends on the task:
- **CRX** tells you everything you *might* need — good for an agent generating a complete chart.
- **iDRegEx** tells you what you *always* need — the bootstrap pipeline that can't be skipped.

Use `prefer='crx'` or `prefer='idregex'` to select an algorithm without the ensemble comparison:

### Terraform

Parses `.tf` files to extract `resource` type sequences, per-file or per-directory:

```python
import re
from bex.ensemble import infer_ensemble

seqs = []
for tf in sorted(Path('.').rglob('*.tf')):
    resources = re.findall(r'resource "(\w+)" "\w+" {', tf.read_text())
    if resources:
        seqs.append(resources)

result = infer_ensemble(seqs)
print(f"Best: {result['best']['algorithm']} (MDL {result['best']['mdl_score']})")
print(f"Grammar: {result['best']['grammar']}")
```

**Example output** (from [terraform-guides](https://github.com/hashicorp/terraform-guides), hashistack example, 9 files):
```
Best: CRX (MDL 4.0, 9/9 match)
Grammar: azurerm_network_security_group?.tls_private_key?.azurerm_virtual_machine?.(azurerm_resource_group+azurerm_subnet+azurerm_virtual_network)+?.azurerm_network_security_rule?.null_resource?.azurerm_network_interface?.azurerm_public_ip?.random_id+?
```

**Grammar notation:**
- `a.b` — `a` followed by `b` (concatenation)
- `(a+b)` — either `a` or `b` (disjunction)
- `r?` — zero or one (optional)
- `r+` — one or more (iteration)
- `r+?` — zero or more (varies across examples)
- `(a|b)` — iDRegEx-style disjunction (equivalent to `(a+b)`)

## Domain: Generic YAML

Converts any YAML file into key-path sequences (DFS traversal) for grammar inference:

```python
from bex.yaml_to_seq import collect_all_sequences
from bex import infer_ensemble

results = collect_all_sequences('config_dir/')
seqs = [seq for _, seq in results]
result = infer_ensemble(seqs)
print(result['best']['grammar'])
```

## Papers

- **Bex et al.** *"Inferring Deterministic Regular Expressions from Positive Data"* — TODS 2010
- **Bex et al.** *"Inferring k-optimal REs from Positive Data"* — arXiv:1004.2372

See `papers/` for extracted text and the original references.

## Tests

```bash
python -m pytest tests/
# or
python tests/test_bex.py
```

## License

MIT
-												Initial commit: BEX-based grammar inference engine

- CRX: direct CHARE inference (Algorithm 7, TODS 2010)
- iDRegEx: k-ORE inference (Algorithm 4, arXiv 2010)
- RWR₀: SORE repair (Algorithm 6, TODS 2010)
- rwr²: k-ORE extraction (Algorithm 3, arXiv 2010)
- SOA, k-OA, iKoa, 2T-INF, Baum-Welch
- Ansible role grammar adapter
- Generic YAML key-path converter
- 28 tests, all passing

											
										
										
											2026-07-01 08:01:16 +02:00
+								# Grammar Inference Engine
 								Infer **regular expression grammars** from example sequences using the BEX family of algorithms. Given a set of example sequences (strings over some alphabet), the engine learns a compact regular expression that describes the general pattern.
 								## Quick Start
 								```bash
 								pip install pyyaml
 								python -m bex
 								```
 								```python
-												Grammar inference engine: CRX + iDRegEx ensemble with MDL scoring, MCP server, showcase, and blog post

- Ensemble inference (infer_ensemble) runs both CRX and iDRegEx, picks best by MDL
- CRX: CRX algorithm for wide coverage (accepts all sequences, large vocabulary)
- iDRegEx: iDRegEx for minimal core grammar (tightest common pattern)
- MDL scoring: fixed model_cost to count alphabet symbol occurrences, fixed dispatch order in _count_words_fast
- Fixed _match_tokens: rewritten as _match_possible with proper backtracking
- Fixed _parse_parts disjunction: children use _parse_flat_symbol to avoid dot-splitting
- MCP server: infer_best_grammar and infer_grammar tools
- Added prefer parameter (crx/idregex) to skip ensemble
- 28 passing tests
- SHOWCASE.md with Geerlingguy Galaxy demonstration
- blog_post.md with full technical deep-dive

											
										
										
											2026-07-01 09:51:41 +02:00
+								from bex import infer_ensemble
-												Initial commit: BEX-based grammar inference engine

- CRX: direct CHARE inference (Algorithm 7, TODS 2010)
- iDRegEx: k-ORE inference (Algorithm 4, arXiv 2010)
- RWR₀: SORE repair (Algorithm 6, TODS 2010)
- rwr²: k-ORE extraction (Algorithm 3, arXiv 2010)
- SOA, k-OA, iKoa, 2T-INF, Baum-Welch
- Ansible role grammar adapter
- Generic YAML key-path converter
- 28 tests, all passing

											
										
										
											2026-07-01 08:01:16 +02:00
 								seqs = [
 								    ['file', 'template', 'docker_image', 'command', 'set_fact', 'shell', 'wait_for'],
 								    ['file', 'template', 'docker_image', 'command', 'set_fact', 'shell'],
 								]
-												Grammar inference engine: CRX + iDRegEx ensemble with MDL scoring, MCP server, showcase, and blog post

- Ensemble inference (infer_ensemble) runs both CRX and iDRegEx, picks best by MDL
- CRX: CRX algorithm for wide coverage (accepts all sequences, large vocabulary)
- iDRegEx: iDRegEx for minimal core grammar (tightest common pattern)
- MDL scoring: fixed model_cost to count alphabet symbol occurrences, fixed dispatch order in _count_words_fast
- Fixed _match_tokens: rewritten as _match_possible with proper backtracking
- Fixed _parse_parts disjunction: children use _parse_flat_symbol to avoid dot-splitting
- MCP server: infer_best_grammar and infer_grammar tools
- Added prefer parameter (crx/idregex) to skip ensemble
- 28 passing tests
- SHOWCASE.md with Geerlingguy Galaxy demonstration
- blog_post.md with full technical deep-dive

											
										
										
											2026-07-01 09:51:41 +02:00
 								result = infer_ensemble(seqs)
 								print(f"Best: {result['best']['algorithm']}")
 								print(f"Grammar: {result['best']['grammar']}")
 								print(f"Score: {result['best']['mdl_score']}")
 								```
 								Or compare algorithms manually:
 								```python
 								from bex.crx import CRX
 								seqs = [...]
-												Initial commit: BEX-based grammar inference engine

- CRX: direct CHARE inference (Algorithm 7, TODS 2010)
- iDRegEx: k-ORE inference (Algorithm 4, arXiv 2010)
- RWR₀: SORE repair (Algorithm 6, TODS 2010)
- rwr²: k-ORE extraction (Algorithm 3, arXiv 2010)
- SOA, k-OA, iKoa, 2T-INF, Baum-Welch
- Ansible role grammar adapter
- Generic YAML key-path converter
- 28 tests, all passing

											
										
										
											2026-07-01 08:01:16 +02:00
+								crx = CRX()
 								grammar = crx.infer(seqs)
 								print(grammar)
 								# file.template.docker_image.command.set_fact.shell.(wait_for)?
 								```
 								## Algorithms
 								| Algorithm | What it learns | Paper | Use case |
 								|-----------|---------------|-------|----------|
-												Grammar inference engine: CRX + iDRegEx ensemble with MDL scoring, MCP server, showcase, and blog post

- Ensemble inference (infer_ensemble) runs both CRX and iDRegEx, picks best by MDL
- CRX: CRX algorithm for wide coverage (accepts all sequences, large vocabulary)
- iDRegEx: iDRegEx for minimal core grammar (tightest common pattern)
- MDL scoring: fixed model_cost to count alphabet symbol occurrences, fixed dispatch order in _count_words_fast
- Fixed _match_tokens: rewritten as _match_possible with proper backtracking
- Fixed _parse_parts disjunction: children use _parse_flat_symbol to avoid dot-splitting
- MCP server: infer_best_grammar and infer_grammar tools
- Added prefer parameter (crx/idregex) to skip ensemble
- 28 passing tests
- SHOWCASE.md with Geerlingguy Galaxy demonstration
- blog_post.md with full technical deep-dive

											
										
										
											2026-07-01 09:51:41 +02:00
+								| **CRX** | CHAREs (single-pass, deterministic) | TODS 2010 §6 | Fast inference, captures *all* symbols |
 								| **iDRegEx** | k-OREs (probabilistic, Baum-Welch) | arXiv 2010 | Finds the minimal core pattern |
 								| **RWR₀** | SOREs (iterative repair) | TODS 2010 §5.2 | Single-sequence grammar repair |
 								| **rwr²** | k-ORE from k-OA | arXiv 2010 | k-ORE extraction after Baum-Welch |
-												Initial commit: BEX-based grammar inference engine

- CRX: direct CHARE inference (Algorithm 7, TODS 2010)
- iDRegEx: k-ORE inference (Algorithm 4, arXiv 2010)
- RWR₀: SORE repair (Algorithm 6, TODS 2010)
- rwr²: k-ORE extraction (Algorithm 3, arXiv 2010)
- SOA, k-OA, iKoa, 2T-INF, Baum-Welch
- Ansible role grammar adapter
- Generic YAML key-path converter
- 28 tests, all passing

											
										
										
											2026-07-01 08:01:16 +02:00
 								### Pipeline 1: Direct CHARE Inference (fast)
 								```
 								Example sequences → CRX → CHAREs grammar
 								```
-												Grammar inference engine: CRX + iDRegEx ensemble with MDL scoring, MCP server, showcase, and blog post

- Ensemble inference (infer_ensemble) runs both CRX and iDRegEx, picks best by MDL
- CRX: CRX algorithm for wide coverage (accepts all sequences, large vocabulary)
- iDRegEx: iDRegEx for minimal core grammar (tightest common pattern)
- MDL scoring: fixed model_cost to count alphabet symbol occurrences, fixed dispatch order in _count_words_fast
- Fixed _match_tokens: rewritten as _match_possible with proper backtracking
- Fixed _parse_parts disjunction: children use _parse_flat_symbol to avoid dot-splitting
- MCP server: infer_best_grammar and infer_grammar tools
- Added prefer parameter (crx/idregex) to skip ensemble
- 28 passing tests
- SHOWCASE.md with Geerlingguy Galaxy demonstration
- blog_post.md with full technical deep-dive

											
										
										
											2026-07-01 09:51:41 +02:00
+								CRX learns a grammar that accepts *all* observed symbols, marking optional ones with `?`. Best when the data is clean and you want the full vocabulary.
-												Initial commit: BEX-based grammar inference engine

- CRX: direct CHARE inference (Algorithm 7, TODS 2010)
- iDRegEx: k-ORE inference (Algorithm 4, arXiv 2010)
- RWR₀: SORE repair (Algorithm 6, TODS 2010)
- rwr²: k-ORE extraction (Algorithm 3, arXiv 2010)
- SOA, k-OA, iKoa, 2T-INF, Baum-Welch
- Ansible role grammar adapter
- Generic YAML key-path converter
- 28 tests, all passing

											
										
										
											2026-07-01 08:01:16 +02:00
+								### Pipeline 2: Probabilistic k-ORE Inference (robust)
 								```
 								Example sequences → Complete k-OA → Baum-Welch (EM)
 								  → Disambiguate → Prune → rwr² → k-ORE grammar
 								```
-												Grammar inference engine: CRX + iDRegEx ensemble with MDL scoring, MCP server, showcase, and blog post

- Ensemble inference (infer_ensemble) runs both CRX and iDRegEx, picks best by MDL
- CRX: CRX algorithm for wide coverage (accepts all sequences, large vocabulary)
- iDRegEx: iDRegEx for minimal core grammar (tightest common pattern)
- MDL scoring: fixed model_cost to count alphabet symbol occurrences, fixed dispatch order in _count_words_fast
- Fixed _match_tokens: rewritten as _match_possible with proper backtracking
- Fixed _parse_parts disjunction: children use _parse_flat_symbol to avoid dot-splitting
- MCP server: infer_best_grammar and infer_grammar tools
- Added prefer parameter (crx/idregex) to skip ensemble
- 28 passing tests
- SHOWCASE.md with Geerlingguy Galaxy demonstration
- blog_post.md with full technical deep-dive

											
										
										
											2026-07-01 09:51:41 +02:00
+								iDRegEx learns the *minimum* common subsequence — symbols that appear in every example. Fails (∅) when the examples are too diverse.
 								### Pipeline 3: Ensemble (recommended)
 								```
 								Example sequences → [CRX, iDRegEx] → MDL score each → pick best
 								```
 								Runs both algorithms, scores each with Minimum Description Length, and returns the winner with an explanation. The MDL score penalizes overly general grammars: a grammar like `(a+b+c+...+z)+` that accepts everything gets a high data cost (`log2(|L(r)|)` is large), while a specific grammar like `a.b.c` has near-zero data cost.
-												Initial commit: BEX-based grammar inference engine

- CRX: direct CHARE inference (Algorithm 7, TODS 2010)
- iDRegEx: k-ORE inference (Algorithm 4, arXiv 2010)
- RWR₀: SORE repair (Algorithm 6, TODS 2010)
- rwr²: k-ORE extraction (Algorithm 3, arXiv 2010)
- SOA, k-OA, iKoa, 2T-INF, Baum-Welch
- Ansible role grammar adapter
- Generic YAML key-path converter
- 28 tests, all passing

											
										
										
											2026-07-01 08:01:16 +02:00
+								## Architecture
 								```
 								bex/
 								├── crx.py          # CRX: direct CHARE inference (Algorithm 7, TODS)
 								├── idregex.py      # iDRegEx: k-ORE inference (Algorithm 4, arXiv)
 								├── rwr0.py         # RWR₀: SORE repair (Algorithm 6, TODS)
 								├── rwrsq.py        # rwr²: k-ORE extraction (Algorithm 3, arXiv)
 								├── soa.py          # SOA: Symbolic Observation Automaton core
 								├── koa.py          # k-OA: k-testable Observation Automaton
 								├── ikoa.py         # iKoa: k-OA inference (Algorithm 1, arXiv)
 								├── twotinf.py      # 2T-INF: 2-testable inference (Algorithm 1, TODS)
 								├── baum_welch.py   # Baum-Welch EM training for k-OA
 								├── expr.py         # Expression utilities (concat, disj, star, strip)
 								├── marking.py      # State marking for determinism
 								├── yaml_to_seq.py  # Generic YAML → key-path sequence converter
 								├── role_grammar.py # Ansible role → module-sequence extractor
-												Grammar inference engine: CRX + iDRegEx ensemble with MDL scoring, MCP server, showcase, and blog post

- Ensemble inference (infer_ensemble) runs both CRX and iDRegEx, picks best by MDL
- CRX: CRX algorithm for wide coverage (accepts all sequences, large vocabulary)
- iDRegEx: iDRegEx for minimal core grammar (tightest common pattern)
- MDL scoring: fixed model_cost to count alphabet symbol occurrences, fixed dispatch order in _count_words_fast
- Fixed _match_tokens: rewritten as _match_possible with proper backtracking
- Fixed _parse_parts disjunction: children use _parse_flat_symbol to avoid dot-splitting
- MCP server: infer_best_grammar and infer_grammar tools
- Added prefer parameter (crx/idregex) to skip ensemble
- 28 passing tests
- SHOWCASE.md with Geerlingguy Galaxy demonstration
- blog_post.md with full technical deep-dive

											
										
										
											2026-07-01 09:51:41 +02:00
+								├── ensemble.py     # Ensemble: runs CRX + iDRegEx, picks best by MDL
 								├── mdl.py          # MDL scoring for grammar selection (fix)
 								├── mcp_server.py   # MCP server exposing 4 tools
-												Initial commit: BEX-based grammar inference engine

- CRX: direct CHARE inference (Algorithm 7, TODS 2010)
- iDRegEx: k-ORE inference (Algorithm 4, arXiv 2010)
- RWR₀: SORE repair (Algorithm 6, TODS 2010)
- rwr²: k-ORE extraction (Algorithm 3, arXiv 2010)
- SOA, k-OA, iKoa, 2T-INF, Baum-Welch
- Ansible role grammar adapter
- Generic YAML key-path converter
- 28 tests, all passing

											
										
										
											2026-07-01 08:01:16 +02:00
+								└── ...
 								```
-												Grammar inference engine: CRX + iDRegEx ensemble with MDL scoring, MCP server, showcase, and blog post

- Ensemble inference (infer_ensemble) runs both CRX and iDRegEx, picks best by MDL
- CRX: CRX algorithm for wide coverage (accepts all sequences, large vocabulary)
- iDRegEx: iDRegEx for minimal core grammar (tightest common pattern)
- MDL scoring: fixed model_cost to count alphabet symbol occurrences, fixed dispatch order in _count_words_fast
- Fixed _match_tokens: rewritten as _match_possible with proper backtracking
- Fixed _parse_parts disjunction: children use _parse_flat_symbol to avoid dot-splitting
- MCP server: infer_best_grammar and infer_grammar tools
- Added prefer parameter (crx/idregex) to skip ensemble
- 28 passing tests
- SHOWCASE.md with Geerlingguy Galaxy demonstration
- blog_post.md with full technical deep-dive

											
										
										
											2026-07-01 09:51:41 +02:00
+								## MCP Server
-												Initial commit: BEX-based grammar inference engine

- CRX: direct CHARE inference (Algorithm 7, TODS 2010)
- iDRegEx: k-ORE inference (Algorithm 4, arXiv 2010)
- RWR₀: SORE repair (Algorithm 6, TODS 2010)
- rwr²: k-ORE extraction (Algorithm 3, arXiv 2010)
- SOA, k-OA, iKoa, 2T-INF, Baum-Welch
- Ansible role grammar adapter
- Generic YAML key-path converter
- 28 tests, all passing

											
										
										
											2026-07-01 08:01:16 +02:00
-												Grammar inference engine: CRX + iDRegEx ensemble with MDL scoring, MCP server, showcase, and blog post

- Ensemble inference (infer_ensemble) runs both CRX and iDRegEx, picks best by MDL
- CRX: CRX algorithm for wide coverage (accepts all sequences, large vocabulary)
- iDRegEx: iDRegEx for minimal core grammar (tightest common pattern)
- MDL scoring: fixed model_cost to count alphabet symbol occurrences, fixed dispatch order in _count_words_fast
- Fixed _match_tokens: rewritten as _match_possible with proper backtracking
- Fixed _parse_parts disjunction: children use _parse_flat_symbol to avoid dot-splitting
- MCP server: infer_best_grammar and infer_grammar tools
- Added prefer parameter (crx/idregex) to skip ensemble
- 28 passing tests
- SHOWCASE.md with Geerlingguy Galaxy demonstration
- blog_post.md with full technical deep-dive

											
										
										
											2026-07-01 09:51:41 +02:00
+								A **Model Context Protocol** server exposes all algorithms and domain adapters as tools:
-												Initial commit: BEX-based grammar inference engine

- CRX: direct CHARE inference (Algorithm 7, TODS 2010)
- iDRegEx: k-ORE inference (Algorithm 4, arXiv 2010)
- RWR₀: SORE repair (Algorithm 6, TODS 2010)
- rwr²: k-ORE extraction (Algorithm 3, arXiv 2010)
- SOA, k-OA, iKoa, 2T-INF, Baum-Welch
- Ansible role grammar adapter
- Generic YAML key-path converter
- 28 tests, all passing

											
										
										
											2026-07-01 08:01:16 +02:00
 								```bash
-												Grammar inference engine: CRX + iDRegEx ensemble with MDL scoring, MCP server, showcase, and blog post

- Ensemble inference (infer_ensemble) runs both CRX and iDRegEx, picks best by MDL
- CRX: CRX algorithm for wide coverage (accepts all sequences, large vocabulary)
- iDRegEx: iDRegEx for minimal core grammar (tightest common pattern)
- MDL scoring: fixed model_cost to count alphabet symbol occurrences, fixed dispatch order in _count_words_fast
- Fixed _match_tokens: rewritten as _match_possible with proper backtracking
- Fixed _parse_parts disjunction: children use _parse_flat_symbol to avoid dot-splitting
- MCP server: infer_best_grammar and infer_grammar tools
- Added prefer parameter (crx/idregex) to skip ensemble
- 28 passing tests
- SHOWCASE.md with Geerlingguy Galaxy demonstration
- blog_post.md with full technical deep-dive

											
										
										
											2026-07-01 09:51:41 +02:00
+								python -m bex.mcp_server
 								```
 								### Tools
 								| Tool | What it does |
 								|------|-------------|
 								| `infer_grammar(sequences, method, kmax, N)` | Core CRX or iDRegEx inference |
 								| `infer_best_grammar(sequences, prefer, kmax, N)` | **Ensemble:** runs both CRX and iDRegEx, picks the best by MDL score. Set `prefer='crx'` or `prefer='idregex'` to skip ensemble and return only that algorithm. Returns structured report with candidates, MDL scores, and a `Why:` explanation. |
 								| `infer_yaml_grammar(yaml_dir, pattern, method)` | Generic YAML → key-paths → grammar |
 								| `infer_ansible_role_grammar(roles_dir)` | Ansible role module sequences → per-category grammar |
 								### Using `infer_best_grammar`
 								The ensemble runs both algorithms and picks the best by MDL. To skip the comparison and run just one algorithm, pass `prefer`:
 								```
 								User: Run CRX on our deploy tasks.
 								Agent: [runs with prefer='crx']
 								Best: CRX (MDL 7.0)
 								Grammar: file.template.docker_image.command.set_fact.shell.wait_for?
 								  CRX  MDL=  7.00  file.template.docker_image.command.set_fact.shell.wait_for?
 								Why: Requested CRX only.
 								```
 								Without `prefer`, the ensemble compares both:
 								```
 								User: Find the grammar for our Helm chart.
 								Agent: [runs]
 								Best: iDRegEx (MDL 1432.99)
 								Grammar: ServiceAccount.ClusterRole.ClusterRoleBinding.Service.Deployment
 								  iDRegEx     MDL=  1432.99  ServiceAccount.ClusterRole.ClusterRoleBinding.Service.Deployment
 								  CRX         MDL=  2651.74  (Alertmanager+...+ValidatingWebhookConfiguration)+.Role?.RoleBinding?.Job+?
 								Why: iDRegEx (score 1433.0) vs CRX (score 2651.7). CRX matches 6/6 sequences,
 								iDRegEx matches 1/6. iDRegEx selected (MDL score 1433.0).
 								```
 								Both grammars are correct — they operate at different levels of specificity. The `Why:` field helps the agent decide which one to use for the task at hand.
 								## Ensemble Selection
 								The `infer_best_grammar` tool runs both CRX and iDRegEx, scores each with Minimum Description Length (MDL), and returns the best.
 								### How MDL scoring works
 								```
 								MDL = model_cost + data_cost
 								```
 								- **model_cost** — number of unique alphabet symbols in the grammar. Simpler grammars are cheaper.
 								- **data_cost** — Σ log₂(|L(r) at length len(s)|) across all sequences. A grammar that accepts *many* strings of the same length (like a 17-way disjunction `(a+b+...+q)+`) has high data cost because `|L(r)|` is large. A specific, fixed sequence (`a.b.c.d.e`) has `|L(r)| = 1` so data cost is zero.
 								The ensemble selects the grammar with the lowest total MDL. This automatically picks the right level of specificity for the data.
 								### When each algorithm wins
 								| Scenario | Winner | Why |
 								|----------|--------|-----|
 								| Many sequences, diverse patterns | **CRX** | CRX captures the full vocabulary. iDRegEx can't find a common core. |
 								| Clean, structured sequences | **CRX** | CRX learns precise concatenation order with optional suffixes. iDRegEx may over-generalize. |
 								| Few sequences (2–3) | **iDRegEx** | CRX overfits to the limited data. iDRegEx's probabilistic approach handles noise better. |
 								| Sequences share a clear core | **iDRegEx** | iDRegEx extracts the minimal common subsequence. CRX buries it in a mass of optional symbols. |
 								| Single sequence | **iDRegEx** (with SOA repair) | RWR₀ repair pipeline produces a grammatical regex from one example. |
 								### Real-world benchmarks
 								Results from three domains using the ensemble (fixed MDL scoring):
 								```
 								Dataset                   Best       MDL      Matches
 								──────────────────────────────────────────────────────────
 								Helm (prom-stack)         iDRegEx    1433.0   1/6
 								Ansible (deploy)          CRX        246.1    34/36
 								Ansible (validate)        CRX        34.0     5/5
 								Ansible (restore)         CRX        24.0     2/2
 								Ansible (manage)          iDRegEx    25.0     1/2
 								Ansible (configure)       iDRegEx    22.5     1/4
 								Terraform (hashistack)    CRX        4.0      9/9
 								```
 								Note: MDL scores are not comparable across datasets — only within the same run
 								(CRX vs iDRegEx on the same sequences). The Helm score is higher because
 								each sequence is ~120 symbols long, making the data cost term dominant for
 								the overly-general CRX grammar (19 kinds × many lengths).
 								## Domain Adapters
 								### Ansible Roles
 								Extracts module names from `tasks/main.yml`, groups by category prefix (e.g., `deploy_foo` → `deploy`), and learns per-category grammars:
 								```python
 								from bex.ensemble import infer_ensemble
 								from bex.role_grammar import collect_all_role_sequences
-												Initial commit: BEX-based grammar inference engine

- CRX: direct CHARE inference (Algorithm 7, TODS 2010)
- iDRegEx: k-ORE inference (Algorithm 4, arXiv 2010)
- RWR₀: SORE repair (Algorithm 6, TODS 2010)
- rwr²: k-ORE extraction (Algorithm 3, arXiv 2010)
- SOA, k-OA, iKoa, 2T-INF, Baum-Welch
- Ansible role grammar adapter
- Generic YAML key-path converter
- 28 tests, all passing

											
										
										
											2026-07-01 08:01:16 +02:00
+								all_roles, by_category = collect_all_role_sequences('path/to/roles')
 								for cat, items in sorted(by_category.items()):
 								    seqs = [s for _, s in items]
-												Grammar inference engine: CRX + iDRegEx ensemble with MDL scoring, MCP server, showcase, and blog post

- Ensemble inference (infer_ensemble) runs both CRX and iDRegEx, picks best by MDL
- CRX: CRX algorithm for wide coverage (accepts all sequences, large vocabulary)
- iDRegEx: iDRegEx for minimal core grammar (tightest common pattern)
- MDL scoring: fixed model_cost to count alphabet symbol occurrences, fixed dispatch order in _count_words_fast
- Fixed _match_tokens: rewritten as _match_possible with proper backtracking
- Fixed _parse_parts disjunction: children use _parse_flat_symbol to avoid dot-splitting
- MCP server: infer_best_grammar and infer_grammar tools
- Added prefer parameter (crx/idregex) to skip ensemble
- 28 passing tests
- SHOWCASE.md with Geerlingguy Galaxy demonstration
- blog_post.md with full technical deep-dive

											
										
										
											2026-07-01 09:51:41 +02:00
+								    if len(seqs) >= 2:
 								        result = infer_ensemble(seqs)
 								        print(f"── {cat} ({len(items)} roles) ──")
 								        print(f"  Best: {result['best']['algorithm']} (MDL {result['best']['mdl_score']})")
 								        print(f"  Grammar: {result['best']['grammar']}")
 								        print(f"  Why: {result['why']}")
-												Initial commit: BEX-based grammar inference engine

- CRX: direct CHARE inference (Algorithm 7, TODS 2010)
- iDRegEx: k-ORE inference (Algorithm 4, arXiv 2010)
- RWR₀: SORE repair (Algorithm 6, TODS 2010)
- rwr²: k-ORE extraction (Algorithm 3, arXiv 2010)
- SOA, k-OA, iKoa, 2T-INF, Baum-Welch
- Ansible role grammar adapter
- Generic YAML key-path converter
- 28 tests, all passing

											
										
										
											2026-07-01 08:01:16 +02:00
+								```
-												Grammar inference engine: CRX + iDRegEx ensemble with MDL scoring, MCP server, showcase, and blog post

- Ensemble inference (infer_ensemble) runs both CRX and iDRegEx, picks best by MDL
- CRX: CRX algorithm for wide coverage (accepts all sequences, large vocabulary)
- iDRegEx: iDRegEx for minimal core grammar (tightest common pattern)
- MDL scoring: fixed model_cost to count alphabet symbol occurrences, fixed dispatch order in _count_words_fast
- Fixed _match_tokens: rewritten as _match_possible with proper backtracking
- Fixed _parse_parts disjunction: children use _parse_flat_symbol to avoid dot-splitting
- MCP server: infer_best_grammar and infer_grammar tools
- Added prefer parameter (crx/idregex) to skip ensemble
- 28 passing tests
- SHOWCASE.md with Geerlingguy Galaxy demonstration
- blog_post.md with full technical deep-dive

											
										
										
											2026-07-01 09:51:41 +02:00
+								**Example output** (from [companyweb](https://github.com/anomalyco/companyweb), 51 roles):
-												Initial commit: BEX-based grammar inference engine

- CRX: direct CHARE inference (Algorithm 7, TODS 2010)
- iDRegEx: k-ORE inference (Algorithm 4, arXiv 2010)
- RWR₀: SORE repair (Algorithm 6, TODS 2010)
- rwr²: k-ORE extraction (Algorithm 3, arXiv 2010)
- SOA, k-OA, iKoa, 2T-INF, Baum-Welch
- Ansible role grammar adapter
- Generic YAML key-path converter
- 28 tests, all passing

											
										
										
											2026-07-01 08:01:16 +02:00
+								```
 								── restore (2 roles) ──
-												Grammar inference engine: CRX + iDRegEx ensemble with MDL scoring, MCP server, showcase, and blog post

- Ensemble inference (infer_ensemble) runs both CRX and iDRegEx, picks best by MDL
- CRX: CRX algorithm for wide coverage (accepts all sequences, large vocabulary)
- iDRegEx: iDRegEx for minimal core grammar (tightest common pattern)
- MDL scoring: fixed model_cost to count alphabet symbol occurrences, fixed dispatch order in _count_words_fast
- Fixed _match_tokens: rewritten as _match_possible with proper backtracking
- Fixed _parse_parts disjunction: children use _parse_flat_symbol to avoid dot-splitting
- MCP server: infer_best_grammar and infer_grammar tools
- Added prefer parameter (crx/idregex) to skip ensemble
- 28 passing tests
- SHOWCASE.md with Geerlingguy Galaxy demonstration
- blog_post.md with full technical deep-dive

											
										
										
											2026-07-01 09:51:41 +02:00
+								  Best: CRX (MDL 24.0)
-												Initial commit: BEX-based grammar inference engine

- CRX: direct CHARE inference (Algorithm 7, TODS 2010)
- iDRegEx: k-ORE inference (Algorithm 4, arXiv 2010)
- RWR₀: SORE repair (Algorithm 6, TODS 2010)
- rwr²: k-ORE extraction (Algorithm 3, arXiv 2010)
- SOA, k-OA, iKoa, 2T-INF, Baum-Welch
- Ansible role grammar adapter
- Generic YAML key-path converter
- 28 tests, all passing

											
										
										
											2026-07-01 08:01:16 +02:00
+								  Grammar: file.copy.unarchive+.command
-												Grammar inference engine: CRX + iDRegEx ensemble with MDL scoring, MCP server, showcase, and blog post

- Ensemble inference (infer_ensemble) runs both CRX and iDRegEx, picks best by MDL
- CRX: CRX algorithm for wide coverage (accepts all sequences, large vocabulary)
- iDRegEx: iDRegEx for minimal core grammar (tightest common pattern)
- MDL scoring: fixed model_cost to count alphabet symbol occurrences, fixed dispatch order in _count_words_fast
- Fixed _match_tokens: rewritten as _match_possible with proper backtracking
- Fixed _parse_parts disjunction: children use _parse_flat_symbol to avoid dot-splitting
- MCP server: infer_best_grammar and infer_grammar tools
- Added prefer parameter (crx/idregex) to skip ensemble
- 28 passing tests
- SHOWCASE.md with Geerlingguy Galaxy demonstration
- blog_post.md with full technical deep-dive

											
										
										
											2026-07-01 09:51:41 +02:00
+								  Why: CRX (score 24.0) vs iDRegEx (score 33.0). Both match 2/2. CRX is more compact.
-												Initial commit: BEX-based grammar inference engine

- CRX: direct CHARE inference (Algorithm 7, TODS 2010)
- iDRegEx: k-ORE inference (Algorithm 4, arXiv 2010)
- RWR₀: SORE repair (Algorithm 6, TODS 2010)
- rwr²: k-ORE extraction (Algorithm 3, arXiv 2010)
- SOA, k-OA, iKoa, 2T-INF, Baum-Welch
- Ansible role grammar adapter
- Generic YAML key-path converter
- 28 tests, all passing

											
										
										
											2026-07-01 08:01:16 +02:00
 								── validate (5 roles) ──
-												Grammar inference engine: CRX + iDRegEx ensemble with MDL scoring, MCP server, showcase, and blog post

- Ensemble inference (infer_ensemble) runs both CRX and iDRegEx, picks best by MDL
- CRX: CRX algorithm for wide coverage (accepts all sequences, large vocabulary)
- iDRegEx: iDRegEx for minimal core grammar (tightest common pattern)
- MDL scoring: fixed model_cost to count alphabet symbol occurrences, fixed dispatch order in _count_words_fast
- Fixed _match_tokens: rewritten as _match_possible with proper backtracking
- Fixed _parse_parts disjunction: children use _parse_flat_symbol to avoid dot-splitting
- MCP server: infer_best_grammar and infer_grammar tools
- Added prefer parameter (crx/idregex) to skip ensemble
- 28 passing tests
- SHOWCASE.md with Geerlingguy Galaxy demonstration
- blog_post.md with full technical deep-dive

											
										
										
											2026-07-01 09:51:41 +02:00
+								  Best: CRX (MDL 34.0)
-												Initial commit: BEX-based grammar inference engine

- CRX: direct CHARE inference (Algorithm 7, TODS 2010)
- iDRegEx: k-ORE inference (Algorithm 4, arXiv 2010)
- RWR₀: SORE repair (Algorithm 6, TODS 2010)
- rwr²: k-ORE extraction (Algorithm 3, arXiv 2010)
- SOA, k-OA, iKoa, 2T-INF, Baum-Welch
- Ansible role grammar adapter
- Generic YAML key-path converter
- 28 tests, all passing

											
										
										
											2026-07-01 08:01:16 +02:00
+								  Grammar: hosts?.shell?.(copy+debug+fail+set_fact+uri)+?
-												Grammar inference engine: CRX + iDRegEx ensemble with MDL scoring, MCP server, showcase, and blog post

- Ensemble inference (infer_ensemble) runs both CRX and iDRegEx, picks best by MDL
- CRX: CRX algorithm for wide coverage (accepts all sequences, large vocabulary)
- iDRegEx: iDRegEx for minimal core grammar (tightest common pattern)
- MDL scoring: fixed model_cost to count alphabet symbol occurrences, fixed dispatch order in _count_words_fast
- Fixed _match_tokens: rewritten as _match_possible with proper backtracking
- Fixed _parse_parts disjunction: children use _parse_flat_symbol to avoid dot-splitting
- MCP server: infer_best_grammar and infer_grammar tools
- Added prefer parameter (crx/idregex) to skip ensemble
- 28 passing tests
- SHOWCASE.md with Geerlingguy Galaxy demonstration
- blog_post.md with full technical deep-dive

											
										
										
											2026-07-01 09:51:41 +02:00
+								  Why: CRX (score 34.0) matches 5/5, iDRegEx (score 49.5) matches 0/5.
-												Initial commit: BEX-based grammar inference engine

- CRX: direct CHARE inference (Algorithm 7, TODS 2010)
- iDRegEx: k-ORE inference (Algorithm 4, arXiv 2010)
- RWR₀: SORE repair (Algorithm 6, TODS 2010)
- rwr²: k-ORE extraction (Algorithm 3, arXiv 2010)
- SOA, k-OA, iKoa, 2T-INF, Baum-Welch
- Ansible role grammar adapter
- Generic YAML key-path converter
- 28 tests, all passing

											
										
										
											2026-07-01 08:01:16 +02:00
 								── configure (4 roles) ──
-												Grammar inference engine: CRX + iDRegEx ensemble with MDL scoring, MCP server, showcase, and blog post

- Ensemble inference (infer_ensemble) runs both CRX and iDRegEx, picks best by MDL
- CRX: CRX algorithm for wide coverage (accepts all sequences, large vocabulary)
- iDRegEx: iDRegEx for minimal core grammar (tightest common pattern)
- MDL scoring: fixed model_cost to count alphabet symbol occurrences, fixed dispatch order in _count_words_fast
- Fixed _match_tokens: rewritten as _match_possible with proper backtracking
- Fixed _parse_parts disjunction: children use _parse_flat_symbol to avoid dot-splitting
- MCP server: infer_best_grammar and infer_grammar tools
- Added prefer parameter (crx/idregex) to skip ensemble
- 28 passing tests
- SHOWCASE.md with Geerlingguy Galaxy demonstration
- blog_post.md with full technical deep-dive

											
										
										
											2026-07-01 09:51:41 +02:00
+								  Best: iDRegEx (MDL 22.5)
 								  Grammar: include_role
 								  Why: iDRegEx (score 22.5) beats CRX (score 44.5). CRX overfits to diverse patterns.
 								```
 								### Helm Charts
 								Renders a Helm chart with different values files and extracts Kubernetes `kind` sequences for grammar inference:
 								```python
 								import subprocess, yaml
 								from bex.ensemble import infer_ensemble
 								seqs = []
 								for vf in sorted(Path('ci/').glob('*-values.yaml')):
 								    out = subprocess.run(
 								        ['helm', 'template', 'test', '.', '--skip-tests', '-f', str(vf)],
 								        capture_output=True, text=True, timeout=120,
 								    )
 								    if out.returncode == 0:
 								        kinds = [d['kind'] for d in yaml.safe_load_all(out.stdout)
 								                 if d and isinstance(d, dict) and 'kind' in d]
 								        if kinds:
 								            seqs.append(kinds)
 								result = infer_ensemble(seqs)
 								print(f"Best: {result['best']['algorithm']} (MDL {result['best']['mdl_score']})")
 								print(f"Grammar: {result['best']['grammar']}")
 								print(f"Why: {result['why']}")
 								```
 								**Example output** (from [kube-prometheus-stack](https://github.com/prometheus-community/helm-charts/tree/main/charts/kube-prometheus-stack), 6 CI configs):
 								```
 								Best: iDRegEx (MDL 1432.99)
 								Grammar: ServiceAccount.ClusterRole.ClusterRoleBinding.Service.Deployment
 								  iDRegEx     MDL=  1432.99  ServiceAccount.ClusterRole.ClusterRoleBinding.Service.Deployment
 								  CRX         MDL=  2651.74  (Alertmanager+ClusterRole+ClusterRoleBinding+ConfigMap+DaemonSet+...)+.Role?.RoleBinding?.Job+?
 								Why: iDRegEx (score 1433.0) vs CRX (score 2651.7). CRX matches 6/6, iDRegEx matches 1/6.
 								iDRegEx selected (MDL score 1433.0).
 								```
 								CRX captures *all* symbols that appear. iDRegEx finds only the minimal core that every config shares:
 								```
 								ServiceAccount.ClusterRole.ClusterRoleBinding.Service.Deployment
 								```
 								Which grammar is more useful depends on the task:
 								- **CRX** tells you everything you *might* need — good for an agent generating a complete chart.
 								- **iDRegEx** tells you what you *always* need — the bootstrap pipeline that can't be skipped.
 								Use `prefer='crx'` or `prefer='idregex'` to select an algorithm without the ensemble comparison:
 								### Terraform
 								Parses `.tf` files to extract `resource` type sequences, per-file or per-directory:
 								```python
 								import re
 								from bex.ensemble import infer_ensemble
 								seqs = []
 								for tf in sorted(Path('.').rglob('*.tf')):
 								    resources = re.findall(r'resource "(\w+)" "\w+" {', tf.read_text())
 								    if resources:
 								        seqs.append(resources)
 								result = infer_ensemble(seqs)
 								print(f"Best: {result['best']['algorithm']} (MDL {result['best']['mdl_score']})")
 								print(f"Grammar: {result['best']['grammar']}")
 								```
 								**Example output** (from [terraform-guides](https://github.com/hashicorp/terraform-guides), hashistack example, 9 files):
 								```
 								Best: CRX (MDL 4.0, 9/9 match)
 								Grammar: azurerm_network_security_group?.tls_private_key?.azurerm_virtual_machine?.(azurerm_resource_group+azurerm_subnet+azurerm_virtual_network)+?.azurerm_network_security_rule?.null_resource?.azurerm_network_interface?.azurerm_public_ip?.random_id+?
-												Initial commit: BEX-based grammar inference engine

- CRX: direct CHARE inference (Algorithm 7, TODS 2010)
- iDRegEx: k-ORE inference (Algorithm 4, arXiv 2010)
- RWR₀: SORE repair (Algorithm 6, TODS 2010)
- rwr²: k-ORE extraction (Algorithm 3, arXiv 2010)
- SOA, k-OA, iKoa, 2T-INF, Baum-Welch
- Ansible role grammar adapter
- Generic YAML key-path converter
- 28 tests, all passing

											
										
										
											2026-07-01 08:01:16 +02:00
+								```
 								**Grammar notation:**
 								- `a.b` — `a` followed by `b` (concatenation)
 								- `(a+b)` — either `a` or `b` (disjunction)
 								- `r?` — zero or one (optional)
 								- `r+` — one or more (iteration)
 								- `r+?` — zero or more (varies across examples)
-												Grammar inference engine: CRX + iDRegEx ensemble with MDL scoring, MCP server, showcase, and blog post

- Ensemble inference (infer_ensemble) runs both CRX and iDRegEx, picks best by MDL
- CRX: CRX algorithm for wide coverage (accepts all sequences, large vocabulary)
- iDRegEx: iDRegEx for minimal core grammar (tightest common pattern)
- MDL scoring: fixed model_cost to count alphabet symbol occurrences, fixed dispatch order in _count_words_fast
- Fixed _match_tokens: rewritten as _match_possible with proper backtracking
- Fixed _parse_parts disjunction: children use _parse_flat_symbol to avoid dot-splitting
- MCP server: infer_best_grammar and infer_grammar tools
- Added prefer parameter (crx/idregex) to skip ensemble
- 28 passing tests
- SHOWCASE.md with Geerlingguy Galaxy demonstration
- blog_post.md with full technical deep-dive

											
										
										
											2026-07-01 09:51:41 +02:00
+								- `(a|b)` — iDRegEx-style disjunction (equivalent to `(a+b)`)
-												Initial commit: BEX-based grammar inference engine

- CRX: direct CHARE inference (Algorithm 7, TODS 2010)
- iDRegEx: k-ORE inference (Algorithm 4, arXiv 2010)
- RWR₀: SORE repair (Algorithm 6, TODS 2010)
- rwr²: k-ORE extraction (Algorithm 3, arXiv 2010)
- SOA, k-OA, iKoa, 2T-INF, Baum-Welch
- Ansible role grammar adapter
- Generic YAML key-path converter
- 28 tests, all passing

											
										
										
											2026-07-01 08:01:16 +02:00
 								## Domain: Generic YAML
-												Grammar inference engine: CRX + iDRegEx ensemble with MDL scoring, MCP server, showcase, and blog post

- Ensemble inference (infer_ensemble) runs both CRX and iDRegEx, picks best by MDL
- CRX: CRX algorithm for wide coverage (accepts all sequences, large vocabulary)
- iDRegEx: iDRegEx for minimal core grammar (tightest common pattern)
- MDL scoring: fixed model_cost to count alphabet symbol occurrences, fixed dispatch order in _count_words_fast
- Fixed _match_tokens: rewritten as _match_possible with proper backtracking
- Fixed _parse_parts disjunction: children use _parse_flat_symbol to avoid dot-splitting
- MCP server: infer_best_grammar and infer_grammar tools
- Added prefer parameter (crx/idregex) to skip ensemble
- 28 passing tests
- SHOWCASE.md with Geerlingguy Galaxy demonstration
- blog_post.md with full technical deep-dive

											
										
										
											2026-07-01 09:51:41 +02:00
+								Converts any YAML file into key-path sequences (DFS traversal) for grammar inference:
-												Initial commit: BEX-based grammar inference engine

- CRX: direct CHARE inference (Algorithm 7, TODS 2010)
- iDRegEx: k-ORE inference (Algorithm 4, arXiv 2010)
- RWR₀: SORE repair (Algorithm 6, TODS 2010)
- rwr²: k-ORE extraction (Algorithm 3, arXiv 2010)
- SOA, k-OA, iKoa, 2T-INF, Baum-Welch
- Ansible role grammar adapter
- Generic YAML key-path converter
- 28 tests, all passing

											
										
										
											2026-07-01 08:01:16 +02:00
 								```python
-												Grammar inference engine: CRX + iDRegEx ensemble with MDL scoring, MCP server, showcase, and blog post

- Ensemble inference (infer_ensemble) runs both CRX and iDRegEx, picks best by MDL
- CRX: CRX algorithm for wide coverage (accepts all sequences, large vocabulary)
- iDRegEx: iDRegEx for minimal core grammar (tightest common pattern)
- MDL scoring: fixed model_cost to count alphabet symbol occurrences, fixed dispatch order in _count_words_fast
- Fixed _match_tokens: rewritten as _match_possible with proper backtracking
- Fixed _parse_parts disjunction: children use _parse_flat_symbol to avoid dot-splitting
- MCP server: infer_best_grammar and infer_grammar tools
- Added prefer parameter (crx/idregex) to skip ensemble
- 28 passing tests
- SHOWCASE.md with Geerlingguy Galaxy demonstration
- blog_post.md with full technical deep-dive

											
										
										
											2026-07-01 09:51:41 +02:00
+								from bex.yaml_to_seq import collect_all_sequences
 								from bex import infer_ensemble
-												Initial commit: BEX-based grammar inference engine

- CRX: direct CHARE inference (Algorithm 7, TODS 2010)
- iDRegEx: k-ORE inference (Algorithm 4, arXiv 2010)
- RWR₀: SORE repair (Algorithm 6, TODS 2010)
- rwr²: k-ORE extraction (Algorithm 3, arXiv 2010)
- SOA, k-OA, iKoa, 2T-INF, Baum-Welch
- Ansible role grammar adapter
- Generic YAML key-path converter
- 28 tests, all passing

											
										
										
											2026-07-01 08:01:16 +02:00
-												Grammar inference engine: CRX + iDRegEx ensemble with MDL scoring, MCP server, showcase, and blog post

- Ensemble inference (infer_ensemble) runs both CRX and iDRegEx, picks best by MDL
- CRX: CRX algorithm for wide coverage (accepts all sequences, large vocabulary)
- iDRegEx: iDRegEx for minimal core grammar (tightest common pattern)
- MDL scoring: fixed model_cost to count alphabet symbol occurrences, fixed dispatch order in _count_words_fast
- Fixed _match_tokens: rewritten as _match_possible with proper backtracking
- Fixed _parse_parts disjunction: children use _parse_flat_symbol to avoid dot-splitting
- MCP server: infer_best_grammar and infer_grammar tools
- Added prefer parameter (crx/idregex) to skip ensemble
- 28 passing tests
- SHOWCASE.md with Geerlingguy Galaxy demonstration
- blog_post.md with full technical deep-dive

											
										
										
											2026-07-01 09:51:41 +02:00
+								results = collect_all_sequences('config_dir/')
 								seqs = [seq for _, seq in results]
 								result = infer_ensemble(seqs)
 								print(result['best']['grammar'])
-												Initial commit: BEX-based grammar inference engine

- CRX: direct CHARE inference (Algorithm 7, TODS 2010)
- iDRegEx: k-ORE inference (Algorithm 4, arXiv 2010)
- RWR₀: SORE repair (Algorithm 6, TODS 2010)
- rwr²: k-ORE extraction (Algorithm 3, arXiv 2010)
- SOA, k-OA, iKoa, 2T-INF, Baum-Welch
- Ansible role grammar adapter
- Generic YAML key-path converter
- 28 tests, all passing

											
										
										
											2026-07-01 08:01:16 +02:00
+								```
 								## Papers
 								- **Bex et al.** *"Inferring Deterministic Regular Expressions from Positive Data"* — TODS 2010
 								- **Bex et al.** *"Inferring k-optimal REs from Positive Data"* — arXiv:1004.2372
 								See `papers/` for extracted text and the original references.
 								## Tests
 								```bash
 								python -m pytest tests/
 								# or
 								python tests/test_bex.py
 								```
 								## License
 								MIT