grammar-inference-engine/README.md

# Dervish MCP

<p align="center">
  <img src="dervish-logo.png" alt="Dervish" width="216">
</p>
<p align="center">
  <img src="https://img.shields.io/badge/license-MIT-blue" alt="License">
  <img src="https://img.shields.io/badge/python-3.10%2B-blue" alt="Python 3.10+">
  <img src="https://ci.corentic.eu/api/badges/7/status.svg" alt="CI Pipeline Status">
  <br>
  <a href="SHOWCASE.md">Showcase</a> ·
  <a href="#quick-start">Usage</a> ·
  <a href="#papers">Papers</a>
  <br><br>
  <a href="https://www.buymeacoffee.com/IjonTichy85"><img src="https://cdn.buymeacoffee.com/buttons/v2/default-yellow.png" alt="Buy me a coffee" width="140"></a>
</p>

**Dervish** infers **regular expression grammars** from example sequences using the BEX family of algorithms. Given a set of example sequences (strings over some alphabet), it learns a compact regular expression that captures the general pattern.

Every codebase has unwritten conventions — the order tasks appear in Ansible roles, the resources a Helm chart always creates, the steps every CI pipeline runs. Nobody writes these down. They emerge from copying and converging.

When an LLM agent needs to follow these conventions, it usually has two bad options:

1. **Stuff every existing file into context** — 15 Ansible roles = 5,000 tokens. You'll hit the context window by the third example.
2. **Guess from one or two examples** — the LLM infers a pattern and often gets it wrong.

Dervish replaces both with a **one-call MCP tool**: pass your sequences, get back a ~60-token grammar. 
By leveraging **Minimum Description Length (MDL) scoring**, Dervish treats the grammar discovery problem as an optimal compression task.  the resulting rule is optimized to consume as few tokens as possible without losing the pattern.

**Without Dervish:** token cost scales linearly with examples. **With Dervish:** one compact grammar describes them all — a ~60–200 token rule instead of thousands of tokens of raw examples. Try it out and you too will say:

<p align="center"><img src="dervish.gif" alt="Dervish animation" width="65%"></p>

## MCP Server

The primary interface is a **Model Context Protocol (MCP)** server. Connect any MCP-compatible client (pi.dev, opencode, vibe, etc.) and get grammar inference as a tool:

```json
{
  "mcpServers": {
    "dervish": {
      "command": "python3",
      "args": ["/path/to/bex/mcp_server.py"]
    }
  }
}
```

### Tools

| Tool | Parameters | What it does |
|------|-----------|-------------|
| `infer_best_grammar` | `sequences`, `prefer`, `kmax`, `N`, `min_coverage` | **The only tool you need.** Runs CRX + iDRegEx + kOREInference, picks best by MDL. Set `prefer` to run only one algorithm. Set `min_coverage < 1.0` for optional core+outlier analysis. |

**Parameters explained:**
- **`prefer`**: `'crx'` for full vocabulary (accepts all sequences), `'idregex'` or `'koreinference'` for deterministic minimal core. Omit to let MDL pick the winner across all three.
- **`kmax`** (1–5): Context window for k-ORE inference (iDRegEx, kOREInference). Higher values capture longer-range dependencies but need more data and are slower. Default 2 works for most cases.
- **`N`** (1–10): Random trials for k-ORE inference. More = better convergence but slower. Default 3.
- **`min_coverage`** (0.5–1.0): **Optional core+outlier analysis.** When < 1.0, iteratively removes outlier sequences (those with the rarest symbols) until at least this fraction remain. Returns the core CRX grammar for the majority plus a list of removed outliers. Default 1.0 = disabled. Example: `min_coverage=0.8` finds the tight pattern for ~80% of examples while flagging the other ~20% as variants.

### Agent workflow

An LLM agent uses the MCP to discover an unwritten convention from existing examples — compressing hundreds of files into a single ~60-token rule:

```text
User: Generate a new Ansible role for installing PostgreSQL.

Agent: Let me check what pattern the existing community roles follow.
       I'll look at 15 popular geerlingguy roles.

       [finds role directories, extracts task module sequences,
        calls infer_best_grammar(sequences=..., prefer='crx')]

       Dervish returns:
         Best: CRX (MDL 288)
         Grammar: fail?.(include_vars+set_fact+package+file+template+service+...)+
                  .include+?.(npm+pip)+?.lineinfile?

       This tells me: every role starts with a fail check for preconditions,
       then OS-specific variables, installs packages, configures with templates,
       starts services, and optionally handles language tooling (npm/pip).
       The role should end with a lineinfile tweak.

       I'll generate the new role following this structure.
```

**Without Dervish:** the agent stuffs 15 role files into context (5,000+ tokens per role = beyond any context window), or guesses the pattern from 1–2 examples and often gets it wrong.

**With Dervish:** one MCP call returns a ~60-token grammar known to match 15/15 existing roles. The agent follows it reliably.

**Core+outlier mode:** When generating a new role, the agent can call with
`min_coverage=0.8` to learn the mainstream pattern while seeing which roles
deviate and why — useful when the user's case resembles an outlier
(e.g., a PHP app like phpmyadmin that needs raw `lineinfile`).

## Quick Start

```bash
pip install pyyaml
python -m bex
```

```python
from bex import infer_ensemble

seqs = [
    ['file', 'template', 'docker_image', 'command', 'set_fact', 'shell', 'wait_for'],
    ['file', 'template', 'docker_image', 'command', 'set_fact', 'shell'],
]

result = infer_ensemble(seqs)
print(f"Best: {result['best']['algorithm']}")
print(f"Grammar: {result['best']['grammar']}")
print(f"Score: {result['best']['mdl_score']}")
```

## Why not just use a schema?

Many of the things developers build every day **have no formal schema**. They're free-form scripts, config files, or YAML blobs where the structure is emergent convention, not enforced specification. An LLM generating new content in these domains needs to know the convention — but it's never been written down.

Dervish discovers these conventions automatically from existing examples. The domains below are **just examples** of what it can do — the same approach works for any sequential data with an unwritten pattern.

| Domain | What gets extracted | Example extracted symbols | What Dervish discovers | Why it helps an LLM |
|--------|-------------------|--------------------------|----------------------|---------------------|
| Ansible roles | Module names from `tasks/main.yml` in order | `fail`, `include_vars`, `set_fact`, `package`, `file`, `template`, `service`, `npm`, `pip`, `lineinfile` | `fail?.(include_vars+set_fact+package+file+template+service+...)+.include+?.(npm+pip)+?.lineinfile?` | "Validate preconditions first, then set vars, install packages, configure with templates, start services. Include sub-roles last." |
| Helm charts (cross-project, 15 charts) | K8s resource kinds from `helm template` output in rendered order | `NetworkPolicy`, `PodDisruptionBudget`, `ServiceAccount`, `Secret`, `ConfigMap`, `Service`, `Deployment`, `StatefulSet`, `ClusterRole`, `ClusterRoleBinding` | `NetworkPolicy?.PodDisruptionBudget?.ServiceAccount?.Secret?.ConfigMap?.PersistentVolumeClaim?.ClusterRole?.ClusterRoleBinding?.Service.Deployment?.StatefulSet?.(IngressClass+MutatingWebhookConfiguration)?.ValidatingWebhookConfiguration?.Job?` | "Writing a Helm chart? Start with resilience (PDB, NetworkPolicy), then identity (ServiceAccount, Secrets), then the Service, then your workload. Only cluster-wide tools need RBAC." |
| GitHub Actions (Go lint) | Step `uses:` or `run:` values from workflow YAML in job order | `actions/checkout`, `actions/setup-go`, `golangci/golangci-lint-action`, `megalinter/megalinter` | `actions/checkout.(actions/setup-go+run:echo+run:sudo)+.golangci/golangci-lint-action?.megalinter?` | "Starting a new Go project on GitHub Actions? Four independent projects converged on: checkout → setup Go → (optional golangci-lint) → (optional megalinter)." |


## Real-world Results

Dervish has been tested against public datasets from Ansible Galaxy, Helm, and GitHub Actions — all cases where multiple projects independently converged on an undocumented pattern. [**Full details → SHOWCASE.md**](SHOWCASE.md)

| Dataset | Best grammar | Compression |
|---------|-------------|-------------|
| Ansible Galaxy (15 roles) | `fail?.(include_vars+set_fact+package+file+template+service+...)+.include+?.(npm+pip)+?.lineinfile?` | 5,000 tokens → 60 tokens (83×) |
| Helm cross-project (15 charts) | `NetworkPolicy?.PodDisruptionBudget?.ServiceAccount?.Secret?.ConfigMap?...Service.Deployment?.StatefulSet?...` | 121 tokens → 35 tokens (3.5×) |
| Go lint (6 jobs) | `actions/checkout.(actions/setup-go+run:echo+run:sudo)+.golangci/golangci-lint-action?.megalinter?` | ~900 tokens → 30 tokens (30×) |

The sweet spot: **multiple implementations of the same abstract task** with a shared but undocumented pattern. Not everything works — Dockerfiles, pre-commit configs, and schema-enforced formats are too rigid or too diverse to yield a convention.

> **kOREInference note:** Algorithm 4 (iDRegEx with MDL, arXiv 1004.2372) is included for paper-faithful correctness. On real tool-sequence data, its rwr₀ repair step returns ∅ because the k-OA is rarely SORE (interconnected symbols). The ensemble falls back to CRX or iDRegEx automatically.

## Algorithm Selection Guide

| When | Use | Why |
|------|-----|-----|
| Clean, structured data with full vocabulary | **CRX** | Single-pass, deterministic. Accepts all sequences. |
| Few examples, or want minimal common core | **iDRegEx** or **kOREInference** | Probabilistic EM, finds only what's shared. |
| Don't know which is better | **Ensemble (default)** | Runs all three, picks best by MDL score. |
| Want core pattern + outlier detection | **Ensemble + `min_coverage<1`** | Finds tight grammar for majority, flags outliers. |
| Data is clearly one type | `prefer='crx'` | Skips ensemble comparison, runs CRX alone. |

## When each algorithm wins

| Data property | Winner | Why |
|---------------|--------|-----|
| Diverse patterns, full vocabulary needed | CRX | Captures all symbols. iDRegEx returns ∅. |
| Clean sequences with clear core | iDRegEx | Extracts minimal common subsequence. CRX buries it in optional noise. |
| Interconnected (non-SORE) data | CRX | kOREInference (rwr₀) returns ∅ when k-OA is not SORE. CRX handles it. |
| Single sequence | iDRegEx (+ RWR₀) | RWR₀ repair produces a grammatical regex from one example. |
| 2–3 sequences | iDRegEx | CRX overfits. iDRegEx handles noise better. |
| Many sequences, tight pattern | CRX | Learns precise concatenation with optional suffixes. |
| Want majority pattern + outlier list | CRX + `min_coverage` | Core analysis finds tight grammar for ~80%, flags the rest. |

## Token savings

<p align="center">
  <img src="chart_token_savings.png" alt="Token savings per dataset" width="65%">
</p>

Across all public benchmarks, Dervish delivers **40–83× compression**. The grammar is smaller than a single example file would be — and it represents the entire dataset.

## How MDL scoring works

```text
MDL = model_cost + data_cost
```

- **model_cost** — number of unique alphabet symbols in the grammar. Simpler grammars are cheaper.
- **data_cost** — Σ log₂(|L(r) at length len(s)|) across all sequences. A specific fixed sequence (`a.b.c.d.e`) has data cost zero because |L(r)| = 1. A grammar that accepts *many* strings of the same length (like `(a+b+...+q)+`) has high data cost.

The ensemble selects the grammar with the lowest total MDL.

## Grammar Notation

- `a.b` — `a` followed by `b` (concatenation)
- `(a+b)` — either `a` or `b` (disjunction)
- `r?` — zero or one (optional)
- `r+` — one or more (iteration)
- `r+?` — zero or more (varies across examples)

## Papers

- **Bex et al.** *[Learning Deterministic Regular Expressions for the Web](https://doi.org/10.1145/1806907.1806911)* — TODS 2010
- **Bex et al.** *[Simplifying XML Schema: Single-Type Approximations of Regular Expressions](https://arxiv.org/abs/1004.2372)* — arXiv:1004.2372

## Tests

```bash
python -m pytest tests/
```

## License

MIT
-												rename to Dervish MCP; expand description with token-savings framing; add xkcd-style bar charts; link papers to actual URLs

											
										
										
											2026-07-01 11:05:03 +02:00
+								# Dervish MCP
-												Initial commit: BEX-based grammar inference engine

- CRX: direct CHARE inference (Algorithm 7, TODS 2010)
- iDRegEx: k-ORE inference (Algorithm 4, arXiv 2010)
- RWR₀: SORE repair (Algorithm 6, TODS 2010)
- rwr²: k-ORE extraction (Algorithm 3, arXiv 2010)
- SOA, k-OA, iKoa, 2T-INF, Baum-Welch
- Ansible role grammar adapter
- Generic YAML key-path converter
- 28 tests, all passing

											
										
										
											2026-07-01 08:01:16 +02:00
-												docs: center logo, move coffee button below nav links

											
										
										
											2026-07-01 15:41:35 +02:00
+								<p align="center">
-												docs: bump logo 180→216px

											
										
										
											2026-07-01 15:41:55 +02:00
+								  <img src="dervish-logo.png" alt="Dervish" width="216">
-												fix badge position; purge remaining German user-reference comments

											
										
										
											2026-07-01 13:28:55 +02:00
+								</p>
-												docs: badges row, nav links, language tags on code blocks

											
										
										
											2026-07-01 15:24:00 +02:00
+								<p align="center">
 								  <img src="https://img.shields.io/badge/license-MIT-blue" alt="License">
 								  <img src="https://img.shields.io/badge/python-3.10%2B-blue" alt="Python 3.10+">
-												fix pipeline: editable install; add CI badge to README

											
										
										
											2026-07-01 13:19:11 +02:00
+								  <img src="https://ci.corentic.eu/api/badges/7/status.svg" alt="CI Pipeline Status">
-												docs: badges row, nav links, language tags on code blocks

											
										
										
											2026-07-01 15:24:00 +02:00
+								  <br>
 								  <a href="SHOWCASE.md">Showcase</a> ·
 								  <a href="#quick-start">Usage</a> ·
 								  <a href="#papers">Papers</a>
-												docs: extra blank line above coffee button

											
										
										
											2026-07-01 15:42:14 +02:00
+								  <br><br>
-												docs: center logo, move coffee button below nav links

											
										
										
											2026-07-01 15:41:35 +02:00
+								  <a href="https://www.buymeacoffee.com/IjonTichy85"><img src="https://cdn.buymeacoffee.com/buttons/v2/default-yellow.png" alt="Buy me a coffee" width="140"></a>
-												fix pipeline: editable install; add CI badge to README

											
										
										
											2026-07-01 13:19:11 +02:00
+								</p>
-												Rename to Dervish, add animated logo to README

											
										
										
											2026-07-01 10:19:08 +02:00
-												rename to Dervish MCP; expand description with token-savings framing; add xkcd-style bar charts; link papers to actual URLs

											
										
										
											2026-07-01 11:05:03 +02:00
+								**Dervish** infers **regular expression grammars** from example sequences using the BEX family of algorithms. Given a set of example sequences (strings over some alphabet), it learns a compact regular expression that captures the general pattern.
 								Every codebase has unwritten conventions — the order tasks appear in Ansible roles, the resources a Helm chart always creates, the steps every CI pipeline runs. Nobody writes these down. They emerge from copying and converging.
 								When an LLM agent needs to follow these conventions, it usually has two bad options:
 . **Stuff every existing file into context** — 15 Ansible roles = 5,000 tokens. You'll hit the context window by the third example.
 . **Guess from one or two examples** — the LLM infers a pattern and often gets it wrong.
-												Update README.md

											
										
										
											2026-07-02 16:33:02 +00:00
+								Dervish replaces both with a **one-call MCP tool**: pass your sequences, get back a ~60-token grammar.
-												Update README.md

											
										
										
											2026-07-02 18:06:38 +00:00
+								By leveraging **Minimum Description Length (MDL) scoring**, Dervish treats the grammar discovery problem as an optimal compression task.  the resulting rule is optimized to consume as few tokens as possible without losing the pattern.
-												rename to Dervish MCP; expand description with token-savings framing; add xkcd-style bar charts; link papers to actual URLs

											
										
										
											2026-07-01 11:05:03 +02:00
-												Update README.md

											
										
										
											2026-07-01 09:33:02 +00:00
+								**Without Dervish:** token cost scales linearly with examples. **With Dervish:** one compact grammar describes them all — a ~60–200 token rule instead of thousands of tokens of raw examples. Try it out and you too will say:
-												Initial commit: BEX-based grammar inference engine

- CRX: direct CHARE inference (Algorithm 7, TODS 2010)
- iDRegEx: k-ORE inference (Algorithm 4, arXiv 2010)
- RWR₀: SORE repair (Algorithm 6, TODS 2010)
- rwr²: k-ORE extraction (Algorithm 3, arXiv 2010)
- SOA, k-OA, iKoa, 2T-INF, Baum-Welch
- Ansible role grammar adapter
- Generic YAML key-path converter
- 28 tests, all passing

											
										
										
											2026-07-01 08:01:16 +02:00
-												gif 65%

											
										
										
											2026-07-01 11:52:51 +02:00
+								<p align="center"><img src="dervish.gif" alt="Dervish animation" width="65%"></p>
-												restore original dervish.gif below explanation; add new logo left-aligned, smaller

											
										
										
											2026-07-01 11:18:13 +02:00
-												Move MCP server to top of README — it's the primary interface

Restructure: MCP Server first (with agent workflow example), then
Why grammar inference / showcases, then Quick Start, then details.
This matches how users actually interact with the project: via MCP tools.

											
										
										
											2026-07-01 10:18:10 +02:00
+								## MCP Server
-												Initial commit: BEX-based grammar inference engine

- CRX: direct CHARE inference (Algorithm 7, TODS 2010)
- iDRegEx: k-ORE inference (Algorithm 4, arXiv 2010)
- RWR₀: SORE repair (Algorithm 6, TODS 2010)
- rwr²: k-ORE extraction (Algorithm 3, arXiv 2010)
- SOA, k-OA, iKoa, 2T-INF, Baum-Welch
- Ansible role grammar adapter
- Generic YAML key-path converter
- 28 tests, all passing

											
										
										
											2026-07-01 08:01:16 +02:00
-												purge Portainer references, format-specific tools, and Domain Adapters section; make showcases concrete with extracted types

											
										
										
											2026-07-01 10:36:04 +02:00
+								The primary interface is a **Model Context Protocol (MCP)** server. Connect any MCP-compatible client (pi.dev, opencode, vibe, etc.) and get grammar inference as a tool:
-												Move MCP server to top of README — it's the primary interface

Restructure: MCP Server first (with agent workflow example), then
Why grammar inference / showcases, then Quick Start, then details.
This matches how users actually interact with the project: via MCP tools.

											
										
										
											2026-07-01 10:18:10 +02:00
 								```json
 								{
 								  "mcpServers": {
-												purge Portainer references, format-specific tools, and Domain Adapters section; make showcases concrete with extracted types

											
										
										
											2026-07-01 10:36:04 +02:00
+								    "dervish": {
-												Move MCP server to top of README — it's the primary interface

Restructure: MCP Server first (with agent workflow example), then
Why grammar inference / showcases, then Quick Start, then details.
This matches how users actually interact with the project: via MCP tools.

											
										
										
											2026-07-01 10:18:10 +02:00
+								      "command": "python3",
 								      "args": ["/path/to/bex/mcp_server.py"]
 								    }
 								  }
 								}
-												Initial commit: BEX-based grammar inference engine

- CRX: direct CHARE inference (Algorithm 7, TODS 2010)
- iDRegEx: k-ORE inference (Algorithm 4, arXiv 2010)
- RWR₀: SORE repair (Algorithm 6, TODS 2010)
- rwr²: k-ORE extraction (Algorithm 3, arXiv 2010)
- SOA, k-OA, iKoa, 2T-INF, Baum-Welch
- Ansible role grammar adapter
- Generic YAML key-path converter
- 28 tests, all passing

											
										
										
											2026-07-01 08:01:16 +02:00
+								```
-												Move MCP server to top of README — it's the primary interface

Restructure: MCP Server first (with agent workflow example), then
Why grammar inference / showcases, then Quick Start, then details.
This matches how users actually interact with the project: via MCP tools.

											
										
										
											2026-07-01 10:18:10 +02:00
+								### Tools
-												Initial commit: BEX-based grammar inference engine

- CRX: direct CHARE inference (Algorithm 7, TODS 2010)
- iDRegEx: k-ORE inference (Algorithm 4, arXiv 2010)
- RWR₀: SORE repair (Algorithm 6, TODS 2010)
- rwr²: k-ORE extraction (Algorithm 3, arXiv 2010)
- SOA, k-OA, iKoa, 2T-INF, Baum-Welch
- Ansible role grammar adapter
- Generic YAML key-path converter
- 28 tests, all passing

											
										
										
											2026-07-01 08:01:16 +02:00
-												purge Portainer references, format-specific tools, and Domain Adapters section; make showcases concrete with extracted types

											
										
										
											2026-07-01 10:36:04 +02:00
+								| Tool | Parameters | What it does |
 								|------|-----------|-------------|
-												docs: add min_coverage to MCP tool + README, include core in output

											
										
										
											2026-07-01 15:16:24 +02:00
+								| `infer_best_grammar` | `sequences`, `prefer`, `kmax`, `N`, `min_coverage` | **The only tool you need.** Runs CRX + iDRegEx + kOREInference, picks best by MDL. Set `prefer` to run only one algorithm. Set `min_coverage < 1.0` for optional core+outlier analysis. |
-												purge Portainer references, format-specific tools, and Domain Adapters section; make showcases concrete with extracted types

											
										
										
											2026-07-01 10:36:04 +02:00
 								**Parameters explained:**
-												docs: add min_coverage to MCP tool + README, include core in output

											
										
										
											2026-07-01 15:16:24 +02:00
+								- **`prefer`**: `'crx'` for full vocabulary (accepts all sequences), `'idregex'` or `'koreinference'` for deterministic minimal core. Omit to let MDL pick the winner across all three.
 								- **`kmax`** (1–5): Context window for k-ORE inference (iDRegEx, kOREInference). Higher values capture longer-range dependencies but need more data and are slower. Default 2 works for most cases.
 								- **`N`** (1–10): Random trials for k-ORE inference. More = better convergence but slower. Default 3.
 								- **`min_coverage`** (0.5–1.0): **Optional core+outlier analysis.** When < 1.0, iteratively removes outlier sequences (those with the rarest symbols) until at least this fraction remain. Returns the core CRX grammar for the majority plus a list of removed outliers. Default 1.0 = disabled. Example: `min_coverage=0.8` finds the tight pattern for ~80% of examples while flagging the other ~20% as variants.
-												Grammar inference engine: CRX + iDRegEx ensemble with MDL scoring, MCP server, showcase, and blog post

- Ensemble inference (infer_ensemble) runs both CRX and iDRegEx, picks best by MDL
- CRX: CRX algorithm for wide coverage (accepts all sequences, large vocabulary)
- iDRegEx: iDRegEx for minimal core grammar (tightest common pattern)
- MDL scoring: fixed model_cost to count alphabet symbol occurrences, fixed dispatch order in _count_words_fast
- Fixed _match_tokens: rewritten as _match_possible with proper backtracking
- Fixed _parse_parts disjunction: children use _parse_flat_symbol to avoid dot-splitting
- MCP server: infer_best_grammar and infer_grammar tools
- Added prefer parameter (crx/idregex) to skip ensemble
- 28 passing tests
- SHOWCASE.md with Geerlingguy Galaxy demonstration
- blog_post.md with full technical deep-dive

											
										
										
											2026-07-01 09:51:41 +02:00
-												Move MCP server to top of README — it's the primary interface

Restructure: MCP Server first (with agent workflow example), then
Why grammar inference / showcases, then Quick Start, then details.
This matches how users actually interact with the project: via MCP tools.

											
										
										
											2026-07-01 10:18:10 +02:00
+								### Agent workflow
-												purge Portainer references, format-specific tools, and Domain Adapters section; make showcases concrete with extracted types

											
										
										
											2026-07-01 10:36:04 +02:00
+								An LLM agent uses the MCP to discover an unwritten convention from existing examples — compressing hundreds of files into a single ~60-token rule:
-												Move MCP server to top of README — it's the primary interface

Restructure: MCP Server first (with agent workflow example), then
Why grammar inference / showcases, then Quick Start, then details.
This matches how users actually interact with the project: via MCP tools.

											
										
										
											2026-07-01 10:18:10 +02:00
-												docs: badges row, nav links, language tags on code blocks

											
										
										
											2026-07-01 15:24:00 +02:00
+								```text
-												Move MCP server to top of README — it's the primary interface

Restructure: MCP Server first (with agent workflow example), then
Why grammar inference / showcases, then Quick Start, then details.
This matches how users actually interact with the project: via MCP tools.

											
										
										
											2026-07-01 10:18:10 +02:00
+								User: Generate a new Ansible role for installing PostgreSQL.
-												purge Portainer references, format-specific tools, and Domain Adapters section; make showcases concrete with extracted types

											
										
										
											2026-07-01 10:36:04 +02:00
+								Agent: Let me check what pattern the existing community roles follow.
 								       I'll look at 15 popular geerlingguy roles.
-												Move MCP server to top of README — it's the primary interface

Restructure: MCP Server first (with agent workflow example), then
Why grammar inference / showcases, then Quick Start, then details.
This matches how users actually interact with the project: via MCP tools.

											
										
										
											2026-07-01 10:18:10 +02:00
-												purge Portainer references, format-specific tools, and Domain Adapters section; make showcases concrete with extracted types

											
										
										
											2026-07-01 10:36:04 +02:00
+								       [finds role directories, extracts task module sequences,
 								        calls infer_best_grammar(sequences=..., prefer='crx')]
-												Grammar inference engine: CRX + iDRegEx ensemble with MDL scoring, MCP server, showcase, and blog post

- Ensemble inference (infer_ensemble) runs both CRX and iDRegEx, picks best by MDL
- CRX: CRX algorithm for wide coverage (accepts all sequences, large vocabulary)
- iDRegEx: iDRegEx for minimal core grammar (tightest common pattern)
- MDL scoring: fixed model_cost to count alphabet symbol occurrences, fixed dispatch order in _count_words_fast
- Fixed _match_tokens: rewritten as _match_possible with proper backtracking
- Fixed _parse_parts disjunction: children use _parse_flat_symbol to avoid dot-splitting
- MCP server: infer_best_grammar and infer_grammar tools
- Added prefer parameter (crx/idregex) to skip ensemble
- 28 passing tests
- SHOWCASE.md with Geerlingguy Galaxy demonstration
- blog_post.md with full technical deep-dive

											
										
										
											2026-07-01 09:51:41 +02:00
-												purge Portainer references, format-specific tools, and Domain Adapters section; make showcases concrete with extracted types

											
										
										
											2026-07-01 10:36:04 +02:00
+								       Dervish returns:
 								         Best: CRX (MDL 288)
 								         Grammar: fail?.(include_vars+set_fact+package+file+template+service+...)+
 								                  .include+?.(npm+pip)+?.lineinfile?
-												Move MCP server to top of README — it's the primary interface

Restructure: MCP Server first (with agent workflow example), then
Why grammar inference / showcases, then Quick Start, then details.
This matches how users actually interact with the project: via MCP tools.

											
										
										
											2026-07-01 10:18:10 +02:00
-												purge Portainer references, format-specific tools, and Domain Adapters section; make showcases concrete with extracted types

											
										
										
											2026-07-01 10:36:04 +02:00
+								       This tells me: every role starts with a fail check for preconditions,
 								       then OS-specific variables, installs packages, configures with templates,
 								       starts services, and optionally handles language tooling (npm/pip).
 								       The role should end with a lineinfile tweak.
-												Grammar inference engine: CRX + iDRegEx ensemble with MDL scoring, MCP server, showcase, and blog post

- Ensemble inference (infer_ensemble) runs both CRX and iDRegEx, picks best by MDL
- CRX: CRX algorithm for wide coverage (accepts all sequences, large vocabulary)
- iDRegEx: iDRegEx for minimal core grammar (tightest common pattern)
- MDL scoring: fixed model_cost to count alphabet symbol occurrences, fixed dispatch order in _count_words_fast
- Fixed _match_tokens: rewritten as _match_possible with proper backtracking
- Fixed _parse_parts disjunction: children use _parse_flat_symbol to avoid dot-splitting
- MCP server: infer_best_grammar and infer_grammar tools
- Added prefer parameter (crx/idregex) to skip ensemble
- 28 passing tests
- SHOWCASE.md with Geerlingguy Galaxy demonstration
- blog_post.md with full technical deep-dive

											
										
										
											2026-07-01 09:51:41 +02:00
-												purge Portainer references, format-specific tools, and Domain Adapters section; make showcases concrete with extracted types

											
										
										
											2026-07-01 10:36:04 +02:00
+								       I'll generate the new role following this structure.
 								```
-												Grammar inference engine: CRX + iDRegEx ensemble with MDL scoring, MCP server, showcase, and blog post

- Ensemble inference (infer_ensemble) runs both CRX and iDRegEx, picks best by MDL
- CRX: CRX algorithm for wide coverage (accepts all sequences, large vocabulary)
- iDRegEx: iDRegEx for minimal core grammar (tightest common pattern)
- MDL scoring: fixed model_cost to count alphabet symbol occurrences, fixed dispatch order in _count_words_fast
- Fixed _match_tokens: rewritten as _match_possible with proper backtracking
- Fixed _parse_parts disjunction: children use _parse_flat_symbol to avoid dot-splitting
- MCP server: infer_best_grammar and infer_grammar tools
- Added prefer parameter (crx/idregex) to skip ensemble
- 28 passing tests
- SHOWCASE.md with Geerlingguy Galaxy demonstration
- blog_post.md with full technical deep-dive

											
										
										
											2026-07-01 09:51:41 +02:00
-												purge Portainer references, format-specific tools, and Domain Adapters section; make showcases concrete with extracted types

											
										
										
											2026-07-01 10:36:04 +02:00
+								**Without Dervish:** the agent stuffs 15 role files into context (5,000+ tokens per role = beyond any context window), or guesses the pattern from 1–2 examples and often gets it wrong.
-												Initial commit: BEX-based grammar inference engine

- CRX: direct CHARE inference (Algorithm 7, TODS 2010)
- iDRegEx: k-ORE inference (Algorithm 4, arXiv 2010)
- RWR₀: SORE repair (Algorithm 6, TODS 2010)
- rwr²: k-ORE extraction (Algorithm 3, arXiv 2010)
- SOA, k-OA, iKoa, 2T-INF, Baum-Welch
- Ansible role grammar adapter
- Generic YAML key-path converter
- 28 tests, all passing

											
										
										
											2026-07-01 08:01:16 +02:00
-												purge Portainer references, format-specific tools, and Domain Adapters section; make showcases concrete with extracted types

											
										
										
											2026-07-01 10:36:04 +02:00
+								**With Dervish:** one MCP call returns a ~60-token grammar known to match 15/15 existing roles. The agent follows it reliably.
-												Initial commit: BEX-based grammar inference engine

- CRX: direct CHARE inference (Algorithm 7, TODS 2010)
- iDRegEx: k-ORE inference (Algorithm 4, arXiv 2010)
- RWR₀: SORE repair (Algorithm 6, TODS 2010)
- rwr²: k-ORE extraction (Algorithm 3, arXiv 2010)
- SOA, k-OA, iKoa, 2T-INF, Baum-Welch
- Ansible role grammar adapter
- Generic YAML key-path converter
- 28 tests, all passing

											
										
										
											2026-07-01 08:01:16 +02:00
-												docs: update README and SHOWCASE for kOREInference + core/outlier analysis

											
										
										
											2026-07-01 15:18:32 +02:00
+								**Core+outlier mode:** When generating a new role, the agent can call with
 								`min_coverage=0.8` to learn the mainstream pattern while seeing which roles
 								deviate and why — useful when the user's case resembles an outlier
 								(e.g., a PHP app like phpmyadmin that needs raw `lineinfile`).
-												Move MCP server to top of README — it's the primary interface

Restructure: MCP Server first (with agent workflow example), then
Why grammar inference / showcases, then Quick Start, then details.
This matches how users actually interact with the project: via MCP tools.

											
										
										
											2026-07-01 10:18:10 +02:00
+								## Quick Start
-												Initial commit: BEX-based grammar inference engine

- CRX: direct CHARE inference (Algorithm 7, TODS 2010)
- iDRegEx: k-ORE inference (Algorithm 4, arXiv 2010)
- RWR₀: SORE repair (Algorithm 6, TODS 2010)
- rwr²: k-ORE extraction (Algorithm 3, arXiv 2010)
- SOA, k-OA, iKoa, 2T-INF, Baum-Welch
- Ansible role grammar adapter
- Generic YAML key-path converter
- 28 tests, all passing

											
										
										
											2026-07-01 08:01:16 +02:00
-												Move MCP server to top of README — it's the primary interface

Restructure: MCP Server first (with agent workflow example), then
Why grammar inference / showcases, then Quick Start, then details.
This matches how users actually interact with the project: via MCP tools.

											
										
										
											2026-07-01 10:18:10 +02:00
+								```bash
 								pip install pyyaml
 								python -m bex
 								```
 								```python
 								from bex import infer_ensemble
 								seqs = [
 								    ['file', 'template', 'docker_image', 'command', 'set_fact', 'shell', 'wait_for'],
 								    ['file', 'template', 'docker_image', 'command', 'set_fact', 'shell'],
 								]
 								result = infer_ensemble(seqs)
 								print(f"Best: {result['best']['algorithm']}")
 								print(f"Grammar: {result['best']['grammar']}")
 								print(f"Score: {result['best']['mdl_score']}")
 								```
-												Initial commit: BEX-based grammar inference engine

- CRX: direct CHARE inference (Algorithm 7, TODS 2010)
- iDRegEx: k-ORE inference (Algorithm 4, arXiv 2010)
- RWR₀: SORE repair (Algorithm 6, TODS 2010)
- rwr²: k-ORE extraction (Algorithm 3, arXiv 2010)
- SOA, k-OA, iKoa, 2T-INF, Baum-Welch
- Ansible role grammar adapter
- Generic YAML key-path converter
- 28 tests, all passing

											
										
										
											2026-07-01 08:01:16 +02:00
-												purge Portainer references, format-specific tools, and Domain Adapters section; make showcases concrete with extracted types

											
										
										
											2026-07-01 10:36:04 +02:00
+								## Why not just use a schema?
 								Many of the things developers build every day **have no formal schema**. They're free-form scripts, config files, or YAML blobs where the structure is emergent convention, not enforced specification. An LLM generating new content in these domains needs to know the convention — but it's never been written down.
 								Dervish discovers these conventions automatically from existing examples. The domains below are **just examples** of what it can do — the same approach works for any sequential data with an unwritten pattern.
 								| Domain | What gets extracted | Example extracted symbols | What Dervish discovers | Why it helps an LLM |
 								|--------|-------------------|--------------------------|----------------------|---------------------|
 								| Ansible roles | Module names from `tasks/main.yml` in order | `fail`, `include_vars`, `set_fact`, `package`, `file`, `template`, `service`, `npm`, `pip`, `lineinfile` | `fail?.(include_vars+set_fact+package+file+template+service+...)+.include+?.(npm+pip)+?.lineinfile?` | "Validate preconditions first, then set vars, install packages, configure with templates, start services. Include sub-roles last." |
-												feat: replace single-chart Helm with cross-project convention (15 charts, 6 publishers)

											
										
										
											2026-07-01 16:00:04 +02:00
+								| Helm charts (cross-project, 15 charts) | K8s resource kinds from `helm template` output in rendered order | `NetworkPolicy`, `PodDisruptionBudget`, `ServiceAccount`, `Secret`, `ConfigMap`, `Service`, `Deployment`, `StatefulSet`, `ClusterRole`, `ClusterRoleBinding` | `NetworkPolicy?.PodDisruptionBudget?.ServiceAccount?.Secret?.ConfigMap?.PersistentVolumeClaim?.ClusterRole?.ClusterRoleBinding?.Service.Deployment?.StatefulSet?.(IngressClass+MutatingWebhookConfiguration)?.ValidatingWebhookConfiguration?.Job?` | "Writing a Helm chart? Start with resilience (PDB, NetworkPolicy), then identity (ServiceAccount, Secrets), then the Service, then your workload. Only cluster-wide tools need RBAC." |
-												docs: fix Go lint description (both optional), format outliers in SHOWCASE usage

											
										
										
											2026-07-01 15:21:55 +02:00
+								| GitHub Actions (Go lint) | Step `uses:` or `run:` values from workflow YAML in job order | `actions/checkout`, `actions/setup-go`, `golangci/golangci-lint-action`, `megalinter/megalinter` | `actions/checkout.(actions/setup-go+run:echo+run:sudo)+.golangci/golangci-lint-action?.megalinter?` | "Starting a new Go project on GitHub Actions? Four independent projects converged on: checkout → setup Go → (optional golangci-lint) → (optional megalinter)." |
-												remove Terraform showcase (everything-optional grammar isn't useful); fix GHA scope claim

											
										
										
											2026-07-01 10:42:08 +02:00
-												purge Portainer references, format-specific tools, and Domain Adapters section; make showcases concrete with extracted types

											
										
										
											2026-07-01 10:36:04 +02:00
-												Update README and SHOWCASE with real-world dataset evaluations

README:
- Replace outdated company benchmarks with public showcases
- Add Algorithm Selection Guide
- Add 'When each algorithm wins' table
- Add 'Why grammar inference?' table with value prop for LLMs
- Add 'What doesn't work' section documenting failed approaches
- Update all domain adapter examples with public results
- Clean up outdated references (companyweb roles, hashistack terraform)

SHOWCASE:
- Add Helm (kube-prometheus-stack) with iDRegEx minimal core
- Add Docker Compose per-project patterns
- Add GitHub Actions cross-project Go lint pattern
- Add Terraform modules with vocabulary analysis
- Add 'What doesn't work' section
- Explain WHY each dataset helps an LLM

											
										
										
											2026-07-01 10:04:10 +02:00
+								## Real-world Results
-												Initial commit: BEX-based grammar inference engine

- CRX: direct CHARE inference (Algorithm 7, TODS 2010)
- iDRegEx: k-ORE inference (Algorithm 4, arXiv 2010)
- RWR₀: SORE repair (Algorithm 6, TODS 2010)
- rwr²: k-ORE extraction (Algorithm 3, arXiv 2010)
- SOA, k-OA, iKoa, 2T-INF, Baum-Welch
- Ansible role grammar adapter
- Generic YAML key-path converter
- 28 tests, all passing

											
										
										
											2026-07-01 08:01:16 +02:00
-												deduplicate: replace detailed Real-world Results with summary table linking to SHOWCASE.md

											
										
										
											2026-07-01 11:05:26 +02:00
+								Dervish has been tested against public datasets from Ansible Galaxy, Helm, and GitHub Actions — all cases where multiple projects independently converged on an undocumented pattern. [**Full details → SHOWCASE.md**](SHOWCASE.md)
-												Grammar inference engine: CRX + iDRegEx ensemble with MDL scoring, MCP server, showcase, and blog post

- Ensemble inference (infer_ensemble) runs both CRX and iDRegEx, picks best by MDL
- CRX: CRX algorithm for wide coverage (accepts all sequences, large vocabulary)
- iDRegEx: iDRegEx for minimal core grammar (tightest common pattern)
- MDL scoring: fixed model_cost to count alphabet symbol occurrences, fixed dispatch order in _count_words_fast
- Fixed _match_tokens: rewritten as _match_possible with proper backtracking
- Fixed _parse_parts disjunction: children use _parse_flat_symbol to avoid dot-splitting
- MCP server: infer_best_grammar and infer_grammar tools
- Added prefer parameter (crx/idregex) to skip ensemble
- 28 passing tests
- SHOWCASE.md with Geerlingguy Galaxy demonstration
- blog_post.md with full technical deep-dive

											
										
										
											2026-07-01 09:51:41 +02:00
-												deduplicate: replace detailed Real-world Results with summary table linking to SHOWCASE.md

											
										
										
											2026-07-01 11:05:26 +02:00
+								| Dataset | Best grammar | Compression |
 								|---------|-------------|-------------|
 								| Ansible Galaxy (15 roles) | `fail?.(include_vars+set_fact+package+file+template+service+...)+.include+?.(npm+pip)+?.lineinfile?` | 5,000 tokens → 60 tokens (83×) |
-												feat: replace single-chart Helm with cross-project convention (15 charts, 6 publishers)

											
										
										
											2026-07-01 16:00:04 +02:00
+								| Helm cross-project (15 charts) | `NetworkPolicy?.PodDisruptionBudget?.ServiceAccount?.Secret?.ConfigMap?...Service.Deployment?.StatefulSet?...` | 121 tokens → 35 tokens (3.5×) |
-												deduplicate: replace detailed Real-world Results with summary table linking to SHOWCASE.md

											
										
										
											2026-07-01 11:05:26 +02:00
+								| Go lint (6 jobs) | `actions/checkout.(actions/setup-go+run:echo+run:sudo)+.golangci/golangci-lint-action?.megalinter?` | ~900 tokens → 30 tokens (30×) |
-												Initial commit: BEX-based grammar inference engine

- CRX: direct CHARE inference (Algorithm 7, TODS 2010)
- iDRegEx: k-ORE inference (Algorithm 4, arXiv 2010)
- RWR₀: SORE repair (Algorithm 6, TODS 2010)
- rwr²: k-ORE extraction (Algorithm 3, arXiv 2010)
- SOA, k-OA, iKoa, 2T-INF, Baum-Welch
- Ansible role grammar adapter
- Generic YAML key-path converter
- 28 tests, all passing

											
										
										
											2026-07-01 08:01:16 +02:00
-												deduplicate: replace detailed Real-world Results with summary table linking to SHOWCASE.md

											
										
										
											2026-07-01 11:05:26 +02:00
+								The sweet spot: **multiple implementations of the same abstract task** with a shared but undocumented pattern. Not everything works — Dockerfiles, pre-commit configs, and schema-enforced formats are too rigid or too diverse to yield a convention.
-												Grammar inference engine: CRX + iDRegEx ensemble with MDL scoring, MCP server, showcase, and blog post

- Ensemble inference (infer_ensemble) runs both CRX and iDRegEx, picks best by MDL
- CRX: CRX algorithm for wide coverage (accepts all sequences, large vocabulary)
- iDRegEx: iDRegEx for minimal core grammar (tightest common pattern)
- MDL scoring: fixed model_cost to count alphabet symbol occurrences, fixed dispatch order in _count_words_fast
- Fixed _match_tokens: rewritten as _match_possible with proper backtracking
- Fixed _parse_parts disjunction: children use _parse_flat_symbol to avoid dot-splitting
- MCP server: infer_best_grammar and infer_grammar tools
- Added prefer parameter (crx/idregex) to skip ensemble
- 28 passing tests
- SHOWCASE.md with Geerlingguy Galaxy demonstration
- blog_post.md with full technical deep-dive

											
										
										
											2026-07-01 09:51:41 +02:00
-												docs: update README and SHOWCASE for kOREInference + core/outlier analysis

											
										
										
											2026-07-01 15:18:32 +02:00
+								> **kOREInference note:** Algorithm 4 (iDRegEx with MDL, arXiv 1004.2372) is included for paper-faithful correctness. On real tool-sequence data, its rwr₀ repair step returns ∅ because the k-OA is rarely SORE (interconnected symbols). The ensemble falls back to CRX or iDRegEx automatically.
-												Move MCP server to top of README — it's the primary interface

Restructure: MCP Server first (with agent workflow example), then
Why grammar inference / showcases, then Quick Start, then details.
This matches how users actually interact with the project: via MCP tools.

											
										
										
											2026-07-01 10:18:10 +02:00
+								## Algorithm Selection Guide
 								| When | Use | Why |
 								|------|-----|-----|
 								| Clean, structured data with full vocabulary | **CRX** | Single-pass, deterministic. Accepts all sequences. |
-												docs: add min_coverage to MCP tool + README, include core in output

											
										
										
											2026-07-01 15:16:24 +02:00
+								| Few examples, or want minimal common core | **iDRegEx** or **kOREInference** | Probabilistic EM, finds only what's shared. |
 								| Don't know which is better | **Ensemble (default)** | Runs all three, picks best by MDL score. |
 								| Want core pattern + outlier detection | **Ensemble + `min_coverage<1`** | Finds tight grammar for majority, flags outliers. |
 								| Data is clearly one type | `prefer='crx'` | Skips ensemble comparison, runs CRX alone. |
-												Move MCP server to top of README — it's the primary interface

Restructure: MCP Server first (with agent workflow example), then
Why grammar inference / showcases, then Quick Start, then details.
This matches how users actually interact with the project: via MCP tools.

											
										
										
											2026-07-01 10:18:10 +02:00
-												Update README and SHOWCASE with real-world dataset evaluations

README:
- Replace outdated company benchmarks with public showcases
- Add Algorithm Selection Guide
- Add 'When each algorithm wins' table
- Add 'Why grammar inference?' table with value prop for LLMs
- Add 'What doesn't work' section documenting failed approaches
- Update all domain adapter examples with public results
- Clean up outdated references (companyweb roles, hashistack terraform)

SHOWCASE:
- Add Helm (kube-prometheus-stack) with iDRegEx minimal core
- Add Docker Compose per-project patterns
- Add GitHub Actions cross-project Go lint pattern
- Add Terraform modules with vocabulary analysis
- Add 'What doesn't work' section
- Explain WHY each dataset helps an LLM

											
										
										
											2026-07-01 10:04:10 +02:00
+								## When each algorithm wins
-												Grammar inference engine: CRX + iDRegEx ensemble with MDL scoring, MCP server, showcase, and blog post

- Ensemble inference (infer_ensemble) runs both CRX and iDRegEx, picks best by MDL
- CRX: CRX algorithm for wide coverage (accepts all sequences, large vocabulary)
- iDRegEx: iDRegEx for minimal core grammar (tightest common pattern)
- MDL scoring: fixed model_cost to count alphabet symbol occurrences, fixed dispatch order in _count_words_fast
- Fixed _match_tokens: rewritten as _match_possible with proper backtracking
- Fixed _parse_parts disjunction: children use _parse_flat_symbol to avoid dot-splitting
- MCP server: infer_best_grammar and infer_grammar tools
- Added prefer parameter (crx/idregex) to skip ensemble
- 28 passing tests
- SHOWCASE.md with Geerlingguy Galaxy demonstration
- blog_post.md with full technical deep-dive

											
										
										
											2026-07-01 09:51:41 +02:00
-												Update README and SHOWCASE with real-world dataset evaluations

README:
- Replace outdated company benchmarks with public showcases
- Add Algorithm Selection Guide
- Add 'When each algorithm wins' table
- Add 'Why grammar inference?' table with value prop for LLMs
- Add 'What doesn't work' section documenting failed approaches
- Update all domain adapter examples with public results
- Clean up outdated references (companyweb roles, hashistack terraform)

SHOWCASE:
- Add Helm (kube-prometheus-stack) with iDRegEx minimal core
- Add Docker Compose per-project patterns
- Add GitHub Actions cross-project Go lint pattern
- Add Terraform modules with vocabulary analysis
- Add 'What doesn't work' section
- Explain WHY each dataset helps an LLM

											
										
										
											2026-07-01 10:04:10 +02:00
+								| Data property | Winner | Why |
 								|---------------|--------|-----|
-												docs: update README and SHOWCASE for kOREInference + core/outlier analysis

											
										
										
											2026-07-01 15:18:32 +02:00
+								| Diverse patterns, full vocabulary needed | CRX | Captures all symbols. iDRegEx returns ∅. |
-												Update README and SHOWCASE with real-world dataset evaluations

README:
- Replace outdated company benchmarks with public showcases
- Add Algorithm Selection Guide
- Add 'When each algorithm wins' table
- Add 'Why grammar inference?' table with value prop for LLMs
- Add 'What doesn't work' section documenting failed approaches
- Update all domain adapter examples with public results
- Clean up outdated references (companyweb roles, hashistack terraform)

SHOWCASE:
- Add Helm (kube-prometheus-stack) with iDRegEx minimal core
- Add Docker Compose per-project patterns
- Add GitHub Actions cross-project Go lint pattern
- Add Terraform modules with vocabulary analysis
- Add 'What doesn't work' section
- Explain WHY each dataset helps an LLM

											
										
										
											2026-07-01 10:04:10 +02:00
+								| Clean sequences with clear core | iDRegEx | Extracts minimal common subsequence. CRX buries it in optional noise. |
-												docs: update README and SHOWCASE for kOREInference + core/outlier analysis

											
										
										
											2026-07-01 15:18:32 +02:00
+								| Interconnected (non-SORE) data | CRX | kOREInference (rwr₀) returns ∅ when k-OA is not SORE. CRX handles it. |
-												Update README and SHOWCASE with real-world dataset evaluations

README:
- Replace outdated company benchmarks with public showcases
- Add Algorithm Selection Guide
- Add 'When each algorithm wins' table
- Add 'Why grammar inference?' table with value prop for LLMs
- Add 'What doesn't work' section documenting failed approaches
- Update all domain adapter examples with public results
- Clean up outdated references (companyweb roles, hashistack terraform)

SHOWCASE:
- Add Helm (kube-prometheus-stack) with iDRegEx minimal core
- Add Docker Compose per-project patterns
- Add GitHub Actions cross-project Go lint pattern
- Add Terraform modules with vocabulary analysis
- Add 'What doesn't work' section
- Explain WHY each dataset helps an LLM

											
										
										
											2026-07-01 10:04:10 +02:00
+								| Single sequence | iDRegEx (+ RWR₀) | RWR₀ repair produces a grammatical regex from one example. |
 								| 2–3 sequences | iDRegEx | CRX overfits. iDRegEx handles noise better. |
 								| Many sequences, tight pattern | CRX | Learns precise concatenation with optional suffixes. |
-												docs: add min_coverage to MCP tool + README, include core in output

											
										
										
											2026-07-01 15:16:24 +02:00
+								| Want majority pattern + outlier list | CRX + `min_coverage` | Core analysis finds tight grammar for ~80%, flags the rest. |
-												Grammar inference engine: CRX + iDRegEx ensemble with MDL scoring, MCP server, showcase, and blog post

- Ensemble inference (infer_ensemble) runs both CRX and iDRegEx, picks best by MDL
- CRX: CRX algorithm for wide coverage (accepts all sequences, large vocabulary)
- iDRegEx: iDRegEx for minimal core grammar (tightest common pattern)
- MDL scoring: fixed model_cost to count alphabet symbol occurrences, fixed dispatch order in _count_words_fast
- Fixed _match_tokens: rewritten as _match_possible with proper backtracking
- Fixed _parse_parts disjunction: children use _parse_flat_symbol to avoid dot-splitting
- MCP server: infer_best_grammar and infer_grammar tools
- Added prefer parameter (crx/idregex) to skip ensemble
- 28 passing tests
- SHOWCASE.md with Geerlingguy Galaxy demonstration
- blog_post.md with full technical deep-dive

											
										
										
											2026-07-01 09:51:41 +02:00
-												rename to Dervish MCP; expand description with token-savings framing; add xkcd-style bar charts; link papers to actual URLs

											
										
										
											2026-07-01 11:05:03 +02:00
+								## Token savings
 								<p align="center">
-												gif 65%

											
										
										
											2026-07-01 11:52:51 +02:00
+								  <img src="chart_token_savings.png" alt="Token savings per dataset" width="65%">
-												rename to Dervish MCP; expand description with token-savings framing; add xkcd-style bar charts; link papers to actual URLs

											
										
										
											2026-07-01 11:05:03 +02:00
+								</p>
 								Across all public benchmarks, Dervish delivers **40–83× compression**. The grammar is smaller than a single example file would be — and it represents the entire dataset.
-												Update README and SHOWCASE with real-world dataset evaluations

README:
- Replace outdated company benchmarks with public showcases
- Add Algorithm Selection Guide
- Add 'When each algorithm wins' table
- Add 'Why grammar inference?' table with value prop for LLMs
- Add 'What doesn't work' section documenting failed approaches
- Update all domain adapter examples with public results
- Clean up outdated references (companyweb roles, hashistack terraform)

SHOWCASE:
- Add Helm (kube-prometheus-stack) with iDRegEx minimal core
- Add Docker Compose per-project patterns
- Add GitHub Actions cross-project Go lint pattern
- Add Terraform modules with vocabulary analysis
- Add 'What doesn't work' section
- Explain WHY each dataset helps an LLM

											
										
										
											2026-07-01 10:04:10 +02:00
+								## How MDL scoring works
-												docs: badges row, nav links, language tags on code blocks

											
										
										
											2026-07-01 15:24:00 +02:00
+								```text
-												Update README and SHOWCASE with real-world dataset evaluations

README:
- Replace outdated company benchmarks with public showcases
- Add Algorithm Selection Guide
- Add 'When each algorithm wins' table
- Add 'Why grammar inference?' table with value prop for LLMs
- Add 'What doesn't work' section documenting failed approaches
- Update all domain adapter examples with public results
- Clean up outdated references (companyweb roles, hashistack terraform)

SHOWCASE:
- Add Helm (kube-prometheus-stack) with iDRegEx minimal core
- Add Docker Compose per-project patterns
- Add GitHub Actions cross-project Go lint pattern
- Add Terraform modules with vocabulary analysis
- Add 'What doesn't work' section
- Explain WHY each dataset helps an LLM

											
										
										
											2026-07-01 10:04:10 +02:00
+								MDL = model_cost + data_cost
 								```
 								- **model_cost** — number of unique alphabet symbols in the grammar. Simpler grammars are cheaper.
 								- **data_cost** — Σ log₂(|L(r) at length len(s)|) across all sequences. A specific fixed sequence (`a.b.c.d.e`) has data cost zero because |L(r)| = 1. A grammar that accepts *many* strings of the same length (like `(a+b+...+q)+`) has high data cost.
 								The ensemble selects the grammar with the lowest total MDL.
 								## Grammar Notation
 								- `a.b` — `a` followed by `b` (concatenation)
 								- `(a+b)` — either `a` or `b` (disjunction)
 								- `r?` — zero or one (optional)
 								- `r+` — one or more (iteration)
 								- `r+?` — zero or more (varies across examples)
-												Initial commit: BEX-based grammar inference engine

- CRX: direct CHARE inference (Algorithm 7, TODS 2010)
- iDRegEx: k-ORE inference (Algorithm 4, arXiv 2010)
- RWR₀: SORE repair (Algorithm 6, TODS 2010)
- rwr²: k-ORE extraction (Algorithm 3, arXiv 2010)
- SOA, k-OA, iKoa, 2T-INF, Baum-Welch
- Ansible role grammar adapter
- Generic YAML key-path converter
- 28 tests, all passing

											
										
										
											2026-07-01 08:01:16 +02:00
+								## Papers
-												rename to Dervish MCP; expand description with token-savings framing; add xkcd-style bar charts; link papers to actual URLs

											
										
										
											2026-07-01 11:05:03 +02:00
+								- **Bex et al.** *[Learning Deterministic Regular Expressions for the Web](https://doi.org/10.1145/1806907.1806911)* — TODS 2010
 								- **Bex et al.** *[Simplifying XML Schema: Single-Type Approximations of Regular Expressions](https://arxiv.org/abs/1004.2372)* — arXiv:1004.2372
-												Initial commit: BEX-based grammar inference engine

- CRX: direct CHARE inference (Algorithm 7, TODS 2010)
- iDRegEx: k-ORE inference (Algorithm 4, arXiv 2010)
- RWR₀: SORE repair (Algorithm 6, TODS 2010)
- rwr²: k-ORE extraction (Algorithm 3, arXiv 2010)
- SOA, k-OA, iKoa, 2T-INF, Baum-Welch
- Ansible role grammar adapter
- Generic YAML key-path converter
- 28 tests, all passing

											
										
										
											2026-07-01 08:01:16 +02:00
 								## Tests
 								```bash
 								python -m pytest tests/
 								```
 								## License
 								MIT