Update README.md

2026-07-02 18:17:16 +00:00 · 2026-07-02 18:14:48 +00:00 · 2026-07-02 18:08:42 +00:00 · 2026-07-02 18:06:38 +00:00 · 2026-07-02 16:33:51 +00:00 · 2026-07-02 16:33:02 +00:00
3 changed files with 65 additions and 9 deletions
--- a/README.md
+++ b/README.md
@ -17,16 +17,17 @@

 **Dervish** infers **regular expression grammars** from example sequences using the BEX family of algorithms. Given a set of example sequences (strings over some alphabet), it learns a compact regular expression that captures the general pattern.

-Every codebase has unwritten conventions — the order tasks appear in Ansible roles, the resources a Helm chart always creates, the steps every CI pipeline runs. Nobody writes these down. They emerge from copying and converging.
+Every codebase has unwritten conventions like the order tasks appear in Ansible roles, the resources a Helm chart always creates, the steps every CI pipeline runs. Nobody writes these down. They emerge from copying and converging.

 When an LLM agent needs to follow these conventions, it usually has two bad options:

-1. **Stuff every existing file into context** — 15 Ansible roles = 5,000 tokens. You'll hit the context window by the third example.
-2. **Guess from one or two examples** — the LLM infers a pattern and often gets it wrong.
+1. **Stuff every existing file into context** - You'll hit the context window by the third example.
+2. **Guess from one or two examples** - the LLM infers a pattern and often gets it wrong.

-Dervish replaces both with a **one-call MCP tool**: pass your sequences, get back a ~60-token grammar. A rule you can trust, at a fraction of the cost.
+Dervish replaces both with a **one-call MCP tool**: pass your sequences, get back a ~60-token grammar. 
+By leveraging **Minimum Description Length (MDL) scoring**, Dervish treats the grammar discovery problem as an optimal compression task.  the resulting rule is optimized to consume as few tokens as possible without losing the pattern.

-**Without Dervish:** token cost scales linearly with examples. **With Dervish:** one compact grammar describes them all — a ~60–200 token rule instead of thousands of tokens of raw examples. Try it out and you too will say:
+**Without Dervish:** token cost scales linearly with examples. **With Dervish:** one compact grammar describes them all In a ~60–200 token rule instead of thousands of tokens of raw examples. Try it out and you too will say:

 <p align="center"><img src="dervish.gif" alt="Dervish animation" width="65%"></p>

@ -59,7 +60,7 @@ The primary interface is a **Model Context Protocol (MCP)** server. Connect any

 ### Agent workflow

-An LLM agent uses the MCP to discover an unwritten convention from existing examples — compressing hundreds of files into a single ~60-token rule:
+An LLM agent uses the MCP to discover an schema from existing examples, thereby compressing hundreds of files into a single ~60-token rule:

 ```text
 User: Generate a new Ansible role for installing PostgreSQL.
@ -83,12 +84,12 @@ Agent: Let me check what pattern the existing community roles follow.
       I'll generate the new role following this structure.
 ```

-**Without Dervish:** the agent stuffs 15 role files into context (5,000+ tokens per role = beyond any context window), or guesses the pattern from 1–2 examples and often gets it wrong.
+**Without Dervish:** the agent either has to read all 15 role files (5,000+ tokens per role), or guesses the pattern from 1–2 examples and often gets it wrong.

 **With Dervish:** one MCP call returns a ~60-token grammar known to match 15/15 existing roles. The agent follows it reliably.

-**Core+outlier mode:** When generating a new role, the agent can call with
-`min_coverage=0.8` to learn the mainstream pattern while seeing which roles
+**Core+outlier mode:** When generating a new file, for example a new Ansible role, the agent can call with
+`min_coverage=0.8` to learn the mainstream pattern while seeing which files
 deviate and why — useful when the user's case resembles an outlier
 (e.g., a PHP app like phpmyadmin that needs raw `lineinfile`).

--- a/chart_token_savings.png
+++ b/chart_token_savings.png
--- a/generate_chart.py
+++ b/generate_chart.py
@ -0,0 +1,55 @@
+import matplotlib
+import matplotlib.font_manager as fm
+fm._load_fontmanager(try_read_cache=False)
+import matplotlib.pyplot as plt
+import numpy as np
+from matplotlib import patheffects
+
+plt.rcParams.update({
+    "font.family": "Comic Neue",
+    "font.size": 14.0,
+    "path.sketch": (1, 100, 2),
+    "path.effects": [patheffects.withStroke(linewidth=4, foreground="w")],
+    "axes.linewidth": 1.5,
+    "lines.linewidth": 2.0,
+    "figure.facecolor": "white",
+    "grid.linewidth": 0.0,
+    "axes.grid": False,
+    "axes.unicode_minus": False,
+    "axes.edgecolor": "black",
+    "xtick.major.size": 8,
+    "xtick.major.width": 3,
+    "ytick.major.size": 8,
+    "ytick.major.width": 3,
+})
+
+categories = ["Ansible", "Helm", "Go Lint GHA"]
+without = [900, 210, 450]
+with_dervish = [60, 70, 30]
+compression = [f"{w//d}×" for w, d in zip(without, with_dervish)]
+
+x = np.arange(len(categories))
+width = 0.3
+
+fig, ax = plt.subplots(figsize=(8, 4.5))
+
+bars1 = ax.bar(x - width/2, without, width, label="Without Dervish", color="#888888", edgecolor="black")
+bars2 = ax.bar(x + width/2, with_dervish, width, label="With Dervish", color="#853E91", edgecolor="black")
+
+for bar, val in zip(bars1, without):
+    ax.text(bar.get_x() + bar.get_width()/2, bar.get_height() + 10,
+            str(val), ha="center", va="bottom", fontsize=9)
+for bar, val, comp in zip(bars2, with_dervish, compression):
+    ax.text(bar.get_x() + bar.get_width()/2, bar.get_height() + 10,
+            f"{val} ({comp})", ha="center", va="bottom", fontsize=9)
+
+ax.set_ylabel("Tokens")
+ax.set_title("Token Savings per Dataset")
+ax.set_xticks(x)
+ax.set_xticklabels(categories)
+ax.legend(frameon=True)
+ax.set_ylim(0, 1100)
+fig.tight_layout()
+fig.savefig("chart_token_savings.png", dpi=200)
+plt.close()
+print("chart_token_savings.png regenerated")
Author	SHA1	Message	Date
tobi	28f5f897d5	Update README.md All checks were successful ci/woodpecker/push/woodpecker Pipeline was successful Details	2026-07-02 18:17:16 +00:00
tobi	d74b36e563	Update README.md All checks were successful ci/woodpecker/push/woodpecker Pipeline was successful Details	2026-07-02 18:14:48 +00:00
tobi	197a0a3c22	Update README.md All checks were successful ci/woodpecker/push/woodpecker Pipeline was successful Details	2026-07-02 18:08:42 +00:00
tobi	136ae08fe3	Update README.md All checks were successful ci/woodpecker/push/woodpecker Pipeline was successful Details	2026-07-02 18:06:38 +00:00
tobi	ea8e2f1db7	Update README.md All checks were successful ci/woodpecker/push/woodpecker Pipeline was successful Details	2026-07-02 16:33:51 +00:00
tobi	16cbff61a8	Update README.md All checks were successful ci/woodpecker/push/woodpecker Pipeline was successful Details	2026-07-02 16:33:02 +00:00
tobjend	b037098730	fix: regenerate chart with Comic Neue + path.sketch xkcd style All checks were successful ci/woodpecker/push/woodpecker Pipeline was successful Details	2026-07-01 16:14:24 +02:00
tobi	d2d57bc431	Merge pull request 'feat: kOREInference — Algorithm 4 iDRegEx with MDL scoring + ensemble integration' (#1 ) from feature/kore-inference into main All checks were successful ci/woodpecker/push/woodpecker Pipeline was successful Details	2026-07-01 14:08:18 +00:00