diff --git a/README.md b/README.md index 39f0a1c..99ad51f 100644 --- a/README.md +++ b/README.md @@ -1,8 +1,19 @@ -# Dervish +# Dervish MCP

Dervish

-**Dervish** infers **regular expression grammars** from example sequences using the BEX family of algorithms. Given a set of example sequences (strings over some alphabet), it learns a compact regular expression that describes the general pattern. +**Dervish** infers **regular expression grammars** from example sequences using the BEX family of algorithms. Given a set of example sequences (strings over some alphabet), it learns a compact regular expression that captures the general pattern. + +Every codebase has unwritten conventions — the order tasks appear in Ansible roles, the resources a Helm chart always creates, the steps every CI pipeline runs. Nobody writes these down. They emerge from copying and converging. + +When an LLM agent needs to follow these conventions, it usually has two bad options: + +1. **Stuff every existing file into context** — 15 Ansible roles = 5,000 tokens. You'll hit the context window by the third example. +2. **Guess from one or two examples** — the LLM infers a pattern and often gets it wrong. + +Dervish replaces both with a **one-call MCP tool**: pass your sequences, get back a ~60-token grammar. A rule you can trust, at a fraction of the cost. + +**Without Dervish:** token cost scales linearly with examples. **With Dervish:** one compact grammar describes them all — a ~60–200 token rule instead of thousands of tokens of raw examples. ## MCP Server @@ -196,6 +207,20 @@ The sweet spot: **multiple implementations of the same abstract task** (like "de | 2–3 sequences | iDRegEx | CRX overfits. iDRegEx handles noise better. | | Many sequences, tight pattern | CRX | Learns precise concatenation with optional suffixes. | +## Token savings + +

+ Context cost: raw examples vs Dervish grammar +

+ +Without Dervish, including N examples in context costs N × ~100 tokens. With Dervish, the grammar stays small and flat — ~60 tokens for a tight pattern, ~200 for diverse data. + +

+ Token savings per dataset +

+ +Across all public benchmarks, Dervish delivers **40–83× compression**. The grammar is smaller than a single example file would be — and it represents the entire dataset. + ## How MDL scoring works ``` @@ -217,8 +242,8 @@ The ensemble selects the grammar with the lowest total MDL. ## Papers -- **Bex et al.** *"Inferring Deterministic Regular Expressions from Positive Data"* — TODS 2010 -- **Bex et al.** *"Inferring k-optimal REs from Positive Data"* — arXiv:1004.2372 +- **Bex et al.** *[Learning Deterministic Regular Expressions for the Web](https://doi.org/10.1145/1806907.1806911)* — TODS 2010 +- **Bex et al.** *[Simplifying XML Schema: Single-Type Approximations of Regular Expressions](https://arxiv.org/abs/1004.2372)* — arXiv:1004.2372 ## Tests diff --git a/blog_post.md b/blog_post.md index a845d7a..d395dcc 100644 --- a/blog_post.md +++ b/blog_post.md @@ -253,9 +253,9 @@ the pattern they all share. The structural convention is in the data ## References - Bex, G. J., Gelade, W., Neven, F., & Vansummeren, S. (2010). - *Learning Deterministic Regular Expressions for the Web.* TODS 2010. + [*Learning Deterministic Regular Expressions for the Web.*](https://doi.org/10.1145/1806907.1806911) TODS 2010. - Bex, G. J., Gelade, W., Martens, W., & Neven, F. (2010). - *Simplifying XML Schema: Single-Type Approximations of Regular - Expressions.* arXiv:1004.2372. + [*Simplifying XML Schema: Single-Type Approximations of Regular + Expressions.*](https://arxiv.org/abs/1004.2372) arXiv:1004.2372. - Rissanen, J. (1978). *Modeling by shortest data description.* Automatica 14(5). diff --git a/chart_context_cost.png b/chart_context_cost.png new file mode 100644 index 0000000..4d21826 Binary files /dev/null and b/chart_context_cost.png differ diff --git a/chart_token_savings.png b/chart_token_savings.png new file mode 100644 index 0000000..ec7b081 Binary files /dev/null and b/chart_token_savings.png differ diff --git a/make_charts.py b/make_charts.py new file mode 100644 index 0000000..1553311 --- /dev/null +++ b/make_charts.py @@ -0,0 +1,71 @@ +import matplotlib.pyplot as plt +import numpy as np + +plt.xkcd(scale=0.7, length=60, randomness=2) + +FIG_W = 8 +FIG_H = 5 + +# ── Chart 1: Context cost vs examples ── +fig1, ax1 = plt.subplots(figsize=(FIG_W, FIG_H)) + +N = [1, 5, 15, 36] +raw = [100, 500, 1500, 3600] # ~100 tokens/example +dervish = [40, 60, 60, 200] # grammar grows only when diversity grows + +x = np.arange(len(N)) +w = 0.35 + +bars1 = ax1.bar(x - w/2, raw, w, label='Raw examples', color='#e74c3c', alpha=0.85) +bars2 = ax1.bar(x + w/2, dervish, w, label='Dervish grammar', color='#3498db', alpha=0.85) + +ax1.set_xticks(x) +ax1.set_xticklabels([f'{n} examples' for n in N]) +ax1.set_ylabel('Tokens needed in context') +ax1.set_title('Context cost: raw examples vs Dervish grammar') +ax1.legend(frameon=False) + +for bar in bars1: + ax1.text(bar.get_x() + bar.get_width()/2, bar.get_height() + 80, + f'{int(bar.get_height())}', ha='center', va='bottom', fontsize=9) +for bar in bars2: + ax1.text(bar.get_x() + bar.get_width()/2, bar.get_height() + 80, + f'{int(bar.get_height())}', ha='center', va='bottom', fontsize=9) + +ax1.set_ylim(0, 4500) +fig1.tight_layout() +fig1.savefig('chart_context_cost.png', dpi=200) +plt.close(fig1) + +# ── Chart 2: Tokens — Without vs With Dervish (per dataset) ── +fig2, ax2 = plt.subplots(figsize=(FIG_W, FIG_H)) + +datasets = ['Ansible Galaxy\n(15 roles)', 'Helm\n(6 configs)', 'Go lint\n(6 jobs)'] +without = [5000, 3000, 900] +with_derv = [60, 40, 30] +ratios = [f'{int(w/d)}×' for w, d in zip(without, with_derv)] + +x2 = np.arange(len(datasets)) +w2 = 0.3 + +bw = ax2.bar(x2 - w2/2, without, w2, label='Without Dervish', color='#e74c3c', alpha=0.85) +bd = ax2.bar(x2 + w2/2, with_derv, w2, label='With Dervish', color='#3498db', alpha=0.85) + +ax2.set_xticks(x2) +ax2.set_xticklabels(datasets) +ax2.set_ylabel('Tokens') +ax2.set_title('Token savings per dataset') +ax2.legend(frameon=False) +ax2.set_yscale('log') +ax2.set_ylim(5, 30000) + +# Label compression ratios +for i, (r, wbar, dbar) in enumerate(zip(ratios, bw, bd)): + ax2.text(x2[i], without[i] * 1.3, r, ha='center', va='bottom', fontsize=11, fontweight='bold', + bbox=dict(boxstyle='round,pad=0.2', facecolor='white', edgecolor='gray', alpha=0.8)) + +fig2.tight_layout() +fig2.savefig('chart_token_savings.png', dpi=200) +plt.close(fig2) + +print("Charts saved: chart_context_cost.png, chart_token_savings.png")