rename to Dervish MCP; expand description with token-savings framing; add xkcd-style bar charts; link papers to actual URLs

This commit is contained in:
tobjend 2026-07-01 11:05:03 +02:00
parent 6d1c033267
commit b05c3ee116
5 changed files with 103 additions and 7 deletions

View file

@ -1,8 +1,19 @@
# Dervish
# Dervish MCP
<p align="center"><img src="dervish.gif" alt="Dervish"></p>
**Dervish** infers **regular expression grammars** from example sequences using the BEX family of algorithms. Given a set of example sequences (strings over some alphabet), it learns a compact regular expression that describes the general pattern.
**Dervish** infers **regular expression grammars** from example sequences using the BEX family of algorithms. Given a set of example sequences (strings over some alphabet), it learns a compact regular expression that captures the general pattern.
Every codebase has unwritten conventions — the order tasks appear in Ansible roles, the resources a Helm chart always creates, the steps every CI pipeline runs. Nobody writes these down. They emerge from copying and converging.
When an LLM agent needs to follow these conventions, it usually has two bad options:
1. **Stuff every existing file into context** — 15 Ansible roles = 5,000 tokens. You'll hit the context window by the third example.
2. **Guess from one or two examples** — the LLM infers a pattern and often gets it wrong.
Dervish replaces both with a **one-call MCP tool**: pass your sequences, get back a ~60-token grammar. A rule you can trust, at a fraction of the cost.
**Without Dervish:** token cost scales linearly with examples. **With Dervish:** one compact grammar describes them all — a ~60200 token rule instead of thousands of tokens of raw examples.
## MCP Server
@ -196,6 +207,20 @@ The sweet spot: **multiple implementations of the same abstract task** (like "de
| 23 sequences | iDRegEx | CRX overfits. iDRegEx handles noise better. |
| Many sequences, tight pattern | CRX | Learns precise concatenation with optional suffixes. |
## Token savings
<p align="center">
<img src="chart_context_cost.png" alt="Context cost: raw examples vs Dervish grammar" width="75%">
</p>
Without Dervish, including N examples in context costs N × ~100 tokens. With Dervish, the grammar stays small and flat — ~60 tokens for a tight pattern, ~200 for diverse data.
<p align="center">
<img src="chart_token_savings.png" alt="Token savings per dataset" width="75%">
</p>
Across all public benchmarks, Dervish delivers **4083× compression**. The grammar is smaller than a single example file would be — and it represents the entire dataset.
## How MDL scoring works
```
@ -217,8 +242,8 @@ The ensemble selects the grammar with the lowest total MDL.
## Papers
- **Bex et al.** *"Inferring Deterministic Regular Expressions from Positive Data"* — TODS 2010
- **Bex et al.** *"Inferring k-optimal REs from Positive Data"* — arXiv:1004.2372
- **Bex et al.** *[Learning Deterministic Regular Expressions for the Web](https://doi.org/10.1145/1806907.1806911)* — TODS 2010
- **Bex et al.** *[Simplifying XML Schema: Single-Type Approximations of Regular Expressions](https://arxiv.org/abs/1004.2372)* — arXiv:1004.2372
## Tests

View file

@ -253,9 +253,9 @@ the pattern they all share. The structural convention is in the data
## References
- Bex, G. J., Gelade, W., Neven, F., & Vansummeren, S. (2010).
*Learning Deterministic Regular Expressions for the Web.* TODS 2010.
[*Learning Deterministic Regular Expressions for the Web.*](https://doi.org/10.1145/1806907.1806911) TODS 2010.
- Bex, G. J., Gelade, W., Martens, W., & Neven, F. (2010).
*Simplifying XML Schema: Single-Type Approximations of Regular
Expressions.* arXiv:1004.2372.
[*Simplifying XML Schema: Single-Type Approximations of Regular
Expressions.*](https://arxiv.org/abs/1004.2372) arXiv:1004.2372.
- Rissanen, J. (1978). *Modeling by shortest data description.*
Automatica 14(5).

BIN
chart_context_cost.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 161 KiB

BIN
chart_token_savings.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 137 KiB

71
make_charts.py Normal file
View file

@ -0,0 +1,71 @@
import matplotlib.pyplot as plt
import numpy as np
plt.xkcd(scale=0.7, length=60, randomness=2)
FIG_W = 8
FIG_H = 5
# ── Chart 1: Context cost vs examples ──
fig1, ax1 = plt.subplots(figsize=(FIG_W, FIG_H))
N = [1, 5, 15, 36]
raw = [100, 500, 1500, 3600] # ~100 tokens/example
dervish = [40, 60, 60, 200] # grammar grows only when diversity grows
x = np.arange(len(N))
w = 0.35
bars1 = ax1.bar(x - w/2, raw, w, label='Raw examples', color='#e74c3c', alpha=0.85)
bars2 = ax1.bar(x + w/2, dervish, w, label='Dervish grammar', color='#3498db', alpha=0.85)
ax1.set_xticks(x)
ax1.set_xticklabels([f'{n} examples' for n in N])
ax1.set_ylabel('Tokens needed in context')
ax1.set_title('Context cost: raw examples vs Dervish grammar')
ax1.legend(frameon=False)
for bar in bars1:
ax1.text(bar.get_x() + bar.get_width()/2, bar.get_height() + 80,
f'{int(bar.get_height())}', ha='center', va='bottom', fontsize=9)
for bar in bars2:
ax1.text(bar.get_x() + bar.get_width()/2, bar.get_height() + 80,
f'{int(bar.get_height())}', ha='center', va='bottom', fontsize=9)
ax1.set_ylim(0, 4500)
fig1.tight_layout()
fig1.savefig('chart_context_cost.png', dpi=200)
plt.close(fig1)
# ── Chart 2: Tokens — Without vs With Dervish (per dataset) ──
fig2, ax2 = plt.subplots(figsize=(FIG_W, FIG_H))
datasets = ['Ansible Galaxy\n(15 roles)', 'Helm\n(6 configs)', 'Go lint\n(6 jobs)']
without = [5000, 3000, 900]
with_derv = [60, 40, 30]
ratios = [f'{int(w/d)}×' for w, d in zip(without, with_derv)]
x2 = np.arange(len(datasets))
w2 = 0.3
bw = ax2.bar(x2 - w2/2, without, w2, label='Without Dervish', color='#e74c3c', alpha=0.85)
bd = ax2.bar(x2 + w2/2, with_derv, w2, label='With Dervish', color='#3498db', alpha=0.85)
ax2.set_xticks(x2)
ax2.set_xticklabels(datasets)
ax2.set_ylabel('Tokens')
ax2.set_title('Token savings per dataset')
ax2.legend(frameon=False)
ax2.set_yscale('log')
ax2.set_ylim(5, 30000)
# Label compression ratios
for i, (r, wbar, dbar) in enumerate(zip(ratios, bw, bd)):
ax2.text(x2[i], without[i] * 1.3, r, ha='center', va='bottom', fontsize=11, fontweight='bold',
bbox=dict(boxstyle='round,pad=0.2', facecolor='white', edgecolor='gray', alpha=0.8))
fig2.tight_layout()
fig2.savefig('chart_token_savings.png', dpi=200)
plt.close(fig2)
print("Charts saved: chart_context_cost.png, chart_token_savings.png")