docs: add min_coverage to MCP tool + README, include core in output
All checks were successful
ci/woodpecker/push/woodpecker Pipeline was successful
ci/woodpecker/pr/woodpecker Pipeline was successful

This commit is contained in:
tobjend 2026-07-01 15:16:24 +02:00
parent 9045769d57
commit 036a84cc76
3 changed files with 271 additions and 14 deletions

View file

@ -41,12 +41,13 @@ The primary interface is a **Model Context Protocol (MCP)** server. Connect any
| Tool | Parameters | What it does | | Tool | Parameters | What it does |
|------|-----------|-------------| |------|-----------|-------------|
| `infer_best_grammar` | `sequences`, `prefer`, `kmax`, `N` | **The only tool you need.** Runs CRX + iDRegEx, picks best by MDL. Set `prefer='crx'` for full coverage or `prefer='idregex'` for minimal core — skips the ensemble and runs one algorithm. | | `infer_best_grammar` | `sequences`, `prefer`, `kmax`, `N`, `min_coverage` | **The only tool you need.** Runs CRX + iDRegEx + kOREInference, picks best by MDL. Set `prefer` to run only one algorithm. Set `min_coverage < 1.0` for optional core+outlier analysis. |
**Parameters explained:** **Parameters explained:**
- **`prefer`**: `'crx'` for full vocabulary (accepts all sequences), `'idregex'` for minimal common core (only what every example shares). Omit to let MDL pick the winner. - **`prefer`**: `'crx'` for full vocabulary (accepts all sequences), `'idregex'` or `'koreinference'` for deterministic minimal core. Omit to let MDL pick the winner across all three.
- **`kmax`** (15): Context window for iDRegEx's k-testable automaton. Higher values capture longer-range dependencies but need more data and are slower. Default 2 works for most cases. - **`kmax`** (15): Context window for k-ORE inference (iDRegEx, kOREInference). Higher values capture longer-range dependencies but need more data and are slower. Default 2 works for most cases.
- **`N`** (110): Baum-Welch EM iterations for iDRegEx training. More iterations = better convergence but slower. Default 3 is a good balance. - **`N`** (110): Random trials for k-ORE inference. More = better convergence but slower. Default 3.
- **`min_coverage`** (0.51.0): **Optional core+outlier analysis.** When < 1.0, iteratively removes outlier sequences (those with the rarest symbols) until at least this fraction remain. Returns the core CRX grammar for the majority plus a list of removed outliers. Default 1.0 = disabled. Example: `min_coverage=0.8` finds the tight pattern for ~80% of examples while flagging the other ~20% as variants.
### Agent workflow ### Agent workflow
@ -129,19 +130,21 @@ The sweet spot: **multiple implementations of the same abstract task** with a sh
| When | Use | Why | | When | Use | Why |
|------|-----|-----| |------|-----|-----|
| Clean, structured data with full vocabulary | **CRX** | Single-pass, deterministic. Accepts all sequences. | | Clean, structured data with full vocabulary | **CRX** | Single-pass, deterministic. Accepts all sequences. |
| Few examples, or want minimal common core | **iDRegEx** | Probabilistic EM, finds only what's shared. | | Few examples, or want minimal common core | **iDRegEx** or **kOREInference** | Probabilistic EM, finds only what's shared. |
| Don't know which is better | **Ensemble (default)** | Runs both, picks the best by MDL score. | | Don't know which is better | **Ensemble (default)** | Runs all three, picks best by MDL score. |
| Data is clearly one type | `prefer='crx'` or `prefer='idregex'` | Skips ensemble comparison, runs one algorithm. | | Want core pattern + outlier detection | **Ensemble + `min_coverage<1`** | Finds tight grammar for majority, flags outliers. |
| Data is clearly one type | `prefer='crx'` | Skips ensemble comparison, runs CRX alone. |
## When each algorithm wins ## When each algorithm wins
| Data property | Winner | Why | | Data property | Winner | Why |
|---------------|--------|-----| |---------------|--------|-----|
| Diverse patterns, full vocabulary needed | CRX | Captures all symbols. iDRegEx returns ∅. | | Diverse patterns, full vocabulary needed | CRX | Captures all symbols. iDRegEx/kOREInference return ∅. |
| Clean sequences with clear core | iDRegEx | Extracts minimal common subsequence. CRX buries it in optional noise. | | Clean sequences with clear core | iDRegEx | Extracts minimal common subsequence. CRX buries it in optional noise. |
| Single sequence | iDRegEx (+ RWR₀) | RWR₀ repair produces a grammatical regex from one example. | | Single sequence | iDRegEx (+ RWR₀) | RWR₀ repair produces a grammatical regex from one example. |
| 23 sequences | iDRegEx | CRX overfits. iDRegEx handles noise better. | | 23 sequences | iDRegEx | CRX overfits. iDRegEx handles noise better. |
| Many sequences, tight pattern | CRX | Learns precise concatenation with optional suffixes. | | Many sequences, tight pattern | CRX | Learns precise concatenation with optional suffixes. |
| Want majority pattern + outlier list | CRX + `min_coverage` | Core analysis finds tight grammar for ~80%, flags the rest. |
## Token savings ## Token savings

View file

@ -17,6 +17,7 @@ def infer_best_grammar(
prefer: str = "", prefer: str = "",
kmax: int = 2, kmax: int = 2,
N: int = 3, N: int = 3,
min_coverage: float = 1.0,
) -> str: ) -> str:
"""Infer a compact grammar from example sequences. Use this when you """Infer a compact grammar from example sequences. Use this when you
have examples of sequential data and want to learn the pattern. have examples of sequential data and want to learn the pattern.
@ -29,19 +30,26 @@ def infer_best_grammar(
sequences: List of sequences, each a list of strings (symbols in sequences: List of sequences, each a list of strings (symbols in
the order they appear). Example: [["file","copy","command"], the order they appear). Example: [["file","copy","command"],
["file","template","command"]]. ["file","template","command"]].
prefer: Optional 'crx' for full coverage (accepts all examples), prefer: Optional 'crx' for full vocabulary (accepts all examples),
'idregex' for minimal core (only what every example shares). 'idregex' for deterministic minimal core. Omit to auto-pick by MDL.
Default: runs both and picks best by MDL score. kmax: Context depth for k-ORE inference. Default 2.
kmax: Maximum k for iDRegEx k-ORE inference. N: Random trials for k-ORE inference (higher = better, slower).
N: Number of EM iterations for iDRegEx. min_coverage: (Expert) When < 1.0, also runs a **core+outlier analysis**:
iteratively removes outlier sequences (those with rarest symbols)
until at least this fraction remain. Returns the core grammar
for the majority, plus a list of which sequences were removed and why.
Default 1.0 = no core analysis. Set to 0.8 to find the tight
pattern shared by ~80% of examples while flagging the other ~20%
as variations.
Returns: Returns:
A formatted string with the best grammar, scores, and explanation. A formatted string with the best grammar, scores, and explanation.
When min_coverage < 1.0, includes the core grammar and outlier info.
Grammar notation: a.b = a then b, (a+b) = a or b, r? = optional, Grammar notation: a.b = a then b, (a+b) = a or b, r? = optional,
r+ = one or more, r+? = zero or more. r+ = one or more, r+? = zero or more.
""" """
pref = prefer if prefer else None pref = prefer if prefer else None
result = infer_ensemble(sequences, kmax=kmax, N=N, prefer=pref) result = infer_ensemble(sequences, kmax=kmax, N=N, prefer=pref, min_coverage=min_coverage)
if result['best'] is None: if result['best'] is None:
return f"No grammar found. {result['why']}" return f"No grammar found. {result['why']}"
lines = [f"Best: {result['best']['algorithm']} (MDL {result['best']['mdl_score']})", lines = [f"Best: {result['best']['algorithm']} (MDL {result['best']['mdl_score']})",
@ -53,6 +61,13 @@ def infer_best_grammar(
lines.append(f" {r['algorithm']:10s} MDL={r['mdl_score']:>8.2f} match={m}/{len(sequences)}") lines.append(f" {r['algorithm']:10s} MDL={r['mdl_score']:>8.2f} match={m}/{len(sequences)}")
lines.append("") lines.append("")
lines.append(f"Why: {result['why']}") lines.append(f"Why: {result['why']}")
if 'core' in result and result['core']:
c = result['core']
lines.append(f"\nCore CRX ({c['coverage']:.0%} coverage, {c['outlier_count']} outliers): {c['grammar']}")
if c['outliers']:
lines.append(f" Outlier sequences:")
for i, o in enumerate(c['outliers'], 1):
lines.append(f" {i}. {''.join(str(x) for x in o[:8])}{'...' if len(o) > 8 else ''}")
return "\n".join(lines) return "\n".join(lines)

239
examples/readme_analysis.py Normal file
View file

@ -0,0 +1,239 @@
"""
README Structure Analysis infer the conventional heading structure of
top GitHub repositories using Dervish grammar inference.
"""
import re
import sys
import time
import json
import requests
from pathlib import Path
sys.path.insert(0, str(Path(__file__).resolve().parent.parent))
from bex.ensemble import infer_ensemble, _matches
# ── Synonym normalization map ──
NORMALIZE = {
'description': 'description',
'overview': 'description',
'about': 'description',
'introduction': 'description',
'getting started': 'getting-started',
'quick start': 'getting-started',
'quickstart': 'getting-started',
'installation': 'installation',
'install': 'installation',
'setup': 'installation',
'usage': 'usage',
'how to use': 'usage',
'examples': 'usage',
'example': 'usage',
'api': 'api',
'api reference': 'api',
'api documentation': 'api',
'documentation': 'api',
'features': 'features',
'configuration': 'configuration',
'config': 'configuration',
'contributing': 'contributing',
'development': 'contributing',
'building': 'contributing',
'build': 'contributing',
'license': 'license',
'changelog': 'changelog',
'faq': 'faq',
'frequently asked questions': 'faq',
'support': 'support',
'screenshots': 'screenshots',
'demo': 'screenshots',
'tests': 'testing',
'testing': 'testing',
'badges': 'badges',
'acknowledgments': 'acknowledgments',
'acknowledgements': 'acknowledgments',
'credits': 'acknowledgments',
'roadmap': 'roadmap',
'related projects': 'related',
'see also': 'related',
}
def normalize_heading(text):
"""Normalize a heading to a canonical name, or return the raw slug."""
t = text.strip().lower()
t = re.sub(r'[^a-z0-9 ]', '', t)
t = re.sub(r'\s+', ' ', t).strip()
return NORMALIZE.get(t, t)
def fetch_top_repos(n=100, min_stars=5000):
"""Fetch top N repos by stars from GitHub search API."""
repos = []
page = 1
headers = {'Accept': 'application/vnd.github.v3+json'}
per_page = min(n, 100)
while len(repos) < n:
url = (
f'https://api.github.com/search/repositories'
f'?q=stars:>{min_stars}&sort=stars&order=desc'
f'&per_page={per_page}&page={page}'
)
resp = requests.get(url, headers=headers)
if resp.status_code == 403:
print(" Rate limited. Sleeping 60s...")
time.sleep(60)
continue
if resp.status_code != 200:
print(f" API error {resp.status_code}: {resp.text[:200]}")
break
data = resp.json()
items = data.get('items', [])
if not items:
break
for r in items:
repos.append({
'full_name': r['full_name'],
'stars': r['stargazers_count'],
'default_branch': r.get('default_branch', 'main'),
'description': r.get('description', ''),
'language': r.get('language', ''),
})
print(f" Page {page}: got {len(items)} repos (total {len(repos)})")
page += 1
# Small delay to avoid secondary rate limits
time.sleep(0.5)
if len(repos) >= n:
break
return repos[:n]
def fetch_readme(repo):
"""Fetch README content from a GitHub repo. Tries main, master, and common variants."""
branches = [repo['default_branch'], 'main', 'master']
attempted = set()
for branch in branches:
if branch in attempted:
continue
attempted.add(branch)
for path in ['README.md', 'readme.md', 'README.markdown', 'README.rst']:
url = f'https://raw.githubusercontent.com/{repo["full_name"]}/{branch}/{path}'
try:
resp = requests.get(url, timeout=10)
if resp.status_code == 200:
return resp.text, path
except:
pass
return None, None
def extract_headings(text):
"""Extract heading sequence from markdown text.
Returns list of (level, text) tuples, e.g. [(1, "Title"), (2, "Installation"), ...]
"""
headings = []
for line in text.splitlines():
m = re.match(r'^(#{1,6})\s+(.+)$', line.strip())
if m:
level = len(m.group(1))
text = m.group(2).strip()
# Remove trailing `#` characters (common in some markdowns)
text = re.sub(r'\s+#+\s*$', '', text).strip()
headings.append((level, text))
return headings
def compress_headings(headings):
"""Convert heading sequence to our symbol vocabulary.
H1 becomes just the section key; H2+ include their parent context.
"""
# For simplicity: treat all headings as symbols, normalized.
# H1 = title (always present, strip it)
# Return list of normalized H2+ heading texts
seq = []
seen_h1 = False
for level, text in headings:
if level == 1 and not seen_h1:
seen_h1 = True
continue # skip the title
norm = normalize_heading(text)
if norm:
seq.append(norm)
return seq
def main():
print("=" * 60)
print("README Structure Analysis")
print("=" * 60)
# Step 1: Fetch top repos
print("\n[1] Fetching top repos from GitHub...")
repos = fetch_top_repos(n=100)
print(f" Got {len(repos)} repos")
# Step 2: Fetch READMEs
print("\n[2] Fetching READMEs...")
sequences = []
failed = 0
for i, repo in enumerate(repos, 1):
raw_text, path = fetch_readme(repo)
if raw_text is None:
failed += 1
continue
headings = extract_headings(raw_text)
seq = compress_headings(headings)
if len(seq) >= 3: # need at least a few sections
sequences.append(seq)
if i % 20 == 0:
print(f" {i}/{len(repos)}: {len(sequences)} valid, {failed} failed")
print(f" Total: {len(sequences)} valid sequences, {failed} failed")
# Step 3: Collect vocabulary stats
print("\n[3] Vocabulary statistics...")
all_symbols = set()
symbol_counts = {}
for seq in sequences:
for s in seq:
all_symbols.add(s)
symbol_counts[s] = symbol_counts.get(s, 0) + 1
print(f" Unique symbols: {len(all_symbols)}")
print(f" Top symbols:")
for sym, cnt in sorted(symbol_counts.items(), key=lambda x: -x[1])[:25]:
pct = cnt / len(sequences) * 100
print(f" {sym:30s} {cnt:4d} ({pct:5.1f}%)")
# Step 4: Run Dervish
print("\n[4] Running Dervish grammar inference...")
result = infer_ensemble(sequences)
print(f"\n Best: {result['best']['algorithm']} (MDL {result['best']['mdl_score']})")
print(f" Grammar: {result['best']['grammar']}")
if len(result['all']) > 1:
for r in result['all']:
m = sum(1 for s in sequences if _matches(r['grammar'], s))
print(f" {r['algorithm']:10s} MDL={r['mdl_score']:>8.2f} match={m}/{len(sequences)}")
print(f"\n Why: {result['why']}")
# Step 5: Print example sequences
print("\n[5] Sample sequences:")
for seq in sequences[:10]:
print(f" {''.join(seq[:10])}" + (" → ..." if len(seq) > 10 else ""))
print(f" ... ({len(sequences)} total)")
# Save results
out = {
'num_repos': len(sequences),
'failed': failed,
'unique_symbols': len(all_symbols),
'top_symbols': {s: symbol_counts[s] for s in sorted(symbol_counts, key=lambda x: -symbol_counts[x])[:30]},
'grammar': result['best']['grammar'],
'algorithm': result['best']['algorithm'],
'mdl': result['best']['mdl_score'],
}
path = Path(__file__).resolve().parent.parent / 'readme_analysis.json'
with open(path, 'w') as f:
json.dump(out, f, indent=2)
print(f"\nResults saved to {path}")
if __name__ == '__main__':
main()