docs: add min_coverage to MCP tool + README, include core in output
This commit is contained in:
parent
9045769d57
commit
036a84cc76
3 changed files with 271 additions and 14 deletions
19
README.md
19
README.md
|
|
@ -41,12 +41,13 @@ The primary interface is a **Model Context Protocol (MCP)** server. Connect any
|
||||||
|
|
||||||
| Tool | Parameters | What it does |
|
| Tool | Parameters | What it does |
|
||||||
|------|-----------|-------------|
|
|------|-----------|-------------|
|
||||||
| `infer_best_grammar` | `sequences`, `prefer`, `kmax`, `N` | **The only tool you need.** Runs CRX + iDRegEx, picks best by MDL. Set `prefer='crx'` for full coverage or `prefer='idregex'` for minimal core — skips the ensemble and runs one algorithm. |
|
| `infer_best_grammar` | `sequences`, `prefer`, `kmax`, `N`, `min_coverage` | **The only tool you need.** Runs CRX + iDRegEx + kOREInference, picks best by MDL. Set `prefer` to run only one algorithm. Set `min_coverage < 1.0` for optional core+outlier analysis. |
|
||||||
|
|
||||||
**Parameters explained:**
|
**Parameters explained:**
|
||||||
- **`prefer`**: `'crx'` for full vocabulary (accepts all sequences), `'idregex'` for minimal common core (only what every example shares). Omit to let MDL pick the winner.
|
- **`prefer`**: `'crx'` for full vocabulary (accepts all sequences), `'idregex'` or `'koreinference'` for deterministic minimal core. Omit to let MDL pick the winner across all three.
|
||||||
- **`kmax`** (1–5): Context window for iDRegEx's k-testable automaton. Higher values capture longer-range dependencies but need more data and are slower. Default 2 works for most cases.
|
- **`kmax`** (1–5): Context window for k-ORE inference (iDRegEx, kOREInference). Higher values capture longer-range dependencies but need more data and are slower. Default 2 works for most cases.
|
||||||
- **`N`** (1–10): Baum-Welch EM iterations for iDRegEx training. More iterations = better convergence but slower. Default 3 is a good balance.
|
- **`N`** (1–10): Random trials for k-ORE inference. More = better convergence but slower. Default 3.
|
||||||
|
- **`min_coverage`** (0.5–1.0): **Optional core+outlier analysis.** When < 1.0, iteratively removes outlier sequences (those with the rarest symbols) until at least this fraction remain. Returns the core CRX grammar for the majority plus a list of removed outliers. Default 1.0 = disabled. Example: `min_coverage=0.8` finds the tight pattern for ~80% of examples while flagging the other ~20% as variants.
|
||||||
|
|
||||||
### Agent workflow
|
### Agent workflow
|
||||||
|
|
||||||
|
|
@ -129,19 +130,21 @@ The sweet spot: **multiple implementations of the same abstract task** with a sh
|
||||||
| When | Use | Why |
|
| When | Use | Why |
|
||||||
|------|-----|-----|
|
|------|-----|-----|
|
||||||
| Clean, structured data with full vocabulary | **CRX** | Single-pass, deterministic. Accepts all sequences. |
|
| Clean, structured data with full vocabulary | **CRX** | Single-pass, deterministic. Accepts all sequences. |
|
||||||
| Few examples, or want minimal common core | **iDRegEx** | Probabilistic EM, finds only what's shared. |
|
| Few examples, or want minimal common core | **iDRegEx** or **kOREInference** | Probabilistic EM, finds only what's shared. |
|
||||||
| Don't know which is better | **Ensemble (default)** | Runs both, picks the best by MDL score. |
|
| Don't know which is better | **Ensemble (default)** | Runs all three, picks best by MDL score. |
|
||||||
| Data is clearly one type | `prefer='crx'` or `prefer='idregex'` | Skips ensemble comparison, runs one algorithm. |
|
| Want core pattern + outlier detection | **Ensemble + `min_coverage<1`** | Finds tight grammar for majority, flags outliers. |
|
||||||
|
| Data is clearly one type | `prefer='crx'` | Skips ensemble comparison, runs CRX alone. |
|
||||||
|
|
||||||
## When each algorithm wins
|
## When each algorithm wins
|
||||||
|
|
||||||
| Data property | Winner | Why |
|
| Data property | Winner | Why |
|
||||||
|---------------|--------|-----|
|
|---------------|--------|-----|
|
||||||
| Diverse patterns, full vocabulary needed | CRX | Captures all symbols. iDRegEx returns ∅. |
|
| Diverse patterns, full vocabulary needed | CRX | Captures all symbols. iDRegEx/kOREInference return ∅. |
|
||||||
| Clean sequences with clear core | iDRegEx | Extracts minimal common subsequence. CRX buries it in optional noise. |
|
| Clean sequences with clear core | iDRegEx | Extracts minimal common subsequence. CRX buries it in optional noise. |
|
||||||
| Single sequence | iDRegEx (+ RWR₀) | RWR₀ repair produces a grammatical regex from one example. |
|
| Single sequence | iDRegEx (+ RWR₀) | RWR₀ repair produces a grammatical regex from one example. |
|
||||||
| 2–3 sequences | iDRegEx | CRX overfits. iDRegEx handles noise better. |
|
| 2–3 sequences | iDRegEx | CRX overfits. iDRegEx handles noise better. |
|
||||||
| Many sequences, tight pattern | CRX | Learns precise concatenation with optional suffixes. |
|
| Many sequences, tight pattern | CRX | Learns precise concatenation with optional suffixes. |
|
||||||
|
| Want majority pattern + outlier list | CRX + `min_coverage` | Core analysis finds tight grammar for ~80%, flags the rest. |
|
||||||
|
|
||||||
## Token savings
|
## Token savings
|
||||||
|
|
||||||
|
|
|
||||||
|
|
@ -17,6 +17,7 @@ def infer_best_grammar(
|
||||||
prefer: str = "",
|
prefer: str = "",
|
||||||
kmax: int = 2,
|
kmax: int = 2,
|
||||||
N: int = 3,
|
N: int = 3,
|
||||||
|
min_coverage: float = 1.0,
|
||||||
) -> str:
|
) -> str:
|
||||||
"""Infer a compact grammar from example sequences. Use this when you
|
"""Infer a compact grammar from example sequences. Use this when you
|
||||||
have examples of sequential data and want to learn the pattern.
|
have examples of sequential data and want to learn the pattern.
|
||||||
|
|
@ -29,19 +30,26 @@ def infer_best_grammar(
|
||||||
sequences: List of sequences, each a list of strings (symbols in
|
sequences: List of sequences, each a list of strings (symbols in
|
||||||
the order they appear). Example: [["file","copy","command"],
|
the order they appear). Example: [["file","copy","command"],
|
||||||
["file","template","command"]].
|
["file","template","command"]].
|
||||||
prefer: Optional — 'crx' for full coverage (accepts all examples),
|
prefer: Optional — 'crx' for full vocabulary (accepts all examples),
|
||||||
'idregex' for minimal core (only what every example shares).
|
'idregex' for deterministic minimal core. Omit to auto-pick by MDL.
|
||||||
Default: runs both and picks best by MDL score.
|
kmax: Context depth for k-ORE inference. Default 2.
|
||||||
kmax: Maximum k for iDRegEx k-ORE inference.
|
N: Random trials for k-ORE inference (higher = better, slower).
|
||||||
N: Number of EM iterations for iDRegEx.
|
min_coverage: (Expert) When < 1.0, also runs a **core+outlier analysis**:
|
||||||
|
iteratively removes outlier sequences (those with rarest symbols)
|
||||||
|
until at least this fraction remain. Returns the core grammar
|
||||||
|
for the majority, plus a list of which sequences were removed and why.
|
||||||
|
Default 1.0 = no core analysis. Set to 0.8 to find the tight
|
||||||
|
pattern shared by ~80% of examples while flagging the other ~20%
|
||||||
|
as variations.
|
||||||
|
|
||||||
Returns:
|
Returns:
|
||||||
A formatted string with the best grammar, scores, and explanation.
|
A formatted string with the best grammar, scores, and explanation.
|
||||||
|
When min_coverage < 1.0, includes the core grammar and outlier info.
|
||||||
Grammar notation: a.b = a then b, (a+b) = a or b, r? = optional,
|
Grammar notation: a.b = a then b, (a+b) = a or b, r? = optional,
|
||||||
r+ = one or more, r+? = zero or more.
|
r+ = one or more, r+? = zero or more.
|
||||||
"""
|
"""
|
||||||
pref = prefer if prefer else None
|
pref = prefer if prefer else None
|
||||||
result = infer_ensemble(sequences, kmax=kmax, N=N, prefer=pref)
|
result = infer_ensemble(sequences, kmax=kmax, N=N, prefer=pref, min_coverage=min_coverage)
|
||||||
if result['best'] is None:
|
if result['best'] is None:
|
||||||
return f"No grammar found. {result['why']}"
|
return f"No grammar found. {result['why']}"
|
||||||
lines = [f"Best: {result['best']['algorithm']} (MDL {result['best']['mdl_score']})",
|
lines = [f"Best: {result['best']['algorithm']} (MDL {result['best']['mdl_score']})",
|
||||||
|
|
@ -53,6 +61,13 @@ def infer_best_grammar(
|
||||||
lines.append(f" {r['algorithm']:10s} MDL={r['mdl_score']:>8.2f} match={m}/{len(sequences)}")
|
lines.append(f" {r['algorithm']:10s} MDL={r['mdl_score']:>8.2f} match={m}/{len(sequences)}")
|
||||||
lines.append("")
|
lines.append("")
|
||||||
lines.append(f"Why: {result['why']}")
|
lines.append(f"Why: {result['why']}")
|
||||||
|
if 'core' in result and result['core']:
|
||||||
|
c = result['core']
|
||||||
|
lines.append(f"\nCore CRX ({c['coverage']:.0%} coverage, {c['outlier_count']} outliers): {c['grammar']}")
|
||||||
|
if c['outliers']:
|
||||||
|
lines.append(f" Outlier sequences:")
|
||||||
|
for i, o in enumerate(c['outliers'], 1):
|
||||||
|
lines.append(f" {i}. {' → '.join(str(x) for x in o[:8])}{'...' if len(o) > 8 else ''}")
|
||||||
return "\n".join(lines)
|
return "\n".join(lines)
|
||||||
|
|
||||||
|
|
||||||
|
|
|
||||||
239
examples/readme_analysis.py
Normal file
239
examples/readme_analysis.py
Normal file
|
|
@ -0,0 +1,239 @@
|
||||||
|
"""
|
||||||
|
README Structure Analysis — infer the conventional heading structure of
|
||||||
|
top GitHub repositories using Dervish grammar inference.
|
||||||
|
"""
|
||||||
|
|
||||||
|
import re
|
||||||
|
import sys
|
||||||
|
import time
|
||||||
|
import json
|
||||||
|
import requests
|
||||||
|
from pathlib import Path
|
||||||
|
|
||||||
|
sys.path.insert(0, str(Path(__file__).resolve().parent.parent))
|
||||||
|
from bex.ensemble import infer_ensemble, _matches
|
||||||
|
|
||||||
|
# ── Synonym normalization map ──
|
||||||
|
NORMALIZE = {
|
||||||
|
'description': 'description',
|
||||||
|
'overview': 'description',
|
||||||
|
'about': 'description',
|
||||||
|
'introduction': 'description',
|
||||||
|
'getting started': 'getting-started',
|
||||||
|
'quick start': 'getting-started',
|
||||||
|
'quickstart': 'getting-started',
|
||||||
|
'installation': 'installation',
|
||||||
|
'install': 'installation',
|
||||||
|
'setup': 'installation',
|
||||||
|
'usage': 'usage',
|
||||||
|
'how to use': 'usage',
|
||||||
|
'examples': 'usage',
|
||||||
|
'example': 'usage',
|
||||||
|
'api': 'api',
|
||||||
|
'api reference': 'api',
|
||||||
|
'api documentation': 'api',
|
||||||
|
'documentation': 'api',
|
||||||
|
'features': 'features',
|
||||||
|
'configuration': 'configuration',
|
||||||
|
'config': 'configuration',
|
||||||
|
'contributing': 'contributing',
|
||||||
|
'development': 'contributing',
|
||||||
|
'building': 'contributing',
|
||||||
|
'build': 'contributing',
|
||||||
|
'license': 'license',
|
||||||
|
'changelog': 'changelog',
|
||||||
|
'faq': 'faq',
|
||||||
|
'frequently asked questions': 'faq',
|
||||||
|
'support': 'support',
|
||||||
|
'screenshots': 'screenshots',
|
||||||
|
'demo': 'screenshots',
|
||||||
|
'tests': 'testing',
|
||||||
|
'testing': 'testing',
|
||||||
|
'badges': 'badges',
|
||||||
|
'acknowledgments': 'acknowledgments',
|
||||||
|
'acknowledgements': 'acknowledgments',
|
||||||
|
'credits': 'acknowledgments',
|
||||||
|
'roadmap': 'roadmap',
|
||||||
|
'related projects': 'related',
|
||||||
|
'see also': 'related',
|
||||||
|
}
|
||||||
|
|
||||||
|
def normalize_heading(text):
|
||||||
|
"""Normalize a heading to a canonical name, or return the raw slug."""
|
||||||
|
t = text.strip().lower()
|
||||||
|
t = re.sub(r'[^a-z0-9 ]', '', t)
|
||||||
|
t = re.sub(r'\s+', ' ', t).strip()
|
||||||
|
return NORMALIZE.get(t, t)
|
||||||
|
|
||||||
|
def fetch_top_repos(n=100, min_stars=5000):
|
||||||
|
"""Fetch top N repos by stars from GitHub search API."""
|
||||||
|
repos = []
|
||||||
|
page = 1
|
||||||
|
headers = {'Accept': 'application/vnd.github.v3+json'}
|
||||||
|
per_page = min(n, 100)
|
||||||
|
|
||||||
|
while len(repos) < n:
|
||||||
|
url = (
|
||||||
|
f'https://api.github.com/search/repositories'
|
||||||
|
f'?q=stars:>{min_stars}&sort=stars&order=desc'
|
||||||
|
f'&per_page={per_page}&page={page}'
|
||||||
|
)
|
||||||
|
resp = requests.get(url, headers=headers)
|
||||||
|
if resp.status_code == 403:
|
||||||
|
print(" Rate limited. Sleeping 60s...")
|
||||||
|
time.sleep(60)
|
||||||
|
continue
|
||||||
|
if resp.status_code != 200:
|
||||||
|
print(f" API error {resp.status_code}: {resp.text[:200]}")
|
||||||
|
break
|
||||||
|
data = resp.json()
|
||||||
|
items = data.get('items', [])
|
||||||
|
if not items:
|
||||||
|
break
|
||||||
|
for r in items:
|
||||||
|
repos.append({
|
||||||
|
'full_name': r['full_name'],
|
||||||
|
'stars': r['stargazers_count'],
|
||||||
|
'default_branch': r.get('default_branch', 'main'),
|
||||||
|
'description': r.get('description', ''),
|
||||||
|
'language': r.get('language', ''),
|
||||||
|
})
|
||||||
|
print(f" Page {page}: got {len(items)} repos (total {len(repos)})")
|
||||||
|
page += 1
|
||||||
|
# Small delay to avoid secondary rate limits
|
||||||
|
time.sleep(0.5)
|
||||||
|
if len(repos) >= n:
|
||||||
|
break
|
||||||
|
|
||||||
|
return repos[:n]
|
||||||
|
|
||||||
|
def fetch_readme(repo):
|
||||||
|
"""Fetch README content from a GitHub repo. Tries main, master, and common variants."""
|
||||||
|
branches = [repo['default_branch'], 'main', 'master']
|
||||||
|
attempted = set()
|
||||||
|
|
||||||
|
for branch in branches:
|
||||||
|
if branch in attempted:
|
||||||
|
continue
|
||||||
|
attempted.add(branch)
|
||||||
|
for path in ['README.md', 'readme.md', 'README.markdown', 'README.rst']:
|
||||||
|
url = f'https://raw.githubusercontent.com/{repo["full_name"]}/{branch}/{path}'
|
||||||
|
try:
|
||||||
|
resp = requests.get(url, timeout=10)
|
||||||
|
if resp.status_code == 200:
|
||||||
|
return resp.text, path
|
||||||
|
except:
|
||||||
|
pass
|
||||||
|
return None, None
|
||||||
|
|
||||||
|
def extract_headings(text):
|
||||||
|
"""Extract heading sequence from markdown text.
|
||||||
|
Returns list of (level, text) tuples, e.g. [(1, "Title"), (2, "Installation"), ...]
|
||||||
|
"""
|
||||||
|
headings = []
|
||||||
|
for line in text.splitlines():
|
||||||
|
m = re.match(r'^(#{1,6})\s+(.+)$', line.strip())
|
||||||
|
if m:
|
||||||
|
level = len(m.group(1))
|
||||||
|
text = m.group(2).strip()
|
||||||
|
# Remove trailing `#` characters (common in some markdowns)
|
||||||
|
text = re.sub(r'\s+#+\s*$', '', text).strip()
|
||||||
|
headings.append((level, text))
|
||||||
|
return headings
|
||||||
|
|
||||||
|
def compress_headings(headings):
|
||||||
|
"""Convert heading sequence to our symbol vocabulary.
|
||||||
|
H1 becomes just the section key; H2+ include their parent context.
|
||||||
|
"""
|
||||||
|
# For simplicity: treat all headings as symbols, normalized.
|
||||||
|
# H1 = title (always present, strip it)
|
||||||
|
# Return list of normalized H2+ heading texts
|
||||||
|
seq = []
|
||||||
|
seen_h1 = False
|
||||||
|
for level, text in headings:
|
||||||
|
if level == 1 and not seen_h1:
|
||||||
|
seen_h1 = True
|
||||||
|
continue # skip the title
|
||||||
|
norm = normalize_heading(text)
|
||||||
|
if norm:
|
||||||
|
seq.append(norm)
|
||||||
|
return seq
|
||||||
|
|
||||||
|
def main():
|
||||||
|
print("=" * 60)
|
||||||
|
print("README Structure Analysis")
|
||||||
|
print("=" * 60)
|
||||||
|
|
||||||
|
# Step 1: Fetch top repos
|
||||||
|
print("\n[1] Fetching top repos from GitHub...")
|
||||||
|
repos = fetch_top_repos(n=100)
|
||||||
|
print(f" Got {len(repos)} repos")
|
||||||
|
|
||||||
|
# Step 2: Fetch READMEs
|
||||||
|
print("\n[2] Fetching READMEs...")
|
||||||
|
sequences = []
|
||||||
|
failed = 0
|
||||||
|
for i, repo in enumerate(repos, 1):
|
||||||
|
raw_text, path = fetch_readme(repo)
|
||||||
|
if raw_text is None:
|
||||||
|
failed += 1
|
||||||
|
continue
|
||||||
|
headings = extract_headings(raw_text)
|
||||||
|
seq = compress_headings(headings)
|
||||||
|
if len(seq) >= 3: # need at least a few sections
|
||||||
|
sequences.append(seq)
|
||||||
|
if i % 20 == 0:
|
||||||
|
print(f" {i}/{len(repos)}: {len(sequences)} valid, {failed} failed")
|
||||||
|
|
||||||
|
print(f" Total: {len(sequences)} valid sequences, {failed} failed")
|
||||||
|
|
||||||
|
# Step 3: Collect vocabulary stats
|
||||||
|
print("\n[3] Vocabulary statistics...")
|
||||||
|
all_symbols = set()
|
||||||
|
symbol_counts = {}
|
||||||
|
for seq in sequences:
|
||||||
|
for s in seq:
|
||||||
|
all_symbols.add(s)
|
||||||
|
symbol_counts[s] = symbol_counts.get(s, 0) + 1
|
||||||
|
|
||||||
|
print(f" Unique symbols: {len(all_symbols)}")
|
||||||
|
print(f" Top symbols:")
|
||||||
|
for sym, cnt in sorted(symbol_counts.items(), key=lambda x: -x[1])[:25]:
|
||||||
|
pct = cnt / len(sequences) * 100
|
||||||
|
print(f" {sym:30s} {cnt:4d} ({pct:5.1f}%)")
|
||||||
|
|
||||||
|
# Step 4: Run Dervish
|
||||||
|
print("\n[4] Running Dervish grammar inference...")
|
||||||
|
result = infer_ensemble(sequences)
|
||||||
|
|
||||||
|
print(f"\n Best: {result['best']['algorithm']} (MDL {result['best']['mdl_score']})")
|
||||||
|
print(f" Grammar: {result['best']['grammar']}")
|
||||||
|
if len(result['all']) > 1:
|
||||||
|
for r in result['all']:
|
||||||
|
m = sum(1 for s in sequences if _matches(r['grammar'], s))
|
||||||
|
print(f" {r['algorithm']:10s} MDL={r['mdl_score']:>8.2f} match={m}/{len(sequences)}")
|
||||||
|
print(f"\n Why: {result['why']}")
|
||||||
|
|
||||||
|
# Step 5: Print example sequences
|
||||||
|
print("\n[5] Sample sequences:")
|
||||||
|
for seq in sequences[:10]:
|
||||||
|
print(f" {' → '.join(seq[:10])}" + (" → ..." if len(seq) > 10 else ""))
|
||||||
|
print(f" ... ({len(sequences)} total)")
|
||||||
|
|
||||||
|
# Save results
|
||||||
|
out = {
|
||||||
|
'num_repos': len(sequences),
|
||||||
|
'failed': failed,
|
||||||
|
'unique_symbols': len(all_symbols),
|
||||||
|
'top_symbols': {s: symbol_counts[s] for s in sorted(symbol_counts, key=lambda x: -symbol_counts[x])[:30]},
|
||||||
|
'grammar': result['best']['grammar'],
|
||||||
|
'algorithm': result['best']['algorithm'],
|
||||||
|
'mdl': result['best']['mdl_score'],
|
||||||
|
}
|
||||||
|
path = Path(__file__).resolve().parent.parent / 'readme_analysis.json'
|
||||||
|
with open(path, 'w') as f:
|
||||||
|
json.dump(out, f, indent=2)
|
||||||
|
print(f"\nResults saved to {path}")
|
||||||
|
|
||||||
|
if __name__ == '__main__':
|
||||||
|
main()
|
||||||
Loading…
Add table
Reference in a new issue