docs: add min_coverage to MCP tool + README, include core in output

2026-07-01 15:16:24 +02:00 · 2026-07-01 15:16:24 +02:00 · 036a84cc76
commit 036a84cc76
parent 9045769d57
3 changed files with 271 additions and 14 deletions
--- a/README.md
+++ b/README.md
@ -41,12 +41,13 @@ The primary interface is a **Model Context Protocol (MCP)** server. Connect any

 | Tool | Parameters | What it does |
 |------|-----------|-------------|
-| `infer_best_grammar` | `sequences`, `prefer`, `kmax`, `N` | **The only tool you need.** Runs CRX + iDRegEx, picks best by MDL. Set `prefer='crx'` for full coverage or `prefer='idregex'` for minimal core — skips the ensemble and runs one algorithm. |
+| `infer_best_grammar` | `sequences`, `prefer`, `kmax`, `N`, `min_coverage` | **The only tool you need.** Runs CRX + iDRegEx + kOREInference, picks best by MDL. Set `prefer` to run only one algorithm. Set `min_coverage < 1.0` for optional core+outlier analysis. |

 **Parameters explained:**
- **`prefer`**: `'crx'` for full vocabulary (accepts all sequences), `'idregex'` for minimal common core (only what every example shares). Omit to let MDL pick the winner.
- **`kmax`** (1–5): Context window for iDRegEx's k-testable automaton. Higher values capture longer-range dependencies but need more data and are slower. Default 2 works for most cases.
- **`N`** (1–10): Baum-Welch EM iterations for iDRegEx training. More iterations = better convergence but slower. Default 3 is a good balance.
+- **`prefer`**: `'crx'` for full vocabulary (accepts all sequences), `'idregex'` or `'koreinference'` for deterministic minimal core. Omit to let MDL pick the winner across all three.
+- **`kmax`** (1–5): Context window for k-ORE inference (iDRegEx, kOREInference). Higher values capture longer-range dependencies but need more data and are slower. Default 2 works for most cases.
+- **`N`** (1–10): Random trials for k-ORE inference. More = better convergence but slower. Default 3.
+- **`min_coverage`** (0.5–1.0): **Optional core+outlier analysis.** When < 1.0, iteratively removes outlier sequences (those with the rarest symbols) until at least this fraction remain. Returns the core CRX grammar for the majority plus a list of removed outliers. Default 1.0 = disabled. Example: `min_coverage=0.8` finds the tight pattern for ~80% of examples while flagging the other ~20% as variants.

 ### Agent workflow

@ -129,19 +130,21 @@ The sweet spot: **multiple implementations of the same abstract task** with a sh
 | When | Use | Why |
 |------|-----|-----|
 | Clean, structured data with full vocabulary | **CRX** | Single-pass, deterministic. Accepts all sequences. |
-| Few examples, or want minimal common core | **iDRegEx** | Probabilistic EM, finds only what's shared. |
-| Don't know which is better | **Ensemble (default)** | Runs both, picks the best by MDL score. |
-| Data is clearly one type | `prefer='crx'` or `prefer='idregex'` | Skips ensemble comparison, runs one algorithm. |
+| Few examples, or want minimal common core | **iDRegEx** or **kOREInference** | Probabilistic EM, finds only what's shared. |
+| Don't know which is better | **Ensemble (default)** | Runs all three, picks best by MDL score. |
+| Want core pattern + outlier detection | **Ensemble + `min_coverage<1`** | Finds tight grammar for majority, flags outliers. |
+| Data is clearly one type | `prefer='crx'` | Skips ensemble comparison, runs CRX alone. |

 ## When each algorithm wins

 | Data property | Winner | Why |
 |---------------|--------|-----|
-| Diverse patterns, full vocabulary needed | CRX | Captures all symbols. iDRegEx returns ∅. |
+| Diverse patterns, full vocabulary needed | CRX | Captures all symbols. iDRegEx/kOREInference return ∅. |
 | Clean sequences with clear core | iDRegEx | Extracts minimal common subsequence. CRX buries it in optional noise. |
 | Single sequence | iDRegEx (+ RWR₀) | RWR₀ repair produces a grammatical regex from one example. |
 | 2–3 sequences | iDRegEx | CRX overfits. iDRegEx handles noise better. |
 | Many sequences, tight pattern | CRX | Learns precise concatenation with optional suffixes. |
+| Want majority pattern + outlier list | CRX + `min_coverage` | Core analysis finds tight grammar for ~80%, flags the rest. |

 ## Token savings

--- a/bex/mcp_server.py
+++ b/bex/mcp_server.py
@ -17,6 +17,7 @@ def infer_best_grammar(
    prefer: str = "",
    kmax: int = 2,
    N: int = 3,
+    min_coverage: float = 1.0,
 ) -> str:
    """Infer a compact grammar from example sequences. Use this when you
    have examples of sequential data and want to learn the pattern.
@ -29,19 +30,26 @@ def infer_best_grammar(
        sequences: List of sequences, each a list of strings (symbols in
            the order they appear). Example: [["file","copy","command"],
            ["file","template","command"]].
-        prefer: Optional — 'crx' for full coverage (accepts all examples),
-            'idregex' for minimal core (only what every example shares).
-            Default: runs both and picks best by MDL score.
-        kmax: Maximum k for iDRegEx k-ORE inference.
-        N: Number of EM iterations for iDRegEx.
+        prefer: Optional — 'crx' for full vocabulary (accepts all examples),
+            'idregex' for deterministic minimal core. Omit to auto-pick by MDL.
+        kmax: Context depth for k-ORE inference. Default 2.
+        N: Random trials for k-ORE inference (higher = better, slower).
+        min_coverage: (Expert) When < 1.0, also runs a **core+outlier analysis**:
+            iteratively removes outlier sequences (those with rarest symbols)
+            until at least this fraction remain. Returns the core grammar
+            for the majority, plus a list of which sequences were removed and why.
+            Default 1.0 = no core analysis. Set to 0.8 to find the tight
+            pattern shared by ~80% of examples while flagging the other ~20%
+            as variations.

    Returns:
        A formatted string with the best grammar, scores, and explanation.
+        When min_coverage < 1.0, includes the core grammar and outlier info.
        Grammar notation: a.b = a then b, (a+b) = a or b, r? = optional,
        r+ = one or more, r+? = zero or more.
    """
    pref = prefer if prefer else None
-    result = infer_ensemble(sequences, kmax=kmax, N=N, prefer=pref)
+    result = infer_ensemble(sequences, kmax=kmax, N=N, prefer=pref, min_coverage=min_coverage)
    if result['best'] is None:
        return f"No grammar found. {result['why']}"
    lines = [f"Best: {result['best']['algorithm']} (MDL {result['best']['mdl_score']})",
@ -53,6 +61,13 @@ def infer_best_grammar(
            lines.append(f"  {r['algorithm']:10s}  MDL={r['mdl_score']:>8.2f}  match={m}/{len(sequences)}")
    lines.append("")
    lines.append(f"Why: {result['why']}")
+    if 'core' in result and result['core']:
+        c = result['core']
+        lines.append(f"\nCore CRX ({c['coverage']:.0%} coverage, {c['outlier_count']} outliers): {c['grammar']}")
+        if c['outliers']:
+            lines.append(f"  Outlier sequences:")
+            for i, o in enumerate(c['outliers'], 1):
+                lines.append(f"    {i}. {' → '.join(str(x) for x in o[:8])}{'...' if len(o) > 8 else ''}")
    return "\n".join(lines)


--- a/examples/readme_analysis.py
+++ b/examples/readme_analysis.py
@ -0,0 +1,239 @@
+"""
+README Structure Analysis — infer the conventional heading structure of
+top GitHub repositories using Dervish grammar inference.
+"""
+
+import re
+import sys
+import time
+import json
+import requests
+from pathlib import Path
+
+sys.path.insert(0, str(Path(__file__).resolve().parent.parent))
+from bex.ensemble import infer_ensemble, _matches
+
+# ── Synonym normalization map ──
+NORMALIZE = {
+    'description': 'description',
+    'overview': 'description',
+    'about': 'description',
+    'introduction': 'description',
+    'getting started': 'getting-started',
+    'quick start': 'getting-started',
+    'quickstart': 'getting-started',
+    'installation': 'installation',
+    'install': 'installation',
+    'setup': 'installation',
+    'usage': 'usage',
+    'how to use': 'usage',
+    'examples': 'usage',
+    'example': 'usage',
+    'api': 'api',
+    'api reference': 'api',
+    'api documentation': 'api',
+    'documentation': 'api',
+    'features': 'features',
+    'configuration': 'configuration',
+    'config': 'configuration',
+    'contributing': 'contributing',
+    'development': 'contributing',
+    'building': 'contributing',
+    'build': 'contributing',
+    'license': 'license',
+    'changelog': 'changelog',
+    'faq': 'faq',
+    'frequently asked questions': 'faq',
+    'support': 'support',
+    'screenshots': 'screenshots',
+    'demo': 'screenshots',
+    'tests': 'testing',
+    'testing': 'testing',
+    'badges': 'badges',
+    'acknowledgments': 'acknowledgments',
+    'acknowledgements': 'acknowledgments',
+    'credits': 'acknowledgments',
+    'roadmap': 'roadmap',
+    'related projects': 'related',
+    'see also': 'related',
+}
+
+def normalize_heading(text):
+    """Normalize a heading to a canonical name, or return the raw slug."""
+    t = text.strip().lower()
+    t = re.sub(r'[^a-z0-9 ]', '', t)
+    t = re.sub(r'\s+', ' ', t).strip()
+    return NORMALIZE.get(t, t)
+
+def fetch_top_repos(n=100, min_stars=5000):
+    """Fetch top N repos by stars from GitHub search API."""
+    repos = []
+    page = 1
+    headers = {'Accept': 'application/vnd.github.v3+json'}
+    per_page = min(n, 100)
+
+    while len(repos) < n:
+        url = (
+            f'https://api.github.com/search/repositories'
+            f'?q=stars:>{min_stars}&sort=stars&order=desc'
+            f'&per_page={per_page}&page={page}'
+        )
+        resp = requests.get(url, headers=headers)
+        if resp.status_code == 403:
+            print("  Rate limited. Sleeping 60s...")
+            time.sleep(60)
+            continue
+        if resp.status_code != 200:
+            print(f"  API error {resp.status_code}: {resp.text[:200]}")
+            break
+        data = resp.json()
+        items = data.get('items', [])
+        if not items:
+            break
+        for r in items:
+            repos.append({
+                'full_name': r['full_name'],
+                'stars': r['stargazers_count'],
+                'default_branch': r.get('default_branch', 'main'),
+                'description': r.get('description', ''),
+                'language': r.get('language', ''),
+            })
+        print(f"  Page {page}: got {len(items)} repos (total {len(repos)})")
+        page += 1
+        # Small delay to avoid secondary rate limits
+        time.sleep(0.5)
+        if len(repos) >= n:
+            break
+
+    return repos[:n]
+
+def fetch_readme(repo):
+    """Fetch README content from a GitHub repo. Tries main, master, and common variants."""
+    branches = [repo['default_branch'], 'main', 'master']
+    attempted = set()
+
+    for branch in branches:
+        if branch in attempted:
+            continue
+        attempted.add(branch)
+        for path in ['README.md', 'readme.md', 'README.markdown', 'README.rst']:
+            url = f'https://raw.githubusercontent.com/{repo["full_name"]}/{branch}/{path}'
+            try:
+                resp = requests.get(url, timeout=10)
+                if resp.status_code == 200:
+                    return resp.text, path
+            except:
+                pass
+    return None, None
+
+def extract_headings(text):
+    """Extract heading sequence from markdown text.
+    Returns list of (level, text) tuples, e.g. [(1, "Title"), (2, "Installation"), ...]
+    """
+    headings = []
+    for line in text.splitlines():
+        m = re.match(r'^(#{1,6})\s+(.+)$', line.strip())
+        if m:
+            level = len(m.group(1))
+            text = m.group(2).strip()
+            # Remove trailing `#` characters (common in some markdowns)
+            text = re.sub(r'\s+#+\s*$', '', text).strip()
+            headings.append((level, text))
+    return headings
+
+def compress_headings(headings):
+    """Convert heading sequence to our symbol vocabulary.
+    H1 becomes just the section key; H2+ include their parent context.
+    """
+    # For simplicity: treat all headings as symbols, normalized.
+    # H1 = title (always present, strip it)
+    # Return list of normalized H2+ heading texts
+    seq = []
+    seen_h1 = False
+    for level, text in headings:
+        if level == 1 and not seen_h1:
+            seen_h1 = True
+            continue  # skip the title
+        norm = normalize_heading(text)
+        if norm:
+            seq.append(norm)
+    return seq
+
+def main():
+    print("=" * 60)
+    print("README Structure Analysis")
+    print("=" * 60)
+
+    # Step 1: Fetch top repos
+    print("\n[1] Fetching top repos from GitHub...")
+    repos = fetch_top_repos(n=100)
+    print(f"  Got {len(repos)} repos")
+
+    # Step 2: Fetch READMEs
+    print("\n[2] Fetching READMEs...")
+    sequences = []
+    failed = 0
+    for i, repo in enumerate(repos, 1):
+        raw_text, path = fetch_readme(repo)
+        if raw_text is None:
+            failed += 1
+            continue
+        headings = extract_headings(raw_text)
+        seq = compress_headings(headings)
+        if len(seq) >= 3:  # need at least a few sections
+            sequences.append(seq)
+        if i % 20 == 0:
+            print(f"  {i}/{len(repos)}: {len(sequences)} valid, {failed} failed")
+
+    print(f"  Total: {len(sequences)} valid sequences, {failed} failed")
+
+    # Step 3: Collect vocabulary stats
+    print("\n[3] Vocabulary statistics...")
+    all_symbols = set()
+    symbol_counts = {}
+    for seq in sequences:
+        for s in seq:
+            all_symbols.add(s)
+            symbol_counts[s] = symbol_counts.get(s, 0) + 1
+
+    print(f"  Unique symbols: {len(all_symbols)}")
+    print(f"  Top symbols:")
+    for sym, cnt in sorted(symbol_counts.items(), key=lambda x: -x[1])[:25]:
+        pct = cnt / len(sequences) * 100
+        print(f"    {sym:30s}  {cnt:4d} ({pct:5.1f}%)")
+
+    # Step 4: Run Dervish
+    print("\n[4] Running Dervish grammar inference...")
+    result = infer_ensemble(sequences)
+
+    print(f"\n  Best: {result['best']['algorithm']} (MDL {result['best']['mdl_score']})")
+    print(f"  Grammar: {result['best']['grammar']}")
+    if len(result['all']) > 1:
+        for r in result['all']:
+            m = sum(1 for s in sequences if _matches(r['grammar'], s))
+            print(f"    {r['algorithm']:10s}  MDL={r['mdl_score']:>8.2f}  match={m}/{len(sequences)}")
+    print(f"\n  Why: {result['why']}")
+
+    # Step 5: Print example sequences
+    print("\n[5] Sample sequences:")
+    for seq in sequences[:10]:
+        print(f"  {' → '.join(seq[:10])}" + (" → ..." if len(seq) > 10 else ""))
+    print(f"  ... ({len(sequences)} total)")
+
+    # Save results
+    out = {
+        'num_repos': len(sequences),
+        'failed': failed,
+        'unique_symbols': len(all_symbols),
+        'top_symbols': {s: symbol_counts[s] for s in sorted(symbol_counts, key=lambda x: -symbol_counts[x])[:30]},
+        'grammar': result['best']['grammar'],
+        'algorithm': result['best']['algorithm'],
+        'mdl': result['best']['mdl_score'],
+    }
+    path = Path(__file__).resolve().parent.parent / 'readme_analysis.json'
+    with open(path, 'w') as f:
+        json.dump(out, f, indent=2)
+    print(f"\nResults saved to {path}")
+
+if __name__ == '__main__':
+    main()