diff --git a/README.md b/README.md index 247a240..11dd1b1 100644 --- a/README.md +++ b/README.md @@ -41,12 +41,13 @@ The primary interface is a **Model Context Protocol (MCP)** server. Connect any | Tool | Parameters | What it does | |------|-----------|-------------| -| `infer_best_grammar` | `sequences`, `prefer`, `kmax`, `N` | **The only tool you need.** Runs CRX + iDRegEx, picks best by MDL. Set `prefer='crx'` for full coverage or `prefer='idregex'` for minimal core — skips the ensemble and runs one algorithm. | +| `infer_best_grammar` | `sequences`, `prefer`, `kmax`, `N`, `min_coverage` | **The only tool you need.** Runs CRX + iDRegEx + kOREInference, picks best by MDL. Set `prefer` to run only one algorithm. Set `min_coverage < 1.0` for optional core+outlier analysis. | **Parameters explained:** -- **`prefer`**: `'crx'` for full vocabulary (accepts all sequences), `'idregex'` for minimal common core (only what every example shares). Omit to let MDL pick the winner. -- **`kmax`** (1–5): Context window for iDRegEx's k-testable automaton. Higher values capture longer-range dependencies but need more data and are slower. Default 2 works for most cases. -- **`N`** (1–10): Baum-Welch EM iterations for iDRegEx training. More iterations = better convergence but slower. Default 3 is a good balance. +- **`prefer`**: `'crx'` for full vocabulary (accepts all sequences), `'idregex'` or `'koreinference'` for deterministic minimal core. Omit to let MDL pick the winner across all three. +- **`kmax`** (1–5): Context window for k-ORE inference (iDRegEx, kOREInference). Higher values capture longer-range dependencies but need more data and are slower. Default 2 works for most cases. +- **`N`** (1–10): Random trials for k-ORE inference. More = better convergence but slower. Default 3. +- **`min_coverage`** (0.5–1.0): **Optional core+outlier analysis.** When < 1.0, iteratively removes outlier sequences (those with the rarest symbols) until at least this fraction remain. Returns the core CRX grammar for the majority plus a list of removed outliers. Default 1.0 = disabled. Example: `min_coverage=0.8` finds the tight pattern for ~80% of examples while flagging the other ~20% as variants. ### Agent workflow @@ -129,19 +130,21 @@ The sweet spot: **multiple implementations of the same abstract task** with a sh | When | Use | Why | |------|-----|-----| | Clean, structured data with full vocabulary | **CRX** | Single-pass, deterministic. Accepts all sequences. | -| Few examples, or want minimal common core | **iDRegEx** | Probabilistic EM, finds only what's shared. | -| Don't know which is better | **Ensemble (default)** | Runs both, picks the best by MDL score. | -| Data is clearly one type | `prefer='crx'` or `prefer='idregex'` | Skips ensemble comparison, runs one algorithm. | +| Few examples, or want minimal common core | **iDRegEx** or **kOREInference** | Probabilistic EM, finds only what's shared. | +| Don't know which is better | **Ensemble (default)** | Runs all three, picks best by MDL score. | +| Want core pattern + outlier detection | **Ensemble + `min_coverage<1`** | Finds tight grammar for majority, flags outliers. | +| Data is clearly one type | `prefer='crx'` | Skips ensemble comparison, runs CRX alone. | ## When each algorithm wins | Data property | Winner | Why | |---------------|--------|-----| -| Diverse patterns, full vocabulary needed | CRX | Captures all symbols. iDRegEx returns ∅. | +| Diverse patterns, full vocabulary needed | CRX | Captures all symbols. iDRegEx/kOREInference return ∅. | | Clean sequences with clear core | iDRegEx | Extracts minimal common subsequence. CRX buries it in optional noise. | | Single sequence | iDRegEx (+ RWR₀) | RWR₀ repair produces a grammatical regex from one example. | | 2–3 sequences | iDRegEx | CRX overfits. iDRegEx handles noise better. | | Many sequences, tight pattern | CRX | Learns precise concatenation with optional suffixes. | +| Want majority pattern + outlier list | CRX + `min_coverage` | Core analysis finds tight grammar for ~80%, flags the rest. | ## Token savings diff --git a/bex/mcp_server.py b/bex/mcp_server.py index df7b034..226ff5a 100644 --- a/bex/mcp_server.py +++ b/bex/mcp_server.py @@ -17,6 +17,7 @@ def infer_best_grammar( prefer: str = "", kmax: int = 2, N: int = 3, + min_coverage: float = 1.0, ) -> str: """Infer a compact grammar from example sequences. Use this when you have examples of sequential data and want to learn the pattern. @@ -29,19 +30,26 @@ def infer_best_grammar( sequences: List of sequences, each a list of strings (symbols in the order they appear). Example: [["file","copy","command"], ["file","template","command"]]. - prefer: Optional — 'crx' for full coverage (accepts all examples), - 'idregex' for minimal core (only what every example shares). - Default: runs both and picks best by MDL score. - kmax: Maximum k for iDRegEx k-ORE inference. - N: Number of EM iterations for iDRegEx. + prefer: Optional — 'crx' for full vocabulary (accepts all examples), + 'idregex' for deterministic minimal core. Omit to auto-pick by MDL. + kmax: Context depth for k-ORE inference. Default 2. + N: Random trials for k-ORE inference (higher = better, slower). + min_coverage: (Expert) When < 1.0, also runs a **core+outlier analysis**: + iteratively removes outlier sequences (those with rarest symbols) + until at least this fraction remain. Returns the core grammar + for the majority, plus a list of which sequences were removed and why. + Default 1.0 = no core analysis. Set to 0.8 to find the tight + pattern shared by ~80% of examples while flagging the other ~20% + as variations. Returns: A formatted string with the best grammar, scores, and explanation. + When min_coverage < 1.0, includes the core grammar and outlier info. Grammar notation: a.b = a then b, (a+b) = a or b, r? = optional, r+ = one or more, r+? = zero or more. """ pref = prefer if prefer else None - result = infer_ensemble(sequences, kmax=kmax, N=N, prefer=pref) + result = infer_ensemble(sequences, kmax=kmax, N=N, prefer=pref, min_coverage=min_coverage) if result['best'] is None: return f"No grammar found. {result['why']}" lines = [f"Best: {result['best']['algorithm']} (MDL {result['best']['mdl_score']})", @@ -53,6 +61,13 @@ def infer_best_grammar( lines.append(f" {r['algorithm']:10s} MDL={r['mdl_score']:>8.2f} match={m}/{len(sequences)}") lines.append("") lines.append(f"Why: {result['why']}") + if 'core' in result and result['core']: + c = result['core'] + lines.append(f"\nCore CRX ({c['coverage']:.0%} coverage, {c['outlier_count']} outliers): {c['grammar']}") + if c['outliers']: + lines.append(f" Outlier sequences:") + for i, o in enumerate(c['outliers'], 1): + lines.append(f" {i}. {' → '.join(str(x) for x in o[:8])}{'...' if len(o) > 8 else ''}") return "\n".join(lines) diff --git a/examples/readme_analysis.py b/examples/readme_analysis.py new file mode 100644 index 0000000..2a6c7c8 --- /dev/null +++ b/examples/readme_analysis.py @@ -0,0 +1,239 @@ +""" +README Structure Analysis — infer the conventional heading structure of +top GitHub repositories using Dervish grammar inference. +""" + +import re +import sys +import time +import json +import requests +from pathlib import Path + +sys.path.insert(0, str(Path(__file__).resolve().parent.parent)) +from bex.ensemble import infer_ensemble, _matches + +# ── Synonym normalization map ── +NORMALIZE = { + 'description': 'description', + 'overview': 'description', + 'about': 'description', + 'introduction': 'description', + 'getting started': 'getting-started', + 'quick start': 'getting-started', + 'quickstart': 'getting-started', + 'installation': 'installation', + 'install': 'installation', + 'setup': 'installation', + 'usage': 'usage', + 'how to use': 'usage', + 'examples': 'usage', + 'example': 'usage', + 'api': 'api', + 'api reference': 'api', + 'api documentation': 'api', + 'documentation': 'api', + 'features': 'features', + 'configuration': 'configuration', + 'config': 'configuration', + 'contributing': 'contributing', + 'development': 'contributing', + 'building': 'contributing', + 'build': 'contributing', + 'license': 'license', + 'changelog': 'changelog', + 'faq': 'faq', + 'frequently asked questions': 'faq', + 'support': 'support', + 'screenshots': 'screenshots', + 'demo': 'screenshots', + 'tests': 'testing', + 'testing': 'testing', + 'badges': 'badges', + 'acknowledgments': 'acknowledgments', + 'acknowledgements': 'acknowledgments', + 'credits': 'acknowledgments', + 'roadmap': 'roadmap', + 'related projects': 'related', + 'see also': 'related', +} + +def normalize_heading(text): + """Normalize a heading to a canonical name, or return the raw slug.""" + t = text.strip().lower() + t = re.sub(r'[^a-z0-9 ]', '', t) + t = re.sub(r'\s+', ' ', t).strip() + return NORMALIZE.get(t, t) + +def fetch_top_repos(n=100, min_stars=5000): + """Fetch top N repos by stars from GitHub search API.""" + repos = [] + page = 1 + headers = {'Accept': 'application/vnd.github.v3+json'} + per_page = min(n, 100) + + while len(repos) < n: + url = ( + f'https://api.github.com/search/repositories' + f'?q=stars:>{min_stars}&sort=stars&order=desc' + f'&per_page={per_page}&page={page}' + ) + resp = requests.get(url, headers=headers) + if resp.status_code == 403: + print(" Rate limited. Sleeping 60s...") + time.sleep(60) + continue + if resp.status_code != 200: + print(f" API error {resp.status_code}: {resp.text[:200]}") + break + data = resp.json() + items = data.get('items', []) + if not items: + break + for r in items: + repos.append({ + 'full_name': r['full_name'], + 'stars': r['stargazers_count'], + 'default_branch': r.get('default_branch', 'main'), + 'description': r.get('description', ''), + 'language': r.get('language', ''), + }) + print(f" Page {page}: got {len(items)} repos (total {len(repos)})") + page += 1 + # Small delay to avoid secondary rate limits + time.sleep(0.5) + if len(repos) >= n: + break + + return repos[:n] + +def fetch_readme(repo): + """Fetch README content from a GitHub repo. Tries main, master, and common variants.""" + branches = [repo['default_branch'], 'main', 'master'] + attempted = set() + + for branch in branches: + if branch in attempted: + continue + attempted.add(branch) + for path in ['README.md', 'readme.md', 'README.markdown', 'README.rst']: + url = f'https://raw.githubusercontent.com/{repo["full_name"]}/{branch}/{path}' + try: + resp = requests.get(url, timeout=10) + if resp.status_code == 200: + return resp.text, path + except: + pass + return None, None + +def extract_headings(text): + """Extract heading sequence from markdown text. + Returns list of (level, text) tuples, e.g. [(1, "Title"), (2, "Installation"), ...] + """ + headings = [] + for line in text.splitlines(): + m = re.match(r'^(#{1,6})\s+(.+)$', line.strip()) + if m: + level = len(m.group(1)) + text = m.group(2).strip() + # Remove trailing `#` characters (common in some markdowns) + text = re.sub(r'\s+#+\s*$', '', text).strip() + headings.append((level, text)) + return headings + +def compress_headings(headings): + """Convert heading sequence to our symbol vocabulary. + H1 becomes just the section key; H2+ include their parent context. + """ + # For simplicity: treat all headings as symbols, normalized. + # H1 = title (always present, strip it) + # Return list of normalized H2+ heading texts + seq = [] + seen_h1 = False + for level, text in headings: + if level == 1 and not seen_h1: + seen_h1 = True + continue # skip the title + norm = normalize_heading(text) + if norm: + seq.append(norm) + return seq + +def main(): + print("=" * 60) + print("README Structure Analysis") + print("=" * 60) + + # Step 1: Fetch top repos + print("\n[1] Fetching top repos from GitHub...") + repos = fetch_top_repos(n=100) + print(f" Got {len(repos)} repos") + + # Step 2: Fetch READMEs + print("\n[2] Fetching READMEs...") + sequences = [] + failed = 0 + for i, repo in enumerate(repos, 1): + raw_text, path = fetch_readme(repo) + if raw_text is None: + failed += 1 + continue + headings = extract_headings(raw_text) + seq = compress_headings(headings) + if len(seq) >= 3: # need at least a few sections + sequences.append(seq) + if i % 20 == 0: + print(f" {i}/{len(repos)}: {len(sequences)} valid, {failed} failed") + + print(f" Total: {len(sequences)} valid sequences, {failed} failed") + + # Step 3: Collect vocabulary stats + print("\n[3] Vocabulary statistics...") + all_symbols = set() + symbol_counts = {} + for seq in sequences: + for s in seq: + all_symbols.add(s) + symbol_counts[s] = symbol_counts.get(s, 0) + 1 + + print(f" Unique symbols: {len(all_symbols)}") + print(f" Top symbols:") + for sym, cnt in sorted(symbol_counts.items(), key=lambda x: -x[1])[:25]: + pct = cnt / len(sequences) * 100 + print(f" {sym:30s} {cnt:4d} ({pct:5.1f}%)") + + # Step 4: Run Dervish + print("\n[4] Running Dervish grammar inference...") + result = infer_ensemble(sequences) + + print(f"\n Best: {result['best']['algorithm']} (MDL {result['best']['mdl_score']})") + print(f" Grammar: {result['best']['grammar']}") + if len(result['all']) > 1: + for r in result['all']: + m = sum(1 for s in sequences if _matches(r['grammar'], s)) + print(f" {r['algorithm']:10s} MDL={r['mdl_score']:>8.2f} match={m}/{len(sequences)}") + print(f"\n Why: {result['why']}") + + # Step 5: Print example sequences + print("\n[5] Sample sequences:") + for seq in sequences[:10]: + print(f" {' → '.join(seq[:10])}" + (" → ..." if len(seq) > 10 else "")) + print(f" ... ({len(sequences)} total)") + + # Save results + out = { + 'num_repos': len(sequences), + 'failed': failed, + 'unique_symbols': len(all_symbols), + 'top_symbols': {s: symbol_counts[s] for s in sorted(symbol_counts, key=lambda x: -symbol_counts[x])[:30]}, + 'grammar': result['best']['grammar'], + 'algorithm': result['best']['algorithm'], + 'mdl': result['best']['mdl_score'], + } + path = Path(__file__).resolve().parent.parent / 'readme_analysis.json' + with open(path, 'w') as f: + json.dump(out, f, indent=2) + print(f"\nResults saved to {path}") + +if __name__ == '__main__': + main()