grammar-inference-engine/bex/__init__.py
tobjend 0e2aec582b Grammar inference engine: CRX + iDRegEx ensemble with MDL scoring, MCP server, showcase, and blog post
- Ensemble inference (infer_ensemble) runs both CRX and iDRegEx, picks best by MDL
- CRX: CRX algorithm for wide coverage (accepts all sequences, large vocabulary)
- iDRegEx: iDRegEx for minimal core grammar (tightest common pattern)
- MDL scoring: fixed model_cost to count alphabet symbol occurrences, fixed dispatch order in _count_words_fast
- Fixed _match_tokens: rewritten as _match_possible with proper backtracking
- Fixed _parse_parts disjunction: children use _parse_flat_symbol to avoid dot-splitting
- MCP server: infer_best_grammar and infer_grammar tools
- Added prefer parameter (crx/idregex) to skip ensemble
- 28 passing tests
- SHOWCASE.md with Geerlingguy Galaxy demonstration
- blog_post.md with full technical deep-dive
2026-07-01 09:51:41 +02:00

27 lines
821 B
Python

"""
bex — Paper-faithful implementation of BEX inference algorithms.
Papers:
- Bex et al. 2010 (TODS): Inference of Concise Regular Expressions and DTDs
- Bex et al. 2010 (arXiv 1004.2372): Learning Deterministic Regular Expressions
Algorithms implemented:
TODS 2010: 2T-INF, REWRITE, RWR, RWR², RWR₀, CRX
arXiv 2010: iKoa, Disambiguate, rwr², iDRegEx
"""
from .soa import SOA
from .twotinf import build_soa
from .rwr0 import rwr0
from .crx import CRX
from .ikoa import ikoa
from .rwrsq import rwr_sq
from .idregex import idregex
from .koa import KOA, build_complete_koa
from .expr import concat, disj, star, optional, alphabet, strip_k
from .marking import mark_koa
from .tokenizer import YAMLTokenizer
from .ensemble import infer_ensemble
from .template import generate_template
__version__ = "0.2.0"