Initial commit: BEX-based grammar inference engine

- CRX: direct CHARE inference (Algorithm 7, TODS 2010)
- iDRegEx: k-ORE inference (Algorithm 4, arXiv 2010)
- RWR₀: SORE repair (Algorithm 6, TODS 2010)
- rwr²: k-ORE extraction (Algorithm 3, arXiv 2010)
- SOA, k-OA, iKoa, 2T-INF, Baum-Welch
- Ansible role grammar adapter
- Generic YAML key-path converter
- 28 tests, all passing
This commit is contained in:
tobjend 2026-07-01 08:01:16 +02:00
commit 7c00c6713d
33 changed files with 8928 additions and 0 deletions

8
.gitignore vendored Normal file
View file

@ -0,0 +1,8 @@
__pycache__/
*.pyc
.env
.venv
venv/
*.egg-info/
dist/
build/

45
AGENTS.md Normal file
View file

@ -0,0 +1,45 @@
# Grammar Inference Engine — Agent Guide
## Overview
This repo implements the BEX family of algorithms for inferring regular expression grammars
from example sequences. Use it whenever you need to discover the pattern behind a set of
strings or structured sequences.
## Quick Start for Agents
```python
# Fast pattern inference
from bex.crx import CRX
g = CRX().infer([['a','b','c'], ['a','b'], ['a','c']]) # a.(b+c)?
# Probabilistic k-ORE inference (handles noise better)
from bex.idregex import idregex
g = idregex([['a','b','c'], ['a','b'], ['a','c']], kmax=2, N=3)
```
## Use Cases
1. **Ansible role patterns** — extract module sequences from tasks/main.yml, learn per-category grammars
2. **Log analysis** — find common patterns in event sequences
3. **API call patterns** — learn the typical order of API operations
4. **Configuration structure** — discover the schema behind YAML files
5. **Workflow mining** — extract the typical task flow from process logs
## Architecture
Two inference pipelines:
| Pipeline | When to use |
|----------|-------------|
| CRX (fast) | Many examples, need speed, CHAREs output |
| iDRegEx (robust) | Few/noisy examples, need probabilistic handling |
## Running Tests
```bash
python tests/test_bex.py
```
## MCP Roadmap
- [ ] Standalone MCP server wrapping CRX + iDRegEx
- [ ] Tool: `infer_grammar(sequences, method="crx")`
- [ ] Tool: `ansible_role_grammar(roles_dir)`
- [ ] Tool: `yaml_to_sequences(yaml_path)`

132
README.md Normal file
View file

@ -0,0 +1,132 @@
# Grammar Inference Engine
Infer **regular expression grammars** from example sequences using the BEX family of algorithms. Given a set of example sequences (strings over some alphabet), the engine learns a compact regular expression that describes the general pattern.
## Quick Start
```bash
pip install pyyaml
python -m bex
```
```python
from bex.crx import CRX
seqs = [
['file', 'template', 'docker_image', 'command', 'set_fact', 'shell', 'wait_for'],
['file', 'template', 'docker_image', 'command', 'set_fact', 'shell'],
]
crx = CRX()
grammar = crx.infer(seqs)
print(grammar)
# file.template.docker_image.command.set_fact.shell.(wait_for)?
```
## Algorithms
| Algorithm | What it learns | Paper | Use case |
|-----------|---------------|-------|----------|
| **CRX** | CHAREs (single-pass, deterministic) | TODS 2010 §6 | Fast inference from many sequences |
| **iDRegEx** | k-OREs (probabilistic, Baum-Welch) | arXiv 2010 | Handles noise, learns from few examples |
| **RWR₀** | SOREs (iterative repair) | TODS 2010 §5.2 | Builds regex from a single automaton |
| **rwr²** | k-ORE from k-OA | arXiv 2010 | Post-processing for k-ORE extraction |
### Pipeline 1: Direct CHARE Inference (fast)
```
Example sequences → CRX → CHAREs grammar
```
### Pipeline 2: Probabilistic k-ORE Inference (robust)
```
Example sequences → Complete k-OA → Baum-Welch (EM)
→ Disambiguate → Prune → rwr² → k-ORE grammar
```
## Architecture
```
bex/
├── crx.py # CRX: direct CHARE inference (Algorithm 7, TODS)
├── idregex.py # iDRegEx: k-ORE inference (Algorithm 4, arXiv)
├── rwr0.py # RWR₀: SORE repair (Algorithm 6, TODS)
├── rwrsq.py # rwr²: k-ORE extraction (Algorithm 3, arXiv)
├── soa.py # SOA: Symbolic Observation Automaton core
├── koa.py # k-OA: k-testable Observation Automaton
├── ikoa.py # iKoa: k-OA inference (Algorithm 1, arXiv)
├── twotinf.py # 2T-INF: 2-testable inference (Algorithm 1, TODS)
├── baum_welch.py # Baum-Welch EM training for k-OA
├── expr.py # Expression utilities (concat, disj, star, strip)
├── marking.py # State marking for determinism
├── yaml_to_seq.py # Generic YAML → key-path sequence converter
├── role_grammar.py # Ansible role → module-sequence extractor
└── ...
```
## Domain: Ansible Role Grammar
The engine includes a domain adapter for Ansible roles. It extracts module names from `tasks/main.yml` files and learns per-category grammars:
```bash
python -c "
from bex.role_grammar import collect_all_role_sequences, learn_grammar
all_roles, by_category = collect_all_role_sequences('path/to/roles')
for cat, items in sorted(by_category.items()):
seqs = [s for _, s in items]
print(f'{cat}: {learn_grammar(seqs)}')
"
```
### Example Output
```
── restore (2 roles) ──
Grammar: file.copy.unarchive+.command
── validate (5 roles) ──
Grammar: hosts?.shell?.(copy+debug+fail+set_fact+uri)+?
── configure (4 roles) ──
Grammar: (assert+debug+set_fact+uri)+?.include_role?
```
**Grammar notation:**
- `a.b``a` followed by `b` (concatenation)
- `(a+b)` — either `a` or `b` (disjunction)
- `r?` — zero or one (optional)
- `r+` — one or more (iteration)
- `r+?` — zero or more (varies across examples)
## Domain: Generic YAML
The engine can convert any YAML file into key-path sequences for grammar inference:
```python
from bex.yaml_to_seq import yaml_file_to_sequence, sequences_to_crx
grammar = sequences_to_crx(yaml_file_to_sequence('config.yml'))
```
## Papers
- **Bex et al.** *"Inferring Deterministic Regular Expressions from Positive Data"* — TODS 2010
- **Bex et al.** *"Inferring k-optimal REs from Positive Data"* — arXiv:1004.2372
See `papers/` for extracted text and the original references.
## Tests
```bash
python -m pytest tests/
# or
python tests/test_bex.py
```
## MCP Server
A Model Context Protocol server for grammar inference is planned. See `AGENTS.md` for the roadmap.
## License
MIT

26
bex/__init__.py Normal file
View file

@ -0,0 +1,26 @@
"""
bex Paper-faithful implementation of BEX inference algorithms.
Papers:
- Bex et al. 2010 (TODS): Inference of Concise Regular Expressions and DTDs
- Bex et al. 2010 (arXiv 1004.2372): Learning Deterministic Regular Expressions
Algorithms implemented:
TODS 2010: 2T-INF, REWRITE, RWR, RWR², RWR₀, CRX
arXiv 2010: iKoa, Disambiguate, rwr², iDRegEx
"""
from .soa import SOA
from .twotinf import build_soa
from .rwr0 import rwr0
from .crx import CRX
from .ikoa import ikoa
from .rwrsq import rwr_sq
from .idregex import idregex
from .koa import KOA, build_complete_koa
from .expr import concat, disj, star, optional, alphabet, strip_k
from .marking import mark_koa
from .tokenizer import YAMLTokenizer
from .template import generate_template
__version__ = "0.2.0"

3
bex/__main__.py Normal file
View file

@ -0,0 +1,3 @@
from .cli import main
main()

130
bex/automaton.py Normal file
View file

@ -0,0 +1,130 @@
"""
Automaton Graph representation for BEX algorithms.
Ein Automaton ist ein gerichteter Graph mit beschrifteten Kanten (Labels = Token).
Dient als Basis für:
- Prefix-Tree Automaton (aus Beispielsequenzen)
- SORE/CHARE Transformation via shrink-Rewrite-Regeln
- Determinism-Check und repair
Die Implementierung folgt der Struktur aus Bex et al. 2010 (TWEB):
- Nodes: Menge der Zustände
- Edges: Liste von (from, to, label, prob) prob optional für HMM
- start: Startzustand
- accepts: Menge akzeptierender Zustände
"""
class Automaton:
def __init__(self, start=None):
self.nodes = set()
self.edges = []
self.start = start
self.accepts = set()
def add_node(self, node):
self.nodes.add(node)
def add_edge(self, u, v, label, prob=None):
self.edges.append({
'from': u,
'to': v,
'label': label,
'prob': prob,
})
self.add_node(u)
self.add_node(v)
def remove_edge(self, u, v, label):
self.edges = [
e for e in self.edges
if not (e['from'] == u and e['to'] == v and e['label'] == label)
]
def remove_all_edges_between(self, u, v):
self.edges = [
e for e in self.edges
if not (e['from'] == u and e['to'] == v)
]
def set_start(self, node):
self.start = node
self.add_node(node)
def add_accept(self, node):
self.accepts.add(node)
self.add_node(node)
def outgoing(self, node):
return [e for e in self.edges if e['from'] == node]
def incoming(self, node):
return [e for e in self.edges if e['to'] == node]
def successors(self, node):
return {(e['to'], e['label']) for e in self.outgoing(node)}
def has_edge(self, u, v, label):
return any(
e['from'] == u and e['to'] == v and e['label'] == label
for e in self.edges
)
def has_self_loop(self, node):
return any(e['from'] == node and e['to'] == node for e in self.edges)
def labels_on_edge(self, u, v):
return [e['label'] for e in self.edges if e['from'] == u and e['to'] == v]
def is_deterministic(self):
"""Prüft ob der Automat deterministisch ist (keine zwei Kanten mit gleichem Label von einem Zustand)."""
for node in self.nodes:
seen = set()
for e in self.outgoing(node):
if e['label'] in seen:
return False
seen.add(e['label'])
return True
def merge_nodes(self, target, source):
"""Vereinigt source in target: Alle Kanten von/zu source werden auf target umgeleitet."""
new_edges = []
for e in self.edges:
if e['from'] == source and e['to'] == source:
new_edges.append({'from': target, 'to': target, 'label': e['label']})
elif e['from'] == source:
new_edges.append({'from': target, 'to': e['to'], 'label': e['label']})
elif e['to'] == source:
new_edges.append({'from': e['from'], 'to': target, 'label': e['label']})
else:
new_edges.append(e)
self.edges = new_edges
if source in self.accepts:
self.accepts.add(target)
if source in self.accepts:
self.accepts.discard(source)
if source in self.nodes:
self.nodes.discard(source)
def copy(self):
import copy
return copy.deepcopy(self)
def __repr__(self):
return (f"Automaton(nodes={len(self.nodes)}, edges={len(self.edges)}, "
f"start={self.start}, accepts={self.accepts})")
def to_dot(self):
lines = ["digraph Automaton {"]
lines.append(" rankdir=LR;")
lines.append(f' start [shape=point];')
lines.append(f' start -> {self.start};')
for n in self.nodes:
shape = "doublecircle" if n in self.accepts else "circle"
lines.append(f' {n} [shape={shape}];')
for e in self.edges:
label = e['label'].replace('"', '\\"')
prob = f" [{e['prob']:.2f}]" if e['prob'] is not None else ""
lines.append(f' {e["from"]} -> {e["to"]} [label="{label}{prob}"];')
lines.append("}")
return '\n'.join(lines)

192
bex/baum_welch.py Normal file
View file

@ -0,0 +1,192 @@
"""Baum-Welch for POMM on k-OA — standard forward-backward (Rabiner 1989)."""
import random
import math
def init_probabilities(G, sequences):
"""Initialize α per iKoa init (Algorithm 1, line 1).
α(src, sink) = fraction of empty words in S
α(src, s) = fraction of words starting with lab(s), split equally
among all k copies of that symbol
α(s, t) for s src: chosen randomly, normalized to sum to 1
"""
total = len(sequences)
if total == 0:
total = 1
empty_count = sum(1 for s in sequences if not s)
start_counts = {}
for seq in sequences:
if seq:
start_counts[seq[0]] = start_counts.get(seq[0], 0) + 1
prob = {}
for s in G._succ:
if s == G.sink:
continue
succ = list(G._succ[s])
if not succ:
prob[s] = {}
continue
vals = []
for t in succ:
if s == G.src:
if t == G.sink:
v = empty_count / total
else:
lab = G.label(t)
base = lab.rsplit('_', 1)[0] if '_' in lab else lab
count = start_counts.get(base, 0)
copies = sum(1 for u in succ if G.label(u) == lab)
v = (count / total) / max(copies, 1)
vals.append(v)
else:
vals.append(random.random())
s_total = sum(vals)
if s_total == 0:
vals = [1.0 / len(vals)] * len(vals)
else:
vals = [v / s_total for v in vals]
prob[s] = {t: v for t, v in zip(succ, vals)}
for s in prob:
for t in prob[s]:
if prob[s][t] < 1e-10:
prob[s][t] = 0.0
return prob
def bw_iteration(prob, sequences, node_to_idx, n_states, all_nodes, G):
"""Single Baum-Welch iteration over all sequences."""
total_num = {}
total_denom = {}
for seq in sequences:
if not seq:
continue
T = len(seq)
obs = seq
# which states can emit each observation? (keyed by base symbol)
emit = {}
for n in all_nodes:
lab = G.label(n)
if lab:
base = lab.rsplit('_', 1)[0] if '_' in lab else lab
emit.setdefault(base, []).append(n)
# sink emits nothing
sink = G.sink
# Forward pass
alpha = [{} for _ in range(T + 1)]
alpha[0][G.src] = 1.0
for t in range(T):
sym = obs[t]
possible = emit.get(sym, [])
for j in possible:
total = 0.0
for i in alpha[t]:
p_trans = prob.get(i, {}).get(j, 0.0)
if p_trans > 0:
total += alpha[t][i] * p_trans
if total > 0:
alpha[t + 1][j] = total
# P(O | λ)
po = 0.0
for i in alpha[T]:
po += alpha[T][i] * prob.get(i, {}).get(sink, 0.0)
if po == 0:
continue
# Backward pass
beta = [{} for _ in range(T + 1)]
for i in all_nodes:
if prob.get(i, {}).get(sink, 0.0) > 0:
beta[T][i] = prob[i][sink]
for t in range(T - 1, -1, -1):
sym = obs[t] if t < T else None
possible = emit.get(sym, []) if sym else []
for i in alpha[t]:
total = 0.0
for j in possible:
p_trans = prob.get(i, {}).get(j, 0.0)
if p_trans > 0 and j in beta[t + 1]:
total += p_trans * beta[t + 1][j]
if total > 0:
beta[t][i] = total
# Accumulate ξ and γ
for t in range(T):
sym_nxt = obs[t]
possible = emit.get(sym_nxt, [])
for i in alpha[t]:
if i not in beta[t] or beta[t][i] == 0:
continue
for j in possible:
p_trans = prob.get(i, {}).get(j, 0.0)
if p_trans == 0 or j not in beta[t + 1] or beta[t + 1][j] == 0:
continue
xi = alpha[t][i] * p_trans * beta[t + 1][j] / po
if xi > 1e-15:
key = (i, j)
total_num[key] = total_num.get(key, 0.0) + xi
total_denom[i] = total_denom.get(i, 0.0) + xi
# M-step: update probabilities
for s in prob:
for t in prob[s]:
key = (s, t)
d = total_denom.get(s, 0.0)
if d > 1e-15 and key in total_num:
prob[s][t] = total_num[key] / d
else:
prob[s][t] = 0.0
# Renormalize
for s in prob:
row_sum = sum(prob[s].values())
if row_sum > 1e-10:
for t in prob[s]:
prob[s][t] /= row_sum
else:
n_succ = len(prob[s])
for t in prob[s]:
prob[s][t] = 1.0 / n_succ
return prob
def baum_welch(G, prob, sequences, iterations=10):
"""Baum-Welch EM training.
Args:
G: k-OA graph
prob: dict[s][t] = transition probabilities
sequences: list of token lists (bag, not set)
iterations: number of EM iterations (full convergence)
Returns:
Updated prob dict
"""
all_nodes = list(G._succ.keys())
node_to_idx = {n: i for i, n in enumerate(all_nodes)}
n_states = len(all_nodes)
for _ in range(iterations):
prob = bw_iteration(prob, sequences, node_to_idx, n_states, all_nodes, G)
return prob
def baum_welch_fixed(G, prob, sequences, iterations=2):
"""Baum-Welch with fixed small iteration count (for Disambiguate).
= 2 for |Σ| 7, = 3 for |Σ| > 7.
"""
return baum_welch(G, prob, sequences, iterations)

145
bex/cli.py Normal file
View file

@ -0,0 +1,145 @@
"""
CLI Command-Line Interface for bex YAML Grammar Inference.
Usage:
python -m bex --dir roles/ --k-max 5
python -m bex --dir playbooks/ --context tasks
python -m bex --dir roles/ --output template.yaml
"""
import argparse
import os
import sys
import glob
from .tokenizer import YAMLTokenizer
from .kore import kOREInference
from .template import generate_template
from .ilocal import iLocal, extract_contexts_from_file, reduce_contexts
def find_yaml_files(directory):
"""Findet alle YAML-Dateien in einem Verzeichnis (rekursiv)."""
patterns = ['**/*.yml', '**/*.yaml']
files = []
for pattern in patterns:
files.extend(glob.glob(os.path.join(directory, pattern), recursive=True))
return sorted(files)
def main():
parser = argparse.ArgumentParser(
description='bex — BEX-based YAML Grammar Inference',
)
parser.add_argument('--dir', type=str, default='roles/',
help='Verzeichnis mit YAML-Dateien (default: roles/)')
parser.add_argument('--k-max', type=int, default=5,
help='Max k für k-ORE-Inferenz (default: 5)')
parser.add_argument('--context', type=str, default=None,
help='Auf spezifischen Container-Key beschränken (z.B. tasks)')
parser.add_argument('--output', type=str, default=None,
help='Output-Datei für Template (default: stdout)')
parser.add_argument('--ilocal', action='store_true',
help='iLocal-Kontextanalyse durchführen')
parser.add_argument('--crx', action='store_true',
help='CRX (direct CHARE inference) verwenden')
parser.add_argument('--verbose', '-v', action='store_true',
help='Ausführliche Ausgabe')
parser.add_argument('--stats', action='store_true',
help='Zeige Token-Statistiken')
args = parser.parse_args()
if not os.path.isdir(args.dir):
print(f"Fehler: Verzeichnis '{args.dir}' nicht gefunden.", file=sys.stderr)
sys.exit(1)
yaml_files = find_yaml_files(args.dir)
if not yaml_files:
print(f"Keine YAML-Dateien in '{args.dir}' gefunden.", file=sys.stderr)
sys.exit(1)
print(f"Gefundene YAML-Dateien: {len(yaml_files)}", file=sys.stderr)
if args.ilocal:
print("\n=== iLocal: Kontext-Extraktion ===", file=sys.stderr)
all_contexts = {}
for f in yaml_files:
contexts = extract_contexts_from_file(f)
for ctx, seqs in contexts.items():
if ctx not in all_contexts:
all_contexts[ctx] = []
all_contexts[ctx].extend(seqs)
reduced = reduce_contexts(all_contexts)
print(f" Kontexte gefunden: {len(reduced)}", file=sys.stderr)
for ctx, seqs in sorted(reduced.items()):
lengths = [len(s) for s in seqs]
print(f" {ctx}: {len(seqs)} Sequenzen, "
f"Längen {min(lengths)}-{max(lengths)}, "
f"unique_seqs={len(set(tuple(s) for s in seqs))}",
file=sys.stderr)
print("\n=== Tokenisierung ===", file=sys.stderr)
tokenizer = YAMLTokenizer(resolve_includes=False)
all_sequences = []
container_sequences = {}
for f in yaml_files:
try:
seq = tokenizer.tokenize_file(f)
if seq:
all_sequences.append(seq)
if args.verbose:
print(f" {os.path.relpath(f)}: {seq}", file=sys.stderr)
except Exception as e:
if args.verbose:
print(f" Fehler in {f}: {e}", file=sys.stderr)
if not all_sequences:
print("Keine Sequenzen extrahiert.", file=sys.stderr)
sys.exit(1)
print(f" Sequenzen extrahiert: {len(all_sequences)}", file=sys.stderr)
lengths = [len(s) for s in all_sequences]
print(f" Längen: min={min(lengths)}, max={max(lengths)}, "
f"avg={sum(lengths)/len(lengths):.1f}", file=sys.stderr)
if args.stats:
stats = tokenizer.get_statistics()
print("\n=== Token-Statistiken ===", file=sys.stderr)
for token, count in list(stats.items())[:30]:
print(f" {token}: {count}", file=sys.stderr)
print("\n=== k-ORE Inferenz ===", file=sys.stderr)
kore = kOREInference(k_max=args.k_max)
if args.crx:
result = kore.infer_with_crx(all_sequences)
_, expr, method = result
print(f" Methode: {method}", file=sys.stderr)
else:
result = kore.infer(all_sequences)
if result:
_, expr, k = result
print(f" Bestes k: {k}", file=sys.stderr)
else:
expr = ""
print(" Kein Ergebnis", file=sys.stderr)
print(f" Inferierter Ausdruck: {expr}", file=sys.stderr)
print("\n=== One-Shot Template ===", file=sys.stderr)
print(file=sys.stderr)
template = generate_template(expr, context_key=args.context)
if args.output:
with open(args.output, 'w') as f:
f.write(template)
print(f"Template geschrieben nach: {args.output}", file=sys.stderr)
else:
print(template)
if __name__ == '__main__':
main()

191
bex/crx.py Normal file
View file

@ -0,0 +1,191 @@
"""CRX — Direct CHARE inference (Algorithm 7, TODS 2010)."""
from collections import defaultdict
from .expr import concat
class CRX:
"""
| Algorithm 7: CRX |
Input: sample S (list of token lists)
Output: CHARE r such that S L(r)
"""
def infer(self, sequences):
S = [list(s) for s in sequences if s]
if not S:
return 'ε'
sigma = set()
for w in S:
for a in w:
sigma.add(a)
if not sigma:
return 'ε'
# Step 1: Compute ImmedPred and equivalence classes ≈_S
immed = set()
for w in S:
for i in range(len(w) - 1):
immed.add((w[i], w[i + 1]))
# Reachability: →_S (reflexive, transitive closure)
closure = self._transitive_closure(sigma, immed)
# Equivalence: a ≈_S b iff a →*_S b and b →*_S a
eq = self._equivalence(sigma, closure)
# Build class map: symbol → class index
sym_to_cls = {}
classes = []
for cls_syms in eq:
idx = len(classes)
for sym in cls_syms:
sym_to_cls[sym] = idx
classes.append(set(cls_syms))
# Step 2-3: Preserve only singleton nodes? No, the algorithm says merge singletons
# that share Pred/Succ in the Hasse diagram. But actually, looking at the algorithm
# more carefully:
#
# "while a maximal set of singleton nodes γ₁,...,γ_ such that
# Pred_HS(γ₁)=···=Pred_HS(γ_) and Succ_HS(γ₁)=···=Succ_HS(γ_) exists do
# Replace γ₁,...,γ_ by γ := ∪ⱼ γⱼ"
#
# This merges singleton equivalence classes (classes with exactly one symbol)
# that have the same Pred and Succ sets in the Hasse diagram.
changed = True
while changed:
changed = False
singleton_ids = [i for i, c in enumerate(classes) if len(c) == 1]
# Compute Pred and Succ for each singleton (considering ALL symbols in each class)
hs_pred = {}
hs_succ = {}
for i in singleton_ids:
hs_pred[i] = set()
hs_succ[i] = set()
sym_i = next(iter(classes[i]))
for j, c in enumerate(classes):
if i == j:
continue
if any((sym_j, sym_i) in immed for sym_j in c):
hs_pred[i].add(j)
if any((sym_i, sym_j) in immed for sym_j in c):
hs_succ[i].add(j)
# Group by same (Pred, Succ)
groups = defaultdict(list)
for i in singleton_ids:
groups[(frozenset(hs_pred[i]), frozenset(hs_succ[i]))].append(i)
for (pred_set, succ_set), group in groups.items():
if len(group) >= 2:
merged = set()
for i in group:
merged.update(classes[i])
new_id = len(classes)
classes.append(merged)
for i in sorted(group, reverse=True):
classes.pop(i)
changed = True
break
# After merging, rebuild sym_to_cls to map to new class indices
sym_to_cls = {}
for idx, cls in enumerate(classes):
for sym in cls:
sym_to_cls[sym] = idx
# Step 5: Topological sort of the Hasse diagram
adj = {i: set() for i in range(len(classes))}
indeg = {i: 0 for i in range(len(classes))}
for a, b in immed:
ca, cb = sym_to_cls.get(a), sym_to_cls.get(b)
if ca is not None and cb is not None and ca != cb:
if cb not in adj[ca]:
adj[ca].add(cb)
indeg[cb] += 1
# Topological sort (Kahn's algorithm)
order = []
q = [i for i in range(len(classes)) if indeg[i] == 0]
while q:
i = q.pop(0)
order.append(i)
for j in adj[i]:
indeg[j] -= 1
if indeg[j] == 0:
q.append(j)
remaining = set(range(len(classes))) - set(order)
order.extend(remaining)
# Step 6-16: Assign chain factors (Algorithm 7 lines 7-14)
def count_in_class(w, syms):
return sum(1 for a in w if a in syms)
parts = []
for i in order:
syms = classes[i]
counts = [count_in_class(w, syms) for w in S]
all_exactly_one = all(c == 1 for c in counts)
all_at_most_one = all(c <= 1 for c in counts)
all_at_least_one = all(c >= 1 for c in counts)
some_two_or_more = any(c >= 2 for c in counts)
sym_list = sorted(syms)
factor = '+'.join(sym_list)
if len(sym_list) > 1:
factor = '(' + factor + ')'
if all_exactly_one:
pass # (a₁+···+aₙ)
elif all_at_most_one:
factor += '?' # (a₁+···+aₙ)?
elif all_at_least_one and some_two_or_more:
factor += '+' # (a₁+···+aₙ)+
else:
factor += '+?' # (a₁+···+aₙ)+?
parts.append(factor)
if not parts:
return 'ε'
return '.'.join(parts)
def _transitive_closure(self, sigma, immed):
"""Compute reflexive, transitive closure of immed relation."""
closure = {(a, b) for (a, b) in immed}
for a in sigma:
closure.add((a, a))
changed = True
while changed:
changed = False
for a in sigma:
for b in sigma:
for c in sigma:
if (a, b) in closure and (b, c) in closure and (a, c) not in closure:
closure.add((a, c))
changed = True
return closure
def _equivalence(self, sigma, closure):
"""Compute equivalence classes of ≈_S."""
remaining = set(sigma)
classes = []
while remaining:
a = remaining.pop()
cls = {a}
added = True
while added:
added = False
for b in list(remaining):
if (a, b) in closure and (b, a) in closure:
if b not in cls:
cls.add(b)
remaining.discard(b)
added = True
classes.append(cls)
return classes

164
bex/expr.py Normal file
View file

@ -0,0 +1,164 @@
"""Expression utilities for SOREs and k-OREs."""
import re
def sym(s):
"""Create a simple symbol expression."""
return s
def concat(*parts):
"""Create concatenation expression."""
parts = [p for p in parts if p and p != 'ε']
if not parts:
return 'ε'
if len(parts) == 1:
return parts[0]
return '.'.join(parts)
def disj(*parts):
"""Create disjunction expression."""
parts = [p for p in parts if p and p != '']
if not parts:
return ''
if len(parts) == 1:
return parts[0]
return '(' + '|'.join(parts) + ')'
def star(expr):
"""Create iteration expression (one or more, r+)."""
if not expr or expr in ('', 'ε'):
return expr
if len(expr) == 1 or (expr.startswith('(') and expr.endswith(')')):
return expr + '+'
return '(' + expr + ')+'
def optional(expr):
"""Create optional expression (r?)."""
if not expr or expr in ('', 'ε'):
return 'ε'
if len(expr) == 1 or (expr.startswith('(') and expr.endswith(')')):
return expr + '?'
return '(' + expr + ')?'
def alphabet(expr):
"""Return set of alphabet symbols in expression."""
cleaned = re.sub(r'[+?*().|]', ' ', expr)
result = set()
for token in cleaned.split():
token = token.strip('_0123456789')
if token and token not in ('ε', ''):
result.add(token)
return result
def strip_k(s):
"""Remove k-ORE markers: a_1 → a, b^(2) → b."""
result = re.sub(r'_\d+', '', s)
result = re.sub(r'\^\(\d+\)', '', result)
result = re.sub(r'^\(|\)$', '', result)
return result
def has_repeats(expr, symbol):
"""Check if a symbol appears more than once in expression."""
return expr.count(symbol) > 1
def lang_size_at_most(expr, n, alphabet_symbols=None):
"""Compute |L(r)<=n| — number of words of length ≤ n in L(r)."""
if alphabet_symbols is None:
alphabet_symbols = alphabet(expr)
if not alphabet_symbols:
return 1 if 'ε' in expr else 0
size = 0
for length in range(n + 1):
size += _count_words(expr, length, alphabet_symbols)
return size
def _count_words(expr, length, alphabet_symbols):
if length < 0:
return 0
if not expr or expr == '':
return 0
if expr == 'ε':
return 1 if length == 0 else 0
if expr in alphabet_symbols:
return 1 if length == 1 else 0
if '+' in expr:
inner = expr.rstrip('+')
if inner.endswith('?'):
inner = inner[:-1]
return _count_star_words(inner, length, alphabet_symbols, 1)
if expr.endswith('?'):
inner = expr[:-1]
return _count_words(inner, length, alphabet_symbols) + (1 if length == 0 else 0)
if expr.startswith('(') and '|' in expr:
inner = expr[1:-1]
parts = _split_disjunction(inner)
return sum(_count_words(p, length, alphabet_symbols) for p in parts)
if '.' in expr:
parts = expr.split('.')
return _count_concat_words(parts, length, alphabet_symbols, 0)
if ')' in expr or '(' in expr:
return 0
return 0
def _count_concat_words(parts, length, alphabet_symbols, idx):
if idx >= len(parts):
return 1 if length == 0 else 0
total = 0
for take in range(length + 1):
cnt = _count_words(parts[idx], take, alphabet_symbols)
if cnt > 0:
rest = _count_concat_words(parts, length - take, alphabet_symbols, idx + 1)
total += cnt * rest
return total
def _count_star_words(inner, length, alphabet_symbols, min_count):
total = 0
for repeat in range(min_count, length + 1):
if repeat == 0:
continue
total += _count_repeat_words(inner, repeat, length, alphabet_symbols)
return total
def _count_repeat_words(inner, repeat, length, alphabet_symbols):
if repeat == 0:
return 1 if length == 0 else 0
total = 0
for take in range(length + 1):
cnt = _count_words(inner, take, alphabet_symbols)
if cnt > 0:
rest = _count_repeat_words(inner, repeat - 1, length - take, alphabet_symbols)
total += cnt * rest
return total
def _split_disjunction(s):
depth = 0
parts = []
current = []
for ch in s:
if ch == '(':
depth += 1
current.append(ch)
elif ch == ')':
depth -= 1
current.append(ch)
elif ch == '|' and depth == 0:
parts.append(''.join(current))
current = []
else:
current.append(ch)
parts.append(''.join(current))
return parts

202
bex/idregex.py Normal file
View file

@ -0,0 +1,202 @@
"""iDRegEx — Algorithm 4 (arXiv 1004.2372)."""
from .ikoa import ikoa
from .rwrsq import rwr_sq
from .expr import alphabet
def is_deterministic(expr):
"""Check if a k-ORE is deterministic (Glushkov determinism).
A k-ORE is deterministic iff for every subexpression (r|s),
first(r) first(s) = .
"""
if not expr or expr == '' or expr == 'ε':
return True
return _check_det(expr)
def _check_det(expr):
"""Recursive determinism check."""
depth = 0
i = 0
while i < len(expr):
if expr[i] == '(':
if depth == 0:
start = i
depth += 1
elif expr[i] == ')':
depth -= 1
if depth == 0:
inner = expr[start + 1:i]
if '|' in inner:
alts = _split_or(inner)
first_sets = []
for alt in alts:
fs = _first_set(alt.strip())
first_sets.append(fs)
for j, fs1 in enumerate(first_sets):
for fs2 in first_sets[j + 1:]:
if fs1 & fs2:
return False
for alt in alts:
if not _check_det(alt.strip()):
return False
else:
if not _check_det(inner):
return False
elif expr[i] == '+':
pass
elif expr[i] == '?':
pass
i += 1
return True
def _first_set(expr):
"""Compute first(r) — set of alphabet symbols that can appear at the start of a word in L(r)."""
if not expr or expr == '':
return set()
if expr == 'ε':
return set()
alpha = alphabet(expr)
if expr in alpha:
return {expr}
if expr.endswith('?') or expr.endswith('+'):
inner = expr.rstrip('+?')
return _first_set(inner)
if '.' in expr:
parts = expr.split('.')
return _first_set(parts[0])
if expr.startswith('(') and '|' in expr:
inner = expr[1:-1]
alts = _split_or(inner)
result = set()
for a in alts:
result |= _first_set(a.strip())
return result
return alpha
def _split_or(s):
"""Split disjunction string at top-level | operators."""
depth = 0
parts = []
cur = []
for ch in s:
if ch == '(':
depth += 1
cur.append(ch)
elif ch == ')':
depth -= 1
cur.append(ch)
elif ch == '|' and depth == 0:
parts.append(''.join(cur))
cur = []
else:
cur.append(ch)
parts.append(''.join(cur))
return parts
def _lang_size(expr, n=None):
"""|L(r)≤n| — number of words of length ≤ n in L(r).
n = 2m + 1 where m = |r| excluding operators.
Uses simple structural approximation.
"""
if not expr or expr == '':
return 0
if expr == 'ε':
return 1
m = len(alphabet(expr))
if n is None:
n = 2 * m + 1
total = 0
for length in range(n + 1):
total += _count_len(expr, length)
return total
def _count_len(expr, length):
if length < 0:
return 0
if not expr or expr == '':
return 0
if expr == 'ε':
return 1 if length == 0 else 0
alpha = alphabet(expr)
if expr in alpha:
return 1 if length == 1 else 0
if expr.endswith('+'):
inner = expr[:-1]
if inner.endswith('?'):
inner = inner[:-1]
total = 0
for rep in range(1, length + 1):
total += _count_repeat(inner, rep, length)
return total
if expr.endswith('?'):
inner = expr[:-1]
return _count_len(inner, length) + (1 if length == 0 else 0)
if '.' in expr:
parts = expr.split('.')
return _count_concat(parts, length, 0)
if expr.startswith('(') and '|' in expr:
inner = expr[1:-1]
alts = _split_or(inner)
return sum(_count_len(a.strip(), length) for a in alts)
return 0
def _count_concat(parts, length, idx):
if idx >= len(parts):
return 1 if length == 0 else 0
total = 0
for take in range(length + 1):
cnt = _count_len(parts[idx], take)
if cnt:
total += cnt * _count_concat(parts, length - take, idx + 1)
return total
def _count_repeat(inner, rep, length):
if rep == 0:
return 1 if length == 0 else 0
total = 0
for take in range(length + 1):
cnt = _count_len(inner, take)
if cnt:
total += cnt * _count_repeat(inner, rep - 1, length - take)
return total
def idregex(sequences, kmax=4, N=5, criterion='langsize'):
"""
| Algorithm 4: iDRegEx |
Require: sample S
Ensure: k-ORE r
1: C
2: for k = 1 to kmax do
3: for n = 1 to N do
4: G iKoa(S, k)
5: if rwr²(G) is deterministic then
6: add rwr²(G) to C
7: return best(C)
"""
C = set()
for k in range(1, kmax + 1):
for _ in range(N):
G = ikoa(sequences, k, num_trials=1)
if G is None:
continue
expr = rwr_sq(G)
if expr and expr not in ('', 'ε'):
if is_deterministic(expr):
C.add(expr)
if not C:
return None
if criterion == 'langsize':
return min(C, key=lambda e: (_lang_size(e), len(e)))
return min(C, key=lambda e: len(e))

139
bex/ikoa.py Normal file
View file

@ -0,0 +1,139 @@
"""iKoa — Algorithm 1 (arXiv 1004.2372) with Disambiguate (Algorithm 2)."""
from collections import deque, defaultdict
import random
from .koa import KOA, build_complete_koa
from .baum_welch import init_probabilities, baum_welch, baum_welch_fixed
def disambiguate(G, prob, sequences):
"""
|---- Algorithm 2: Disambiguate ----|
Require: POMM P=(G,alpha) and sample S
Ensure: deterministic k-OA
"""
sigma = set()
for seq in sequences:
for sym in seq:
sigma.add(sym)
bw_iter = 2 if len(sigma) <= 7 else 3
Q = deque([G.src])
for s in G._succ.get(G.src, set()):
if prob.get(G.src, {}).get(s, 0) > 0:
Q.append(s)
D = set()
from .expr import strip_k
while Q:
s = Q.popleft()
while True:
lab_groups = defaultdict(list)
for t in list(G._succ.get(s, set())):
l = G.label(t)
if l:
lab_groups[strip_k(l)].append(t)
multi = [(lab, ts) for lab, ts in lab_groups.items() if len(ts) > 1]
if not multi:
break
for lab, targets in multi:
t_max = max(targets, key=lambda t: prob.get(s, {}).get(t, 0))
total_p = sum(prob.get(s, {}).get(t, 0) for t in targets)
if total_p > 0 and t_max in prob.get(s, {}):
prob[s][t_max] = total_p
for t in targets:
if t != t_max:
G.rm_edge(s, t)
if t in prob.get(s, {}):
prob[s][t] = 0.0
prob = baum_welch_fixed(G, prob, sequences, bw_iter)
for seq in sequences:
if not G.accept(seq):
return None
D.add(s)
for t in list(G._succ.get(s, set())):
if t not in D and t != G.sink:
Q.append(t)
return G
def prune(G, sequences):
"""Prune (iKoa line 4). Remove edges without witnesses in S.
Also removes states s Succ(src) without a witness.
"""
from .expr import strip_k as _sk
witnessed = set()
for seq in sequences:
if not seq:
witnessed.add((G.src, G.sink))
continue
cur = {G.src}
for sym in seq:
nxt = set()
for s in cur:
for t in G._succ.get(s, set()):
lab = G.label(t)
if lab and _sk(lab) == sym:
nxt.add(t)
witnessed.add((s, t))
cur = nxt
for s in cur:
if G.has_edge(s, G.sink):
witnessed.add((s, G.sink))
for s in list(G._succ.keys()):
for t in list(G._succ.get(s, set())):
if (s, t) not in witnessed:
G.rm_edge(s, t)
r_from_src = set()
q = [G.src]
while q:
s = q.pop()
if s in r_from_src:
continue
r_from_src.add(s)
q.extend(G._succ.get(s, set()))
r_to_sink = set()
q = [G.sink]
while q:
s = q.pop()
if s in r_to_sink:
continue
r_to_sink.add(s)
q.extend(G._pred.get(s, set()))
for n in list(G._succ.keys()):
if n in (G.src, G.sink):
continue
if n not in r_from_src or n not in r_to_sink:
G.rm_state(n)
return G
def ikoa(sequences, k, num_trials=1):
"""
| Algorithm 1: iKoa |
Require: sample S, value k
Ensure: deterministic k-OA G with S L(G)
1: P init(k, S)
2: P BaumWelsh(P, S)
3: G Disambiguate(P, S)
4: G Prune(G, S)
5: return G
"""
for _ in range(num_trials):
G, _ = build_complete_koa(sequences, k)
prob = init_probabilities(G, sequences)
prob = baum_welch(G, prob, sequences, iterations=10)
G2 = G.copy()
prob2 = {s: dict(d) for s, d in prob.items()}
result = disambiguate(G2, prob2, sequences)
if result is not None:
result = prune(result, sequences)
if result.sink_reachable():
return result
return None

166
bex/ilocal.py Normal file
View file

@ -0,0 +1,166 @@
"""
iLocal Kontext-basierte Inferenz (Bex 2007).
Nach Bex et al. 2007: "Inferring XML Schema Definitions from XML Data"
Extrahiert aus YAML-Bäumen (Kontext, Sequenz)-Paare, wobei der Kontext
der YAML-Key (Container-Key) ist.
Angepasst für YAML:
- Kontext = YAML-Key, dessen Wert eine Liste ist (z.B. tasks, steps)
- Sequenz = Die item-Keys innerhalb dieser Liste (z.B. apt, template, service)
Anstatt Dateipfade zu verwenden (wie im XML-Kontext), arbeiten wir
mit den Container-Keys direkt (Benutzer-Vorgabe: kein Dateipfad-Ballast).
"""
import yaml
def extract_contexts_from_yaml(data, context_prefix=None):
"""
Extrahiert (context, sequence)-Paare aus geparstem YAML.
Args:
data: Geparste YAML-Daten (dict oder list)
context_prefix: Interner Prefix für verschachtelte Kontexte
Returns:
dict: {context_key: [sequence1, sequence2, ...]}
"""
contexts = {}
def walk(node, prefix=None):
if isinstance(node, dict):
for key, value in node.items():
full_key = f"{prefix}.{key}" if prefix else str(key)
if isinstance(value, list) and len(value) > 0:
seq = []
for item in value:
if isinstance(item, dict):
item_key = next(
(k for k in item if k != 'name' and not k.startswith('_')),
None
)
if item_key:
seq.append(item_key)
else:
named = item.get('name', str(item))
seq.append(f"named:{named[:20]}")
else:
seq.append(str(item))
if full_key not in contexts:
contexts[full_key] = []
contexts[full_key].append(seq)
for item in value:
walk(item, full_key)
elif isinstance(value, dict):
walk(value, full_key)
elif isinstance(value, list):
for item in value:
walk(item, full_key)
elif isinstance(node, list):
for item in node:
walk(item, prefix)
walk(data)
return contexts
def extract_contexts_from_yaml_string(yaml_string):
"""
Extrahiert Kontext-Sequenzen aus einem YAML-String.
Args:
yaml_string: YAML-String
Returns:
dict: {context_key: [sequence1, sequence2, ...]}
"""
try:
data = yaml.safe_load(yaml_string)
except yaml.YAMLError:
return {}
if data is None:
return {}
return extract_contexts_from_yaml(data)
def extract_contexts_from_file(filepath):
"""
Extrahiert Kontext-Sequenzen aus einer YAML-Datei.
Args:
filepath: Pfad zur YAML-Datei
Returns:
dict: {context_key: [sequence1, sequence2, ...]}
"""
with open(filepath) as f:
return extract_contexts_from_yaml_string(f.read())
def reduce_contexts(context_groups):
"""
reduce Generalisierung nach Bex 2007 (Algorithmus reduce).
Identifiziert äquivalente Kontext-Modelle und fasst sie zusammen:
- Wenn zwei Kontexte die gleiche Sequenz-Struktur haben,
werden sie zu einem generalisierten Kontext zusammengefasst
Args:
context_groups: dict of {context_key: [sequences]}
Returns:
dict: {generalized_context: [sequences]} (reduziert)
"""
if not context_groups:
return {}
signature_map = {}
for ctx, seqs in context_groups.items():
# Signatur = sortierte Menge der (Länge, erstes/letztes Element)
sig_parts = []
for s in seqs:
first = s[0] if s else ""
last = s[-1] if s else ""
sig_parts.append((len(s), first, last))
signature = tuple(sorted(set(sig_parts)))
if signature not in signature_map:
signature_map[signature] = []
signature_map[signature].append(ctx)
# Gruppen mit gleicher Signatur → merge
result = {}
for sig, ctx_list in signature_map.items():
merged_ctx = "|".join(sorted(ctx_list))
merged_seqs = []
for ctx in ctx_list:
merged_seqs.extend(context_groups[ctx])
result[merged_ctx] = merged_seqs
return result
def iLocal(yaml_documents):
"""
iLocal Kontext-Inferenz nach Bex 2007.
Args:
yaml_documents: Liste von YAML-Strings oder Dateipfaden
Returns:
dict: {generalized_context: [sequences]}
"""
all_contexts = {}
for doc in yaml_documents:
if '\n' in doc or '\r' in doc:
contexts = extract_contexts_from_yaml_string(doc)
else:
contexts = extract_contexts_from_file(doc)
for ctx, seqs in contexts.items():
if ctx not in all_contexts:
all_contexts[ctx] = []
all_contexts[ctx].extend(seqs)
return reduce_contexts(all_contexts)

105
bex/koa.py Normal file
View file

@ -0,0 +1,105 @@
"""k-OA — k-Occurrence Automaton (Definition 4.1, arXiv 1004.2372).
A k-OA is like a SOA but each symbol appears at most k times as a state label.
"""
from .soa import SOA
from .expr import strip_k
class KOA(SOA):
"""k-Occurrence Automaton.
Same structure as SOA but each symbol may label up to k states.
"""
def __init__(self, k=1):
super().__init__()
self.k = k
self._symbol_count = {}
def add_state(self, label):
nid = super().add_state(label)
sym = strip_k(label)
self._symbol_count.setdefault(sym, 0)
self._symbol_count[sym] += 1
return nid
def remove_state(self, nid):
label = self._label.get(nid)
if label:
sym = strip_k(label)
self._symbol_count[sym] -= 1
super().rm_state(nid)
def count_symbol(self, symbol):
return self._symbol_count.get(strip_k(symbol), 0)
def symbol_ok(self, symbol):
return self.count_symbol(symbol) < self.k
def is_deterministic(self):
for n in self._succ:
label_map = {}
for t in self._succ[n]:
lab = self._label.get(t)
if lab:
base = strip_k(lab)
if base in label_map:
return False
label_map[base] = t
return True
def accept(self, w):
"""Accept using base symbols (strip k-markers from state labels)."""
cur = {self.src}
for sym in w:
nxt = set()
for s in cur:
for t in self._succ.get(s, set()):
lab = self._label.get(t)
if lab and strip_k(lab) == sym:
nxt.add(t)
if not nxt:
return False
cur = nxt
return any(self.sink in self._succ.get(s, set()) for s in cur)
def succ_labeled(self, nid, symbol):
return {t for t in self._succ.get(nid, set()) if strip_k(self._label.get(t) or '') == symbol}
def build_complete_koa(sequences, k):
"""Build complete k-OA Ck (Definition 4.2, arXiv 1004.2372).
For each a Σ(S), exactly k states labeled a (a_1 ... a_k).
- src connected to exactly one a_i for each a
- Every state has edge to every other state (except src)
- src sink edge (for ε)
"""
G = KOA(k=k)
alphabet = set()
for seq in sequences:
for token in seq:
alphabet.add(token)
symbol_states = {}
for sym in alphabet:
state_ids = []
for i in range(1, k + 1):
nid = G.add_state(f"{sym}_{i}")
state_ids.append(nid)
G.add_edge(G.src, nid)
symbol_states[sym] = state_ids
all_states = [n for n in G._succ if n not in (G.src, G.sink)]
for s in all_states:
for t in all_states:
if s != t and not G.has_edge(s, t):
G.add_edge(s, t)
if not G.has_edge(s, G.sink):
G.add_edge(s, G.sink)
G.add_edge(G.src, G.sink)
return G, symbol_states

432
bex/kore.py Normal file
View file

@ -0,0 +1,432 @@
"""
kore k-ORE Inference (iDRegEx) nach Bex et al. 2008/2010.
iDRegEx (Bex 2008):
1. Prefix-Tree Automaton (PTA) aus Beispielsequenzen
2. Shrink: Rewrite-Regeln generalisieren den Automaten
(simplify star_rewrite concat_rewrite alternation_rewrite)
3. Repair: Stelle Determinismus nach jedem Rewrite-Durchlauf wieder her
4. Convert: Überführe den Automaten in einen regulären Ausdruck
(State-Elimination nach Brzozowski & McCluskey)
5. k-ORE Prüfung: Der Ausdruck muss die k-Occurrence-Bedingung erfüllen
(jedes Symbol maximal k-mal nennenswert)
6. MDL: Wähle k mit minimalem MDL-Score
"""
from .automaton import Automaton
from .pta import build_pta
from .shrink import shrink
from .repair import repair
from .mdl import mdl_score
def _state_elimination(G):
"""
State Elimination nach Brzozowski & McCluskey.
Entfernt nacheinander alle Nicht-Start/Accept-Zustände.
Für jeden eliminierten Zustand q:
- Für jedes Paar (p, r) mit pq (Label A) und qr (Label B):
- R_self_q = disjunktion aller Selbst-Schleifen auf q
- Neues Label = A · (R_self_q)* · B
- Füge Kante p r mit dem neuen Label hinzu (oder merge mit existierender)
Nach Elimination: Nur Start- und Accept-Zustände bleiben.
Der Ausdruck ist: summe aller Pfade von Start zu Accept.
"""
G = G.copy()
eliminated = set()
# Wiederhole bis nur Start + Accepts übrig sind
changed = True
while changed:
changed = False
# Wähle einen Zustand zur Elimination (nicht Start, nicht Accept)
for q in list(G.nodes):
if q == G.start or q in G.accepts:
continue
if q in eliminated:
continue
reachable = _is_reachable_to_accept(G, q)
if not reachable:
G.nodes.discard(q)
G.accepts.discard(q)
G.edges = [e for e in G.edges if e['from'] != q and e['to'] != q]
eliminated.add(q)
changed = True
continue
incoming = G.incoming(q)
outgoing = G.outgoing(q)
# R_self_q = (a1 | a2 | ...)* für alle Selbst-Schleifen auf q
self_loops = [e for e in outgoing if e['to'] == q]
outgoing_no_self = [e for e in outgoing if e['to'] != q]
if not outgoing_no_self:
# Sackgasse, keine Outgoing-Kanten (außer self-loop)
# Entferne eingehende Kanten + q
for e in incoming:
G.remove_edge(e['from'], e['to'], e['label'])
G.nodes.discard(q)
G.accepts.discard(q)
eliminated.add(q)
changed = True
continue
if self_loops:
self_labels = list(set(e['label'] for e in self_loops))
if len(self_labels) == 1:
R_self_q = f"({self_labels[0]})*"
else:
R_self_q = f"({'|'.join(self_labels)})*"
else:
R_self_q = ""
# Für jedes Paar (p, r): p→q (incoming), q→r (outgoing, r != q)
for e_in in incoming:
p = e_in['from']
if p == q:
continue
A = e_in['label']
for e_out in outgoing_no_self:
r = e_out['to']
B = e_out['label']
if R_self_q:
new_label = f"({A}.{R_self_q}.{B})"
else:
new_label = f"({A}.{B})"
# Merge mit existierender Kante p→r wenn vorhanden
existing = [e for e in G.edges if e['from'] == p and e['to'] == r]
existing_labels = [e['label'] for e in existing]
if new_label not in existing_labels and f"({new_label})" not in existing_labels:
# Vereinige mit existierenden Labels via |
if existing:
old_label = existing[0]['label']
merged = f"({old_label}|{new_label})"
G.remove_edge(p, r, old_label)
G.add_edge(p, r, merged)
else:
G.add_edge(p, r, new_label)
# Lösche q und alle seine Kanten
for e in incoming:
G.remove_edge(e['from'], e['to'], e['label'])
for e in self_loops:
G.remove_edge(e['from'], e['to'], e['label'])
for e in outgoing_no_self:
G.remove_edge(e['from'], e['to'], e['label'])
G.nodes.discard(q)
G.accepts.discard(q)
eliminated.add(q)
changed = True
break
return G
def _is_reachable_to_accept(G, q):
"""Prüft ob von q aus ein Accept-Zustand erreichbar ist."""
visited = set()
stack = [q]
while stack:
n = stack.pop()
if n in visited:
continue
visited.add(n)
if n in G.accepts:
return True
for e in G.outgoing(n):
stack.append(e['to'])
return False
def _extract_expression(G):
"""
Extrahiert den regulären Ausdruck aus dem eliminierten Automaten.
Nach Elimination gibt es nur Startzustand und Accept-Zustände.
Der Ausdruck ist die Disjunktion aller Pfade von Start zu Accept.
"""
if G.start is None:
return ""
# Phase 1: State Elimination
G_elim = _state_elimination(G)
start = G_elim.start
if not G_elim.accepts:
return ""
paths = []
outgoing = G_elim.outgoing(start)
# Spezialfall: Start ist selbst Accept
if start in G_elim.accepts:
# Prüfe auf Selbst-Schleife
self_edges = [e for e in outgoing if e['to'] == start]
non_self = [e for e in outgoing if e['to'] != start]
if not non_self and not self_edges:
return "ε"
if self_edges:
self_labels = '|'.join(set(e['label'] for e in self_edges))
paths.append(f"({self_labels})*")
# Außer Start → Accept → andere Accepts
for e in non_self:
target = e['to']
if target in G_elim.accepts:
paths.append(e['label'])
# Pfade von Start zu Accept-Zuständen
for acc in G_elim.accepts:
if acc == start:
continue
# Kante start → acc
direct = [e for e in outgoing if e['to'] == acc]
for e in direct:
paths.append(e['label'])
self_loops_start = [e for e in G_elim.outgoing(start) if e['to'] == start]
# Weitere Kanten: start → x (wo x != accept)
intermediate = [e for e in outgoing if e['to'] not in G_elim.accepts and e['to'] != start]
for e in intermediate:
# Folge Pfad von intermediate zu accept
suffix = _follow_path(G_elim, e['to'], G_elim.accepts, set())
if suffix:
paths.append(f"({e['label']}.{suffix})")
# Entferne Duplikate
paths = list(set(paths))
if not paths:
return "ε"
if len(paths) == 1:
expr = paths[0]
else:
expr = f"({'|'.join(paths)})"
# Vereinfache: Entferne überflüssige Klammern
expr = _simplify_expression(expr)
return expr
def _follow_path(G, start, accepts, visited):
"""Findet den Pfad von start zu einem Accept."""
if start in accepts:
return "ε"
if start in visited:
return None
visited.add(start)
outgoing = G.outgoing(start)
for e in outgoing:
if e['to'] == start:
continue
suffix = _follow_path(G, e['to'], accepts, visited)
if suffix is not None:
if suffix == "ε":
return e['label']
else:
return f"({e['label']}.{suffix})"
return None
def _simplify_expression(expr):
"""
Vereinfacht einen regulären Ausdruck.
Entfernt überflüssige Klammern, doppelte Operatoren, etc.
"""
if not expr or expr in ('ε', ''):
return expr
# (ε. X ) → X
# (X . ε) → X
# ((X)) → X
# (a|a) → a
simplified = expr
while True:
prev = simplified
simplified = _simplify_once(simplified)
if simplified == prev:
break
return simplified
def _simplify_once(expr):
"""Ein Reduktionsschritt."""
# (ε.X) → X
# (X.ε) → X
# ((X)) → X
# (a|a) → a
result = expr
# ((X)) → X (doppelte Klammern)
import re
result = re.sub(r'$$\(([^()]+)\)$$', r'(\1)', result)
return result
def validate_k_ore(expr, k_index):
"""
Prüft ob ein Ausdruck die k-Occurrence-Bedingung erfüllt.
Ein k-ORE erlaubt jedes Symbol maximal einmal pro k-Indikator,
d.h. in jedem Konjunkt (Teilausdruck ohne |) darf jedes Symbol
höchstens k-mal vorkommen.
Vereinfacht: Zähle Vorkommen jedes eindeutigen Token-Namens
im Ausdruck. Wenn ein Token mehr als k-mal vorkommt, ist
die Bedingung verletzt.
Returns:
bool, str: (erfüllt, Grund)
"""
# Extrahiere alle Token-Namen aus dem Ausdruck
tokens = set()
for c in '*+?()|.':
pass
token_names = set()
i = 0
while i < len(expr):
if expr[i].isalnum() or expr[i] in '/_-':
j = i
while j < len(expr) and (expr[j].isalnum() or expr[j] in '/_-'):
j += 1
token_names.add(expr[i:j])
i = j
else:
i += 1
# Zähle Vorkommen
token_counts = {}
i = 0
while i < len(expr):
if expr[i].isalnum() or expr[i] in '/_-':
j = i
while j < len(expr) and (expr[j].isalnum() or expr[j] in '/_-'):
j += 1
token = expr[i:j]
token_counts[token] = token_counts.get(token, 0) + 1
i = j
else:
i += 1
violations = [t for t, c in token_counts.items() if c > k_index]
if violations:
return False, f"Token {violations} erscheint > {k_index}-mal"
return True, "OK"
class kOREInference:
"""
iDRegEx: k-ORE Inferenz via PTA Shrink Repair Expression.
Nach Bex et al. 2008:
- Baue PTA aus Sequenzen
- Shrink: Rewrite-Regeln generalisieren
- Repair: Stelle Determinismus wieder her
- Convert: Extrahiere regulären Ausdruck via State Elimination
- Prüfe k-Occurrence
- Wähle k mit MDL
"""
def __init__(self, k_max=5):
self.k_max = k_max
def infer(self, sequences):
"""
Inferiere den besten k-ORE.
Returns:
(Automaton, expression_string, best_k) oder None
"""
sequences = [s for s in sequences if s]
if not sequences:
return None, "", 0
best_score = float('inf')
best_result = None
for k in range(1, self.k_max + 1):
try:
auto, expr = self._infer_k_expression(sequences, k)
if auto is None:
continue
score = mdl_score(auto, sequences)
if score < best_score:
best_score = score
best_result = (auto, expr, k)
except Exception:
continue
return best_result
def _infer_k_expression(self, sequences, k):
"""Führe iDRegEx für ein spezifisches k durch."""
# 1. PTA bauen
pta = build_pta(sequences)
# 2. Shrink
shrunk = shrink(pta, max_iterations=20)
# 3. Repair
repaired = repair(shrunk)
# 4. Expression extrahieren
expr = _extract_expression(repaired)
# 5. k-ORE Prüfung
valid, _ = validate_k_ore(expr, k)
if not valid:
expr = self._generalize_to_k_ore(expr, k)
return repaired, expr
def _generalize_to_k_ore(self, expr, k):
"""
Generalisiere den Ausdruck zur k-ORE.
Wenn Token t mehr als k-mal vorkommt:
- Ersetze Wiederholungen durch t+ oder t*
"""
# Einfache Heuristik: Extrahiere Token, zähle, ersetze
result = expr
token_counts = {}
i = 0
while i < len(result):
if result[i].isalnum() or result[i] in '/_-':
j = i
while j < len(result) and (result[j].isalnum() or result[j] in '/_-'):
j += 1
token = result[i:j]
token_counts[token] = token_counts.get(token, 0) + 1
i = j
else:
i += 1
for token, count in token_counts.items():
if count > k:
# Ersetze token.token durch token+
import re
pattern = re.escape(token) + r'\..' + re.escape(token)
replacement = f"{token}+"
result = re.sub(pattern, replacement, result, count=1)
break
return result

46
bex/marking.py Normal file
View file

@ -0,0 +1,46 @@
"""Marking — Convert k-OA to SOA over Σ^(k) (Definition 4.4, arXiv 1004.2372)."""
from .soa import SOA
from .expr import strip_k
def mark_koa(G):
"""
Mark a k-OA G as a SOA over Σ^(k).
Process nodes in arbitrary order. For the i-th occurrence of label a,
replace by a^(i) (represented as "a_i").
Returns a SOA H over Σ^(k) such that L(G) = strip(L(H)).
"""
H = SOA()
H.src = G.src
H.sink = G.sink
H._succ = {n: set(succ) for n, succ in G._succ.items()}
H._pred = {n: set(pred) for n, pred in G._pred.items()}
H._label = {}
H._next = G._next
counts = {}
for n in G._succ:
lab = G._label.get(n)
if lab and lab not in ('ε', '') and n not in (G.src, G.sink):
sym = strip_k(lab)
counts[sym] = counts.get(sym, 0) + 1
H._label[n] = f"{sym}_{counts[sym]}"
elif n in (G.src, G.sink):
H._label[n] = None
else:
H._label[n] = lab
return H
def strip_expression(expr):
"""Strip k-ORE markers from expression: a_i → a.
Returns expression over original alphabet Σ.
"""
import re
result = re.sub(r'(_\d+)', '', expr)
return result

143
bex/mdl.py Normal file
View file

@ -0,0 +1,143 @@
"""MDL scoring for iDRegEx (Algorithm 4, arXiv 1004.2372)."""
import math
from .expr import alphabet
def model_cost(expr):
"""|r| — number of alphabet symbol occurrences in expression."""
import re
cleaned = re.sub(r'[+?*()|.]', '', expr)
cleaned = re.sub(r'_\d+', '', cleaned)
cleaned = re.sub(r'[ε∅]', '', cleaned)
return len(cleaned)
def lang_size(expr, n=None):
"""Estimate |L(r)≤n| — number of words of length ≤ n in L(r).
Simple approximation based on expression structure.
"""
if not expr or expr == '':
return 0
if expr == 'ε':
return 1
n = n or (2 * model_cost(expr) + 1)
total = 0
for length in range(n + 1):
total += _count_words_fast(expr, length)
return total
def _count_words_fast(expr, length):
if length < 0:
return 0
if not expr or expr == '':
return 0
if expr == 'ε':
return 1 if length == 0 else 0
alpha = alphabet(expr)
if expr in alpha:
return 1 if length == 1 else 0
if '+' in expr:
inner = expr.rstrip('+')
if inner.endswith('?'):
inner = inner[:-1]
return _count_star(inner, length, min_count=1)
if expr.endswith('?'):
inner = expr[:-1]
return _count_words_fast(inner, length) + (1 if length == 0 else 0)
if expr.startswith('(') and '|' in expr:
parts = _split_disj(expr[1:-1])
return sum(_count_words_fast(p.strip(), length) for p in parts)
if '.' in expr:
parts = expr.split('.')
return _count_concat(parts, length, 0)
return 0
def _count_concat(parts, length, idx):
if idx >= len(parts):
return 1 if length == 0 else 0
total = 0
for take in range(length + 1):
cnt = _count_words_fast(parts[idx], take)
if cnt:
total += cnt * _count_concat(parts, length - take, idx + 1)
return total
def _count_star(inner, length, min_count):
total = 0
for rep in range(min_count, length + 1):
total += _count_repeat(inner, rep, length)
return total
def _count_repeat(inner, rep, length):
if rep == 0:
return 1 if length == 0 else 0
total = 0
for take in range(length + 1):
cnt = _count_words_fast(inner, take)
if cnt:
total += cnt * _count_repeat(inner, rep - 1, length - take)
return total
def _split_disj(s):
depth = 0
parts = []
cur = []
for ch in s:
if ch == '(':
depth += 1
cur.append(ch)
elif ch == ')':
depth -= 1
cur.append(ch)
elif ch == '|' and depth == 0:
parts.append(''.join(cur))
cur = []
else:
cur.append(ch)
parts.append(''.join(cur))
return parts
def data_cost(expr, sequences):
"""MDL data cost: Σ_i log₂(|L=i(r)| / |S=i|) adjusted.
Simplified form: for each word in S, cost = log₂(lang_size of all words
of that length).
"""
n = 2 * model_cost(expr) + 1
total_cost = 0.0
for seq in sequences:
length = len(seq)
if length <= n:
lang_at_len = _count_words_fast(expr, length)
if lang_at_len > 0:
total_cost += math.log2(lang_at_len) if lang_at_len > 0 else 0
return total_cost
def mdl_score(expr, sequences):
"""MDL = model cost + data cost."""
model = model_cost(expr)
data = data_cost(expr, sequences)
return model + data
# For backward compatibility
class MDLScorer:
def score(self, expr, sequences):
return mdl_score(expr, sequences)

62
bex/pta.py Normal file
View file

@ -0,0 +1,62 @@
"""
pta Prefix-Tree Automaton (PTA) construction.
Nach Bex et al. 2008/2010: Der PTA ist der initiale Automat, der aus
den positiven Beispielsequenzen (Token-Sequenzen) konstruiert wird.
Jede Sequenz wird als Pfad im Trie abgebildet:
- Wurzel = Startzustand
- Jeder gemeinsame Prefix wird geteilt (wie im Trie)
- Der letzte Zustand jeder Sequenz wird als accept markiert
Der PTA ist deterministisch und akzeptiert genau die gegebenen Sequenzen.
Er ist der Ausgangspunkt für die SORE/CHARE-Inferenz via shrink-Rewrites.
"""
from .automaton import Automaton
def build_pta(sequences):
"""
Konstruiert den Prefix-Tree Automaton aus einer Liste von Token-Sequenzen.
Nach Bex et al. 2008/2010, Algorithmus PTA:
- Initialisiere mit Startzustand q0
- Für jede Sequenz w = a1...an:
- Starte in q0
- Für jedes ai: Folge der Kante (q, ai) falls vorhanden,
sonst erzeuge neuen Zustand q' und Kante (q, q', ai)
- Markiere Endzustand als accept
Args:
sequences: Liste von Token-Listen (jede = ein YAML-Dokument)
Returns:
Automaton: PTA für die gegebenen Sequenzen
Example:
>>> build_pta([["apt", "service"], ["apt", "template", "service"]])
Automaton(nodes=5, edges=5, start=0, accepts={3, 4})
"""
automaton = Automaton(start=0)
automaton.add_node(0)
next_id = 1
for seq in sequences:
current = 0
for token in seq:
found = False
for (to, label) in automaton.successors(current):
if label == token:
current = to
found = True
break
if not found:
new_node = next_id
next_id += 1
automaton.add_edge(current, new_node, token)
current = new_node
automaton.add_accept(current)
return automaton

167
bex/repair.py Normal file
View file

@ -0,0 +1,167 @@
"""
repair Determinism Repair nach Bex 2010.
Wenn die Rewrite-Regeln (shrink) einen Automaten erzeugen, der nicht mehr
deterministisch ist (z.B. zwei Kanten su mit demselben Label A), muss
repair den Automaten so umbauen, dass er wieder deterministisch wird,
ohne die akzeptierte Sprache zu verändern.
Bex 2010, Section 4.2.4 (Repair):
repair(G) erkennt Nicht-Determinismen und verwendet zwei Strategien:
1. Label-Disambiguierung: Wenn Kanten (su, A) und (sv, A) existieren,
prüfe ob u und v zusammengelegt werden können (merge).
2. Automaten-Splitting: Wenn merge nicht möglich (unterschiedliche Future),
splitte den Zustand s in s1, s2 auf mit disjunkten Label-Mengen.
Die repair-Funktion wird nach jedem shrink-Durchlauf aufgerufen.
"""
from .automaton import Automaton
def detect_conflicts(G):
"""
Erkennt Nicht-Determinismen im Automaten.
Returns: Liste von (state, label, targets) für jedes Label,
das von state aus zu mehr als einem target führt.
"""
conflicts = []
for node in G.nodes:
label_map = {}
for e in G.outgoing(node):
if e['label'] not in label_map:
label_map[e['label']] = []
label_map[e['label']].append(e['to'])
for label, targets in label_map.items():
if len(targets) > 1:
conflicts.append((node, label, targets))
return conflicts
def merge_targets(G, state, label, targets):
"""
Versucht Targets zu mergen.
Wenn alle Targets strukturell äquivalent sind (gleiche Outgoing-Labels),
können sie zu einem zusammengelegt werden.
"""
future_sets = []
for t in targets:
futures = {(e['to'], e['label']) for e in G.outgoing(t)}
future_sets.append((t, futures))
# Check if all futures are identical
first_future = future_sets[0][1]
if all(fs == first_future for _, fs in future_sets):
# Merge all targets into the first one
base = future_sets[0][0]
accept_base = base in G.accepts
for t, _ in future_sets[1:]:
if t in G.accepts:
G.add_accept(base)
if base != t:
for e in G.incoming(t):
if e['from'] != state:
G.add_edge(e['from'], base, e['label'])
G.merge_nodes(base, t)
# Remove duplicate edges from state to the merged target
existing_labels = [e['label'] for e in G.outgoing(state) if e['to'] == base]
if label in existing_labels:
existing_labels.remove(label)
if label not in existing_labels:
G.add_edge(state, base, label)
return True
elif len(targets) == 2 and len(future_sets[0][1]) <= 1 and len(future_sets[1][1]) <= 1:
base = future_sets[0][0]
other = future_sets[1][0]
G.merge_nodes(base, other)
G.add_edge(state, base, label)
return True
return False
def split_automaton(G, state, label, targets):
"""
Splittet den Zustand 'state' in mehrere Kopien, je eine pro Ziel.
Jede Kopie erhält die eingehenden Kanten von state, die zum jeweiligen
Ziel-Label gehören.
"""
# Find the highest node ID
max_id = max(G.nodes) if G.nodes else 0
incoming = G.incoming(state)
outgoing = G.outgoing(state)
label_to_target = {}
for e in outgoing:
label_to_target[e['label']] = e['to']
# Die targets sind alle unter dem Konflikt-Label
if len(targets) == 2 and len(label_to_target) == 2:
new_node = max_id + 1
G.add_node(new_node)
target1, target2 = targets[0], targets[1]
for e in list(G.incoming(state)):
if e['from'] == state:
continue
G.add_edge(e['from'], new_node, e['label'])
label_for_other = [k for k, v in label_to_target.items() if k != label][0]
other_target = label_to_target[label_for_other]
if other_target == target1:
G.add_edge(new_node, target1, label)
elif other_target == target2:
G.add_edge(state, target1, label)
else:
G.add_edge(state, target1, label)
return True
return False
def repair(G):
"""
repair Stellt Determinismus nach Rewrite-Operationen wieder her.
Nach Bex 2010, repair-Algorithmus:
1. Erkenne Nicht-Determinismen (detect_conflicts)
2. Für jeden Konflikt:
a. Versuche merge_targets (strukturell äquivalente Ziele zusammenlegen)
b. Falls nicht möglich: split_automaton (Zustand aufspalten)
3. Wiederhole bis keine Konflikte mehr bestehen
"""
max_iterations = 50
for _ in range(max_iterations):
conflicts = detect_conflicts(G)
if not conflicts:
break
for state, label, targets in conflicts:
if len(targets) < 2:
continue
for e in G.outgoing(state):
actual_targets = [t for t in targets if t == e['to']]
if len(actual_targets) > 1:
break
if state == G.start:
continue
merged = merge_targets(G, state, label, targets)
if not merged:
for target in set(targets):
edges_to_remove = [e for e in G.outgoing(state)
if e['label'] == label and e['to'] == target]
for e in edges_to_remove[1:]:
G.remove_edge(e['from'], e['to'], e['label'])
return G

111
bex/role_grammar.py Normal file
View file

@ -0,0 +1,111 @@
"""Extract Ansible role task module sequences and learn per-group grammars."""
from pathlib import Path
import yaml
from collections import defaultdict
from .crx import CRX
from .expr import strip_k
IGNORE_MODULES = frozenset({'name', 'tags', 'when', 'register', 'no_log',
'changed_when', 'failed_when', 'ignore_errors',
'run_once', 'delegate_to', 'loop', 'loop_control',
'until', 'retries', 'delay', 'poll', 'async',
'become', 'become_user', 'become_flags',
'check_mode', 'diff', 'environment',
'vars', 'notify', 'args',
'block', 'rescue', 'always', 'include_tasks'})
def extract_module_name(task):
"""Extract the Ansible module name from a task dict.
The module is the key that is NOT a known non-module key.
Returns 'skip' for non-task entries like block/rescue/always.
"""
if not isinstance(task, dict):
return None
# Check for block/rescue/always — these contain nested tasks
for key in ('block', 'rescue', 'always'):
if key in task:
nested = task[key]
if isinstance(nested, list):
return [extract_module_name(t) for t in nested]
return None
# Find the module key (not name, not meta-keys)
for key, value in task.items():
if key in ('name',):
continue
if key in IGNORE_MODULES:
continue
if isinstance(value, (dict, list, str, bool, int, float)):
# It's the module name (venv or fqcn)
return strip_k(key)
return None
def flatten_nested(seq):
"""Flatten nested lists into a single list."""
result = []
for item in seq:
if isinstance(item, list):
result.extend(flatten_nested(item))
elif item is not None and item != 'skip':
result.append(item)
return result
def get_role_category(role_name):
"""Extract category from role name like deploy_foo → deploy."""
parts = role_name.split('_')
if len(parts) >= 2:
return parts[0]
return 'other'
def load_role_module_sequence(role_dir):
"""Load a role's task file and extract the module sequence."""
task_file = role_dir / 'tasks' / 'main.yml'
if not task_file.exists():
return None, None
with open(task_file) as f:
data = yaml.safe_load(f)
if not isinstance(data, list):
return None, None
modules = []
for task in data:
result = extract_module_name(task)
if isinstance(result, list):
modules.extend(flatten_nested(result))
elif result is not None:
modules.append(result)
return role_dir.name, modules
def collect_all_role_sequences(roles_dir='roles'):
"""Collect module sequences from all roles, grouped by category."""
by_category = defaultdict(list)
all_roles = []
for role_dir in sorted(Path(roles_dir).glob('*/tasks/main.yml')):
role_name = role_dir.parent.parent.name
name, seq = load_role_module_sequence(role_dir.parent.parent)
if seq:
cat = get_role_category(role_name)
by_category[cat].append((role_name, seq))
all_roles.append((role_name, seq))
return all_roles, by_category
def learn_grammar(sequences):
"""Run CRX on a list of sequences."""
if len(sequences) < 2:
seqs = [sequences[0]] if sequences else []
else:
seqs = sequences
if not seqs:
return 'ε'
crx = CRX()
return crx.infer(seqs)

224
bex/rwr0.py Normal file
View file

@ -0,0 +1,224 @@
"""RWR₀ — Algorithm 6 (TODS 2010), conference version rules (Figure 10 + Figure 13).
Precedence: CONCATENATION > DISJUNCTION > SELF-LOOP > OPTIONAL
Repair precedence: ENABLE-DISJUNCTION > ENABLE-OPTIONAL-1 > ENABLE-OPTIONAL-2
Conditions checked on ε-closure G* (Definition 25).
Used as rwr²₁ in arXiv 1004.2372 for k>1.
"""
from .soa import SOA
from .expr import concat, disj, star, optional
def _find_concat(G, Gs):
"""Figure 10 CONCATENATION rule, checked on G*.
Check four variants with priority: r·s > r?·s|r·s? > r?·s?
r·s: Succ(r)={s} Pred(s)={r}
r?·s: Succ(r)={s,sink} Pred(s)={r}
r·s?: Succ(r)={s} Pred(s)={r,sink}
r?·s?: Succ(r)={s,sink} Pred(s)={r,sink}
"""
st = G.states()
# Variant 1: r·s (highest priority — check all pairs first)
for r in st:
for s in st:
if r == s:
continue
if Gs.succ(r) == {s} and G.pred(s) == {r}:
return r, s, concat(G.label(r), G.label(s))
# Variants 2-3: r?·s and r·s?
for r in st:
for s in st:
if r == s:
continue
Sr = Gs.succ(r)
Ps = G.pred(s)
if Sr == {s, G.sink} and Ps == {r}:
return r, s, concat(G.label(r), optional(G.label(s)))
if Sr == {s} and Ps == {r, G.sink}:
return r, s, concat(optional(G.label(r)), G.label(s))
# Variant 4: r?·s?
for r in st:
for s in st:
if r == s:
continue
if Gs.succ(r) == {s, G.sink} and G.pred(s) == {r, G.sink}:
return r, s, concat(optional(G.label(r)), optional(G.label(s)))
return None, None, None
def _find_disj(G, Gs):
"""Figure 10 DISJUNCTION rule, checked on G*.
Pred(r)=Pred(s) Succ(r)=Succ(s)
"""
st = G.states()
for i, r in enumerate(st):
for s in st[i + 1:]:
if G._pred_plus(r) == G._pred_plus(s) and G._succ_plus(r) == G._succ_plus(s):
return r, s, disj(G.label(r), G.label(s))
return None, None, None
def _find_selfloop(G, Gs):
"""Figure 10 SELF-LOOP rule. r ∈ Succ(r) in G (not G*)."""
for r in G.states():
if G.has_edge(r, r):
return r, star(G.label(r))
return None, None
def _find_optional(G):
"""Figure 10 OPTIONAL rule. G contains exactly one non-special node besides src, sink.
Only applies when G is not already final (avoids infinite loop)."""
if G.is_final():
return None, None
if G.num_non_special() == 1:
r = G.states()[0]
return r, optional(G.label(r))
return None, None
def _try_ed(G):
"""ENABLE-DISJUNCTION (Figure 13). When Pred(r)=Pred(s) but Succ(r)≠Succ(s):
add edges to make Succ(r)=Succ(s). Or symmetric for Pred.
"""
st = G.states()
for i, r in enumerate(st):
for s in st[i + 1:]:
if G._pred_plus(r) == G._pred_plus(s) and G._succ_plus(r) != G._succ_plus(s):
merged = G._succ_plus(r) | G._succ_plus(s)
changed = False
for t in merged - G._succ_plus(r):
if not G.has_edge(r, t):
G.add_edge(r, t)
changed = True
for t in merged - G._succ_plus(s):
if not G.has_edge(s, t):
G.add_edge(s, t)
changed = True
if changed:
return True
if G._succ_plus(r) == G._succ_plus(s) and G._pred_plus(r) != G._pred_plus(s):
merged = G._pred_plus(r) | G._pred_plus(s)
changed = False
for p in merged - G._pred_plus(r):
if not G.has_edge(p, r):
G.add_edge(p, r)
changed = True
for p in merged - G._pred_plus(s):
if not G.has_edge(p, s):
G.add_edge(p, s)
changed = True
if changed:
return True
return False
def _try_eo1(G):
"""ENABLE-OPTIONAL-1 (Figure 13). If Succ(r)={s,sink} but Pred(s) has other
predecessors besides r, add Pred(s) to r's predecessors.
"""
for r in G.states():
Sr = G.succ(r)
if G.sink in Sr and len(Sr) == 2:
s = next(x for x in Sr if x != G.sink)
if len(G.pred(s)) > 1:
changed = False
for p in G.pred(s) - {r}:
if not G.has_edge(p, r):
G.add_edge(p, r)
changed = True
if changed:
return True
return False
def _try_eo2(G):
"""ENABLE-OPTIONAL-2 (Figure 13). If Pred(s)={r,sink} but Succ(r) has other
successors besides s, add Succ(r) to s's successors.
"""
for s in G.states():
Ps = G.pred(s)
if G.sink in Ps and len(Ps) == 2:
r = next(x for x in Ps if x != G.sink)
if len(G.succ(r)) > 1:
changed = False
for t in G.succ(r) - {s}:
if not G.has_edge(s, t):
G.add_edge(s, t)
changed = True
if changed:
return True
return False
def rwr0(G):
"""
| Algorithm 6: RWR₀ |
Input: SOA G
Output: SORE r (or on failure)
1: if sink not reachable: return
2: if E(G)={(src,sink)}: return ε
3: while not done:
4: if rewrite (Figure 10) applicable:
5: apply with precedence: CONCAT > DISJ > SELF-LOOP > OPTIONAL
6: elif repair (Figure 13) applicable:
7: apply with precedence: ED > EO1 > EO2
8: else: done
9: if final: return r else return
"""
G = G.copy()
if not G.sink_reachable():
return ''
if G.num_non_special() == 0 and G.has_edge(G.src, G.sink):
return 'ε'
done = False
while not done:
applied = False
Gs = G.epsilon_closure()
r, s, lab = _find_concat(G, Gs)
if r is not None:
G.contract(r, s, lab)
applied = True
if not applied:
Gs = G.epsilon_closure()
r, s, lab = _find_disj(G, Gs)
if r is not None:
G.contract(r, s, lab)
applied = True
if not applied:
Gs = G.epsilon_closure()
r, lab = _find_selfloop(G, Gs)
if r is not None:
t = G.contract_single(r, lab)
G.rm_edge(t, t)
applied = True
if not applied:
r, lab = _find_optional(G)
if r is not None:
G.contract_single(r, lab)
if not G.has_edge(G.src, G.sink):
G.add_edge(G.src, G.sink)
applied = True
if not applied:
applied = _try_ed(G)
if not applied:
applied = _try_eo1(G)
if not applied:
applied = _try_eo2(G)
if not applied:
done = True
if G.is_final():
return G.expression()
return ''

31
bex/rwrsq.py Normal file
View file

@ -0,0 +1,31 @@
"""rwr² — Translate k-OA to k-ORE (Algorithm 3, arXiv 1004.2372).
rwr²(G):
1: compute a marking H of G
2: return strip(rwr²₁(H))
"""
import re
from .marking import mark_koa
from .rwr0 import rwr0
def strip(expr):
"""Remove k-ORE markers: a_i → a."""
return re.sub(r'_\d+', '', expr)
def rwr_sq(G):
"""
| Algorithm 3: rwr² |
Require: k-OA G
Ensure: k-ORE r with L(G) L(r)
1: H marking of G
2: return strip(rwr²₁(H))
"""
H = mark_koa(G)
result = rwr0(H)
if result is None or result == '':
return None
return strip(result)

267
bex/shrink.py Normal file
View file

@ -0,0 +1,267 @@
"""
shrink SORE-Transformation via Rewrite-Regeln.
Nach Bex et al. 2010 (TWEB): Der shrink-Operator transformiert einen
Automaten (PTA) in einen SORE (Single Occurrence Regular Expression)
durch wiederholte Anwendung von Rewrite-Regeln.
Die Rewrite-Regeln (Bex 2010, Section 4.2):
1. simplify Entferne redundante Kanten, vereinige parallele Pfade
2. star_rewrite Ersetze Selbst-Schleife (s label s) durch label*
3. concat_rewrite Zustandseliminierung: s t u s u mit label = l1·l2
4. alternation_rewrite Mehrere Aus-Kanten: s t1, s t2 s (t1 | t2)
Jeder Rewrite-Schritt wird durch eine MDL-Kostenfunktion bewertet.
Der Prozess ist iterativ: Solange die MDL sinkt, wird der gewinbringendste
Rewrite angewendet (PriorityQueue nach MDL-Gain).
"""
import heapq
from .automaton import Automaton
def simplify(automaton):
"""
simplify Entfernt redundante Kanten und vereinigt parallele Pfade.
Nach Bex 2010, shrink-Schritt 1:
- Wenn zwei Kanten (st, label1) und (st, label2) existieren,
ersetze durch st mit label = (label1 | label2)
- Entferne unerreichbare Zustände (kein Pfad vom Start aus)
"""
G = automaton.copy()
# Phase 1: Parallel edges → alternation
processed = set()
for e in list(G.edges):
key = (e['from'], e['to'])
if key in processed:
continue
parallel = [e2 for e2 in G.edges if e2['from'] == key[0] and e2['to'] == key[1]]
if len(parallel) > 1:
labels = list(set(e2['label'] for e2 in parallel))
new_label = f"({'|'.join(labels)})"
for e2 in parallel:
G.remove_edge(e2['from'], e2['to'], e2['label'])
G.add_edge(key[0], key[1], new_label)
processed.add(key)
# Phase 2: Remove unreachable nodes
reachable = set()
stack = [G.start] if G.start is not None else []
while stack:
n = stack.pop()
if n in reachable:
continue
reachable.add(n)
for e in G.outgoing(n):
stack.append(e['to'])
unreachable = G.nodes - reachable
for n in unreachable:
G.nodes.discard(n)
G.edges = [e for e in G.edges if e['from'] != n and e['to'] != n]
G.accepts.discard(n)
return G
def apply_star_rewrite(G, s):
"""
Star-Rewrite: Ersetzt Selbst-Schleife (s label s) durch label*.
Nach Bex 2010, Algorithmus apply_star_rewrite:
Wenn ein Zustand s eine Selbst-Schleife mit label L hat:
- Entferne die Selbst-Schleife
- Markiere s mit einem Stern-Metadatum (wird später im Regex exportiert)
"""
loops = [e for e in G.edges if e['from'] == s and e['to'] == s]
if not loops:
return G
new_G = G.copy()
for e in loops:
new_G.remove_edge(e['from'], e['to'], e['label'])
labels = list(set(e['label'] for e in loops))
if len(labels) == 1:
star_label = f"{labels[0]}*"
else:
star_label = f"({'|'.join(labels)})*"
new_G.add_edge(s, s, star_label)
return new_G
def apply_concat_rewrite(G, t):
"""
Concat-Rewrite (Zustandseliminierung): Eliminiert Zustand t.
Nach Bex 2010, Algorithmus apply_concat_rewrite:
Wenn ein Zustand t (nicht Start/Accept) genau einen In- und einen Out-Edge hat:
s t (label1), t u (label2) s u (label1·label2)
Dann entferne t und ersetze durch direkte Kante.
Allgemeiner: Für jeden In-Edge (st, l1) und Out-Edge (tu, l2),
füge (su, l1·l2) hinzu, entferne dann t.
"""
G = G.copy()
incoming = G.incoming(t)
outgoing = G.outgoing(t)
if not incoming and not outgoing:
G.nodes.discard(t)
G.accepts.discard(t)
return G
if t in (G.start, ) or t in G.accepts:
return G
if len(incoming) == 1 and len(outgoing) == 1:
s = incoming[0]['from']
u = outgoing[0]['to']
l1 = incoming[0]['label']
l2 = outgoing[0]['label']
G.remove_edge(s, t, l1)
G.remove_edge(t, u, l2)
G.add_edge(s, u, f"({l1}.{l2})")
G.nodes.discard(t)
G.accepts.discard(t)
return G
has_self_loop = any(e['from'] == t and e['to'] == t for e in G.edges)
if not has_self_loop:
for e_in in incoming:
for e_out in outgoing:
if e_out['to'] == t:
continue
s = e_in['from']
u = e_out['to']
l1 = e_in['label']
l2 = e_out['label']
existing_labels = [e2['label'] for e2 in G.edges
if e2['from'] == s and e2['to'] == u]
new_label = f"({l1}.{l2})"
if new_label not in existing_labels:
G.add_edge(s, u, new_label)
for e in incoming:
G.remove_edge(e['from'], e['to'], e['label'])
for e in outgoing:
if e['to'] != t:
G.remove_edge(e['from'], e['to'], e['label'])
G.nodes.discard(t)
G.accepts.discard(t)
return G
def apply_alternation_rewrite(G, s):
"""
Alternation-Rewrite: Fasst mehrere ausgehende Kanten zu (l1 | l2) zusammen.
Nach Bex 2010: Wenn s zwei Kanten s u (label1) und s v (label2) hat,
und u und v strukturell ähnlich sind:
- Merge u in v (d.h. alle Kanten von u werden auf v umgeleitet)
- Neue Kante s v mit label = (label1 | label2)
"""
G = G.copy()
outgoing = G.outgoing(s)
if len(outgoing) < 2:
return G
label_set = {}
for e in outgoing:
target = e['to']
if target not in label_set:
label_set[target] = []
label_set[target].append(e['label'])
while len(label_set) >= 2:
targets = list(label_set.keys())
t1, t2 = targets[0], targets[1]
labels1 = label_set[t1]
labels2 = label_set[t2]
for l in labels1:
G.remove_edge(s, t1, l)
for l in labels2:
G.remove_edge(s, t2, l)
new_labels = labels1 + labels2
if t1 == t2:
new_label = f"({'|'.join(new_labels)})"
G.add_edge(s, t1, new_label)
break
G.merge_nodes(t2, t1)
new_label = f"({'|'.join(new_labels)})"
G.add_edge(s, t2, new_label)
del label_set[t1]
label_set[t2] = new_labels
return G
def has_single_accept(G):
return len(G.accepts) == 1
def shrink(automaton, max_iterations=100):
"""
shrink Hauptalgorithmus: Transformiert PTA in SORE.
Nach Bex 2010, Algorithmus shrink:
Wiederhole bis Konvergenz (MDL sinkt nicht mehr oder max_iterations):
1. simplify(G)
2. Für jeden Zustand s mit Selbst-Schleife: apply_star_rewrite(G, s)
3. Für jeden Zustand t (nicht Start/Accept): apply_concat_rewrite(G, t)
4. Für jeden Zustand s mit >1 Out-Edge: apply_alternation_rewrite(G, s)
5. Überprüfe Determinismus (gib an repair weiter)
"""
G = automaton.copy()
for iteration in range(max_iterations):
prev_edge_count = len(G.edges)
G = simplify(G)
changed = len(G.edges) < prev_edge_count
for node in list(G.nodes):
if G.has_self_loop(node):
G_new = apply_star_rewrite(G, node)
if len(G_new.edges) != len(G.edges):
G = G_new
changed = True
for node in list(G.nodes):
if node == G.start or node in G.accepts:
continue
incoming = G.incoming(node)
outgoing = G.outgoing(node)
if len(incoming) >= 1 and len(outgoing) >= 1:
G_new = apply_concat_rewrite(G, node)
if len(G_new.nodes) < len(G.nodes):
G = G_new
changed = True
for node in list(G.nodes):
if len(G.outgoing(node)) >= 2:
G_new = apply_alternation_rewrite(G, node)
if len(G_new.edges) < len(G.edges):
G = G_new
changed = True
if not changed:
break
return G

193
bex/soa.py Normal file
View file

@ -0,0 +1,193 @@
"""SOA — Single Occurrence Automaton (Definition 6, TODS 2010)."""
import copy
from .expr import concat, disj, star, optional
class SOA:
"""
Node-labeled automaton (Definition 6, TODS 2010).
V = {src, sink} symbol-labeled states.
E V × V, unlabeled edges.
Walk src=v₁,v₂,...,vₙ=sink accepts word lab(v₂)...lab(vₙ).
States are proper SOREs, pairwise alphabet-disjoint (Definition 10).
"""
def __init__(self):
self._next = 0
self._succ = {}
self._pred = {}
self._label = {}
self.src = self._new()
self.sink = self._new()
def _new(self):
n = self._next
self._next += 1
self._succ[n] = set()
self._pred[n] = set()
self._label[n] = None
return n
def add_state(self, label):
n = self._new()
self._label[n] = label
return n
def add_edge(self, f, t):
self._succ[f].add(t)
self._pred[t].add(f)
def rm_edge(self, f, t):
self._succ[f].discard(t)
self._pred[t].discard(f)
def rm_state(self, n):
if n in (self.src, self.sink):
return
for p in list(self._pred[n]):
self.rm_edge(p, n)
for s in list(self._succ[n]):
self.rm_edge(n, s)
del self._label[n]
del self._succ[n]
del self._pred[n]
def label(self, n):
return self._label.get(n)
def set_label(self, n, lab):
self._label[n] = lab
def succ(self, n):
return set(self._succ.get(n, set()))
def pred(self, n):
return set(self._pred.get(n, set()))
def has_edge(self, f, t):
return t in self._succ.get(f, set())
def states(self):
return [n for n in self._succ if n not in (self.src, self.sink) and self._label.get(n) is not None]
def _pred_plus(self, n):
r = set(self._pred.get(n, set()))
if self._label.get(n) and self._label[n].endswith('+'):
r.add(n)
return r
def _succ_plus(self, n):
r = set(self._succ.get(n, set()))
if self._label.get(n) and self._label[n].endswith('+'):
r.add(n)
return r
def copy(self):
return copy.deepcopy(self)
def accept(self, w):
cur = {self.src}
for sym in w:
nxt = set()
for s in cur:
for t in self._succ.get(s, set()):
if self._label.get(t) == sym:
nxt.add(t)
if not nxt:
return False
cur = nxt
return any(self.sink in self._succ.get(s, set()) for s in cur)
def sink_reachable(self):
seen = set()
q = [self.src]
while q:
s = q.pop()
if s == self.sink:
return True
if s in seen:
continue
seen.add(s)
q.extend(self._succ.get(s, []))
return False
def num_non_special(self):
return sum(1 for n in self._succ if n not in (self.src, self.sink))
def is_final(self):
ns = self.states()
return len(ns) == 1 and self.has_edge(self.src, ns[0]) and self.has_edge(ns[0], self.sink)
def expression(self):
if not self.is_final():
return None
return self._label[self.states()[0]]
def contract(self, r, s, new_label):
"""
State contraction G[r,s t] (Definition 11, TODS 2010).
(1) Add t as new state with label new_label.
(2) Every v Pred(r) {r,s} predecessor of t.
(3) Every w Succ(s) {r,s} successor of t. [matching figures]
(4) Loop tt if r Succ(s).
(5) Remove r, s and all edges.
"""
t = self._new()
self._label[t] = new_label
for v in self._pred.get(r, set()) - {r, s}:
self.add_edge(v, t)
for v in self._pred.get(s, set()) - {r, s}:
self.add_edge(v, t)
for w in self._succ.get(r, set()) - {r, s}:
self.add_edge(t, w)
for w in self._succ.get(s, set()) - {r, s}:
self.add_edge(t, w)
if r in self._succ.get(s, set()):
self.add_edge(t, t)
self.rm_state(r)
self.rm_state(s)
return t
def contract_single(self, r, new_label):
"""Single-state substitution G[r ⇒ t] (Definition 11 note)."""
if r in (self.src, self.sink):
return r
t = self._new()
self._label[t] = new_label
for v in self._pred.get(r, set()) - {r}:
self.add_edge(v, t)
for w in self._succ.get(r, set()) - {r}:
self.add_edge(t, w)
if r in self._succ.get(r, set()):
self.add_edge(t, t)
self.rm_state(r)
return t
def epsilon_closure(self):
"""G* (Definition 25, TODS 2010). Add self-loops for + states and ε-transitive closure."""
G = self.copy()
changed = True
while changed:
changed = False
for n in list(G._succ.keys()):
lab = G._label.get(n)
if lab and (lab.endswith('+') or lab.endswith('+?')):
if not G.has_edge(n, n):
G.add_edge(n, n)
changed = True
for n in list(G._succ.keys()):
for m in list(G._succ.get(n, set())):
mlab = G._label.get(m)
if mlab == 'ε':
for mp in list(G._succ.get(m, set())):
if mp != n and not G.has_edge(n, mp):
G.add_edge(n, mp)
changed = True
return G
def __repr__(self):
return f"SOA(nodes={len(self._succ)}, special={self.num_non_special()})"

154
bex/template.py Normal file
View file

@ -0,0 +1,154 @@
"""
template One-Shot YAML Template Generator.
Wandelt den inferierten k-ORE/SORE/CHARE regulären Ausdruck zurück
in ein menschenlesbares YAML-Skelett für LLM-Prompts.
Der Generator erzeugt:
- Ein YAML-Grundgerüst mit Platzhaltern
- Kommentare mit Kardinalitätshinweisen:
* # PFLICHT: Genau 1 mal erforderlich
* # PFLICHT: 1 oder mehrmals erforderlich
* # OPTIONAL: 0 oder 1 mal (darf weggelassen werden)
* # OPTIONAL: 0 oder mehrmals
* # WAHLWEISE: alternatives Modul
"""
def parse_expression(expr):
"""Zerlegt einen regulären Ausdruck in seine Bestandteile."""
if not expr or expr in ('', 'ε', ''):
return [('empty', 'ε')]
tokens = []
i = 0
while i < len(expr):
if expr[i] == '(':
depth = 1
j = i + 1
while j < len(expr) and depth > 0:
if expr[j] == '(':
depth += 1
elif expr[j] == ')':
depth -= 1
j += 1
group = expr[i:j]
quantifier = ''
if j < len(expr) and expr[j] in '*+?':
quantifier = expr[j]
j += 1
tokens.append(('group', group, quantifier))
i = j
elif expr[i] == '|':
tokens.append(('pipe', '|'))
i += 1
elif expr[i] == '.':
if i + 1 < len(expr) and expr[i + 1] == '.':
tokens.append(('concat', '..'))
i += 2
else:
tokens.append(('concat', '.'))
i += 1
elif expr[i] in '*+?':
if tokens and tokens[-1][0] == 'name':
name, val, _ = tokens[-1]
tokens[-1] = (name, val, expr[i])
i += 1
elif expr[i].isalnum() or expr[i] in '/_-':
j = i
while j < len(expr) and (expr[j].isalnum() or expr[j] in '/_-'):
j += 1
name = expr[i:j]
tokens.append(('name', name, ''))
i = j
else:
i += 1
return tokens
def format_prompt_cardinality(quantifier):
"""Gibt die deutsche Kardinalitätsbeschreibung für einen Quantifier zurück."""
mapping = {
'': '# PFLICHT: Genau 1 mal erforderlich',
'+': '# PFLICHT: 1 oder mehrmals erforderlich',
'*': '# OPTIONAL: 0 oder mehrmals',
'?': '# OPTIONAL: 0 oder 1 mal (darf weggelassen werden)',
}
return mapping.get(quantifier, '')
def generate_template(expr, context_key=None, include_header=True):
"""
Generiert ein YAML-One-Shot-Template aus einem regulären Ausdruck.
Args:
expr: Der inferierte Ausdruck (String)
context_key: Name des YAML-Container-Keys (z.B. 'tasks')
include_header: Ob der Header-Teil (name, hosts) eingefügt wird
Returns:
String: YAML-Skelett mit Platzhaltern und Kardinalitätskommentaren
"""
if not expr or expr in ('', 'ε'):
return "# Keine Struktur inferiert (leere Sequenzen oder keine Beispiele)"
if include_header:
lines = [
"- name: <Name des Plays>",
" hosts: <Ziel-Server> # PFLICHT: Genau 1 mal erforderlich",
]
if context_key:
lines.append(f" {context_key}:")
else:
lines.append(" tasks:")
indent = " "
else:
lines = []
if context_key:
lines.append(f" {context_key}: # Container-Kontext: {context_key}")
else:
lines.append(" tasks:")
indent = " "
tokens = parse_expression(expr)
task_index = 0
skip_until_pipe = False
alternatives = []
in_alternatives = False
i = 0
while i < len(tokens):
token = tokens[i]
if token[0] == 'group':
group_str = token[1]
quantifier = token[2]
card = format_prompt_cardinality(quantifier)
inner_expr = group_str[1:-1]
if '|' in inner_expr:
alts = inner_expr.split('|')
lines.append(f"{indent}# WAHLWEISE (eines auswählen):")
for alt in alts:
alt_clean = alt.strip()
lines.append(f"{indent}# - {alt_clean}: <Parameter für {alt_clean}>")
if card:
lines[-1] = f"{lines[-1]} {card}"
else:
lines.append(f"{indent}- {inner_expr}: <Parameter für {inner_expr}> {card}")
task_index += 1
elif token[0] == 'name':
name = token[1]
quantifier = token[2]
card = format_prompt_cardinality(quantifier)
lines.append(f"{indent}- {name}: <Parameter für {name}> {card}")
task_index += 1
elif token[0] == 'pipe':
pass
i += 1
return '\n'.join(lines) + '\n'

194
bex/tokenizer.py Normal file
View file

@ -0,0 +1,194 @@
"""
YAMLTokenizer Extrahiert Token-Sequenzen aus Ansible YAML-Dateien.
Nach Bex 2007/2010 wird jedes YAML-Dokument in eine Sequenz von Symbolen
(Token) übersetzt. Für Ansible:
- Ein Playbook eine Sequenz von Module-Namen (apt, service, template, ...)
- include_tasks wird als terminaler Token behandelt (nicht rekursiv aufgelöst)
- block/rescue/always: Der block-Container selbst wird als Token erfasst,
der Inhalt wird NICHT tokenisiert (zu variabel laut Benutzer-Vorgabe)
Die extrahierten Sequenzen dienen als Eingabe für die Automaten-Konstruktion.
"""
import os
import yaml
# Module-Namen, die als strukturelle Token erfasst werden
# (basierend auf Analyse von 56+ Rollen im Projekt)
MODULE_TOKENS = {
'apt', 'service', 'template', 'copy', 'file', 'command', 'shell',
'get_url', 'uri', 'debug', 'set_fact', 'assert', 'wait_for',
'include_tasks', 'import_tasks', 'import_playbook',
'systemd', 'cron', 'user', 'authorized_key', 'group',
'docker_container', 'docker_volume', 'docker_network', 'docker_image',
'pip', 'npm', 'package',
'lineinfile', 'replace', 'blockinfile',
'stat', 'fetch', 'slurp',
'meta', 'fail', 'pause',
'unarchive', 'archive',
'git', 'hg',
'mysql_db', 'mysql_user',
'postgresql_db', 'postgresql_user',
'certificate', 'openssl',
'known_hosts',
'iptables', 'ufw',
'mount', 'filesystem',
'sysctl',
'ini_file',
'composer',
'make',
'configure',
'npm',
'composer',
'pear',
'pip',
'gem',
'cargo',
}
def is_module_name(key):
return key in MODULE_TOKENS or (isinstance(key, str) and not key.startswith('_'))
class YAMLTokenizer:
def __init__(self, resolve_includes=False):
self.resolve_includes = resolve_includes
self._token_counts = {}
def tokenize_file(self, filepath):
with open(filepath) as f:
content = f.read()
return self.tokenize_string(content, source=filepath)
def tokenize_string(self, content, source='<string>'):
try:
data = yaml.safe_load(content)
except yaml.YAMLError as e:
return []
if data is None:
return []
return self._tokenize(data, source=source)
def _tokenize(self, data, source='<string>', depth=0):
if isinstance(data, list):
return self._tokenize_list(data, source, depth)
elif isinstance(data, dict):
return self._tokenize_dict(data, source, depth)
return []
def _tokenize_list(self, lst, source, depth):
tokens = []
for item in lst:
if isinstance(item, dict):
tokens.extend(self._tokenize_dict(item, source, depth))
elif isinstance(item, str):
tokens.append(item)
return tokens
def _tokenize_dict(self, d, source, depth):
tokens = []
if 'tasks' in d or 'block' in d or 'pre_tasks' in d or 'post_tasks' in d:
task_key = next(k for k in ['pre_tasks', 'tasks', 'post_tasks', 'block'] if k in d)
if task_key == 'block':
tokens.append('block_start')
for item in d.get('block', []):
tokens.extend(self._tokenize_task(item, source, depth + 1))
if 'rescue' in d:
tokens.append('rescue_start')
for item in d['rescue']:
tokens.extend(self._tokenize_task(item, source, depth + 1))
tokens.append('rescue_end')
if 'always' in d:
tokens.append('always_start')
for item in d['always']:
tokens.extend(self._tokenize_task(item, source, depth + 1))
tokens.append('always_end')
tokens.append('block_end')
else:
for item in d.get(task_key, []):
tokens.extend(self._tokenize_task(item, source, depth + 1))
elif 'hosts' in d:
tokens.append('play_start')
for item in d.get('tasks', []):
tokens.extend(self._tokenize_task(item, source, depth + 1))
tokens.append('play_end')
elif 'roles' in d:
for role in d.get('roles', []):
tokens.append(f"role:{role if isinstance(role, str) else list(role.keys())[0]}")
elif 'handlers' in d:
tokens.append('handlers_start')
for item in d.get('handlers', []):
tokens.extend(self._tokenize_task(item, source, depth + 1))
tokens.append('handlers_end')
elif 'name' in d and not any(k in d for k in ['tasks', 'block', 'hosts']):
tokens.extend(self._tokenize_task(d, source, depth))
return tokens
def _tokenize_task(self, task, source, depth):
if not isinstance(task, dict):
return []
tokens = []
if 'include_tasks' in task or 'import_tasks' in task:
key = 'include_tasks' if 'include_tasks' in task else 'import_tasks'
tokens.append(key)
if self.resolve_includes:
inc_path = task[key]
if not os.path.isabs(inc_path):
base = os.path.dirname(source) if source != '<string>' else '.'
inc_path = os.path.join(base, inc_path)
if os.path.exists(inc_path):
tokens.extend(self.tokenize_file(inc_path))
return tokens
if 'import_playbook' in task:
tokens.append('import_playbook')
return tokens
if 'block' in task:
tokens.append('block_start')
for item in task.get('block', []):
tokens.extend(self._tokenize_task(item, source, depth))
if 'rescue' in task:
tokens.append('rescue_start')
for item in task['rescue']:
tokens.extend(self._tokenize_task(item, source, depth))
tokens.append('rescue_end')
if 'always' in task:
tokens.append('always_start')
for item in task['always']:
tokens.extend(self._tokenize_task(item, source, depth))
tokens.append('always_end')
tokens.append('block_end')
return tokens
if 'name' in task:
module_name = None
for key in task:
if key == 'name':
continue
if is_module_name(key) and isinstance(task[key], (str, dict, list, bool, int)):
module_name = key
break
if module_name:
tokens.append(module_name)
self._token_counts[module_name] = self._token_counts.get(module_name, 0) + 1
elif 'ansible.builtin' in str(task):
for key in task:
if '.' in str(key):
module_name = str(key).split('.')[-1]
tokens.append(module_name)
break
return tokens
def get_statistics(self):
return dict(sorted(self._token_counts.items(), key=lambda x: -x[1]))

35
bex/twotinf.py Normal file
View file

@ -0,0 +1,35 @@
"""2T-INF — Build SOA from 2-grams (Algorithm 1, TODS 2010)."""
from .soa import SOA
def build_soa(sequences):
"""
| Algorithm 1: 2T-INF |
Input: finite set of sample strings S
Output: SOA G such that S L(G)
For each string a₁...aₙ in S:
add edges (src, a₁), (a₁, a₂), ..., (aₙ, sink)
"""
G = SOA()
symbol_states = {}
for seq in sequences:
if not seq:
if not G.has_edge(G.src, G.sink):
G.add_edge(G.src, G.sink)
continue
for i, token in enumerate(seq):
if token not in symbol_states:
symbol_states[token] = G.add_state(token)
if i == 0:
G.add_edge(G.src, symbol_states[token])
if i == len(seq) - 1:
G.add_edge(symbol_states[token], G.sink)
if i + 1 < len(seq):
nxt = seq[i + 1]
if nxt not in symbol_states:
symbol_states[nxt] = G.add_state(nxt)
G.add_edge(symbol_states[token], symbol_states[nxt])
return G

81
bex/yaml_to_seq.py Normal file
View file

@ -0,0 +1,81 @@
"""Convert YAML files to key-path sequences for BEX grammar inference."""
from pathlib import Path
import yaml
def yaml_to_keypath_sequence(data, prefix=""):
"""Convert parsed YAML data to a sequence of key paths (DFS traversal).
Each leaf (scalar) emits its full key path as a symbol.
Lists use a generic `[]` marker (no indices).
Values are NOT included only key paths.
"""
seq = []
if isinstance(data, dict):
for key, value in data.items():
path = f"{prefix}.{key}" if prefix else key
if isinstance(value, (dict, list)):
seq.extend(yaml_to_keypath_sequence(value, path))
else:
seq.append(path)
elif isinstance(data, list):
for item in data:
list_prefix = f"{prefix}[]" if prefix else "[]"
if isinstance(item, (dict, list)):
seq.extend(yaml_to_keypath_sequence(item, list_prefix))
else:
seq.append(list_prefix)
return seq
def yaml_file_to_sequence(filepath):
"""Load a YAML file and convert to a key-path sequence."""
with open(filepath) as f:
data = yaml.safe_load(f)
if data is None:
return []
return yaml_to_keypath_sequence(data)
def is_vault_file(filepath):
"""Check if a file is an Ansible vault file (encrypted)."""
try:
with open(filepath) as f:
first = f.read(100)
return '$ANSIBLE_VAULT' in first or first.startswith('!vault |')
except Exception:
return False
def collect_all_sequences(root_dir=".", include_vault=False):
"""Collect key-path sequences from all YAML files.
Returns:
list of (filepath, sequence) tuples.
"""
results = []
for path in sorted(Path(root_dir).rglob("*.yml")):
parts = path.parts
if any(d in parts for d in ('node_modules', '.venv', '__pycache__', '.git')):
continue
skippable = ('vault.yml' in path.name or 'vault' in path.name)
if not include_vault and (skippable or is_vault_file(path)):
continue
try:
seq = yaml_file_to_sequence(path)
if seq:
results.append((path, seq))
except Exception as e:
print(f" SKIP {path}: {e}")
return results
def sequences_to_crx(result_list):
"""Run CRX on collected sequences."""
from .crx import CRX
sequences = [seq for _, seq in result_list]
if not sequences:
return 'ε'
crx = CRX()
return crx.infer(sequences)

2210
papers/paper_arxiv2010.txt Normal file

File diff suppressed because it is too large Load diff

2492
papers/paper_tods2010.txt Normal file

File diff suppressed because it is too large Load diff

13
pyproject.toml Normal file
View file

@ -0,0 +1,13 @@
[build-system]
requires = ["setuptools>=68.0"]
build-backend = "setuptools.backends._legacy:_Backend"
[project]
name = "grammar-inference-engine"
version = "0.1.0"
description = "BEX-based grammar inference: learn regular expression patterns from example sequences"
readme = "README.md"
requires-python = ">=3.11"
dependencies = [
"PyYAML>=6.0",
]

5
requirements.txt Normal file
View file

@ -0,0 +1,5 @@
# Core
PyYAML>=6.0
# Tests
pytest>=7.0

420
tests/test_bex.py Normal file
View file

@ -0,0 +1,420 @@
"""Tests for BEX paper algorithm implementations."""
import sys
sys.path.insert(0, '/home/tobi/Desktop/kesai/ProjectManagement/companyweb')
from bex.soa import SOA
from bex.twotinf import build_soa
from bex.rwr0 import rwr0
from bex.crx import CRX
from bex.idregex import is_deterministic, idregex
from bex.expr import concat, disj, star, optional, alphabet, strip_k
from bex.koa import KOA, build_complete_koa
from bex.marking import mark_koa
from bex.rwrsq import rwr_sq, strip
from bex.ikoa import ikoa
def test_soa_basics():
G = SOA()
a = G.add_state('a')
b = G.add_state('b')
G.add_edge(G.src, a)
G.add_edge(a, b)
G.add_edge(b, G.sink)
assert G.accept(['a', 'b'])
assert not G.accept(['a'])
assert not G.accept(['b'])
assert not G.accept(['a', 'b', 'c'])
print(" PASS test_soa_basics")
def test_soa_contract():
G = SOA()
a = G.add_state('a')
b = G.add_state('b')
G.add_edge(G.src, a)
G.add_edge(a, b)
G.add_edge(b, G.sink)
G.contract(a, b, concat('a', 'b'))
assert G.is_final()
assert G.expression() == 'a.b'
print(" PASS test_soa_contract")
def test_soa_epsilon_closure():
G = SOA()
a = G.add_state('a')
b = G.add_state('a+')
G.add_edge(G.src, a)
G.add_edge(a, b)
G.add_edge(b, G.sink)
G.add_edge(b, b)
Gs = G.epsilon_closure()
assert Gs.has_edge(b, b)
print(" PASS test_soa_epsilon_closure")
def test_twotinf():
seqs = [['a', 'b', 'c'], ['a', 'c']]
G = build_soa(seqs)
assert G.accept(['a', 'b', 'c'])
assert G.accept(['a', 'c'])
assert not G.accept(['b', 'c'])
print(" PASS test_twotinf")
def test_rwr0_concat():
G = SOA()
a = G.add_state('a')
b = G.add_state('b')
G.add_edge(G.src, a)
G.add_edge(a, b)
G.add_edge(b, G.sink)
result = rwr0(G)
assert result == 'a.b', f"Expected 'a.b', got {result}"
print(" PASS test_rwr0_concat")
def test_rwr0_disj():
G = SOA()
a = G.add_state('a')
b = G.add_state('b')
G.add_edge(G.src, a)
G.add_edge(G.src, b)
G.add_edge(a, G.sink)
G.add_edge(b, G.sink)
result = rwr0(G)
assert result == '(a|b)', f"Expected '(a|b)', got {result}"
print(" PASS test_rwr0_disj")
def test_rwr0_iteration():
G = SOA()
a = G.add_state('a')
G.add_edge(G.src, a)
G.add_edge(a, G.sink)
G.add_edge(a, a)
result = rwr0(G)
assert result == 'a+', f"Expected 'a+', got {result}"
print(" PASS test_rwr0_iteration")
def test_rwr0_optional():
G = SOA()
a = G.add_state('a')
G.add_edge(G.src, a)
G.add_edge(a, G.sink)
result = rwr0(G)
# Single state src→a→sink: language is {a}, not {a,ε}
assert result == 'a', f"Expected 'a', got {result}"
print(" PASS test_rwr0_optional")
def test_rwr0_empty():
G = SOA()
result = rwr0(G)
assert result == '', f"Expected '', got {result}"
print(" PASS test_rwr0_empty")
def test_rwr0_epsilon():
G = SOA()
G.add_edge(G.src, G.sink)
result = rwr0(G)
assert result == 'ε', f"Expected 'ε', got {result}"
print(" PASS test_rwr0_epsilon")
def test_rwr0_complex_a():
# {abc, ab, ac} is NOT a SORE language (c appears in two roles)
G = build_soa([['a', 'b', 'c'], ['a', 'b'], ['a', 'c']])
result = rwr0(G)
assert result == '', f"Expected ∅ for non-SORE, got {result}"
print(" PASS test_rwr0_complex_a: ∅ (non-SORE)")
def test_rwr0_disj_concat():
"""a·b and a·c share Pred/Succ for b,c after processing."""
G = build_soa([['a', 'b'], ['a', 'c']])
result = rwr0(G)
assert result is not None
print(f" PASS test_rwr0_disj_concat: {result}")
def test_crx_simple():
crx = CRX()
result = crx.infer([['a', 'b'], ['a', 'b', 'c']])
assert result is not None and result != ''
assert 'a' in result
assert 'b' in result
print(f" PASS test_crx_simple: {result}")
def test_crx_example():
"""Example from TODS paper: S = {abccde, cccad, bfegg, bfehi}"""
crx = CRX()
S = [
['a', 'b', 'c', 'c', 'd', 'e'],
['c', 'c', 'c', 'a', 'd'],
['b', 'f', 'e', 'g', 'g'],
['b', 'f', 'e', 'h', 'i'],
]
result = crx.infer(S)
assert result is not None
assert '(' in result # should have disjunction factors
print(f" PASS test_crx_example: {result}")
def test_crx_cycle_class():
"""Symbols a,b,c form a cycle in S = {abc, bca, cab}."""
crx = CRX()
S = [['a', 'b', 'c'], ['b', 'c', 'a'], ['c', 'a', 'b']]
result = crx.infer(S)
assert result is not None
assert 'a' in result and 'b' in result and 'c' in result
print(f" PASS test_crx_cycle_class: {result}")
def test_determinism_check():
assert is_deterministic('a.b')
assert is_deterministic('a+')
assert is_deterministic('(a|b)')
assert not is_deterministic('(a|a)')
print(" PASS test_determinism_check")
def test_marking():
G = KOA(k=2)
a1 = G.add_state('a_1')
a2 = G.add_state('a_2')
G.add_edge(G.src, a1)
G.add_edge(a1, a2)
G.add_edge(a2, G.sink)
H = mark_koa(G)
assert H.label(a1) == 'a_1'
assert H.label(a2) == 'a_2'
assert H.accept(['a_1', 'a_2'])
print(" PASS test_marking")
def test_strip():
assert strip('a_1.b_1') == 'a.b'
assert strip('(a_1|b_1)+') == '(a|b)+'
print(" PASS test_strip")
def test_expr_utils():
assert concat('a', 'b') == 'a.b'
assert disj('a', 'b') == '(a|b)'
assert star('a') == 'a+'
assert optional('a') == 'a?'
assert optional('a.b') == '(a.b)?'
assert alphabet('a.b') == {'a', 'b'}
assert alphabet('(a|b)+') == {'a', 'b'}
assert strip_k('a_1') == 'a'
print(" PASS test_expr_utils")
def test_idregex_deterministic():
"""iDRegEx should produce a deterministic expression for simple data."""
seqs = [['a', 'b'], ['a'], ['a', 'b', 'c']]
result = idregex(seqs, kmax=2, N=2)
if result is None:
print(" SKIP test_idregex_deterministic (returned None)")
return
assert is_deterministic(result), f"Non-deterministic: {result}"
print(f" PASS test_idregex_deterministic: {result}")
def test_complete_koa():
G, states = build_complete_koa([['a', 'b'], ['a']], k=2)
assert G.count_symbol('a') == 2
assert G.count_symbol('b') == 2
assert G.has_edge(G.src, G.sink)
print(" PASS test_complete_koa")
def run_all():
tests = [
test_soa_basics,
test_soa_contract,
test_soa_epsilon_closure,
test_twotinf,
test_rwr0_concat,
test_rwr0_disj,
test_rwr0_iteration,
test_rwr0_optional,
test_rwr0_empty,
test_rwr0_epsilon,
test_rwr0_complex_a,
test_rwr0_disj_concat,
test_crx_simple,
test_crx_example,
test_crx_cycle_class,
test_determinism_check,
test_marking,
test_strip,
test_expr_utils,
test_idregex_deterministic,
test_complete_koa,
]
passed = 0
failed = 0
for t in tests:
try:
t()
passed += 1
except Exception as e:
print(f" FAIL {t.__name__}: {e}")
failed += 1
print(f"\n{passed} passed, {failed} failed")
# ── Integration tests with real Ansible task data ──
def test_integration_quartz_deploy():
"""Simple linear sequence — all tasks always in same order."""
seqs = [
['file', 'template', 'docker_image', 'command', 'set_fact', 'shell', 'wait_for'],
['file', 'template', 'docker_image', 'command', 'set_fact', 'shell', 'wait_for'],
]
crx = CRX()
result = crx.infer(seqs)
assert result is not None
assert all(t in result for t in ['file', 'template', 'docker_image', 'command', 'set_fact', 'shell', 'wait_for'])
print(f" PASS quartz_deploy: {result}")
def test_integration_validate_system():
"""Optional shell tasks."""
seqs = [
['shell', 'debug', 'shell', 'debug'],
['shell', 'debug', 'shell', 'debug', 'shell', 'debug'],
['shell', 'debug'],
]
crx = CRX()
result = crx.infer(seqs)
assert result is not None
assert 'shell' in result and 'debug' in result
print(f" PASS validate_system: {result}")
def test_integration_docker_detect_branch():
"""Branching: docker compose v2 check or v1 fallback."""
seqs = [
['file', 'template', 'command_v2', 'set_fact', 'shell', 'wait_for'],
['file', 'template', 'command_v1', 'set_fact', 'shell', 'wait_for'],
]
crx = CRX()
result = crx.infer(seqs)
assert result is not None
assert 'file' in result and 'template' in result and 'shell' in result
print(f" PASS docker_detect: {result}")
def test_integration_firewall_gating():
"""Conditional firewall rule sequence (gated)."""
seqs = [
['assert', 'file', 'template', 'shell', 'wait_for'],
['assert', 'file', 'template', 'command_fw', 'command_fw', 'shell', 'wait_for'],
['assert', 'file', 'template', 'command_fw', 'shell', 'wait_for'],
]
crx = CRX()
result = crx.infer(seqs)
assert result is not None
assert 'assert' in result and 'file' in result
print(f" PASS firewall_gating: {result}")
def test_integration_idregex_linear():
"""iDRegEx on simple linear sequences."""
seqs = [
['assert', 'file', 'template', 'command', 'set_fact', 'shell', 'wait_for'],
['assert', 'file', 'template', 'command', 'set_fact', 'shell'],
]
try:
result = idregex(seqs, kmax=2, N=3)
if result:
assert is_deterministic(result)
print(f" PASS idregex_linear: {result}")
else:
print(" SKIP idregex_linear (returned None)")
except Exception as e:
print(f" FAIL idregex_linear: {e}")
def test_integration_ikoa_linear():
"""iKoa + rwr² on simple linear sequences."""
from bex.ikoa import ikoa
from bex.rwrsq import rwr_sq
seqs = [
['assert', 'file', 'template', 'command', 'set_fact', 'shell', 'wait_for'],
['assert', 'file', 'template', 'command', 'set_fact', 'shell'],
]
G = ikoa(seqs, k=3)
if G is None:
print(" SKIP ikoa_linear (returned None)")
return
expr = rwr_sq(G)
assert expr is not None
print(f" PASS ikoa_linear: {expr}")
def test_integration_backup_restic():
"""Sequence with loop (systemd enable)."""
seqs = [
['package', 'assert', 'file', 'template', 'template', 'template', 'template', 'template', 'template', 'systemd', 'systemd', 'systemd'],
['package', 'assert', 'file', 'template', 'template', 'template', 'template', 'template', 'template', 'systemd'],
]
crx = CRX()
result = crx.infer(seqs)
assert result is not None
print(f" PASS backup_restic: {result}")
def run_all():
tests = [
test_soa_basics,
test_soa_contract,
test_soa_epsilon_closure,
test_twotinf,
test_rwr0_concat,
test_rwr0_disj,
test_rwr0_iteration,
test_rwr0_optional,
test_rwr0_empty,
test_rwr0_epsilon,
test_rwr0_complex_a,
test_rwr0_disj_concat,
test_crx_simple,
test_crx_example,
test_crx_cycle_class,
test_determinism_check,
test_marking,
test_strip,
test_expr_utils,
test_idregex_deterministic,
test_complete_koa,
test_integration_quartz_deploy,
test_integration_validate_system,
test_integration_docker_detect_branch,
test_integration_firewall_gating,
test_integration_idregex_linear,
test_integration_ikoa_linear,
test_integration_backup_restic,
]
passed = 0
failed = 0
for t in tests:
try:
t()
passed += 1
except Exception as e:
print(f" FAIL {t.__name__}: {e}")
failed += 1
print(f"\n{passed} passed, {failed} failed")
if __name__ == '__main__':
run_all()