Initial commit: BEX-based grammar inference engine
- CRX: direct CHARE inference (Algorithm 7, TODS 2010) - iDRegEx: k-ORE inference (Algorithm 4, arXiv 2010) - RWR₀: SORE repair (Algorithm 6, TODS 2010) - rwr²: k-ORE extraction (Algorithm 3, arXiv 2010) - SOA, k-OA, iKoa, 2T-INF, Baum-Welch - Ansible role grammar adapter - Generic YAML key-path converter - 28 tests, all passing
This commit is contained in:
commit
7c00c6713d
33 changed files with 8928 additions and 0 deletions
8
.gitignore
vendored
Normal file
8
.gitignore
vendored
Normal file
|
|
@ -0,0 +1,8 @@
|
|||
__pycache__/
|
||||
*.pyc
|
||||
.env
|
||||
.venv
|
||||
venv/
|
||||
*.egg-info/
|
||||
dist/
|
||||
build/
|
||||
45
AGENTS.md
Normal file
45
AGENTS.md
Normal file
|
|
@ -0,0 +1,45 @@
|
|||
# Grammar Inference Engine — Agent Guide
|
||||
|
||||
## Overview
|
||||
This repo implements the BEX family of algorithms for inferring regular expression grammars
|
||||
from example sequences. Use it whenever you need to discover the pattern behind a set of
|
||||
strings or structured sequences.
|
||||
|
||||
## Quick Start for Agents
|
||||
|
||||
```python
|
||||
# Fast pattern inference
|
||||
from bex.crx import CRX
|
||||
g = CRX().infer([['a','b','c'], ['a','b'], ['a','c']]) # a.(b+c)?
|
||||
|
||||
# Probabilistic k-ORE inference (handles noise better)
|
||||
from bex.idregex import idregex
|
||||
g = idregex([['a','b','c'], ['a','b'], ['a','c']], kmax=2, N=3)
|
||||
```
|
||||
|
||||
## Use Cases
|
||||
1. **Ansible role patterns** — extract module sequences from tasks/main.yml, learn per-category grammars
|
||||
2. **Log analysis** — find common patterns in event sequences
|
||||
3. **API call patterns** — learn the typical order of API operations
|
||||
4. **Configuration structure** — discover the schema behind YAML files
|
||||
5. **Workflow mining** — extract the typical task flow from process logs
|
||||
|
||||
## Architecture
|
||||
|
||||
Two inference pipelines:
|
||||
|
||||
| Pipeline | When to use |
|
||||
|----------|-------------|
|
||||
| CRX (fast) | Many examples, need speed, CHAREs output |
|
||||
| iDRegEx (robust) | Few/noisy examples, need probabilistic handling |
|
||||
|
||||
## Running Tests
|
||||
```bash
|
||||
python tests/test_bex.py
|
||||
```
|
||||
|
||||
## MCP Roadmap
|
||||
- [ ] Standalone MCP server wrapping CRX + iDRegEx
|
||||
- [ ] Tool: `infer_grammar(sequences, method="crx")`
|
||||
- [ ] Tool: `ansible_role_grammar(roles_dir)`
|
||||
- [ ] Tool: `yaml_to_sequences(yaml_path)`
|
||||
132
README.md
Normal file
132
README.md
Normal file
|
|
@ -0,0 +1,132 @@
|
|||
# Grammar Inference Engine
|
||||
|
||||
Infer **regular expression grammars** from example sequences using the BEX family of algorithms. Given a set of example sequences (strings over some alphabet), the engine learns a compact regular expression that describes the general pattern.
|
||||
|
||||
## Quick Start
|
||||
|
||||
```bash
|
||||
pip install pyyaml
|
||||
python -m bex
|
||||
```
|
||||
|
||||
```python
|
||||
from bex.crx import CRX
|
||||
|
||||
seqs = [
|
||||
['file', 'template', 'docker_image', 'command', 'set_fact', 'shell', 'wait_for'],
|
||||
['file', 'template', 'docker_image', 'command', 'set_fact', 'shell'],
|
||||
]
|
||||
crx = CRX()
|
||||
grammar = crx.infer(seqs)
|
||||
print(grammar)
|
||||
# file.template.docker_image.command.set_fact.shell.(wait_for)?
|
||||
```
|
||||
|
||||
## Algorithms
|
||||
|
||||
| Algorithm | What it learns | Paper | Use case |
|
||||
|-----------|---------------|-------|----------|
|
||||
| **CRX** | CHAREs (single-pass, deterministic) | TODS 2010 §6 | Fast inference from many sequences |
|
||||
| **iDRegEx** | k-OREs (probabilistic, Baum-Welch) | arXiv 2010 | Handles noise, learns from few examples |
|
||||
| **RWR₀** | SOREs (iterative repair) | TODS 2010 §5.2 | Builds regex from a single automaton |
|
||||
| **rwr²** | k-ORE from k-OA | arXiv 2010 | Post-processing for k-ORE extraction |
|
||||
|
||||
### Pipeline 1: Direct CHARE Inference (fast)
|
||||
|
||||
```
|
||||
Example sequences → CRX → CHAREs grammar
|
||||
```
|
||||
|
||||
### Pipeline 2: Probabilistic k-ORE Inference (robust)
|
||||
|
||||
```
|
||||
Example sequences → Complete k-OA → Baum-Welch (EM)
|
||||
→ Disambiguate → Prune → rwr² → k-ORE grammar
|
||||
```
|
||||
|
||||
## Architecture
|
||||
|
||||
```
|
||||
bex/
|
||||
├── crx.py # CRX: direct CHARE inference (Algorithm 7, TODS)
|
||||
├── idregex.py # iDRegEx: k-ORE inference (Algorithm 4, arXiv)
|
||||
├── rwr0.py # RWR₀: SORE repair (Algorithm 6, TODS)
|
||||
├── rwrsq.py # rwr²: k-ORE extraction (Algorithm 3, arXiv)
|
||||
├── soa.py # SOA: Symbolic Observation Automaton core
|
||||
├── koa.py # k-OA: k-testable Observation Automaton
|
||||
├── ikoa.py # iKoa: k-OA inference (Algorithm 1, arXiv)
|
||||
├── twotinf.py # 2T-INF: 2-testable inference (Algorithm 1, TODS)
|
||||
├── baum_welch.py # Baum-Welch EM training for k-OA
|
||||
├── expr.py # Expression utilities (concat, disj, star, strip)
|
||||
├── marking.py # State marking for determinism
|
||||
├── yaml_to_seq.py # Generic YAML → key-path sequence converter
|
||||
├── role_grammar.py # Ansible role → module-sequence extractor
|
||||
└── ...
|
||||
```
|
||||
|
||||
## Domain: Ansible Role Grammar
|
||||
|
||||
The engine includes a domain adapter for Ansible roles. It extracts module names from `tasks/main.yml` files and learns per-category grammars:
|
||||
|
||||
```bash
|
||||
python -c "
|
||||
from bex.role_grammar import collect_all_role_sequences, learn_grammar
|
||||
all_roles, by_category = collect_all_role_sequences('path/to/roles')
|
||||
for cat, items in sorted(by_category.items()):
|
||||
seqs = [s for _, s in items]
|
||||
print(f'{cat}: {learn_grammar(seqs)}')
|
||||
"
|
||||
```
|
||||
|
||||
### Example Output
|
||||
|
||||
```
|
||||
── restore (2 roles) ──
|
||||
Grammar: file.copy.unarchive+.command
|
||||
|
||||
── validate (5 roles) ──
|
||||
Grammar: hosts?.shell?.(copy+debug+fail+set_fact+uri)+?
|
||||
|
||||
── configure (4 roles) ──
|
||||
Grammar: (assert+debug+set_fact+uri)+?.include_role?
|
||||
```
|
||||
|
||||
**Grammar notation:**
|
||||
- `a.b` — `a` followed by `b` (concatenation)
|
||||
- `(a+b)` — either `a` or `b` (disjunction)
|
||||
- `r?` — zero or one (optional)
|
||||
- `r+` — one or more (iteration)
|
||||
- `r+?` — zero or more (varies across examples)
|
||||
|
||||
## Domain: Generic YAML
|
||||
|
||||
The engine can convert any YAML file into key-path sequences for grammar inference:
|
||||
|
||||
```python
|
||||
from bex.yaml_to_seq import yaml_file_to_sequence, sequences_to_crx
|
||||
|
||||
grammar = sequences_to_crx(yaml_file_to_sequence('config.yml'))
|
||||
```
|
||||
|
||||
## Papers
|
||||
|
||||
- **Bex et al.** *"Inferring Deterministic Regular Expressions from Positive Data"* — TODS 2010
|
||||
- **Bex et al.** *"Inferring k-optimal REs from Positive Data"* — arXiv:1004.2372
|
||||
|
||||
See `papers/` for extracted text and the original references.
|
||||
|
||||
## Tests
|
||||
|
||||
```bash
|
||||
python -m pytest tests/
|
||||
# or
|
||||
python tests/test_bex.py
|
||||
```
|
||||
|
||||
## MCP Server
|
||||
|
||||
A Model Context Protocol server for grammar inference is planned. See `AGENTS.md` for the roadmap.
|
||||
|
||||
## License
|
||||
|
||||
MIT
|
||||
26
bex/__init__.py
Normal file
26
bex/__init__.py
Normal file
|
|
@ -0,0 +1,26 @@
|
|||
"""
|
||||
bex — Paper-faithful implementation of BEX inference algorithms.
|
||||
|
||||
Papers:
|
||||
- Bex et al. 2010 (TODS): Inference of Concise Regular Expressions and DTDs
|
||||
- Bex et al. 2010 (arXiv 1004.2372): Learning Deterministic Regular Expressions
|
||||
|
||||
Algorithms implemented:
|
||||
TODS 2010: 2T-INF, REWRITE, RWR, RWR², RWR₀, CRX
|
||||
arXiv 2010: iKoa, Disambiguate, rwr², iDRegEx
|
||||
"""
|
||||
|
||||
from .soa import SOA
|
||||
from .twotinf import build_soa
|
||||
from .rwr0 import rwr0
|
||||
from .crx import CRX
|
||||
from .ikoa import ikoa
|
||||
from .rwrsq import rwr_sq
|
||||
from .idregex import idregex
|
||||
from .koa import KOA, build_complete_koa
|
||||
from .expr import concat, disj, star, optional, alphabet, strip_k
|
||||
from .marking import mark_koa
|
||||
from .tokenizer import YAMLTokenizer
|
||||
from .template import generate_template
|
||||
|
||||
__version__ = "0.2.0"
|
||||
3
bex/__main__.py
Normal file
3
bex/__main__.py
Normal file
|
|
@ -0,0 +1,3 @@
|
|||
from .cli import main
|
||||
|
||||
main()
|
||||
130
bex/automaton.py
Normal file
130
bex/automaton.py
Normal file
|
|
@ -0,0 +1,130 @@
|
|||
"""
|
||||
Automaton — Graph representation for BEX algorithms.
|
||||
|
||||
Ein Automaton ist ein gerichteter Graph mit beschrifteten Kanten (Labels = Token).
|
||||
Dient als Basis für:
|
||||
- Prefix-Tree Automaton (aus Beispielsequenzen)
|
||||
- SORE/CHARE Transformation via shrink-Rewrite-Regeln
|
||||
- Determinism-Check und repair
|
||||
|
||||
Die Implementierung folgt der Struktur aus Bex et al. 2010 (TWEB):
|
||||
- Nodes: Menge der Zustände
|
||||
- Edges: Liste von (from, to, label, prob) — prob optional für HMM
|
||||
- start: Startzustand
|
||||
- accepts: Menge akzeptierender Zustände
|
||||
"""
|
||||
|
||||
|
||||
class Automaton:
|
||||
def __init__(self, start=None):
|
||||
self.nodes = set()
|
||||
self.edges = []
|
||||
self.start = start
|
||||
self.accepts = set()
|
||||
|
||||
def add_node(self, node):
|
||||
self.nodes.add(node)
|
||||
|
||||
def add_edge(self, u, v, label, prob=None):
|
||||
self.edges.append({
|
||||
'from': u,
|
||||
'to': v,
|
||||
'label': label,
|
||||
'prob': prob,
|
||||
})
|
||||
self.add_node(u)
|
||||
self.add_node(v)
|
||||
|
||||
def remove_edge(self, u, v, label):
|
||||
self.edges = [
|
||||
e for e in self.edges
|
||||
if not (e['from'] == u and e['to'] == v and e['label'] == label)
|
||||
]
|
||||
|
||||
def remove_all_edges_between(self, u, v):
|
||||
self.edges = [
|
||||
e for e in self.edges
|
||||
if not (e['from'] == u and e['to'] == v)
|
||||
]
|
||||
|
||||
def set_start(self, node):
|
||||
self.start = node
|
||||
self.add_node(node)
|
||||
|
||||
def add_accept(self, node):
|
||||
self.accepts.add(node)
|
||||
self.add_node(node)
|
||||
|
||||
def outgoing(self, node):
|
||||
return [e for e in self.edges if e['from'] == node]
|
||||
|
||||
def incoming(self, node):
|
||||
return [e for e in self.edges if e['to'] == node]
|
||||
|
||||
def successors(self, node):
|
||||
return {(e['to'], e['label']) for e in self.outgoing(node)}
|
||||
|
||||
def has_edge(self, u, v, label):
|
||||
return any(
|
||||
e['from'] == u and e['to'] == v and e['label'] == label
|
||||
for e in self.edges
|
||||
)
|
||||
|
||||
def has_self_loop(self, node):
|
||||
return any(e['from'] == node and e['to'] == node for e in self.edges)
|
||||
|
||||
def labels_on_edge(self, u, v):
|
||||
return [e['label'] for e in self.edges if e['from'] == u and e['to'] == v]
|
||||
|
||||
def is_deterministic(self):
|
||||
"""Prüft ob der Automat deterministisch ist (keine zwei Kanten mit gleichem Label von einem Zustand)."""
|
||||
for node in self.nodes:
|
||||
seen = set()
|
||||
for e in self.outgoing(node):
|
||||
if e['label'] in seen:
|
||||
return False
|
||||
seen.add(e['label'])
|
||||
return True
|
||||
|
||||
def merge_nodes(self, target, source):
|
||||
"""Vereinigt source in target: Alle Kanten von/zu source werden auf target umgeleitet."""
|
||||
new_edges = []
|
||||
for e in self.edges:
|
||||
if e['from'] == source and e['to'] == source:
|
||||
new_edges.append({'from': target, 'to': target, 'label': e['label']})
|
||||
elif e['from'] == source:
|
||||
new_edges.append({'from': target, 'to': e['to'], 'label': e['label']})
|
||||
elif e['to'] == source:
|
||||
new_edges.append({'from': e['from'], 'to': target, 'label': e['label']})
|
||||
else:
|
||||
new_edges.append(e)
|
||||
self.edges = new_edges
|
||||
if source in self.accepts:
|
||||
self.accepts.add(target)
|
||||
if source in self.accepts:
|
||||
self.accepts.discard(source)
|
||||
if source in self.nodes:
|
||||
self.nodes.discard(source)
|
||||
|
||||
def copy(self):
|
||||
import copy
|
||||
return copy.deepcopy(self)
|
||||
|
||||
def __repr__(self):
|
||||
return (f"Automaton(nodes={len(self.nodes)}, edges={len(self.edges)}, "
|
||||
f"start={self.start}, accepts={self.accepts})")
|
||||
|
||||
def to_dot(self):
|
||||
lines = ["digraph Automaton {"]
|
||||
lines.append(" rankdir=LR;")
|
||||
lines.append(f' start [shape=point];')
|
||||
lines.append(f' start -> {self.start};')
|
||||
for n in self.nodes:
|
||||
shape = "doublecircle" if n in self.accepts else "circle"
|
||||
lines.append(f' {n} [shape={shape}];')
|
||||
for e in self.edges:
|
||||
label = e['label'].replace('"', '\\"')
|
||||
prob = f" [{e['prob']:.2f}]" if e['prob'] is not None else ""
|
||||
lines.append(f' {e["from"]} -> {e["to"]} [label="{label}{prob}"];')
|
||||
lines.append("}")
|
||||
return '\n'.join(lines)
|
||||
192
bex/baum_welch.py
Normal file
192
bex/baum_welch.py
Normal file
|
|
@ -0,0 +1,192 @@
|
|||
"""Baum-Welch for POMM on k-OA — standard forward-backward (Rabiner 1989)."""
|
||||
|
||||
import random
|
||||
import math
|
||||
|
||||
|
||||
def init_probabilities(G, sequences):
|
||||
"""Initialize α per iKoa init (Algorithm 1, line 1).
|
||||
|
||||
— α(src, sink) = fraction of empty words in S
|
||||
— α(src, s) = fraction of words starting with lab(s), split equally
|
||||
among all k copies of that symbol
|
||||
— α(s, t) for s ≠ src: chosen randomly, normalized to sum to 1
|
||||
"""
|
||||
total = len(sequences)
|
||||
if total == 0:
|
||||
total = 1
|
||||
empty_count = sum(1 for s in sequences if not s)
|
||||
|
||||
start_counts = {}
|
||||
for seq in sequences:
|
||||
if seq:
|
||||
start_counts[seq[0]] = start_counts.get(seq[0], 0) + 1
|
||||
|
||||
prob = {}
|
||||
for s in G._succ:
|
||||
if s == G.sink:
|
||||
continue
|
||||
succ = list(G._succ[s])
|
||||
if not succ:
|
||||
prob[s] = {}
|
||||
continue
|
||||
vals = []
|
||||
for t in succ:
|
||||
if s == G.src:
|
||||
if t == G.sink:
|
||||
v = empty_count / total
|
||||
else:
|
||||
lab = G.label(t)
|
||||
base = lab.rsplit('_', 1)[0] if '_' in lab else lab
|
||||
count = start_counts.get(base, 0)
|
||||
copies = sum(1 for u in succ if G.label(u) == lab)
|
||||
v = (count / total) / max(copies, 1)
|
||||
vals.append(v)
|
||||
else:
|
||||
vals.append(random.random())
|
||||
s_total = sum(vals)
|
||||
if s_total == 0:
|
||||
vals = [1.0 / len(vals)] * len(vals)
|
||||
else:
|
||||
vals = [v / s_total for v in vals]
|
||||
prob[s] = {t: v for t, v in zip(succ, vals)}
|
||||
|
||||
for s in prob:
|
||||
for t in prob[s]:
|
||||
if prob[s][t] < 1e-10:
|
||||
prob[s][t] = 0.0
|
||||
|
||||
return prob
|
||||
|
||||
|
||||
def bw_iteration(prob, sequences, node_to_idx, n_states, all_nodes, G):
|
||||
"""Single Baum-Welch iteration over all sequences."""
|
||||
total_num = {}
|
||||
total_denom = {}
|
||||
|
||||
for seq in sequences:
|
||||
if not seq:
|
||||
continue
|
||||
T = len(seq)
|
||||
obs = seq
|
||||
|
||||
# which states can emit each observation? (keyed by base symbol)
|
||||
emit = {}
|
||||
for n in all_nodes:
|
||||
lab = G.label(n)
|
||||
if lab:
|
||||
base = lab.rsplit('_', 1)[0] if '_' in lab else lab
|
||||
emit.setdefault(base, []).append(n)
|
||||
# sink emits nothing
|
||||
sink = G.sink
|
||||
|
||||
# Forward pass
|
||||
alpha = [{} for _ in range(T + 1)]
|
||||
alpha[0][G.src] = 1.0
|
||||
|
||||
for t in range(T):
|
||||
sym = obs[t]
|
||||
possible = emit.get(sym, [])
|
||||
for j in possible:
|
||||
total = 0.0
|
||||
for i in alpha[t]:
|
||||
p_trans = prob.get(i, {}).get(j, 0.0)
|
||||
if p_trans > 0:
|
||||
total += alpha[t][i] * p_trans
|
||||
if total > 0:
|
||||
alpha[t + 1][j] = total
|
||||
|
||||
# P(O | λ)
|
||||
po = 0.0
|
||||
for i in alpha[T]:
|
||||
po += alpha[T][i] * prob.get(i, {}).get(sink, 0.0)
|
||||
if po == 0:
|
||||
continue
|
||||
|
||||
# Backward pass
|
||||
beta = [{} for _ in range(T + 1)]
|
||||
for i in all_nodes:
|
||||
if prob.get(i, {}).get(sink, 0.0) > 0:
|
||||
beta[T][i] = prob[i][sink]
|
||||
|
||||
for t in range(T - 1, -1, -1):
|
||||
sym = obs[t] if t < T else None
|
||||
possible = emit.get(sym, []) if sym else []
|
||||
for i in alpha[t]:
|
||||
total = 0.0
|
||||
for j in possible:
|
||||
p_trans = prob.get(i, {}).get(j, 0.0)
|
||||
if p_trans > 0 and j in beta[t + 1]:
|
||||
total += p_trans * beta[t + 1][j]
|
||||
if total > 0:
|
||||
beta[t][i] = total
|
||||
|
||||
# Accumulate ξ and γ
|
||||
for t in range(T):
|
||||
sym_nxt = obs[t]
|
||||
possible = emit.get(sym_nxt, [])
|
||||
for i in alpha[t]:
|
||||
if i not in beta[t] or beta[t][i] == 0:
|
||||
continue
|
||||
for j in possible:
|
||||
p_trans = prob.get(i, {}).get(j, 0.0)
|
||||
if p_trans == 0 or j not in beta[t + 1] or beta[t + 1][j] == 0:
|
||||
continue
|
||||
xi = alpha[t][i] * p_trans * beta[t + 1][j] / po
|
||||
if xi > 1e-15:
|
||||
key = (i, j)
|
||||
total_num[key] = total_num.get(key, 0.0) + xi
|
||||
total_denom[i] = total_denom.get(i, 0.0) + xi
|
||||
|
||||
# M-step: update probabilities
|
||||
for s in prob:
|
||||
for t in prob[s]:
|
||||
key = (s, t)
|
||||
d = total_denom.get(s, 0.0)
|
||||
if d > 1e-15 and key in total_num:
|
||||
prob[s][t] = total_num[key] / d
|
||||
else:
|
||||
prob[s][t] = 0.0
|
||||
|
||||
# Renormalize
|
||||
for s in prob:
|
||||
row_sum = sum(prob[s].values())
|
||||
if row_sum > 1e-10:
|
||||
for t in prob[s]:
|
||||
prob[s][t] /= row_sum
|
||||
else:
|
||||
n_succ = len(prob[s])
|
||||
for t in prob[s]:
|
||||
prob[s][t] = 1.0 / n_succ
|
||||
|
||||
return prob
|
||||
|
||||
|
||||
def baum_welch(G, prob, sequences, iterations=10):
|
||||
"""Baum-Welch EM training.
|
||||
|
||||
Args:
|
||||
G: k-OA graph
|
||||
prob: dict[s][t] = transition probabilities
|
||||
sequences: list of token lists (bag, not set)
|
||||
iterations: number of EM iterations (full convergence)
|
||||
|
||||
Returns:
|
||||
Updated prob dict
|
||||
"""
|
||||
all_nodes = list(G._succ.keys())
|
||||
node_to_idx = {n: i for i, n in enumerate(all_nodes)}
|
||||
n_states = len(all_nodes)
|
||||
|
||||
for _ in range(iterations):
|
||||
prob = bw_iteration(prob, sequences, node_to_idx, n_states, all_nodes, G)
|
||||
|
||||
return prob
|
||||
|
||||
|
||||
def baum_welch_fixed(G, prob, sequences, iterations=2):
|
||||
"""Baum-Welch with fixed small iteration count (for Disambiguate).
|
||||
|
||||
ℓ = 2 for |Σ| ≤ 7, ℓ = 3 for |Σ| > 7.
|
||||
"""
|
||||
return baum_welch(G, prob, sequences, iterations)
|
||||
145
bex/cli.py
Normal file
145
bex/cli.py
Normal file
|
|
@ -0,0 +1,145 @@
|
|||
"""
|
||||
CLI — Command-Line Interface for bex YAML Grammar Inference.
|
||||
|
||||
Usage:
|
||||
python -m bex --dir roles/ --k-max 5
|
||||
python -m bex --dir playbooks/ --context tasks
|
||||
python -m bex --dir roles/ --output template.yaml
|
||||
"""
|
||||
|
||||
import argparse
|
||||
import os
|
||||
import sys
|
||||
import glob
|
||||
|
||||
from .tokenizer import YAMLTokenizer
|
||||
from .kore import kOREInference
|
||||
from .template import generate_template
|
||||
from .ilocal import iLocal, extract_contexts_from_file, reduce_contexts
|
||||
|
||||
|
||||
def find_yaml_files(directory):
|
||||
"""Findet alle YAML-Dateien in einem Verzeichnis (rekursiv)."""
|
||||
patterns = ['**/*.yml', '**/*.yaml']
|
||||
files = []
|
||||
for pattern in patterns:
|
||||
files.extend(glob.glob(os.path.join(directory, pattern), recursive=True))
|
||||
return sorted(files)
|
||||
|
||||
|
||||
def main():
|
||||
parser = argparse.ArgumentParser(
|
||||
description='bex — BEX-based YAML Grammar Inference',
|
||||
)
|
||||
parser.add_argument('--dir', type=str, default='roles/',
|
||||
help='Verzeichnis mit YAML-Dateien (default: roles/)')
|
||||
parser.add_argument('--k-max', type=int, default=5,
|
||||
help='Max k für k-ORE-Inferenz (default: 5)')
|
||||
parser.add_argument('--context', type=str, default=None,
|
||||
help='Auf spezifischen Container-Key beschränken (z.B. tasks)')
|
||||
parser.add_argument('--output', type=str, default=None,
|
||||
help='Output-Datei für Template (default: stdout)')
|
||||
parser.add_argument('--ilocal', action='store_true',
|
||||
help='iLocal-Kontextanalyse durchführen')
|
||||
parser.add_argument('--crx', action='store_true',
|
||||
help='CRX (direct CHARE inference) verwenden')
|
||||
parser.add_argument('--verbose', '-v', action='store_true',
|
||||
help='Ausführliche Ausgabe')
|
||||
parser.add_argument('--stats', action='store_true',
|
||||
help='Zeige Token-Statistiken')
|
||||
|
||||
args = parser.parse_args()
|
||||
|
||||
if not os.path.isdir(args.dir):
|
||||
print(f"Fehler: Verzeichnis '{args.dir}' nicht gefunden.", file=sys.stderr)
|
||||
sys.exit(1)
|
||||
|
||||
yaml_files = find_yaml_files(args.dir)
|
||||
if not yaml_files:
|
||||
print(f"Keine YAML-Dateien in '{args.dir}' gefunden.", file=sys.stderr)
|
||||
sys.exit(1)
|
||||
|
||||
print(f"Gefundene YAML-Dateien: {len(yaml_files)}", file=sys.stderr)
|
||||
|
||||
if args.ilocal:
|
||||
print("\n=== iLocal: Kontext-Extraktion ===", file=sys.stderr)
|
||||
all_contexts = {}
|
||||
for f in yaml_files:
|
||||
contexts = extract_contexts_from_file(f)
|
||||
for ctx, seqs in contexts.items():
|
||||
if ctx not in all_contexts:
|
||||
all_contexts[ctx] = []
|
||||
all_contexts[ctx].extend(seqs)
|
||||
|
||||
reduced = reduce_contexts(all_contexts)
|
||||
print(f" Kontexte gefunden: {len(reduced)}", file=sys.stderr)
|
||||
for ctx, seqs in sorted(reduced.items()):
|
||||
lengths = [len(s) for s in seqs]
|
||||
print(f" {ctx}: {len(seqs)} Sequenzen, "
|
||||
f"Längen {min(lengths)}-{max(lengths)}, "
|
||||
f"unique_seqs={len(set(tuple(s) for s in seqs))}",
|
||||
file=sys.stderr)
|
||||
|
||||
print("\n=== Tokenisierung ===", file=sys.stderr)
|
||||
tokenizer = YAMLTokenizer(resolve_includes=False)
|
||||
all_sequences = []
|
||||
container_sequences = {}
|
||||
|
||||
for f in yaml_files:
|
||||
try:
|
||||
seq = tokenizer.tokenize_file(f)
|
||||
if seq:
|
||||
all_sequences.append(seq)
|
||||
if args.verbose:
|
||||
print(f" {os.path.relpath(f)}: {seq}", file=sys.stderr)
|
||||
except Exception as e:
|
||||
if args.verbose:
|
||||
print(f" Fehler in {f}: {e}", file=sys.stderr)
|
||||
|
||||
if not all_sequences:
|
||||
print("Keine Sequenzen extrahiert.", file=sys.stderr)
|
||||
sys.exit(1)
|
||||
|
||||
print(f" Sequenzen extrahiert: {len(all_sequences)}", file=sys.stderr)
|
||||
lengths = [len(s) for s in all_sequences]
|
||||
print(f" Längen: min={min(lengths)}, max={max(lengths)}, "
|
||||
f"avg={sum(lengths)/len(lengths):.1f}", file=sys.stderr)
|
||||
|
||||
if args.stats:
|
||||
stats = tokenizer.get_statistics()
|
||||
print("\n=== Token-Statistiken ===", file=sys.stderr)
|
||||
for token, count in list(stats.items())[:30]:
|
||||
print(f" {token}: {count}", file=sys.stderr)
|
||||
|
||||
print("\n=== k-ORE Inferenz ===", file=sys.stderr)
|
||||
kore = kOREInference(k_max=args.k_max)
|
||||
|
||||
if args.crx:
|
||||
result = kore.infer_with_crx(all_sequences)
|
||||
_, expr, method = result
|
||||
print(f" Methode: {method}", file=sys.stderr)
|
||||
else:
|
||||
result = kore.infer(all_sequences)
|
||||
if result:
|
||||
_, expr, k = result
|
||||
print(f" Bestes k: {k}", file=sys.stderr)
|
||||
else:
|
||||
expr = "∅"
|
||||
print(" Kein Ergebnis", file=sys.stderr)
|
||||
|
||||
print(f" Inferierter Ausdruck: {expr}", file=sys.stderr)
|
||||
|
||||
print("\n=== One-Shot Template ===", file=sys.stderr)
|
||||
print(file=sys.stderr)
|
||||
template = generate_template(expr, context_key=args.context)
|
||||
|
||||
if args.output:
|
||||
with open(args.output, 'w') as f:
|
||||
f.write(template)
|
||||
print(f"Template geschrieben nach: {args.output}", file=sys.stderr)
|
||||
else:
|
||||
print(template)
|
||||
|
||||
|
||||
if __name__ == '__main__':
|
||||
main()
|
||||
191
bex/crx.py
Normal file
191
bex/crx.py
Normal file
|
|
@ -0,0 +1,191 @@
|
|||
"""CRX — Direct CHARE inference (Algorithm 7, TODS 2010)."""
|
||||
|
||||
from collections import defaultdict
|
||||
from .expr import concat
|
||||
|
||||
|
||||
class CRX:
|
||||
"""
|
||||
|———— Algorithm 7: CRX ————|
|
||||
Input: sample S (list of token lists)
|
||||
Output: CHARE r such that S ⊆ L(r)
|
||||
"""
|
||||
|
||||
def infer(self, sequences):
|
||||
S = [list(s) for s in sequences if s]
|
||||
if not S:
|
||||
return 'ε'
|
||||
|
||||
sigma = set()
|
||||
for w in S:
|
||||
for a in w:
|
||||
sigma.add(a)
|
||||
if not sigma:
|
||||
return 'ε'
|
||||
|
||||
# Step 1: Compute ImmedPred and equivalence classes ≈_S
|
||||
immed = set()
|
||||
for w in S:
|
||||
for i in range(len(w) - 1):
|
||||
immed.add((w[i], w[i + 1]))
|
||||
|
||||
# Reachability: →_S (reflexive, transitive closure)
|
||||
closure = self._transitive_closure(sigma, immed)
|
||||
|
||||
# Equivalence: a ≈_S b iff a →*_S b and b →*_S a
|
||||
eq = self._equivalence(sigma, closure)
|
||||
|
||||
# Build class map: symbol → class index
|
||||
sym_to_cls = {}
|
||||
classes = []
|
||||
for cls_syms in eq:
|
||||
idx = len(classes)
|
||||
for sym in cls_syms:
|
||||
sym_to_cls[sym] = idx
|
||||
classes.append(set(cls_syms))
|
||||
|
||||
# Step 2-3: Preserve only singleton nodes? No, the algorithm says merge singletons
|
||||
# that share Pred/Succ in the Hasse diagram. But actually, looking at the algorithm
|
||||
# more carefully:
|
||||
#
|
||||
# "while a maximal set of singleton nodes γ₁,...,γ_ℓ such that
|
||||
# Pred_HS(γ₁)=···=Pred_HS(γ_ℓ) and Succ_HS(γ₁)=···=Succ_HS(γ_ℓ) exists do
|
||||
# Replace γ₁,...,γ_ℓ by γ := ∪ⱼ γⱼ"
|
||||
#
|
||||
# This merges singleton equivalence classes (classes with exactly one symbol)
|
||||
# that have the same Pred and Succ sets in the Hasse diagram.
|
||||
|
||||
changed = True
|
||||
while changed:
|
||||
changed = False
|
||||
singleton_ids = [i for i, c in enumerate(classes) if len(c) == 1]
|
||||
|
||||
# Compute Pred and Succ for each singleton (considering ALL symbols in each class)
|
||||
hs_pred = {}
|
||||
hs_succ = {}
|
||||
for i in singleton_ids:
|
||||
hs_pred[i] = set()
|
||||
hs_succ[i] = set()
|
||||
sym_i = next(iter(classes[i]))
|
||||
for j, c in enumerate(classes):
|
||||
if i == j:
|
||||
continue
|
||||
if any((sym_j, sym_i) in immed for sym_j in c):
|
||||
hs_pred[i].add(j)
|
||||
if any((sym_i, sym_j) in immed for sym_j in c):
|
||||
hs_succ[i].add(j)
|
||||
|
||||
# Group by same (Pred, Succ)
|
||||
groups = defaultdict(list)
|
||||
for i in singleton_ids:
|
||||
groups[(frozenset(hs_pred[i]), frozenset(hs_succ[i]))].append(i)
|
||||
|
||||
for (pred_set, succ_set), group in groups.items():
|
||||
if len(group) >= 2:
|
||||
merged = set()
|
||||
for i in group:
|
||||
merged.update(classes[i])
|
||||
new_id = len(classes)
|
||||
classes.append(merged)
|
||||
for i in sorted(group, reverse=True):
|
||||
classes.pop(i)
|
||||
changed = True
|
||||
break
|
||||
|
||||
# After merging, rebuild sym_to_cls to map to new class indices
|
||||
sym_to_cls = {}
|
||||
for idx, cls in enumerate(classes):
|
||||
for sym in cls:
|
||||
sym_to_cls[sym] = idx
|
||||
|
||||
# Step 5: Topological sort of the Hasse diagram
|
||||
adj = {i: set() for i in range(len(classes))}
|
||||
indeg = {i: 0 for i in range(len(classes))}
|
||||
for a, b in immed:
|
||||
ca, cb = sym_to_cls.get(a), sym_to_cls.get(b)
|
||||
if ca is not None and cb is not None and ca != cb:
|
||||
if cb not in adj[ca]:
|
||||
adj[ca].add(cb)
|
||||
indeg[cb] += 1
|
||||
|
||||
# Topological sort (Kahn's algorithm)
|
||||
order = []
|
||||
q = [i for i in range(len(classes)) if indeg[i] == 0]
|
||||
while q:
|
||||
i = q.pop(0)
|
||||
order.append(i)
|
||||
for j in adj[i]:
|
||||
indeg[j] -= 1
|
||||
if indeg[j] == 0:
|
||||
q.append(j)
|
||||
remaining = set(range(len(classes))) - set(order)
|
||||
order.extend(remaining)
|
||||
|
||||
# Step 6-16: Assign chain factors (Algorithm 7 lines 7-14)
|
||||
def count_in_class(w, syms):
|
||||
return sum(1 for a in w if a in syms)
|
||||
|
||||
parts = []
|
||||
for i in order:
|
||||
syms = classes[i]
|
||||
counts = [count_in_class(w, syms) for w in S]
|
||||
|
||||
all_exactly_one = all(c == 1 for c in counts)
|
||||
all_at_most_one = all(c <= 1 for c in counts)
|
||||
all_at_least_one = all(c >= 1 for c in counts)
|
||||
some_two_or_more = any(c >= 2 for c in counts)
|
||||
|
||||
sym_list = sorted(syms)
|
||||
factor = '+'.join(sym_list)
|
||||
if len(sym_list) > 1:
|
||||
factor = '(' + factor + ')'
|
||||
|
||||
if all_exactly_one:
|
||||
pass # (a₁+···+aₙ)
|
||||
elif all_at_most_one:
|
||||
factor += '?' # (a₁+···+aₙ)?
|
||||
elif all_at_least_one and some_two_or_more:
|
||||
factor += '+' # (a₁+···+aₙ)+
|
||||
else:
|
||||
factor += '+?' # (a₁+···+aₙ)+?
|
||||
|
||||
parts.append(factor)
|
||||
|
||||
if not parts:
|
||||
return 'ε'
|
||||
return '.'.join(parts)
|
||||
|
||||
def _transitive_closure(self, sigma, immed):
|
||||
"""Compute reflexive, transitive closure of immed relation."""
|
||||
closure = {(a, b) for (a, b) in immed}
|
||||
for a in sigma:
|
||||
closure.add((a, a))
|
||||
changed = True
|
||||
while changed:
|
||||
changed = False
|
||||
for a in sigma:
|
||||
for b in sigma:
|
||||
for c in sigma:
|
||||
if (a, b) in closure and (b, c) in closure and (a, c) not in closure:
|
||||
closure.add((a, c))
|
||||
changed = True
|
||||
return closure
|
||||
|
||||
def _equivalence(self, sigma, closure):
|
||||
"""Compute equivalence classes of ≈_S."""
|
||||
remaining = set(sigma)
|
||||
classes = []
|
||||
while remaining:
|
||||
a = remaining.pop()
|
||||
cls = {a}
|
||||
added = True
|
||||
while added:
|
||||
added = False
|
||||
for b in list(remaining):
|
||||
if (a, b) in closure and (b, a) in closure:
|
||||
if b not in cls:
|
||||
cls.add(b)
|
||||
remaining.discard(b)
|
||||
added = True
|
||||
classes.append(cls)
|
||||
return classes
|
||||
164
bex/expr.py
Normal file
164
bex/expr.py
Normal file
|
|
@ -0,0 +1,164 @@
|
|||
"""Expression utilities for SOREs and k-OREs."""
|
||||
|
||||
import re
|
||||
|
||||
|
||||
def sym(s):
|
||||
"""Create a simple symbol expression."""
|
||||
return s
|
||||
|
||||
|
||||
def concat(*parts):
|
||||
"""Create concatenation expression."""
|
||||
parts = [p for p in parts if p and p != 'ε']
|
||||
if not parts:
|
||||
return 'ε'
|
||||
if len(parts) == 1:
|
||||
return parts[0]
|
||||
return '.'.join(parts)
|
||||
|
||||
|
||||
def disj(*parts):
|
||||
"""Create disjunction expression."""
|
||||
parts = [p for p in parts if p and p != '∅']
|
||||
if not parts:
|
||||
return '∅'
|
||||
if len(parts) == 1:
|
||||
return parts[0]
|
||||
return '(' + '|'.join(parts) + ')'
|
||||
|
||||
|
||||
def star(expr):
|
||||
"""Create iteration expression (one or more, r+)."""
|
||||
if not expr or expr in ('∅', 'ε'):
|
||||
return expr
|
||||
if len(expr) == 1 or (expr.startswith('(') and expr.endswith(')')):
|
||||
return expr + '+'
|
||||
return '(' + expr + ')+'
|
||||
|
||||
|
||||
def optional(expr):
|
||||
"""Create optional expression (r?)."""
|
||||
if not expr or expr in ('∅', 'ε'):
|
||||
return 'ε'
|
||||
if len(expr) == 1 or (expr.startswith('(') and expr.endswith(')')):
|
||||
return expr + '?'
|
||||
return '(' + expr + ')?'
|
||||
|
||||
|
||||
def alphabet(expr):
|
||||
"""Return set of alphabet symbols in expression."""
|
||||
cleaned = re.sub(r'[+?*().|]', ' ', expr)
|
||||
result = set()
|
||||
for token in cleaned.split():
|
||||
token = token.strip('_0123456789')
|
||||
if token and token not in ('ε', '∅'):
|
||||
result.add(token)
|
||||
return result
|
||||
|
||||
|
||||
def strip_k(s):
|
||||
"""Remove k-ORE markers: a_1 → a, b^(2) → b."""
|
||||
result = re.sub(r'_\d+', '', s)
|
||||
result = re.sub(r'\^\(\d+\)', '', result)
|
||||
result = re.sub(r'^\(|\)$', '', result)
|
||||
return result
|
||||
|
||||
|
||||
def has_repeats(expr, symbol):
|
||||
"""Check if a symbol appears more than once in expression."""
|
||||
return expr.count(symbol) > 1
|
||||
|
||||
|
||||
def lang_size_at_most(expr, n, alphabet_symbols=None):
|
||||
"""Compute |L(r)<=n| — number of words of length ≤ n in L(r)."""
|
||||
if alphabet_symbols is None:
|
||||
alphabet_symbols = alphabet(expr)
|
||||
if not alphabet_symbols:
|
||||
return 1 if 'ε' in expr else 0
|
||||
size = 0
|
||||
for length in range(n + 1):
|
||||
size += _count_words(expr, length, alphabet_symbols)
|
||||
return size
|
||||
|
||||
|
||||
def _count_words(expr, length, alphabet_symbols):
|
||||
if length < 0:
|
||||
return 0
|
||||
if not expr or expr == '∅':
|
||||
return 0
|
||||
if expr == 'ε':
|
||||
return 1 if length == 0 else 0
|
||||
if expr in alphabet_symbols:
|
||||
return 1 if length == 1 else 0
|
||||
if '+' in expr:
|
||||
inner = expr.rstrip('+')
|
||||
if inner.endswith('?'):
|
||||
inner = inner[:-1]
|
||||
return _count_star_words(inner, length, alphabet_symbols, 1)
|
||||
if expr.endswith('?'):
|
||||
inner = expr[:-1]
|
||||
return _count_words(inner, length, alphabet_symbols) + (1 if length == 0 else 0)
|
||||
if expr.startswith('(') and '|' in expr:
|
||||
inner = expr[1:-1]
|
||||
parts = _split_disjunction(inner)
|
||||
return sum(_count_words(p, length, alphabet_symbols) for p in parts)
|
||||
if '.' in expr:
|
||||
parts = expr.split('.')
|
||||
return _count_concat_words(parts, length, alphabet_symbols, 0)
|
||||
if ')' in expr or '(' in expr:
|
||||
return 0
|
||||
return 0
|
||||
|
||||
|
||||
def _count_concat_words(parts, length, alphabet_symbols, idx):
|
||||
if idx >= len(parts):
|
||||
return 1 if length == 0 else 0
|
||||
total = 0
|
||||
for take in range(length + 1):
|
||||
cnt = _count_words(parts[idx], take, alphabet_symbols)
|
||||
if cnt > 0:
|
||||
rest = _count_concat_words(parts, length - take, alphabet_symbols, idx + 1)
|
||||
total += cnt * rest
|
||||
return total
|
||||
|
||||
|
||||
def _count_star_words(inner, length, alphabet_symbols, min_count):
|
||||
total = 0
|
||||
for repeat in range(min_count, length + 1):
|
||||
if repeat == 0:
|
||||
continue
|
||||
total += _count_repeat_words(inner, repeat, length, alphabet_symbols)
|
||||
return total
|
||||
|
||||
|
||||
def _count_repeat_words(inner, repeat, length, alphabet_symbols):
|
||||
if repeat == 0:
|
||||
return 1 if length == 0 else 0
|
||||
total = 0
|
||||
for take in range(length + 1):
|
||||
cnt = _count_words(inner, take, alphabet_symbols)
|
||||
if cnt > 0:
|
||||
rest = _count_repeat_words(inner, repeat - 1, length - take, alphabet_symbols)
|
||||
total += cnt * rest
|
||||
return total
|
||||
|
||||
|
||||
def _split_disjunction(s):
|
||||
depth = 0
|
||||
parts = []
|
||||
current = []
|
||||
for ch in s:
|
||||
if ch == '(':
|
||||
depth += 1
|
||||
current.append(ch)
|
||||
elif ch == ')':
|
||||
depth -= 1
|
||||
current.append(ch)
|
||||
elif ch == '|' and depth == 0:
|
||||
parts.append(''.join(current))
|
||||
current = []
|
||||
else:
|
||||
current.append(ch)
|
||||
parts.append(''.join(current))
|
||||
return parts
|
||||
202
bex/idregex.py
Normal file
202
bex/idregex.py
Normal file
|
|
@ -0,0 +1,202 @@
|
|||
"""iDRegEx — Algorithm 4 (arXiv 1004.2372)."""
|
||||
|
||||
from .ikoa import ikoa
|
||||
from .rwrsq import rwr_sq
|
||||
from .expr import alphabet
|
||||
|
||||
|
||||
def is_deterministic(expr):
|
||||
"""Check if a k-ORE is deterministic (Glushkov determinism).
|
||||
|
||||
A k-ORE is deterministic iff for every subexpression (r|s),
|
||||
first(r) ∩ first(s) = ∅.
|
||||
"""
|
||||
if not expr or expr == '∅' or expr == 'ε':
|
||||
return True
|
||||
return _check_det(expr)
|
||||
|
||||
|
||||
def _check_det(expr):
|
||||
"""Recursive determinism check."""
|
||||
depth = 0
|
||||
i = 0
|
||||
while i < len(expr):
|
||||
if expr[i] == '(':
|
||||
if depth == 0:
|
||||
start = i
|
||||
depth += 1
|
||||
elif expr[i] == ')':
|
||||
depth -= 1
|
||||
if depth == 0:
|
||||
inner = expr[start + 1:i]
|
||||
if '|' in inner:
|
||||
alts = _split_or(inner)
|
||||
first_sets = []
|
||||
for alt in alts:
|
||||
fs = _first_set(alt.strip())
|
||||
first_sets.append(fs)
|
||||
for j, fs1 in enumerate(first_sets):
|
||||
for fs2 in first_sets[j + 1:]:
|
||||
if fs1 & fs2:
|
||||
return False
|
||||
for alt in alts:
|
||||
if not _check_det(alt.strip()):
|
||||
return False
|
||||
else:
|
||||
if not _check_det(inner):
|
||||
return False
|
||||
elif expr[i] == '+':
|
||||
pass
|
||||
elif expr[i] == '?':
|
||||
pass
|
||||
i += 1
|
||||
return True
|
||||
|
||||
|
||||
def _first_set(expr):
|
||||
"""Compute first(r) — set of alphabet symbols that can appear at the start of a word in L(r)."""
|
||||
if not expr or expr == '∅':
|
||||
return set()
|
||||
if expr == 'ε':
|
||||
return set()
|
||||
alpha = alphabet(expr)
|
||||
if expr in alpha:
|
||||
return {expr}
|
||||
if expr.endswith('?') or expr.endswith('+'):
|
||||
inner = expr.rstrip('+?')
|
||||
return _first_set(inner)
|
||||
if '.' in expr:
|
||||
parts = expr.split('.')
|
||||
return _first_set(parts[0])
|
||||
if expr.startswith('(') and '|' in expr:
|
||||
inner = expr[1:-1]
|
||||
alts = _split_or(inner)
|
||||
result = set()
|
||||
for a in alts:
|
||||
result |= _first_set(a.strip())
|
||||
return result
|
||||
return alpha
|
||||
|
||||
|
||||
def _split_or(s):
|
||||
"""Split disjunction string at top-level | operators."""
|
||||
depth = 0
|
||||
parts = []
|
||||
cur = []
|
||||
for ch in s:
|
||||
if ch == '(':
|
||||
depth += 1
|
||||
cur.append(ch)
|
||||
elif ch == ')':
|
||||
depth -= 1
|
||||
cur.append(ch)
|
||||
elif ch == '|' and depth == 0:
|
||||
parts.append(''.join(cur))
|
||||
cur = []
|
||||
else:
|
||||
cur.append(ch)
|
||||
parts.append(''.join(cur))
|
||||
return parts
|
||||
|
||||
|
||||
def _lang_size(expr, n=None):
|
||||
"""|L(r)≤n| — number of words of length ≤ n in L(r).
|
||||
|
||||
n = 2m + 1 where m = |r| excluding operators.
|
||||
Uses simple structural approximation.
|
||||
"""
|
||||
if not expr or expr == '∅':
|
||||
return 0
|
||||
if expr == 'ε':
|
||||
return 1
|
||||
m = len(alphabet(expr))
|
||||
if n is None:
|
||||
n = 2 * m + 1
|
||||
total = 0
|
||||
for length in range(n + 1):
|
||||
total += _count_len(expr, length)
|
||||
return total
|
||||
|
||||
|
||||
def _count_len(expr, length):
|
||||
if length < 0:
|
||||
return 0
|
||||
if not expr or expr == '∅':
|
||||
return 0
|
||||
if expr == 'ε':
|
||||
return 1 if length == 0 else 0
|
||||
alpha = alphabet(expr)
|
||||
if expr in alpha:
|
||||
return 1 if length == 1 else 0
|
||||
if expr.endswith('+'):
|
||||
inner = expr[:-1]
|
||||
if inner.endswith('?'):
|
||||
inner = inner[:-1]
|
||||
total = 0
|
||||
for rep in range(1, length + 1):
|
||||
total += _count_repeat(inner, rep, length)
|
||||
return total
|
||||
if expr.endswith('?'):
|
||||
inner = expr[:-1]
|
||||
return _count_len(inner, length) + (1 if length == 0 else 0)
|
||||
if '.' in expr:
|
||||
parts = expr.split('.')
|
||||
return _count_concat(parts, length, 0)
|
||||
if expr.startswith('(') and '|' in expr:
|
||||
inner = expr[1:-1]
|
||||
alts = _split_or(inner)
|
||||
return sum(_count_len(a.strip(), length) for a in alts)
|
||||
return 0
|
||||
|
||||
|
||||
def _count_concat(parts, length, idx):
|
||||
if idx >= len(parts):
|
||||
return 1 if length == 0 else 0
|
||||
total = 0
|
||||
for take in range(length + 1):
|
||||
cnt = _count_len(parts[idx], take)
|
||||
if cnt:
|
||||
total += cnt * _count_concat(parts, length - take, idx + 1)
|
||||
return total
|
||||
|
||||
|
||||
def _count_repeat(inner, rep, length):
|
||||
if rep == 0:
|
||||
return 1 if length == 0 else 0
|
||||
total = 0
|
||||
for take in range(length + 1):
|
||||
cnt = _count_len(inner, take)
|
||||
if cnt:
|
||||
total += cnt * _count_repeat(inner, rep - 1, length - take)
|
||||
return total
|
||||
|
||||
|
||||
def idregex(sequences, kmax=4, N=5, criterion='langsize'):
|
||||
"""
|
||||
|———— Algorithm 4: iDRegEx ————|
|
||||
Require: sample S
|
||||
Ensure: k-ORE r
|
||||
|
||||
1: C ← ∅
|
||||
2: for k = 1 to kmax do
|
||||
3: for n = 1 to N do
|
||||
4: G ← iKoa(S, k)
|
||||
5: if rwr²(G) is deterministic then
|
||||
6: add rwr²(G) to C
|
||||
7: return best(C)
|
||||
"""
|
||||
C = set()
|
||||
for k in range(1, kmax + 1):
|
||||
for _ in range(N):
|
||||
G = ikoa(sequences, k, num_trials=1)
|
||||
if G is None:
|
||||
continue
|
||||
expr = rwr_sq(G)
|
||||
if expr and expr not in ('∅', 'ε'):
|
||||
if is_deterministic(expr):
|
||||
C.add(expr)
|
||||
if not C:
|
||||
return None
|
||||
if criterion == 'langsize':
|
||||
return min(C, key=lambda e: (_lang_size(e), len(e)))
|
||||
return min(C, key=lambda e: len(e))
|
||||
139
bex/ikoa.py
Normal file
139
bex/ikoa.py
Normal file
|
|
@ -0,0 +1,139 @@
|
|||
"""iKoa — Algorithm 1 (arXiv 1004.2372) with Disambiguate (Algorithm 2)."""
|
||||
|
||||
from collections import deque, defaultdict
|
||||
import random
|
||||
from .koa import KOA, build_complete_koa
|
||||
from .baum_welch import init_probabilities, baum_welch, baum_welch_fixed
|
||||
|
||||
|
||||
def disambiguate(G, prob, sequences):
|
||||
"""
|
||||
|---- Algorithm 2: Disambiguate ----|
|
||||
Require: POMM P=(G,alpha) and sample S
|
||||
Ensure: deterministic k-OA
|
||||
"""
|
||||
sigma = set()
|
||||
for seq in sequences:
|
||||
for sym in seq:
|
||||
sigma.add(sym)
|
||||
bw_iter = 2 if len(sigma) <= 7 else 3
|
||||
|
||||
Q = deque([G.src])
|
||||
for s in G._succ.get(G.src, set()):
|
||||
if prob.get(G.src, {}).get(s, 0) > 0:
|
||||
Q.append(s)
|
||||
D = set()
|
||||
|
||||
from .expr import strip_k
|
||||
while Q:
|
||||
s = Q.popleft()
|
||||
while True:
|
||||
lab_groups = defaultdict(list)
|
||||
for t in list(G._succ.get(s, set())):
|
||||
l = G.label(t)
|
||||
if l:
|
||||
lab_groups[strip_k(l)].append(t)
|
||||
multi = [(lab, ts) for lab, ts in lab_groups.items() if len(ts) > 1]
|
||||
if not multi:
|
||||
break
|
||||
for lab, targets in multi:
|
||||
t_max = max(targets, key=lambda t: prob.get(s, {}).get(t, 0))
|
||||
total_p = sum(prob.get(s, {}).get(t, 0) for t in targets)
|
||||
if total_p > 0 and t_max in prob.get(s, {}):
|
||||
prob[s][t_max] = total_p
|
||||
for t in targets:
|
||||
if t != t_max:
|
||||
G.rm_edge(s, t)
|
||||
if t in prob.get(s, {}):
|
||||
prob[s][t] = 0.0
|
||||
prob = baum_welch_fixed(G, prob, sequences, bw_iter)
|
||||
for seq in sequences:
|
||||
if not G.accept(seq):
|
||||
return None
|
||||
D.add(s)
|
||||
for t in list(G._succ.get(s, set())):
|
||||
if t not in D and t != G.sink:
|
||||
Q.append(t)
|
||||
return G
|
||||
|
||||
|
||||
def prune(G, sequences):
|
||||
"""Prune (iKoa line 4). Remove edges without witnesses in S.
|
||||
|
||||
Also removes states s ∈ Succ(src) without a witness.
|
||||
"""
|
||||
from .expr import strip_k as _sk
|
||||
witnessed = set()
|
||||
for seq in sequences:
|
||||
if not seq:
|
||||
witnessed.add((G.src, G.sink))
|
||||
continue
|
||||
cur = {G.src}
|
||||
for sym in seq:
|
||||
nxt = set()
|
||||
for s in cur:
|
||||
for t in G._succ.get(s, set()):
|
||||
lab = G.label(t)
|
||||
if lab and _sk(lab) == sym:
|
||||
nxt.add(t)
|
||||
witnessed.add((s, t))
|
||||
cur = nxt
|
||||
for s in cur:
|
||||
if G.has_edge(s, G.sink):
|
||||
witnessed.add((s, G.sink))
|
||||
for s in list(G._succ.keys()):
|
||||
for t in list(G._succ.get(s, set())):
|
||||
if (s, t) not in witnessed:
|
||||
G.rm_edge(s, t)
|
||||
|
||||
r_from_src = set()
|
||||
q = [G.src]
|
||||
while q:
|
||||
s = q.pop()
|
||||
if s in r_from_src:
|
||||
continue
|
||||
r_from_src.add(s)
|
||||
q.extend(G._succ.get(s, set()))
|
||||
|
||||
r_to_sink = set()
|
||||
q = [G.sink]
|
||||
while q:
|
||||
s = q.pop()
|
||||
if s in r_to_sink:
|
||||
continue
|
||||
r_to_sink.add(s)
|
||||
q.extend(G._pred.get(s, set()))
|
||||
|
||||
for n in list(G._succ.keys()):
|
||||
if n in (G.src, G.sink):
|
||||
continue
|
||||
if n not in r_from_src or n not in r_to_sink:
|
||||
G.rm_state(n)
|
||||
|
||||
return G
|
||||
|
||||
|
||||
def ikoa(sequences, k, num_trials=1):
|
||||
"""
|
||||
|———— Algorithm 1: iKoa ————|
|
||||
Require: sample S, value k
|
||||
Ensure: deterministic k-OA G with S ⊆ L(G)
|
||||
|
||||
1: P ← init(k, S)
|
||||
2: P ← BaumWelsh(P, S)
|
||||
3: G ← Disambiguate(P, S)
|
||||
4: G ← Prune(G, S)
|
||||
5: return G
|
||||
"""
|
||||
for _ in range(num_trials):
|
||||
G, _ = build_complete_koa(sequences, k)
|
||||
prob = init_probabilities(G, sequences)
|
||||
prob = baum_welch(G, prob, sequences, iterations=10)
|
||||
G2 = G.copy()
|
||||
prob2 = {s: dict(d) for s, d in prob.items()}
|
||||
result = disambiguate(G2, prob2, sequences)
|
||||
if result is not None:
|
||||
result = prune(result, sequences)
|
||||
if result.sink_reachable():
|
||||
return result
|
||||
return None
|
||||
166
bex/ilocal.py
Normal file
166
bex/ilocal.py
Normal file
|
|
@ -0,0 +1,166 @@
|
|||
"""
|
||||
iLocal — Kontext-basierte Inferenz (Bex 2007).
|
||||
|
||||
Nach Bex et al. 2007: "Inferring XML Schema Definitions from XML Data"
|
||||
Extrahiert aus YAML-Bäumen (Kontext, Sequenz)-Paare, wobei der Kontext
|
||||
der YAML-Key (Container-Key) ist.
|
||||
|
||||
Angepasst für YAML:
|
||||
- Kontext = YAML-Key, dessen Wert eine Liste ist (z.B. tasks, steps)
|
||||
- Sequenz = Die item-Keys innerhalb dieser Liste (z.B. apt, template, service)
|
||||
|
||||
Anstatt Dateipfade zu verwenden (wie im XML-Kontext), arbeiten wir
|
||||
mit den Container-Keys direkt (Benutzer-Vorgabe: kein Dateipfad-Ballast).
|
||||
"""
|
||||
|
||||
import yaml
|
||||
|
||||
|
||||
def extract_contexts_from_yaml(data, context_prefix=None):
|
||||
"""
|
||||
Extrahiert (context, sequence)-Paare aus geparstem YAML.
|
||||
|
||||
Args:
|
||||
data: Geparste YAML-Daten (dict oder list)
|
||||
context_prefix: Interner Prefix für verschachtelte Kontexte
|
||||
|
||||
Returns:
|
||||
dict: {context_key: [sequence1, sequence2, ...]}
|
||||
"""
|
||||
contexts = {}
|
||||
|
||||
def walk(node, prefix=None):
|
||||
if isinstance(node, dict):
|
||||
for key, value in node.items():
|
||||
full_key = f"{prefix}.{key}" if prefix else str(key)
|
||||
if isinstance(value, list) and len(value) > 0:
|
||||
seq = []
|
||||
for item in value:
|
||||
if isinstance(item, dict):
|
||||
item_key = next(
|
||||
(k for k in item if k != 'name' and not k.startswith('_')),
|
||||
None
|
||||
)
|
||||
if item_key:
|
||||
seq.append(item_key)
|
||||
else:
|
||||
named = item.get('name', str(item))
|
||||
seq.append(f"named:{named[:20]}")
|
||||
else:
|
||||
seq.append(str(item))
|
||||
if full_key not in contexts:
|
||||
contexts[full_key] = []
|
||||
contexts[full_key].append(seq)
|
||||
for item in value:
|
||||
walk(item, full_key)
|
||||
elif isinstance(value, dict):
|
||||
walk(value, full_key)
|
||||
elif isinstance(value, list):
|
||||
for item in value:
|
||||
walk(item, full_key)
|
||||
elif isinstance(node, list):
|
||||
for item in node:
|
||||
walk(item, prefix)
|
||||
|
||||
walk(data)
|
||||
return contexts
|
||||
|
||||
|
||||
def extract_contexts_from_yaml_string(yaml_string):
|
||||
"""
|
||||
Extrahiert Kontext-Sequenzen aus einem YAML-String.
|
||||
|
||||
Args:
|
||||
yaml_string: YAML-String
|
||||
|
||||
Returns:
|
||||
dict: {context_key: [sequence1, sequence2, ...]}
|
||||
"""
|
||||
try:
|
||||
data = yaml.safe_load(yaml_string)
|
||||
except yaml.YAMLError:
|
||||
return {}
|
||||
|
||||
if data is None:
|
||||
return {}
|
||||
return extract_contexts_from_yaml(data)
|
||||
|
||||
|
||||
def extract_contexts_from_file(filepath):
|
||||
"""
|
||||
Extrahiert Kontext-Sequenzen aus einer YAML-Datei.
|
||||
|
||||
Args:
|
||||
filepath: Pfad zur YAML-Datei
|
||||
|
||||
Returns:
|
||||
dict: {context_key: [sequence1, sequence2, ...]}
|
||||
"""
|
||||
with open(filepath) as f:
|
||||
return extract_contexts_from_yaml_string(f.read())
|
||||
|
||||
|
||||
def reduce_contexts(context_groups):
|
||||
"""
|
||||
reduce — Generalisierung nach Bex 2007 (Algorithmus reduce).
|
||||
|
||||
Identifiziert äquivalente Kontext-Modelle und fasst sie zusammen:
|
||||
- Wenn zwei Kontexte die gleiche Sequenz-Struktur haben,
|
||||
werden sie zu einem generalisierten Kontext zusammengefasst
|
||||
|
||||
Args:
|
||||
context_groups: dict of {context_key: [sequences]}
|
||||
|
||||
Returns:
|
||||
dict: {generalized_context: [sequences]} (reduziert)
|
||||
"""
|
||||
if not context_groups:
|
||||
return {}
|
||||
|
||||
signature_map = {}
|
||||
for ctx, seqs in context_groups.items():
|
||||
# Signatur = sortierte Menge der (Länge, erstes/letztes Element)
|
||||
sig_parts = []
|
||||
for s in seqs:
|
||||
first = s[0] if s else "∅"
|
||||
last = s[-1] if s else "∅"
|
||||
sig_parts.append((len(s), first, last))
|
||||
signature = tuple(sorted(set(sig_parts)))
|
||||
if signature not in signature_map:
|
||||
signature_map[signature] = []
|
||||
signature_map[signature].append(ctx)
|
||||
|
||||
# Gruppen mit gleicher Signatur → merge
|
||||
result = {}
|
||||
for sig, ctx_list in signature_map.items():
|
||||
merged_ctx = "|".join(sorted(ctx_list))
|
||||
merged_seqs = []
|
||||
for ctx in ctx_list:
|
||||
merged_seqs.extend(context_groups[ctx])
|
||||
result[merged_ctx] = merged_seqs
|
||||
|
||||
return result
|
||||
|
||||
|
||||
def iLocal(yaml_documents):
|
||||
"""
|
||||
iLocal — Kontext-Inferenz nach Bex 2007.
|
||||
|
||||
Args:
|
||||
yaml_documents: Liste von YAML-Strings oder Dateipfaden
|
||||
|
||||
Returns:
|
||||
dict: {generalized_context: [sequences]}
|
||||
"""
|
||||
all_contexts = {}
|
||||
for doc in yaml_documents:
|
||||
if '\n' in doc or '\r' in doc:
|
||||
contexts = extract_contexts_from_yaml_string(doc)
|
||||
else:
|
||||
contexts = extract_contexts_from_file(doc)
|
||||
for ctx, seqs in contexts.items():
|
||||
if ctx not in all_contexts:
|
||||
all_contexts[ctx] = []
|
||||
all_contexts[ctx].extend(seqs)
|
||||
|
||||
return reduce_contexts(all_contexts)
|
||||
105
bex/koa.py
Normal file
105
bex/koa.py
Normal file
|
|
@ -0,0 +1,105 @@
|
|||
"""k-OA — k-Occurrence Automaton (Definition 4.1, arXiv 1004.2372).
|
||||
|
||||
A k-OA is like a SOA but each symbol appears at most k times as a state label.
|
||||
"""
|
||||
|
||||
from .soa import SOA
|
||||
from .expr import strip_k
|
||||
|
||||
|
||||
class KOA(SOA):
|
||||
"""k-Occurrence Automaton.
|
||||
|
||||
Same structure as SOA but each symbol may label up to k states.
|
||||
"""
|
||||
|
||||
def __init__(self, k=1):
|
||||
super().__init__()
|
||||
self.k = k
|
||||
self._symbol_count = {}
|
||||
|
||||
def add_state(self, label):
|
||||
nid = super().add_state(label)
|
||||
sym = strip_k(label)
|
||||
self._symbol_count.setdefault(sym, 0)
|
||||
self._symbol_count[sym] += 1
|
||||
return nid
|
||||
|
||||
def remove_state(self, nid):
|
||||
label = self._label.get(nid)
|
||||
if label:
|
||||
sym = strip_k(label)
|
||||
self._symbol_count[sym] -= 1
|
||||
super().rm_state(nid)
|
||||
|
||||
def count_symbol(self, symbol):
|
||||
return self._symbol_count.get(strip_k(symbol), 0)
|
||||
|
||||
def symbol_ok(self, symbol):
|
||||
return self.count_symbol(symbol) < self.k
|
||||
|
||||
def is_deterministic(self):
|
||||
for n in self._succ:
|
||||
label_map = {}
|
||||
for t in self._succ[n]:
|
||||
lab = self._label.get(t)
|
||||
if lab:
|
||||
base = strip_k(lab)
|
||||
if base in label_map:
|
||||
return False
|
||||
label_map[base] = t
|
||||
return True
|
||||
|
||||
def accept(self, w):
|
||||
"""Accept using base symbols (strip k-markers from state labels)."""
|
||||
cur = {self.src}
|
||||
for sym in w:
|
||||
nxt = set()
|
||||
for s in cur:
|
||||
for t in self._succ.get(s, set()):
|
||||
lab = self._label.get(t)
|
||||
if lab and strip_k(lab) == sym:
|
||||
nxt.add(t)
|
||||
if not nxt:
|
||||
return False
|
||||
cur = nxt
|
||||
return any(self.sink in self._succ.get(s, set()) for s in cur)
|
||||
|
||||
def succ_labeled(self, nid, symbol):
|
||||
return {t for t in self._succ.get(nid, set()) if strip_k(self._label.get(t) or '') == symbol}
|
||||
|
||||
|
||||
def build_complete_koa(sequences, k):
|
||||
"""Build complete k-OA Ck (Definition 4.2, arXiv 1004.2372).
|
||||
|
||||
For each a ∈ Σ(S), exactly k states labeled a (a_1 ... a_k).
|
||||
- src connected to exactly one a_i for each a
|
||||
- Every state has edge to every other state (except src)
|
||||
- src → sink edge (for ε)
|
||||
"""
|
||||
G = KOA(k=k)
|
||||
alphabet = set()
|
||||
for seq in sequences:
|
||||
for token in seq:
|
||||
alphabet.add(token)
|
||||
|
||||
symbol_states = {}
|
||||
for sym in alphabet:
|
||||
state_ids = []
|
||||
for i in range(1, k + 1):
|
||||
nid = G.add_state(f"{sym}_{i}")
|
||||
state_ids.append(nid)
|
||||
G.add_edge(G.src, nid)
|
||||
symbol_states[sym] = state_ids
|
||||
|
||||
all_states = [n for n in G._succ if n not in (G.src, G.sink)]
|
||||
for s in all_states:
|
||||
for t in all_states:
|
||||
if s != t and not G.has_edge(s, t):
|
||||
G.add_edge(s, t)
|
||||
if not G.has_edge(s, G.sink):
|
||||
G.add_edge(s, G.sink)
|
||||
|
||||
G.add_edge(G.src, G.sink)
|
||||
|
||||
return G, symbol_states
|
||||
432
bex/kore.py
Normal file
432
bex/kore.py
Normal file
|
|
@ -0,0 +1,432 @@
|
|||
"""
|
||||
kore — k-ORE Inference (iDRegEx) nach Bex et al. 2008/2010.
|
||||
|
||||
iDRegEx (Bex 2008):
|
||||
1. Prefix-Tree Automaton (PTA) aus Beispielsequenzen
|
||||
2. Shrink: Rewrite-Regeln generalisieren den Automaten
|
||||
(simplify → star_rewrite → concat_rewrite → alternation_rewrite)
|
||||
3. Repair: Stelle Determinismus nach jedem Rewrite-Durchlauf wieder her
|
||||
4. Convert: Überführe den Automaten in einen regulären Ausdruck
|
||||
(State-Elimination nach Brzozowski & McCluskey)
|
||||
5. k-ORE Prüfung: Der Ausdruck muss die k-Occurrence-Bedingung erfüllen
|
||||
(jedes Symbol maximal k-mal nennenswert)
|
||||
6. MDL: Wähle k mit minimalem MDL-Score
|
||||
"""
|
||||
|
||||
from .automaton import Automaton
|
||||
from .pta import build_pta
|
||||
from .shrink import shrink
|
||||
from .repair import repair
|
||||
from .mdl import mdl_score
|
||||
|
||||
|
||||
def _state_elimination(G):
|
||||
"""
|
||||
State Elimination nach Brzozowski & McCluskey.
|
||||
|
||||
Entfernt nacheinander alle Nicht-Start/Accept-Zustände.
|
||||
Für jeden eliminierten Zustand q:
|
||||
- Für jedes Paar (p, r) mit p→q (Label A) und q→r (Label B):
|
||||
- R_self_q = disjunktion aller Selbst-Schleifen auf q
|
||||
- Neues Label = A · (R_self_q)* · B
|
||||
- Füge Kante p → r mit dem neuen Label hinzu (oder merge mit existierender)
|
||||
|
||||
Nach Elimination: Nur Start- und Accept-Zustände bleiben.
|
||||
Der Ausdruck ist: summe aller Pfade von Start zu Accept.
|
||||
"""
|
||||
G = G.copy()
|
||||
eliminated = set()
|
||||
|
||||
# Wiederhole bis nur Start + Accepts übrig sind
|
||||
changed = True
|
||||
while changed:
|
||||
changed = False
|
||||
# Wähle einen Zustand zur Elimination (nicht Start, nicht Accept)
|
||||
for q in list(G.nodes):
|
||||
if q == G.start or q in G.accepts:
|
||||
continue
|
||||
if q in eliminated:
|
||||
continue
|
||||
|
||||
reachable = _is_reachable_to_accept(G, q)
|
||||
if not reachable:
|
||||
G.nodes.discard(q)
|
||||
G.accepts.discard(q)
|
||||
G.edges = [e for e in G.edges if e['from'] != q and e['to'] != q]
|
||||
eliminated.add(q)
|
||||
changed = True
|
||||
continue
|
||||
|
||||
incoming = G.incoming(q)
|
||||
outgoing = G.outgoing(q)
|
||||
|
||||
# R_self_q = (a1 | a2 | ...)* für alle Selbst-Schleifen auf q
|
||||
self_loops = [e for e in outgoing if e['to'] == q]
|
||||
outgoing_no_self = [e for e in outgoing if e['to'] != q]
|
||||
|
||||
if not outgoing_no_self:
|
||||
# Sackgasse, keine Outgoing-Kanten (außer self-loop)
|
||||
# Entferne eingehende Kanten + q
|
||||
for e in incoming:
|
||||
G.remove_edge(e['from'], e['to'], e['label'])
|
||||
G.nodes.discard(q)
|
||||
G.accepts.discard(q)
|
||||
eliminated.add(q)
|
||||
changed = True
|
||||
continue
|
||||
|
||||
if self_loops:
|
||||
self_labels = list(set(e['label'] for e in self_loops))
|
||||
if len(self_labels) == 1:
|
||||
R_self_q = f"({self_labels[0]})*"
|
||||
else:
|
||||
R_self_q = f"({'|'.join(self_labels)})*"
|
||||
else:
|
||||
R_self_q = ""
|
||||
|
||||
# Für jedes Paar (p, r): p→q (incoming), q→r (outgoing, r != q)
|
||||
for e_in in incoming:
|
||||
p = e_in['from']
|
||||
if p == q:
|
||||
continue
|
||||
A = e_in['label']
|
||||
|
||||
for e_out in outgoing_no_self:
|
||||
r = e_out['to']
|
||||
B = e_out['label']
|
||||
|
||||
if R_self_q:
|
||||
new_label = f"({A}.{R_self_q}.{B})"
|
||||
else:
|
||||
new_label = f"({A}.{B})"
|
||||
|
||||
# Merge mit existierender Kante p→r wenn vorhanden
|
||||
existing = [e for e in G.edges if e['from'] == p and e['to'] == r]
|
||||
existing_labels = [e['label'] for e in existing]
|
||||
|
||||
if new_label not in existing_labels and f"({new_label})" not in existing_labels:
|
||||
# Vereinige mit existierenden Labels via |
|
||||
if existing:
|
||||
old_label = existing[0]['label']
|
||||
merged = f"({old_label}|{new_label})"
|
||||
G.remove_edge(p, r, old_label)
|
||||
G.add_edge(p, r, merged)
|
||||
else:
|
||||
G.add_edge(p, r, new_label)
|
||||
|
||||
# Lösche q und alle seine Kanten
|
||||
for e in incoming:
|
||||
G.remove_edge(e['from'], e['to'], e['label'])
|
||||
for e in self_loops:
|
||||
G.remove_edge(e['from'], e['to'], e['label'])
|
||||
for e in outgoing_no_self:
|
||||
G.remove_edge(e['from'], e['to'], e['label'])
|
||||
|
||||
G.nodes.discard(q)
|
||||
G.accepts.discard(q)
|
||||
eliminated.add(q)
|
||||
changed = True
|
||||
break
|
||||
|
||||
return G
|
||||
|
||||
|
||||
def _is_reachable_to_accept(G, q):
|
||||
"""Prüft ob von q aus ein Accept-Zustand erreichbar ist."""
|
||||
visited = set()
|
||||
stack = [q]
|
||||
while stack:
|
||||
n = stack.pop()
|
||||
if n in visited:
|
||||
continue
|
||||
visited.add(n)
|
||||
if n in G.accepts:
|
||||
return True
|
||||
for e in G.outgoing(n):
|
||||
stack.append(e['to'])
|
||||
return False
|
||||
|
||||
|
||||
def _extract_expression(G):
|
||||
"""
|
||||
Extrahiert den regulären Ausdruck aus dem eliminierten Automaten.
|
||||
Nach Elimination gibt es nur Startzustand und Accept-Zustände.
|
||||
Der Ausdruck ist die Disjunktion aller Pfade von Start zu Accept.
|
||||
"""
|
||||
if G.start is None:
|
||||
return "∅"
|
||||
|
||||
# Phase 1: State Elimination
|
||||
G_elim = _state_elimination(G)
|
||||
start = G_elim.start
|
||||
|
||||
if not G_elim.accepts:
|
||||
return "∅"
|
||||
|
||||
paths = []
|
||||
outgoing = G_elim.outgoing(start)
|
||||
|
||||
# Spezialfall: Start ist selbst Accept
|
||||
if start in G_elim.accepts:
|
||||
# Prüfe auf Selbst-Schleife
|
||||
self_edges = [e for e in outgoing if e['to'] == start]
|
||||
non_self = [e for e in outgoing if e['to'] != start]
|
||||
|
||||
if not non_self and not self_edges:
|
||||
return "ε"
|
||||
|
||||
if self_edges:
|
||||
self_labels = '|'.join(set(e['label'] for e in self_edges))
|
||||
paths.append(f"({self_labels})*")
|
||||
|
||||
# Außer Start → Accept → andere Accepts
|
||||
for e in non_self:
|
||||
target = e['to']
|
||||
if target in G_elim.accepts:
|
||||
paths.append(e['label'])
|
||||
|
||||
# Pfade von Start zu Accept-Zuständen
|
||||
for acc in G_elim.accepts:
|
||||
if acc == start:
|
||||
continue
|
||||
# Kante start → acc
|
||||
direct = [e for e in outgoing if e['to'] == acc]
|
||||
for e in direct:
|
||||
paths.append(e['label'])
|
||||
|
||||
self_loops_start = [e for e in G_elim.outgoing(start) if e['to'] == start]
|
||||
|
||||
# Weitere Kanten: start → x (wo x != accept)
|
||||
intermediate = [e for e in outgoing if e['to'] not in G_elim.accepts and e['to'] != start]
|
||||
for e in intermediate:
|
||||
# Folge Pfad von intermediate zu accept
|
||||
suffix = _follow_path(G_elim, e['to'], G_elim.accepts, set())
|
||||
if suffix:
|
||||
paths.append(f"({e['label']}.{suffix})")
|
||||
|
||||
# Entferne Duplikate
|
||||
paths = list(set(paths))
|
||||
|
||||
if not paths:
|
||||
return "ε"
|
||||
|
||||
if len(paths) == 1:
|
||||
expr = paths[0]
|
||||
else:
|
||||
expr = f"({'|'.join(paths)})"
|
||||
|
||||
# Vereinfache: Entferne überflüssige Klammern
|
||||
expr = _simplify_expression(expr)
|
||||
|
||||
return expr
|
||||
|
||||
|
||||
def _follow_path(G, start, accepts, visited):
|
||||
"""Findet den Pfad von start zu einem Accept."""
|
||||
if start in accepts:
|
||||
return "ε"
|
||||
if start in visited:
|
||||
return None
|
||||
visited.add(start)
|
||||
|
||||
outgoing = G.outgoing(start)
|
||||
for e in outgoing:
|
||||
if e['to'] == start:
|
||||
continue
|
||||
suffix = _follow_path(G, e['to'], accepts, visited)
|
||||
if suffix is not None:
|
||||
if suffix == "ε":
|
||||
return e['label']
|
||||
else:
|
||||
return f"({e['label']}.{suffix})"
|
||||
return None
|
||||
|
||||
|
||||
def _simplify_expression(expr):
|
||||
"""
|
||||
Vereinfacht einen regulären Ausdruck.
|
||||
Entfernt überflüssige Klammern, doppelte Operatoren, etc.
|
||||
"""
|
||||
if not expr or expr in ('ε', '∅'):
|
||||
return expr
|
||||
|
||||
# (ε. X ) → X
|
||||
# (X . ε) → X
|
||||
# ((X)) → X
|
||||
# (a|a) → a
|
||||
|
||||
simplified = expr
|
||||
|
||||
while True:
|
||||
prev = simplified
|
||||
simplified = _simplify_once(simplified)
|
||||
if simplified == prev:
|
||||
break
|
||||
|
||||
return simplified
|
||||
|
||||
|
||||
def _simplify_once(expr):
|
||||
"""Ein Reduktionsschritt."""
|
||||
# (ε.X) → X
|
||||
# (X.ε) → X
|
||||
# ((X)) → X
|
||||
# (a|a) → a
|
||||
|
||||
result = expr
|
||||
|
||||
# ((X)) → X (doppelte Klammern)
|
||||
import re
|
||||
result = re.sub(r'$$\(([^()]+)\)$$', r'(\1)', result)
|
||||
|
||||
return result
|
||||
|
||||
|
||||
def validate_k_ore(expr, k_index):
|
||||
"""
|
||||
Prüft ob ein Ausdruck die k-Occurrence-Bedingung erfüllt.
|
||||
Ein k-ORE erlaubt jedes Symbol maximal einmal pro k-Indikator,
|
||||
d.h. in jedem Konjunkt (Teilausdruck ohne |) darf jedes Symbol
|
||||
höchstens k-mal vorkommen.
|
||||
|
||||
Vereinfacht: Zähle Vorkommen jedes eindeutigen Token-Namens
|
||||
im Ausdruck. Wenn ein Token mehr als k-mal vorkommt, ist
|
||||
die Bedingung verletzt.
|
||||
|
||||
Returns:
|
||||
bool, str: (erfüllt, Grund)
|
||||
"""
|
||||
# Extrahiere alle Token-Namen aus dem Ausdruck
|
||||
tokens = set()
|
||||
for c in '*+?()|.':
|
||||
pass
|
||||
|
||||
token_names = set()
|
||||
i = 0
|
||||
while i < len(expr):
|
||||
if expr[i].isalnum() or expr[i] in '/_-':
|
||||
j = i
|
||||
while j < len(expr) and (expr[j].isalnum() or expr[j] in '/_-'):
|
||||
j += 1
|
||||
token_names.add(expr[i:j])
|
||||
i = j
|
||||
else:
|
||||
i += 1
|
||||
|
||||
# Zähle Vorkommen
|
||||
token_counts = {}
|
||||
i = 0
|
||||
while i < len(expr):
|
||||
if expr[i].isalnum() or expr[i] in '/_-':
|
||||
j = i
|
||||
while j < len(expr) and (expr[j].isalnum() or expr[j] in '/_-'):
|
||||
j += 1
|
||||
token = expr[i:j]
|
||||
token_counts[token] = token_counts.get(token, 0) + 1
|
||||
i = j
|
||||
else:
|
||||
i += 1
|
||||
|
||||
violations = [t for t, c in token_counts.items() if c > k_index]
|
||||
if violations:
|
||||
return False, f"Token {violations} erscheint > {k_index}-mal"
|
||||
return True, "OK"
|
||||
|
||||
|
||||
class kOREInference:
|
||||
"""
|
||||
iDRegEx: k-ORE Inferenz via PTA → Shrink → Repair → Expression.
|
||||
|
||||
Nach Bex et al. 2008:
|
||||
- Baue PTA aus Sequenzen
|
||||
- Shrink: Rewrite-Regeln generalisieren
|
||||
- Repair: Stelle Determinismus wieder her
|
||||
- Convert: Extrahiere regulären Ausdruck via State Elimination
|
||||
- Prüfe k-Occurrence
|
||||
- Wähle k mit MDL
|
||||
"""
|
||||
|
||||
def __init__(self, k_max=5):
|
||||
self.k_max = k_max
|
||||
|
||||
def infer(self, sequences):
|
||||
"""
|
||||
Inferiere den besten k-ORE.
|
||||
|
||||
Returns:
|
||||
(Automaton, expression_string, best_k) oder None
|
||||
"""
|
||||
sequences = [s for s in sequences if s]
|
||||
if not sequences:
|
||||
return None, "∅", 0
|
||||
|
||||
best_score = float('inf')
|
||||
best_result = None
|
||||
|
||||
for k in range(1, self.k_max + 1):
|
||||
try:
|
||||
auto, expr = self._infer_k_expression(sequences, k)
|
||||
if auto is None:
|
||||
continue
|
||||
score = mdl_score(auto, sequences)
|
||||
if score < best_score:
|
||||
best_score = score
|
||||
best_result = (auto, expr, k)
|
||||
except Exception:
|
||||
continue
|
||||
|
||||
return best_result
|
||||
|
||||
def _infer_k_expression(self, sequences, k):
|
||||
"""Führe iDRegEx für ein spezifisches k durch."""
|
||||
# 1. PTA bauen
|
||||
pta = build_pta(sequences)
|
||||
|
||||
# 2. Shrink
|
||||
shrunk = shrink(pta, max_iterations=20)
|
||||
|
||||
# 3. Repair
|
||||
repaired = repair(shrunk)
|
||||
|
||||
# 4. Expression extrahieren
|
||||
expr = _extract_expression(repaired)
|
||||
|
||||
# 5. k-ORE Prüfung
|
||||
valid, _ = validate_k_ore(expr, k)
|
||||
if not valid:
|
||||
expr = self._generalize_to_k_ore(expr, k)
|
||||
|
||||
return repaired, expr
|
||||
|
||||
def _generalize_to_k_ore(self, expr, k):
|
||||
"""
|
||||
Generalisiere den Ausdruck zur k-ORE.
|
||||
|
||||
Wenn Token t mehr als k-mal vorkommt:
|
||||
- Ersetze Wiederholungen durch t+ oder t*
|
||||
"""
|
||||
# Einfache Heuristik: Extrahiere Token, zähle, ersetze
|
||||
result = expr
|
||||
token_counts = {}
|
||||
i = 0
|
||||
while i < len(result):
|
||||
if result[i].isalnum() or result[i] in '/_-':
|
||||
j = i
|
||||
while j < len(result) and (result[j].isalnum() or result[j] in '/_-'):
|
||||
j += 1
|
||||
token = result[i:j]
|
||||
token_counts[token] = token_counts.get(token, 0) + 1
|
||||
i = j
|
||||
else:
|
||||
i += 1
|
||||
|
||||
for token, count in token_counts.items():
|
||||
if count > k:
|
||||
# Ersetze token.token durch token+
|
||||
import re
|
||||
pattern = re.escape(token) + r'\..' + re.escape(token)
|
||||
replacement = f"{token}+"
|
||||
result = re.sub(pattern, replacement, result, count=1)
|
||||
break
|
||||
|
||||
return result
|
||||
46
bex/marking.py
Normal file
46
bex/marking.py
Normal file
|
|
@ -0,0 +1,46 @@
|
|||
"""Marking — Convert k-OA to SOA over Σ^(k) (Definition 4.4, arXiv 1004.2372)."""
|
||||
|
||||
from .soa import SOA
|
||||
from .expr import strip_k
|
||||
|
||||
|
||||
def mark_koa(G):
|
||||
"""
|
||||
Mark a k-OA G as a SOA over Σ^(k).
|
||||
|
||||
Process nodes in arbitrary order. For the i-th occurrence of label a,
|
||||
replace by a^(i) (represented as "a_i").
|
||||
|
||||
Returns a SOA H over Σ^(k) such that L(G) = strip(L(H)).
|
||||
"""
|
||||
H = SOA()
|
||||
H.src = G.src
|
||||
H.sink = G.sink
|
||||
H._succ = {n: set(succ) for n, succ in G._succ.items()}
|
||||
H._pred = {n: set(pred) for n, pred in G._pred.items()}
|
||||
H._label = {}
|
||||
H._next = G._next
|
||||
|
||||
counts = {}
|
||||
for n in G._succ:
|
||||
lab = G._label.get(n)
|
||||
if lab and lab not in ('ε', '∅') and n not in (G.src, G.sink):
|
||||
sym = strip_k(lab)
|
||||
counts[sym] = counts.get(sym, 0) + 1
|
||||
H._label[n] = f"{sym}_{counts[sym]}"
|
||||
elif n in (G.src, G.sink):
|
||||
H._label[n] = None
|
||||
else:
|
||||
H._label[n] = lab
|
||||
|
||||
return H
|
||||
|
||||
|
||||
def strip_expression(expr):
|
||||
"""Strip k-ORE markers from expression: a_i → a.
|
||||
|
||||
Returns expression over original alphabet Σ.
|
||||
"""
|
||||
import re
|
||||
result = re.sub(r'(_\d+)', '', expr)
|
||||
return result
|
||||
143
bex/mdl.py
Normal file
143
bex/mdl.py
Normal file
|
|
@ -0,0 +1,143 @@
|
|||
"""MDL scoring for iDRegEx (Algorithm 4, arXiv 1004.2372)."""
|
||||
|
||||
import math
|
||||
from .expr import alphabet
|
||||
|
||||
|
||||
def model_cost(expr):
|
||||
"""|r| — number of alphabet symbol occurrences in expression."""
|
||||
import re
|
||||
cleaned = re.sub(r'[+?*()|.]', '', expr)
|
||||
cleaned = re.sub(r'_\d+', '', cleaned)
|
||||
cleaned = re.sub(r'[ε∅]', '', cleaned)
|
||||
return len(cleaned)
|
||||
|
||||
|
||||
def lang_size(expr, n=None):
|
||||
"""Estimate |L(r)≤n| — number of words of length ≤ n in L(r).
|
||||
|
||||
Simple approximation based on expression structure.
|
||||
"""
|
||||
if not expr or expr == '∅':
|
||||
return 0
|
||||
if expr == 'ε':
|
||||
return 1
|
||||
|
||||
n = n or (2 * model_cost(expr) + 1)
|
||||
|
||||
total = 0
|
||||
for length in range(n + 1):
|
||||
total += _count_words_fast(expr, length)
|
||||
return total
|
||||
|
||||
|
||||
def _count_words_fast(expr, length):
|
||||
if length < 0:
|
||||
return 0
|
||||
if not expr or expr == '∅':
|
||||
return 0
|
||||
if expr == 'ε':
|
||||
return 1 if length == 0 else 0
|
||||
|
||||
alpha = alphabet(expr)
|
||||
if expr in alpha:
|
||||
return 1 if length == 1 else 0
|
||||
|
||||
if '+' in expr:
|
||||
inner = expr.rstrip('+')
|
||||
if inner.endswith('?'):
|
||||
inner = inner[:-1]
|
||||
return _count_star(inner, length, min_count=1)
|
||||
|
||||
if expr.endswith('?'):
|
||||
inner = expr[:-1]
|
||||
return _count_words_fast(inner, length) + (1 if length == 0 else 0)
|
||||
|
||||
if expr.startswith('(') and '|' in expr:
|
||||
parts = _split_disj(expr[1:-1])
|
||||
return sum(_count_words_fast(p.strip(), length) for p in parts)
|
||||
|
||||
if '.' in expr:
|
||||
parts = expr.split('.')
|
||||
return _count_concat(parts, length, 0)
|
||||
|
||||
return 0
|
||||
|
||||
|
||||
def _count_concat(parts, length, idx):
|
||||
if idx >= len(parts):
|
||||
return 1 if length == 0 else 0
|
||||
total = 0
|
||||
for take in range(length + 1):
|
||||
cnt = _count_words_fast(parts[idx], take)
|
||||
if cnt:
|
||||
total += cnt * _count_concat(parts, length - take, idx + 1)
|
||||
return total
|
||||
|
||||
|
||||
def _count_star(inner, length, min_count):
|
||||
total = 0
|
||||
for rep in range(min_count, length + 1):
|
||||
total += _count_repeat(inner, rep, length)
|
||||
return total
|
||||
|
||||
|
||||
def _count_repeat(inner, rep, length):
|
||||
if rep == 0:
|
||||
return 1 if length == 0 else 0
|
||||
total = 0
|
||||
for take in range(length + 1):
|
||||
cnt = _count_words_fast(inner, take)
|
||||
if cnt:
|
||||
total += cnt * _count_repeat(inner, rep - 1, length - take)
|
||||
return total
|
||||
|
||||
|
||||
def _split_disj(s):
|
||||
depth = 0
|
||||
parts = []
|
||||
cur = []
|
||||
for ch in s:
|
||||
if ch == '(':
|
||||
depth += 1
|
||||
cur.append(ch)
|
||||
elif ch == ')':
|
||||
depth -= 1
|
||||
cur.append(ch)
|
||||
elif ch == '|' and depth == 0:
|
||||
parts.append(''.join(cur))
|
||||
cur = []
|
||||
else:
|
||||
cur.append(ch)
|
||||
parts.append(''.join(cur))
|
||||
return parts
|
||||
|
||||
|
||||
def data_cost(expr, sequences):
|
||||
"""MDL data cost: Σ_i log₂(|L=i(r)| / |S=i|) adjusted.
|
||||
|
||||
Simplified form: for each word in S, cost = log₂(lang_size of all words
|
||||
of that length).
|
||||
"""
|
||||
n = 2 * model_cost(expr) + 1
|
||||
total_cost = 0.0
|
||||
for seq in sequences:
|
||||
length = len(seq)
|
||||
if length <= n:
|
||||
lang_at_len = _count_words_fast(expr, length)
|
||||
if lang_at_len > 0:
|
||||
total_cost += math.log2(lang_at_len) if lang_at_len > 0 else 0
|
||||
return total_cost
|
||||
|
||||
|
||||
def mdl_score(expr, sequences):
|
||||
"""MDL = model cost + data cost."""
|
||||
model = model_cost(expr)
|
||||
data = data_cost(expr, sequences)
|
||||
return model + data
|
||||
|
||||
|
||||
# For backward compatibility
|
||||
class MDLScorer:
|
||||
def score(self, expr, sequences):
|
||||
return mdl_score(expr, sequences)
|
||||
62
bex/pta.py
Normal file
62
bex/pta.py
Normal file
|
|
@ -0,0 +1,62 @@
|
|||
"""
|
||||
pta — Prefix-Tree Automaton (PTA) construction.
|
||||
|
||||
Nach Bex et al. 2008/2010: Der PTA ist der initiale Automat, der aus
|
||||
den positiven Beispielsequenzen (Token-Sequenzen) konstruiert wird.
|
||||
|
||||
Jede Sequenz wird als Pfad im Trie abgebildet:
|
||||
- Wurzel = Startzustand
|
||||
- Jeder gemeinsame Prefix wird geteilt (wie im Trie)
|
||||
- Der letzte Zustand jeder Sequenz wird als accept markiert
|
||||
|
||||
Der PTA ist deterministisch und akzeptiert genau die gegebenen Sequenzen.
|
||||
Er ist der Ausgangspunkt für die SORE/CHARE-Inferenz via shrink-Rewrites.
|
||||
"""
|
||||
|
||||
from .automaton import Automaton
|
||||
|
||||
|
||||
def build_pta(sequences):
|
||||
"""
|
||||
Konstruiert den Prefix-Tree Automaton aus einer Liste von Token-Sequenzen.
|
||||
|
||||
Nach Bex et al. 2008/2010, Algorithmus PTA:
|
||||
- Initialisiere mit Startzustand q0
|
||||
- Für jede Sequenz w = a1...an:
|
||||
- Starte in q0
|
||||
- Für jedes ai: Folge der Kante (q, ai) falls vorhanden,
|
||||
sonst erzeuge neuen Zustand q' und Kante (q, q', ai)
|
||||
- Markiere Endzustand als accept
|
||||
|
||||
Args:
|
||||
sequences: Liste von Token-Listen (jede = ein YAML-Dokument)
|
||||
|
||||
Returns:
|
||||
Automaton: PTA für die gegebenen Sequenzen
|
||||
|
||||
Example:
|
||||
>>> build_pta([["apt", "service"], ["apt", "template", "service"]])
|
||||
Automaton(nodes=5, edges=5, start=0, accepts={3, 4})
|
||||
"""
|
||||
automaton = Automaton(start=0)
|
||||
automaton.add_node(0)
|
||||
|
||||
next_id = 1
|
||||
|
||||
for seq in sequences:
|
||||
current = 0
|
||||
for token in seq:
|
||||
found = False
|
||||
for (to, label) in automaton.successors(current):
|
||||
if label == token:
|
||||
current = to
|
||||
found = True
|
||||
break
|
||||
if not found:
|
||||
new_node = next_id
|
||||
next_id += 1
|
||||
automaton.add_edge(current, new_node, token)
|
||||
current = new_node
|
||||
automaton.add_accept(current)
|
||||
|
||||
return automaton
|
||||
167
bex/repair.py
Normal file
167
bex/repair.py
Normal file
|
|
@ -0,0 +1,167 @@
|
|||
"""
|
||||
repair — Determinism Repair nach Bex 2010.
|
||||
|
||||
Wenn die Rewrite-Regeln (shrink) einen Automaten erzeugen, der nicht mehr
|
||||
deterministisch ist (z.B. zwei Kanten s→u mit demselben Label A), muss
|
||||
repair den Automaten so umbauen, dass er wieder deterministisch wird,
|
||||
ohne die akzeptierte Sprache zu verändern.
|
||||
|
||||
Bex 2010, Section 4.2.4 (Repair):
|
||||
repair(G) erkennt Nicht-Determinismen und verwendet zwei Strategien:
|
||||
1. Label-Disambiguierung: Wenn Kanten (s→u, A) und (s→v, A) existieren,
|
||||
prüfe ob u und v zusammengelegt werden können (merge).
|
||||
2. Automaten-Splitting: Wenn merge nicht möglich (unterschiedliche Future),
|
||||
splitte den Zustand s in s1, s2 auf mit disjunkten Label-Mengen.
|
||||
|
||||
Die repair-Funktion wird nach jedem shrink-Durchlauf aufgerufen.
|
||||
"""
|
||||
|
||||
from .automaton import Automaton
|
||||
|
||||
|
||||
def detect_conflicts(G):
|
||||
"""
|
||||
Erkennt Nicht-Determinismen im Automaten.
|
||||
|
||||
Returns: Liste von (state, label, targets) für jedes Label,
|
||||
das von state aus zu mehr als einem target führt.
|
||||
"""
|
||||
conflicts = []
|
||||
for node in G.nodes:
|
||||
label_map = {}
|
||||
for e in G.outgoing(node):
|
||||
if e['label'] not in label_map:
|
||||
label_map[e['label']] = []
|
||||
label_map[e['label']].append(e['to'])
|
||||
for label, targets in label_map.items():
|
||||
if len(targets) > 1:
|
||||
conflicts.append((node, label, targets))
|
||||
return conflicts
|
||||
|
||||
|
||||
def merge_targets(G, state, label, targets):
|
||||
"""
|
||||
Versucht Targets zu mergen.
|
||||
Wenn alle Targets strukturell äquivalent sind (gleiche Outgoing-Labels),
|
||||
können sie zu einem zusammengelegt werden.
|
||||
"""
|
||||
future_sets = []
|
||||
for t in targets:
|
||||
futures = {(e['to'], e['label']) for e in G.outgoing(t)}
|
||||
future_sets.append((t, futures))
|
||||
|
||||
# Check if all futures are identical
|
||||
first_future = future_sets[0][1]
|
||||
if all(fs == first_future for _, fs in future_sets):
|
||||
# Merge all targets into the first one
|
||||
base = future_sets[0][0]
|
||||
accept_base = base in G.accepts
|
||||
for t, _ in future_sets[1:]:
|
||||
if t in G.accepts:
|
||||
G.add_accept(base)
|
||||
if base != t:
|
||||
for e in G.incoming(t):
|
||||
if e['from'] != state:
|
||||
G.add_edge(e['from'], base, e['label'])
|
||||
G.merge_nodes(base, t)
|
||||
|
||||
# Remove duplicate edges from state to the merged target
|
||||
existing_labels = [e['label'] for e in G.outgoing(state) if e['to'] == base]
|
||||
if label in existing_labels:
|
||||
existing_labels.remove(label)
|
||||
if label not in existing_labels:
|
||||
G.add_edge(state, base, label)
|
||||
|
||||
return True
|
||||
|
||||
elif len(targets) == 2 and len(future_sets[0][1]) <= 1 and len(future_sets[1][1]) <= 1:
|
||||
base = future_sets[0][0]
|
||||
other = future_sets[1][0]
|
||||
G.merge_nodes(base, other)
|
||||
G.add_edge(state, base, label)
|
||||
return True
|
||||
|
||||
return False
|
||||
|
||||
|
||||
def split_automaton(G, state, label, targets):
|
||||
"""
|
||||
Splittet den Zustand 'state' in mehrere Kopien, je eine pro Ziel.
|
||||
Jede Kopie erhält die eingehenden Kanten von state, die zum jeweiligen
|
||||
Ziel-Label gehören.
|
||||
"""
|
||||
# Find the highest node ID
|
||||
max_id = max(G.nodes) if G.nodes else 0
|
||||
|
||||
incoming = G.incoming(state)
|
||||
outgoing = G.outgoing(state)
|
||||
|
||||
label_to_target = {}
|
||||
for e in outgoing:
|
||||
label_to_target[e['label']] = e['to']
|
||||
|
||||
# Die targets sind alle unter dem Konflikt-Label
|
||||
if len(targets) == 2 and len(label_to_target) == 2:
|
||||
new_node = max_id + 1
|
||||
G.add_node(new_node)
|
||||
|
||||
target1, target2 = targets[0], targets[1]
|
||||
|
||||
for e in list(G.incoming(state)):
|
||||
if e['from'] == state:
|
||||
continue
|
||||
G.add_edge(e['from'], new_node, e['label'])
|
||||
|
||||
label_for_other = [k for k, v in label_to_target.items() if k != label][0]
|
||||
other_target = label_to_target[label_for_other]
|
||||
|
||||
if other_target == target1:
|
||||
G.add_edge(new_node, target1, label)
|
||||
elif other_target == target2:
|
||||
G.add_edge(state, target1, label)
|
||||
else:
|
||||
G.add_edge(state, target1, label)
|
||||
|
||||
return True
|
||||
|
||||
return False
|
||||
|
||||
|
||||
def repair(G):
|
||||
"""
|
||||
repair — Stellt Determinismus nach Rewrite-Operationen wieder her.
|
||||
|
||||
Nach Bex 2010, repair-Algorithmus:
|
||||
1. Erkenne Nicht-Determinismen (detect_conflicts)
|
||||
2. Für jeden Konflikt:
|
||||
a. Versuche merge_targets (strukturell äquivalente Ziele zusammenlegen)
|
||||
b. Falls nicht möglich: split_automaton (Zustand aufspalten)
|
||||
3. Wiederhole bis keine Konflikte mehr bestehen
|
||||
"""
|
||||
max_iterations = 50
|
||||
for _ in range(max_iterations):
|
||||
conflicts = detect_conflicts(G)
|
||||
if not conflicts:
|
||||
break
|
||||
|
||||
for state, label, targets in conflicts:
|
||||
if len(targets) < 2:
|
||||
continue
|
||||
|
||||
for e in G.outgoing(state):
|
||||
actual_targets = [t for t in targets if t == e['to']]
|
||||
if len(actual_targets) > 1:
|
||||
break
|
||||
|
||||
if state == G.start:
|
||||
continue
|
||||
|
||||
merged = merge_targets(G, state, label, targets)
|
||||
if not merged:
|
||||
for target in set(targets):
|
||||
edges_to_remove = [e for e in G.outgoing(state)
|
||||
if e['label'] == label and e['to'] == target]
|
||||
for e in edges_to_remove[1:]:
|
||||
G.remove_edge(e['from'], e['to'], e['label'])
|
||||
|
||||
return G
|
||||
111
bex/role_grammar.py
Normal file
111
bex/role_grammar.py
Normal file
|
|
@ -0,0 +1,111 @@
|
|||
"""Extract Ansible role task module sequences and learn per-group grammars."""
|
||||
|
||||
from pathlib import Path
|
||||
import yaml
|
||||
from collections import defaultdict
|
||||
|
||||
from .crx import CRX
|
||||
from .expr import strip_k
|
||||
|
||||
|
||||
IGNORE_MODULES = frozenset({'name', 'tags', 'when', 'register', 'no_log',
|
||||
'changed_when', 'failed_when', 'ignore_errors',
|
||||
'run_once', 'delegate_to', 'loop', 'loop_control',
|
||||
'until', 'retries', 'delay', 'poll', 'async',
|
||||
'become', 'become_user', 'become_flags',
|
||||
'check_mode', 'diff', 'environment',
|
||||
'vars', 'notify', 'args',
|
||||
'block', 'rescue', 'always', 'include_tasks'})
|
||||
|
||||
|
||||
def extract_module_name(task):
|
||||
"""Extract the Ansible module name from a task dict.
|
||||
|
||||
The module is the key that is NOT a known non-module key.
|
||||
Returns 'skip' for non-task entries like block/rescue/always.
|
||||
"""
|
||||
if not isinstance(task, dict):
|
||||
return None
|
||||
# Check for block/rescue/always — these contain nested tasks
|
||||
for key in ('block', 'rescue', 'always'):
|
||||
if key in task:
|
||||
nested = task[key]
|
||||
if isinstance(nested, list):
|
||||
return [extract_module_name(t) for t in nested]
|
||||
return None
|
||||
# Find the module key (not name, not meta-keys)
|
||||
for key, value in task.items():
|
||||
if key in ('name',):
|
||||
continue
|
||||
if key in IGNORE_MODULES:
|
||||
continue
|
||||
if isinstance(value, (dict, list, str, bool, int, float)):
|
||||
# It's the module name (venv or fqcn)
|
||||
return strip_k(key)
|
||||
return None
|
||||
|
||||
|
||||
def flatten_nested(seq):
|
||||
"""Flatten nested lists into a single list."""
|
||||
result = []
|
||||
for item in seq:
|
||||
if isinstance(item, list):
|
||||
result.extend(flatten_nested(item))
|
||||
elif item is not None and item != 'skip':
|
||||
result.append(item)
|
||||
return result
|
||||
|
||||
|
||||
def get_role_category(role_name):
|
||||
"""Extract category from role name like deploy_foo → deploy."""
|
||||
parts = role_name.split('_')
|
||||
if len(parts) >= 2:
|
||||
return parts[0]
|
||||
return 'other'
|
||||
|
||||
|
||||
def load_role_module_sequence(role_dir):
|
||||
"""Load a role's task file and extract the module sequence."""
|
||||
task_file = role_dir / 'tasks' / 'main.yml'
|
||||
if not task_file.exists():
|
||||
return None, None
|
||||
with open(task_file) as f:
|
||||
data = yaml.safe_load(f)
|
||||
if not isinstance(data, list):
|
||||
return None, None
|
||||
|
||||
modules = []
|
||||
for task in data:
|
||||
result = extract_module_name(task)
|
||||
if isinstance(result, list):
|
||||
modules.extend(flatten_nested(result))
|
||||
elif result is not None:
|
||||
modules.append(result)
|
||||
|
||||
return role_dir.name, modules
|
||||
|
||||
|
||||
def collect_all_role_sequences(roles_dir='roles'):
|
||||
"""Collect module sequences from all roles, grouped by category."""
|
||||
by_category = defaultdict(list)
|
||||
all_roles = []
|
||||
for role_dir in sorted(Path(roles_dir).glob('*/tasks/main.yml')):
|
||||
role_name = role_dir.parent.parent.name
|
||||
name, seq = load_role_module_sequence(role_dir.parent.parent)
|
||||
if seq:
|
||||
cat = get_role_category(role_name)
|
||||
by_category[cat].append((role_name, seq))
|
||||
all_roles.append((role_name, seq))
|
||||
return all_roles, by_category
|
||||
|
||||
|
||||
def learn_grammar(sequences):
|
||||
"""Run CRX on a list of sequences."""
|
||||
if len(sequences) < 2:
|
||||
seqs = [sequences[0]] if sequences else []
|
||||
else:
|
||||
seqs = sequences
|
||||
if not seqs:
|
||||
return 'ε'
|
||||
crx = CRX()
|
||||
return crx.infer(seqs)
|
||||
224
bex/rwr0.py
Normal file
224
bex/rwr0.py
Normal file
|
|
@ -0,0 +1,224 @@
|
|||
"""RWR₀ — Algorithm 6 (TODS 2010), conference version rules (Figure 10 + Figure 13).
|
||||
|
||||
Precedence: CONCATENATION > DISJUNCTION > SELF-LOOP > OPTIONAL
|
||||
Repair precedence: ENABLE-DISJUNCTION > ENABLE-OPTIONAL-1 > ENABLE-OPTIONAL-2
|
||||
|
||||
Conditions checked on ε-closure G* (Definition 25).
|
||||
Used as rwr²₁ in arXiv 1004.2372 for k>1.
|
||||
"""
|
||||
|
||||
from .soa import SOA
|
||||
from .expr import concat, disj, star, optional
|
||||
|
||||
|
||||
def _find_concat(G, Gs):
|
||||
"""Figure 10 CONCATENATION rule, checked on G*.
|
||||
|
||||
Check four variants with priority: r·s > r?·s|r·s? > r?·s?
|
||||
r·s: Succ(r)={s} ∧ Pred(s)={r}
|
||||
r?·s: Succ(r)={s,sink} ∧ Pred(s)={r}
|
||||
r·s?: Succ(r)={s} ∧ Pred(s)={r,sink}
|
||||
r?·s?: Succ(r)={s,sink} ∧ Pred(s)={r,sink}
|
||||
"""
|
||||
st = G.states()
|
||||
# Variant 1: r·s (highest priority — check all pairs first)
|
||||
for r in st:
|
||||
for s in st:
|
||||
if r == s:
|
||||
continue
|
||||
if Gs.succ(r) == {s} and G.pred(s) == {r}:
|
||||
return r, s, concat(G.label(r), G.label(s))
|
||||
# Variants 2-3: r?·s and r·s?
|
||||
for r in st:
|
||||
for s in st:
|
||||
if r == s:
|
||||
continue
|
||||
Sr = Gs.succ(r)
|
||||
Ps = G.pred(s)
|
||||
if Sr == {s, G.sink} and Ps == {r}:
|
||||
return r, s, concat(G.label(r), optional(G.label(s)))
|
||||
if Sr == {s} and Ps == {r, G.sink}:
|
||||
return r, s, concat(optional(G.label(r)), G.label(s))
|
||||
# Variant 4: r?·s?
|
||||
for r in st:
|
||||
for s in st:
|
||||
if r == s:
|
||||
continue
|
||||
if Gs.succ(r) == {s, G.sink} and G.pred(s) == {r, G.sink}:
|
||||
return r, s, concat(optional(G.label(r)), optional(G.label(s)))
|
||||
return None, None, None
|
||||
|
||||
|
||||
def _find_disj(G, Gs):
|
||||
"""Figure 10 DISJUNCTION rule, checked on G*.
|
||||
|
||||
Pred⁺(r)=Pred⁺(s) ∧ Succ⁺(r)=Succ⁺(s)
|
||||
"""
|
||||
st = G.states()
|
||||
for i, r in enumerate(st):
|
||||
for s in st[i + 1:]:
|
||||
if G._pred_plus(r) == G._pred_plus(s) and G._succ_plus(r) == G._succ_plus(s):
|
||||
return r, s, disj(G.label(r), G.label(s))
|
||||
return None, None, None
|
||||
|
||||
|
||||
def _find_selfloop(G, Gs):
|
||||
"""Figure 10 SELF-LOOP rule. r ∈ Succ(r) in G (not G*)."""
|
||||
for r in G.states():
|
||||
if G.has_edge(r, r):
|
||||
return r, star(G.label(r))
|
||||
return None, None
|
||||
|
||||
|
||||
def _find_optional(G):
|
||||
"""Figure 10 OPTIONAL rule. G contains exactly one non-special node besides src, sink.
|
||||
Only applies when G is not already final (avoids infinite loop)."""
|
||||
if G.is_final():
|
||||
return None, None
|
||||
if G.num_non_special() == 1:
|
||||
r = G.states()[0]
|
||||
return r, optional(G.label(r))
|
||||
return None, None
|
||||
|
||||
|
||||
def _try_ed(G):
|
||||
"""ENABLE-DISJUNCTION (Figure 13). When Pred(r)=Pred(s) but Succ(r)≠Succ(s):
|
||||
add edges to make Succ(r)=Succ(s). Or symmetric for Pred.
|
||||
"""
|
||||
st = G.states()
|
||||
for i, r in enumerate(st):
|
||||
for s in st[i + 1:]:
|
||||
if G._pred_plus(r) == G._pred_plus(s) and G._succ_plus(r) != G._succ_plus(s):
|
||||
merged = G._succ_plus(r) | G._succ_plus(s)
|
||||
changed = False
|
||||
for t in merged - G._succ_plus(r):
|
||||
if not G.has_edge(r, t):
|
||||
G.add_edge(r, t)
|
||||
changed = True
|
||||
for t in merged - G._succ_plus(s):
|
||||
if not G.has_edge(s, t):
|
||||
G.add_edge(s, t)
|
||||
changed = True
|
||||
if changed:
|
||||
return True
|
||||
if G._succ_plus(r) == G._succ_plus(s) and G._pred_plus(r) != G._pred_plus(s):
|
||||
merged = G._pred_plus(r) | G._pred_plus(s)
|
||||
changed = False
|
||||
for p in merged - G._pred_plus(r):
|
||||
if not G.has_edge(p, r):
|
||||
G.add_edge(p, r)
|
||||
changed = True
|
||||
for p in merged - G._pred_plus(s):
|
||||
if not G.has_edge(p, s):
|
||||
G.add_edge(p, s)
|
||||
changed = True
|
||||
if changed:
|
||||
return True
|
||||
return False
|
||||
|
||||
|
||||
def _try_eo1(G):
|
||||
"""ENABLE-OPTIONAL-1 (Figure 13). If Succ(r)={s,sink} but Pred(s) has other
|
||||
predecessors besides r, add Pred(s) to r's predecessors.
|
||||
"""
|
||||
for r in G.states():
|
||||
Sr = G.succ(r)
|
||||
if G.sink in Sr and len(Sr) == 2:
|
||||
s = next(x for x in Sr if x != G.sink)
|
||||
if len(G.pred(s)) > 1:
|
||||
changed = False
|
||||
for p in G.pred(s) - {r}:
|
||||
if not G.has_edge(p, r):
|
||||
G.add_edge(p, r)
|
||||
changed = True
|
||||
if changed:
|
||||
return True
|
||||
return False
|
||||
|
||||
|
||||
def _try_eo2(G):
|
||||
"""ENABLE-OPTIONAL-2 (Figure 13). If Pred(s)={r,sink} but Succ(r) has other
|
||||
successors besides s, add Succ(r) to s's successors.
|
||||
"""
|
||||
for s in G.states():
|
||||
Ps = G.pred(s)
|
||||
if G.sink in Ps and len(Ps) == 2:
|
||||
r = next(x for x in Ps if x != G.sink)
|
||||
if len(G.succ(r)) > 1:
|
||||
changed = False
|
||||
for t in G.succ(r) - {s}:
|
||||
if not G.has_edge(s, t):
|
||||
G.add_edge(s, t)
|
||||
changed = True
|
||||
if changed:
|
||||
return True
|
||||
return False
|
||||
|
||||
|
||||
def rwr0(G):
|
||||
"""
|
||||
|———— Algorithm 6: RWR₀ ————|
|
||||
Input: SOA G
|
||||
Output: SORE r (or ∅ on failure)
|
||||
|
||||
1: if sink not reachable: return ∅
|
||||
2: if E(G)={(src,sink)}: return ε
|
||||
3: while not done:
|
||||
4: if rewrite (Figure 10) applicable:
|
||||
5: apply with precedence: CONCAT > DISJ > SELF-LOOP > OPTIONAL
|
||||
6: elif repair (Figure 13) applicable:
|
||||
7: apply with precedence: ED > EO1 > EO2
|
||||
8: else: done
|
||||
9: if final: return r else return ∅
|
||||
"""
|
||||
G = G.copy()
|
||||
if not G.sink_reachable():
|
||||
return '∅'
|
||||
if G.num_non_special() == 0 and G.has_edge(G.src, G.sink):
|
||||
return 'ε'
|
||||
|
||||
done = False
|
||||
while not done:
|
||||
applied = False
|
||||
Gs = G.epsilon_closure()
|
||||
|
||||
r, s, lab = _find_concat(G, Gs)
|
||||
if r is not None:
|
||||
G.contract(r, s, lab)
|
||||
applied = True
|
||||
|
||||
if not applied:
|
||||
Gs = G.epsilon_closure()
|
||||
r, s, lab = _find_disj(G, Gs)
|
||||
if r is not None:
|
||||
G.contract(r, s, lab)
|
||||
applied = True
|
||||
|
||||
if not applied:
|
||||
Gs = G.epsilon_closure()
|
||||
r, lab = _find_selfloop(G, Gs)
|
||||
if r is not None:
|
||||
t = G.contract_single(r, lab)
|
||||
G.rm_edge(t, t)
|
||||
applied = True
|
||||
|
||||
if not applied:
|
||||
r, lab = _find_optional(G)
|
||||
if r is not None:
|
||||
G.contract_single(r, lab)
|
||||
if not G.has_edge(G.src, G.sink):
|
||||
G.add_edge(G.src, G.sink)
|
||||
applied = True
|
||||
|
||||
if not applied:
|
||||
applied = _try_ed(G)
|
||||
if not applied:
|
||||
applied = _try_eo1(G)
|
||||
if not applied:
|
||||
applied = _try_eo2(G)
|
||||
if not applied:
|
||||
done = True
|
||||
|
||||
if G.is_final():
|
||||
return G.expression()
|
||||
return '∅'
|
||||
31
bex/rwrsq.py
Normal file
31
bex/rwrsq.py
Normal file
|
|
@ -0,0 +1,31 @@
|
|||
"""rwr² — Translate k-OA to k-ORE (Algorithm 3, arXiv 1004.2372).
|
||||
|
||||
rwr²(G):
|
||||
1: compute a marking H of G
|
||||
2: return strip(rwr²₁(H))
|
||||
"""
|
||||
|
||||
import re
|
||||
from .marking import mark_koa
|
||||
from .rwr0 import rwr0
|
||||
|
||||
|
||||
def strip(expr):
|
||||
"""Remove k-ORE markers: a_i → a."""
|
||||
return re.sub(r'_\d+', '', expr)
|
||||
|
||||
|
||||
def rwr_sq(G):
|
||||
"""
|
||||
|———— Algorithm 3: rwr² ————|
|
||||
Require: k-OA G
|
||||
Ensure: k-ORE r with L(G) ⊆ L(r)
|
||||
|
||||
1: H ← marking of G
|
||||
2: return strip(rwr²₁(H))
|
||||
"""
|
||||
H = mark_koa(G)
|
||||
result = rwr0(H)
|
||||
if result is None or result == '∅':
|
||||
return None
|
||||
return strip(result)
|
||||
267
bex/shrink.py
Normal file
267
bex/shrink.py
Normal file
|
|
@ -0,0 +1,267 @@
|
|||
"""
|
||||
shrink — SORE-Transformation via Rewrite-Regeln.
|
||||
|
||||
Nach Bex et al. 2010 (TWEB): Der shrink-Operator transformiert einen
|
||||
Automaten (PTA) in einen SORE (Single Occurrence Regular Expression)
|
||||
durch wiederholte Anwendung von Rewrite-Regeln.
|
||||
|
||||
Die Rewrite-Regeln (Bex 2010, Section 4.2):
|
||||
1. simplify — Entferne redundante Kanten, vereinige parallele Pfade
|
||||
2. star_rewrite — Ersetze Selbst-Schleife (s →label s) durch label*
|
||||
3. concat_rewrite — Zustandseliminierung: s →t →u → s →u mit label = l1·l2
|
||||
4. alternation_rewrite — Mehrere Aus-Kanten: s →t1, s →t2 → s →(t1 | t2)
|
||||
|
||||
Jeder Rewrite-Schritt wird durch eine MDL-Kostenfunktion bewertet.
|
||||
Der Prozess ist iterativ: Solange die MDL sinkt, wird der gewinbringendste
|
||||
Rewrite angewendet (PriorityQueue nach MDL-Gain).
|
||||
"""
|
||||
|
||||
import heapq
|
||||
from .automaton import Automaton
|
||||
|
||||
|
||||
def simplify(automaton):
|
||||
"""
|
||||
simplify — Entfernt redundante Kanten und vereinigt parallele Pfade.
|
||||
|
||||
Nach Bex 2010, shrink-Schritt 1:
|
||||
- Wenn zwei Kanten (s→t, label1) und (s→t, label2) existieren,
|
||||
ersetze durch s→t mit label = (label1 | label2)
|
||||
- Entferne unerreichbare Zustände (kein Pfad vom Start aus)
|
||||
"""
|
||||
G = automaton.copy()
|
||||
|
||||
# Phase 1: Parallel edges → alternation
|
||||
processed = set()
|
||||
for e in list(G.edges):
|
||||
key = (e['from'], e['to'])
|
||||
if key in processed:
|
||||
continue
|
||||
parallel = [e2 for e2 in G.edges if e2['from'] == key[0] and e2['to'] == key[1]]
|
||||
if len(parallel) > 1:
|
||||
labels = list(set(e2['label'] for e2 in parallel))
|
||||
new_label = f"({'|'.join(labels)})"
|
||||
for e2 in parallel:
|
||||
G.remove_edge(e2['from'], e2['to'], e2['label'])
|
||||
G.add_edge(key[0], key[1], new_label)
|
||||
processed.add(key)
|
||||
|
||||
# Phase 2: Remove unreachable nodes
|
||||
reachable = set()
|
||||
stack = [G.start] if G.start is not None else []
|
||||
while stack:
|
||||
n = stack.pop()
|
||||
if n in reachable:
|
||||
continue
|
||||
reachable.add(n)
|
||||
for e in G.outgoing(n):
|
||||
stack.append(e['to'])
|
||||
|
||||
unreachable = G.nodes - reachable
|
||||
for n in unreachable:
|
||||
G.nodes.discard(n)
|
||||
G.edges = [e for e in G.edges if e['from'] != n and e['to'] != n]
|
||||
G.accepts.discard(n)
|
||||
|
||||
return G
|
||||
|
||||
|
||||
def apply_star_rewrite(G, s):
|
||||
"""
|
||||
Star-Rewrite: Ersetzt Selbst-Schleife (s →label s) durch label*.
|
||||
|
||||
Nach Bex 2010, Algorithmus apply_star_rewrite:
|
||||
Wenn ein Zustand s eine Selbst-Schleife mit label L hat:
|
||||
- Entferne die Selbst-Schleife
|
||||
- Markiere s mit einem Stern-Metadatum (wird später im Regex exportiert)
|
||||
"""
|
||||
loops = [e for e in G.edges if e['from'] == s and e['to'] == s]
|
||||
if not loops:
|
||||
return G
|
||||
|
||||
new_G = G.copy()
|
||||
for e in loops:
|
||||
new_G.remove_edge(e['from'], e['to'], e['label'])
|
||||
|
||||
labels = list(set(e['label'] for e in loops))
|
||||
if len(labels) == 1:
|
||||
star_label = f"{labels[0]}*"
|
||||
else:
|
||||
star_label = f"({'|'.join(labels)})*"
|
||||
|
||||
new_G.add_edge(s, s, star_label)
|
||||
return new_G
|
||||
|
||||
|
||||
def apply_concat_rewrite(G, t):
|
||||
"""
|
||||
Concat-Rewrite (Zustandseliminierung): Eliminiert Zustand t.
|
||||
|
||||
Nach Bex 2010, Algorithmus apply_concat_rewrite:
|
||||
Wenn ein Zustand t (nicht Start/Accept) genau einen In- und einen Out-Edge hat:
|
||||
s → t (label1), t → u (label2) → s → u (label1·label2)
|
||||
Dann entferne t und ersetze durch direkte Kante.
|
||||
|
||||
Allgemeiner: Für jeden In-Edge (s→t, l1) und Out-Edge (t→u, l2),
|
||||
füge (s→u, l1·l2) hinzu, entferne dann t.
|
||||
"""
|
||||
G = G.copy()
|
||||
incoming = G.incoming(t)
|
||||
outgoing = G.outgoing(t)
|
||||
|
||||
if not incoming and not outgoing:
|
||||
G.nodes.discard(t)
|
||||
G.accepts.discard(t)
|
||||
return G
|
||||
|
||||
if t in (G.start, ) or t in G.accepts:
|
||||
return G
|
||||
|
||||
if len(incoming) == 1 and len(outgoing) == 1:
|
||||
s = incoming[0]['from']
|
||||
u = outgoing[0]['to']
|
||||
l1 = incoming[0]['label']
|
||||
l2 = outgoing[0]['label']
|
||||
|
||||
G.remove_edge(s, t, l1)
|
||||
G.remove_edge(t, u, l2)
|
||||
G.add_edge(s, u, f"({l1}.{l2})")
|
||||
|
||||
G.nodes.discard(t)
|
||||
G.accepts.discard(t)
|
||||
return G
|
||||
|
||||
has_self_loop = any(e['from'] == t and e['to'] == t for e in G.edges)
|
||||
if not has_self_loop:
|
||||
for e_in in incoming:
|
||||
for e_out in outgoing:
|
||||
if e_out['to'] == t:
|
||||
continue
|
||||
s = e_in['from']
|
||||
u = e_out['to']
|
||||
l1 = e_in['label']
|
||||
l2 = e_out['label']
|
||||
|
||||
existing_labels = [e2['label'] for e2 in G.edges
|
||||
if e2['from'] == s and e2['to'] == u]
|
||||
new_label = f"({l1}.{l2})"
|
||||
if new_label not in existing_labels:
|
||||
G.add_edge(s, u, new_label)
|
||||
|
||||
for e in incoming:
|
||||
G.remove_edge(e['from'], e['to'], e['label'])
|
||||
for e in outgoing:
|
||||
if e['to'] != t:
|
||||
G.remove_edge(e['from'], e['to'], e['label'])
|
||||
|
||||
G.nodes.discard(t)
|
||||
G.accepts.discard(t)
|
||||
|
||||
return G
|
||||
|
||||
|
||||
def apply_alternation_rewrite(G, s):
|
||||
"""
|
||||
Alternation-Rewrite: Fasst mehrere ausgehende Kanten zu (l1 | l2) zusammen.
|
||||
|
||||
Nach Bex 2010: Wenn s zwei Kanten s → u (label1) und s → v (label2) hat,
|
||||
und u und v strukturell ähnlich sind:
|
||||
- Merge u in v (d.h. alle Kanten von u werden auf v umgeleitet)
|
||||
- Neue Kante s → v mit label = (label1 | label2)
|
||||
"""
|
||||
G = G.copy()
|
||||
outgoing = G.outgoing(s)
|
||||
|
||||
if len(outgoing) < 2:
|
||||
return G
|
||||
|
||||
label_set = {}
|
||||
for e in outgoing:
|
||||
target = e['to']
|
||||
if target not in label_set:
|
||||
label_set[target] = []
|
||||
label_set[target].append(e['label'])
|
||||
|
||||
while len(label_set) >= 2:
|
||||
targets = list(label_set.keys())
|
||||
t1, t2 = targets[0], targets[1]
|
||||
|
||||
labels1 = label_set[t1]
|
||||
labels2 = label_set[t2]
|
||||
|
||||
for l in labels1:
|
||||
G.remove_edge(s, t1, l)
|
||||
for l in labels2:
|
||||
G.remove_edge(s, t2, l)
|
||||
|
||||
new_labels = labels1 + labels2
|
||||
|
||||
if t1 == t2:
|
||||
new_label = f"({'|'.join(new_labels)})"
|
||||
G.add_edge(s, t1, new_label)
|
||||
break
|
||||
|
||||
G.merge_nodes(t2, t1)
|
||||
|
||||
new_label = f"({'|'.join(new_labels)})"
|
||||
G.add_edge(s, t2, new_label)
|
||||
|
||||
del label_set[t1]
|
||||
label_set[t2] = new_labels
|
||||
|
||||
return G
|
||||
|
||||
|
||||
def has_single_accept(G):
|
||||
return len(G.accepts) == 1
|
||||
|
||||
|
||||
def shrink(automaton, max_iterations=100):
|
||||
"""
|
||||
shrink — Hauptalgorithmus: Transformiert PTA in SORE.
|
||||
|
||||
Nach Bex 2010, Algorithmus shrink:
|
||||
Wiederhole bis Konvergenz (MDL sinkt nicht mehr oder max_iterations):
|
||||
1. simplify(G)
|
||||
2. Für jeden Zustand s mit Selbst-Schleife: apply_star_rewrite(G, s)
|
||||
3. Für jeden Zustand t (nicht Start/Accept): apply_concat_rewrite(G, t)
|
||||
4. Für jeden Zustand s mit >1 Out-Edge: apply_alternation_rewrite(G, s)
|
||||
5. Überprüfe Determinismus (gib an repair weiter)
|
||||
"""
|
||||
G = automaton.copy()
|
||||
|
||||
for iteration in range(max_iterations):
|
||||
prev_edge_count = len(G.edges)
|
||||
|
||||
G = simplify(G)
|
||||
changed = len(G.edges) < prev_edge_count
|
||||
|
||||
for node in list(G.nodes):
|
||||
if G.has_self_loop(node):
|
||||
G_new = apply_star_rewrite(G, node)
|
||||
if len(G_new.edges) != len(G.edges):
|
||||
G = G_new
|
||||
changed = True
|
||||
|
||||
for node in list(G.nodes):
|
||||
if node == G.start or node in G.accepts:
|
||||
continue
|
||||
incoming = G.incoming(node)
|
||||
outgoing = G.outgoing(node)
|
||||
if len(incoming) >= 1 and len(outgoing) >= 1:
|
||||
G_new = apply_concat_rewrite(G, node)
|
||||
if len(G_new.nodes) < len(G.nodes):
|
||||
G = G_new
|
||||
changed = True
|
||||
|
||||
for node in list(G.nodes):
|
||||
if len(G.outgoing(node)) >= 2:
|
||||
G_new = apply_alternation_rewrite(G, node)
|
||||
if len(G_new.edges) < len(G.edges):
|
||||
G = G_new
|
||||
changed = True
|
||||
|
||||
if not changed:
|
||||
break
|
||||
|
||||
return G
|
||||
193
bex/soa.py
Normal file
193
bex/soa.py
Normal file
|
|
@ -0,0 +1,193 @@
|
|||
"""SOA — Single Occurrence Automaton (Definition 6, TODS 2010)."""
|
||||
|
||||
import copy
|
||||
from .expr import concat, disj, star, optional
|
||||
|
||||
|
||||
class SOA:
|
||||
"""
|
||||
Node-labeled automaton (Definition 6, TODS 2010).
|
||||
|
||||
V = {src, sink} ∪ symbol-labeled states.
|
||||
E ⊆ V × V, unlabeled edges.
|
||||
Walk src=v₁,v₂,...,vₙ₊₁=sink accepts word lab(v₂)...lab(vₙ).
|
||||
|
||||
States are proper SOREs, pairwise alphabet-disjoint (Definition 10).
|
||||
"""
|
||||
|
||||
def __init__(self):
|
||||
self._next = 0
|
||||
self._succ = {}
|
||||
self._pred = {}
|
||||
self._label = {}
|
||||
self.src = self._new()
|
||||
self.sink = self._new()
|
||||
|
||||
def _new(self):
|
||||
n = self._next
|
||||
self._next += 1
|
||||
self._succ[n] = set()
|
||||
self._pred[n] = set()
|
||||
self._label[n] = None
|
||||
return n
|
||||
|
||||
def add_state(self, label):
|
||||
n = self._new()
|
||||
self._label[n] = label
|
||||
return n
|
||||
|
||||
def add_edge(self, f, t):
|
||||
self._succ[f].add(t)
|
||||
self._pred[t].add(f)
|
||||
|
||||
def rm_edge(self, f, t):
|
||||
self._succ[f].discard(t)
|
||||
self._pred[t].discard(f)
|
||||
|
||||
def rm_state(self, n):
|
||||
if n in (self.src, self.sink):
|
||||
return
|
||||
for p in list(self._pred[n]):
|
||||
self.rm_edge(p, n)
|
||||
for s in list(self._succ[n]):
|
||||
self.rm_edge(n, s)
|
||||
del self._label[n]
|
||||
del self._succ[n]
|
||||
del self._pred[n]
|
||||
|
||||
def label(self, n):
|
||||
return self._label.get(n)
|
||||
|
||||
def set_label(self, n, lab):
|
||||
self._label[n] = lab
|
||||
|
||||
def succ(self, n):
|
||||
return set(self._succ.get(n, set()))
|
||||
|
||||
def pred(self, n):
|
||||
return set(self._pred.get(n, set()))
|
||||
|
||||
def has_edge(self, f, t):
|
||||
return t in self._succ.get(f, set())
|
||||
|
||||
def states(self):
|
||||
return [n for n in self._succ if n not in (self.src, self.sink) and self._label.get(n) is not None]
|
||||
|
||||
def _pred_plus(self, n):
|
||||
r = set(self._pred.get(n, set()))
|
||||
if self._label.get(n) and self._label[n].endswith('+'):
|
||||
r.add(n)
|
||||
return r
|
||||
|
||||
def _succ_plus(self, n):
|
||||
r = set(self._succ.get(n, set()))
|
||||
if self._label.get(n) and self._label[n].endswith('+'):
|
||||
r.add(n)
|
||||
return r
|
||||
|
||||
def copy(self):
|
||||
return copy.deepcopy(self)
|
||||
|
||||
def accept(self, w):
|
||||
cur = {self.src}
|
||||
for sym in w:
|
||||
nxt = set()
|
||||
for s in cur:
|
||||
for t in self._succ.get(s, set()):
|
||||
if self._label.get(t) == sym:
|
||||
nxt.add(t)
|
||||
if not nxt:
|
||||
return False
|
||||
cur = nxt
|
||||
return any(self.sink in self._succ.get(s, set()) for s in cur)
|
||||
|
||||
def sink_reachable(self):
|
||||
seen = set()
|
||||
q = [self.src]
|
||||
while q:
|
||||
s = q.pop()
|
||||
if s == self.sink:
|
||||
return True
|
||||
if s in seen:
|
||||
continue
|
||||
seen.add(s)
|
||||
q.extend(self._succ.get(s, []))
|
||||
return False
|
||||
|
||||
def num_non_special(self):
|
||||
return sum(1 for n in self._succ if n not in (self.src, self.sink))
|
||||
|
||||
def is_final(self):
|
||||
ns = self.states()
|
||||
return len(ns) == 1 and self.has_edge(self.src, ns[0]) and self.has_edge(ns[0], self.sink)
|
||||
|
||||
def expression(self):
|
||||
if not self.is_final():
|
||||
return None
|
||||
return self._label[self.states()[0]]
|
||||
|
||||
def contract(self, r, s, new_label):
|
||||
"""
|
||||
State contraction G[r,s ⇒ t] (Definition 11, TODS 2010).
|
||||
|
||||
(1) Add t as new state with label new_label.
|
||||
(2) Every v ∈ Pred(r) − {r,s} → predecessor of t.
|
||||
(3) Every w ∈ Succ(s) − {r,s} → successor of t. [matching figures]
|
||||
(4) Loop t→t if r ∈ Succ(s).
|
||||
(5) Remove r, s and all edges.
|
||||
"""
|
||||
t = self._new()
|
||||
self._label[t] = new_label
|
||||
for v in self._pred.get(r, set()) - {r, s}:
|
||||
self.add_edge(v, t)
|
||||
for v in self._pred.get(s, set()) - {r, s}:
|
||||
self.add_edge(v, t)
|
||||
for w in self._succ.get(r, set()) - {r, s}:
|
||||
self.add_edge(t, w)
|
||||
for w in self._succ.get(s, set()) - {r, s}:
|
||||
self.add_edge(t, w)
|
||||
if r in self._succ.get(s, set()):
|
||||
self.add_edge(t, t)
|
||||
self.rm_state(r)
|
||||
self.rm_state(s)
|
||||
return t
|
||||
|
||||
def contract_single(self, r, new_label):
|
||||
"""Single-state substitution G[r ⇒ t] (Definition 11 note)."""
|
||||
if r in (self.src, self.sink):
|
||||
return r
|
||||
t = self._new()
|
||||
self._label[t] = new_label
|
||||
for v in self._pred.get(r, set()) - {r}:
|
||||
self.add_edge(v, t)
|
||||
for w in self._succ.get(r, set()) - {r}:
|
||||
self.add_edge(t, w)
|
||||
if r in self._succ.get(r, set()):
|
||||
self.add_edge(t, t)
|
||||
self.rm_state(r)
|
||||
return t
|
||||
|
||||
def epsilon_closure(self):
|
||||
"""G* (Definition 25, TODS 2010). Add self-loops for + states and ε-transitive closure."""
|
||||
G = self.copy()
|
||||
changed = True
|
||||
while changed:
|
||||
changed = False
|
||||
for n in list(G._succ.keys()):
|
||||
lab = G._label.get(n)
|
||||
if lab and (lab.endswith('+') or lab.endswith('+?')):
|
||||
if not G.has_edge(n, n):
|
||||
G.add_edge(n, n)
|
||||
changed = True
|
||||
for n in list(G._succ.keys()):
|
||||
for m in list(G._succ.get(n, set())):
|
||||
mlab = G._label.get(m)
|
||||
if mlab == 'ε':
|
||||
for mp in list(G._succ.get(m, set())):
|
||||
if mp != n and not G.has_edge(n, mp):
|
||||
G.add_edge(n, mp)
|
||||
changed = True
|
||||
return G
|
||||
|
||||
def __repr__(self):
|
||||
return f"SOA(nodes={len(self._succ)}, special={self.num_non_special()})"
|
||||
154
bex/template.py
Normal file
154
bex/template.py
Normal file
|
|
@ -0,0 +1,154 @@
|
|||
"""
|
||||
template — One-Shot YAML Template Generator.
|
||||
|
||||
Wandelt den inferierten k-ORE/SORE/CHARE regulären Ausdruck zurück
|
||||
in ein menschenlesbares YAML-Skelett für LLM-Prompts.
|
||||
|
||||
Der Generator erzeugt:
|
||||
- Ein YAML-Grundgerüst mit Platzhaltern
|
||||
- Kommentare mit Kardinalitätshinweisen:
|
||||
* # PFLICHT: Genau 1 mal erforderlich
|
||||
* # PFLICHT: 1 oder mehrmals erforderlich
|
||||
* # OPTIONAL: 0 oder 1 mal (darf weggelassen werden)
|
||||
* # OPTIONAL: 0 oder mehrmals
|
||||
* # WAHLWEISE: alternatives Modul
|
||||
"""
|
||||
|
||||
|
||||
def parse_expression(expr):
|
||||
"""Zerlegt einen regulären Ausdruck in seine Bestandteile."""
|
||||
if not expr or expr in ('∅', 'ε', ''):
|
||||
return [('empty', 'ε')]
|
||||
|
||||
tokens = []
|
||||
i = 0
|
||||
while i < len(expr):
|
||||
if expr[i] == '(':
|
||||
depth = 1
|
||||
j = i + 1
|
||||
while j < len(expr) and depth > 0:
|
||||
if expr[j] == '(':
|
||||
depth += 1
|
||||
elif expr[j] == ')':
|
||||
depth -= 1
|
||||
j += 1
|
||||
group = expr[i:j]
|
||||
quantifier = ''
|
||||
if j < len(expr) and expr[j] in '*+?':
|
||||
quantifier = expr[j]
|
||||
j += 1
|
||||
tokens.append(('group', group, quantifier))
|
||||
i = j
|
||||
elif expr[i] == '|':
|
||||
tokens.append(('pipe', '|'))
|
||||
i += 1
|
||||
elif expr[i] == '.':
|
||||
if i + 1 < len(expr) and expr[i + 1] == '.':
|
||||
tokens.append(('concat', '..'))
|
||||
i += 2
|
||||
else:
|
||||
tokens.append(('concat', '.'))
|
||||
i += 1
|
||||
elif expr[i] in '*+?':
|
||||
if tokens and tokens[-1][0] == 'name':
|
||||
name, val, _ = tokens[-1]
|
||||
tokens[-1] = (name, val, expr[i])
|
||||
i += 1
|
||||
elif expr[i].isalnum() or expr[i] in '/_-':
|
||||
j = i
|
||||
while j < len(expr) and (expr[j].isalnum() or expr[j] in '/_-'):
|
||||
j += 1
|
||||
name = expr[i:j]
|
||||
tokens.append(('name', name, ''))
|
||||
i = j
|
||||
else:
|
||||
i += 1
|
||||
|
||||
return tokens
|
||||
|
||||
|
||||
def format_prompt_cardinality(quantifier):
|
||||
"""Gibt die deutsche Kardinalitätsbeschreibung für einen Quantifier zurück."""
|
||||
mapping = {
|
||||
'': '# PFLICHT: Genau 1 mal erforderlich',
|
||||
'+': '# PFLICHT: 1 oder mehrmals erforderlich',
|
||||
'*': '# OPTIONAL: 0 oder mehrmals',
|
||||
'?': '# OPTIONAL: 0 oder 1 mal (darf weggelassen werden)',
|
||||
}
|
||||
return mapping.get(quantifier, '')
|
||||
|
||||
|
||||
def generate_template(expr, context_key=None, include_header=True):
|
||||
"""
|
||||
Generiert ein YAML-One-Shot-Template aus einem regulären Ausdruck.
|
||||
|
||||
Args:
|
||||
expr: Der inferierte Ausdruck (String)
|
||||
context_key: Name des YAML-Container-Keys (z.B. 'tasks')
|
||||
include_header: Ob der Header-Teil (name, hosts) eingefügt wird
|
||||
|
||||
Returns:
|
||||
String: YAML-Skelett mit Platzhaltern und Kardinalitätskommentaren
|
||||
"""
|
||||
if not expr or expr in ('∅', 'ε'):
|
||||
return "# Keine Struktur inferiert (leere Sequenzen oder keine Beispiele)"
|
||||
|
||||
if include_header:
|
||||
lines = [
|
||||
"- name: <Name des Plays>",
|
||||
" hosts: <Ziel-Server> # PFLICHT: Genau 1 mal erforderlich",
|
||||
]
|
||||
if context_key:
|
||||
lines.append(f" {context_key}:")
|
||||
else:
|
||||
lines.append(" tasks:")
|
||||
indent = " "
|
||||
else:
|
||||
lines = []
|
||||
if context_key:
|
||||
lines.append(f" {context_key}: # Container-Kontext: {context_key}")
|
||||
else:
|
||||
lines.append(" tasks:")
|
||||
indent = " "
|
||||
|
||||
tokens = parse_expression(expr)
|
||||
task_index = 0
|
||||
skip_until_pipe = False
|
||||
|
||||
alternatives = []
|
||||
in_alternatives = False
|
||||
|
||||
i = 0
|
||||
while i < len(tokens):
|
||||
token = tokens[i]
|
||||
|
||||
if token[0] == 'group':
|
||||
group_str = token[1]
|
||||
quantifier = token[2]
|
||||
card = format_prompt_cardinality(quantifier)
|
||||
inner_expr = group_str[1:-1]
|
||||
if '|' in inner_expr:
|
||||
alts = inner_expr.split('|')
|
||||
lines.append(f"{indent}# WAHLWEISE (eines auswählen):")
|
||||
for alt in alts:
|
||||
alt_clean = alt.strip()
|
||||
lines.append(f"{indent}# - {alt_clean}: <Parameter für {alt_clean}>")
|
||||
if card:
|
||||
lines[-1] = f"{lines[-1]} {card}"
|
||||
else:
|
||||
lines.append(f"{indent}- {inner_expr}: <Parameter für {inner_expr}> {card}")
|
||||
task_index += 1
|
||||
|
||||
elif token[0] == 'name':
|
||||
name = token[1]
|
||||
quantifier = token[2]
|
||||
card = format_prompt_cardinality(quantifier)
|
||||
lines.append(f"{indent}- {name}: <Parameter für {name}> {card}")
|
||||
task_index += 1
|
||||
|
||||
elif token[0] == 'pipe':
|
||||
pass
|
||||
|
||||
i += 1
|
||||
|
||||
return '\n'.join(lines) + '\n'
|
||||
194
bex/tokenizer.py
Normal file
194
bex/tokenizer.py
Normal file
|
|
@ -0,0 +1,194 @@
|
|||
"""
|
||||
YAMLTokenizer — Extrahiert Token-Sequenzen aus Ansible YAML-Dateien.
|
||||
|
||||
Nach Bex 2007/2010 wird jedes YAML-Dokument in eine Sequenz von Symbolen
|
||||
(Token) übersetzt. Für Ansible:
|
||||
- Ein Playbook → eine Sequenz von Module-Namen (apt, service, template, ...)
|
||||
- include_tasks wird als terminaler Token behandelt (nicht rekursiv aufgelöst)
|
||||
- block/rescue/always: Der block-Container selbst wird als Token erfasst,
|
||||
der Inhalt wird NICHT tokenisiert (zu variabel laut Benutzer-Vorgabe)
|
||||
|
||||
Die extrahierten Sequenzen dienen als Eingabe für die Automaten-Konstruktion.
|
||||
"""
|
||||
|
||||
import os
|
||||
import yaml
|
||||
|
||||
|
||||
# Module-Namen, die als strukturelle Token erfasst werden
|
||||
# (basierend auf Analyse von 56+ Rollen im Projekt)
|
||||
MODULE_TOKENS = {
|
||||
'apt', 'service', 'template', 'copy', 'file', 'command', 'shell',
|
||||
'get_url', 'uri', 'debug', 'set_fact', 'assert', 'wait_for',
|
||||
'include_tasks', 'import_tasks', 'import_playbook',
|
||||
'systemd', 'cron', 'user', 'authorized_key', 'group',
|
||||
'docker_container', 'docker_volume', 'docker_network', 'docker_image',
|
||||
'pip', 'npm', 'package',
|
||||
'lineinfile', 'replace', 'blockinfile',
|
||||
'stat', 'fetch', 'slurp',
|
||||
'meta', 'fail', 'pause',
|
||||
'unarchive', 'archive',
|
||||
'git', 'hg',
|
||||
'mysql_db', 'mysql_user',
|
||||
'postgresql_db', 'postgresql_user',
|
||||
'certificate', 'openssl',
|
||||
'known_hosts',
|
||||
'iptables', 'ufw',
|
||||
'mount', 'filesystem',
|
||||
'sysctl',
|
||||
'ini_file',
|
||||
'composer',
|
||||
'make',
|
||||
'configure',
|
||||
'npm',
|
||||
'composer',
|
||||
'pear',
|
||||
'pip',
|
||||
'gem',
|
||||
'cargo',
|
||||
}
|
||||
|
||||
def is_module_name(key):
|
||||
return key in MODULE_TOKENS or (isinstance(key, str) and not key.startswith('_'))
|
||||
|
||||
class YAMLTokenizer:
|
||||
def __init__(self, resolve_includes=False):
|
||||
self.resolve_includes = resolve_includes
|
||||
self._token_counts = {}
|
||||
|
||||
def tokenize_file(self, filepath):
|
||||
with open(filepath) as f:
|
||||
content = f.read()
|
||||
return self.tokenize_string(content, source=filepath)
|
||||
|
||||
def tokenize_string(self, content, source='<string>'):
|
||||
try:
|
||||
data = yaml.safe_load(content)
|
||||
except yaml.YAMLError as e:
|
||||
return []
|
||||
if data is None:
|
||||
return []
|
||||
return self._tokenize(data, source=source)
|
||||
|
||||
def _tokenize(self, data, source='<string>', depth=0):
|
||||
if isinstance(data, list):
|
||||
return self._tokenize_list(data, source, depth)
|
||||
elif isinstance(data, dict):
|
||||
return self._tokenize_dict(data, source, depth)
|
||||
return []
|
||||
|
||||
def _tokenize_list(self, lst, source, depth):
|
||||
tokens = []
|
||||
for item in lst:
|
||||
if isinstance(item, dict):
|
||||
tokens.extend(self._tokenize_dict(item, source, depth))
|
||||
elif isinstance(item, str):
|
||||
tokens.append(item)
|
||||
return tokens
|
||||
|
||||
def _tokenize_dict(self, d, source, depth):
|
||||
tokens = []
|
||||
|
||||
if 'tasks' in d or 'block' in d or 'pre_tasks' in d or 'post_tasks' in d:
|
||||
task_key = next(k for k in ['pre_tasks', 'tasks', 'post_tasks', 'block'] if k in d)
|
||||
if task_key == 'block':
|
||||
tokens.append('block_start')
|
||||
for item in d.get('block', []):
|
||||
tokens.extend(self._tokenize_task(item, source, depth + 1))
|
||||
if 'rescue' in d:
|
||||
tokens.append('rescue_start')
|
||||
for item in d['rescue']:
|
||||
tokens.extend(self._tokenize_task(item, source, depth + 1))
|
||||
tokens.append('rescue_end')
|
||||
if 'always' in d:
|
||||
tokens.append('always_start')
|
||||
for item in d['always']:
|
||||
tokens.extend(self._tokenize_task(item, source, depth + 1))
|
||||
tokens.append('always_end')
|
||||
tokens.append('block_end')
|
||||
else:
|
||||
for item in d.get(task_key, []):
|
||||
tokens.extend(self._tokenize_task(item, source, depth + 1))
|
||||
|
||||
elif 'hosts' in d:
|
||||
tokens.append('play_start')
|
||||
for item in d.get('tasks', []):
|
||||
tokens.extend(self._tokenize_task(item, source, depth + 1))
|
||||
tokens.append('play_end')
|
||||
|
||||
elif 'roles' in d:
|
||||
for role in d.get('roles', []):
|
||||
tokens.append(f"role:{role if isinstance(role, str) else list(role.keys())[0]}")
|
||||
|
||||
elif 'handlers' in d:
|
||||
tokens.append('handlers_start')
|
||||
for item in d.get('handlers', []):
|
||||
tokens.extend(self._tokenize_task(item, source, depth + 1))
|
||||
tokens.append('handlers_end')
|
||||
|
||||
elif 'name' in d and not any(k in d for k in ['tasks', 'block', 'hosts']):
|
||||
tokens.extend(self._tokenize_task(d, source, depth))
|
||||
|
||||
return tokens
|
||||
|
||||
def _tokenize_task(self, task, source, depth):
|
||||
if not isinstance(task, dict):
|
||||
return []
|
||||
|
||||
tokens = []
|
||||
|
||||
if 'include_tasks' in task or 'import_tasks' in task:
|
||||
key = 'include_tasks' if 'include_tasks' in task else 'import_tasks'
|
||||
tokens.append(key)
|
||||
if self.resolve_includes:
|
||||
inc_path = task[key]
|
||||
if not os.path.isabs(inc_path):
|
||||
base = os.path.dirname(source) if source != '<string>' else '.'
|
||||
inc_path = os.path.join(base, inc_path)
|
||||
if os.path.exists(inc_path):
|
||||
tokens.extend(self.tokenize_file(inc_path))
|
||||
return tokens
|
||||
|
||||
if 'import_playbook' in task:
|
||||
tokens.append('import_playbook')
|
||||
return tokens
|
||||
|
||||
if 'block' in task:
|
||||
tokens.append('block_start')
|
||||
for item in task.get('block', []):
|
||||
tokens.extend(self._tokenize_task(item, source, depth))
|
||||
if 'rescue' in task:
|
||||
tokens.append('rescue_start')
|
||||
for item in task['rescue']:
|
||||
tokens.extend(self._tokenize_task(item, source, depth))
|
||||
tokens.append('rescue_end')
|
||||
if 'always' in task:
|
||||
tokens.append('always_start')
|
||||
for item in task['always']:
|
||||
tokens.extend(self._tokenize_task(item, source, depth))
|
||||
tokens.append('always_end')
|
||||
tokens.append('block_end')
|
||||
return tokens
|
||||
|
||||
if 'name' in task:
|
||||
module_name = None
|
||||
for key in task:
|
||||
if key == 'name':
|
||||
continue
|
||||
if is_module_name(key) and isinstance(task[key], (str, dict, list, bool, int)):
|
||||
module_name = key
|
||||
break
|
||||
if module_name:
|
||||
tokens.append(module_name)
|
||||
self._token_counts[module_name] = self._token_counts.get(module_name, 0) + 1
|
||||
elif 'ansible.builtin' in str(task):
|
||||
for key in task:
|
||||
if '.' in str(key):
|
||||
module_name = str(key).split('.')[-1]
|
||||
tokens.append(module_name)
|
||||
break
|
||||
|
||||
return tokens
|
||||
|
||||
def get_statistics(self):
|
||||
return dict(sorted(self._token_counts.items(), key=lambda x: -x[1]))
|
||||
35
bex/twotinf.py
Normal file
35
bex/twotinf.py
Normal file
|
|
@ -0,0 +1,35 @@
|
|||
"""2T-INF — Build SOA from 2-grams (Algorithm 1, TODS 2010)."""
|
||||
|
||||
from .soa import SOA
|
||||
|
||||
|
||||
def build_soa(sequences):
|
||||
"""
|
||||
|———— Algorithm 1: 2T-INF ————|
|
||||
Input: finite set of sample strings S
|
||||
Output: SOA G such that S ⊆ L(G)
|
||||
|
||||
For each string a₁...aₙ in S:
|
||||
add edges (src, a₁), (a₁, a₂), ..., (aₙ, sink)
|
||||
"""
|
||||
G = SOA()
|
||||
symbol_states = {}
|
||||
|
||||
for seq in sequences:
|
||||
if not seq:
|
||||
if not G.has_edge(G.src, G.sink):
|
||||
G.add_edge(G.src, G.sink)
|
||||
continue
|
||||
for i, token in enumerate(seq):
|
||||
if token not in symbol_states:
|
||||
symbol_states[token] = G.add_state(token)
|
||||
if i == 0:
|
||||
G.add_edge(G.src, symbol_states[token])
|
||||
if i == len(seq) - 1:
|
||||
G.add_edge(symbol_states[token], G.sink)
|
||||
if i + 1 < len(seq):
|
||||
nxt = seq[i + 1]
|
||||
if nxt not in symbol_states:
|
||||
symbol_states[nxt] = G.add_state(nxt)
|
||||
G.add_edge(symbol_states[token], symbol_states[nxt])
|
||||
return G
|
||||
81
bex/yaml_to_seq.py
Normal file
81
bex/yaml_to_seq.py
Normal file
|
|
@ -0,0 +1,81 @@
|
|||
"""Convert YAML files to key-path sequences for BEX grammar inference."""
|
||||
|
||||
from pathlib import Path
|
||||
import yaml
|
||||
|
||||
|
||||
def yaml_to_keypath_sequence(data, prefix=""):
|
||||
"""Convert parsed YAML data to a sequence of key paths (DFS traversal).
|
||||
|
||||
Each leaf (scalar) emits its full key path as a symbol.
|
||||
Lists use a generic `[]` marker (no indices).
|
||||
Values are NOT included — only key paths.
|
||||
"""
|
||||
seq = []
|
||||
if isinstance(data, dict):
|
||||
for key, value in data.items():
|
||||
path = f"{prefix}.{key}" if prefix else key
|
||||
if isinstance(value, (dict, list)):
|
||||
seq.extend(yaml_to_keypath_sequence(value, path))
|
||||
else:
|
||||
seq.append(path)
|
||||
elif isinstance(data, list):
|
||||
for item in data:
|
||||
list_prefix = f"{prefix}[]" if prefix else "[]"
|
||||
if isinstance(item, (dict, list)):
|
||||
seq.extend(yaml_to_keypath_sequence(item, list_prefix))
|
||||
else:
|
||||
seq.append(list_prefix)
|
||||
return seq
|
||||
|
||||
|
||||
def yaml_file_to_sequence(filepath):
|
||||
"""Load a YAML file and convert to a key-path sequence."""
|
||||
with open(filepath) as f:
|
||||
data = yaml.safe_load(f)
|
||||
if data is None:
|
||||
return []
|
||||
return yaml_to_keypath_sequence(data)
|
||||
|
||||
|
||||
def is_vault_file(filepath):
|
||||
"""Check if a file is an Ansible vault file (encrypted)."""
|
||||
try:
|
||||
with open(filepath) as f:
|
||||
first = f.read(100)
|
||||
return '$ANSIBLE_VAULT' in first or first.startswith('!vault |')
|
||||
except Exception:
|
||||
return False
|
||||
|
||||
|
||||
def collect_all_sequences(root_dir=".", include_vault=False):
|
||||
"""Collect key-path sequences from all YAML files.
|
||||
|
||||
Returns:
|
||||
list of (filepath, sequence) tuples.
|
||||
"""
|
||||
results = []
|
||||
for path in sorted(Path(root_dir).rglob("*.yml")):
|
||||
parts = path.parts
|
||||
if any(d in parts for d in ('node_modules', '.venv', '__pycache__', '.git')):
|
||||
continue
|
||||
skippable = ('vault.yml' in path.name or 'vault' in path.name)
|
||||
if not include_vault and (skippable or is_vault_file(path)):
|
||||
continue
|
||||
try:
|
||||
seq = yaml_file_to_sequence(path)
|
||||
if seq:
|
||||
results.append((path, seq))
|
||||
except Exception as e:
|
||||
print(f" SKIP {path}: {e}")
|
||||
return results
|
||||
|
||||
|
||||
def sequences_to_crx(result_list):
|
||||
"""Run CRX on collected sequences."""
|
||||
from .crx import CRX
|
||||
sequences = [seq for _, seq in result_list]
|
||||
if not sequences:
|
||||
return 'ε'
|
||||
crx = CRX()
|
||||
return crx.infer(sequences)
|
||||
2210
papers/paper_arxiv2010.txt
Normal file
2210
papers/paper_arxiv2010.txt
Normal file
File diff suppressed because it is too large
Load diff
2492
papers/paper_tods2010.txt
Normal file
2492
papers/paper_tods2010.txt
Normal file
File diff suppressed because it is too large
Load diff
13
pyproject.toml
Normal file
13
pyproject.toml
Normal file
|
|
@ -0,0 +1,13 @@
|
|||
[build-system]
|
||||
requires = ["setuptools>=68.0"]
|
||||
build-backend = "setuptools.backends._legacy:_Backend"
|
||||
|
||||
[project]
|
||||
name = "grammar-inference-engine"
|
||||
version = "0.1.0"
|
||||
description = "BEX-based grammar inference: learn regular expression patterns from example sequences"
|
||||
readme = "README.md"
|
||||
requires-python = ">=3.11"
|
||||
dependencies = [
|
||||
"PyYAML>=6.0",
|
||||
]
|
||||
5
requirements.txt
Normal file
5
requirements.txt
Normal file
|
|
@ -0,0 +1,5 @@
|
|||
# Core
|
||||
PyYAML>=6.0
|
||||
|
||||
# Tests
|
||||
pytest>=7.0
|
||||
420
tests/test_bex.py
Normal file
420
tests/test_bex.py
Normal file
|
|
@ -0,0 +1,420 @@
|
|||
"""Tests for BEX paper algorithm implementations."""
|
||||
|
||||
import sys
|
||||
sys.path.insert(0, '/home/tobi/Desktop/kesai/ProjectManagement/companyweb')
|
||||
|
||||
from bex.soa import SOA
|
||||
from bex.twotinf import build_soa
|
||||
from bex.rwr0 import rwr0
|
||||
from bex.crx import CRX
|
||||
from bex.idregex import is_deterministic, idregex
|
||||
from bex.expr import concat, disj, star, optional, alphabet, strip_k
|
||||
from bex.koa import KOA, build_complete_koa
|
||||
from bex.marking import mark_koa
|
||||
from bex.rwrsq import rwr_sq, strip
|
||||
from bex.ikoa import ikoa
|
||||
|
||||
|
||||
def test_soa_basics():
|
||||
G = SOA()
|
||||
a = G.add_state('a')
|
||||
b = G.add_state('b')
|
||||
G.add_edge(G.src, a)
|
||||
G.add_edge(a, b)
|
||||
G.add_edge(b, G.sink)
|
||||
assert G.accept(['a', 'b'])
|
||||
assert not G.accept(['a'])
|
||||
assert not G.accept(['b'])
|
||||
assert not G.accept(['a', 'b', 'c'])
|
||||
print(" PASS test_soa_basics")
|
||||
|
||||
|
||||
def test_soa_contract():
|
||||
G = SOA()
|
||||
a = G.add_state('a')
|
||||
b = G.add_state('b')
|
||||
G.add_edge(G.src, a)
|
||||
G.add_edge(a, b)
|
||||
G.add_edge(b, G.sink)
|
||||
G.contract(a, b, concat('a', 'b'))
|
||||
assert G.is_final()
|
||||
assert G.expression() == 'a.b'
|
||||
print(" PASS test_soa_contract")
|
||||
|
||||
|
||||
def test_soa_epsilon_closure():
|
||||
G = SOA()
|
||||
a = G.add_state('a')
|
||||
b = G.add_state('a+')
|
||||
G.add_edge(G.src, a)
|
||||
G.add_edge(a, b)
|
||||
G.add_edge(b, G.sink)
|
||||
G.add_edge(b, b)
|
||||
Gs = G.epsilon_closure()
|
||||
assert Gs.has_edge(b, b)
|
||||
print(" PASS test_soa_epsilon_closure")
|
||||
|
||||
|
||||
def test_twotinf():
|
||||
seqs = [['a', 'b', 'c'], ['a', 'c']]
|
||||
G = build_soa(seqs)
|
||||
assert G.accept(['a', 'b', 'c'])
|
||||
assert G.accept(['a', 'c'])
|
||||
assert not G.accept(['b', 'c'])
|
||||
print(" PASS test_twotinf")
|
||||
|
||||
|
||||
def test_rwr0_concat():
|
||||
G = SOA()
|
||||
a = G.add_state('a')
|
||||
b = G.add_state('b')
|
||||
G.add_edge(G.src, a)
|
||||
G.add_edge(a, b)
|
||||
G.add_edge(b, G.sink)
|
||||
result = rwr0(G)
|
||||
assert result == 'a.b', f"Expected 'a.b', got {result}"
|
||||
print(" PASS test_rwr0_concat")
|
||||
|
||||
|
||||
def test_rwr0_disj():
|
||||
G = SOA()
|
||||
a = G.add_state('a')
|
||||
b = G.add_state('b')
|
||||
G.add_edge(G.src, a)
|
||||
G.add_edge(G.src, b)
|
||||
G.add_edge(a, G.sink)
|
||||
G.add_edge(b, G.sink)
|
||||
result = rwr0(G)
|
||||
assert result == '(a|b)', f"Expected '(a|b)', got {result}"
|
||||
print(" PASS test_rwr0_disj")
|
||||
|
||||
|
||||
def test_rwr0_iteration():
|
||||
G = SOA()
|
||||
a = G.add_state('a')
|
||||
G.add_edge(G.src, a)
|
||||
G.add_edge(a, G.sink)
|
||||
G.add_edge(a, a)
|
||||
result = rwr0(G)
|
||||
assert result == 'a+', f"Expected 'a+', got {result}"
|
||||
print(" PASS test_rwr0_iteration")
|
||||
|
||||
|
||||
def test_rwr0_optional():
|
||||
G = SOA()
|
||||
a = G.add_state('a')
|
||||
G.add_edge(G.src, a)
|
||||
G.add_edge(a, G.sink)
|
||||
result = rwr0(G)
|
||||
# Single state src→a→sink: language is {a}, not {a,ε}
|
||||
assert result == 'a', f"Expected 'a', got {result}"
|
||||
print(" PASS test_rwr0_optional")
|
||||
|
||||
|
||||
def test_rwr0_empty():
|
||||
G = SOA()
|
||||
result = rwr0(G)
|
||||
assert result == '∅', f"Expected '∅', got {result}"
|
||||
print(" PASS test_rwr0_empty")
|
||||
|
||||
|
||||
def test_rwr0_epsilon():
|
||||
G = SOA()
|
||||
G.add_edge(G.src, G.sink)
|
||||
result = rwr0(G)
|
||||
assert result == 'ε', f"Expected 'ε', got {result}"
|
||||
print(" PASS test_rwr0_epsilon")
|
||||
|
||||
|
||||
def test_rwr0_complex_a():
|
||||
# {abc, ab, ac} is NOT a SORE language (c appears in two roles)
|
||||
G = build_soa([['a', 'b', 'c'], ['a', 'b'], ['a', 'c']])
|
||||
result = rwr0(G)
|
||||
assert result == '∅', f"Expected ∅ for non-SORE, got {result}"
|
||||
print(" PASS test_rwr0_complex_a: ∅ (non-SORE)")
|
||||
|
||||
|
||||
def test_rwr0_disj_concat():
|
||||
"""a·b and a·c share Pred/Succ for b,c after processing."""
|
||||
G = build_soa([['a', 'b'], ['a', 'c']])
|
||||
result = rwr0(G)
|
||||
assert result is not None
|
||||
print(f" PASS test_rwr0_disj_concat: {result}")
|
||||
|
||||
|
||||
def test_crx_simple():
|
||||
crx = CRX()
|
||||
result = crx.infer([['a', 'b'], ['a', 'b', 'c']])
|
||||
assert result is not None and result != '∅'
|
||||
assert 'a' in result
|
||||
assert 'b' in result
|
||||
print(f" PASS test_crx_simple: {result}")
|
||||
|
||||
|
||||
def test_crx_example():
|
||||
"""Example from TODS paper: S = {abccde, cccad, bfegg, bfehi}"""
|
||||
crx = CRX()
|
||||
S = [
|
||||
['a', 'b', 'c', 'c', 'd', 'e'],
|
||||
['c', 'c', 'c', 'a', 'd'],
|
||||
['b', 'f', 'e', 'g', 'g'],
|
||||
['b', 'f', 'e', 'h', 'i'],
|
||||
]
|
||||
result = crx.infer(S)
|
||||
assert result is not None
|
||||
assert '(' in result # should have disjunction factors
|
||||
print(f" PASS test_crx_example: {result}")
|
||||
|
||||
|
||||
def test_crx_cycle_class():
|
||||
"""Symbols a,b,c form a cycle in S = {abc, bca, cab}."""
|
||||
crx = CRX()
|
||||
S = [['a', 'b', 'c'], ['b', 'c', 'a'], ['c', 'a', 'b']]
|
||||
result = crx.infer(S)
|
||||
assert result is not None
|
||||
assert 'a' in result and 'b' in result and 'c' in result
|
||||
print(f" PASS test_crx_cycle_class: {result}")
|
||||
|
||||
|
||||
def test_determinism_check():
|
||||
assert is_deterministic('a.b')
|
||||
assert is_deterministic('a+')
|
||||
assert is_deterministic('(a|b)')
|
||||
assert not is_deterministic('(a|a)')
|
||||
print(" PASS test_determinism_check")
|
||||
|
||||
|
||||
def test_marking():
|
||||
G = KOA(k=2)
|
||||
a1 = G.add_state('a_1')
|
||||
a2 = G.add_state('a_2')
|
||||
G.add_edge(G.src, a1)
|
||||
G.add_edge(a1, a2)
|
||||
G.add_edge(a2, G.sink)
|
||||
H = mark_koa(G)
|
||||
assert H.label(a1) == 'a_1'
|
||||
assert H.label(a2) == 'a_2'
|
||||
assert H.accept(['a_1', 'a_2'])
|
||||
print(" PASS test_marking")
|
||||
|
||||
|
||||
def test_strip():
|
||||
assert strip('a_1.b_1') == 'a.b'
|
||||
assert strip('(a_1|b_1)+') == '(a|b)+'
|
||||
print(" PASS test_strip")
|
||||
|
||||
|
||||
def test_expr_utils():
|
||||
assert concat('a', 'b') == 'a.b'
|
||||
assert disj('a', 'b') == '(a|b)'
|
||||
assert star('a') == 'a+'
|
||||
assert optional('a') == 'a?'
|
||||
assert optional('a.b') == '(a.b)?'
|
||||
assert alphabet('a.b') == {'a', 'b'}
|
||||
assert alphabet('(a|b)+') == {'a', 'b'}
|
||||
assert strip_k('a_1') == 'a'
|
||||
print(" PASS test_expr_utils")
|
||||
|
||||
|
||||
def test_idregex_deterministic():
|
||||
"""iDRegEx should produce a deterministic expression for simple data."""
|
||||
seqs = [['a', 'b'], ['a'], ['a', 'b', 'c']]
|
||||
result = idregex(seqs, kmax=2, N=2)
|
||||
if result is None:
|
||||
print(" SKIP test_idregex_deterministic (returned None)")
|
||||
return
|
||||
assert is_deterministic(result), f"Non-deterministic: {result}"
|
||||
print(f" PASS test_idregex_deterministic: {result}")
|
||||
|
||||
|
||||
def test_complete_koa():
|
||||
G, states = build_complete_koa([['a', 'b'], ['a']], k=2)
|
||||
assert G.count_symbol('a') == 2
|
||||
assert G.count_symbol('b') == 2
|
||||
assert G.has_edge(G.src, G.sink)
|
||||
print(" PASS test_complete_koa")
|
||||
|
||||
|
||||
def run_all():
|
||||
tests = [
|
||||
test_soa_basics,
|
||||
test_soa_contract,
|
||||
test_soa_epsilon_closure,
|
||||
test_twotinf,
|
||||
test_rwr0_concat,
|
||||
test_rwr0_disj,
|
||||
test_rwr0_iteration,
|
||||
test_rwr0_optional,
|
||||
test_rwr0_empty,
|
||||
test_rwr0_epsilon,
|
||||
test_rwr0_complex_a,
|
||||
test_rwr0_disj_concat,
|
||||
test_crx_simple,
|
||||
test_crx_example,
|
||||
test_crx_cycle_class,
|
||||
test_determinism_check,
|
||||
test_marking,
|
||||
test_strip,
|
||||
test_expr_utils,
|
||||
test_idregex_deterministic,
|
||||
test_complete_koa,
|
||||
]
|
||||
passed = 0
|
||||
failed = 0
|
||||
for t in tests:
|
||||
try:
|
||||
t()
|
||||
passed += 1
|
||||
except Exception as e:
|
||||
print(f" FAIL {t.__name__}: {e}")
|
||||
failed += 1
|
||||
print(f"\n{passed} passed, {failed} failed")
|
||||
|
||||
|
||||
# ── Integration tests with real Ansible task data ──
|
||||
|
||||
def test_integration_quartz_deploy():
|
||||
"""Simple linear sequence — all tasks always in same order."""
|
||||
seqs = [
|
||||
['file', 'template', 'docker_image', 'command', 'set_fact', 'shell', 'wait_for'],
|
||||
['file', 'template', 'docker_image', 'command', 'set_fact', 'shell', 'wait_for'],
|
||||
]
|
||||
crx = CRX()
|
||||
result = crx.infer(seqs)
|
||||
assert result is not None
|
||||
assert all(t in result for t in ['file', 'template', 'docker_image', 'command', 'set_fact', 'shell', 'wait_for'])
|
||||
print(f" PASS quartz_deploy: {result}")
|
||||
|
||||
|
||||
def test_integration_validate_system():
|
||||
"""Optional shell tasks."""
|
||||
seqs = [
|
||||
['shell', 'debug', 'shell', 'debug'],
|
||||
['shell', 'debug', 'shell', 'debug', 'shell', 'debug'],
|
||||
['shell', 'debug'],
|
||||
]
|
||||
crx = CRX()
|
||||
result = crx.infer(seqs)
|
||||
assert result is not None
|
||||
assert 'shell' in result and 'debug' in result
|
||||
print(f" PASS validate_system: {result}")
|
||||
|
||||
|
||||
def test_integration_docker_detect_branch():
|
||||
"""Branching: docker compose v2 check or v1 fallback."""
|
||||
seqs = [
|
||||
['file', 'template', 'command_v2', 'set_fact', 'shell', 'wait_for'],
|
||||
['file', 'template', 'command_v1', 'set_fact', 'shell', 'wait_for'],
|
||||
]
|
||||
crx = CRX()
|
||||
result = crx.infer(seqs)
|
||||
assert result is not None
|
||||
assert 'file' in result and 'template' in result and 'shell' in result
|
||||
print(f" PASS docker_detect: {result}")
|
||||
|
||||
|
||||
def test_integration_firewall_gating():
|
||||
"""Conditional firewall rule sequence (gated)."""
|
||||
seqs = [
|
||||
['assert', 'file', 'template', 'shell', 'wait_for'],
|
||||
['assert', 'file', 'template', 'command_fw', 'command_fw', 'shell', 'wait_for'],
|
||||
['assert', 'file', 'template', 'command_fw', 'shell', 'wait_for'],
|
||||
]
|
||||
crx = CRX()
|
||||
result = crx.infer(seqs)
|
||||
assert result is not None
|
||||
assert 'assert' in result and 'file' in result
|
||||
print(f" PASS firewall_gating: {result}")
|
||||
|
||||
|
||||
def test_integration_idregex_linear():
|
||||
"""iDRegEx on simple linear sequences."""
|
||||
seqs = [
|
||||
['assert', 'file', 'template', 'command', 'set_fact', 'shell', 'wait_for'],
|
||||
['assert', 'file', 'template', 'command', 'set_fact', 'shell'],
|
||||
]
|
||||
try:
|
||||
result = idregex(seqs, kmax=2, N=3)
|
||||
if result:
|
||||
assert is_deterministic(result)
|
||||
print(f" PASS idregex_linear: {result}")
|
||||
else:
|
||||
print(" SKIP idregex_linear (returned None)")
|
||||
except Exception as e:
|
||||
print(f" FAIL idregex_linear: {e}")
|
||||
|
||||
|
||||
def test_integration_ikoa_linear():
|
||||
"""iKoa + rwr² on simple linear sequences."""
|
||||
from bex.ikoa import ikoa
|
||||
from bex.rwrsq import rwr_sq
|
||||
seqs = [
|
||||
['assert', 'file', 'template', 'command', 'set_fact', 'shell', 'wait_for'],
|
||||
['assert', 'file', 'template', 'command', 'set_fact', 'shell'],
|
||||
]
|
||||
G = ikoa(seqs, k=3)
|
||||
if G is None:
|
||||
print(" SKIP ikoa_linear (returned None)")
|
||||
return
|
||||
expr = rwr_sq(G)
|
||||
assert expr is not None
|
||||
print(f" PASS ikoa_linear: {expr}")
|
||||
|
||||
|
||||
def test_integration_backup_restic():
|
||||
"""Sequence with loop (systemd enable)."""
|
||||
seqs = [
|
||||
['package', 'assert', 'file', 'template', 'template', 'template', 'template', 'template', 'template', 'systemd', 'systemd', 'systemd'],
|
||||
['package', 'assert', 'file', 'template', 'template', 'template', 'template', 'template', 'template', 'systemd'],
|
||||
]
|
||||
crx = CRX()
|
||||
result = crx.infer(seqs)
|
||||
assert result is not None
|
||||
print(f" PASS backup_restic: {result}")
|
||||
|
||||
|
||||
def run_all():
|
||||
tests = [
|
||||
test_soa_basics,
|
||||
test_soa_contract,
|
||||
test_soa_epsilon_closure,
|
||||
test_twotinf,
|
||||
test_rwr0_concat,
|
||||
test_rwr0_disj,
|
||||
test_rwr0_iteration,
|
||||
test_rwr0_optional,
|
||||
test_rwr0_empty,
|
||||
test_rwr0_epsilon,
|
||||
test_rwr0_complex_a,
|
||||
test_rwr0_disj_concat,
|
||||
test_crx_simple,
|
||||
test_crx_example,
|
||||
test_crx_cycle_class,
|
||||
test_determinism_check,
|
||||
test_marking,
|
||||
test_strip,
|
||||
test_expr_utils,
|
||||
test_idregex_deterministic,
|
||||
test_complete_koa,
|
||||
test_integration_quartz_deploy,
|
||||
test_integration_validate_system,
|
||||
test_integration_docker_detect_branch,
|
||||
test_integration_firewall_gating,
|
||||
test_integration_idregex_linear,
|
||||
test_integration_ikoa_linear,
|
||||
test_integration_backup_restic,
|
||||
]
|
||||
passed = 0
|
||||
failed = 0
|
||||
for t in tests:
|
||||
try:
|
||||
t()
|
||||
passed += 1
|
||||
except Exception as e:
|
||||
print(f" FAIL {t.__name__}: {e}")
|
||||
failed += 1
|
||||
print(f"\n{passed} passed, {failed} failed")
|
||||
|
||||
|
||||
if __name__ == '__main__':
|
||||
run_all()
|
||||
Loading…
Add table
Reference in a new issue