grammar-inference-engine/papers/paper_arxiv2010.txt

arXiv:1004.2372v1 [cs.DB] 14 Apr 2010

Learning Deterministic Regular Expressions for the
Inference of Schemas from XML Data
GEERT JAN BEX, WOUTER GELADE, FRANK NEVEN
Hasselt University and Transnational University of Limburg
and
STIJN VANSUMMEREN
Université Libre de Bruxelles

Inferring an appropriate DTD or XML Schema Definition (XSD) for a given collection of XML
documents essentially reduces to learning deterministic regular expressions from sets of positive
example words. Unfortunately, there is no algorithm capable of learning the complete class of
deterministic regular expressions from positive examples only, as we will show. The regular expressions occurring in practical DTDs and XSDs, however, are such that every alphabet symbol
occurs only a small number of times. As such, in practice it suffices to learn the subclass of
deterministic regular expressions in which each alphabet symbol occurs at most k times, for some
small k. We refer to such expressions as k-occurrence regular expressions (k-OREs for short).
Motivated by this observation, we provide a probabilistic algorithm that learns k-OREs for increasing values of k, and selects the deterministic one that best describes the sample based on a
Minimum Description Length argument. The effectiveness of the method is empirically validated
both on real world and synthetic data. Furthermore, the method is shown to be conservative over
the simpler classes of expressions considered in previous work.
Categories and Subject Descriptors: F.4.3 [Mathematical Logic and Formal Languages]:
Formal Languages; I.2.6 [Artificial Intelligence]: Learning; I.7.2 [Document and Text Processing]: Document Preparation
General Terms: Algorithms, Languages, Theory
Additional Key Words and Phrases: regular expressions, schema inference, XML

1.

INTRODUCTION

Recent studies stipulate that schemas accompanying collections of XML documents
are sparse and erroneous in practice. Indeed, Barbosa et al. [2005] and Mignet et al.
[2003] have shown that approximately half of the XML documents available on the
web do not refer to a schema. In addition, Bex et al. [2004] and Martens et al.
[2006] have noted that about two-thirds of XML Schema Definitions (XSDs) gathered from schema repositories and from the web at large are not valid with respect
to the W3C XML Schema specification [Thompson et al. 2001], rendering them
A preliminary version of this article appeared in the 17th International World Wide Web Conference (WWW 2008).
Permission to make digital/hard copy of all or part of this material without fee for personal
or classroom use provided that the copies are not made or distributed for profit or commercial
advantage, the ACM copyright/server notice, the title of the publication, and its date appear, and
notice is given that copying is by permission of the ACM, Inc. To copy otherwise, to republish,
to post on servers, or to redistribute to lists requires prior specific permission and/or a fee.
c 2024 ACM 0000-0000/2024/0000-0001 $5.00
ACM Journal Name, Vol. V, No. N, November 2024, Pages 1–31.

2

·

Geert Jan Bex et al.
<!ELEMENT store (order∗ , stock)>
<!ELEMENT order (customer, item+ )>
<!ELEMENT customer (first, last, email∗ )>
<!ELEMENT item (id, price + (qty, (supplier + item+ )))>
<!ELEMENT stock (item∗ )>
<!ELEMENT supplier (first, last, email∗ )>
Fig. 1.

An example DTD.

essentially useless for immedidate application. A similar observation was made by
Sahuguet [2000] concerning Document Type Definitions (DTDs). Nevertheless, the
presence of a schema strongly facilitates optimization of XML processing (cf., e.g.,
[Benedikt et al. 2005; Che et al. 2006; Du et al. 2004; Freire et al. 2002; Koch et al.
2004; Manolescu et al. 2001; Neven and Schwentick 2006]) and various software
development tools such as Castor [cas ] and SUN’s JAXB [jax ] rely on schemas
as well to perform object-relational mappings for persistence. Additionally, the
existence of schemas is imperative when integrating (meta) data through schema
matching [Rahm and Bernstein 2001] and in the area of generic model management [Bernstein 2003].
Based on the above described benefits of schemas and their unavailability in
practice, it is essential to devise algorithms that can infer a DTD or XSD for a
given collection of XML documents when none, or no syntactically correct one, is
present. This is also acknowledged by Florescu [2005] who emphasizes that in the
context of data integration
“We need to extract good-quality schemas automatically from existing
data and perform incremental maintenance of the generated schemas.”
As illustrated in Figure 1, a DTD is essentially a mapping d from element names
to regular expressions over element names. An XML document is valid with respect
to the DTD if for every occurrence of an element name e in the document, the
word formed by its children belongs to the language of the corresponding regular
expression d(e). For instance, the DTD in Figure 1 requires each store element
to have zero or more order children, which must be followed by a stock element.
Likewise, each order must have a customer child, which must be followed by one
or more item elements.
To infer a DTD from a corpus of XML documents C it hence suffices to look,
for each element name e that occurs in a document in C, at the set of element
name words that occur below e in C, and to infer from this set the corresponding
regular expression d(e). As such, the inference of DTDs reduces to the inference
of regular expressions from sets of positive example words. To illustrate, from the
words id price, id qty supplier, and id qty item item appearing under <item>
elements in a sample XML corpus, we could derive the rule
item → (id, price + (qty, (supplier + item+ ))).
Although XSDs are more expressive than DTDs, and although XSD inference is
therefore more involved than DTD inference, derivation of regular expressions remains one of the main building blocks on which XSD inference algorithms are built.
ACM Journal Name, Vol. V, No. N, November 2024.

Learning Deterministic Regular Expressions for the Inference of Schemas from XML Data

·

In fact, apart from also inferring atomic data types, systems like Trang [Clark ] and
XStruct [Hegewald et al. 2006] simply infer DTDs in XSD syntax. The more recent
iXSD algorithm [Bex et al. 2007] does infer true XSD schemas by first deriving a
regular expression for every context in which an element name appears, where the
context is determined by the path from the root to that element, and subsequently
reduces the number of contexts by merging similar ones.
So, the effectiveness of DTD or XSD schema inference algorithms is strongly
determined by the accuracy of the employed regular expression inference method.
The present article presents a method to reliably learn regular expressions that
are far more complex than the classes of expressions previously considered in the
literature.
1.1

Problem setting

In particular, let Σ be a fixed set of alphabet symbols (also called element names),
and let Σ∗ be the set of all words over Σ.
Definition 1.1 (Regular Expressions). Regular expressions are derived by the following grammar.
r, s ::= ∅ | ε | a | r . s | r + s | r? | r+
Here, parentheses may be added to avoid ambiguity; ε denotes the empty word;
a ranges over symbols in Σ; r . s denotes concatenation; r + s denotes disjunction;
r+ denotes one-or-more repetitions; and r? denotes the optional regular expression.
That is, the language L(r) accepted by regular expression r is given by:
L(∅) = ∅
L(a) = {a}
L(r + s) = L(r) ∪ L(s)

L(ε) = {ε}
L(r . s) = {vw | v ∈ L(r), w ∈ L(s)}
L(r+ ) = {v1 . . . vn | n ≥ 1 and v1 , . . . , vn ∈ L(r)}

L(r?) = L(r) ∪ {ε}.
Note that the Kleene star operator (denoting zero or more repititions as in r∗ ) is
not allowed by the above syntax. This is not a restriction, since r∗ can always be
represented as (r+ )? or (r?)+ . Conversely, the latter can always be rewritten into
the former for presentation to the user.
The class of all regular expressions is actually too large for our purposes, as both
DTDs and XSDs require the regular expressions occurring in them to be deterministic (also sometimes called one-unambiguous [Brüggemann-Klein and Wood
1998]). Intuitively, a regular expression is deterministic if, without looking ahead
in the input word, it allows to match each symbol of that word uniquely against a
position in the expression when processing the input in one pass from left to right.
For instance, (a + b)∗ a is not deterministic as already the first symbol in the word
aaa could be matched by either the first or the second a in the expression. Without
lookahead, it is impossible to know which one to choose. The equivalent expression
b∗ a(b∗ a)∗ , on the other hand, is deterministic.
Definition 1.2. Formally, let r stand for the regular expression obtained from r
by replacing the ith occurrence of alphabet symbol a in r by a(i) , for every i and
+
+
a. For example, for r = b+ a(ba+ )? we have r = b(1) a(1) (b(2) a(2) )?. A regular
ACM Journal Name, Vol. V, No. N, November 2024.

3

4

·

Geert Jan Bex et al.

expression r is deterministic if there are no words wa(i) v and wa(j) v 0 in L(r) such
that i 6= j.
Equivalently, an expression is deterministic if the Glushkov construction [BrüggemanKlein 1993] translates it into a deterministic finite automaton rather than a nondeterministic one [Brüggemann-Klein and Wood 1998]. Not every non-deterministic
regular expression is equivalent to a deterministic one [Brüggemann-Klein and
Wood 1998]. Thus, semantically, the class of deterministic regular expressions
forms a strict subclass of the class of all regular expressions.
For the purpose of inferring DTDs and XSDs from XML data, we are hence in
search of an algorithm that, given enough sample words of a target deterministic
regular expression r, returns a deterministic expression r0 equivalent to r. In the
framework of learning in the limit [Gold 1967], such an algorithm is said to learn
the deterministic regular expressions from positive data.
Definition 1.3. Define a sample to be a finite subset of Σ∗ and let R be a subclass
of the regular expressions. An algorithm M mapping samples to expressions in R
learns R in the limit from positive data if (1) S ⊆ L(M (S)) for every sample S and
(2) to every r ∈ R we can associate a so-called characteristic sample Sr ⊆ L(r) such
that, for each sample S with Sr ⊆ S ⊆ L(r), M (S) is equivalent to r.
Intuitively, the first condition says that M must be sound ; the second that M
must be complete, given enough data. A class of regular expressions R is learnable
in the limit from positive data if an algorithm exists that learns R. For the class of
all regular expressions, it was shown by Gold that no such algorithm exists [Gold
1967]. We extend this result to the class of deterministic expressions:
Theorem 1.4. The class of deterministic regular expressions is not learnable in
the limit from positive data.
Proof. It was shown by Gold [1967, Theorem I.8], that any class of regular
expressions that contains all non-empty finite languages as well as at least one
infinite language is not learnable in the limit from positive data. Since deterministic
regular expressions like a∗ define an infinite language, it suffices to show that every
non-empty finite language is definable by a deterministic expression. Hereto, let
S be a finite, non-empty set of words. Now consider the prefix tree T for S. For
example, if S = {a, aab, abc, aac}, we have the following prefix tree:
a
a
b c

b
c

Nodes for which the path from the root to that node forms a word in S are marked
by double circles. In particular, all leaf nodes are marked.
By viewing the internal nodes in T with two or more children as disjunctions;
internal nodes in T with one child as conjunctions; and adding a question mark for
every marked internal node in T , it is straightforward to transform T into a regular
ACM Journal Name, Vol. V, No. N, November 2024.

Learning Deterministic Regular Expressions for the Inference of Schemas from XML Data

·

expression. For example, with S and T as above we get r = a .(b . c + a .(b + c))?.
Clearly, L(r) = S. Moreover, since no node in T has two edges with the same label,
r must be deterministic.
Theorem 1.4 immediately excludes the possibility for an algorithm to infer the
full class of DTDs or XSDs. In practice, however, regular expressions occurring
in DTDs and XSDs are concise rather than arbitrarily complex. Indeed, a study
of 819 DTDs and XSDs gathered from the Cover Pages [Cover 2003] (including
many high-quality XML standards) as well as from the web at large, reveals that
regular expressions occurring in practical schemas are such that every alphabet
symbol occurs only a small number of times [Martens et al. 2006]. In practice,
therefore, it suffices to learn the subclass of deterministic regular expressions in
which each alphabet symbol occurs at most k times, for some small k. We refer to
such expressions as k-occurrence regular expressions.
Definition 1.5. A regular expression is k-occurrence if every alphabet symbol
occurs at most k times in it.
For example, the expressions customer . order+ and (school + institute)+ are
both 1-occurrence, while id .(qty+id) is 2-occurrence (as id occurs twice). Observe
that if r is k-occurrence, then it is also l-occurrence for every l ≥ k. To simplify
notation in what follows, we abbreviate ‘k-occurrence regular expression’ by k-ORE
and also refer to the 1-OREs as ‘single occurrence regular expressions’ or SOREs.
1.2

Outline and Contributions

Actually, the above mentioned examination shows that in the majority of the cases
k = 1. Motivated by that observation, we have studied and suggested practical
learning algorithms for the class of deterministic SOREs in a companion article [Bex
et al. 2006]. These algorithms, however, can only output SOREs even when the
target regular expression is not. In that case they always return an approximation
of the target expressions. It is therefore desirable to also have learning algorithms
for the class of deterministic k-OREs with k ≥ 2. Furthermore, since the exact
k-value for the target expression, although small, is unknown in a schema inference
setting, we also require an algorithm capable of determining the best value of k
automatically.
We begin our study of this problem in Section 3 by showing that, for each fixed k,
the class of deterministic k-OREs is learnable in the limit from positive examples
only. We also argue, however, that this theoretical algorithm is unlikely to work
well in practice as it does not provide a method to automatically determine the
best value of k and needs samples whose size can be exponential in the size of the
alphabet to successfully learn some target expressions.
In view of these observations, we provide in Section 4 the practical algorithm
iDRegEx. Given a sample of words S, iDRegEx derives corresponding deterministic k-OREs for increasing values of k and selects from these candidate expressions
the expression that describes S best. To determine the “best” expression we propose two measures: (1) a Language Size measure and (2) a Minimum Description
Length measure based on the work of Adriaans and Vitányi [2006]. The main technical contribution lies in the subroutine used to derive the actual k-OREs for S.
ACM Journal Name, Vol. V, No. N, November 2024.

5

6

·

Geert Jan Bex et al.

Indeed, while for the special case where k = 1 one can derive a k-ORE by first
learning an automaton A for S using the inference algorithm of Garcia and Vidal
[1990], and by subsequently translating A into a 1-ORE (as shown in [Bex et al.
2006]), this approach does not work when k ≥ 2. In particular, the algorithm of
Garcia and Vidal only works when learning languages that are “n-testable” for
some fixed natural number n [Garcia and Vidal 1990]. Although every language
definable by a 1-ORE is 2-testable [Bex et al. 2006], there are languages definable
by a 2-ORE, for instance a∗ ba∗ , that are not n-testable for any n. We therefore
use a probabilistic method based on Hidden Markov Models to learn an automaton
for S, which is subsequently translated into a k-ORE.
The effectiveness of iDRegEx is empirically validated in Section 5 both on real
world and synthetic data. We compare the results of iDRegEx with those of
the algorithm presented in previous work [Bex et al. 2008], to which we refer as
iDRegEx(rwr0 ).
2.

RELATED WORK

Semi-structured data. In the context of semi-structured data, the inference of
schemas as defined in [Buneman et al. 1997; Quass et al. 1996] has been extensively studied [Goldman and Widom 1997; Nestorov et al. 1998]. No methods were
provided to translate the inferred types to regular expressions, however.
DTD and XSD inference. In the context of DTD inference, Bex et al. [2006]
gave in earlier work two inference algorithms: one for learning 1-OREs and one for
learning the subclass of 1-OREs known as chain regular expressions. The latter
class can also be learned using Trang [Clark ], state of the art software written
by James Clark that is primarily intended as a translator between the schema
languages DTD, Relax NG [Clark and Murata 2001], and XSD, but also infers a
schema for a set of XML documents. In contrast, our goal in this article is to infer
the more general class of deterministic expressions. xtract [Garofalakis et al.
2003] is another regular expression learning system with similar goals. We note
that xtract also uses the Minimum Description Length principle to choose the
best expression from a set of candidates.
Other relevant DTD inference research is [Sankey and Wong 2001] and [Chidlovskii
2001] that learn finite automata but do not consider the translation to deterministic
regular expressions. Also, in [Young-Lai and Tompa 2000] a method is proposed to
infer DTDs through stochastic grammars where right-hand sides of rules are represented by probabilistic automata. No method is provided to transform these into
regular expressions. Although Ahonen [1996] proposes such a translation, the effectiveness of her algorithm is only illustrated by a single case study of a dictionary
example; no experimental study is provided.
Also relevant are the XSD inference systems [Bex et al. 2007; Clark ; Hegewald
et al. 2006] that, as already mentioned, rely on the same methods for learning
regular expressions as DTD inference.
Regular expression inference. Most of the learning of regular languages from
positive examples in the computational learning community is directed towards inference of automata as opposed to inference of regular expressions [Angluin and
ACM Journal Name, Vol. V, No. N, November 2024.

Learning Deterministic Regular Expressions for the Inference of Schemas from XML Data

·

Smith 1983; Pitt 1989; Sakakibara 1997]. However, these approaches learn strict
subclasses of the regular languages which are incomparable to the subclasses considered here. Some approaches to inference of regular expressions for restricted cases
have been considered. For instance, [Brāzma 1993] showed that regular expressions
without union can be approximately learned in polynomial time from a set of examples satisfying some criteria. [Fernau 2005] provided a learning algorithm for
regular expressions that are finite unions of pairwise left-aligned union-free regular
expressions. The development is purely theoretical, no experimental validation has
been performed.
HMM learning. Although there has been work on Hidden Markov Model structure induction [Rabiner 1989; Freitag and McCallum 2000], the requirement in our
setting that the resulting automaton is deterministic is, to the best of our knowledge, unique.
3.

BASIC RESULTS

In this section we establish that, in contrast to the class of all deterministic expressions, the subclass of deterministic k-OREs can theoretically be learned in the limit
from positive data, for each fixed k. We also argue, however, that this theoretical
algorithm is unlikely to work well in practice.
Let Σ(r) denote the set of alphabet symbols that occur in a regular expression
r, and let Σ(S) be similarly defined for a sample S. Define the length of a regular expression r as the length of it string representation, including operators and
parenthesis. For example, the length of (a . b)+ ? + c is 9.
Theorem 3.1. For every k there exists an algorithm M that learns the class of
deterministic k-OREs from positive data. Furthermore, on input S, M runs in
time polynomial in the size of S, yet exponential in k and |Σ(S)|.
Proof. The algorithm M is based on the following observations. First observe
that every deterministic k-ORE r over a finite alphabet A ⊆ Σ can be simplified
into an equivalent deterministic k-ORE r0 of length at most 10k|A| by rewriting r
according to the following system of rewrite rules until no more rule is applicable:
((s)) → (s)
s?? → s?
s + ε → s?
s.ε → s
ε? → ε
s+∅ → s
s.∅ → ∅
∅? → ∅

s?+ → s+ ?
s++ → s+
ε + s → s?
ε.s → s
ε+ → ε
∅+s → s
∅.s → ∅
∅+ → ∅

(The first rewrite rule removes redundant parenthesis in r.) Indeed, since each
rewrite rule clearly preserves determinism and language equivalence, r0 must be a
deterministic expression equivalent to r. Moreover, since none of the rewrite rules
duplicates a subexpression and since r is a k-ORE, so is r0 . Now note that, since
ACM Journal Name, Vol. V, No. N, November 2024.

7

8

·

Geert Jan Bex et al.

no rewrite rule applies to it, r0 is either ∅, ε, or generated by the following grammar
t ::= a | a? | a+ | a+ ? | (a) | (a)? | (a)+ | (a)+ ?
| t1 . t2 | (t1 . t2 ) | (t1 . t2 )? | (t1 . t2 )+ | (t1 . t2 )+ ?
| t1 + t2 | (t1 + t2 ) | (t1 + t2 )? | (t1 + t2 )+ | (t1 + t2 )+ ?
It is not difficult to verify by structural induction that any expression t produced
by this grammar has length
X
|t| ≤ −4 + 10
rep(t, a),
a∈Σ(t)

where rep(t, a) denotes the number of times alphabet symbol a occurs in t. For
instance, rep(b .(b + c), a) = 0 and rep(b .(b + c), b) = 2. Since rep(r0 , a) ≤ k for
every a ∈ Σ(r0 ), it readily follows that |r0 | ≤ 10k|A| − 4 ≤ 10k|A|.
Then observe that all possible regular expressions over A of length at most 10k|A|
can be enumerated in time exponential in k|A|. Since checking whether a regular expression is deterministic is decidable in polynomial time [Brüggemann-Klein
and Wood 1998]; and since equivalence of deterministic expressions is decidable in
polynomial time [Brüggemann-Klein and Wood 1998], it follows by the above observations that for each k and each finite alphabet A ⊆ Σ it is possible to compute
in time exponential in k|A| a finite set RA of pairwise non-equivalent deterministic
k-OREs over A such that
—every r ∈ RA is of size at most 10k|A|; and
—for every deterministic k-ORE r over A there exists an equivalent expression
r0 ∈ RA .
(Note that since RA is computable in time exponential in k|A|, it has at most an
exponential number of elements in k|A|.) Now fix, for each finite A ⊆ Σ an arbitrary
order ≺ on RA , subject to the provision that r ≺ s only if L(s) − L(r) 6= ∅. Such
an order always exists since RA does not contain equivalent expressions.
Then let M be the algorithm that, upon sample S, computes RΣ(S) and outputs
the first (according to ≺) expression r ∈ RΣ(S) for which S ⊆ L(r). Since RΣ(S) can
be computed in time exponential in k|Σ(S)|; since there are at most an exponential
number of expressions in RΣ(S) ; since each expression r ∈ RΣ(S) has size at most
10k|Σ(S)|; and since checking membership in L(r) of a single word w ∈ S can be
done in time polynomial in the size of w and r, it follows that M runs in time
polynomial in S and exponential in k|Σ(S)|.
Furthermore, we claim that M learns the class of deterministic k-OREs. Clearly,
S ⊆ L(M (S)) by definition. Hence, it remains to show completeness, i.e., that we
can associate to each deterministic k-ORE r a sample Sr ⊆ L(r) such that, for each
sample S with Sr ⊆ S ⊆ L(r), M (S) is equivalent to r. Note that, by definition of
RΣ(r) , there exists a deterministic k-ORE r0 ∈ RΣ(r) equivalent to r. Initialize Sr
to an arbitrary finite subset of L(r) = L(r0 ) such that each alphabet symbol of r
occurs at least once in S, i.e., Σ(Sr ) = Σ(r). Let r1 ≺ · · · ≺ rn be all predecessors of
r0 in RΣ(r) according to ≺. By definition of ≺, there exists a word wi ∈ L(r)−L(ri )
for every 1 ≤ i ≤ n. Add all of these words to Sr . Then clearly, for every sample S
with Sr ⊆ S ⊆ L(r) we have Σ(S) = Σ(r) and S 6⊆ L(ri ) for every 1 ≤ i ≤ n. Since
ACM Journal Name, Vol. V, No. N, November 2024.

Learning Deterministic Regular Expressions for the Inference of Schemas from XML Data

·

M (S) is the first expression in RΣ(r) with S ⊆ L(r), we hence have M (S) = r0 ≡ r,
as desired.
While Theorem 3.1 shows that the class of deterministic k-OREs is better suited
for learning from positive data than the complete class of deterministic expressions,
it does not provide a useful practical algorithm, for the following reasons.
(1) First and foremost, M runs in time exponential in the size of the alphabet Σ(S),
which may be problematic for the inference of schema’s with many element
names.
(2) Second, while Theorem 3.1 shows that the class of deterministic k-OREs is
learnable in the limit for each fixed k, the schema inference setting is such that
we do not know k a priori. If we overestimate k then M (S) risks being an underapproximation of the target expression r, especially when S is incomplete.
To illustrate, consider the 1-ORE target expression r = a+ b+ and sample
S = {ab, abbb, aabb}. If we overestimate k to, say, 2 instead of 1, then M is free
to output aa?b+ as a sound answer. On the other hand, if we underestimate k
then M (S) risks being an over-approximation of r. Consider, for instance, the
2-ORE target expression r = aa?b+ and the same sample S = {ab, abbb, aabb}.
If we underestimate k to be 1 instead of 2, then M can only output 1-OREs,
and needs to output at least a+ b+ in order to be sound. In summary: we need
a method to determine the most suitable value of k.
(3) Third, the notion of learning in the limit is a very liberal one: correct expressions need only be derived when sufficient data is provided, i.e., when the input
sample is a superset of the characteristic sample for the target expression r.
The following theorem shows that there are reasonably simple expressions r
such that characteristic sample Sr of any sound and complete learning algorithm is at least exponential in the size of r. As such, it is unlikely for any
sound and complete learning algorithm to behave well on real-world samples,
which are typically incomplete and hence unlikely to contain all words of the
characteristic sample.
Theorem 3.2. Let A = {a1 , . . . , an } ⊆ Σ consist of n distinct element names.
Let r1 = (a1 a2 + a3 + · · · + an )+ , and let r2 = (a2 + · · · + an )+ a1 (a2 + · · · + an )+ .
For any algorithm that learns the class of deterministic (2n
Pn+ 3)-OREs and any
sample S that is characteristic for r1 or r2 we have |S| ≥ i=1 (n − 2)i .
Proof. First consider r1 = (a1 a2 + a3 + · · · + an )+ . Observe that there exist
an exponential number of deterministic (2n + 3)-OREs that differ from r1 in only
a single word. Indeed, let B = A − {a1 , a2 } and let W consist of all non-empty
words w over B of length at most n. Define, for every word w = b1 . . . bm ∈ W the
deterministic (2n + 3)-ORE rw such that L(rw ) = L(r1 ) − {w} as follows. First,
i
that accepts all words in
define, for every 1 ≤ i ≤ m the deterministic 2-ORE rw
L(r1 ) that do not start with bi :
i
rw
:= (a1 a2 + (B − {bi })) .(a1 a2 + a3 + · · · + an )∗

Clearly, v ∈ L(r1 ) − {w} if, and only if, v ∈ L(r1 ) and there is some 0 ≤ i ≤ m
such that v agrees with w on the first i letters, but differs in the (i + 1)-th letter.
ACM Journal Name, Vol. V, No. N, November 2024.

9

10

·

Geert Jan Bex et al.

Hence, it suffices to take
1
2
3
m
rw := rw
+ b1 (ε + rw
+ b2 (ε + rw
+ b3 (· · · + bm−1 (ε + rw
+ bm . r1 ) . . . )))

Now assume that algorithm M learns the class of deterministic (2n + 3)-OREs and
suppose that Sr1 is characteristic for r1 . In particular, Sr1 ⊆ L(r1 ). By definition,
M (S) is equivalent to r for every sample S with Sr1 ⊆ S ⊆ L(r1 ). We claim that
in order for M to have this property, W must be a subset
of Sr . Then, since W
Pn
contains all words over B of length at most n, |Sr1 | ≥ i=1 (n−2)i , as desired. The
intuitive argument why W must be a subset of Sr is that if there exists w in W −Sr ,
then M cannot distinguish between r1 and rw . Indeed, suppose for the purpose
of contradiction that there is some w ∈ W with w 6∈ Sr1 . Then Sr1 is a subset of
L(rw ). Indeed, Sr1 = Sr1 − {w} ⊆ L(r1 ) − {w} = L(rw ). Furthermore, since M
learns the class of deterministic (2n + 3)-OREs, there must be some characteristic
sample Srw for rw . Now, consider the sample Sr1 ∪ Srw . It is included in both
L(r1 ) and L(rw ) and is a superset of both Sr1 and Srw . But then, by definition of
characteristic samples, M (Sr1 ∪ Srw ) must be equivalent to both r1 and rw . This
is absurd, however, since L(r1 ) 6= L(rw ) by construction.
A similar argument shows that the P
characteristic sample Sr2 of r2 = (a2 + · · · +
n
an )+ a1 (a2 + · · · + an )+ also requires i=1 (n − 2)i elements. In this case, we take
B = A − {a1 } and we take W to be the set of all non-empty words over B of
length at most n. For each w = b1 . . . bm ∈ W , we construct the deterministic
(2n + 3)-ORE rw such that L(rw ) accepts all words in L(r) that do not end with
i
be the 2-ORE that accepts all words in B +
a1 w, as follows. Let, for 1 ≤ i ≤ m, rw
that do not start with bi :
i
rw
:= (B − {bi }) . B ∗

Then it suffices to take
i
2
m
rw := B + a1 (rw
+ b1 (ε + rw
+ b3 (· · · + bm−1 (ε + rw
+ bm B + ) . . . ))).

A similar argument as for r1 then shows that the characteristic sample Sr2 of r2
needs to contain, for
w ∈ W , at least one word of the form va1 w with v ∈ B + .
Peach
n
Therefore, |Sr2 | ≥ i=1 (n − 2)i , as desired.
4.

THE LEARNING ALGORITHM

In view of the observations made in Section 3, we present in this section a practical
learning algorithm that (1) works well on incomplete data and (2) automatically
determines the best value of k (see Section 5 for an experimental evaluation). Specifically, given a sample S, the algorithm derives deterministic k-OREs for increasing
values of k and selects from these candidate expressions the k-ORE that describes
S best. To determine the “best” expression we propose two measures: (1) a Language Size measure and (2) a Minimum Description Length measure based on the
work of Adriaans and Vitányi [2006].
Our algorithm does not derive deterministic k-OREs for S directly, but uses, for
each fixed k, a probabilistic method to first learn an automaton for S, which is subsequently translated into a k-ORE. The following section (Section 4.1) explains how
the probabilistic method that learns an automaton from S works. Section 4.2 explains how the learned automaton is translated into a k-ORE. Finally, Section 4.3,
ACM Journal Name, Vol. V, No. N, November 2024.

Learning Deterministic Regular Expressions for the Inference of Schemas from XML Data

·

introduces the whole algorithm, together with the two measures to determine the
best candidate expression.
4.1

Probabilistically Learning a Deterministic Automaton

In particular, the algorithm first learns a deterministic k-occurrence automaton
(deterministic k-OA) for S. This is a specific kind of finite state automaton in
which each alphabet symbol can occur at most k times. Figure 2(a) gives an
example. Note that in contrast to the classical definition of an automaton, no
edges are labeled: all incoming edges in a state s are assumed to be labeled by the
label of s. In other words, the 2-OA of Figure 2(a) accepts the same language as
aa?b+ .
Definition 4.1 (k-OA). An automaton is a node-labeled graph G = (V, E, lab)
where
—V is a finite set of nodes (also called states) with a distinguished source src ∈ V
and sink sink ∈ V ;
—the edge relation E is such that src has only outgoing edges; sink has only
incoming edges; and every state v ∈ V − {src, sink } is reachable by a walk from
src to sink ;
—lab : V − {src, sink } → Σ is the labeling function.
In this context, an accepting run for a word a1 . . . an is a walk src s1 . . . sn sink
from src to sink in G such that ai = lab(si ) for 1 ≤ i ≤ n. As usual, we denote
by L(G) the set of all words for which an accepting run exists. An automaton is
k-occurrence (a k-OA) if there are at most k states labeled by the same alphabet
symbol. If G uses only labels in A ⊆ Σ then G is an automaton over A.
In what follows, we write Succ(s) for the set {t | (s, t) ∈ E} of all direct successors
of state s in G, and Pred(s) for the set {t | (t, s) ∈ E} of all direct predecessors
of s in G. Furthermore, we write Succ(s, a) and Pred(s, a) for the set of states in
Succ(s) and Pred(s), respectively, that are labeled by a. As usual, an automaton G
is deterministic if Succ(s, a) contains at most one state, for every s ∈ V and a ∈ Σ.
For convenience, we will also refer to the 1-OAs as “single occurence automata”
or SOAs for short.
We learn a deterministic k-OA for a sample S as follows. First, recall from
Section 3 that Σ(S) is the set of alphabet symbols occurring in words in S. We view
S as the result of a stochastic process that generates words from Σ∗ by performing
random walks on the complete k-OA Ck over Σ(S).
Definition 4.2. Define the complete k-OA Ck over Σ(S) to be the k-OA G =
(V, E, lab) over Σ(S) in which each a ∈ Σ(S) labels exactly k states such that
—there is an edge from src to sink ;
—src is connected to exactly one state labeled by a, for every a ∈ Σ(S); and
—every state s ∈ V − {src, sink } has an outgoing edge to every other state except
src.
To illustrate, the complete 2-OA over {a, b} is shown in Figure 2(b). Clearly,
L(Ck ) = Σ(S)∗ .
ACM Journal Name, Vol. V, No. N, November 2024.

11

12

·

Geert Jan Bex et al.

a

a

b
(a) An example 2-OA. It accepts
the same language as aa?b+
Fig. 2.

a

a

b

b

(b) The complete
{a, b}.

2-OA

over

Two 2-OAs.

The stochastic process that generates words from Σ∗ by performing random walks
on Ck operates as follows. First, the process picks, among all states in Succ(src),
a state s1 with probability α(src, s1 ) and emits lab(s1 ). Then it picks, among
all states in Succ(s1 ) a state s2 with probability α(s1 , s2 ) and emits lab(s2 ). The
process continues moving to new states and emitting their labels until the final state
is reached (which does not emit a symbol). Of course, α must be a true probability
distribution, i.e.,
X
α(s, t) ≥ 0; and
α(s, t) = 1
(1)
t∈Succ(s)

for all states s 6= sink and all states t. The probability of generating a particular
accepting run ~s = src s1 s2 . . . sn sink given the process P = (Ck , α) in this setting
is
P [~s | P] = α(src, s1 ) · α(s2 , s3 ) · α(s2 , s3 ) · · · α(sn , sink ),
and the probability of generating the word w = a1 . . . an is
X
P [w | P] =
P [~s | P].
all accepting runs ~
s of w in Ck

Assuming independence, the probability of obtaining all words in the sample S is
then
Y
P [S | P] =
P [w | P].
w∈S

Clearly, the process that best explains the observation of S is the one in which the
probabilities α are such that they maximize P [S | P].
To learn a deterministic k-OA for S we therefore first try to infer from S the
probability distribution α that maximizes P [S | P], and use this distribution to
determine the topology of the desired deterministic k-OA. In particular, we remove
from Ck the non-deterministic edges with the lowest probability as these are the
least likely to contribute to the generation of S, and are therefore the least likely
to be necessary for the acceptance of S.
The problem of inferring α from S is well-studied in Machine Learning, where
our stochastic process P corresponds to a particular kind of Hidden Markov Model
sometimes referred to as a Partially Observable Markov Model (POMM for short).
(For the readers familiar with Hidden Markov Models we note that the initial
state distribution π usually considered in Hidden Markov Models is absorbed in
ACM Journal Name, Vol. V, No. N, November 2024.

Learning Deterministic Regular Expressions for the Inference of Schemas from XML Data

·

Algorithm 1 iKoa
Require: a sample S, a value for k
Ensure: a deterministic k-OA G with S ⊆ L(G)
1: P ← init(k, S)
2: P ← BaumWelsh(P, S)
3: G ← Disambiguate(P, S)
4: G ← Prune(G, S)
5: return G
Algorithm 2 Disambiguate
Require: a POMM P = (G, α) and sample S
Ensure: a deterministic k-OA
1: Initialize queue Q to {s ∈ Succ(src) | α(src, s) > 0}
2: Initialize set of marked states D ← ∅
3: while Q is non-empty do
4:
s ← first(Q)
5:
while some a ∈ Σ has | Succ(s, a)| > 1 do
0
0
6:
pick t ∈ Succ(s,
P a) with α(s, t) = max{α(s, t ) | t ∈ Succ(s, a)}
7:
set α(s, t) ← {α(s, t0 ) | t0 ∈ Succ(s, a)}
8:
for all t0 in Succ(s, a) \ {t} do
9:
delete edge (s, t0 ) from G
10:
set α(s, t0 ) ← 0
11:
P ← BaumWelsh(P, S)
12:
if S 6⊆ L(G) then Fail
13:
add s to marked states D and pop s from Q
14:
enqueue all states in Succ(s) \ D to Q
15: return G
the state transition distribution α(src, ·) in our context.) Inference of α is generally
accomplished by the well-known Baum-Welsh algorithm [Rabiner 1989] that adjusts
initial values for α until a (possibly local) maximum is reached.
We use Baum-Welsh in our learning algorithm iKoa shown in Algorithm 1, which
operates as follows. In line 1, iKoa initializes the stochastic process P to the tuple
(Ck , α) where
—Ck is the complete k-OA over Σ(S);
—α(src, sink ) is the fraction of empty words in S;
—α(src, s) is the fraction of words in S that start with lab(s), for every s ∈
Succ(src); and
—α(s, t) is chosen randomly for s 6= src, subject to the constraints in equation (1).
It is important to emphasize that, since we are trying to model a stochastic process,
multiple occurrences of the same word in S are important. A sample should therefore not be considered as a set in Algorithm 1, but as a bag. Line 2 then optimizes
the initial values of α using the Baum-Welsh algorithm.
With these probabilities in hand Disambiguate, shown in Algorithm 2, determines the topology of the desired deterministic k-OA for S. In a breadth-first
ACM Journal Name, Vol. V, No. N, November 2024.

13

14

·

Geert Jan Bex et al.

manner, it picks for each state s and each symbol a the state t ∈ Succ(s, a) with
the highest probability and deletes all other edges to states labeled by a. Line 7
merely ensures that α continues to be a probability distribution after this removal
and line 11 adjusts α to the new topology. Line 12 is a sanity check that ensures
that we have not removed edges necessary to accept all words in S; Disambiguate
reports failure otherwise. The result of a successful run of Disambiguate is a
deterministic k-OA which nevertheless may have edges (s, t) for which there is no
witness in S (i.e., a word in S whose unique accepting run traverses (s, t)). The
function Prune in line 4 of iKoa removes all such edges. It also removes all states
s ∈ Succ(src) without a witness in S. Figure 3 illustrates a hypothetical run of
iKoa.
It should be noted that BaumWelsh, which iteratively refines α until a (possibly local) maximum is reached, is computationally quite expensive. For that
reason, our implementation only executes a fixed number of refinement iterations
of BaumWelsh in Line 11. Rather surprisingly, this cut-off actually improves the
precision of iDRegEx, as our experiments in Section 5 show, where it is discussed
in more detail.
4.2

Translating k-OAs into k-OREs

Once we have learned a deterministic k-OA for a given sample S using iKoa
it remains to translate this k-OA into a deterministic k-ORE. An obvious approach in this respect would be to use the classical state elimination algorithm
(cf., e.g., [Hopcroft and Ullman 2007]). Unfortunately, as already hinted upon by
Fernau [2004; 2005] and as we illustrate below, it is very difficult to get concise
regular expressions from an automaton representation. For instance, the classical
state elimination algorithm applied to the SOA in Figure 4 yields the expression:1
(aa∗ d + (c + aa∗ c)(c + aa∗ c)∗ (d + aa∗ d) + (b + aa∗ b + (c +
aa∗ c)(c + aa∗ c)∗ (b + aa∗ b))(aa∗ b + (c + aa∗ c)(c + aa∗ c)∗
(b + aa∗ b))∗ (aa∗ d + (c + aa∗ c)(c + aa∗ c)∗ (d + aa∗ d)))(aa∗ d +
(c + aa∗ c)(c + aa∗ c)∗ (d + aa∗ d) + (b + aa∗ b + (c + aa∗ c)(c +
aa∗ c)∗ (b + aa∗ b))(aa∗ b + (c + aa∗ c)(c + aa∗ c)∗ (b + aa∗ b))∗

which is non-deterministic and differs quite a bit from the equivalent deterministic
SORE
((b?(a + c))+ d)+ e.
Actually, results by Ehrenfeucht and Zeiger [1976]; Gelade and Neven [2008]; and
Gruber and Holzer [2008] show that it is impossible in general to generate concise
regular expressions from automata: there are k-OAs (even for k = 1) for which the
number of occurrences of alphabet symbols in the smallest equivalent expression is
exponential in the size of the automaton. For such automata, an equivalent k-ORE
hence does not exist.
It is then natural to ask whether there is an algorithm that translates a given
k-OA into an equivalent k-ORE when such a k-ORE exists, and returns a k-ORE
super approximation of the input k-OA otherwise. Clearly, the above example
shows that the classical state elimination algorithm does not suffice for this purpose.
1 Transformation computed by JFLAP: www.jflap.org.

ACM Journal Name, Vol. V, No. N, November 2024.

Learning Deterministic Regular Expressions for the Inference of Schemas from XML Data

α
src
a1
a2
b1
b2

a1

a2

a1

a2

b1

b2

b1

b2

a1
1
0.2
0.4
0.1
0.1

a2
\
0.3
0.1
0.3
0.1

b1
0
0.3
0.2
0.3
0.2

b2
\
0.1
0.1
0.2
0.5

sink
0
0.1
0.2
0.1
0.1

α
src
a1
a2
b1
b2

(a) Process P returned by init with random values for α.

α
src
a1
a2
b1
b2

a1
1
0
0.01
0.01
0.01

a1
1
0.2
0.01
0.01
0.01

a2
\
0.3
0.01
0.01
0.01

b1
0
0.3
0.6
0.5
0.33

(b) Process P after
BaumWelsh.

first

a1

a2

a1

a2

b1

b2

b1

b2

a2
\
0.5
0.01
0.01
0.01

b1
0
0.49
0.6
0.5
0.33

b2
\
0
0.37
0.28
0.5

sink
0
0.01
0.01
0.2
0.15

α
src
a1
a2
b1
b2

(c) Process P after first disambiguation step
(for a1 ). Edges to a1 and b2 are removed.

a1
1
0
0.01
0.02
0.01

a2
\
0.5
0.01
0
0.01

b1
0
0.49
0.6
0.78
0.38

a

a

b

b

b

returned

sink
0
0.01
0.01
0.2
0.15

training

b2
\
0
0.37
0
0.4

by

sink
0
0.01
0.01
0.2
0.2

(d) Process P after second disambiguation step
(for b1 ). Edges to a2 and b2 are removed.

a

(e) Automaton
A
Disambiguate.

b2
\
0.19
0.37
0.28
0.5

·

a

(f) Automaton A returned by Prune. It
accepts the same language as aa?b+ .

by

Fig. 3. Example run of iKoa for k = 2 with target language aa?b+ . For the process
P in (c)-(f), the α values are listed in table-form. To distinguish different states
with the same label, we have indexed the labels.

b

a

d

c

e

Fig. 4. A SOA on which the classical state elimination algorithm returns a complicated expression.
ACM Journal Name, Vol. V, No. N, November 2024.

15

16

·

Geert Jan Bex et al.
a(1)

a(2)

b(1)

Fig. 5.

An example marking

For that reason, we have proposed in a companion article [Bex et al. ] a family
of algorithms {rwr, rwr21 , rwr22 , rwr23 , . . . } that translate SOAs into SOREs and
have exactly these properties:
Theorem 4.3 ([Bex et al. ]). Let G be a SOA and let T be any of the algorithms in the family {rwr, rwr21 , rwr22 , rwr23 , . . . }. If G is equivalent to a SORE
r, then T (G) returns a SORE equivalent to r. Otherwise, T (G) returns a SORE
that is a super approximation of G, L(G) ⊆ L(T (G)).
(Note that SOAs and SOREs are always deterministic by definition.)
These algorithms, in short, apply an inverse Glushkov translation. Starting from
a k-OA where each state is labeled by a symbol, they iteratively rewrite subautomata into equivalent regular expressions. In the end only one state remains and
the regular expression labeling this state is the output.
In this section, we show how the above algorithms can be used to translate k-OAs
into k-OREs. For simplicity of exposition, we will focus our discussion on rwr21 as
it is the concrete translation algorithm used in our experiments in Section 5, but
the same arguments apply to the other algorithms in the family.
Definition 4.4. First, let Σ(k) denote the alphabet that consists of k copies of
the symbols in Σ, where the first copy of a ∈ Σ is denoted by a(1) , the second by
a(2) , and so on:
Σ(k) := {a(i) | a ∈ Σ, 1 ≤ i ≤ k}.
Let strip be the function mapping copies to their original symbol, i.e., strip(a(i) ) =
a. We extend strip pointwise to words, languages, and regular expressions over
Σ(k) .
For example, strip({a(1) a(2) b(1) , a(2) a(2) c(2) }) = {aab, aac} and strip(a(1) . a(2) ? .
+
b(1) ) = a . a? . b+ .
To see how we can use rwr21 , which translates SOAs into SOREs, to translate
a k-OA into a k-ORE, observe that we can always transform a k-OA G over Σ
into a SOA H over Σ(k) by processing the nodes of G in an arbitrary order and
replacing the ith occurrence of label a ∈ Σ by a(i) . To illustrate, the SOA over Σ(2)
obtained in this way from the 2-OA in Figure 2(a) is shown in Figure 5. Clearly,
L(G) = strip(L(H)).
Definition 4.5. We call a SOA H over Σ(k) obtained from a k-OA G in the above
manner a marking of G.
Note that, by Theorem 4.3, running rwr21 on H yields a SORE r over Σ(k)
with L(H) ⊆ L(r). For instance, with H as in Figure 5, rwr2 (H) returns r =
ACM Journal Name, Vol. V, No. N, November 2024.

Learning Deterministic Regular Expressions for the Inference of Schemas from XML Data

·

Algorithm 3 rwr2
Require: a k-OA G
Ensure: a k-ORE r with L(G) ⊆ L(r)
1: compute a marking H of G.
2: return strip(rwr21 (H))
+

a(1) . a(2) ? . b(1) . By subsequently stripping r, we always obtain a k-ORE over Σ.
Moreover, L(G) = strip(L(H)) ⊆ strip(L(r)) = L(strip(r)), so the k-ORE strip(r)
is always a super approximation of G. Algorithm 3, called rwr2 , summarizes the
translation. By our discussion, rwr2 is clearly sound:
Proposition 4.6. rwr2 (G) is a (possibly non-deterministic) k-ORE with L(G) ⊆
L(rwr2 (G)), for every k-OA G.
Note, however, that even when G is deterministic and equivalent to a deterministic k-ORE r, rwr2 (G) need not be deterministic, nor equivalent to r. For instance,
consider the 2-OA G:
b

a

c

b

Clearly, G is equivalent to the deterministic 2-ORE bc?a(ba)+ ?. Now suppose for
the purpose of illustration that rwr2 constructs the following marking H of G. (It
does not matter which marking rwr2 constructs, they all result in the same final
expression.)
b(1)

a(1)

c(1)

b(2)

Since H is not equivalent to a SORE over Σ(k) , rwr21 (H) need not be equivalent
to L(H). In fact, rwr21 (H) returns ((b(1) c(1) ?a(1) )?b(2) ?)+ , which yields the nondeterministic ((bc?a)?b?)+ after stripping. Nevertheless, G is equivalent to the
deterministic 2-ORE bc?a(ba)+ ?.
So although rwr2 is always guaranteed to return a k-ORE, it does not provide
the same strong guarantees that rwr21 provides (Theorem 4.3). The following theorem shows, however, that if we can obtain G by applying the Glushkov construction
on r [Brüggeman-Klein 1993], rwr2 (G) is always equivalent to r. Moreover, if r
is deterministic, then so is rwr2 (G). So in this sense, rwr2 applies an inverse
Glushkov construction to r. Formally, the Glushkov construction is defined as
follows.
Definition 4.7. Let r be a k-ORE. Recall from Definition 1.2 that r is the regular
expression obtained from r by replacing the ith occurrence of alphabet symbol a
by a(i) , for every a ∈ Σ and every 1 ≤ i ≤ n. Let pos(r) denote the symbols in Σ(k)
that actually appear in r. Moreover, let the sets first(r), last(r), and follow (r, a(i) )
be defined as shown in Figure 6. A k-OA G is a Glushkov translation of r if there
exists a one-to-one onto mapping ρ : (V (G) − {src, sink }) → pos(r) such that
ACM Journal Name, Vol. V, No. N, November 2024.

17

18

·

Geert Jan Bex et al.
first(∅)
first(a(i) )
first(r+ )

=
=
=

first(r . s)

=

last(∅)
last(a(i) )
last(r+ )

=
=
=

last(r . s)

=

follow (a(i) , a(i) )
follow (r?, a(i) )

=
=

follow (r+ , a(i) )

=

follow (r + s, a(i) )

=

follow (r . s, a(i) )

=

Fig. 6.

∅
first(ε)
{a(i) }
first(r?)
first(r)
first(r + s)
(
first(r)
if ε ∈
/ L(r),
first(r) ∪ first(s) otherwise.

=
=
=

∅
first(r)
first(r) ∪ first(s)

∅
{a(i) }
last(r)
(
last(s)
last(r) ∪ last(s)

=
=
=

∅
last(r)
last(r) ∪ last(s)

last(ε)
last(r?)
last(r + s)
if ε ∈
/ L(s),
otherwise.

∅
follow (r, a(i) )
(
follow (r, a(i) )
(i)
(follow (r, a ) ∪ first(r)
follow (r, a(i) )
follow (s, a(i) )

(i)

follow (r, a )

follow (r, a(i) ) ∪ first(s)


follow (s, a(i) )

if a(i) ∈
/ last(r),
otherwise.
if a(i) ∈ pos(r),
otherwise.
if a(i) ∈ pos(r), a(i) ∈
/ last(r),
if a(i) ∈ pos(r), a(i) ∈ last(r),
otherwise.

Definition of first(r), last(r), and follow (r, a(i) ), for a(i) ∈ pos(r).

(1) v ∈ Succ(src) ⇔ ρ(v) ∈ first(r);
(2) v ∈ Pred(sink ) ⇔ ρ(v) ∈ last(r);
(3) v ∈ Succ(w) ⇔ ρ(v) ∈ follow (r, ρ(w)); and
(4) strip(ρ(v)) = lab(v),
for all v, w ∈ V (G) − {src, sink }.
Theorem 4.8. If k-OA G is a Glushkov representation of a target k-ORE
r, then rwr2 (G) is equivalent to r. Moreover, if r is deterministic, then so is
rwr2 (G).
Proof. Since rwr2 (G) = strip(rwr21 (H)) for an arbitrarily chosen marking
H of G, it suffices to prove that strip(rwr21 (H)) is equivalent to r and that
strip(rwr21 (H)) is deterministic whenever r is deterministic, for every marking H
of G. Hereto, let H be an arbitrary but fixed marking of G. In particular, G and H
have the same set of nodes V and edges E, but differ in their labeling function. Let
lab G be the labeling function of G and let lab H the labeling function of H. Clearly,
lab G (v) = strip(lab H (v)) for every v ∈ V − {src, sink }. Since G is a Glushkov
translation of r, there is a one-to-one, onto mapping ρ : (V − {src, sink }) → pos(r)
satisfying properties (1)-(4) in Definition 4.7. Now let σ : pos(r) → Σ(k) be the
function that maps a(i) ∈ pos(r) to lab H (ρ−1 (a(i) )). Since lab H assigns a distinct
label to each state, σ is one-to-one and onto the subset of Σ(k) symbols used as
labels in H. Moreover, by property (4) and the fact that lab G (v) = strip(lab H (v))
ACM Journal Name, Vol. V, No. N, November 2024.

Learning Deterministic Regular Expressions for the Inference of Schemas from XML Data

·

we have,
strip(a(i) ) = lab G (ρ−1 (a(i) )) = strip(lab H (ρ−1 (a(i) ))) = strip(σ(a(i) ))

(?)

(i)

for each a ∈ pos(r). In other words, σ preserves (stripped) labels. Now let σ(r)
be the SORE obtained from r by replacing each a(i) ∈ pos(r) by σ(a(i) ). Since σ is
one-to-one and r is a SORE, so is σ(r). Moreover, we claim that L(H) = L(σ(r)).
Indeed, it is readily verified by induction on r that a word a1 (i1 ) . . . an (in ) ∈ L(r)
if, and only if, (i) a1 (i1 ) ∈ first(r); (ii) ap+1 (ip+1 ) ∈ follow (r, ap+1 (ip+1 ) ) for every
1 ≤ p < n; and (iii) an (in ) ∈ last(r). By properties (1)-(4) of Definition 4.7 we
hence obtain:
σ(a1 (i1 ) ) . . . σ(an (in ) ) ∈ L(σ(r))
⇔ a1 (i1 ) . . . an (in ) ∈ L(r)
⇔ src, ρ−1 (a1 (i1 ) ), . . . , ρ−1 (an (in ) ), sink is a walk in G
⇔ src, ρ−1 (a1 (i1 ) ), . . . , ρ−1 (an (in ) ), sink is a walk in H
⇔ lab H (ρ−1 (a1 (i1 ) )) . . . , lab H (ρ−1 (an (in ) )) ∈ L(H)
⇔ σ(a1 (i1 ) ) . . . σ(an (in ) ) ∈ L(H)
Therefore, L(H) = L(σ(r)).
Hence, we have established that H is a SOA over Σ(k) equivalent to the SORE
σ(r) over Σ(k) . By Theorem 4.3, rwr21 (H) is hence equivalent to σ(r). Therefore,
strip(rwr21 (H)) is equivalent to strip(σ(r)), which by (?) above, is equivalent to
strip(r) = r, as desired.
Finally, to see that strip(rwr21 (H)) is deterministic if r is deterministic, let
s := strip(rwr21 (H)) and suppose for the purpose of contradiction that s is not
deterministic. Then there exists wa(i) v1 and wa(j) v2 in L(s) with i 6= j. It is
0
0
not hard to see that this can happen only if there exist w0 a(i ) v10 and w0 a(j ) v20
in L(rwr21 (H)) with i0 6= j 0 . Since L(rwr21 (H)) = L(σ(r)) we know that hence
0
0
00
0
σ −1 (w0 a(i ) v10 ) ∈ L(r) and σ −1 (w0 a(j ) v20 ) ∈ L(r). Let w00 a(i ) v100 = σ −1 (w0 a(i ) v10 )
00
0
and w00 a(j ) v200 = σ −1 (w0 a(i ) v20 ). Since σ is one-to-one and i0 6= j 0 , also i00 6= j 00 .
Therefore, r is not deterministic, which yields the desired contradiction.
4.3

The whole Algorithm

Our deterministic regular expression inference algorithm iDRegEx combines iKoa
and rwr2 as shown in Algorithm 4. For increasing values of k until a maximum
kmax is reached, it first learns a deterministic k-OA G from the given sample S,
and subsequently translates that k-OA into a k-ORE using rwr2 . If the resulting
k-ORE is deterministic then it is added to the set C of deterministic candidate
expressions for S, otherwise it is discarded. From this set of candidate expressions,
iDRegEx returns the “best” regular expression best(C), which is determined according to one of the measures introduced below. Since it is well-known that,
depending on the initial value of α, BaumWelsh (and therefore iKoa) may converge to a local maximum that is not necessarily global, we apply iKoa a number
of times N with independently chosen random seed values for α to increase the
probability of correctly learning the target regular expression from S.
The observant reader may wonder whether we are always guaranteed to derive
at least one deterministic expression such that best(C) is defined. Indeed, Theorem 4.8 tells us that if we manage to learn from sample S a k-OA which is the
ACM Journal Name, Vol. V, No. N, November 2024.

19

20

·

Geert Jan Bex et al.

Algorithm 4 iDRegEx
Require: a sample S
Ensure: a k-ORE r
1: initialize candidate set C ← ∅
2: for k = 1 to kmax do
3:
for n = 1 to N do
4:
G ← iKoa(S, k)
5:
if rwr2 (G) is deterministic then
6:
add rwr2 (G) to C
7: return best(C)
Glushkov representation of the target expression r, then rwr2 will always return
a deterministic k-ORE equivalent to r. When k > 1, there can be several k-OAs
representing the same language and we could therefore learn a non-Glushkov one.
In that case, rwr2 always returns a k-ORE which is a super approximation of the
target expression. Although that approximation can be non-deterministic, since we
derive k-OREs for increasing values of k and since for k = 1 the result of rwr2 is
always deterministic (as every SORE is deterministic), we always infer at least one
deterministic regular expression. In fact, in our experiments on 100 synthetic regular expressions, we derived for 96 of them a deterministic expression with k > 1,
and only for 4 expressions had to resort to a 1-ORE approximation.
4.3.1 A Language Size Measure for Determining the Best Candidate. Intuitively,
we want to select from C the simplest deterministic expression that “best” describes
S. Since each candidate expression in C accepts all words in S by construction, one
way to interpret “the best” is to select the expression that accepts the least number
of words (thereby adding the least number of words to S). Since an expression defines an infinite language in general, it is of course impossible to take all words into
account. We therefore only consider the words up to a length n, where n = 2m + 1
with m the length of the candidate expression, excluding regular expression operators, ∅, and ε. For instance, if the candidate expression is a .(a + c+ )?, then m = 3
and n = 7. Formally, for a language L, let |L≤n | denote the number of words in L
of length at most n. Then the best candidate in C is the one with the least value of
| L(r)≤n |. If there are multiple such candidates, we pick the shortest one (breaking
ties arbitrarily). It turns out that | L(r)≤n | can be computed quite efficiently; see
[Bex et al. ] for details.
4.3.2 A Minimum Description Length Measure for Determining the Best Candidate. An alternative measure to determine the best candidate is given by Adriaans
and Vitányi [2006], who compare the size of S with the size of the language of a
candidate r. Specifically, Adriaans and Vitányi define the data encoding cost of r
to be:
 =i

n
X
| L (r)|
datacost(r, S) :=
2 · log2 i + log2
,
|S =i |
i=0
where n = 2m + 1 as before; |S =i | is the number of words in S that have length i;
and | L=i (r)| is the number of words in L(r) that have exactly length i. Although
ACM Journal Name, Vol. V, No. N, November 2024.

Learning Deterministic Regular Expressions for the Inference of Schemas from XML Data

·

the above formula is numerically difficult to compute, there is an easier estimation
procedure; see [Adriaans and Vitányi 2006] for details.
In this case, the model encoding cost is simply taken to be its length, thereby
preferring shorter expressions over longer ones. The best regular expression in the
candidate set C is then the one that minimizes both model and data encoding cost
(breaking ties arbitrarily).
We already mentioned that xtract [Garofalakis et al. 2003] also utilizes the
Minimum Description Length principle. However, their measure for data encoding
cost depends on the concrete structure of the regular expressions while ours only
depends on the language defined by them and is independent of the representation.
Therefore, in our setting, when two equivalent expressions are derived, the one with
the smallest model cost, that is, the simplest one, will always be taken.
5.

EXPERIMENTS

In this section we validate our approach by means of an experimental analysis.
Throughout the section, we say that a target k-ORE r is successfully derived when
a k-ORE s with L(r) = L(s) is generated. The success rate of our experiments
then is the percentage of successfully derived target regular expressions.
Our previous work [Bex et al. 2008] on this topic was based on a version of the
rwr0 algorithm [Bex et al. 2006], we refer to this algorithm as iDRegEx(rwr0 ).
Unfortunately, as detailed in [Bex et al. 2008], it is not known whether rwr0 is
complete on the class of all single occurrence regular expressions. Nevertheless, the
experiments in [Bex et al. 2008] which are revisited below show a good and reliable
performance. However, to obtain a theoretically complete algorithm, c.f.r. Theorem 4.8, we use the algorithm rwr2 which is sound and complete on single occurrence regular expressions. In the remainder we focus on iDRegEx, but compare
with the results for iDRegEx(rwr0 ).
As mentioned in Section 4.3.1, another new aspect of the results presented here is
the use of language size as an alternative measure over Minimum Description Length
(MDL) to compare candidates. The iDRegEx(rwr0 ) algorithm is only considered
with the MDL criterion. We note that for alphabet size 5, the success rate of
iDRegEx with the MDL criterion was only 21 %, while that of the language size
criterion is 98 %. The corpus used in this experiment is described in Section 5.3.
Therefore in the remainder of this section we only consider iDRegEx with the
language size criterion.
For all the experiments described below we take kmax = 4 and N = 10 in Algorithm 4.
5.1

Running times

All experiments were performed using a prototype implementation of iDRegEx
and iDRegEx(rwr0 ) written in Java executed on Pentium M 2.0 GHz class machines equipped with 1GB RAM. For the BaumWelsh subroutine we have gratefully used Jean-Marc François’ Jahmm library [François 2006], which is a faithful
implementation of the algorithms described in Rabiner’s Hidden Markov Model tutorial [Rabiner 1989]. Since Jahmm strives for clarity rather than performance and
since only limited precautions are taken against underflows, our prototype should
be seen as a proof of concept rather than a polished product. In particular, underACM Journal Name, Vol. V, No. N, November 2024.

21

22

·

Geert Jan Bex et al.

flows currently limit us to target regular expressions whose total number of symbol
occurrences is at most 40. Here, the total number of symbol occurrences occ(r) of
a regular expression r is its length excluding the regular expression operators and
parenthesis. To illustrate, the total number of symbol occurrences in aa?b+ is 3.
Furthermore, the lack of optimization in Jahmm leads to average running times
ranging from 4 minutes for target expressions r with |Σ(r)| = 5 and occ(r) = 6 to
9 hours for targets expression with |Σ(r)| = 15 and occ(r) = 30. Running times for
iDRegEx and iDRegEx(rwr0 ) are similar.
As already mentioned in Section 4.3, one of the bottlenecks of iDRegEx is the application of BaumWelsh in Line 11 of Disambiguate (Algorithm 2). BaumWelsh
is an iterative procedure that is typically run until convergence, i.e., until the
computed probability distribution no longer change significantly. To improve the
running time, we only apply a fixed number ` of iteration steps when calling
BaumWelsh in Line 11 of Disambiguate. Experiments show that the running
time performance scales linear with ` as one expects, but, perhaps surprisingly, the
success rate improves as well for an optimal value of `. This optimal value for `
depends on the alphabet size. These improved results can be explained as follows:
applying BaumWelsh in each disambiguation step until it converges guarantees
that the probability distribution for that step will have reached a local optimum.
However, we know that the search space for the algorithm contains many local optima, and that BaumWelsh is a local optimization algorithm, i.e., it will converge
to one of the local optima it can reach from its starting point by hill climbing. The
disambiguation procedure proceeds state by state, so fine tuning the probability
distribution for a disambiguation step may transform the search space so that certain local optima for the next iteration can no longer be reached by a local search
algorithm such as BaumWelsh. Table I shows the performance of the algorithm
for various number of BaumWelsh iterations ` for expressions of alphabet size 5,
10 and 15. These expressions are those described in Section 5.3. In this Table,
` = ∞ denotes the case where BaumWelsh is ran until convergence after each
disambiguation step. The Table illustrates that the success rate is actually higher
for small values of `. The running time performance gains increase rapidly with
the expressions’ alphabet size: for |Σ| = 5, we gain a factor of 3.5 (` = 2), for
|Σ| = 10, it is already a factor of 10 (` = 3) and for |Σ| = 15, we gain a factor
of 25 (` = 3). This brings the running time for the largest expressions we tested
down to 22 minutes, in contrast with 9 hours mentioned for iDRegEx(rwr0 ) and
iDRegEx. The algorithm with the optimal number of BaumWelsh steps in the
disambiguation process will be referred to as iDRegExfixed . In particular for small
alphabet sizes (|Σ| ≤ 7) we use ` = 2, for large alphabet size ` = 3 (|Σ| > 7). We
note that the alphabet size can easily be determined from the sample.
We should also note that Experience with Hidden Markov Model learning in bioinformatics [Finn et al. 2006] suggests that both the running time and the maximum
number of symbol occurrences that can be handled can be significantly improved
by moving to an industrial-strength BaumWelsh implementation. Our focus for
the rest of the section will therefore be on the precision of iDRegEx.
ACM Journal Name, Vol. V, No. N, November 2024.

Learning Deterministic Regular Expressions for the Inference of Schemas from XML Data
`
1
2
3
4
∞

rate |Σ| = 5
95 %
100 %
95 %
95 %
98 %

rate |Σ| = 10
80 %
75 %
84 %
77 %
75 %

·

rate |Σ| = 15
40 %
50 %
60 %
50 %
50 %

Table I. Success rate for a limited number of BaumWelsh iterations in the disambiguation procedure, ` = ∞ corresponds to iDRegEx, for ` = 1, . . . , 4 correspond to iDRegExfixed .

5.2

Real-world target expressions and real-world samples

We want to test how iDRegEx performs on real-world data. Since the number
of publicly available XML corpora with valid schemas is rather limited, we have
used as target expressions the 49 content models occurring in the XSD for XML
Schema Definitions [Thompson et al. 2001] and have drawn multiset samples for
these expressions from a large corpus of real-world XSDs harvested from the Cover
Pages [Cover 2003]. In other words, the goal of our first experiment is to derive, from
a corpus of XSD definitions, the regular expression content models in the schema
for XML Schema Definitions2 . As it turns out, the XSD regular expressions are all
single occurrence regular expressions.
The iDRegEx(rwr0 ) algorithm infers all these expressions correctly, showing
that it is conservative with respect to k since, as mentioned above, the algorithm
considers k values ranging from 1 to 4. In this setting, iDRegEx performs not
as well, deriving only 73 % of the regular expressions correctly. We note that for
each expression that was not derived exactly, always an expression was obtained
describing the input sample and which in addition is more specific than the target
expression. iDRegEx therefore seems to favor more specific regular expressions,
based on the available examples.
5.3

Synthetic target expressions

Although the successful inference of the real-world expressions in Section 5.2 suggests that iDRegEx is applicable in real-world scenarios, we further test its behavior on a sizable and diverse set of regular expressions. Due to the lack of real-world
data, we have developed a synthetic regular expression generator that is parameterized for flexibility.
Synthetic expression generation. In particular, the occurrence of the regular
expression operators concatenation, disjunction (+), zero-or-one (?), zero-or-more
(∗ ), and one-or-more (+ ) in the generated expressions is determined by a userdefined probability distribution. We found that typical values yielding realistic
expressions are 1/10 for the unary operators and 7/20 for others. The alphabet
can be specified, as well as the number of times that each individual symbol should
occur. The maximum of these numbers determines the value k of the generated
k-ORE.
To ensure the validity of our experiments, we want to generate a wide range of
different expressions. To this end, we measure how much the language of a generated
2 This corpus was also used in [Bex et al. 2007] for XSD inference.

ACM Journal Name, Vol. V, No. N, November 2024.

23

24

·

Geert Jan Bex et al.

((debab) + c)∗ a
((((c + b)b) + a)ca) + e + d
(((ea)∗ db) + b + a + c)+
((b+ + c + e + d)aab)+
((((eabh) + d + j + c + b)+ f ) + a + g + i)?
((((aa) + e)+ + c)b) + b + d
((((d + a)∗ eabcb) + c)a)?
((((ac) + b + d)eab) + c)∗
(((((bab) + c)+ + e)?a) + d)+
((((ecb)+ a) + b)+ + d + a)?
((bagbf eid) + c + a + j + h)∗
((gdab) + a + i + c + j + e + f )+ hb
((h∗ cdf a) + j + e + g + b + i)∗ ab
((g + b + e + f + i + d)∗ aba) + h + j + c
((((h + b + c + j + f )+ + e)?aaidb) + g)?

Fig. 7.

(((((dbe)∗ cf ) + j)hac) + b + i)∗ gad
(((((ihaaj) + d)+ + g)b) + e + b + f + c)+
(((ecgecd) + b + d + a + j + f )∗ ihaba)∗
(l + c + d + m + n)∗ aojahbegcbf idke
(((c + b)ab) + d + i + a)+ + j + g + f + e + h
(((a?clf habgd) + b + n + o)iedjcem)∗ k
((a + k + f + c + m + e)+ bdieclbonjgda)∗ h
(((k?jghadf celif cjbhom)+
b + g + a + e + i + n)+ + d)?
(((aedoadenhdbci) + h + k + m + j + g + b)∗
f ccgelbif ja)
((a+ + f + d + o + g + n + h + c + b + j + i + e)
keacdlbm)
(((k + f + o + a + j)?edhldf hngicjmab)?cie)∗ bg
((((a?d)+ ba) + h + g + e + c)+ + j + i + b)?f

A snapshot of the 100 generated expressions.

expression overlaps with Σ∗ . The larger the overlap, the greater its language size
as defined in Section 4.3.1.
To ensure that the generated expressions do not impede readability by containing
redundant subexpressions (as in e.g., (a+ )+ ), the final step of our generator is to
syntactically simplify the generated expressions using the following straightforward
equivalences:
r∗ → r+ ?
r?? → r?
(r+ )+ → r+
(r?)+ → r+ ?
(r1 · r2 ) · r3 → r1 · (r2 · r3 )
r1 · (r2 · r3 ) → r1 · r2 · r3
(r1 ? · r2 ?)? → r1 ? · r2 ?
(r1 + r2 ) + r3 → r1 + (r2 + r3 )
r1 + (r2 + r3 ) → r1 + r2 + r3
(r1 + r2+ )+ → (r1 + r2 )+
(r1+ + r2+ ) → (r1 + r2 )+
r1 + r2 ? → (r1 + r2 )?
Of course, the resulting expression is rejected if it is non-deterministic.
To obtain a diverse target set, we synthesized expressions with alphabet size 5
(45 expressions), 10 (45 expressions), and 15 (10 expressions) with a variety of
symbol occurrences (k = 1, 2, 3). For each of the alphabet sizes, the expressions
were selected to cover language size ranging from 0 to 1. All in all, this yielded a
set of 100 deterministic target expressions. A snapshot is given in Figure 7.
Synthetic sample generation. For each of those 100 target expressions, we
generated synthetic samples by transforming the target expressions into stochastic
processes that perform random walks on the automata representing the expressions
(cf. Section 4). The probability distributions of these processes are derived from the
structure of the originating expression. In particular, each operand in a disjunction
ACM Journal Name, Vol. V, No. N, November 2024.

Learning Deterministic Regular Expressions for the Inference of Schemas from XML Data
p

r1 · · · rn

p

1

r1

1

···

1

rn

·

1

r1
p/n
p

r1 + · · · + rn

1

1
.
.
.
1

p/n
rn
p/2
p
r?

1

r
p/2

1

2/3
p

Fig. 8.

r+

1
p

r
1/3

From a regular expression to a probabilistic automaton.

is equally likely and the probability to have zero or one occurrences for the zeroor-one operator ? is 1/2 for each option. The probability to have n repetitions in
a one-or-more or zero-or-more operator (∗ and + ) is determined by the probability
that we choose to continue looping (2/3) or choose to leave the loop (1/3). The
latter values are based on observations of real-world corpora. Figure 8 illustrates
how we construct the desired stochastic process from a regular expression r: starting
from the following initial graph,
1

r

1

we continue applying the rewrite rules shown until each internal node is an individual alphabet symbol.
Experiments on covering samples. Our first experiment is designed to test
how iDRegEx performs on samples that are at least large enough to cover the
target regular expression, in the following sense.
Definition 5.1. A sample S covers a deterministic automaton G if for every edge
(s, t) in G there is a word w ∈ S whose unique accepting run in G traverses (s, t).
Such a word w is called a witness for (s, t). A sample S covers a deterministic
regular expression r if it covers the automaton obtained from S using the Glushkov
construction for translating regular expressions into automata as defined in Definition 4.7.
Intuitively, if a sample does not cover a target regular expression r then there
will be parts of r that cannot be learned from S. In this sense, covering samples
are the minimal samples necessary to learn r. Note that such samples are far from
“complete” or “characteristic” in the sense of the theoretical framework of learning
in the limit, as some characteristic samples are bound to be of size exponential in
the size of r by Theorem 3.2, while samples of size at most quadratic in r suffice
to cover r. Indeed, the Glushkov construction always yields an automaton whose
number of states is bounded by the size of r. Therefore, this automaton can have
ACM Journal Name, Vol. V, No. N, November 2024.

25

26

·

Geert Jan Bex et al.

at most |r|2 edges, and hence |r|2 witness words suffice to cover r.
Table II shows how iDRegEx performs on covering samples, broken up by alphabet size of the target expressions. The size of the sample used is depicted as well.
The table demonstrates a remarkable precision. Out of a total of 100 expressions,
82 are derived exactly for iDRegEx. Although iDRegEx(rwr0 ) outperforms
iDRegEx with a success rate of 87 %, overall iDRegExfixed performs best with
89 %. The performance decreases with the alphabet size of the target expressions:
this is to be expected since the inference task’s complexity increases. It should
be emphasized that even if iDRegExfixed does not derive the target expression
exactly, it always yields an over-approximation, i.e., its language is a superset of
the target language.
Table III shows an alternative view on the results. It shows the success rate as a
function of the target expression’s language size, grouped in intervals. In particular,
it demonstrates that the method works well for all language sizes.
A final perspective is offered in Table IV which shows the success rate in function
of the average states per symbol κ for an expression. The latter quantity is defined
as the length of the regular expression excluding operators, divided by the alphabet size. For instance, for the expression a(a + b)+ cab, κ = 6/3 since its length
excluding operators is 6 and |Σ| = 3. It is clear that the learning task is harder
for increasing values of κ. To verify the latter, a few extra expressions with large κ
values were added to the target expressions. For the algorithm iDRegExfixed the
success rate is quite high for target expressions with a large value of κ. Conversely,
iDRegEx(rwr0 ) yields better results for κ < 1.6, while its success rate drops to
around 50 % for larger values of κ. This illustrates that neither iDRegEx(rwr0 )
nor iDRegExfixed outperforms the other in all situations.
|Σ|
5
10
15
total

#regex
45
45
10
100

iDRegEx(rwr0 )
86 %
93 %
70 %
87 %

iDRegEx
97 %
75 %
50 %
82 %

iDRegExfixed
100 %
84 %
60 %
89 %

|S|
300
1000
1500

Table II. Success rate on the target regular expressions and the sample size used per alphabet size
for the various algorithms.

Density(r)
[0.0, 0.2[
[0.2, 0.4[
[0.4, 0.6[
[0.6, 0.8[
[0.8, 1.0]
Table III.

#regex
24
22
20
22
12

iDRegEx(rwr0 )
100 %
82 %
90 %
95 %
83 %

iDRegEx
87 %
91 %
75 %
72 %
78 %

iDRegExfixed
96 %
91 %
85 %
83 %
78 %

Success rate on the target regular expressions, grouped by language size.

It is also interesting to note that iDRegEx successfully derived the regular expression r1 = (a1 a2 + a3 + · · · + an )+ of Theorem 3.2 for n = 8, n = 10, and n = 12
from covering samples of size 500, 800, and 1100, respectively. This is quite surprising considering that the characteristic samples for these expressions was proven to
ACM Journal Name, Vol. V, No. N, November 2024.

Learning Deterministic Regular Expressions for the Inference of Schemas from XML Data
κ
[1.2, 1.4[
[1.4, 1.6[
[1.6, 1.8[
[1.8, 2.0[
[2.0, 2.5[
[2.5, 3.0]

#regex
29
37
24
11
12
18

iDRegEx(rwr0 )
96 %
100 %
91 %
54 %
41 %
66 %

iDRegEx
72 %
89 %
92 %
91 %
50 %
71 %

·

iDRegExfixed
83 %
89 %
100 %
100 %
50 %
78 %

Table IV. Success rate on the target regular expressions, grouped by κ, the average number of
states per symbol.

be of size at least (n − 2)!, i.e., 720, 40320, and 3628800 respectively. The regular
expression r2 = (Σ \ a1 )+ a1 (Σ \ a1 )+ , in contrast, was not derivable by iDRegEx
from small samples.
Experiments on partially covering samples. Unfortunately, samples to learn
regular expressions from are often smaller than one would prefer. In an extreme, but
not uncommon case, the sample does not even entirely cover the target expression.
In this section we therefore test how iDRegEx performs on such samples.
Definition 5.2. The coverage of a target regular expression r by a sample S is
defined as the fraction of transitions in the corresponding Glushkov automaton for
r that have at least one witness in S.
Note that to successfully learn r from a partially covering sample, iDRegEx
needs to “guess” the edges for which there is no witness in S. This guessing capability is built into iDRegEx(rwr0 ) and iDRegEx in the form of repair rules [Bex
et al. 2006; Bex et al. 2008]. Our experiments show that for target expressions
with alphabet size |Σ| = 10, this is highly effective for iDRegEx(rwr0 ): even at a
coverage of 70%, half the target expressions can still be learned correctly as Table V
shows. The algorithm iDRegEx is performing very poorly in this setting, being
only successful occasionally for coverages close to 100 %. iDRegExfixed performs
better, although not as well as iDRegEx(rwr0 ). This again illustrates that both
algorithms have their merits.
coverage
1.0
0.9
0.8
0.7
0.6

iDRegEx(rwr0 )
100 %
64 %
60 %
52 %
0%

iDRegEx
80 %
20 %
0%
0%
0%

iDRegExfixed
80 %
60 %
40 %
0%
0%

Table V. Success rate for 25 target expressions for |Σ| = 10 for samples that provide partial
coverage of the target expressions.

We also experimented with target expressions with alphabet size |Σ| = 5. In this
case, the results were not very promising for iDRegEx(rwr0 ), but as Table VI
illustrates, iDRegEx and iDRegExfixed performs better, on par with the target
expressions for |Σ| = 10 in the case of iDRegExfixed . This is interesting since
the absolute amount of information missing for smaller regular expressions is larger
than in the case of larger expressions.
ACM Journal Name, Vol. V, No. N, November 2024.

27

28

·

Geert Jan Bex et al.
coverage
1.0
0.9
0.8
0.7
0.6
0.5

Table VI.

6.

iDRegEx(rwr0 )
100 %
25 %
16 %
8%
8%
0%

iDRegEx
100 %
75 %
75 %
25 %
25 %
8%

iDRegExfixed
100 %
66 %
41 %
33 %
17 %
17 %

Success rate for 12 target expressions for |Σ| = 5 with partially covering samples.

CONCLUSIONS

We presented the algorithm iDRegEx for inferring a deterministic regular expression from a sample of words. Motivated by regular expressions occurring in practice,
we use a novel measure based on the number k of occurrences of the same alphabet
symbol and derive expressions for increasing values of k. We demonstrated the
remarkable effectiveness of iDRegEx on a large corpus of real-world and synthetic
regular expressions of different densities.
Our experiments show that iDRegEx(rwr0 ) performs better than iDRegEx
for target expressions with a κ < 1.6 and vice versa for larger values of κ. For
partially covering samples, iDRegEx(rwr0 ) is more robust than iDRegEx. As κ
values and sample coverage are not known in advance, it makes sense to run both
algorithms and select the smallest expression or the one with the smallest language
size, depending on the application at hand.
Some questions need further attention. First, in our experiments, iDRegEx
always derived the correct expression or a super-approximation of the target expression. It remains to investigate for which kind of input samples this behavior
can be formally proved. Second, it would also be interesting to characterize precisely which classes of expressions can be learned with our method. Although the
parameter κ explains this to some extend, we probably need more fine grained
measures. A last and obvious goal for future work is to speed up the inference of
the probabilistic automaton which forms the bottleneck of the proposed algorithm.
A possibility is to use an industrial strength implementation of the Baum-Welsh
algorithm as in [Finn et al. 2006] rather than a straightforward one or to explore
different methods for learning probabilistic automata.
Although iDRegEx can be directly plugged into the XSD inference engine iXSD
of [Bex et al. 2007], it would be interesting to investigate how to extend these
techniques to the more robust class of Relax NG schemas [Clark and Murata 2001].
REFERENCES
Castor. www.castor.org.
SUN Microsystems JAXB. java.sun.com/webservices/jaxb.
Adriaans, P. and Vitányi, P. 2006. The Power and Perils of MDL.
Ahonen, H. 1996. Generating Grammars for structured documents using grammatical inference
methods. Report A-1996-4, Department of Computer Science, University of Finland.
Angluin, D. and Smith, C. H. 1983. Inductive Inference: Theory and Methods. ACM Computing
Surveys 15, 3, 237–269.
Barbosa, D., Mignet, L., and Veltri, P. 2005. Studying the XML Web: gathering statistics
from an XML sample. World Wide Web 8, 4, 413–438.
ACM Journal Name, Vol. V, No. N, November 2024.

Learning Deterministic Regular Expressions for the Inference of Schemas from XML Data

·

Benedikt, M., Fan, W., and Geerts, F. 2005. XPath satisfiability in the presence of DTDs. In
Proceedings of the Twenty-fourth ACM SIGACT-SIGMOD-SIGART Symposium on Principles
of Database Systems. 25–36.
Bernstein, P. A. 2003. Applying Model Management to Classical Meta Data Problems. In First
Biennial Conference on Innovative Data Systems Research.
Bex, G., Neven, F., Schwentick, T., and Vansummeren, S. Inference of Concise Regular
Expressions and DTDs. ACM TODS . To Appear.
Bex, G. J., Gelade, W., Neven, F., and Vansummeren, S. 2008. Learning deterministic regular
expressions for the inference of schemas from XML data. In WWW. Beijing, China, 825–834.
Accepted for WWW 2008.
Bex, G. J., Neven, F., Schwentick, T., and Tuyls, K. 2006. Inference of concise DTDs from
XML data. In Proceedings of the 32nd International Conference on Very Large Data Bases.
115–126.
Bex, G. J., Neven, F., Schwentick, T., and Vansummeren, S. 2008. Inference of Concise
Regular Expressions and DTDs. submitted to VLDB Journal.
Bex, G. J., Neven, F., and Van den Bussche, J. 2004. DTDs versus XML Schema: a practical
study. In Proceedings of the 7th International Workshop on the Web and Databases. 79–84.
Bex, G. J., Neven, F., and Vansummeren, S. 2007. Inferring XML Schema Definitions from
XML data. In Proceedings of the 33rd International Conference on Very Large Databases.
998–1009.
Brāzma, A. 1993. Efficient identification of regular expressions from representative examples.
In Proceedings of the 6th Annual ACM Conference on Computational Learning Theory. ACM
Press, 236–242.
Brüggeman-Klein, A. 1993. Regular expressions into finite automata. Theoretical Computer
Science 120, 2, 197–213.
Brüggemann-Klein, A. and Wood, D. 1998. One-unambiguous regular languages. Information
and computation 140, 2, 229–253.
Buneman, P., Davidson, S. B., Fernandez, M. F., and Suciu, D. 1997. Adding structure to
unstructured data. In Database Theory - ICDT ’97, 6th International Conference, F. N. Afrati
and P. G. Kolaitis, Eds. Lecture Notes in Computer Science, vol. 1186. Springer, 336–350.
Che, D., Aberer, K., and Özsu, M. T. 2006. Query optimization in XML structured-document
databases. VLDB Journal 15, 3, 263–289.
Chidlovskii, B. 2001. Schema extraction from XML: a grammatical inference approach. In
Proceedings of the 8th International Workshop on Knowledge Representation meets Databases.
Clark, J. Trang: Multi-format schema converter based on RELAX NG. http://www.
thaiopensource.com/relaxng/trang.html.
Clark, J. and Murata, M. 2001. RELAX NG Specification. OASIS.
Cover, R. 2003. The Cover Pages. http://xml.coverpages.org/.
Du, F., Amer-Yahia, S., and Freire, J. 2004. ShreX: Managing XML Documents in Relational
Databases. In Proceedings of the 30th International Conference on Very Large Data Bases.
1297–1300.
Ehrenfeucht, A. and Zeiger, P. 1976. Complexity measures for regular expressions. Journal
of computer and system sciences 12, 134–146.
Fernau, H. 2004. Extracting minimum length Document Type Definitions is NP-hard. In ICGI.
277–278.
Fernau, H. 2005. Algorithms for Learning Regular Expressions. In Algorithmic Learning Theory,
16th International Conference. 297–311.
Finn, R., Mistry, J., Schuster-Bckler, B., Griffiths-Jones, S., et al. 2006. Pfam: clans,
web tools and services. Nucleic Acids Research 34, D247–D251.
Florescu, D. 2005. Managing semi-structured data. ACM Queue 3, 8 (October).
François, J.-M. 2006. Jahmm. http://www.run.montefiore.ulg.ac.be/~francois/software/
jahmm/.
ACM Journal Name, Vol. V, No. N, November 2024.

29

30

·

Geert Jan Bex et al.

Freire, J., Haritsa, J. R., Ramanath, M., Roy, P., and Siméon, J. 2002. StatiX: making XML
count. In SIGMOD Conference. 181–191.
Freitag, D. and McCallum, A. 2000. Information Extraction with HMM Structures Learned
by Stochastic Optimization. In AAAI/IAAI. AAAI Press / The MIT Press, 584–589.
Garcia, P. and Vidal, E. 1990. Inference of k-testable languages in the strict sense and application to syntactic pattern recognition. IEEE Transactions on Pattern Analysis and Machine
Intelligence 12, 9 (September), 920–925.
Garofalakis, M., Gionis, A., Rastogi, R., Seshadri, S., and Shim, K. 2003. XTRACT: learning document type descriptors from XML document collections. Data mining and knowledge
discovery 7, 23–56.
Gelade, W. and Neven, F. 2008. Succinctness of the Complement and Intersection of Regular
Expressions. In STACS. 325–336.
Gold, E. 1967. Language identification in the limit. Information and Control 10, 5 (May),
447–474.
Goldman, R. and Widom, J. 1997. DataGuides: Enabling Query Formulation and Optimization
in Semistructured Databases. In Proceedings of 23rd International Conference on Very Large
Data Bases. 436–445.
Gruber, H. and Holzer, M. 2008. Finite Automata, Digraph Connectivity, and Regular Expression Size. In ICALP (2). 39–50.
Hegewald, J., Naumann, F., and Weis, M. 2006. XStruct: efficient schema extraction from
multiple and large XML documents. In ICDE Workshops. 81.
Hopcroft, J. and Ullman, J. 2007. Introduction to automata theory, languages and computation. Addison-Wesley, Reading, MA.
Koch, C., Scherzinger, S., Schweikardt, N., and Stegmaier, B. 2004. Schema-based scheduling of event processors and buffer minimization for queries on structured data streams. In
Proceedings of the 30th International Conference on Very Large Data Bases. 228–239.
Manolescu, I., Florescu, D., and Kossmann, D. 2001. Answering XML Queries on Heterogeneous Data Sources. In Proceedings of 27th International Conference on Very Large Data
Bases. 241–250.
Martens, W., Neven, F., Schwentick, T., and Bex, G. J. 2006. Expressiveness and Complexity
of XML Schema. ACM Transactions on Database Systems 31, 3, 770–813.
Mignet, L., Barbosa, D., and Veltri, P. 2003. The XML web: a first study. In Proceedings of
the 12th International World Wide Web Conference. Budapest, Hungary, 500–510.
Nestorov, S., Abiteboul, S., and Motwani, R. 1998. Extracting Schema from Semistructured
Data. In International Conference on Management of Data. ACM Press, 295–306.
Neven, F. and Schwentick, T. 2006. On the complexity of XPath containment in the presence
of disjunction, DTDs, and variables. Logical Methods in Computer Science 2, 3.
Pitt, L. 1989. Inductive Inference, DFAs, and Computational Complexity. In Proceedings of
the International Workshop on Analogical and Inductive Inference, K. P. Jantke, Ed. Lecture
Notes in Computer Science, vol. 397. Springer-Verlag, 18–44.
Quass, D., Widom, J., Goldman, R., et al. 1996. LORE: a Lightweight Object REpository for
semistructured data. In Proceedings of the 1996 ACM SIGMOD International Conference on
Management of Data. 549.
Rabiner, L. 1989. A tutorial on Hidden Markov Models and selected applications in speech
recognition. Proc. IEEE 77, 2, 257–286.
Rahm, E. and Bernstein, P. A. 2001. A survey of approaches to automatic schema matching.
VLDB Journal 10, 4, 334–350.
Sahuguet, A. 2000. Everything You Ever Wanted to Know About DTDs, But Were Afraid to Ask
(Extended Abstract). In The World Wide Web and Databases, 3rd International Workshop,
D. Suciu and G. Vossen, Eds. Lecture Notes in Computer Science, vol. 1997. Springer, 171–183.
Sakakibara, Y. 1997. Recent advances of grammatical inference. Theoretical Computer Science 185, 1, 15–45.
ACM Journal Name, Vol. V, No. N, November 2024.

Learning Deterministic Regular Expressions for the Inference of Schemas from XML Data

·

Sankey, J. and Wong, R. K. 2001. Structural inference for semistructured data. In Proceedings
of the 10th international conference on Information and knowledge management. ACM Press,
159–166.
Thompson, H., Beech, D., Maloney, M., and Mendelsohn, N. 2001. XML Schema part 1:
structures. W3C.
Young-Lai, M. and Tompa, F. W. 2000. Stochastic Grammatical Inference of Text Database
Structure. Machine Learning 40, 2, 111–137.

Received Month Year; revised Month Year; accepted Month Year

ACM Journal Name, Vol. V, No. N, November 2024.

31